What is Cleric AI doing?

Cleric AI is developing innovative solutions using knowledge graphs and AI to help diagnose issues in production environments.

An AI SRE refers to an AI Site Reliability Engineer who uses AI technologies to manage and maintain the reliability of systems.

What challenges do AI agents face in production environments?

Key challenges include the dynamic nature of systems, the unsupervised nature of problems, and the complexity of understanding relations among various components.

How does the knowledge graph help in troubleshooting?

The knowledge graph maps relationships and dependencies in the production environment, helping diagnose root causes of issues efficiently.

What role does confidence scoring play?

Confidence scoring helps prioritize alerts and determine the reliability of findings before presenting them to engineers.

How does Cleric AI handle memory?

Cleric AI uses episodic and procedural memories to learn from past actions and improves its diagnostics based on feedback.

What pricing model is Cleric AI exploring?

Cleric AI is considering a usage-based pricing model to encourage adoption while covering operational costs.

What external tools does Cleric AI integrate with?

Cleric AI integrates with tools like DataDog to access logs, monitor system health, and gather necessary data for analysis.

How does Cleric AI incorporate chaos engineering?

Cleric AI employs chaos engineering by simulating failures and evaluating agent performance in controlled scenarios.

What are the major concerns with AI changes in production environments?

Concerns include maintaining system reliability, avoiding unnecessary disruptions, and ensuring that AI assists rather than complicates.

Autonomous AI SRE: The Future of Site Reliability Engineering

00:55:58

https://www.youtube.com/watch?v=24Ow2HnnoRc

Summary

TLDRIn this podcast episode, host Demetrios interviews William, CTO of Cleric AI, discussing the role of AI Site Reliability Engineers (SREs) and the challenges they face in dynamic production environments. They delve into the use of knowledge graphs to effectively diagnose root causes of system issues. William shares insights on the importance of confidence scoring, memory management in AI, tool integrations, and chaos engineering, highlighting the complexity of operational environments and how Cleric aims to enhance engineer productivity without increasing alert fatigue. They also touch on pricing strategies that encourage usage while managing operational costs.

Takeaways

☕️ Cleric AI is solving tough problems in AI and infrastructure.
🧠 Knowledge graphs aid in diagnosing issues efficiently.
🔍 Confidence scoring helps prioritize alerts for engineers.
💡 AI agents learn from episodic and procedural memories.
📊 Usage-based pricing is being explored to optimize adoption.
⚙️ Integration with tools like DataDog is crucial for performance.
🌪️ Chaos engineering is used to test AI robustness in production.
⚠️ Engineers are cautious about AI making changes to critical systems.
📈 The complexity of AI in dynamic environments poses unique challenges.
🤝 Building trust is essential for AI adoption among engineers.

Timeline

00:00:00 - 00:05:00
Introduction of the hosts and the context of the discussion focusing on AI SRE and knowledge graphs; William's background with cleric AI and Feast. Begins with some light notes about Christmas sweater and caffeine consumption, introducing casual vibe to the conversation.
00:05:00 - 00:10:00
Discussion of AI SRE being a complex problem tied to MLOps; highlighted the differences between development and production environments, and how operational complexity increases with deployment across different applications, leading to challenges in root cause analysis.
00:10:00 - 00:15:00
Exploration of how agent systems are being built with modular components, balancing understanding and responsibility versus the productivity gains, leading to operational instability in larger organizations where pressure is high.
00:15:00 - 00:20:00
Innovation through the use of knowledge graphs to diagnose incidents within systems. Discussion emphasizes the complexity of relationships within IT infrastructures, and how mappings of even small clusters begin to reveal potential issues.
00:20:00 - 00:25:00
Deep dive into how agents scan systems to identify problems prior to alerts being raised, and the necessity for ongoing updating of knowledge graphs to maintain relevancy in fast-paced production environments.
00:25:00 - 00:30:00
Relationship between proactive monitoring of production state and agents' ability to query information; the graph's dual role in driving effective diagnosis and enabling exploration of root causes through structured data.
00:30:00 - 00:35:00
Understanding the background job of graph building and updating during investigations; the potential for agents to uncover hidden issues through continuous environment scanning, including the fiscal impacts of these processes.
00:35:00 - 00:40:00
Confidence scoring approach for agents ensuring they don't overwhelm engineers with false positives, emphasizing trust and utility; integrating human feedback into performance evaluations to boost efficiency.
00:40:00 - 00:45:00
The challenge of decision-making in AI-driven environments, where agents need to discern context to retrieve useful information without causing disruption; exploring trade-offs between human input and automated systems.
00:45:00 - 00:50:00
Agency Learning from experiences, including procedural and episodic memory management creating layered structures; feedback-driven iterations are enhancing performance, building a feedback loop with engineers to improve knowledge repositories.
00:50:00 - 00:55:58
Examining market pricing strategies for AI agents; approaching pricing based on usage while ensuring engineers can operate without cost-induced anxiety over their investigative actions. Emphasis on a thoughtful revenue model to drive engagement instead of risking underutilization.

Mind Map

Video Q&A

What is Cleric AI doing?
Cleric AI is developing innovative solutions using knowledge graphs and AI to help diagnose issues in production environments.
What is an AI SRE?
An AI SRE refers to an AI Site Reliability Engineer who uses AI technologies to manage and maintain the reliability of systems.
What challenges do AI agents face in production environments?
Key challenges include the dynamic nature of systems, the unsupervised nature of problems, and the complexity of understanding relations among various components.
How does the knowledge graph help in troubleshooting?
The knowledge graph maps relationships and dependencies in the production environment, helping diagnose root causes of issues efficiently.
What role does confidence scoring play?
Confidence scoring helps prioritize alerts and determine the reliability of findings before presenting them to engineers.
How does Cleric AI handle memory?
Cleric AI uses episodic and procedural memories to learn from past actions and improves its diagnostics based on feedback.
What pricing model is Cleric AI exploring?
Cleric AI is considering a usage-based pricing model to encourage adoption while covering operational costs.
What external tools does Cleric AI integrate with?
Cleric AI integrates with tools like DataDog to access logs, monitor system health, and gather necessary data for analysis.
How does Cleric AI incorporate chaos engineering?
Cleric AI employs chaos engineering by simulating failures and evaluating agent performance in controlled scenarios.
What are the major concerns with AI changes in production environments?
Concerns include maintaining system reliability, avoiding unnecessary disruptions, and ensuring that AI assists rather than complicates.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:00
will and Pinar CTO of cleric we're
00:00:03
building an
00:00:04
aisr we're based in San
00:00:07
Francisco black coffee is the way to go
00:00:10
and if you want to join a team of
00:00:12
veterans and Ai and infrastructure and a
00:00:15
really tough problem yeah come and chat
00:00:17
to us boom welcome back to the m the
00:00:20
lobs Community podcast I'm your host
00:00:22
Demetrios today we are talking with my
00:00:25
good friend William some of you may know
00:00:27
him as the CTO of cleric a I doing some
00:00:31
pretty novel stuff with the AIS Sr which
00:00:36
we dive into very deep in this next hour
00:00:41
we talk all about how he's using
00:00:43
knowledge graphs to triage root cause
00:00:46
issues with their AI agent solution and
00:00:50
others of you may know Willam because he
00:00:52
is also the same guy that built the open
00:00:56
source feature store Feast that's where
00:00:59
I got to know him back four five years
00:01:02
ago and since then I've been following
00:01:05
what he is doing very closely and it's
00:01:08
safe to say this guy never fails to
00:01:11
disappoint let's get into the
00:01:13
conversation right
00:01:15
[Music]
00:01:20
now let's start by prefacing this
00:01:23
conversation with we are recording 2
00:01:26
days before Christmas so when it comes
00:01:29
out
00:01:30
this sweater that I'm wearing is not
00:01:32
going to be okay but today it is totally
00:01:35
inbounds for me being able to wear
00:01:38
it unfortunately I don't have a cool
00:01:40
sweater like you and I'm in Sunny San
00:01:42
Francisco but I guess it's got the fog
00:01:46
yeah it's Christmas
00:01:48
viod dude I found out three four days
00:01:52
ago that if you have
00:01:55
F this pill magic pill with caffeine it
00:02:00
like minimizes the Jitters so I have
00:02:04
taken that as an excuse LCN or which
00:02:07
yeah you've heard of it yeah yeah dude I
00:02:10
I've just been abusing my caffeine
00:02:12
intake and pounding these pills with it
00:02:16
it's amazing I am so much more
00:02:18
productive so that's my 2025 secret for
00:02:20
everyone okay I and a bit of magnesium
00:02:23
for bitter sleep or actual
00:02:26
steep all right man enough of that you
00:02:29
been building cleric you've been coming
00:02:32
on occasionally to the
00:02:35
different conferences that we've had and
00:02:38
sharing your learnings but recently you
00:02:41
put out a blog post and I want to go
00:02:42
super deep on this blog post on what an
00:02:45
AI SRE is just because it feels like
00:02:49
sres are very close to the mlops world
00:02:52
and AI agents are very much what we've
00:02:56
been talking about a lot as we were
00:02:58
presenting at the agents production
00:03:00
conference the first thing that we
00:03:03
should start with is just what a hard
00:03:06
problem this is and why is it hard we
00:03:10
can dive into those areas and I think
00:03:11
we're going to get into that in this
00:03:13
this conversation maybe just a set the
00:03:15
stage everyone is building agents like
00:03:17
agents of all the hype right now but
00:03:19
every use case is different right you've
00:03:22
got agents in law you've got agents for
00:03:24
writing blog post you've got agents for
00:03:26
social media one of the tricky things
00:03:28
about our space is really if you
00:03:31
consider two main things that an
00:03:33
engineer does is the credit software and
00:03:36
then they deployed into production
00:03:37
environment and it runs and operates
00:03:38
actually has to have an impact on the
00:03:39
real world that second world the
00:03:42
operational environment is quite
00:03:44
different from the development
00:03:45
environment the development environment
00:03:46
has tests it has an IDE it has Type
00:03:49
feedback Cycles often it has ground
00:03:51
truth right so you can make a change and
00:03:54
see if your test balse there's
00:03:56
permissionless data sets that are off
00:03:57
there so you can go to get up and you
00:03:59
can find like
00:04:00
millions of issues that people cre PRS
00:04:02
that are like the solutions to those
00:04:04
issues yeah but consider like the
00:04:07
production environment of an Enterprise
00:04:09
company where do you find the data set
00:04:13
that represents all the problems that
00:04:14
they've had and all the solutions it's
00:04:16
not just laying out there right you can
00:04:18
get some like root causes and things
00:04:20
that people have posted as blog posts
00:04:22
but this is an unsupervised problem for
00:04:24
the most part it's a very complicated
00:04:26
problem I I guess we can get get into
00:04:28
those details in a in a bit but that's
00:04:30
really what makes this challenging it's
00:04:32
complex sprawling Dynamic
00:04:35
systems yeah the complexity of the
00:04:38
systems does not help and I also think
00:04:40
with the rise of the coding
00:04:44
co-pilots does that not also make things
00:04:47
more complex because you're running
00:04:49
stuff in a production environment that
00:04:52
maybe you know how it got created maybe
00:04:55
you don't massively and I think even at
00:04:58
our scale small startup it's become a
00:05:01
topic
00:05:02
internally how much do we delegate to AI
00:05:05
because we're also Outsourcing and
00:05:07
delegating to our own agents internally
00:05:09
the produce code so I think all teams
00:05:12
are trying to get to the boundaries of
00:05:14
understanding and confidence so you're
00:05:16
building these modular components like
00:05:18
Lego blocks with internals you're unsure
00:05:20
about but you're shipping into
00:05:21
production and seeing how that succeeds
00:05:23
and fails because it gives you so much
00:05:25
velocity so the ROI is there but the
00:05:27
understanding is like one of the things
00:05:28
you lose over time and I think at scale
00:05:31
where the incentives aren't aligned
00:05:32
where you have many different teams and
00:05:34
they're all being pressured to ship more
00:05:37
belts are being tightened so there's not
00:05:39
a lot of head count and they have to do
00:05:40
more the production environment is
00:05:43
really people are like putting their
00:05:45
figers in that damn wall but eventually
00:05:47
it's going to break it's unstable at a
00:05:49
lot of companies yeah so coding is going
00:05:52
to make or AI generated coding is really
00:05:54
going to make this a much more complex
00:05:56
system to deal with so the Dynamics
00:05:59
between these components that
00:06:01
interrelate where there's much less
00:06:03
understanding is going to explod yeah
00:06:06
we're already seeing that dude there's
00:06:08
so many different pieces on the complex
00:06:10
systems that I want to dive into but the
00:06:13
first one that stood out to me and has
00:06:15
continued to replay in my mind is this
00:06:19
knowledge graph that you presented at
00:06:21
the conference and then subsequently in
00:06:24
your blog post and you made the point of
00:06:27
saying this is a
00:06:30
Knowledge Graph that we created on a
00:06:32
production environment but it's not like
00:06:34
it's a gigantic kubernetes cluster it
00:06:38
was a fairly small kubernetes cluster
00:06:41
and all of the different relations from
00:06:43
that and all the slack messages and all
00:06:45
the GitHub issues and everything that is
00:06:48
involved in that kubernetes cluster
00:06:50
you've mapped out and that's just for
00:06:52
one kubernetes cluster so I can't
00:06:54
imagine across in a whole entire
00:06:56
organization like an Enterprise size how
00:06:59
complex this gets yeah so if you
00:07:03
consider that specific cluster or graph
00:07:05
I showed you was the op Telemetry
00:07:07
reference architecture it's like a demo
00:07:09
stack it's like an e-c Converse store
00:07:10
it's got about 12 13 Services yeah
00:07:14
roughly in that
00:07:15
range I've only shown you literally like
00:07:18
10% of the relations maybe even less and
00:07:20
it's only at the infrastructure layer
00:07:21
right so it's not even talking about
00:07:22
like buckets and Cloud infras nothing
00:07:25
about nodes nothing about application
00:07:27
internals right so if you consider one
00:07:28
cloud project like a gcp project or AWS
00:07:32
project mhm there's a whole tree there
00:07:34
the networks the regions down to the
00:07:36
kubernetes Clusters within a cluster
00:07:38
there's the nodes there's the containers
00:07:40
within the containers there s the pods
00:07:42
there's multiple containers potentially
00:07:44
within each of those many processes each
00:07:46
process has code with variables and each
00:07:50
let so creates this tree structure but
00:07:52
then between those noes in the tree can
00:07:54
also have inter relations right like a
00:07:55
piece of code here would be referencing
00:07:57
an IP address but that IP address is
00:08:00
by some cloud service somewhere and it's
00:08:02
also connected to some other
00:08:04
systems and you can't not use that
00:08:07
information right because if a problem
00:08:09
arrives and you're you know Landing your
00:08:11
LA and you have to causally walk that
00:08:14
graph to go Upstream to find the root
00:08:15
cause in the security space this is a
00:08:19
pretty well studied problem and there
00:08:21
are traditional techniques people have
00:08:23
been using to extract this from cloud
00:08:25
environments but LMS really unlock a new
00:08:28
level of understanding there so they're
00:08:29
extremely good at extracting these
00:08:32
relationships taking really unstructured
00:08:33
data so it can be conversations that you
00:08:36
and I have it can be kubernetes objects
00:08:38
it can be all of these like the whole
00:08:40
Spectrum from unstructured to structured
00:08:42
you can extract structured information
00:08:43
so you can build these graphs the
00:08:46
challenge really is toold so you know
00:08:48
you need to use this graph to get to a
00:08:50
root cause but it's fuzzy right as soon
00:08:54
as you extract that information you
00:08:56
build that graph it's out of date almost
00:08:58
instantly because system shange so
00:08:59
quickly right so somebody's deploying
00:09:01
something an IP address gets rolled po
00:09:04
names change and so you needed to be
00:09:08
able to make efficient decisions with
00:09:10
your agent right so just to uh anchor
00:09:13
this our agent is essentially a
00:09:16
diagnostic agent right now so it helps
00:09:18
teams quickly root cause a problem so if
00:09:21
you've got an alert that fires or could
00:09:23
an engineer presents an issue to the
00:09:25
agent it's it quickly Advocates this
00:09:28
graph and its awareness of your
00:09:30
production environment to find the root
00:09:32
C if it didn't have the graph it could
00:09:35
still do it through first principles
00:09:36
right it could still say looking at
00:09:38
everything that's available I'll try
00:09:40
this I'll try that but the graph allows
00:09:42
it to very efficiently get to the root
00:09:44
course um and so that fuzziness is one
00:09:48
of the challenges that the fact that
00:09:49
it's out of date so quickly but it's so
00:09:52
important to still have it
00:09:54
regardless there's a few things that you
00:09:56
mentioned about how with the vision or
00:10:00
the understanding of the graph you can
00:10:03
escalate up issues that may have been
00:10:06
looked at in isolation is not that big
00:10:08
of a deal and so can you explain how
00:10:11
that works a little
00:10:12
bit so the graph is essentially there's
00:10:16
two if if you draw box around the
00:10:17
production environment right there are
00:10:19
two kinds of issues right there's on you
00:10:21
you have alerts for and you're awareness
00:10:23
of so the you tell us like okay my alert
00:10:26
fired here's a problem go look at it
00:10:27
another is we scan the environment and
00:10:30
we identify problems the graph is built
00:10:33
in two ways one is a background job
00:10:36
where it's just like looking through
00:10:37
your infrastructure and finding new
00:10:39
things and updating itself continuously
00:10:41
and the other is when the agent's doing
00:10:42
investigation and it sees new
00:10:44
information and it just throws that back
00:10:45
into the graph because it's got the
00:10:47
information mil just usage update the
00:10:49
graph but in this background scanning
00:10:51
process it might uncover things that it
00:10:54
didn't realize was a problem but then it
00:10:56
sees it CU this is actually a problem
00:10:58
for example it could process your
00:11:01
metrics or it could look at your
00:11:03
configuration of your objects in
00:11:05
kubernetes or maybe it finds a bucket
00:11:07
and it's try to create that node the
00:11:10
updated state of the bucket and it sees
00:11:12
exposed publicly so then you could
00:11:14
surface the student engineer and say
00:11:16
your data is being exposed publicly or
00:11:19
they've misconfigured this pod and the
00:11:21
memory is growing this application and
00:11:24
in about an hour or two this is going to
00:11:26
crash yeah so there's a massive opportun
00:11:29
for Elin to be used as reasoning engines
00:11:32
where it can infer and predict a failure
00:11:35
imminently and you can prev that so you
00:11:37
get to a proactive state of alerting
00:11:40
that is of course quite inefficient
00:11:42
today if you use an LM to just slap it
00:11:45
on a vision model onto a metrix graph or
00:11:48
yeah onto you your objects in your Cloud
00:11:51
infrastructure but there's a massive low
00:11:53
hanging fruit there where you just still
00:11:55
a lot of those inferencing capabilities
00:11:57
to fine- tuned or more purpose both
00:11:59
models for each one of these tasks but
00:12:02
how does the scanning work because I
00:12:05
know that you also mention the agents
00:12:09
will go until they run out of credit or
00:12:13
something or until they hit their like
00:12:14
spend limit when they're trying to root
00:12:17
cause analysis some kind of a problem
00:12:21
but I can imagine that you're not just
00:12:24
continuously scanning or are you kicking
00:12:26
off scans every x amount of seconds or
00:12:28
minutes or days yeah so there are
00:12:31
different parts to this if we do
00:12:33
background scanning graph building we
00:12:35
try and use more efficient models so
00:12:39
because of the volume of data you don't
00:12:41
use expensive models that are used for
00:12:43
like you know very accurate reasoning
00:12:46
yeah and so the costs all lower and so
00:12:47
you set it like a daily budget of that
00:12:49
and then you run up to the budget this
00:12:52
is not something that's constantly
00:12:53
running and processing large amounts of
00:12:55
information think about it as like a
00:12:57
human right you wouldn't process all
00:12:59
logs and all information your Cloud
00:13:01
infrastructure you just get like a lay
00:13:03
of the land like like what are the most
00:13:05
recent deployments what are the most
00:13:06
recent conversations people are having
00:13:08
it's like get like a a playby play so
00:13:11
that when an issue comes up you can
00:13:13
quickly jump into action you you can
00:13:14
fast thinking you can make the right
00:13:16
decisions quickly but in investigation
00:13:19
we set a cap we say per
00:13:23
investigation let's say make it 10 cents
00:13:25
or make it a dollar or whatever and then
00:13:28
we tell the AG this is how much you've
00:13:30
been assigned use it as best you can go
00:13:33
find information that you need through
00:13:35
your
00:13:35
tools and they allow the human to say
00:13:38
okay go a bit further or stop here I'll
00:13:41
take over wow and so we bring the human
00:13:43
in the loop as soon as the agent has
00:13:45
something valuable to present to it so
00:13:48
if the agent goes off on a quest and it
00:13:50
finds almost nothing it can present that
00:13:52
to the human say up nothing or say okay
00:13:54
couldn't find anything or just remain
00:13:56
quiet depends on how you've configured
00:13:58
it but it'll always at that budget limit
00:14:01
yeah the benefit of it not finding
00:14:04
anything also is that it will narrow
00:14:07
down where the human has to go and
00:14:09
search so now the human doesn't have to
00:14:12
go and look through all this crap that
00:14:14
the AI agent just looked through because
00:14:17
ideally if the agent didn't catch
00:14:19
anything it's hopefully not there and so
00:14:22
the human can go and look in other
00:14:24
places first and if they exhaust all
00:14:26
their options they can go back and try
00:14:28
and see where the agent was looking and
00:14:30
see if that's where the problem
00:14:32
is I think this comes back to the
00:14:34
fundamental problem here and maybe we
00:14:37
glassed over some of this like tools
00:14:41
don't solve the problem of operation
00:14:43
operations is an on call no amount of
00:14:46
data dogs or dashboards or cube C
00:14:49
commands will free your senior Engineers
00:14:52
up from getting into the production
00:14:54
environment
00:14:55
so what we're trying to get to is ENT to
00:14:59
end resolution when we find a problem
00:15:02
can the agent go all the way multiple
00:15:05
steps which today requires Engineers
00:15:07
reasoning and judgment looking at
00:15:09
different tools understanding tribal
00:15:11
knowledge understanding why systems have
00:15:12
been deployed we want to get the agents
00:15:15
there but you can't St there because
00:15:17
this is an unsupervised problem you
00:15:19
can't just start changing things in
00:15:20
production nobody would do that right
00:15:23
now if you scale that back from
00:15:25
resolution meaning change like code
00:15:27
level change ter for things in your
00:15:31
Reapers if you walk it back from that
00:15:33
it's understanding what the problem is
00:15:34
and if you walk it back further from
00:15:35
that it's search space reduction
00:15:37
triangulating the problem into a
00:15:39
specific area maybe not saying the line
00:15:41
of code but saying here's the service or
00:15:43
here's the cluster and that's already
00:15:45
very compelling to a human or you can
00:15:47
say it's not this these 400 other Cloud
00:15:50
clusters or providers or Services is
00:15:53
probably in this one and that is
00:15:56
extremely useful to an engineer today so
00:15:59
space reduction is one of the things
00:16:01
that we are very reliable at and where
00:16:02
we've started and we start in a kind of
00:16:05
collaborative mode so we quickly reduce
00:16:08
the search space we tell you what we
00:16:09
checked and what we didn't and then we
00:16:11
as an engineer we can say okay here's
00:16:12
some more context go a bit further and
00:16:14
try this piece of information and in
00:16:17
that steering and then collaboration we
00:16:20
learn from engineers and they teach us
00:16:22
and we get better and better over time
00:16:23
on this like road to
00:16:25
resolution yeah I know you mentioned
00:16:27
memory and I want to get into that in a
00:16:28
sec but but keeping on the theme of
00:16:31
money and cost and the Agents having
00:16:36
more or less a budget that they can go
00:16:38
expend and try and find what they're
00:16:40
looking for do you see that agents will
00:16:44
get stuck in recursive loops and then
00:16:46
use their whole budget and not really
00:16:49
get much of anything or is that
00:16:50
something that was fairly
00:16:53
common six or 10 months ago but now
00:16:57
you've found ways to counterbalance that
00:17:01
problem this problem space is one where
00:17:04
small little additions to your or
00:17:06
improvements to your product make a big
00:17:08
difference over time because they
00:17:10
compound we've learned a lot from
00:17:12
decoding agents like s agent and others
00:17:15
so one of the things they found was that
00:17:17
when the agent succeeds it succeeds very
00:17:19
quickly when fails very slowly so
00:17:22
typically you can even see as a proxy
00:17:23
has the agent run for 3 4 5 6 7 minutes
00:17:27
it's probably wrong even if you don't
00:17:29
score it at all and if it ran into like
00:17:32
it came to a conclusion quickly like in
00:17:33
30 seconds it's probably going to be
00:17:35
right our agents sometimes do chase
00:17:38
their tails so we have a confidence
00:17:40
score and we have a critiquer at the end
00:17:42
that assesses the agent so we try and
00:17:45
not you know spam the human ultimately
00:17:48
it's about attention and saving them
00:17:49
time so if you keep throwing like bad
00:17:51
findings and bad information they really
00:17:53
they'll just rip you out of your their
00:17:55
production environment because it's
00:17:56
going to be noisy right that's the last
00:17:57
thing they want so yes depending on the
00:18:01
use case the agent can go a recursive
00:18:04
Loop or it can go on a direction that it
00:18:06
should so for us a really effective
00:18:09
mechanism to manage that is
00:18:12
understanding where we're good and where
00:18:13
we're bad so for each issue or event
00:18:16
that comes in we do an enrichment and
00:18:17
then we build the full context of that
00:18:19
issue and then we look at have we seen
00:18:21
this in the past similar issues have we
00:18:24
Sol how have we solved this in the p and
00:18:26
have we had positive feedback and so if
00:18:27
we Che the right historical context we
00:18:30
get a good idea of our confidence on
00:18:31
something before presenting that
00:18:33
information to a human like the the
00:18:34
ultimate set of findings but yeah
00:18:37
sometimes it does go a
00:18:39
ride I'm trying to think is the
00:18:41
knowledge graph something that you are
00:18:44
creating once getting an idea the lay of
00:18:47
the land and then there's almost like
00:18:51
stuff doesn't really get updated until
00:18:53
there's an incident and you go and you
00:18:55
Explore More and what kind of knowledge
00:18:58
graphs are using or you use many
00:18:59
different knowledge graphs is it just
00:19:01
one big one how does that even look in
00:19:04
practice we originally started with one
00:19:06
big Knowledge Graph the thing with these
00:19:08
knowledge graphs is that they're often
00:19:10
the F threat them is deterministic
00:19:12
Method so you can run CU cuddle and you
00:19:15
can just walk the cluster with
00:19:17
traditional techniques there's no no a
00:19:19
or all am involved but then you want to
00:19:22
layer on top of that this the fuzzy
00:19:24
relationships where you see this
00:19:25
container has this sort of a reference
00:19:27
to something over there or confit map
00:19:29
mentions something that I've I've seen
00:19:32
somewhere else and so what we've gone
00:19:35
towards is a more layered approach so we
00:19:38
have like multiple grow off layers where
00:19:40
some of them have a higher confidence
00:19:42
and durability and can be updated
00:19:44
quickly or perhaps using different
00:19:46
techniques and then you layer on the
00:19:48
more fuzzy layers on top of that or or
00:19:51
different lay so you could use an owl in
00:19:52
to kind of canvas the landscape between
00:19:55
clusters or from accum cluster to maybe
00:19:59
the application layer or to the layers
00:20:00
below but using smaller micro graphs has
00:20:03
been easier for us from like a data
00:20:05
management
00:20:07
perspective what are other data points
00:20:09
that you're then mapping out for the
00:20:11
knowledge graph that can be helpful
00:20:13
later on when the AI s re is trying to
00:20:19
triage different
00:20:21
problems in most teams there's an
00:20:25
820 like burrito distribution of value
00:20:29
um yeah so some of the key factors are
00:20:31
often find in the same system I think it
00:20:33
was M meta or yeah that's had some
00:20:37
internal survey where they found out
00:20:39
that 50 or 60% of their production
00:20:41
issues were just due to config or code
00:20:43
changes anything that disrupted their
00:20:45
prod environment so if you're just
00:20:48
looking at what people are deploying
00:20:49
like you're following the humans you're
00:20:50
going to probably find a lot of the
00:20:51
problems so monitoring slack monitoring
00:20:55
deployments is one of the most effective
00:20:57
things to do looking at like releases or
00:21:01
changes that people are scheduling and
00:21:03
understanding those events so having an
00:21:05
assessment of that and then in the
00:21:07
resolution path there's also or the way
00:21:10
to build the resolution looking at run
00:21:12
books looking at how people have solved
00:21:14
problems in the past
00:21:16
like often what happens is like a slack
00:21:19
thread is created right so the slack
00:21:21
thread is like a contextual container
00:21:24
for how do you go from a problem which
00:21:26
somebody produ upgrades trade for so a
00:21:28
Sol
00:21:29
and summarizing these slack phrases is
00:21:31
extremely useful so you can basically
00:21:33
say like this engineer came into this
00:21:35
problem this was the discussion and this
00:21:37
is the final conclusion and there's
00:21:38
often like a PR attached to that so you
00:21:40
can condense that down to almost like a
00:21:42
guidance or like a run book yeah and
00:21:45
attaching that into like novel scenarios
00:21:48
is useful because it shows you how this
00:21:50
team does things and they they often
00:21:52
contain probable knowledge right so this
00:21:54
is how we solve problems at our company
00:21:56
we connect to our vpns like this we
00:21:59
access to system think these are the Key
00:22:00
Systems right the the most important
00:22:02
systems in your production environment
00:22:03
will be referenced by Engineers
00:22:05
constantly yeah um often through Shand
00:22:09
notations um and if you speak to
00:22:11
Engineers of most companies those will
00:22:14
be the two bigger problems right one is
00:22:17
you don't understand our systems and our
00:22:20
processes and our context and the second
00:22:23
one is that you don't know how to
00:22:24
integrate or access these because
00:22:26
they're custom and bespoke and homegrown
00:22:29
and so those are the two challenges that
00:22:31
we face as like agencies basically we're
00:22:34
like a new engineer on the team and you
00:22:35
need to be taught by this engineering
00:22:37
team if you're not taught then you're
00:22:39
never going to succeed I hope that
00:22:41
answered your question yeah and how do
00:22:43
you overcome that you just are creating
00:22:47
some kind of a glossery with the these
00:22:50
shorthand things that the that are
00:22:53
fairly common within the organization or
00:22:56
what yeah so you there's multiple layers
00:22:58
to this and I think this is quite
00:23:01
evolving space thankfully alls are
00:23:03
pretty adaptive and forgiving in this
00:23:06
regard so we can experiment with
00:23:08
different ways to summarize different
00:23:09
levels of granularity so we've looked at
00:23:11
okay can you just take like a massive
00:23:13
amount of information and just shove
00:23:15
that to the context window give it in a
00:23:17
relatively rawal form and that works but
00:23:19
it's quite expensive yeah and then you
00:23:21
show it like more condensed form and you
00:23:23
say this is just the like tip of the
00:23:25
iceberg for any one of these topics you
00:23:27
can you can query using this tool I get
00:23:29
more information yeah and it's not
00:23:33
always easy to know which one is the
00:23:36
best because it's dependent on the issue
00:23:37
at hand right because sometimes a key
00:23:39
factor needle and aasta is buried one
00:23:41
level deeper and the agent can't see it
00:23:44
because it has a call a tool to get to
00:23:45
it so we typically ear on the side of
00:23:48
spending more money and just having the
00:23:51
agent see it and then optimizing cost
00:23:53
and latency over time for us it's really
00:23:56
about being valuable out of the gate
00:23:59
Engineers should find this valuable and
00:24:01
in that value the collaboration starts
00:24:04
and then it creates a virtuous cycle
00:24:06
where they feed feed us more information
00:24:07
they give us more information they get
00:24:09
more value because we take more grunt
00:24:11
work off their plate and and it's it's
00:24:14
like training a new person on your team
00:24:16
if you see that oh this person is taking
00:24:18
more and more tasks yeah I'll just get
00:24:20
them more information I'll give them
00:24:21
more scope yeah I want to go into a
00:24:23
little bit of the ideas that you're
00:24:27
talking about there like how you can
00:24:28
interact with the agent and but I feel
00:24:32
like the gravitational pull towards
00:24:35
asking you about memory and how you're
00:24:38
doing that is too strong so we got to go
00:24:40
down that route first and
00:24:43
specifically are you just caching these
00:24:47
answers are you caching like successful
00:24:50
runs how do you go about knowing that a
00:24:53
something was successful and then where
00:24:55
do you store it how do you like give
00:24:57
that access
00:24:58
or agents get access to that and they
00:25:01
know that oh we've seen this before yeah
00:25:03
cool boom it feels like that is quite
00:25:07
complex in theory you would be like yeah
00:25:10
of course we're just going to store
00:25:11
these successful runs but then when you
00:25:13
break it down and you say all right what
00:25:15
does success mean and what where we
00:25:17
going to store it and who's going to
00:25:20
have access to that and how we going to
00:25:21
label that as successful like I was
00:25:23
thinking how do you even go about
00:25:25
labeling this kind of because is it
00:25:28
you sitting there clicking and human
00:25:30
annotating stuff or is it you're
00:25:33
throwing it to another llm to say yay
00:25:36
success what does it look like break
00:25:38
that whole thing down for me because
00:25:40
memory feels quite complex and that when
00:25:43
you really look at
00:25:45
it it is a big part of this is also the
00:25:48
ux challenge because people don't want
00:25:50
to just sit there and label I think
00:25:52
people are just like especially
00:25:54
Engineers are really tired of slop code
00:25:56
and they're just being thrown this like
00:25:58
slop and then they have to review they
00:26:00
want to create and I think that's what
00:26:02
we're trying to do is free them out from
00:26:03
support but in doing so you don't want
00:26:05
to get them to like constantly review
00:26:08
your work with no benefit so that's the
00:26:11
key thing there has to be interaction
00:26:13
where there's implicit feedback and they
00:26:15
get value out of that and so I'm getting
00:26:19
to your point about memory so
00:26:21
effectively there is three types of
00:26:23
memory there's the like Knowledge Graph
00:26:25
which captures the system State and the
00:26:27
relations between
00:26:28
things then there's episodic and
00:26:31
procedural memory so the procedural
00:26:33
memory is like how to ride a bicycle
00:26:35
you've got your brakes here you your
00:26:37
pedal here it's like the guide It's
00:26:39
almost like the Run book but the Run
00:26:41
book doesn't describe for this specific
00:26:45
issue that we had on this date what did
00:26:47
we do the instance of that is the
00:26:50
episode or the episodic memory and both
00:26:53
of those need to be captured right so
00:26:55
when we start we're indexing your
00:26:56
environment getting all these like
00:26:58
ations and things and then we also look
00:27:00
at okay are there things that we can
00:27:02
extract from this world where we've got
00:27:04
procedures and then finally as we
00:27:08
experience things or as we understand
00:27:10
the experiences of others within this
00:27:12
environment we can store those as well
00:27:14
we have really spent a lot of time and
00:27:17
most companies care about this a lot
00:27:20
securing data so we are deployed in your
00:27:23
production environment and we only have
00:27:25
read in the AIS so our agent cannot make
00:27:27
change I didn't make suggestions so all
00:27:30
your data want to change that right that
00:27:32
later we'll talk about like how you want
00:27:35
to eventually get to a different state
00:27:37
but yeah continue yeah yeah we want to
00:27:39
get to close dup resolution but that's a
00:27:42
that's a longer part so we're storing
00:27:44
all of these memories mostly As I think
00:27:48
the valuable ones are the episodes right
00:27:50
those are the like the instances like if
00:27:53
this happened or this happened and I
00:27:54
solve it in this way we had a black rodl
00:27:57
the cluster fell over we scaled it up
00:28:01
and they later we saw it was working but
00:28:03
oh it's done and we did that two or
00:28:06
three times and we think that's a good
00:28:08
pattern like scaling is effective but
00:28:10
that's all captured in the environment
00:28:14
um of the customer our primary mean of
00:28:16
feed means of feedback is monitoring
00:28:19
system Health post um change oh nice we
00:28:23
can look at the system and see that this
00:28:26
change has been effective and we can
00:28:27
look at the code of the environment
00:28:29
whether it's the application code or the
00:28:31
INF infrastructure code basically as
00:28:33
like a masking problem do we see or can
00:28:37
we predict the change the human will
00:28:39
make in order to solve this problem and
00:28:40
if they do then make that change
00:28:42
especially if it's a recommendation then
00:28:44
we see that they has B Green look what
00:28:46
we've done right they've actually
00:28:48
approved our suggestion yeah that is not
00:28:52
super rich data source because the
00:28:54
change that they may make be slightly
00:28:56
different or we may not have access to
00:28:58
those systems a more effective way is
00:29:02
interaction so if we present findings
00:29:04
and say Here's five findings and here's
00:29:06
our diagnosis and you say this is dumb
00:29:09
try something else then we know that was
00:29:10
bad so we get a lot of negative examples
00:29:13
right so this is bad and so it's a
00:29:15
little bit lopsided but then when you
00:29:17
eventually say oh okay I'm going to
00:29:19
prove this and I'm going to blast this
00:29:21
out of the engineuity team or I'm going
00:29:22
to update my P Duty notes or I'm going
00:29:24
to I want you to generate a pull request
00:29:27
from the information then suddenly we've
00:29:30
got like positive feedback on that in
00:29:32
the user experience it's really implicit
00:29:35
source of information the interaction
00:29:37
with the engineer and that gets attached
00:29:39
to these memories But ultimately at the
00:29:42
end of the day it's still a very sparse
00:29:44
data set so these memories you you may
00:29:47
not have true labels and so for us a
00:29:51
massive investment has been our
00:29:53
evaluation bench which is external from
00:29:56
customers where we train our agents and
00:29:58
we do a lot of really hand the
00:30:00
handcrafted labeling whereas even a
00:30:02
smaller data set gets the agent to a
00:30:04
much much higher degree of accurac so
00:30:07
you want a bit of both right you want
00:30:08
the real production use cases with
00:30:09
engineering feedback which does preside
00:30:12
present good information but the eval
00:30:14
bench is ultimately what is the Firm
00:30:17
Foundation that gives you that coverage
00:30:18
at the moment but it feels like the
00:30:21
evals have to be specific to customers
00:30:24
don't they and it also feels like each
00:30:26
deployment of each agent has to be a bit
00:30:29
bespoken custom per agent or am I
00:30:34
mistaken in that one the pattern are
00:30:37
very so the agents are pretty
00:30:39
generalized the agents get contextual
00:30:41
information per customer so it gets
00:30:44
injected like localized customer
00:30:46
specific procedures and memories and all
00:30:49
those things but those are lated on the
00:30:52
base which is developed inside of our
00:30:55
product right like in the mothership or
00:30:57
actually it's called the Temple of cler
00:31:00
um so we distribute like new versions of
00:31:03
cleric and our prompts our logic our
00:31:06
reasoning generalized memories or
00:31:09
approaches to solving problems are
00:31:12
imbued in a Divine way into the cleric
00:31:14
and it's sent up it's a layering
00:31:17
challenge right because you do want to
00:31:18
have cross cutting benefits to all
00:31:21
customers and accuracy driven by the
00:31:23
eval bench but also customization at
00:31:27
their on their processes and like
00:31:29
customer specific approaches all right
00:31:32
so there's a few other things that are
00:31:36
fascinating to me when it comes to the
00:31:37
UI and the ux of how you're doing things
00:31:41
specifically how you are very keen on
00:31:46
not giving Engineers more alerts unless
00:31:50
it absolutely needs to happen and I
00:31:53
think that's something that I've been
00:31:54
hearing since
00:31:56
2018 and it was all on alert fatigue and
00:32:00
how when you have complex systems and
00:32:02
you set up all of this monitoring and
00:32:04
observability you inevitably are just
00:32:06
getting pinged continuously because
00:32:09
something is out of whack and so the
00:32:14
ways that you made sure to do this and I
00:32:17
thought this was fascinating is a have a
00:32:19
confidence score so be able to say look
00:32:22
we think that this is like this and
00:32:26
we're giving it 75%
00:32:28
confidence that this is going to happen
00:32:30
or this could be bad or whatever it may
00:32:33
be and then B if it is under a certain
00:32:38
percent confidence score you just don't
00:32:41
even tell anyone and you try and figure
00:32:43
out isn't actually a problem and I I'm
00:32:45
guessing you continue working or you
00:32:47
just forget about it explain that whole
00:32:51
user experience and how you came about
00:32:53
that yeah we realized because this is a
00:32:55
trust building exercise we can't just
00:32:58
respond with whatever we find and the
00:33:00
Agents
00:33:02
can sometimes they're just not
00:33:04
especially during the onboarding excuse
00:33:06
doing the onboarding phase they don't
00:33:07
have the necessary access and they don't
00:33:09
have the context right and so at least
00:33:11
at the start when you're training the
00:33:13
agent you don't want it to just Spam me
00:33:15
with this raw ideas and so the
00:33:17
confidence score was one that I think a
00:33:20
lot of teams are actually trying to
00:33:22
build into their products as agent
00:33:23
Builders it's extremely hard in this
00:33:26
case because it's such an un
00:33:28
unsupervised
00:33:30
problem I'm trying to not get into the
00:33:32
RO details because there's a lot of like
00:33:34
effort we've put into that like building
00:33:36
this confidence score as a big part of
00:33:38
our IP is like how do we measure our own
00:33:41
success information Divine name for the
00:33:45
IP or something it's not your IP it's
00:33:48
your what was it when Moses was up on
00:33:50
the hill and he got the Revelation it
00:33:53
was yeah this is not your IP this is
00:33:55
your Revelations that you've had yeah
00:33:57
the but so the high level is basically
00:34:00
that it's really driven by this data fly
00:34:04
wheel it's really driven by experience
00:34:06
and that's also how an engineer does
00:34:08
things but those can be again like two
00:34:10
layered like from the base layers of the
00:34:12
product but also experiences in this
00:34:15
company so we do use an LM for self
00:34:18
assessment but it's also driven and
00:34:20
grounded by existing experiences so we
00:34:24
inject a lot of those experiences and
00:34:26
whether those are positive or negative
00:34:28
outcomes and as an engineer you can set
00:34:32
the threshold so you can say oh nice
00:34:35
only extremely high relevance findings
00:34:38
or diag mines should be shown and you
00:34:41
can set the conciseness and specificity
00:34:44
so you can say I just wanted one
00:34:45
sentence or just give me a word or give
00:34:49
me all the raw information so what we do
00:34:53
today is we're very asynchronous so an
00:34:56
alert virus will go from a quest find
00:34:58
the whatever information we can and come
00:34:59
back if we're confident will respond if
00:35:02
not we'll just be quiet but then you can
00:35:05
engage with us in a synchronous way so
00:35:07
it starts async and then you can kick
00:35:09
the ball back and forth in a synchronous
00:35:11
way and in a synchronous mode the sorry
00:35:14
in the synchronous mode it's very
00:35:16
interactive and and lower latency we
00:35:18
will almost always respond if you ask us
00:35:21
a question we'll respond so then the
00:35:22
conference score is less important
00:35:24
because then it's like the user is
00:35:26
refining that answer saying go back try
00:35:28
this go back try this but for us the key
00:35:31
thing is we have to come back with good
00:35:33
initial findings and that's why the
00:35:35
conference score is so important but
00:35:36
again it's really driven by
00:35:39
experiences just to like re reiterate
00:35:41
like why this is such a complex problem
00:35:44
to solve you can't just take a
00:35:46
production environment and say okay I'm
00:35:48
going to spin this up in a Docker
00:35:49
container and reproduce it at a specif
00:35:51
point in time at many companies you
00:35:53
can't even do a low test Cross services
00:35:56
it's so complex it's all different
00:35:57
different teams they're all interrelated
00:35:59
you can do this as for a small startup
00:36:01
with one application running on Heroku
00:36:02
or versel but doing this at scale is
00:36:05
virtually impossible at most companies
00:36:07
so you you don't have that ground trick
00:36:10
you you can't say with 100% certainty
00:36:12
whether you're right or wrong and that's
00:36:13
just the state we're in right now
00:36:15
despite that the confidence score has
00:36:18
been a very powerful technique to at
00:36:21
least eliminate most true positiv like
00:36:25
or and when we know that we don't we
00:36:26
don't have anything of substance just
00:36:29
being
00:36:30
quiet but how do you know if you got
00:36:35
enough
00:36:36
information when you were doing the scan
00:36:39
or you were doing the search to go back
00:36:42
to the human and give that information
00:36:46
and also how do you know that you are
00:36:49
fully understanding what the human is
00:36:52
asking for when you're doing that back
00:36:53
and forth honestly this is one of the
00:36:56
key parts that that's very challenging
00:36:58
it's a human will say the checkout
00:37:01
service is done and you need to know
00:37:04
that there are probably maybe based on
00:37:07
who the engineer is talking about
00:37:10
production or if they've been talking
00:37:12
about developing a new feature they're
00:37:14
probably talking about the dev
00:37:15
environment and if you go down the wrong
00:37:17
path then you can spend some money and
00:37:20
like a lot of time inves something
00:37:22
that's useless so what we do is even at
00:37:24
the initial message that comes in we
00:37:27
will will ask for a clarifying question
00:37:29
if we are not sure about what you're
00:37:31
asking if you've not been specific
00:37:33
enough and most agent Boulders even if
00:37:35
cognitions Debon but do this that
00:37:38
initially they'll say okay do you mean X
00:37:39
Y and Z okay this is my plan okay I'm
00:37:41
going to go do it now so there is the
00:37:44
sense of confidence built into these
00:37:45
products from a ux layer and that's
00:37:47
where we are right now it's with chat
00:37:50
you can sometimes say or with Claude
00:37:52
something very inaccurate or vague and I
00:37:56
can probably guess the right answer
00:37:58
because the cost is not multi-step right
00:38:00
it's very cheap you can just quickly fix
00:38:02
your text but for us we have to Short
00:38:05
Circuit that and make sure that you're
00:38:06
specific enough in your initial
00:38:08
instructions and then over time loosen
00:38:10
that a bit as we understand a bit more
00:38:12
what your teams are doing what things
00:38:13
are what are you up to you can be more
00:38:16
vague but for now it requires a bit more
00:38:18
specificity and
00:38:20
guidance speaking of the multi- turns
00:38:24
and spending money for things or trying
00:38:27
to not waste money and going down the
00:38:31
wrong tree branch or Rabbit Hole how do
00:38:34
you think about pricing for agents is it
00:38:38
all consumption based are you looking at
00:38:40
what the price of an SRE would be and
00:38:43
you're saying oh we'll price a
00:38:44
percentage of that because we're saving
00:38:46
you time like what in your mind is the
00:38:50
right way to base off of
00:38:55
pricing well we're trying to build a
00:38:57
product that Engineers love to use and
00:38:59
so we want it to be a toothbrush we want
00:39:01
it to be something that you reach for
00:39:03
instead of your observability platform
00:39:05
instead of going into the console so for
00:39:07
us usage is very important so we don't
00:39:10
want to have procurement stand in the
00:39:11
way necessarily but the reality is there
00:39:14
are costs and this is a business and we
00:39:17
want to add value and money is how you
00:39:19
show us that we're valuable so the
00:39:22
original idea with agents was that there
00:39:24
would be this augmentation of
00:39:26
engineering teams and that you could
00:39:29
charge some order of magnitude less but
00:39:31
a fraction of engineering headcount or
00:39:34
employee headcount by augmenting teams I
00:39:37
think the jurry is still out on that I
00:39:38
think most agent Builders today are
00:39:41
pricing to get into production
00:39:45
environments or into these systems that
00:39:47
they need to use to solve problems to
00:39:50
get close to their Persona and if you
00:39:52
look at what Devon did I think they also
00:39:54
started at 10K per year or some pring
00:39:58
and I think it's now like 500 a month
00:40:00
but it's mostly a consumption based
00:40:02
model so you get some committed amount
00:40:05
of compute hours that is effectively
00:40:08
giving you time um to use the product
00:40:11
for us we're also orienting around that
00:40:14
model so because we're not GA our
00:40:16
pricing is still a little bit like on
00:40:18
flux and working with our initial
00:40:20
customers to figure out like what do you
00:40:21
they think is reasonable what do they
00:40:22
think is fair but I think we're going to
00:40:25
land on something that's mostly similar
00:40:27
to the Devon model where it's usage
00:40:29
based we don't want Engineers to think
00:40:32
about okay there's an investigation it's
00:40:34
going to cost me X they should just be
00:40:36
able to just run it and just see this is
00:40:38
valuable or not and increase usage but
00:40:40
it will be something about like a tiered
00:40:42
amount of compute that you can do use so
00:40:45
maybe you get 5,000 investigations a
00:40:48
month or something in that
00:40:50
order okay nice yeah because that's what
00:40:53
instantly came to my mind was you want
00:40:57
folks to just reach for this and use it
00:41:00
as much as possible but if you are on a
00:41:04
usage based pricing then
00:41:07
inevitably you're going to hit that
00:41:10
friction where it's yeah I want to use
00:41:12
it but H it's going to cost me yeah yeah
00:41:16
so you do want to have a committed
00:41:18
amount set aside at the front and we're
00:41:21
also exploring like having a free tier
00:41:23
or like a free band maybe the first X is
00:41:27
just you can just kick the tires and dry
00:41:29
it out and as you get to higher limits
00:41:31
then you can set external attemps so we
00:41:35
haven't even talked about tool usage but
00:41:37
that's another piece that feels like it
00:41:40
is so complex because you're using tools
00:41:45
you're using a different you're using an
00:41:47
array of tools and how do you tap in to
00:41:50
each of these tools right because it's
00:41:52
if you're looking at logs or are you
00:41:57
syncing directly with the data dogs of
00:42:00
the world how do you see tool usage for
00:42:05
this and what have been some
00:42:06
specifically hard challenges to overcome
00:42:08
in that
00:42:10
Arena again this kind of goes back to
00:42:12
why this is so challenging and
00:42:14
especially one of the key things that
00:42:16
we've seen is Agents solve problems very
00:42:18
differently from humans but they need a
00:42:20
lot of the things humans need they need
00:42:22
the same tools if you're storing all of
00:42:24
your data and data dog we may not be
00:42:26
able to find all the information we need
00:42:27
to solve a problem by just looking at
00:42:29
your actual application running and your
00:42:30
cloud in front so we need to go to data
00:42:32
do so we need access there and so
00:42:34
engineering teams give us that access if
00:42:37
you've then constructed a bunch of
00:42:39
dashboards and metrics then and that's
00:42:42
how you've laid out your let say your
00:42:44
run books and your processes to debug
00:42:46
issues we need to do things like look at
00:42:49
multiple charts or graphs and infer
00:42:52
across those in the time ranges that an
00:42:54
issue happened what are the anomalies
00:42:56
that happened across multiple services
00:42:58
so if two of them are spiking and CP
00:43:01
inter related so we should look at the
00:43:03
relations between them but these are
00:43:05
extremely hard problems for L to solve
00:43:08
even Vision models they're not
00:43:10
purposefully both purposeful for that so
00:43:14
when it comes to Tool
00:43:15
usage llms are or Foundation models are
00:43:19
good at certain types of informations
00:43:21
especially semantic ones so code config
00:43:24
logs they're slightly less good good at
00:43:28
traces but also pretty decent but they
00:43:31
really suck at metrics they really suck
00:43:33
at time series so it's really dependent
00:43:36
on your observability stack how useful
00:43:39
it's going to be because for human we
00:43:41
just sit back and look at a bunch of
00:43:42
dashboards we can see like pattern
00:43:44
matching instantly you can see like
00:43:46
these are spikes but for an LM they see
00:43:49
something different so what we'll find
00:43:51
is over time these observability tools
00:43:54
at least will probably become less and
00:43:56
less
00:43:57
human Centric and may even become
00:44:00
redundant um you may see completely
00:44:03
different means of diing problems and I
00:44:06
think the honeycomb approach the trace
00:44:08
based approach with these high
00:44:09
cardinality events is probably the thing
00:44:12
that I put my money on is the dominant
00:44:15
pattern that IA see winning because can
00:44:19
you explain that real fast I don't know
00:44:21
what that is so so basically what they
00:44:22
do is or what charity majors and some of
00:44:25
these others have been promoting for
00:44:26
years is logging out traces but with
00:44:31
Rich events attached to these so you
00:44:34
basically can follow like a request
00:44:35
through your whole application stack and
00:44:39
um you can log out like a complete
00:44:42
object payload at multiple steps along
00:44:44
the way and store that in a system where
00:44:46
you can query all the information so
00:44:48
you've got the point and time you've got
00:44:50
the whole like tree of the trace as well
00:44:53
and then at each point you can see the
00:44:55
individual attributes and Fields
00:44:57
and so you get a lot more detail in that
00:45:00
versus if you look at a time series
00:45:01
you're basically seeing okay CPU is
00:45:03
going up CPU goes down and what can you
00:45:06
clean from that you basically have to
00:45:08
like it's like witchcraft trying to find
00:45:11
the root cause right but the dad dogs of
00:45:14
the money have been making a lot of
00:45:16
sorry the dad dogs of the world have
00:45:17
making a lot of money um selling
00:45:19
consumption and selling the Witchcraft
00:45:21
to Engineers for years and so there's a
00:45:23
real incentive to keep this this data
00:45:25
score going but I think agents become
00:45:27
more dominant we'll see them gravitate
00:45:30
to the most valuable sources of
00:45:32
information and then if you give your
00:45:34
agent more and more scope you'll see
00:45:36
death is rarely involved in these
00:45:39
causings so why why are we still paying
00:45:41
for them so I'm not sure what it's going
00:45:43
to look like in the next two or three
00:45:44
years but it's going to be interesting
00:45:46
how things play out as agents become the
00:45:49
go-to for diagnosing and solving
00:45:52
problems yeah I hadn't even thought
00:45:54
about that how for human usage like
00:45:57
maybe data dog is set up wonderfully
00:46:00
because we look at it and it gives us
00:46:02
everything we need and we can root cause
00:46:05
it very quickly by pattern matching but
00:46:07
if that turns out to be one of the
00:46:09
harder things for agents to do instead
00:46:12
of making an agent better at
00:46:15
understanding metrics maybe you just
00:46:17
give it different data and so that it
00:46:21
can root cause it without those metrics
00:46:23
and it will shift away from reading the
00:46:27
information from those
00:46:29
Services yeah if you look at like chess
00:46:31
and the AI and like the stock fishes of
00:46:34
the world that's not that's just one AI
00:46:36
That's like plays against grand grand
00:46:39
Masters mhm even the top players have
00:46:42
learned from the AI so they know that a
00:46:45
porn push on the side has been extremely
00:46:49
powerful or a rook lift has been very
00:46:52
powerful so now like the the top players
00:46:54
in the world adopt these techniques they
00:46:56
learn from the AI is but that's also
00:46:58
because it's always a human in the loop
00:46:59
we still want to see people playing
00:47:01
people but if you just leave it up to
00:47:02
the AI like the way they play the game
00:47:04
is completely different they see things
00:47:05
that we don't and I know I didn't sure
00:47:08
answer at the start fully but these
00:47:11
tools are grounding actions for us so
00:47:13
the observability stack is one of them
00:47:14
but ultimately we b a complete
00:47:18
abstraction to the production
00:47:19
environment so the agent uses these
00:47:23
tools and learns how to use these tools
00:47:25
and knows which tools are the most
00:47:26
effective
00:47:27
but we also build a a transferability
00:47:31
layer so you can shift the agent from
00:47:33
the real production environment into the
00:47:35
eval stack and it doesn't even know that
00:47:37
it's running in an eval stack it's not
00:47:39
suddenly just looking at like fake
00:47:40
surfaces fake kubernetes clusters fake
00:47:43
data dos fake scenarios or fake world so
00:47:47
these tools are an incredibly important
00:47:49
abstraction it's one of the key
00:47:50
obstructions that the agent needs and
00:47:52
honestly it's memory management and
00:47:54
tools are the two big things that agent
00:47:57
team should be focusing on I'd say right
00:47:59
now wait why do you switch it to this
00:48:02
fake
00:48:03
world because that's where you've got
00:48:05
full control that's where you can
00:48:06
introduce your own scenarios your own
00:48:08
chaos and stretch your agent but if you
00:48:13
do so in a way where the tools are
00:48:15
different the worlds are different
00:48:16
experience are different there's less
00:48:18
transferability when you then take it
00:48:20
into the production environment and
00:48:21
suddenly it's going to fall flat so you
00:48:23
want the like a a real simile of the
00:48:27
environment in your your tool or your
00:48:29
eval
00:48:30
bench and are you doing any type of
00:48:34
chaos engineering to just see how the
00:48:36
agents
00:48:38
perform yes that's pretty much where our
00:48:40
eval stack is it's chaos we produce a
00:48:43
world in which the reproduce chaos and
00:48:45
then we say given this problem what's up
00:48:49
what's the underlying cause and we see
00:48:51
how close we can get to the Dost string
00:48:52
cause yeah
00:48:54
perfect opportunity for an incredible
00:48:58
name like Lucifer this is the this is
00:49:03
this is the seventh layer of hell I
00:49:05
don't know something along those lines
00:49:08
yeah we've got some ideas on the blog
00:49:11
post that will have some more players in
00:49:12
this idea so tpd I think one thing to
00:49:17
noce
00:49:18
that this is a very deep space so if you
00:49:22
look at self-driving cars the lives are
00:49:24
on the line and so people care a lot and
00:49:26
you you have to hit a much higher bar
00:49:28
than a human driving a car it's very
00:49:31
similar in the space right like these
00:49:32
production environments are sacred they
00:49:35
are important to these companies right
00:49:37
they are if they go down or if there a
00:49:38
data breach or
00:49:40
anything that their business are on the
00:49:42
line CTS really care the bar that we
00:49:44
have to hit is very high and so we take
00:49:47
security very seriously but the whole
00:49:50
product that we're building requires a
00:49:52
lot of care and there's a lot of
00:49:53
complexity that goes into that so I
00:49:56
think it's extreme
00:49:57
compelling as an engineer to work in the
00:49:59
space because there's so many compelling
00:50:01
problems to solve like the knowledge
00:50:02
grock building the confidence scoring
00:50:05
how do you do evaluation like how do you
00:50:07
learn from these environments and build
00:50:09
them into your core product the toothing
00:50:10
layers the the chaos benches all these
00:50:13
things and how you do that in a reliable
00:50:16
repeatable way I think that's the other
00:50:18
big challenge is if you're an A or TCP
00:50:21
or using this stack or different stack
00:50:23
if you're going from e-commerce to
00:50:24
gaming to social media how generaliz
00:50:27
your agent can you just St it or can you
00:50:30
only solve one cl of problem and so
00:50:32
that's one of the things that we really
00:50:34
leaning into right now is the
00:50:35
repeatability of the product and scaling
00:50:37
this out to more and more
00:50:39
Enterprises but but yeah I'd say it's a
00:50:42
extremely complex problem to solve and
00:50:45
it even though we're valuable today true
00:50:48
resolution end to end resolution maybe
00:50:51
like multiple years just like the self
00:50:53
driving cars it took years to get to a
00:50:54
point where we've got Whos on the roads
00:50:57
yeah that's what I wanted to ask you
00:50:58
about was the true resolution and how
00:51:02
that like that just scares me to think
00:51:06
about first of all and I don't have
00:51:08
anything running in production let alone
00:51:10
a multi-million dollar system so I can
00:51:13
only imagine that you would encounter a
00:51:16
lot of resistance when you bring that up
00:51:19
to
00:51:21
Engineers
00:51:23
surprisingly no there's definitely the
00:51:25
hesitation but the hesitation is most
00:51:26
mostly based on uncertainty like what
00:51:28
exactly can you do and if you show them
00:51:30
like we literally can't change things we
00:51:32
don't have the access like you literally
00:51:33
like the API keys are read only or we
00:51:36
can strain to these environments and if
00:51:39
you introduce change through the
00:51:40
processes that they have already so pool
00:51:42
requests and there's guard rails in
00:51:45
place then they're very open to those
00:51:47
ideas I think a bit part of this is
00:51:49
really Engineers really hate in Fr and
00:51:53
support so they yearn for some something
00:51:56
that can help free them from that but
00:51:58
it's a progressive trust building
00:52:00
exercise we've spoken to quite a lot of
00:52:03
Enterprises and almost all of them have
00:52:06
different classes of sensitivity you
00:52:09
have your big fish customers for example
00:52:11
that you don't want to touch their
00:52:13
critical systems but then you've got
00:52:15
your internal airflow deployments and
00:52:18
your cicd your gitlab deployment if that
00:52:21
thing Falls over we can scale it up or
00:52:24
we could try make a change this zero
00:52:26
customer impact and so those are the
00:52:28
areas really helping teams today is on
00:52:31
the lower severity or low risk places
00:52:33
where we can make changes and when if
00:52:36
you if you're crushing crushing those
00:52:38
changes over time then Engineers will
00:52:39
introduce you to the more high value
00:52:41
places but but yes right now we're
00:52:44
steering clear of the critical systems
00:52:46
because we don't want to make a change
00:52:48
that is dangerous yeah and it just feels
00:52:52
like it's
00:52:53
too loaded so even if you are doing
00:52:57
everything right because it is so high
00:53:01
maintenance you're you don't want to
00:53:03
stick yourself in there just yet let the
00:53:06
engineers bring you in when they're
00:53:07
ready and when you feel like it's ready
00:53:10
I can see that for sure yeah also
00:53:12
behaviorally Engineers won't change
00:53:14
their you know the tools they reach for
00:53:17
the process is in a wartime scenario
00:53:19
when something is like relaxed
00:53:21
environment they're willing to try Ai
00:53:23
and experiment with that and adopt that
00:53:26
but if it's a critical situation they
00:53:28
they want to introduce an AI and add
00:53:30
more chaos into the mix right so they
00:53:32
want something that reduces the
00:53:34
uncertainty yeah that reminds me about
00:53:37
one of my major things that I notice
00:53:40
whenever I'm working
00:53:42
with agents or Building Systems that
00:53:46
involve
00:53:48
AI the prompts can
00:53:52
be the biggest Hang-Ups and the PRP
00:53:56
pumps for me sometimes feel
00:53:59
like I just need to do obviously I'm not
00:54:02
building a product that relies on agents
00:54:05
most of the time so I don't have the
00:54:08
drive to see it through but a lot of
00:54:12
times I will fiddle with prompts for so
00:54:16
long that I get angry because I feel
00:54:19
like I should just do the thing that I
00:54:22
am trying to
00:54:24
do and not get AI to do it I don't
00:54:28
really have an answer for you that's
00:54:29
just the nature of the
00:54:31
Beast yes exactly I do want to just
00:54:34
double click and say everybody has that
00:54:36
problem everybody struggles with that
00:54:38
you don't know if you're like one prom
00:54:39
change away or 20 and they're very good
00:54:43
at making it seem like you're getting
00:54:44
closer and closer but you may not be we
00:54:47
found success in building Frameworks
00:54:51
evaluations so that we can at
00:54:53
least extract it either from production
00:54:56
or EVS the samples the crowd truth that
00:54:58
makes us know or gives us confidence
00:55:00
we're getting to the answer otherwise
00:55:03
you just you can go forever right like
00:55:05
just tweaking things and never getting
00:55:07
there that's it and that's frustrating
00:55:10
because some yeah sometimes you take one
00:55:12
step forward and two steps back and
00:55:14
you're like oh my God it's quite hard
00:55:16
with content creation I think it's
00:55:18
harder in your space unless I have all
00:55:21
but stopped using it for Content
00:55:23
creation that's for sure like maybe to
00:55:26
to help me fill up a blank page and get
00:55:29
directionally correct but for the most
00:55:31
part yeah I don't like the way it writes
00:55:34
I don't really even if I prompt it to
00:55:36
the maximum it doesn't feel like it
00:55:38
gives me deep insights stopped that but
00:55:42
you're still on GPD 3.5 right
00:55:46
[Music]