Autonomous AI SRE: The Future of Site Reliability Engineering
Summary
TLDRIn this podcast episode, host Demetrios interviews William, CTO of Cleric AI, discussing the role of AI Site Reliability Engineers (SREs) and the challenges they face in dynamic production environments. They delve into the use of knowledge graphs to effectively diagnose root causes of system issues. William shares insights on the importance of confidence scoring, memory management in AI, tool integrations, and chaos engineering, highlighting the complexity of operational environments and how Cleric aims to enhance engineer productivity without increasing alert fatigue. They also touch on pricing strategies that encourage usage while managing operational costs.
Takeaways
- ☕️ Cleric AI is solving tough problems in AI and infrastructure.
- 🧠 Knowledge graphs aid in diagnosing issues efficiently.
- 🔍 Confidence scoring helps prioritize alerts for engineers.
- 💡 AI agents learn from episodic and procedural memories.
- 📊 Usage-based pricing is being explored to optimize adoption.
- ⚙️ Integration with tools like DataDog is crucial for performance.
- 🌪️ Chaos engineering is used to test AI robustness in production.
- ⚠️ Engineers are cautious about AI making changes to critical systems.
- 📈 The complexity of AI in dynamic environments poses unique challenges.
- 🤝 Building trust is essential for AI adoption among engineers.
Timeline
- 00:00:00 - 00:05:00
Introduction of the hosts and the context of the discussion focusing on AI SRE and knowledge graphs; William's background with cleric AI and Feast. Begins with some light notes about Christmas sweater and caffeine consumption, introducing casual vibe to the conversation.
- 00:05:00 - 00:10:00
Discussion of AI SRE being a complex problem tied to MLOps; highlighted the differences between development and production environments, and how operational complexity increases with deployment across different applications, leading to challenges in root cause analysis.
- 00:10:00 - 00:15:00
Exploration of how agent systems are being built with modular components, balancing understanding and responsibility versus the productivity gains, leading to operational instability in larger organizations where pressure is high.
- 00:15:00 - 00:20:00
Innovation through the use of knowledge graphs to diagnose incidents within systems. Discussion emphasizes the complexity of relationships within IT infrastructures, and how mappings of even small clusters begin to reveal potential issues.
- 00:20:00 - 00:25:00
Deep dive into how agents scan systems to identify problems prior to alerts being raised, and the necessity for ongoing updating of knowledge graphs to maintain relevancy in fast-paced production environments.
- 00:25:00 - 00:30:00
Relationship between proactive monitoring of production state and agents' ability to query information; the graph's dual role in driving effective diagnosis and enabling exploration of root causes through structured data.
- 00:30:00 - 00:35:00
Understanding the background job of graph building and updating during investigations; the potential for agents to uncover hidden issues through continuous environment scanning, including the fiscal impacts of these processes.
- 00:35:00 - 00:40:00
Confidence scoring approach for agents ensuring they don't overwhelm engineers with false positives, emphasizing trust and utility; integrating human feedback into performance evaluations to boost efficiency.
- 00:40:00 - 00:45:00
The challenge of decision-making in AI-driven environments, where agents need to discern context to retrieve useful information without causing disruption; exploring trade-offs between human input and automated systems.
- 00:45:00 - 00:50:00
Agency Learning from experiences, including procedural and episodic memory management creating layered structures; feedback-driven iterations are enhancing performance, building a feedback loop with engineers to improve knowledge repositories.
- 00:50:00 - 00:55:58
Examining market pricing strategies for AI agents; approaching pricing based on usage while ensuring engineers can operate without cost-induced anxiety over their investigative actions. Emphasis on a thoughtful revenue model to drive engagement instead of risking underutilization.
Mind Map
Video Q&A
What is Cleric AI doing?
Cleric AI is developing innovative solutions using knowledge graphs and AI to help diagnose issues in production environments.
What is an AI SRE?
An AI SRE refers to an AI Site Reliability Engineer who uses AI technologies to manage and maintain the reliability of systems.
What challenges do AI agents face in production environments?
Key challenges include the dynamic nature of systems, the unsupervised nature of problems, and the complexity of understanding relations among various components.
How does the knowledge graph help in troubleshooting?
The knowledge graph maps relationships and dependencies in the production environment, helping diagnose root causes of issues efficiently.
What role does confidence scoring play?
Confidence scoring helps prioritize alerts and determine the reliability of findings before presenting them to engineers.
How does Cleric AI handle memory?
Cleric AI uses episodic and procedural memories to learn from past actions and improves its diagnostics based on feedback.
What pricing model is Cleric AI exploring?
Cleric AI is considering a usage-based pricing model to encourage adoption while covering operational costs.
What external tools does Cleric AI integrate with?
Cleric AI integrates with tools like DataDog to access logs, monitor system health, and gather necessary data for analysis.
How does Cleric AI incorporate chaos engineering?
Cleric AI employs chaos engineering by simulating failures and evaluating agent performance in controlled scenarios.
What are the major concerns with AI changes in production environments?
Concerns include maintaining system reliability, avoiding unnecessary disruptions, and ensuring that AI assists rather than complicates.
View more video summaries
- 00:00:00will and Pinar CTO of cleric we're
- 00:00:03building an
- 00:00:04aisr we're based in San
- 00:00:07Francisco black coffee is the way to go
- 00:00:10and if you want to join a team of
- 00:00:12veterans and Ai and infrastructure and a
- 00:00:15really tough problem yeah come and chat
- 00:00:17to us boom welcome back to the m the
- 00:00:20lobs Community podcast I'm your host
- 00:00:22Demetrios today we are talking with my
- 00:00:25good friend William some of you may know
- 00:00:27him as the CTO of cleric a I doing some
- 00:00:31pretty novel stuff with the AIS Sr which
- 00:00:36we dive into very deep in this next hour
- 00:00:41we talk all about how he's using
- 00:00:43knowledge graphs to triage root cause
- 00:00:46issues with their AI agent solution and
- 00:00:50others of you may know Willam because he
- 00:00:52is also the same guy that built the open
- 00:00:56source feature store Feast that's where
- 00:00:59I got to know him back four five years
- 00:01:02ago and since then I've been following
- 00:01:05what he is doing very closely and it's
- 00:01:08safe to say this guy never fails to
- 00:01:11disappoint let's get into the
- 00:01:13conversation right
- 00:01:15[Music]
- 00:01:20now let's start by prefacing this
- 00:01:23conversation with we are recording 2
- 00:01:26days before Christmas so when it comes
- 00:01:29out
- 00:01:30this sweater that I'm wearing is not
- 00:01:32going to be okay but today it is totally
- 00:01:35inbounds for me being able to wear
- 00:01:38it unfortunately I don't have a cool
- 00:01:40sweater like you and I'm in Sunny San
- 00:01:42Francisco but I guess it's got the fog
- 00:01:46yeah it's Christmas
- 00:01:48viod dude I found out three four days
- 00:01:52ago that if you have
- 00:01:55F this pill magic pill with caffeine it
- 00:02:00like minimizes the Jitters so I have
- 00:02:04taken that as an excuse LCN or which
- 00:02:07yeah you've heard of it yeah yeah dude I
- 00:02:10I've just been abusing my caffeine
- 00:02:12intake and pounding these pills with it
- 00:02:16it's amazing I am so much more
- 00:02:18productive so that's my 2025 secret for
- 00:02:20everyone okay I and a bit of magnesium
- 00:02:23for bitter sleep or actual
- 00:02:26steep all right man enough of that you
- 00:02:29been building cleric you've been coming
- 00:02:32on occasionally to the
- 00:02:35different conferences that we've had and
- 00:02:38sharing your learnings but recently you
- 00:02:41put out a blog post and I want to go
- 00:02:42super deep on this blog post on what an
- 00:02:45AI SRE is just because it feels like
- 00:02:49sres are very close to the mlops world
- 00:02:52and AI agents are very much what we've
- 00:02:56been talking about a lot as we were
- 00:02:58presenting at the agents production
- 00:03:00conference the first thing that we
- 00:03:03should start with is just what a hard
- 00:03:06problem this is and why is it hard we
- 00:03:10can dive into those areas and I think
- 00:03:11we're going to get into that in this
- 00:03:13this conversation maybe just a set the
- 00:03:15stage everyone is building agents like
- 00:03:17agents of all the hype right now but
- 00:03:19every use case is different right you've
- 00:03:22got agents in law you've got agents for
- 00:03:24writing blog post you've got agents for
- 00:03:26social media one of the tricky things
- 00:03:28about our space is really if you
- 00:03:31consider two main things that an
- 00:03:33engineer does is the credit software and
- 00:03:36then they deployed into production
- 00:03:37environment and it runs and operates
- 00:03:38actually has to have an impact on the
- 00:03:39real world that second world the
- 00:03:42operational environment is quite
- 00:03:44different from the development
- 00:03:45environment the development environment
- 00:03:46has tests it has an IDE it has Type
- 00:03:49feedback Cycles often it has ground
- 00:03:51truth right so you can make a change and
- 00:03:54see if your test balse there's
- 00:03:56permissionless data sets that are off
- 00:03:57there so you can go to get up and you
- 00:03:59can find like
- 00:04:00millions of issues that people cre PRS
- 00:04:02that are like the solutions to those
- 00:04:04issues yeah but consider like the
- 00:04:07production environment of an Enterprise
- 00:04:09company where do you find the data set
- 00:04:13that represents all the problems that
- 00:04:14they've had and all the solutions it's
- 00:04:16not just laying out there right you can
- 00:04:18get some like root causes and things
- 00:04:20that people have posted as blog posts
- 00:04:22but this is an unsupervised problem for
- 00:04:24the most part it's a very complicated
- 00:04:26problem I I guess we can get get into
- 00:04:28those details in a in a bit but that's
- 00:04:30really what makes this challenging it's
- 00:04:32complex sprawling Dynamic
- 00:04:35systems yeah the complexity of the
- 00:04:38systems does not help and I also think
- 00:04:40with the rise of the coding
- 00:04:44co-pilots does that not also make things
- 00:04:47more complex because you're running
- 00:04:49stuff in a production environment that
- 00:04:52maybe you know how it got created maybe
- 00:04:55you don't massively and I think even at
- 00:04:58our scale small startup it's become a
- 00:05:01topic
- 00:05:02internally how much do we delegate to AI
- 00:05:05because we're also Outsourcing and
- 00:05:07delegating to our own agents internally
- 00:05:09the produce code so I think all teams
- 00:05:12are trying to get to the boundaries of
- 00:05:14understanding and confidence so you're
- 00:05:16building these modular components like
- 00:05:18Lego blocks with internals you're unsure
- 00:05:20about but you're shipping into
- 00:05:21production and seeing how that succeeds
- 00:05:23and fails because it gives you so much
- 00:05:25velocity so the ROI is there but the
- 00:05:27understanding is like one of the things
- 00:05:28you lose over time and I think at scale
- 00:05:31where the incentives aren't aligned
- 00:05:32where you have many different teams and
- 00:05:34they're all being pressured to ship more
- 00:05:37belts are being tightened so there's not
- 00:05:39a lot of head count and they have to do
- 00:05:40more the production environment is
- 00:05:43really people are like putting their
- 00:05:45figers in that damn wall but eventually
- 00:05:47it's going to break it's unstable at a
- 00:05:49lot of companies yeah so coding is going
- 00:05:52to make or AI generated coding is really
- 00:05:54going to make this a much more complex
- 00:05:56system to deal with so the Dynamics
- 00:05:59between these components that
- 00:06:01interrelate where there's much less
- 00:06:03understanding is going to explod yeah
- 00:06:06we're already seeing that dude there's
- 00:06:08so many different pieces on the complex
- 00:06:10systems that I want to dive into but the
- 00:06:13first one that stood out to me and has
- 00:06:15continued to replay in my mind is this
- 00:06:19knowledge graph that you presented at
- 00:06:21the conference and then subsequently in
- 00:06:24your blog post and you made the point of
- 00:06:27saying this is a
- 00:06:30Knowledge Graph that we created on a
- 00:06:32production environment but it's not like
- 00:06:34it's a gigantic kubernetes cluster it
- 00:06:38was a fairly small kubernetes cluster
- 00:06:41and all of the different relations from
- 00:06:43that and all the slack messages and all
- 00:06:45the GitHub issues and everything that is
- 00:06:48involved in that kubernetes cluster
- 00:06:50you've mapped out and that's just for
- 00:06:52one kubernetes cluster so I can't
- 00:06:54imagine across in a whole entire
- 00:06:56organization like an Enterprise size how
- 00:06:59complex this gets yeah so if you
- 00:07:03consider that specific cluster or graph
- 00:07:05I showed you was the op Telemetry
- 00:07:07reference architecture it's like a demo
- 00:07:09stack it's like an e-c Converse store
- 00:07:10it's got about 12 13 Services yeah
- 00:07:14roughly in that
- 00:07:15range I've only shown you literally like
- 00:07:1810% of the relations maybe even less and
- 00:07:20it's only at the infrastructure layer
- 00:07:21right so it's not even talking about
- 00:07:22like buckets and Cloud infras nothing
- 00:07:25about nodes nothing about application
- 00:07:27internals right so if you consider one
- 00:07:28cloud project like a gcp project or AWS
- 00:07:32project mhm there's a whole tree there
- 00:07:34the networks the regions down to the
- 00:07:36kubernetes Clusters within a cluster
- 00:07:38there's the nodes there's the containers
- 00:07:40within the containers there s the pods
- 00:07:42there's multiple containers potentially
- 00:07:44within each of those many processes each
- 00:07:46process has code with variables and each
- 00:07:50let so creates this tree structure but
- 00:07:52then between those noes in the tree can
- 00:07:54also have inter relations right like a
- 00:07:55piece of code here would be referencing
- 00:07:57an IP address but that IP address is
- 00:08:00by some cloud service somewhere and it's
- 00:08:02also connected to some other
- 00:08:04systems and you can't not use that
- 00:08:07information right because if a problem
- 00:08:09arrives and you're you know Landing your
- 00:08:11LA and you have to causally walk that
- 00:08:14graph to go Upstream to find the root
- 00:08:15cause in the security space this is a
- 00:08:19pretty well studied problem and there
- 00:08:21are traditional techniques people have
- 00:08:23been using to extract this from cloud
- 00:08:25environments but LMS really unlock a new
- 00:08:28level of understanding there so they're
- 00:08:29extremely good at extracting these
- 00:08:32relationships taking really unstructured
- 00:08:33data so it can be conversations that you
- 00:08:36and I have it can be kubernetes objects
- 00:08:38it can be all of these like the whole
- 00:08:40Spectrum from unstructured to structured
- 00:08:42you can extract structured information
- 00:08:43so you can build these graphs the
- 00:08:46challenge really is toold so you know
- 00:08:48you need to use this graph to get to a
- 00:08:50root cause but it's fuzzy right as soon
- 00:08:54as you extract that information you
- 00:08:56build that graph it's out of date almost
- 00:08:58instantly because system shange so
- 00:08:59quickly right so somebody's deploying
- 00:09:01something an IP address gets rolled po
- 00:09:04names change and so you needed to be
- 00:09:08able to make efficient decisions with
- 00:09:10your agent right so just to uh anchor
- 00:09:13this our agent is essentially a
- 00:09:16diagnostic agent right now so it helps
- 00:09:18teams quickly root cause a problem so if
- 00:09:21you've got an alert that fires or could
- 00:09:23an engineer presents an issue to the
- 00:09:25agent it's it quickly Advocates this
- 00:09:28graph and its awareness of your
- 00:09:30production environment to find the root
- 00:09:32C if it didn't have the graph it could
- 00:09:35still do it through first principles
- 00:09:36right it could still say looking at
- 00:09:38everything that's available I'll try
- 00:09:40this I'll try that but the graph allows
- 00:09:42it to very efficiently get to the root
- 00:09:44course um and so that fuzziness is one
- 00:09:48of the challenges that the fact that
- 00:09:49it's out of date so quickly but it's so
- 00:09:52important to still have it
- 00:09:54regardless there's a few things that you
- 00:09:56mentioned about how with the vision or
- 00:10:00the understanding of the graph you can
- 00:10:03escalate up issues that may have been
- 00:10:06looked at in isolation is not that big
- 00:10:08of a deal and so can you explain how
- 00:10:11that works a little
- 00:10:12bit so the graph is essentially there's
- 00:10:16two if if you draw box around the
- 00:10:17production environment right there are
- 00:10:19two kinds of issues right there's on you
- 00:10:21you have alerts for and you're awareness
- 00:10:23of so the you tell us like okay my alert
- 00:10:26fired here's a problem go look at it
- 00:10:27another is we scan the environment and
- 00:10:30we identify problems the graph is built
- 00:10:33in two ways one is a background job
- 00:10:36where it's just like looking through
- 00:10:37your infrastructure and finding new
- 00:10:39things and updating itself continuously
- 00:10:41and the other is when the agent's doing
- 00:10:42investigation and it sees new
- 00:10:44information and it just throws that back
- 00:10:45into the graph because it's got the
- 00:10:47information mil just usage update the
- 00:10:49graph but in this background scanning
- 00:10:51process it might uncover things that it
- 00:10:54didn't realize was a problem but then it
- 00:10:56sees it CU this is actually a problem
- 00:10:58for example it could process your
- 00:11:01metrics or it could look at your
- 00:11:03configuration of your objects in
- 00:11:05kubernetes or maybe it finds a bucket
- 00:11:07and it's try to create that node the
- 00:11:10updated state of the bucket and it sees
- 00:11:12exposed publicly so then you could
- 00:11:14surface the student engineer and say
- 00:11:16your data is being exposed publicly or
- 00:11:19they've misconfigured this pod and the
- 00:11:21memory is growing this application and
- 00:11:24in about an hour or two this is going to
- 00:11:26crash yeah so there's a massive opportun
- 00:11:29for Elin to be used as reasoning engines
- 00:11:32where it can infer and predict a failure
- 00:11:35imminently and you can prev that so you
- 00:11:37get to a proactive state of alerting
- 00:11:40that is of course quite inefficient
- 00:11:42today if you use an LM to just slap it
- 00:11:45on a vision model onto a metrix graph or
- 00:11:48yeah onto you your objects in your Cloud
- 00:11:51infrastructure but there's a massive low
- 00:11:53hanging fruit there where you just still
- 00:11:55a lot of those inferencing capabilities
- 00:11:57to fine- tuned or more purpose both
- 00:11:59models for each one of these tasks but
- 00:12:02how does the scanning work because I
- 00:12:05know that you also mention the agents
- 00:12:09will go until they run out of credit or
- 00:12:13something or until they hit their like
- 00:12:14spend limit when they're trying to root
- 00:12:17cause analysis some kind of a problem
- 00:12:21but I can imagine that you're not just
- 00:12:24continuously scanning or are you kicking
- 00:12:26off scans every x amount of seconds or
- 00:12:28minutes or days yeah so there are
- 00:12:31different parts to this if we do
- 00:12:33background scanning graph building we
- 00:12:35try and use more efficient models so
- 00:12:39because of the volume of data you don't
- 00:12:41use expensive models that are used for
- 00:12:43like you know very accurate reasoning
- 00:12:46yeah and so the costs all lower and so
- 00:12:47you set it like a daily budget of that
- 00:12:49and then you run up to the budget this
- 00:12:52is not something that's constantly
- 00:12:53running and processing large amounts of
- 00:12:55information think about it as like a
- 00:12:57human right you wouldn't process all
- 00:12:59logs and all information your Cloud
- 00:13:01infrastructure you just get like a lay
- 00:13:03of the land like like what are the most
- 00:13:05recent deployments what are the most
- 00:13:06recent conversations people are having
- 00:13:08it's like get like a a playby play so
- 00:13:11that when an issue comes up you can
- 00:13:13quickly jump into action you you can
- 00:13:14fast thinking you can make the right
- 00:13:16decisions quickly but in investigation
- 00:13:19we set a cap we say per
- 00:13:23investigation let's say make it 10 cents
- 00:13:25or make it a dollar or whatever and then
- 00:13:28we tell the AG this is how much you've
- 00:13:30been assigned use it as best you can go
- 00:13:33find information that you need through
- 00:13:35your
- 00:13:35tools and they allow the human to say
- 00:13:38okay go a bit further or stop here I'll
- 00:13:41take over wow and so we bring the human
- 00:13:43in the loop as soon as the agent has
- 00:13:45something valuable to present to it so
- 00:13:48if the agent goes off on a quest and it
- 00:13:50finds almost nothing it can present that
- 00:13:52to the human say up nothing or say okay
- 00:13:54couldn't find anything or just remain
- 00:13:56quiet depends on how you've configured
- 00:13:58it but it'll always at that budget limit
- 00:14:01yeah the benefit of it not finding
- 00:14:04anything also is that it will narrow
- 00:14:07down where the human has to go and
- 00:14:09search so now the human doesn't have to
- 00:14:12go and look through all this crap that
- 00:14:14the AI agent just looked through because
- 00:14:17ideally if the agent didn't catch
- 00:14:19anything it's hopefully not there and so
- 00:14:22the human can go and look in other
- 00:14:24places first and if they exhaust all
- 00:14:26their options they can go back and try
- 00:14:28and see where the agent was looking and
- 00:14:30see if that's where the problem
- 00:14:32is I think this comes back to the
- 00:14:34fundamental problem here and maybe we
- 00:14:37glassed over some of this like tools
- 00:14:41don't solve the problem of operation
- 00:14:43operations is an on call no amount of
- 00:14:46data dogs or dashboards or cube C
- 00:14:49commands will free your senior Engineers
- 00:14:52up from getting into the production
- 00:14:54environment
- 00:14:55so what we're trying to get to is ENT to
- 00:14:59end resolution when we find a problem
- 00:15:02can the agent go all the way multiple
- 00:15:05steps which today requires Engineers
- 00:15:07reasoning and judgment looking at
- 00:15:09different tools understanding tribal
- 00:15:11knowledge understanding why systems have
- 00:15:12been deployed we want to get the agents
- 00:15:15there but you can't St there because
- 00:15:17this is an unsupervised problem you
- 00:15:19can't just start changing things in
- 00:15:20production nobody would do that right
- 00:15:23now if you scale that back from
- 00:15:25resolution meaning change like code
- 00:15:27level change ter for things in your
- 00:15:31Reapers if you walk it back from that
- 00:15:33it's understanding what the problem is
- 00:15:34and if you walk it back further from
- 00:15:35that it's search space reduction
- 00:15:37triangulating the problem into a
- 00:15:39specific area maybe not saying the line
- 00:15:41of code but saying here's the service or
- 00:15:43here's the cluster and that's already
- 00:15:45very compelling to a human or you can
- 00:15:47say it's not this these 400 other Cloud
- 00:15:50clusters or providers or Services is
- 00:15:53probably in this one and that is
- 00:15:56extremely useful to an engineer today so
- 00:15:59space reduction is one of the things
- 00:16:01that we are very reliable at and where
- 00:16:02we've started and we start in a kind of
- 00:16:05collaborative mode so we quickly reduce
- 00:16:08the search space we tell you what we
- 00:16:09checked and what we didn't and then we
- 00:16:11as an engineer we can say okay here's
- 00:16:12some more context go a bit further and
- 00:16:14try this piece of information and in
- 00:16:17that steering and then collaboration we
- 00:16:20learn from engineers and they teach us
- 00:16:22and we get better and better over time
- 00:16:23on this like road to
- 00:16:25resolution yeah I know you mentioned
- 00:16:27memory and I want to get into that in a
- 00:16:28sec but but keeping on the theme of
- 00:16:31money and cost and the Agents having
- 00:16:36more or less a budget that they can go
- 00:16:38expend and try and find what they're
- 00:16:40looking for do you see that agents will
- 00:16:44get stuck in recursive loops and then
- 00:16:46use their whole budget and not really
- 00:16:49get much of anything or is that
- 00:16:50something that was fairly
- 00:16:53common six or 10 months ago but now
- 00:16:57you've found ways to counterbalance that
- 00:17:01problem this problem space is one where
- 00:17:04small little additions to your or
- 00:17:06improvements to your product make a big
- 00:17:08difference over time because they
- 00:17:10compound we've learned a lot from
- 00:17:12decoding agents like s agent and others
- 00:17:15so one of the things they found was that
- 00:17:17when the agent succeeds it succeeds very
- 00:17:19quickly when fails very slowly so
- 00:17:22typically you can even see as a proxy
- 00:17:23has the agent run for 3 4 5 6 7 minutes
- 00:17:27it's probably wrong even if you don't
- 00:17:29score it at all and if it ran into like
- 00:17:32it came to a conclusion quickly like in
- 00:17:3330 seconds it's probably going to be
- 00:17:35right our agents sometimes do chase
- 00:17:38their tails so we have a confidence
- 00:17:40score and we have a critiquer at the end
- 00:17:42that assesses the agent so we try and
- 00:17:45not you know spam the human ultimately
- 00:17:48it's about attention and saving them
- 00:17:49time so if you keep throwing like bad
- 00:17:51findings and bad information they really
- 00:17:53they'll just rip you out of your their
- 00:17:55production environment because it's
- 00:17:56going to be noisy right that's the last
- 00:17:57thing they want so yes depending on the
- 00:18:01use case the agent can go a recursive
- 00:18:04Loop or it can go on a direction that it
- 00:18:06should so for us a really effective
- 00:18:09mechanism to manage that is
- 00:18:12understanding where we're good and where
- 00:18:13we're bad so for each issue or event
- 00:18:16that comes in we do an enrichment and
- 00:18:17then we build the full context of that
- 00:18:19issue and then we look at have we seen
- 00:18:21this in the past similar issues have we
- 00:18:24Sol how have we solved this in the p and
- 00:18:26have we had positive feedback and so if
- 00:18:27we Che the right historical context we
- 00:18:30get a good idea of our confidence on
- 00:18:31something before presenting that
- 00:18:33information to a human like the the
- 00:18:34ultimate set of findings but yeah
- 00:18:37sometimes it does go a
- 00:18:39ride I'm trying to think is the
- 00:18:41knowledge graph something that you are
- 00:18:44creating once getting an idea the lay of
- 00:18:47the land and then there's almost like
- 00:18:51stuff doesn't really get updated until
- 00:18:53there's an incident and you go and you
- 00:18:55Explore More and what kind of knowledge
- 00:18:58graphs are using or you use many
- 00:18:59different knowledge graphs is it just
- 00:19:01one big one how does that even look in
- 00:19:04practice we originally started with one
- 00:19:06big Knowledge Graph the thing with these
- 00:19:08knowledge graphs is that they're often
- 00:19:10the F threat them is deterministic
- 00:19:12Method so you can run CU cuddle and you
- 00:19:15can just walk the cluster with
- 00:19:17traditional techniques there's no no a
- 00:19:19or all am involved but then you want to
- 00:19:22layer on top of that this the fuzzy
- 00:19:24relationships where you see this
- 00:19:25container has this sort of a reference
- 00:19:27to something over there or confit map
- 00:19:29mentions something that I've I've seen
- 00:19:32somewhere else and so what we've gone
- 00:19:35towards is a more layered approach so we
- 00:19:38have like multiple grow off layers where
- 00:19:40some of them have a higher confidence
- 00:19:42and durability and can be updated
- 00:19:44quickly or perhaps using different
- 00:19:46techniques and then you layer on the
- 00:19:48more fuzzy layers on top of that or or
- 00:19:51different lay so you could use an owl in
- 00:19:52to kind of canvas the landscape between
- 00:19:55clusters or from accum cluster to maybe
- 00:19:59the application layer or to the layers
- 00:20:00below but using smaller micro graphs has
- 00:20:03been easier for us from like a data
- 00:20:05management
- 00:20:07perspective what are other data points
- 00:20:09that you're then mapping out for the
- 00:20:11knowledge graph that can be helpful
- 00:20:13later on when the AI s re is trying to
- 00:20:19triage different
- 00:20:21problems in most teams there's an
- 00:20:25820 like burrito distribution of value
- 00:20:29um yeah so some of the key factors are
- 00:20:31often find in the same system I think it
- 00:20:33was M meta or yeah that's had some
- 00:20:37internal survey where they found out
- 00:20:39that 50 or 60% of their production
- 00:20:41issues were just due to config or code
- 00:20:43changes anything that disrupted their
- 00:20:45prod environment so if you're just
- 00:20:48looking at what people are deploying
- 00:20:49like you're following the humans you're
- 00:20:50going to probably find a lot of the
- 00:20:51problems so monitoring slack monitoring
- 00:20:55deployments is one of the most effective
- 00:20:57things to do looking at like releases or
- 00:21:01changes that people are scheduling and
- 00:21:03understanding those events so having an
- 00:21:05assessment of that and then in the
- 00:21:07resolution path there's also or the way
- 00:21:10to build the resolution looking at run
- 00:21:12books looking at how people have solved
- 00:21:14problems in the past
- 00:21:16like often what happens is like a slack
- 00:21:19thread is created right so the slack
- 00:21:21thread is like a contextual container
- 00:21:24for how do you go from a problem which
- 00:21:26somebody produ upgrades trade for so a
- 00:21:28Sol
- 00:21:29and summarizing these slack phrases is
- 00:21:31extremely useful so you can basically
- 00:21:33say like this engineer came into this
- 00:21:35problem this was the discussion and this
- 00:21:37is the final conclusion and there's
- 00:21:38often like a PR attached to that so you
- 00:21:40can condense that down to almost like a
- 00:21:42guidance or like a run book yeah and
- 00:21:45attaching that into like novel scenarios
- 00:21:48is useful because it shows you how this
- 00:21:50team does things and they they often
- 00:21:52contain probable knowledge right so this
- 00:21:54is how we solve problems at our company
- 00:21:56we connect to our vpns like this we
- 00:21:59access to system think these are the Key
- 00:22:00Systems right the the most important
- 00:22:02systems in your production environment
- 00:22:03will be referenced by Engineers
- 00:22:05constantly yeah um often through Shand
- 00:22:09notations um and if you speak to
- 00:22:11Engineers of most companies those will
- 00:22:14be the two bigger problems right one is
- 00:22:17you don't understand our systems and our
- 00:22:20processes and our context and the second
- 00:22:23one is that you don't know how to
- 00:22:24integrate or access these because
- 00:22:26they're custom and bespoke and homegrown
- 00:22:29and so those are the two challenges that
- 00:22:31we face as like agencies basically we're
- 00:22:34like a new engineer on the team and you
- 00:22:35need to be taught by this engineering
- 00:22:37team if you're not taught then you're
- 00:22:39never going to succeed I hope that
- 00:22:41answered your question yeah and how do
- 00:22:43you overcome that you just are creating
- 00:22:47some kind of a glossery with the these
- 00:22:50shorthand things that the that are
- 00:22:53fairly common within the organization or
- 00:22:56what yeah so you there's multiple layers
- 00:22:58to this and I think this is quite
- 00:23:01evolving space thankfully alls are
- 00:23:03pretty adaptive and forgiving in this
- 00:23:06regard so we can experiment with
- 00:23:08different ways to summarize different
- 00:23:09levels of granularity so we've looked at
- 00:23:11okay can you just take like a massive
- 00:23:13amount of information and just shove
- 00:23:15that to the context window give it in a
- 00:23:17relatively rawal form and that works but
- 00:23:19it's quite expensive yeah and then you
- 00:23:21show it like more condensed form and you
- 00:23:23say this is just the like tip of the
- 00:23:25iceberg for any one of these topics you
- 00:23:27can you can query using this tool I get
- 00:23:29more information yeah and it's not
- 00:23:33always easy to know which one is the
- 00:23:36best because it's dependent on the issue
- 00:23:37at hand right because sometimes a key
- 00:23:39factor needle and aasta is buried one
- 00:23:41level deeper and the agent can't see it
- 00:23:44because it has a call a tool to get to
- 00:23:45it so we typically ear on the side of
- 00:23:48spending more money and just having the
- 00:23:51agent see it and then optimizing cost
- 00:23:53and latency over time for us it's really
- 00:23:56about being valuable out of the gate
- 00:23:59Engineers should find this valuable and
- 00:24:01in that value the collaboration starts
- 00:24:04and then it creates a virtuous cycle
- 00:24:06where they feed feed us more information
- 00:24:07they give us more information they get
- 00:24:09more value because we take more grunt
- 00:24:11work off their plate and and it's it's
- 00:24:14like training a new person on your team
- 00:24:16if you see that oh this person is taking
- 00:24:18more and more tasks yeah I'll just get
- 00:24:20them more information I'll give them
- 00:24:21more scope yeah I want to go into a
- 00:24:23little bit of the ideas that you're
- 00:24:27talking about there like how you can
- 00:24:28interact with the agent and but I feel
- 00:24:32like the gravitational pull towards
- 00:24:35asking you about memory and how you're
- 00:24:38doing that is too strong so we got to go
- 00:24:40down that route first and
- 00:24:43specifically are you just caching these
- 00:24:47answers are you caching like successful
- 00:24:50runs how do you go about knowing that a
- 00:24:53something was successful and then where
- 00:24:55do you store it how do you like give
- 00:24:57that access
- 00:24:58or agents get access to that and they
- 00:25:01know that oh we've seen this before yeah
- 00:25:03cool boom it feels like that is quite
- 00:25:07complex in theory you would be like yeah
- 00:25:10of course we're just going to store
- 00:25:11these successful runs but then when you
- 00:25:13break it down and you say all right what
- 00:25:15does success mean and what where we
- 00:25:17going to store it and who's going to
- 00:25:20have access to that and how we going to
- 00:25:21label that as successful like I was
- 00:25:23thinking how do you even go about
- 00:25:25labeling this kind of because is it
- 00:25:28you sitting there clicking and human
- 00:25:30annotating stuff or is it you're
- 00:25:33throwing it to another llm to say yay
- 00:25:36success what does it look like break
- 00:25:38that whole thing down for me because
- 00:25:40memory feels quite complex and that when
- 00:25:43you really look at
- 00:25:45it it is a big part of this is also the
- 00:25:48ux challenge because people don't want
- 00:25:50to just sit there and label I think
- 00:25:52people are just like especially
- 00:25:54Engineers are really tired of slop code
- 00:25:56and they're just being thrown this like
- 00:25:58slop and then they have to review they
- 00:26:00want to create and I think that's what
- 00:26:02we're trying to do is free them out from
- 00:26:03support but in doing so you don't want
- 00:26:05to get them to like constantly review
- 00:26:08your work with no benefit so that's the
- 00:26:11key thing there has to be interaction
- 00:26:13where there's implicit feedback and they
- 00:26:15get value out of that and so I'm getting
- 00:26:19to your point about memory so
- 00:26:21effectively there is three types of
- 00:26:23memory there's the like Knowledge Graph
- 00:26:25which captures the system State and the
- 00:26:27relations between
- 00:26:28things then there's episodic and
- 00:26:31procedural memory so the procedural
- 00:26:33memory is like how to ride a bicycle
- 00:26:35you've got your brakes here you your
- 00:26:37pedal here it's like the guide It's
- 00:26:39almost like the Run book but the Run
- 00:26:41book doesn't describe for this specific
- 00:26:45issue that we had on this date what did
- 00:26:47we do the instance of that is the
- 00:26:50episode or the episodic memory and both
- 00:26:53of those need to be captured right so
- 00:26:55when we start we're indexing your
- 00:26:56environment getting all these like
- 00:26:58ations and things and then we also look
- 00:27:00at okay are there things that we can
- 00:27:02extract from this world where we've got
- 00:27:04procedures and then finally as we
- 00:27:08experience things or as we understand
- 00:27:10the experiences of others within this
- 00:27:12environment we can store those as well
- 00:27:14we have really spent a lot of time and
- 00:27:17most companies care about this a lot
- 00:27:20securing data so we are deployed in your
- 00:27:23production environment and we only have
- 00:27:25read in the AIS so our agent cannot make
- 00:27:27change I didn't make suggestions so all
- 00:27:30your data want to change that right that
- 00:27:32later we'll talk about like how you want
- 00:27:35to eventually get to a different state
- 00:27:37but yeah continue yeah yeah we want to
- 00:27:39get to close dup resolution but that's a
- 00:27:42that's a longer part so we're storing
- 00:27:44all of these memories mostly As I think
- 00:27:48the valuable ones are the episodes right
- 00:27:50those are the like the instances like if
- 00:27:53this happened or this happened and I
- 00:27:54solve it in this way we had a black rodl
- 00:27:57the cluster fell over we scaled it up
- 00:28:01and they later we saw it was working but
- 00:28:03oh it's done and we did that two or
- 00:28:06three times and we think that's a good
- 00:28:08pattern like scaling is effective but
- 00:28:10that's all captured in the environment
- 00:28:14um of the customer our primary mean of
- 00:28:16feed means of feedback is monitoring
- 00:28:19system Health post um change oh nice we
- 00:28:23can look at the system and see that this
- 00:28:26change has been effective and we can
- 00:28:27look at the code of the environment
- 00:28:29whether it's the application code or the
- 00:28:31INF infrastructure code basically as
- 00:28:33like a masking problem do we see or can
- 00:28:37we predict the change the human will
- 00:28:39make in order to solve this problem and
- 00:28:40if they do then make that change
- 00:28:42especially if it's a recommendation then
- 00:28:44we see that they has B Green look what
- 00:28:46we've done right they've actually
- 00:28:48approved our suggestion yeah that is not
- 00:28:52super rich data source because the
- 00:28:54change that they may make be slightly
- 00:28:56different or we may not have access to
- 00:28:58those systems a more effective way is
- 00:29:02interaction so if we present findings
- 00:29:04and say Here's five findings and here's
- 00:29:06our diagnosis and you say this is dumb
- 00:29:09try something else then we know that was
- 00:29:10bad so we get a lot of negative examples
- 00:29:13right so this is bad and so it's a
- 00:29:15little bit lopsided but then when you
- 00:29:17eventually say oh okay I'm going to
- 00:29:19prove this and I'm going to blast this
- 00:29:21out of the engineuity team or I'm going
- 00:29:22to update my P Duty notes or I'm going
- 00:29:24to I want you to generate a pull request
- 00:29:27from the information then suddenly we've
- 00:29:30got like positive feedback on that in
- 00:29:32the user experience it's really implicit
- 00:29:35source of information the interaction
- 00:29:37with the engineer and that gets attached
- 00:29:39to these memories But ultimately at the
- 00:29:42end of the day it's still a very sparse
- 00:29:44data set so these memories you you may
- 00:29:47not have true labels and so for us a
- 00:29:51massive investment has been our
- 00:29:53evaluation bench which is external from
- 00:29:56customers where we train our agents and
- 00:29:58we do a lot of really hand the
- 00:30:00handcrafted labeling whereas even a
- 00:30:02smaller data set gets the agent to a
- 00:30:04much much higher degree of accurac so
- 00:30:07you want a bit of both right you want
- 00:30:08the real production use cases with
- 00:30:09engineering feedback which does preside
- 00:30:12present good information but the eval
- 00:30:14bench is ultimately what is the Firm
- 00:30:17Foundation that gives you that coverage
- 00:30:18at the moment but it feels like the
- 00:30:21evals have to be specific to customers
- 00:30:24don't they and it also feels like each
- 00:30:26deployment of each agent has to be a bit
- 00:30:29bespoken custom per agent or am I
- 00:30:34mistaken in that one the pattern are
- 00:30:37very so the agents are pretty
- 00:30:39generalized the agents get contextual
- 00:30:41information per customer so it gets
- 00:30:44injected like localized customer
- 00:30:46specific procedures and memories and all
- 00:30:49those things but those are lated on the
- 00:30:52base which is developed inside of our
- 00:30:55product right like in the mothership or
- 00:30:57actually it's called the Temple of cler
- 00:31:00um so we distribute like new versions of
- 00:31:03cleric and our prompts our logic our
- 00:31:06reasoning generalized memories or
- 00:31:09approaches to solving problems are
- 00:31:12imbued in a Divine way into the cleric
- 00:31:14and it's sent up it's a layering
- 00:31:17challenge right because you do want to
- 00:31:18have cross cutting benefits to all
- 00:31:21customers and accuracy driven by the
- 00:31:23eval bench but also customization at
- 00:31:27their on their processes and like
- 00:31:29customer specific approaches all right
- 00:31:32so there's a few other things that are
- 00:31:36fascinating to me when it comes to the
- 00:31:37UI and the ux of how you're doing things
- 00:31:41specifically how you are very keen on
- 00:31:46not giving Engineers more alerts unless
- 00:31:50it absolutely needs to happen and I
- 00:31:53think that's something that I've been
- 00:31:54hearing since
- 00:31:562018 and it was all on alert fatigue and
- 00:32:00how when you have complex systems and
- 00:32:02you set up all of this monitoring and
- 00:32:04observability you inevitably are just
- 00:32:06getting pinged continuously because
- 00:32:09something is out of whack and so the
- 00:32:14ways that you made sure to do this and I
- 00:32:17thought this was fascinating is a have a
- 00:32:19confidence score so be able to say look
- 00:32:22we think that this is like this and
- 00:32:26we're giving it 75%
- 00:32:28confidence that this is going to happen
- 00:32:30or this could be bad or whatever it may
- 00:32:33be and then B if it is under a certain
- 00:32:38percent confidence score you just don't
- 00:32:41even tell anyone and you try and figure
- 00:32:43out isn't actually a problem and I I'm
- 00:32:45guessing you continue working or you
- 00:32:47just forget about it explain that whole
- 00:32:51user experience and how you came about
- 00:32:53that yeah we realized because this is a
- 00:32:55trust building exercise we can't just
- 00:32:58respond with whatever we find and the
- 00:33:00Agents
- 00:33:02can sometimes they're just not
- 00:33:04especially during the onboarding excuse
- 00:33:06doing the onboarding phase they don't
- 00:33:07have the necessary access and they don't
- 00:33:09have the context right and so at least
- 00:33:11at the start when you're training the
- 00:33:13agent you don't want it to just Spam me
- 00:33:15with this raw ideas and so the
- 00:33:17confidence score was one that I think a
- 00:33:20lot of teams are actually trying to
- 00:33:22build into their products as agent
- 00:33:23Builders it's extremely hard in this
- 00:33:26case because it's such an un
- 00:33:28unsupervised
- 00:33:30problem I'm trying to not get into the
- 00:33:32RO details because there's a lot of like
- 00:33:34effort we've put into that like building
- 00:33:36this confidence score as a big part of
- 00:33:38our IP is like how do we measure our own
- 00:33:41success information Divine name for the
- 00:33:45IP or something it's not your IP it's
- 00:33:48your what was it when Moses was up on
- 00:33:50the hill and he got the Revelation it
- 00:33:53was yeah this is not your IP this is
- 00:33:55your Revelations that you've had yeah
- 00:33:57the but so the high level is basically
- 00:34:00that it's really driven by this data fly
- 00:34:04wheel it's really driven by experience
- 00:34:06and that's also how an engineer does
- 00:34:08things but those can be again like two
- 00:34:10layered like from the base layers of the
- 00:34:12product but also experiences in this
- 00:34:15company so we do use an LM for self
- 00:34:18assessment but it's also driven and
- 00:34:20grounded by existing experiences so we
- 00:34:24inject a lot of those experiences and
- 00:34:26whether those are positive or negative
- 00:34:28outcomes and as an engineer you can set
- 00:34:32the threshold so you can say oh nice
- 00:34:35only extremely high relevance findings
- 00:34:38or diag mines should be shown and you
- 00:34:41can set the conciseness and specificity
- 00:34:44so you can say I just wanted one
- 00:34:45sentence or just give me a word or give
- 00:34:49me all the raw information so what we do
- 00:34:53today is we're very asynchronous so an
- 00:34:56alert virus will go from a quest find
- 00:34:58the whatever information we can and come
- 00:34:59back if we're confident will respond if
- 00:35:02not we'll just be quiet but then you can
- 00:35:05engage with us in a synchronous way so
- 00:35:07it starts async and then you can kick
- 00:35:09the ball back and forth in a synchronous
- 00:35:11way and in a synchronous mode the sorry
- 00:35:14in the synchronous mode it's very
- 00:35:16interactive and and lower latency we
- 00:35:18will almost always respond if you ask us
- 00:35:21a question we'll respond so then the
- 00:35:22conference score is less important
- 00:35:24because then it's like the user is
- 00:35:26refining that answer saying go back try
- 00:35:28this go back try this but for us the key
- 00:35:31thing is we have to come back with good
- 00:35:33initial findings and that's why the
- 00:35:35conference score is so important but
- 00:35:36again it's really driven by
- 00:35:39experiences just to like re reiterate
- 00:35:41like why this is such a complex problem
- 00:35:44to solve you can't just take a
- 00:35:46production environment and say okay I'm
- 00:35:48going to spin this up in a Docker
- 00:35:49container and reproduce it at a specif
- 00:35:51point in time at many companies you
- 00:35:53can't even do a low test Cross services
- 00:35:56it's so complex it's all different
- 00:35:57different teams they're all interrelated
- 00:35:59you can do this as for a small startup
- 00:36:01with one application running on Heroku
- 00:36:02or versel but doing this at scale is
- 00:36:05virtually impossible at most companies
- 00:36:07so you you don't have that ground trick
- 00:36:10you you can't say with 100% certainty
- 00:36:12whether you're right or wrong and that's
- 00:36:13just the state we're in right now
- 00:36:15despite that the confidence score has
- 00:36:18been a very powerful technique to at
- 00:36:21least eliminate most true positiv like
- 00:36:25or and when we know that we don't we
- 00:36:26don't have anything of substance just
- 00:36:29being
- 00:36:30quiet but how do you know if you got
- 00:36:35enough
- 00:36:36information when you were doing the scan
- 00:36:39or you were doing the search to go back
- 00:36:42to the human and give that information
- 00:36:46and also how do you know that you are
- 00:36:49fully understanding what the human is
- 00:36:52asking for when you're doing that back
- 00:36:53and forth honestly this is one of the
- 00:36:56key parts that that's very challenging
- 00:36:58it's a human will say the checkout
- 00:37:01service is done and you need to know
- 00:37:04that there are probably maybe based on
- 00:37:07who the engineer is talking about
- 00:37:10production or if they've been talking
- 00:37:12about developing a new feature they're
- 00:37:14probably talking about the dev
- 00:37:15environment and if you go down the wrong
- 00:37:17path then you can spend some money and
- 00:37:20like a lot of time inves something
- 00:37:22that's useless so what we do is even at
- 00:37:24the initial message that comes in we
- 00:37:27will will ask for a clarifying question
- 00:37:29if we are not sure about what you're
- 00:37:31asking if you've not been specific
- 00:37:33enough and most agent Boulders even if
- 00:37:35cognitions Debon but do this that
- 00:37:38initially they'll say okay do you mean X
- 00:37:39Y and Z okay this is my plan okay I'm
- 00:37:41going to go do it now so there is the
- 00:37:44sense of confidence built into these
- 00:37:45products from a ux layer and that's
- 00:37:47where we are right now it's with chat
- 00:37:50you can sometimes say or with Claude
- 00:37:52something very inaccurate or vague and I
- 00:37:56can probably guess the right answer
- 00:37:58because the cost is not multi-step right
- 00:38:00it's very cheap you can just quickly fix
- 00:38:02your text but for us we have to Short
- 00:38:05Circuit that and make sure that you're
- 00:38:06specific enough in your initial
- 00:38:08instructions and then over time loosen
- 00:38:10that a bit as we understand a bit more
- 00:38:12what your teams are doing what things
- 00:38:13are what are you up to you can be more
- 00:38:16vague but for now it requires a bit more
- 00:38:18specificity and
- 00:38:20guidance speaking of the multi- turns
- 00:38:24and spending money for things or trying
- 00:38:27to not waste money and going down the
- 00:38:31wrong tree branch or Rabbit Hole how do
- 00:38:34you think about pricing for agents is it
- 00:38:38all consumption based are you looking at
- 00:38:40what the price of an SRE would be and
- 00:38:43you're saying oh we'll price a
- 00:38:44percentage of that because we're saving
- 00:38:46you time like what in your mind is the
- 00:38:50right way to base off of
- 00:38:55pricing well we're trying to build a
- 00:38:57product that Engineers love to use and
- 00:38:59so we want it to be a toothbrush we want
- 00:39:01it to be something that you reach for
- 00:39:03instead of your observability platform
- 00:39:05instead of going into the console so for
- 00:39:07us usage is very important so we don't
- 00:39:10want to have procurement stand in the
- 00:39:11way necessarily but the reality is there
- 00:39:14are costs and this is a business and we
- 00:39:17want to add value and money is how you
- 00:39:19show us that we're valuable so the
- 00:39:22original idea with agents was that there
- 00:39:24would be this augmentation of
- 00:39:26engineering teams and that you could
- 00:39:29charge some order of magnitude less but
- 00:39:31a fraction of engineering headcount or
- 00:39:34employee headcount by augmenting teams I
- 00:39:37think the jurry is still out on that I
- 00:39:38think most agent Builders today are
- 00:39:41pricing to get into production
- 00:39:45environments or into these systems that
- 00:39:47they need to use to solve problems to
- 00:39:50get close to their Persona and if you
- 00:39:52look at what Devon did I think they also
- 00:39:54started at 10K per year or some pring
- 00:39:58and I think it's now like 500 a month
- 00:40:00but it's mostly a consumption based
- 00:40:02model so you get some committed amount
- 00:40:05of compute hours that is effectively
- 00:40:08giving you time um to use the product
- 00:40:11for us we're also orienting around that
- 00:40:14model so because we're not GA our
- 00:40:16pricing is still a little bit like on
- 00:40:18flux and working with our initial
- 00:40:20customers to figure out like what do you
- 00:40:21they think is reasonable what do they
- 00:40:22think is fair but I think we're going to
- 00:40:25land on something that's mostly similar
- 00:40:27to the Devon model where it's usage
- 00:40:29based we don't want Engineers to think
- 00:40:32about okay there's an investigation it's
- 00:40:34going to cost me X they should just be
- 00:40:36able to just run it and just see this is
- 00:40:38valuable or not and increase usage but
- 00:40:40it will be something about like a tiered
- 00:40:42amount of compute that you can do use so
- 00:40:45maybe you get 5,000 investigations a
- 00:40:48month or something in that
- 00:40:50order okay nice yeah because that's what
- 00:40:53instantly came to my mind was you want
- 00:40:57folks to just reach for this and use it
- 00:41:00as much as possible but if you are on a
- 00:41:04usage based pricing then
- 00:41:07inevitably you're going to hit that
- 00:41:10friction where it's yeah I want to use
- 00:41:12it but H it's going to cost me yeah yeah
- 00:41:16so you do want to have a committed
- 00:41:18amount set aside at the front and we're
- 00:41:21also exploring like having a free tier
- 00:41:23or like a free band maybe the first X is
- 00:41:27just you can just kick the tires and dry
- 00:41:29it out and as you get to higher limits
- 00:41:31then you can set external attemps so we
- 00:41:35haven't even talked about tool usage but
- 00:41:37that's another piece that feels like it
- 00:41:40is so complex because you're using tools
- 00:41:45you're using a different you're using an
- 00:41:47array of tools and how do you tap in to
- 00:41:50each of these tools right because it's
- 00:41:52if you're looking at logs or are you
- 00:41:57syncing directly with the data dogs of
- 00:42:00the world how do you see tool usage for
- 00:42:05this and what have been some
- 00:42:06specifically hard challenges to overcome
- 00:42:08in that
- 00:42:10Arena again this kind of goes back to
- 00:42:12why this is so challenging and
- 00:42:14especially one of the key things that
- 00:42:16we've seen is Agents solve problems very
- 00:42:18differently from humans but they need a
- 00:42:20lot of the things humans need they need
- 00:42:22the same tools if you're storing all of
- 00:42:24your data and data dog we may not be
- 00:42:26able to find all the information we need
- 00:42:27to solve a problem by just looking at
- 00:42:29your actual application running and your
- 00:42:30cloud in front so we need to go to data
- 00:42:32do so we need access there and so
- 00:42:34engineering teams give us that access if
- 00:42:37you've then constructed a bunch of
- 00:42:39dashboards and metrics then and that's
- 00:42:42how you've laid out your let say your
- 00:42:44run books and your processes to debug
- 00:42:46issues we need to do things like look at
- 00:42:49multiple charts or graphs and infer
- 00:42:52across those in the time ranges that an
- 00:42:54issue happened what are the anomalies
- 00:42:56that happened across multiple services
- 00:42:58so if two of them are spiking and CP
- 00:43:01inter related so we should look at the
- 00:43:03relations between them but these are
- 00:43:05extremely hard problems for L to solve
- 00:43:08even Vision models they're not
- 00:43:10purposefully both purposeful for that so
- 00:43:14when it comes to Tool
- 00:43:15usage llms are or Foundation models are
- 00:43:19good at certain types of informations
- 00:43:21especially semantic ones so code config
- 00:43:24logs they're slightly less good good at
- 00:43:28traces but also pretty decent but they
- 00:43:31really suck at metrics they really suck
- 00:43:33at time series so it's really dependent
- 00:43:36on your observability stack how useful
- 00:43:39it's going to be because for human we
- 00:43:41just sit back and look at a bunch of
- 00:43:42dashboards we can see like pattern
- 00:43:44matching instantly you can see like
- 00:43:46these are spikes but for an LM they see
- 00:43:49something different so what we'll find
- 00:43:51is over time these observability tools
- 00:43:54at least will probably become less and
- 00:43:56less
- 00:43:57human Centric and may even become
- 00:44:00redundant um you may see completely
- 00:44:03different means of diing problems and I
- 00:44:06think the honeycomb approach the trace
- 00:44:08based approach with these high
- 00:44:09cardinality events is probably the thing
- 00:44:12that I put my money on is the dominant
- 00:44:15pattern that IA see winning because can
- 00:44:19you explain that real fast I don't know
- 00:44:21what that is so so basically what they
- 00:44:22do is or what charity majors and some of
- 00:44:25these others have been promoting for
- 00:44:26years is logging out traces but with
- 00:44:31Rich events attached to these so you
- 00:44:34basically can follow like a request
- 00:44:35through your whole application stack and
- 00:44:39um you can log out like a complete
- 00:44:42object payload at multiple steps along
- 00:44:44the way and store that in a system where
- 00:44:46you can query all the information so
- 00:44:48you've got the point and time you've got
- 00:44:50the whole like tree of the trace as well
- 00:44:53and then at each point you can see the
- 00:44:55individual attributes and Fields
- 00:44:57and so you get a lot more detail in that
- 00:45:00versus if you look at a time series
- 00:45:01you're basically seeing okay CPU is
- 00:45:03going up CPU goes down and what can you
- 00:45:06clean from that you basically have to
- 00:45:08like it's like witchcraft trying to find
- 00:45:11the root cause right but the dad dogs of
- 00:45:14the money have been making a lot of
- 00:45:16sorry the dad dogs of the world have
- 00:45:17making a lot of money um selling
- 00:45:19consumption and selling the Witchcraft
- 00:45:21to Engineers for years and so there's a
- 00:45:23real incentive to keep this this data
- 00:45:25score going but I think agents become
- 00:45:27more dominant we'll see them gravitate
- 00:45:30to the most valuable sources of
- 00:45:32information and then if you give your
- 00:45:34agent more and more scope you'll see
- 00:45:36death is rarely involved in these
- 00:45:39causings so why why are we still paying
- 00:45:41for them so I'm not sure what it's going
- 00:45:43to look like in the next two or three
- 00:45:44years but it's going to be interesting
- 00:45:46how things play out as agents become the
- 00:45:49go-to for diagnosing and solving
- 00:45:52problems yeah I hadn't even thought
- 00:45:54about that how for human usage like
- 00:45:57maybe data dog is set up wonderfully
- 00:46:00because we look at it and it gives us
- 00:46:02everything we need and we can root cause
- 00:46:05it very quickly by pattern matching but
- 00:46:07if that turns out to be one of the
- 00:46:09harder things for agents to do instead
- 00:46:12of making an agent better at
- 00:46:15understanding metrics maybe you just
- 00:46:17give it different data and so that it
- 00:46:21can root cause it without those metrics
- 00:46:23and it will shift away from reading the
- 00:46:27information from those
- 00:46:29Services yeah if you look at like chess
- 00:46:31and the AI and like the stock fishes of
- 00:46:34the world that's not that's just one AI
- 00:46:36That's like plays against grand grand
- 00:46:39Masters mhm even the top players have
- 00:46:42learned from the AI so they know that a
- 00:46:45porn push on the side has been extremely
- 00:46:49powerful or a rook lift has been very
- 00:46:52powerful so now like the the top players
- 00:46:54in the world adopt these techniques they
- 00:46:56learn from the AI is but that's also
- 00:46:58because it's always a human in the loop
- 00:46:59we still want to see people playing
- 00:47:01people but if you just leave it up to
- 00:47:02the AI like the way they play the game
- 00:47:04is completely different they see things
- 00:47:05that we don't and I know I didn't sure
- 00:47:08answer at the start fully but these
- 00:47:11tools are grounding actions for us so
- 00:47:13the observability stack is one of them
- 00:47:14but ultimately we b a complete
- 00:47:18abstraction to the production
- 00:47:19environment so the agent uses these
- 00:47:23tools and learns how to use these tools
- 00:47:25and knows which tools are the most
- 00:47:26effective
- 00:47:27but we also build a a transferability
- 00:47:31layer so you can shift the agent from
- 00:47:33the real production environment into the
- 00:47:35eval stack and it doesn't even know that
- 00:47:37it's running in an eval stack it's not
- 00:47:39suddenly just looking at like fake
- 00:47:40surfaces fake kubernetes clusters fake
- 00:47:43data dos fake scenarios or fake world so
- 00:47:47these tools are an incredibly important
- 00:47:49abstraction it's one of the key
- 00:47:50obstructions that the agent needs and
- 00:47:52honestly it's memory management and
- 00:47:54tools are the two big things that agent
- 00:47:57team should be focusing on I'd say right
- 00:47:59now wait why do you switch it to this
- 00:48:02fake
- 00:48:03world because that's where you've got
- 00:48:05full control that's where you can
- 00:48:06introduce your own scenarios your own
- 00:48:08chaos and stretch your agent but if you
- 00:48:13do so in a way where the tools are
- 00:48:15different the worlds are different
- 00:48:16experience are different there's less
- 00:48:18transferability when you then take it
- 00:48:20into the production environment and
- 00:48:21suddenly it's going to fall flat so you
- 00:48:23want the like a a real simile of the
- 00:48:27environment in your your tool or your
- 00:48:29eval
- 00:48:30bench and are you doing any type of
- 00:48:34chaos engineering to just see how the
- 00:48:36agents
- 00:48:38perform yes that's pretty much where our
- 00:48:40eval stack is it's chaos we produce a
- 00:48:43world in which the reproduce chaos and
- 00:48:45then we say given this problem what's up
- 00:48:49what's the underlying cause and we see
- 00:48:51how close we can get to the Dost string
- 00:48:52cause yeah
- 00:48:54perfect opportunity for an incredible
- 00:48:58name like Lucifer this is the this is
- 00:49:03this is the seventh layer of hell I
- 00:49:05don't know something along those lines
- 00:49:08yeah we've got some ideas on the blog
- 00:49:11post that will have some more players in
- 00:49:12this idea so tpd I think one thing to
- 00:49:17noce
- 00:49:18that this is a very deep space so if you
- 00:49:22look at self-driving cars the lives are
- 00:49:24on the line and so people care a lot and
- 00:49:26you you have to hit a much higher bar
- 00:49:28than a human driving a car it's very
- 00:49:31similar in the space right like these
- 00:49:32production environments are sacred they
- 00:49:35are important to these companies right
- 00:49:37they are if they go down or if there a
- 00:49:38data breach or
- 00:49:40anything that their business are on the
- 00:49:42line CTS really care the bar that we
- 00:49:44have to hit is very high and so we take
- 00:49:47security very seriously but the whole
- 00:49:50product that we're building requires a
- 00:49:52lot of care and there's a lot of
- 00:49:53complexity that goes into that so I
- 00:49:56think it's extreme
- 00:49:57compelling as an engineer to work in the
- 00:49:59space because there's so many compelling
- 00:50:01problems to solve like the knowledge
- 00:50:02grock building the confidence scoring
- 00:50:05how do you do evaluation like how do you
- 00:50:07learn from these environments and build
- 00:50:09them into your core product the toothing
- 00:50:10layers the the chaos benches all these
- 00:50:13things and how you do that in a reliable
- 00:50:16repeatable way I think that's the other
- 00:50:18big challenge is if you're an A or TCP
- 00:50:21or using this stack or different stack
- 00:50:23if you're going from e-commerce to
- 00:50:24gaming to social media how generaliz
- 00:50:27your agent can you just St it or can you
- 00:50:30only solve one cl of problem and so
- 00:50:32that's one of the things that we really
- 00:50:34leaning into right now is the
- 00:50:35repeatability of the product and scaling
- 00:50:37this out to more and more
- 00:50:39Enterprises but but yeah I'd say it's a
- 00:50:42extremely complex problem to solve and
- 00:50:45it even though we're valuable today true
- 00:50:48resolution end to end resolution maybe
- 00:50:51like multiple years just like the self
- 00:50:53driving cars it took years to get to a
- 00:50:54point where we've got Whos on the roads
- 00:50:57yeah that's what I wanted to ask you
- 00:50:58about was the true resolution and how
- 00:51:02that like that just scares me to think
- 00:51:06about first of all and I don't have
- 00:51:08anything running in production let alone
- 00:51:10a multi-million dollar system so I can
- 00:51:13only imagine that you would encounter a
- 00:51:16lot of resistance when you bring that up
- 00:51:19to
- 00:51:21Engineers
- 00:51:23surprisingly no there's definitely the
- 00:51:25hesitation but the hesitation is most
- 00:51:26mostly based on uncertainty like what
- 00:51:28exactly can you do and if you show them
- 00:51:30like we literally can't change things we
- 00:51:32don't have the access like you literally
- 00:51:33like the API keys are read only or we
- 00:51:36can strain to these environments and if
- 00:51:39you introduce change through the
- 00:51:40processes that they have already so pool
- 00:51:42requests and there's guard rails in
- 00:51:45place then they're very open to those
- 00:51:47ideas I think a bit part of this is
- 00:51:49really Engineers really hate in Fr and
- 00:51:53support so they yearn for some something
- 00:51:56that can help free them from that but
- 00:51:58it's a progressive trust building
- 00:52:00exercise we've spoken to quite a lot of
- 00:52:03Enterprises and almost all of them have
- 00:52:06different classes of sensitivity you
- 00:52:09have your big fish customers for example
- 00:52:11that you don't want to touch their
- 00:52:13critical systems but then you've got
- 00:52:15your internal airflow deployments and
- 00:52:18your cicd your gitlab deployment if that
- 00:52:21thing Falls over we can scale it up or
- 00:52:24we could try make a change this zero
- 00:52:26customer impact and so those are the
- 00:52:28areas really helping teams today is on
- 00:52:31the lower severity or low risk places
- 00:52:33where we can make changes and when if
- 00:52:36you if you're crushing crushing those
- 00:52:38changes over time then Engineers will
- 00:52:39introduce you to the more high value
- 00:52:41places but but yes right now we're
- 00:52:44steering clear of the critical systems
- 00:52:46because we don't want to make a change
- 00:52:48that is dangerous yeah and it just feels
- 00:52:52like it's
- 00:52:53too loaded so even if you are doing
- 00:52:57everything right because it is so high
- 00:53:01maintenance you're you don't want to
- 00:53:03stick yourself in there just yet let the
- 00:53:06engineers bring you in when they're
- 00:53:07ready and when you feel like it's ready
- 00:53:10I can see that for sure yeah also
- 00:53:12behaviorally Engineers won't change
- 00:53:14their you know the tools they reach for
- 00:53:17the process is in a wartime scenario
- 00:53:19when something is like relaxed
- 00:53:21environment they're willing to try Ai
- 00:53:23and experiment with that and adopt that
- 00:53:26but if it's a critical situation they
- 00:53:28they want to introduce an AI and add
- 00:53:30more chaos into the mix right so they
- 00:53:32want something that reduces the
- 00:53:34uncertainty yeah that reminds me about
- 00:53:37one of my major things that I notice
- 00:53:40whenever I'm working
- 00:53:42with agents or Building Systems that
- 00:53:46involve
- 00:53:48AI the prompts can
- 00:53:52be the biggest Hang-Ups and the PRP
- 00:53:56pumps for me sometimes feel
- 00:53:59like I just need to do obviously I'm not
- 00:54:02building a product that relies on agents
- 00:54:05most of the time so I don't have the
- 00:54:08drive to see it through but a lot of
- 00:54:12times I will fiddle with prompts for so
- 00:54:16long that I get angry because I feel
- 00:54:19like I should just do the thing that I
- 00:54:22am trying to
- 00:54:24do and not get AI to do it I don't
- 00:54:28really have an answer for you that's
- 00:54:29just the nature of the
- 00:54:31Beast yes exactly I do want to just
- 00:54:34double click and say everybody has that
- 00:54:36problem everybody struggles with that
- 00:54:38you don't know if you're like one prom
- 00:54:39change away or 20 and they're very good
- 00:54:43at making it seem like you're getting
- 00:54:44closer and closer but you may not be we
- 00:54:47found success in building Frameworks
- 00:54:51evaluations so that we can at
- 00:54:53least extract it either from production
- 00:54:56or EVS the samples the crowd truth that
- 00:54:58makes us know or gives us confidence
- 00:55:00we're getting to the answer otherwise
- 00:55:03you just you can go forever right like
- 00:55:05just tweaking things and never getting
- 00:55:07there that's it and that's frustrating
- 00:55:10because some yeah sometimes you take one
- 00:55:12step forward and two steps back and
- 00:55:14you're like oh my God it's quite hard
- 00:55:16with content creation I think it's
- 00:55:18harder in your space unless I have all
- 00:55:21but stopped using it for Content
- 00:55:23creation that's for sure like maybe to
- 00:55:26to help me fill up a blank page and get
- 00:55:29directionally correct but for the most
- 00:55:31part yeah I don't like the way it writes
- 00:55:34I don't really even if I prompt it to
- 00:55:36the maximum it doesn't feel like it
- 00:55:38gives me deep insights stopped that but
- 00:55:42you're still on GPD 3.5 right
- 00:55:46[Music]
- AI
- SRE
- Knowledge Graphs
- Cleric AI
- Diagnostics
- Root Cause Analysis
- Confidence Scoring
- Operational Complexity
- Chaos Engineering
- Memory Management