Navigating AI for Testing: Insights on Context and Evaluation with Sourcegraph
摘要
TLDRThe podcast episode delves into AI testing and its integration within software development, featuring Rashab Merra from Sourcegraph, known for its tool Cody. The discussion highlights the challenges of managing large code repositories and how AI is instrumental in boosting developer efficiency. By leveraging machine learning, Sourcegraph aims to refine code suggestions and testing, tailoring solutions to developer workflows. The conversation also touches on the criticality of evaluation to verify if model improvements translate to real-world applications, noting that offline metrics like 'passet-1' may not always reflect true user experiences. Rashab expounds on the need for context in AI-driven testing, advocating for models that accurately consider the intricacies of expansive codebases, often seen in large enterprises. He also underscores the importance of AI in writing better and more efficient unit tests, functioning as a safeguard against poor code entering the system. The episode further explores how AI tools balance speed and quality, particularly in latency-sensitive environments where developers require timely feedback. The notion of an evolving, symbiotic relationship between human programmers and AI systems is a recurring theme, suggesting a future where developers focus more on creative and complex problem-solving.
心得
- 🤖 AI testing is crucial for efficient software development.
- 🚀 Sourcegraph's Cody aids developers with code completion and testing.
- 🧩 Context is vital for accurate AI-driven code evaluations.
- 🕒 Latency impacts developer satisfaction and tool efficacy.
- 🔄 Evaluation metrics must align with real-world applications.
- 💻 Open Context broadens the scope of code insights.
- 👨💻 Developers shift towards more creative roles with AI assistance.
- ⚙️ AI-generated unit tests act as code quality guardrails.
- 📊 Machine learning models must adapt to specific developer needs.
- 🔍 Continuous improvement in AI tools enhances productivity.
时间轴
- 00:00:00 - 00:05:00
The AI Native Dev episode focuses on AI testing, discussing context for models, evaluation of generated code, and timing and automation of AI tests. Rashab Merra from Sourcecraft is introduced, explaining how Sourcecraft tackles the big code problem by using tools like Cody to improve developer productivity with features like autocomplete and code suggestions.
- 00:05:00 - 00:10:00
Rashab explains his experience in AI since 2009, witnessing the evolution from traditional NLP to large language models (LLMs). He describes how AI, like Cody, abstracts complexity to enhance developer productivity, comparing this progression to his own experience transitioning from coding in C to using advanced frameworks.
- 00:10:00 - 00:15:00
The discussion shifts to Spotify and Netflix's recommendation systems, using various machine learning models to enhance user experience. Rashab parallels this to Cody's multifaceted approach, including features beyond code suggestions like chat, code edits, and unit test generation, each requiring different evaluations, models, and latencies.
- 00:15:00 - 00:20:00
Different features in coding assistants demand varying latency and quality levels. For instance, autocomplete needs low latency, whereas code edit and chat can tolerate more delay. Rashab explains the trade-offs between latency and model size, highlighting the benefits of fine-tuning models for specific tasks like Rust language auto-completion.
- 00:20:00 - 00:25:00
The conversation delves into efficient code and unit test generation balancing user trust in automated systems and developers' cognitive load reductions. As complexity in code suggestions increases, so does the trust issue, stressing the importance of effective evaluation systems and guardrails to prevent introducing errors through automation.
- 00:25:00 - 00:30:00
Evaluation is emphasized as crucial for development and successful adoption of AI-driven tools. Rashab highlights the importance of developing accurate evaluation metrics that mirror real-world usage over standard benchmarks. This ensures improvements in AI features truly enhance user experience.
- 00:30:00 - 00:35:00
Rashab describes the challenges of heterogeneity in coding tasks across different industries, identifying opportunities where pre-trained models excel and where fine-tuning can provide significant benefits by focusing on underserved languages or complex tasks often found in enterprise environments.
- 00:35:00 - 00:40:00
Discussion highlights the adversarial nature of good unit tests as guardrails against bad code. Effective testing prevents the introduction of errors, especially in automated settings where AI-generated code volume increases. The conversation underscores the need for unit testing to evolve alongside increased AI integration.
- 00:40:00 - 00:45:00
Different testing levels (e.g., unit, integration) and their role in ensuring code quality are explored. Rashab stresses the need for automated systems that continuously improve through feedback loops, suggesting an integrated process where human oversight complements machine-driven testing.
- 00:45:00 - 00:52:09
Rashab emphasizes the orchestration of AI tools where developers act as conductors, choosing where to spend effort between optimization of code, testing, and other tasks. He stresses the importance of human understanding and evaluation in deploying AI solutions effectively while embracing automation to reduce toil and improve productivity.
思维导图
常见问题
Who is the guest on the episode?
Rashab Merra from Sourcegraph is the guest on the episode.
What major concept is discussed in the episode?
The episode discusses AI testing in software development.
How does Sourcegraph help developers?
Sourcegraph offers tools like Cody, a coding assistant that helps improve developer productivity by integrating into IDEs.
What is the role of machine learning in Sourcegraph's tools?
Machine learning in Sourcegraph's tools is used to enhance features such as code completion and testing, assisting developers by understanding and adapting to their workflows.
Why is evaluation critical in machine learning?
Evaluation is critical because it helps determine if improvements in machine learning models genuinely enhance user experience, especially when offline metrics don't always correlate with real-world usage.
What is the significance of latency in AI tools?
Latency is significant as it affects user satisfaction; developers expect fast responses for tasks like code completion, while they're more tolerant of longer wait times for complex tasks like code fixes.
How does Sourcegraph's Cody vary in its features?
Cody offers a range of features from code auto-completion to unit test generation, each with different latency and quality requirements.
What does Rashab consider a 'nightmare' issue?
A 'nightmare' issue is getting accurate unit test generation and evaluation in real-world, large-scale codebases.
What is 'open context' in Sourcegraph?
'Open Context' is a protocol designed to integrate additional context sources to provide better recommendations and code insights.
Why are unit tests important for AI-generated code?
Unit tests serve as guardrails to prevent bad code from entering the codebase, especially as more AI-generated code is used.
查看更多视频摘要
Powerful Habits to make Games Quickly
WEIGHT RATINGS | GVWR | GAWR | for CDL and NON CDL HOTSHOT
The ULTIMATE Furry Fandom Iceberg EXPLAINED (Part 1) 🐱
Full Habits Guide: How to Make & Break ANY Habit
Sounding the alarm; exposing the enemy's plans about things he’s trying to do: a call to prayer
Duolingo is (almost) good now
- 00:00:00[Music]
- 00:00:01you're listening to the AI native Dev
- 00:00:03brought to you by
- 00:00:14Tesla on today's episode we're going to
- 00:00:16be talking about all things AI testing
- 00:00:19so we're going to be dipping into things
- 00:00:22like uh the context that models are
- 00:00:24going to need to be able to create good
- 00:00:26tests uh things like evaluation what
- 00:00:28does it mean for generated code mod uh
- 00:00:30and then look into deeper into AI tests
- 00:00:33about when they should be done should
- 00:00:34they be done early should be should they
- 00:00:36be done late and more automated joining
- 00:00:38me today rashab Merra from Source craft
- 00:00:42those of you may know Source craft
- 00:00:43better as the tool Cody welcome to the
- 00:00:45session tell us a little bit about
- 00:00:46yourself first of all thanks am this is
- 00:00:48really interesting topic for me and so I
- 00:00:50work with Source craft we do we've done
- 00:00:52like coding code search for for the last
- 00:00:54quite a few years and uh I mean we
- 00:00:56tackled the big quot problem like if you
- 00:00:58look at Enterprises Enterprises have a
- 00:01:00lot of developers which is great for
- 00:01:01them they also have like massive code
- 00:01:03bases look at a bank in the US I mean
- 00:01:05they would have like 20 30,000
- 00:01:06developers they would have like more
- 00:01:07than that like 40,000 repositories right
- 00:01:10so this big code problem is a huge pain
- 00:01:12right so what what we've done as a
- 00:01:14company is like we've spent quite a few
- 00:01:15years in tackling code search and then
- 00:01:17last year we launched Codi which is a
- 00:01:20coding assistant and essentially Codi
- 00:01:21lives in your ID it makes tries to make
- 00:01:23you a better produ productive developer
- 00:01:25and there's a bunch of different
- 00:01:26features and my role right now at source
- 00:01:29graph is to lead the a effort so I've
- 00:01:31done a lot of like machine learning on
- 00:01:32the consumer side like looking at
- 00:01:33Spotify recommendations music short
- 00:01:35video recommendations my PhD was on
- 00:01:37search which again ties well because in
- 00:01:39llms you need context and we're going to
- 00:01:41talk about yeah a lot of that as well
- 00:01:43how many years would you say you've been
- 00:01:44in
- 00:01:45AI yeah so I think the first research
- 00:01:47paper I wrote was in 2009 and almost 15
- 00:01:50years ago and this was like NLP again
- 00:01:52this is not like deep learning NLP this
- 00:01:54is more like traditional NLP and I I've
- 00:01:57been in the world wherein like the
- 00:01:59domain experts handcraft these features
- 00:02:01and then embeddings came in and like
- 00:02:02they washed all of it away and then
- 00:02:04these like neural models came in they
- 00:02:05washed like custom topic models away and
- 00:02:07then these llms came in and they washed
- 00:02:09a bunch of these Rangers away so I think
- 00:02:11like I've seen some of these like waves
- 00:02:12of if you create like specific models
- 00:02:14which are very handcrafted maybe a
- 00:02:16simpler large scale like generalizable
- 00:02:19model will sweep some of these models
- 00:02:21I've seen a few of these waves in lastas
- 00:02:23and way before it was cool yeah exactly
- 00:02:24I think like I keep saying to a lot of
- 00:02:25like early PhD students like in 2010s a
- 00:02:28lot of the PhD students would get their
- 00:02:29doctorate by just creating latent
- 00:02:31variable models and doing some Gip
- 00:02:33sampling inference nobody even knows
- 00:02:35that 15 years ago this is what would get
- 00:02:37you a PhD in machine right so again
- 00:02:39things have really moved on you can say
- 00:02:40to that you can say you weren't there
- 00:02:41you weren't there in the early days
- 00:02:43there when you WR gift sampler code and
- 00:02:45see we even never hadow P like a bunch
- 00:02:48of these I I think like this is I think
- 00:02:50like we've seen this is Codi and a lot
- 00:02:52of these other gen coding tools they're
- 00:02:54trying to make us as developers work at
- 00:02:56higher and higher abstractions right I
- 00:02:58started my research or ml career not
- 00:03:01writing like py code right I wrote again
- 00:03:04like Gib sampler in C right and then I'm
- 00:03:06not touching C until it's like really
- 00:03:08needed now so a lot of these like
- 00:03:10Frameworks have come in which have
- 00:03:11abstracted the complexities away and
- 00:03:13made my life easier and again we're
- 00:03:16seeing again I've seen it as an IC over
- 00:03:18the last 10 15 years but also that's
- 00:03:20also happening in the industry right you
- 00:03:22don't have to be bothered by lowlevel
- 00:03:23Primitives and you can maybe tackle
- 00:03:24higher level things and that's what
- 00:03:26exactly Cody is trying to do that how do
- 00:03:28I remove toil from your developers life
- 00:03:31and then make them focus on the
- 00:03:32interesting pieces and the creative
- 00:03:33pieces and like really get the
- 00:03:34architecture right yeah so let's let's
- 00:03:36talk a little bit about some of the
- 00:03:38features that you mentioned because
- 00:03:39Cody's obviously not it's when we think
- 00:03:41about a tool like Cody it's not just
- 00:03:45code suggestion there's a ton of
- 00:03:47different things and you mentioned some
- 00:03:48of them we'll be talking a little bit
- 00:03:49more probably about testing today but
- 00:03:51when we think about ML and the use of ml
- 00:03:54here how does it differ from the various
- 00:03:57features that a tool like
- 00:04:00does it
- 00:04:03when yeah that's an excellent point let
- 00:04:05me take a step back right let's not even
- 00:04:06talk about coding assist let's go back
- 00:04:08to recommendations and let's look at
- 00:04:10Spotify or Netflix right people think
- 00:04:12that hey if you're doing Spotify is
- 00:04:13famous for oh it knows what I want it it
- 00:04:16suest nostalgic music for me so people
- 00:04:18in in the while love the recommendations
- 00:04:20from Spotify I'm not saying this just
- 00:04:22because I was an employee at Spotify but
- 00:04:23also as a user right I like Spotify
- 00:04:25recommendations I have to say Netflix is
- 00:04:27there for me like I've never binge
- 00:04:29listened anything on Netflix anything on
- 00:04:31Spotify but Netflix has kept me up to
- 00:04:33the early hours exact questioning my
- 00:04:35life yeah I think there a series right
- 00:04:37like Netflix for videos you look at
- 00:04:39Spotify for music you look at Tik Tok
- 00:04:41Instagram me for short videos so again
- 00:04:43the mediums are like it used to be hours
- 00:04:45of content to like minutes of music to
- 00:04:47like seconds of short video but then the
- 00:04:49point is we love these products but this
- 00:04:52is not just one ranker Netflix is not
- 00:04:54just one ranker right it's a bunch of
- 00:04:55different features a bunch of surfaces
- 00:04:58and user touch points and each touch
- 00:05:00point is powered by a different machine
- 00:05:01learning model different thinking
- 00:05:02different valuation right so my point is
- 00:05:04we we this is how the industry has
- 00:05:06evolved over the last 10 15 years right
- 00:05:08most of these user Centric applications
- 00:05:09which people love and like hundreds and
- 00:05:11whatever like 400 million users are
- 00:05:12using it monthly they are a mix of
- 00:05:15different features and each feature
- 00:05:17needs a design of what's the ml model
- 00:05:19what's the right tradeoffs what are the
- 00:05:20right metrics what's the right science
- 00:05:22behind it what's the right evaluation
- 00:05:23behind it so we have seen this and now
- 00:05:25I'm seeing exactly the same at Codi
- 00:05:27right if you look at COD so Codi lives
- 00:05:29in your ID so if you're a developer if
- 00:05:30you're using vs code or jet brains again
- 00:05:32if you're writing code then we can code
- 00:05:34complete right so aut complete is an
- 00:05:36important feature that's like very high
- 00:05:37volume right why because it's not that
- 00:05:39again right a lot of users use it but
- 00:05:41also when you're writing code in a file
- 00:05:43you will trigger aut complete like maybe
- 00:05:44hundreds of times because you're writing
- 00:05:46like one word you're writing like some
- 00:05:47syntax in line It'll like trigger and
- 00:05:49then if you like what like a ghost text
- 00:05:52which is like recommended you can just
- 00:05:54Tab and like it selects right yeah so
- 00:05:56then it helps you write code in a faster
- 00:05:58way now this is I would say that this is
- 00:06:00something on the like extreme of hey I
- 00:06:03have to be like very latency sensitive
- 00:06:04and like really be fast right because
- 00:06:06when you're on Google you're rping a
- 00:06:07query Google will do like AO suggestions
- 00:06:09right itself can I complete your query
- 00:06:11so we we've been like using some of
- 00:06:13these features for the last decade as
- 00:06:14users on the commu site and this is
- 00:06:16really important as well because this
- 00:06:18interaction that you're just talking
- 00:06:19about here it's a very emotive one you
- 00:06:21will you will piss people off if you do
- 00:06:23not if you do not provide them cuz this
- 00:06:25is supposed to be an efficiency play if
- 00:06:27you don't provide them with something
- 00:06:28that they can actually prove their
- 00:06:30workflow and they can just Tab and say
- 00:06:32yeah this is so cool I'm tabing tabing
- 00:06:33tabing accepting then they're going to
- 00:06:35they're going to get annoyed to the
- 00:06:37extent that they will almost reject that
- 00:06:39kind exactly this is where quality is
- 00:06:41important but latency is also important
- 00:06:42right so now we see that there's a
- 00:06:43trade-off yeah that hey yes I want
- 00:06:45quality but then if you're doing it like
- 00:06:46400 milliseconds later hey I'm already
- 00:06:49annoyed why because the perception of
- 00:06:51the user on when they're competing right
- 00:06:53they don't want to wait if I have to
- 00:06:55wait then i' rather go to code edit or
- 00:06:56like chat this is that's a great point
- 00:06:59is is latency does latency differ
- 00:07:01between the types of things that c
- 00:07:03offers or cuz I I would guess if if a
- 00:07:05developer is going to wait for something
- 00:07:06they might they'll be likely to wait a
- 00:07:08similar amount of time across all but
- 00:07:10different yeah not really the expect let
- 00:07:12me go back to the features right C has
- 00:07:14auto complete which helps you complete
- 00:07:15code when you're typing out we have code
- 00:07:17edit code fix here there's a bug you can
- 00:07:19write a command or select something and
- 00:07:20say hey Cod fix it yeah now when you're
- 00:07:22fixing then people are okay to wait for
- 00:07:24a second or two as it figures out and
- 00:07:25then they're going to show a diff and oh
- 00:07:27I like this change and I'm going to
- 00:07:28accept it now this is going to span
- 00:07:29maybe 3,000 milliseconds right 3 to 4
- 00:07:31seconds yeah versus auto complete no I
- 00:07:33want everything tap complete selected
- 00:07:35within 400 500 millisecond Laten again
- 00:07:37just there the difference start popping
- 00:07:39up now we've talked about autocomplete
- 00:07:40code edit code fix then we could go to
- 00:07:42chat as well right chat is okay I'm
- 00:07:44typing inquiry it'll take me like a few
- 00:07:46seconds to type in the right query and
- 00:07:47select the right code and do this right
- 00:07:49so the expectation of 400 milliseconds
- 00:07:51is not really the case in chat because
- 00:07:53I'm asking maybe a more complex query I
- 00:07:55want you to take your time and give the
- 00:07:57answer versus like unitest generation
- 00:07:59for examp example right unit write the
- 00:08:00entire code and make sure that you cover
- 00:08:03the right corner cases unit is like
- 00:08:04great coverage and like you're not just
- 00:08:06missing important stuff you making sure
- 00:08:08that the unit is actually quite good now
- 00:08:11there I don't want you to complete in 4
- 00:08:13in milliseconds take your time write
- 00:08:15good code I'm waiting I'm willing to
- 00:08:16wait a long time yeah let's take a step
- 00:08:18back what what are we looking at we
- 00:08:20looking at a few different features now
- 00:08:21similar right Netflix Spotify is not
- 00:08:23just one recommendation model you go do
- 00:08:25search you go do podcast you go do hey I
- 00:08:27want this category content a bunch of
- 00:08:28these right so similarly here in coding
- 00:08:31assistant for COD you have auto complete
- 00:08:32code edit code fix unit test generation
- 00:08:35you have a bunch of these commands you
- 00:08:36have chat chat is an entire nightmare I
- 00:08:38can talk about hours on like chat is
- 00:08:40like this one box Vision which people
- 00:08:42can come with like hundreds of intents
- 00:08:44yeah that's e in 10 it's a nightmare for
- 00:08:47me as an engineer to say that are we
- 00:08:48doing well on that because in aut
- 00:08:50complete I can develop metrics around it
- 00:08:52I can think about okay this is a unified
- 00:08:54this is a specific feature yeah chat may
- 00:08:56be like masking hundreds of these
- 00:08:58features just by natural language so we
- 00:09:00can talk a little more about chat as
- 00:09:01well over a period of time but coming
- 00:09:04back to the original point you mentioned
- 00:09:05for auto complete people will be latency
- 00:09:07sensitive yeah for unit generation maybe
- 00:09:09less for chat maybe even less what that
- 00:09:11means is the design choices which I have
- 00:09:14as an ml engineer are different in order
- 00:09:16to complete I'm not going to look at
- 00:09:17like 400 billion parameter BS right I
- 00:09:19ownn something which is f right so if
- 00:09:21you look at the x-axis latency Y axis
- 00:09:23quality look I don't want to go the top
- 00:09:26right top right is high latency and high
- 00:09:28quality I don't want latency I want to
- 00:09:30be in the like whatever 400 500 n to n
- 00:09:33millisecond latency space so there small
- 00:09:35models kick in right and small models we
- 00:09:37can f tune for great effect right we
- 00:09:39we've done some work we' just published
- 00:09:40a block post a couple of weeks ago on
- 00:09:42hey if you find tune for rust rust is
- 00:09:44like a lot more has a lot of nuances
- 00:09:46which most of these large language
- 00:09:47models are not able to capture so we can
- 00:09:49find un a model for rust and do really
- 00:09:51well on auto completion within the
- 00:09:53latency requirements which we have for
- 00:09:54this feature yeah so then these
- 00:09:56trade-offs start emerging essentially
- 00:09:58how does that change if you if the
- 00:10:00output that Cod say was going to provide
- 00:10:02the developer would actually be on a
- 00:10:04larger scale so we talked when we're
- 00:10:05talking about autocomplete we're really
- 00:10:06talking about a one liner complete but
- 00:10:08what if we was to say I want you to
- 00:10:10write this method or I want you to write
- 00:10:11this module obviously then you don't
- 00:10:13want that you don't want the developer
- 00:10:15to accept an autocomplete or look at an
- 00:10:18autocomplete module or or function and
- 00:10:21think this is absolute nonsense it's
- 00:10:22giving me nonsense quickly presumably
- 00:10:24then they're willing to they're willing
- 00:10:26to wait that much longer yeah I think
- 00:10:27there's that's a really good point right
- 00:10:29that people use Codi and not just Codi
- 00:10:31and not just encoding domain right we
- 00:10:33use co-pilots across different I mean
- 00:10:35sales co-pilot marketing co-pilot
- 00:10:36Finance RIS co- pilot people are using
- 00:10:38these agents or assistants in for
- 00:10:41various different tasks right and some
- 00:10:44of these tasks are like complex and more
- 00:10:46sophisticated some of these tasks are
- 00:10:47like simpler right yeah so let me let me
- 00:10:49just paint this like picture of how I
- 00:10:51view this right when you pick up a topic
- 00:10:53to learn right beat programming you
- 00:10:54don't start with like multi-threading
- 00:10:55you start with okay do I know the syntax
- 00:10:57can I instantiate variables can I go
- 00:10:59fails can I do fall Loop and can I do
- 00:11:00switch and then multiread and then
- 00:11:02parallelism so when we as humans learn
- 00:11:04there is a curriculum we learn on right
- 00:11:06we don't directly go to chapter 11 we
- 00:11:07start with chapter 1 similarly I think
- 00:11:09like the lens which which I view some of
- 00:11:12these agents and tools are it's okay you
- 00:11:14not at like chapter 12 yet but I know
- 00:11:17that there are simpler tasks you can do
- 00:11:18and then there are like medium tasks you
- 00:11:19can do and then there are like complex
- 00:11:20tasks you can do now this is a lens
- 00:11:23which is which I found to be like pretty
- 00:11:25useful because when you say that hey for
- 00:11:26aut complete for example I don't want
- 00:11:28again my use case probably for auto
- 00:11:30compete is not okay tackle a chapter 12
- 00:11:32complexity problem no for that I'll
- 00:11:34probably have an agentic setup yeah so
- 00:11:36this curriculum is a great way to look
- 00:11:37at it the other great way to look at
- 00:11:38things are like let's just call it like
- 00:11:40left versus right so on the left we have
- 00:11:41these tools which are living in your IDE
- 00:11:43and they're helping you write better
- 00:11:44code and like complete code and like
- 00:11:46really you are the main lead you're
- 00:11:47driving the car you're just getting some
- 00:11:48assistance along the way right versus
- 00:11:51things on the right are like agentic
- 00:11:53right that hey I'm going to type in here
- 00:11:54is my GitHub issue send me create a
- 00:11:56bootstrap me a PR for this right there I
- 00:11:59want the machine learning models to take
- 00:12:00control not full autonomy Quinn our CEO
- 00:12:03has a amazing block Post Yeah on levels
- 00:12:05of code right you start level Zer to
- 00:12:07level seven and some of these human
- 00:12:09initiated AI initiated AI Le and that
- 00:12:11gives us a spectrum to look at autonomy
- 00:12:13from a coding assistant perspective
- 00:12:15which is great I think everybody should
- 00:12:16look at it but coming back to the
- 00:12:18question auto complete is probably for
- 00:12:21I'm still the lead driver here help me
- 00:12:24but in some of the other cases I'm stuck
- 00:12:26take your time but then tackle more
- 00:12:28complex task yeah now the context and
- 00:12:30the model size the the latencies all of
- 00:12:32these start differing here right when
- 00:12:34you're writing this code you probably
- 00:12:35need for a to complete local context
- 00:12:37right or like maybe if you're
- 00:12:38referencing a code from some of
- 00:12:40repository bring that as a dependency
- 00:12:42code and then use it in context if
- 00:12:44you're looking at a new file generation
- 00:12:46or like a new function generation that's
- 00:12:48okay you got to look at the entire
- 00:12:49repository and not just make one changes
- 00:12:51over here you have to make an entire
- 00:12:53file and make changes across three other
- 00:12:55files right M so even where is the
- 00:12:57impact the impact in auto it is like
- 00:12:59local in this file in this region right
- 00:13:02and then if you look at the full
- 00:13:03autonomy case or like agentic setups
- 00:13:05then the impact is okay I'm going to
- 00:13:06make five changes across three files
- 00:13:08into two repositories yeah right so
- 00:13:10that's that's the granularity at which
- 00:13:11some of these things are starting to
- 00:13:13operate essentially yeah and testing is
- 00:13:14going to be very simp similar as well
- 00:13:15right if someone is if someone's writing
- 00:13:18code in the in their ID line by line and
- 00:13:20that's maybe using a code generation as
- 00:13:22well like Cody they're going to likely
- 00:13:25want to be able to have test step step
- 00:13:27in sync so as I write code you're
- 00:13:29automatically generating tests that are
- 00:13:32effectively providing me with that
- 00:13:33assurance that that the automatically
- 00:13:35generated code is working as I want to
- 00:13:38yeah that's a great point I think this
- 00:13:39is more like errors multiply yeah if I'm
- 00:13:41evaluating something after long writing
- 00:13:44it then it's worse off right because the
- 00:13:47errors I could have stopped the errors
- 00:13:48earlier on and then debugged it and
- 00:13:51fixed it locally and then moved on so
- 00:13:53especially so taking a step back look I
- 00:13:55love evaluation I really in machine
- 00:13:57learning I I started my PhD thinking
- 00:13:59that hey mats and like fancy graphical
- 00:14:02models are the way to have impact using
- 00:14:03machine learning right and you spend one
- 00:14:06year in the industry realized nah it's
- 00:14:07not about the fancy model it's about do
- 00:14:09you have an evaluation you have these
- 00:14:11metrics do you know when something is
- 00:14:12working better yeah so I think getting
- 00:14:14the zero to one on evaluation on these
- 00:14:16data sets that is really key for any
- 00:14:19machine learning problem y now
- 00:14:21especially when what you mean by the
- 00:14:22zero to one yeah 0 to1 is look at like
- 00:14:24whenever a new language model gets
- 00:14:26launched right people are saying that
- 00:14:27hey for coding M Lama 3 does well on
- 00:14:30coding why because oh we have this human
- 00:14:32eval data set and a passet one metric
- 00:14:34let's unpack that human eval data set is
- 00:14:36a data set of 164 questions hey write me
- 00:14:39a binary search in this code right so
- 00:14:41essentially it's like you get a text and
- 00:14:43you write a function and then you're
- 00:14:45like hey does this function run
- 00:14:46correctly so they have a unit test for
- 00:14:47that and if it passes then you get plus
- 00:14:49one right yeah so now this is great it's
- 00:14:51a great start but is it really how
- 00:14:54people are using Cod and a bunch of
- 00:14:56other coding tools no they're like if
- 00:14:57I'm an Enterprise developer if if let's
- 00:14:59say I'm in a big bank then I have 20,000
- 00:15:01other peers and there are like 30,000
- 00:15:03depositories but I not writing binary
- 00:15:05search independent of everything else
- 00:15:07right I'm working in a massive code base
- 00:15:09which has been edited across the last 10
- 00:15:11years and there's some dependency by
- 00:15:13some team in Beijing and there's a
- 00:15:14function which I haven't even read right
- 00:15:16and maybe it's in a language I don't
- 00:15:18even care about or understand so my
- 00:15:20point is the evaluation which we need
- 00:15:23for these Real World products is
- 00:15:24different than the benchmarks which we
- 00:15:26have in the industry right now the one
- 00:15:29for evaluation is that hey sure let's
- 00:15:31use passet one in human at the start on
- 00:15:33Day Zero but then we see that you
- 00:15:35improve it by 10% we have results when
- 00:15:38we actually did improve pass it one by
- 00:15:4010 15% we tried it online on COD users
- 00:15:43and the metrics dropped yeah and we
- 00:15:45writing a block post about it at offline
- 00:15:46online correlation yeah because if you
- 00:15:48trust your offline metric pass it one
- 00:15:50you improve it you hope that hey amazing
- 00:15:51users are going to love it yeah it
- 00:15:53wasn't true the context is so different
- 00:15:55yeah the context are so different now
- 00:15:57this is this means that I got to develop
- 00:15:59an evaluation for my feature and I got
- 00:16:02my evaluation should represent how my
- 00:16:03actual users using this feature feel
- 00:16:05about it m just because it's better on a
- 00:16:07metric which is an industry Benchmark
- 00:16:10doesn't mean that improving it will
- 00:16:11improve actual user experience and can
- 00:16:12that change from user to user as well so
- 00:16:14you mention a bank there if five other
- 00:16:16Banks is it going to be the same for
- 00:16:18them if something not in the fin Tech
- 00:16:20space is it going to be different for
- 00:16:21them that's a great point I think the
- 00:16:22Nuance you're trying to say is that hey
- 00:16:24one are you even feature aware in your
- 00:16:27evaluation because passet one is not
- 00:16:28featur away right yeah passet one
- 00:16:30doesn't care about auto complete or
- 00:16:31unitize generation or code fixing I
- 00:16:33don't care what the end use case or
- 00:16:34application is this is just evaluation
- 00:16:36so I think the first jump is have an
- 00:16:38evaluation data set which is about your
- 00:16:41feature right the evaluation data set
- 00:16:42for unit as generation is going to be
- 00:16:44different that code completion it's
- 00:16:45going to be different than code edits
- 00:16:46it's going to be different than chat so
- 00:16:48I think the 0 to one we talking about 5
- 00:16:49minutes earlier you got to do 0 to ones
- 00:16:51for each of these features yeah and
- 00:16:53that's not easy because evaluation
- 00:16:55doesn't come naturally yeah and once you
- 00:16:56have it then the question becomes that
- 00:16:58hey okay once I have it for my feuture
- 00:17:00then hey can I reuse it across
- 00:17:01Industries can I reuse it across like
- 00:17:04users and I think we've seen it I've
- 00:17:05seen it in the rec traditional
- 00:17:07recommendation space let's say most of
- 00:17:09these apps again if they got like seed
- 00:17:11funding last year or maybe series a
- 00:17:13there at what like 10,000 daily active
- 00:17:14users 5,000 daily active users today one
- 00:17:16year from now they're going to be 100K
- 00:17:18500k daily active users right now how
- 00:17:21representative is your subset of users
- 00:17:22today right the 5,000 users today are
- 00:17:24probably early adopters and if anything
- 00:17:27is scaling companies in the last 10
- 00:17:28years what it has told us is the early
- 00:17:30adopters are probably not the
- 00:17:31representative set of users you'll have
- 00:17:33once you have a mature adoption yeah
- 00:17:35what that means is the the metrics which
- 00:17:36are develop and the learnings which I've
- 00:17:38had from the initial AB test may not
- 00:17:41hold one year down the line six months
- 00:17:42down the line as in when the users start
- 00:17:44increasing right yeah now how does it
- 00:17:46link to the point you asked look there
- 00:17:48are heterogeneities across different
- 00:17:49domains different Industries luckily
- 00:17:51there are like homogeneities across
- 00:17:52language right if you're a front-end
- 00:17:53developer a versus B versus C companies
- 00:17:56a lot of the tasks you're trying to do
- 00:17:57are like similar and a lot of the task
- 00:17:59which the pre-training data set has seen
- 00:18:02is also similar because rarely again
- 00:18:04there are cases where you're doing
- 00:18:05something really novel but a lot of the
- 00:18:08junior development workflow probably is
- 00:18:10more like things which like hundreds and
- 00:18:11thousands of Engineers have done before
- 00:18:13so the pre-trained models have seen this
- 00:18:14before right so when we finding the mod
- 00:18:16for us that's not where we saw
- 00:18:18advantages because yeah you've seen it
- 00:18:19before you're going to do it well MH
- 00:18:21coming back to the point I mentioned
- 00:18:22earlier it's going to be a curriculum
- 00:18:23right you can do simple things well you
- 00:18:25can do harder things in Python well you
- 00:18:27can't do harder things in Rust well you
- 00:18:29can't do harder things in MA lab well so
- 00:18:31my goal of fine-tuning some of these
- 00:18:32models is that hey I'm going to show you
- 00:18:34examples of these hard task not because
- 00:18:37I just want to play with it but because
- 00:18:39some of our adopters right some we have
- 00:18:41a lot of Enterprise customers using us
- 00:18:43right and paying us for that right I get
- 00:18:44my salary because of that essentially so
- 00:18:46essentially I want those developers to
- 00:18:47be productive and they're trying to
- 00:18:49tackle some complex tasks in Rust which
- 00:18:51maybe we haven't paid attention when we
- 00:18:53were training this llama model or like
- 00:18:54this entropic model so then my goal is
- 00:18:56how do I extract those examples and then
- 00:18:59bring it to my Loop training Loop
- 00:19:01essentially and that's where right now
- 00:19:03if let's say one industry is struggling
- 00:19:06we know how the metrics are performing
- 00:19:08right that's what evaluation is so
- 00:19:09important we know where we suck at right
- 00:19:12now and then we can start collecting
- 00:19:13public examples and start focusing the
- 00:19:16models to do well on those right yeah
- 00:19:18again let me bring the exact point I
- 00:19:19mentioned I I'm going to say it 100
- 00:19:21times we've done it before if you spend
- 00:19:2320 minutes on Tik Tok you're going to
- 00:19:26look at what 40 short videos If you
- 00:19:28spend 5 minutes on Tik Tok or Instagram
- 00:19:30re you're going to look at like 10 short
- 00:19:31videos right yeah in the first nine
- 00:19:33short videos you're going to either skip
- 00:19:34it or like it or follow a Creator do
- 00:19:36something right so the 11 short perod is
- 00:19:38like really personalized because I've
- 00:19:39seen what you're doing in the last 5
- 00:19:40minutes and I can do real-time
- 00:19:42personalization for you yeah what does
- 00:19:44that mean in the coding assistant world
- 00:19:45look I know how these models are used in
- 00:19:47the industry right now and how our
- 00:19:48Enterprise customers and our community
- 00:19:50users are using it yeah let's look at
- 00:19:51the ACC completion acceptance rate or
- 00:19:53autocomplete oh for these languages in
- 00:19:55these use cases we get a high acceptance
- 00:19:57rate we show our recommendation
- 00:19:59people accept the code and move on but
- 00:20:01in these on examples oh we're not really
- 00:20:03doing well so the question that me comes
- 00:20:06oh this is something which maybe we
- 00:20:07should train on or we should find you on
- 00:20:10and this establishes a feedback loop
- 00:20:11yeah that look at what's not working and
- 00:20:14then make the model look at more of
- 00:20:16those examples and create that feedback
- 00:20:17loop which can then make the models
- 00:20:19evolve over a period of time and so I
- 00:20:20think there's two pieces here if we move
- 00:20:22a little bit into the into what this
- 00:20:24then means for testing the more that
- 00:20:27evaluation like almost like testing of
- 00:20:29the model effectively gets better it
- 00:20:31effectively means that the suggested
- 00:20:33code is then more accurate as a result
- 00:20:36the you need to rely on your tests
- 00:20:39slightly less I'm not saying you
- 00:20:40shouldn't but slightly less because the
- 00:20:42generated code that is being suggested
- 00:20:44is a higher quality when it then comes
- 00:20:46to the suggestions of tests afterwards
- 00:20:50does that follow the same model in terms
- 00:20:52of learning from not just the what the
- 00:20:54user wants to test but also from what is
- 00:20:57being generated
- 00:20:59is there work that we can do there into
- 00:21:00building that test yeah I think that's
- 00:21:02an interesting point generation is a
- 00:21:05selfcontained problem in itself right
- 00:21:07yeah one I think let's just establish
- 00:21:08the fact that unit generation is
- 00:21:10probably one of the highest value use
- 00:21:13cases yeah we got to get right in the
- 00:21:15industry code and that's because I
- 00:21:17mentioned that look you will stop using
- 00:21:19spotify if I start showing shitty
- 00:21:21recommendations to you yeah if I don't
- 00:21:23learn from my mistakes you're going to
- 00:21:25keep skipping like short videos or music
- 00:21:27or podcast and you're going to go right
- 00:21:28I don't get value because I spend I'm
- 00:21:30running I'm jogging I want music to just
- 00:21:32come in and I don't want to stop my
- 00:21:33running and hit like skip because that's
- 00:21:36takes that's like dissatisfaction right
- 00:21:38yeah one of the things which I did at
- 00:21:39Spotify and like in my previous company
- 00:21:40was like I really wanted people to focus
- 00:21:43on dissatisfaction yeah because
- 00:21:44satisfaction is all users are happy yeah
- 00:21:46that's not where I make more money I
- 00:21:47make more money by reducing
- 00:21:49dissatisfaction where are you unhappy
- 00:21:51and how do I fix that and even there if
- 00:21:53I stop you there actually just quickly I
- 00:21:55think there are different levels to this
- 00:21:56as well in terms of the testing right
- 00:21:57cuz it's at some point it's yes at its
- 00:21:59most basic does this code work and then
- 00:22:03as it goes up it's does this code work
- 00:22:04well does this code work fast does this
- 00:22:06code is this really making you happy as
- 00:22:08a user and so I think where would you
- 00:22:12say we are right now in terms of the
- 00:22:14level of test generation are people most
- 00:22:16concerned with the this code is being
- 00:22:19generated or when I start creating my
- 00:22:21tests how do I validate that my code is
- 00:22:24correct in terms of it compiles it's
- 00:22:28it's covering my basic use cases that
- 00:22:30I'm asking for and doing the right thing
- 00:22:32no this we can talk an hour about this
- 00:22:34look I think this is a long journey yeah
- 00:22:36getting evaluation getting United
- 00:22:37generation right across the actual
- 00:22:39representative use case in the
- 00:22:41Enterprise that's a nightmare of a
- 00:22:43problem yeah look I can do it right for
- 00:22:45writing binary binary search algorithms
- 00:22:47right if you give me like if you give me
- 00:22:48a coding task which has nothing to do
- 00:22:50with a big code repository and like
- 00:22:52understanding context understanding what
- 00:22:54the 5,000 developers have done sure I
- 00:22:56can attempt it and create a unit test
- 00:22:58because this there's a code and there's
- 00:23:00a unit T this lives independently right
- 00:23:02they live on an island they're happily
- 00:23:03married amazing everything works but
- 00:23:04this is not how people use coding this
- 00:23:06is not how like the Enterprise or like
- 00:23:08even Pro developer use cases are the pro
- 00:23:10developer use cases are about like hey
- 00:23:12is this working correctly in this like
- 00:23:14wider context because there's a big code
- 00:23:16repository and like multiple of them and
- 00:23:18that's a wider context wherein you're
- 00:23:20actually writing the code and you're
- 00:23:21actually writing the unit test now I
- 00:23:23would bring in this philosophical
- 00:23:25argument of I think unit generation I
- 00:23:27would look at it from an a Serial
- 00:23:28setting what's the point of having the
- 00:23:30unit test it's not just to feel make
- 00:23:32your yourself or your manager happy that
- 00:23:34oh I have unitest coverage unit test are
- 00:23:36probably like a guard rail to keep bad
- 00:23:39code from entering your system yes so
- 00:23:41what is this this is an adversarial
- 00:23:42setup maybe not intentionally
- 00:23:44adversarial aders set up okay somebody's
- 00:23:45trying to make bad things happen in your
- 00:23:47code and somebody else is stopping that
- 00:23:50from happening right so again if you
- 00:23:52start looking at unitest generation from
- 00:23:53this adversar setup that look this is a
- 00:23:55good guy right the unitest is going to
- 00:23:58prevent bad things from happening in
- 00:23:59future to my code base that's why I need
- 00:24:02good unit test now this bad now this is
- 00:24:04a good guy right who are the bad people
- 00:24:06right now in the last up until the last
- 00:24:07few years the bad people not
- 00:24:09intentionally bad but the Bad actors in
- 00:24:11The Code base were developers yeah now
- 00:24:13we have ai yeah right now I am as a
- 00:24:15developer writing code and if I write
- 00:24:16shitty code then the unitest will catch
- 00:24:18it and I won't be able to merge yeah
- 00:24:20right if
- 00:24:21ifel right exactly we'll get to that
- 00:24:24right I'm yet to I'm yet to see a
- 00:24:26developer who who is not a tdd fan who
- 00:24:29absolutely lives for writing tests and
- 00:24:31building a perfect test s for the code
- 00:24:33yeah yeah again right and there's a
- 00:24:34reason why quot like test case coverage
- 00:24:36like low across like all the
- 00:24:38repositories up right it's not something
- 00:24:39which again I think like biang CTO he
- 00:24:42loves to say that the goal of cod is to
- 00:24:43remove developer toil yeah and how do I
- 00:24:45make you do a lot more happier job
- 00:24:48focusing on the right creative aspects
- 00:24:50of architecture design or system design
- 00:24:51and remove toil from your life right a
- 00:24:53bunch of Developers for for better RSE
- 00:24:55to start looking at unit as generation
- 00:24:57as maybe it's not as interesting let's
- 00:24:59unpack that as well not all unit tests
- 00:25:01are like boring right yeah writing
- 00:25:02stupid unit test for stupid functions we
- 00:25:04shouldn't even like probably do it or
- 00:25:06like I I will let like machine learning
- 00:25:07do it essentially but the Nuance are
- 00:25:09like here is a very critical function if
- 00:25:12you screw this then maybe the payment
- 00:25:14system in your application gets screwed
- 00:25:16and then you lose money right and then
- 00:25:18you don't if you don't have
- 00:25:19observability then you lose money over a
- 00:25:21period of time and then you're literally
- 00:25:22costing company dollars millions of
- 00:25:23dollars if you screw this code
- 00:25:25essentially so the point is not all unit
- 00:25:26tests are the same because not all
- 00:25:27functions are equally important right
- 00:25:29there's going to be a distribution of
- 00:25:30some of the are like really really
- 00:25:31important functions you got to get a
- 00:25:33amazing unit test right I would rather I
- 00:25:35if I have a limited budget that if I
- 00:25:37have to principal Engineers I would make
- 00:25:39sure that the unit of these really
- 00:25:41critical pieces of component in my
- 00:25:43software stack are written by these
- 00:25:45Engineers or even if they written by
- 00:25:48like these agents or AI Solutions then
- 00:25:49at least like they wed it from some of
- 00:25:51these MH but before we get there let's
- 00:25:54just look at the fact of the need for
- 00:25:56unit test not just today but tomorrow
- 00:25:58yeah because right now if you have
- 00:26:00primarily developers writing unit test
- 00:26:02or like some starting tools tomorrow a
- 00:26:04lot more AI assistant I mean we are
- 00:26:06building one right we are trying to say
- 00:26:08that hey we're going to write more and
- 00:26:09more of your code yeah what that means
- 00:26:11is if in the adversarial setup unit test
- 00:26:14are like protecting your code base the
- 00:26:16the potential attacks not intentional
- 00:26:18but the bad code could come in from
- 00:26:19humans but also like thousands and
- 00:26:20millions of AI agents tomorrow yeah and
- 00:26:22you know what worries me a little bit
- 00:26:24here as well is in fact when you talked
- 00:26:25about that that those levels autonomy on
- 00:26:28the far left it's much more interactive
- 00:26:31right you have developers who are
- 00:26:32looking at the lines of code that
- 00:26:33suggested and looking at the tests that
- 00:26:35get generated so it's much more involved
- 00:26:38for the developer as soon as you go
- 00:26:40further right into that more automated
- 00:26:41World we're we're more in an we're more
- 00:26:43in an environment where um larger
- 00:26:46amounts of content is going to be
- 00:26:49suggested to that developer and if we go
- 00:26:51back to the same old that story where if
- 00:26:54you want 100 comments on your PO request
- 00:26:56in a code review you write two line
- 00:26:58change if you want zero you provide a
- 00:27:00500 line change right and when we
- 00:27:02provide that volume whether it's hey I'm
- 00:27:04going to build you this part of an
- 00:27:06application or a module or a test Suite
- 00:27:09based on some code how much is a
- 00:27:11developer actually going to look in
- 00:27:13detail at every single one of those
- 00:27:14right and I think this kind of comes
- 00:27:16back to your point of what are the most
- 00:27:19important parts that I need to to look
- 00:27:22at but yeah it revolves a little bit
- 00:27:24more around what you were saying earlier
- 00:27:25as well whereby tests becoming more more
- 00:27:28important for this kind of thing and
- 00:27:30exactly as code gets generated
- 00:27:31particularly in volume right what are
- 00:27:34where are the guard rails for this and
- 00:27:36it's all about tests I love the point
- 00:27:37you mentioned right that look as more
- 00:27:39and more code gets written like my the
- 00:27:41cognitive abilities of a developer to
- 00:27:43look at every change everywhere it just
- 00:27:46takes more time takes more effort takes
- 00:27:47more cognitive load right yeah now
- 00:27:49coupling the fact that if you've been
- 00:27:50using this system for a few months then
- 00:27:52there's an inherent trust in the system
- 00:27:55now this is when I get really scared you
- 00:27:57look at
- 00:27:58when I started using Alexa in 2015 right
- 00:28:00it would only get weather right right
- 00:28:02Google home Alexa it won't do any of
- 00:28:04other who can get weather
- 00:28:06right yeah weather prediction is a hard
- 00:28:09enough maching problem Deep Mind
- 00:28:10Engineers are still working on it and
- 00:28:11still getting it right and doing it in
- 00:28:12London yeah I would pay for a service
- 00:28:15which predict but point is we have used
- 00:28:18as a society like these conversational
- 00:28:19agents for a decade now we just asking
- 00:28:21like crappy questions yeah because we
- 00:28:23trust it I asked you a complex question
- 00:28:24you don't have an answer I I forgot
- 00:28:26about you for the next few months but
- 00:28:28but then we start increasing the
- 00:28:29complexity of the questions we ask and
- 00:28:30that's great because now the the Siri
- 00:28:33and the and the Google assistant and
- 00:28:34Alexa was able to tackle these questions
- 00:28:36right and then you start trusting them
- 00:28:38because hey oh I've asked you these
- 00:28:39questions and we've answered them well
- 00:28:40so then I trust you to do these tasks
- 00:28:42well and again right if you look at
- 00:28:45people who use Spotify or Netflix their
- 00:28:47recommendations in the feed they have
- 00:28:49more Trust on your system yeah because
- 00:28:51most of these applications do provide
- 00:28:52you a way out right if you don't trust
- 00:28:53recommendations go to your library go do
- 00:28:55search yeah search search versus
- 00:28:58recommendations is that push versus pull
- 00:28:59Paradigm right recommendations I'm going
- 00:29:02to push content to you if you trust us
- 00:29:04you're going to consume this right if
- 00:29:05you don't trust a system if you don't
- 00:29:06trust your recommendations then you're
- 00:29:08going to pull content which is search
- 00:29:09right now especially as in when you seen
- 00:29:11there's a distribution of people who
- 00:29:12don't search at all right they're like
- 00:29:14they we live in that high trust world
- 00:29:16when they're going to they're going to
- 00:29:17like like a recommendation same right
- 00:29:19Google who goes through the second page
- 00:29:20of Google right who goes through Google
- 00:29:22now sorry uh but essentially the point
- 00:29:25is once people start trusting these
- 00:29:27systems
- 00:29:28the units generation is a system which I
- 00:29:31start trusting right and then it starts
- 00:29:33tackling more and more complex problems
- 00:29:34and then is oh I start I stopped looking
- 00:29:36at the corner cases and then that code
- 00:29:38was committed 6 months ago and that unit
- 00:29:40T is there and then code on top of it
- 00:29:42was committed three months ago and then
- 00:29:43there's probably unit test which I
- 00:29:44didn't write the agent road now this is
- 00:29:47where like complexity evolves for a
- 00:29:49period of time and maybe there's a
- 00:29:51generation of unit test which have been
- 00:29:53written maybe with less and less of me
- 00:29:55being involved yeah the the levels of
- 00:29:57code a like that means like your
- 00:29:59involvement is not at the finer levels
- 00:30:01it's like higher up now this assumes it
- 00:30:03works well if the foundations are
- 00:30:04correct and everything is robust and we
- 00:30:06have like good checks in place yeah
- 00:30:08again right whatever could go wrong
- 00:30:10would go wrong yeah so then the point is
- 00:30:12in this complex code where in the series
- 00:30:14generations of unit test generations of
- 00:30:16code Cycles edits have been made by an
- 00:30:19agent then things could go horribly
- 00:30:21wrong so do we have what's the solution
- 00:30:23the solution is pay more respect to
- 00:30:25evaluation right look you got to you got
- 00:30:27to LD is guard is just like a very
- 00:30:30harmless way to for me to say that like
- 00:30:33unitz generation is important not just
- 00:30:34for unit generation today but for unit
- 00:30:37generation and code generation tomorrow
- 00:30:39so I think like the kind of metrics we
- 00:30:40need the kind of evaluation we need the
- 00:30:42kind of robust auditing of these systems
- 00:30:44and auditing of these unites so far
- 00:30:46again I don't have a huge 10e experience
- 00:30:48in the coding industry because I've
- 00:30:50worked on recommendations and use C
- 00:30:51systems for but for me it was like hey
- 00:30:54what is your test coverage in a
- 00:30:56repository that's the most commonly way
- 00:30:58look way to look at like repository and
- 00:31:00what's what I how advanced are you in
- 00:31:02your testing capabilities and that
- 00:31:03doesn't cut it are you covering the con
- 00:31:05cases what's your again what's the
- 00:31:06complexity what's the severity so do we
- 00:31:08need automated tests for our tests or do
- 00:31:11we need people humans to we need both
- 00:31:14right we need human domain experts again
- 00:31:16this is and this is not just a coding
- 00:31:17problem look at look millions of dollars
- 00:31:19are spent by entropic and open mind on
- 00:31:21scale AI scale AI has they raised like a
- 00:31:23lot of money recently billion dollar
- 00:31:25valuations because we need domain
- 00:31:27experts to tag yeah this is also
- 00:31:28nightmare at Spotify in search I have PD
- 00:31:31in search my search was like I'll show
- 00:31:32users some results and they're going to
- 00:31:34tag this is correct or not I can't do
- 00:31:36this in coding MH because crowdsourcing
- 00:31:38has been a great assistance to machine
- 00:31:40learning systems for 20 years now
- 00:31:42because I can get that feedback from the
- 00:31:44user now to get feedback on a complex
- 00:31:47rust code where am I going to find those
- 00:31:49crowdsource workers on scale or Amazon
- 00:31:51Mt right they don't exist you're not
- 00:31:53going to pay them $20 an hour to write
- 00:31:55give feedback they're like, hour dollar
- 00:31:57what ,000 hour developers right yeah we
- 00:32:00don't even have a community right now
- 00:32:02around around crowdsource workers for
- 00:32:04code because this is domain specific
- 00:32:06right so my point is again this is not
- 00:32:09all Dooms worthy in the sense like this
- 00:32:11is a important problem we we got to get
- 00:32:13righted uh and it's going to be a long
- 00:32:15journey we're going to do evaluations
- 00:32:16and we're going to do like generations
- 00:32:17of evaluations I think the right way to
- 00:32:20think about is like paying attention to
- 00:32:22unit test but also like evaluation of
- 00:32:24unit test and also evaluation take a
- 00:32:26step back like multiple lels of
- 00:32:28evaluation right you're going to
- 00:32:29evaluate okay what are the important fun
- 00:32:32are we able to write identify the
- 00:32:33important functions the criticality of
- 00:32:35the code right and then look at unit as
- 00:32:36generation from that lens yeah now
- 00:32:38immediately one of the solutions I think
- 00:32:40when we met over coffee we talking
- 00:32:42briefly about it if let's say I generate
- 00:32:4420 unit test right now I want my
- 00:32:47principal or some respected trusted
- 00:32:50engineer to wet at least some of them
- 00:32:53now their day job is not just to write
- 00:32:55unit test right their day job is to
- 00:32:56maintain the system advance wants it so
- 00:32:58they're going to probably have limited
- 00:33:00budget to look at the unit I've
- 00:33:02generated right now the question becomes
- 00:33:04if this week I was able to generate 120
- 00:33:07unit test and you are a principal
- 00:33:09engineer in my team you're not going to
- 00:33:10look at 120 unit tests and pass them MH
- 00:33:12you maybe have two hours to spare this
- 00:33:14week yeah you're going to look at maybe
- 00:33:16five of these unit test yeah now this
- 00:33:18becomes an interesting machine learning
- 00:33:19problem for me yeah the 120 unit test
- 00:33:21I've created which what is a subset of
- 00:33:24five which I need your input on yeah now
- 00:33:27this is
- 00:33:28one way to tackle these problems reduce
- 00:33:29the uncertainity right machine learning
- 00:33:31uncertainity models we've done it for 20
- 00:33:33years in the industry so likey how
- 00:33:35certain am I that this is correct and if
- 00:33:37I'm certain then sure I I won't show it
- 00:33:39to you yeah maybe I'll show one off just
- 00:33:40to make sure that I I get a feedback
- 00:33:42that I thought I'm certain you did you
- 00:33:44confirm or did you reject that and then
- 00:33:46I'll learn but then I'm I'm going to go
- 00:33:47on an information maximization principle
- 00:33:50that what do I not know and can I show
- 00:33:52that to you mhm uh so what that means is
- 00:33:54it's a budget constrained subset
- 00:33:56selection problem mhm right that I've
- 00:33:58generated 120 unit test you can only
- 00:33:59look at five of them I'm going to pick
- 00:34:01up these five and see now we can do it
- 00:34:03like I I can pick up these five and show
- 00:34:05it to you or I can pick up one get that
- 00:34:08feedback see what I additionally learn
- 00:34:10from that one and then look at the 119
- 00:34:12again and say that knowing what I know
- 00:34:14just now what else would I say what is
- 00:34:16the next one yeah and previously you
- 00:34:18mentioned about the most important parts
- 00:34:20of the code to test right cuz there
- 00:34:22could be critical paths there could be
- 00:34:23other areas where you know what if
- 00:34:25there's a bug it's far less of a problem
- 00:34:27who provid that context is that
- 00:34:28something the is that something the LM
- 00:34:30can like decide or is that something
- 00:34:32like you said the principal Engineers
- 00:34:33should spend a little bit of their time
- 00:34:35saying look these are the main core
- 00:34:37areas that we need to be bulletproof
- 00:34:38right other areas yeah I would rather
- 00:34:41spend my time on these than those others
- 00:34:43yeah I think I probably am not the best
- 00:34:45person to answer like who currently
- 00:34:46provides it as a machine learning
- 00:34:48engineer I see there are signals yeah if
- 00:34:50I look at your system then there are
- 00:34:52like where is the bugs race where was
- 00:34:53the severity right for each bug for each
- 00:34:55incident there was a severity report
- 00:34:57watch code what code was missed what
- 00:34:59code wasn't missed yeah what were the
- 00:35:00unit tests created by people so far
- 00:35:03right so again I view it from a data
- 00:35:04observability perspective like knowing
- 00:35:06what I know about your code about your
- 00:35:08severity about your issues about your
- 00:35:09time to resolve some of these then I can
- 00:35:13develop a model personalized to your
- 00:35:15code base on what are the core important
- 00:35:17pieces right so I can look at just code
- 00:35:20which okay which functions are calling
- 00:35:21which there's a dependency and we do a
- 00:35:23lot of we have amazing Engineers right
- 00:35:24on who are just like compiler experts in
- 00:35:27the company
- 00:35:28and one of the Reas join was because
- 00:35:30bring in the ml knowledge but I want to
- 00:35:32work with dom experts right and has
- 00:35:34amazing talent over the last few years
- 00:35:36and it compiles it compounds right so
- 00:35:38there are these compilers experts
- 00:35:39internally Olaf and a few others being
- 00:35:41one of them and they would do like
- 00:35:43really like precise Intelligence on the
- 00:35:45code and find out the dependency
- 00:35:46structure and all those right and that
- 00:35:48will give me like just content
- 00:35:50understanding of your code base right
- 00:35:51and then if a function is deemed
- 00:35:52important from those like graphical
- 00:35:54links essentially like how many of
- 00:35:56inotes and out notes are coming to your
- 00:35:57do your function and a lot of people are
- 00:35:59calling it then that means there's a lot
- 00:36:00of Downstream dependencies so there is
- 00:36:02this way of looking at it right now this
- 00:36:03is just pure code but I do have
- 00:36:06observational data observational data
- 00:36:08means that I do know what were the
- 00:36:09severities and where the SE zeros and SE
- 00:36:121es were cost and where the red really
- 00:36:15critical errors have happened the last
- 00:36:16few months ni and where is the
- 00:36:19probability of those happening right now
- 00:36:20plus where are you writing unit test
- 00:36:22right now that gives me an additional
- 00:36:24layer of information I already have
- 00:36:26passed your code base understood what's
- 00:36:28going on I have that view but I also now
- 00:36:30look at the real world view of that data
- 00:36:32coming in oh you know what there was a
- 00:36:34depend there was a SE zero issue caused
- 00:36:35by this piece of code over here and
- 00:36:37maybe this that that meant now the
- 00:36:39question is can I go back one day before
- 00:36:41yeah can I predict that this is if I had
- 00:36:43to pred that one error is going to pop
- 00:36:44up tomorrow which part of the code base
- 00:36:46where this error will be will be popping
- 00:36:48up right now that is a prediction I can
- 00:36:49make one day before right and I can
- 00:36:51start trading these models now now this
- 00:36:53is again we talked about earlier right
- 00:36:54at the start of the podcast that look
- 00:36:56each of these features are different ml
- 00:36:59models right we just talked about two ml
- 00:37:01models one if I have 120 unit test what
- 00:37:04is the subset of five I've showed you
- 00:37:05that could be an llm that could be like
- 00:37:07a subset selection subset selection have
- 00:37:09known solutions for I mean and like
- 00:37:11theoretical guarantees on performance
- 00:37:12submod submod subset selection and all
- 00:37:15I've implemented them at 200 million
- 00:37:17monthly active user scale at Spotify
- 00:37:19yeah so we can tackle this problem but
- 00:37:20this is a new problem this is not like
- 00:37:21an llm solution right second one is here
- 00:37:24right can I predict where the critical
- 00:37:26bug is going to be and then use that to
- 00:37:27identify critical components and then
- 00:37:29use that to chain and put like a unius
- 00:37:31in there it's essentially human context
- 00:37:33right we we we talk about context a lot
- 00:37:35when we talking about actual source
- 00:37:38source files and and how we can maybe do
- 00:37:40some code completion against so many
- 00:37:42parts of our project but thinking about
- 00:37:44that almost behavioral context exactly
- 00:37:46and again I'll be literally like the
- 00:37:48broken record on this we have done this
- 00:37:50before and look when you upload a short
- 00:37:52video like I look at Short video has
- 00:37:55zero views right now I look at hey is
- 00:37:56this high quality who's the author again
- 00:37:59right I look at the content composition
- 00:38:00why because zero people have interacted
- 00:38:01with it so far give it an hour 10
- 00:38:04million people would have interacted
- 00:38:06with that short video now I know which
- 00:38:07people will like which people not right
- 00:38:09so if you look at the recommendation
- 00:38:10life cycle of any podcast we're going to
- 00:38:12upload this podcast the life cycle of
- 00:38:14this podcast will be there is something
- 00:38:16about the content of this podcast and
- 00:38:18there's something about the
- 00:38:18observational behavioral data of this
- 00:38:20podcast which developers which users
- 00:38:22liked it and not liked it skipped it
- 00:38:23streamed it and all that right so we
- 00:38:25have designed recommendation systems
- 00:38:28as a combination of like content and
- 00:38:30behavior right same here when I say uni
- 00:38:33as Generations I can look at your code
- 00:38:35base right and I can make some
- 00:38:36inferences now on top of that I have
- 00:38:38additional view on this data which is
- 00:38:40like what are the errors and where are
- 00:38:41the you writing unit is where are you
- 00:38:42devoting time that gives me additional
- 00:38:44observation data on top of my content
- 00:38:46understanding if a code base combine the
- 00:38:48two together and like better things will
- 00:38:50emerge Essen and we've talked about unit
- 00:38:52tests quite a bit in terms of testing in
- 00:38:54general right there are obviously
- 00:38:57different layers of of that testing I
- 00:38:59think when we talk about the intent
- 00:39:00changes heavily right so the higher you
- 00:39:02go when we talking about the integration
- 00:39:04test and things like that the intent is
- 00:39:05really and again when we talk go back to
- 00:39:07context the intent is about the use
- 00:39:09cases uh a lot about the the intention
- 00:39:12of what a user will how a user will use
- 00:39:15that application um when we also think
- 00:39:18about the the areas of the codebase
- 00:39:21which are extremely important those
- 00:39:23higher level integration tests the flows
- 00:39:25that they go through through the
- 00:39:26application will show which areas of
- 00:39:29code are most important as well in terms
- 00:39:31of from our developer audience when
- 00:39:33people we talk to when we want to
- 00:39:35provide advice or best practices in
- 00:39:36terms of how should a developer be
- 00:39:39thinking about using AI into their
- 00:39:40General testing strategy what's a good
- 00:39:43start today would you say in terms of
- 00:39:45introducing these kind of Technologies
- 00:39:47into people's processes and existing
- 00:39:50workflows successfully today right yeah
- 00:39:52I think the the simplest answer is start
- 00:39:54using Cody uh no I love it I think like
- 00:39:57I even before I joined SCE craft Cody
- 00:39:59helped me I interviewed basically Cody
- 00:40:01Source craft we do an interview of okay
- 00:40:02here's a code base it's open source look
- 00:40:03at it try to make some changes and I
- 00:40:05love that I psychologically I was bought
- 00:40:08in even before I had an offer because
- 00:40:11you're making me do cognitive word on
- 00:40:12your code depository as part of the
- 00:40:14interview you just spend one hour
- 00:40:15instead of just chatting look at the
- 00:40:17code and make some changes right yeah my
- 00:40:18point is yeah use cod but essentially an
- 00:40:21interesting point over here is if you're
- 00:40:22trying to adopt Cod or any of the other
- 00:40:24tools for test generation you're going
- 00:40:26to what what are you going to do you're
- 00:40:27going to try to use the off thes shell
- 00:40:29feature right hey generate unit I see
- 00:40:31where it works where it doesn't work now
- 00:40:33Cod provides something called custom
- 00:40:34commands custom command edit code unit
- 00:40:37generation these are all commands right
- 00:40:39what a command is if what is the llm
- 00:40:41feature let's just take a step back llm
- 00:40:43feature is I want to do this task I need
- 00:40:45some context so what I'm going to do is
- 00:40:46I'm going to generate an English prompt
- 00:40:49right and I'm going to bring in a
- 00:40:50context strategy that what are the
- 00:40:51relevant pieces of information I should
- 00:40:53use which is okay here's the unit is in
- 00:40:55the same folder for example or here's a
- 00:40:56dependency which you should be aware of
- 00:40:58bring in that context write an English
- 00:40:59prompt and then send to the llm right
- 00:41:01that's a very nice simplified way of
- 00:41:03looking at what is an llm feature so Cod
- 00:41:06provides an option of doing custom
- 00:41:07commands what that means is I can see
- 00:41:10hey this doesn't work as great for me
- 00:41:12why because of these Nu aners let me
- 00:41:13create a custom command now you are a
- 00:41:15staff engineer at this company you can
- 00:41:16create a custom command and that oh this
- 00:41:18is better now because in an Enterprise
- 00:41:20setting you can now share this custom
- 00:41:21command with all your employees
- 00:41:22essentially right you said that hey you
- 00:41:24know what this is a better way of using
- 00:41:25the unit generation because I've created
- 00:41:27this custom command and everybody can
- 00:41:28benefit essentially right mhm now what
- 00:41:30makes you write a better custom command
- 00:41:33and even if you forget about custom
- 00:41:34commands and cod what makes you get a
- 00:41:35better output out this is where the 0
- 00:41:38to1 evaluation comes in like where are
- 00:41:40you currently failing what sort of unit
- 00:41:41test are we getting right what sort of
- 00:41:43are we not getting it right what about
- 00:41:45this is interesting right what about
- 00:41:47your code base is interesting now the
- 00:41:49question then becomes can I provide that
- 00:41:52as context so can I provide can I track
- 00:41:54that that where are we where is it
- 00:41:56failing where is it not faing and then
- 00:41:57there are few interventions you could do
- 00:41:59right you can change the prom new custom
- 00:42:00con or you can create a new context
- 00:42:02Source now this is a great Segway for me
- 00:42:04to just mention one thing which is open
- 00:42:07context so I think Quin Rio he literally
- 00:42:09started his work as an IC one of the
- 00:42:11other impressing things about Source
- 00:42:12graph is you look at the you look at the
- 00:42:13GitHub history commit histories of the
- 00:42:15founders are they running a company or
- 00:42:16are they likeing up these things and it
- 00:42:19just blew me away like when I first got
- 00:42:21it came across that but essentially
- 00:42:22Quinn introduced and then a lot of the
- 00:42:24teams worked on it something called open
- 00:42:25context yeah which is internet
- 00:42:27Enterprise setting you have so much
- 00:42:28context which we may not be nobody can
- 00:42:31get right right yeah because you're
- 00:42:32going to plug in thousands of different
- 00:42:34heterogeneous context sources is going
- 00:42:35to help you get a better answer so open
- 00:42:38context is a protocol designed to add
- 00:42:40you can add a new context source for
- 00:42:42yourself and there's a protocol and cod
- 00:42:44will cod and even the other because it's
- 00:42:46a protocol a lot of the other agents or
- 00:42:48Tools around can use so essentially what
- 00:42:50that means is if you are writing unit
- 00:42:52test and if you know that this is where
- 00:42:53it's not working you're going to make a
- 00:42:54change in the prompt add a custom
- 00:42:56command you're going to add some other
- 00:42:57examples and then you're like hey maybe
- 00:42:58I should add a context Source because oh
- 00:43:00I have this information like we talked
- 00:43:02about right where are the errors coming
- 00:43:03in from now that S zero is probably not
- 00:43:06something which you have given access to
- 00:43:07Cod right now but then because of the
- 00:43:09open context open CTX you can add this
- 00:43:11like context source and then make your
- 00:43:13Solutions better right yeah what are you
- 00:43:15doing here you're doing this you're
- 00:43:16doing like ml applied machine learning
- 00:43:18101 now for your feature right so again
- 00:43:21this is exactly where you need like a 0
- 00:43:23to one five examples where it doesn't
- 00:43:25work right now and if you make the St
- 00:43:27change or context change it's going to
- 00:43:29start working right so you have done
- 00:43:30this mini z to one for your own goal
- 00:43:33right and I think there's a meta point
- 00:43:36over here which is forget about coding
- 00:43:38assistants I think we're all
- 00:43:39transitioning to that abstraction of
- 00:43:41working with an ml system working with
- 00:43:44an AE AI system I hate to use the phrase
- 00:43:46AI I I from the machine learning world I
- 00:43:49rather say machine learning but again
- 00:43:50the audiences probably buy that more
- 00:43:53some of the audiences but essentially a
- 00:43:55lot of us are starting to work with
- 00:43:57systems and I think the way we start
- 00:43:59interacting with them are going to be
- 00:44:00like a bit more orchestrated right try
- 00:44:03this let's figure out where it's working
- 00:44:05great I'm going to use it let's figure
- 00:44:07out where it's not working okay cool
- 00:44:08then I'm going to either give that
- 00:44:09feedback or do something and adjust my
- 00:44:11workflow a bit and then make it work
- 00:44:13better on those right yeah so I think we
- 00:44:15all starting to be that applied
- 00:44:17scientist one right in some way or the
- 00:44:19other and this is not just like you as
- 00:44:21an engineer if you're a domain expert if
- 00:44:23you're a risk analyst you want to create
- 00:44:26these plots or if you're a sales assist
- 00:44:28using a sales Scope Pilot you are
- 00:44:30working with an agentic setup of ML and
- 00:44:33you want to see where it's working where
- 00:44:35it's not and what changes do I want to
- 00:44:37make again you have done it when you add
- 00:44:39Like A Plus or double quotes in Google
- 00:44:41you get those words right what is that
- 00:44:42you're adopting right you know where
- 00:44:44what's going to work you're going to
- 00:44:45start adding those tricks and now
- 00:44:47because more and more of your daily
- 00:44:49workflow is going to be around these
- 00:44:51agents and systems and you start
- 00:44:52developing these feedback loops yourself
- 00:44:54so I think like what we are trying to do
- 00:44:55as mlus in the proc product are is
- 00:44:58similar philosophically to what you are
- 00:44:59trying to do to use these products
- 00:45:01essentially so I think like I would a
- 00:45:03lot of my friends and other people ask
- 00:45:04me like hey like what's happening like
- 00:45:06I'm a doain expert I was in another
- 00:45:08panel a few days ago versus employees
- 00:45:10and the question there was like jobs and
- 00:45:12all those right again not to bombard the
- 00:45:15conversation around those but
- 00:45:16essentially if we start acting as
- 00:45:19orchestrators of these systems then we
- 00:45:21start developing intuitions on where
- 00:45:22it's working where it's not and then we
- 00:45:23start putting in these guardrails yeah
- 00:45:25and those guardrails are going to help
- 00:45:27in uni as your Miss and and I think
- 00:45:28that's important because as our audience
- 00:45:30are all going to be somewhere on that
- 00:45:32Journey from it being fully interactive
- 00:45:36to it being fully automated and people
- 00:45:38are going to maybe want to be somewhere
- 00:45:40on that Journey but they will progress
- 00:45:42from one end to the other of that as we
- 00:45:45get into that more and more automated
- 00:45:46space I remember you saying earlier when
- 00:45:48we talk about the budget of a human in
- 00:45:51terms of they have an amount of time if
- 00:45:53you have 120 tests you want to provide
- 00:45:55them with five how do you place the
- 00:45:57importance of a developer's time right
- 00:46:00in the future when things get to that
- 00:46:02more automated State how would you place
- 00:46:05that that relevance on a developer
- 00:46:08focusing on code versus focusing on
- 00:46:10testing versus focusing on something
- 00:46:12else where is it most valuable to have
- 00:46:14that developers eyes yeah so I think
- 00:46:16let's take a step back right you're a
- 00:46:17developer I'm a developer and there's a
- 00:46:19reason I have a job right I have a job
- 00:46:21to do that job means there's a task to
- 00:46:23complete right the reason why I'm
- 00:46:25writing this unit test I don't get paid
- 00:46:27money just to write a better unit list
- 00:46:28right I get paid money to again not me
- 00:46:31specifically in my job but essentially
- 00:46:33as a developer I get paid money if I can
- 00:46:35do that task and if I can at least spend
- 00:46:37some time in making sure that in
- 00:46:39future my load of doing that task is
- 00:46:42easier and the system is like helping me
- 00:46:44down the scam with that high level view
- 00:46:46right the what's happening what's
- 00:46:48happening is rather than focusing on
- 00:46:50unitized generation or like code
- 00:46:51completion as a I care about this
- 00:46:53because I care about the silo no I care
- 00:46:56about this because I care about the task
- 00:46:58being completed and I can if I can do
- 00:47:00this task 10 next quicker then what's my
- 00:47:02path from today spending 5 hours doing
- 00:47:04it to 20 minutes doing it right and this
- 00:47:07is where I mentioned you're going to be
- 00:47:08we're all going to be orchestrators
- 00:47:09right look at the music orchestra right
- 00:47:11you have the symphony and you're there's
- 00:47:12an orchestra and like you're hand waving
- 00:47:14your way into amazing music right art
- 00:47:16gets created right again that that's the
- 00:47:18goal right Cody wants to make sure that
- 00:47:20we allow users developers to start
- 00:47:22creating art and not just toil now I I
- 00:47:26can say this English right but then I
- 00:47:28think a good developer would just embi
- 00:47:30the spirit of it which is that look my
- 00:47:33role is the sooner I can get to that
- 00:47:34orchestrator role in my mindset the more
- 00:47:37I start using these tools exactly right
- 00:47:39rather than being scared of oh it's
- 00:47:40writing code and I won't again you
- 00:47:42mentioned right there's going to be like
- 00:47:43you might want to be somewhere on that
- 00:47:45Spectrum but the the Evol technological
- 00:47:47Evolution will march on and we're going
- 00:47:49to be forced into some parts and it
- 00:47:51could be not just technology but
- 00:47:53individuals wanting to be individuals
- 00:47:55love writing code or
- 00:47:57it's like that will need to change
- 00:48:00depending on where what technology
- 00:48:02offers us but I guess what's the if we
- 00:48:06push further to that right what's the
- 00:48:08highest risk right to what we're
- 00:48:10delivering not being the right solution
- 00:48:14is it is it is it testing now is it
- 00:48:16guard rails that become the most
- 00:48:18important thing and almost I would say
- 00:48:20more important than code or or or is
- 00:48:22still code the thing that that we need
- 00:48:24to care about the most yeah I think like
- 00:48:27if I view this extreme of like again if
- 00:48:29I put my evaluation heart right I think
- 00:48:31I want to be one of the most prominent
- 00:48:33prop proponent vocal proponent of
- 00:48:36evaluation in the industry not just cod
- 00:48:38in the machine learning industry we
- 00:48:39should do more
- 00:48:40evaluation so there I would say that
- 00:48:42writing a good evaluation is more
- 00:48:44important than writing a good model yeah
- 00:48:45writing a good evaluation is more
- 00:48:47important than writing a better context
- 00:48:48Source because you don't know what
- 00:48:50what's a better Contex Source if you
- 00:48:51don't have a way to evaluate it right so
- 00:48:53I think for me evaluation precedes any
- 00:48:55feature development yeah if you don't
- 00:48:57have a way to evaluate it then you don't
- 00:48:58you're just shooting darts in the dark
- 00:49:00room right some are going to land by
- 00:49:02luck now there in that world right I
- 00:49:04have to ensure that unit test and
- 00:49:06evaluation is like ahead in terms of
- 00:49:08importance and just like code that said
- 00:49:11I I think overall right what's more
- 00:49:12important is like task success right
- 00:49:15which is again what is Task success
- 00:49:16you're not just looking at unit test as
- 00:49:18an evaluation you're looking at
- 00:49:19evaluation of the overall goal which is
- 00:49:21hey do I do this task F and then I think
- 00:49:23if that's where as an orchestrator if I
- 00:49:25start treating these agents could be Cod
- 00:49:27Auto compete or like like any specific
- 00:49:30Standalone agent powered by Source graph
- 00:49:32as well probably so in those words
- 00:49:34evaluation of that task because you are
- 00:49:37the Roman expert assume AGI exists today
- 00:49:40assume the the foundation models are
- 00:49:41going to get smarter smarter like
- 00:49:42billions of dollars trillions of dollars
- 00:49:44eventually into it we train these fan
- 00:49:46again the smartest models and they can
- 00:49:48do everything but you are best place to
- 00:49:51understand your domain on what the goal
- 00:49:53is right now right so you are the only
- 00:49:56person who can develop that evaluation
- 00:49:57of like how do I know that you're
- 00:49:58correct how do I know whether you're 90%
- 00:50:00correct 92% correct and again right the
- 00:50:02marginal gain on 92 to 94 is going to be
- 00:50:04a lot more harder than going from to 90
- 00:50:06right it always gets saer like I mean
- 00:50:08there's going to be like an exponential
- 00:50:09hardness increase over there so
- 00:50:11essentially the point then becomes
- 00:50:13purely on evaluation purely on unit test
- 00:50:15right what makes us what are the nuances
- 00:50:17of this problem of this domain which the
- 00:50:20model needs to get proved and are you
- 00:50:22are we able to articulate those and are
- 00:50:24we able to generate those unit test so
- 00:50:26generate those guard RS and evaluations
- 00:50:28so that I can judge how the models are
- 00:50:30getting better on that topic right so
- 00:50:31the models are going to be far
- 00:50:33intelligent cre but then what is Suess
- 00:50:35you as a domain expert get to Define
- 00:50:36that and this is a great thing not just
- 00:50:38about coding but also like any domain
- 00:50:40expert using machine learning or these
- 00:50:42tools across domains you know what
- 00:50:44you're using it for right the other AGI
- 00:50:47tools are just tools to help you do that
- 00:50:48job so you it's I think the owners is on
- 00:50:51you to write good evaluation or I mean
- 00:50:54maybe tomorrow llm as a judge and like
- 00:50:56people are developing Foundation models
- 00:50:57just for evaluation right so there going
- 00:50:59to be other tools to help you do that as
- 00:51:00well code Foundation models for like
- 00:51:02unitest maybe that's the thing in 6
- 00:51:03months from now right uh the point then
- 00:51:06becomes what should it focus on that's
- 00:51:08the role you're playing orchestrating
- 00:51:10but like orchestrating on the evaluation
- 00:51:11oh did you get that corner piece right
- 00:51:13or you know what this is a criticality
- 00:51:14of the system right the again right the
- 00:51:16payment Gateway link and the
- 00:51:18authentication link some of these get
- 00:51:19screwed up then massive bad things
- 00:51:21happen right so you know that so I think
- 00:51:23that's where like the human in the loop
- 00:51:24and your input to the system starts
- 00:51:26getting amazing rashab we could talk for
- 00:51:29hours on this I know this has been
- 00:51:31really interesting and I love the Deep
- 00:51:32dive like a little bit below into the ml
- 00:51:34space as well I'm sure a lot of our
- 00:51:36audience will find this very interesting
- 00:51:37thank you so much really appreciate you
- 00:51:39coming on the podcast thanks so much
- 00:51:40this was this was a fun conversation
- 00:51:42yeah it could go on for hours hopefully
- 00:51:44the inside thank you
- 00:51:48[Applause]
- 00:51:58thanks for tuning in join us next time
- 00:52:00on the AI native Dev brought to you by
- 00:52:02Tesla
- 00:52:05[Music]
- AI Testing
- Software Development
- Machine Learning
- Code Evaluation
- Large Codebases
- Sourcegraph Cody
- Developer Productivity
- AI Context
- Unit Testing
- Evaluation Metrics