Who is the guest on the episode?

Rashab Merra from Sourcegraph is the guest on the episode.

What major concept is discussed in the episode?

The episode discusses AI testing in software development.

How does Sourcegraph help developers?

Sourcegraph offers tools like Cody, a coding assistant that helps improve developer productivity by integrating into IDEs.

What is the role of machine learning in Sourcegraph's tools?

Machine learning in Sourcegraph's tools is used to enhance features such as code completion and testing, assisting developers by understanding and adapting to their workflows.

Why is evaluation critical in machine learning?

Evaluation is critical because it helps determine if improvements in machine learning models genuinely enhance user experience, especially when offline metrics don't always correlate with real-world usage.

What is the significance of latency in AI tools?

Latency is significant as it affects user satisfaction; developers expect fast responses for tasks like code completion, while they're more tolerant of longer wait times for complex tasks like code fixes.

How does Sourcegraph's Cody vary in its features?

Cody offers a range of features from code auto-completion to unit test generation, each with different latency and quality requirements.

What does Rashab consider a 'nightmare' issue?

A 'nightmare' issue is getting accurate unit test generation and evaluation in real-world, large-scale codebases.

What is 'open context' in Sourcegraph?

'Open Context' is a protocol designed to integrate additional context sources to provide better recommendations and code insights.

Why are unit tests important for AI-generated code?

Unit tests serve as guardrails to prevent bad code from entering the codebase, especially as more AI-generated code is used.

Navigating AI for Testing: Insights on Context and Evaluation with Sourcegraph

00:52:09

https://www.youtube.com/watch?v=ExCpKtFgpHI

摘要

TLDRThe podcast episode delves into AI testing and its integration within software development, featuring Rashab Merra from Sourcegraph, known for its tool Cody. The discussion highlights the challenges of managing large code repositories and how AI is instrumental in boosting developer efficiency. By leveraging machine learning, Sourcegraph aims to refine code suggestions and testing, tailoring solutions to developer workflows. The conversation also touches on the criticality of evaluation to verify if model improvements translate to real-world applications, noting that offline metrics like 'passet-1' may not always reflect true user experiences. Rashab expounds on the need for context in AI-driven testing, advocating for models that accurately consider the intricacies of expansive codebases, often seen in large enterprises. He also underscores the importance of AI in writing better and more efficient unit tests, functioning as a safeguard against poor code entering the system. The episode further explores how AI tools balance speed and quality, particularly in latency-sensitive environments where developers require timely feedback. The notion of an evolving, symbiotic relationship between human programmers and AI systems is a recurring theme, suggesting a future where developers focus more on creative and complex problem-solving.

心得

🤖 AI testing is crucial for efficient software development.
🚀 Sourcegraph's Cody aids developers with code completion and testing.
🧩 Context is vital for accurate AI-driven code evaluations.
🕒 Latency impacts developer satisfaction and tool efficacy.
🔄 Evaluation metrics must align with real-world applications.
💻 Open Context broadens the scope of code insights.
👨‍💻 Developers shift towards more creative roles with AI assistance.
⚙️ AI-generated unit tests act as code quality guardrails.
📊 Machine learning models must adapt to specific developer needs.
🔍 Continuous improvement in AI tools enhances productivity.

时间轴

00:00:00 - 00:05:00
The AI Native Dev episode focuses on AI testing, discussing context for models, evaluation of generated code, and timing and automation of AI tests. Rashab Merra from Sourcecraft is introduced, explaining how Sourcecraft tackles the big code problem by using tools like Cody to improve developer productivity with features like autocomplete and code suggestions.
00:05:00 - 00:10:00
Rashab explains his experience in AI since 2009, witnessing the evolution from traditional NLP to large language models (LLMs). He describes how AI, like Cody, abstracts complexity to enhance developer productivity, comparing this progression to his own experience transitioning from coding in C to using advanced frameworks.
00:10:00 - 00:15:00
The discussion shifts to Spotify and Netflix's recommendation systems, using various machine learning models to enhance user experience. Rashab parallels this to Cody's multifaceted approach, including features beyond code suggestions like chat, code edits, and unit test generation, each requiring different evaluations, models, and latencies.
00:15:00 - 00:20:00
Different features in coding assistants demand varying latency and quality levels. For instance, autocomplete needs low latency, whereas code edit and chat can tolerate more delay. Rashab explains the trade-offs between latency and model size, highlighting the benefits of fine-tuning models for specific tasks like Rust language auto-completion.
00:20:00 - 00:25:00
The conversation delves into efficient code and unit test generation balancing user trust in automated systems and developers' cognitive load reductions. As complexity in code suggestions increases, so does the trust issue, stressing the importance of effective evaluation systems and guardrails to prevent introducing errors through automation.
00:25:00 - 00:30:00
Evaluation is emphasized as crucial for development and successful adoption of AI-driven tools. Rashab highlights the importance of developing accurate evaluation metrics that mirror real-world usage over standard benchmarks. This ensures improvements in AI features truly enhance user experience.
00:30:00 - 00:35:00
Rashab describes the challenges of heterogeneity in coding tasks across different industries, identifying opportunities where pre-trained models excel and where fine-tuning can provide significant benefits by focusing on underserved languages or complex tasks often found in enterprise environments.
00:35:00 - 00:40:00
Discussion highlights the adversarial nature of good unit tests as guardrails against bad code. Effective testing prevents the introduction of errors, especially in automated settings where AI-generated code volume increases. The conversation underscores the need for unit testing to evolve alongside increased AI integration.
00:40:00 - 00:45:00
Different testing levels (e.g., unit, integration) and their role in ensuring code quality are explored. Rashab stresses the need for automated systems that continuously improve through feedback loops, suggesting an integrated process where human oversight complements machine-driven testing.
00:45:00 - 00:52:09
Rashab emphasizes the orchestration of AI tools where developers act as conductors, choosing where to spend effort between optimization of code, testing, and other tasks. He stresses the importance of human understanding and evaluation in deploying AI solutions effectively while embracing automation to reduce toil and improve productivity.

显示更多

思维导图

视频问答

Who is the guest on the episode?
Rashab Merra from Sourcegraph is the guest on the episode.
What major concept is discussed in the episode?
The episode discusses AI testing in software development.
How does Sourcegraph help developers?
Sourcegraph offers tools like Cody, a coding assistant that helps improve developer productivity by integrating into IDEs.
What is the role of machine learning in Sourcegraph's tools?
Machine learning in Sourcegraph's tools is used to enhance features such as code completion and testing, assisting developers by understanding and adapting to their workflows.
Why is evaluation critical in machine learning?
Evaluation is critical because it helps determine if improvements in machine learning models genuinely enhance user experience, especially when offline metrics don't always correlate with real-world usage.
What is the significance of latency in AI tools?
Latency is significant as it affects user satisfaction; developers expect fast responses for tasks like code completion, while they're more tolerant of longer wait times for complex tasks like code fixes.
How does Sourcegraph's Cody vary in its features?
Cody offers a range of features from code auto-completion to unit test generation, each with different latency and quality requirements.
What does Rashab consider a 'nightmare' issue?
A 'nightmare' issue is getting accurate unit test generation and evaluation in real-world, large-scale codebases.
What is 'open context' in Sourcegraph?
'Open Context' is a protocol designed to integrate additional context sources to provide better recommendations and code insights.
Why are unit tests important for AI-generated code?
Unit tests serve as guardrails to prevent bad code from entering the codebase, especially as more AI-generated code is used.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要！

字幕

自动滚动:

00:00:00
[Music]
00:00:01
you're listening to the AI native Dev
00:00:03
brought to you by
00:00:14
Tesla on today's episode we're going to
00:00:16
be talking about all things AI testing
00:00:19
so we're going to be dipping into things
00:00:22
like uh the context that models are
00:00:24
going to need to be able to create good
00:00:26
tests uh things like evaluation what
00:00:28
does it mean for generated code mod uh
00:00:30
and then look into deeper into AI tests
00:00:33
about when they should be done should
00:00:34
they be done early should be should they
00:00:36
be done late and more automated joining
00:00:38
me today rashab Merra from Source craft
00:00:42
those of you may know Source craft
00:00:43
better as the tool Cody welcome to the
00:00:45
session tell us a little bit about
00:00:46
yourself first of all thanks am this is
00:00:48
really interesting topic for me and so I
00:00:50
work with Source craft we do we've done
00:00:52
like coding code search for for the last
00:00:54
quite a few years and uh I mean we
00:00:56
tackled the big quot problem like if you
00:00:58
look at Enterprises Enterprises have a
00:01:00
lot of developers which is great for
00:01:01
them they also have like massive code
00:01:03
bases look at a bank in the US I mean
00:01:05
they would have like 20 30,000
00:01:06
developers they would have like more
00:01:07
than that like 40,000 repositories right
00:01:10
so this big code problem is a huge pain
00:01:12
right so what what we've done as a
00:01:14
company is like we've spent quite a few
00:01:15
years in tackling code search and then
00:01:17
last year we launched Codi which is a
00:01:20
coding assistant and essentially Codi
00:01:21
lives in your ID it makes tries to make
00:01:23
you a better produ productive developer
00:01:25
and there's a bunch of different
00:01:26
features and my role right now at source
00:01:29
graph is to lead the a effort so I've
00:01:31
done a lot of like machine learning on
00:01:32
the consumer side like looking at
00:01:33
Spotify recommendations music short
00:01:35
video recommendations my PhD was on
00:01:37
search which again ties well because in
00:01:39
llms you need context and we're going to
00:01:41
talk about yeah a lot of that as well
00:01:43
how many years would you say you've been
00:01:44
in
00:01:45
AI yeah so I think the first research
00:01:47
paper I wrote was in 2009 and almost 15
00:01:50
years ago and this was like NLP again
00:01:52
this is not like deep learning NLP this
00:01:54
is more like traditional NLP and I I've
00:01:57
been in the world wherein like the
00:01:59
domain experts handcraft these features
00:02:01
and then embeddings came in and like
00:02:02
they washed all of it away and then
00:02:04
these like neural models came in they
00:02:05
washed like custom topic models away and
00:02:07
then these llms came in and they washed
00:02:09
a bunch of these Rangers away so I think
00:02:11
like I've seen some of these like waves
00:02:12
of if you create like specific models
00:02:14
which are very handcrafted maybe a
00:02:16
simpler large scale like generalizable
00:02:19
model will sweep some of these models
00:02:21
I've seen a few of these waves in lastas
00:02:23
and way before it was cool yeah exactly
00:02:24
I think like I keep saying to a lot of
00:02:25
like early PhD students like in 2010s a
00:02:28
lot of the PhD students would get their
00:02:29
doctorate by just creating latent
00:02:31
variable models and doing some Gip
00:02:33
sampling inference nobody even knows
00:02:35
that 15 years ago this is what would get
00:02:37
you a PhD in machine right so again
00:02:39
things have really moved on you can say
00:02:40
to that you can say you weren't there
00:02:41
you weren't there in the early days
00:02:43
there when you WR gift sampler code and
00:02:45
see we even never hadow P like a bunch
00:02:48
of these I I think like this is I think
00:02:50
like we've seen this is Codi and a lot
00:02:52
of these other gen coding tools they're
00:02:54
trying to make us as developers work at
00:02:56
higher and higher abstractions right I
00:02:58
started my research or ml career not
00:03:01
writing like py code right I wrote again
00:03:04
like Gib sampler in C right and then I'm
00:03:06
not touching C until it's like really
00:03:08
needed now so a lot of these like
00:03:10
Frameworks have come in which have
00:03:11
abstracted the complexities away and
00:03:13
made my life easier and again we're
00:03:16
seeing again I've seen it as an IC over
00:03:18
the last 10 15 years but also that's
00:03:20
also happening in the industry right you
00:03:22
don't have to be bothered by lowlevel
00:03:23
Primitives and you can maybe tackle
00:03:24
higher level things and that's what
00:03:26
exactly Cody is trying to do that how do
00:03:28
I remove toil from your developers life
00:03:31
and then make them focus on the
00:03:32
interesting pieces and the creative
00:03:33
pieces and like really get the
00:03:34
architecture right yeah so let's let's
00:03:36
talk a little bit about some of the
00:03:38
features that you mentioned because
00:03:39
Cody's obviously not it's when we think
00:03:41
about a tool like Cody it's not just
00:03:45
code suggestion there's a ton of
00:03:47
different things and you mentioned some
00:03:48
of them we'll be talking a little bit
00:03:49
more probably about testing today but
00:03:51
when we think about ML and the use of ml
00:03:54
here how does it differ from the various
00:03:57
features that a tool like
00:04:00
does it
00:04:03
when yeah that's an excellent point let
00:04:05
me take a step back right let's not even
00:04:06
talk about coding assist let's go back
00:04:08
to recommendations and let's look at
00:04:10
Spotify or Netflix right people think
00:04:12
that hey if you're doing Spotify is
00:04:13
famous for oh it knows what I want it it
00:04:16
suest nostalgic music for me so people
00:04:18
in in the while love the recommendations
00:04:20
from Spotify I'm not saying this just
00:04:22
because I was an employee at Spotify but
00:04:23
also as a user right I like Spotify
00:04:25
recommendations I have to say Netflix is
00:04:27
there for me like I've never binge
00:04:29
listened anything on Netflix anything on
00:04:31
Spotify but Netflix has kept me up to
00:04:33
the early hours exact questioning my
00:04:35
life yeah I think there a series right
00:04:37
like Netflix for videos you look at
00:04:39
Spotify for music you look at Tik Tok
00:04:41
Instagram me for short videos so again
00:04:43
the mediums are like it used to be hours
00:04:45
of content to like minutes of music to
00:04:47
like seconds of short video but then the
00:04:49
point is we love these products but this
00:04:52
is not just one ranker Netflix is not
00:04:54
just one ranker right it's a bunch of
00:04:55
different features a bunch of surfaces
00:04:58
and user touch points and each touch
00:05:00
point is powered by a different machine
00:05:01
learning model different thinking
00:05:02
different valuation right so my point is
00:05:04
we we this is how the industry has
00:05:06
evolved over the last 10 15 years right
00:05:08
most of these user Centric applications
00:05:09
which people love and like hundreds and
00:05:11
whatever like 400 million users are
00:05:12
using it monthly they are a mix of
00:05:15
different features and each feature
00:05:17
needs a design of what's the ml model
00:05:19
what's the right tradeoffs what are the
00:05:20
right metrics what's the right science
00:05:22
behind it what's the right evaluation
00:05:23
behind it so we have seen this and now
00:05:25
I'm seeing exactly the same at Codi
00:05:27
right if you look at COD so Codi lives
00:05:29
in your ID so if you're a developer if
00:05:30
you're using vs code or jet brains again
00:05:32
if you're writing code then we can code
00:05:34
complete right so aut complete is an
00:05:36
important feature that's like very high
00:05:37
volume right why because it's not that
00:05:39
again right a lot of users use it but
00:05:41
also when you're writing code in a file
00:05:43
you will trigger aut complete like maybe
00:05:44
hundreds of times because you're writing
00:05:46
like one word you're writing like some
00:05:47
syntax in line It'll like trigger and
00:05:49
then if you like what like a ghost text
00:05:52
which is like recommended you can just
00:05:54
Tab and like it selects right yeah so
00:05:56
then it helps you write code in a faster
00:05:58
way now this is I would say that this is
00:06:00
something on the like extreme of hey I
00:06:03
have to be like very latency sensitive
00:06:04
and like really be fast right because
00:06:06
when you're on Google you're rping a
00:06:07
query Google will do like AO suggestions
00:06:09
right itself can I complete your query
00:06:11
so we we've been like using some of
00:06:13
these features for the last decade as
00:06:14
users on the commu site and this is
00:06:16
really important as well because this
00:06:18
interaction that you're just talking
00:06:19
about here it's a very emotive one you
00:06:21
will you will piss people off if you do
00:06:23
not if you do not provide them cuz this
00:06:25
is supposed to be an efficiency play if
00:06:27
you don't provide them with something
00:06:28
that they can actually prove their
00:06:30
workflow and they can just Tab and say
00:06:32
yeah this is so cool I'm tabing tabing
00:06:33
tabing accepting then they're going to
00:06:35
they're going to get annoyed to the
00:06:37
extent that they will almost reject that
00:06:39
kind exactly this is where quality is
00:06:41
important but latency is also important
00:06:42
right so now we see that there's a
00:06:43
trade-off yeah that hey yes I want
00:06:45
quality but then if you're doing it like
00:06:46
400 milliseconds later hey I'm already
00:06:49
annoyed why because the perception of
00:06:51
the user on when they're competing right
00:06:53
they don't want to wait if I have to
00:06:55
wait then i' rather go to code edit or
00:06:56
like chat this is that's a great point
00:06:59
is is latency does latency differ
00:07:01
between the types of things that c
00:07:03
offers or cuz I I would guess if if a
00:07:05
developer is going to wait for something
00:07:06
they might they'll be likely to wait a
00:07:08
similar amount of time across all but
00:07:10
different yeah not really the expect let
00:07:12
me go back to the features right C has
00:07:14
auto complete which helps you complete
00:07:15
code when you're typing out we have code
00:07:17
edit code fix here there's a bug you can
00:07:19
write a command or select something and
00:07:20
say hey Cod fix it yeah now when you're
00:07:22
fixing then people are okay to wait for
00:07:24
a second or two as it figures out and
00:07:25
then they're going to show a diff and oh
00:07:27
I like this change and I'm going to
00:07:28
accept it now this is going to span
00:07:29
maybe 3,000 milliseconds right 3 to 4
00:07:31
seconds yeah versus auto complete no I
00:07:33
want everything tap complete selected
00:07:35
within 400 500 millisecond Laten again
00:07:37
just there the difference start popping
00:07:39
up now we've talked about autocomplete
00:07:40
code edit code fix then we could go to
00:07:42
chat as well right chat is okay I'm
00:07:44
typing inquiry it'll take me like a few
00:07:46
seconds to type in the right query and
00:07:47
select the right code and do this right
00:07:49
so the expectation of 400 milliseconds
00:07:51
is not really the case in chat because
00:07:53
I'm asking maybe a more complex query I
00:07:55
want you to take your time and give the
00:07:57
answer versus like unitest generation
00:07:59
for examp example right unit write the
00:08:00
entire code and make sure that you cover
00:08:03
the right corner cases unit is like
00:08:04
great coverage and like you're not just
00:08:06
missing important stuff you making sure
00:08:08
that the unit is actually quite good now
00:08:11
there I don't want you to complete in 4
00:08:13
in milliseconds take your time write
00:08:15
good code I'm waiting I'm willing to
00:08:16
wait a long time yeah let's take a step
00:08:18
back what what are we looking at we
00:08:20
looking at a few different features now
00:08:21
similar right Netflix Spotify is not
00:08:23
just one recommendation model you go do
00:08:25
search you go do podcast you go do hey I
00:08:27
want this category content a bunch of
00:08:28
these right so similarly here in coding
00:08:31
assistant for COD you have auto complete
00:08:32
code edit code fix unit test generation
00:08:35
you have a bunch of these commands you
00:08:36
have chat chat is an entire nightmare I
00:08:38
can talk about hours on like chat is
00:08:40
like this one box Vision which people
00:08:42
can come with like hundreds of intents
00:08:44
yeah that's e in 10 it's a nightmare for
00:08:47
me as an engineer to say that are we
00:08:48
doing well on that because in aut
00:08:50
complete I can develop metrics around it
00:08:52
I can think about okay this is a unified
00:08:54
this is a specific feature yeah chat may
00:08:56
be like masking hundreds of these
00:08:58
features just by natural language so we
00:09:00
can talk a little more about chat as
00:09:01
well over a period of time but coming
00:09:04
back to the original point you mentioned
00:09:05
for auto complete people will be latency
00:09:07
sensitive yeah for unit generation maybe
00:09:09
less for chat maybe even less what that
00:09:11
means is the design choices which I have
00:09:14
as an ml engineer are different in order
00:09:16
to complete I'm not going to look at
00:09:17
like 400 billion parameter BS right I
00:09:19
ownn something which is f right so if
00:09:21
you look at the x-axis latency Y axis
00:09:23
quality look I don't want to go the top
00:09:26
right top right is high latency and high
00:09:28
quality I don't want latency I want to
00:09:30
be in the like whatever 400 500 n to n
00:09:33
millisecond latency space so there small
00:09:35
models kick in right and small models we
00:09:37
can f tune for great effect right we
00:09:39
we've done some work we' just published
00:09:40
a block post a couple of weeks ago on
00:09:42
hey if you find tune for rust rust is
00:09:44
like a lot more has a lot of nuances
00:09:46
which most of these large language
00:09:47
models are not able to capture so we can
00:09:49
find un a model for rust and do really
00:09:51
well on auto completion within the
00:09:53
latency requirements which we have for
00:09:54
this feature yeah so then these
00:09:56
trade-offs start emerging essentially
00:09:58
how does that change if you if the
00:10:00
output that Cod say was going to provide
00:10:02
the developer would actually be on a
00:10:04
larger scale so we talked when we're
00:10:05
talking about autocomplete we're really
00:10:06
talking about a one liner complete but
00:10:08
what if we was to say I want you to
00:10:10
write this method or I want you to write
00:10:11
this module obviously then you don't
00:10:13
want that you don't want the developer
00:10:15
to accept an autocomplete or look at an
00:10:18
autocomplete module or or function and
00:10:21
think this is absolute nonsense it's
00:10:22
giving me nonsense quickly presumably
00:10:24
then they're willing to they're willing
00:10:26
to wait that much longer yeah I think
00:10:27
there's that's a really good point right
00:10:29
that people use Codi and not just Codi
00:10:31
and not just encoding domain right we
00:10:33
use co-pilots across different I mean
00:10:35
sales co-pilot marketing co-pilot
00:10:36
Finance RIS co- pilot people are using
00:10:38
these agents or assistants in for
00:10:41
various different tasks right and some
00:10:44
of these tasks are like complex and more
00:10:46
sophisticated some of these tasks are
00:10:47
like simpler right yeah so let me let me
00:10:49
just paint this like picture of how I
00:10:51
view this right when you pick up a topic
00:10:53
to learn right beat programming you
00:10:54
don't start with like multi-threading
00:10:55
you start with okay do I know the syntax
00:10:57
can I instantiate variables can I go
00:10:59
fails can I do fall Loop and can I do
00:11:00
switch and then multiread and then
00:11:02
parallelism so when we as humans learn
00:11:04
there is a curriculum we learn on right
00:11:06
we don't directly go to chapter 11 we
00:11:07
start with chapter 1 similarly I think
00:11:09
like the lens which which I view some of
00:11:12
these agents and tools are it's okay you
00:11:14
not at like chapter 12 yet but I know
00:11:17
that there are simpler tasks you can do
00:11:18
and then there are like medium tasks you
00:11:19
can do and then there are like complex
00:11:20
tasks you can do now this is a lens
00:11:23
which is which I found to be like pretty
00:11:25
useful because when you say that hey for
00:11:26
aut complete for example I don't want
00:11:28
again my use case probably for auto
00:11:30
compete is not okay tackle a chapter 12
00:11:32
complexity problem no for that I'll
00:11:34
probably have an agentic setup yeah so
00:11:36
this curriculum is a great way to look
00:11:37
at it the other great way to look at
00:11:38
things are like let's just call it like
00:11:40
left versus right so on the left we have
00:11:41
these tools which are living in your IDE
00:11:43
and they're helping you write better
00:11:44
code and like complete code and like
00:11:46
really you are the main lead you're
00:11:47
driving the car you're just getting some
00:11:48
assistance along the way right versus
00:11:51
things on the right are like agentic
00:11:53
right that hey I'm going to type in here
00:11:54
is my GitHub issue send me create a
00:11:56
bootstrap me a PR for this right there I
00:11:59
want the machine learning models to take
00:12:00
control not full autonomy Quinn our CEO
00:12:03
has a amazing block Post Yeah on levels
00:12:05
of code right you start level Zer to
00:12:07
level seven and some of these human
00:12:09
initiated AI initiated AI Le and that
00:12:11
gives us a spectrum to look at autonomy
00:12:13
from a coding assistant perspective
00:12:15
which is great I think everybody should
00:12:16
look at it but coming back to the
00:12:18
question auto complete is probably for
00:12:21
I'm still the lead driver here help me
00:12:24
but in some of the other cases I'm stuck
00:12:26
take your time but then tackle more
00:12:28
complex task yeah now the context and
00:12:30
the model size the the latencies all of
00:12:32
these start differing here right when
00:12:34
you're writing this code you probably
00:12:35
need for a to complete local context
00:12:37
right or like maybe if you're
00:12:38
referencing a code from some of
00:12:40
repository bring that as a dependency
00:12:42
code and then use it in context if
00:12:44
you're looking at a new file generation
00:12:46
or like a new function generation that's
00:12:48
okay you got to look at the entire
00:12:49
repository and not just make one changes
00:12:51
over here you have to make an entire
00:12:53
file and make changes across three other
00:12:55
files right M so even where is the
00:12:57
impact the impact in auto it is like
00:12:59
local in this file in this region right
00:13:02
and then if you look at the full
00:13:03
autonomy case or like agentic setups
00:13:05
then the impact is okay I'm going to
00:13:06
make five changes across three files
00:13:08
into two repositories yeah right so
00:13:10
that's that's the granularity at which
00:13:11
some of these things are starting to
00:13:13
operate essentially yeah and testing is
00:13:14
going to be very simp similar as well
00:13:15
right if someone is if someone's writing
00:13:18
code in the in their ID line by line and
00:13:20
that's maybe using a code generation as
00:13:22
well like Cody they're going to likely
00:13:25
want to be able to have test step step
00:13:27
in sync so as I write code you're
00:13:29
automatically generating tests that are
00:13:32
effectively providing me with that
00:13:33
assurance that that the automatically
00:13:35
generated code is working as I want to
00:13:38
yeah that's a great point I think this
00:13:39
is more like errors multiply yeah if I'm
00:13:41
evaluating something after long writing
00:13:44
it then it's worse off right because the
00:13:47
errors I could have stopped the errors
00:13:48
earlier on and then debugged it and
00:13:51
fixed it locally and then moved on so
00:13:53
especially so taking a step back look I
00:13:55
love evaluation I really in machine
00:13:57
learning I I started my PhD thinking
00:13:59
that hey mats and like fancy graphical
00:14:02
models are the way to have impact using
00:14:03
machine learning right and you spend one
00:14:06
year in the industry realized nah it's
00:14:07
not about the fancy model it's about do
00:14:09
you have an evaluation you have these
00:14:11
metrics do you know when something is
00:14:12
working better yeah so I think getting
00:14:14
the zero to one on evaluation on these
00:14:16
data sets that is really key for any
00:14:19
machine learning problem y now
00:14:21
especially when what you mean by the
00:14:22
zero to one yeah 0 to1 is look at like
00:14:24
whenever a new language model gets
00:14:26
launched right people are saying that
00:14:27
hey for coding M Lama 3 does well on
00:14:30
coding why because oh we have this human
00:14:32
eval data set and a passet one metric
00:14:34
let's unpack that human eval data set is
00:14:36
a data set of 164 questions hey write me
00:14:39
a binary search in this code right so
00:14:41
essentially it's like you get a text and
00:14:43
you write a function and then you're
00:14:45
like hey does this function run
00:14:46
correctly so they have a unit test for
00:14:47
that and if it passes then you get plus
00:14:49
one right yeah so now this is great it's
00:14:51
a great start but is it really how
00:14:54
people are using Cod and a bunch of
00:14:56
other coding tools no they're like if
00:14:57
I'm an Enterprise developer if if let's
00:14:59
say I'm in a big bank then I have 20,000
00:15:01
other peers and there are like 30,000
00:15:03
depositories but I not writing binary
00:15:05
search independent of everything else
00:15:07
right I'm working in a massive code base
00:15:09
which has been edited across the last 10
00:15:11
years and there's some dependency by
00:15:13
some team in Beijing and there's a
00:15:14
function which I haven't even read right
00:15:16
and maybe it's in a language I don't
00:15:18
even care about or understand so my
00:15:20
point is the evaluation which we need
00:15:23
for these Real World products is
00:15:24
different than the benchmarks which we
00:15:26
have in the industry right now the one
00:15:29
for evaluation is that hey sure let's
00:15:31
use passet one in human at the start on
00:15:33
Day Zero but then we see that you
00:15:35
improve it by 10% we have results when
00:15:38
we actually did improve pass it one by
00:15:40
10 15% we tried it online on COD users
00:15:43
and the metrics dropped yeah and we
00:15:45
writing a block post about it at offline
00:15:46
online correlation yeah because if you
00:15:48
trust your offline metric pass it one
00:15:50
you improve it you hope that hey amazing
00:15:51
users are going to love it yeah it
00:15:53
wasn't true the context is so different
00:15:55
yeah the context are so different now
00:15:57
this is this means that I got to develop
00:15:59
an evaluation for my feature and I got
00:16:02
my evaluation should represent how my
00:16:03
actual users using this feature feel
00:16:05
about it m just because it's better on a
00:16:07
metric which is an industry Benchmark
00:16:10
doesn't mean that improving it will
00:16:11
improve actual user experience and can
00:16:12
that change from user to user as well so
00:16:14
you mention a bank there if five other
00:16:16
Banks is it going to be the same for
00:16:18
them if something not in the fin Tech
00:16:20
space is it going to be different for
00:16:21
them that's a great point I think the
00:16:22
Nuance you're trying to say is that hey
00:16:24
one are you even feature aware in your
00:16:27
evaluation because passet one is not
00:16:28
featur away right yeah passet one
00:16:30
doesn't care about auto complete or
00:16:31
unitize generation or code fixing I
00:16:33
don't care what the end use case or
00:16:34
application is this is just evaluation
00:16:36
so I think the first jump is have an
00:16:38
evaluation data set which is about your
00:16:41
feature right the evaluation data set
00:16:42
for unit as generation is going to be
00:16:44
different that code completion it's
00:16:45
going to be different than code edits
00:16:46
it's going to be different than chat so
00:16:48
I think the 0 to one we talking about 5
00:16:49
minutes earlier you got to do 0 to ones
00:16:51
for each of these features yeah and
00:16:53
that's not easy because evaluation
00:16:55
doesn't come naturally yeah and once you
00:16:56
have it then the question becomes that
00:16:58
hey okay once I have it for my feuture
00:17:00
then hey can I reuse it across
00:17:01
Industries can I reuse it across like
00:17:04
users and I think we've seen it I've
00:17:05
seen it in the rec traditional
00:17:07
recommendation space let's say most of
00:17:09
these apps again if they got like seed
00:17:11
funding last year or maybe series a
00:17:13
there at what like 10,000 daily active
00:17:14
users 5,000 daily active users today one
00:17:16
year from now they're going to be 100K
00:17:18
500k daily active users right now how
00:17:21
representative is your subset of users
00:17:22
today right the 5,000 users today are
00:17:24
probably early adopters and if anything
00:17:27
is scaling companies in the last 10
00:17:28
years what it has told us is the early
00:17:30
adopters are probably not the
00:17:31
representative set of users you'll have
00:17:33
once you have a mature adoption yeah
00:17:35
what that means is the the metrics which
00:17:36
are develop and the learnings which I've
00:17:38
had from the initial AB test may not
00:17:41
hold one year down the line six months
00:17:42
down the line as in when the users start
00:17:44
increasing right yeah now how does it
00:17:46
link to the point you asked look there
00:17:48
are heterogeneities across different
00:17:49
domains different Industries luckily
00:17:51
there are like homogeneities across
00:17:52
language right if you're a front-end
00:17:53
developer a versus B versus C companies
00:17:56
a lot of the tasks you're trying to do
00:17:57
are like similar and a lot of the task
00:17:59
which the pre-training data set has seen
00:18:02
is also similar because rarely again
00:18:04
there are cases where you're doing
00:18:05
something really novel but a lot of the
00:18:08
junior development workflow probably is
00:18:10
more like things which like hundreds and
00:18:11
thousands of Engineers have done before
00:18:13
so the pre-trained models have seen this
00:18:14
before right so when we finding the mod
00:18:16
for us that's not where we saw
00:18:18
advantages because yeah you've seen it
00:18:19
before you're going to do it well MH
00:18:21
coming back to the point I mentioned
00:18:22
earlier it's going to be a curriculum
00:18:23
right you can do simple things well you
00:18:25
can do harder things in Python well you
00:18:27
can't do harder things in Rust well you
00:18:29
can't do harder things in MA lab well so
00:18:31
my goal of fine-tuning some of these
00:18:32
models is that hey I'm going to show you
00:18:34
examples of these hard task not because
00:18:37
I just want to play with it but because
00:18:39
some of our adopters right some we have
00:18:41
a lot of Enterprise customers using us
00:18:43
right and paying us for that right I get
00:18:44
my salary because of that essentially so
00:18:46
essentially I want those developers to
00:18:47
be productive and they're trying to
00:18:49
tackle some complex tasks in Rust which
00:18:51
maybe we haven't paid attention when we
00:18:53
were training this llama model or like
00:18:54
this entropic model so then my goal is
00:18:56
how do I extract those examples and then
00:18:59
bring it to my Loop training Loop
00:19:01
essentially and that's where right now
00:19:03
if let's say one industry is struggling
00:19:06
we know how the metrics are performing
00:19:08
right that's what evaluation is so
00:19:09
important we know where we suck at right
00:19:12
now and then we can start collecting
00:19:13
public examples and start focusing the
00:19:16
models to do well on those right yeah
00:19:18
again let me bring the exact point I
00:19:19
mentioned I I'm going to say it 100
00:19:21
times we've done it before if you spend
00:19:23
20 minutes on Tik Tok you're going to
00:19:26
look at what 40 short videos If you
00:19:28
spend 5 minutes on Tik Tok or Instagram
00:19:30
re you're going to look at like 10 short
00:19:31
videos right yeah in the first nine
00:19:33
short videos you're going to either skip
00:19:34
it or like it or follow a Creator do
00:19:36
something right so the 11 short perod is
00:19:38
like really personalized because I've
00:19:39
seen what you're doing in the last 5
00:19:40
minutes and I can do real-time
00:19:42
personalization for you yeah what does
00:19:44
that mean in the coding assistant world
00:19:45
look I know how these models are used in
00:19:47
the industry right now and how our
00:19:48
Enterprise customers and our community
00:19:50
users are using it yeah let's look at
00:19:51
the ACC completion acceptance rate or
00:19:53
autocomplete oh for these languages in
00:19:55
these use cases we get a high acceptance
00:19:57
rate we show our recommendation
00:19:59
people accept the code and move on but
00:20:01
in these on examples oh we're not really
00:20:03
doing well so the question that me comes
00:20:06
oh this is something which maybe we
00:20:07
should train on or we should find you on
00:20:10
and this establishes a feedback loop
00:20:11
yeah that look at what's not working and
00:20:14
then make the model look at more of
00:20:16
those examples and create that feedback
00:20:17
loop which can then make the models
00:20:19
evolve over a period of time and so I
00:20:20
think there's two pieces here if we move
00:20:22
a little bit into the into what this
00:20:24
then means for testing the more that
00:20:27
evaluation like almost like testing of
00:20:29
the model effectively gets better it
00:20:31
effectively means that the suggested
00:20:33
code is then more accurate as a result
00:20:36
the you need to rely on your tests
00:20:39
slightly less I'm not saying you
00:20:40
shouldn't but slightly less because the
00:20:42
generated code that is being suggested
00:20:44
is a higher quality when it then comes
00:20:46
to the suggestions of tests afterwards
00:20:50
does that follow the same model in terms
00:20:52
of learning from not just the what the
00:20:54
user wants to test but also from what is
00:20:57
being generated
00:20:59
is there work that we can do there into
00:21:00
building that test yeah I think that's
00:21:02
an interesting point generation is a
00:21:05
selfcontained problem in itself right
00:21:07
yeah one I think let's just establish
00:21:08
the fact that unit generation is
00:21:10
probably one of the highest value use
00:21:13
cases yeah we got to get right in the
00:21:15
industry code and that's because I
00:21:17
mentioned that look you will stop using
00:21:19
spotify if I start showing shitty
00:21:21
recommendations to you yeah if I don't
00:21:23
learn from my mistakes you're going to
00:21:25
keep skipping like short videos or music
00:21:27
or podcast and you're going to go right
00:21:28
I don't get value because I spend I'm
00:21:30
running I'm jogging I want music to just
00:21:32
come in and I don't want to stop my
00:21:33
running and hit like skip because that's
00:21:36
takes that's like dissatisfaction right
00:21:38
yeah one of the things which I did at
00:21:39
Spotify and like in my previous company
00:21:40
was like I really wanted people to focus
00:21:43
on dissatisfaction yeah because
00:21:44
satisfaction is all users are happy yeah
00:21:46
that's not where I make more money I
00:21:47
make more money by reducing
00:21:49
dissatisfaction where are you unhappy
00:21:51
and how do I fix that and even there if
00:21:53
I stop you there actually just quickly I
00:21:55
think there are different levels to this
00:21:56
as well in terms of the testing right
00:21:57
cuz it's at some point it's yes at its
00:21:59
most basic does this code work and then
00:22:03
as it goes up it's does this code work
00:22:04
well does this code work fast does this
00:22:06
code is this really making you happy as
00:22:08
a user and so I think where would you
00:22:12
say we are right now in terms of the
00:22:14
level of test generation are people most
00:22:16
concerned with the this code is being
00:22:19
generated or when I start creating my
00:22:21
tests how do I validate that my code is
00:22:24
correct in terms of it compiles it's
00:22:28
it's covering my basic use cases that
00:22:30
I'm asking for and doing the right thing
00:22:32
no this we can talk an hour about this
00:22:34
look I think this is a long journey yeah
00:22:36
getting evaluation getting United
00:22:37
generation right across the actual
00:22:39
representative use case in the
00:22:41
Enterprise that's a nightmare of a
00:22:43
problem yeah look I can do it right for
00:22:45
writing binary binary search algorithms
00:22:47
right if you give me like if you give me
00:22:48
a coding task which has nothing to do
00:22:50
with a big code repository and like
00:22:52
understanding context understanding what
00:22:54
the 5,000 developers have done sure I
00:22:56
can attempt it and create a unit test
00:22:58
because this there's a code and there's
00:23:00
a unit T this lives independently right
00:23:02
they live on an island they're happily
00:23:03
married amazing everything works but
00:23:04
this is not how people use coding this
00:23:06
is not how like the Enterprise or like
00:23:08
even Pro developer use cases are the pro
00:23:10
developer use cases are about like hey
00:23:12
is this working correctly in this like
00:23:14
wider context because there's a big code
00:23:16
repository and like multiple of them and
00:23:18
that's a wider context wherein you're
00:23:20
actually writing the code and you're
00:23:21
actually writing the unit test now I
00:23:23
would bring in this philosophical
00:23:25
argument of I think unit generation I
00:23:27
would look at it from an a Serial
00:23:28
setting what's the point of having the
00:23:30
unit test it's not just to feel make
00:23:32
your yourself or your manager happy that
00:23:34
oh I have unitest coverage unit test are
00:23:36
probably like a guard rail to keep bad
00:23:39
code from entering your system yes so
00:23:41
what is this this is an adversarial
00:23:42
setup maybe not intentionally
00:23:44
adversarial aders set up okay somebody's
00:23:45
trying to make bad things happen in your
00:23:47
code and somebody else is stopping that
00:23:50
from happening right so again if you
00:23:52
start looking at unitest generation from
00:23:53
this adversar setup that look this is a
00:23:55
good guy right the unitest is going to
00:23:58
prevent bad things from happening in
00:23:59
future to my code base that's why I need
00:24:02
good unit test now this bad now this is
00:24:04
a good guy right who are the bad people
00:24:06
right now in the last up until the last
00:24:07
few years the bad people not
00:24:09
intentionally bad but the Bad actors in
00:24:11
The Code base were developers yeah now
00:24:13
we have ai yeah right now I am as a
00:24:15
developer writing code and if I write
00:24:16
shitty code then the unitest will catch
00:24:18
it and I won't be able to merge yeah
00:24:20
right if
00:24:21
ifel right exactly we'll get to that
00:24:24
right I'm yet to I'm yet to see a
00:24:26
developer who who is not a tdd fan who
00:24:29
absolutely lives for writing tests and
00:24:31
building a perfect test s for the code
00:24:33
yeah yeah again right and there's a
00:24:34
reason why quot like test case coverage
00:24:36
like low across like all the
00:24:38
repositories up right it's not something
00:24:39
which again I think like biang CTO he
00:24:42
loves to say that the goal of cod is to
00:24:43
remove developer toil yeah and how do I
00:24:45
make you do a lot more happier job
00:24:48
focusing on the right creative aspects
00:24:50
of architecture design or system design
00:24:51
and remove toil from your life right a
00:24:53
bunch of Developers for for better RSE
00:24:55
to start looking at unit as generation
00:24:57
as maybe it's not as interesting let's
00:24:59
unpack that as well not all unit tests
00:25:01
are like boring right yeah writing
00:25:02
stupid unit test for stupid functions we
00:25:04
shouldn't even like probably do it or
00:25:06
like I I will let like machine learning
00:25:07
do it essentially but the Nuance are
00:25:09
like here is a very critical function if
00:25:12
you screw this then maybe the payment
00:25:14
system in your application gets screwed
00:25:16
and then you lose money right and then
00:25:18
you don't if you don't have
00:25:19
observability then you lose money over a
00:25:21
period of time and then you're literally
00:25:22
costing company dollars millions of
00:25:23
dollars if you screw this code
00:25:25
essentially so the point is not all unit
00:25:26
tests are the same because not all
00:25:27
functions are equally important right
00:25:29
there's going to be a distribution of
00:25:30
some of the are like really really
00:25:31
important functions you got to get a
00:25:33
amazing unit test right I would rather I
00:25:35
if I have a limited budget that if I
00:25:37
have to principal Engineers I would make
00:25:39
sure that the unit of these really
00:25:41
critical pieces of component in my
00:25:43
software stack are written by these
00:25:45
Engineers or even if they written by
00:25:48
like these agents or AI Solutions then
00:25:49
at least like they wed it from some of
00:25:51
these MH but before we get there let's
00:25:54
just look at the fact of the need for
00:25:56
unit test not just today but tomorrow
00:25:58
yeah because right now if you have
00:26:00
primarily developers writing unit test
00:26:02
or like some starting tools tomorrow a
00:26:04
lot more AI assistant I mean we are
00:26:06
building one right we are trying to say
00:26:08
that hey we're going to write more and
00:26:09
more of your code yeah what that means
00:26:11
is if in the adversarial setup unit test
00:26:14
are like protecting your code base the
00:26:16
the potential attacks not intentional
00:26:18
but the bad code could come in from
00:26:19
humans but also like thousands and
00:26:20
millions of AI agents tomorrow yeah and
00:26:22
you know what worries me a little bit
00:26:24
here as well is in fact when you talked
00:26:25
about that that those levels autonomy on
00:26:28
the far left it's much more interactive
00:26:31
right you have developers who are
00:26:32
looking at the lines of code that
00:26:33
suggested and looking at the tests that
00:26:35
get generated so it's much more involved
00:26:38
for the developer as soon as you go
00:26:40
further right into that more automated
00:26:41
World we're we're more in an we're more
00:26:43
in an environment where um larger
00:26:46
amounts of content is going to be
00:26:49
suggested to that developer and if we go
00:26:51
back to the same old that story where if
00:26:54
you want 100 comments on your PO request
00:26:56
in a code review you write two line
00:26:58
change if you want zero you provide a
00:27:00
500 line change right and when we
00:27:02
provide that volume whether it's hey I'm
00:27:04
going to build you this part of an
00:27:06
application or a module or a test Suite
00:27:09
based on some code how much is a
00:27:11
developer actually going to look in
00:27:13
detail at every single one of those
00:27:14
right and I think this kind of comes
00:27:16
back to your point of what are the most
00:27:19
important parts that I need to to look
00:27:22
at but yeah it revolves a little bit
00:27:24
more around what you were saying earlier
00:27:25
as well whereby tests becoming more more
00:27:28
important for this kind of thing and
00:27:30
exactly as code gets generated
00:27:31
particularly in volume right what are
00:27:34
where are the guard rails for this and
00:27:36
it's all about tests I love the point
00:27:37
you mentioned right that look as more
00:27:39
and more code gets written like my the
00:27:41
cognitive abilities of a developer to
00:27:43
look at every change everywhere it just
00:27:46
takes more time takes more effort takes
00:27:47
more cognitive load right yeah now
00:27:49
coupling the fact that if you've been
00:27:50
using this system for a few months then
00:27:52
there's an inherent trust in the system
00:27:55
now this is when I get really scared you
00:27:57
look at
00:27:58
when I started using Alexa in 2015 right
00:28:00
it would only get weather right right
00:28:02
Google home Alexa it won't do any of
00:28:04
other who can get weather
00:28:06
right yeah weather prediction is a hard
00:28:09
enough maching problem Deep Mind
00:28:10
Engineers are still working on it and
00:28:11
still getting it right and doing it in
00:28:12
London yeah I would pay for a service
00:28:15
which predict but point is we have used
00:28:18
as a society like these conversational
00:28:19
agents for a decade now we just asking
00:28:21
like crappy questions yeah because we
00:28:23
trust it I asked you a complex question
00:28:24
you don't have an answer I I forgot
00:28:26
about you for the next few months but
00:28:28
but then we start increasing the
00:28:29
complexity of the questions we ask and
00:28:30
that's great because now the the Siri
00:28:33
and the and the Google assistant and
00:28:34
Alexa was able to tackle these questions
00:28:36
right and then you start trusting them
00:28:38
because hey oh I've asked you these
00:28:39
questions and we've answered them well
00:28:40
so then I trust you to do these tasks
00:28:42
well and again right if you look at
00:28:45
people who use Spotify or Netflix their
00:28:47
recommendations in the feed they have
00:28:49
more Trust on your system yeah because
00:28:51
most of these applications do provide
00:28:52
you a way out right if you don't trust
00:28:53
recommendations go to your library go do
00:28:55
search yeah search search versus
00:28:58
recommendations is that push versus pull
00:28:59
Paradigm right recommendations I'm going
00:29:02
to push content to you if you trust us
00:29:04
you're going to consume this right if
00:29:05
you don't trust a system if you don't
00:29:06
trust your recommendations then you're
00:29:08
going to pull content which is search
00:29:09
right now especially as in when you seen
00:29:11
there's a distribution of people who
00:29:12
don't search at all right they're like
00:29:14
they we live in that high trust world
00:29:16
when they're going to they're going to
00:29:17
like like a recommendation same right
00:29:19
Google who goes through the second page
00:29:20
of Google right who goes through Google
00:29:22
now sorry uh but essentially the point
00:29:25
is once people start trusting these
00:29:27
systems
00:29:28
the units generation is a system which I
00:29:31
start trusting right and then it starts
00:29:33
tackling more and more complex problems
00:29:34
and then is oh I start I stopped looking
00:29:36
at the corner cases and then that code
00:29:38
was committed 6 months ago and that unit
00:29:40
T is there and then code on top of it
00:29:42
was committed three months ago and then
00:29:43
there's probably unit test which I
00:29:44
didn't write the agent road now this is
00:29:47
where like complexity evolves for a
00:29:49
period of time and maybe there's a
00:29:51
generation of unit test which have been
00:29:53
written maybe with less and less of me
00:29:55
being involved yeah the the levels of
00:29:57
code a like that means like your
00:29:59
involvement is not at the finer levels
00:30:01
it's like higher up now this assumes it
00:30:03
works well if the foundations are
00:30:04
correct and everything is robust and we
00:30:06
have like good checks in place yeah
00:30:08
again right whatever could go wrong
00:30:10
would go wrong yeah so then the point is
00:30:12
in this complex code where in the series
00:30:14
generations of unit test generations of
00:30:16
code Cycles edits have been made by an
00:30:19
agent then things could go horribly
00:30:21
wrong so do we have what's the solution
00:30:23
the solution is pay more respect to
00:30:25
evaluation right look you got to you got
00:30:27
to LD is guard is just like a very
00:30:30
harmless way to for me to say that like
00:30:33
unitz generation is important not just
00:30:34
for unit generation today but for unit
00:30:37
generation and code generation tomorrow
00:30:39
so I think like the kind of metrics we
00:30:40
need the kind of evaluation we need the
00:30:42
kind of robust auditing of these systems
00:30:44
and auditing of these unites so far
00:30:46
again I don't have a huge 10e experience
00:30:48
in the coding industry because I've
00:30:50
worked on recommendations and use C
00:30:51
systems for but for me it was like hey
00:30:54
what is your test coverage in a
00:30:56
repository that's the most commonly way
00:30:58
look way to look at like repository and
00:31:00
what's what I how advanced are you in
00:31:02
your testing capabilities and that
00:31:03
doesn't cut it are you covering the con
00:31:05
cases what's your again what's the
00:31:06
complexity what's the severity so do we
00:31:08
need automated tests for our tests or do
00:31:11
we need people humans to we need both
00:31:14
right we need human domain experts again
00:31:16
this is and this is not just a coding
00:31:17
problem look at look millions of dollars
00:31:19
are spent by entropic and open mind on
00:31:21
scale AI scale AI has they raised like a
00:31:23
lot of money recently billion dollar
00:31:25
valuations because we need domain
00:31:27
experts to tag yeah this is also
00:31:28
nightmare at Spotify in search I have PD
00:31:31
in search my search was like I'll show
00:31:32
users some results and they're going to
00:31:34
tag this is correct or not I can't do
00:31:36
this in coding MH because crowdsourcing
00:31:38
has been a great assistance to machine
00:31:40
learning systems for 20 years now
00:31:42
because I can get that feedback from the
00:31:44
user now to get feedback on a complex
00:31:47
rust code where am I going to find those
00:31:49
crowdsource workers on scale or Amazon
00:31:51
Mt right they don't exist you're not
00:31:53
going to pay them $20 an hour to write
00:31:55
give feedback they're like, hour dollar
00:31:57
what ,000 hour developers right yeah we
00:32:00
don't even have a community right now
00:32:02
around around crowdsource workers for
00:32:04
code because this is domain specific
00:32:06
right so my point is again this is not
00:32:09
all Dooms worthy in the sense like this
00:32:11
is a important problem we we got to get
00:32:13
righted uh and it's going to be a long
00:32:15
journey we're going to do evaluations
00:32:16
and we're going to do like generations
00:32:17
of evaluations I think the right way to
00:32:20
think about is like paying attention to
00:32:22
unit test but also like evaluation of
00:32:24
unit test and also evaluation take a
00:32:26
step back like multiple lels of
00:32:28
evaluation right you're going to
00:32:29
evaluate okay what are the important fun
00:32:32
are we able to write identify the
00:32:33
important functions the criticality of
00:32:35
the code right and then look at unit as
00:32:36
generation from that lens yeah now
00:32:38
immediately one of the solutions I think
00:32:40
when we met over coffee we talking
00:32:42
briefly about it if let's say I generate
00:32:44
20 unit test right now I want my
00:32:47
principal or some respected trusted
00:32:50
engineer to wet at least some of them
00:32:53
now their day job is not just to write
00:32:55
unit test right their day job is to
00:32:56
maintain the system advance wants it so
00:32:58
they're going to probably have limited
00:33:00
budget to look at the unit I've
00:33:02
generated right now the question becomes
00:33:04
if this week I was able to generate 120
00:33:07
unit test and you are a principal
00:33:09
engineer in my team you're not going to
00:33:10
look at 120 unit tests and pass them MH
00:33:12
you maybe have two hours to spare this
00:33:14
week yeah you're going to look at maybe
00:33:16
five of these unit test yeah now this
00:33:18
becomes an interesting machine learning
00:33:19
problem for me yeah the 120 unit test
00:33:21
I've created which what is a subset of
00:33:24
five which I need your input on yeah now
00:33:27
this is
00:33:28
one way to tackle these problems reduce
00:33:29
the uncertainity right machine learning
00:33:31
uncertainity models we've done it for 20
00:33:33
years in the industry so likey how
00:33:35
certain am I that this is correct and if
00:33:37
I'm certain then sure I I won't show it
00:33:39
to you yeah maybe I'll show one off just
00:33:40
to make sure that I I get a feedback
00:33:42
that I thought I'm certain you did you
00:33:44
confirm or did you reject that and then
00:33:46
I'll learn but then I'm I'm going to go
00:33:47
on an information maximization principle
00:33:50
that what do I not know and can I show
00:33:52
that to you mhm uh so what that means is
00:33:54
it's a budget constrained subset
00:33:56
selection problem mhm right that I've
00:33:58
generated 120 unit test you can only
00:33:59
look at five of them I'm going to pick
00:34:01
up these five and see now we can do it
00:34:03
like I I can pick up these five and show
00:34:05
it to you or I can pick up one get that
00:34:08
feedback see what I additionally learn
00:34:10
from that one and then look at the 119
00:34:12
again and say that knowing what I know
00:34:14
just now what else would I say what is
00:34:16
the next one yeah and previously you
00:34:18
mentioned about the most important parts
00:34:20
of the code to test right cuz there
00:34:22
could be critical paths there could be
00:34:23
other areas where you know what if
00:34:25
there's a bug it's far less of a problem
00:34:27
who provid that context is that
00:34:28
something the is that something the LM
00:34:30
can like decide or is that something
00:34:32
like you said the principal Engineers
00:34:33
should spend a little bit of their time
00:34:35
saying look these are the main core
00:34:37
areas that we need to be bulletproof
00:34:38
right other areas yeah I would rather
00:34:41
spend my time on these than those others
00:34:43
yeah I think I probably am not the best
00:34:45
person to answer like who currently
00:34:46
provides it as a machine learning
00:34:48
engineer I see there are signals yeah if
00:34:50
I look at your system then there are
00:34:52
like where is the bugs race where was
00:34:53
the severity right for each bug for each
00:34:55
incident there was a severity report
00:34:57
watch code what code was missed what
00:34:59
code wasn't missed yeah what were the
00:35:00
unit tests created by people so far
00:35:03
right so again I view it from a data
00:35:04
observability perspective like knowing
00:35:06
what I know about your code about your
00:35:08
severity about your issues about your
00:35:09
time to resolve some of these then I can
00:35:13
develop a model personalized to your
00:35:15
code base on what are the core important
00:35:17
pieces right so I can look at just code
00:35:20
which okay which functions are calling
00:35:21
which there's a dependency and we do a
00:35:23
lot of we have amazing Engineers right
00:35:24
on who are just like compiler experts in
00:35:27
the company
00:35:28
and one of the Reas join was because
00:35:30
bring in the ml knowledge but I want to
00:35:32
work with dom experts right and has
00:35:34
amazing talent over the last few years
00:35:36
and it compiles it compounds right so
00:35:38
there are these compilers experts
00:35:39
internally Olaf and a few others being
00:35:41
one of them and they would do like
00:35:43
really like precise Intelligence on the
00:35:45
code and find out the dependency
00:35:46
structure and all those right and that
00:35:48
will give me like just content
00:35:50
understanding of your code base right
00:35:51
and then if a function is deemed
00:35:52
important from those like graphical
00:35:54
links essentially like how many of
00:35:56
inotes and out notes are coming to your
00:35:57
do your function and a lot of people are
00:35:59
calling it then that means there's a lot
00:36:00
of Downstream dependencies so there is
00:36:02
this way of looking at it right now this
00:36:03
is just pure code but I do have
00:36:06
observational data observational data
00:36:08
means that I do know what were the
00:36:09
severities and where the SE zeros and SE
00:36:12
1es were cost and where the red really
00:36:15
critical errors have happened the last
00:36:16
few months ni and where is the
00:36:19
probability of those happening right now
00:36:20
plus where are you writing unit test
00:36:22
right now that gives me an additional
00:36:24
layer of information I already have
00:36:26
passed your code base understood what's
00:36:28
going on I have that view but I also now
00:36:30
look at the real world view of that data
00:36:32
coming in oh you know what there was a
00:36:34
depend there was a SE zero issue caused
00:36:35
by this piece of code over here and
00:36:37
maybe this that that meant now the
00:36:39
question is can I go back one day before
00:36:41
yeah can I predict that this is if I had
00:36:43
to pred that one error is going to pop
00:36:44
up tomorrow which part of the code base
00:36:46
where this error will be will be popping
00:36:48
up right now that is a prediction I can
00:36:49
make one day before right and I can
00:36:51
start trading these models now now this
00:36:53
is again we talked about earlier right
00:36:54
at the start of the podcast that look
00:36:56
each of these features are different ml
00:36:59
models right we just talked about two ml
00:37:01
models one if I have 120 unit test what
00:37:04
is the subset of five I've showed you
00:37:05
that could be an llm that could be like
00:37:07
a subset selection subset selection have
00:37:09
known solutions for I mean and like
00:37:11
theoretical guarantees on performance
00:37:12
submod submod subset selection and all
00:37:15
I've implemented them at 200 million
00:37:17
monthly active user scale at Spotify
00:37:19
yeah so we can tackle this problem but
00:37:20
this is a new problem this is not like
00:37:21
an llm solution right second one is here
00:37:24
right can I predict where the critical
00:37:26
bug is going to be and then use that to
00:37:27
identify critical components and then
00:37:29
use that to chain and put like a unius
00:37:31
in there it's essentially human context
00:37:33
right we we we talk about context a lot
00:37:35
when we talking about actual source
00:37:38
source files and and how we can maybe do
00:37:40
some code completion against so many
00:37:42
parts of our project but thinking about
00:37:44
that almost behavioral context exactly
00:37:46
and again I'll be literally like the
00:37:48
broken record on this we have done this
00:37:50
before and look when you upload a short
00:37:52
video like I look at Short video has
00:37:55
zero views right now I look at hey is
00:37:56
this high quality who's the author again
00:37:59
right I look at the content composition
00:38:00
why because zero people have interacted
00:38:01
with it so far give it an hour 10
00:38:04
million people would have interacted
00:38:06
with that short video now I know which
00:38:07
people will like which people not right
00:38:09
so if you look at the recommendation
00:38:10
life cycle of any podcast we're going to
00:38:12
upload this podcast the life cycle of
00:38:14
this podcast will be there is something
00:38:16
about the content of this podcast and
00:38:18
there's something about the
00:38:18
observational behavioral data of this
00:38:20
podcast which developers which users
00:38:22
liked it and not liked it skipped it
00:38:23
streamed it and all that right so we
00:38:25
have designed recommendation systems
00:38:28
as a combination of like content and
00:38:30
behavior right same here when I say uni
00:38:33
as Generations I can look at your code
00:38:35
base right and I can make some
00:38:36
inferences now on top of that I have
00:38:38
additional view on this data which is
00:38:40
like what are the errors and where are
00:38:41
the you writing unit is where are you
00:38:42
devoting time that gives me additional
00:38:44
observation data on top of my content
00:38:46
understanding if a code base combine the
00:38:48
two together and like better things will
00:38:50
emerge Essen and we've talked about unit
00:38:52
tests quite a bit in terms of testing in
00:38:54
general right there are obviously
00:38:57
different layers of of that testing I
00:38:59
think when we talk about the intent
00:39:00
changes heavily right so the higher you
00:39:02
go when we talking about the integration
00:39:04
test and things like that the intent is
00:39:05
really and again when we talk go back to
00:39:07
context the intent is about the use
00:39:09
cases uh a lot about the the intention
00:39:12
of what a user will how a user will use
00:39:15
that application um when we also think
00:39:18
about the the areas of the codebase
00:39:21
which are extremely important those
00:39:23
higher level integration tests the flows
00:39:25
that they go through through the
00:39:26
application will show which areas of
00:39:29
code are most important as well in terms
00:39:31
of from our developer audience when
00:39:33
people we talk to when we want to
00:39:35
provide advice or best practices in
00:39:36
terms of how should a developer be
00:39:39
thinking about using AI into their
00:39:40
General testing strategy what's a good
00:39:43
start today would you say in terms of
00:39:45
introducing these kind of Technologies
00:39:47
into people's processes and existing
00:39:50
workflows successfully today right yeah
00:39:52
I think the the simplest answer is start
00:39:54
using Cody uh no I love it I think like
00:39:57
I even before I joined SCE craft Cody
00:39:59
helped me I interviewed basically Cody
00:40:01
Source craft we do an interview of okay
00:40:02
here's a code base it's open source look
00:40:03
at it try to make some changes and I
00:40:05
love that I psychologically I was bought
00:40:08
in even before I had an offer because
00:40:11
you're making me do cognitive word on
00:40:12
your code depository as part of the
00:40:14
interview you just spend one hour
00:40:15
instead of just chatting look at the
00:40:17
code and make some changes right yeah my
00:40:18
point is yeah use cod but essentially an
00:40:21
interesting point over here is if you're
00:40:22
trying to adopt Cod or any of the other
00:40:24
tools for test generation you're going
00:40:26
to what what are you going to do you're
00:40:27
going to try to use the off thes shell
00:40:29
feature right hey generate unit I see
00:40:31
where it works where it doesn't work now
00:40:33
Cod provides something called custom
00:40:34
commands custom command edit code unit
00:40:37
generation these are all commands right
00:40:39
what a command is if what is the llm
00:40:41
feature let's just take a step back llm
00:40:43
feature is I want to do this task I need
00:40:45
some context so what I'm going to do is
00:40:46
I'm going to generate an English prompt
00:40:49
right and I'm going to bring in a
00:40:50
context strategy that what are the
00:40:51
relevant pieces of information I should
00:40:53
use which is okay here's the unit is in
00:40:55
the same folder for example or here's a
00:40:56
dependency which you should be aware of
00:40:58
bring in that context write an English
00:40:59
prompt and then send to the llm right
00:41:01
that's a very nice simplified way of
00:41:03
looking at what is an llm feature so Cod
00:41:06
provides an option of doing custom
00:41:07
commands what that means is I can see
00:41:10
hey this doesn't work as great for me
00:41:12
why because of these Nu aners let me
00:41:13
create a custom command now you are a
00:41:15
staff engineer at this company you can
00:41:16
create a custom command and that oh this
00:41:18
is better now because in an Enterprise
00:41:20
setting you can now share this custom
00:41:21
command with all your employees
00:41:22
essentially right you said that hey you
00:41:24
know what this is a better way of using
00:41:25
the unit generation because I've created
00:41:27
this custom command and everybody can
00:41:28
benefit essentially right mhm now what
00:41:30
makes you write a better custom command
00:41:33
and even if you forget about custom
00:41:34
commands and cod what makes you get a
00:41:35
better output out this is where the 0
00:41:38
to1 evaluation comes in like where are
00:41:40
you currently failing what sort of unit
00:41:41
test are we getting right what sort of
00:41:43
are we not getting it right what about
00:41:45
this is interesting right what about
00:41:47
your code base is interesting now the
00:41:49
question then becomes can I provide that
00:41:52
as context so can I provide can I track
00:41:54
that that where are we where is it
00:41:56
failing where is it not faing and then
00:41:57
there are few interventions you could do
00:41:59
right you can change the prom new custom
00:42:00
con or you can create a new context
00:42:02
Source now this is a great Segway for me
00:42:04
to just mention one thing which is open
00:42:07
context so I think Quin Rio he literally
00:42:09
started his work as an IC one of the
00:42:11
other impressing things about Source
00:42:12
graph is you look at the you look at the
00:42:13
GitHub history commit histories of the
00:42:15
founders are they running a company or
00:42:16
are they likeing up these things and it
00:42:19
just blew me away like when I first got
00:42:21
it came across that but essentially
00:42:22
Quinn introduced and then a lot of the
00:42:24
teams worked on it something called open
00:42:25
context yeah which is internet
00:42:27
Enterprise setting you have so much
00:42:28
context which we may not be nobody can
00:42:31
get right right yeah because you're
00:42:32
going to plug in thousands of different
00:42:34
heterogeneous context sources is going
00:42:35
to help you get a better answer so open
00:42:38
context is a protocol designed to add
00:42:40
you can add a new context source for
00:42:42
yourself and there's a protocol and cod
00:42:44
will cod and even the other because it's
00:42:46
a protocol a lot of the other agents or
00:42:48
Tools around can use so essentially what
00:42:50
that means is if you are writing unit
00:42:52
test and if you know that this is where
00:42:53
it's not working you're going to make a
00:42:54
change in the prompt add a custom
00:42:56
command you're going to add some other
00:42:57
examples and then you're like hey maybe
00:42:58
I should add a context Source because oh
00:43:00
I have this information like we talked
00:43:02
about right where are the errors coming
00:43:03
in from now that S zero is probably not
00:43:06
something which you have given access to
00:43:07
Cod right now but then because of the
00:43:09
open context open CTX you can add this
00:43:11
like context source and then make your
00:43:13
Solutions better right yeah what are you
00:43:15
doing here you're doing this you're
00:43:16
doing like ml applied machine learning
00:43:18
101 now for your feature right so again
00:43:21
this is exactly where you need like a 0
00:43:23
to one five examples where it doesn't
00:43:25
work right now and if you make the St
00:43:27
change or context change it's going to
00:43:29
start working right so you have done
00:43:30
this mini z to one for your own goal
00:43:33
right and I think there's a meta point
00:43:36
over here which is forget about coding
00:43:38
assistants I think we're all
00:43:39
transitioning to that abstraction of
00:43:41
working with an ml system working with
00:43:44
an AE AI system I hate to use the phrase
00:43:46
AI I I from the machine learning world I
00:43:49
rather say machine learning but again
00:43:50
the audiences probably buy that more
00:43:53
some of the audiences but essentially a
00:43:55
lot of us are starting to work with
00:43:57
systems and I think the way we start
00:43:59
interacting with them are going to be
00:44:00
like a bit more orchestrated right try
00:44:03
this let's figure out where it's working
00:44:05
great I'm going to use it let's figure
00:44:07
out where it's not working okay cool
00:44:08
then I'm going to either give that
00:44:09
feedback or do something and adjust my
00:44:11
workflow a bit and then make it work
00:44:13
better on those right yeah so I think we
00:44:15
all starting to be that applied
00:44:17
scientist one right in some way or the
00:44:19
other and this is not just like you as
00:44:21
an engineer if you're a domain expert if
00:44:23
you're a risk analyst you want to create
00:44:26
these plots or if you're a sales assist
00:44:28
using a sales Scope Pilot you are
00:44:30
working with an agentic setup of ML and
00:44:33
you want to see where it's working where
00:44:35
it's not and what changes do I want to
00:44:37
make again you have done it when you add
00:44:39
Like A Plus or double quotes in Google
00:44:41
you get those words right what is that
00:44:42
you're adopting right you know where
00:44:44
what's going to work you're going to
00:44:45
start adding those tricks and now
00:44:47
because more and more of your daily
00:44:49
workflow is going to be around these
00:44:51
agents and systems and you start
00:44:52
developing these feedback loops yourself
00:44:54
so I think like what we are trying to do
00:44:55
as mlus in the proc product are is
00:44:58
similar philosophically to what you are
00:44:59
trying to do to use these products
00:45:01
essentially so I think like I would a
00:45:03
lot of my friends and other people ask
00:45:04
me like hey like what's happening like
00:45:06
I'm a doain expert I was in another
00:45:08
panel a few days ago versus employees
00:45:10
and the question there was like jobs and
00:45:12
all those right again not to bombard the
00:45:15
conversation around those but
00:45:16
essentially if we start acting as
00:45:19
orchestrators of these systems then we
00:45:21
start developing intuitions on where
00:45:22
it's working where it's not and then we
00:45:23
start putting in these guardrails yeah
00:45:25
and those guardrails are going to help
00:45:27
in uni as your Miss and and I think
00:45:28
that's important because as our audience
00:45:30
are all going to be somewhere on that
00:45:32
Journey from it being fully interactive
00:45:36
to it being fully automated and people
00:45:38
are going to maybe want to be somewhere
00:45:40
on that Journey but they will progress
00:45:42
from one end to the other of that as we
00:45:45
get into that more and more automated
00:45:46
space I remember you saying earlier when
00:45:48
we talk about the budget of a human in
00:45:51
terms of they have an amount of time if
00:45:53
you have 120 tests you want to provide
00:45:55
them with five how do you place the
00:45:57
importance of a developer's time right
00:46:00
in the future when things get to that
00:46:02
more automated State how would you place
00:46:05
that that relevance on a developer
00:46:08
focusing on code versus focusing on
00:46:10
testing versus focusing on something
00:46:12
else where is it most valuable to have
00:46:14
that developers eyes yeah so I think
00:46:16
let's take a step back right you're a
00:46:17
developer I'm a developer and there's a
00:46:19
reason I have a job right I have a job
00:46:21
to do that job means there's a task to
00:46:23
complete right the reason why I'm
00:46:25
writing this unit test I don't get paid
00:46:27
money just to write a better unit list
00:46:28
right I get paid money to again not me
00:46:31
specifically in my job but essentially
00:46:33
as a developer I get paid money if I can
00:46:35
do that task and if I can at least spend
00:46:37
some time in making sure that in
00:46:39
future my load of doing that task is
00:46:42
easier and the system is like helping me
00:46:44
down the scam with that high level view
00:46:46
right the what's happening what's
00:46:48
happening is rather than focusing on
00:46:50
unitized generation or like code
00:46:51
completion as a I care about this
00:46:53
because I care about the silo no I care
00:46:56
about this because I care about the task
00:46:58
being completed and I can if I can do
00:47:00
this task 10 next quicker then what's my
00:47:02
path from today spending 5 hours doing
00:47:04
it to 20 minutes doing it right and this
00:47:07
is where I mentioned you're going to be
00:47:08
we're all going to be orchestrators
00:47:09
right look at the music orchestra right
00:47:11
you have the symphony and you're there's
00:47:12
an orchestra and like you're hand waving
00:47:14
your way into amazing music right art
00:47:16
gets created right again that that's the
00:47:18
goal right Cody wants to make sure that
00:47:20
we allow users developers to start
00:47:22
creating art and not just toil now I I
00:47:26
can say this English right but then I
00:47:28
think a good developer would just embi
00:47:30
the spirit of it which is that look my
00:47:33
role is the sooner I can get to that
00:47:34
orchestrator role in my mindset the more
00:47:37
I start using these tools exactly right
00:47:39
rather than being scared of oh it's
00:47:40
writing code and I won't again you
00:47:42
mentioned right there's going to be like
00:47:43
you might want to be somewhere on that
00:47:45
Spectrum but the the Evol technological
00:47:47
Evolution will march on and we're going
00:47:49
to be forced into some parts and it
00:47:51
could be not just technology but
00:47:53
individuals wanting to be individuals
00:47:55
love writing code or
00:47:57
it's like that will need to change
00:48:00
depending on where what technology
00:48:02
offers us but I guess what's the if we
00:48:06
push further to that right what's the
00:48:08
highest risk right to what we're
00:48:10
delivering not being the right solution
00:48:14
is it is it is it testing now is it
00:48:16
guard rails that become the most
00:48:18
important thing and almost I would say
00:48:20
more important than code or or or is
00:48:22
still code the thing that that we need
00:48:24
to care about the most yeah I think like
00:48:27
if I view this extreme of like again if
00:48:29
I put my evaluation heart right I think
00:48:31
I want to be one of the most prominent
00:48:33
prop proponent vocal proponent of
00:48:36
evaluation in the industry not just cod
00:48:38
in the machine learning industry we
00:48:39
should do more
00:48:40
evaluation so there I would say that
00:48:42
writing a good evaluation is more
00:48:44
important than writing a good model yeah
00:48:45
writing a good evaluation is more
00:48:47
important than writing a better context
00:48:48
Source because you don't know what
00:48:50
what's a better Contex Source if you
00:48:51
don't have a way to evaluate it right so
00:48:53
I think for me evaluation precedes any
00:48:55
feature development yeah if you don't
00:48:57
have a way to evaluate it then you don't
00:48:58
you're just shooting darts in the dark
00:49:00
room right some are going to land by
00:49:02
luck now there in that world right I
00:49:04
have to ensure that unit test and
00:49:06
evaluation is like ahead in terms of
00:49:08
importance and just like code that said
00:49:11
I I think overall right what's more
00:49:12
important is like task success right
00:49:15
which is again what is Task success
00:49:16
you're not just looking at unit test as
00:49:18
an evaluation you're looking at
00:49:19
evaluation of the overall goal which is
00:49:21
hey do I do this task F and then I think
00:49:23
if that's where as an orchestrator if I
00:49:25
start treating these agents could be Cod
00:49:27
Auto compete or like like any specific
00:49:30
Standalone agent powered by Source graph
00:49:32
as well probably so in those words
00:49:34
evaluation of that task because you are
00:49:37
the Roman expert assume AGI exists today
00:49:40
assume the the foundation models are
00:49:41
going to get smarter smarter like
00:49:42
billions of dollars trillions of dollars
00:49:44
eventually into it we train these fan
00:49:46
again the smartest models and they can
00:49:48
do everything but you are best place to
00:49:51
understand your domain on what the goal
00:49:53
is right now right so you are the only
00:49:56
person who can develop that evaluation
00:49:57
of like how do I know that you're
00:49:58
correct how do I know whether you're 90%
00:50:00
correct 92% correct and again right the
00:50:02
marginal gain on 92 to 94 is going to be
00:50:04
a lot more harder than going from to 90
00:50:06
right it always gets saer like I mean
00:50:08
there's going to be like an exponential
00:50:09
hardness increase over there so
00:50:11
essentially the point then becomes
00:50:13
purely on evaluation purely on unit test
00:50:15
right what makes us what are the nuances
00:50:17
of this problem of this domain which the
00:50:20
model needs to get proved and are you
00:50:22
are we able to articulate those and are
00:50:24
we able to generate those unit test so
00:50:26
generate those guard RS and evaluations
00:50:28
so that I can judge how the models are
00:50:30
getting better on that topic right so
00:50:31
the models are going to be far
00:50:33
intelligent cre but then what is Suess
00:50:35
you as a domain expert get to Define
00:50:36
that and this is a great thing not just
00:50:38
about coding but also like any domain
00:50:40
expert using machine learning or these
00:50:42
tools across domains you know what
00:50:44
you're using it for right the other AGI
00:50:47
tools are just tools to help you do that
00:50:48
job so you it's I think the owners is on
00:50:51
you to write good evaluation or I mean
00:50:54
maybe tomorrow llm as a judge and like
00:50:56
people are developing Foundation models
00:50:57
just for evaluation right so there going
00:50:59
to be other tools to help you do that as
00:51:00
well code Foundation models for like
00:51:02
unitest maybe that's the thing in 6
00:51:03
months from now right uh the point then
00:51:06
becomes what should it focus on that's
00:51:08
the role you're playing orchestrating
00:51:10
but like orchestrating on the evaluation
00:51:11
oh did you get that corner piece right
00:51:13
or you know what this is a criticality
00:51:14
of the system right the again right the
00:51:16
payment Gateway link and the
00:51:18
authentication link some of these get
00:51:19
screwed up then massive bad things
00:51:21
happen right so you know that so I think
00:51:23
that's where like the human in the loop
00:51:24
and your input to the system starts
00:51:26
getting amazing rashab we could talk for
00:51:29
hours on this I know this has been
00:51:31
really interesting and I love the Deep
00:51:32
dive like a little bit below into the ml
00:51:34
space as well I'm sure a lot of our
00:51:36
audience will find this very interesting
00:51:37
thank you so much really appreciate you
00:51:39
coming on the podcast thanks so much
00:51:40
this was this was a fun conversation
00:51:42
yeah it could go on for hours hopefully
00:51:44
the inside thank you
00:51:48
[Applause]
00:51:58
thanks for tuning in join us next time
00:52:00
on the AI native Dev brought to you by
00:52:02
Tesla
00:52:05
[Music]

标签

AI Testing
Software Development
Machine Learning
Code Evaluation
Large Codebases
Sourcegraph Cody
Developer Productivity
AI Context
Unit Testing
Evaluation Metrics