Navigating AI for Testing: Insights on Context and Evaluation with Sourcegraph

00:52:09
https://www.youtube.com/watch?v=ExCpKtFgpHI

Summary

TLDRThe podcast episode delves into AI testing and its integration within software development, featuring Rashab Merra from Sourcegraph, known for its tool Cody. The discussion highlights the challenges of managing large code repositories and how AI is instrumental in boosting developer efficiency. By leveraging machine learning, Sourcegraph aims to refine code suggestions and testing, tailoring solutions to developer workflows. The conversation also touches on the criticality of evaluation to verify if model improvements translate to real-world applications, noting that offline metrics like 'passet-1' may not always reflect true user experiences. Rashab expounds on the need for context in AI-driven testing, advocating for models that accurately consider the intricacies of expansive codebases, often seen in large enterprises. He also underscores the importance of AI in writing better and more efficient unit tests, functioning as a safeguard against poor code entering the system. The episode further explores how AI tools balance speed and quality, particularly in latency-sensitive environments where developers require timely feedback. The notion of an evolving, symbiotic relationship between human programmers and AI systems is a recurring theme, suggesting a future where developers focus more on creative and complex problem-solving.

Takeaways

  • πŸ€– AI testing is crucial for efficient software development.
  • πŸš€ Sourcegraph's Cody aids developers with code completion and testing.
  • 🧩 Context is vital for accurate AI-driven code evaluations.
  • πŸ•’ Latency impacts developer satisfaction and tool efficacy.
  • πŸ”„ Evaluation metrics must align with real-world applications.
  • πŸ’» Open Context broadens the scope of code insights.
  • πŸ‘¨β€πŸ’» Developers shift towards more creative roles with AI assistance.
  • βš™οΈ AI-generated unit tests act as code quality guardrails.
  • πŸ“Š Machine learning models must adapt to specific developer needs.
  • πŸ” Continuous improvement in AI tools enhances productivity.

Timeline

  • 00:00:00 - 00:05:00

    The AI Native Dev episode focuses on AI testing, discussing context for models, evaluation of generated code, and timing and automation of AI tests. Rashab Merra from Sourcecraft is introduced, explaining how Sourcecraft tackles the big code problem by using tools like Cody to improve developer productivity with features like autocomplete and code suggestions.

  • 00:05:00 - 00:10:00

    Rashab explains his experience in AI since 2009, witnessing the evolution from traditional NLP to large language models (LLMs). He describes how AI, like Cody, abstracts complexity to enhance developer productivity, comparing this progression to his own experience transitioning from coding in C to using advanced frameworks.

  • 00:10:00 - 00:15:00

    The discussion shifts to Spotify and Netflix's recommendation systems, using various machine learning models to enhance user experience. Rashab parallels this to Cody's multifaceted approach, including features beyond code suggestions like chat, code edits, and unit test generation, each requiring different evaluations, models, and latencies.

  • 00:15:00 - 00:20:00

    Different features in coding assistants demand varying latency and quality levels. For instance, autocomplete needs low latency, whereas code edit and chat can tolerate more delay. Rashab explains the trade-offs between latency and model size, highlighting the benefits of fine-tuning models for specific tasks like Rust language auto-completion.

  • 00:20:00 - 00:25:00

    The conversation delves into efficient code and unit test generation balancing user trust in automated systems and developers' cognitive load reductions. As complexity in code suggestions increases, so does the trust issue, stressing the importance of effective evaluation systems and guardrails to prevent introducing errors through automation.

  • 00:25:00 - 00:30:00

    Evaluation is emphasized as crucial for development and successful adoption of AI-driven tools. Rashab highlights the importance of developing accurate evaluation metrics that mirror real-world usage over standard benchmarks. This ensures improvements in AI features truly enhance user experience.

  • 00:30:00 - 00:35:00

    Rashab describes the challenges of heterogeneity in coding tasks across different industries, identifying opportunities where pre-trained models excel and where fine-tuning can provide significant benefits by focusing on underserved languages or complex tasks often found in enterprise environments.

  • 00:35:00 - 00:40:00

    Discussion highlights the adversarial nature of good unit tests as guardrails against bad code. Effective testing prevents the introduction of errors, especially in automated settings where AI-generated code volume increases. The conversation underscores the need for unit testing to evolve alongside increased AI integration.

  • 00:40:00 - 00:45:00

    Different testing levels (e.g., unit, integration) and their role in ensuring code quality are explored. Rashab stresses the need for automated systems that continuously improve through feedback loops, suggesting an integrated process where human oversight complements machine-driven testing.

  • 00:45:00 - 00:52:09

    Rashab emphasizes the orchestration of AI tools where developers act as conductors, choosing where to spend effort between optimization of code, testing, and other tasks. He stresses the importance of human understanding and evaluation in deploying AI solutions effectively while embracing automation to reduce toil and improve productivity.

Show more

Mind Map

Video Q&A

  • Who is the guest on the episode?

    Rashab Merra from Sourcegraph is the guest on the episode.

  • What major concept is discussed in the episode?

    The episode discusses AI testing in software development.

  • How does Sourcegraph help developers?

    Sourcegraph offers tools like Cody, a coding assistant that helps improve developer productivity by integrating into IDEs.

  • What is the role of machine learning in Sourcegraph's tools?

    Machine learning in Sourcegraph's tools is used to enhance features such as code completion and testing, assisting developers by understanding and adapting to their workflows.

  • Why is evaluation critical in machine learning?

    Evaluation is critical because it helps determine if improvements in machine learning models genuinely enhance user experience, especially when offline metrics don't always correlate with real-world usage.

  • What is the significance of latency in AI tools?

    Latency is significant as it affects user satisfaction; developers expect fast responses for tasks like code completion, while they're more tolerant of longer wait times for complex tasks like code fixes.

  • How does Sourcegraph's Cody vary in its features?

    Cody offers a range of features from code auto-completion to unit test generation, each with different latency and quality requirements.

  • What does Rashab consider a 'nightmare' issue?

    A 'nightmare' issue is getting accurate unit test generation and evaluation in real-world, large-scale codebases.

  • What is 'open context' in Sourcegraph?

    'Open Context' is a protocol designed to integrate additional context sources to provide better recommendations and code insights.

  • Why are unit tests important for AI-generated code?

    Unit tests serve as guardrails to prevent bad code from entering the codebase, especially as more AI-generated code is used.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:00
    [Music]
  • 00:00:01
    you're listening to the AI native Dev
  • 00:00:03
    brought to you by
  • 00:00:14
    Tesla on today's episode we're going to
  • 00:00:16
    be talking about all things AI testing
  • 00:00:19
    so we're going to be dipping into things
  • 00:00:22
    like uh the context that models are
  • 00:00:24
    going to need to be able to create good
  • 00:00:26
    tests uh things like evaluation what
  • 00:00:28
    does it mean for generated code mod uh
  • 00:00:30
    and then look into deeper into AI tests
  • 00:00:33
    about when they should be done should
  • 00:00:34
    they be done early should be should they
  • 00:00:36
    be done late and more automated joining
  • 00:00:38
    me today rashab Merra from Source craft
  • 00:00:42
    those of you may know Source craft
  • 00:00:43
    better as the tool Cody welcome to the
  • 00:00:45
    session tell us a little bit about
  • 00:00:46
    yourself first of all thanks am this is
  • 00:00:48
    really interesting topic for me and so I
  • 00:00:50
    work with Source craft we do we've done
  • 00:00:52
    like coding code search for for the last
  • 00:00:54
    quite a few years and uh I mean we
  • 00:00:56
    tackled the big quot problem like if you
  • 00:00:58
    look at Enterprises Enterprises have a
  • 00:01:00
    lot of developers which is great for
  • 00:01:01
    them they also have like massive code
  • 00:01:03
    bases look at a bank in the US I mean
  • 00:01:05
    they would have like 20 30,000
  • 00:01:06
    developers they would have like more
  • 00:01:07
    than that like 40,000 repositories right
  • 00:01:10
    so this big code problem is a huge pain
  • 00:01:12
    right so what what we've done as a
  • 00:01:14
    company is like we've spent quite a few
  • 00:01:15
    years in tackling code search and then
  • 00:01:17
    last year we launched Codi which is a
  • 00:01:20
    coding assistant and essentially Codi
  • 00:01:21
    lives in your ID it makes tries to make
  • 00:01:23
    you a better produ productive developer
  • 00:01:25
    and there's a bunch of different
  • 00:01:26
    features and my role right now at source
  • 00:01:29
    graph is to lead the a effort so I've
  • 00:01:31
    done a lot of like machine learning on
  • 00:01:32
    the consumer side like looking at
  • 00:01:33
    Spotify recommendations music short
  • 00:01:35
    video recommendations my PhD was on
  • 00:01:37
    search which again ties well because in
  • 00:01:39
    llms you need context and we're going to
  • 00:01:41
    talk about yeah a lot of that as well
  • 00:01:43
    how many years would you say you've been
  • 00:01:44
    in
  • 00:01:45
    AI yeah so I think the first research
  • 00:01:47
    paper I wrote was in 2009 and almost 15
  • 00:01:50
    years ago and this was like NLP again
  • 00:01:52
    this is not like deep learning NLP this
  • 00:01:54
    is more like traditional NLP and I I've
  • 00:01:57
    been in the world wherein like the
  • 00:01:59
    domain experts handcraft these features
  • 00:02:01
    and then embeddings came in and like
  • 00:02:02
    they washed all of it away and then
  • 00:02:04
    these like neural models came in they
  • 00:02:05
    washed like custom topic models away and
  • 00:02:07
    then these llms came in and they washed
  • 00:02:09
    a bunch of these Rangers away so I think
  • 00:02:11
    like I've seen some of these like waves
  • 00:02:12
    of if you create like specific models
  • 00:02:14
    which are very handcrafted maybe a
  • 00:02:16
    simpler large scale like generalizable
  • 00:02:19
    model will sweep some of these models
  • 00:02:21
    I've seen a few of these waves in lastas
  • 00:02:23
    and way before it was cool yeah exactly
  • 00:02:24
    I think like I keep saying to a lot of
  • 00:02:25
    like early PhD students like in 2010s a
  • 00:02:28
    lot of the PhD students would get their
  • 00:02:29
    doctorate by just creating latent
  • 00:02:31
    variable models and doing some Gip
  • 00:02:33
    sampling inference nobody even knows
  • 00:02:35
    that 15 years ago this is what would get
  • 00:02:37
    you a PhD in machine right so again
  • 00:02:39
    things have really moved on you can say
  • 00:02:40
    to that you can say you weren't there
  • 00:02:41
    you weren't there in the early days
  • 00:02:43
    there when you WR gift sampler code and
  • 00:02:45
    see we even never hadow P like a bunch
  • 00:02:48
    of these I I think like this is I think
  • 00:02:50
    like we've seen this is Codi and a lot
  • 00:02:52
    of these other gen coding tools they're
  • 00:02:54
    trying to make us as developers work at
  • 00:02:56
    higher and higher abstractions right I
  • 00:02:58
    started my research or ml career not
  • 00:03:01
    writing like py code right I wrote again
  • 00:03:04
    like Gib sampler in C right and then I'm
  • 00:03:06
    not touching C until it's like really
  • 00:03:08
    needed now so a lot of these like
  • 00:03:10
    Frameworks have come in which have
  • 00:03:11
    abstracted the complexities away and
  • 00:03:13
    made my life easier and again we're
  • 00:03:16
    seeing again I've seen it as an IC over
  • 00:03:18
    the last 10 15 years but also that's
  • 00:03:20
    also happening in the industry right you
  • 00:03:22
    don't have to be bothered by lowlevel
  • 00:03:23
    Primitives and you can maybe tackle
  • 00:03:24
    higher level things and that's what
  • 00:03:26
    exactly Cody is trying to do that how do
  • 00:03:28
    I remove toil from your developers life
  • 00:03:31
    and then make them focus on the
  • 00:03:32
    interesting pieces and the creative
  • 00:03:33
    pieces and like really get the
  • 00:03:34
    architecture right yeah so let's let's
  • 00:03:36
    talk a little bit about some of the
  • 00:03:38
    features that you mentioned because
  • 00:03:39
    Cody's obviously not it's when we think
  • 00:03:41
    about a tool like Cody it's not just
  • 00:03:45
    code suggestion there's a ton of
  • 00:03:47
    different things and you mentioned some
  • 00:03:48
    of them we'll be talking a little bit
  • 00:03:49
    more probably about testing today but
  • 00:03:51
    when we think about ML and the use of ml
  • 00:03:54
    here how does it differ from the various
  • 00:03:57
    features that a tool like
  • 00:04:00
    does it
  • 00:04:03
    when yeah that's an excellent point let
  • 00:04:05
    me take a step back right let's not even
  • 00:04:06
    talk about coding assist let's go back
  • 00:04:08
    to recommendations and let's look at
  • 00:04:10
    Spotify or Netflix right people think
  • 00:04:12
    that hey if you're doing Spotify is
  • 00:04:13
    famous for oh it knows what I want it it
  • 00:04:16
    suest nostalgic music for me so people
  • 00:04:18
    in in the while love the recommendations
  • 00:04:20
    from Spotify I'm not saying this just
  • 00:04:22
    because I was an employee at Spotify but
  • 00:04:23
    also as a user right I like Spotify
  • 00:04:25
    recommendations I have to say Netflix is
  • 00:04:27
    there for me like I've never binge
  • 00:04:29
    listened anything on Netflix anything on
  • 00:04:31
    Spotify but Netflix has kept me up to
  • 00:04:33
    the early hours exact questioning my
  • 00:04:35
    life yeah I think there a series right
  • 00:04:37
    like Netflix for videos you look at
  • 00:04:39
    Spotify for music you look at Tik Tok
  • 00:04:41
    Instagram me for short videos so again
  • 00:04:43
    the mediums are like it used to be hours
  • 00:04:45
    of content to like minutes of music to
  • 00:04:47
    like seconds of short video but then the
  • 00:04:49
    point is we love these products but this
  • 00:04:52
    is not just one ranker Netflix is not
  • 00:04:54
    just one ranker right it's a bunch of
  • 00:04:55
    different features a bunch of surfaces
  • 00:04:58
    and user touch points and each touch
  • 00:05:00
    point is powered by a different machine
  • 00:05:01
    learning model different thinking
  • 00:05:02
    different valuation right so my point is
  • 00:05:04
    we we this is how the industry has
  • 00:05:06
    evolved over the last 10 15 years right
  • 00:05:08
    most of these user Centric applications
  • 00:05:09
    which people love and like hundreds and
  • 00:05:11
    whatever like 400 million users are
  • 00:05:12
    using it monthly they are a mix of
  • 00:05:15
    different features and each feature
  • 00:05:17
    needs a design of what's the ml model
  • 00:05:19
    what's the right tradeoffs what are the
  • 00:05:20
    right metrics what's the right science
  • 00:05:22
    behind it what's the right evaluation
  • 00:05:23
    behind it so we have seen this and now
  • 00:05:25
    I'm seeing exactly the same at Codi
  • 00:05:27
    right if you look at COD so Codi lives
  • 00:05:29
    in your ID so if you're a developer if
  • 00:05:30
    you're using vs code or jet brains again
  • 00:05:32
    if you're writing code then we can code
  • 00:05:34
    complete right so aut complete is an
  • 00:05:36
    important feature that's like very high
  • 00:05:37
    volume right why because it's not that
  • 00:05:39
    again right a lot of users use it but
  • 00:05:41
    also when you're writing code in a file
  • 00:05:43
    you will trigger aut complete like maybe
  • 00:05:44
    hundreds of times because you're writing
  • 00:05:46
    like one word you're writing like some
  • 00:05:47
    syntax in line It'll like trigger and
  • 00:05:49
    then if you like what like a ghost text
  • 00:05:52
    which is like recommended you can just
  • 00:05:54
    Tab and like it selects right yeah so
  • 00:05:56
    then it helps you write code in a faster
  • 00:05:58
    way now this is I would say that this is
  • 00:06:00
    something on the like extreme of hey I
  • 00:06:03
    have to be like very latency sensitive
  • 00:06:04
    and like really be fast right because
  • 00:06:06
    when you're on Google you're rping a
  • 00:06:07
    query Google will do like AO suggestions
  • 00:06:09
    right itself can I complete your query
  • 00:06:11
    so we we've been like using some of
  • 00:06:13
    these features for the last decade as
  • 00:06:14
    users on the commu site and this is
  • 00:06:16
    really important as well because this
  • 00:06:18
    interaction that you're just talking
  • 00:06:19
    about here it's a very emotive one you
  • 00:06:21
    will you will piss people off if you do
  • 00:06:23
    not if you do not provide them cuz this
  • 00:06:25
    is supposed to be an efficiency play if
  • 00:06:27
    you don't provide them with something
  • 00:06:28
    that they can actually prove their
  • 00:06:30
    workflow and they can just Tab and say
  • 00:06:32
    yeah this is so cool I'm tabing tabing
  • 00:06:33
    tabing accepting then they're going to
  • 00:06:35
    they're going to get annoyed to the
  • 00:06:37
    extent that they will almost reject that
  • 00:06:39
    kind exactly this is where quality is
  • 00:06:41
    important but latency is also important
  • 00:06:42
    right so now we see that there's a
  • 00:06:43
    trade-off yeah that hey yes I want
  • 00:06:45
    quality but then if you're doing it like
  • 00:06:46
    400 milliseconds later hey I'm already
  • 00:06:49
    annoyed why because the perception of
  • 00:06:51
    the user on when they're competing right
  • 00:06:53
    they don't want to wait if I have to
  • 00:06:55
    wait then i' rather go to code edit or
  • 00:06:56
    like chat this is that's a great point
  • 00:06:59
    is is latency does latency differ
  • 00:07:01
    between the types of things that c
  • 00:07:03
    offers or cuz I I would guess if if a
  • 00:07:05
    developer is going to wait for something
  • 00:07:06
    they might they'll be likely to wait a
  • 00:07:08
    similar amount of time across all but
  • 00:07:10
    different yeah not really the expect let
  • 00:07:12
    me go back to the features right C has
  • 00:07:14
    auto complete which helps you complete
  • 00:07:15
    code when you're typing out we have code
  • 00:07:17
    edit code fix here there's a bug you can
  • 00:07:19
    write a command or select something and
  • 00:07:20
    say hey Cod fix it yeah now when you're
  • 00:07:22
    fixing then people are okay to wait for
  • 00:07:24
    a second or two as it figures out and
  • 00:07:25
    then they're going to show a diff and oh
  • 00:07:27
    I like this change and I'm going to
  • 00:07:28
    accept it now this is going to span
  • 00:07:29
    maybe 3,000 milliseconds right 3 to 4
  • 00:07:31
    seconds yeah versus auto complete no I
  • 00:07:33
    want everything tap complete selected
  • 00:07:35
    within 400 500 millisecond Laten again
  • 00:07:37
    just there the difference start popping
  • 00:07:39
    up now we've talked about autocomplete
  • 00:07:40
    code edit code fix then we could go to
  • 00:07:42
    chat as well right chat is okay I'm
  • 00:07:44
    typing inquiry it'll take me like a few
  • 00:07:46
    seconds to type in the right query and
  • 00:07:47
    select the right code and do this right
  • 00:07:49
    so the expectation of 400 milliseconds
  • 00:07:51
    is not really the case in chat because
  • 00:07:53
    I'm asking maybe a more complex query I
  • 00:07:55
    want you to take your time and give the
  • 00:07:57
    answer versus like unitest generation
  • 00:07:59
    for examp example right unit write the
  • 00:08:00
    entire code and make sure that you cover
  • 00:08:03
    the right corner cases unit is like
  • 00:08:04
    great coverage and like you're not just
  • 00:08:06
    missing important stuff you making sure
  • 00:08:08
    that the unit is actually quite good now
  • 00:08:11
    there I don't want you to complete in 4
  • 00:08:13
    in milliseconds take your time write
  • 00:08:15
    good code I'm waiting I'm willing to
  • 00:08:16
    wait a long time yeah let's take a step
  • 00:08:18
    back what what are we looking at we
  • 00:08:20
    looking at a few different features now
  • 00:08:21
    similar right Netflix Spotify is not
  • 00:08:23
    just one recommendation model you go do
  • 00:08:25
    search you go do podcast you go do hey I
  • 00:08:27
    want this category content a bunch of
  • 00:08:28
    these right so similarly here in coding
  • 00:08:31
    assistant for COD you have auto complete
  • 00:08:32
    code edit code fix unit test generation
  • 00:08:35
    you have a bunch of these commands you
  • 00:08:36
    have chat chat is an entire nightmare I
  • 00:08:38
    can talk about hours on like chat is
  • 00:08:40
    like this one box Vision which people
  • 00:08:42
    can come with like hundreds of intents
  • 00:08:44
    yeah that's e in 10 it's a nightmare for
  • 00:08:47
    me as an engineer to say that are we
  • 00:08:48
    doing well on that because in aut
  • 00:08:50
    complete I can develop metrics around it
  • 00:08:52
    I can think about okay this is a unified
  • 00:08:54
    this is a specific feature yeah chat may
  • 00:08:56
    be like masking hundreds of these
  • 00:08:58
    features just by natural language so we
  • 00:09:00
    can talk a little more about chat as
  • 00:09:01
    well over a period of time but coming
  • 00:09:04
    back to the original point you mentioned
  • 00:09:05
    for auto complete people will be latency
  • 00:09:07
    sensitive yeah for unit generation maybe
  • 00:09:09
    less for chat maybe even less what that
  • 00:09:11
    means is the design choices which I have
  • 00:09:14
    as an ml engineer are different in order
  • 00:09:16
    to complete I'm not going to look at
  • 00:09:17
    like 400 billion parameter BS right I
  • 00:09:19
    ownn something which is f right so if
  • 00:09:21
    you look at the x-axis latency Y axis
  • 00:09:23
    quality look I don't want to go the top
  • 00:09:26
    right top right is high latency and high
  • 00:09:28
    quality I don't want latency I want to
  • 00:09:30
    be in the like whatever 400 500 n to n
  • 00:09:33
    millisecond latency space so there small
  • 00:09:35
    models kick in right and small models we
  • 00:09:37
    can f tune for great effect right we
  • 00:09:39
    we've done some work we' just published
  • 00:09:40
    a block post a couple of weeks ago on
  • 00:09:42
    hey if you find tune for rust rust is
  • 00:09:44
    like a lot more has a lot of nuances
  • 00:09:46
    which most of these large language
  • 00:09:47
    models are not able to capture so we can
  • 00:09:49
    find un a model for rust and do really
  • 00:09:51
    well on auto completion within the
  • 00:09:53
    latency requirements which we have for
  • 00:09:54
    this feature yeah so then these
  • 00:09:56
    trade-offs start emerging essentially
  • 00:09:58
    how does that change if you if the
  • 00:10:00
    output that Cod say was going to provide
  • 00:10:02
    the developer would actually be on a
  • 00:10:04
    larger scale so we talked when we're
  • 00:10:05
    talking about autocomplete we're really
  • 00:10:06
    talking about a one liner complete but
  • 00:10:08
    what if we was to say I want you to
  • 00:10:10
    write this method or I want you to write
  • 00:10:11
    this module obviously then you don't
  • 00:10:13
    want that you don't want the developer
  • 00:10:15
    to accept an autocomplete or look at an
  • 00:10:18
    autocomplete module or or function and
  • 00:10:21
    think this is absolute nonsense it's
  • 00:10:22
    giving me nonsense quickly presumably
  • 00:10:24
    then they're willing to they're willing
  • 00:10:26
    to wait that much longer yeah I think
  • 00:10:27
    there's that's a really good point right
  • 00:10:29
    that people use Codi and not just Codi
  • 00:10:31
    and not just encoding domain right we
  • 00:10:33
    use co-pilots across different I mean
  • 00:10:35
    sales co-pilot marketing co-pilot
  • 00:10:36
    Finance RIS co- pilot people are using
  • 00:10:38
    these agents or assistants in for
  • 00:10:41
    various different tasks right and some
  • 00:10:44
    of these tasks are like complex and more
  • 00:10:46
    sophisticated some of these tasks are
  • 00:10:47
    like simpler right yeah so let me let me
  • 00:10:49
    just paint this like picture of how I
  • 00:10:51
    view this right when you pick up a topic
  • 00:10:53
    to learn right beat programming you
  • 00:10:54
    don't start with like multi-threading
  • 00:10:55
    you start with okay do I know the syntax
  • 00:10:57
    can I instantiate variables can I go
  • 00:10:59
    fails can I do fall Loop and can I do
  • 00:11:00
    switch and then multiread and then
  • 00:11:02
    parallelism so when we as humans learn
  • 00:11:04
    there is a curriculum we learn on right
  • 00:11:06
    we don't directly go to chapter 11 we
  • 00:11:07
    start with chapter 1 similarly I think
  • 00:11:09
    like the lens which which I view some of
  • 00:11:12
    these agents and tools are it's okay you
  • 00:11:14
    not at like chapter 12 yet but I know
  • 00:11:17
    that there are simpler tasks you can do
  • 00:11:18
    and then there are like medium tasks you
  • 00:11:19
    can do and then there are like complex
  • 00:11:20
    tasks you can do now this is a lens
  • 00:11:23
    which is which I found to be like pretty
  • 00:11:25
    useful because when you say that hey for
  • 00:11:26
    aut complete for example I don't want
  • 00:11:28
    again my use case probably for auto
  • 00:11:30
    compete is not okay tackle a chapter 12
  • 00:11:32
    complexity problem no for that I'll
  • 00:11:34
    probably have an agentic setup yeah so
  • 00:11:36
    this curriculum is a great way to look
  • 00:11:37
    at it the other great way to look at
  • 00:11:38
    things are like let's just call it like
  • 00:11:40
    left versus right so on the left we have
  • 00:11:41
    these tools which are living in your IDE
  • 00:11:43
    and they're helping you write better
  • 00:11:44
    code and like complete code and like
  • 00:11:46
    really you are the main lead you're
  • 00:11:47
    driving the car you're just getting some
  • 00:11:48
    assistance along the way right versus
  • 00:11:51
    things on the right are like agentic
  • 00:11:53
    right that hey I'm going to type in here
  • 00:11:54
    is my GitHub issue send me create a
  • 00:11:56
    bootstrap me a PR for this right there I
  • 00:11:59
    want the machine learning models to take
  • 00:12:00
    control not full autonomy Quinn our CEO
  • 00:12:03
    has a amazing block Post Yeah on levels
  • 00:12:05
    of code right you start level Zer to
  • 00:12:07
    level seven and some of these human
  • 00:12:09
    initiated AI initiated AI Le and that
  • 00:12:11
    gives us a spectrum to look at autonomy
  • 00:12:13
    from a coding assistant perspective
  • 00:12:15
    which is great I think everybody should
  • 00:12:16
    look at it but coming back to the
  • 00:12:18
    question auto complete is probably for
  • 00:12:21
    I'm still the lead driver here help me
  • 00:12:24
    but in some of the other cases I'm stuck
  • 00:12:26
    take your time but then tackle more
  • 00:12:28
    complex task yeah now the context and
  • 00:12:30
    the model size the the latencies all of
  • 00:12:32
    these start differing here right when
  • 00:12:34
    you're writing this code you probably
  • 00:12:35
    need for a to complete local context
  • 00:12:37
    right or like maybe if you're
  • 00:12:38
    referencing a code from some of
  • 00:12:40
    repository bring that as a dependency
  • 00:12:42
    code and then use it in context if
  • 00:12:44
    you're looking at a new file generation
  • 00:12:46
    or like a new function generation that's
  • 00:12:48
    okay you got to look at the entire
  • 00:12:49
    repository and not just make one changes
  • 00:12:51
    over here you have to make an entire
  • 00:12:53
    file and make changes across three other
  • 00:12:55
    files right M so even where is the
  • 00:12:57
    impact the impact in auto it is like
  • 00:12:59
    local in this file in this region right
  • 00:13:02
    and then if you look at the full
  • 00:13:03
    autonomy case or like agentic setups
  • 00:13:05
    then the impact is okay I'm going to
  • 00:13:06
    make five changes across three files
  • 00:13:08
    into two repositories yeah right so
  • 00:13:10
    that's that's the granularity at which
  • 00:13:11
    some of these things are starting to
  • 00:13:13
    operate essentially yeah and testing is
  • 00:13:14
    going to be very simp similar as well
  • 00:13:15
    right if someone is if someone's writing
  • 00:13:18
    code in the in their ID line by line and
  • 00:13:20
    that's maybe using a code generation as
  • 00:13:22
    well like Cody they're going to likely
  • 00:13:25
    want to be able to have test step step
  • 00:13:27
    in sync so as I write code you're
  • 00:13:29
    automatically generating tests that are
  • 00:13:32
    effectively providing me with that
  • 00:13:33
    assurance that that the automatically
  • 00:13:35
    generated code is working as I want to
  • 00:13:38
    yeah that's a great point I think this
  • 00:13:39
    is more like errors multiply yeah if I'm
  • 00:13:41
    evaluating something after long writing
  • 00:13:44
    it then it's worse off right because the
  • 00:13:47
    errors I could have stopped the errors
  • 00:13:48
    earlier on and then debugged it and
  • 00:13:51
    fixed it locally and then moved on so
  • 00:13:53
    especially so taking a step back look I
  • 00:13:55
    love evaluation I really in machine
  • 00:13:57
    learning I I started my PhD thinking
  • 00:13:59
    that hey mats and like fancy graphical
  • 00:14:02
    models are the way to have impact using
  • 00:14:03
    machine learning right and you spend one
  • 00:14:06
    year in the industry realized nah it's
  • 00:14:07
    not about the fancy model it's about do
  • 00:14:09
    you have an evaluation you have these
  • 00:14:11
    metrics do you know when something is
  • 00:14:12
    working better yeah so I think getting
  • 00:14:14
    the zero to one on evaluation on these
  • 00:14:16
    data sets that is really key for any
  • 00:14:19
    machine learning problem y now
  • 00:14:21
    especially when what you mean by the
  • 00:14:22
    zero to one yeah 0 to1 is look at like
  • 00:14:24
    whenever a new language model gets
  • 00:14:26
    launched right people are saying that
  • 00:14:27
    hey for coding M Lama 3 does well on
  • 00:14:30
    coding why because oh we have this human
  • 00:14:32
    eval data set and a passet one metric
  • 00:14:34
    let's unpack that human eval data set is
  • 00:14:36
    a data set of 164 questions hey write me
  • 00:14:39
    a binary search in this code right so
  • 00:14:41
    essentially it's like you get a text and
  • 00:14:43
    you write a function and then you're
  • 00:14:45
    like hey does this function run
  • 00:14:46
    correctly so they have a unit test for
  • 00:14:47
    that and if it passes then you get plus
  • 00:14:49
    one right yeah so now this is great it's
  • 00:14:51
    a great start but is it really how
  • 00:14:54
    people are using Cod and a bunch of
  • 00:14:56
    other coding tools no they're like if
  • 00:14:57
    I'm an Enterprise developer if if let's
  • 00:14:59
    say I'm in a big bank then I have 20,000
  • 00:15:01
    other peers and there are like 30,000
  • 00:15:03
    depositories but I not writing binary
  • 00:15:05
    search independent of everything else
  • 00:15:07
    right I'm working in a massive code base
  • 00:15:09
    which has been edited across the last 10
  • 00:15:11
    years and there's some dependency by
  • 00:15:13
    some team in Beijing and there's a
  • 00:15:14
    function which I haven't even read right
  • 00:15:16
    and maybe it's in a language I don't
  • 00:15:18
    even care about or understand so my
  • 00:15:20
    point is the evaluation which we need
  • 00:15:23
    for these Real World products is
  • 00:15:24
    different than the benchmarks which we
  • 00:15:26
    have in the industry right now the one
  • 00:15:29
    for evaluation is that hey sure let's
  • 00:15:31
    use passet one in human at the start on
  • 00:15:33
    Day Zero but then we see that you
  • 00:15:35
    improve it by 10% we have results when
  • 00:15:38
    we actually did improve pass it one by
  • 00:15:40
    10 15% we tried it online on COD users
  • 00:15:43
    and the metrics dropped yeah and we
  • 00:15:45
    writing a block post about it at offline
  • 00:15:46
    online correlation yeah because if you
  • 00:15:48
    trust your offline metric pass it one
  • 00:15:50
    you improve it you hope that hey amazing
  • 00:15:51
    users are going to love it yeah it
  • 00:15:53
    wasn't true the context is so different
  • 00:15:55
    yeah the context are so different now
  • 00:15:57
    this is this means that I got to develop
  • 00:15:59
    an evaluation for my feature and I got
  • 00:16:02
    my evaluation should represent how my
  • 00:16:03
    actual users using this feature feel
  • 00:16:05
    about it m just because it's better on a
  • 00:16:07
    metric which is an industry Benchmark
  • 00:16:10
    doesn't mean that improving it will
  • 00:16:11
    improve actual user experience and can
  • 00:16:12
    that change from user to user as well so
  • 00:16:14
    you mention a bank there if five other
  • 00:16:16
    Banks is it going to be the same for
  • 00:16:18
    them if something not in the fin Tech
  • 00:16:20
    space is it going to be different for
  • 00:16:21
    them that's a great point I think the
  • 00:16:22
    Nuance you're trying to say is that hey
  • 00:16:24
    one are you even feature aware in your
  • 00:16:27
    evaluation because passet one is not
  • 00:16:28
    featur away right yeah passet one
  • 00:16:30
    doesn't care about auto complete or
  • 00:16:31
    unitize generation or code fixing I
  • 00:16:33
    don't care what the end use case or
  • 00:16:34
    application is this is just evaluation
  • 00:16:36
    so I think the first jump is have an
  • 00:16:38
    evaluation data set which is about your
  • 00:16:41
    feature right the evaluation data set
  • 00:16:42
    for unit as generation is going to be
  • 00:16:44
    different that code completion it's
  • 00:16:45
    going to be different than code edits
  • 00:16:46
    it's going to be different than chat so
  • 00:16:48
    I think the 0 to one we talking about 5
  • 00:16:49
    minutes earlier you got to do 0 to ones
  • 00:16:51
    for each of these features yeah and
  • 00:16:53
    that's not easy because evaluation
  • 00:16:55
    doesn't come naturally yeah and once you
  • 00:16:56
    have it then the question becomes that
  • 00:16:58
    hey okay once I have it for my feuture
  • 00:17:00
    then hey can I reuse it across
  • 00:17:01
    Industries can I reuse it across like
  • 00:17:04
    users and I think we've seen it I've
  • 00:17:05
    seen it in the rec traditional
  • 00:17:07
    recommendation space let's say most of
  • 00:17:09
    these apps again if they got like seed
  • 00:17:11
    funding last year or maybe series a
  • 00:17:13
    there at what like 10,000 daily active
  • 00:17:14
    users 5,000 daily active users today one
  • 00:17:16
    year from now they're going to be 100K
  • 00:17:18
    500k daily active users right now how
  • 00:17:21
    representative is your subset of users
  • 00:17:22
    today right the 5,000 users today are
  • 00:17:24
    probably early adopters and if anything
  • 00:17:27
    is scaling companies in the last 10
  • 00:17:28
    years what it has told us is the early
  • 00:17:30
    adopters are probably not the
  • 00:17:31
    representative set of users you'll have
  • 00:17:33
    once you have a mature adoption yeah
  • 00:17:35
    what that means is the the metrics which
  • 00:17:36
    are develop and the learnings which I've
  • 00:17:38
    had from the initial AB test may not
  • 00:17:41
    hold one year down the line six months
  • 00:17:42
    down the line as in when the users start
  • 00:17:44
    increasing right yeah now how does it
  • 00:17:46
    link to the point you asked look there
  • 00:17:48
    are heterogeneities across different
  • 00:17:49
    domains different Industries luckily
  • 00:17:51
    there are like homogeneities across
  • 00:17:52
    language right if you're a front-end
  • 00:17:53
    developer a versus B versus C companies
  • 00:17:56
    a lot of the tasks you're trying to do
  • 00:17:57
    are like similar and a lot of the task
  • 00:17:59
    which the pre-training data set has seen
  • 00:18:02
    is also similar because rarely again
  • 00:18:04
    there are cases where you're doing
  • 00:18:05
    something really novel but a lot of the
  • 00:18:08
    junior development workflow probably is
  • 00:18:10
    more like things which like hundreds and
  • 00:18:11
    thousands of Engineers have done before
  • 00:18:13
    so the pre-trained models have seen this
  • 00:18:14
    before right so when we finding the mod
  • 00:18:16
    for us that's not where we saw
  • 00:18:18
    advantages because yeah you've seen it
  • 00:18:19
    before you're going to do it well MH
  • 00:18:21
    coming back to the point I mentioned
  • 00:18:22
    earlier it's going to be a curriculum
  • 00:18:23
    right you can do simple things well you
  • 00:18:25
    can do harder things in Python well you
  • 00:18:27
    can't do harder things in Rust well you
  • 00:18:29
    can't do harder things in MA lab well so
  • 00:18:31
    my goal of fine-tuning some of these
  • 00:18:32
    models is that hey I'm going to show you
  • 00:18:34
    examples of these hard task not because
  • 00:18:37
    I just want to play with it but because
  • 00:18:39
    some of our adopters right some we have
  • 00:18:41
    a lot of Enterprise customers using us
  • 00:18:43
    right and paying us for that right I get
  • 00:18:44
    my salary because of that essentially so
  • 00:18:46
    essentially I want those developers to
  • 00:18:47
    be productive and they're trying to
  • 00:18:49
    tackle some complex tasks in Rust which
  • 00:18:51
    maybe we haven't paid attention when we
  • 00:18:53
    were training this llama model or like
  • 00:18:54
    this entropic model so then my goal is
  • 00:18:56
    how do I extract those examples and then
  • 00:18:59
    bring it to my Loop training Loop
  • 00:19:01
    essentially and that's where right now
  • 00:19:03
    if let's say one industry is struggling
  • 00:19:06
    we know how the metrics are performing
  • 00:19:08
    right that's what evaluation is so
  • 00:19:09
    important we know where we suck at right
  • 00:19:12
    now and then we can start collecting
  • 00:19:13
    public examples and start focusing the
  • 00:19:16
    models to do well on those right yeah
  • 00:19:18
    again let me bring the exact point I
  • 00:19:19
    mentioned I I'm going to say it 100
  • 00:19:21
    times we've done it before if you spend
  • 00:19:23
    20 minutes on Tik Tok you're going to
  • 00:19:26
    look at what 40 short videos If you
  • 00:19:28
    spend 5 minutes on Tik Tok or Instagram
  • 00:19:30
    re you're going to look at like 10 short
  • 00:19:31
    videos right yeah in the first nine
  • 00:19:33
    short videos you're going to either skip
  • 00:19:34
    it or like it or follow a Creator do
  • 00:19:36
    something right so the 11 short perod is
  • 00:19:38
    like really personalized because I've
  • 00:19:39
    seen what you're doing in the last 5
  • 00:19:40
    minutes and I can do real-time
  • 00:19:42
    personalization for you yeah what does
  • 00:19:44
    that mean in the coding assistant world
  • 00:19:45
    look I know how these models are used in
  • 00:19:47
    the industry right now and how our
  • 00:19:48
    Enterprise customers and our community
  • 00:19:50
    users are using it yeah let's look at
  • 00:19:51
    the ACC completion acceptance rate or
  • 00:19:53
    autocomplete oh for these languages in
  • 00:19:55
    these use cases we get a high acceptance
  • 00:19:57
    rate we show our recommendation
  • 00:19:59
    people accept the code and move on but
  • 00:20:01
    in these on examples oh we're not really
  • 00:20:03
    doing well so the question that me comes
  • 00:20:06
    oh this is something which maybe we
  • 00:20:07
    should train on or we should find you on
  • 00:20:10
    and this establishes a feedback loop
  • 00:20:11
    yeah that look at what's not working and
  • 00:20:14
    then make the model look at more of
  • 00:20:16
    those examples and create that feedback
  • 00:20:17
    loop which can then make the models
  • 00:20:19
    evolve over a period of time and so I
  • 00:20:20
    think there's two pieces here if we move
  • 00:20:22
    a little bit into the into what this
  • 00:20:24
    then means for testing the more that
  • 00:20:27
    evaluation like almost like testing of
  • 00:20:29
    the model effectively gets better it
  • 00:20:31
    effectively means that the suggested
  • 00:20:33
    code is then more accurate as a result
  • 00:20:36
    the you need to rely on your tests
  • 00:20:39
    slightly less I'm not saying you
  • 00:20:40
    shouldn't but slightly less because the
  • 00:20:42
    generated code that is being suggested
  • 00:20:44
    is a higher quality when it then comes
  • 00:20:46
    to the suggestions of tests afterwards
  • 00:20:50
    does that follow the same model in terms
  • 00:20:52
    of learning from not just the what the
  • 00:20:54
    user wants to test but also from what is
  • 00:20:57
    being generated
  • 00:20:59
    is there work that we can do there into
  • 00:21:00
    building that test yeah I think that's
  • 00:21:02
    an interesting point generation is a
  • 00:21:05
    selfcontained problem in itself right
  • 00:21:07
    yeah one I think let's just establish
  • 00:21:08
    the fact that unit generation is
  • 00:21:10
    probably one of the highest value use
  • 00:21:13
    cases yeah we got to get right in the
  • 00:21:15
    industry code and that's because I
  • 00:21:17
    mentioned that look you will stop using
  • 00:21:19
    spotify if I start showing shitty
  • 00:21:21
    recommendations to you yeah if I don't
  • 00:21:23
    learn from my mistakes you're going to
  • 00:21:25
    keep skipping like short videos or music
  • 00:21:27
    or podcast and you're going to go right
  • 00:21:28
    I don't get value because I spend I'm
  • 00:21:30
    running I'm jogging I want music to just
  • 00:21:32
    come in and I don't want to stop my
  • 00:21:33
    running and hit like skip because that's
  • 00:21:36
    takes that's like dissatisfaction right
  • 00:21:38
    yeah one of the things which I did at
  • 00:21:39
    Spotify and like in my previous company
  • 00:21:40
    was like I really wanted people to focus
  • 00:21:43
    on dissatisfaction yeah because
  • 00:21:44
    satisfaction is all users are happy yeah
  • 00:21:46
    that's not where I make more money I
  • 00:21:47
    make more money by reducing
  • 00:21:49
    dissatisfaction where are you unhappy
  • 00:21:51
    and how do I fix that and even there if
  • 00:21:53
    I stop you there actually just quickly I
  • 00:21:55
    think there are different levels to this
  • 00:21:56
    as well in terms of the testing right
  • 00:21:57
    cuz it's at some point it's yes at its
  • 00:21:59
    most basic does this code work and then
  • 00:22:03
    as it goes up it's does this code work
  • 00:22:04
    well does this code work fast does this
  • 00:22:06
    code is this really making you happy as
  • 00:22:08
    a user and so I think where would you
  • 00:22:12
    say we are right now in terms of the
  • 00:22:14
    level of test generation are people most
  • 00:22:16
    concerned with the this code is being
  • 00:22:19
    generated or when I start creating my
  • 00:22:21
    tests how do I validate that my code is
  • 00:22:24
    correct in terms of it compiles it's
  • 00:22:28
    it's covering my basic use cases that
  • 00:22:30
    I'm asking for and doing the right thing
  • 00:22:32
    no this we can talk an hour about this
  • 00:22:34
    look I think this is a long journey yeah
  • 00:22:36
    getting evaluation getting United
  • 00:22:37
    generation right across the actual
  • 00:22:39
    representative use case in the
  • 00:22:41
    Enterprise that's a nightmare of a
  • 00:22:43
    problem yeah look I can do it right for
  • 00:22:45
    writing binary binary search algorithms
  • 00:22:47
    right if you give me like if you give me
  • 00:22:48
    a coding task which has nothing to do
  • 00:22:50
    with a big code repository and like
  • 00:22:52
    understanding context understanding what
  • 00:22:54
    the 5,000 developers have done sure I
  • 00:22:56
    can attempt it and create a unit test
  • 00:22:58
    because this there's a code and there's
  • 00:23:00
    a unit T this lives independently right
  • 00:23:02
    they live on an island they're happily
  • 00:23:03
    married amazing everything works but
  • 00:23:04
    this is not how people use coding this
  • 00:23:06
    is not how like the Enterprise or like
  • 00:23:08
    even Pro developer use cases are the pro
  • 00:23:10
    developer use cases are about like hey
  • 00:23:12
    is this working correctly in this like
  • 00:23:14
    wider context because there's a big code
  • 00:23:16
    repository and like multiple of them and
  • 00:23:18
    that's a wider context wherein you're
  • 00:23:20
    actually writing the code and you're
  • 00:23:21
    actually writing the unit test now I
  • 00:23:23
    would bring in this philosophical
  • 00:23:25
    argument of I think unit generation I
  • 00:23:27
    would look at it from an a Serial
  • 00:23:28
    setting what's the point of having the
  • 00:23:30
    unit test it's not just to feel make
  • 00:23:32
    your yourself or your manager happy that
  • 00:23:34
    oh I have unitest coverage unit test are
  • 00:23:36
    probably like a guard rail to keep bad
  • 00:23:39
    code from entering your system yes so
  • 00:23:41
    what is this this is an adversarial
  • 00:23:42
    setup maybe not intentionally
  • 00:23:44
    adversarial aders set up okay somebody's
  • 00:23:45
    trying to make bad things happen in your
  • 00:23:47
    code and somebody else is stopping that
  • 00:23:50
    from happening right so again if you
  • 00:23:52
    start looking at unitest generation from
  • 00:23:53
    this adversar setup that look this is a
  • 00:23:55
    good guy right the unitest is going to
  • 00:23:58
    prevent bad things from happening in
  • 00:23:59
    future to my code base that's why I need
  • 00:24:02
    good unit test now this bad now this is
  • 00:24:04
    a good guy right who are the bad people
  • 00:24:06
    right now in the last up until the last
  • 00:24:07
    few years the bad people not
  • 00:24:09
    intentionally bad but the Bad actors in
  • 00:24:11
    The Code base were developers yeah now
  • 00:24:13
    we have ai yeah right now I am as a
  • 00:24:15
    developer writing code and if I write
  • 00:24:16
    shitty code then the unitest will catch
  • 00:24:18
    it and I won't be able to merge yeah
  • 00:24:20
    right if
  • 00:24:21
    ifel right exactly we'll get to that
  • 00:24:24
    right I'm yet to I'm yet to see a
  • 00:24:26
    developer who who is not a tdd fan who
  • 00:24:29
    absolutely lives for writing tests and
  • 00:24:31
    building a perfect test s for the code
  • 00:24:33
    yeah yeah again right and there's a
  • 00:24:34
    reason why quot like test case coverage
  • 00:24:36
    like low across like all the
  • 00:24:38
    repositories up right it's not something
  • 00:24:39
    which again I think like biang CTO he
  • 00:24:42
    loves to say that the goal of cod is to
  • 00:24:43
    remove developer toil yeah and how do I
  • 00:24:45
    make you do a lot more happier job
  • 00:24:48
    focusing on the right creative aspects
  • 00:24:50
    of architecture design or system design
  • 00:24:51
    and remove toil from your life right a
  • 00:24:53
    bunch of Developers for for better RSE
  • 00:24:55
    to start looking at unit as generation
  • 00:24:57
    as maybe it's not as interesting let's
  • 00:24:59
    unpack that as well not all unit tests
  • 00:25:01
    are like boring right yeah writing
  • 00:25:02
    stupid unit test for stupid functions we
  • 00:25:04
    shouldn't even like probably do it or
  • 00:25:06
    like I I will let like machine learning
  • 00:25:07
    do it essentially but the Nuance are
  • 00:25:09
    like here is a very critical function if
  • 00:25:12
    you screw this then maybe the payment
  • 00:25:14
    system in your application gets screwed
  • 00:25:16
    and then you lose money right and then
  • 00:25:18
    you don't if you don't have
  • 00:25:19
    observability then you lose money over a
  • 00:25:21
    period of time and then you're literally
  • 00:25:22
    costing company dollars millions of
  • 00:25:23
    dollars if you screw this code
  • 00:25:25
    essentially so the point is not all unit
  • 00:25:26
    tests are the same because not all
  • 00:25:27
    functions are equally important right
  • 00:25:29
    there's going to be a distribution of
  • 00:25:30
    some of the are like really really
  • 00:25:31
    important functions you got to get a
  • 00:25:33
    amazing unit test right I would rather I
  • 00:25:35
    if I have a limited budget that if I
  • 00:25:37
    have to principal Engineers I would make
  • 00:25:39
    sure that the unit of these really
  • 00:25:41
    critical pieces of component in my
  • 00:25:43
    software stack are written by these
  • 00:25:45
    Engineers or even if they written by
  • 00:25:48
    like these agents or AI Solutions then
  • 00:25:49
    at least like they wed it from some of
  • 00:25:51
    these MH but before we get there let's
  • 00:25:54
    just look at the fact of the need for
  • 00:25:56
    unit test not just today but tomorrow
  • 00:25:58
    yeah because right now if you have
  • 00:26:00
    primarily developers writing unit test
  • 00:26:02
    or like some starting tools tomorrow a
  • 00:26:04
    lot more AI assistant I mean we are
  • 00:26:06
    building one right we are trying to say
  • 00:26:08
    that hey we're going to write more and
  • 00:26:09
    more of your code yeah what that means
  • 00:26:11
    is if in the adversarial setup unit test
  • 00:26:14
    are like protecting your code base the
  • 00:26:16
    the potential attacks not intentional
  • 00:26:18
    but the bad code could come in from
  • 00:26:19
    humans but also like thousands and
  • 00:26:20
    millions of AI agents tomorrow yeah and
  • 00:26:22
    you know what worries me a little bit
  • 00:26:24
    here as well is in fact when you talked
  • 00:26:25
    about that that those levels autonomy on
  • 00:26:28
    the far left it's much more interactive
  • 00:26:31
    right you have developers who are
  • 00:26:32
    looking at the lines of code that
  • 00:26:33
    suggested and looking at the tests that
  • 00:26:35
    get generated so it's much more involved
  • 00:26:38
    for the developer as soon as you go
  • 00:26:40
    further right into that more automated
  • 00:26:41
    World we're we're more in an we're more
  • 00:26:43
    in an environment where um larger
  • 00:26:46
    amounts of content is going to be
  • 00:26:49
    suggested to that developer and if we go
  • 00:26:51
    back to the same old that story where if
  • 00:26:54
    you want 100 comments on your PO request
  • 00:26:56
    in a code review you write two line
  • 00:26:58
    change if you want zero you provide a
  • 00:27:00
    500 line change right and when we
  • 00:27:02
    provide that volume whether it's hey I'm
  • 00:27:04
    going to build you this part of an
  • 00:27:06
    application or a module or a test Suite
  • 00:27:09
    based on some code how much is a
  • 00:27:11
    developer actually going to look in
  • 00:27:13
    detail at every single one of those
  • 00:27:14
    right and I think this kind of comes
  • 00:27:16
    back to your point of what are the most
  • 00:27:19
    important parts that I need to to look
  • 00:27:22
    at but yeah it revolves a little bit
  • 00:27:24
    more around what you were saying earlier
  • 00:27:25
    as well whereby tests becoming more more
  • 00:27:28
    important for this kind of thing and
  • 00:27:30
    exactly as code gets generated
  • 00:27:31
    particularly in volume right what are
  • 00:27:34
    where are the guard rails for this and
  • 00:27:36
    it's all about tests I love the point
  • 00:27:37
    you mentioned right that look as more
  • 00:27:39
    and more code gets written like my the
  • 00:27:41
    cognitive abilities of a developer to
  • 00:27:43
    look at every change everywhere it just
  • 00:27:46
    takes more time takes more effort takes
  • 00:27:47
    more cognitive load right yeah now
  • 00:27:49
    coupling the fact that if you've been
  • 00:27:50
    using this system for a few months then
  • 00:27:52
    there's an inherent trust in the system
  • 00:27:55
    now this is when I get really scared you
  • 00:27:57
    look at
  • 00:27:58
    when I started using Alexa in 2015 right
  • 00:28:00
    it would only get weather right right
  • 00:28:02
    Google home Alexa it won't do any of
  • 00:28:04
    other who can get weather
  • 00:28:06
    right yeah weather prediction is a hard
  • 00:28:09
    enough maching problem Deep Mind
  • 00:28:10
    Engineers are still working on it and
  • 00:28:11
    still getting it right and doing it in
  • 00:28:12
    London yeah I would pay for a service
  • 00:28:15
    which predict but point is we have used
  • 00:28:18
    as a society like these conversational
  • 00:28:19
    agents for a decade now we just asking
  • 00:28:21
    like crappy questions yeah because we
  • 00:28:23
    trust it I asked you a complex question
  • 00:28:24
    you don't have an answer I I forgot
  • 00:28:26
    about you for the next few months but
  • 00:28:28
    but then we start increasing the
  • 00:28:29
    complexity of the questions we ask and
  • 00:28:30
    that's great because now the the Siri
  • 00:28:33
    and the and the Google assistant and
  • 00:28:34
    Alexa was able to tackle these questions
  • 00:28:36
    right and then you start trusting them
  • 00:28:38
    because hey oh I've asked you these
  • 00:28:39
    questions and we've answered them well
  • 00:28:40
    so then I trust you to do these tasks
  • 00:28:42
    well and again right if you look at
  • 00:28:45
    people who use Spotify or Netflix their
  • 00:28:47
    recommendations in the feed they have
  • 00:28:49
    more Trust on your system yeah because
  • 00:28:51
    most of these applications do provide
  • 00:28:52
    you a way out right if you don't trust
  • 00:28:53
    recommendations go to your library go do
  • 00:28:55
    search yeah search search versus
  • 00:28:58
    recommendations is that push versus pull
  • 00:28:59
    Paradigm right recommendations I'm going
  • 00:29:02
    to push content to you if you trust us
  • 00:29:04
    you're going to consume this right if
  • 00:29:05
    you don't trust a system if you don't
  • 00:29:06
    trust your recommendations then you're
  • 00:29:08
    going to pull content which is search
  • 00:29:09
    right now especially as in when you seen
  • 00:29:11
    there's a distribution of people who
  • 00:29:12
    don't search at all right they're like
  • 00:29:14
    they we live in that high trust world
  • 00:29:16
    when they're going to they're going to
  • 00:29:17
    like like a recommendation same right
  • 00:29:19
    Google who goes through the second page
  • 00:29:20
    of Google right who goes through Google
  • 00:29:22
    now sorry uh but essentially the point
  • 00:29:25
    is once people start trusting these
  • 00:29:27
    systems
  • 00:29:28
    the units generation is a system which I
  • 00:29:31
    start trusting right and then it starts
  • 00:29:33
    tackling more and more complex problems
  • 00:29:34
    and then is oh I start I stopped looking
  • 00:29:36
    at the corner cases and then that code
  • 00:29:38
    was committed 6 months ago and that unit
  • 00:29:40
    T is there and then code on top of it
  • 00:29:42
    was committed three months ago and then
  • 00:29:43
    there's probably unit test which I
  • 00:29:44
    didn't write the agent road now this is
  • 00:29:47
    where like complexity evolves for a
  • 00:29:49
    period of time and maybe there's a
  • 00:29:51
    generation of unit test which have been
  • 00:29:53
    written maybe with less and less of me
  • 00:29:55
    being involved yeah the the levels of
  • 00:29:57
    code a like that means like your
  • 00:29:59
    involvement is not at the finer levels
  • 00:30:01
    it's like higher up now this assumes it
  • 00:30:03
    works well if the foundations are
  • 00:30:04
    correct and everything is robust and we
  • 00:30:06
    have like good checks in place yeah
  • 00:30:08
    again right whatever could go wrong
  • 00:30:10
    would go wrong yeah so then the point is
  • 00:30:12
    in this complex code where in the series
  • 00:30:14
    generations of unit test generations of
  • 00:30:16
    code Cycles edits have been made by an
  • 00:30:19
    agent then things could go horribly
  • 00:30:21
    wrong so do we have what's the solution
  • 00:30:23
    the solution is pay more respect to
  • 00:30:25
    evaluation right look you got to you got
  • 00:30:27
    to LD is guard is just like a very
  • 00:30:30
    harmless way to for me to say that like
  • 00:30:33
    unitz generation is important not just
  • 00:30:34
    for unit generation today but for unit
  • 00:30:37
    generation and code generation tomorrow
  • 00:30:39
    so I think like the kind of metrics we
  • 00:30:40
    need the kind of evaluation we need the
  • 00:30:42
    kind of robust auditing of these systems
  • 00:30:44
    and auditing of these unites so far
  • 00:30:46
    again I don't have a huge 10e experience
  • 00:30:48
    in the coding industry because I've
  • 00:30:50
    worked on recommendations and use C
  • 00:30:51
    systems for but for me it was like hey
  • 00:30:54
    what is your test coverage in a
  • 00:30:56
    repository that's the most commonly way
  • 00:30:58
    look way to look at like repository and
  • 00:31:00
    what's what I how advanced are you in
  • 00:31:02
    your testing capabilities and that
  • 00:31:03
    doesn't cut it are you covering the con
  • 00:31:05
    cases what's your again what's the
  • 00:31:06
    complexity what's the severity so do we
  • 00:31:08
    need automated tests for our tests or do
  • 00:31:11
    we need people humans to we need both
  • 00:31:14
    right we need human domain experts again
  • 00:31:16
    this is and this is not just a coding
  • 00:31:17
    problem look at look millions of dollars
  • 00:31:19
    are spent by entropic and open mind on
  • 00:31:21
    scale AI scale AI has they raised like a
  • 00:31:23
    lot of money recently billion dollar
  • 00:31:25
    valuations because we need domain
  • 00:31:27
    experts to tag yeah this is also
  • 00:31:28
    nightmare at Spotify in search I have PD
  • 00:31:31
    in search my search was like I'll show
  • 00:31:32
    users some results and they're going to
  • 00:31:34
    tag this is correct or not I can't do
  • 00:31:36
    this in coding MH because crowdsourcing
  • 00:31:38
    has been a great assistance to machine
  • 00:31:40
    learning systems for 20 years now
  • 00:31:42
    because I can get that feedback from the
  • 00:31:44
    user now to get feedback on a complex
  • 00:31:47
    rust code where am I going to find those
  • 00:31:49
    crowdsource workers on scale or Amazon
  • 00:31:51
    Mt right they don't exist you're not
  • 00:31:53
    going to pay them $20 an hour to write
  • 00:31:55
    give feedback they're like, hour dollar
  • 00:31:57
    what ,000 hour developers right yeah we
  • 00:32:00
    don't even have a community right now
  • 00:32:02
    around around crowdsource workers for
  • 00:32:04
    code because this is domain specific
  • 00:32:06
    right so my point is again this is not
  • 00:32:09
    all Dooms worthy in the sense like this
  • 00:32:11
    is a important problem we we got to get
  • 00:32:13
    righted uh and it's going to be a long
  • 00:32:15
    journey we're going to do evaluations
  • 00:32:16
    and we're going to do like generations
  • 00:32:17
    of evaluations I think the right way to
  • 00:32:20
    think about is like paying attention to
  • 00:32:22
    unit test but also like evaluation of
  • 00:32:24
    unit test and also evaluation take a
  • 00:32:26
    step back like multiple lels of
  • 00:32:28
    evaluation right you're going to
  • 00:32:29
    evaluate okay what are the important fun
  • 00:32:32
    are we able to write identify the
  • 00:32:33
    important functions the criticality of
  • 00:32:35
    the code right and then look at unit as
  • 00:32:36
    generation from that lens yeah now
  • 00:32:38
    immediately one of the solutions I think
  • 00:32:40
    when we met over coffee we talking
  • 00:32:42
    briefly about it if let's say I generate
  • 00:32:44
    20 unit test right now I want my
  • 00:32:47
    principal or some respected trusted
  • 00:32:50
    engineer to wet at least some of them
  • 00:32:53
    now their day job is not just to write
  • 00:32:55
    unit test right their day job is to
  • 00:32:56
    maintain the system advance wants it so
  • 00:32:58
    they're going to probably have limited
  • 00:33:00
    budget to look at the unit I've
  • 00:33:02
    generated right now the question becomes
  • 00:33:04
    if this week I was able to generate 120
  • 00:33:07
    unit test and you are a principal
  • 00:33:09
    engineer in my team you're not going to
  • 00:33:10
    look at 120 unit tests and pass them MH
  • 00:33:12
    you maybe have two hours to spare this
  • 00:33:14
    week yeah you're going to look at maybe
  • 00:33:16
    five of these unit test yeah now this
  • 00:33:18
    becomes an interesting machine learning
  • 00:33:19
    problem for me yeah the 120 unit test
  • 00:33:21
    I've created which what is a subset of
  • 00:33:24
    five which I need your input on yeah now
  • 00:33:27
    this is
  • 00:33:28
    one way to tackle these problems reduce
  • 00:33:29
    the uncertainity right machine learning
  • 00:33:31
    uncertainity models we've done it for 20
  • 00:33:33
    years in the industry so likey how
  • 00:33:35
    certain am I that this is correct and if
  • 00:33:37
    I'm certain then sure I I won't show it
  • 00:33:39
    to you yeah maybe I'll show one off just
  • 00:33:40
    to make sure that I I get a feedback
  • 00:33:42
    that I thought I'm certain you did you
  • 00:33:44
    confirm or did you reject that and then
  • 00:33:46
    I'll learn but then I'm I'm going to go
  • 00:33:47
    on an information maximization principle
  • 00:33:50
    that what do I not know and can I show
  • 00:33:52
    that to you mhm uh so what that means is
  • 00:33:54
    it's a budget constrained subset
  • 00:33:56
    selection problem mhm right that I've
  • 00:33:58
    generated 120 unit test you can only
  • 00:33:59
    look at five of them I'm going to pick
  • 00:34:01
    up these five and see now we can do it
  • 00:34:03
    like I I can pick up these five and show
  • 00:34:05
    it to you or I can pick up one get that
  • 00:34:08
    feedback see what I additionally learn
  • 00:34:10
    from that one and then look at the 119
  • 00:34:12
    again and say that knowing what I know
  • 00:34:14
    just now what else would I say what is
  • 00:34:16
    the next one yeah and previously you
  • 00:34:18
    mentioned about the most important parts
  • 00:34:20
    of the code to test right cuz there
  • 00:34:22
    could be critical paths there could be
  • 00:34:23
    other areas where you know what if
  • 00:34:25
    there's a bug it's far less of a problem
  • 00:34:27
    who provid that context is that
  • 00:34:28
    something the is that something the LM
  • 00:34:30
    can like decide or is that something
  • 00:34:32
    like you said the principal Engineers
  • 00:34:33
    should spend a little bit of their time
  • 00:34:35
    saying look these are the main core
  • 00:34:37
    areas that we need to be bulletproof
  • 00:34:38
    right other areas yeah I would rather
  • 00:34:41
    spend my time on these than those others
  • 00:34:43
    yeah I think I probably am not the best
  • 00:34:45
    person to answer like who currently
  • 00:34:46
    provides it as a machine learning
  • 00:34:48
    engineer I see there are signals yeah if
  • 00:34:50
    I look at your system then there are
  • 00:34:52
    like where is the bugs race where was
  • 00:34:53
    the severity right for each bug for each
  • 00:34:55
    incident there was a severity report
  • 00:34:57
    watch code what code was missed what
  • 00:34:59
    code wasn't missed yeah what were the
  • 00:35:00
    unit tests created by people so far
  • 00:35:03
    right so again I view it from a data
  • 00:35:04
    observability perspective like knowing
  • 00:35:06
    what I know about your code about your
  • 00:35:08
    severity about your issues about your
  • 00:35:09
    time to resolve some of these then I can
  • 00:35:13
    develop a model personalized to your
  • 00:35:15
    code base on what are the core important
  • 00:35:17
    pieces right so I can look at just code
  • 00:35:20
    which okay which functions are calling
  • 00:35:21
    which there's a dependency and we do a
  • 00:35:23
    lot of we have amazing Engineers right
  • 00:35:24
    on who are just like compiler experts in
  • 00:35:27
    the company
  • 00:35:28
    and one of the Reas join was because
  • 00:35:30
    bring in the ml knowledge but I want to
  • 00:35:32
    work with dom experts right and has
  • 00:35:34
    amazing talent over the last few years
  • 00:35:36
    and it compiles it compounds right so
  • 00:35:38
    there are these compilers experts
  • 00:35:39
    internally Olaf and a few others being
  • 00:35:41
    one of them and they would do like
  • 00:35:43
    really like precise Intelligence on the
  • 00:35:45
    code and find out the dependency
  • 00:35:46
    structure and all those right and that
  • 00:35:48
    will give me like just content
  • 00:35:50
    understanding of your code base right
  • 00:35:51
    and then if a function is deemed
  • 00:35:52
    important from those like graphical
  • 00:35:54
    links essentially like how many of
  • 00:35:56
    inotes and out notes are coming to your
  • 00:35:57
    do your function and a lot of people are
  • 00:35:59
    calling it then that means there's a lot
  • 00:36:00
    of Downstream dependencies so there is
  • 00:36:02
    this way of looking at it right now this
  • 00:36:03
    is just pure code but I do have
  • 00:36:06
    observational data observational data
  • 00:36:08
    means that I do know what were the
  • 00:36:09
    severities and where the SE zeros and SE
  • 00:36:12
    1es were cost and where the red really
  • 00:36:15
    critical errors have happened the last
  • 00:36:16
    few months ni and where is the
  • 00:36:19
    probability of those happening right now
  • 00:36:20
    plus where are you writing unit test
  • 00:36:22
    right now that gives me an additional
  • 00:36:24
    layer of information I already have
  • 00:36:26
    passed your code base understood what's
  • 00:36:28
    going on I have that view but I also now
  • 00:36:30
    look at the real world view of that data
  • 00:36:32
    coming in oh you know what there was a
  • 00:36:34
    depend there was a SE zero issue caused
  • 00:36:35
    by this piece of code over here and
  • 00:36:37
    maybe this that that meant now the
  • 00:36:39
    question is can I go back one day before
  • 00:36:41
    yeah can I predict that this is if I had
  • 00:36:43
    to pred that one error is going to pop
  • 00:36:44
    up tomorrow which part of the code base
  • 00:36:46
    where this error will be will be popping
  • 00:36:48
    up right now that is a prediction I can
  • 00:36:49
    make one day before right and I can
  • 00:36:51
    start trading these models now now this
  • 00:36:53
    is again we talked about earlier right
  • 00:36:54
    at the start of the podcast that look
  • 00:36:56
    each of these features are different ml
  • 00:36:59
    models right we just talked about two ml
  • 00:37:01
    models one if I have 120 unit test what
  • 00:37:04
    is the subset of five I've showed you
  • 00:37:05
    that could be an llm that could be like
  • 00:37:07
    a subset selection subset selection have
  • 00:37:09
    known solutions for I mean and like
  • 00:37:11
    theoretical guarantees on performance
  • 00:37:12
    submod submod subset selection and all
  • 00:37:15
    I've implemented them at 200 million
  • 00:37:17
    monthly active user scale at Spotify
  • 00:37:19
    yeah so we can tackle this problem but
  • 00:37:20
    this is a new problem this is not like
  • 00:37:21
    an llm solution right second one is here
  • 00:37:24
    right can I predict where the critical
  • 00:37:26
    bug is going to be and then use that to
  • 00:37:27
    identify critical components and then
  • 00:37:29
    use that to chain and put like a unius
  • 00:37:31
    in there it's essentially human context
  • 00:37:33
    right we we we talk about context a lot
  • 00:37:35
    when we talking about actual source
  • 00:37:38
    source files and and how we can maybe do
  • 00:37:40
    some code completion against so many
  • 00:37:42
    parts of our project but thinking about
  • 00:37:44
    that almost behavioral context exactly
  • 00:37:46
    and again I'll be literally like the
  • 00:37:48
    broken record on this we have done this
  • 00:37:50
    before and look when you upload a short
  • 00:37:52
    video like I look at Short video has
  • 00:37:55
    zero views right now I look at hey is
  • 00:37:56
    this high quality who's the author again
  • 00:37:59
    right I look at the content composition
  • 00:38:00
    why because zero people have interacted
  • 00:38:01
    with it so far give it an hour 10
  • 00:38:04
    million people would have interacted
  • 00:38:06
    with that short video now I know which
  • 00:38:07
    people will like which people not right
  • 00:38:09
    so if you look at the recommendation
  • 00:38:10
    life cycle of any podcast we're going to
  • 00:38:12
    upload this podcast the life cycle of
  • 00:38:14
    this podcast will be there is something
  • 00:38:16
    about the content of this podcast and
  • 00:38:18
    there's something about the
  • 00:38:18
    observational behavioral data of this
  • 00:38:20
    podcast which developers which users
  • 00:38:22
    liked it and not liked it skipped it
  • 00:38:23
    streamed it and all that right so we
  • 00:38:25
    have designed recommendation systems
  • 00:38:28
    as a combination of like content and
  • 00:38:30
    behavior right same here when I say uni
  • 00:38:33
    as Generations I can look at your code
  • 00:38:35
    base right and I can make some
  • 00:38:36
    inferences now on top of that I have
  • 00:38:38
    additional view on this data which is
  • 00:38:40
    like what are the errors and where are
  • 00:38:41
    the you writing unit is where are you
  • 00:38:42
    devoting time that gives me additional
  • 00:38:44
    observation data on top of my content
  • 00:38:46
    understanding if a code base combine the
  • 00:38:48
    two together and like better things will
  • 00:38:50
    emerge Essen and we've talked about unit
  • 00:38:52
    tests quite a bit in terms of testing in
  • 00:38:54
    general right there are obviously
  • 00:38:57
    different layers of of that testing I
  • 00:38:59
    think when we talk about the intent
  • 00:39:00
    changes heavily right so the higher you
  • 00:39:02
    go when we talking about the integration
  • 00:39:04
    test and things like that the intent is
  • 00:39:05
    really and again when we talk go back to
  • 00:39:07
    context the intent is about the use
  • 00:39:09
    cases uh a lot about the the intention
  • 00:39:12
    of what a user will how a user will use
  • 00:39:15
    that application um when we also think
  • 00:39:18
    about the the areas of the codebase
  • 00:39:21
    which are extremely important those
  • 00:39:23
    higher level integration tests the flows
  • 00:39:25
    that they go through through the
  • 00:39:26
    application will show which areas of
  • 00:39:29
    code are most important as well in terms
  • 00:39:31
    of from our developer audience when
  • 00:39:33
    people we talk to when we want to
  • 00:39:35
    provide advice or best practices in
  • 00:39:36
    terms of how should a developer be
  • 00:39:39
    thinking about using AI into their
  • 00:39:40
    General testing strategy what's a good
  • 00:39:43
    start today would you say in terms of
  • 00:39:45
    introducing these kind of Technologies
  • 00:39:47
    into people's processes and existing
  • 00:39:50
    workflows successfully today right yeah
  • 00:39:52
    I think the the simplest answer is start
  • 00:39:54
    using Cody uh no I love it I think like
  • 00:39:57
    I even before I joined SCE craft Cody
  • 00:39:59
    helped me I interviewed basically Cody
  • 00:40:01
    Source craft we do an interview of okay
  • 00:40:02
    here's a code base it's open source look
  • 00:40:03
    at it try to make some changes and I
  • 00:40:05
    love that I psychologically I was bought
  • 00:40:08
    in even before I had an offer because
  • 00:40:11
    you're making me do cognitive word on
  • 00:40:12
    your code depository as part of the
  • 00:40:14
    interview you just spend one hour
  • 00:40:15
    instead of just chatting look at the
  • 00:40:17
    code and make some changes right yeah my
  • 00:40:18
    point is yeah use cod but essentially an
  • 00:40:21
    interesting point over here is if you're
  • 00:40:22
    trying to adopt Cod or any of the other
  • 00:40:24
    tools for test generation you're going
  • 00:40:26
    to what what are you going to do you're
  • 00:40:27
    going to try to use the off thes shell
  • 00:40:29
    feature right hey generate unit I see
  • 00:40:31
    where it works where it doesn't work now
  • 00:40:33
    Cod provides something called custom
  • 00:40:34
    commands custom command edit code unit
  • 00:40:37
    generation these are all commands right
  • 00:40:39
    what a command is if what is the llm
  • 00:40:41
    feature let's just take a step back llm
  • 00:40:43
    feature is I want to do this task I need
  • 00:40:45
    some context so what I'm going to do is
  • 00:40:46
    I'm going to generate an English prompt
  • 00:40:49
    right and I'm going to bring in a
  • 00:40:50
    context strategy that what are the
  • 00:40:51
    relevant pieces of information I should
  • 00:40:53
    use which is okay here's the unit is in
  • 00:40:55
    the same folder for example or here's a
  • 00:40:56
    dependency which you should be aware of
  • 00:40:58
    bring in that context write an English
  • 00:40:59
    prompt and then send to the llm right
  • 00:41:01
    that's a very nice simplified way of
  • 00:41:03
    looking at what is an llm feature so Cod
  • 00:41:06
    provides an option of doing custom
  • 00:41:07
    commands what that means is I can see
  • 00:41:10
    hey this doesn't work as great for me
  • 00:41:12
    why because of these Nu aners let me
  • 00:41:13
    create a custom command now you are a
  • 00:41:15
    staff engineer at this company you can
  • 00:41:16
    create a custom command and that oh this
  • 00:41:18
    is better now because in an Enterprise
  • 00:41:20
    setting you can now share this custom
  • 00:41:21
    command with all your employees
  • 00:41:22
    essentially right you said that hey you
  • 00:41:24
    know what this is a better way of using
  • 00:41:25
    the unit generation because I've created
  • 00:41:27
    this custom command and everybody can
  • 00:41:28
    benefit essentially right mhm now what
  • 00:41:30
    makes you write a better custom command
  • 00:41:33
    and even if you forget about custom
  • 00:41:34
    commands and cod what makes you get a
  • 00:41:35
    better output out this is where the 0
  • 00:41:38
    to1 evaluation comes in like where are
  • 00:41:40
    you currently failing what sort of unit
  • 00:41:41
    test are we getting right what sort of
  • 00:41:43
    are we not getting it right what about
  • 00:41:45
    this is interesting right what about
  • 00:41:47
    your code base is interesting now the
  • 00:41:49
    question then becomes can I provide that
  • 00:41:52
    as context so can I provide can I track
  • 00:41:54
    that that where are we where is it
  • 00:41:56
    failing where is it not faing and then
  • 00:41:57
    there are few interventions you could do
  • 00:41:59
    right you can change the prom new custom
  • 00:42:00
    con or you can create a new context
  • 00:42:02
    Source now this is a great Segway for me
  • 00:42:04
    to just mention one thing which is open
  • 00:42:07
    context so I think Quin Rio he literally
  • 00:42:09
    started his work as an IC one of the
  • 00:42:11
    other impressing things about Source
  • 00:42:12
    graph is you look at the you look at the
  • 00:42:13
    GitHub history commit histories of the
  • 00:42:15
    founders are they running a company or
  • 00:42:16
    are they likeing up these things and it
  • 00:42:19
    just blew me away like when I first got
  • 00:42:21
    it came across that but essentially
  • 00:42:22
    Quinn introduced and then a lot of the
  • 00:42:24
    teams worked on it something called open
  • 00:42:25
    context yeah which is internet
  • 00:42:27
    Enterprise setting you have so much
  • 00:42:28
    context which we may not be nobody can
  • 00:42:31
    get right right yeah because you're
  • 00:42:32
    going to plug in thousands of different
  • 00:42:34
    heterogeneous context sources is going
  • 00:42:35
    to help you get a better answer so open
  • 00:42:38
    context is a protocol designed to add
  • 00:42:40
    you can add a new context source for
  • 00:42:42
    yourself and there's a protocol and cod
  • 00:42:44
    will cod and even the other because it's
  • 00:42:46
    a protocol a lot of the other agents or
  • 00:42:48
    Tools around can use so essentially what
  • 00:42:50
    that means is if you are writing unit
  • 00:42:52
    test and if you know that this is where
  • 00:42:53
    it's not working you're going to make a
  • 00:42:54
    change in the prompt add a custom
  • 00:42:56
    command you're going to add some other
  • 00:42:57
    examples and then you're like hey maybe
  • 00:42:58
    I should add a context Source because oh
  • 00:43:00
    I have this information like we talked
  • 00:43:02
    about right where are the errors coming
  • 00:43:03
    in from now that S zero is probably not
  • 00:43:06
    something which you have given access to
  • 00:43:07
    Cod right now but then because of the
  • 00:43:09
    open context open CTX you can add this
  • 00:43:11
    like context source and then make your
  • 00:43:13
    Solutions better right yeah what are you
  • 00:43:15
    doing here you're doing this you're
  • 00:43:16
    doing like ml applied machine learning
  • 00:43:18
    101 now for your feature right so again
  • 00:43:21
    this is exactly where you need like a 0
  • 00:43:23
    to one five examples where it doesn't
  • 00:43:25
    work right now and if you make the St
  • 00:43:27
    change or context change it's going to
  • 00:43:29
    start working right so you have done
  • 00:43:30
    this mini z to one for your own goal
  • 00:43:33
    right and I think there's a meta point
  • 00:43:36
    over here which is forget about coding
  • 00:43:38
    assistants I think we're all
  • 00:43:39
    transitioning to that abstraction of
  • 00:43:41
    working with an ml system working with
  • 00:43:44
    an AE AI system I hate to use the phrase
  • 00:43:46
    AI I I from the machine learning world I
  • 00:43:49
    rather say machine learning but again
  • 00:43:50
    the audiences probably buy that more
  • 00:43:53
    some of the audiences but essentially a
  • 00:43:55
    lot of us are starting to work with
  • 00:43:57
    systems and I think the way we start
  • 00:43:59
    interacting with them are going to be
  • 00:44:00
    like a bit more orchestrated right try
  • 00:44:03
    this let's figure out where it's working
  • 00:44:05
    great I'm going to use it let's figure
  • 00:44:07
    out where it's not working okay cool
  • 00:44:08
    then I'm going to either give that
  • 00:44:09
    feedback or do something and adjust my
  • 00:44:11
    workflow a bit and then make it work
  • 00:44:13
    better on those right yeah so I think we
  • 00:44:15
    all starting to be that applied
  • 00:44:17
    scientist one right in some way or the
  • 00:44:19
    other and this is not just like you as
  • 00:44:21
    an engineer if you're a domain expert if
  • 00:44:23
    you're a risk analyst you want to create
  • 00:44:26
    these plots or if you're a sales assist
  • 00:44:28
    using a sales Scope Pilot you are
  • 00:44:30
    working with an agentic setup of ML and
  • 00:44:33
    you want to see where it's working where
  • 00:44:35
    it's not and what changes do I want to
  • 00:44:37
    make again you have done it when you add
  • 00:44:39
    Like A Plus or double quotes in Google
  • 00:44:41
    you get those words right what is that
  • 00:44:42
    you're adopting right you know where
  • 00:44:44
    what's going to work you're going to
  • 00:44:45
    start adding those tricks and now
  • 00:44:47
    because more and more of your daily
  • 00:44:49
    workflow is going to be around these
  • 00:44:51
    agents and systems and you start
  • 00:44:52
    developing these feedback loops yourself
  • 00:44:54
    so I think like what we are trying to do
  • 00:44:55
    as mlus in the proc product are is
  • 00:44:58
    similar philosophically to what you are
  • 00:44:59
    trying to do to use these products
  • 00:45:01
    essentially so I think like I would a
  • 00:45:03
    lot of my friends and other people ask
  • 00:45:04
    me like hey like what's happening like
  • 00:45:06
    I'm a doain expert I was in another
  • 00:45:08
    panel a few days ago versus employees
  • 00:45:10
    and the question there was like jobs and
  • 00:45:12
    all those right again not to bombard the
  • 00:45:15
    conversation around those but
  • 00:45:16
    essentially if we start acting as
  • 00:45:19
    orchestrators of these systems then we
  • 00:45:21
    start developing intuitions on where
  • 00:45:22
    it's working where it's not and then we
  • 00:45:23
    start putting in these guardrails yeah
  • 00:45:25
    and those guardrails are going to help
  • 00:45:27
    in uni as your Miss and and I think
  • 00:45:28
    that's important because as our audience
  • 00:45:30
    are all going to be somewhere on that
  • 00:45:32
    Journey from it being fully interactive
  • 00:45:36
    to it being fully automated and people
  • 00:45:38
    are going to maybe want to be somewhere
  • 00:45:40
    on that Journey but they will progress
  • 00:45:42
    from one end to the other of that as we
  • 00:45:45
    get into that more and more automated
  • 00:45:46
    space I remember you saying earlier when
  • 00:45:48
    we talk about the budget of a human in
  • 00:45:51
    terms of they have an amount of time if
  • 00:45:53
    you have 120 tests you want to provide
  • 00:45:55
    them with five how do you place the
  • 00:45:57
    importance of a developer's time right
  • 00:46:00
    in the future when things get to that
  • 00:46:02
    more automated State how would you place
  • 00:46:05
    that that relevance on a developer
  • 00:46:08
    focusing on code versus focusing on
  • 00:46:10
    testing versus focusing on something
  • 00:46:12
    else where is it most valuable to have
  • 00:46:14
    that developers eyes yeah so I think
  • 00:46:16
    let's take a step back right you're a
  • 00:46:17
    developer I'm a developer and there's a
  • 00:46:19
    reason I have a job right I have a job
  • 00:46:21
    to do that job means there's a task to
  • 00:46:23
    complete right the reason why I'm
  • 00:46:25
    writing this unit test I don't get paid
  • 00:46:27
    money just to write a better unit list
  • 00:46:28
    right I get paid money to again not me
  • 00:46:31
    specifically in my job but essentially
  • 00:46:33
    as a developer I get paid money if I can
  • 00:46:35
    do that task and if I can at least spend
  • 00:46:37
    some time in making sure that in
  • 00:46:39
    future my load of doing that task is
  • 00:46:42
    easier and the system is like helping me
  • 00:46:44
    down the scam with that high level view
  • 00:46:46
    right the what's happening what's
  • 00:46:48
    happening is rather than focusing on
  • 00:46:50
    unitized generation or like code
  • 00:46:51
    completion as a I care about this
  • 00:46:53
    because I care about the silo no I care
  • 00:46:56
    about this because I care about the task
  • 00:46:58
    being completed and I can if I can do
  • 00:47:00
    this task 10 next quicker then what's my
  • 00:47:02
    path from today spending 5 hours doing
  • 00:47:04
    it to 20 minutes doing it right and this
  • 00:47:07
    is where I mentioned you're going to be
  • 00:47:08
    we're all going to be orchestrators
  • 00:47:09
    right look at the music orchestra right
  • 00:47:11
    you have the symphony and you're there's
  • 00:47:12
    an orchestra and like you're hand waving
  • 00:47:14
    your way into amazing music right art
  • 00:47:16
    gets created right again that that's the
  • 00:47:18
    goal right Cody wants to make sure that
  • 00:47:20
    we allow users developers to start
  • 00:47:22
    creating art and not just toil now I I
  • 00:47:26
    can say this English right but then I
  • 00:47:28
    think a good developer would just embi
  • 00:47:30
    the spirit of it which is that look my
  • 00:47:33
    role is the sooner I can get to that
  • 00:47:34
    orchestrator role in my mindset the more
  • 00:47:37
    I start using these tools exactly right
  • 00:47:39
    rather than being scared of oh it's
  • 00:47:40
    writing code and I won't again you
  • 00:47:42
    mentioned right there's going to be like
  • 00:47:43
    you might want to be somewhere on that
  • 00:47:45
    Spectrum but the the Evol technological
  • 00:47:47
    Evolution will march on and we're going
  • 00:47:49
    to be forced into some parts and it
  • 00:47:51
    could be not just technology but
  • 00:47:53
    individuals wanting to be individuals
  • 00:47:55
    love writing code or
  • 00:47:57
    it's like that will need to change
  • 00:48:00
    depending on where what technology
  • 00:48:02
    offers us but I guess what's the if we
  • 00:48:06
    push further to that right what's the
  • 00:48:08
    highest risk right to what we're
  • 00:48:10
    delivering not being the right solution
  • 00:48:14
    is it is it is it testing now is it
  • 00:48:16
    guard rails that become the most
  • 00:48:18
    important thing and almost I would say
  • 00:48:20
    more important than code or or or is
  • 00:48:22
    still code the thing that that we need
  • 00:48:24
    to care about the most yeah I think like
  • 00:48:27
    if I view this extreme of like again if
  • 00:48:29
    I put my evaluation heart right I think
  • 00:48:31
    I want to be one of the most prominent
  • 00:48:33
    prop proponent vocal proponent of
  • 00:48:36
    evaluation in the industry not just cod
  • 00:48:38
    in the machine learning industry we
  • 00:48:39
    should do more
  • 00:48:40
    evaluation so there I would say that
  • 00:48:42
    writing a good evaluation is more
  • 00:48:44
    important than writing a good model yeah
  • 00:48:45
    writing a good evaluation is more
  • 00:48:47
    important than writing a better context
  • 00:48:48
    Source because you don't know what
  • 00:48:50
    what's a better Contex Source if you
  • 00:48:51
    don't have a way to evaluate it right so
  • 00:48:53
    I think for me evaluation precedes any
  • 00:48:55
    feature development yeah if you don't
  • 00:48:57
    have a way to evaluate it then you don't
  • 00:48:58
    you're just shooting darts in the dark
  • 00:49:00
    room right some are going to land by
  • 00:49:02
    luck now there in that world right I
  • 00:49:04
    have to ensure that unit test and
  • 00:49:06
    evaluation is like ahead in terms of
  • 00:49:08
    importance and just like code that said
  • 00:49:11
    I I think overall right what's more
  • 00:49:12
    important is like task success right
  • 00:49:15
    which is again what is Task success
  • 00:49:16
    you're not just looking at unit test as
  • 00:49:18
    an evaluation you're looking at
  • 00:49:19
    evaluation of the overall goal which is
  • 00:49:21
    hey do I do this task F and then I think
  • 00:49:23
    if that's where as an orchestrator if I
  • 00:49:25
    start treating these agents could be Cod
  • 00:49:27
    Auto compete or like like any specific
  • 00:49:30
    Standalone agent powered by Source graph
  • 00:49:32
    as well probably so in those words
  • 00:49:34
    evaluation of that task because you are
  • 00:49:37
    the Roman expert assume AGI exists today
  • 00:49:40
    assume the the foundation models are
  • 00:49:41
    going to get smarter smarter like
  • 00:49:42
    billions of dollars trillions of dollars
  • 00:49:44
    eventually into it we train these fan
  • 00:49:46
    again the smartest models and they can
  • 00:49:48
    do everything but you are best place to
  • 00:49:51
    understand your domain on what the goal
  • 00:49:53
    is right now right so you are the only
  • 00:49:56
    person who can develop that evaluation
  • 00:49:57
    of like how do I know that you're
  • 00:49:58
    correct how do I know whether you're 90%
  • 00:50:00
    correct 92% correct and again right the
  • 00:50:02
    marginal gain on 92 to 94 is going to be
  • 00:50:04
    a lot more harder than going from to 90
  • 00:50:06
    right it always gets saer like I mean
  • 00:50:08
    there's going to be like an exponential
  • 00:50:09
    hardness increase over there so
  • 00:50:11
    essentially the point then becomes
  • 00:50:13
    purely on evaluation purely on unit test
  • 00:50:15
    right what makes us what are the nuances
  • 00:50:17
    of this problem of this domain which the
  • 00:50:20
    model needs to get proved and are you
  • 00:50:22
    are we able to articulate those and are
  • 00:50:24
    we able to generate those unit test so
  • 00:50:26
    generate those guard RS and evaluations
  • 00:50:28
    so that I can judge how the models are
  • 00:50:30
    getting better on that topic right so
  • 00:50:31
    the models are going to be far
  • 00:50:33
    intelligent cre but then what is Suess
  • 00:50:35
    you as a domain expert get to Define
  • 00:50:36
    that and this is a great thing not just
  • 00:50:38
    about coding but also like any domain
  • 00:50:40
    expert using machine learning or these
  • 00:50:42
    tools across domains you know what
  • 00:50:44
    you're using it for right the other AGI
  • 00:50:47
    tools are just tools to help you do that
  • 00:50:48
    job so you it's I think the owners is on
  • 00:50:51
    you to write good evaluation or I mean
  • 00:50:54
    maybe tomorrow llm as a judge and like
  • 00:50:56
    people are developing Foundation models
  • 00:50:57
    just for evaluation right so there going
  • 00:50:59
    to be other tools to help you do that as
  • 00:51:00
    well code Foundation models for like
  • 00:51:02
    unitest maybe that's the thing in 6
  • 00:51:03
    months from now right uh the point then
  • 00:51:06
    becomes what should it focus on that's
  • 00:51:08
    the role you're playing orchestrating
  • 00:51:10
    but like orchestrating on the evaluation
  • 00:51:11
    oh did you get that corner piece right
  • 00:51:13
    or you know what this is a criticality
  • 00:51:14
    of the system right the again right the
  • 00:51:16
    payment Gateway link and the
  • 00:51:18
    authentication link some of these get
  • 00:51:19
    screwed up then massive bad things
  • 00:51:21
    happen right so you know that so I think
  • 00:51:23
    that's where like the human in the loop
  • 00:51:24
    and your input to the system starts
  • 00:51:26
    getting amazing rashab we could talk for
  • 00:51:29
    hours on this I know this has been
  • 00:51:31
    really interesting and I love the Deep
  • 00:51:32
    dive like a little bit below into the ml
  • 00:51:34
    space as well I'm sure a lot of our
  • 00:51:36
    audience will find this very interesting
  • 00:51:37
    thank you so much really appreciate you
  • 00:51:39
    coming on the podcast thanks so much
  • 00:51:40
    this was this was a fun conversation
  • 00:51:42
    yeah it could go on for hours hopefully
  • 00:51:44
    the inside thank you
  • 00:51:48
    [Applause]
  • 00:51:58
    thanks for tuning in join us next time
  • 00:52:00
    on the AI native Dev brought to you by
  • 00:52:02
    Tesla
  • 00:52:05
    [Music]
Tags
  • AI Testing
  • Software Development
  • Machine Learning
  • Code Evaluation
  • Large Codebases
  • Sourcegraph Cody
  • Developer Productivity
  • AI Context
  • Unit Testing
  • Evaluation Metrics