Why Your RAG System Is Broken, and How to Fix It with Jason Liu - 709

00:57:34
https://www.youtube.com/watch?v=wexpoR1R03A

概要

TLDRIn this episode of the Tmall AI Podcast, host Sam Charington interviews Jason Lou, an AI consultant and expert in Retrieval-Augmented Generation (RAG). They explore the importance of understanding customer needs and the potential pitfalls of focusing too heavily on the complexity of reasoning in AI models. Jason shares his insights on optimizing AI systems through evaluation metrics, the impact of user experience design, and the significance of creating effective datasets. The conversation highlights the balance between generative capabilities and retrieval performance, emphasizing that users often overlook crucial aspects like precise needs and system feedback. The episode encourages iterative testing and understanding specific user queries to enhance AI functionality.

収穫

  • 🤔 Understand user needs to guide product development.
  • 🔍 Focus on evaluation and testing to improve AI systems.
  • 💡 UX design can enhance feedback and engagement.
  • 📊 Regular testing helps identify areas of improvement.
  • ⚙️ Fine-tuning may not always be necessary; assess use case specifics.
  • 📈 Longer context provides better analysis but requires efficiency.
  • 🔄 Segmenting problems can streamline solution development.
  • 📅 Leverage existing data structures for answering queries effectively.
  • 💬 Encourage experimentation and trust in data-driven approaches.

タイムライン

  • 00:00:00 - 00:05:00

    Customers express a desire for more complex reasoning capabilities in AI models, which leads to discussions about understanding customer needs and improving product clarity.

  • 00:05:00 - 00:10:00

    Sam Charington introduces Jason Lou, a freelance AI consultant with a background in machine learning and recommendation systems, and the conversation will delve into retrieval-augmented generation (RAG) and system diagnostics.

  • 00:10:00 - 00:15:00

    Jason shares his educational background and experience at Stitch Fix, where he worked on multimodal embedding for predicting outfit recommendations, which aligns with current RAG principles in AI models.

  • 00:15:00 - 00:20:00

    He discusses how organizations often seek improvements in their RAG systems, focusing on the need for better embedding models and retrieval mechanisms to enhance customer engagement and business performance.

  • 00:20:00 - 00:25:00

    Jason criticizes the tendency of companies to focus on generational tuning rather than ensuring that the language model has adequate context, emphasizing the importance of retrieval quality over generation adjustments.

  • 00:25:00 - 00:30:00

    He mentions biases that affect evaluation of AI systems, such as absence bias and intervention bias, stressing the need to focus on retrieval effectiveness rather than tweaking generative prompts.

  • 00:30:00 - 00:35:00

    As companies start to recognize the importance of evaluations, Jason highlights the shift towards efficient evaluation practices and how quick tests could foster a more results-oriented environment.

  • 00:35:00 - 00:40:00

    He provides strategies for building effective datasets and tests, advocating for iterative experiments and synthesis of training data from simple existing questions and their answers.

  • 00:40:00 - 00:45:00

    Jason discusses the necessity of segmentation in problem-solving, asserting the importance of correctly identifying user questions to develop functional data sets and improve AI system performance.

  • 00:45:00 - 00:50:00

    He emphasizes that having structured workflows and understanding user questions leads to meaningful data set generation, allowing for precise evaluation of AI models' performance and answering capabilities.

  • 00:50:00 - 00:57:34

    Further, he explains the benefits of using off-the-shelf embeddings and suggests that detailed experimentation, rather than generic assumptions, should guide the choice of approaches for embedding and retrieval systems.

もっと見る

マインドマップ

ビデオQ&A

  • What is Retrieval-Augmented Generation (RAG)?

    RAG is an approach that combines retrieval of information with natural language generation, allowing models to generate responses based on retrieved data.

  • How can companies improve their RAG systems?

    By focusing on user needs, conducting thorough evaluations, and optimizing retrieval processes rather than solely fine-tuning the generation aspect.

  • What role does user experience (UX) play in AI systems?

    Good UX design can significantly improve user engagement and feedback collection, ultimately enhancing AI system performance.

  • What should companies prioritize when deploying AI models?

    Focusing on the context in which the AI operates, understanding users' workflows, and enabling the AI to add value rather than just answering questions.

  • Is fine-tuning always necessary for AI models?

    Not always; it depends on the specific use case and whether off-the-shelf models can efficiently meet the needs without additional fine-tuning.

  • What is the importance of context length in RAG?

    Longer context length can allow models to analyze more data and provide better responses, but achieving a balance with system efficiency and latency is crucial.

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス!
字幕
en
オートスクロール:
  • 00:00:00
    like the big smell I usually like to
  • 00:00:01
    call out very early in the beginning is
  • 00:00:03
    just you know customers saying man I
  • 00:00:04
    really wish these models were capable of
  • 00:00:06
    more complex
  • 00:00:07
    reasoning it's like do you want the
  • 00:00:09
    complex reason because you haven't
  • 00:00:11
    reasoned about what the customer wants
  • 00:00:13
    because the more we really think hard
  • 00:00:15
    and we really reason ourselves like what
  • 00:00:19
    the customer wants the product becomes
  • 00:00:21
    much more clear
  • 00:00:23
    [Music]
  • 00:00:35
    all right everyone welcome to another
  • 00:00:37
    episode of the tmall AI podcast I am
  • 00:00:39
    your host Sam charington today I'm
  • 00:00:41
    joined by Jason Lou Jason is a freelance
  • 00:00:43
    AI consultant and advisor and creator of
  • 00:00:46
    the instructor library before we get
  • 00:00:49
    going be sure to take a moment to hit
  • 00:00:50
    that subscribe button wherever you're
  • 00:00:52
    listening to Today's Show Jason welcome
  • 00:00:54
    to the Pod hey pleas to be here man
  • 00:00:56
    really excited uh I'm excited for this
  • 00:00:58
    conversation as well we've uh spoken a
  • 00:01:01
    few times before and uh I'm looking
  • 00:01:04
    forward to picking your brain on uh all
  • 00:01:07
    things retrieval augmented generation
  • 00:01:10
    and more um and I think a fun way to dig
  • 00:01:14
    into this conversation is to talk
  • 00:01:17
    through uh when you come in and people's
  • 00:01:20
    rag is broken how they go about fixing
  • 00:01:23
    it um but before we dive into that I'd
  • 00:01:27
    love to have you share a little bit
  • 00:01:28
    about your background
  • 00:01:30
    perfect so you know graduated from
  • 00:01:31
    University of waterl when we were doing
  • 00:01:34
    a lot of physics and basically after my
  • 00:01:37
    first physics term I realized oh man
  • 00:01:39
    machine learning is definitely going to
  • 00:01:40
    be next and I think the Nobel Prize I
  • 00:01:43
    nailed it right um absolutely so once I
  • 00:01:47
    started doing more machine learning a
  • 00:01:48
    lot of my background was mostly in
  • 00:01:49
    computer vision and recommendation
  • 00:01:51
    systems right so image eddings text
  • 00:01:53
    embeddings to do recommendation systems
  • 00:01:56
    and it just so happened that now where
  • 00:01:57
    where we do uh rag it's kind of the same
  • 00:02:00
    thing all over again right it's all of
  • 00:02:03
    text embeddings into recommendation
  • 00:02:04
    systems that now feed into language
  • 00:02:06
    models versus people and so it was a
  • 00:02:08
    very nice transition into rag when I
  • 00:02:10
    started doing more uh Consulting and you
  • 00:02:13
    did some of that at Stitch fix if I
  • 00:02:14
    remember correctly yeah most of my
  • 00:02:16
    background was at doing like multimodal
  • 00:02:18
    embedding the Stitch fix from like 2017
  • 00:02:20
    onwards so like taking an outfit putting
  • 00:02:23
    it into some outfit embedding space and
  • 00:02:25
    trying to predict the next outfit that
  • 00:02:27
    the person would want in their box yeah
  • 00:02:30
    you know how you how do you do
  • 00:02:31
    replacement how do you do these like
  • 00:02:32
    recommendation carousels what is a
  • 00:02:34
    similar item all that fun stuff and so
  • 00:02:37
    um what were your first steps
  • 00:02:40
    into uh you know Rag and helping folks
  • 00:02:44
    with uh their gen challenges it's funny
  • 00:02:48
    so you know I I basically took a year
  • 00:02:50
    after I took a year off after I left
  • 00:02:51
    Stitch Trix and when chat gbd came back
  • 00:02:53
    out and they were talking about rag
  • 00:02:55
    everyone was really Amazed by the power
  • 00:02:57
    of text and beddings in my mind Tex
  • 00:03:00
    eding was my intern project in 2016
  • 00:03:02
    because I didn't know how to set up
  • 00:03:03
    elastic search
  • 00:03:06
    it's very exciting ride to come back and
  • 00:03:08
    say oh I have like eight years
  • 00:03:09
    experience doing this kind of stuff
  • 00:03:11
    let's jump in and figure out how can we
  • 00:03:13
    actually learn new embeddings and how do
  • 00:03:15
    we actually improve and measure these
  • 00:03:17
    kind of search systems um especially
  • 00:03:20
    when more of these people now are just
  • 00:03:22
    plugging in open Ai and beddings and
  • 00:03:24
    just doing some kind of vector search
  • 00:03:26
    there's not not much room for
  • 00:03:28
    improvement when it comes to retrieval
  • 00:03:29
    and so a lot of the companies come to me
  • 00:03:31
    and they just say hey it's not really
  • 00:03:32
    working we're losing customers we're
  • 00:03:33
    losing a bit of money how do we make
  • 00:03:35
    this better and that's kind of how the
  • 00:03:37
    conversation uh starts off got it got
  • 00:03:40
    and when you say there's not much room
  • 00:03:42
    for improvement around retrieval what do
  • 00:03:45
    you mean by that I think a lot of
  • 00:03:47
    companies if you think about you know
  • 00:03:49
    how we used to do embedding at know
  • 00:03:51
    Stitch fix Netflix Shopify Spotify a lot
  • 00:03:54
    of it is using user and product
  • 00:03:56
    interaction pairs to train embeddings
  • 00:03:58
    that are optimized for maybe a
  • 00:04:00
    click-through rate or maybe for some
  • 00:04:01
    kind of relevancy metric but to me at
  • 00:04:04
    least it feels pretty crazy that we're
  • 00:04:05
    going to use these external embedding
  • 00:04:07
    models from open Ai and just assume that
  • 00:04:11
    oh yeah of course my question about some
  • 00:04:12
    law is going to be embedded very similar
  • 00:04:15
    to exactly the paragraph that answers
  • 00:04:17
    the question about the esoteric you know
  • 00:04:19
    legal statement that's not really true
  • 00:04:21
    right if you think about even just the
  • 00:04:23
    sentence I love coffee and I hate coffee
  • 00:04:26
    are they similar or
  • 00:04:28
    dissimilar right they could be SAR on a
  • 00:04:30
    dating app because they're both
  • 00:04:31
    preferences about coffee but maybe
  • 00:04:33
    they're dissimilar because they're
  • 00:04:34
    negative preferences against each other
  • 00:04:37
    but you know I should be able to choose
  • 00:04:40
    which one that looks like and I think
  • 00:04:41
    that's where a lot of people are getting
  • 00:04:42
    tripped up it's really a big assumption
  • 00:04:44
    to think that we know what is and is not
  • 00:04:47
    similar in this in this embedding space
  • 00:04:49
    interesting interesting I think the
  • 00:04:51
    reason why I zeroed in on on you
  • 00:04:54
    mentioning that is because when I talk
  • 00:04:55
    to folks that are working on Rag and
  • 00:04:59
    their rag is broken they are often
  • 00:05:02
    trying to fix it via tuning the
  • 00:05:05
    generation and that invariably is not
  • 00:05:08
    the right way to do it uh so there
  • 00:05:10
    there's often a lot more Headroom in
  • 00:05:13
    making sure that the llm has the right
  • 00:05:15
    context than um you know fine-tuning the
  • 00:05:19
    the promps but uh let's maybe you know
  • 00:05:23
    when you're called into those situations
  • 00:05:25
    like how do you begin to diagnose the
  • 00:05:29
    the problem yeah so the first the first
  • 00:05:32
    thing I do is I basically just ban
  • 00:05:34
    adjectives as as a word that you can use
  • 00:05:36
    during standup I think a lot of
  • 00:05:37
    companies you know when sof as good bad
  • 00:05:41
    looks better feels better
  • 00:05:44
    right% 20% what does that look like um
  • 00:05:48
    that's that's usually the first step is
  • 00:05:49
    really getting away from this like Vibe
  • 00:05:51
    based estimate of the generation and
  • 00:05:53
    really thinking about retrieval I like
  • 00:05:55
    to think about this two biases I learned
  • 00:05:57
    in this MBA book the first one is called
  • 00:05:59
    absence this which just says like you
  • 00:06:01
    can't really think about the thing you
  • 00:06:03
    don't see and you see the generation you
  • 00:06:05
    always see the text coming out of the
  • 00:06:06
    language model so you think that's the
  • 00:06:08
    thing that I got to control because I
  • 00:06:09
    don't see the generation uh uh the the
  • 00:06:11
    content the second thing is inter inter
  • 00:06:14
    intervention bias which is I want to
  • 00:06:16
    change things to feel in control and so
  • 00:06:19
    if you want to feel in control of a rag
  • 00:06:20
    application all you got to do is just
  • 00:06:22
    twiddle with the change some text in
  • 00:06:24
    your prompt and hope that all the
  • 00:06:27
    relevant data is in there and usually
  • 00:06:28
    that's not the case right um I think
  • 00:06:31
    that's where a lot of the the issues
  • 00:06:33
    stem from right you're looking too much
  • 00:06:35
    at generation and not really thinking
  • 00:06:36
    about recall or precision and whether or
  • 00:06:39
    not a language model is confused or even
  • 00:06:40
    finding the right
  • 00:06:43
    information and that opens up the whole
  • 00:06:46
    conversation around evales and and
  • 00:06:49
    evaluation loops and pipe pipelines and
  • 00:06:51
    flywheels and the like and uh at least
  • 00:06:55
    from my
  • 00:06:56
    perspective yeah 9 months 12 months
  • 00:06:59
    months ago like folks were trying to
  • 00:07:01
    figure out how to spell eval and now
  • 00:07:03
    like it's coming up in a lot more
  • 00:07:04
    conversations are you seeing a similar
  • 00:07:07
    shift yeah except the thing I've been
  • 00:07:09
    noticing is we've almost also delegated
  • 00:07:13
    the the scoring battle language models
  • 00:07:16
    right the llm is Judge idea exactly I
  • 00:07:19
    think it's very useful to get some kind
  • 00:07:21
    of proxy for what is good and what is
  • 00:07:22
    bad but what ends up happening is
  • 00:07:25
    instead of trying to solve the relevancy
  • 00:07:26
    problem we just solving the problem of
  • 00:07:29
    again prompting another language model
  • 00:07:31
    right I I basically said don't fiddle
  • 00:07:33
    with the generation of the language
  • 00:07:34
    model you you feel in control because
  • 00:07:37
    you can fumble with it but you're not
  • 00:07:38
    going to get any results and they said
  • 00:07:39
    okay well let me work with a different
  • 00:07:41
    prompt
  • 00:07:43
    instead rather than building like a
  • 00:07:45
    Precision recall data set right I think
  • 00:07:48
    you know there are many many tasks that
  • 00:07:51
    take milliseconds to compute that we
  • 00:07:53
    could run tests across thousands of
  • 00:07:55
    examples and and figure what the
  • 00:07:56
    relevancy looks like rather than looking
  • 00:07:58
    at you know l a judge literally a couple
  • 00:08:01
    weeks ago during standof a team was like
  • 00:08:03
    hey uh who spent $1,000 this
  • 00:08:07
    weekend and some Junior Engineers like
  • 00:08:09
    oh I was like trying to run evals to see
  • 00:08:12
    how good the new changes are should I
  • 00:08:13
    not do that and like oh W like they're
  • 00:08:16
    so expensive I've just incentivized
  • 00:08:17
    someone to not run more tests that's
  • 00:08:19
    really not how I want things to be right
  • 00:08:21
    I want tests that are really fast really
  • 00:08:23
    cheap that you should be running every
  • 00:08:24
    every 10 20 minutes when you make a line
  • 00:08:27
    of code uh change in your system and so
  • 00:08:29
    that's it's been pretty funny to sort of
  • 00:08:31
    see that transition and and really push
  • 00:08:33
    to be just a lot faster fail faster and
  • 00:08:36
    do very cheap cheap evaluations and part
  • 00:08:39
    of that is just that building data sets
  • 00:08:42
    is hard and time consuming and hey
  • 00:08:45
    pre-trained models was supposed to get
  • 00:08:47
    me out of that business right yeah I
  • 00:08:49
    think every data scientist every data
  • 00:08:51
    engineer is like yeah you know I I'm
  • 00:08:53
    kind of the janitor right but then what
  • 00:08:55
    happens is you kind of get like you get
  • 00:08:57
    the Roomba and you spill something the
  • 00:08:59
    Roomba just just smears it all over the
  • 00:09:00
    wall floor and you're like
  • 00:09:02
    oh I should have done it myself that's
  • 00:09:05
    kind of how I think about these things
  • 00:09:06
    and but
  • 00:09:07
    also in Earnest I think what's really
  • 00:09:09
    happening is because there are so many
  • 00:09:12
    more Engineers coming into the space
  • 00:09:15
    with like less data literacy it actually
  • 00:09:18
    is just very hard to even describe what
  • 00:09:20
    a good data set looks like you know
  • 00:09:22
    Eugene uh Eugene Yang and I were were
  • 00:09:25
    basically trying to figure out what does
  • 00:09:26
    good data literacy look like and we we
  • 00:09:29
    really struggle we can only come up with
  • 00:09:31
    10 reasons why it was Data illiterate
  • 00:09:34
    but it's still hard to describe like
  • 00:09:35
    what is the intuition and the vibe of
  • 00:09:37
    when do you give up when do you try new
  • 00:09:38
    things you know if the model said it's
  • 00:09:40
    98% accurate you probably did something
  • 00:09:42
    wrong all those kinds of things are uh
  • 00:09:45
    pretty undocumented I think when it
  • 00:09:46
    comes to making this mental shift so
  • 00:09:50
    when you're talking to folks and you're
  • 00:09:52
    encouraging them to take this initial
  • 00:09:54
    step of
  • 00:09:55
    building uh data set that will allow
  • 00:09:58
    them to measure their or revals like do
  • 00:10:00
    they always know what that process you
  • 00:10:02
    know needs to look like to do that uh
  • 00:10:05
    and you know how do you you know guide
  • 00:10:08
    folks that don't through the process of
  • 00:10:10
    building that data set I think they have
  • 00:10:13
    some ideas but what ends up happening is
  • 00:10:15
    some ideas feel so intuitive you almost
  • 00:10:17
    need someone else's permission to
  • 00:10:19
    believe your or trust your gut right
  • 00:10:21
    like the simplest thing to do for
  • 00:10:22
    example is say you know what given a
  • 00:10:24
    text Chunk can I generate a synthetic
  • 00:10:26
    question with a language model and then
  • 00:10:28
    save the these two pairs and then let me
  • 00:10:31
    check whether or not the question I just
  • 00:10:32
    generated finds the text Chun right A
  • 00:10:35
    lot of times I think you know there are
  • 00:10:36
    Engineers on the team that have this
  • 00:10:38
    idea but they just like hey Jason like
  • 00:10:40
    does this does this make sense like just
  • 00:10:43
    nobody really knows for certain if what
  • 00:10:44
    they're doing is like ridiculous and I
  • 00:10:46
    think a lot of it at least with the
  • 00:10:48
    really great Engineers I work with they
  • 00:10:49
    almost just need permission to trust
  • 00:10:51
    their gut and do these tiny experiments
  • 00:10:53
    because in engineering there is just
  • 00:10:56
    like the edge cases you have to
  • 00:10:58
    enumerate everything right away and then
  • 00:11:00
    you sort of build out your test but here
  • 00:11:01
    it's a lot of it it's like well we
  • 00:11:03
    really do just have to try it we just
  • 00:11:05
    have to try 10 different things and
  • 00:11:06
    figure out what works and what
  • 00:11:08
    doesn't and giving them giving the
  • 00:11:10
    engineering team the permission to trust
  • 00:11:11
    their get has also been a a pretty
  • 00:11:13
    valuable lesson on my end and are those
  • 00:11:15
    the same 10 things in every case or is
  • 00:11:18
    it you know 10 Edge case specific things
  • 00:11:21
    that are different for every person
  • 00:11:22
    who's trying to build a
  • 00:11:24
    system yeah what I find is one thing
  • 00:11:28
    that is often missing they don't
  • 00:11:29
    actually understand what the workflows
  • 00:11:32
    ought to be and what kind of question
  • 00:11:33
    type they ought to serve right I think
  • 00:11:36
    everyone wants AGI right G stand for
  • 00:11:39
    General and they want to solve
  • 00:11:42
    every but you know the open AI
  • 00:11:44
    definition of of AGI has something to do
  • 00:11:46
    with like economic value like are we
  • 00:11:48
    unlocking economic value for for our
  • 00:11:50
    customer and like the big smell I
  • 00:11:53
    usually like to call out very early in
  • 00:11:55
    the beginning is just you know customers
  • 00:11:56
    saying man I really wish these models
  • 00:11:58
    were capable of more complex
  • 00:12:00
    reasoning it's like do you want the
  • 00:12:02
    complex reasoning because you haven't
  • 00:12:03
    reasoned about what the customer
  • 00:12:06
    wants because because the more we really
  • 00:12:09
    think hard and we really reason
  • 00:12:12
    ourselves like what the customer wants
  • 00:12:14
    the product becomes much more clear
  • 00:12:16
    right um for example day one we have a
  • 00:12:20
    bunch of user questions coming in if we
  • 00:12:21
    can do some kind of clustering and
  • 00:12:23
    segmentation we might find out that oh
  • 00:12:25
    wow you know 30% of all the questions
  • 00:12:27
    are looking for contract and whether or
  • 00:12:29
    not they're signed you know 10% of the
  • 00:12:31
    questions were just who modified the
  • 00:12:33
    document last that's not even in the
  • 00:12:36
    text Chunk but if we just append you
  • 00:12:38
    know an additional token that says like
  • 00:12:40
    Modified by Jason we could now just
  • 00:12:42
    serve a 10% of our question base right
  • 00:12:45
    if we just parsed out the dates and like
  • 00:12:47
    an is signed Boolean variable again we
  • 00:12:49
    could now serve like 30% of our
  • 00:12:51
    questions and so I think that the real
  • 00:12:53
    trick is just developing the habit of
  • 00:12:56
    looking at the data but also trusting
  • 00:12:58
    that your your your job is to make these
  • 00:13:01
    hypotheses and your job isn't to be
  • 00:13:03
    right all the time right you have to be
  • 00:13:06
    wrong you have to do these experiments
  • 00:13:07
    and feel F that that seems like it needs
  • 00:13:10
    to be first like really understanding
  • 00:13:13
    what the questions you're trying to
  • 00:13:15
    serve with uh your system whether it's a
  • 00:13:19
    chatbot or something else um because you
  • 00:13:22
    can't even really build a data set till
  • 00:13:24
    you know what those questions need to
  • 00:13:25
    look like I mean sometimes if you just
  • 00:13:27
    have a bunch of PDFs you could try to
  • 00:13:29
    have the language model answer these
  • 00:13:30
    questions but to my surprise there have
  • 00:13:33
    been times where you know if you use
  • 00:13:35
    like Paul Graham essays and you generate
  • 00:13:38
    synthetic questions off of random Tech
  • 00:13:40
    chunks you get like 96 and 97% recall
  • 00:13:44
    right the the problem is too easy so
  • 00:13:45
    have to make it harder but there are
  • 00:13:48
    other data sets where I do the same task
  • 00:13:49
    and I get like 60% recall for example
  • 00:13:53
    right if you just take all GitHub issues
  • 00:13:55
    GitHub issues okay yeah like a common
  • 00:13:58
    question that gets generated is like uh
  • 00:14:00
    how to get
  • 00:14:01
    started Well turns out if you don't have
  • 00:14:03
    a filter on repo like repository you
  • 00:14:06
    can't answer the question how best I get
  • 00:14:07
    started because now there's filters
  • 00:14:09
    involved right and turns out if I just
  • 00:14:11
    say best ways to get started in repo I
  • 00:14:15
    now have to sort of parse things out and
  • 00:14:17
    do some filtering and maybe if it's not
  • 00:14:18
    the exact filter I need to do some
  • 00:14:20
    string matching and all the other stuff
  • 00:14:23
    are you saying when people are asking
  • 00:14:25
    you the best way to get started with rag
  • 00:14:28
    or when people want to be able to serve
  • 00:14:31
    or answer the question for their users
  • 00:14:33
    the best way to get started I'm not
  • 00:14:35
    following example answering the question
  • 00:14:36
    like imagine doing like GitHub GitHub
  • 00:14:39
    issue search and I search best way to
  • 00:14:40
    get started right how could I have
  • 00:14:43
    possibly found the trunk that came from
  • 00:14:45
    there's there's thousands of best ways
  • 00:14:47
    to get started documentations right um
  • 00:14:50
    and then and then you do the go oh okay
  • 00:14:52
    actually in order to do this problem
  • 00:14:54
    well I have to do some kind of like repo
  • 00:14:57
    matching mechanism I probably need to do
  • 00:14:58
    some kind of like filtering mechanism
  • 00:15:00
    and now you slowly add complexity into
  • 00:15:02
    the system that you built I think too
  • 00:15:04
    many people just sort of throw the data
  • 00:15:06
    into a bunch of PDFs and go well
  • 00:15:08
    obviously I can just ask you what the
  • 00:15:10
    systematic risks of this investment is
  • 00:15:13
    and that doesn't be seem to be the case
  • 00:15:16
    uh so if we're building up
  • 00:15:18
    to steps then you know one step is like
  • 00:15:21
    know your question the um you know next
  • 00:15:24
    step might be build out your test set um
  • 00:15:28
    and the third step is to you know think
  • 00:15:32
    really hard about like metadata and like
  • 00:15:35
    sourcing data that the llm can use to
  • 00:15:40
    answer the question or really that a
  • 00:15:42
    pre-processor can use to get the right
  • 00:15:44
    information to the llm actually we're
  • 00:15:45
    still in retrieval at this point yeah so
  • 00:15:48
    so I like to think about it this way I'm
  • 00:15:50
    going to do some segmentation if I was
  • 00:15:52
    going to do marketing I might segment
  • 00:15:54
    against men's and women's and East Coast
  • 00:15:55
    which West Coast every every problem you
  • 00:15:58
    want to solve you kind want to segment
  • 00:16:00
    in some way so we're ultimately going to
  • 00:16:02
    find these segments in uh the question
  • 00:16:05
    space and then ultimately there's two
  • 00:16:07
    kinds of segments there's going to be
  • 00:16:09
    segments that don't do well because we
  • 00:16:11
    have capabilities issues so for example
  • 00:16:14
    if I ask who modified this document last
  • 00:16:17
    if I don't have that metadata I can't
  • 00:16:19
    answer that question so I need to
  • 00:16:20
    improve my capabilities right that the
  • 00:16:24
    the row exists but I need an extra
  • 00:16:27
    column the other world is like inventory
  • 00:16:31
    issues where the column doesn't exist
  • 00:16:33
    right like if you think of maybe like a
  • 00:16:35
    door Dash and you find out that uh Greek
  • 00:16:38
    restaurants near me is a terrible search
  • 00:16:40
    query the solution might be to buy iPads
  • 00:16:42
    for Greek restaurants so they can come
  • 00:16:44
    onto the platform so there's been other
  • 00:16:46
    times when we do this kind of debugging
  • 00:16:48
    and we realize oh wow we don't have the
  • 00:16:50
    data to answer these quic questions we
  • 00:16:52
    don't have the scheduling information to
  • 00:16:53
    do this we don't have the tables
  • 00:16:55
    extracted to do this and so there's
  • 00:16:57
    usually two kinds of solutions you got
  • 00:16:58
    to start integrating right are there
  • 00:17:00
    more capabilities by adding like more
  • 00:17:02
    rows to your data set or just like sorry
  • 00:17:05
    more columns or add inventory and inove
  • 00:17:08
    the number of rows we have on a data set
  • 00:17:09
    and that's kind of like how I think
  • 00:17:10
    about these things adding rows in all of
  • 00:17:13
    the examples you've given are is a much
  • 00:17:15
    longer process than adding columns
  • 00:17:18
    exactly exactly I think sometimes it's
  • 00:17:21
    like oh man like we need to figure out
  • 00:17:22
    contracts so we need to figure out you
  • 00:17:25
    know who if someone is responsible the
  • 00:17:27
    feature people really care about can we
  • 00:17:29
    contact them in some way and now you
  • 00:17:31
    start building an actual application
  • 00:17:33
    because you know what the customer wants
  • 00:17:35
    right it's not just ask a question but
  • 00:17:37
    to take some action make some decision
  • 00:17:39
    you
  • 00:17:40
    mentioned
  • 00:17:42
    the you know this issue of using
  • 00:17:45
    off-the-shelf
  • 00:17:47
    embedding uh models are you finding are
  • 00:17:51
    you advising folks to like use to like
  • 00:17:53
    build their own
  • 00:17:54
    embedding systems or just get better
  • 00:17:57
    data to the ones that are there or like
  • 00:17:59
    is there a decision Matrix around and in
  • 00:18:03
    fact there are a lot of similar
  • 00:18:04
    questions that I run into that I don't
  • 00:18:09
    I've heard
  • 00:18:10
    mixed uh you know mixed opinions as to
  • 00:18:14
    like if they're even important and it's
  • 00:18:16
    like the embedding uh scheme the
  • 00:18:20
    chunking strategy like headings and
  • 00:18:23
    other um you know contextual information
  • 00:18:26
    around chunks like do you have
  • 00:18:29
    is it are there one siiz fits alls
  • 00:18:31
    answers to these or like it a decision
  • 00:18:33
    Matrix or like how do you think about
  • 00:18:35
    the space of like you know
  • 00:18:37
    implementation details around uh
  • 00:18:40
    embedding it's a really good question
  • 00:18:43
    primarily because why guess when we can
  • 00:18:45
    test
  • 00:18:49
    right I think this is a matter of really
  • 00:18:51
    just you know investing more in a data
  • 00:18:53
    set and just going great well let me let
  • 00:18:55
    me just run like 30 experiments you know
  • 00:18:58
    over the weekend I'll come back and I'll
  • 00:19:00
    just know the answer and I think that's
  • 00:19:02
    kind
  • 00:19:03
    of even the nature of that question to
  • 00:19:05
    me is a symptom of sort of the lack of
  • 00:19:07
    having really good evaluations on your
  • 00:19:10
    own data sets but likewise the the
  • 00:19:13
    answer to the question is an indication
  • 00:19:16
    that there's there aren't clear patterns
  • 00:19:19
    and it's very data set and use case
  • 00:19:22
    specific and you know if you said for
  • 00:19:25
    example yeah like you know we really
  • 00:19:28
    really only ever see chunking strategies
  • 00:19:30
    giving a you know one 2% lift it's not
  • 00:19:33
    usually worth it like that would be
  • 00:19:34
    really informative but you didn't say
  • 00:19:36
    that so you know there must be cases
  • 00:19:38
    where you change your chunking strategy
  • 00:19:39
    and you bam get some great results
  • 00:19:42
    exactly a good example of that might be
  • 00:19:44
    thinking about chunking and then
  • 00:19:47
    completely processing like tables within
  • 00:19:49
    PDFs differently than regular text
  • 00:19:51
    trunks it's like okay like paragraphs
  • 00:19:53
    you chunk and then if you see a table we
  • 00:19:55
    have to save the entire table somewhere
  • 00:19:57
    else as a separate Index right um that
  • 00:20:00
    would be a good example of like when
  • 00:20:03
    chunking really matters I would also say
  • 00:20:05
    you know if you have even thousands of
  • 00:20:08
    examples of questions and labels on
  • 00:20:11
    whether or not a chunk is relevant it's
  • 00:20:14
    probably pretty fruitful to fine tune a
  • 00:20:16
    ranker but even then I think oftentimes
  • 00:20:19
    I surprise myself
  • 00:20:20
    in whether or not um certain
  • 00:20:24
    interventions perform better there are
  • 00:20:27
    times when using hybrid search with
  • 00:20:29
    edings and bm25 bm25 is like 3% better
  • 00:20:34
    right that often is the case if the
  • 00:20:37
    person who is searching the data is
  • 00:20:39
    aware of the file names and the text
  • 00:20:41
    that's in the data like if I wrote my
  • 00:20:42
    own
  • 00:20:43
    essay it'll be easier for me to find it
  • 00:20:45
    if I use full text search because I know
  • 00:20:47
    what I
  • 00:20:49
    wrote right there's been times
  • 00:20:53
    when rankers don't improve the
  • 00:20:55
    performance of the model and then there
  • 00:20:57
    are times when the rankers do and again
  • 00:20:58
    again it's it just becomes the
  • 00:20:59
    Superstition
  • 00:21:01
    but it is very easy to sort of absolve
  • 00:21:04
    myself of the Superstition by having
  • 00:21:06
    these tests that run really really fast
  • 00:21:08
    right just is the order better these
  • 00:21:10
    things are really great whereas if we go
  • 00:21:11
    into factuality or like self-consistency
  • 00:21:14
    and like context recall who knows right
  • 00:21:17
    maybe the model just wants to choose
  • 00:21:19
    itself now you're doing a whole set of
  • 00:21:21
    other experiments to prove that the
  • 00:21:22
    model is aligned with a metric you never
  • 00:21:26
    made up and that's when I think things
  • 00:21:28
    get
  • 00:21:28
    [Laughter]
  • 00:21:31
    and
  • 00:21:32
    expensive yeah
  • 00:21:35
    yeah even TimeWise I feel like there's
  • 00:21:37
    been times where we have like
  • 00:21:38
    summarization prompts so for example um
  • 00:21:42
    when we want to retrieve images do we
  • 00:21:43
    use like a clip
  • 00:21:45
    embedding well that means we also have
  • 00:21:47
    to use a clip embedding for the text
  • 00:21:50
    what I've seen do go really well is
  • 00:21:52
    actually using a visual language model
  • 00:21:53
    to give a detailed
  • 00:21:55
    description as like a paragraph of the
  • 00:21:57
    image and then just T Ed the paragraph
  • 00:22:01
    but what this means is now the the uh
  • 00:22:04
    describe this image prompt is another
  • 00:22:06
    hyperparameter to experiment against and
  • 00:22:09
    I've seen situations where if you just
  • 00:22:11
    say describe this image uh recall is
  • 00:22:14
    like
  • 00:22:15
    27% but if you can teach this concept of
  • 00:22:18
    recall to an engineer and you just make
  • 00:22:20
    them Hill Climb for like a day and a
  • 00:22:21
    half we've been able to get to like an
  • 00:22:23
    87% recall just by improving the
  • 00:22:26
    prompt WR a prompt tell me what doesn't
  • 00:22:29
    recover okay uh find the blueprint but
  • 00:22:32
    also count the number of rooms okay now
  • 00:22:34
    it's like 35% okay also transcribe the
  • 00:22:37
    street addresses and include that in the
  • 00:22:40
    description 70% you know also describe
  • 00:22:44
    like like whether it's north facing and
  • 00:22:46
    east facing and like also describe the
  • 00:22:48
    positions of the cabins and all of a
  • 00:22:50
    sudden you have a 96% recall system for
  • 00:22:52
    finding blueprints because you actually
  • 00:22:55
    worked on the prompt sounds like feature
  • 00:22:57
    engineering hey
  • 00:22:59
    I can't say that because they then they
  • 00:23:01
    got confused but uh yeah but but ends up
  • 00:23:04
    being it right it's like oh this is this
  • 00:23:05
    classical machine learning but we were
  • 00:23:07
    able to Hill Climb because our eval is
  • 00:23:10
    very fast it takes you know 50
  • 00:23:12
    milliseconds to try again and try again
  • 00:23:15
    and I think that's where a lot of things
  • 00:23:16
    can be really optimized for but yeah
  • 00:23:18
    it's definitely just feature engineering
  • 00:23:19
    but that's also another
  • 00:23:23
    word are you um you know one of the
  • 00:23:26
    questions that that comes up a lot
  • 00:23:28
    around the the whole idea of evals is
  • 00:23:31
    like tooling like um do you have you
  • 00:23:37
    know go-to answers for that I'm guessing
  • 00:23:39
    it's going to be yeah build your data
  • 00:23:41
    set in like you know some silly eval in
  • 00:23:44
    a you know notebook but uh did do you
  • 00:23:48
    find
  • 00:23:49
    that there's a point at which it becomes
  • 00:23:52
    more complex and there's you know some
  • 00:23:55
    you know open source or off the-shelf
  • 00:23:57
    tooling that makes it difference for
  • 00:23:58
    folks so I would say if you are working
  • 00:24:01
    independently you are likely best off
  • 00:24:04
    just like writing things to a jonline
  • 00:24:06
    file or like a SQL art file primarily
  • 00:24:08
    because you're just building out these
  • 00:24:10
    very fast evals right like if you're
  • 00:24:11
    just comparing like length of summary
  • 00:24:13
    divided by length of input and figuring
  • 00:24:15
    out if there's a compression rate I want
  • 00:24:16
    to set a goal against super fast what I
  • 00:24:19
    do with when I work with bigger
  • 00:24:21
    companies is I use uh Brain Trust
  • 00:24:23
    primarily
  • 00:24:25
    because like Brain Trust was basically
  • 00:24:27
    built because ER has just been like
  • 00:24:29
    sharing screenshots of like results from
  • 00:24:31
    a jupyter notebook every once in a while
  • 00:24:32
    you're like I need a tool that does
  • 00:24:33
    better than this and so often times if
  • 00:24:36
    you need to collaborate on sharing data
  • 00:24:39
    sets collaborate on sharing results and
  • 00:24:41
    getting feedback and you know coer on
  • 00:24:43
    your team to label data with you I think
  • 00:24:46
    that's when a tool really really shines
  • 00:24:49
    those for the collaboration aspect not
  • 00:24:51
    because the evaluations are better in
  • 00:24:54
    some kind of way exactly because the
  • 00:24:56
    evaluations you have to build yourself
  • 00:24:59
    right you're pulling evaluations off the
  • 00:25:01
    shelf yeah factuality the
  • 00:25:04
    self-consistency like that stuff to me
  • 00:25:06
    is uh it's crazy you know it's just
  • 00:25:10
    like it's like it's like having someone
  • 00:25:13
    grade their own assignment it's like oh
  • 00:25:14
    man I just I
  • 00:25:16
    hope um and then like the other stuff
  • 00:25:19
    that's valuable is like okay how can I
  • 00:25:22
    you know down sample my production
  • 00:25:24
    traffic to also run these evaluations to
  • 00:25:26
    make sure that things are running prod
  • 00:25:27
    productively
  • 00:25:28
    can I monitor these things over time
  • 00:25:30
    right a really simple example is just I
  • 00:25:34
    have a company that we do a meeting
  • 00:25:36
    summarization and we we plot the average
  • 00:25:39
    length of a transcript and we plot the
  • 00:25:42
    average length of the summary divided by
  • 00:25:44
    the average of the
  • 00:25:45
    transcript and every once in a while
  • 00:25:47
    there's like a blip and like why did
  • 00:25:48
    that blip happen well it turns out you
  • 00:25:51
    know we ran a marketing campaign and we
  • 00:25:52
    got a whole new set of users and these
  • 00:25:54
    new users are doing threeh hour long
  • 00:25:57
    podcasts and it's really
  • 00:25:59
    bad because the summary is just like
  • 00:26:01
    they talked about
  • 00:26:02
    AI
  • 00:26:05
    right you're like oh man like the the
  • 00:26:08
    compression rate is too high now let's
  • 00:26:10
    go do something great we build a rule
  • 00:26:13
    that says if the call is less than an
  • 00:26:15
    hour we can use this prompt if we use
  • 00:26:17
    greater than an hour can we use that
  • 00:26:19
    prompt okay the ratios are like
  • 00:26:21
    recovering a little bit and can we
  • 00:26:22
    monitor that and I think that's how I
  • 00:26:24
    think about building these systems have
  • 00:26:25
    like the dumbest evals possible to tell
  • 00:26:27
    you what to look at uh you mentioned
  • 00:26:31
    compression rate previously before
  • 00:26:33
    talking about this specific example is
  • 00:26:35
    that a metric that you've applied
  • 00:26:39
    broadly or is it just this transcription
  • 00:26:42
    summarization thing yeah I mean I've
  • 00:26:46
    just found a lot of the applications I
  • 00:26:48
    tend to work on are ones where we're
  • 00:26:49
    doing a lot of
  • 00:26:50
    summarization and summarization is a
  • 00:26:52
    very uh interesting task
  • 00:26:56
    because like llms are good at
  • 00:26:58
    summarization in the sense that indeed
  • 00:27:00
    the output is shorter than the input but
  • 00:27:03
    it's actually very hard to evaluate like
  • 00:27:05
    what is a good summary and like when do
  • 00:27:06
    we lose nuance and all that kind of
  • 00:27:08
    stuff and
  • 00:27:09
    so you know obviously we can have the
  • 00:27:12
    entire like llm as a
  • 00:27:14
    judge model of doing things but ideally
  • 00:27:17
    we have much more like much simpler
  • 00:27:19
    metrics right so I have metrics of just
  • 00:27:22
    you know length of summary divided by
  • 00:27:24
    length of uh transcript I also have the
  • 00:27:27
    counts of named entities right for
  • 00:27:29
    example if the summary is all mentioning
  • 00:27:31
    my name and versus the summary just
  • 00:27:33
    going like they thought it was you know
  • 00:27:36
    it's like it's very like ambiguous and
  • 00:27:38
    so can we can we can we preserve some
  • 00:27:40
    kind of information density there there
  • 00:27:42
    they're all proxies for some you know
  • 00:27:45
    satisfaction or Nuance that we can also
  • 00:27:47
    use a language model against but
  • 00:27:49
    um looking at like odd examples of just
  • 00:27:52
    simple numbers still can tell you a lot
  • 00:27:54
    of
  • 00:27:55
    information right like what we found was
  • 00:27:57
    when when we pl summary length by um
  • 00:28:01
    transcript length it would go up and
  • 00:28:04
    then after like 20,000 tokens and
  • 00:28:06
    actually got shorter again I like
  • 00:28:09
    okay that took six minutes to plot out
  • 00:28:11
    and like write the data for but now we
  • 00:28:14
    know that there's some weird behavior
  • 00:28:15
    when the transcript is really really
  • 00:28:18
    long great let me change my prompt rerun
  • 00:28:21
    this it's trade again perfect we're good
  • 00:28:25
    was there a step before changing the
  • 00:28:26
    prompt that was trying to understand
  • 00:28:28
    like the intuition for why that might be
  • 00:28:30
    happening or was that ancillary
  • 00:28:32
    to actually getting the problem don't
  • 00:28:35
    remember like what we did in that
  • 00:28:36
    example I I think we we just kind of saw
  • 00:28:39
    that like oh wow not only is it getting
  • 00:28:42
    dropping the variance is also increasing
  • 00:28:44
    as we drop and so what we want a prompt
  • 00:28:47
    that has lower variance in the like
  • 00:28:49
    compression rate that seems like a very
  • 00:28:52
    like healthy and quantifiable goal where
  • 00:28:55
    we can just say hey Jason I tried three
  • 00:28:58
    different prompts and I was able to drop
  • 00:29:00
    the standard deviation by like
  • 00:29:02
    40% and it now is like monotonically
  • 00:29:05
    increasing as a function of context like
  • 00:29:08
    that becomes so scientific and so
  • 00:29:10
    quantifiable that we don't have to worry
  • 00:29:12
    about some of these like bigger things
  • 00:29:13
    and obviously we might still lose Nuance
  • 00:29:16
    but um setting a goal against that is
  • 00:29:19
    very
  • 00:29:21
    easy my sense is that folks coming to
  • 00:29:25
    this uh you know
  • 00:29:29
    fresh and being told hey you should
  • 00:29:31
    build a test data set um you know kind
  • 00:29:35
    of wrapping their head around Precision
  • 00:29:37
    recall I keep whether you keep whether
  • 00:29:39
    you can keep those two straight without
  • 00:29:40
    looking it up that's another issue but
  • 00:29:42
    like you know that's like oh that's
  • 00:29:44
    probably something that I need to be
  • 00:29:45
    able to measure and test against is like
  • 00:29:48
    um you know an obvious thing uh
  • 00:29:52
    compression rate feels like less obvious
  • 00:29:55
    or more nuanced in some way way are
  • 00:29:59
    there other kind of nuanced types of
  • 00:30:03
    things I think you there's I guess I'm
  • 00:30:06
    thinking there are two ways that you get
  • 00:30:08
    this either one like you know banging
  • 00:30:10
    your head against your problem and you
  • 00:30:12
    know this is probably the best way to
  • 00:30:14
    come up with these things but you know
  • 00:30:16
    part of what we're trying to do is like
  • 00:30:18
    accelerate learning and provide
  • 00:30:19
    shortcuts like what are the shortcuts
  • 00:30:22
    that you've come across for different
  • 00:30:24
    problem classes like oh these four
  • 00:30:26
    metrics like you probably wouldn't think
  • 00:30:27
    about them but you know when you did
  • 00:30:29
    like you discover that these come up all
  • 00:30:31
    the time do do you have that list
  • 00:30:34
    another one that is pretty reasonable in
  • 00:30:37
    this like summarization task just
  • 00:30:39
    whether or not it co uh adheres to a
  • 00:30:41
    certain schema and a certain uh
  • 00:30:44
    formatting but again I try my best to
  • 00:30:47
    just write a regular expression that
  • 00:30:48
    tries to capture this as quickly as
  • 00:30:50
    possible right and you know I could have
  • 00:30:53
    like a Sixpoint grading scale on whether
  • 00:30:55
    or not it fits the the the markdown
  • 00:30:58
    format that I want but then I lose all
  • 00:31:01
    like there's too much Nuance right
  • 00:31:02
    really I just want to have a bunch of
  • 00:31:03
    past fail tests that are very binary
  • 00:31:06
    where I can say great show me 10
  • 00:31:08
    examples where I failed 10 examples
  • 00:31:10
    where I succeeded let's let me just go
  • 00:31:13
    like think really hard and figure out
  • 00:31:15
    what is happening and how can I change
  • 00:31:17
    that um outside of that I find it a lot
  • 00:31:19
    of it ends up being very very specific
  • 00:31:22
    there's an example where I generate
  • 00:31:24
    action items but I want to evaluate
  • 00:31:26
    whether or not the action item is
  • 00:31:27
    correct ly assigned to the person right
  • 00:31:31
    uh that's just a very specific eval that
  • 00:31:33
    you have to build and it ends up being
  • 00:31:35
    very challenging sometimes to uh
  • 00:31:37
    correctly assign something like that are
  • 00:31:40
    you finding that you're always running
  • 00:31:43
    all of your eval Suite whenever you're
  • 00:31:46
    making any change or um do you find like
  • 00:31:51
    running specific you know feature
  • 00:31:54
    specific evals uh and then running your
  • 00:31:57
    broader sweet uh less
  • 00:32:00
    frequently yeah it depends on like what
  • 00:32:04
    kind of eval to be honest I if I can
  • 00:32:06
    afford to I would just rather run them
  • 00:32:07
    all the time because you never know what
  • 00:32:09
    kind of
  • 00:32:10
    cross uh influence there is like a
  • 00:32:13
    really simple example was we had both an
  • 00:32:15
    executive summary and a list of action
  • 00:32:18
    items and the action item description
  • 00:32:21
    was too
  • 00:32:22
    long you're like great well uh make the
  • 00:32:26
    action item shorter
  • 00:32:28
    and then we got uh an equal length
  • 00:32:30
    action item but just fewer action
  • 00:32:34
    items but that's EAS that test is easy
  • 00:32:36
    because we can just like parse out the
  • 00:32:38
    asterisk and count them and that's one
  • 00:32:40
    EV like I literally had an Eva that was
  • 00:32:42
    just count the number of action items
  • 00:32:44
    the second one was like what is the
  • 00:32:45
    average character count of the action
  • 00:32:47
    items and what is the average summary
  • 00:32:49
    count so then you do okay well
  • 00:32:52
    uh just make the description of the
  • 00:32:56
    action items shorter and then all of a
  • 00:32:57
    sudden the summary is also
  • 00:32:59
    shorter right and so there is cross
  • 00:33:02
    contamination and one of the things
  • 00:33:03
    that's valuable is to go okay
  • 00:33:06
    well I don't know why but but
  • 00:33:09
    controlling one and and not the other is
  • 00:33:11
    so difficult I'm going to break this
  • 00:33:13
    down into two
  • 00:33:15
    tasks I'm G to have a summary task and
  • 00:33:17
    an action item task and the reason I've
  • 00:33:19
    done this the reason I've added this
  • 00:33:21
    extra
  • 00:33:22
    complexity is because I have all these
  • 00:33:24
    experiments to prove that I can't figure
  • 00:33:26
    out how to combine them
  • 00:33:28
    right maybe if a new model comes out and
  • 00:33:31
    it's better and more steerable we can
  • 00:33:34
    re-evaluate
  • 00:33:36
    this but I'm going to separate these to
  • 00:33:38
    two different tasks because I cannot get
  • 00:33:40
    the evals to match uh what I want in
  • 00:33:43
    terms of performance right so now you
  • 00:33:45
    can have this idea of like I'm going to
  • 00:33:47
    segment to make this simpler but there
  • 00:33:50
    are conditions when I would re recombine
  • 00:33:52
    these tasks and maybe when you know High
  • 00:33:54
    coup 3.5 comes out I'll rerun my old
  • 00:33:56
    evals see if I can fix these things and
  • 00:33:59
    and justify some of these Investments
  • 00:34:01
    but the idea really is you're making
  • 00:34:04
    your resource allocation and and how you
  • 00:34:05
    spent your time how you designed your
  • 00:34:08
    system and its
  • 00:34:10
    complexity based on the trade-offs
  • 00:34:12
    you're making with
  • 00:34:13
    eals and again these evals are just
  • 00:34:15
    regular expressions or not anything
  • 00:34:17
    fancy you're not calling any
  • 00:34:19
    L1 fine-tuning comes up in the context
  • 00:34:23
    of uh rag and gen more broadly you know
  • 00:34:27
    what role do you see for
  • 00:34:30
    fine-tuning uh in the types of systems
  • 00:34:34
    that you know we're typically building
  • 00:34:35
    for rag I would say if you're going to
  • 00:34:37
    start fine tuning the first thing to
  • 00:34:39
    fine tune is likely going to be
  • 00:34:40
    something like a cohero ranker right
  • 00:34:43
    that's where you're going to have the
  • 00:34:44
    least amount of data you required it's
  • 00:34:46
    going to be very easy to label this data
  • 00:34:49
    if you have a bunch of questions and
  • 00:34:50
    text junks you can probably ask like the
  • 00:34:52
    smartest most expensive model you have
  • 00:34:54
    and just label thousands of examples
  • 00:34:57
    right
  • 00:34:58
    so just transfer learning that task is
  • 00:35:01
    pretty affordable probably for $50 you
  • 00:35:03
    can you can get a fine-tune ranker that
  • 00:35:05
    outperforms anything off the
  • 00:35:08
    shelf I don't know whether fine-tuning
  • 00:35:11
    and betting models is worth it just
  • 00:35:12
    because of like it's just annoying to
  • 00:35:15
    like own inference but I think that's
  • 00:35:18
    the second easiest thing to fine-tune is
  • 00:35:19
    fine tuning an edding model to do search
  • 00:35:22
    better after that the only thing I would
  • 00:35:25
    really fine-tuned is any kind of pre
  • 00:35:26
    rewriting steps
  • 00:35:28
    so you know can I given a question parse
  • 00:35:30
    it out to you know query start date end
  • 00:35:34
    date can I can I map it to metadata
  • 00:35:36
    filters that I think people should be
  • 00:35:38
    fine-tuning because it's a very specific
  • 00:35:40
    task you can fine-tune like a llama
  • 00:35:43
    model you can host it in a way that has
  • 00:35:46
    fast inference and it's usually going to
  • 00:35:48
    be pretty effective whereas I find it's
  • 00:35:50
    pretty challenging to really think about
  • 00:35:52
    how do how do you fine tune like 40 to
  • 00:35:55
    do answer generation you I would you
  • 00:35:58
    would need to be pretty Justified to
  • 00:36:00
    explore that especially
  • 00:36:02
    because as these model like these models
  • 00:36:04
    are going to get better in a way that we
  • 00:36:05
    can't control and they're always going
  • 00:36:07
    to be have better recall they're going
  • 00:36:09
    to have better robustness towards like
  • 00:36:12
    low Precision text chunks it's hard to
  • 00:36:15
    beat them because they actually have all
  • 00:36:16
    the data whereas um for something like
  • 00:36:22
    rankers you know they don't have that
  • 00:36:24
    data like we are the ones that are able
  • 00:36:26
    to capture the value have this data set
  • 00:36:28
    fine-tune and outperform uh the public
  • 00:36:33
    benchmarks one of the the model related
  • 00:36:37
    questions that comes up all the time
  • 00:36:40
    especially as the you know the big
  • 00:36:43
    models get better is like do I need to
  • 00:36:47
    think about any of this in a large
  • 00:36:50
    context length uh you know regime like
  • 00:36:55
    do I need to rerank do I need to you
  • 00:36:58
    know emed like chunk can I just throw
  • 00:37:02
    everything um you know of course for
  • 00:37:04
    some definitions of everything it's
  • 00:37:05
    going to be too big you know bigger than
  • 00:37:07
    whatever context window you have but
  • 00:37:10
    like uh assuming a large context um and
  • 00:37:15
    you know assuming context sufficient for
  • 00:37:18
    you know a a lot of your
  • 00:37:22
    context
  • 00:37:23
    [Music]
  • 00:37:24
    um yeah you get where I'm going with
  • 00:37:26
    this like
  • 00:37:29
    yeah I
  • 00:37:30
    mean I think what's really going to
  • 00:37:32
    happen is we're going to go in the same
  • 00:37:34
    way that like the iPhone battery life
  • 00:37:35
    has gone
  • 00:37:37
    right like we've never had a better
  • 00:37:40
    battery and then longer battery life
  • 00:37:41
    we've just had more powerful
  • 00:37:45
    applications and so I think as contactx
  • 00:37:48
    increases we're just going to have way
  • 00:37:49
    more complex instructions with like
  • 00:37:51
    different personalities or you know
  • 00:37:53
    maybe not only is going to have the
  • 00:37:54
    context length it's going to have my you
  • 00:37:56
    know my history all this kind of stuff
  • 00:37:59
    that said I think there's a great place
  • 00:38:01
    for long context models especially when
  • 00:38:03
    we have a few documents I would almost
  • 00:38:05
    rather always shove everything into
  • 00:38:06
    context right but we're always going to
  • 00:38:08
    run into latency tradeoffs if we think
  • 00:38:11
    of the recommendation systems or you
  • 00:38:13
    know e-commerce systems we know that
  • 00:38:16
    even a 100 milliseconds 300 milliseconds
  • 00:38:18
    of latency could be a 1% Revenue hit I
  • 00:38:21
    think that'll be the same thing for
  • 00:38:22
    these language models right there's
  • 00:38:24
    always going to be some Frontier of
  • 00:38:26
    context L and late see in business
  • 00:38:28
    outcome that we're going to have to make
  • 00:38:30
    tradeoffs against I think that's what
  • 00:38:31
    that's what's really going to happen
  • 00:38:32
    yeah I was wondering if you had more
  • 00:38:36
    Nuance around the way you think about
  • 00:38:39
    the generation side um I I guess my
  • 00:38:44
    observation is like the length of the
  • 00:38:46
    context itself is insufficient as a
  • 00:38:50
    determinant of success right and you
  • 00:38:53
    know that's why for example we have
  • 00:38:54
    reranking because you know within a
  • 00:38:57
    given context length the model can't
  • 00:39:00
    really follow the plot all the way from
  • 00:39:02
    the top to the bottom right and so like
  • 00:39:05
    just saying like this is the number of
  • 00:39:07
    context length doesn't say enough about
  • 00:39:09
    how well the model how good a job the
  • 00:39:12
    model does at attending to all the
  • 00:39:14
    various things in the context and so um
  • 00:39:18
    you know that gets to you know Concepts
  • 00:39:21
    like precision and recall and other
  • 00:39:22
    things like do you have a structured way
  • 00:39:25
    that you think about that or like the
  • 00:39:26
    way that you would approach evaluating a
  • 00:39:30
    different context length so when I use a
  • 00:39:34
    longer context model I'm usually working
  • 00:39:35
    with a very few set of documents right
  • 00:39:38
    so the question is like okay is the
  • 00:39:39
    relevant information in text Chunk split
  • 00:39:42
    across many documents or really is going
  • 00:39:44
    to be a very few documents and a good
  • 00:39:46
    example of this is we have an agent
  • 00:39:49
    that's job is to take sales calls
  • 00:39:52
    reference your pricing pages and give
  • 00:39:55
    you a compelling personalized pricing on
  • 00:39:58
    a certain service that you provide right
  • 00:40:00
    so we have a onh hour long transcript we
  • 00:40:02
    have a 16-page PDF that describes our
  • 00:40:04
    pricing options for like different
  • 00:40:06
    add-ons and whatnot and the prompt goes
  • 00:40:08
    as follows right it says here's a
  • 00:40:10
    transcript here is 16 pages of our
  • 00:40:13
    pricing first list out all the variables
  • 00:40:17
    that are required to determine whether
  • 00:40:19
    or not you can personalize the price and
  • 00:40:21
    then it does that and then for
  • 00:40:24
    everything that we list out extract out
  • 00:40:26
    exactly what part of the transcript they
  • 00:40:28
    mention this variable so first it list
  • 00:40:31
    out the variables and then it lists out
  • 00:40:32
    the variables hydrated by
  • 00:40:34
    excerpts then you know reread the
  • 00:40:37
    transcript and the the page and list out
  • 00:40:41
    the resulting like price number that you
  • 00:40:44
    can give it and then construct a
  • 00:40:46
    follow-up email that offers a
  • 00:40:48
    personalized
  • 00:40:49
    price so what we're really doing is
  • 00:40:51
    we're trying to just push the language
  • 00:40:53
    model to do a lot of very specific Chain
  • 00:40:55
    of Thought where you're kind of
  • 00:40:56
    extracting the data then organizing it
  • 00:40:58
    again in a smaller package and then as
  • 00:41:01
    you generate the email you assume that
  • 00:41:03
    we're kind of only attending over this
  • 00:41:04
    like prepared notepad that the language
  • 00:41:07
    one determined um that's mostly how I
  • 00:41:10
    think about using long context models
  • 00:41:13
    when it comes to very few data which is
  • 00:41:14
    just to say I want you to attend over
  • 00:41:16
    everything reorganize the information in
  • 00:41:19
    Your Chain of Thought in your scratch
  • 00:41:20
    pad and then finally give me a final
  • 00:41:22
    result and that has usually worked
  • 00:41:25
    pretty well that that's been the
  • 00:41:26
    difference between we could ship to
  • 00:41:28
    something that is actually sending
  • 00:41:29
    followup emails right now in production
  • 00:41:32
    no that's really interesting so it's
  • 00:41:33
    kind of speaking to like long context
  • 00:41:37
    doesn't necessarily say that you're
  • 00:41:38
    going to be able to onot your answer but
  • 00:41:41
    if you can reduce your context
  • 00:41:44
    systematically you know you know through
  • 00:41:46
    a lens of the way you thought about
  • 00:41:47
    breaking down your problem then you
  • 00:41:50
    could you know one long contest can be a
  • 00:41:53
    convenience for you exactly and it it
  • 00:41:56
    makes the problem really it's still it's
  • 00:41:57
    still a single prompt now right it's
  • 00:42:00
    just generating like scratch Pad one
  • 00:42:02
    scratch Pad two scratch Pad three and it
  • 00:42:04
    actually lets us work in a world when
  • 00:42:07
    when the next long context model exists
  • 00:42:09
    we can just replace the model number
  • 00:42:11
    rather than going you know we had this
  • 00:42:13
    like six prompt agentic system and I
  • 00:42:16
    hope okay yeah that I was envisioning
  • 00:42:18
    this a six prompt agentic system it's oh
  • 00:42:21
    yeah this what does tell me what what
  • 00:42:23
    does the scratch Pad mean in that
  • 00:42:25
    context and how is a prompt
  • 00:42:27
    yeah incorporating that so basically I
  • 00:42:30
    say Okay first list out the variables
  • 00:42:32
    then list out the variables and the
  • 00:42:33
    transcripts but it's just doing it so
  • 00:42:35
    just in the the generation you're asking
  • 00:42:38
    it to show its work that kind of deal
  • 00:42:40
    yeah but like so it's like a like a very
  • 00:42:42
    very long show your work right it's like
  • 00:42:44
    it's you know maybe 3,000 tokens of
  • 00:42:46
    planning of just going like well uh we
  • 00:42:49
    can offer a per seat model if the per
  • 00:42:51
    seats is greater than 30 uh this person
  • 00:42:53
    mentioned 30 was the minimum seat number
  • 00:42:56
    and they said they had 48
  • 00:42:58
    seats so now it just sort of like goes
  • 00:43:01
    down this but it's as a single
  • 00:43:03
    generation interesting and are there is
  • 00:43:06
    there anything that you need to do or
  • 00:43:09
    prompt magic to get it to kind of stick
  • 00:43:11
    to the steps or does that generally work
  • 00:43:13
    pretty good for you know sufficiently
  • 00:43:15
    Advanced models so for the advanced
  • 00:43:18
    models because we have this long context
  • 00:43:20
    we just have like four or five examples
  • 00:43:22
    of this entire reasoning
  • 00:43:25
    protocol right
  • 00:43:27
    that's another reason that the long
  • 00:43:28
    context value matters because now we
  • 00:43:30
    have the transcript 16 pages of calls
  • 00:43:33
    and four examples of reasoning about the
  • 00:43:36
    variables needed to create pricing pages
  • 00:43:38
    and and you know like if it's lower than
  • 00:43:40
    this price offer this package to do this
  • 00:43:44
    um that just becomes like way way more
  • 00:43:46
    Conta that we can use because we have a
  • 00:43:48
    longer context model but then ultimately
  • 00:43:51
    you run the thing it's like
  • 00:43:52
    178,000 tokens of prompt use right and I
  • 00:43:57
    think what's going to happen is as as
  • 00:44:00
    context models increase we're going to
  • 00:44:02
    have much more sophisticated F shot
  • 00:44:03
    examples maybe we have full examples of
  • 00:44:05
    transcripts in the past and how we
  • 00:44:07
    reason about them we're just GNA
  • 00:44:09
    saturate everything as as much as we
  • 00:44:12
    can so got all our Basics uh lined up
  • 00:44:18
    Clos Loop evaluation considering
  • 00:44:21
    techniques like
  • 00:44:23
    fine-tuning um are there other
  • 00:44:26
    optimization
  • 00:44:27
    that Beyond fine tuning that um someone
  • 00:44:31
    might think about once they've got the
  • 00:44:33
    basics lined up in the in the model
  • 00:44:37
    maybe less so but I think a lot of
  • 00:44:38
    people are sort of ignoring the ux and
  • 00:44:42
    the product facing side of things right
  • 00:44:45
    for example if we focus on streaming we
  • 00:44:47
    can make the perceived latency decrease
  • 00:44:50
    right if we just focus hard on building
  • 00:44:53
    great copy and great you know UI to
  • 00:44:55
    collect feedback we might be be able to
  • 00:44:57
    start uh fine-tuning rankers sooner
  • 00:45:00
    rather than later right for example if I
  • 00:45:03
    generate an answer with a bunch of files
  • 00:45:05
    what if I gave the user the ability to
  • 00:45:07
    delete one of the files and regenerate
  • 00:45:08
    an answer that becomes a negative sample
  • 00:45:12
    in your ranker because now we know that
  • 00:45:13
    was irrelevant right there's a lot of ux
  • 00:45:17
    features that we can do like one of the
  • 00:45:18
    great examples I I discovered from with
  • 00:45:20
    zapier working with them for a couple
  • 00:45:22
    months was uh we changed the copy of how
  • 00:45:25
    did we do to did we answer your question
  • 00:45:29
    today and that in itself 5x the amount
  • 00:45:32
    of feedback we were able to collect uh
  • 00:45:34
    per day and that basically me within you
  • 00:45:37
    know within one month we got enough data
  • 00:45:39
    that we could get together as a team
  • 00:45:42
    review all the examples and figure out
  • 00:45:44
    what we want to do next
  • 00:45:45
    month right just because we have one
  • 00:45:48
    volume and stuff like that I think is
  • 00:45:51
    really underlooked especially when the
  • 00:45:53
    ux can also be used to you know educate
  • 00:45:56
    the user if we discover question types
  • 00:45:58
    that are low volume and low success
  • 00:46:02
    maybe we just uh say no to answering
  • 00:46:04
    those kind of questions we if we have
  • 00:46:06
    question types that are low volume but
  • 00:46:08
    High success maybe we like preview that
  • 00:46:11
    as an example question that you can ask
  • 00:46:13
    and and teach users that we can actually
  • 00:46:15
    do this very well and we should be using
  • 00:46:17
    this to answer those kind of questions
  • 00:46:20
    right a lot of that education in the ux
  • 00:46:22
    I think is something that is often
  • 00:46:24
    underlooked at smaller teams yeah along
  • 00:46:27
    the lines of the ux one of the things
  • 00:46:32
    that I've been uh talking a bit about
  • 00:46:34
    recently is this idea like hey we've all
  • 00:46:36
    started with like trying to replicate
  • 00:46:39
    chat GPT for our business's data but
  • 00:46:42
    that chat experience isn't necessarily
  • 00:46:44
    the best experience for everything in
  • 00:46:46
    fact it might not be the best experience
  • 00:46:48
    for a lot of things and uh at least in
  • 00:46:52
    an Enterprise context maybe it's
  • 00:46:54
    different on a product context but in an
  • 00:46:56
    Enterprise
  • 00:46:57
    context um a lot can be gained by
  • 00:47:00
    integrating you know what you're trying
  • 00:47:02
    to accomplish with rag into an existing
  • 00:47:05
    workflow as opposed to creating some new
  • 00:47:07
    Standalone chatbot uh is that something
  • 00:47:09
    that you see in the folks that you work
  • 00:47:11
    with yeah one of my most sort of like
  • 00:47:15
    popular takes on rag is that question
  • 00:47:18
    answering is sort of very low value and
  • 00:47:20
    cost centered Centric whereas one of the
  • 00:47:23
    big things I see in the companies that
  • 00:47:26
    I've been advising vager for examp like
  • 00:47:28
    vantage.com for example they do report
  • 00:47:30
    generation right so instead of saying
  • 00:47:33
    give a data room can I ask a bunch of
  • 00:47:35
    questions about how the founders met and
  • 00:47:37
    what is the you know Tam of their
  • 00:47:40
    business vantag just says if you give me
  • 00:47:43
    a data room I will just pre-generate
  • 00:47:45
    every report that you use to make a
  • 00:47:47
    decision in your
  • 00:47:49
    business and now you can just use the
  • 00:47:51
    workflow of reviewing
  • 00:47:53
    reports but now you can instead of
  • 00:47:55
    processing 40 businesses a quarter you
  • 00:47:58
    can do 80 businesses a quarter right and
  • 00:48:01
    now the question is instead of capturing
  • 00:48:03
    the percentage of the cost of Labor we
  • 00:48:06
    might be able to cap capture a
  • 00:48:08
    percentage of the ROI of the decision
  • 00:48:11
    and I think that's where a lot of really
  • 00:48:13
    great arag applications will will come
  • 00:48:15
    about right can we capture the ROI
  • 00:48:18
    rather than the cost of of doing this
  • 00:48:19
    kind of work and that also lends itself
  • 00:48:22
    to uh kind of progressively
  • 00:48:27
    inserting rag into multiple places in a
  • 00:48:30
    long running workflow or business
  • 00:48:34
    process yeah and it goes back to this
  • 00:48:36
    idea that if you if you wish the asent
  • 00:48:38
    had complex reasoning it's because you
  • 00:48:40
    have not thought hard about the problem
  • 00:48:42
    yourself it's a spicy take but I
  • 00:48:45
    oftentimes you know I think people admit
  • 00:48:46
    to uh agreeing with it even just a
  • 00:48:49
    little bit Yeah multim modals a popular
  • 00:48:53
    topic uh is we've talked a little bit
  • 00:48:55
    about like
  • 00:48:57
    extracting tables from reports and um
  • 00:49:01
    some of the ways that you are extracting
  • 00:49:03
    metadata from images are there other
  • 00:49:05
    ways that you see multimodal coming up
  • 00:49:07
    yeah I think you know if you follow like
  • 00:49:10
    Joe from Besa or Ben from answer AI
  • 00:49:13
    they're all very excited and and me
  • 00:49:14
    included very excited on the models like
  • 00:49:17
    kopali where we use visual language
  • 00:49:20
    models to do search effectively and not
  • 00:49:24
    only can you do search you can then use
  • 00:49:26
    visual language models to given the
  • 00:49:28
    images answer the question and because
  • 00:49:31
    it's all local well not all local but
  • 00:49:34
    you know it's it's open weights you can
  • 00:49:36
    also inspect the attention mechanism so
  • 00:49:39
    when I ask a question on a PDF I can
  • 00:49:42
    tell where the model is looking to
  • 00:49:44
    determine its relevancy I think there's
  • 00:49:46
    a lot of features there that can be very
  • 00:49:47
    useful in the context of maybe you know
  • 00:49:50
    if we have hundreds of PDFs with
  • 00:49:51
    hundreds of pages we can use something
  • 00:49:53
    like kopali to really be great at uh you
  • 00:49:55
    know reading diagrams and understanding
  • 00:49:57
    structure without thinking about the OCR
  • 00:50:00
    and the table ATT traction and all that
  • 00:50:01
    kind of work so that's something I'm
  • 00:50:03
    very excited about exploring we've
  • 00:50:05
    talked a little bit about uh agents
  • 00:50:09
    and
  • 00:50:11
    um you know there's one dimension of
  • 00:50:13
    agents that is like breaking up your
  • 00:50:15
    prompt into a bunch of steps uh and
  • 00:50:17
    using that as a kind of a reasoning
  • 00:50:20
    mechanism um but there's you know I
  • 00:50:23
    guess you could argue whether this is an
  • 00:50:25
    agentic thing or not um but like
  • 00:50:27
    function calls and tools and stuff like
  • 00:50:29
    that do you see those capabilities
  • 00:50:32
    coming into play in the rag systems that
  • 00:50:35
    you're uh
  • 00:50:36
    building yeah I mean I think the real
  • 00:50:39
    question is like how many hops is my rag
  • 00:50:41
    agent allowed to take right like if can
  • 00:50:44
    I do retrieval and then determine that I
  • 00:50:46
    still need to do more retrieval or do I
  • 00:50:48
    only have a couple attempts to answer
  • 00:50:50
    the question when I retrieve data
  • 00:50:53
    um I think the general idea is that
  • 00:50:57
    if we can segment the problem space or
  • 00:50:59
    the query space into these different you
  • 00:51:01
    know
  • 00:51:02
    buckets it probably benefits us to build
  • 00:51:05
    specific indices to serve each set of
  • 00:51:07
    questions right if I know 40% of the
  • 00:51:10
    questions I ask are going to be around
  • 00:51:12
    scheduling I might just develop a data
  • 00:51:14
    structure optimized for curing schedules
  • 00:51:17
    and then have a function call hit that
  • 00:51:20
    API
  • 00:51:21
    right and then I think function calling
  • 00:51:24
    is effectively just building out routers
  • 00:51:25
    that can combine these separate indices
  • 00:51:28
    into a single API and letting the
  • 00:51:30
    language mod determine um what's going
  • 00:51:32
    on I think there's also a world where if
  • 00:51:35
    we have many many tools we might want to
  • 00:51:37
    do retrieval and search to figure out
  • 00:51:39
    what tools are relevant right imagine if
  • 00:51:41
    we have 200 tools that are disposal it's
  • 00:51:44
    just another precedent recall evaluation
  • 00:51:47
    to figure out whether or not the
  • 00:51:48
    question is finding the right tool but I
  • 00:51:50
    think for the most part I've just been
  • 00:51:52
    able
  • 00:51:52
    to really just push position recall to
  • 00:51:56
    almost be the hammer uh in a world where
  • 00:51:59
    I think I think llms are the hammer for
  • 00:52:00
    everything I've just gone back to to
  • 00:52:03
    Basics and you're your last example
  • 00:52:06
    spoke
  • 00:52:07
    to um another interesting thing that
  • 00:52:10
    I've seen is like
  • 00:52:12
    using trying to get Beyond you know
  • 00:52:16
    building rag systems just with a bunch
  • 00:52:18
    of text but also including structured
  • 00:52:20
    data um which can be incorporated via
  • 00:52:23
    tools like it sounds like you're seeing
  • 00:52:25
    at least some of that you seeing that uh
  • 00:52:28
    grow I guess in the report generation
  • 00:52:31
    example that you mentioned that would be
  • 00:52:33
    a big part of it right yeah yeah exactly
  • 00:52:36
    I I think I think ultimately it will
  • 00:52:38
    just be function calling plus like the
  • 00:52:41
    the messages already and I think that
  • 00:52:43
    can probably do a lot of these cases um
  • 00:52:46
    and outside of that you know like one
  • 00:52:48
    thing I like to say I forget what
  • 00:52:50
    theorem or Paradigm this was but it was
  • 00:52:52
    the idea that all complex systems are
  • 00:52:55
    derived from pre-existing complex
  • 00:52:58
    systems and if you think about chatbots
  • 00:53:00
    and finite State machines you know Lang
  • 00:53:02
    graph is covering that basis right it's
  • 00:53:04
    kind of the llm extension of a system
  • 00:53:07
    that already works you know if you think
  • 00:53:09
    about like open AI swarm Library it's
  • 00:53:12
    very much like message passing and
  • 00:53:13
    distributed systems and like act the
  • 00:53:15
    actor model of programming so I think
  • 00:53:16
    we're already slowly seeing these
  • 00:53:18
    different forms of agentic programs
  • 00:53:21
    being remapped to like known successful
  • 00:53:26
    working Paradigm for building out these
  • 00:53:29
    kind of complex systems whether it's
  • 00:53:31
    like Lang graph or swarm or anything
  • 00:53:33
    like that I think we've sort of figured
  • 00:53:36
    out what works and our now our job is
  • 00:53:38
    just to scale that better and better I
  • 00:53:40
    guess maybe changing topics slightly you
  • 00:53:44
    uh Beyond rag another thing that you're
  • 00:53:47
    very excited about is like helping
  • 00:53:50
    folks kind of tool up as AI Consultants
  • 00:53:55
    like where did your interest in that
  • 00:53:56
    that come from well I just struggled so
  • 00:53:59
    L myself personally you know what I mean
  • 00:54:01
    I feel
  • 00:54:02
    like like I didn't work for like a year
  • 00:54:05
    I came back to and I was like oh man
  • 00:54:06
    people are asking me for help I don't
  • 00:54:08
    really know how to turn this into a
  • 00:54:10
    business you know even a year down the
  • 00:54:12
    road I really feel like through a lot of
  • 00:54:14
    like a augmentation there should be more
  • 00:54:16
    and more individuals who are able to
  • 00:54:19
    scale up their own knowledge work with
  • 00:54:22
    llms so I like well I just think there's
  • 00:54:24
    going to be more businesses like more
  • 00:54:25
    solo business like entrepreneurs making
  • 00:54:28
    six or seven figures and and so okay if
  • 00:54:30
    that's true I should try doing it but I
  • 00:54:34
    just realize that you know I think the
  • 00:54:36
    like if you're a technical person and
  • 00:54:37
    you enjoy technical work it is very hard
  • 00:54:39
    to do the sales and and do the writing
  • 00:54:42
    and figure out how to write proposals to
  • 00:54:44
    like charge more and I think everyone
  • 00:54:46
    tells you to charge more but you don't
  • 00:54:47
    know what that means and there's no
  • 00:54:48
    playbook for that right just say things
  • 00:54:51
    like well just look in the mirror and
  • 00:54:54
    name a price and then double it and if
  • 00:54:55
    you don't keep doubling it and at some
  • 00:54:58
    point you can just ask that
  • 00:55:02
    number and that never worked for me and
  • 00:55:04
    so I basically Pi a bunch of courses I
  • 00:55:06
    read a bunch of books and I'm trying to
  • 00:55:08
    distill everything I know into a little
  • 00:55:10
    package on Maven and kind of just like
  • 00:55:13
    sort of save the regret and the
  • 00:55:15
    embarrassment of undercharging for for
  • 00:55:18
    so long and uh you know sort of pass it
  • 00:55:20
    forward and help them help everyone else
  • 00:55:22
    will figure it out yeah you know I feel
  • 00:55:25
    like the first job I did I asked they
  • 00:55:27
    asked me how much I charged and I said
  • 00:55:29
    oh between like 150 and 170 and an hour
  • 00:55:31
    and they just said great we'll do 170
  • 00:55:33
    just send me to the paperwork and I was
  • 00:55:35
    like oh wow you answered in three
  • 00:55:37
    seconds I
  • 00:55:39
    really yeah I just I called my
  • 00:55:42
    girlfriend I was like Hey I just took
  • 00:55:43
    food out of both our mouths I'm really
  • 00:55:44
    sorry like I'll do better next maybe
  • 00:55:47
    I'll double it I don't know I'm nervous
  • 00:55:49
    and yeah the course I'm running this
  • 00:55:51
    this uh next month is sort of my goal
  • 00:55:54
    to not do that again yeah I can say uh
  • 00:56:00
    that everything that you mentioned is
  • 00:56:02
    true for being an industry
  • 00:56:06
    analyst and a podcaster you know either
  • 00:56:08
    SL both which I am um you know I've been
  • 00:56:12
    at it for quite a long time but you know
  • 00:56:14
    there's definitely a learning curve and
  • 00:56:15
    it changes all the time too so it's
  • 00:56:17
    awesome exctly uh well we will link to
  • 00:56:20
    uh to that course are you still doing
  • 00:56:23
    the rag course as well the rag course
  • 00:56:25
    we're we're running it again in uh
  • 00:56:27
    February 4th they'll also be six weeks
  • 00:56:29
    I'm pretty excited we already have some
  • 00:56:31
    folks from open AI who's taking the
  • 00:56:32
    course now so I've slowly yeah hopefully
  • 00:56:36
    I can help with their Solutions as years
  • 00:56:38
    improve other people's R systems and so
  • 00:56:40
    I'm very excited for the new cohort
  • 00:56:42
    that's a a bunch of really amazing
  • 00:56:43
    companies involved well we will uh be
  • 00:56:46
    sure to link to those in the show notes
  • 00:56:48
    and maybe we can work out some kind of
  • 00:56:49
    discount code for listeners or something
  • 00:56:52
    um yeah let's do it awesome awesome
  • 00:56:55
    Jason it has been been great catching up
  • 00:56:57
    and uh I feel like we probably could
  • 00:56:59
    have continued on for another hour but
  • 00:57:02
    but we should make sure to to keep in
  • 00:57:03
    touch thanks so much for jumping on and
  • 00:57:05
    sharing a bit about your uh your
  • 00:57:07
    experiences with us it's been super fun
  • 00:57:09
    man thanks so much awesome thank you
  • 00:57:29
    [Music]
タグ
  • AI
  • RAG
  • User Experience
  • Testing
  • Fine-Tuning
  • Machine Learning
  • Consulting
  • Data Evaluation
  • Podcast
  • Expert Insights