Why Your RAG System Is Broken, and How to Fix It with Jason Liu - 709
Zusammenfassung
TLDRIn this episode of the Tmall AI Podcast, host Sam Charington interviews Jason Lou, an AI consultant and expert in Retrieval-Augmented Generation (RAG). They explore the importance of understanding customer needs and the potential pitfalls of focusing too heavily on the complexity of reasoning in AI models. Jason shares his insights on optimizing AI systems through evaluation metrics, the impact of user experience design, and the significance of creating effective datasets. The conversation highlights the balance between generative capabilities and retrieval performance, emphasizing that users often overlook crucial aspects like precise needs and system feedback. The episode encourages iterative testing and understanding specific user queries to enhance AI functionality.
Mitbringsel
- 🤔 Understand user needs to guide product development.
- 🔍 Focus on evaluation and testing to improve AI systems.
- 💡 UX design can enhance feedback and engagement.
- 📊 Regular testing helps identify areas of improvement.
- ⚙️ Fine-tuning may not always be necessary; assess use case specifics.
- 📈 Longer context provides better analysis but requires efficiency.
- 🔄 Segmenting problems can streamline solution development.
- 📅 Leverage existing data structures for answering queries effectively.
- 💬 Encourage experimentation and trust in data-driven approaches.
Zeitleiste
- 00:00:00 - 00:05:00
Customers express a desire for more complex reasoning capabilities in AI models, which leads to discussions about understanding customer needs and improving product clarity.
- 00:05:00 - 00:10:00
Sam Charington introduces Jason Lou, a freelance AI consultant with a background in machine learning and recommendation systems, and the conversation will delve into retrieval-augmented generation (RAG) and system diagnostics.
- 00:10:00 - 00:15:00
Jason shares his educational background and experience at Stitch Fix, where he worked on multimodal embedding for predicting outfit recommendations, which aligns with current RAG principles in AI models.
- 00:15:00 - 00:20:00
He discusses how organizations often seek improvements in their RAG systems, focusing on the need for better embedding models and retrieval mechanisms to enhance customer engagement and business performance.
- 00:20:00 - 00:25:00
Jason criticizes the tendency of companies to focus on generational tuning rather than ensuring that the language model has adequate context, emphasizing the importance of retrieval quality over generation adjustments.
- 00:25:00 - 00:30:00
He mentions biases that affect evaluation of AI systems, such as absence bias and intervention bias, stressing the need to focus on retrieval effectiveness rather than tweaking generative prompts.
- 00:30:00 - 00:35:00
As companies start to recognize the importance of evaluations, Jason highlights the shift towards efficient evaluation practices and how quick tests could foster a more results-oriented environment.
- 00:35:00 - 00:40:00
He provides strategies for building effective datasets and tests, advocating for iterative experiments and synthesis of training data from simple existing questions and their answers.
- 00:40:00 - 00:45:00
Jason discusses the necessity of segmentation in problem-solving, asserting the importance of correctly identifying user questions to develop functional data sets and improve AI system performance.
- 00:45:00 - 00:50:00
He emphasizes that having structured workflows and understanding user questions leads to meaningful data set generation, allowing for precise evaluation of AI models' performance and answering capabilities.
- 00:50:00 - 00:57:34
Further, he explains the benefits of using off-the-shelf embeddings and suggests that detailed experimentation, rather than generic assumptions, should guide the choice of approaches for embedding and retrieval systems.
Mind Map
Video-Fragen und Antworten
What is Retrieval-Augmented Generation (RAG)?
RAG is an approach that combines retrieval of information with natural language generation, allowing models to generate responses based on retrieved data.
How can companies improve their RAG systems?
By focusing on user needs, conducting thorough evaluations, and optimizing retrieval processes rather than solely fine-tuning the generation aspect.
What role does user experience (UX) play in AI systems?
Good UX design can significantly improve user engagement and feedback collection, ultimately enhancing AI system performance.
What should companies prioritize when deploying AI models?
Focusing on the context in which the AI operates, understanding users' workflows, and enabling the AI to add value rather than just answering questions.
Is fine-tuning always necessary for AI models?
Not always; it depends on the specific use case and whether off-the-shelf models can efficiently meet the needs without additional fine-tuning.
What is the importance of context length in RAG?
Longer context length can allow models to analyze more data and provide better responses, but achieving a balance with system efficiency and latency is crucial.
Weitere Video-Zusammenfassungen anzeigen
How Frieren Became the Best Anime of All Time
When female protagonist went from good to great
Introduction to spectroscopy | Intermolecular forces and properties | AP Chemistry | Khan Academy
Spectroscopy Basics | Engineering Chemistry
Never Give Up | Jack Ma | Motivational | Goal Quest
Deirdre Fay interviews Frank Corrigan, MD about healing Attachment Shock with Deep Brain Reorienting
- 00:00:00like the big smell I usually like to
- 00:00:01call out very early in the beginning is
- 00:00:03just you know customers saying man I
- 00:00:04really wish these models were capable of
- 00:00:06more complex
- 00:00:07reasoning it's like do you want the
- 00:00:09complex reason because you haven't
- 00:00:11reasoned about what the customer wants
- 00:00:13because the more we really think hard
- 00:00:15and we really reason ourselves like what
- 00:00:19the customer wants the product becomes
- 00:00:21much more clear
- 00:00:23[Music]
- 00:00:35all right everyone welcome to another
- 00:00:37episode of the tmall AI podcast I am
- 00:00:39your host Sam charington today I'm
- 00:00:41joined by Jason Lou Jason is a freelance
- 00:00:43AI consultant and advisor and creator of
- 00:00:46the instructor library before we get
- 00:00:49going be sure to take a moment to hit
- 00:00:50that subscribe button wherever you're
- 00:00:52listening to Today's Show Jason welcome
- 00:00:54to the Pod hey pleas to be here man
- 00:00:56really excited uh I'm excited for this
- 00:00:58conversation as well we've uh spoken a
- 00:01:01few times before and uh I'm looking
- 00:01:04forward to picking your brain on uh all
- 00:01:07things retrieval augmented generation
- 00:01:10and more um and I think a fun way to dig
- 00:01:14into this conversation is to talk
- 00:01:17through uh when you come in and people's
- 00:01:20rag is broken how they go about fixing
- 00:01:23it um but before we dive into that I'd
- 00:01:27love to have you share a little bit
- 00:01:28about your background
- 00:01:30perfect so you know graduated from
- 00:01:31University of waterl when we were doing
- 00:01:34a lot of physics and basically after my
- 00:01:37first physics term I realized oh man
- 00:01:39machine learning is definitely going to
- 00:01:40be next and I think the Nobel Prize I
- 00:01:43nailed it right um absolutely so once I
- 00:01:47started doing more machine learning a
- 00:01:48lot of my background was mostly in
- 00:01:49computer vision and recommendation
- 00:01:51systems right so image eddings text
- 00:01:53embeddings to do recommendation systems
- 00:01:56and it just so happened that now where
- 00:01:57where we do uh rag it's kind of the same
- 00:02:00thing all over again right it's all of
- 00:02:03text embeddings into recommendation
- 00:02:04systems that now feed into language
- 00:02:06models versus people and so it was a
- 00:02:08very nice transition into rag when I
- 00:02:10started doing more uh Consulting and you
- 00:02:13did some of that at Stitch fix if I
- 00:02:14remember correctly yeah most of my
- 00:02:16background was at doing like multimodal
- 00:02:18embedding the Stitch fix from like 2017
- 00:02:20onwards so like taking an outfit putting
- 00:02:23it into some outfit embedding space and
- 00:02:25trying to predict the next outfit that
- 00:02:27the person would want in their box yeah
- 00:02:30you know how you how do you do
- 00:02:31replacement how do you do these like
- 00:02:32recommendation carousels what is a
- 00:02:34similar item all that fun stuff and so
- 00:02:37um what were your first steps
- 00:02:40into uh you know Rag and helping folks
- 00:02:44with uh their gen challenges it's funny
- 00:02:48so you know I I basically took a year
- 00:02:50after I took a year off after I left
- 00:02:51Stitch Trix and when chat gbd came back
- 00:02:53out and they were talking about rag
- 00:02:55everyone was really Amazed by the power
- 00:02:57of text and beddings in my mind Tex
- 00:03:00eding was my intern project in 2016
- 00:03:02because I didn't know how to set up
- 00:03:03elastic search
- 00:03:06it's very exciting ride to come back and
- 00:03:08say oh I have like eight years
- 00:03:09experience doing this kind of stuff
- 00:03:11let's jump in and figure out how can we
- 00:03:13actually learn new embeddings and how do
- 00:03:15we actually improve and measure these
- 00:03:17kind of search systems um especially
- 00:03:20when more of these people now are just
- 00:03:22plugging in open Ai and beddings and
- 00:03:24just doing some kind of vector search
- 00:03:26there's not not much room for
- 00:03:28improvement when it comes to retrieval
- 00:03:29and so a lot of the companies come to me
- 00:03:31and they just say hey it's not really
- 00:03:32working we're losing customers we're
- 00:03:33losing a bit of money how do we make
- 00:03:35this better and that's kind of how the
- 00:03:37conversation uh starts off got it got
- 00:03:40and when you say there's not much room
- 00:03:42for improvement around retrieval what do
- 00:03:45you mean by that I think a lot of
- 00:03:47companies if you think about you know
- 00:03:49how we used to do embedding at know
- 00:03:51Stitch fix Netflix Shopify Spotify a lot
- 00:03:54of it is using user and product
- 00:03:56interaction pairs to train embeddings
- 00:03:58that are optimized for maybe a
- 00:04:00click-through rate or maybe for some
- 00:04:01kind of relevancy metric but to me at
- 00:04:04least it feels pretty crazy that we're
- 00:04:05going to use these external embedding
- 00:04:07models from open Ai and just assume that
- 00:04:11oh yeah of course my question about some
- 00:04:12law is going to be embedded very similar
- 00:04:15to exactly the paragraph that answers
- 00:04:17the question about the esoteric you know
- 00:04:19legal statement that's not really true
- 00:04:21right if you think about even just the
- 00:04:23sentence I love coffee and I hate coffee
- 00:04:26are they similar or
- 00:04:28dissimilar right they could be SAR on a
- 00:04:30dating app because they're both
- 00:04:31preferences about coffee but maybe
- 00:04:33they're dissimilar because they're
- 00:04:34negative preferences against each other
- 00:04:37but you know I should be able to choose
- 00:04:40which one that looks like and I think
- 00:04:41that's where a lot of people are getting
- 00:04:42tripped up it's really a big assumption
- 00:04:44to think that we know what is and is not
- 00:04:47similar in this in this embedding space
- 00:04:49interesting interesting I think the
- 00:04:51reason why I zeroed in on on you
- 00:04:54mentioning that is because when I talk
- 00:04:55to folks that are working on Rag and
- 00:04:59their rag is broken they are often
- 00:05:02trying to fix it via tuning the
- 00:05:05generation and that invariably is not
- 00:05:08the right way to do it uh so there
- 00:05:10there's often a lot more Headroom in
- 00:05:13making sure that the llm has the right
- 00:05:15context than um you know fine-tuning the
- 00:05:19the promps but uh let's maybe you know
- 00:05:23when you're called into those situations
- 00:05:25like how do you begin to diagnose the
- 00:05:29the problem yeah so the first the first
- 00:05:32thing I do is I basically just ban
- 00:05:34adjectives as as a word that you can use
- 00:05:36during standup I think a lot of
- 00:05:37companies you know when sof as good bad
- 00:05:41looks better feels better
- 00:05:44right% 20% what does that look like um
- 00:05:48that's that's usually the first step is
- 00:05:49really getting away from this like Vibe
- 00:05:51based estimate of the generation and
- 00:05:53really thinking about retrieval I like
- 00:05:55to think about this two biases I learned
- 00:05:57in this MBA book the first one is called
- 00:05:59absence this which just says like you
- 00:06:01can't really think about the thing you
- 00:06:03don't see and you see the generation you
- 00:06:05always see the text coming out of the
- 00:06:06language model so you think that's the
- 00:06:08thing that I got to control because I
- 00:06:09don't see the generation uh uh the the
- 00:06:11content the second thing is inter inter
- 00:06:14intervention bias which is I want to
- 00:06:16change things to feel in control and so
- 00:06:19if you want to feel in control of a rag
- 00:06:20application all you got to do is just
- 00:06:22twiddle with the change some text in
- 00:06:24your prompt and hope that all the
- 00:06:27relevant data is in there and usually
- 00:06:28that's not the case right um I think
- 00:06:31that's where a lot of the the issues
- 00:06:33stem from right you're looking too much
- 00:06:35at generation and not really thinking
- 00:06:36about recall or precision and whether or
- 00:06:39not a language model is confused or even
- 00:06:40finding the right
- 00:06:43information and that opens up the whole
- 00:06:46conversation around evales and and
- 00:06:49evaluation loops and pipe pipelines and
- 00:06:51flywheels and the like and uh at least
- 00:06:55from my
- 00:06:56perspective yeah 9 months 12 months
- 00:06:59months ago like folks were trying to
- 00:07:01figure out how to spell eval and now
- 00:07:03like it's coming up in a lot more
- 00:07:04conversations are you seeing a similar
- 00:07:07shift yeah except the thing I've been
- 00:07:09noticing is we've almost also delegated
- 00:07:13the the scoring battle language models
- 00:07:16right the llm is Judge idea exactly I
- 00:07:19think it's very useful to get some kind
- 00:07:21of proxy for what is good and what is
- 00:07:22bad but what ends up happening is
- 00:07:25instead of trying to solve the relevancy
- 00:07:26problem we just solving the problem of
- 00:07:29again prompting another language model
- 00:07:31right I I basically said don't fiddle
- 00:07:33with the generation of the language
- 00:07:34model you you feel in control because
- 00:07:37you can fumble with it but you're not
- 00:07:38going to get any results and they said
- 00:07:39okay well let me work with a different
- 00:07:41prompt
- 00:07:43instead rather than building like a
- 00:07:45Precision recall data set right I think
- 00:07:48you know there are many many tasks that
- 00:07:51take milliseconds to compute that we
- 00:07:53could run tests across thousands of
- 00:07:55examples and and figure what the
- 00:07:56relevancy looks like rather than looking
- 00:07:58at you know l a judge literally a couple
- 00:08:01weeks ago during standof a team was like
- 00:08:03hey uh who spent $1,000 this
- 00:08:07weekend and some Junior Engineers like
- 00:08:09oh I was like trying to run evals to see
- 00:08:12how good the new changes are should I
- 00:08:13not do that and like oh W like they're
- 00:08:16so expensive I've just incentivized
- 00:08:17someone to not run more tests that's
- 00:08:19really not how I want things to be right
- 00:08:21I want tests that are really fast really
- 00:08:23cheap that you should be running every
- 00:08:24every 10 20 minutes when you make a line
- 00:08:27of code uh change in your system and so
- 00:08:29that's it's been pretty funny to sort of
- 00:08:31see that transition and and really push
- 00:08:33to be just a lot faster fail faster and
- 00:08:36do very cheap cheap evaluations and part
- 00:08:39of that is just that building data sets
- 00:08:42is hard and time consuming and hey
- 00:08:45pre-trained models was supposed to get
- 00:08:47me out of that business right yeah I
- 00:08:49think every data scientist every data
- 00:08:51engineer is like yeah you know I I'm
- 00:08:53kind of the janitor right but then what
- 00:08:55happens is you kind of get like you get
- 00:08:57the Roomba and you spill something the
- 00:08:59Roomba just just smears it all over the
- 00:09:00wall floor and you're like
- 00:09:02oh I should have done it myself that's
- 00:09:05kind of how I think about these things
- 00:09:06and but
- 00:09:07also in Earnest I think what's really
- 00:09:09happening is because there are so many
- 00:09:12more Engineers coming into the space
- 00:09:15with like less data literacy it actually
- 00:09:18is just very hard to even describe what
- 00:09:20a good data set looks like you know
- 00:09:22Eugene uh Eugene Yang and I were were
- 00:09:25basically trying to figure out what does
- 00:09:26good data literacy look like and we we
- 00:09:29really struggle we can only come up with
- 00:09:3110 reasons why it was Data illiterate
- 00:09:34but it's still hard to describe like
- 00:09:35what is the intuition and the vibe of
- 00:09:37when do you give up when do you try new
- 00:09:38things you know if the model said it's
- 00:09:4098% accurate you probably did something
- 00:09:42wrong all those kinds of things are uh
- 00:09:45pretty undocumented I think when it
- 00:09:46comes to making this mental shift so
- 00:09:50when you're talking to folks and you're
- 00:09:52encouraging them to take this initial
- 00:09:54step of
- 00:09:55building uh data set that will allow
- 00:09:58them to measure their or revals like do
- 00:10:00they always know what that process you
- 00:10:02know needs to look like to do that uh
- 00:10:05and you know how do you you know guide
- 00:10:08folks that don't through the process of
- 00:10:10building that data set I think they have
- 00:10:13some ideas but what ends up happening is
- 00:10:15some ideas feel so intuitive you almost
- 00:10:17need someone else's permission to
- 00:10:19believe your or trust your gut right
- 00:10:21like the simplest thing to do for
- 00:10:22example is say you know what given a
- 00:10:24text Chunk can I generate a synthetic
- 00:10:26question with a language model and then
- 00:10:28save the these two pairs and then let me
- 00:10:31check whether or not the question I just
- 00:10:32generated finds the text Chun right A
- 00:10:35lot of times I think you know there are
- 00:10:36Engineers on the team that have this
- 00:10:38idea but they just like hey Jason like
- 00:10:40does this does this make sense like just
- 00:10:43nobody really knows for certain if what
- 00:10:44they're doing is like ridiculous and I
- 00:10:46think a lot of it at least with the
- 00:10:48really great Engineers I work with they
- 00:10:49almost just need permission to trust
- 00:10:51their gut and do these tiny experiments
- 00:10:53because in engineering there is just
- 00:10:56like the edge cases you have to
- 00:10:58enumerate everything right away and then
- 00:11:00you sort of build out your test but here
- 00:11:01it's a lot of it it's like well we
- 00:11:03really do just have to try it we just
- 00:11:05have to try 10 different things and
- 00:11:06figure out what works and what
- 00:11:08doesn't and giving them giving the
- 00:11:10engineering team the permission to trust
- 00:11:11their get has also been a a pretty
- 00:11:13valuable lesson on my end and are those
- 00:11:15the same 10 things in every case or is
- 00:11:18it you know 10 Edge case specific things
- 00:11:21that are different for every person
- 00:11:22who's trying to build a
- 00:11:24system yeah what I find is one thing
- 00:11:28that is often missing they don't
- 00:11:29actually understand what the workflows
- 00:11:32ought to be and what kind of question
- 00:11:33type they ought to serve right I think
- 00:11:36everyone wants AGI right G stand for
- 00:11:39General and they want to solve
- 00:11:42every but you know the open AI
- 00:11:44definition of of AGI has something to do
- 00:11:46with like economic value like are we
- 00:11:48unlocking economic value for for our
- 00:11:50customer and like the big smell I
- 00:11:53usually like to call out very early in
- 00:11:55the beginning is just you know customers
- 00:11:56saying man I really wish these models
- 00:11:58were capable of more complex
- 00:12:00reasoning it's like do you want the
- 00:12:02complex reasoning because you haven't
- 00:12:03reasoned about what the customer
- 00:12:06wants because because the more we really
- 00:12:09think hard and we really reason
- 00:12:12ourselves like what the customer wants
- 00:12:14the product becomes much more clear
- 00:12:16right um for example day one we have a
- 00:12:20bunch of user questions coming in if we
- 00:12:21can do some kind of clustering and
- 00:12:23segmentation we might find out that oh
- 00:12:25wow you know 30% of all the questions
- 00:12:27are looking for contract and whether or
- 00:12:29not they're signed you know 10% of the
- 00:12:31questions were just who modified the
- 00:12:33document last that's not even in the
- 00:12:36text Chunk but if we just append you
- 00:12:38know an additional token that says like
- 00:12:40Modified by Jason we could now just
- 00:12:42serve a 10% of our question base right
- 00:12:45if we just parsed out the dates and like
- 00:12:47an is signed Boolean variable again we
- 00:12:49could now serve like 30% of our
- 00:12:51questions and so I think that the real
- 00:12:53trick is just developing the habit of
- 00:12:56looking at the data but also trusting
- 00:12:58that your your your job is to make these
- 00:13:01hypotheses and your job isn't to be
- 00:13:03right all the time right you have to be
- 00:13:06wrong you have to do these experiments
- 00:13:07and feel F that that seems like it needs
- 00:13:10to be first like really understanding
- 00:13:13what the questions you're trying to
- 00:13:15serve with uh your system whether it's a
- 00:13:19chatbot or something else um because you
- 00:13:22can't even really build a data set till
- 00:13:24you know what those questions need to
- 00:13:25look like I mean sometimes if you just
- 00:13:27have a bunch of PDFs you could try to
- 00:13:29have the language model answer these
- 00:13:30questions but to my surprise there have
- 00:13:33been times where you know if you use
- 00:13:35like Paul Graham essays and you generate
- 00:13:38synthetic questions off of random Tech
- 00:13:40chunks you get like 96 and 97% recall
- 00:13:44right the the problem is too easy so
- 00:13:45have to make it harder but there are
- 00:13:48other data sets where I do the same task
- 00:13:49and I get like 60% recall for example
- 00:13:53right if you just take all GitHub issues
- 00:13:55GitHub issues okay yeah like a common
- 00:13:58question that gets generated is like uh
- 00:14:00how to get
- 00:14:01started Well turns out if you don't have
- 00:14:03a filter on repo like repository you
- 00:14:06can't answer the question how best I get
- 00:14:07started because now there's filters
- 00:14:09involved right and turns out if I just
- 00:14:11say best ways to get started in repo I
- 00:14:15now have to sort of parse things out and
- 00:14:17do some filtering and maybe if it's not
- 00:14:18the exact filter I need to do some
- 00:14:20string matching and all the other stuff
- 00:14:23are you saying when people are asking
- 00:14:25you the best way to get started with rag
- 00:14:28or when people want to be able to serve
- 00:14:31or answer the question for their users
- 00:14:33the best way to get started I'm not
- 00:14:35following example answering the question
- 00:14:36like imagine doing like GitHub GitHub
- 00:14:39issue search and I search best way to
- 00:14:40get started right how could I have
- 00:14:43possibly found the trunk that came from
- 00:14:45there's there's thousands of best ways
- 00:14:47to get started documentations right um
- 00:14:50and then and then you do the go oh okay
- 00:14:52actually in order to do this problem
- 00:14:54well I have to do some kind of like repo
- 00:14:57matching mechanism I probably need to do
- 00:14:58some kind of like filtering mechanism
- 00:15:00and now you slowly add complexity into
- 00:15:02the system that you built I think too
- 00:15:04many people just sort of throw the data
- 00:15:06into a bunch of PDFs and go well
- 00:15:08obviously I can just ask you what the
- 00:15:10systematic risks of this investment is
- 00:15:13and that doesn't be seem to be the case
- 00:15:16uh so if we're building up
- 00:15:18to steps then you know one step is like
- 00:15:21know your question the um you know next
- 00:15:24step might be build out your test set um
- 00:15:28and the third step is to you know think
- 00:15:32really hard about like metadata and like
- 00:15:35sourcing data that the llm can use to
- 00:15:40answer the question or really that a
- 00:15:42pre-processor can use to get the right
- 00:15:44information to the llm actually we're
- 00:15:45still in retrieval at this point yeah so
- 00:15:48so I like to think about it this way I'm
- 00:15:50going to do some segmentation if I was
- 00:15:52going to do marketing I might segment
- 00:15:54against men's and women's and East Coast
- 00:15:55which West Coast every every problem you
- 00:15:58want to solve you kind want to segment
- 00:16:00in some way so we're ultimately going to
- 00:16:02find these segments in uh the question
- 00:16:05space and then ultimately there's two
- 00:16:07kinds of segments there's going to be
- 00:16:09segments that don't do well because we
- 00:16:11have capabilities issues so for example
- 00:16:14if I ask who modified this document last
- 00:16:17if I don't have that metadata I can't
- 00:16:19answer that question so I need to
- 00:16:20improve my capabilities right that the
- 00:16:24the row exists but I need an extra
- 00:16:27column the other world is like inventory
- 00:16:31issues where the column doesn't exist
- 00:16:33right like if you think of maybe like a
- 00:16:35door Dash and you find out that uh Greek
- 00:16:38restaurants near me is a terrible search
- 00:16:40query the solution might be to buy iPads
- 00:16:42for Greek restaurants so they can come
- 00:16:44onto the platform so there's been other
- 00:16:46times when we do this kind of debugging
- 00:16:48and we realize oh wow we don't have the
- 00:16:50data to answer these quic questions we
- 00:16:52don't have the scheduling information to
- 00:16:53do this we don't have the tables
- 00:16:55extracted to do this and so there's
- 00:16:57usually two kinds of solutions you got
- 00:16:58to start integrating right are there
- 00:17:00more capabilities by adding like more
- 00:17:02rows to your data set or just like sorry
- 00:17:05more columns or add inventory and inove
- 00:17:08the number of rows we have on a data set
- 00:17:09and that's kind of like how I think
- 00:17:10about these things adding rows in all of
- 00:17:13the examples you've given are is a much
- 00:17:15longer process than adding columns
- 00:17:18exactly exactly I think sometimes it's
- 00:17:21like oh man like we need to figure out
- 00:17:22contracts so we need to figure out you
- 00:17:25know who if someone is responsible the
- 00:17:27feature people really care about can we
- 00:17:29contact them in some way and now you
- 00:17:31start building an actual application
- 00:17:33because you know what the customer wants
- 00:17:35right it's not just ask a question but
- 00:17:37to take some action make some decision
- 00:17:39you
- 00:17:40mentioned
- 00:17:42the you know this issue of using
- 00:17:45off-the-shelf
- 00:17:47embedding uh models are you finding are
- 00:17:51you advising folks to like use to like
- 00:17:53build their own
- 00:17:54embedding systems or just get better
- 00:17:57data to the ones that are there or like
- 00:17:59is there a decision Matrix around and in
- 00:18:03fact there are a lot of similar
- 00:18:04questions that I run into that I don't
- 00:18:09I've heard
- 00:18:10mixed uh you know mixed opinions as to
- 00:18:14like if they're even important and it's
- 00:18:16like the embedding uh scheme the
- 00:18:20chunking strategy like headings and
- 00:18:23other um you know contextual information
- 00:18:26around chunks like do you have
- 00:18:29is it are there one siiz fits alls
- 00:18:31answers to these or like it a decision
- 00:18:33Matrix or like how do you think about
- 00:18:35the space of like you know
- 00:18:37implementation details around uh
- 00:18:40embedding it's a really good question
- 00:18:43primarily because why guess when we can
- 00:18:45test
- 00:18:49right I think this is a matter of really
- 00:18:51just you know investing more in a data
- 00:18:53set and just going great well let me let
- 00:18:55me just run like 30 experiments you know
- 00:18:58over the weekend I'll come back and I'll
- 00:19:00just know the answer and I think that's
- 00:19:02kind
- 00:19:03of even the nature of that question to
- 00:19:05me is a symptom of sort of the lack of
- 00:19:07having really good evaluations on your
- 00:19:10own data sets but likewise the the
- 00:19:13answer to the question is an indication
- 00:19:16that there's there aren't clear patterns
- 00:19:19and it's very data set and use case
- 00:19:22specific and you know if you said for
- 00:19:25example yeah like you know we really
- 00:19:28really only ever see chunking strategies
- 00:19:30giving a you know one 2% lift it's not
- 00:19:33usually worth it like that would be
- 00:19:34really informative but you didn't say
- 00:19:36that so you know there must be cases
- 00:19:38where you change your chunking strategy
- 00:19:39and you bam get some great results
- 00:19:42exactly a good example of that might be
- 00:19:44thinking about chunking and then
- 00:19:47completely processing like tables within
- 00:19:49PDFs differently than regular text
- 00:19:51trunks it's like okay like paragraphs
- 00:19:53you chunk and then if you see a table we
- 00:19:55have to save the entire table somewhere
- 00:19:57else as a separate Index right um that
- 00:20:00would be a good example of like when
- 00:20:03chunking really matters I would also say
- 00:20:05you know if you have even thousands of
- 00:20:08examples of questions and labels on
- 00:20:11whether or not a chunk is relevant it's
- 00:20:14probably pretty fruitful to fine tune a
- 00:20:16ranker but even then I think oftentimes
- 00:20:19I surprise myself
- 00:20:20in whether or not um certain
- 00:20:24interventions perform better there are
- 00:20:27times when using hybrid search with
- 00:20:29edings and bm25 bm25 is like 3% better
- 00:20:34right that often is the case if the
- 00:20:37person who is searching the data is
- 00:20:39aware of the file names and the text
- 00:20:41that's in the data like if I wrote my
- 00:20:42own
- 00:20:43essay it'll be easier for me to find it
- 00:20:45if I use full text search because I know
- 00:20:47what I
- 00:20:49wrote right there's been times
- 00:20:53when rankers don't improve the
- 00:20:55performance of the model and then there
- 00:20:57are times when the rankers do and again
- 00:20:58again it's it just becomes the
- 00:20:59Superstition
- 00:21:01but it is very easy to sort of absolve
- 00:21:04myself of the Superstition by having
- 00:21:06these tests that run really really fast
- 00:21:08right just is the order better these
- 00:21:10things are really great whereas if we go
- 00:21:11into factuality or like self-consistency
- 00:21:14and like context recall who knows right
- 00:21:17maybe the model just wants to choose
- 00:21:19itself now you're doing a whole set of
- 00:21:21other experiments to prove that the
- 00:21:22model is aligned with a metric you never
- 00:21:26made up and that's when I think things
- 00:21:28get
- 00:21:28[Laughter]
- 00:21:31and
- 00:21:32expensive yeah
- 00:21:35yeah even TimeWise I feel like there's
- 00:21:37been times where we have like
- 00:21:38summarization prompts so for example um
- 00:21:42when we want to retrieve images do we
- 00:21:43use like a clip
- 00:21:45embedding well that means we also have
- 00:21:47to use a clip embedding for the text
- 00:21:50what I've seen do go really well is
- 00:21:52actually using a visual language model
- 00:21:53to give a detailed
- 00:21:55description as like a paragraph of the
- 00:21:57image and then just T Ed the paragraph
- 00:22:01but what this means is now the the uh
- 00:22:04describe this image prompt is another
- 00:22:06hyperparameter to experiment against and
- 00:22:09I've seen situations where if you just
- 00:22:11say describe this image uh recall is
- 00:22:14like
- 00:22:1527% but if you can teach this concept of
- 00:22:18recall to an engineer and you just make
- 00:22:20them Hill Climb for like a day and a
- 00:22:21half we've been able to get to like an
- 00:22:2387% recall just by improving the
- 00:22:26prompt WR a prompt tell me what doesn't
- 00:22:29recover okay uh find the blueprint but
- 00:22:32also count the number of rooms okay now
- 00:22:34it's like 35% okay also transcribe the
- 00:22:37street addresses and include that in the
- 00:22:40description 70% you know also describe
- 00:22:44like like whether it's north facing and
- 00:22:46east facing and like also describe the
- 00:22:48positions of the cabins and all of a
- 00:22:50sudden you have a 96% recall system for
- 00:22:52finding blueprints because you actually
- 00:22:55worked on the prompt sounds like feature
- 00:22:57engineering hey
- 00:22:59I can't say that because they then they
- 00:23:01got confused but uh yeah but but ends up
- 00:23:04being it right it's like oh this is this
- 00:23:05classical machine learning but we were
- 00:23:07able to Hill Climb because our eval is
- 00:23:10very fast it takes you know 50
- 00:23:12milliseconds to try again and try again
- 00:23:15and I think that's where a lot of things
- 00:23:16can be really optimized for but yeah
- 00:23:18it's definitely just feature engineering
- 00:23:19but that's also another
- 00:23:23word are you um you know one of the
- 00:23:26questions that that comes up a lot
- 00:23:28around the the whole idea of evals is
- 00:23:31like tooling like um do you have you
- 00:23:37know go-to answers for that I'm guessing
- 00:23:39it's going to be yeah build your data
- 00:23:41set in like you know some silly eval in
- 00:23:44a you know notebook but uh did do you
- 00:23:48find
- 00:23:49that there's a point at which it becomes
- 00:23:52more complex and there's you know some
- 00:23:55you know open source or off the-shelf
- 00:23:57tooling that makes it difference for
- 00:23:58folks so I would say if you are working
- 00:24:01independently you are likely best off
- 00:24:04just like writing things to a jonline
- 00:24:06file or like a SQL art file primarily
- 00:24:08because you're just building out these
- 00:24:10very fast evals right like if you're
- 00:24:11just comparing like length of summary
- 00:24:13divided by length of input and figuring
- 00:24:15out if there's a compression rate I want
- 00:24:16to set a goal against super fast what I
- 00:24:19do with when I work with bigger
- 00:24:21companies is I use uh Brain Trust
- 00:24:23primarily
- 00:24:25because like Brain Trust was basically
- 00:24:27built because ER has just been like
- 00:24:29sharing screenshots of like results from
- 00:24:31a jupyter notebook every once in a while
- 00:24:32you're like I need a tool that does
- 00:24:33better than this and so often times if
- 00:24:36you need to collaborate on sharing data
- 00:24:39sets collaborate on sharing results and
- 00:24:41getting feedback and you know coer on
- 00:24:43your team to label data with you I think
- 00:24:46that's when a tool really really shines
- 00:24:49those for the collaboration aspect not
- 00:24:51because the evaluations are better in
- 00:24:54some kind of way exactly because the
- 00:24:56evaluations you have to build yourself
- 00:24:59right you're pulling evaluations off the
- 00:25:01shelf yeah factuality the
- 00:25:04self-consistency like that stuff to me
- 00:25:06is uh it's crazy you know it's just
- 00:25:10like it's like it's like having someone
- 00:25:13grade their own assignment it's like oh
- 00:25:14man I just I
- 00:25:16hope um and then like the other stuff
- 00:25:19that's valuable is like okay how can I
- 00:25:22you know down sample my production
- 00:25:24traffic to also run these evaluations to
- 00:25:26make sure that things are running prod
- 00:25:27productively
- 00:25:28can I monitor these things over time
- 00:25:30right a really simple example is just I
- 00:25:34have a company that we do a meeting
- 00:25:36summarization and we we plot the average
- 00:25:39length of a transcript and we plot the
- 00:25:42average length of the summary divided by
- 00:25:44the average of the
- 00:25:45transcript and every once in a while
- 00:25:47there's like a blip and like why did
- 00:25:48that blip happen well it turns out you
- 00:25:51know we ran a marketing campaign and we
- 00:25:52got a whole new set of users and these
- 00:25:54new users are doing threeh hour long
- 00:25:57podcasts and it's really
- 00:25:59bad because the summary is just like
- 00:26:01they talked about
- 00:26:02AI
- 00:26:05right you're like oh man like the the
- 00:26:08compression rate is too high now let's
- 00:26:10go do something great we build a rule
- 00:26:13that says if the call is less than an
- 00:26:15hour we can use this prompt if we use
- 00:26:17greater than an hour can we use that
- 00:26:19prompt okay the ratios are like
- 00:26:21recovering a little bit and can we
- 00:26:22monitor that and I think that's how I
- 00:26:24think about building these systems have
- 00:26:25like the dumbest evals possible to tell
- 00:26:27you what to look at uh you mentioned
- 00:26:31compression rate previously before
- 00:26:33talking about this specific example is
- 00:26:35that a metric that you've applied
- 00:26:39broadly or is it just this transcription
- 00:26:42summarization thing yeah I mean I've
- 00:26:46just found a lot of the applications I
- 00:26:48tend to work on are ones where we're
- 00:26:49doing a lot of
- 00:26:50summarization and summarization is a
- 00:26:52very uh interesting task
- 00:26:56because like llms are good at
- 00:26:58summarization in the sense that indeed
- 00:27:00the output is shorter than the input but
- 00:27:03it's actually very hard to evaluate like
- 00:27:05what is a good summary and like when do
- 00:27:06we lose nuance and all that kind of
- 00:27:08stuff and
- 00:27:09so you know obviously we can have the
- 00:27:12entire like llm as a
- 00:27:14judge model of doing things but ideally
- 00:27:17we have much more like much simpler
- 00:27:19metrics right so I have metrics of just
- 00:27:22you know length of summary divided by
- 00:27:24length of uh transcript I also have the
- 00:27:27counts of named entities right for
- 00:27:29example if the summary is all mentioning
- 00:27:31my name and versus the summary just
- 00:27:33going like they thought it was you know
- 00:27:36it's like it's very like ambiguous and
- 00:27:38so can we can we can we preserve some
- 00:27:40kind of information density there there
- 00:27:42they're all proxies for some you know
- 00:27:45satisfaction or Nuance that we can also
- 00:27:47use a language model against but
- 00:27:49um looking at like odd examples of just
- 00:27:52simple numbers still can tell you a lot
- 00:27:54of
- 00:27:55information right like what we found was
- 00:27:57when when we pl summary length by um
- 00:28:01transcript length it would go up and
- 00:28:04then after like 20,000 tokens and
- 00:28:06actually got shorter again I like
- 00:28:09okay that took six minutes to plot out
- 00:28:11and like write the data for but now we
- 00:28:14know that there's some weird behavior
- 00:28:15when the transcript is really really
- 00:28:18long great let me change my prompt rerun
- 00:28:21this it's trade again perfect we're good
- 00:28:25was there a step before changing the
- 00:28:26prompt that was trying to understand
- 00:28:28like the intuition for why that might be
- 00:28:30happening or was that ancillary
- 00:28:32to actually getting the problem don't
- 00:28:35remember like what we did in that
- 00:28:36example I I think we we just kind of saw
- 00:28:39that like oh wow not only is it getting
- 00:28:42dropping the variance is also increasing
- 00:28:44as we drop and so what we want a prompt
- 00:28:47that has lower variance in the like
- 00:28:49compression rate that seems like a very
- 00:28:52like healthy and quantifiable goal where
- 00:28:55we can just say hey Jason I tried three
- 00:28:58different prompts and I was able to drop
- 00:29:00the standard deviation by like
- 00:29:0240% and it now is like monotonically
- 00:29:05increasing as a function of context like
- 00:29:08that becomes so scientific and so
- 00:29:10quantifiable that we don't have to worry
- 00:29:12about some of these like bigger things
- 00:29:13and obviously we might still lose Nuance
- 00:29:16but um setting a goal against that is
- 00:29:19very
- 00:29:21easy my sense is that folks coming to
- 00:29:25this uh you know
- 00:29:29fresh and being told hey you should
- 00:29:31build a test data set um you know kind
- 00:29:35of wrapping their head around Precision
- 00:29:37recall I keep whether you keep whether
- 00:29:39you can keep those two straight without
- 00:29:40looking it up that's another issue but
- 00:29:42like you know that's like oh that's
- 00:29:44probably something that I need to be
- 00:29:45able to measure and test against is like
- 00:29:48um you know an obvious thing uh
- 00:29:52compression rate feels like less obvious
- 00:29:55or more nuanced in some way way are
- 00:29:59there other kind of nuanced types of
- 00:30:03things I think you there's I guess I'm
- 00:30:06thinking there are two ways that you get
- 00:30:08this either one like you know banging
- 00:30:10your head against your problem and you
- 00:30:12know this is probably the best way to
- 00:30:14come up with these things but you know
- 00:30:16part of what we're trying to do is like
- 00:30:18accelerate learning and provide
- 00:30:19shortcuts like what are the shortcuts
- 00:30:22that you've come across for different
- 00:30:24problem classes like oh these four
- 00:30:26metrics like you probably wouldn't think
- 00:30:27about them but you know when you did
- 00:30:29like you discover that these come up all
- 00:30:31the time do do you have that list
- 00:30:34another one that is pretty reasonable in
- 00:30:37this like summarization task just
- 00:30:39whether or not it co uh adheres to a
- 00:30:41certain schema and a certain uh
- 00:30:44formatting but again I try my best to
- 00:30:47just write a regular expression that
- 00:30:48tries to capture this as quickly as
- 00:30:50possible right and you know I could have
- 00:30:53like a Sixpoint grading scale on whether
- 00:30:55or not it fits the the the markdown
- 00:30:58format that I want but then I lose all
- 00:31:01like there's too much Nuance right
- 00:31:02really I just want to have a bunch of
- 00:31:03past fail tests that are very binary
- 00:31:06where I can say great show me 10
- 00:31:08examples where I failed 10 examples
- 00:31:10where I succeeded let's let me just go
- 00:31:13like think really hard and figure out
- 00:31:15what is happening and how can I change
- 00:31:17that um outside of that I find it a lot
- 00:31:19of it ends up being very very specific
- 00:31:22there's an example where I generate
- 00:31:24action items but I want to evaluate
- 00:31:26whether or not the action item is
- 00:31:27correct ly assigned to the person right
- 00:31:31uh that's just a very specific eval that
- 00:31:33you have to build and it ends up being
- 00:31:35very challenging sometimes to uh
- 00:31:37correctly assign something like that are
- 00:31:40you finding that you're always running
- 00:31:43all of your eval Suite whenever you're
- 00:31:46making any change or um do you find like
- 00:31:51running specific you know feature
- 00:31:54specific evals uh and then running your
- 00:31:57broader sweet uh less
- 00:32:00frequently yeah it depends on like what
- 00:32:04kind of eval to be honest I if I can
- 00:32:06afford to I would just rather run them
- 00:32:07all the time because you never know what
- 00:32:09kind of
- 00:32:10cross uh influence there is like a
- 00:32:13really simple example was we had both an
- 00:32:15executive summary and a list of action
- 00:32:18items and the action item description
- 00:32:21was too
- 00:32:22long you're like great well uh make the
- 00:32:26action item shorter
- 00:32:28and then we got uh an equal length
- 00:32:30action item but just fewer action
- 00:32:34items but that's EAS that test is easy
- 00:32:36because we can just like parse out the
- 00:32:38asterisk and count them and that's one
- 00:32:40EV like I literally had an Eva that was
- 00:32:42just count the number of action items
- 00:32:44the second one was like what is the
- 00:32:45average character count of the action
- 00:32:47items and what is the average summary
- 00:32:49count so then you do okay well
- 00:32:52uh just make the description of the
- 00:32:56action items shorter and then all of a
- 00:32:57sudden the summary is also
- 00:32:59shorter right and so there is cross
- 00:33:02contamination and one of the things
- 00:33:03that's valuable is to go okay
- 00:33:06well I don't know why but but
- 00:33:09controlling one and and not the other is
- 00:33:11so difficult I'm going to break this
- 00:33:13down into two
- 00:33:15tasks I'm G to have a summary task and
- 00:33:17an action item task and the reason I've
- 00:33:19done this the reason I've added this
- 00:33:21extra
- 00:33:22complexity is because I have all these
- 00:33:24experiments to prove that I can't figure
- 00:33:26out how to combine them
- 00:33:28right maybe if a new model comes out and
- 00:33:31it's better and more steerable we can
- 00:33:34re-evaluate
- 00:33:36this but I'm going to separate these to
- 00:33:38two different tasks because I cannot get
- 00:33:40the evals to match uh what I want in
- 00:33:43terms of performance right so now you
- 00:33:45can have this idea of like I'm going to
- 00:33:47segment to make this simpler but there
- 00:33:50are conditions when I would re recombine
- 00:33:52these tasks and maybe when you know High
- 00:33:54coup 3.5 comes out I'll rerun my old
- 00:33:56evals see if I can fix these things and
- 00:33:59and justify some of these Investments
- 00:34:01but the idea really is you're making
- 00:34:04your resource allocation and and how you
- 00:34:05spent your time how you designed your
- 00:34:08system and its
- 00:34:10complexity based on the trade-offs
- 00:34:12you're making with
- 00:34:13eals and again these evals are just
- 00:34:15regular expressions or not anything
- 00:34:17fancy you're not calling any
- 00:34:19L1 fine-tuning comes up in the context
- 00:34:23of uh rag and gen more broadly you know
- 00:34:27what role do you see for
- 00:34:30fine-tuning uh in the types of systems
- 00:34:34that you know we're typically building
- 00:34:35for rag I would say if you're going to
- 00:34:37start fine tuning the first thing to
- 00:34:39fine tune is likely going to be
- 00:34:40something like a cohero ranker right
- 00:34:43that's where you're going to have the
- 00:34:44least amount of data you required it's
- 00:34:46going to be very easy to label this data
- 00:34:49if you have a bunch of questions and
- 00:34:50text junks you can probably ask like the
- 00:34:52smartest most expensive model you have
- 00:34:54and just label thousands of examples
- 00:34:57right
- 00:34:58so just transfer learning that task is
- 00:35:01pretty affordable probably for $50 you
- 00:35:03can you can get a fine-tune ranker that
- 00:35:05outperforms anything off the
- 00:35:08shelf I don't know whether fine-tuning
- 00:35:11and betting models is worth it just
- 00:35:12because of like it's just annoying to
- 00:35:15like own inference but I think that's
- 00:35:18the second easiest thing to fine-tune is
- 00:35:19fine tuning an edding model to do search
- 00:35:22better after that the only thing I would
- 00:35:25really fine-tuned is any kind of pre
- 00:35:26rewriting steps
- 00:35:28so you know can I given a question parse
- 00:35:30it out to you know query start date end
- 00:35:34date can I can I map it to metadata
- 00:35:36filters that I think people should be
- 00:35:38fine-tuning because it's a very specific
- 00:35:40task you can fine-tune like a llama
- 00:35:43model you can host it in a way that has
- 00:35:46fast inference and it's usually going to
- 00:35:48be pretty effective whereas I find it's
- 00:35:50pretty challenging to really think about
- 00:35:52how do how do you fine tune like 40 to
- 00:35:55do answer generation you I would you
- 00:35:58would need to be pretty Justified to
- 00:36:00explore that especially
- 00:36:02because as these model like these models
- 00:36:04are going to get better in a way that we
- 00:36:05can't control and they're always going
- 00:36:07to be have better recall they're going
- 00:36:09to have better robustness towards like
- 00:36:12low Precision text chunks it's hard to
- 00:36:15beat them because they actually have all
- 00:36:16the data whereas um for something like
- 00:36:22rankers you know they don't have that
- 00:36:24data like we are the ones that are able
- 00:36:26to capture the value have this data set
- 00:36:28fine-tune and outperform uh the public
- 00:36:33benchmarks one of the the model related
- 00:36:37questions that comes up all the time
- 00:36:40especially as the you know the big
- 00:36:43models get better is like do I need to
- 00:36:47think about any of this in a large
- 00:36:50context length uh you know regime like
- 00:36:55do I need to rerank do I need to you
- 00:36:58know emed like chunk can I just throw
- 00:37:02everything um you know of course for
- 00:37:04some definitions of everything it's
- 00:37:05going to be too big you know bigger than
- 00:37:07whatever context window you have but
- 00:37:10like uh assuming a large context um and
- 00:37:15you know assuming context sufficient for
- 00:37:18you know a a lot of your
- 00:37:22context
- 00:37:23[Music]
- 00:37:24um yeah you get where I'm going with
- 00:37:26this like
- 00:37:29yeah I
- 00:37:30mean I think what's really going to
- 00:37:32happen is we're going to go in the same
- 00:37:34way that like the iPhone battery life
- 00:37:35has gone
- 00:37:37right like we've never had a better
- 00:37:40battery and then longer battery life
- 00:37:41we've just had more powerful
- 00:37:45applications and so I think as contactx
- 00:37:48increases we're just going to have way
- 00:37:49more complex instructions with like
- 00:37:51different personalities or you know
- 00:37:53maybe not only is going to have the
- 00:37:54context length it's going to have my you
- 00:37:56know my history all this kind of stuff
- 00:37:59that said I think there's a great place
- 00:38:01for long context models especially when
- 00:38:03we have a few documents I would almost
- 00:38:05rather always shove everything into
- 00:38:06context right but we're always going to
- 00:38:08run into latency tradeoffs if we think
- 00:38:11of the recommendation systems or you
- 00:38:13know e-commerce systems we know that
- 00:38:16even a 100 milliseconds 300 milliseconds
- 00:38:18of latency could be a 1% Revenue hit I
- 00:38:21think that'll be the same thing for
- 00:38:22these language models right there's
- 00:38:24always going to be some Frontier of
- 00:38:26context L and late see in business
- 00:38:28outcome that we're going to have to make
- 00:38:30tradeoffs against I think that's what
- 00:38:31that's what's really going to happen
- 00:38:32yeah I was wondering if you had more
- 00:38:36Nuance around the way you think about
- 00:38:39the generation side um I I guess my
- 00:38:44observation is like the length of the
- 00:38:46context itself is insufficient as a
- 00:38:50determinant of success right and you
- 00:38:53know that's why for example we have
- 00:38:54reranking because you know within a
- 00:38:57given context length the model can't
- 00:39:00really follow the plot all the way from
- 00:39:02the top to the bottom right and so like
- 00:39:05just saying like this is the number of
- 00:39:07context length doesn't say enough about
- 00:39:09how well the model how good a job the
- 00:39:12model does at attending to all the
- 00:39:14various things in the context and so um
- 00:39:18you know that gets to you know Concepts
- 00:39:21like precision and recall and other
- 00:39:22things like do you have a structured way
- 00:39:25that you think about that or like the
- 00:39:26way that you would approach evaluating a
- 00:39:30different context length so when I use a
- 00:39:34longer context model I'm usually working
- 00:39:35with a very few set of documents right
- 00:39:38so the question is like okay is the
- 00:39:39relevant information in text Chunk split
- 00:39:42across many documents or really is going
- 00:39:44to be a very few documents and a good
- 00:39:46example of this is we have an agent
- 00:39:49that's job is to take sales calls
- 00:39:52reference your pricing pages and give
- 00:39:55you a compelling personalized pricing on
- 00:39:58a certain service that you provide right
- 00:40:00so we have a onh hour long transcript we
- 00:40:02have a 16-page PDF that describes our
- 00:40:04pricing options for like different
- 00:40:06add-ons and whatnot and the prompt goes
- 00:40:08as follows right it says here's a
- 00:40:10transcript here is 16 pages of our
- 00:40:13pricing first list out all the variables
- 00:40:17that are required to determine whether
- 00:40:19or not you can personalize the price and
- 00:40:21then it does that and then for
- 00:40:24everything that we list out extract out
- 00:40:26exactly what part of the transcript they
- 00:40:28mention this variable so first it list
- 00:40:31out the variables and then it lists out
- 00:40:32the variables hydrated by
- 00:40:34excerpts then you know reread the
- 00:40:37transcript and the the page and list out
- 00:40:41the resulting like price number that you
- 00:40:44can give it and then construct a
- 00:40:46follow-up email that offers a
- 00:40:48personalized
- 00:40:49price so what we're really doing is
- 00:40:51we're trying to just push the language
- 00:40:53model to do a lot of very specific Chain
- 00:40:55of Thought where you're kind of
- 00:40:56extracting the data then organizing it
- 00:40:58again in a smaller package and then as
- 00:41:01you generate the email you assume that
- 00:41:03we're kind of only attending over this
- 00:41:04like prepared notepad that the language
- 00:41:07one determined um that's mostly how I
- 00:41:10think about using long context models
- 00:41:13when it comes to very few data which is
- 00:41:14just to say I want you to attend over
- 00:41:16everything reorganize the information in
- 00:41:19Your Chain of Thought in your scratch
- 00:41:20pad and then finally give me a final
- 00:41:22result and that has usually worked
- 00:41:25pretty well that that's been the
- 00:41:26difference between we could ship to
- 00:41:28something that is actually sending
- 00:41:29followup emails right now in production
- 00:41:32no that's really interesting so it's
- 00:41:33kind of speaking to like long context
- 00:41:37doesn't necessarily say that you're
- 00:41:38going to be able to onot your answer but
- 00:41:41if you can reduce your context
- 00:41:44systematically you know you know through
- 00:41:46a lens of the way you thought about
- 00:41:47breaking down your problem then you
- 00:41:50could you know one long contest can be a
- 00:41:53convenience for you exactly and it it
- 00:41:56makes the problem really it's still it's
- 00:41:57still a single prompt now right it's
- 00:42:00just generating like scratch Pad one
- 00:42:02scratch Pad two scratch Pad three and it
- 00:42:04actually lets us work in a world when
- 00:42:07when the next long context model exists
- 00:42:09we can just replace the model number
- 00:42:11rather than going you know we had this
- 00:42:13like six prompt agentic system and I
- 00:42:16hope okay yeah that I was envisioning
- 00:42:18this a six prompt agentic system it's oh
- 00:42:21yeah this what does tell me what what
- 00:42:23does the scratch Pad mean in that
- 00:42:25context and how is a prompt
- 00:42:27yeah incorporating that so basically I
- 00:42:30say Okay first list out the variables
- 00:42:32then list out the variables and the
- 00:42:33transcripts but it's just doing it so
- 00:42:35just in the the generation you're asking
- 00:42:38it to show its work that kind of deal
- 00:42:40yeah but like so it's like a like a very
- 00:42:42very long show your work right it's like
- 00:42:44it's you know maybe 3,000 tokens of
- 00:42:46planning of just going like well uh we
- 00:42:49can offer a per seat model if the per
- 00:42:51seats is greater than 30 uh this person
- 00:42:53mentioned 30 was the minimum seat number
- 00:42:56and they said they had 48
- 00:42:58seats so now it just sort of like goes
- 00:43:01down this but it's as a single
- 00:43:03generation interesting and are there is
- 00:43:06there anything that you need to do or
- 00:43:09prompt magic to get it to kind of stick
- 00:43:11to the steps or does that generally work
- 00:43:13pretty good for you know sufficiently
- 00:43:15Advanced models so for the advanced
- 00:43:18models because we have this long context
- 00:43:20we just have like four or five examples
- 00:43:22of this entire reasoning
- 00:43:25protocol right
- 00:43:27that's another reason that the long
- 00:43:28context value matters because now we
- 00:43:30have the transcript 16 pages of calls
- 00:43:33and four examples of reasoning about the
- 00:43:36variables needed to create pricing pages
- 00:43:38and and you know like if it's lower than
- 00:43:40this price offer this package to do this
- 00:43:44um that just becomes like way way more
- 00:43:46Conta that we can use because we have a
- 00:43:48longer context model but then ultimately
- 00:43:51you run the thing it's like
- 00:43:52178,000 tokens of prompt use right and I
- 00:43:57think what's going to happen is as as
- 00:44:00context models increase we're going to
- 00:44:02have much more sophisticated F shot
- 00:44:03examples maybe we have full examples of
- 00:44:05transcripts in the past and how we
- 00:44:07reason about them we're just GNA
- 00:44:09saturate everything as as much as we
- 00:44:12can so got all our Basics uh lined up
- 00:44:18Clos Loop evaluation considering
- 00:44:21techniques like
- 00:44:23fine-tuning um are there other
- 00:44:26optimization
- 00:44:27that Beyond fine tuning that um someone
- 00:44:31might think about once they've got the
- 00:44:33basics lined up in the in the model
- 00:44:37maybe less so but I think a lot of
- 00:44:38people are sort of ignoring the ux and
- 00:44:42the product facing side of things right
- 00:44:45for example if we focus on streaming we
- 00:44:47can make the perceived latency decrease
- 00:44:50right if we just focus hard on building
- 00:44:53great copy and great you know UI to
- 00:44:55collect feedback we might be be able to
- 00:44:57start uh fine-tuning rankers sooner
- 00:45:00rather than later right for example if I
- 00:45:03generate an answer with a bunch of files
- 00:45:05what if I gave the user the ability to
- 00:45:07delete one of the files and regenerate
- 00:45:08an answer that becomes a negative sample
- 00:45:12in your ranker because now we know that
- 00:45:13was irrelevant right there's a lot of ux
- 00:45:17features that we can do like one of the
- 00:45:18great examples I I discovered from with
- 00:45:20zapier working with them for a couple
- 00:45:22months was uh we changed the copy of how
- 00:45:25did we do to did we answer your question
- 00:45:29today and that in itself 5x the amount
- 00:45:32of feedback we were able to collect uh
- 00:45:34per day and that basically me within you
- 00:45:37know within one month we got enough data
- 00:45:39that we could get together as a team
- 00:45:42review all the examples and figure out
- 00:45:44what we want to do next
- 00:45:45month right just because we have one
- 00:45:48volume and stuff like that I think is
- 00:45:51really underlooked especially when the
- 00:45:53ux can also be used to you know educate
- 00:45:56the user if we discover question types
- 00:45:58that are low volume and low success
- 00:46:02maybe we just uh say no to answering
- 00:46:04those kind of questions we if we have
- 00:46:06question types that are low volume but
- 00:46:08High success maybe we like preview that
- 00:46:11as an example question that you can ask
- 00:46:13and and teach users that we can actually
- 00:46:15do this very well and we should be using
- 00:46:17this to answer those kind of questions
- 00:46:20right a lot of that education in the ux
- 00:46:22I think is something that is often
- 00:46:24underlooked at smaller teams yeah along
- 00:46:27the lines of the ux one of the things
- 00:46:32that I've been uh talking a bit about
- 00:46:34recently is this idea like hey we've all
- 00:46:36started with like trying to replicate
- 00:46:39chat GPT for our business's data but
- 00:46:42that chat experience isn't necessarily
- 00:46:44the best experience for everything in
- 00:46:46fact it might not be the best experience
- 00:46:48for a lot of things and uh at least in
- 00:46:52an Enterprise context maybe it's
- 00:46:54different on a product context but in an
- 00:46:56Enterprise
- 00:46:57context um a lot can be gained by
- 00:47:00integrating you know what you're trying
- 00:47:02to accomplish with rag into an existing
- 00:47:05workflow as opposed to creating some new
- 00:47:07Standalone chatbot uh is that something
- 00:47:09that you see in the folks that you work
- 00:47:11with yeah one of my most sort of like
- 00:47:15popular takes on rag is that question
- 00:47:18answering is sort of very low value and
- 00:47:20cost centered Centric whereas one of the
- 00:47:23big things I see in the companies that
- 00:47:26I've been advising vager for examp like
- 00:47:28vantage.com for example they do report
- 00:47:30generation right so instead of saying
- 00:47:33give a data room can I ask a bunch of
- 00:47:35questions about how the founders met and
- 00:47:37what is the you know Tam of their
- 00:47:40business vantag just says if you give me
- 00:47:43a data room I will just pre-generate
- 00:47:45every report that you use to make a
- 00:47:47decision in your
- 00:47:49business and now you can just use the
- 00:47:51workflow of reviewing
- 00:47:53reports but now you can instead of
- 00:47:55processing 40 businesses a quarter you
- 00:47:58can do 80 businesses a quarter right and
- 00:48:01now the question is instead of capturing
- 00:48:03the percentage of the cost of Labor we
- 00:48:06might be able to cap capture a
- 00:48:08percentage of the ROI of the decision
- 00:48:11and I think that's where a lot of really
- 00:48:13great arag applications will will come
- 00:48:15about right can we capture the ROI
- 00:48:18rather than the cost of of doing this
- 00:48:19kind of work and that also lends itself
- 00:48:22to uh kind of progressively
- 00:48:27inserting rag into multiple places in a
- 00:48:30long running workflow or business
- 00:48:34process yeah and it goes back to this
- 00:48:36idea that if you if you wish the asent
- 00:48:38had complex reasoning it's because you
- 00:48:40have not thought hard about the problem
- 00:48:42yourself it's a spicy take but I
- 00:48:45oftentimes you know I think people admit
- 00:48:46to uh agreeing with it even just a
- 00:48:49little bit Yeah multim modals a popular
- 00:48:53topic uh is we've talked a little bit
- 00:48:55about like
- 00:48:57extracting tables from reports and um
- 00:49:01some of the ways that you are extracting
- 00:49:03metadata from images are there other
- 00:49:05ways that you see multimodal coming up
- 00:49:07yeah I think you know if you follow like
- 00:49:10Joe from Besa or Ben from answer AI
- 00:49:13they're all very excited and and me
- 00:49:14included very excited on the models like
- 00:49:17kopali where we use visual language
- 00:49:20models to do search effectively and not
- 00:49:24only can you do search you can then use
- 00:49:26visual language models to given the
- 00:49:28images answer the question and because
- 00:49:31it's all local well not all local but
- 00:49:34you know it's it's open weights you can
- 00:49:36also inspect the attention mechanism so
- 00:49:39when I ask a question on a PDF I can
- 00:49:42tell where the model is looking to
- 00:49:44determine its relevancy I think there's
- 00:49:46a lot of features there that can be very
- 00:49:47useful in the context of maybe you know
- 00:49:50if we have hundreds of PDFs with
- 00:49:51hundreds of pages we can use something
- 00:49:53like kopali to really be great at uh you
- 00:49:55know reading diagrams and understanding
- 00:49:57structure without thinking about the OCR
- 00:50:00and the table ATT traction and all that
- 00:50:01kind of work so that's something I'm
- 00:50:03very excited about exploring we've
- 00:50:05talked a little bit about uh agents
- 00:50:09and
- 00:50:11um you know there's one dimension of
- 00:50:13agents that is like breaking up your
- 00:50:15prompt into a bunch of steps uh and
- 00:50:17using that as a kind of a reasoning
- 00:50:20mechanism um but there's you know I
- 00:50:23guess you could argue whether this is an
- 00:50:25agentic thing or not um but like
- 00:50:27function calls and tools and stuff like
- 00:50:29that do you see those capabilities
- 00:50:32coming into play in the rag systems that
- 00:50:35you're uh
- 00:50:36building yeah I mean I think the real
- 00:50:39question is like how many hops is my rag
- 00:50:41agent allowed to take right like if can
- 00:50:44I do retrieval and then determine that I
- 00:50:46still need to do more retrieval or do I
- 00:50:48only have a couple attempts to answer
- 00:50:50the question when I retrieve data
- 00:50:53um I think the general idea is that
- 00:50:57if we can segment the problem space or
- 00:50:59the query space into these different you
- 00:51:01know
- 00:51:02buckets it probably benefits us to build
- 00:51:05specific indices to serve each set of
- 00:51:07questions right if I know 40% of the
- 00:51:10questions I ask are going to be around
- 00:51:12scheduling I might just develop a data
- 00:51:14structure optimized for curing schedules
- 00:51:17and then have a function call hit that
- 00:51:20API
- 00:51:21right and then I think function calling
- 00:51:24is effectively just building out routers
- 00:51:25that can combine these separate indices
- 00:51:28into a single API and letting the
- 00:51:30language mod determine um what's going
- 00:51:32on I think there's also a world where if
- 00:51:35we have many many tools we might want to
- 00:51:37do retrieval and search to figure out
- 00:51:39what tools are relevant right imagine if
- 00:51:41we have 200 tools that are disposal it's
- 00:51:44just another precedent recall evaluation
- 00:51:47to figure out whether or not the
- 00:51:48question is finding the right tool but I
- 00:51:50think for the most part I've just been
- 00:51:52able
- 00:51:52to really just push position recall to
- 00:51:56almost be the hammer uh in a world where
- 00:51:59I think I think llms are the hammer for
- 00:52:00everything I've just gone back to to
- 00:52:03Basics and you're your last example
- 00:52:06spoke
- 00:52:07to um another interesting thing that
- 00:52:10I've seen is like
- 00:52:12using trying to get Beyond you know
- 00:52:16building rag systems just with a bunch
- 00:52:18of text but also including structured
- 00:52:20data um which can be incorporated via
- 00:52:23tools like it sounds like you're seeing
- 00:52:25at least some of that you seeing that uh
- 00:52:28grow I guess in the report generation
- 00:52:31example that you mentioned that would be
- 00:52:33a big part of it right yeah yeah exactly
- 00:52:36I I think I think ultimately it will
- 00:52:38just be function calling plus like the
- 00:52:41the messages already and I think that
- 00:52:43can probably do a lot of these cases um
- 00:52:46and outside of that you know like one
- 00:52:48thing I like to say I forget what
- 00:52:50theorem or Paradigm this was but it was
- 00:52:52the idea that all complex systems are
- 00:52:55derived from pre-existing complex
- 00:52:58systems and if you think about chatbots
- 00:53:00and finite State machines you know Lang
- 00:53:02graph is covering that basis right it's
- 00:53:04kind of the llm extension of a system
- 00:53:07that already works you know if you think
- 00:53:09about like open AI swarm Library it's
- 00:53:12very much like message passing and
- 00:53:13distributed systems and like act the
- 00:53:15actor model of programming so I think
- 00:53:16we're already slowly seeing these
- 00:53:18different forms of agentic programs
- 00:53:21being remapped to like known successful
- 00:53:26working Paradigm for building out these
- 00:53:29kind of complex systems whether it's
- 00:53:31like Lang graph or swarm or anything
- 00:53:33like that I think we've sort of figured
- 00:53:36out what works and our now our job is
- 00:53:38just to scale that better and better I
- 00:53:40guess maybe changing topics slightly you
- 00:53:44uh Beyond rag another thing that you're
- 00:53:47very excited about is like helping
- 00:53:50folks kind of tool up as AI Consultants
- 00:53:55like where did your interest in that
- 00:53:56that come from well I just struggled so
- 00:53:59L myself personally you know what I mean
- 00:54:01I feel
- 00:54:02like like I didn't work for like a year
- 00:54:05I came back to and I was like oh man
- 00:54:06people are asking me for help I don't
- 00:54:08really know how to turn this into a
- 00:54:10business you know even a year down the
- 00:54:12road I really feel like through a lot of
- 00:54:14like a augmentation there should be more
- 00:54:16and more individuals who are able to
- 00:54:19scale up their own knowledge work with
- 00:54:22llms so I like well I just think there's
- 00:54:24going to be more businesses like more
- 00:54:25solo business like entrepreneurs making
- 00:54:28six or seven figures and and so okay if
- 00:54:30that's true I should try doing it but I
- 00:54:34just realize that you know I think the
- 00:54:36like if you're a technical person and
- 00:54:37you enjoy technical work it is very hard
- 00:54:39to do the sales and and do the writing
- 00:54:42and figure out how to write proposals to
- 00:54:44like charge more and I think everyone
- 00:54:46tells you to charge more but you don't
- 00:54:47know what that means and there's no
- 00:54:48playbook for that right just say things
- 00:54:51like well just look in the mirror and
- 00:54:54name a price and then double it and if
- 00:54:55you don't keep doubling it and at some
- 00:54:58point you can just ask that
- 00:55:02number and that never worked for me and
- 00:55:04so I basically Pi a bunch of courses I
- 00:55:06read a bunch of books and I'm trying to
- 00:55:08distill everything I know into a little
- 00:55:10package on Maven and kind of just like
- 00:55:13sort of save the regret and the
- 00:55:15embarrassment of undercharging for for
- 00:55:18so long and uh you know sort of pass it
- 00:55:20forward and help them help everyone else
- 00:55:22will figure it out yeah you know I feel
- 00:55:25like the first job I did I asked they
- 00:55:27asked me how much I charged and I said
- 00:55:29oh between like 150 and 170 and an hour
- 00:55:31and they just said great we'll do 170
- 00:55:33just send me to the paperwork and I was
- 00:55:35like oh wow you answered in three
- 00:55:37seconds I
- 00:55:39really yeah I just I called my
- 00:55:42girlfriend I was like Hey I just took
- 00:55:43food out of both our mouths I'm really
- 00:55:44sorry like I'll do better next maybe
- 00:55:47I'll double it I don't know I'm nervous
- 00:55:49and yeah the course I'm running this
- 00:55:51this uh next month is sort of my goal
- 00:55:54to not do that again yeah I can say uh
- 00:56:00that everything that you mentioned is
- 00:56:02true for being an industry
- 00:56:06analyst and a podcaster you know either
- 00:56:08SL both which I am um you know I've been
- 00:56:12at it for quite a long time but you know
- 00:56:14there's definitely a learning curve and
- 00:56:15it changes all the time too so it's
- 00:56:17awesome exctly uh well we will link to
- 00:56:20uh to that course are you still doing
- 00:56:23the rag course as well the rag course
- 00:56:25we're we're running it again in uh
- 00:56:27February 4th they'll also be six weeks
- 00:56:29I'm pretty excited we already have some
- 00:56:31folks from open AI who's taking the
- 00:56:32course now so I've slowly yeah hopefully
- 00:56:36I can help with their Solutions as years
- 00:56:38improve other people's R systems and so
- 00:56:40I'm very excited for the new cohort
- 00:56:42that's a a bunch of really amazing
- 00:56:43companies involved well we will uh be
- 00:56:46sure to link to those in the show notes
- 00:56:48and maybe we can work out some kind of
- 00:56:49discount code for listeners or something
- 00:56:52um yeah let's do it awesome awesome
- 00:56:55Jason it has been been great catching up
- 00:56:57and uh I feel like we probably could
- 00:56:59have continued on for another hour but
- 00:57:02but we should make sure to to keep in
- 00:57:03touch thanks so much for jumping on and
- 00:57:05sharing a bit about your uh your
- 00:57:07experiences with us it's been super fun
- 00:57:09man thanks so much awesome thank you
- 00:57:29[Music]
- AI
- RAG
- User Experience
- Testing
- Fine-Tuning
- Machine Learning
- Consulting
- Data Evaluation
- Podcast
- Expert Insights