What is Retrieval-Augmented Generation (RAG)?

RAG is an approach that combines retrieval of information with natural language generation, allowing models to generate responses based on retrieved data.

How can companies improve their RAG systems?

By focusing on user needs, conducting thorough evaluations, and optimizing retrieval processes rather than solely fine-tuning the generation aspect.

What role does user experience (UX) play in AI systems?

Good UX design can significantly improve user engagement and feedback collection, ultimately enhancing AI system performance.

What should companies prioritize when deploying AI models?

Focusing on the context in which the AI operates, understanding users' workflows, and enabling the AI to add value rather than just answering questions.

Is fine-tuning always necessary for AI models?

Not always; it depends on the specific use case and whether off-the-shelf models can efficiently meet the needs without additional fine-tuning.

What is the importance of context length in RAG?

Longer context length can allow models to analyze more data and provide better responses, but achieving a balance with system efficiency and latency is crucial.

Why Your RAG System Is Broken, and How to Fix It with Jason Liu - 709

00:57:34

https://www.youtube.com/watch?v=wexpoR1R03A

概要

TLDRIn this episode of the Tmall AI Podcast, host Sam Charington interviews Jason Lou, an AI consultant and expert in Retrieval-Augmented Generation (RAG). They explore the importance of understanding customer needs and the potential pitfalls of focusing too heavily on the complexity of reasoning in AI models. Jason shares his insights on optimizing AI systems through evaluation metrics, the impact of user experience design, and the significance of creating effective datasets. The conversation highlights the balance between generative capabilities and retrieval performance, emphasizing that users often overlook crucial aspects like precise needs and system feedback. The episode encourages iterative testing and understanding specific user queries to enhance AI functionality.

収穫

🤔 Understand user needs to guide product development.
🔍 Focus on evaluation and testing to improve AI systems.
💡 UX design can enhance feedback and engagement.
📊 Regular testing helps identify areas of improvement.
⚙️ Fine-tuning may not always be necessary; assess use case specifics.
📈 Longer context provides better analysis but requires efficiency.
🔄 Segmenting problems can streamline solution development.
📅 Leverage existing data structures for answering queries effectively.
💬 Encourage experimentation and trust in data-driven approaches.

タイムライン

00:00:00 - 00:05:00
Customers express a desire for more complex reasoning capabilities in AI models, which leads to discussions about understanding customer needs and improving product clarity.
00:05:00 - 00:10:00
Sam Charington introduces Jason Lou, a freelance AI consultant with a background in machine learning and recommendation systems, and the conversation will delve into retrieval-augmented generation (RAG) and system diagnostics.
00:10:00 - 00:15:00
Jason shares his educational background and experience at Stitch Fix, where he worked on multimodal embedding for predicting outfit recommendations, which aligns with current RAG principles in AI models.
00:15:00 - 00:20:00
He discusses how organizations often seek improvements in their RAG systems, focusing on the need for better embedding models and retrieval mechanisms to enhance customer engagement and business performance.
00:20:00 - 00:25:00
Jason criticizes the tendency of companies to focus on generational tuning rather than ensuring that the language model has adequate context, emphasizing the importance of retrieval quality over generation adjustments.
00:25:00 - 00:30:00
He mentions biases that affect evaluation of AI systems, such as absence bias and intervention bias, stressing the need to focus on retrieval effectiveness rather than tweaking generative prompts.
00:30:00 - 00:35:00
As companies start to recognize the importance of evaluations, Jason highlights the shift towards efficient evaluation practices and how quick tests could foster a more results-oriented environment.
00:35:00 - 00:40:00
He provides strategies for building effective datasets and tests, advocating for iterative experiments and synthesis of training data from simple existing questions and their answers.
00:40:00 - 00:45:00
Jason discusses the necessity of segmentation in problem-solving, asserting the importance of correctly identifying user questions to develop functional data sets and improve AI system performance.
00:45:00 - 00:50:00
He emphasizes that having structured workflows and understanding user questions leads to meaningful data set generation, allowing for precise evaluation of AI models' performance and answering capabilities.
00:50:00 - 00:57:34
Further, he explains the benefits of using off-the-shelf embeddings and suggests that detailed experimentation, rather than generic assumptions, should guide the choice of approaches for embedding and retrieval systems.

ビデオQ&A

What is Retrieval-Augmented Generation (RAG)?
RAG is an approach that combines retrieval of information with natural language generation, allowing models to generate responses based on retrieved data.
How can companies improve their RAG systems?
By focusing on user needs, conducting thorough evaluations, and optimizing retrieval processes rather than solely fine-tuning the generation aspect.
What role does user experience (UX) play in AI systems?
Good UX design can significantly improve user engagement and feedback collection, ultimately enhancing AI system performance.
What should companies prioritize when deploying AI models?
Focusing on the context in which the AI operates, understanding users' workflows, and enabling the AI to add value rather than just answering questions.
Is fine-tuning always necessary for AI models?
Not always; it depends on the specific use case and whether off-the-shelf models can efficiently meet the needs without additional fine-tuning.
What is the importance of context length in RAG?
Longer context length can allow models to analyze more data and provide better responses, but achieving a balance with system efficiency and latency is crucial.

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス！

字幕

オートスクロール:

00:00:00
like the big smell I usually like to
00:00:01
call out very early in the beginning is
00:00:03
just you know customers saying man I
00:00:04
really wish these models were capable of
00:00:06
more complex
00:00:07
reasoning it's like do you want the
00:00:09
complex reason because you haven't
00:00:11
reasoned about what the customer wants
00:00:13
because the more we really think hard
00:00:15
and we really reason ourselves like what
00:00:19
the customer wants the product becomes
00:00:21
much more clear
00:00:23
[Music]
00:00:35
all right everyone welcome to another
00:00:37
episode of the tmall AI podcast I am
00:00:39
your host Sam charington today I'm
00:00:41
joined by Jason Lou Jason is a freelance
00:00:43
AI consultant and advisor and creator of
00:00:46
the instructor library before we get
00:00:49
going be sure to take a moment to hit
00:00:50
that subscribe button wherever you're
00:00:52
listening to Today's Show Jason welcome
00:00:54
to the Pod hey pleas to be here man
00:00:56
really excited uh I'm excited for this
00:00:58
conversation as well we've uh spoken a
00:01:01
few times before and uh I'm looking
00:01:04
forward to picking your brain on uh all
00:01:07
things retrieval augmented generation
00:01:10
and more um and I think a fun way to dig
00:01:14
into this conversation is to talk
00:01:17
through uh when you come in and people's
00:01:20
rag is broken how they go about fixing
00:01:23
it um but before we dive into that I'd
00:01:27
love to have you share a little bit
00:01:28
about your background
00:01:30
perfect so you know graduated from
00:01:31
University of waterl when we were doing
00:01:34
a lot of physics and basically after my
00:01:37
first physics term I realized oh man
00:01:39
machine learning is definitely going to
00:01:40
be next and I think the Nobel Prize I
00:01:43
nailed it right um absolutely so once I
00:01:47
started doing more machine learning a
00:01:48
lot of my background was mostly in
00:01:49
computer vision and recommendation
00:01:51
systems right so image eddings text
00:01:53
embeddings to do recommendation systems
00:01:56
and it just so happened that now where
00:01:57
where we do uh rag it's kind of the same
00:02:00
thing all over again right it's all of
00:02:03
text embeddings into recommendation
00:02:04
systems that now feed into language
00:02:06
models versus people and so it was a
00:02:08
very nice transition into rag when I
00:02:10
started doing more uh Consulting and you
00:02:13
did some of that at Stitch fix if I
00:02:14
remember correctly yeah most of my
00:02:16
background was at doing like multimodal
00:02:18
embedding the Stitch fix from like 2017
00:02:20
onwards so like taking an outfit putting
00:02:23
it into some outfit embedding space and
00:02:25
trying to predict the next outfit that
00:02:27
the person would want in their box yeah
00:02:30
you know how you how do you do
00:02:31
replacement how do you do these like
00:02:32
recommendation carousels what is a
00:02:34
similar item all that fun stuff and so
00:02:37
um what were your first steps
00:02:40
into uh you know Rag and helping folks
00:02:44
with uh their gen challenges it's funny
00:02:48
so you know I I basically took a year
00:02:50
after I took a year off after I left
00:02:51
Stitch Trix and when chat gbd came back
00:02:53
out and they were talking about rag
00:02:55
everyone was really Amazed by the power
00:02:57
of text and beddings in my mind Tex
00:03:00
eding was my intern project in 2016
00:03:02
because I didn't know how to set up
00:03:03
elastic search
00:03:06
it's very exciting ride to come back and
00:03:08
say oh I have like eight years
00:03:09
experience doing this kind of stuff
00:03:11
let's jump in and figure out how can we
00:03:13
actually learn new embeddings and how do
00:03:15
we actually improve and measure these
00:03:17
kind of search systems um especially
00:03:20
when more of these people now are just
00:03:22
plugging in open Ai and beddings and
00:03:24
just doing some kind of vector search
00:03:26
there's not not much room for
00:03:28
improvement when it comes to retrieval
00:03:29
and so a lot of the companies come to me
00:03:31
and they just say hey it's not really
00:03:32
working we're losing customers we're
00:03:33
losing a bit of money how do we make
00:03:35
this better and that's kind of how the
00:03:37
conversation uh starts off got it got
00:03:40
and when you say there's not much room
00:03:42
for improvement around retrieval what do
00:03:45
you mean by that I think a lot of
00:03:47
companies if you think about you know
00:03:49
how we used to do embedding at know
00:03:51
Stitch fix Netflix Shopify Spotify a lot
00:03:54
of it is using user and product
00:03:56
interaction pairs to train embeddings
00:03:58
that are optimized for maybe a
00:04:00
click-through rate or maybe for some
00:04:01
kind of relevancy metric but to me at
00:04:04
least it feels pretty crazy that we're
00:04:05
going to use these external embedding
00:04:07
models from open Ai and just assume that
00:04:11
oh yeah of course my question about some
00:04:12
law is going to be embedded very similar
00:04:15
to exactly the paragraph that answers
00:04:17
the question about the esoteric you know
00:04:19
legal statement that's not really true
00:04:21
right if you think about even just the
00:04:23
sentence I love coffee and I hate coffee
00:04:26
are they similar or
00:04:28
dissimilar right they could be SAR on a
00:04:30
dating app because they're both
00:04:31
preferences about coffee but maybe
00:04:33
they're dissimilar because they're
00:04:34
negative preferences against each other
00:04:37
but you know I should be able to choose
00:04:40
which one that looks like and I think
00:04:41
that's where a lot of people are getting
00:04:42
tripped up it's really a big assumption
00:04:44
to think that we know what is and is not
00:04:47
similar in this in this embedding space
00:04:49
interesting interesting I think the
00:04:51
reason why I zeroed in on on you
00:04:54
mentioning that is because when I talk
00:04:55
to folks that are working on Rag and
00:04:59
their rag is broken they are often
00:05:02
trying to fix it via tuning the
00:05:05
generation and that invariably is not
00:05:08
the right way to do it uh so there
00:05:10
there's often a lot more Headroom in
00:05:13
making sure that the llm has the right
00:05:15
context than um you know fine-tuning the
00:05:19
the promps but uh let's maybe you know
00:05:23
when you're called into those situations
00:05:25
like how do you begin to diagnose the
00:05:29
the problem yeah so the first the first
00:05:32
thing I do is I basically just ban
00:05:34
adjectives as as a word that you can use
00:05:36
during standup I think a lot of
00:05:37
companies you know when sof as good bad
00:05:41
looks better feels better
00:05:44
right% 20% what does that look like um
00:05:48
that's that's usually the first step is
00:05:49
really getting away from this like Vibe
00:05:51
based estimate of the generation and
00:05:53
really thinking about retrieval I like
00:05:55
to think about this two biases I learned
00:05:57
in this MBA book the first one is called
00:05:59
absence this which just says like you
00:06:01
can't really think about the thing you
00:06:03
don't see and you see the generation you
00:06:05
always see the text coming out of the
00:06:06
language model so you think that's the
00:06:08
thing that I got to control because I
00:06:09
don't see the generation uh uh the the
00:06:11
content the second thing is inter inter
00:06:14
intervention bias which is I want to
00:06:16
change things to feel in control and so
00:06:19
if you want to feel in control of a rag
00:06:20
application all you got to do is just
00:06:22
twiddle with the change some text in
00:06:24
your prompt and hope that all the
00:06:27
relevant data is in there and usually
00:06:28
that's not the case right um I think
00:06:31
that's where a lot of the the issues
00:06:33
stem from right you're looking too much
00:06:35
at generation and not really thinking
00:06:36
about recall or precision and whether or
00:06:39
not a language model is confused or even
00:06:40
finding the right
00:06:43
information and that opens up the whole
00:06:46
conversation around evales and and
00:06:49
evaluation loops and pipe pipelines and
00:06:51
flywheels and the like and uh at least
00:06:55
from my
00:06:56
perspective yeah 9 months 12 months
00:06:59
months ago like folks were trying to
00:07:01
figure out how to spell eval and now
00:07:03
like it's coming up in a lot more
00:07:04
conversations are you seeing a similar
00:07:07
shift yeah except the thing I've been
00:07:09
noticing is we've almost also delegated
00:07:13
the the scoring battle language models
00:07:16
right the llm is Judge idea exactly I
00:07:19
think it's very useful to get some kind
00:07:21
of proxy for what is good and what is
00:07:22
bad but what ends up happening is
00:07:25
instead of trying to solve the relevancy
00:07:26
problem we just solving the problem of
00:07:29
again prompting another language model
00:07:31
right I I basically said don't fiddle
00:07:33
with the generation of the language
00:07:34
model you you feel in control because
00:07:37
you can fumble with it but you're not
00:07:38
going to get any results and they said
00:07:39
okay well let me work with a different
00:07:41
prompt
00:07:43
instead rather than building like a
00:07:45
Precision recall data set right I think
00:07:48
you know there are many many tasks that
00:07:51
take milliseconds to compute that we
00:07:53
could run tests across thousands of
00:07:55
examples and and figure what the
00:07:56
relevancy looks like rather than looking
00:07:58
at you know l a judge literally a couple
00:08:01
weeks ago during standof a team was like
00:08:03
hey uh who spent $1,000 this
00:08:07
weekend and some Junior Engineers like
00:08:09
oh I was like trying to run evals to see
00:08:12
how good the new changes are should I
00:08:13
not do that and like oh W like they're
00:08:16
so expensive I've just incentivized
00:08:17
someone to not run more tests that's
00:08:19
really not how I want things to be right
00:08:21
I want tests that are really fast really
00:08:23
cheap that you should be running every
00:08:24
every 10 20 minutes when you make a line
00:08:27
of code uh change in your system and so
00:08:29
that's it's been pretty funny to sort of
00:08:31
see that transition and and really push
00:08:33
to be just a lot faster fail faster and
00:08:36
do very cheap cheap evaluations and part
00:08:39
of that is just that building data sets
00:08:42
is hard and time consuming and hey
00:08:45
pre-trained models was supposed to get
00:08:47
me out of that business right yeah I
00:08:49
think every data scientist every data
00:08:51
engineer is like yeah you know I I'm
00:08:53
kind of the janitor right but then what
00:08:55
happens is you kind of get like you get
00:08:57
the Roomba and you spill something the
00:08:59
Roomba just just smears it all over the
00:09:00
wall floor and you're like
00:09:02
oh I should have done it myself that's
00:09:05
kind of how I think about these things
00:09:06
and but
00:09:07
also in Earnest I think what's really
00:09:09
happening is because there are so many
00:09:12
more Engineers coming into the space
00:09:15
with like less data literacy it actually
00:09:18
is just very hard to even describe what
00:09:20
a good data set looks like you know
00:09:22
Eugene uh Eugene Yang and I were were
00:09:25
basically trying to figure out what does
00:09:26
good data literacy look like and we we
00:09:29
really struggle we can only come up with
00:09:31
10 reasons why it was Data illiterate
00:09:34
but it's still hard to describe like
00:09:35
what is the intuition and the vibe of
00:09:37
when do you give up when do you try new
00:09:38
things you know if the model said it's
00:09:40
98% accurate you probably did something
00:09:42
wrong all those kinds of things are uh
00:09:45
pretty undocumented I think when it
00:09:46
comes to making this mental shift so
00:09:50
when you're talking to folks and you're
00:09:52
encouraging them to take this initial
00:09:54
step of
00:09:55
building uh data set that will allow
00:09:58
them to measure their or revals like do
00:10:00
they always know what that process you
00:10:02
know needs to look like to do that uh
00:10:05
and you know how do you you know guide
00:10:08
folks that don't through the process of
00:10:10
building that data set I think they have
00:10:13
some ideas but what ends up happening is
00:10:15
some ideas feel so intuitive you almost
00:10:17
need someone else's permission to
00:10:19
believe your or trust your gut right
00:10:21
like the simplest thing to do for
00:10:22
example is say you know what given a
00:10:24
text Chunk can I generate a synthetic
00:10:26
question with a language model and then
00:10:28
save the these two pairs and then let me
00:10:31
check whether or not the question I just
00:10:32
generated finds the text Chun right A
00:10:35
lot of times I think you know there are
00:10:36
Engineers on the team that have this
00:10:38
idea but they just like hey Jason like
00:10:40
does this does this make sense like just
00:10:43
nobody really knows for certain if what
00:10:44
they're doing is like ridiculous and I
00:10:46
think a lot of it at least with the
00:10:48
really great Engineers I work with they
00:10:49
almost just need permission to trust
00:10:51
their gut and do these tiny experiments
00:10:53
because in engineering there is just
00:10:56
like the edge cases you have to
00:10:58
enumerate everything right away and then
00:11:00
you sort of build out your test but here
00:11:01
it's a lot of it it's like well we
00:11:03
really do just have to try it we just
00:11:05
have to try 10 different things and
00:11:06
figure out what works and what
00:11:08
doesn't and giving them giving the
00:11:10
engineering team the permission to trust
00:11:11
their get has also been a a pretty
00:11:13
valuable lesson on my end and are those
00:11:15
the same 10 things in every case or is
00:11:18
it you know 10 Edge case specific things
00:11:21
that are different for every person
00:11:22
who's trying to build a
00:11:24
system yeah what I find is one thing
00:11:28
that is often missing they don't
00:11:29
actually understand what the workflows
00:11:32
ought to be and what kind of question
00:11:33
type they ought to serve right I think
00:11:36
everyone wants AGI right G stand for
00:11:39
General and they want to solve
00:11:42
every but you know the open AI
00:11:44
definition of of AGI has something to do
00:11:46
with like economic value like are we
00:11:48
unlocking economic value for for our
00:11:50
customer and like the big smell I
00:11:53
usually like to call out very early in
00:11:55
the beginning is just you know customers
00:11:56
saying man I really wish these models
00:11:58
were capable of more complex
00:12:00
reasoning it's like do you want the
00:12:02
complex reasoning because you haven't
00:12:03
reasoned about what the customer
00:12:06
wants because because the more we really
00:12:09
think hard and we really reason
00:12:12
ourselves like what the customer wants
00:12:14
the product becomes much more clear
00:12:16
right um for example day one we have a
00:12:20
bunch of user questions coming in if we
00:12:21
can do some kind of clustering and
00:12:23
segmentation we might find out that oh
00:12:25
wow you know 30% of all the questions
00:12:27
are looking for contract and whether or
00:12:29
not they're signed you know 10% of the
00:12:31
questions were just who modified the
00:12:33
document last that's not even in the
00:12:36
text Chunk but if we just append you
00:12:38
know an additional token that says like
00:12:40
Modified by Jason we could now just
00:12:42
serve a 10% of our question base right
00:12:45
if we just parsed out the dates and like
00:12:47
an is signed Boolean variable again we
00:12:49
could now serve like 30% of our
00:12:51
questions and so I think that the real
00:12:53
trick is just developing the habit of
00:12:56
looking at the data but also trusting
00:12:58
that your your your job is to make these
00:13:01
hypotheses and your job isn't to be
00:13:03
right all the time right you have to be
00:13:06
wrong you have to do these experiments
00:13:07
and feel F that that seems like it needs
00:13:10
to be first like really understanding
00:13:13
what the questions you're trying to
00:13:15
serve with uh your system whether it's a
00:13:19
chatbot or something else um because you
00:13:22
can't even really build a data set till
00:13:24
you know what those questions need to
00:13:25
look like I mean sometimes if you just
00:13:27
have a bunch of PDFs you could try to
00:13:29
have the language model answer these
00:13:30
questions but to my surprise there have
00:13:33
been times where you know if you use
00:13:35
like Paul Graham essays and you generate
00:13:38
synthetic questions off of random Tech
00:13:40
chunks you get like 96 and 97% recall
00:13:44
right the the problem is too easy so
00:13:45
have to make it harder but there are
00:13:48
other data sets where I do the same task
00:13:49
and I get like 60% recall for example
00:13:53
right if you just take all GitHub issues
00:13:55
GitHub issues okay yeah like a common
00:13:58
question that gets generated is like uh
00:14:00
how to get
00:14:01
started Well turns out if you don't have
00:14:03
a filter on repo like repository you
00:14:06
can't answer the question how best I get
00:14:07
started because now there's filters
00:14:09
involved right and turns out if I just
00:14:11
say best ways to get started in repo I
00:14:15
now have to sort of parse things out and
00:14:17
do some filtering and maybe if it's not
00:14:18
the exact filter I need to do some
00:14:20
string matching and all the other stuff
00:14:23
are you saying when people are asking
00:14:25
you the best way to get started with rag
00:14:28
or when people want to be able to serve
00:14:31
or answer the question for their users
00:14:33
the best way to get started I'm not
00:14:35
following example answering the question
00:14:36
like imagine doing like GitHub GitHub
00:14:39
issue search and I search best way to
00:14:40
get started right how could I have
00:14:43
possibly found the trunk that came from
00:14:45
there's there's thousands of best ways
00:14:47
to get started documentations right um
00:14:50
and then and then you do the go oh okay
00:14:52
actually in order to do this problem
00:14:54
well I have to do some kind of like repo
00:14:57
matching mechanism I probably need to do
00:14:58
some kind of like filtering mechanism
00:15:00
and now you slowly add complexity into
00:15:02
the system that you built I think too
00:15:04
many people just sort of throw the data
00:15:06
into a bunch of PDFs and go well
00:15:08
obviously I can just ask you what the
00:15:10
systematic risks of this investment is
00:15:13
and that doesn't be seem to be the case
00:15:16
uh so if we're building up
00:15:18
to steps then you know one step is like
00:15:21
know your question the um you know next
00:15:24
step might be build out your test set um
00:15:28
and the third step is to you know think
00:15:32
really hard about like metadata and like
00:15:35
sourcing data that the llm can use to
00:15:40
answer the question or really that a
00:15:42
pre-processor can use to get the right
00:15:44
information to the llm actually we're
00:15:45
still in retrieval at this point yeah so
00:15:48
so I like to think about it this way I'm
00:15:50
going to do some segmentation if I was
00:15:52
going to do marketing I might segment
00:15:54
against men's and women's and East Coast
00:15:55
which West Coast every every problem you
00:15:58
want to solve you kind want to segment
00:16:00
in some way so we're ultimately going to
00:16:02
find these segments in uh the question
00:16:05
space and then ultimately there's two
00:16:07
kinds of segments there's going to be
00:16:09
segments that don't do well because we
00:16:11
have capabilities issues so for example
00:16:14
if I ask who modified this document last
00:16:17
if I don't have that metadata I can't
00:16:19
answer that question so I need to
00:16:20
improve my capabilities right that the
00:16:24
the row exists but I need an extra
00:16:27
column the other world is like inventory
00:16:31
issues where the column doesn't exist
00:16:33
right like if you think of maybe like a
00:16:35
door Dash and you find out that uh Greek
00:16:38
restaurants near me is a terrible search
00:16:40
query the solution might be to buy iPads
00:16:42
for Greek restaurants so they can come
00:16:44
onto the platform so there's been other
00:16:46
times when we do this kind of debugging
00:16:48
and we realize oh wow we don't have the
00:16:50
data to answer these quic questions we
00:16:52
don't have the scheduling information to
00:16:53
do this we don't have the tables
00:16:55
extracted to do this and so there's
00:16:57
usually two kinds of solutions you got
00:16:58
to start integrating right are there
00:17:00
more capabilities by adding like more
00:17:02
rows to your data set or just like sorry
00:17:05
more columns or add inventory and inove
00:17:08
the number of rows we have on a data set
00:17:09
and that's kind of like how I think
00:17:10
about these things adding rows in all of
00:17:13
the examples you've given are is a much
00:17:15
longer process than adding columns
00:17:18
exactly exactly I think sometimes it's
00:17:21
like oh man like we need to figure out
00:17:22
contracts so we need to figure out you
00:17:25
know who if someone is responsible the
00:17:27
feature people really care about can we
00:17:29
contact them in some way and now you
00:17:31
start building an actual application
00:17:33
because you know what the customer wants
00:17:35
right it's not just ask a question but
00:17:37
to take some action make some decision
00:17:39
you
00:17:40
mentioned
00:17:42
the you know this issue of using
00:17:45
off-the-shelf
00:17:47
embedding uh models are you finding are
00:17:51
you advising folks to like use to like
00:17:53
build their own
00:17:54
embedding systems or just get better
00:17:57
data to the ones that are there or like
00:17:59
is there a decision Matrix around and in
00:18:03
fact there are a lot of similar
00:18:04
questions that I run into that I don't
00:18:09
I've heard
00:18:10
mixed uh you know mixed opinions as to
00:18:14
like if they're even important and it's
00:18:16
like the embedding uh scheme the
00:18:20
chunking strategy like headings and
00:18:23
other um you know contextual information
00:18:26
around chunks like do you have
00:18:29
is it are there one siiz fits alls
00:18:31
answers to these or like it a decision
00:18:33
Matrix or like how do you think about
00:18:35
the space of like you know
00:18:37
implementation details around uh
00:18:40
embedding it's a really good question
00:18:43
primarily because why guess when we can
00:18:45
test
00:18:49
right I think this is a matter of really
00:18:51
just you know investing more in a data
00:18:53
set and just going great well let me let
00:18:55
me just run like 30 experiments you know
00:18:58
over the weekend I'll come back and I'll
00:19:00
just know the answer and I think that's
00:19:02
kind
00:19:03
of even the nature of that question to
00:19:05
me is a symptom of sort of the lack of
00:19:07
having really good evaluations on your
00:19:10
own data sets but likewise the the
00:19:13
answer to the question is an indication
00:19:16
that there's there aren't clear patterns
00:19:19
and it's very data set and use case
00:19:22
specific and you know if you said for
00:19:25
example yeah like you know we really
00:19:28
really only ever see chunking strategies
00:19:30
giving a you know one 2% lift it's not
00:19:33
usually worth it like that would be
00:19:34
really informative but you didn't say
00:19:36
that so you know there must be cases
00:19:38
where you change your chunking strategy
00:19:39
and you bam get some great results
00:19:42
exactly a good example of that might be
00:19:44
thinking about chunking and then
00:19:47
completely processing like tables within
00:19:49
PDFs differently than regular text
00:19:51
trunks it's like okay like paragraphs
00:19:53
you chunk and then if you see a table we
00:19:55
have to save the entire table somewhere
00:19:57
else as a separate Index right um that
00:20:00
would be a good example of like when
00:20:03
chunking really matters I would also say
00:20:05
you know if you have even thousands of
00:20:08
examples of questions and labels on
00:20:11
whether or not a chunk is relevant it's
00:20:14
probably pretty fruitful to fine tune a
00:20:16
ranker but even then I think oftentimes
00:20:19
I surprise myself
00:20:20
in whether or not um certain
00:20:24
interventions perform better there are
00:20:27
times when using hybrid search with
00:20:29
edings and bm25 bm25 is like 3% better
00:20:34
right that often is the case if the
00:20:37
person who is searching the data is
00:20:39
aware of the file names and the text
00:20:41
that's in the data like if I wrote my
00:20:42
own
00:20:43
essay it'll be easier for me to find it
00:20:45
if I use full text search because I know
00:20:47
what I
00:20:49
wrote right there's been times
00:20:53
when rankers don't improve the
00:20:55
performance of the model and then there
00:20:57
are times when the rankers do and again
00:20:58
again it's it just becomes the
00:20:59
Superstition
00:21:01
but it is very easy to sort of absolve
00:21:04
myself of the Superstition by having
00:21:06
these tests that run really really fast
00:21:08
right just is the order better these
00:21:10
things are really great whereas if we go
00:21:11
into factuality or like self-consistency
00:21:14
and like context recall who knows right
00:21:17
maybe the model just wants to choose
00:21:19
itself now you're doing a whole set of
00:21:21
other experiments to prove that the
00:21:22
model is aligned with a metric you never
00:21:26
made up and that's when I think things
00:21:28
get
00:21:28
[Laughter]
00:21:31
and
00:21:32
expensive yeah
00:21:35
yeah even TimeWise I feel like there's
00:21:37
been times where we have like
00:21:38
summarization prompts so for example um
00:21:42
when we want to retrieve images do we
00:21:43
use like a clip
00:21:45
embedding well that means we also have
00:21:47
to use a clip embedding for the text
00:21:50
what I've seen do go really well is
00:21:52
actually using a visual language model
00:21:53
to give a detailed
00:21:55
description as like a paragraph of the
00:21:57
image and then just T Ed the paragraph
00:22:01
but what this means is now the the uh
00:22:04
describe this image prompt is another
00:22:06
hyperparameter to experiment against and
00:22:09
I've seen situations where if you just
00:22:11
say describe this image uh recall is
00:22:14
like
00:22:15
27% but if you can teach this concept of
00:22:18
recall to an engineer and you just make
00:22:20
them Hill Climb for like a day and a
00:22:21
half we've been able to get to like an
00:22:23
87% recall just by improving the
00:22:26
prompt WR a prompt tell me what doesn't
00:22:29
recover okay uh find the blueprint but
00:22:32
also count the number of rooms okay now
00:22:34
it's like 35% okay also transcribe the
00:22:37
street addresses and include that in the
00:22:40
description 70% you know also describe
00:22:44
like like whether it's north facing and
00:22:46
east facing and like also describe the
00:22:48
positions of the cabins and all of a
00:22:50
sudden you have a 96% recall system for
00:22:52
finding blueprints because you actually
00:22:55
worked on the prompt sounds like feature
00:22:57
engineering hey
00:22:59
I can't say that because they then they
00:23:01
got confused but uh yeah but but ends up
00:23:04
being it right it's like oh this is this
00:23:05
classical machine learning but we were
00:23:07
able to Hill Climb because our eval is
00:23:10
very fast it takes you know 50
00:23:12
milliseconds to try again and try again
00:23:15
and I think that's where a lot of things
00:23:16
can be really optimized for but yeah
00:23:18
it's definitely just feature engineering
00:23:19
but that's also another
00:23:23
word are you um you know one of the
00:23:26
questions that that comes up a lot
00:23:28
around the the whole idea of evals is
00:23:31
like tooling like um do you have you
00:23:37
know go-to answers for that I'm guessing
00:23:39
it's going to be yeah build your data
00:23:41
set in like you know some silly eval in
00:23:44
a you know notebook but uh did do you
00:23:48
find
00:23:49
that there's a point at which it becomes
00:23:52
more complex and there's you know some
00:23:55
you know open source or off the-shelf
00:23:57
tooling that makes it difference for
00:23:58
folks so I would say if you are working
00:24:01
independently you are likely best off
00:24:04
just like writing things to a jonline
00:24:06
file or like a SQL art file primarily
00:24:08
because you're just building out these
00:24:10
very fast evals right like if you're
00:24:11
just comparing like length of summary
00:24:13
divided by length of input and figuring
00:24:15
out if there's a compression rate I want
00:24:16
to set a goal against super fast what I
00:24:19
do with when I work with bigger
00:24:21
companies is I use uh Brain Trust
00:24:23
primarily
00:24:25
because like Brain Trust was basically
00:24:27
built because ER has just been like
00:24:29
sharing screenshots of like results from
00:24:31
a jupyter notebook every once in a while
00:24:32
you're like I need a tool that does
00:24:33
better than this and so often times if
00:24:36
you need to collaborate on sharing data
00:24:39
sets collaborate on sharing results and
00:24:41
getting feedback and you know coer on
00:24:43
your team to label data with you I think
00:24:46
that's when a tool really really shines
00:24:49
those for the collaboration aspect not
00:24:51
because the evaluations are better in
00:24:54
some kind of way exactly because the
00:24:56
evaluations you have to build yourself
00:24:59
right you're pulling evaluations off the
00:25:01
shelf yeah factuality the
00:25:04
self-consistency like that stuff to me
00:25:06
is uh it's crazy you know it's just
00:25:10
like it's like it's like having someone
00:25:13
grade their own assignment it's like oh
00:25:14
man I just I
00:25:16
hope um and then like the other stuff
00:25:19
that's valuable is like okay how can I
00:25:22
you know down sample my production
00:25:24
traffic to also run these evaluations to
00:25:26
make sure that things are running prod
00:25:27
productively
00:25:28
can I monitor these things over time
00:25:30
right a really simple example is just I
00:25:34
have a company that we do a meeting
00:25:36
summarization and we we plot the average
00:25:39
length of a transcript and we plot the
00:25:42
average length of the summary divided by
00:25:44
the average of the
00:25:45
transcript and every once in a while
00:25:47
there's like a blip and like why did
00:25:48
that blip happen well it turns out you
00:25:51
know we ran a marketing campaign and we
00:25:52
got a whole new set of users and these
00:25:54
new users are doing threeh hour long
00:25:57
podcasts and it's really
00:25:59
bad because the summary is just like
00:26:01
they talked about
00:26:02
AI
00:26:05
right you're like oh man like the the
00:26:08
compression rate is too high now let's
00:26:10
go do something great we build a rule
00:26:13
that says if the call is less than an
00:26:15
hour we can use this prompt if we use
00:26:17
greater than an hour can we use that
00:26:19
prompt okay the ratios are like
00:26:21
recovering a little bit and can we
00:26:22
monitor that and I think that's how I
00:26:24
think about building these systems have
00:26:25
like the dumbest evals possible to tell
00:26:27
you what to look at uh you mentioned
00:26:31
compression rate previously before
00:26:33
talking about this specific example is
00:26:35
that a metric that you've applied
00:26:39
broadly or is it just this transcription
00:26:42
summarization thing yeah I mean I've
00:26:46
just found a lot of the applications I
00:26:48
tend to work on are ones where we're
00:26:49
doing a lot of
00:26:50
summarization and summarization is a
00:26:52
very uh interesting task
00:26:56
because like llms are good at
00:26:58
summarization in the sense that indeed
00:27:00
the output is shorter than the input but
00:27:03
it's actually very hard to evaluate like
00:27:05
what is a good summary and like when do
00:27:06
we lose nuance and all that kind of
00:27:08
stuff and
00:27:09
so you know obviously we can have the
00:27:12
entire like llm as a
00:27:14
judge model of doing things but ideally
00:27:17
we have much more like much simpler
00:27:19
metrics right so I have metrics of just
00:27:22
you know length of summary divided by
00:27:24
length of uh transcript I also have the
00:27:27
counts of named entities right for
00:27:29
example if the summary is all mentioning
00:27:31
my name and versus the summary just
00:27:33
going like they thought it was you know
00:27:36
it's like it's very like ambiguous and
00:27:38
so can we can we can we preserve some
00:27:40
kind of information density there there
00:27:42
they're all proxies for some you know
00:27:45
satisfaction or Nuance that we can also
00:27:47
use a language model against but
00:27:49
um looking at like odd examples of just
00:27:52
simple numbers still can tell you a lot
00:27:54
of
00:27:55
information right like what we found was
00:27:57
when when we pl summary length by um
00:28:01
transcript length it would go up and
00:28:04
then after like 20,000 tokens and
00:28:06
actually got shorter again I like
00:28:09
okay that took six minutes to plot out
00:28:11
and like write the data for but now we
00:28:14
know that there's some weird behavior
00:28:15
when the transcript is really really
00:28:18
long great let me change my prompt rerun
00:28:21
this it's trade again perfect we're good
00:28:25
was there a step before changing the
00:28:26
prompt that was trying to understand
00:28:28
like the intuition for why that might be
00:28:30
happening or was that ancillary
00:28:32
to actually getting the problem don't
00:28:35
remember like what we did in that
00:28:36
example I I think we we just kind of saw
00:28:39
that like oh wow not only is it getting
00:28:42
dropping the variance is also increasing
00:28:44
as we drop and so what we want a prompt
00:28:47
that has lower variance in the like
00:28:49
compression rate that seems like a very
00:28:52
like healthy and quantifiable goal where
00:28:55
we can just say hey Jason I tried three
00:28:58
different prompts and I was able to drop
00:29:00
the standard deviation by like
00:29:02
40% and it now is like monotonically
00:29:05
increasing as a function of context like
00:29:08
that becomes so scientific and so
00:29:10
quantifiable that we don't have to worry
00:29:12
about some of these like bigger things
00:29:13
and obviously we might still lose Nuance
00:29:16
but um setting a goal against that is
00:29:19
very
00:29:21
easy my sense is that folks coming to
00:29:25
this uh you know
00:29:29
fresh and being told hey you should
00:29:31
build a test data set um you know kind
00:29:35
of wrapping their head around Precision
00:29:37
recall I keep whether you keep whether
00:29:39
you can keep those two straight without
00:29:40
looking it up that's another issue but
00:29:42
like you know that's like oh that's
00:29:44
probably something that I need to be
00:29:45
able to measure and test against is like
00:29:48
um you know an obvious thing uh
00:29:52
compression rate feels like less obvious
00:29:55
or more nuanced in some way way are
00:29:59
there other kind of nuanced types of
00:30:03
things I think you there's I guess I'm
00:30:06
thinking there are two ways that you get
00:30:08
this either one like you know banging
00:30:10
your head against your problem and you
00:30:12
know this is probably the best way to
00:30:14
come up with these things but you know
00:30:16
part of what we're trying to do is like
00:30:18
accelerate learning and provide
00:30:19
shortcuts like what are the shortcuts
00:30:22
that you've come across for different
00:30:24
problem classes like oh these four
00:30:26
metrics like you probably wouldn't think
00:30:27
about them but you know when you did
00:30:29
like you discover that these come up all
00:30:31
the time do do you have that list
00:30:34
another one that is pretty reasonable in
00:30:37
this like summarization task just
00:30:39
whether or not it co uh adheres to a
00:30:41
certain schema and a certain uh
00:30:44
formatting but again I try my best to
00:30:47
just write a regular expression that
00:30:48
tries to capture this as quickly as
00:30:50
possible right and you know I could have
00:30:53
like a Sixpoint grading scale on whether
00:30:55
or not it fits the the the markdown
00:30:58
format that I want but then I lose all
00:31:01
like there's too much Nuance right
00:31:02
really I just want to have a bunch of
00:31:03
past fail tests that are very binary
00:31:06
where I can say great show me 10
00:31:08
examples where I failed 10 examples
00:31:10
where I succeeded let's let me just go
00:31:13
like think really hard and figure out
00:31:15
what is happening and how can I change
00:31:17
that um outside of that I find it a lot
00:31:19
of it ends up being very very specific
00:31:22
there's an example where I generate
00:31:24
action items but I want to evaluate
00:31:26
whether or not the action item is
00:31:27
correct ly assigned to the person right
00:31:31
uh that's just a very specific eval that
00:31:33
you have to build and it ends up being
00:31:35
very challenging sometimes to uh
00:31:37
correctly assign something like that are
00:31:40
you finding that you're always running
00:31:43
all of your eval Suite whenever you're
00:31:46
making any change or um do you find like
00:31:51
running specific you know feature
00:31:54
specific evals uh and then running your
00:31:57
broader sweet uh less
00:32:00
frequently yeah it depends on like what
00:32:04
kind of eval to be honest I if I can
00:32:06
afford to I would just rather run them
00:32:07
all the time because you never know what
00:32:09
kind of
00:32:10
cross uh influence there is like a
00:32:13
really simple example was we had both an
00:32:15
executive summary and a list of action
00:32:18
items and the action item description
00:32:21
was too
00:32:22
long you're like great well uh make the
00:32:26
action item shorter
00:32:28
and then we got uh an equal length
00:32:30
action item but just fewer action
00:32:34
items but that's EAS that test is easy
00:32:36
because we can just like parse out the
00:32:38
asterisk and count them and that's one
00:32:40
EV like I literally had an Eva that was
00:32:42
just count the number of action items
00:32:44
the second one was like what is the
00:32:45
average character count of the action
00:32:47
items and what is the average summary
00:32:49
count so then you do okay well
00:32:52
uh just make the description of the
00:32:56
action items shorter and then all of a
00:32:57
sudden the summary is also
00:32:59
shorter right and so there is cross
00:33:02
contamination and one of the things
00:33:03
that's valuable is to go okay
00:33:06
well I don't know why but but
00:33:09
controlling one and and not the other is
00:33:11
so difficult I'm going to break this
00:33:13
down into two
00:33:15
tasks I'm G to have a summary task and
00:33:17
an action item task and the reason I've
00:33:19
done this the reason I've added this
00:33:21
extra
00:33:22
complexity is because I have all these
00:33:24
experiments to prove that I can't figure
00:33:26
out how to combine them
00:33:28
right maybe if a new model comes out and
00:33:31
it's better and more steerable we can
00:33:34
re-evaluate
00:33:36
this but I'm going to separate these to
00:33:38
two different tasks because I cannot get
00:33:40
the evals to match uh what I want in
00:33:43
terms of performance right so now you
00:33:45
can have this idea of like I'm going to
00:33:47
segment to make this simpler but there
00:33:50
are conditions when I would re recombine
00:33:52
these tasks and maybe when you know High
00:33:54
coup 3.5 comes out I'll rerun my old
00:33:56
evals see if I can fix these things and
00:33:59
and justify some of these Investments
00:34:01
but the idea really is you're making
00:34:04
your resource allocation and and how you
00:34:05
spent your time how you designed your
00:34:08
system and its
00:34:10
complexity based on the trade-offs
00:34:12
you're making with
00:34:13
eals and again these evals are just
00:34:15
regular expressions or not anything
00:34:17
fancy you're not calling any
00:34:19
L1 fine-tuning comes up in the context
00:34:23
of uh rag and gen more broadly you know
00:34:27
what role do you see for
00:34:30
fine-tuning uh in the types of systems
00:34:34
that you know we're typically building
00:34:35
for rag I would say if you're going to
00:34:37
start fine tuning the first thing to
00:34:39
fine tune is likely going to be
00:34:40
something like a cohero ranker right
00:34:43
that's where you're going to have the
00:34:44
least amount of data you required it's
00:34:46
going to be very easy to label this data
00:34:49
if you have a bunch of questions and
00:34:50
text junks you can probably ask like the
00:34:52
smartest most expensive model you have
00:34:54
and just label thousands of examples
00:34:57
right
00:34:58
so just transfer learning that task is
00:35:01
pretty affordable probably for $50 you
00:35:03
can you can get a fine-tune ranker that
00:35:05
outperforms anything off the
00:35:08
shelf I don't know whether fine-tuning
00:35:11
and betting models is worth it just
00:35:12
because of like it's just annoying to
00:35:15
like own inference but I think that's
00:35:18
the second easiest thing to fine-tune is
00:35:19
fine tuning an edding model to do search
00:35:22
better after that the only thing I would
00:35:25
really fine-tuned is any kind of pre
00:35:26
rewriting steps
00:35:28
so you know can I given a question parse
00:35:30
it out to you know query start date end
00:35:34
date can I can I map it to metadata
00:35:36
filters that I think people should be
00:35:38
fine-tuning because it's a very specific
00:35:40
task you can fine-tune like a llama
00:35:43
model you can host it in a way that has
00:35:46
fast inference and it's usually going to
00:35:48
be pretty effective whereas I find it's
00:35:50
pretty challenging to really think about
00:35:52
how do how do you fine tune like 40 to
00:35:55
do answer generation you I would you
00:35:58
would need to be pretty Justified to
00:36:00
explore that especially
00:36:02
because as these model like these models
00:36:04
are going to get better in a way that we
00:36:05
can't control and they're always going
00:36:07
to be have better recall they're going
00:36:09
to have better robustness towards like
00:36:12
low Precision text chunks it's hard to
00:36:15
beat them because they actually have all
00:36:16
the data whereas um for something like
00:36:22
rankers you know they don't have that
00:36:24
data like we are the ones that are able
00:36:26
to capture the value have this data set
00:36:28
fine-tune and outperform uh the public
00:36:33
benchmarks one of the the model related
00:36:37
questions that comes up all the time
00:36:40
especially as the you know the big
00:36:43
models get better is like do I need to
00:36:47
think about any of this in a large
00:36:50
context length uh you know regime like
00:36:55
do I need to rerank do I need to you
00:36:58
know emed like chunk can I just throw
00:37:02
everything um you know of course for
00:37:04
some definitions of everything it's
00:37:05
going to be too big you know bigger than
00:37:07
whatever context window you have but
00:37:10
like uh assuming a large context um and
00:37:15
you know assuming context sufficient for
00:37:18
you know a a lot of your
00:37:22
context
00:37:23
[Music]
00:37:24
um yeah you get where I'm going with
00:37:26
this like
00:37:29
yeah I
00:37:30
mean I think what's really going to
00:37:32
happen is we're going to go in the same
00:37:34
way that like the iPhone battery life
00:37:35
has gone
00:37:37
right like we've never had a better
00:37:40
battery and then longer battery life
00:37:41
we've just had more powerful
00:37:45
applications and so I think as contactx
00:37:48
increases we're just going to have way
00:37:49
more complex instructions with like
00:37:51
different personalities or you know
00:37:53
maybe not only is going to have the
00:37:54
context length it's going to have my you
00:37:56
know my history all this kind of stuff
00:37:59
that said I think there's a great place
00:38:01
for long context models especially when
00:38:03
we have a few documents I would almost
00:38:05
rather always shove everything into
00:38:06
context right but we're always going to
00:38:08
run into latency tradeoffs if we think
00:38:11
of the recommendation systems or you
00:38:13
know e-commerce systems we know that
00:38:16
even a 100 milliseconds 300 milliseconds
00:38:18
of latency could be a 1% Revenue hit I
00:38:21
think that'll be the same thing for
00:38:22
these language models right there's
00:38:24
always going to be some Frontier of
00:38:26
context L and late see in business
00:38:28
outcome that we're going to have to make
00:38:30
tradeoffs against I think that's what
00:38:31
that's what's really going to happen
00:38:32
yeah I was wondering if you had more
00:38:36
Nuance around the way you think about
00:38:39
the generation side um I I guess my
00:38:44
observation is like the length of the
00:38:46
context itself is insufficient as a
00:38:50
determinant of success right and you
00:38:53
know that's why for example we have
00:38:54
reranking because you know within a
00:38:57
given context length the model can't
00:39:00
really follow the plot all the way from
00:39:02
the top to the bottom right and so like
00:39:05
just saying like this is the number of
00:39:07
context length doesn't say enough about
00:39:09
how well the model how good a job the
00:39:12
model does at attending to all the
00:39:14
various things in the context and so um
00:39:18
you know that gets to you know Concepts
00:39:21
like precision and recall and other
00:39:22
things like do you have a structured way
00:39:25
that you think about that or like the
00:39:26
way that you would approach evaluating a
00:39:30
different context length so when I use a
00:39:34
longer context model I'm usually working
00:39:35
with a very few set of documents right
00:39:38
so the question is like okay is the
00:39:39
relevant information in text Chunk split
00:39:42
across many documents or really is going
00:39:44
to be a very few documents and a good
00:39:46
example of this is we have an agent
00:39:49
that's job is to take sales calls
00:39:52
reference your pricing pages and give
00:39:55
you a compelling personalized pricing on
00:39:58
a certain service that you provide right
00:40:00
so we have a onh hour long transcript we
00:40:02
have a 16-page PDF that describes our
00:40:04
pricing options for like different
00:40:06
add-ons and whatnot and the prompt goes
00:40:08
as follows right it says here's a
00:40:10
transcript here is 16 pages of our
00:40:13
pricing first list out all the variables
00:40:17
that are required to determine whether
00:40:19
or not you can personalize the price and
00:40:21
then it does that and then for
00:40:24
everything that we list out extract out
00:40:26
exactly what part of the transcript they
00:40:28
mention this variable so first it list
00:40:31
out the variables and then it lists out
00:40:32
the variables hydrated by
00:40:34
excerpts then you know reread the
00:40:37
transcript and the the page and list out
00:40:41
the resulting like price number that you
00:40:44
can give it and then construct a
00:40:46
follow-up email that offers a
00:40:48
personalized
00:40:49
price so what we're really doing is
00:40:51
we're trying to just push the language
00:40:53
model to do a lot of very specific Chain
00:40:55
of Thought where you're kind of
00:40:56
extracting the data then organizing it
00:40:58
again in a smaller package and then as
00:41:01
you generate the email you assume that
00:41:03
we're kind of only attending over this
00:41:04
like prepared notepad that the language
00:41:07
one determined um that's mostly how I
00:41:10
think about using long context models
00:41:13
when it comes to very few data which is
00:41:14
just to say I want you to attend over
00:41:16
everything reorganize the information in
00:41:19
Your Chain of Thought in your scratch
00:41:20
pad and then finally give me a final
00:41:22
result and that has usually worked
00:41:25
pretty well that that's been the
00:41:26
difference between we could ship to
00:41:28
something that is actually sending
00:41:29
followup emails right now in production
00:41:32
no that's really interesting so it's
00:41:33
kind of speaking to like long context
00:41:37
doesn't necessarily say that you're
00:41:38
going to be able to onot your answer but
00:41:41
if you can reduce your context
00:41:44
systematically you know you know through
00:41:46
a lens of the way you thought about
00:41:47
breaking down your problem then you
00:41:50
could you know one long contest can be a
00:41:53
convenience for you exactly and it it
00:41:56
makes the problem really it's still it's
00:41:57
still a single prompt now right it's
00:42:00
just generating like scratch Pad one
00:42:02
scratch Pad two scratch Pad three and it
00:42:04
actually lets us work in a world when
00:42:07
when the next long context model exists
00:42:09
we can just replace the model number
00:42:11
rather than going you know we had this
00:42:13
like six prompt agentic system and I
00:42:16
hope okay yeah that I was envisioning
00:42:18
this a six prompt agentic system it's oh
00:42:21
yeah this what does tell me what what
00:42:23
does the scratch Pad mean in that
00:42:25
context and how is a prompt
00:42:27
yeah incorporating that so basically I
00:42:30
say Okay first list out the variables
00:42:32
then list out the variables and the
00:42:33
transcripts but it's just doing it so
00:42:35
just in the the generation you're asking
00:42:38
it to show its work that kind of deal
00:42:40
yeah but like so it's like a like a very
00:42:42
very long show your work right it's like
00:42:44
it's you know maybe 3,000 tokens of
00:42:46
planning of just going like well uh we
00:42:49
can offer a per seat model if the per
00:42:51
seats is greater than 30 uh this person
00:42:53
mentioned 30 was the minimum seat number
00:42:56
and they said they had 48
00:42:58
seats so now it just sort of like goes
00:43:01
down this but it's as a single
00:43:03
generation interesting and are there is
00:43:06
there anything that you need to do or
00:43:09
prompt magic to get it to kind of stick
00:43:11
to the steps or does that generally work
00:43:13
pretty good for you know sufficiently
00:43:15
Advanced models so for the advanced
00:43:18
models because we have this long context
00:43:20
we just have like four or five examples
00:43:22
of this entire reasoning
00:43:25
protocol right
00:43:27
that's another reason that the long
00:43:28
context value matters because now we
00:43:30
have the transcript 16 pages of calls
00:43:33
and four examples of reasoning about the
00:43:36
variables needed to create pricing pages
00:43:38
and and you know like if it's lower than
00:43:40
this price offer this package to do this
00:43:44
um that just becomes like way way more
00:43:46
Conta that we can use because we have a
00:43:48
longer context model but then ultimately
00:43:51
you run the thing it's like
00:43:52
178,000 tokens of prompt use right and I
00:43:57
think what's going to happen is as as
00:44:00
context models increase we're going to
00:44:02
have much more sophisticated F shot
00:44:03
examples maybe we have full examples of
00:44:05
transcripts in the past and how we
00:44:07
reason about them we're just GNA
00:44:09
saturate everything as as much as we
00:44:12
can so got all our Basics uh lined up
00:44:18
Clos Loop evaluation considering
00:44:21
techniques like
00:44:23
fine-tuning um are there other
00:44:26
optimization
00:44:27
that Beyond fine tuning that um someone
00:44:31
might think about once they've got the
00:44:33
basics lined up in the in the model
00:44:37
maybe less so but I think a lot of
00:44:38
people are sort of ignoring the ux and
00:44:42
the product facing side of things right
00:44:45
for example if we focus on streaming we
00:44:47
can make the perceived latency decrease
00:44:50
right if we just focus hard on building
00:44:53
great copy and great you know UI to
00:44:55
collect feedback we might be be able to
00:44:57
start uh fine-tuning rankers sooner
00:45:00
rather than later right for example if I
00:45:03
generate an answer with a bunch of files
00:45:05
what if I gave the user the ability to
00:45:07
delete one of the files and regenerate
00:45:08
an answer that becomes a negative sample
00:45:12
in your ranker because now we know that
00:45:13
was irrelevant right there's a lot of ux
00:45:17
features that we can do like one of the
00:45:18
great examples I I discovered from with
00:45:20
zapier working with them for a couple
00:45:22
months was uh we changed the copy of how
00:45:25
did we do to did we answer your question
00:45:29
today and that in itself 5x the amount
00:45:32
of feedback we were able to collect uh
00:45:34
per day and that basically me within you
00:45:37
know within one month we got enough data
00:45:39
that we could get together as a team
00:45:42
review all the examples and figure out
00:45:44
what we want to do next
00:45:45
month right just because we have one
00:45:48
volume and stuff like that I think is
00:45:51
really underlooked especially when the
00:45:53
ux can also be used to you know educate
00:45:56
the user if we discover question types
00:45:58
that are low volume and low success
00:46:02
maybe we just uh say no to answering
00:46:04
those kind of questions we if we have
00:46:06
question types that are low volume but
00:46:08
High success maybe we like preview that
00:46:11
as an example question that you can ask
00:46:13
and and teach users that we can actually
00:46:15
do this very well and we should be using
00:46:17
this to answer those kind of questions
00:46:20
right a lot of that education in the ux
00:46:22
I think is something that is often
00:46:24
underlooked at smaller teams yeah along
00:46:27
the lines of the ux one of the things
00:46:32
that I've been uh talking a bit about
00:46:34
recently is this idea like hey we've all
00:46:36
started with like trying to replicate
00:46:39
chat GPT for our business's data but
00:46:42
that chat experience isn't necessarily
00:46:44
the best experience for everything in
00:46:46
fact it might not be the best experience
00:46:48
for a lot of things and uh at least in
00:46:52
an Enterprise context maybe it's
00:46:54
different on a product context but in an
00:46:56
Enterprise
00:46:57
context um a lot can be gained by
00:47:00
integrating you know what you're trying
00:47:02
to accomplish with rag into an existing
00:47:05
workflow as opposed to creating some new
00:47:07
Standalone chatbot uh is that something
00:47:09
that you see in the folks that you work
00:47:11
with yeah one of my most sort of like
00:47:15
popular takes on rag is that question
00:47:18
answering is sort of very low value and
00:47:20
cost centered Centric whereas one of the
00:47:23
big things I see in the companies that
00:47:26
I've been advising vager for examp like
00:47:28
vantage.com for example they do report
00:47:30
generation right so instead of saying
00:47:33
give a data room can I ask a bunch of
00:47:35
questions about how the founders met and
00:47:37
what is the you know Tam of their
00:47:40
business vantag just says if you give me
00:47:43
a data room I will just pre-generate
00:47:45
every report that you use to make a
00:47:47
decision in your
00:47:49
business and now you can just use the
00:47:51
workflow of reviewing
00:47:53
reports but now you can instead of
00:47:55
processing 40 businesses a quarter you
00:47:58
can do 80 businesses a quarter right and
00:48:01
now the question is instead of capturing
00:48:03
the percentage of the cost of Labor we
00:48:06
might be able to cap capture a
00:48:08
percentage of the ROI of the decision
00:48:11
and I think that's where a lot of really
00:48:13
great arag applications will will come
00:48:15
about right can we capture the ROI
00:48:18
rather than the cost of of doing this
00:48:19
kind of work and that also lends itself
00:48:22
to uh kind of progressively
00:48:27
inserting rag into multiple places in a
00:48:30
long running workflow or business
00:48:34
process yeah and it goes back to this
00:48:36
idea that if you if you wish the asent
00:48:38
had complex reasoning it's because you
00:48:40
have not thought hard about the problem
00:48:42
yourself it's a spicy take but I
00:48:45
oftentimes you know I think people admit
00:48:46
to uh agreeing with it even just a
00:48:49
little bit Yeah multim modals a popular
00:48:53
topic uh is we've talked a little bit
00:48:55
about like
00:48:57
extracting tables from reports and um
00:49:01
some of the ways that you are extracting
00:49:03
metadata from images are there other
00:49:05
ways that you see multimodal coming up
00:49:07
yeah I think you know if you follow like
00:49:10
Joe from Besa or Ben from answer AI
00:49:13
they're all very excited and and me
00:49:14
included very excited on the models like
00:49:17
kopali where we use visual language
00:49:20
models to do search effectively and not
00:49:24
only can you do search you can then use
00:49:26
visual language models to given the
00:49:28
images answer the question and because
00:49:31
it's all local well not all local but
00:49:34
you know it's it's open weights you can
00:49:36
also inspect the attention mechanism so
00:49:39
when I ask a question on a PDF I can
00:49:42
tell where the model is looking to
00:49:44
determine its relevancy I think there's
00:49:46
a lot of features there that can be very
00:49:47
useful in the context of maybe you know
00:49:50
if we have hundreds of PDFs with
00:49:51
hundreds of pages we can use something
00:49:53
like kopali to really be great at uh you
00:49:55
know reading diagrams and understanding
00:49:57
structure without thinking about the OCR
00:50:00
and the table ATT traction and all that
00:50:01
kind of work so that's something I'm
00:50:03
very excited about exploring we've
00:50:05
talked a little bit about uh agents
00:50:09
and
00:50:11
um you know there's one dimension of
00:50:13
agents that is like breaking up your
00:50:15
prompt into a bunch of steps uh and
00:50:17
using that as a kind of a reasoning
00:50:20
mechanism um but there's you know I
00:50:23
guess you could argue whether this is an
00:50:25
agentic thing or not um but like
00:50:27
function calls and tools and stuff like
00:50:29
that do you see those capabilities
00:50:32
coming into play in the rag systems that
00:50:35
you're uh
00:50:36
building yeah I mean I think the real
00:50:39
question is like how many hops is my rag
00:50:41
agent allowed to take right like if can
00:50:44
I do retrieval and then determine that I
00:50:46
still need to do more retrieval or do I
00:50:48
only have a couple attempts to answer
00:50:50
the question when I retrieve data
00:50:53
um I think the general idea is that
00:50:57
if we can segment the problem space or
00:50:59
the query space into these different you
00:51:01
know
00:51:02
buckets it probably benefits us to build
00:51:05
specific indices to serve each set of
00:51:07
questions right if I know 40% of the
00:51:10
questions I ask are going to be around
00:51:12
scheduling I might just develop a data
00:51:14
structure optimized for curing schedules
00:51:17
and then have a function call hit that
00:51:20
API
00:51:21
right and then I think function calling
00:51:24
is effectively just building out routers
00:51:25
that can combine these separate indices
00:51:28
into a single API and letting the
00:51:30
language mod determine um what's going
00:51:32
on I think there's also a world where if
00:51:35
we have many many tools we might want to
00:51:37
do retrieval and search to figure out
00:51:39
what tools are relevant right imagine if
00:51:41
we have 200 tools that are disposal it's
00:51:44
just another precedent recall evaluation
00:51:47
to figure out whether or not the
00:51:48
question is finding the right tool but I
00:51:50
think for the most part I've just been
00:51:52
able
00:51:52
to really just push position recall to
00:51:56
almost be the hammer uh in a world where
00:51:59
I think I think llms are the hammer for
00:52:00
everything I've just gone back to to
00:52:03
Basics and you're your last example
00:52:06
spoke
00:52:07
to um another interesting thing that
00:52:10
I've seen is like
00:52:12
using trying to get Beyond you know
00:52:16
building rag systems just with a bunch
00:52:18
of text but also including structured
00:52:20
data um which can be incorporated via
00:52:23
tools like it sounds like you're seeing
00:52:25
at least some of that you seeing that uh
00:52:28
grow I guess in the report generation
00:52:31
example that you mentioned that would be
00:52:33
a big part of it right yeah yeah exactly
00:52:36
I I think I think ultimately it will
00:52:38
just be function calling plus like the
00:52:41
the messages already and I think that
00:52:43
can probably do a lot of these cases um
00:52:46
and outside of that you know like one
00:52:48
thing I like to say I forget what
00:52:50
theorem or Paradigm this was but it was
00:52:52
the idea that all complex systems are
00:52:55
derived from pre-existing complex
00:52:58
systems and if you think about chatbots
00:53:00
and finite State machines you know Lang
00:53:02
graph is covering that basis right it's
00:53:04
kind of the llm extension of a system
00:53:07
that already works you know if you think
00:53:09
about like open AI swarm Library it's
00:53:12
very much like message passing and
00:53:13
distributed systems and like act the
00:53:15
actor model of programming so I think
00:53:16
we're already slowly seeing these
00:53:18
different forms of agentic programs
00:53:21
being remapped to like known successful
00:53:26
working Paradigm for building out these
00:53:29
kind of complex systems whether it's
00:53:31
like Lang graph or swarm or anything
00:53:33
like that I think we've sort of figured
00:53:36
out what works and our now our job is
00:53:38
just to scale that better and better I
00:53:40
guess maybe changing topics slightly you
00:53:44
uh Beyond rag another thing that you're
00:53:47
very excited about is like helping
00:53:50
folks kind of tool up as AI Consultants
00:53:55
like where did your interest in that
00:53:56
that come from well I just struggled so
00:53:59
L myself personally you know what I mean
00:54:01
I feel
00:54:02
like like I didn't work for like a year
00:54:05
I came back to and I was like oh man
00:54:06
people are asking me for help I don't
00:54:08
really know how to turn this into a
00:54:10
business you know even a year down the
00:54:12
road I really feel like through a lot of
00:54:14
like a augmentation there should be more
00:54:16
and more individuals who are able to
00:54:19
scale up their own knowledge work with
00:54:22
llms so I like well I just think there's
00:54:24
going to be more businesses like more
00:54:25
solo business like entrepreneurs making
00:54:28
six or seven figures and and so okay if
00:54:30
that's true I should try doing it but I
00:54:34
just realize that you know I think the
00:54:36
like if you're a technical person and
00:54:37
you enjoy technical work it is very hard
00:54:39
to do the sales and and do the writing
00:54:42
and figure out how to write proposals to
00:54:44
like charge more and I think everyone
00:54:46
tells you to charge more but you don't
00:54:47
know what that means and there's no
00:54:48
playbook for that right just say things
00:54:51
like well just look in the mirror and
00:54:54
name a price and then double it and if
00:54:55
you don't keep doubling it and at some
00:54:58
point you can just ask that
00:55:02
number and that never worked for me and
00:55:04
so I basically Pi a bunch of courses I
00:55:06
read a bunch of books and I'm trying to
00:55:08
distill everything I know into a little
00:55:10
package on Maven and kind of just like
00:55:13
sort of save the regret and the
00:55:15
embarrassment of undercharging for for
00:55:18
so long and uh you know sort of pass it
00:55:20
forward and help them help everyone else
00:55:22
will figure it out yeah you know I feel
00:55:25
like the first job I did I asked they
00:55:27
asked me how much I charged and I said
00:55:29
oh between like 150 and 170 and an hour
00:55:31
and they just said great we'll do 170
00:55:33
just send me to the paperwork and I was
00:55:35
like oh wow you answered in three
00:55:37
seconds I
00:55:39
really yeah I just I called my
00:55:42
girlfriend I was like Hey I just took
00:55:43
food out of both our mouths I'm really
00:55:44
sorry like I'll do better next maybe
00:55:47
I'll double it I don't know I'm nervous
00:55:49
and yeah the course I'm running this
00:55:51
this uh next month is sort of my goal
00:55:54
to not do that again yeah I can say uh
00:56:00
that everything that you mentioned is
00:56:02
true for being an industry
00:56:06
analyst and a podcaster you know either
00:56:08
SL both which I am um you know I've been
00:56:12
at it for quite a long time but you know
00:56:14
there's definitely a learning curve and
00:56:15
it changes all the time too so it's
00:56:17
awesome exctly uh well we will link to
00:56:20
uh to that course are you still doing
00:56:23
the rag course as well the rag course
00:56:25
we're we're running it again in uh
00:56:27
February 4th they'll also be six weeks
00:56:29
I'm pretty excited we already have some
00:56:31
folks from open AI who's taking the
00:56:32
course now so I've slowly yeah hopefully
00:56:36
I can help with their Solutions as years
00:56:38
improve other people's R systems and so
00:56:40
I'm very excited for the new cohort
00:56:42
that's a a bunch of really amazing
00:56:43
companies involved well we will uh be
00:56:46
sure to link to those in the show notes
00:56:48
and maybe we can work out some kind of
00:56:49
discount code for listeners or something
00:56:52
um yeah let's do it awesome awesome
00:56:55
Jason it has been been great catching up
00:56:57
and uh I feel like we probably could
00:56:59
have continued on for another hour but
00:57:02
but we should make sure to to keep in
00:57:03
touch thanks so much for jumping on and
00:57:05
sharing a bit about your uh your
00:57:07
experiences with us it's been super fun
00:57:09
man thanks so much awesome thank you
00:57:29
[Music]

タグ

AI
RAG
User Experience
Testing
Fine-Tuning
Machine Learning
Consulting
Data Evaluation
Podcast
Expert Insights