What is AI model distillation, as discussed in the interview?

AI model distillation involves refining a larger model to perform specific tasks better with smaller models, improving speed and cost efficiency.

How are AI models categorized into tiers in this interview?

Different AI models are categorized by tiers based on intelligence and cost: Tier 1 models are more intelligent but costly, Tier 2 are balanced, and Tier 3 are less intelligent but cheaper.

Does the speaker use various AI models for different tasks?

Yes, the speaker uses different AI models for different tasks, evaluating their strengths and weaknesses for specific use cases like coding, summarizing documents, and generating prompts.

How does the speaker recommend improving the use of AI models?

The speaker emphasizes building a deep understanding of model capabilities and continuously testing them under different conditions to maximize their potential use.

What is model routing, and what does the speaker think about it?

Model routing involves automatically selecting the best model for a given task. It's seen as a future direction for optimizing AI model use but is currently complex to implement effectively.

Does combining multiple AI models improve productivity as suggested in the video?

The speaker finds using different models for their strengths beneficial, for instance, using one model's structured output capabilities while leveraging another's reasoning skills.

What prediction about the future of prompt engineering is mentioned?

The speaker predicts that traditional prompt writing will be replaced by AI-generated prompts, making the process more efficient and refined.

How does the speaker approach prompt generation and optimization?

He uses an iterative, comparison-based approach to generate optimized prompts, often using multiple AI models to refine a prompt for the best outcome.

2 Years of LLM Advice in 35 Minutes (Sully Omar Interview)

00:49:04

https://www.youtube.com/watch?v=nMORNaE_qe4

Resumen

TLDRIn a detailed interview with the CEO of Cognus, the use and categorization of AI language models are explored. The speaker, Soy Omar, explains the different tiers of language models based on their intelligence and cost, offering insight into how to choose the right model for specific tasks, such as coding, summarizing documents, and generating prompts. He discusses the process and benefits of integrating multiple AI models to leverage their unique strengths and weaknesses. Soy also touches on the importance of model distillation, where larger models are refined to execute tasks better in smaller, faster, and more cost-efficient models. He shares his hands-on experience with model evaluation, stressing the necessity of understanding their nuanced differences and carefully testing them for various applications. The future prospects of prompt engineering, where AI-generated prompts could soon replace traditional manual input, are forecasted, highlighting ongoing advances that make AI tools more intuitive and accessible. This narrative provides a glimpse into how AI models can be seamlessly incorporated into everyday use, showcasing both the current potentials and challenges in maximizing their efficiency.

Para llevar

🤖 AI models can be used in every aspect of daily life.
📊 Different AI models have distinct strengths and weaknesses.
🚀 Model distillation enhances task execution by refining large models.
🛠️ Prompt engineering can be optimized with AI-generated meta prompts.
🌐 Model routing means selecting the best AI model for each task.
💡 Understanding model capabilities is crucial for maximizing potential use.
🔄 Iteration and testing improve AI model usage.
📈 Combining multiple AI models can improve productivity.
📜 Future developments might replace traditional prompt writing.
⚖️ Tiered categorization helps in selecting the right AI model.

Cronología

00:00:00 - 00:05:00
AI can enhance everyday tasks but has its limitations. The conversation discusses large language models (LLMs) and their nuanced differences, highlighting the difficulty in perfecting AI performance.
00:05:00 - 00:10:00
An interview with Suli Omar reveals insights into AI model ranking systems and prompt development. Omar uses meta prompts to create production-ready prompts and discusses performance distillation in AI models.
00:10:00 - 00:15:00
Omar's AI model framework categorizes models based on intelligence and cost, with tier three being cost-effective and used frequently. He provides examples like GPT 40 mini and Gemini Flash.
00:15:00 - 00:20:00
Tiered AI models serve different applications, with Omar using tier two models for tasks not requiring the highest intelligence. He often pairs models to optimize task performance.
00:20:00 - 00:25:00
AI models have specific strengths and weaknesses. Omar shares an example using Gemini for pinpointing details in text, while GPT 40 mini excels in reasoning, showcasing complementary model usage.
00:25:00 - 00:30:00
The future of AI models involves complex routing systems, though challenges exist in achieving the final performance percentage. Current practices involve sophisticated combinations of models.
00:30:00 - 00:35:00
Model distillation is powerful yet complex. Good results require robust data evaluation to avoid regressions when simplifying models for efficiency. The future will see better distillation tools and practices.
00:35:00 - 00:40:00
Omar demonstrates his prompt optimization process using various AI models, leveraging voice interaction for natural input. He iterates across models to refine prompts before applying them for specific tasks.
00:40:00 - 00:49:04
Test-driven development with AI involves using language models to write tests before code, providing checks for accuracy and reliability. Omar adapts this method to improve coding processes.

Mapa mental

Vídeo de preguntas y respuestas

What is AI model distillation, as discussed in the interview?
AI model distillation involves refining a larger model to perform specific tasks better with smaller models, improving speed and cost efficiency.
How are AI models categorized into tiers in this interview?
Different AI models are categorized by tiers based on intelligence and cost: Tier 1 models are more intelligent but costly, Tier 2 are balanced, and Tier 3 are less intelligent but cheaper.
Does the speaker use various AI models for different tasks?
Yes, the speaker uses different AI models for different tasks, evaluating their strengths and weaknesses for specific use cases like coding, summarizing documents, and generating prompts.
How does the speaker recommend improving the use of AI models?
The speaker emphasizes building a deep understanding of model capabilities and continuously testing them under different conditions to maximize their potential use.
What is model routing, and what does the speaker think about it?
Model routing involves automatically selecting the best model for a given task. It's seen as a future direction for optimizing AI model use but is currently complex to implement effectively.
Does combining multiple AI models improve productivity as suggested in the video?
The speaker finds using different models for their strengths beneficial, for instance, using one model's structured output capabilities while leveraging another's reasoning skills.
What prediction about the future of prompt engineering is mentioned?
The speaker predicts that traditional prompt writing will be replaced by AI-generated prompts, making the process more efficient and refined.
How does the speaker approach prompt generation and optimization?
He uses an iterative, comparison-based approach to generate optimized prompts, often using multiple AI models to refine a prompt for the best outcome.

Ver más resúmenes de vídeos

Obtén acceso instantáneo a resúmenes gratuitos de vídeos de YouTube gracias a la IA.

Subtítulos

Desplazamiento automático:

00:00:00
it lets you use AI in basically every
00:00:02
nook and cranny of your day-to-day with
00:00:04
that model came out it actually opened
00:00:06
up a lot of things that you could do we
00:00:08
use a lot of different providers and
00:00:09
that's because what we've seen with our
00:00:11
internal evals is that they're all so
00:00:15
nuanced and different in like a variety
00:00:17
of different ways but you also start to
00:00:19
see where they lack you'll get to an AI
00:00:22
product you get it to 90% even 95% but
00:00:25
that last 51% is nearly impossible how
00:00:27
do you think about model distillation
00:00:29
it's very powerful but you have to be
00:00:30
very
00:00:33
[Laughter]
00:00:35
[Music]
00:00:42
careful I just had an amazing
00:00:45
conversation with soy Omar the CEO of
00:00:47
cognus the company behind auto. not only
00:00:51
is he one of the best llm practitioners
00:00:53
that I've met but you can tell he has a
00:00:55
really deep feeling for how these models
00:00:57
are actually working he speaks from
00:00:58
experience in this inter interiew we go
00:01:00
through his three tier system of
00:01:02
actually ranking language models he
00:01:04
shows us how he uses meta prompts to
00:01:06
develop his real prompts that he uses in
00:01:08
production he also shows us his cursor
00:01:10
development flow where he actually has
00:01:12
the language model write the test first
00:01:15
and then write the actual code and
00:01:17
finally he walks us through distilling
00:01:18
performance from large language models
00:01:20
to small language models without losing
00:01:22
performance let's jump into it and let's
00:01:24
see what wisdom our friend suly has to
00:01:26
share uh the reason why we're doing this
00:01:28
interview here is because I see all the
00:01:30
cool stuff you're sharing on Twitter and
00:01:32
I'm like this guy clearly has not only
00:01:35
like a checklist learned uh ability to
00:01:39
manipulate these models but I can tell
00:01:41
you you feel them like you really feel
00:01:43
how these things are actually going and
00:01:44
the personalities and the nuances and so
00:01:46
I want to dig in dig into that today
00:01:48
yeah well thank you and I think it just
00:01:50
comes from playing with these things
00:01:52
every day day in day out and using them
00:01:56
and pushing them to their limit and like
00:01:58
as cliche as it is is just like
00:02:00
sometimes you got to use them to Vibe
00:02:01
with them you know like like right so
00:02:04
yeah yeah yeah it's so true well I tell
00:02:06
you what I want to start off with um one
00:02:08
framework that I saw you document
00:02:09
recently which was your three tier model
00:02:13
of language models so tier one through
00:02:15
tier three so could you tell me like
00:02:17
starting at tier three what are those
00:02:19
and how do you work your way up yeah so
00:02:21
that's a it's a framework that I I mean
00:02:23
I don't even know if you want to call it
00:02:24
a framework but it's I like to
00:02:26
categorize it them into like based on
00:02:28
intelligence and price which is
00:02:30
correlated right like the less
00:02:32
intelligent models are going to be your
00:02:33
tier three models and then your more
00:02:35
expensive slower are going to be your uh
00:02:38
more intelligent model so the reason I I
00:02:41
thought of it in three tiers was because
00:02:43
of the application purposes so the way
00:02:45
that you use something like let's say 01
00:02:47
so that would be like a tier one is
00:02:49
different than the way that you use
00:02:50
something like Gemini flash which is
00:02:52
tier three um and that's because they
00:02:54
all provide different purposes one is
00:02:56
super cheap super fast the other one's
00:02:58
like really smart and really slow so I I
00:03:00
broke it down to those three tiers and
00:03:02
the third tier is basically what I like
00:03:04
to call just like the you know the
00:03:07
Workhorse the the ones that you're just
00:03:09
constantly using 247 and within that
00:03:12
category I think there was three main
00:03:15
models but it's kind of come down to two
00:03:16
for me personally so the first one is
00:03:19
the one that I think people are probably
00:03:21
more familiar with which is GPT 40 mini
00:03:24
now that model and is actually like I I
00:03:30
really really like it because it lets
00:03:32
you use AI in a way that previously you
00:03:34
couldn't like if you were to go back
00:03:36
let's say six months ago when we had no
00:03:38
cheap models you had let's say GPT 4 and
00:03:41
maybe even Claude
00:03:43
3.5 there was a lot of scenarios where
00:03:46
you couldn't just be like throwing that
00:03:47
at like random problems like you
00:03:49
couldn't just be like hey I have this
00:03:51
you know 20page document I want you to
00:03:53
go paragraph by paragraph and like
00:03:55
extract the details because
00:03:57
realistically like you know you're going
00:03:59
to be paying a lot of money so with that
00:04:01
model came out it actually opened up a
00:04:03
lot of like things that you could do so
00:04:05
that was the the the the first one with
00:04:07
was gp4 mini and then the other one that
00:04:09
I'm starting to really like is Flash so
00:04:11
Gemini flash is actually half the price
00:04:14
of GPT 40 mini and those are the tier
00:04:17
three because like I said they they give
00:04:19
you a lot of optionality and the
00:04:21
different things that you could do that
00:04:22
you couldn't do before it lets you use
00:04:23
AI in basically every nook and cranny of
00:04:26
your day-to-day right if whether it's
00:04:29
your coding and you wanted to look at
00:04:31
like you know 50 different files to
00:04:33
summarize to help another model for
00:04:35
example if you wanted to take a podcast
00:04:38
and you know look at you know when did
00:04:41
someone say a specific word in that
00:04:42
podcast right you're not going to go to
00:04:44
a bigger model so that was that's what I
00:04:46
call the tier three um and then the
00:04:48
second tier that I have is sort of like
00:04:50
the the middle obviously it's the middle
00:04:52
tier and this is where I like to slot in
00:04:54
the actual gp4 Cloud 3.5 Gemini Pro this
00:04:57
is where I think the majority of people
00:04:59
you use these models and and kind of get
00:05:01
the maximum usage out of them um and
00:05:03
then the last TI is obviously like the
00:05:05
01 o1 preview and then what I like to
00:05:07
classify as thinking models yeah that's
00:05:09
so cool so I want to dig in more into
00:05:11
the use case side so which use case
00:05:13
tasks are you doing tier two with and
00:05:16
then I know that o 01 in tier one is
00:05:18
going to be um it's not just oh I need
00:05:20
it smarter it's almost like a different
00:05:22
type of task you're going to ask it to
00:05:23
do so how do you differentiate between
00:05:24
those two right so the way that I like
00:05:27
to differentiate is I like like I pair
00:05:31
them so I will use 01 and I use this in
00:05:33
my dayto day it's like I'll go to Chad
00:05:35
gbt and if you just go and say hey like
00:05:38
to1 can you do this task for me one it's
00:05:41
going to take a little bit of time
00:05:42
you're probably going to hit some rate
00:05:43
limits because it's highly limited and
00:05:44
realistically you're not going to use
00:05:46
the model the way that I think it was
00:05:48
intended somewhat to be used so if you
00:05:50
say like hey how's it going like okay
00:05:53
sure you could use it like that but
00:05:54
realistically you're better off using
00:05:56
you know the the tier two so how I use
00:05:58
the tier 2 is actually the most I use it
00:06:00
the most um obviously everyone uses it
00:06:02
for coding whether it's CLA 3.5 gp4 um
00:06:06
using it for like function calling or to
00:06:08
call tool calling like it it is
00:06:11
obviously like a good balance between
00:06:13
intelligence and price um and and that's
00:06:16
kind of like what I use it the most
00:06:18
whether I'm writing whether I'm asking
00:06:20
it to like hey help me edit an email or
00:06:22
things like that I'm using those like
00:06:23
middle tier ones now how I actually use
00:06:26
that in tandem with 01 is I'll sort of
00:06:29
one of the cases I have is I'll come to
00:06:31
Chad GPT or Claude and I'll sit there
00:06:34
and I'll just create a giant
00:06:35
conversation about a specific topic so
00:06:37
let's say for example you know I'm deep
00:06:41
diving into a research topic and I want
00:06:43
to learn more about now I'm not going to
00:06:45
actually go straight into 01 because I
00:06:47
feel like one it's a bit slow what I'll
00:06:49
what I'll do is I'll start the topic
00:06:50
with gp4 or Claude and I'll like add
00:06:54
files because obviously I think right
00:06:55
now 01 doesn't support like files and
00:06:57
web search so there's a lot of
00:06:58
capabilities that o1 doesn't support and
00:07:00
what I like to call is the context
00:07:01
building so I will just go and build as
00:07:04
much context in this chat as I possibly
00:07:06
can or it could be you know in any
00:07:08
platform and and I'll sit there and
00:07:09
iterate I'll actually use voice mode as
00:07:11
well to sort of give context it's a lot
00:07:13
quicker and that's another workflow and
00:07:15
as soon as I have like you know let's
00:07:18
say like two to three pages worth of
00:07:20
documents I'll actually take that and
00:07:23
paste it into a chat with 01 or 01
00:07:26
preview and I'll say Hey you know do
00:07:28
this gigantic task for me so for example
00:07:31
I I'll give you one thing to use it for
00:07:32
is like I was using it to generate use
00:07:35
cases for my product and I was like okay
00:07:37
I want to generate use cases and I want
00:07:39
to understand you know what are some
00:07:41
potential um customer segments and icps
00:07:44
it's is like a pretty technical question
00:07:46
and if I were to just go to 01 and ask
00:07:48
it that it would have no context it
00:07:49
doesn't know what my product is it has
00:07:51
no clue what my product does who my
00:07:53
customers are and if I were to sit there
00:07:55
chat with it well I'm going to hit that
00:07:56
limit but if I go to Claude or Chad gbt
00:07:58
I can upload documents I can create this
00:08:00
basically a PDF and copy paste it into
00:08:03
01 and then I can say generate me you
00:08:06
know personas icpas it does a lot better
00:08:08
so that's sort of the the workflow and
00:08:09
use case that I have currently running
00:08:11
with like the the tier two and the tier
00:08:13
one models yeah yeah yeah one of the
00:08:15
ways that I found 01 works for me really
00:08:17
well is around actually D duplication so
00:08:19
if I have a long list of items that say
00:08:21
I've processed five different chunks
00:08:23
with the same type of workflow for each
00:08:25
chunk well I'm going to have a list of
00:08:26
duplicated items I give that whole thing
00:08:28
to 01 it's actually really good at D
00:08:30
duplicating and then I'll use one of the
00:08:31
tier 2 models to do the structured
00:08:33
output after that since 01 doesn't yet
00:08:34
support structured output and go from
00:08:36
there yeah that actually that's a good
00:08:38
one that's another thing I do as well is
00:08:40
I'll take 01 and give me like a long
00:08:42
verbos output and then take that and
00:08:45
turn it into structured data sets with
00:08:46
the uh the tier two and even sometimes
00:08:49
you could even get away with using that
00:08:50
with the tier three because it's you
00:08:52
don't even need to worry about the
00:08:53
output you're just like hey I want this
00:08:55
nicely formatted in whatever shape yeah
00:08:57
yeah yeah for sure so it sounds like
00:08:59
you're using different models across
00:09:01
different providers too for different
00:09:04
use cases or do you stick with one all
00:09:06
the time yes so we use a lot of
00:09:09
different providers and that's because
00:09:11
what we've seen with our internal evals
00:09:13
is that they're all so nuanced and
00:09:17
different in like a variety of different
00:09:19
ways so obviously the big one Gemini
00:09:22
multimodal right off the bat like
00:09:24
anything to do with videos or audios
00:09:28
I'll go you know dive straight into that
00:09:30
and and kind of use Gemini but you also
00:09:33
start to see where they lack so for
00:09:37
example a really interesting one is
00:09:39
Gemini models are really good at needle
00:09:41
in the Hast stack and so if you say hey
00:09:44
I want you to find one or two pieces of
00:09:46
information in this you know giant long
00:09:48
piece of text or video it's actually
00:09:50
really good but then I started to notice
00:09:52
that something like GPT 40 mini is a
00:09:55
little bit of a little bit better
00:09:57
reasoning over that so if I give it a
00:09:59
long piece of context and I say hey I
00:10:01
want you to sort of understand the
00:10:03
context of it I saw I found the GPT 40
00:10:05
mini is a little bit better so you start
00:10:07
to see where one model does better than
00:10:10
the other model in specific area so like
00:10:12
another example is Claude 3.5 and GPT 40
00:10:15
now Claude is obviously everyone loves
00:10:17
that model it's a really good model but
00:10:19
one thing it's absolutely horrible at is
00:10:21
tool use with structured outputs and
00:10:24
you'll start to see this if you it's a
00:10:26
very complex tool like I want you to
00:10:29
create the very deep like a a nested
00:10:31
Json a very you know long structured
00:10:34
output like a very large amount of the
00:10:37
time it fails and it gives you XML and
00:10:39
it just breaks all your parsers whereas
00:10:41
GPT 40 mini does a lot better job but
00:10:44
then the caveat is that gp4 o mini is
00:10:48
not as good at actually like thinking
00:10:50
through the problem and acting as an
00:10:52
assistant so there's always these like
00:10:53
tiny trade-offs that you don't really
00:10:55
like notice one of the that we did was
00:10:58
we set up a
00:10:59
like one of the use case was to get
00:11:01
around that was we set up Claude and GPT
00:11:03
40 mini to work together where the tool
00:11:06
use for Claude would be to call GPT 40
00:11:09
mini and we basically system where
00:11:13
Claude could orchestrate GPT 4 mini to
00:11:15
create the structured output so it would
00:11:17
say please do this so the user would say
00:11:19
I want this task all GP all Claude would
00:11:22
do was relay that information to GPT 40
00:11:25
mini 40 mini creates the structured
00:11:27
output and then I guess return so that
00:11:28
was like another use of like how we mix
00:11:30
and match so many models across
00:11:32
different use cases yeah isn't it wild
00:11:35
how all these little mini Vibe tricks we
00:11:38
have to kind of like hack together in
00:11:40
the early days of llms here and then I
00:11:42
think back to how far we've already come
00:11:43
like because even like you know like
00:11:45
January of 23 we're dealing with like
00:11:47
4,000 token context limits and gbt 3.5
00:11:50
and all the hacks that we had then we've
00:11:52
upgraded from them now but we still have
00:11:54
a bunch of hacks like the ones you're
00:11:55
talking about and so it just makes me
00:11:57
think we're never going to get rid of
00:11:58
the hacks and they're always going to to
00:11:59
be there for a long time I would say so
00:12:02
too because yeah like you're right it's
00:12:05
funny looking back at it the hacks that
00:12:07
you used in 2023 were so different you
00:12:09
were hacking around context window and
00:12:11
now you're hacking around well tool use
00:12:14
which didn't even exist a year ago right
00:12:16
or like you know a year and a half ago
00:12:17
so I I agree with you that we're always
00:12:19
going to be Min maxing as a user of
00:12:22
multiple models you're going to be Min
00:12:23
maxing trying to figure out for your use
00:12:25
case for your product for your company
00:12:28
where can I you know masch these
00:12:29
together so that I get the best possible
00:12:31
outcome for my users um and I know a lot
00:12:34
of people have and I'm curious what you
00:12:35
think a lot of people have spoken about
00:12:37
like model routers and how you know at
00:12:39
the end of the day like a model is just
00:12:40
going to pick it but my my personal
00:12:43
opinion is I I think that it's going to
00:12:45
cause a lot of unintended side like you
00:12:47
know side effects but I'm curious what
00:12:49
you think on like this whole idea of
00:12:50
like model routing because you know
00:12:51
we're talking what we're basically doing
00:12:53
we're internally with code model routing
00:12:55
but I'm curious what you think so
00:12:57
whenever I get asked a question like
00:12:58
this I think is there behavior in
00:12:59
practice that tells me um what the
00:13:02
prediction should be and you just
00:13:03
describe basically you're doing model
00:13:05
routing on your own like in in in and of
00:13:08
itself so that tells me yes model
00:13:10
routing will be a thing and I do still
00:13:12
think that fine-tuning models and having
00:13:14
Boke small models is still too much
00:13:16
overhead like it's really hard to do
00:13:18
that and manage them and do with them
00:13:19
all right now all that is going to get
00:13:21
so much easier so I would imagine that
00:13:23
not only will we have model routing for
00:13:25
task specific things against like some
00:13:26
of the big ones where you have Vibe
00:13:28
based feels whether regards to
00:13:30
structured output or tool use or
00:13:31
whatever it may be but then also um for
00:13:34
task specific things um I will
00:13:36
absolutely do model routing so um I'm a
00:13:38
fan I think it's hard I think it will be
00:13:40
the future we're not quite there yet
00:13:42
though that's for sure gotcha yeah like
00:13:45
my my my sentiment there was that there
00:13:47
and it could be just because the models
00:13:49
just where we're at right now what I've
00:13:51
noticed is and I'm sure you've seen the
00:13:53
same is where you'll get to an AI
00:13:55
product you get it to 90% even 95% but
00:13:59
that last 5% is last 10 5 10% is nearly
00:14:03
impossible I find like it even you can
00:14:05
run all the evals you want you can run
00:14:07
all the benchmarks getting that last 10%
00:14:10
and I my thought process there is
00:14:13
that if you have the model sort of
00:14:16
choosing other models that adds to the
00:14:19
variance so it causes a lot more
00:14:22
potential like you know that that's kind
00:14:24
of where my thinking is and that could
00:14:25
just be because like we're early like
00:14:27
realistically we're so early models have
00:14:30
you know multiple generations to get
00:14:31
better uh so that was my thought was
00:14:33
that maybe in the future but right now
00:14:36
probably not because it's it's so hard
00:14:39
to get a product in specifically like
00:14:42
llms into production where you're
00:14:44
handling every potential Edge case uh in
00:14:47
a manner that gives you as high of an
00:14:49
accuracy as you can and adding models
00:14:51
that you might not have an eval 4 could
00:14:55
give you an output that you didn't
00:14:56
expect yeah yeah totally uh well well I
00:14:59
tell you what one of the other
00:15:00
interesting things that came up during
00:15:01
research was your opinion on what is
00:15:04
kind of becoming known as model
00:15:05
distillation so you have a really really
00:15:07
good model you perfect the output from
00:15:09
there but then you realize wow I can
00:15:11
actually come up with a little bit of a
00:15:12
better prompt here and give it to a
00:15:14
smaller model so that you have it's
00:15:16
faster and it's cheaper so can you talk
00:15:18
me or walk me through how do you think
00:15:19
about model distillation in your own
00:15:21
workflow yeah so that's a something I
00:15:23
think about a lot and it's one of those
00:15:26
things where you need to be very careful
00:15:28
because it's very it's very powerful but
00:15:30
you have to be very careful because it
00:15:32
requires a lot of work and the reason it
00:15:35
needs a lot of work is
00:15:37
because you need to have a a good data
00:15:40
Pipeline and understand what you're
00:15:42
distilling so one of the things and
00:15:44
mistakes I made previously with the
00:15:45
product was that we went we had GPT 40
00:15:49
this was actually before GPT 40 it was
00:15:50
gp4 turbo and we used it and it was slow
00:15:54
and we're like hey let's distill that to
00:15:55
3.5 open AI has a has a really nice um
00:15:59
way to do it so we did that and the
00:16:02
problem was that we didn't have good
00:16:04
enough evals we didn't have a good
00:16:05
enough data set so as the potential you
00:16:09
know the various areas grew that people
00:16:11
could use the product we would notice
00:16:13
okay we have to revert back to gp4
00:16:15
because 3.5 was at that time not good
00:16:18
enough now where I do see distillation
00:16:20
in our workflow is when you have a
00:16:22
defined eval set you have like all your
00:16:24
benchmarks and you have a very good data
00:16:27
pipeline where you can say okay
00:16:29
in this 500 example set I'm using Claude
00:16:33
3.5 Sonet or or you know 0an for example
00:16:36
I have my data set and you can use a
00:16:39
bunch of different there's a lot of
00:16:40
different companies that provide you
00:16:41
with like ways to manage your impr
00:16:43
prompts and evals whether it's Brain
00:16:45
Trust or Langs Smith and then you can
00:16:48
very accurately uh detect and determine
00:16:51
the accuracy of the distilled model then
00:16:54
10 out of 10 times I would use it um and
00:16:57
the easy and it's actually really easy
00:16:58
like to actually distill the model down
00:17:01
it's like it's like it's a single API
00:17:03
call the challenging part is making sure
00:17:06
that you don't regress your product when
00:17:08
you do uh the distillation but I think
00:17:11
it's one of those things that it's going
00:17:13
to become more and more apparent as the
00:17:15
tooling around distillation becomes like
00:17:17
better I know there's a couple companies
00:17:19
working on it like open pipe is one of
00:17:20
them um and I know open AI straight up
00:17:23
offers you that so I think as the
00:17:25
tooling gets better you're going to see
00:17:27
this pattern in production
00:17:29
of companies launching with the biggest
00:17:31
best model they collect a bunch of data
00:17:33
they have a good e set and engineering
00:17:35
team to support that then they go and
00:17:37
they distill it to whether open you know
00:17:39
GPT 40 mini or an open source model yeah
00:17:42
that's beautiful my favorite line with
00:17:43
that is the whole make it work make it
00:17:45
right make it fast and so it's like look
00:17:47
you're going to use the biggest one to
00:17:48
start us off but then you're going to
00:17:49
make it fast eventually and go from
00:17:51
there um this is awesome I tell you what
00:17:54
though so I know you're a practical
00:17:56
person I would love to jump into like
00:17:58
you actually showing us some of the ways
00:17:59
that you use these tools and I think a
00:18:01
really cool starting off point would be
00:18:03
I know that you're a fan of prompt
00:18:05
optimizers or like meta prompt writing
00:18:08
and so yes because you had you had a
00:18:11
tweet and literally said pretty good
00:18:13
chance you won't be prompting from
00:18:14
scratch in two to three months so I
00:18:17
would love to see the way you kind of
00:18:18
prompt engineer your way from like an
00:18:20
idea to like I'm going to go use this
00:18:23
thing okay yeah hopefully my prediction
00:18:26
uh ages well because I feel like it's
00:18:28
been a month since I said that and I
00:18:29
don't know if we're two to three months
00:18:31
away from it but okay let me yeah I just
00:18:35
to add some context I do a lot of this
00:18:36
sort of meta prompting where I'll come
00:18:39
in with a problem what is what is meta
00:18:40
prompting let's start there you come in
00:18:42
with a general idea of what you're
00:18:44
trying to do you have a problem that
00:18:46
you're trying to solve like
00:18:47
realistically if you're coming in you
00:18:48
don't know what problem you have that
00:18:49
you're trying to solve with an AI it's
00:18:51
it's sort of useless so an example would
00:18:53
be um the other day I was trying
00:18:56
to get uh one of the models to write
00:18:58
like me which to this day I I cannot for
00:19:02
whatever reason and I was like I came
00:19:05
into it and I came into Chad GPT and I
00:19:07
had all my examples and I was like okay
00:19:09
what do I write and I normally I would
00:19:11
write something like you know you you
00:19:12
write like a basic promp structure and
00:19:15
the reality is that prompt is probably
00:19:16
not that good so what meta prompting or
00:19:19
what I like to think about this work
00:19:20
this idea is that you come in with an
00:19:21
idea hey I want to have an AI right like
00:19:24
me I have examples and then I just give
00:19:27
that to 01 or claw and I say please
00:19:30
create the prompt for me and that's sort
00:19:31
of what I like to think of like this I
00:19:34
come in with a a rough idea of what I'm
00:19:35
trying to do I don't really know
00:19:37
specifically how to optimize it I'll go
00:19:39
to these models and say hey like
00:19:40
actually give me this promp structure
00:19:42
and it does a pretty good job so that's
00:19:43
kind of the the rough idea of how it
00:19:45
works but let's let me should we just
00:19:47
hop into like yeah I would love to jump
00:19:49
into it if you could share your screen
00:19:50
and then are you using just a regular
00:19:52
chat interface or are you going to
00:19:54
anthropics workbench and doing their
00:19:56
prompt Optimizer I I just used the chat
00:19:59
interface because I feel like the prompt
00:20:01
I mean people some people do use it I
00:20:03
and I think you can start with it um but
00:20:06
I just find it easier because I can
00:20:07
iterate a lot better I can say hey start
00:20:10
like this and do that so let's actually
00:20:12
do it but I I want to start and say do
00:20:14
you have some sort of task that like we
00:20:16
should we start we should start with
00:20:17
like a rough idea because I like do you
00:20:19
have any like what what's the task we
00:20:21
could Dem let's do a straightforward one
00:20:24
let's do what I guess I'll give you a
00:20:26
few options you tell me what you think
00:20:27
is best we could do the classification
00:20:29
one which is very standard hey I have
00:20:30
some data sources or can you please
00:20:32
label them for me um we could do either
00:20:36
like uh unstructured to structured
00:20:38
extraction so like extracting insights
00:20:40
from a piece of text or we could do uh
00:20:43
idea generation that's always a fun one
00:20:45
too okay let's do the let's do the
00:20:49
extracting text one and I think that's a
00:20:51
good one so let's say we I like to
00:20:53
always preface it with like the problem
00:20:54
or what we're trying to do so again what
00:20:56
I like to come into it is like all right
00:20:57
I have a problem I'm trying trying to do
00:20:59
a specific task and usually this is like
00:21:01
my blank State slate starting point so
00:21:03
let's say the task that I'm trying to do
00:21:05
is I have a large piece of text and I
00:21:08
want to you know turn that piece of text
00:21:10
into something else some sort of
00:21:11
structured output and it's it's funny
00:21:14
because a lot of people say like oh is
00:21:15
it complicated it's really like I just
00:21:18
come to Chad GPT and I or or claw and I
00:21:20
basically say that so the way that I go
00:21:22
is I'll say you know you could use
00:21:24
Claude or or chagy PT I haven't found
00:21:26
which one is really better again and
00:21:29
this is kind of going back to my
00:21:30
original workflow is what I'll do is
00:21:32
I'll actually start with gp4 or Claude
00:21:34
and I'll get like a rough idea for a
00:21:36
prompt and I'll copy that and I'll give
00:21:38
it to 01 and then I'll start to compare
00:21:40
across all three to see which one like
00:21:42
makes the most sense so let's say for
00:21:44
example in this one I am grabbing
00:21:48
transcripts from podcasts and I want to
00:21:50
know like you know I want a nice like
00:21:54
structured output for all of the key
00:21:57
exciting moments let's say that that
00:21:58
like the problem space so now you could
00:22:00
come in and you could create a prompt
00:22:01
and says okay given this video I want
00:22:04
you to do this or I come to CL and say
00:22:05
look like and actually the other
00:22:07
workflow that I I wish I could demo is I
00:22:09
use voice a lot so I don't know if um if
00:22:12
you use voice a lot but I've notice that
00:22:16
with voice here I don't use it a ton
00:22:17
yeah it hasn't entered my workflow yet
00:22:19
but I'm I'm voice curious so I want to
00:22:21
try actually see this let's see this
00:22:23
okay so I have I have something here I
00:22:25
want to show you the whole workflow that
00:22:27
I use so that I
00:22:30
so and let me know if you need a
00:22:31
transcript I have one handy for us
00:22:34
actually yeah could you could you toss
00:22:35
me it there and then I will use it I'll
00:22:38
copy paste this okay let me know when
00:22:40
you have the transcript and then mm
00:22:43
small plug this is MFM Vault website I
00:22:45
put together that does insight
00:22:47
extraction from my first milon there we
00:22:48
go Okay cool so let's say our goal is to
00:22:51
extract insights now my workflow is I
00:22:54
have a tool that transcribes it so I
00:22:56
think it works so let's say I'll just
00:22:58
exactly show you how to do it okay hey
00:23:01
uh I need a bit of help creating a
00:23:02
prompt uh for a uh use case so what
00:23:05
we're doing right now is taking podcast
00:23:08
transcripts and trying to extract all of
00:23:10
the key moments key insights so I need
00:23:13
you to create a a nice uh prompt that
00:23:15
will you know help us do that and I'll
00:23:18
I'll give I'm going to put in the prompt
00:23:19
as well later on the actual transcript
00:23:21
but I need you to create the prompt SL
00:23:22
system
00:23:23
prompt so boom so that's that's actually
00:23:26
sort of how I do it it's there's no
00:23:28
signs to it and I I'll sit there and
00:23:30
kind of like here and I'll copy this and
00:23:32
I'll actually do this I'll go into Chad
00:23:33
GPT I'll paste it and I'll actually also
00:23:34
place it into
00:23:36
CLA and it's going to go and it's going
00:23:39
to give me like a uh starting
00:23:42
point and so right off the bat like if
00:23:44
you're maybe not as good at prompting or
00:23:47
you're new to prompting like you can
00:23:50
read this like obviously if you're more
00:23:51
experienced and you kind of know like
00:23:54
what you're doing these kind of prompts
00:23:55
are like pretty obvious but for a lot of
00:23:57
people they'll come in and be like okay
00:23:59
cool I have a a good starting point so
00:24:01
then all I'll do is I'll look at this
00:24:02
say okay the following is a podcast
00:24:04
transcript identify so and I'll compare
00:24:07
it to here so right off the bat I don't
00:24:09
know if you which one you think is
00:24:10
better but I'm looking at this and I
00:24:12
like the claw output better um little
00:24:15
bit
00:24:16
more uh what's it called clear Direction
00:24:19
so I'll actually copy this and I'll be
00:24:22
like okay we have a rough outline I
00:24:24
liked the first pass I liked the one
00:24:27
from
00:24:29
uh cloth I'll take that and I'll go back
00:24:31
to Chad GPT and I'll open up a new tab
00:24:34
and then I'll
00:24:35
say let's go to o1 preview so then I'll
00:24:37
actually do the same thing um I'll say
00:24:40
and I'll actually give it more context
00:24:41
so I'll say something along the lines of
00:24:43
and again I I'll go back to the voice
00:24:45
mode here I'll say hey um you're going
00:24:47
to help me optimize a prompt so I
00:24:49
already got another AI model to give me
00:24:51
a rough idea for this prompt I want you
00:24:53
to look at it and tell me if there's any
00:24:54
areas in the prompt that we could
00:24:55
improve um so I'll give you the prompt
00:24:57
and I'll actually give you the prompt
00:24:58
that I gave to the I AI that generated
00:25:00
this
00:25:01
prompt so it's going to go and then I'm
00:25:04
going to go like this so this is sort of
00:25:06
here you
00:25:08
know
00:25:10
original prompt to AI I'll paste that in
00:25:13
a sec um what's amazing is just how you
00:25:17
speak to it just like a human like it's
00:25:19
not complicated it's literally just
00:25:20
being clear in your
00:25:22
directions it's something
00:25:25
that I recently started to do and
00:25:29
I think it's a very a lot of people talk
00:25:31
to the AI as if it's not a human but
00:25:33
they perform the best when you just
00:25:34
speak to it naturally and I found that
00:25:37
voice is the best modality to do that in
00:25:39
because it's very hard to sound robotic
00:25:42
when you're talking to like the the chat
00:25:45
it's like you have to just talk
00:25:46
naturally um and then I found that it's
00:25:48
it's also a lot faster like if I were to
00:25:50
sit here and type that it would take me
00:25:51
a lot so here I'll go here I'll T I'll
00:25:54
paste this um original prompt you know
00:25:58
and then I'll say Okay cool so I like
00:26:00
that one and now this is the second pass
00:26:02
and now this is where again kind of
00:26:04
going back to the workflow that I use
00:26:05
right is I'll come in here and iterate
00:26:07
with voice on this specific subset of a
00:26:09
problem which is generating this kind of
00:26:11
like like a prompt we sat there with
00:26:13
gp24 we sat there with Claude iterated a
00:26:16
bit um and then I'm I'm like okay I have
00:26:18
a rough idea this prompt looks somewhat
00:26:20
good and then I'll come back to 01
00:26:22
preview and I'll say okay cool I want
00:26:24
you to optimize this and I haven't found
00:26:27
like I don't have a real scientific
00:26:29
method to which one is best because I
00:26:31
just kind of sit there and and this is
00:26:32
kind of where I have like a good first
00:26:34
generation of the prompt realistically
00:26:36
I'll put this into production I'll write
00:26:38
a couple of like you know uh evals I'll
00:26:40
say okay how does this actually perform
00:26:42
and then kind of iterate back but this
00:26:43
is sort of my starting point so we'll
00:26:46
let this go
00:26:49
um okay so
00:26:53
here and then it gives me some
00:26:56
things can you please generate the new
00:27:00
prompt now all right cool it gives me
00:27:03
the revised prompt so it it did gives
00:27:07
you finally the answer yeah and and sort
00:27:10
of you can see here and you can OB say
00:27:13
here this is just for the sake of this
00:27:14
and now what I'll do is I will take this
00:27:17
and then I will actually go to and this
00:27:19
is my full workflow we can use any model
00:27:22
but let's say we're going to use um you
00:27:25
have a preference of which model you
00:27:26
want to test out the actual uh
00:27:29
transcription we can actually do I'd
00:27:31
love to hear which one you think and why
00:27:33
and let's just test it out let's let's
00:27:36
test it out so now we go to studio so
00:27:38
and you see what I mean it's like
00:27:40
there's all these different models I'll
00:27:43
go to Studio which is Gemini now we're
00:27:45
going to go to Gemini which I found so
00:27:48
specifically Gemini Pro uh better at
00:27:51
sorts of these these sort of tasks um
00:27:53
and now I'm here with Gemini Pro which
00:27:55
I'm going to take and grab the prompt I
00:27:58
crafted with 01 put it into the system
00:28:01
prompt of uh what's it called Gemini Pro
00:28:04
paste in the the transcript and we'll
00:28:06
see how it goes all right beautiful yeah
00:28:08
that sounds
00:28:10
great all right let's copy this
00:28:14
here okay this is how the sausage is
00:28:17
made yeah it's it's this is how I like
00:28:20
to think of like the first generation of
00:28:21
a promp or I'm not really sure where I'm
00:28:24
starting off with obviously like is this
00:28:26
something that I would use in production
00:28:27
probably not because you want to test it
00:28:29
out and have a lot of back and forth um
00:28:31
but okay cool can I is there a way to
00:28:34
copy paste the transcript you're just
00:28:36
gonna have to select all down at the
00:28:38
bottom
00:28:39
there that would be nice to copy the
00:28:41
transcript actually I think I might add
00:28:43
that feature in there yeah it's a let me
00:28:46
see if I can just this I'm
00:28:50
on all
00:28:53
right cool now we go grab this okay and
00:28:58
then I'll obviously like do a second
00:28:59
pass to make sure that this actually
00:29:01
makes sense key moments obviously yeah
00:29:05
okay this looks pretty good time stamp
00:29:08
three to takeaways extract one sentence
00:29:11
discussion themes theme name
00:29:14
um yeah like Okay cool so here I'll
00:29:17
paste this in and we'll let it we'll let
00:29:19
it run here so I'm using Gemini
00:29:21
Pro um all right 177,000 tokens and and
00:29:24
for for people who are curious like
00:29:26
Gemini Pro
00:29:28
I I talked about this recently is that a
00:29:30
lot of models can't actually reason over
00:29:32
a large context like um but for
00:29:36
something like Gemini Pro anything under
00:29:37
100K tokens it's uh it's pretty good at
00:29:40
like being able to synthesize a a
00:29:42
relatively intelligent answer so
00:29:45
here okay that's really
00:29:49
cool and now yeah key moments how you
00:29:52
leverage CrossFit I'm actually curious
00:29:55
to like see how it this would do against
00:29:57
like you know other benchmarks because
00:29:59
we don't really know if this is a good
00:30:00
output or not and that's where the whole
00:30:02
point of evals is but there you go you
00:30:05
have how I went from an idea to
00:30:10
generating like a full I guess optim air
00:30:13
quote here optimize prompt and the
00:30:16
reason for that is just like for me to
00:30:18
sit here and write this probably would
00:30:20
have taken like an hour hour and a half
00:30:22
maybe like give or take depending on how
00:30:24
good you are but you know we just did it
00:30:26
live in whatever 10 minutes so yeah
00:30:28
that's super super cool I love that um
00:30:31
so then out of curiosity what are you
00:30:33
using for prompt management so I saw a
00:30:36
um a tweet by the CEO of prompt layer
00:30:39
Jared and he's like yeah I see everybody
00:30:40
they go through the same they go through
00:30:42
the same world first their prompts are
00:30:44
just hard-coded in their code and then
00:30:46
second their prompts are hard-coded in
00:30:47
text files but they're still in their
00:30:49
code base and then third you actually go
00:30:50
to a prompt manager what what are you
00:30:52
using for prompt management so for
00:30:55
that's an interesting one we obviously
00:30:57
we use GitHub for our our our prompts
00:31:01
yeah so we use a lot of a couple of
00:31:03
different things maybe maybe we're not
00:31:05
like we're not prompt managing correctly
00:31:08
but we just have our prompts that we
00:31:11
store in Langs Smith and sort of I'll
00:31:14
just have data sets and I'll compare
00:31:18
that prompt to that data set so for
00:31:20
example we have a giant data set of like
00:31:22
a thousand examples that I I run or test
00:31:25
against different models different
00:31:26
prompts and that prompt is just like
00:31:29
stored you know in in the data set and
00:31:33
then whenever I want to change the
00:31:34
prompt I'll actually change it and
00:31:36
duplicate data set paste in the new
00:31:38
prompt and like my version so to speak
00:31:41
so the actual prompt stays in my
00:31:44
codebase with the latest version of like
00:31:46
this is the the source of Truth and all
00:31:49
previous other versions are different
00:31:51
data sets where I can see how they
00:31:53
perform so for example if I want to go
00:31:55
back to a prompt that was like you know
00:31:56
let's say from a week ago I just look at
00:31:58
the data set that was from a week ago
00:32:00
and I can see the prompt is there and I
00:32:01
can also see how it performs so that's
00:32:03
how I manage uh like an inversion it I'm
00:32:06
not sure if that right approach but
00:32:07
that's how way I do it sir so in your
00:32:09
code is the prompt that's being called
00:32:11
is it actually in your code or are you
00:32:13
calling out to langub and Lang Smith
00:32:14
every single time it's in the code so
00:32:17
the the our code it's in GitHub and the
00:32:19
nice part is because it's just all
00:32:21
Version Control like I could look at the
00:32:23
git history and I can actually see okay
00:32:26
this person changed this line is as well
00:32:28
which is nice so I have the line by line
00:32:30
version controlled from git um and then
00:32:32
if I want to see the full prompt I can
00:32:34
look back at like a you know the the
00:32:37
data management tool yeah that's very
00:32:38
cool um I tell you what I had one more
00:32:40
demo on here that I was like this would
00:32:42
be so cool if so Su would show us how we
00:32:44
use this um it's a cursor one actually
00:32:46
so I saw that you tweet you you said do
00:32:49
I actually have the llm write the test
00:32:52
first then the code it helps a ton which
00:32:55
that's a framework I don't see too many
00:32:57
people doing of course there's test
00:32:58
driven development but like not in
00:33:00
practice not usually I'm not seeing a
00:33:01
lot of people do that could you walk us
00:33:03
through like how do you write that test
00:33:05
first and then how do you ask it to
00:33:06
write code right after that yeah okay
00:33:09
this is one that I the reason I started
00:33:11
to do was because the problem I was
00:33:13
facing the model just kept messing up
00:33:14
like every single time it was within our
00:33:16
code base and I was like this is this is
00:33:19
a waste of my time the model can't
00:33:20
figure it out how about I just get it to
00:33:23
generate the test first and then if the
00:33:25
test works then it can maybe look at the
00:33:28
code and say where the issues are
00:33:29
because models guess what if a test
00:33:31
fails you can grab the error output give
00:33:34
it back to the model and say hey like
00:33:36
please decipher that so let's actually
00:33:38
see if I can like I can spin up um like
00:33:41
a little mini project or something or
00:33:43
yeah yeah let's see here if I can spin
00:33:45
up something new I actually think this
00:33:47
is really cool and this is like
00:33:48
something like really truly not enough
00:33:50
people are doing this and if it legit
00:33:52
helps you write better code because it
00:33:54
makes sense you have the test that's
00:33:56
supposed to run successfully and it can
00:33:58
use that as instructions and it can use
00:33:59
that to like test to make sure it's
00:34:00
actually working I'm surprised not a lot
00:34:02
of people not more people are doing this
00:34:04
where it's like right that's like it's
00:34:05
just a lot easier for the llm to like do
00:34:08
that and then your code
00:34:10
is I guess like you know less spaghetti
00:34:12
because you're not you don't you're not
00:34:13
worried about you know if something
00:34:15
changes like the model like you start
00:34:17
with the tests and it's really easy for
00:34:18
the model to generate it okay so I got I
00:34:21
that took a little time I got a uh a
00:34:24
cursor here so this is just a super
00:34:26
quick um let me just grab the screen
00:34:29
here super quick here so I have this you
00:34:33
know super basic thing we can just
00:34:34
terminal we can run it and I can go you
00:34:37
know button
00:34:40
index.ts Hello World um now I should
00:34:43
like to start with cursor and I'll just
00:34:45
say something along the lines of like
00:34:47
literally and again I actually don't
00:34:49
know how to write tests and fun so I can
00:34:51
just go to cursor I open up command I
00:34:53
and for those you don't this is like the
00:34:54
composer it lets you uh coordinate and
00:34:57
create file so I'm going to say you know
00:34:58
I'm using FN uh for now create a test uh
00:35:04
file for a method and then make the
00:35:08
method uh that let's say for now
00:35:11
reverses a string super simple um and oh
00:35:15
I guess I'm out of slow request
00:35:17
unfortunately okay wow so it what it'll
00:35:20
first do is it'll create the test right
00:35:22
and this is obviously a really simple
00:35:23
example and so here I'm happy with this
00:35:27
all right I'll I'll just accept this um
00:35:31
and now right off the bat like there's
00:35:34
you know how many whatever five tests
00:35:36
here so obviously I have the actual
00:35:39
function so here in this example just
00:35:40
reversing string now the nice part is I
00:35:43
can go here I can say you know bun I
00:35:45
guess it's uh reverse
00:35:49
test.ts um and I can again debug with
00:35:53
composer this is a nice part I can just
00:35:54
go debug with I
00:35:59
got up I got up
00:36:01
my man I'm out of the
00:36:07
free that's how much I use cursor I I
00:36:10
just always blow through the budget but
00:36:11
Okay cool so here it like you know
00:36:14
passes the test but let's actually say
00:36:15
that like we are using something a
00:36:16
little bit more complicated than
00:36:18
reversing uh a string now I can go into
00:36:20
here and I can say let's just not
00:36:23
reverse it let's just say like let's
00:36:24
just break the code let's just say here
00:36:27
we'll split it like this okay um return
00:36:31
dot okay so now if I go here I go test
00:36:35
if I go test file so all these tests
00:36:36
fail right now obviously this like a
00:36:38
pretty simple example and it's almost as
00:36:41
simple as just clicking this button that
00:36:43
says add to composer and then I say um
00:36:47
you
00:36:48
know please fix the reverse method due
00:36:53
to errors and now the nice part is here
00:36:56
cursor will pull in that terminal
00:36:58
that'll throw you know the errors where
00:36:59
it happens and what cursor will do is
00:37:03
they'll look at that and they'll say hey
00:37:04
look I see what the issue is and it'll
00:37:06
just fix it so this is kind of what I
00:37:08
like to call of like I don't actually
00:37:09
have a name for it yet maybe llm test
00:37:12
driven development whatever you want to
00:37:14
call it but it's like you come in and
00:37:15
you describe what you're trying to do
00:37:17
here the llm writes the tests for for it
00:37:20
and then it's going to write the method
00:37:22
and then what you can do is have it run
00:37:24
and now if the method itself like this
00:37:25
function which is reversing a string is
00:37:27
is complex or confusing it will be able
00:37:30
to sort of like essentially agentically
00:37:32
air quote here fix itself if that makes
00:37:34
sense it'll test the code see if it
00:37:36
passes the tests if not it'll update the
00:37:39
code and then sort of do that until it
00:37:41
can you know pass the test and all you
00:37:43
have to do is make sure that the tests
00:37:45
you're writing are correct and and I use
00:37:47
this a lot for obviously for simple
00:37:49
functions it's not that useful but when
00:37:51
you have code that is across a couple
00:37:53
different files you know in a in a
00:37:55
modern code base it's not just a single
00:37:57
function it's like you have like you
00:37:58
know a bunch of different files and and
00:38:00
stuff connecting and ones that require a
00:38:04
lot of like conditionals or
00:38:07
like they're not as simple as this it's
00:38:09
like that's where I found that whenever
00:38:11
I would try to get like cursor or sorry
00:38:13
I get like son it to oneshot it it would
00:38:14
fail every single time but then a second
00:38:16
that I was like okay please let's write
00:38:19
the test for it and then I would sit
00:38:20
there and kind of help it write the test
00:38:21
it was able to debug itself a lot better
00:38:23
and go through these like bigger maybe
00:38:26
meteor functions that normally wouldn't
00:38:28
be able to even like 01 and 01 mini
00:38:30
couldn't solve but a second that I would
00:38:32
apply this like test driven development
00:38:34
whatever you want to call it the model
00:38:35
was able to look at the output see where
00:38:37
it messes up adjust the code and kind of
00:38:39
iterate on itself like that that's cool
00:38:41
so not only does this test first mindset
00:38:45
um it's kind of like a prompt
00:38:46
engineering technique it's almost like
00:38:47
think out loud but it's almost like
00:38:49
write the goal first and then tell me
00:38:50
what you think we should do for it but
00:38:52
you also get tests out the other end and
00:38:54
so you get a little bit of extra utility
00:38:55
as a byproduct
00:38:59
exactly it's a it's a win-win you get a
00:39:01
little bit of both and to me that was
00:39:03
the one thing I never understood why
00:39:05
people haven't done more of because you
00:39:07
would think well if it pass all the
00:39:08
tests the the code is like you know
00:39:11
you're happy that it passed the test but
00:39:12
it's something that I haven't seen a lot
00:39:13
of people do yeah yeah for sure well
00:39:15
that's awesome well that's fabulous
00:39:17
thank you for showing me the cursor
00:39:18
example one of the questions I love
00:39:20
asking is I want to know what the smart
00:39:22
people are talking about right now like
00:39:24
in AI so like as You observe on Twitter
00:39:27
in your circles what are the smart
00:39:29
people talking about that's a good
00:39:31
question oh man I
00:39:33
think what I see a lot of people talking
00:39:36
about is sort of the you know what's it
00:39:39
called like test time compute like 01
00:39:41
thinking I see a lot of people talking
00:39:42
about those I see a lot of people
00:39:44
talking about having think those
00:39:47
thinking models do more agentic sorts of
00:39:50
tasks um and basically bringing this
00:39:54
what I like to think of as an agent as a
00:39:56
for Loop inside to the model uh thinking
00:39:59
process having and training the the
00:40:02
model to just innately be able to call
00:40:04
tools like and we saw that I think a
00:40:07
good example that is uh computer use
00:40:09
right from anthropic right they they
00:40:11
obviously fine-tuned in on that so I see
00:40:13
a lot of people talking about that um I
00:40:15
do see what I started to notice is
00:40:17
people starting to talk about whether
00:40:19
we've hit some variation of a wall I
00:40:21
don't know if you've seen it too and
00:40:22
I've hearing a little rumors that you
00:40:24
know Cloud 3.5 Opus is not up to par and
00:40:28
like the the new Gemini model is not as
00:40:31
good so I I've hearing that as well um
00:40:35
and what else are people really talking
00:40:37
about and I think I think we spoke a lot
00:40:39
about the other things model
00:40:40
distillation um and the other thing I'm
00:40:42
starting to see more of is people being
00:40:44
a little bit not I guess talking more
00:40:47
about evals like I I think a lot of
00:40:48
people didn't really talk about it and
00:40:50
people are saying hey like from a
00:40:51
product perspective if you want your
00:40:53
product to be good you need to write
00:40:54
evals which are just a way of writing
00:40:56
test so that's kind of what I seeing and
00:40:57
I don't know if you've SE anything
00:40:58
different but just from what I've heard
00:41:00
from people talking yeah let me think is
00:41:02
there any anything else I would
00:41:04
add to that list um the one thing people
00:41:07
aren't talking about it but I think it
00:41:09
will be a big deal when it actually
00:41:10
comes out is the whole feature
00:41:11
engineering um weight manipulation uh
00:41:14
like the Golden Gate uh Claude
00:41:17
anthropic I'm still waiting for access
00:41:20
to that because that is going to be an
00:41:21
alternative to prompt engineering and I
00:41:22
have no idea like how easy it's going to
00:41:24
be to work with what kind of results
00:41:26
we're going to get but I'm excited test
00:41:27
that whenever it comes out yeah I I I
00:41:29
remember seeing that I was like I was
00:41:31
blown away and I kind of forgot about it
00:41:32
so that I'm actually interested to see
00:41:34
if they ever will ever let you have that
00:41:36
much inoperability with those models
00:41:38
like maybe there's like no no we're good
00:41:40
sorry we're shelving it like you're not
00:41:41
allow to touch it right but that be
00:41:43
really interesting yeah for sure for
00:41:45
sure um awesome two more questions here
00:41:48
last one I love hearing about what is in
00:41:50
people's tool kit so I've seen you use
00:41:52
Exel draw on Excel draw on your YouTube
00:41:56
videos I've seen you use repet I've
00:41:58
heard Rumblings about VZ what else is in
00:42:00
your toolkit that is in your kind of
00:42:02
day-to-day workflows okay okay so
00:42:04
there's a lot I guess yeah you got you
00:42:05
got a couple V Zer obviously there's
00:42:07
cursor um excal draw I like it for
00:42:10
drawing little diagrams um the other one
00:42:13
I guess that I use a lot is the
00:42:15
playground from anthropic and from open
00:42:17
AI uh which is like different than chat
00:42:19
GPT I use that to iterate on prompts
00:42:22
um I use this yeah the the one that I
00:42:25
use for transcribing uh the actual audio
00:42:28
is called whisper flow it's the one
00:42:30
where I like I have a hotkey that I
00:42:31
press and it takes the voice and
00:42:34
transcribes it into the inputs that you
00:42:35
saw me use um the other tooling that I
00:42:38
use I mean we can go do you want to go
00:42:40
into the technical side or are we just
00:42:42
going to leave it at like the high level
00:42:44
I let's let's not go like I don't want
00:42:45
to know your entire teex stack but like
00:42:47
what is in like the cool AI stuff that
00:42:49
like you're you're you're grabbing for I
00:42:52
think that's pretty much it I think um I
00:42:55
think you got it there I I there's not
00:42:57
many other tools that I honestly use
00:42:58
like I just like I a lot of it's yeah
00:43:02
like just writing the code Langs Smith
00:43:03
is one actually I will say that we we we
00:43:06
use Lang Smith a lot for eval that's
00:43:08
like the other one um but yeah that's
00:43:10
pretty much it from from me I think you
00:43:12
nailed it vzero cursor excal draw um OBS
00:43:16
if you're recording videos yeah yeah
00:43:18
yeah yeah for sure um all right last
00:43:20
question and this is kind of off topic
00:43:22
from the AI side but I know people would
00:43:23
be interested in it so you've had a few
00:43:25
bangers on Twitter like just some things
00:43:27
that just absolutely pop and as somebody
00:43:29
who does a little bit of Twitter himself
00:43:30
too I can look at a tweet and be like
00:43:32
that person thought about it and they
00:43:33
did a really good job as to how they
00:43:34
architected and constructed it and I
00:43:36
noticed that with yourself so what hits
00:43:38
on Twitter and what what's your advice
00:43:40
for people who like want to do better on
00:43:43
it oh man okay so Twitter is just this
00:43:46
hilarious platform that the algorithm
00:43:49
changes a lot so it's you kind of got to
00:43:51
get a feel for what works and what
00:43:53
doesn't and luckily the cost so for
00:43:55
anyone's looking to grow the cost to
00:43:57
post on X Twitter is zero like you don't
00:44:00
pay anything if it doesn't do well no
00:44:02
one cares so it's the one platform where
00:44:05
the cost is literally zero because
00:44:07
you're just typing so type things away
00:44:09
how I craft a banger it's like a mixture
00:44:12
of what I see trending so what I see
00:44:15
what people are talking about and
00:44:17
there's two ways to craft a banger one
00:44:20
is you have to be controversial I'm
00:44:22
you're are not going to craft a banger
00:44:23
if you're not controversial now there's
00:44:25
pros and cons if you're posting that
00:44:27
kind of stuff all the time people will
00:44:28
be like hey you're just posting
00:44:30
clickbait so you got to be careful with
00:44:31
it you can't be like this is insane and
00:44:34
every single tweet starts with that like
00:44:36
no one and no one's going to believe you
00:44:37
but start saying something controversial
00:44:40
and the most important part of crafting
00:44:42
a banger is your hook it I can tell like
00:44:45
honestly I'll post something and I can
00:44:47
tell within 20 minutes if it's going to
00:44:49
be a banger or not and it's basically
00:44:52
how natural does it come that's one
00:44:54
that's like how natural did this thought
00:44:55
come to me and how well did I craft that
00:44:57
hook everything in
00:44:59
between like you could you can kind of
00:45:01
sit there in minmax but the the that's
00:45:04
how I sit there and sometimes I'll sit
00:45:05
on something and I'll be like oh man
00:45:07
like I just don't know the right way to
00:45:09
say it so I won't post it but then it'll
00:45:11
just come to me and I'll be like all
00:45:13
right I got this I all the words I'm
00:45:16
using the right structure it's like the
00:45:18
the right timing and and that's kind of
00:45:21
what goes into crafting it so um the one
00:45:23
piece of advice that I will give from my
00:45:25
personal experience is don't spend too
00:45:27
much time on a tweet because I unless
00:45:30
you're doing educational there's there
00:45:32
should be a diagram where the more time
00:45:34
you spend thinking about a tweet the
00:45:36
worst it does because I swear the
00:45:38
majority of my bangers I spend like 15
00:45:40
minutes thinking about I'm like all
00:45:41
right I'm just going to post it you know
00:45:43
grab a coffee I come back and end blow
00:45:44
it up and then all of a sudden you see
00:45:46
1.4 million
00:45:48
views oh man do I have time I have I
00:45:51
have to I have to tell you the story of
00:45:53
how the started do I have time for that
00:45:55
yeah yeah let's hear it okay so because
00:45:57
it's so relevant to the Banger tweet
00:45:59
so my company we we started like a year
00:46:03
and a half ago and right this is around
00:46:05
the time that agents like people were
00:46:07
talking about them but didn't have any
00:46:08
clue this was let's say March
00:46:12
2023 and at this time I I was no one
00:46:16
actually knew of my account I literally
00:46:18
had I had been posting tweets and no one
00:46:21
replied you know the classic zero views
00:46:23
you know that's just what happens and
00:46:25
then and I remember I saw someone else
00:46:28
post something about Auto GPT and I saw
00:46:30
it and I was like it looks pretty cool
00:46:32
but I ignored it and then it came up
00:46:34
again and I was like no I can't I cannot
00:46:36
not ignore this like this seems
00:46:38
something very interesting and i' been
00:46:39
building actually like AI projects side
00:46:41
project before this and I was like you
00:46:43
know what let me like try this thing out
00:46:44
and obviously I tried it and back then I
00:46:46
was like dude this is insane agents AI
00:46:49
is gonna be crazy so when I was like I
00:46:52
just posted about it and like I didn't
00:46:54
post anything crazy and I was like oh
00:46:56
yeah this is thing is kind of cool it's
00:46:57
pretty crazy and it like got like I
00:46:59
think that was the first post that got
00:47:01
over a thousand likes and I was like
00:47:02
wait a minute wow and then I was like
00:47:04
hold up hold a second then I saw this
00:47:06
trend that people wanted to do something
00:47:09
about like AI agents and it's
00:47:11
interestingly enough I like thought back
00:47:13
to an episode of am like my first
00:47:15
million so funny and and I remember them
00:47:17
talking about like there's sometimes you
00:47:19
see like this opportunity and I was like
00:47:20
dude I got to sit here and I got to do
00:47:22
two things first I got to craft
00:47:24
something I got to make a product that
00:47:25
people want to use and I got to figure
00:47:27
out the right Twitter thread and
00:47:30
narrative and story to craft to get
00:47:31
people on it so that weekend I spent the
00:47:34
whole weekend building vzero of cognosis
00:47:37
which was like our previous product in
00:47:39
the meantime posting Twitter bangers and
00:47:43
threads about how AI agents were going
00:47:46
to change everyone's life and every
00:47:48
single post was getting like a million
00:47:50
views I'm not even exaggerating oh and I
00:47:52
was like dude and and I was like okay
00:47:55
and all I would be posting I was like it
00:47:56
was was kind of Click baity I was like
00:47:58
this is going to change your life and
00:47:59
then getting like million view million
00:48:01
views and I post the product like I was
00:48:03
like Hey like here I built this thing
00:48:04
for you people to go and try because I
00:48:06
know from what you've been telling me um
00:48:09
you don't want to go through GitHub and
00:48:10
I and I posted out and it was literally
00:48:12
built it in like three days and within
00:48:15
like two days we got 50,000 users so my
00:48:19
goodness that is so crazy the the
00:48:22
craziest two weeks and the most
00:48:23
stressful two weeks of my life and it
00:48:26
start started all from how can I craft a
00:48:28
banger tweet so I I will say that that
00:48:31
was why it's so relevant and so funny it
00:48:33
just shows how powerful uh writing well
00:48:36
and writing with the right timing and
00:48:38
structure given what's happening can
00:48:40
potentially you know help you start a
00:48:42
company so and with that that is an
00:48:44
absolutely beautiful story to end on
00:48:46
suly thank you very much for joining us
00:48:48
today oh dude it it was a pleasure I I
00:48:50
enjoyed it and hopefully my workflow is
00:48:53
applicable to other people people can
00:48:54
look at it and see that like hey using
00:48:57
AI is just not that hard you just got to
00:48:59
talk to the computer and it'll do stuff
00:49:02
for you

Etiquetas

AI models
model distillation
prompt engineering
model routing
AI evaluation
task optimization
language models
efficient AI use