00:00:05
[Music]
00:00:33
[Music]
00:00:55
[Music]
00:01:07
[Music]
00:01:19
[Music]
00:01:35
[Music]
00:01:45
[Music]
00:02:14
[Music]
00:02:20
[Music]
00:02:32
[Music]
00:02:45
[Music]
00:02:56
[Music]
00:02:59
e
00:03:34
ah there we go yes we have sound Massa
00:03:37
Hill can you hear me
00:03:40
okay all right I think you might still
00:03:42
be muted I'm gonna just do a quick intro
00:03:45
I'm muted yeah good now all right here
00:03:48
we go no sound the sound should be
00:03:51
coming in any moment
00:03:54
now no rip audio no audio yep it'll be
00:03:57
fixed I still don't know how to operate
00:03:59
a mute
00:04:02
button all right we got uh we got audio
00:04:05
good okay so today super excited to
00:04:09
share we have uh two special guests
00:04:11
joining us today uh Matt Schumer
00:04:14
co-founder and CEO of hyperr AI and the
00:04:20
author creator of the most uh recently
00:04:23
dropped amazing open source open weights
00:04:25
model reflection 70b we also have sahil
00:04:30
Chow from uh glaive he is the founder of
00:04:34
glaive who was uh pivotal in helping get
00:04:37
this model to everybody so very excited
00:04:40
to talk to you both thank you so much
00:04:42
for joining me today uh cannot wait to
00:04:45
hear more about what went into this
00:04:47
model but uh welcome to the stream
00:04:51
guys thanks for having
00:04:54
us all right um so first let's just talk
00:04:59
about what happened so yesterday
00:05:02
actually I'm going to share your your
00:05:03
Twitter
00:05:04
post let's see you probably posted a ton
00:05:06
of stuff since
00:05:08
yesterday too
00:05:11
much cool here's the announcement so
00:05:14
match humer yesterday uh 11:51 a.m. I'm
00:05:17
excited to announce reflection 70b the
00:05:19
world's top open source model trained
00:05:23
using a new technique reflection tuning
00:05:25
we'll talk about what that is today and
00:05:28
it is doing incredibly well uh if you
00:05:32
look at the benchmarks just across the
00:05:34
board competitive with the other
00:05:37
Frontier both closed source and open
00:05:39
source models beating llama 3.1 405b the
00:05:42
larger one and according to Matt we have
00:05:45
uh the 405b model coming out next week
00:05:47
so first Matt maybe just tell us a
00:05:51
little bit about yourself uh what do you
00:05:53
what do you do and what made you want to
00:05:56
create this
00:05:58
model yeah yeah so um high level on
00:06:02
myself and the company right so uh I've
00:06:04
been personally starting company since I
00:06:05
was 12 uh we can skip a lot of time a
00:06:07
few years ago uh when I was in college I
00:06:09
started uh other side AI which is the
00:06:11
company that sort of became hyperight
00:06:14
right and the idea about hyperight
00:06:15
initially was can we create an AI that
00:06:18
can write your emails for you we were
00:06:19
actually the first um that were aware of
00:06:21
uh VC backed company in this sort of
00:06:23
like quote unquote generative AI space
00:06:24
that used the open AI models at the
00:06:26
beginning we've grown quite a bit since
00:06:28
then now we have well a few million
00:06:31
users uh millions in Revenue were
00:06:33
profitable um for a much sort of like
00:06:36
expanded version of that product it's
00:06:38
sort of writing in general we're the
00:06:39
best AI for writing continues to get
00:06:42
better we have a lot of sort of specific
00:06:43
things we do but that's that's the high
00:06:45
level um in terms of this model this is
00:06:48
sort of just a fun thing uh very long
00:06:50
story short I was actually on vacation
00:06:52
and my mind was just everywhere and I
00:06:54
was like okay I'm I'm kind of bored of
00:06:56
this I need to actually do something
00:06:57
productive and I've been noling on this
00:07:00
idea for a very long time and it was
00:07:01
sort of like time to just do it so
00:07:03
towards the end of the trip I actually
00:07:05
reached out to sahill and I was like can
00:07:06
we collaborate and that's how this came
00:07:08
to be how long ago was
00:07:12
that three weeks I think so to be clear
00:07:18
uh you had the idea reached out to
00:07:21
sahill put together the data set
00:07:25
fine-tuned the model and published it
00:07:27
all within three weeks
00:07:31
yes and I know that sounds crazy I get
00:07:35
that but I think a lot of people
00:07:37
underestimate what can be done today
00:07:38
with a very small amount of resources
00:07:40
right too many people look to the big
00:07:42
the big AI labs and they're like hey
00:07:43
look you know they spent billions of
00:07:44
dollars doing this there's no hope for
00:07:46
anybody to compete with a small budget
00:07:47
and small team small amount of time
00:07:50
right but thill and I did this on the
00:07:52
side right this wasn't even sort of our
00:07:53
major Focus not even close to it and
00:07:55
it's just thinking about the problem in
00:07:57
a different way right the data set was
00:08:00
everything for this and glaive made that
00:08:03
possible we just had to kind of know
00:08:04
what we were going after and once we
00:08:06
sort of had the idea of what this would
00:08:08
look like it was very easy to actually
00:08:09
put into into practice awesome and and
00:08:13
on that note sahill please tell us about
00:08:16
yourself tell us about what glaive
00:08:20
does yeah yeah uh kind of similar to M
00:08:23
I've been
00:08:24
doing startups specifically AI startups
00:08:27
for a few years now uh just before
00:08:29
starting live I worked at a company
00:08:32
called banana where we were building
00:08:34
sess gpus for machine
00:08:36
learning uh realized that people cannot
00:08:38
host custom models because to have
00:08:40
really high performing custom models
00:08:42
that use case specific you need high
00:08:43
quality data and companies like that
00:08:46
most of the times and that is when I
00:08:48
decided to just uh be a founder and
00:08:52
start claive and glaive is essentially a
00:08:54
platform where companies can build use
00:08:57
case specific data sets uh using using
00:08:59
our synthetic data generation Pipeline
00:09:02
and train mods on that and basically
00:09:04
keep iterating on them uh because it's
00:09:07
uh as I think match said it's it's
00:09:09
pretty quick to iterate on data sets
00:09:11
with the help of synthetic data uh the
00:09:13
generation process is pretty quick yeah
00:09:17
uh that's uh that's summary cool uh so I
00:09:22
mean Matt you said it the data set is
00:09:24
everything but I'm I want to touch on
00:09:26
that a little bit because we can talk
00:09:28
about you know publicly available data
00:09:30
and and how that is essentially all used
00:09:34
in current Frontier models and there's
00:09:37
really like two solutions to that you
00:09:38
either create synthetic data or you do
00:09:41
more with the data that you have and so
00:09:44
which approach were did you take with
00:09:47
reflection and what what makes
00:09:49
reflection special what is the the
00:09:51
secret sauce
00:09:53
there yeah so I think s could probably
00:09:55
talk more to the specifics of the data
00:09:57
side and how the data set was
00:09:58
constructed I could talk more the sort
00:09:59
of idea behind reflection and why it
00:10:02
works um so at a high level right the
00:10:05
general idea is llms are sort of getting
00:10:08
to the point where they can they can
00:10:09
think quote unquote think um and it's
00:10:12
sort of mirrored after how a human
00:10:13
thinks right you kind of talk through it
00:10:15
in your head you know some people do it
00:10:16
differently but you know some people
00:10:17
kind of talk through the problem in
00:10:18
their head and arrive at an answer and
00:10:20
llms do that today with Chain of Thought
00:10:22
that's how they have you know it's one
00:10:24
of the reasons they've improved in
00:10:25
performance so much over the last few
00:10:27
years uh it's been a corner so know
00:10:29
pretty much every major llm whether it's
00:10:32
coding whether it's just general
00:10:33
reasoning or writing even um so we do a
00:10:36
lot of work with that and one of the
00:10:37
things that I realized is like look as a
00:10:40
person I can think through a problem but
00:10:41
I'm not always going to get every little
00:10:43
bit right and what happens if I get
00:10:44
something wrong well personally I
00:10:46
reflect I'm like wait I made a mistake I
00:10:49
backtrack and I fix myself but an llm
00:10:51
doesn't do that one of the things we
00:10:53
noticed is that llms once they make that
00:10:55
first mistake they kind of accept it as
00:10:57
fact right if I'm asking it 2 plus two
00:11:01
and then multiply that by seven right
00:11:03
and it says okay I'll do that let's
00:11:05
start with two plus two that equals five
00:11:07
and then five time 7even right it's
00:11:08
already made that mistake and it's just
00:11:10
assuming the thing that it said is right
00:11:12
and if we could
00:11:14
essentially teach the model to think
00:11:17
more like we do and actually reflect on
00:11:20
behavior that it makes a mistake in the
00:11:23
model gets smarter it gets more reliable
00:11:26
gets more accurate that's generally the
00:11:28
idea here we did a lot of even beyond
00:11:31
the data it's how we trained the model
00:11:32
right we created what we call like our
00:11:34
own splitting strategy that allows the
00:11:37
model to not learn to make mistakes
00:11:38
because if you do this wrong it's very
00:11:39
easy to teach the model to make more
00:11:41
mistakes and then fix them the idea is
00:11:43
to keep the model's performance as is or
00:11:44
make it even better and then on the
00:11:47
things where it actually ends up making
00:11:48
a mistake and it would no matter what we
00:11:50
can fix that and it's still not perfect
00:11:52
it's still very early it's still a first
00:11:54
version of this but I think we've come
00:11:56
quite a bit um away in a few weeks so
00:11:58
that's that's sort of like how it works
00:12:00
sah maybe you want to talk a little bit
00:12:01
more to the the data generation process
00:12:03
itself because that was all
00:12:05
you yeah yeah I can do that uh
00:12:09
surprisingly the data set we used for
00:12:11
this isn't that big uh what people would
00:12:14
expect it's roughly we generated uh and
00:12:17
we generated it in steps uh we started
00:12:19
out with I think just 10,000 samples uh
00:12:22
to see if this is actually possible and
00:12:24
we scaled up to 100,000
00:12:26
samples uh we decided that we'll do uh
00:12:29
some code data some math data General
00:12:31
reasoning function calling and multiturn
00:12:34
so uh the goal really wasn't to get a
00:12:37
lot of data and teach the model how to
00:12:39
reason but it was essentially to teach
00:12:41
the model to recognize its own mistake
00:12:44
and I think uh to that point see a lot
00:12:47
of people uh mentioned that you can kind
00:12:49
of get similar gains out of just
00:12:51
prompting uh regular instruct models to
00:12:54
use reflection and I think that's true
00:12:57
to some extent you can get uh percentage
00:12:59
of the gains by just prompting but if
00:13:02
you try to prompt even son 3.5 for
00:13:05
example you'll see that it's there's a
00:13:07
lot of bias in the model outputs where
00:13:10
the model almost always believes that
00:13:12
what it's saying is correct and we face
00:13:15
that problem in data generation as well
00:13:17
right like we use language models we use
00:13:19
f tune language models to generate
00:13:21
synthetic data and when we first try to
00:13:24
actually generate reflection data we
00:13:25
realize that if we ask the model to
00:13:28
actually make the mistake and then
00:13:30
reflect on it Ms are really bad at doing
00:13:32
that uh maybe it's just R if you do that
00:13:35
like the MS will make mistakes easily
00:13:37
but if you ask a model to deliberately
00:13:39
make a mistake it just won't be able to
00:13:43
and that was the main goal of just find
00:13:45
you to teach the model that it can
00:13:48
actually make mistakes and and correct
00:13:51
that yeah uh I I have so many questions
00:13:55
about this so uh for those of you
00:13:58
watching uh maybe this is a good way to
00:14:00
break it down you you effectively took
00:14:02
really kind of sophisticated prompt
00:14:04
engineering strategies you know s even
00:14:08
something as simple as explain your
00:14:09
reasoning step by step which is
00:14:11
something that I put into all of my
00:14:13
prompts when I'm giving it more complex
00:14:15
reasoning and logic questions um but you
00:14:18
can get obviously even more
00:14:20
sophisticated than that Chain of Thought
00:14:22
um and and so on and so you've taken
00:14:25
that and you've built it into the model
00:14:27
itself and you've actually given it
00:14:29
examples of making mistakes along the
00:14:32
way and that's actually what I'm showing
00:14:33
I believe on the screen right here
00:14:35
actually there's a lot going on
00:14:36
hopefully yall can see it uh but but
00:14:39
effectively you're teaching the model in
00:14:43
the fine-tuning data itself to think
00:14:45
through everything you know more step by
00:14:48
step but also to and hence the name
00:14:52
reflect on the output as it's doing the
00:14:56
inference and then potentially correct
00:14:58
itself during the inference process so
00:15:01
what first of all H like what was the
00:15:06
trigger for thinking that this would be
00:15:08
really a a a good strategy and then why
00:15:11
is it better than simply uh put like
00:15:15
having a system message with some of
00:15:17
this prompting techniques or or putting
00:15:19
it in the prompt
00:15:22
itself yeah I mean why don't I start
00:15:24
with what you you just asked the last
00:15:25
question s covered a little bit of this
00:15:27
um some of my person findings with this
00:15:29
because I've been prompt engineering for
00:15:32
too many years now um like basically
00:15:34
since 2019 with gpd2 um what I found was
00:15:39
essentially if you asked it to do this
00:15:41
and didn't train on it a couple things
00:15:43
would happen sometimes it was sort of
00:15:44
like what side was talking about where
00:15:46
you know the model was overconfident but
00:15:48
there was another issue I found which is
00:15:50
sometimes this is kind of getting a
00:15:51
little weird the model would actually
00:15:53
deliberately make mistakes that it
00:15:54
wouldn't have made otherwise because
00:15:56
when you're prompting it really wants to
00:15:58
follow those instructions
00:15:59
really wants to follow that system
00:16:00
prompt and if you say when you make a
00:16:03
mistake fix it with a reflection tag
00:16:06
you're going to notice that it's going
00:16:07
to make that mistake what you notice
00:16:08
with our model is that it doesn't always
00:16:10
use reflection it only uses it when it
00:16:12
really thinks it needs to sometimes
00:16:14
little a little bit too much but we're
00:16:15
you know we're we're much more towards
00:16:17
the end of like use it when you need to
00:16:18
don't when you don't because if you're
00:16:20
teaching it to do it all the time right
00:16:21
it's just going to make mistakes
00:16:23
deliberately um so yeah it doesn't
00:16:26
really work for prompting and I do have
00:16:28
to apologize like I said said I not
00:16:29
slept in two days so what was the first
00:16:30
part of your question no I
00:16:33
I I'm trying to understand like what the
00:16:38
reasoning is like what if it's better to
00:16:41
put the prompt in the actual fine-tune
00:16:45
data or if it's better to do it
00:16:47
afterwards and obviously like your model
00:16:49
is is crushing The Benchmark so you've
00:16:52
done something really well but like what
00:16:55
what was that idea that made you think
00:16:58
that it it's actually like let's let's
00:17:00
push this to the fine tune
00:17:03
itself yeah I mean I've had this idea
00:17:05
like I mentioned for for many months now
00:17:07
and different forms of it um so it was
00:17:10
sort of like how do you enable that for
00:17:12
everyone without them needing to figure
00:17:13
it out themselves but even if they
00:17:15
wanted to figure out themselves without
00:17:16
a fine tune it doesn't work super well
00:17:17
you know prompting like like sah said
00:17:19
can get you a marginal gain like you
00:17:21
definitely will see better performance
00:17:22
but it's not going to be the performance
00:17:23
jump we saw I mean we saw 10 plus
00:17:25
percent uh jumps in many benchmarks I
00:17:27
mean it was it was taking L 70b and
00:17:30
bring it past 405b and much further
00:17:32
beyond that I mean we're expecting to
00:17:34
see with 45b is going to be hopefully
00:17:36
insane um so yeah it's really that this
00:17:39
can't be done with prompting alone if
00:17:41
you really want to get the full force of
00:17:42
it um this is more close this is closer
00:17:45
to how people think I think there are
00:17:47
many more Gams we can make there and it
00:17:49
was just one of those stupid obvious
00:17:51
ideas that no one really thought to do
00:17:54
right I think a lot of the major labs
00:17:55
they they're way too in the weeds in
00:17:56
many ways they're thinking to
00:17:58
technically and too intelligently about
00:18:00
this where sometimes the simple thing is
00:18:02
what actually is going to work right
00:18:04
just like so many people back in you
00:18:06
know 2018 whatever they were they were
00:18:08
working on all these new architectures
00:18:09
but scaling them when we had was the
00:18:11
right answer yeah and sahill I think you
00:18:14
mentioned like
00:18:16
putting almost like telling the model to
00:18:19
make the mistake and then showing it how
00:18:20
to correct itself I is is one of the
00:18:23
secrets to reflection but you also said
00:18:26
if you do that too much it will just
00:18:28
start making mistakes what is that
00:18:30
balance how did you discover where that
00:18:32
kind of Cliff is before it started just
00:18:35
taking those mistakes and and and
00:18:37
accepting them as
00:18:39
truth yeah I think there's a bunch of
00:18:41
small uh details there and I think next
00:18:44
week when we put out like a technical
00:18:46
report it's going to help people
00:18:48
understand that as well but
00:18:49
essentially uh if you if you look at uh
00:18:54
the responses the model generate you'll
00:18:56
see in somewhere or other it's not
00:18:58
always structure but it's going to
00:18:59
classify whether the problem is a hard
00:19:01
problem moderate or an easy problem and
00:19:04
you're going to see that it only uses
00:19:05
reflection andine of thought for
00:19:07
problems it thinks uh is a hard problem
00:19:10
for itself and that was part of the
00:19:12
training so we uh only added Reflections
00:19:16
we classified problems as easy moderate
00:19:18
and hard and we only added Reflections
00:19:20
to the hard problems because you don't
00:19:23
always need Reflections you don't want
00:19:24
to teach them all to
00:19:26
overthink uh that was one part of it and
00:19:28
I think I think uh the second balance is
00:19:30
we also added uh data samples which are
00:19:34
essentially just Reflections but the all
00:19:37
reflects on something it did which was
00:19:39
correct and just realizes that it does
00:19:41
not need to actually course correct and
00:19:43
it was already correct and move on with
00:19:45
the existing uh Chain of Thought So we
00:19:48
we try to balance it out with uh these
00:19:51
type of samples trying to cover as many
00:19:52
H cases as possible I think there's
00:19:54
still room for
00:19:56
improvement but uh it was essentially I
00:19:58
think someone on twit found out that
00:20:00
this was V5 of the model we trained we
00:20:02
had like V1 2 3 and four and that's
00:20:05
essentially what it was uh training
00:20:07
model and figuring out uh how to balance
00:20:10
the data set even more so if we're if
00:20:13
we're looking at this example I I guess
00:20:15
I'm trying to figure out uh you you give
00:20:17
it a prompt and it's running inference
00:20:20
and outputting is it truly able to
00:20:25
reflect on the output as it's giving the
00:20:28
output like what is what does that flow
00:20:30
look
00:20:32
like yeah so that was that was actually
00:20:34
an intentional a very intentional design
00:20:36
decision
00:20:37
where I think a lot of people get
00:20:39
frustrated when they look at this is
00:20:41
actually something we learned from hyper
00:20:42
right people get frustrated when they
00:20:43
see Chain of Thought you know some of
00:20:44
the more power Usery people they like it
00:20:47
but your average user doesn't they just
00:20:48
want to know the answer to the question
00:20:50
and if they want to dive deeper great
00:20:51
but they don't always want to dive
00:20:53
deeper and they don't want to dig
00:20:54
through amount of text Chain of Thought
00:20:55
text to find that answer so this gets
00:20:59
even worse with reflection right because
00:21:00
when it's reflecting in its output it's
00:21:02
like oh my God there's so much going on
00:21:04
here how do I parse through this it's
00:21:05
just the cognitive load is crazy so how
00:21:08
can we fix that and one of the ideas we
00:21:10
had um was just separate the reasoning
00:21:13
out right especially as inference is
00:21:15
getting faster look at what's going on
00:21:16
with for example Gro and cerebrus right
00:21:18
inference is going to get way way faster
00:21:21
and if we can do some of the thinking up
00:21:22
front and just sort of hide that from
00:21:24
the user and show it optionally if they
00:21:26
want to see it and then spit out a
00:21:28
perfect answer at the end that's what is
00:21:30
ideal um for most users in most used
00:21:32
cases the idea here is right the model
00:21:35
will start reasoning inside sort of what
00:21:37
we call output tags or sorry thinking
00:21:39
tags so it'll that's what we're seeing
00:21:42
right
00:21:43
here yeah it's going to start thinking
00:21:46
it does its Reflections in there it does
00:21:47
it Chain of Thought in there it does all
00:21:48
its planning all that and then it's like
00:21:51
okay once it's got the right answer or
00:21:52
it thinks it's got the right answer it's
00:21:54
like I'm done then it says okay time for
00:21:56
an output here's what I'm to show the
00:21:57
user and it writes the final message to
00:22:00
the user and I think different companies
00:22:02
and different use cases and interfaces
00:22:04
will use this differently um some people
00:22:07
want to see it all some people want to
00:22:08
make it optional some people want to
00:22:10
hide it entirely um but I think that
00:22:12
especially as models get faster and the
00:22:15
actual sort of thinking process takes
00:22:17
like less than a second for example I
00:22:19
think this is the right
00:22:20
approach and so that that was the
00:22:23
purpose of the tags just to clarify so
00:22:25
you have the thinking tags if you don't
00:22:26
want to see the thinking you just want
00:22:28
the output put it's kind of like
00:22:29
standard mode and then you have advanced
00:22:31
mode uh and then you can actually see
00:22:32
what's going on personally I love seeing
00:22:35
it like I want to know what it's
00:22:36
thinking why it's thinking that and that
00:22:39
can actually help you kind of recraft
00:22:42
the original prompt if you need to uh
00:22:44
because you could see if there's a
00:22:45
misunderstanding uh or or not so like
00:22:49
what what are your thoughts on on this
00:22:51
this method of self-correction during
00:22:54
inference time versus let's say having
00:22:57
multiple large language models uh
00:23:00
working in collaboration correcting each
00:23:02
other how does this
00:23:04
differ why not use them both right
00:23:07
everything can build on everything hell
00:23:10
yeah and if you can have multiple of
00:23:12
these models right thinking about
00:23:13
different angles on each problem and you
00:23:15
have reflection with within each model
00:23:17
it's just like having a better model
00:23:19
going after um a problem but having them
00:23:21
do it together okay amazing um and then
00:23:25
one last question for me and then I I
00:23:27
know we have a bunch of questions in the
00:23:29
chat so hopefully I know you have
00:23:31
probably about 12 more minutes so uh
00:23:33
hopefully we'll get to them both um what
00:23:35
have you seen in terms of token usage on
00:23:38
output as compared to a
00:23:40
non-reflection
00:23:43
model definitely more sah you might have
00:23:45
a much better response than definitely
00:23:49
more yeah uh haven't test it out like I
00:23:53
don't have exact numbers to share right
00:23:55
now but uh from just the test I've run
00:23:58
it seems to be uh 1.5 two times the
00:24:02
number of tokens uh for harder problems
00:24:05
so it's it's definitely generating a lot
00:24:08
more tokens and yeah I think if you're
00:24:12
using a hosted API that causes the the
00:24:14
cost of the latency problem but again I
00:24:16
think uh as Matt said INF is going to
00:24:19
get much faster and cheaper so uh we
00:24:22
kind of betting on that as
00:24:24
well all right um thank you so we got a
00:24:29
bunch of questions already uh we're
00:24:31
using a new streaming software I'm not
00:24:33
sure how to put it on the screen so I'm
00:24:34
just going to read them uh Tresor asks
00:24:37
will it be possible to fine-tune other
00:24:39
models like GPT 40 mini to implement the
00:24:42
reflection approach so basically does
00:24:44
this data set does this approach work
00:24:46
across the
00:24:48
board it should there will be
00:24:51
constraints with closed provider fine
00:24:54
tunings because uh they often have their
00:24:57
own limits on how you can do the fine
00:24:58
tuning so we'll go into this in more
00:25:00
detail when we release a report and we
00:25:02
we I think are going to be releasing the
00:25:03
data set um we're still making the final
00:25:05
decision there but it's likely G to be a
00:25:07
yes um it has to be trained in a very
00:25:10
specific way it's a very simple training
00:25:12
process it's straight up just like
00:25:14
standard fine-tuning right um there's no
00:25:16
rhf going on none of that but the way
00:25:19
the data is sort of split up it's to
00:25:21
sort of like enable it to not learn to
00:25:24
make extra mistakes but learn to correct
00:25:26
them um so for any open source model it
00:25:29
should be doable and relatively easy
00:25:31
compared to some of the other approaches
00:25:32
people have uh put forward for things
00:25:35
like fine tuning open AI maybe it's
00:25:38
possible especially if you're creative
00:25:39
about it I don't I haven't fine- tune on
00:25:41
the open API in a while but from what I
00:25:43
remember it would be a little bit
00:25:44
limiting for something like this yeah
00:25:46
and I I like what you said earlier um
00:25:49
this was just an overlooked strategy
00:25:51
like it's it's maybe it was it was too
00:25:54
simple uh not that's not to be
00:25:56
pejorative at all it's like but it is
00:25:58
you know sometimes it's just like hey
00:26:00
it's right in front of you here here's
00:26:01
here's a great strategy you can use um
00:26:03
so Yash asks what about an 8B version
00:26:07
and we know you mentioned the 405b is
00:26:10
coming out uh what about an 8B version
00:26:12
that more people can run maybe
00:26:15
locally yeah I mean we found and to be
00:26:18
fair we trained the AP first um but we
00:26:21
found that it made some big improvements
00:26:23
on certain benchmarks but others it
00:26:25
really didn't and eight
00:26:29
we weren't really sure where some of the
00:26:30
gains were coming from it it felt like
00:26:33
it was a little too dumb to pick up on
00:26:35
this perfectly um like the 70b
00:26:38
definitely sort of like crossed a
00:26:39
threshold where it was like able to
00:26:40
grock uh grock this to some extent um
00:26:43
but given the sort of crazy interest
00:26:45
we've seen like we we expected this to
00:26:46
do well we didn't expect it to do this
00:26:49
well um like it's been the craziest
00:26:52
launcher of My Life
00:26:54
um given that I think there probably is
00:26:57
enough interest to figure out something
00:26:59
there but you know you said before right
00:27:01
there there are many things that are
00:27:02
sort of like these low hanging fruit
00:27:03
things that just don't pay attention to
00:27:05
and we have a few other ideas in mind
00:27:06
that I think could do far better than
00:27:08
reflection so I think it's a question of
00:27:09
do we want to focus more on reflection
00:27:11
and really optimizing an 8B that you
00:27:13
know may be obsolete soon or really
00:27:15
going crazy and focusing on something
00:27:17
that could actually Top This by a good
00:27:19
bit and I think I'm more excited about
00:27:21
the lad but I was just about to ask you
00:27:25
that Matt sorry sorry to cut you off I
00:27:27
was just about to ask if you have any
00:27:29
kind of new ideas going on do you uh
00:27:31
want to hint at anything or not quite
00:27:35
yet I think if you look at my past
00:27:38
work and you think about this from the
00:27:40
perspective of like look some of the
00:27:43
learnings we've gotten from prompting
00:27:44
have not been properly thought about in
00:27:46
terms of training them into the model I
00:27:48
think you can kind of see where I'm
00:27:49
going with
00:27:50
this
00:27:52
um there are a few different ideas that
00:27:55
I'm kind of toying with and we're
00:27:56
talking about right now and I think the
00:27:58
interesting thing is with glaive like
00:27:59
they're not going to be so hard to
00:28:01
implement um they actually might be even
00:28:04
easier than reflection um they can be
00:28:06
comined with reflection reflection but
00:28:09
yeah I think it's the idea of like we're
00:28:11
going to start exploring what it looks
00:28:12
like to take these things that make it
00:28:15
better from a prompting perspective and
00:28:16
can you make it much better by training
00:28:19
on
00:28:20
that how do you give everyone the power
00:28:22
of like the world- class prompt
00:28:23
engineering um sort of stuff maked into
00:28:25
the model and even go beyond that that's
00:28:27
kind of how about it that's that's great
00:28:30
and uh you know Victor asked kind of as
00:28:32
an extension of that uh are you going to
00:28:35
provide the data sets publicly H so
00:28:38
other people can maybe you know uh
00:28:42
manicure them as they see fit and and
00:28:44
roll their
00:28:47
own I think it's highly likely s if you
00:28:50
have different feelings that we can
00:28:52
figure that
00:28:53
out no I think in the past uh like most
00:28:57
of the open Source work we have done uh
00:29:00
as always included data sets we are
00:29:02
essentially a data set company so it
00:29:03
makes sense plus it just uh makes it
00:29:06
really easy for people to reproduce uh
00:29:08
benchmarks uh the technique so overall I
00:29:12
think it would be it would make sense to
00:29:13
open source once we have uh the four
00:29:16
five be trained as
00:29:18
well awesome uh Mark Cox asks uh how
00:29:22
exactly does it recognize a quote
00:29:25
unquote mistake that requires a
00:29:28
reflection and actually I have even kind
00:29:30
of a followup to that
00:29:33
do does it always output a reflection is
00:29:36
it always
00:29:40
necessary uh no so uh in the data set uh
00:29:45
we only have Reflections as part of uh
00:29:48
samples where the model is likely to
00:29:51
make mistakes and we decided that by
00:29:54
just classifying the problem as hard so
00:29:56
uh if if you just ask simple question it
00:29:59
won't add a reflection tag uh if the
00:30:01
question is really simple it won't even
00:30:03
do Chain of Thought and just uh answer
00:30:05
pretty straightforwardly so model can
00:30:08
almost infer uh the difficulty of a
00:30:12
problem and decide when to do
00:30:14
Reflections uh but we've tried to train
00:30:16
them mod to do Reflections whenever it
00:30:18
makes uh it it solves a hard problem and
00:30:21
it encounters the step where it's making
00:30:23
a hard Chain of Thought step so even if
00:30:27
you have a really complicated problem
00:30:29
there would be like few steps within
00:30:31
that problem that the model is more
00:30:32
likely to make mistakes and we have
00:30:34
added Reflections in the r set right at
00:30:36
that step so whenever it makes uh for
00:30:39
example it does arithmetic it's more
00:30:41
likely to just use a reflection tag
00:30:43
right after that and see if what it did
00:30:45
was correct or
00:30:48
not okay great um Alex fov asked uh any
00:30:53
gut feeling about using Moe mixture of
00:30:56
experts and this technique
00:30:58
uh maybe if some experts are trained
00:31:00
like this and others
00:31:05
aren't first of all hi Alex um I don't
00:31:09
know I don't see why it would be much
00:31:12
different I've had poor experiences
00:31:16
training on mixure of experts models
00:31:17
just
00:31:18
generally so I don't know if you found
00:31:20
the
00:31:23
same yeah uh I mean one benefit could be
00:31:26
that uh
00:31:28
if there's uh an inference speed trade
00:31:31
off because uh Moes are usually faster
00:31:34
to inference that could be a
00:31:36
benefit uh intuitively I don't think we
00:31:39
would see any performance game uh
00:31:42
relative to just these tense models but
00:31:44
again I have never tried this so un
00:31:47
sure okay people will be able to try it
00:31:50
next week especially if we have the data
00:31:51
out so hopefully everyone does yeah I
00:31:54
mean once you do that I'm sure everybody
00:31:56
wants to get their get their hands on it
00:31:58
try experiment with with their own
00:32:01
techniques built on top of what you've
00:32:03
done with reflection so um and then
00:32:06
actually another one from Alex uh
00:32:08
effects on quantization of this and and
00:32:11
um Matt we were talking about this
00:32:12
earlier I'm planning on running it
00:32:14
through my full L llm test suite and I
00:32:17
want to download it I have two rtxa
00:32:20
6000s maybe that was enough for I think
00:32:22
you said fp8 uh but uh how how how do
00:32:26
you see quantization affect the quality
00:32:28
of the output using this
00:32:34
technique I don't know for sure I do
00:32:37
suspect especially from my limited
00:32:39
testing of like there's already an API
00:32:40
that's up that is
00:32:42
fp8 um I do suspect there's definitely a
00:32:45
performance loss I just don't know how
00:32:47
much it is right maybe it's 1% it really
00:32:50
doesn't matter um maybe it is more than
00:32:53
that I do think from what I saw at a
00:32:54
minimum it is much less consistent um I
00:32:58
did notice that I just haven't spent
00:32:59
enough time with it
00:33:03
personally okay uh Daniel Ranger is
00:33:06
asking about context length is there a
00:33:08
reason why the context length is 4K
00:33:10
instead of 128k and and I guess how have
00:33:12
you seen maybe that's not the case based
00:33:14
on your reaction uh how have you seen
00:33:17
context length affect the performance of
00:33:21
the model and maybe you could just
00:33:22
clarify what the context length
00:33:26
is it should be the full Lama I believe
00:33:29
one go ahead go ahead yeah so uh it's
00:33:33
the full Lama context line I think
00:33:36
128k uh around that uh I just that that
00:33:40
we haven't included data that's very
00:33:41
long context in in the training so we
00:33:45
don't know how the performance will be
00:33:47
on very long contexts and like that's
00:33:50
something to improve on in the
00:33:52
future but you you can essentially run
00:33:54
it through the entire context length of
00:33:57
L fre one
00:33:58
awesome uh okay I know you guys got to
00:34:01
go in like two minutes so I'll just wrap
00:34:03
it up quickly uh your post got 2.2
00:34:06
million views uh Matt we have 1,700 plus
00:34:10
viewers in here and growing it it keeps
00:34:12
going up so I think there's a lot of
00:34:15
appreciation for what you've done and
00:34:17
you as well sahill I just want to say
00:34:19
thank you especially like from the open
00:34:21
source Community I love seeing open
00:34:24
source compete with closed Source
00:34:25
Frontier models it just it it makes me
00:34:27
me so happy that I can download this
00:34:29
stuff and play with it myself um so
00:34:31
thank you so much if you want to check
00:34:33
out uh Matt here's his Twitter right
00:34:36
here Matt Schumer
00:34:38
uncore uh and he is the CEO and founder
00:34:43
of hyperr so check out hyperight Ai and
00:34:46
all of this was built with sah Hill's
00:34:48
company uh founder of glaive g a i v so
00:34:54
glaive doai I believe it is is that
00:34:56
right uh Sill yeah that's correct yeah
00:34:59
so if you have a novel approach just
00:35:02
like Matt did contact sahill at glaive
00:35:05
and you can make it happen and I just
00:35:07
want to say thank you both S I mean this
00:35:09
was like a last minute thing you just
00:35:10
pinged me Matt and and I was so happy
00:35:13
you agreed to join you too sahill so
00:35:15
thank you so much thanks to everybody
00:35:17
who joined and uh I'm glad we got some
00:35:19
uh got some
00:35:22
details yeah thank you for having us and
00:35:24
thank you for the
00:35:25
support yeah all right you know where to
00:35:27
find them we'll drop it all in the
00:35:29
description