00:00:00
so open AI is the leading AI company and
00:00:03
of course their recent iteration of
00:00:05
models the 01 series is by far the most
00:00:08
advanced AI that we currently have
00:00:10
access to now incredibly this AI model
00:00:13
has been shrouded with secrecy to the
00:00:15
point that if you ever dare to ask the
00:00:17
model what it was thinking about during
00:00:19
the process it was giving you a response
00:00:22
the model gives you a response where it
00:00:24
tells you to never ask a question like
00:00:25
that again and if you do it too many
00:00:27
times you can actually get banned from
00:00:30
using open AI service and now the reason
00:00:32
that this is shrouded in so much secrecy
00:00:34
is because this is a big step towards
00:00:36
AGI and many are thinking that open AI
00:00:39
are quite likely to be the first company
00:00:41
to achieve it now with that being said
00:00:43
many have wanted to know exactly how
00:00:45
this system works and there have been
00:00:46
many different ways open I have of
00:00:48
course published a few different
00:00:50
Publications but nothing to the point
00:00:52
where we truly understand what's going
00:00:54
on beneath the hood however there has
00:00:56
been a recent research paper from a
00:00:58
group of researchers in China and we are
00:01:01
now asking ourselves if they just
00:01:03
managed to crack the code did they
00:01:05
figure out how 01 works and release a
00:01:08
road map to build something similar so
00:01:11
this is the paper scaling of search and
00:01:13
learning a road map to reproduce 01 from
00:01:15
reinforcement learning perspective and
00:01:17
this is the paper that could change
00:01:19
everything because if this is true then
00:01:21
it means the playing field is leveled
00:01:23
and it means it's only a matter of time
00:01:25
before many other companies start to
00:01:27
produce their AI models that are going
00:01:29
to be on par with open AI now I'm
00:01:31
actually going to break this down into
00:01:32
four parts but let's actually first
00:01:34
understand the basics of how this AI
00:01:36
thing even works so one of the first
00:01:38
things that we do have is we have of
00:01:39
course reinforcement learning with AI so
00:01:42
essentially we can use a game analogy so
00:01:45
imagine you're trying to teach a dog a
00:01:47
trick so you would give this dog a treat
00:01:49
which is the reward when it does
00:01:51
something right and it then learns to
00:01:54
repeat those actions to get more treats
00:01:56
and that is basically reinforcement
00:01:57
learning now with AI the dog is
00:02:00
essentially a program and the treat is a
00:02:03
digital reward and the trick could be
00:02:05
anything from winning a game to writing
00:02:07
code now why is reinforcement learning
00:02:10
important for the 01 series and this is
00:02:12
because open AI seems to believe that
00:02:14
reinforcement learning is the key to
00:02:16
making 01 so smart it h it's basically
00:02:19
how 01 learns to reason and solve
00:02:21
complex problems through trial and error
00:02:24
now there are four pillars of this
00:02:26
according to the paper you can see right
00:02:28
here they give us an over view of how 01
00:02:31
essentially Works we've got the policy
00:02:34
initialization this is the starting
00:02:36
point of the model this sets up the
00:02:38
model's initial reasoning abilities
00:02:39
using pre-training or fine-tuning and
00:02:41
this is basically the foundation of the
00:02:43
model we've got reward design which is
00:02:45
of course how the model is rewarded
00:02:46
which we just spoke about I'm going to
00:02:48
speak about that in more detail and then
00:02:49
of course we've got search which is
00:02:51
where during the inference time where
00:02:53
the model is quote unquote thinking this
00:02:55
is how the model searches through
00:02:57
different possibilities and of course we
00:02:58
have learning and this is where you
00:03:00
improve the model by analyzing the data
00:03:03
generated during the search process and
00:03:05
then you use different techniques such
00:03:06
as reinforcement learning to make the
00:03:08
model better over time and essentially
00:03:10
the central idea is reinforcement
00:03:12
learning okay and the core mechanism
00:03:14
ties these components together the model
00:03:17
which is the policy interacts with its
00:03:18
environment data flows from search
00:03:21
results into the learning process and
00:03:22
the improved policy is fed back into the
00:03:24
search creating a continuous Improvement
00:03:26
Loop and the diagram basically
00:03:28
emphasizes the cyclic nature of the
00:03:30
process search generates data for
00:03:32
learning learning updates the policy and
00:03:33
yada y yada so if we want to actually
00:03:36
understand how this works we have to
00:03:37
actually understand the policy so this
00:03:39
is the basics this is the foundation of
00:03:41
the model so imagine you're basically
00:03:43
teaching someone to play a complex game
00:03:45
like chess you wouldn't throw them into
00:03:47
a match against a Grandmaster on their
00:03:49
first day right you'd start by teaching
00:03:51
them the basics how the pieces move
00:03:53
basic strategies and maybe some common
00:03:55
opening moves that's essentially what
00:03:57
policy initialization is for AI now in
00:04:00
the context of a powerful AI like 01
00:04:03
policy initialization is essentially
00:04:05
giving the AI just the very strong
00:04:07
foundation and reasoning before it even
00:04:09
starts trying to solve really hard
00:04:11
problems it's about equipping it with a
00:04:13
basic set of skills and knowledge that
00:04:15
it can then build upon through
00:04:16
reinforcement learning the paper
00:04:18
suggests that for 01 this Head Start
00:04:20
likely comes in two main phases number
00:04:23
one the pre-training which we can see
00:04:25
here which is you know where you train
00:04:26
it on massive text Data think of this
00:04:29
like like letting the AI read the
00:04:31
entirety of the internet or at least a
00:04:34
huge chunk of it and by doing this the
00:04:36
AI learns how language Works how words
00:04:39
relate to each other and gains a vast
00:04:41
amount of general knowledge about the
00:04:43
world think of it like learning grammar
00:04:45
vocabulary the basic facts before trying
00:04:47
to write a novel and it will also learn
00:04:49
basic reasoning abilities by training on
00:04:52
this data and then this is where we get
00:04:53
to the important bit which is where we
00:04:55
get the fine-tuning with instructions
00:04:57
and humanlike reasoning and this is
00:04:59
where we actually give the AI more
00:05:01
specific lessons on how to reason and
00:05:03
solve problems and this involves two key
00:05:05
techniques which we can see right here
00:05:07
prompt engineering and supervised
00:05:09
fine-tuning so prompt engineering is
00:05:11
where essentially you know you give the
00:05:13
AI carefully crafted instructions or
00:05:16
examples to guide Its Behavior and the
00:05:18
paper mentions behaviors like problem
00:05:19
analysis which is where you restate the
00:05:22
problem to make sure it's understood
00:05:23
task decomposition like breaking down a
00:05:26
complex problem into smaller easier
00:05:28
steps which is where you literally say
00:05:29
you know first think step by step and of
00:05:31
course with supervised finetuning which
00:05:33
is right here sft this involves training
00:05:36
the AI on examples of human solving
00:05:38
problems like basically showing it the
00:05:40
right way to think and reason it could
00:05:42
involve showing it examples of experts
00:05:45
explaining their thought process step by
00:05:46
step so in a nutshell policy
00:05:48
initialization is about giving AI a
00:05:50
solid foundation and language knowledge
00:05:52
and basic reasoning skills setting up
00:05:55
setting it up for success in the later
00:05:56
stages of learning and problem solving
00:05:58
and this phase of 1 is essentially
00:06:00
crucial for developing human-like
00:06:02
reasoning behaviors in AI enabling them
00:06:04
to think systematically and explore
00:06:06
solution spaces efficiently next we get
00:06:08
to something super interesting this is
00:06:11
where we get to reward design so this
00:06:13
image that you can see on the screen
00:06:15
illustrates two types of reward systems
00:06:18
used in reinforcement learning outcome
00:06:21
reward modeling which is om over here
00:06:23
and then we've got process reward
00:06:24
modeling which is PRM now as for the
00:06:27
explanation it's actually pretty
00:06:28
straightforward so outcome reward
00:06:31
modeling is something that only
00:06:33
evaluates the solution based on the
00:06:35
final result so if the final answer is
00:06:38
incorrect the entire solution is marked
00:06:40
as wrong even if these steps right here
00:06:43
or even if most steps are correct and in
00:06:45
this example there are some steps that
00:06:47
are actually correct but due to the fact
00:06:49
that the final output is incorrect the
00:06:51
entire thing is just marked as wrong but
00:06:54
this is where we actually use process
00:06:56
reward modeling which is much better so
00:06:58
with process mod modeling this evaluates
00:07:01
each step in the solution individually
00:07:03
this is where we reward the correct
00:07:05
steps and we penalize the incorrect ones
00:07:08
and this one actually provides more
00:07:10
granular feedback which helps guide
00:07:12
improvements during training so we can
00:07:14
see that steps one two and three are
00:07:16
correct and then they receive the
00:07:17
rewards and steps four and five are
00:07:19
incorrect and are thus flagged its
00:07:21
errors and this approach is far better
00:07:24
because it pinpoints the exact errors in
00:07:26
the process rather than discarding the
00:07:28
entire solution and this this diagram
00:07:30
basically emphasizes the importance of
00:07:32
process rewards in tasks that involve
00:07:34
multi-step reasoning as it allows for
00:07:36
iterative improvements and Better
00:07:38
Learning outcomes which is essentially
00:07:40
what they believe 01 is using now this
00:07:43
is where we get into the really
00:07:45
interesting thing because this is where
00:07:47
we get to search and many have heralded
00:07:49
search as the thing that could take us
00:07:51
to Super intelligence in fact I did
00:07:53
recently see a tweet that just stated
00:07:55
that I'm sure I'll manage to add that on
00:07:57
screen so when we decide to break this
00:07:59
down this is essentially where we have
00:08:01
the AI thinking so you know when you
00:08:03
have a powerful AI like 01 it needs time
00:08:06
to think to explore different
00:08:08
possibilities and find the best solution
00:08:10
this thinking process is what the paper
00:08:12
refers to as such so thinking more is
00:08:15
where they say that you know one way you
00:08:17
could improve the performance is by
00:08:20
thinking more during the inference which
00:08:22
means that instead of just generating
00:08:23
one answer it explores multiple possible
00:08:26
solutions before picking the best one so
00:08:28
you know let's say you think about
00:08:30
writing an essay you don't just write
00:08:31
the first draft and submit it right you
00:08:33
brainstorm ideas you write multiple
00:08:35
drafts you revise and edit until you're
00:08:37
happy with the final product and that is
00:08:39
essentially a form of search 2 so there
00:08:43
are two main strategies that are in the
00:08:46
search area and the paper highlights
00:08:47
these strategies that 01 might be using
00:08:50
for this thinking process so coming in
00:08:52
at number one we have the tree search so
00:08:55
imagine a branching tree where a branch
00:08:58
represents a different choice or you
00:09:00
know action that the AI could
00:09:02
potentially take research is like you
00:09:04
know exploring the tree following
00:09:06
different paths to see where they lead
00:09:08
for example in a game of chess an AI
00:09:10
might consider all the possible moves
00:09:11
that it could make then all the possible
00:09:13
responses its opponent could make and
00:09:15
then build on this tree of possibilities
00:09:18
and then it uses a certain kind of
00:09:20
criteria to decide which branch to
00:09:22
explore further and which to prune
00:09:25
focusing on the most promising path
00:09:27
basically thinking about where you're
00:09:28
going to go what decisions you're going
00:09:30
to make and which one yields the best
00:09:32
rewards it's kind of like a gardener
00:09:34
selectively trimming branches to help a
00:09:36
tree grow in the right direction a
00:09:38
simple example of this is best of end
00:09:39
sampling where the model generates end
00:09:41
possible solutions and then picks the
00:09:43
best one based on some kind of criteria
00:09:45
now on the bottom right here this is
00:09:47
where we have sequential revisions this
00:09:50
is like writing that essay we talked
00:09:52
about earlier and the AI starts with an
00:09:54
initial attempt at a solution then
00:09:56
refines it step by step along the way
00:09:59
making improvements for example an AI
00:10:03
might generate an initial answer to a
00:10:06
math problem and then it might check its
00:10:08
work then identify the errors and then
00:10:12
revise its solution accordingly it's
00:10:14
kind of like editing your essay catching
00:10:16
the mistakes and then making it better
00:10:18
with every time you review it so you
00:10:20
have to also think about you know how
00:10:22
does the AI decide which paths to
00:10:25
explore in the tree search or how to
00:10:27
even you know revive is the solution in
00:10:30
sequential revision so the paper
00:10:32
mentions two types of guidance so we
00:10:34
have internal guidance and this is where
00:10:37
you've got the AI using its own internal
00:10:39
knowledge and calculations to guide its
00:10:42
search and one example is of course you
00:10:44
know model uncertainty and this is where
00:10:47
the model can actually estimate how
00:10:49
confident it is in certain parts of its
00:10:52
solution it might focus on areas where
00:10:54
it's less certain exploring Alternatives
00:10:57
or making revisions it's kind of like
00:10:58
double checking your work when you're
00:11:00
not really sure if you've made a mistake
00:11:03
another example of this is of course you
00:11:04
know self- evaluation this is where you
00:11:07
know the AI can be trained to assess its
00:11:09
own work identifying potential errors or
00:11:12
areas for improvement it's kind of like
00:11:13
having an internal editor that reviews
00:11:16
your writing and suggest changes then
00:11:18
we've got external guidance and this is
00:11:21
like getting feedback from the outside
00:11:22
world to guide the search so one example
00:11:25
is environmental feedback which is where
00:11:27
in some cases AI can interact with a
00:11:30
real or simulated environment and get
00:11:32
feedback on its actions for example a
00:11:35
robot learning to navigate a maze might
00:11:38
get feedback on whether it's moving
00:11:39
closer to or farther from the goal and
00:11:42
another example of this is using a
00:11:44
reward model which we discussed earlier
00:11:46
the reward model can provide feedback on
00:11:48
the quality of different solutions or
00:11:51
actions guiding the AI towards better
00:11:53
outcomes it's kind of like having a
00:11:56
teacher who grades your work and tells
00:11:58
you what you did well and tells you
00:11:59
where you need to improve in essence the
00:12:01
search element and the process by which
00:12:04
01 explores different possibilities and
00:12:06
refines its solution is Guided by both
00:12:08
its internal knowledge and its external
00:12:10
feedback and this is a crucial part of
00:12:12
what makes 01 so good at complex
00:12:15
reasoning tasks so of course search is
00:12:18
how the AI thinks about a problem but
00:12:20
how does it actually get better at
00:12:21
solving problems over time this is where
00:12:24
learning comes in so the paper suggests
00:12:26
that 01 uses a powerful technique called
00:12:29
reinforcement learning to improve its
00:12:31
performance so search generates the
00:12:34
training data so remember how we talked
00:12:35
about search generating multiple
00:12:37
possible solutions well those Solutions
00:12:39
along with the feedback from internal or
00:12:42
external guidance become value training
00:12:44
become valuable training data for the AI
00:12:47
think of it like a student practicing
00:12:49
for an exam they might try and solve
00:12:51
many different practice problems getting
00:12:53
feedback on their answers and learning
00:12:55
from their mistakes each attempt whether
00:12:57
successful or not provides valuable
00:12:58
information
00:12:59
that actually helps them learn and
00:13:01
improve now we've got two main learning
00:13:04
methods and the paper focuses on two
00:13:06
main methods that a one might be using
00:13:08
to learn from during this search
00:13:10
generated data number one is policy
00:13:12
gradient methods like Po and these
00:13:15
methods are a little bit more complex
00:13:17
but the basic idea is that the AI
00:13:18
adjusts its internal policy which is the
00:13:21
strategy for choosing its actions based
00:13:23
on the reward that it achieves and
00:13:25
actions that lead to high rewards are
00:13:27
made more likely while actions that lead
00:13:28
to low rewards are made less likely it's
00:13:31
kind of like fine-tuning the ai's
00:13:32
decision-making process based on its own
00:13:34
experiences then we've got po which is
00:13:37
essentially proximal policy optimization
00:13:39
which is it's a popular policy gradient
00:13:42
method that is known for its stability
00:13:43
and efficiency it's like having a
00:13:45
careful and methodical way of updating
00:13:47
the AI strategy making sure it doesn't
00:13:50
change too drastically in its response
00:13:51
to any single experience then of course
00:13:54
here we have Behavior cloning and this
00:13:56
is a simpler method where the AI learns
00:13:59
to mimic successful Solutions it's like
00:14:01
learning via imitation if the search
00:14:03
process finds a really good solution one
00:14:06
that gets a high reward the AI can learn
00:14:08
to copy that solution in similar
00:14:10
situations it's like a student learning
00:14:12
to solve a math problem by studying a
00:14:15
worked example and the paper suggests
00:14:17
that 01 might use Behavior cloning to
00:14:20
learn from the very best Solutions found
00:14:22
during search effectively adding them to
00:14:24
its repertoire of successful strategies
00:14:27
or it could be used as an initial way to
00:14:30
warm up the model before using more
00:14:32
complex methods like Po now of course
00:14:34
we've got iterative search and learning
00:14:36
and the real power of this approach
00:14:38
comes from combining search and of
00:14:40
course learning in an iterative Loop so
00:14:42
the AI searches for Solutions learns
00:14:45
from the results then use its improved
00:14:47
knowledge to conduct even better
00:14:49
searches in the future it's like a
00:14:51
continuous cycle of practice feedback
00:14:53
and Improvement and the paper suggests
00:14:55
that this iterative progress is key to
00:14:59
ow one's ability to achieve superhuman
00:15:01
performance on certain tasks by
00:15:03
continuously searching and learning the
00:15:05
AI can surpass the limitations of its
00:15:07
initial training data potentially
00:15:09
discover new and better Solutions than
00:15:12
humans haven't thought of so with all
00:15:14
that being said about how 01 works and
00:15:17
now that you know the basics the four
00:15:18
key pillars do you guys think we are
00:15:20
close to Super intelligence after
00:15:22
reading this research paper and
00:15:24
understanding the key granular details
00:15:25
about how Owen works I think I really do
00:15:28
understand why the wider AI Community is
00:15:30
saying that super intelligence isn't
00:15:31
that far away if an AI can search for
00:15:34
Solutions then learn from those results
00:15:36
and use that improved knowledge to
00:15:38
conduct even better searches in the
00:15:39
future having a continuous cycle of
00:15:41
practice feedback and Improvement
00:15:43
achieving superhuman performance would
00:15:45
be possible in theory so maybe
00:15:47
artificial super intelligence isn't that
00:15:49
far away with that being said I'd love
00:15:50
to know your thoughts and hopefully you
00:15:52
guys have a