00:00:00
so today actually marks a very historic
00:00:02
moment for the AI Community as it is
00:00:04
going to be probably regarded as the day
00:00:06
where AGI actually happened now if you
00:00:08
guys don't know why this is the case
00:00:10
this is because opening eye today
00:00:11
released SL announced I guess you could
00:00:13
say their new 03 model which is the
00:00:16
second iteration of their 01 series the
00:00:19
model that thinks for a long time now if
00:00:21
you don't understand why this is
00:00:23
potentially AGI this is because the new
00:00:26
system managed to surpass human
00:00:28
performance in the arc benchmark now the
00:00:30
reason that the arc Benchmark is such an
00:00:32
important Benchmark is because it is
00:00:34
resistant to memorization sure so Arc is
00:00:37
intended as a kind of IQ test for
00:00:39
machine intelligence and what makes it
00:00:42
different from most benchmarks out there
00:00:44
is that it's designed to be resistant to
00:00:47
memorization so if you look at the way
00:00:49
LMS work they're basically this uh big
00:00:52
interpolative memory and the way you
00:00:54
scalop the capabilities is by trying to
00:00:56
cram as much uh knowledge and patterns
00:00:59
as POS possible into them and uh by
00:01:03
contrast Arc does not require a lot of
00:01:05
knowledge at all it's designed to only
00:01:07
require what's known as core knowledge
00:01:10
which is uh basic knowledge about things
00:01:13
like U Elementary physics objectness
00:01:16
counting that sort of thing um the sort
00:01:18
of knowledge that any four-year-old or
00:01:20
5-year-old uh possesses right um but
00:01:24
what's interesting is that each puzzle
00:01:26
in Arc is novel is something that you've
00:01:29
probably not encountered before even if
00:01:31
you've memorized the entire internet now
00:01:34
if you want to know what Arc actually
00:01:36
looks like in terms of this test that
00:01:37
humans are so easily able to pass but
00:01:40
you know these AI systems currently
00:01:41
aren't you can take a look at the
00:01:43
current examples right here in this
00:01:45
video example here AGI is all about
00:01:48
having input examples and output
00:01:50
examples well they're good they're good
00:01:52
okay input examples and output examples
00:01:54
now the goal is you want to understand
00:01:56
the rule of the transformation and guess
00:01:58
it on the output so Sam what do you
00:02:00
think is happening in here probably
00:02:02
putting a dark blue square in the empty
00:02:04
space see yes that is exactly it now
00:02:07
that is really um it's easy for humans
00:02:09
to uh intuitively guess what that is
00:02:11
it's actually surprisingly hard for AI
00:02:13
to know to understand what's going on um
00:02:15
what's interesting though is AI has not
00:02:18
been able to get this problem thus far
00:02:20
and even though that we verified that a
00:02:22
panel of humans could actually do it now
00:02:25
the unique part about AR AI is every
00:02:27
task requires distinct skill skills and
00:02:30
what I mean by that is we won't ask
00:02:32
there won't be another task that you
00:02:34
need to fill in the corners with blue
00:02:35
squares and but we do that on purpose
00:02:38
and the reason why we do that is because
00:02:39
we want to test the model's ability to
00:02:41
learn new skills on the Fly we don't
00:02:44
just want it to uh repeat what it's
00:02:46
already memorized that that's the whole
00:02:47
point here now Ark AGI version one took
00:02:50
5 years to go from 0% to 5% with leading
00:02:54
Frontier models however today I'm very
00:02:57
excited to say that 03 has scored a new
00:02:59
new state-ofthe-art score that we have
00:03:01
verified on low compute for uh 03 it has
00:03:04
scored
00:03:05
75.7 on Ark AGI semi-private holdout set
00:03:09
now this is extremely impressive because
00:03:11
this is within the uh compute
00:03:13
requirements that we have for our public
00:03:14
leaderboard and this is the new number
00:03:17
one entry on rkg Pub so congratulations
00:03:20
to that now I know those of you outside
00:03:22
the AI Community might not think this is
00:03:23
a big deal but this is a really big deal
00:03:26
because this is something that we've
00:03:27
been trying to Sol for I think around 5
00:03:29
years now this is a benchmark that many
00:03:31
would have heralded to be the golden
00:03:32
standard for AI and would of course Mark
00:03:34
the first time that we've actually
00:03:36
managed to get to a system that can
00:03:38
actually outperform humans at a task
00:03:39
that traditionally AI systems would
00:03:41
particularly fail at now what was
00:03:43
interesting was that they had two
00:03:45
versions so we had o03 with low tuning
00:03:48
and we had 03 with high tuning so the 03
00:03:51
with low tuning is the low reasoning
00:03:53
effort and this is the model that
00:03:55
operates with minimum computational
00:03:56
effort which is optimized for Speed and
00:03:58
cost efficiency and this one is suitable
00:04:00
for simpler task where deep reasoning is
00:04:02
not required so you know basic coding
00:04:04
straightforward tasks and then of course
00:04:06
you have the high tuned one which is
00:04:08
where the model takes more time and
00:04:10
resources to analyze and solve problems
00:04:11
and this is optimized for performance on
00:04:13
complex task requiring deeper reasoning
00:04:15
or multi-step problem solving and what
00:04:17
we can see here is that when we actually
00:04:19
you know tune the model and make it
00:04:21
think for longer we can see that we
00:04:22
actually manage to surpass where humans
00:04:24
currently are now what's crazy about
00:04:25
this is that the people that have
00:04:27
created The Arc AGI Benchmark said that
00:04:29
the performance on Arc AGI highlights a
00:04:31
genuine breakthrough in novelty
00:04:33
adaptation this is not incremental
00:04:35
progress we are in New Territory so they
00:04:38
start to ask is this AGI 03 still fails
00:04:41
on some very easy tasks indicating
00:04:43
fundamental differences with human
00:04:44
intelligence and the guy that you saw at
00:04:46
the beginning Francis cholet actually
00:04:48
spoke about how he doesn't believe that
00:04:49
this is exactly AGI but it does
00:04:51
represent a big milestone on the way to
00:04:53
WS AGI he says there's still a fair
00:04:55
number of easy Arc AGI one talks to 03
00:04:58
can't solve and we have early
00:04:59
indications that Ark AI 2 will remain
00:05:02
extremely challenging 403 so he's
00:05:04
stating that it shows that it's feasible
00:05:05
to create unsaturated interesting bench
00:05:07
marks that are easy for humans yet
00:05:09
impossible for AI without involving
00:05:11
specialist knowledge and he States now
00:05:13
which is you know some people could
00:05:14
argue that he's moved the goal that we
00:05:16
will have AGI when creating such evals
00:05:19
become outright impossible but this is a
00:05:21
little bit contrast to what he said
00:05:23
earlier this year take a look at what he
00:05:24
said 6 months ago about the Benchmark
00:05:27
surpassing 80% which it did today turn
00:05:29
question around to you so suppose that
00:05:31
it's the case that in a year a
00:05:33
multimodal model can solve Arc let's say
00:05:37
get 80% whatever the average human would
00:05:39
get then AGI quite possibly yes I think
00:05:43
if you if you start so honestly what I
00:05:45
would like to see is uh an llm type
00:05:48
model solving Arc at like 80% but after
00:05:52
having only been trained on core
00:05:54
knowledge related stuff now one of the
00:05:56
limitations of the model is actually the
00:05:59
compute cost so you can see right here
00:06:00
that he says does this mean that the arc
00:06:03
priz competition is beaten he says no
00:06:05
The Arc priz competition targets the
00:06:07
fully private data set which is a
00:06:09
different and somewhat harder evaluation
00:06:10
but the arc prize is of course the one
00:06:12
where your Solutions must run with a
00:06:14
fixed amount of compute which is about
00:06:16
10 cents per TS and the reason that that
00:06:18
is really interesting is because I'm not
00:06:19
sure you guys may have seen this but if
00:06:21
we actually take a look down the bottom
00:06:23
here you can see that o03 high tuned the
00:06:25
amount of compute that is being put into
00:06:27
the model there means that this model
00:06:29
cost around $11,000 per task which is
00:06:32
pretty pretty expensive if you're trying
00:06:34
to use that AI for anything at all I
00:06:36
think these models that search over many
00:06:39
different solutions are going to be
00:06:40
really really expensive and we can see
00:06:41
that reflected here with this one being
00:06:43
over $1,000 per task which is
00:06:45
ridiculously expensive when you think
00:06:47
about using it to perform any kind of
00:06:49
task of course as we've seen with AI
00:06:51
cost will eventually come down so the
00:06:53
fact that they've managed to get the
00:06:55
Benchmark is just the very best thing we
00:06:57
can use the similarities between how
00:06:58
technology was pretty bulky in the early
00:07:00
days you know like the big TVs the
00:07:02
really bulky phones but now with
00:07:04
technology you find ways to become more
00:07:06
and more efficient and eventually you do
00:07:08
things faster and of course cheaper so
00:07:09
it's quite likely that this will happen
00:07:11
in AI too and what I do find crazy about
00:07:13
this is that you know two years ago he
00:07:15
said that the arc AGI Benchmark fully
00:07:17
solved is not going to be within the
00:07:19
next 8 years and he says 70% hopefully
00:07:22
less than 8 years perhaps four or five
00:07:24
of course you can see AI managing to
00:07:26
speed past most people's predictions we
00:07:28
also got the fact that it did very well
00:07:30
on the swe bench which is of course a
00:07:32
very hard software engineering bench and
00:07:34
I'm guessing that if you are a software
00:07:36
engineer this is probably not the best
00:07:37
news for you but I'm sure that now with
00:07:39
a bunch of people who are now coding
00:07:41
there's probably a lot more demand for
00:07:42
software Engineers that actually
00:07:44
understand the code that's actually
00:07:45
being written here but this is actually
00:07:46
rather interesting because one thing
00:07:48
that I realized when you know doing this
00:07:50
video was the fact that this is actually
00:07:53
O2 and not 03 just want to go on a
00:07:55
tangent quickly here because whilst I
00:07:57
was reading this you know break through
00:07:59
about 03 one thing that I needed to
00:08:01
remember is that this is actually the
00:08:03
second iteration of the model I think
00:08:05
some people might be thinking that 03 is
00:08:07
opening I third iteration but O2 is
00:08:09
simply being skipped because there is a
00:08:11
conflict with O2 which is a British
00:08:14
mobile service provider so the fact that
00:08:16
this is only the second iteration of the
00:08:18
model it does go to show that
00:08:20
potentially 03 or even 04 is going to be
00:08:22
quite a large jump and might reach
00:08:25
Benchmark saturation now we also look at
00:08:26
the math benchmarks and the PHD science
00:08:29
level benchmarks we can see that there
00:08:30
is a decent Improvement there as well I
00:08:32
do think though that this is sort of
00:08:34
reaching the Benchmark saturation area
00:08:35
because we can see this one on
00:08:37
competition math is around 96.7% and
00:08:40
this one is 87.7% and like I said before
00:08:43
one of the things that most people are
00:08:45
starting to say is that like okay AI has
00:08:47
slowed down because it's no longer
00:08:49
increasing at the same rates as it was
00:08:51
before what we have to understand is
00:08:52
that as these benchmarks get to 955 Plus
00:08:55
or around that 89% the incremental gains
00:08:58
are going to be harder and harder to to
00:08:59
reach because number one you only have
00:09:01
10% left to get and number two it's
00:09:03
quite likely that 3 to 5% of all the
00:09:05
questions are contaminated meaning that
00:09:07
there are errors in those questions
00:09:08
anyways which means that 100% is simply
00:09:10
not possible on certain benchmarks which
00:09:12
is why they decided to create the
00:09:14
frontier math benchmark now at the time
00:09:16
when the Benchmark was released which I
00:09:18
think around 2 or 3 months ago these
00:09:19
were the current models that could do
00:09:21
only 2% on these questions you can see
00:09:24
at the lead was Gemini 1.5 Pro Claude 3
00:09:27
Sonic and you can see 01 preview and 01
00:09:28
mini were there but for those of you
00:09:29
that haven't realized just how good
00:09:31
02/03 is this is a model that gets 25%
00:09:35
so this is something that is really
00:09:37
really incredible when it comes to
00:09:38
research math and you have to understand
00:09:41
that this kind of math is super super
00:09:42
difficult and all of those questions are
00:09:44
completely novel so the previous
00:09:46
benchmarks that we recently had aren't
00:09:47
truly showing the capabilities I mean
00:09:49
think about it like this okay when we
00:09:51
take a look at these benchmarks you can
00:09:52
see that okay you would say maybe the
00:09:54
model's gotten you know 10% better
00:09:56
overall which is just not true because
00:09:58
if we actually look at the really hard
00:09:59
benchmarks where it's really solving
00:10:01
unseen math problems we can see that
00:10:03
there is a 20 times improvement over the
00:10:05
current stateof the-art which is just
00:10:06
absolutely incredible so I think this is
00:10:08
probably the very important image that
00:10:11
people need to realize because you can't
00:10:13
compare this kind of model even though
00:10:14
it does do a massive increase on these
00:10:17
kind of benchmarks you can't really
00:10:18
compare it to this where it's having you
00:10:21
know this insane level of scores now
00:10:23
what's important to understand is that
00:10:25
noan brown the person who works on
00:10:26
reasoning at opening eye said that 03 is
00:10:29
going going to continue in that
00:10:30
trajectory so for those of you who are
00:10:31
thinking you know maybe AI is slowing
00:10:33
down it is clear that that is not the
00:10:34
case now interestingly enough we also
00:10:36
got samman in an interview talking about
00:10:38
what he believes AGI to be and I think
00:10:40
like I said before the definition is
00:10:42
constantly shifting and constantly
00:10:43
changing it used to be a term that
00:10:45
people used a lot and it was this really
00:10:47
smart AI that was very far off in the
00:10:49
future as we get closer to it I think
00:10:51
it's become a less useful term people
00:10:53
use it to mean very different things um
00:10:56
some people use it to mean something
00:10:57
that's not that different than 01 you
00:10:59
know
00:10:59
uh and some people use it to mean true
00:11:01
super intelligence something smarter
00:11:02
than like all of humanity put together
00:11:04
uh we try now to use these different
00:11:07
levels we have a five level framework
00:11:09
we're on level two with agents now sorry
00:11:12
with uh reasoning now uh rather than the
00:11:14
binary of is it AGI or is it not I think
00:11:16
that became too horse as we get closer
00:11:19
but I I will say um by the end of next
00:11:22
year end of 25 I expect we will have
00:11:25
systems that can do truly astonishing
00:11:28
cognitive tasks like where you'll use it
00:11:30
and be like that thing is smarter than
00:11:32
me at a lot of at a lot of hard problems
00:11:34
so now can see the only thing here is
00:11:36
that if you are a safety researcher he
00:11:37
says please consider applying to help
00:11:39
test 03 minion 03 because he's excited
00:11:41
to get these out for General
00:11:42
availability soon and he is extremely
00:11:44
proud of the work the opening ey have
00:11:45
been doing to creating these amazing
00:11:47
models so it will be interesting to see
00:11:49
when these models are actually released
00:11:51
what people are going to do with them
00:11:52
what they're going to build and of
00:11:53
course what happens next if you enjoyed
00:11:55
this video let me know what your
00:11:56
thoughts are on if this is AI or not I
00:11:58
do think that that benchmark Mark has
00:11:59
been broken and we're constantly finding
00:12:01
new ways so it will be interesting to
00:12:03
see where things head next