00:00:20
good morning we have an exciting one for
00:00:22
you today we started this 12-day event
00:00:24
12 days ago with the launch of 01 our
00:00:26
first reasoning model it's been amazing
00:00:28
to see what people are doing with that
00:00:30
gratifying to hear how much people like
00:00:31
it we view this as sort of the beginning
00:00:33
of the next phase of AI where you can
00:00:35
use these models to do increasingly
00:00:36
complex tasks they require a lot of
00:00:39
reasoning and so for the last day of
00:00:41
this event um we thought it would be fun
00:00:43
to go from one Frontier Model to our
00:00:45
next Frontier Model today we're going to
00:00:47
talk about that next Frontier Model um
00:00:50
which you would think logically maybe
00:00:52
should be called O2 um but out of
00:00:54
respect to our friends at telica and in
00:00:56
the grand tradition of open AI being
00:00:57
really truly bad at names it's going to
00:00:59
be called 03 actually we're going to
00:01:02
launch uh not launch we're going to
00:01:03
announce two models today 03 and O3 mini
00:01:06
03 is a very very smart model uh 03 mini
00:01:10
is an incredibly smart model but still
00:01:12
uh but a really good at performance and
00:01:14
cost so to get the bad news out of the
00:01:17
way first we're not going to publicly
00:01:18
launch these today um the good news is
00:01:21
we're going to make them available for
00:01:22
Public Safety testing starting today you
00:01:24
can apply and we'll talk about that
00:01:25
later we've taken safety Tes testing
00:01:28
seriously as our models get uh more and
00:01:30
more capable and at this new level of
00:01:32
capability we want to try adding a new
00:01:34
part of our safety testing procedure
00:01:36
which is to allow uh Public Access for
00:01:38
researchers that want to help us test
00:01:40
we'll talk more at the end about when
00:01:41
these models uh when we expect to make
00:01:43
these models generally available but
00:01:45
we're so excited uh to show you what
00:01:47
they can do to talk about their
00:01:48
performance got a little surprise we'll
00:01:50
show you some demos uh and without
00:01:52
further Ado I'll hand it over to Mark to
00:01:53
talk about it cool thank you so much Sam
00:01:55
so my name is Mark I lead research at
00:01:57
openai and I want to talk a little bit
00:01:59
about O3 capabilities now O3 is a really
00:02:01
strong model at very hard technical
00:02:03
benchmarks and I want to start with
00:02:05
coding benchmarks if you can bring those
00:02:07
up so on software style benchmarks we
00:02:10
have sweep bench verified which is a
00:02:13
benchmark consisting of real world
00:02:14
software tasks we're seeing that 03
00:02:17
performs at about
00:02:19
71.7% accuracy which is over 20% better
00:02:22
than our 01 models now this really
00:02:24
signifies that we're really climbing the
00:02:26
frontier of utility as well on
00:02:29
competition code we see that 01 achieves
00:02:32
an ELO on this contest coding site
00:02:34
called code forces about 1891 at our
00:02:37
most aggressive High test time compute
00:02:39
settings we're able to achieve almost
00:02:41
like a 2727 ELO here ju so Mark was a
00:02:44
competitive programmer actually still
00:02:46
coaches competitive programming very
00:02:48
very good what what is your I think my
00:02:50
best at a comparable site was about 2500
00:02:52
that's tough well I I will say you know
00:02:55
our chief scientist um this is also
00:02:57
better than our chief scientist yakov's
00:02:59
score I think there's one guy at opening
00:03:01
ey who's still like a 3,000 something
00:03:03
yeah a few more months to yeah enjoy
00:03:05
hopefully we have a couple months to
00:03:06
enjoy there great that's I mean this is
00:03:08
it's in this model is incredible at
00:03:10
programming yeah and not just program
00:03:13
but also mathematics so we see that on
00:03:15
competition math benchmarks just like
00:03:17
competitive programming we achieve very
00:03:19
very strong scores so 03 gets about
00:03:22
96.7% accuracy versus an 01 performance
00:03:25
of 83.3% on the Amy what's your best Amy
00:03:28
score I did get a perfect score once so
00:03:31
I'm safe but
00:03:33
yeah really what this signifies is that
00:03:35
03 um often just misses one question
00:03:38
whenever we tested on this very hard
00:03:40
feeder exam for the USA mathematical LPN
00:03:43
there's another very tough Benchmark
00:03:45
which is called gpq Diamond and this
00:03:48
measures the model's performance on PhD
00:03:50
level science questions here we get
00:03:52
another state-of-the-art number
00:03:55
87.7% which is about 10% better than our
00:03:58
01 performance which was at 78% just to
00:04:01
put this in perspective if you take an
00:04:03
expert PhD they typically get about 70%
00:04:06
in kind of their field of strength here
00:04:09
so one thing that you might notice yeah
00:04:11
from from some of these benchmarks is
00:04:13
that we're reaching saturation for a lot
00:04:15
of them or nearing saturation so the
00:04:18
last year has really highlighted the
00:04:20
need for really harder benchmarks to
00:04:22
accurately assess where our Frontier
00:04:24
models lie and I think a couple have
00:04:26
emerged as fairly promising over the
00:04:28
last months one in particular I want to
00:04:30
call out is epic ai's Frontier math
00:04:32
benchmark now you can see the scores
00:04:35
look a lot lower than they did for the
00:04:37
the previous benchmarks we showed and
00:04:39
this is because this is considered today
00:04:41
the toughest mathematical Benchmark out
00:04:43
there this is a data set that consists
00:04:46
of Novel unpublished and also very hard
00:04:48
to extremely hard yeah very very hard
00:04:50
problems even turn houses you know it
00:04:52
would take professional mathematicians
00:04:54
hours or even days to solve one of these
00:04:57
problems and today all offerings out
00:05:00
there um have less than 2% accuracy um
00:05:04
on on this Benchmark and we're seeing
00:05:06
with 03 in aggressive test time settings
00:05:08
we're able to get over
00:05:10
25% yeah um that's awesome in addition
00:05:14
to Epic ai's Frontier math benchmark we
00:05:16
have one more surprise for you guys so I
00:05:19
want to talk about the arc Benchmark at
00:05:21
this point but I would love to invite
00:05:23
one of our friends Greg who is the
00:05:25
president of the Ark foundation on to
00:05:27
talk about this Benchmark wonderful Sam
00:05:29
and Mark thank you very much for having
00:05:31
us today of course hello everybody my
00:05:33
name is Greg camad and I the president
00:05:35
of the arc priz Foundation now Arc prise
00:05:38
is a nonprofit with the mission of being
00:05:39
a North star towards AGI through and
00:05:42
during benchmarks so our first Benchmark
00:05:44
Arc AGI was developed in 2019 by
00:05:47
Francois cholle in his paper on the
00:05:50
measure of intelligence however it has
00:05:53
been unbeaten for five years now in AI
00:05:56
world that's like it feels like
00:05:58
centuries is where it is so the system
00:06:00
that beats Ark AGI is going to be an
00:06:02
important Milestone towards general
00:06:04
intelligence but I'm excited to say
00:06:07
today that we have a new
00:06:09
state-of-the-art score to announce
00:06:11
before I get into that though I want to
00:06:13
talk about what Ark AGI is so I would
00:06:15
love to show you an example here Arc AGI
00:06:19
is all about having input examples and
00:06:21
output examples what they're good
00:06:23
they're good okay input examples and
00:06:25
output examples now the goal is you want
00:06:27
to understand the rule of the
00:06:29
transformation and guess it on the
00:06:30
output so Sam what do you think is
00:06:33
happening in here probably putting a
00:06:36
dark blue square in the empty space see
00:06:38
yes that is exactly it now that is
00:06:41
really um it's easy for humans to uh
00:06:43
intuitively guess what that is it's
00:06:45
actually surprisingly hard for AI to
00:06:47
know to understand what's going on so I
00:06:49
want to show one more hard example here
00:06:52
now Mark I'm going to put you on the
00:06:54
spot what do you think is going on in
00:06:56
this uh task okay so you take each these
00:06:59
yellow squares you count the number of
00:07:01
colored kind of squares there and you
00:07:03
create a border of that with that that
00:07:05
is exactly and that's much quicker than
00:07:07
most people so congratulations on that
00:07:09
um what's interesting though is AI has
00:07:11
not been able to get this problem thus
00:07:14
far and even though that we verified
00:07:16
that a panel of humans could actually do
00:07:18
it now the unique part about R AGI is
00:07:21
every task requires distinct skills and
00:07:25
what I mean by that is we won't ask
00:07:28
there won't be another task that you
00:07:29
need to fill in the corners with blue
00:07:31
squares and but we do that on purpose
00:07:33
and the reason why we do that is because
00:07:35
we want to test the model's ability to
00:07:37
learn new skills on the Fly we don't
00:07:40
just want it to uh repeat what it's
00:07:42
already memorized that that's the whole
00:07:43
point here now Ark AGI version one took
00:07:47
5 years to go from 0% to 5% with leading
00:07:50
Frontier models however today I'm very
00:07:54
excited to say that 03 has scored a new
00:07:57
state-of-the-art score that we have
00:07:58
verified
00:08:00
on low compute for uh 03 it has scored
00:08:04
75.7 on Arc ai's semi private holdout
00:08:08
set now this is extremely impressive
00:08:11
because this is within the uh compute
00:08:13
requirements that we have for our public
00:08:14
leaderboard and this is the new number
00:08:16
one entry on rkg Pub so congratulations
00:08:20
to that thank so much yeah now uh as a
00:08:23
capabilities demonstration when we ask
00:08:24
o03 to think longer and we actually ramp
00:08:27
up to high compute O3 was able to score
00:08:31
85.7% on the same hidden holdout set
00:08:34
this is especially important .5 sorry
00:08:37
87.5 yes this is especially important
00:08:40
because um Human Performance is is
00:08:43
comparable at 85% threshold so being
00:08:46
Above This is a major Milestone and we
00:08:49
have never tested A system that has done
00:08:50
this or any model that has done this
00:08:52
beforehand so this is new territory in
00:08:54
the rcgi world congratulations with that
00:08:57
congratulations for making such a great
00:08:58
Benchmark yeah yeah um when I look at
00:09:01
these scores I realize um I need to
00:09:03
switch my worldview a little bit I need
00:09:05
to fix my AI intuitions about what AI
00:09:07
can actually do and what it's capable of
00:09:10
uh especially in this 03 world but the
00:09:13
work also is not over yet and these are
00:09:15
still the early days of AI so um we need
00:09:19
more enduring benchmarks like Arc AGI to
00:09:22
help measure and guide progress and I am
00:09:25
excited to accelerate that progress and
00:09:27
I'm excited to partner with open AI next
00:09:29
year to develop our next Frontier
00:09:31
Benchmark amazing you know it's also a
00:09:34
benchmark that we've been targeting and
00:09:35
been on our mind for a very long time so
00:09:37
excited to work with you in the future
00:09:39
worth mentioning that we didn't we
00:09:40
Target and we think it's an awesome benk
00:09:41
we didn't go do speciic this is just you
00:09:43
know the general of three but yeah
00:09:45
really appreciate the partnership and
00:09:46
this was a fun one to do absolutely and
00:09:48
even though this has done so well Arc
00:09:50
prize will continue in 2025 and anybody
00:09:52
can find out more at AR pri.org great
00:09:55
thank you so much absolutely
00:09:59
okay so next up we're going to talk
00:10:00
about o03 mini um O3 mini is a thing
00:10:03
that we're really really excited about
00:10:05
and hongu who trained the model will
00:10:07
come out and join us hey you
00:10:12
hey um hi everyone um I'm homean I'm a
00:10:16
open air researcher uh working on
00:10:18
reasoning so this September we released
00:10:21
01 mini uh which is a efficient
00:10:23
reasoning model the you know1 family
00:10:25
that's really capable of uh math and
00:10:27
coding probably among the best in the
00:10:28
world given the low cost so now together
00:10:31
with 03 I'm very happy to tell you more
00:10:35
about uh 03 mini which is a brand new
00:10:38
model in the 03 family that truly
00:10:40
defines a new cost efficient reasoning
00:10:42
Frontier it's incredible um yeah though
00:10:45
it's not available to our users today we
00:10:48
are opening access to the model to uh
00:10:50
our Safety and Security researchers
00:10:52
through test model out um with the
00:10:55
release of adaptive thinking time in the
00:10:58
API a couple days ago
00:10:59
for all three mini will support three
00:11:02
different options low median and high
00:11:05
reasoning effort so the users can freely
00:11:08
adjust the uh thinking time based on
00:11:10
their different use cases so for example
00:11:12
for some we may want the model to think
00:11:15
longer for more complicated problems and
00:11:18
U uh things shorter uh with like simpler
00:11:21
ones um with that I'm happy to show the
00:11:24
first set of evals of all three
00:11:27
mini um so on the left hand side we show
00:11:31
the coding EV so it's like code forces
00:11:34
ELO which measures how good a programmer
00:11:36
is uh and the higher is the better so as
00:11:39
we can see on the plot with more
00:11:42
thinking time all three mini is able to
00:11:45
have like increasing Yow all all
00:11:47
performing all One Mini and with like
00:11:49
median thinking time is able to measure
00:11:52
even better than o1 yeah so it's like
00:11:54
for an order and magnitude more speed
00:11:56
and cost we can deliver the same code
00:11:58
performance on this even better
00:12:00
insurance right so although it's like
00:12:02
the ultra high is still like a couple
00:12:04
hundred points away from Mark it's not
00:12:06
far that's better than me probably um
00:12:08
but just an incredible sort of cost to
00:12:11
Performance gain over what we've been
00:12:13
able to offer with 01 and we think
00:12:14
people will really love this yeah I hope
00:12:16
so so on the right hand plot we showed
00:12:19
the estimated cost versus cold forces yo
00:12:23
trade-off uh so it's pretty clear that
00:12:25
all3 media defines like a new uh cost
00:12:27
efficient reasoning Frontier on
00:12:30
uh so it's achieve like better
00:12:31
performance compar better performance
00:12:33
than all1 is a fraction of cost amazing
00:12:36
um with that being said um um I would
00:12:39
like to do a live demo on ult Mini uh so
00:12:44
um and hopefully you can test out all
00:12:46
the three different like low medium high
00:12:49
uh thinking time of the model so let me
00:12:51
past the prom
00:13:05
um so I'm testing out all three mini
00:13:07
High first and the task is that I'm
00:13:11
asking the model to uh use Python to
00:13:14
implement a code generator and executor
00:13:18
so if I launch this uh run this like
00:13:20
Pyon script it will launch a server um
00:13:24
and um locally with a with a with a UI
00:13:28
that contains a text box
00:13:30
and then we can uh make coding requests
00:13:32
in a text box it will send the request
00:13:35
to call ult Mini API and Al mini API
00:13:39
will solve the task and return a piece
00:13:41
of code and it will then uh save the
00:13:43
code locally on my desktop and then open
00:13:47
a terminal to execute the code
00:13:49
automatically so it's a very complicated
00:13:52
PR complicated house right um and it
00:13:55
outp puts like a big triangle code so if
00:13:57
we copy
00:13:59
code and paste it to our
00:14:03
server and then we would like to run
00:14:07
launch This Server so we should get a
00:14:09
text box when you're launching it yeah
00:14:11
okay great oh yeah see I hope so it
00:14:12
seems to be launching something
00:14:18
um okay oh great we have a we have a UI
00:14:21
where we can enter some cing promps
00:14:23
let's try out a simple one like print
00:14:25
open the eye and a random number
00:14:32
submit so it's sending the request to
00:14:34
all three mini medium so you should be
00:14:36
pretty fast right so on this 4 terminal
00:14:40
yeah 41 that's the magic number right so
00:14:42
it saves the generated code to this like
00:14:44
local script um on a desktop and the
00:14:47
print out opening and 41 um is there any
00:14:51
other task you guys want toy test it out
00:14:53
I wonder if you could get it to get its
00:14:54
own GP QA
00:14:56
numbers that Isa that's a great ask just
00:14:59
as what I expected we practice a lot
00:15:01
yesterday um okay so now let me copy the
00:15:09
code and send it in
00:15:13
the code
00:15:16
UI so um in this task we asked the model
00:15:19
to evaluate all3 mini with the low
00:15:22
reasoning effort on this hard gpq data
00:15:25
set and the model needs to First
00:15:28
download the the the raw file from this
00:15:30
URL and then you need to figure out
00:15:33
which part is a question which part is a
00:15:36
um which part is the answer and or which
00:15:39
part is the options right and then
00:15:40
formulate all the questions and to and
00:15:44
then ask model to answer it and then
00:15:45
part the result and then to grade it
00:15:49
that's actually blazingly fast yeah and
00:15:50
it's actually really fast because it's
00:15:52
calling the al3 mini with low reasoning
00:15:56
effort um yeah let's see how it goes
00:15:59
I guess two tasks are really hard here
00:16:02
yeah the long tails of the
00:16:05
problem
00:16:08
go yeah is a hard data set yes yeah you
00:16:12
can't is like maybe 196 easy problems
00:16:15
and two pretty hard
00:16:16
problems
00:16:18
um while we're waiting for this do you
00:16:20
want to show the what the request was
00:16:22
again mhm oh it's actually Returns the
00:16:25
results it's uh 61.6%
00:16:29
6% right with a low reasoning effort
00:16:31
model it's actually pretty fast then
00:16:33
full evaluation in the uh and the a
00:16:37
minute and somehow very cool to like
00:16:39
just ask a model to evaluate itself like
00:16:41
this yeah exactly right and if we just
00:16:43
summarize what we just did we asked the
00:16:45
model to write a script to evaluate
00:16:48
itself um through on this like hard D
00:16:51
created ass Set uh from a UI right from
00:16:54
this code generator and executor created
00:16:57
by the model itself you first place next
00:17:00
year we're going to bring you on and
00:17:01
you're going to have to improve ask the
00:17:03
model to improve it so yeah let's
00:17:04
definely ask the model to improve it
00:17:05
next time maybe not
00:17:07
um
00:17:10
um so um besides code forces and gpq the
00:17:15
model is also a pretty good um um math
00:17:18
model so we we show on this plot uh with
00:17:22
like on this am 2024 data set also meing
00:17:25
low achieves um comparable performance
00:17:28
with One Mini and 03 mini medium
00:17:31
achieves a comparable better performance
00:17:33
than 01 we check the solid bar which are
00:17:35
passle ones and we can further push the
00:17:38
performance with 03 mini high right and
00:17:41
on the right hand side plot when we
00:17:43
measure the latency on this like
00:17:45
anonymized ow preview traffic we show
00:17:48
that 03 mini low drastically reduce the
00:17:50
latency of O mini right almost like
00:17:54
achieving comparable latency with uh gbt
00:17:57
40 under second so probably is like
00:18:00
instant response and also Mei medium is
00:18:03
like half the latency of all
00:18:06
one um and here's another set of evals
00:18:09
as I'm even more excited to to show you
00:18:11
guys is um uh API features right we get
00:18:14
a lot of requests from our developer
00:18:16
communities to support like function
00:18:18
calling structured outputs developer
00:18:20
messages U all Miner models and here um
00:18:24
all3 mini will support all these
00:18:27
features same as o1
00:18:29
um and notably it achieves like
00:18:32
comparable better performance than for
00:18:34
all on most of the ow providing a more
00:18:37
cost effective solution to our
00:18:39
developers cool um and if we actually
00:18:43
enveil the true gpq Diamond performance
00:18:47
that I run a couple days ago uh it
00:18:49
actually also me L is actually 62% right
00:18:52
basically ask model to evalate itself
00:18:54
yeah right next time you should totally
00:18:55
just ask model to automatically do the
00:18:57
evaluation instead of ask
00:19:00
um yeah so with that um that's it for
00:19:03
alter Mei and I hope our user can have a
00:19:05
much better user experience in already
00:19:07
next year fantastic work thank really
00:19:09
great work on thank you cool so I know
00:19:13
you're excited to get this in your own
00:19:15
hands um and we're very working very
00:19:17
hard to postra this model to do some uh
00:19:19
safety interventions on top of the model
00:19:21
and we're doing a lot of internal safety
00:19:22
testing right now but something new
00:19:25
we're doing this time is we're also
00:19:26
opening up this model to external safety
00:19:29
testing starting today with O3 mini and
00:19:31
also eventually with 03 so how do you
00:19:34
get Early Access as a safety researcher
00:19:36
or as security researcher you can go to
00:19:38
our website and you can see a form like
00:19:40
this one that you see on the screen and
00:19:42
applications for this form are rolling
00:19:44
they'll close on January 10th and we
00:19:47
really invite you to apply uh we're
00:19:48
excited to see what kind of things that
00:19:50
you can explore with this and what kind
00:19:52
of um jailbreaks and other things you
00:19:55
discover cool great so one other thing
00:19:58
that I'm excited to talk about is a a
00:20:00
new report that we published I think
00:20:02
yesterday or today um that advances our
00:20:05
safety program and this is a new
00:20:07
technique called deliberative alignment
00:20:09
typically when we do safety training on
00:20:12
top of our models we're trying to learn
00:20:13
this decision boundary of what's safe
00:20:15
and what's unsafe right and usually it's
00:20:19
uh just through showing examples pure
00:20:21
examples of this is a safe prompt this
00:20:22
is an unsafe prompt but we can now
00:20:25
leverage the reasoning capabilities that
00:20:27
we have from our models to find a more
00:20:29
accurate safety boundary here and this
00:20:32
technique called deliberative alignment
00:20:34
allows us to take a safety spec allows
00:20:36
the model to reason over a prompt and
00:20:39
also just tell you know is this a safe
00:20:41
prompt or not often times within the
00:20:43
reasoning it'll just uncover that hey
00:20:45
you know this user is trying to trick me
00:20:47
or they're expressing this kind of
00:20:49
intent that's hidden so even if you kind
00:20:51
of try to Cipher your your prompts often
00:20:53
times the reasoning will break that and
00:20:56
the primary result you see is in this
00:20:58
figure that that's shown over here we
00:21:00
have um our performance on a rejection
00:21:02
Benchmark on the x-axis and on over
00:21:04
refusals on the Y AIS and here uh to the
00:21:07
right is better so this is our ability
00:21:09
to accurately tell when we should reject
00:21:11
something also our ability to tell when
00:21:13
we should revie something and typically
00:21:15
you think of these two metrics as having
00:21:17
some sort of trade-off it's really hard
00:21:18
to do while on the it is really hard to
00:21:20
do yeah um but it seems with
00:21:22
deliberative alignment that we can get
00:21:24
these two green points on the top right
00:21:26
whereas the previous models the red and
00:21:28
Blue Points um signify the performance
00:21:30
of our previous models so we're really
00:21:33
starting to leverage safety to get sorry
00:21:35
leverage reasoning to get better safety
00:21:37
yeah I think this is a really great
00:21:38
result of safety yeah fantastic okay so
00:21:42
to sum this up 03 mini and 03 apply
00:21:45
please if you'd like for safety testing
00:21:47
to help us uh test these models as an
00:21:50
additional step we plan to launch 03
00:21:52
mini around the end of January and full3
00:21:54
shortly after that but uh that will you
00:21:56
know the more people can help us safety
00:21:58
test the more we can uh make sure we hit
00:22:00
that so please check it out uh and
00:22:03
thanks for following along with us with
00:22:05
this it's been a lot of fun for us we
00:22:06
hope you've enjoyed it too Merry
00:22:07
Christmas Merry Christmas Merry
00:22:10
[Applause]
00:22:11
Christmas okay also it's
00:22:27
c fore