What is the new AI model being introduced?

The new AI models introduced are O3 and O3 Mini.

What are the capabilities of O3?

O3 exhibits high performance in coding, mathematics, and PhD-level science questions, surpassing previous models like O1.

How does O3 perform in coding benchmarks?

O3 achieves about 71.7% accuracy in coding benchmarks, significantly better than O1's performance.

What is the significance of the Arc AGI benchmark?

The Arc AGI benchmark evaluates AI's general intelligence through unique tasks requiring new skills, and O3 has achieved a new state-of-the-art score.

O3 Mini is a cost-efficient reasoning model in the O3 family, performing well in math and coding with various reasoning time options.

How can researchers access O3 and O3 Mini for testing?

Researchers can apply for early access to O3 and O3 Mini for safety testing by filling out a form on OpenAI's website.

When will O3 and O3 Mini be publicly available?

O3 Mini is expected to launch around the end of January, with O3 following shortly after.

What is deliberative alignment?

Deliberative alignment is a new safety training technique that uses reasoning capabilities to better define safe and unsafe prompts.

How does O3 Mini compare to O1 Mini?

O3 Mini offers better performance at a lower cost, with more efficient reasoning capabilities than O1 Mini.

OpenAI o3 震撼发布！Arc AGI 测试得分超越人类｜ OpenAI 12天「第12天」| 回到Axton

00:22:58

https://www.youtube.com/watch?v=-O90WJvN3vw

Summary

TLDROpenAI is announcing two new AI models, O3 and O3 Mini. O3 is a highly advanced model excelling in programming, math, and scientific tasks, achieving significant improvements over previous models. O3 Mini is introduced as a cost-efficient alternative, optimizing performance and reasoning capabilities. Although not publicly launched, these models are open for safety testing to researchers. A key feature of these models is their ability to perform well on challenging benchmarks like Arc AGI, highlighting leaps towards general intelligence. OpenAI emphasizes safety with these releases, leveraging new techniques like deliberative alignment to ensure better understanding of safety boundaries. Public availability is planned for the beginning of next year, with ongoing opportunities for researchers to assist in refining these models.

Takeaways

🚀 Introduction of two advanced AI models, O3 and O3 Mini.
🧠 O3 excels in complex reasoning tasks, including coding and math.
🔍 O3 achieves state-of-the-art results on tough benchmarks like Arc AGI.
📉 O3 Mini offers cost-efficient performance with flexible reasoning times.
🔒 OpenAI opens these models for safety testing to researchers.
📜 New technique, deliberative alignment, enhances model safety.
📆 Public release of O3 Mini expected by end of January.
👍 Tremendous progress in AI capabilities compared to previous models.
🔗 Opportunity for public contributions in SAF testing.
🎯 Focus on advancing AI towards general intelligence.

Timeline

00:00:00 - 00:05:00
The event began with the introduction of OpenAI's first reasoning model, O1, twelve days ago. The focus is on the next frontier of AI, O3, and its cost-effective counterpart, O3 mini. Both models aim to tackle complex tasks requiring high-level reasoning, although they're not launching publicly yet. Instead, public safety testing is open to researchers. A thorough safety testing approach is emphasized for these advanced models. Mark from OpenAI elaborates on O3's capabilities, showcasing its superior performance over previous models in coding and mathematical benchmarks.
00:05:00 - 00:10:00
The discussion highlights O3's remarkable abilities, outperforming its predecessors in competitive programming and advanced math tasks. It's noted that current benchmarks are being surpassed, underscoring the need for more challenging tests. The introduction of innovative benchmarks, such as Epic AI's Frontier math benchmark, demonstrates O3's superior capability by achieving over 25% accuracy in comparison to less than 2% from other models. The breakthrough on the Arc Prize Foundation's AGI benchmark is announced, showcasing O3's impressive 75.7% score, indicating progress toward general intelligence.
00:10:00 - 00:15:00
O3 mini, a companion to O3, is introduced as a highly cost-efficient reasoning model. By supporting multiple reasoning levels, it provides flexibility for different use cases. Preliminary evaluations demonstrate O3 mini's superior performance, particularly in coding, compared to earlier models. A live demo exemplifies its capability in automating complex coding tasks. Additional evaluations highlight O3 mini's advantage in terms of cost versus performance ratio, offering promising utility for developers needing efficient but powerful AI solutions.
00:15:00 - 00:22:58
The event wraps up with a focus on safety interventions and external testing for O3 mini and O3, urging researchers to apply for early access. A new report on 'deliberative alignment' is introduced, emphasizing enhanced safety through reasoning. This strategy aims to refine the decision boundary between safe and unsafe AI behavior, utilizing the model's reasoning to reliably detect malicious intent. OpenAI plans to launch O3 mini by January's end, followed by the full O3, highlighting a commitment to safety and performance. The audience is encouraged to participate in testing to aid in perfecting the models.

Mind Map

Video Q&A

What is the new AI model being introduced?
The new AI models introduced are O3 and O3 Mini.
What are the capabilities of O3?
O3 exhibits high performance in coding, mathematics, and PhD-level science questions, surpassing previous models like O1.
How does O3 perform in coding benchmarks?
O3 achieves about 71.7% accuracy in coding benchmarks, significantly better than O1's performance.
What is the significance of the Arc AGI benchmark?
The Arc AGI benchmark evaluates AI's general intelligence through unique tasks requiring new skills, and O3 has achieved a new state-of-the-art score.
What is O3 Mini?
O3 Mini is a cost-efficient reasoning model in the O3 family, performing well in math and coding with various reasoning time options.
How can researchers access O3 and O3 Mini for testing?
Researchers can apply for early access to O3 and O3 Mini for safety testing by filling out a form on OpenAI's website.
When will O3 and O3 Mini be publicly available?
O3 Mini is expected to launch around the end of January, with O3 following shortly after.
What is deliberative alignment?
Deliberative alignment is a new safety training technique that uses reasoning capabilities to better define safe and unsafe prompts.
How does O3 Mini compare to O1 Mini?
O3 Mini offers better performance at a lower cost, with more efficient reasoning capabilities than O1 Mini.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:20
good morning we have an exciting one for
00:00:22
you today we started this 12-day event
00:00:24
12 days ago with the launch of 01 our
00:00:26
first reasoning model it's been amazing
00:00:28
to see what people are doing with that
00:00:30
gratifying to hear how much people like
00:00:31
it we view this as sort of the beginning
00:00:33
of the next phase of AI where you can
00:00:35
use these models to do increasingly
00:00:36
complex tasks they require a lot of
00:00:39
reasoning and so for the last day of
00:00:41
this event um we thought it would be fun
00:00:43
to go from one Frontier Model to our
00:00:45
next Frontier Model today we're going to
00:00:47
talk about that next Frontier Model um
00:00:50
which you would think logically maybe
00:00:52
should be called O2 um but out of
00:00:54
respect to our friends at telica and in
00:00:56
the grand tradition of open AI being
00:00:57
really truly bad at names it's going to
00:00:59
be called 03 actually we're going to
00:01:02
launch uh not launch we're going to
00:01:03
announce two models today 03 and O3 mini
00:01:06
03 is a very very smart model uh 03 mini
00:01:10
is an incredibly smart model but still
00:01:12
uh but a really good at performance and
00:01:14
cost so to get the bad news out of the
00:01:17
way first we're not going to publicly
00:01:18
launch these today um the good news is
00:01:21
we're going to make them available for
00:01:22
Public Safety testing starting today you
00:01:24
can apply and we'll talk about that
00:01:25
later we've taken safety Tes testing
00:01:28
seriously as our models get uh more and
00:01:30
more capable and at this new level of
00:01:32
capability we want to try adding a new
00:01:34
part of our safety testing procedure
00:01:36
which is to allow uh Public Access for
00:01:38
researchers that want to help us test
00:01:40
we'll talk more at the end about when
00:01:41
these models uh when we expect to make
00:01:43
these models generally available but
00:01:45
we're so excited uh to show you what
00:01:47
they can do to talk about their
00:01:48
performance got a little surprise we'll
00:01:50
show you some demos uh and without
00:01:52
further Ado I'll hand it over to Mark to
00:01:53
talk about it cool thank you so much Sam
00:01:55
so my name is Mark I lead research at
00:01:57
openai and I want to talk a little bit
00:01:59
about O3 capabilities now O3 is a really
00:02:01
strong model at very hard technical
00:02:03
benchmarks and I want to start with
00:02:05
coding benchmarks if you can bring those
00:02:07
up so on software style benchmarks we
00:02:10
have sweep bench verified which is a
00:02:13
benchmark consisting of real world
00:02:14
software tasks we're seeing that 03
00:02:17
performs at about
00:02:19
71.7% accuracy which is over 20% better
00:02:22
than our 01 models now this really
00:02:24
signifies that we're really climbing the
00:02:26
frontier of utility as well on
00:02:29
competition code we see that 01 achieves
00:02:32
an ELO on this contest coding site
00:02:34
called code forces about 1891 at our
00:02:37
most aggressive High test time compute
00:02:39
settings we're able to achieve almost
00:02:41
like a 2727 ELO here ju so Mark was a
00:02:44
competitive programmer actually still
00:02:46
coaches competitive programming very
00:02:48
very good what what is your I think my
00:02:50
best at a comparable site was about 2500
00:02:52
that's tough well I I will say you know
00:02:55
our chief scientist um this is also
00:02:57
better than our chief scientist yakov's
00:02:59
score I think there's one guy at opening
00:03:01
ey who's still like a 3,000 something
00:03:03
yeah a few more months to yeah enjoy
00:03:05
hopefully we have a couple months to
00:03:06
enjoy there great that's I mean this is
00:03:08
it's in this model is incredible at
00:03:10
programming yeah and not just program
00:03:13
but also mathematics so we see that on
00:03:15
competition math benchmarks just like
00:03:17
competitive programming we achieve very
00:03:19
very strong scores so 03 gets about
00:03:22
96.7% accuracy versus an 01 performance
00:03:25
of 83.3% on the Amy what's your best Amy
00:03:28
score I did get a perfect score once so
00:03:31
I'm safe but
00:03:33
yeah really what this signifies is that
00:03:35
03 um often just misses one question
00:03:38
whenever we tested on this very hard
00:03:40
feeder exam for the USA mathematical LPN
00:03:43
there's another very tough Benchmark
00:03:45
which is called gpq Diamond and this
00:03:48
measures the model's performance on PhD
00:03:50
level science questions here we get
00:03:52
another state-of-the-art number
00:03:55
87.7% which is about 10% better than our
00:03:58
01 performance which was at 78% just to
00:04:01
put this in perspective if you take an
00:04:03
expert PhD they typically get about 70%
00:04:06
in kind of their field of strength here
00:04:09
so one thing that you might notice yeah
00:04:11
from from some of these benchmarks is
00:04:13
that we're reaching saturation for a lot
00:04:15
of them or nearing saturation so the
00:04:18
last year has really highlighted the
00:04:20
need for really harder benchmarks to
00:04:22
accurately assess where our Frontier
00:04:24
models lie and I think a couple have
00:04:26
emerged as fairly promising over the
00:04:28
last months one in particular I want to
00:04:30
call out is epic ai's Frontier math
00:04:32
benchmark now you can see the scores
00:04:35
look a lot lower than they did for the
00:04:37
the previous benchmarks we showed and
00:04:39
this is because this is considered today
00:04:41
the toughest mathematical Benchmark out
00:04:43
there this is a data set that consists
00:04:46
of Novel unpublished and also very hard
00:04:48
to extremely hard yeah very very hard
00:04:50
problems even turn houses you know it
00:04:52
would take professional mathematicians
00:04:54
hours or even days to solve one of these
00:04:57
problems and today all offerings out
00:05:00
there um have less than 2% accuracy um
00:05:04
on on this Benchmark and we're seeing
00:05:06
with 03 in aggressive test time settings
00:05:08
we're able to get over
00:05:10
25% yeah um that's awesome in addition
00:05:14
to Epic ai's Frontier math benchmark we
00:05:16
have one more surprise for you guys so I
00:05:19
want to talk about the arc Benchmark at
00:05:21
this point but I would love to invite
00:05:23
one of our friends Greg who is the
00:05:25
president of the Ark foundation on to
00:05:27
talk about this Benchmark wonderful Sam
00:05:29
and Mark thank you very much for having
00:05:31
us today of course hello everybody my
00:05:33
name is Greg camad and I the president
00:05:35
of the arc priz Foundation now Arc prise
00:05:38
is a nonprofit with the mission of being
00:05:39
a North star towards AGI through and
00:05:42
during benchmarks so our first Benchmark
00:05:44
Arc AGI was developed in 2019 by
00:05:47
Francois cholle in his paper on the
00:05:50
measure of intelligence however it has
00:05:53
been unbeaten for five years now in AI
00:05:56
world that's like it feels like
00:05:58
centuries is where it is so the system
00:06:00
that beats Ark AGI is going to be an
00:06:02
important Milestone towards general
00:06:04
intelligence but I'm excited to say
00:06:07
today that we have a new
00:06:09
state-of-the-art score to announce
00:06:11
before I get into that though I want to
00:06:13
talk about what Ark AGI is so I would
00:06:15
love to show you an example here Arc AGI
00:06:19
is all about having input examples and
00:06:21
output examples what they're good
00:06:23
they're good okay input examples and
00:06:25
output examples now the goal is you want
00:06:27
to understand the rule of the
00:06:29
transformation and guess it on the
00:06:30
output so Sam what do you think is
00:06:33
happening in here probably putting a
00:06:36
dark blue square in the empty space see
00:06:38
yes that is exactly it now that is
00:06:41
really um it's easy for humans to uh
00:06:43
intuitively guess what that is it's
00:06:45
actually surprisingly hard for AI to
00:06:47
know to understand what's going on so I
00:06:49
want to show one more hard example here
00:06:52
now Mark I'm going to put you on the
00:06:54
spot what do you think is going on in
00:06:56
this uh task okay so you take each these
00:06:59
yellow squares you count the number of
00:07:01
colored kind of squares there and you
00:07:03
create a border of that with that that
00:07:05
is exactly and that's much quicker than
00:07:07
most people so congratulations on that
00:07:09
um what's interesting though is AI has
00:07:11
not been able to get this problem thus
00:07:14
far and even though that we verified
00:07:16
that a panel of humans could actually do
00:07:18
it now the unique part about R AGI is
00:07:21
every task requires distinct skills and
00:07:25
what I mean by that is we won't ask
00:07:28
there won't be another task that you
00:07:29
need to fill in the corners with blue
00:07:31
squares and but we do that on purpose
00:07:33
and the reason why we do that is because
00:07:35
we want to test the model's ability to
00:07:37
learn new skills on the Fly we don't
00:07:40
just want it to uh repeat what it's
00:07:42
already memorized that that's the whole
00:07:43
point here now Ark AGI version one took
00:07:47
5 years to go from 0% to 5% with leading
00:07:50
Frontier models however today I'm very
00:07:54
excited to say that 03 has scored a new
00:07:57
state-of-the-art score that we have
00:07:58
verified
00:08:00
on low compute for uh 03 it has scored
00:08:04
75.7 on Arc ai's semi private holdout
00:08:08
set now this is extremely impressive
00:08:11
because this is within the uh compute
00:08:13
requirements that we have for our public
00:08:14
leaderboard and this is the new number
00:08:16
one entry on rkg Pub so congratulations
00:08:20
to that thank so much yeah now uh as a
00:08:23
capabilities demonstration when we ask
00:08:24
o03 to think longer and we actually ramp
00:08:27
up to high compute O3 was able to score
00:08:31
85.7% on the same hidden holdout set
00:08:34
this is especially important .5 sorry
00:08:37
87.5 yes this is especially important
00:08:40
because um Human Performance is is
00:08:43
comparable at 85% threshold so being
00:08:46
Above This is a major Milestone and we
00:08:49
have never tested A system that has done
00:08:50
this or any model that has done this
00:08:52
beforehand so this is new territory in
00:08:54
the rcgi world congratulations with that
00:08:57
congratulations for making such a great
00:08:58
Benchmark yeah yeah um when I look at
00:09:01
these scores I realize um I need to
00:09:03
switch my worldview a little bit I need
00:09:05
to fix my AI intuitions about what AI
00:09:07
can actually do and what it's capable of
00:09:10
uh especially in this 03 world but the
00:09:13
work also is not over yet and these are
00:09:15
still the early days of AI so um we need
00:09:19
more enduring benchmarks like Arc AGI to
00:09:22
help measure and guide progress and I am
00:09:25
excited to accelerate that progress and
00:09:27
I'm excited to partner with open AI next
00:09:29
year to develop our next Frontier
00:09:31
Benchmark amazing you know it's also a
00:09:34
benchmark that we've been targeting and
00:09:35
been on our mind for a very long time so
00:09:37
excited to work with you in the future
00:09:39
worth mentioning that we didn't we
00:09:40
Target and we think it's an awesome benk
00:09:41
we didn't go do speciic this is just you
00:09:43
know the general of three but yeah
00:09:45
really appreciate the partnership and
00:09:46
this was a fun one to do absolutely and
00:09:48
even though this has done so well Arc
00:09:50
prize will continue in 2025 and anybody
00:09:52
can find out more at AR pri.org great
00:09:55
thank you so much absolutely
00:09:59
okay so next up we're going to talk
00:10:00
about o03 mini um O3 mini is a thing
00:10:03
that we're really really excited about
00:10:05
and hongu who trained the model will
00:10:07
come out and join us hey you
00:10:12
hey um hi everyone um I'm homean I'm a
00:10:16
open air researcher uh working on
00:10:18
reasoning so this September we released
00:10:21
01 mini uh which is a efficient
00:10:23
reasoning model the you know1 family
00:10:25
that's really capable of uh math and
00:10:27
coding probably among the best in the
00:10:28
world given the low cost so now together
00:10:31
with 03 I'm very happy to tell you more
00:10:35
about uh 03 mini which is a brand new
00:10:38
model in the 03 family that truly
00:10:40
defines a new cost efficient reasoning
00:10:42
Frontier it's incredible um yeah though
00:10:45
it's not available to our users today we
00:10:48
are opening access to the model to uh
00:10:50
our Safety and Security researchers
00:10:52
through test model out um with the
00:10:55
release of adaptive thinking time in the
00:10:58
API a couple days ago
00:10:59
for all three mini will support three
00:11:02
different options low median and high
00:11:05
reasoning effort so the users can freely
00:11:08
adjust the uh thinking time based on
00:11:10
their different use cases so for example
00:11:12
for some we may want the model to think
00:11:15
longer for more complicated problems and
00:11:18
U uh things shorter uh with like simpler
00:11:21
ones um with that I'm happy to show the
00:11:24
first set of evals of all three
00:11:27
mini um so on the left hand side we show
00:11:31
the coding EV so it's like code forces
00:11:34
ELO which measures how good a programmer
00:11:36
is uh and the higher is the better so as
00:11:39
we can see on the plot with more
00:11:42
thinking time all three mini is able to
00:11:45
have like increasing Yow all all
00:11:47
performing all One Mini and with like
00:11:49
median thinking time is able to measure
00:11:52
even better than o1 yeah so it's like
00:11:54
for an order and magnitude more speed
00:11:56
and cost we can deliver the same code
00:11:58
performance on this even better
00:12:00
insurance right so although it's like
00:12:02
the ultra high is still like a couple
00:12:04
hundred points away from Mark it's not
00:12:06
far that's better than me probably um
00:12:08
but just an incredible sort of cost to
00:12:11
Performance gain over what we've been
00:12:13
able to offer with 01 and we think
00:12:14
people will really love this yeah I hope
00:12:16
so so on the right hand plot we showed
00:12:19
the estimated cost versus cold forces yo
00:12:23
trade-off uh so it's pretty clear that
00:12:25
all3 media defines like a new uh cost
00:12:27
efficient reasoning Frontier on
00:12:30
uh so it's achieve like better
00:12:31
performance compar better performance
00:12:33
than all1 is a fraction of cost amazing
00:12:36
um with that being said um um I would
00:12:39
like to do a live demo on ult Mini uh so
00:12:44
um and hopefully you can test out all
00:12:46
the three different like low medium high
00:12:49
uh thinking time of the model so let me
00:12:51
past the prom
00:13:05
um so I'm testing out all three mini
00:13:07
High first and the task is that I'm
00:13:11
asking the model to uh use Python to
00:13:14
implement a code generator and executor
00:13:18
so if I launch this uh run this like
00:13:20
Pyon script it will launch a server um
00:13:24
and um locally with a with a with a UI
00:13:28
that contains a text box
00:13:30
and then we can uh make coding requests
00:13:32
in a text box it will send the request
00:13:35
to call ult Mini API and Al mini API
00:13:39
will solve the task and return a piece
00:13:41
of code and it will then uh save the
00:13:43
code locally on my desktop and then open
00:13:47
a terminal to execute the code
00:13:49
automatically so it's a very complicated
00:13:52
PR complicated house right um and it
00:13:55
outp puts like a big triangle code so if
00:13:57
we copy
00:13:59
code and paste it to our
00:14:03
server and then we would like to run
00:14:07
launch This Server so we should get a
00:14:09
text box when you're launching it yeah
00:14:11
okay great oh yeah see I hope so it
00:14:12
seems to be launching something
00:14:18
um okay oh great we have a we have a UI
00:14:21
where we can enter some cing promps
00:14:23
let's try out a simple one like print
00:14:25
open the eye and a random number
00:14:32
submit so it's sending the request to
00:14:34
all three mini medium so you should be
00:14:36
pretty fast right so on this 4 terminal
00:14:40
yeah 41 that's the magic number right so
00:14:42
it saves the generated code to this like
00:14:44
local script um on a desktop and the
00:14:47
print out opening and 41 um is there any
00:14:51
other task you guys want toy test it out
00:14:53
I wonder if you could get it to get its
00:14:54
own GP QA
00:14:56
numbers that Isa that's a great ask just
00:14:59
as what I expected we practice a lot
00:15:01
yesterday um okay so now let me copy the
00:15:09
code and send it in
00:15:13
the code
00:15:16
UI so um in this task we asked the model
00:15:19
to evaluate all3 mini with the low
00:15:22
reasoning effort on this hard gpq data
00:15:25
set and the model needs to First
00:15:28
download the the the raw file from this
00:15:30
URL and then you need to figure out
00:15:33
which part is a question which part is a
00:15:36
um which part is the answer and or which
00:15:39
part is the options right and then
00:15:40
formulate all the questions and to and
00:15:44
then ask model to answer it and then
00:15:45
part the result and then to grade it
00:15:49
that's actually blazingly fast yeah and
00:15:50
it's actually really fast because it's
00:15:52
calling the al3 mini with low reasoning
00:15:56
effort um yeah let's see how it goes
00:15:59
I guess two tasks are really hard here
00:16:02
yeah the long tails of the
00:16:05
problem
00:16:08
go yeah is a hard data set yes yeah you
00:16:12
can't is like maybe 196 easy problems
00:16:15
and two pretty hard
00:16:16
problems
00:16:18
um while we're waiting for this do you
00:16:20
want to show the what the request was
00:16:22
again mhm oh it's actually Returns the
00:16:25
results it's uh 61.6%
00:16:29
6% right with a low reasoning effort
00:16:31
model it's actually pretty fast then
00:16:33
full evaluation in the uh and the a
00:16:37
minute and somehow very cool to like
00:16:39
just ask a model to evaluate itself like
00:16:41
this yeah exactly right and if we just
00:16:43
summarize what we just did we asked the
00:16:45
model to write a script to evaluate
00:16:48
itself um through on this like hard D
00:16:51
created ass Set uh from a UI right from
00:16:54
this code generator and executor created
00:16:57
by the model itself you first place next
00:17:00
year we're going to bring you on and
00:17:01
you're going to have to improve ask the
00:17:03
model to improve it so yeah let's
00:17:04
definely ask the model to improve it
00:17:05
next time maybe not
00:17:07
um
00:17:10
um so um besides code forces and gpq the
00:17:15
model is also a pretty good um um math
00:17:18
model so we we show on this plot uh with
00:17:22
like on this am 2024 data set also meing
00:17:25
low achieves um comparable performance
00:17:28
with One Mini and 03 mini medium
00:17:31
achieves a comparable better performance
00:17:33
than 01 we check the solid bar which are
00:17:35
passle ones and we can further push the
00:17:38
performance with 03 mini high right and
00:17:41
on the right hand side plot when we
00:17:43
measure the latency on this like
00:17:45
anonymized ow preview traffic we show
00:17:48
that 03 mini low drastically reduce the
00:17:50
latency of O mini right almost like
00:17:54
achieving comparable latency with uh gbt
00:17:57
40 under second so probably is like
00:18:00
instant response and also Mei medium is
00:18:03
like half the latency of all
00:18:06
one um and here's another set of evals
00:18:09
as I'm even more excited to to show you
00:18:11
guys is um uh API features right we get
00:18:14
a lot of requests from our developer
00:18:16
communities to support like function
00:18:18
calling structured outputs developer
00:18:20
messages U all Miner models and here um
00:18:24
all3 mini will support all these
00:18:27
features same as o1
00:18:29
um and notably it achieves like
00:18:32
comparable better performance than for
00:18:34
all on most of the ow providing a more
00:18:37
cost effective solution to our
00:18:39
developers cool um and if we actually
00:18:43
enveil the true gpq Diamond performance
00:18:47
that I run a couple days ago uh it
00:18:49
actually also me L is actually 62% right
00:18:52
basically ask model to evalate itself
00:18:54
yeah right next time you should totally
00:18:55
just ask model to automatically do the
00:18:57
evaluation instead of ask
00:19:00
um yeah so with that um that's it for
00:19:03
alter Mei and I hope our user can have a
00:19:05
much better user experience in already
00:19:07
next year fantastic work thank really
00:19:09
great work on thank you cool so I know
00:19:13
you're excited to get this in your own
00:19:15
hands um and we're very working very
00:19:17
hard to postra this model to do some uh
00:19:19
safety interventions on top of the model
00:19:21
and we're doing a lot of internal safety
00:19:22
testing right now but something new
00:19:25
we're doing this time is we're also
00:19:26
opening up this model to external safety
00:19:29
testing starting today with O3 mini and
00:19:31
also eventually with 03 so how do you
00:19:34
get Early Access as a safety researcher
00:19:36
or as security researcher you can go to
00:19:38
our website and you can see a form like
00:19:40
this one that you see on the screen and
00:19:42
applications for this form are rolling
00:19:44
they'll close on January 10th and we
00:19:47
really invite you to apply uh we're
00:19:48
excited to see what kind of things that
00:19:50
you can explore with this and what kind
00:19:52
of um jailbreaks and other things you
00:19:55
discover cool great so one other thing
00:19:58
that I'm excited to talk about is a a
00:20:00
new report that we published I think
00:20:02
yesterday or today um that advances our
00:20:05
safety program and this is a new
00:20:07
technique called deliberative alignment
00:20:09
typically when we do safety training on
00:20:12
top of our models we're trying to learn
00:20:13
this decision boundary of what's safe
00:20:15
and what's unsafe right and usually it's
00:20:19
uh just through showing examples pure
00:20:21
examples of this is a safe prompt this
00:20:22
is an unsafe prompt but we can now
00:20:25
leverage the reasoning capabilities that
00:20:27
we have from our models to find a more
00:20:29
accurate safety boundary here and this
00:20:32
technique called deliberative alignment
00:20:34
allows us to take a safety spec allows
00:20:36
the model to reason over a prompt and
00:20:39
also just tell you know is this a safe
00:20:41
prompt or not often times within the
00:20:43
reasoning it'll just uncover that hey
00:20:45
you know this user is trying to trick me
00:20:47
or they're expressing this kind of
00:20:49
intent that's hidden so even if you kind
00:20:51
of try to Cipher your your prompts often
00:20:53
times the reasoning will break that and
00:20:56
the primary result you see is in this
00:20:58
figure that that's shown over here we
00:21:00
have um our performance on a rejection
00:21:02
Benchmark on the x-axis and on over
00:21:04
refusals on the Y AIS and here uh to the
00:21:07
right is better so this is our ability
00:21:09
to accurately tell when we should reject
00:21:11
something also our ability to tell when
00:21:13
we should revie something and typically
00:21:15
you think of these two metrics as having
00:21:17
some sort of trade-off it's really hard
00:21:18
to do while on the it is really hard to
00:21:20
do yeah um but it seems with
00:21:22
deliberative alignment that we can get
00:21:24
these two green points on the top right
00:21:26
whereas the previous models the red and
00:21:28
Blue Points um signify the performance
00:21:30
of our previous models so we're really
00:21:33
starting to leverage safety to get sorry
00:21:35
leverage reasoning to get better safety
00:21:37
yeah I think this is a really great
00:21:38
result of safety yeah fantastic okay so
00:21:42
to sum this up 03 mini and 03 apply
00:21:45
please if you'd like for safety testing
00:21:47
to help us uh test these models as an
00:21:50
additional step we plan to launch 03
00:21:52
mini around the end of January and full3
00:21:54
shortly after that but uh that will you
00:21:56
know the more people can help us safety
00:21:58
test the more we can uh make sure we hit
00:22:00
that so please check it out uh and
00:22:03
thanks for following along with us with
00:22:05
this it's been a lot of fun for us we
00:22:06
hope you've enjoyed it too Merry
00:22:07
Christmas Merry Christmas Merry
00:22:10
[Applause]
00:22:11
Christmas okay also it's
00:22:27
c fore

OpenAI o3 震撼发布！Arc AGI 测试得分超越人类｜ OpenAI 12天「第12天」| 回到Axton

Summary

Takeaways

Timeline

Mind Map

Video Q&A

What is the new AI model being introduced?

The new AI models introduced are O3 and O3 Mini.

What are the capabilities of O3?

O3 exhibits high performance in coding, mathematics, and PhD-level science questions, surpassing previous models like O1.

How does O3 perform in coding benchmarks?

O3 achieves about 71.7% accuracy in coding benchmarks, significantly better than O1's performance.

What is the significance of the Arc AGI benchmark?

The Arc AGI benchmark evaluates AI's general intelligence through unique tasks requiring new skills, and O3 has achieved a new state-of-the-art score.

What is O3 Mini?

O3 Mini is a cost-efficient reasoning model in the O3 family, performing well in math and coding with various reasoning time options.

How can researchers access O3 and O3 Mini for testing?

Researchers can apply for early access to O3 and O3 Mini for safety testing by filling out a form on OpenAI's website.

When will O3 and O3 Mini be publicly available?

O3 Mini is expected to launch around the end of January, with O3 following shortly after.

What is deliberative alignment?

Deliberative alignment is a new safety training technique that uses reasoning capabilities to better define safe and unsafe prompts.

How does O3 Mini compare to O1 Mini?

O3 Mini offers better performance at a lower cost, with more efficient reasoning capabilities than O1 Mini.

View more video summaries

Tools

Daily Summary

Legal

OpenAI o3 震撼发布！Arc AGI 测试得分超越人类 ｜ OpenAI 12天「第12天」| 回到Axton

Summary

Takeaways

Timeline

Mind Map

Video Q&A

What is the new AI model being introduced?

The new AI models introduced are O3 and O3 Mini.

What are the capabilities of O3?

O3 exhibits high performance in coding, mathematics, and PhD-level science questions, surpassing previous models like O1.

How does O3 perform in coding benchmarks?

O3 achieves about 71.7% accuracy in coding benchmarks, significantly better than O1's performance.

What is the significance of the Arc AGI benchmark?

The Arc AGI benchmark evaluates AI's general intelligence through unique tasks requiring new skills, and O3 has achieved a new state-of-the-art score.

What is O3 Mini?

O3 Mini is a cost-efficient reasoning model in the O3 family, performing well in math and coding with various reasoning time options.

How can researchers access O3 and O3 Mini for testing?

Researchers can apply for early access to O3 and O3 Mini for safety testing by filling out a form on OpenAI's website.

When will O3 and O3 Mini be publicly available?

O3 Mini is expected to launch around the end of January, with O3 following shortly after.

What is deliberative alignment?

Deliberative alignment is a new safety training technique that uses reasoning capabilities to better define safe and unsafe prompts.

How does O3 Mini compare to O1 Mini?

O3 Mini offers better performance at a lower cost, with more efficient reasoning capabilities than O1 Mini.

View more video summaries

OpenAI o3 震撼发布！Arc AGI 测试得分超越人类｜ OpenAI 12天「第12天」| 回到Axton