OpenAI o3 震撼发布!Arc AGI 测试得分超越人类 | OpenAI 12天「第12天」| 回到Axton

00:22:58
https://www.youtube.com/watch?v=-O90WJvN3vw

Summary

TLDROpenAI is announcing two new AI models, O3 and O3 Mini. O3 is a highly advanced model excelling in programming, math, and scientific tasks, achieving significant improvements over previous models. O3 Mini is introduced as a cost-efficient alternative, optimizing performance and reasoning capabilities. Although not publicly launched, these models are open for safety testing to researchers. A key feature of these models is their ability to perform well on challenging benchmarks like Arc AGI, highlighting leaps towards general intelligence. OpenAI emphasizes safety with these releases, leveraging new techniques like deliberative alignment to ensure better understanding of safety boundaries. Public availability is planned for the beginning of next year, with ongoing opportunities for researchers to assist in refining these models.

Takeaways

  • 🚀 Introduction of two advanced AI models, O3 and O3 Mini.
  • 🧠 O3 excels in complex reasoning tasks, including coding and math.
  • 🔍 O3 achieves state-of-the-art results on tough benchmarks like Arc AGI.
  • 📉 O3 Mini offers cost-efficient performance with flexible reasoning times.
  • 🔒 OpenAI opens these models for safety testing to researchers.
  • 📜 New technique, deliberative alignment, enhances model safety.
  • 📆 Public release of O3 Mini expected by end of January.
  • 👍 Tremendous progress in AI capabilities compared to previous models.
  • 🔗 Opportunity for public contributions in SAF testing.
  • 🎯 Focus on advancing AI towards general intelligence.

Timeline

  • 00:00:00 - 00:05:00

    The event began with the introduction of OpenAI's first reasoning model, O1, twelve days ago. The focus is on the next frontier of AI, O3, and its cost-effective counterpart, O3 mini. Both models aim to tackle complex tasks requiring high-level reasoning, although they're not launching publicly yet. Instead, public safety testing is open to researchers. A thorough safety testing approach is emphasized for these advanced models. Mark from OpenAI elaborates on O3's capabilities, showcasing its superior performance over previous models in coding and mathematical benchmarks.

  • 00:05:00 - 00:10:00

    The discussion highlights O3's remarkable abilities, outperforming its predecessors in competitive programming and advanced math tasks. It's noted that current benchmarks are being surpassed, underscoring the need for more challenging tests. The introduction of innovative benchmarks, such as Epic AI's Frontier math benchmark, demonstrates O3's superior capability by achieving over 25% accuracy in comparison to less than 2% from other models. The breakthrough on the Arc Prize Foundation's AGI benchmark is announced, showcasing O3's impressive 75.7% score, indicating progress toward general intelligence.

  • 00:10:00 - 00:15:00

    O3 mini, a companion to O3, is introduced as a highly cost-efficient reasoning model. By supporting multiple reasoning levels, it provides flexibility for different use cases. Preliminary evaluations demonstrate O3 mini's superior performance, particularly in coding, compared to earlier models. A live demo exemplifies its capability in automating complex coding tasks. Additional evaluations highlight O3 mini's advantage in terms of cost versus performance ratio, offering promising utility for developers needing efficient but powerful AI solutions.

  • 00:15:00 - 00:22:58

    The event wraps up with a focus on safety interventions and external testing for O3 mini and O3, urging researchers to apply for early access. A new report on 'deliberative alignment' is introduced, emphasizing enhanced safety through reasoning. This strategy aims to refine the decision boundary between safe and unsafe AI behavior, utilizing the model's reasoning to reliably detect malicious intent. OpenAI plans to launch O3 mini by January's end, followed by the full O3, highlighting a commitment to safety and performance. The audience is encouraged to participate in testing to aid in perfecting the models.

Show more

Mind Map

Video Q&A

  • What is the new AI model being introduced?

    The new AI models introduced are O3 and O3 Mini.

  • What are the capabilities of O3?

    O3 exhibits high performance in coding, mathematics, and PhD-level science questions, surpassing previous models like O1.

  • How does O3 perform in coding benchmarks?

    O3 achieves about 71.7% accuracy in coding benchmarks, significantly better than O1's performance.

  • What is the significance of the Arc AGI benchmark?

    The Arc AGI benchmark evaluates AI's general intelligence through unique tasks requiring new skills, and O3 has achieved a new state-of-the-art score.

  • What is O3 Mini?

    O3 Mini is a cost-efficient reasoning model in the O3 family, performing well in math and coding with various reasoning time options.

  • How can researchers access O3 and O3 Mini for testing?

    Researchers can apply for early access to O3 and O3 Mini for safety testing by filling out a form on OpenAI's website.

  • When will O3 and O3 Mini be publicly available?

    O3 Mini is expected to launch around the end of January, with O3 following shortly after.

  • What is deliberative alignment?

    Deliberative alignment is a new safety training technique that uses reasoning capabilities to better define safe and unsafe prompts.

  • How does O3 Mini compare to O1 Mini?

    O3 Mini offers better performance at a lower cost, with more efficient reasoning capabilities than O1 Mini.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:20
    good morning we have an exciting one for
  • 00:00:22
    you today we started this 12-day event
  • 00:00:24
    12 days ago with the launch of 01 our
  • 00:00:26
    first reasoning model it's been amazing
  • 00:00:28
    to see what people are doing with that
  • 00:00:30
    gratifying to hear how much people like
  • 00:00:31
    it we view this as sort of the beginning
  • 00:00:33
    of the next phase of AI where you can
  • 00:00:35
    use these models to do increasingly
  • 00:00:36
    complex tasks they require a lot of
  • 00:00:39
    reasoning and so for the last day of
  • 00:00:41
    this event um we thought it would be fun
  • 00:00:43
    to go from one Frontier Model to our
  • 00:00:45
    next Frontier Model today we're going to
  • 00:00:47
    talk about that next Frontier Model um
  • 00:00:50
    which you would think logically maybe
  • 00:00:52
    should be called O2 um but out of
  • 00:00:54
    respect to our friends at telica and in
  • 00:00:56
    the grand tradition of open AI being
  • 00:00:57
    really truly bad at names it's going to
  • 00:00:59
    be called 03 actually we're going to
  • 00:01:02
    launch uh not launch we're going to
  • 00:01:03
    announce two models today 03 and O3 mini
  • 00:01:06
    03 is a very very smart model uh 03 mini
  • 00:01:10
    is an incredibly smart model but still
  • 00:01:12
    uh but a really good at performance and
  • 00:01:14
    cost so to get the bad news out of the
  • 00:01:17
    way first we're not going to publicly
  • 00:01:18
    launch these today um the good news is
  • 00:01:21
    we're going to make them available for
  • 00:01:22
    Public Safety testing starting today you
  • 00:01:24
    can apply and we'll talk about that
  • 00:01:25
    later we've taken safety Tes testing
  • 00:01:28
    seriously as our models get uh more and
  • 00:01:30
    more capable and at this new level of
  • 00:01:32
    capability we want to try adding a new
  • 00:01:34
    part of our safety testing procedure
  • 00:01:36
    which is to allow uh Public Access for
  • 00:01:38
    researchers that want to help us test
  • 00:01:40
    we'll talk more at the end about when
  • 00:01:41
    these models uh when we expect to make
  • 00:01:43
    these models generally available but
  • 00:01:45
    we're so excited uh to show you what
  • 00:01:47
    they can do to talk about their
  • 00:01:48
    performance got a little surprise we'll
  • 00:01:50
    show you some demos uh and without
  • 00:01:52
    further Ado I'll hand it over to Mark to
  • 00:01:53
    talk about it cool thank you so much Sam
  • 00:01:55
    so my name is Mark I lead research at
  • 00:01:57
    openai and I want to talk a little bit
  • 00:01:59
    about O3 capabilities now O3 is a really
  • 00:02:01
    strong model at very hard technical
  • 00:02:03
    benchmarks and I want to start with
  • 00:02:05
    coding benchmarks if you can bring those
  • 00:02:07
    up so on software style benchmarks we
  • 00:02:10
    have sweep bench verified which is a
  • 00:02:13
    benchmark consisting of real world
  • 00:02:14
    software tasks we're seeing that 03
  • 00:02:17
    performs at about
  • 00:02:19
    71.7% accuracy which is over 20% better
  • 00:02:22
    than our 01 models now this really
  • 00:02:24
    signifies that we're really climbing the
  • 00:02:26
    frontier of utility as well on
  • 00:02:29
    competition code we see that 01 achieves
  • 00:02:32
    an ELO on this contest coding site
  • 00:02:34
    called code forces about 1891 at our
  • 00:02:37
    most aggressive High test time compute
  • 00:02:39
    settings we're able to achieve almost
  • 00:02:41
    like a 2727 ELO here ju so Mark was a
  • 00:02:44
    competitive programmer actually still
  • 00:02:46
    coaches competitive programming very
  • 00:02:48
    very good what what is your I think my
  • 00:02:50
    best at a comparable site was about 2500
  • 00:02:52
    that's tough well I I will say you know
  • 00:02:55
    our chief scientist um this is also
  • 00:02:57
    better than our chief scientist yakov's
  • 00:02:59
    score I think there's one guy at opening
  • 00:03:01
    ey who's still like a 3,000 something
  • 00:03:03
    yeah a few more months to yeah enjoy
  • 00:03:05
    hopefully we have a couple months to
  • 00:03:06
    enjoy there great that's I mean this is
  • 00:03:08
    it's in this model is incredible at
  • 00:03:10
    programming yeah and not just program
  • 00:03:13
    but also mathematics so we see that on
  • 00:03:15
    competition math benchmarks just like
  • 00:03:17
    competitive programming we achieve very
  • 00:03:19
    very strong scores so 03 gets about
  • 00:03:22
    96.7% accuracy versus an 01 performance
  • 00:03:25
    of 83.3% on the Amy what's your best Amy
  • 00:03:28
    score I did get a perfect score once so
  • 00:03:31
    I'm safe but
  • 00:03:33
    yeah really what this signifies is that
  • 00:03:35
    03 um often just misses one question
  • 00:03:38
    whenever we tested on this very hard
  • 00:03:40
    feeder exam for the USA mathematical LPN
  • 00:03:43
    there's another very tough Benchmark
  • 00:03:45
    which is called gpq Diamond and this
  • 00:03:48
    measures the model's performance on PhD
  • 00:03:50
    level science questions here we get
  • 00:03:52
    another state-of-the-art number
  • 00:03:55
    87.7% which is about 10% better than our
  • 00:03:58
    01 performance which was at 78% just to
  • 00:04:01
    put this in perspective if you take an
  • 00:04:03
    expert PhD they typically get about 70%
  • 00:04:06
    in kind of their field of strength here
  • 00:04:09
    so one thing that you might notice yeah
  • 00:04:11
    from from some of these benchmarks is
  • 00:04:13
    that we're reaching saturation for a lot
  • 00:04:15
    of them or nearing saturation so the
  • 00:04:18
    last year has really highlighted the
  • 00:04:20
    need for really harder benchmarks to
  • 00:04:22
    accurately assess where our Frontier
  • 00:04:24
    models lie and I think a couple have
  • 00:04:26
    emerged as fairly promising over the
  • 00:04:28
    last months one in particular I want to
  • 00:04:30
    call out is epic ai's Frontier math
  • 00:04:32
    benchmark now you can see the scores
  • 00:04:35
    look a lot lower than they did for the
  • 00:04:37
    the previous benchmarks we showed and
  • 00:04:39
    this is because this is considered today
  • 00:04:41
    the toughest mathematical Benchmark out
  • 00:04:43
    there this is a data set that consists
  • 00:04:46
    of Novel unpublished and also very hard
  • 00:04:48
    to extremely hard yeah very very hard
  • 00:04:50
    problems even turn houses you know it
  • 00:04:52
    would take professional mathematicians
  • 00:04:54
    hours or even days to solve one of these
  • 00:04:57
    problems and today all offerings out
  • 00:05:00
    there um have less than 2% accuracy um
  • 00:05:04
    on on this Benchmark and we're seeing
  • 00:05:06
    with 03 in aggressive test time settings
  • 00:05:08
    we're able to get over
  • 00:05:10
    25% yeah um that's awesome in addition
  • 00:05:14
    to Epic ai's Frontier math benchmark we
  • 00:05:16
    have one more surprise for you guys so I
  • 00:05:19
    want to talk about the arc Benchmark at
  • 00:05:21
    this point but I would love to invite
  • 00:05:23
    one of our friends Greg who is the
  • 00:05:25
    president of the Ark foundation on to
  • 00:05:27
    talk about this Benchmark wonderful Sam
  • 00:05:29
    and Mark thank you very much for having
  • 00:05:31
    us today of course hello everybody my
  • 00:05:33
    name is Greg camad and I the president
  • 00:05:35
    of the arc priz Foundation now Arc prise
  • 00:05:38
    is a nonprofit with the mission of being
  • 00:05:39
    a North star towards AGI through and
  • 00:05:42
    during benchmarks so our first Benchmark
  • 00:05:44
    Arc AGI was developed in 2019 by
  • 00:05:47
    Francois cholle in his paper on the
  • 00:05:50
    measure of intelligence however it has
  • 00:05:53
    been unbeaten for five years now in AI
  • 00:05:56
    world that's like it feels like
  • 00:05:58
    centuries is where it is so the system
  • 00:06:00
    that beats Ark AGI is going to be an
  • 00:06:02
    important Milestone towards general
  • 00:06:04
    intelligence but I'm excited to say
  • 00:06:07
    today that we have a new
  • 00:06:09
    state-of-the-art score to announce
  • 00:06:11
    before I get into that though I want to
  • 00:06:13
    talk about what Ark AGI is so I would
  • 00:06:15
    love to show you an example here Arc AGI
  • 00:06:19
    is all about having input examples and
  • 00:06:21
    output examples what they're good
  • 00:06:23
    they're good okay input examples and
  • 00:06:25
    output examples now the goal is you want
  • 00:06:27
    to understand the rule of the
  • 00:06:29
    transformation and guess it on the
  • 00:06:30
    output so Sam what do you think is
  • 00:06:33
    happening in here probably putting a
  • 00:06:36
    dark blue square in the empty space see
  • 00:06:38
    yes that is exactly it now that is
  • 00:06:41
    really um it's easy for humans to uh
  • 00:06:43
    intuitively guess what that is it's
  • 00:06:45
    actually surprisingly hard for AI to
  • 00:06:47
    know to understand what's going on so I
  • 00:06:49
    want to show one more hard example here
  • 00:06:52
    now Mark I'm going to put you on the
  • 00:06:54
    spot what do you think is going on in
  • 00:06:56
    this uh task okay so you take each these
  • 00:06:59
    yellow squares you count the number of
  • 00:07:01
    colored kind of squares there and you
  • 00:07:03
    create a border of that with that that
  • 00:07:05
    is exactly and that's much quicker than
  • 00:07:07
    most people so congratulations on that
  • 00:07:09
    um what's interesting though is AI has
  • 00:07:11
    not been able to get this problem thus
  • 00:07:14
    far and even though that we verified
  • 00:07:16
    that a panel of humans could actually do
  • 00:07:18
    it now the unique part about R AGI is
  • 00:07:21
    every task requires distinct skills and
  • 00:07:25
    what I mean by that is we won't ask
  • 00:07:28
    there won't be another task that you
  • 00:07:29
    need to fill in the corners with blue
  • 00:07:31
    squares and but we do that on purpose
  • 00:07:33
    and the reason why we do that is because
  • 00:07:35
    we want to test the model's ability to
  • 00:07:37
    learn new skills on the Fly we don't
  • 00:07:40
    just want it to uh repeat what it's
  • 00:07:42
    already memorized that that's the whole
  • 00:07:43
    point here now Ark AGI version one took
  • 00:07:47
    5 years to go from 0% to 5% with leading
  • 00:07:50
    Frontier models however today I'm very
  • 00:07:54
    excited to say that 03 has scored a new
  • 00:07:57
    state-of-the-art score that we have
  • 00:07:58
    verified
  • 00:08:00
    on low compute for uh 03 it has scored
  • 00:08:04
    75.7 on Arc ai's semi private holdout
  • 00:08:08
    set now this is extremely impressive
  • 00:08:11
    because this is within the uh compute
  • 00:08:13
    requirements that we have for our public
  • 00:08:14
    leaderboard and this is the new number
  • 00:08:16
    one entry on rkg Pub so congratulations
  • 00:08:20
    to that thank so much yeah now uh as a
  • 00:08:23
    capabilities demonstration when we ask
  • 00:08:24
    o03 to think longer and we actually ramp
  • 00:08:27
    up to high compute O3 was able to score
  • 00:08:31
    85.7% on the same hidden holdout set
  • 00:08:34
    this is especially important .5 sorry
  • 00:08:37
    87.5 yes this is especially important
  • 00:08:40
    because um Human Performance is is
  • 00:08:43
    comparable at 85% threshold so being
  • 00:08:46
    Above This is a major Milestone and we
  • 00:08:49
    have never tested A system that has done
  • 00:08:50
    this or any model that has done this
  • 00:08:52
    beforehand so this is new territory in
  • 00:08:54
    the rcgi world congratulations with that
  • 00:08:57
    congratulations for making such a great
  • 00:08:58
    Benchmark yeah yeah um when I look at
  • 00:09:01
    these scores I realize um I need to
  • 00:09:03
    switch my worldview a little bit I need
  • 00:09:05
    to fix my AI intuitions about what AI
  • 00:09:07
    can actually do and what it's capable of
  • 00:09:10
    uh especially in this 03 world but the
  • 00:09:13
    work also is not over yet and these are
  • 00:09:15
    still the early days of AI so um we need
  • 00:09:19
    more enduring benchmarks like Arc AGI to
  • 00:09:22
    help measure and guide progress and I am
  • 00:09:25
    excited to accelerate that progress and
  • 00:09:27
    I'm excited to partner with open AI next
  • 00:09:29
    year to develop our next Frontier
  • 00:09:31
    Benchmark amazing you know it's also a
  • 00:09:34
    benchmark that we've been targeting and
  • 00:09:35
    been on our mind for a very long time so
  • 00:09:37
    excited to work with you in the future
  • 00:09:39
    worth mentioning that we didn't we
  • 00:09:40
    Target and we think it's an awesome benk
  • 00:09:41
    we didn't go do speciic this is just you
  • 00:09:43
    know the general of three but yeah
  • 00:09:45
    really appreciate the partnership and
  • 00:09:46
    this was a fun one to do absolutely and
  • 00:09:48
    even though this has done so well Arc
  • 00:09:50
    prize will continue in 2025 and anybody
  • 00:09:52
    can find out more at AR pri.org great
  • 00:09:55
    thank you so much absolutely
  • 00:09:59
    okay so next up we're going to talk
  • 00:10:00
    about o03 mini um O3 mini is a thing
  • 00:10:03
    that we're really really excited about
  • 00:10:05
    and hongu who trained the model will
  • 00:10:07
    come out and join us hey you
  • 00:10:12
    hey um hi everyone um I'm homean I'm a
  • 00:10:16
    open air researcher uh working on
  • 00:10:18
    reasoning so this September we released
  • 00:10:21
    01 mini uh which is a efficient
  • 00:10:23
    reasoning model the you know1 family
  • 00:10:25
    that's really capable of uh math and
  • 00:10:27
    coding probably among the best in the
  • 00:10:28
    world given the low cost so now together
  • 00:10:31
    with 03 I'm very happy to tell you more
  • 00:10:35
    about uh 03 mini which is a brand new
  • 00:10:38
    model in the 03 family that truly
  • 00:10:40
    defines a new cost efficient reasoning
  • 00:10:42
    Frontier it's incredible um yeah though
  • 00:10:45
    it's not available to our users today we
  • 00:10:48
    are opening access to the model to uh
  • 00:10:50
    our Safety and Security researchers
  • 00:10:52
    through test model out um with the
  • 00:10:55
    release of adaptive thinking time in the
  • 00:10:58
    API a couple days ago
  • 00:10:59
    for all three mini will support three
  • 00:11:02
    different options low median and high
  • 00:11:05
    reasoning effort so the users can freely
  • 00:11:08
    adjust the uh thinking time based on
  • 00:11:10
    their different use cases so for example
  • 00:11:12
    for some we may want the model to think
  • 00:11:15
    longer for more complicated problems and
  • 00:11:18
    U uh things shorter uh with like simpler
  • 00:11:21
    ones um with that I'm happy to show the
  • 00:11:24
    first set of evals of all three
  • 00:11:27
    mini um so on the left hand side we show
  • 00:11:31
    the coding EV so it's like code forces
  • 00:11:34
    ELO which measures how good a programmer
  • 00:11:36
    is uh and the higher is the better so as
  • 00:11:39
    we can see on the plot with more
  • 00:11:42
    thinking time all three mini is able to
  • 00:11:45
    have like increasing Yow all all
  • 00:11:47
    performing all One Mini and with like
  • 00:11:49
    median thinking time is able to measure
  • 00:11:52
    even better than o1 yeah so it's like
  • 00:11:54
    for an order and magnitude more speed
  • 00:11:56
    and cost we can deliver the same code
  • 00:11:58
    performance on this even better
  • 00:12:00
    insurance right so although it's like
  • 00:12:02
    the ultra high is still like a couple
  • 00:12:04
    hundred points away from Mark it's not
  • 00:12:06
    far that's better than me probably um
  • 00:12:08
    but just an incredible sort of cost to
  • 00:12:11
    Performance gain over what we've been
  • 00:12:13
    able to offer with 01 and we think
  • 00:12:14
    people will really love this yeah I hope
  • 00:12:16
    so so on the right hand plot we showed
  • 00:12:19
    the estimated cost versus cold forces yo
  • 00:12:23
    trade-off uh so it's pretty clear that
  • 00:12:25
    all3 media defines like a new uh cost
  • 00:12:27
    efficient reasoning Frontier on
  • 00:12:30
    uh so it's achieve like better
  • 00:12:31
    performance compar better performance
  • 00:12:33
    than all1 is a fraction of cost amazing
  • 00:12:36
    um with that being said um um I would
  • 00:12:39
    like to do a live demo on ult Mini uh so
  • 00:12:44
    um and hopefully you can test out all
  • 00:12:46
    the three different like low medium high
  • 00:12:49
    uh thinking time of the model so let me
  • 00:12:51
    past the prom
  • 00:13:05
    um so I'm testing out all three mini
  • 00:13:07
    High first and the task is that I'm
  • 00:13:11
    asking the model to uh use Python to
  • 00:13:14
    implement a code generator and executor
  • 00:13:18
    so if I launch this uh run this like
  • 00:13:20
    Pyon script it will launch a server um
  • 00:13:24
    and um locally with a with a with a UI
  • 00:13:28
    that contains a text box
  • 00:13:30
    and then we can uh make coding requests
  • 00:13:32
    in a text box it will send the request
  • 00:13:35
    to call ult Mini API and Al mini API
  • 00:13:39
    will solve the task and return a piece
  • 00:13:41
    of code and it will then uh save the
  • 00:13:43
    code locally on my desktop and then open
  • 00:13:47
    a terminal to execute the code
  • 00:13:49
    automatically so it's a very complicated
  • 00:13:52
    PR complicated house right um and it
  • 00:13:55
    outp puts like a big triangle code so if
  • 00:13:57
    we copy
  • 00:13:59
    code and paste it to our
  • 00:14:03
    server and then we would like to run
  • 00:14:07
    launch This Server so we should get a
  • 00:14:09
    text box when you're launching it yeah
  • 00:14:11
    okay great oh yeah see I hope so it
  • 00:14:12
    seems to be launching something
  • 00:14:18
    um okay oh great we have a we have a UI
  • 00:14:21
    where we can enter some cing promps
  • 00:14:23
    let's try out a simple one like print
  • 00:14:25
    open the eye and a random number
  • 00:14:32
    submit so it's sending the request to
  • 00:14:34
    all three mini medium so you should be
  • 00:14:36
    pretty fast right so on this 4 terminal
  • 00:14:40
    yeah 41 that's the magic number right so
  • 00:14:42
    it saves the generated code to this like
  • 00:14:44
    local script um on a desktop and the
  • 00:14:47
    print out opening and 41 um is there any
  • 00:14:51
    other task you guys want toy test it out
  • 00:14:53
    I wonder if you could get it to get its
  • 00:14:54
    own GP QA
  • 00:14:56
    numbers that Isa that's a great ask just
  • 00:14:59
    as what I expected we practice a lot
  • 00:15:01
    yesterday um okay so now let me copy the
  • 00:15:09
    code and send it in
  • 00:15:13
    the code
  • 00:15:16
    UI so um in this task we asked the model
  • 00:15:19
    to evaluate all3 mini with the low
  • 00:15:22
    reasoning effort on this hard gpq data
  • 00:15:25
    set and the model needs to First
  • 00:15:28
    download the the the raw file from this
  • 00:15:30
    URL and then you need to figure out
  • 00:15:33
    which part is a question which part is a
  • 00:15:36
    um which part is the answer and or which
  • 00:15:39
    part is the options right and then
  • 00:15:40
    formulate all the questions and to and
  • 00:15:44
    then ask model to answer it and then
  • 00:15:45
    part the result and then to grade it
  • 00:15:49
    that's actually blazingly fast yeah and
  • 00:15:50
    it's actually really fast because it's
  • 00:15:52
    calling the al3 mini with low reasoning
  • 00:15:56
    effort um yeah let's see how it goes
  • 00:15:59
    I guess two tasks are really hard here
  • 00:16:02
    yeah the long tails of the
  • 00:16:05
    problem
  • 00:16:08
    go yeah is a hard data set yes yeah you
  • 00:16:12
    can't is like maybe 196 easy problems
  • 00:16:15
    and two pretty hard
  • 00:16:16
    problems
  • 00:16:18
    um while we're waiting for this do you
  • 00:16:20
    want to show the what the request was
  • 00:16:22
    again mhm oh it's actually Returns the
  • 00:16:25
    results it's uh 61.6%
  • 00:16:29
    6% right with a low reasoning effort
  • 00:16:31
    model it's actually pretty fast then
  • 00:16:33
    full evaluation in the uh and the a
  • 00:16:37
    minute and somehow very cool to like
  • 00:16:39
    just ask a model to evaluate itself like
  • 00:16:41
    this yeah exactly right and if we just
  • 00:16:43
    summarize what we just did we asked the
  • 00:16:45
    model to write a script to evaluate
  • 00:16:48
    itself um through on this like hard D
  • 00:16:51
    created ass Set uh from a UI right from
  • 00:16:54
    this code generator and executor created
  • 00:16:57
    by the model itself you first place next
  • 00:17:00
    year we're going to bring you on and
  • 00:17:01
    you're going to have to improve ask the
  • 00:17:03
    model to improve it so yeah let's
  • 00:17:04
    definely ask the model to improve it
  • 00:17:05
    next time maybe not
  • 00:17:07
    um
  • 00:17:10
    um so um besides code forces and gpq the
  • 00:17:15
    model is also a pretty good um um math
  • 00:17:18
    model so we we show on this plot uh with
  • 00:17:22
    like on this am 2024 data set also meing
  • 00:17:25
    low achieves um comparable performance
  • 00:17:28
    with One Mini and 03 mini medium
  • 00:17:31
    achieves a comparable better performance
  • 00:17:33
    than 01 we check the solid bar which are
  • 00:17:35
    passle ones and we can further push the
  • 00:17:38
    performance with 03 mini high right and
  • 00:17:41
    on the right hand side plot when we
  • 00:17:43
    measure the latency on this like
  • 00:17:45
    anonymized ow preview traffic we show
  • 00:17:48
    that 03 mini low drastically reduce the
  • 00:17:50
    latency of O mini right almost like
  • 00:17:54
    achieving comparable latency with uh gbt
  • 00:17:57
    40 under second so probably is like
  • 00:18:00
    instant response and also Mei medium is
  • 00:18:03
    like half the latency of all
  • 00:18:06
    one um and here's another set of evals
  • 00:18:09
    as I'm even more excited to to show you
  • 00:18:11
    guys is um uh API features right we get
  • 00:18:14
    a lot of requests from our developer
  • 00:18:16
    communities to support like function
  • 00:18:18
    calling structured outputs developer
  • 00:18:20
    messages U all Miner models and here um
  • 00:18:24
    all3 mini will support all these
  • 00:18:27
    features same as o1
  • 00:18:29
    um and notably it achieves like
  • 00:18:32
    comparable better performance than for
  • 00:18:34
    all on most of the ow providing a more
  • 00:18:37
    cost effective solution to our
  • 00:18:39
    developers cool um and if we actually
  • 00:18:43
    enveil the true gpq Diamond performance
  • 00:18:47
    that I run a couple days ago uh it
  • 00:18:49
    actually also me L is actually 62% right
  • 00:18:52
    basically ask model to evalate itself
  • 00:18:54
    yeah right next time you should totally
  • 00:18:55
    just ask model to automatically do the
  • 00:18:57
    evaluation instead of ask
  • 00:19:00
    um yeah so with that um that's it for
  • 00:19:03
    alter Mei and I hope our user can have a
  • 00:19:05
    much better user experience in already
  • 00:19:07
    next year fantastic work thank really
  • 00:19:09
    great work on thank you cool so I know
  • 00:19:13
    you're excited to get this in your own
  • 00:19:15
    hands um and we're very working very
  • 00:19:17
    hard to postra this model to do some uh
  • 00:19:19
    safety interventions on top of the model
  • 00:19:21
    and we're doing a lot of internal safety
  • 00:19:22
    testing right now but something new
  • 00:19:25
    we're doing this time is we're also
  • 00:19:26
    opening up this model to external safety
  • 00:19:29
    testing starting today with O3 mini and
  • 00:19:31
    also eventually with 03 so how do you
  • 00:19:34
    get Early Access as a safety researcher
  • 00:19:36
    or as security researcher you can go to
  • 00:19:38
    our website and you can see a form like
  • 00:19:40
    this one that you see on the screen and
  • 00:19:42
    applications for this form are rolling
  • 00:19:44
    they'll close on January 10th and we
  • 00:19:47
    really invite you to apply uh we're
  • 00:19:48
    excited to see what kind of things that
  • 00:19:50
    you can explore with this and what kind
  • 00:19:52
    of um jailbreaks and other things you
  • 00:19:55
    discover cool great so one other thing
  • 00:19:58
    that I'm excited to talk about is a a
  • 00:20:00
    new report that we published I think
  • 00:20:02
    yesterday or today um that advances our
  • 00:20:05
    safety program and this is a new
  • 00:20:07
    technique called deliberative alignment
  • 00:20:09
    typically when we do safety training on
  • 00:20:12
    top of our models we're trying to learn
  • 00:20:13
    this decision boundary of what's safe
  • 00:20:15
    and what's unsafe right and usually it's
  • 00:20:19
    uh just through showing examples pure
  • 00:20:21
    examples of this is a safe prompt this
  • 00:20:22
    is an unsafe prompt but we can now
  • 00:20:25
    leverage the reasoning capabilities that
  • 00:20:27
    we have from our models to find a more
  • 00:20:29
    accurate safety boundary here and this
  • 00:20:32
    technique called deliberative alignment
  • 00:20:34
    allows us to take a safety spec allows
  • 00:20:36
    the model to reason over a prompt and
  • 00:20:39
    also just tell you know is this a safe
  • 00:20:41
    prompt or not often times within the
  • 00:20:43
    reasoning it'll just uncover that hey
  • 00:20:45
    you know this user is trying to trick me
  • 00:20:47
    or they're expressing this kind of
  • 00:20:49
    intent that's hidden so even if you kind
  • 00:20:51
    of try to Cipher your your prompts often
  • 00:20:53
    times the reasoning will break that and
  • 00:20:56
    the primary result you see is in this
  • 00:20:58
    figure that that's shown over here we
  • 00:21:00
    have um our performance on a rejection
  • 00:21:02
    Benchmark on the x-axis and on over
  • 00:21:04
    refusals on the Y AIS and here uh to the
  • 00:21:07
    right is better so this is our ability
  • 00:21:09
    to accurately tell when we should reject
  • 00:21:11
    something also our ability to tell when
  • 00:21:13
    we should revie something and typically
  • 00:21:15
    you think of these two metrics as having
  • 00:21:17
    some sort of trade-off it's really hard
  • 00:21:18
    to do while on the it is really hard to
  • 00:21:20
    do yeah um but it seems with
  • 00:21:22
    deliberative alignment that we can get
  • 00:21:24
    these two green points on the top right
  • 00:21:26
    whereas the previous models the red and
  • 00:21:28
    Blue Points um signify the performance
  • 00:21:30
    of our previous models so we're really
  • 00:21:33
    starting to leverage safety to get sorry
  • 00:21:35
    leverage reasoning to get better safety
  • 00:21:37
    yeah I think this is a really great
  • 00:21:38
    result of safety yeah fantastic okay so
  • 00:21:42
    to sum this up 03 mini and 03 apply
  • 00:21:45
    please if you'd like for safety testing
  • 00:21:47
    to help us uh test these models as an
  • 00:21:50
    additional step we plan to launch 03
  • 00:21:52
    mini around the end of January and full3
  • 00:21:54
    shortly after that but uh that will you
  • 00:21:56
    know the more people can help us safety
  • 00:21:58
    test the more we can uh make sure we hit
  • 00:22:00
    that so please check it out uh and
  • 00:22:03
    thanks for following along with us with
  • 00:22:05
    this it's been a lot of fun for us we
  • 00:22:06
    hope you've enjoyed it too Merry
  • 00:22:07
    Christmas Merry Christmas Merry
  • 00:22:10
    [Applause]
  • 00:22:11
    Christmas okay also it's
  • 00:22:27
    c fore
Tags
  • AI
  • OpenAI
  • O3
  • O3 Mini
  • Safety Testing
  • Deliberative Alignment
  • Artificial Intelligence
  • Coding Benchmarks
  • Arc AGI
  • General Intelligence