OpenAI Just Revealed They ACHIEVED AGI (OpenAI o3 Explained)

00:12:05
https://www.youtube.com/watch?v=CRuhyF3oj0c

概要

TLDRThe video highlights a potential milestone towards artificial general intelligence (AGI) with OpenAI's release of the 03 model. This model has surpassed human performance on the ARC benchmark, a test designed to assess machine intelligence without relying on memorization. The ARC benchmark focuses on core knowledge, making it a significant achievement in AI research. Despite this success, the 03 model still faces challenges with simpler tasks, indicating it isn't true AGI yet. The video also discusses the model's variations, where low-tuned versions optimize for speed and cost, while high-tuned versions handle complex tasks at a substantial compute expense. The advancements are likened to historical tech improvements in efficiency and cost. The continued AI advancement hints at remarkable capabilities predicted by 2025, although AGI remains a debated and evolving concept.

収穫

  • 🤖 OpenAI's 03 model marks a potential step towards AGI.
  • 📊 The ARC benchmark resists memorization, testing true intelligence.
  • 🧠 The 03 model surpasses human performance on ARC, but isn't fully AGI.
  • 💡 Low vs. high tuning offers varied performance and cost efficiency.
  • 💸 High compute costs limit high-tuned model's practical use.
  • 🔬 Achievements in AI draw parallels to tech efficiency in history.
  • 🔍 Continued AI improvements expected, sparking debate on AGI.
  • 🗨 Sam Altman views AGI as evolving, expects major AI strides by 2025.
  • 📉 03 faces fundamental challenges, with some easy tasks unresolved.
  • 📈 The video suggests the future of AI holds potential beyond current milestones.

タイムライン

  • 00:00:00 - 00:05:00

    Today marks a historic moment in AI as OpenAI released a new model, 03, which is part of their 01 series that can think for a long time. This model surpasses human performance on the ARC benchmark, a difficult test designed to evaluate machine intelligence's ability to adapt without relying on memorization. ARC requires core knowledge like elementary physics and basic skills any young child would have, presenting novel problems that AI has struggled with in the past. The release of 03 shows significant improvement as it scored 75.7 on the ARC AGI semi-private holdout set, making it the highest scorer on this difficult benchmark, previously considered a golden standard.

  • 00:05:00 - 00:12:05

    Despite impressive gains on the ARC benchmark, 03 still lacks some capabilities of human intelligence, missing on easy tasks. This shows it's not full AGI but a significant step towards it. The model, while costly to compute, demonstrates the expensive nature of AI solutions, with costs per task being potentially prohibitive. Efficiencies similar to technological advancements in the past are expected. Despite O2 not existing due to naming conflicts, 03 is only the second iteration of this model, foreshadowing further advancements. The model achieves high scores on various benchmarks, indicating impressive growth. AI's progress isn't slowing; expectations are that models will define new milestones in cognitive tasks by next year.

マインドマップ

ビデオQ&A

  • What marks a historic moment for the AI community in the video?

    The release of OpenAI's 03 model, which potentially signifies the arrival of AGI by surpassing human performance on the ARC benchmark.

  • Why is the ARC benchmark important?

    The ARC benchmark is resistant to memorization and is considered a test of machine intelligence, relying on core knowledge rather than vast knowledge banks.

  • What challenges does the 03 model face despite surpassing human performance?

    The 03 model still struggles with some easy tasks, indicating differences from human intelligence.

  • What are the differences between low-tuned and high-tuned versions of the 03 model?

    The low-tuned version is optimized for speed and simpler tasks, while the high-tuned version focuses on complex tasks and deeper reasoning with more compute resources.

  • What is the stance of the ARC AGI benchmark creators towards the 03 model?

    They recognize the 03's performance as a breakthrough but emphasize that there are still challenges, as some easy tasks remain unsolved.

  • How does the cost of using the 03 high-tuned model affect its practicality?

    The high-tuned model's use at $11,000 per task makes it currently impractical for widespread use.

  • What are the expectations for AI's progress according to the video?

    AI is expected to achieve more efficient and cost-effective solutions, continuing to improve even beyond current capabilities.

  • What is Sam Altman's view on AGI according to the video?

    Sam Altman sees AGI as a shifting term and believes by 2025, AI will achieve remarkable cognitive tasks that surpass human abilities.

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス!
字幕
en
オートスクロール:
  • 00:00:00
    so today actually marks a very historic
  • 00:00:02
    moment for the AI Community as it is
  • 00:00:04
    going to be probably regarded as the day
  • 00:00:06
    where AGI actually happened now if you
  • 00:00:08
    guys don't know why this is the case
  • 00:00:10
    this is because opening eye today
  • 00:00:11
    released SL announced I guess you could
  • 00:00:13
    say their new 03 model which is the
  • 00:00:16
    second iteration of their 01 series the
  • 00:00:19
    model that thinks for a long time now if
  • 00:00:21
    you don't understand why this is
  • 00:00:23
    potentially AGI this is because the new
  • 00:00:26
    system managed to surpass human
  • 00:00:28
    performance in the arc benchmark now the
  • 00:00:30
    reason that the arc Benchmark is such an
  • 00:00:32
    important Benchmark is because it is
  • 00:00:34
    resistant to memorization sure so Arc is
  • 00:00:37
    intended as a kind of IQ test for
  • 00:00:39
    machine intelligence and what makes it
  • 00:00:42
    different from most benchmarks out there
  • 00:00:44
    is that it's designed to be resistant to
  • 00:00:47
    memorization so if you look at the way
  • 00:00:49
    LMS work they're basically this uh big
  • 00:00:52
    interpolative memory and the way you
  • 00:00:54
    scalop the capabilities is by trying to
  • 00:00:56
    cram as much uh knowledge and patterns
  • 00:00:59
    as POS possible into them and uh by
  • 00:01:03
    contrast Arc does not require a lot of
  • 00:01:05
    knowledge at all it's designed to only
  • 00:01:07
    require what's known as core knowledge
  • 00:01:10
    which is uh basic knowledge about things
  • 00:01:13
    like U Elementary physics objectness
  • 00:01:16
    counting that sort of thing um the sort
  • 00:01:18
    of knowledge that any four-year-old or
  • 00:01:20
    5-year-old uh possesses right um but
  • 00:01:24
    what's interesting is that each puzzle
  • 00:01:26
    in Arc is novel is something that you've
  • 00:01:29
    probably not encountered before even if
  • 00:01:31
    you've memorized the entire internet now
  • 00:01:34
    if you want to know what Arc actually
  • 00:01:36
    looks like in terms of this test that
  • 00:01:37
    humans are so easily able to pass but
  • 00:01:40
    you know these AI systems currently
  • 00:01:41
    aren't you can take a look at the
  • 00:01:43
    current examples right here in this
  • 00:01:45
    video example here AGI is all about
  • 00:01:48
    having input examples and output
  • 00:01:50
    examples well they're good they're good
  • 00:01:52
    okay input examples and output examples
  • 00:01:54
    now the goal is you want to understand
  • 00:01:56
    the rule of the transformation and guess
  • 00:01:58
    it on the output so Sam what do you
  • 00:02:00
    think is happening in here probably
  • 00:02:02
    putting a dark blue square in the empty
  • 00:02:04
    space see yes that is exactly it now
  • 00:02:07
    that is really um it's easy for humans
  • 00:02:09
    to uh intuitively guess what that is
  • 00:02:11
    it's actually surprisingly hard for AI
  • 00:02:13
    to know to understand what's going on um
  • 00:02:15
    what's interesting though is AI has not
  • 00:02:18
    been able to get this problem thus far
  • 00:02:20
    and even though that we verified that a
  • 00:02:22
    panel of humans could actually do it now
  • 00:02:25
    the unique part about AR AI is every
  • 00:02:27
    task requires distinct skill skills and
  • 00:02:30
    what I mean by that is we won't ask
  • 00:02:32
    there won't be another task that you
  • 00:02:34
    need to fill in the corners with blue
  • 00:02:35
    squares and but we do that on purpose
  • 00:02:38
    and the reason why we do that is because
  • 00:02:39
    we want to test the model's ability to
  • 00:02:41
    learn new skills on the Fly we don't
  • 00:02:44
    just want it to uh repeat what it's
  • 00:02:46
    already memorized that that's the whole
  • 00:02:47
    point here now Ark AGI version one took
  • 00:02:50
    5 years to go from 0% to 5% with leading
  • 00:02:54
    Frontier models however today I'm very
  • 00:02:57
    excited to say that 03 has scored a new
  • 00:02:59
    new state-ofthe-art score that we have
  • 00:03:01
    verified on low compute for uh 03 it has
  • 00:03:04
    scored
  • 00:03:05
    75.7 on Ark AGI semi-private holdout set
  • 00:03:09
    now this is extremely impressive because
  • 00:03:11
    this is within the uh compute
  • 00:03:13
    requirements that we have for our public
  • 00:03:14
    leaderboard and this is the new number
  • 00:03:17
    one entry on rkg Pub so congratulations
  • 00:03:20
    to that now I know those of you outside
  • 00:03:22
    the AI Community might not think this is
  • 00:03:23
    a big deal but this is a really big deal
  • 00:03:26
    because this is something that we've
  • 00:03:27
    been trying to Sol for I think around 5
  • 00:03:29
    years now this is a benchmark that many
  • 00:03:31
    would have heralded to be the golden
  • 00:03:32
    standard for AI and would of course Mark
  • 00:03:34
    the first time that we've actually
  • 00:03:36
    managed to get to a system that can
  • 00:03:38
    actually outperform humans at a task
  • 00:03:39
    that traditionally AI systems would
  • 00:03:41
    particularly fail at now what was
  • 00:03:43
    interesting was that they had two
  • 00:03:45
    versions so we had o03 with low tuning
  • 00:03:48
    and we had 03 with high tuning so the 03
  • 00:03:51
    with low tuning is the low reasoning
  • 00:03:53
    effort and this is the model that
  • 00:03:55
    operates with minimum computational
  • 00:03:56
    effort which is optimized for Speed and
  • 00:03:58
    cost efficiency and this one is suitable
  • 00:04:00
    for simpler task where deep reasoning is
  • 00:04:02
    not required so you know basic coding
  • 00:04:04
    straightforward tasks and then of course
  • 00:04:06
    you have the high tuned one which is
  • 00:04:08
    where the model takes more time and
  • 00:04:10
    resources to analyze and solve problems
  • 00:04:11
    and this is optimized for performance on
  • 00:04:13
    complex task requiring deeper reasoning
  • 00:04:15
    or multi-step problem solving and what
  • 00:04:17
    we can see here is that when we actually
  • 00:04:19
    you know tune the model and make it
  • 00:04:21
    think for longer we can see that we
  • 00:04:22
    actually manage to surpass where humans
  • 00:04:24
    currently are now what's crazy about
  • 00:04:25
    this is that the people that have
  • 00:04:27
    created The Arc AGI Benchmark said that
  • 00:04:29
    the performance on Arc AGI highlights a
  • 00:04:31
    genuine breakthrough in novelty
  • 00:04:33
    adaptation this is not incremental
  • 00:04:35
    progress we are in New Territory so they
  • 00:04:38
    start to ask is this AGI 03 still fails
  • 00:04:41
    on some very easy tasks indicating
  • 00:04:43
    fundamental differences with human
  • 00:04:44
    intelligence and the guy that you saw at
  • 00:04:46
    the beginning Francis cholet actually
  • 00:04:48
    spoke about how he doesn't believe that
  • 00:04:49
    this is exactly AGI but it does
  • 00:04:51
    represent a big milestone on the way to
  • 00:04:53
    WS AGI he says there's still a fair
  • 00:04:55
    number of easy Arc AGI one talks to 03
  • 00:04:58
    can't solve and we have early
  • 00:04:59
    indications that Ark AI 2 will remain
  • 00:05:02
    extremely challenging 403 so he's
  • 00:05:04
    stating that it shows that it's feasible
  • 00:05:05
    to create unsaturated interesting bench
  • 00:05:07
    marks that are easy for humans yet
  • 00:05:09
    impossible for AI without involving
  • 00:05:11
    specialist knowledge and he States now
  • 00:05:13
    which is you know some people could
  • 00:05:14
    argue that he's moved the goal that we
  • 00:05:16
    will have AGI when creating such evals
  • 00:05:19
    become outright impossible but this is a
  • 00:05:21
    little bit contrast to what he said
  • 00:05:23
    earlier this year take a look at what he
  • 00:05:24
    said 6 months ago about the Benchmark
  • 00:05:27
    surpassing 80% which it did today turn
  • 00:05:29
    question around to you so suppose that
  • 00:05:31
    it's the case that in a year a
  • 00:05:33
    multimodal model can solve Arc let's say
  • 00:05:37
    get 80% whatever the average human would
  • 00:05:39
    get then AGI quite possibly yes I think
  • 00:05:43
    if you if you start so honestly what I
  • 00:05:45
    would like to see is uh an llm type
  • 00:05:48
    model solving Arc at like 80% but after
  • 00:05:52
    having only been trained on core
  • 00:05:54
    knowledge related stuff now one of the
  • 00:05:56
    limitations of the model is actually the
  • 00:05:59
    compute cost so you can see right here
  • 00:06:00
    that he says does this mean that the arc
  • 00:06:03
    priz competition is beaten he says no
  • 00:06:05
    The Arc priz competition targets the
  • 00:06:07
    fully private data set which is a
  • 00:06:09
    different and somewhat harder evaluation
  • 00:06:10
    but the arc prize is of course the one
  • 00:06:12
    where your Solutions must run with a
  • 00:06:14
    fixed amount of compute which is about
  • 00:06:16
    10 cents per TS and the reason that that
  • 00:06:18
    is really interesting is because I'm not
  • 00:06:19
    sure you guys may have seen this but if
  • 00:06:21
    we actually take a look down the bottom
  • 00:06:23
    here you can see that o03 high tuned the
  • 00:06:25
    amount of compute that is being put into
  • 00:06:27
    the model there means that this model
  • 00:06:29
    cost around $11,000 per task which is
  • 00:06:32
    pretty pretty expensive if you're trying
  • 00:06:34
    to use that AI for anything at all I
  • 00:06:36
    think these models that search over many
  • 00:06:39
    different solutions are going to be
  • 00:06:40
    really really expensive and we can see
  • 00:06:41
    that reflected here with this one being
  • 00:06:43
    over $1,000 per task which is
  • 00:06:45
    ridiculously expensive when you think
  • 00:06:47
    about using it to perform any kind of
  • 00:06:49
    task of course as we've seen with AI
  • 00:06:51
    cost will eventually come down so the
  • 00:06:53
    fact that they've managed to get the
  • 00:06:55
    Benchmark is just the very best thing we
  • 00:06:57
    can use the similarities between how
  • 00:06:58
    technology was pretty bulky in the early
  • 00:07:00
    days you know like the big TVs the
  • 00:07:02
    really bulky phones but now with
  • 00:07:04
    technology you find ways to become more
  • 00:07:06
    and more efficient and eventually you do
  • 00:07:08
    things faster and of course cheaper so
  • 00:07:09
    it's quite likely that this will happen
  • 00:07:11
    in AI too and what I do find crazy about
  • 00:07:13
    this is that you know two years ago he
  • 00:07:15
    said that the arc AGI Benchmark fully
  • 00:07:17
    solved is not going to be within the
  • 00:07:19
    next 8 years and he says 70% hopefully
  • 00:07:22
    less than 8 years perhaps four or five
  • 00:07:24
    of course you can see AI managing to
  • 00:07:26
    speed past most people's predictions we
  • 00:07:28
    also got the fact that it did very well
  • 00:07:30
    on the swe bench which is of course a
  • 00:07:32
    very hard software engineering bench and
  • 00:07:34
    I'm guessing that if you are a software
  • 00:07:36
    engineer this is probably not the best
  • 00:07:37
    news for you but I'm sure that now with
  • 00:07:39
    a bunch of people who are now coding
  • 00:07:41
    there's probably a lot more demand for
  • 00:07:42
    software Engineers that actually
  • 00:07:44
    understand the code that's actually
  • 00:07:45
    being written here but this is actually
  • 00:07:46
    rather interesting because one thing
  • 00:07:48
    that I realized when you know doing this
  • 00:07:50
    video was the fact that this is actually
  • 00:07:53
    O2 and not 03 just want to go on a
  • 00:07:55
    tangent quickly here because whilst I
  • 00:07:57
    was reading this you know break through
  • 00:07:59
    about 03 one thing that I needed to
  • 00:08:01
    remember is that this is actually the
  • 00:08:03
    second iteration of the model I think
  • 00:08:05
    some people might be thinking that 03 is
  • 00:08:07
    opening I third iteration but O2 is
  • 00:08:09
    simply being skipped because there is a
  • 00:08:11
    conflict with O2 which is a British
  • 00:08:14
    mobile service provider so the fact that
  • 00:08:16
    this is only the second iteration of the
  • 00:08:18
    model it does go to show that
  • 00:08:20
    potentially 03 or even 04 is going to be
  • 00:08:22
    quite a large jump and might reach
  • 00:08:25
    Benchmark saturation now we also look at
  • 00:08:26
    the math benchmarks and the PHD science
  • 00:08:29
    level benchmarks we can see that there
  • 00:08:30
    is a decent Improvement there as well I
  • 00:08:32
    do think though that this is sort of
  • 00:08:34
    reaching the Benchmark saturation area
  • 00:08:35
    because we can see this one on
  • 00:08:37
    competition math is around 96.7% and
  • 00:08:40
    this one is 87.7% and like I said before
  • 00:08:43
    one of the things that most people are
  • 00:08:45
    starting to say is that like okay AI has
  • 00:08:47
    slowed down because it's no longer
  • 00:08:49
    increasing at the same rates as it was
  • 00:08:51
    before what we have to understand is
  • 00:08:52
    that as these benchmarks get to 955 Plus
  • 00:08:55
    or around that 89% the incremental gains
  • 00:08:58
    are going to be harder and harder to to
  • 00:08:59
    reach because number one you only have
  • 00:09:01
    10% left to get and number two it's
  • 00:09:03
    quite likely that 3 to 5% of all the
  • 00:09:05
    questions are contaminated meaning that
  • 00:09:07
    there are errors in those questions
  • 00:09:08
    anyways which means that 100% is simply
  • 00:09:10
    not possible on certain benchmarks which
  • 00:09:12
    is why they decided to create the
  • 00:09:14
    frontier math benchmark now at the time
  • 00:09:16
    when the Benchmark was released which I
  • 00:09:18
    think around 2 or 3 months ago these
  • 00:09:19
    were the current models that could do
  • 00:09:21
    only 2% on these questions you can see
  • 00:09:24
    at the lead was Gemini 1.5 Pro Claude 3
  • 00:09:27
    Sonic and you can see 01 preview and 01
  • 00:09:28
    mini were there but for those of you
  • 00:09:29
    that haven't realized just how good
  • 00:09:31
    02/03 is this is a model that gets 25%
  • 00:09:35
    so this is something that is really
  • 00:09:37
    really incredible when it comes to
  • 00:09:38
    research math and you have to understand
  • 00:09:41
    that this kind of math is super super
  • 00:09:42
    difficult and all of those questions are
  • 00:09:44
    completely novel so the previous
  • 00:09:46
    benchmarks that we recently had aren't
  • 00:09:47
    truly showing the capabilities I mean
  • 00:09:49
    think about it like this okay when we
  • 00:09:51
    take a look at these benchmarks you can
  • 00:09:52
    see that okay you would say maybe the
  • 00:09:54
    model's gotten you know 10% better
  • 00:09:56
    overall which is just not true because
  • 00:09:58
    if we actually look at the really hard
  • 00:09:59
    benchmarks where it's really solving
  • 00:10:01
    unseen math problems we can see that
  • 00:10:03
    there is a 20 times improvement over the
  • 00:10:05
    current stateof the-art which is just
  • 00:10:06
    absolutely incredible so I think this is
  • 00:10:08
    probably the very important image that
  • 00:10:11
    people need to realize because you can't
  • 00:10:13
    compare this kind of model even though
  • 00:10:14
    it does do a massive increase on these
  • 00:10:17
    kind of benchmarks you can't really
  • 00:10:18
    compare it to this where it's having you
  • 00:10:21
    know this insane level of scores now
  • 00:10:23
    what's important to understand is that
  • 00:10:25
    noan brown the person who works on
  • 00:10:26
    reasoning at opening eye said that 03 is
  • 00:10:29
    going going to continue in that
  • 00:10:30
    trajectory so for those of you who are
  • 00:10:31
    thinking you know maybe AI is slowing
  • 00:10:33
    down it is clear that that is not the
  • 00:10:34
    case now interestingly enough we also
  • 00:10:36
    got samman in an interview talking about
  • 00:10:38
    what he believes AGI to be and I think
  • 00:10:40
    like I said before the definition is
  • 00:10:42
    constantly shifting and constantly
  • 00:10:43
    changing it used to be a term that
  • 00:10:45
    people used a lot and it was this really
  • 00:10:47
    smart AI that was very far off in the
  • 00:10:49
    future as we get closer to it I think
  • 00:10:51
    it's become a less useful term people
  • 00:10:53
    use it to mean very different things um
  • 00:10:56
    some people use it to mean something
  • 00:10:57
    that's not that different than 01 you
  • 00:10:59
    know
  • 00:10:59
    uh and some people use it to mean true
  • 00:11:01
    super intelligence something smarter
  • 00:11:02
    than like all of humanity put together
  • 00:11:04
    uh we try now to use these different
  • 00:11:07
    levels we have a five level framework
  • 00:11:09
    we're on level two with agents now sorry
  • 00:11:12
    with uh reasoning now uh rather than the
  • 00:11:14
    binary of is it AGI or is it not I think
  • 00:11:16
    that became too horse as we get closer
  • 00:11:19
    but I I will say um by the end of next
  • 00:11:22
    year end of 25 I expect we will have
  • 00:11:25
    systems that can do truly astonishing
  • 00:11:28
    cognitive tasks like where you'll use it
  • 00:11:30
    and be like that thing is smarter than
  • 00:11:32
    me at a lot of at a lot of hard problems
  • 00:11:34
    so now can see the only thing here is
  • 00:11:36
    that if you are a safety researcher he
  • 00:11:37
    says please consider applying to help
  • 00:11:39
    test 03 minion 03 because he's excited
  • 00:11:41
    to get these out for General
  • 00:11:42
    availability soon and he is extremely
  • 00:11:44
    proud of the work the opening ey have
  • 00:11:45
    been doing to creating these amazing
  • 00:11:47
    models so it will be interesting to see
  • 00:11:49
    when these models are actually released
  • 00:11:51
    what people are going to do with them
  • 00:11:52
    what they're going to build and of
  • 00:11:53
    course what happens next if you enjoyed
  • 00:11:55
    this video let me know what your
  • 00:11:56
    thoughts are on if this is AI or not I
  • 00:11:58
    do think that that benchmark Mark has
  • 00:11:59
    been broken and we're constantly finding
  • 00:12:01
    new ways so it will be interesting to
  • 00:12:03
    see where things head next
タグ
  • AGI
  • ARC benchmark
  • AI Community
  • OpenAI
  • machine intelligence
  • core knowledge
  • AI progress
  • cost efficiency