Chinese Researchers Just Cracked OpenAI's AGI Secrets

00:15:53
https://www.youtube.com/watch?v=LyKRUwLNPO8

概要

TLDRThis content discusses OpenAI's latest AI model, the 01 series, highlighting its advancement and the secrecy surrounding it. The 01 model represents a significant step toward achieving Artificial General Intelligence (AGI). A recent research paper from China claims to demystify the workings of the 01 model, potentially leveling the AI development playing field by providing a roadmap to create similar AI systems. The video outlines the basics of AI functioning with a focus on reinforcement learning, where the system learns from rewards. The OpenAI's 01 model uses this learning method to solve complex problems. The four pillars essential to 01's operation are policy initialization, reward design, search, and learning. Policy initialization involves training the AI with a vast amount of data to develop basic reasoning. The discussion further delves into search methods, like tree exploration and sequential revisions, and how they enhance the model's reasoning capabilities. It also explores how reinforcement learning fixes errors through trial and error. The iterative cycle of search and learning may lead to superhuman problem-solving abilities, nudging closer to superintelligence.

収穫

  • 🤖 OpenAI's 01 model is highly advanced and kept secretive.
  • 🤔 It uses reinforcement learning to solve complex problems.
  • 📚 Policy initialization sets up the initial reasoning capabilities.
  • 🎯 Reward design is crucial for precise AI learning.
  • 🔍 Search is how the AI explores different possibilities.
  • 🔄 Learning from search results improves AI over time.
  • 🌲 Tree search explores potential problem-solving paths.
  • ✍️ Sequential revisions refine AI's solutions step-by-step.
  • 🧠 Superintelligence might be within reach with continuous improvements.
  • 🇨🇳 A Chinese paper claims to decode 01's workings.
  • 📝 Policy initialization involves massive text data training.
  • 💡 Behavior cloning mimics successful solutions.

タイムライン

  • 00:00:00 - 00:05:00

    OpenAI's new 01 series model is considered a major step toward achieving AGI (Artificial General Intelligence), and its inner workings are highly secretive. A Chinese research paper proposes a roadmap to replicate the 01 model, suggesting that reinforcement learning is central to its success. The 01 series uses four key processes: policy initialization, reward design, search, and learning, to develop its reasoning abilities.

  • 00:05:00 - 00:10:00

    Policy initialization and reward design are pivotal for the AI's foundation. Policy initialization involves pre-training and fine-tuning, essentially equipping AI with language and reasoning skills before tackling complex problems. Reward design involves two types of modeling: outcome reward modeling evaluates based on the final result, while process reward modeling assesses each step for correctness, providing more granular feedback for iterative learning and improvement.

  • 00:10:00 - 00:15:53

    In the search and learning phases, the AI refines its problem-solving skills. Search allows the AI to explore various solutions, with internal and external guidance aiding the process. Reinforcement learning, coupled with methods like policy gradient and behavior cloning, lets the AI iterate on its strategies, improving over time. This continuous loop of practice and refinement could lead AI to superhuman performance, suggesting that artificial superintelligence may not be far off.

マインドマップ

ビデオQ&A

  • Why is OpenAI's 01 model shrouded in secrecy?

    The 01 AI model is shrouded in secrecy due to its advanced capabilities and potential implications for achieving AGI, leading OpenAI to restrict certain inquiries to protect their technology.

  • What recent development challenges OpenAI's AI model dominance?

    A research paper from China outlines a roadmap to reproduce OpenAI's 01 model, potentially leveling the AI development playing field.

  • What is reinforcement learning in the context of AI?

    Reinforcement learning involves a system receiving rewards for completed tasks, learning through trial and error to improve over time.

  • How does policy initialization contribute to AI development?

    Policy initialization involves pre-training AI on massive datasets for basic reasoning skills, setting a foundation before tackling complex problems.

  • What role does reward design play in AI learning?

    Reward design involves systems evaluating solutions either based on final outcomes or individual steps, which helps refine AI learning processes.

  • How does the search improve AI performance?

    Search involves AI exploring multiple solutions, refining and improving its approach to problem-solving, crucial for complex reasoning tasks.

  • What techniques are used in AI search processes?

    Techniques include tree search for exploring potential paths and sequential revisions for refining solutions, guided by internal and external feedback.

  • How does reinforcement learning contribute to AI improvement?

    Reinforcement learning uses experiences from search outcomes to adjust AI strategies, employing methods like policy gradient and behavior cloning.

  • What is the potential impact of achieving superintelligence?

    Achieving superintelligence could revolutionize problem solving, allowing AI to surpass human abilities in certain domains.

  • How do iterative search and learning cycles benefit AI?

    Iteratively combining search and learning allows AI to continuously refine its abilities, leading to potential superhuman performance on complex tasks.

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス!
字幕
en
オートスクロール:
  • 00:00:00
    so open AI is the leading AI company and
  • 00:00:03
    of course their recent iteration of
  • 00:00:05
    models the 01 series is by far the most
  • 00:00:08
    advanced AI that we currently have
  • 00:00:10
    access to now incredibly this AI model
  • 00:00:13
    has been shrouded with secrecy to the
  • 00:00:15
    point that if you ever dare to ask the
  • 00:00:17
    model what it was thinking about during
  • 00:00:19
    the process it was giving you a response
  • 00:00:22
    the model gives you a response where it
  • 00:00:24
    tells you to never ask a question like
  • 00:00:25
    that again and if you do it too many
  • 00:00:27
    times you can actually get banned from
  • 00:00:30
    using open AI service and now the reason
  • 00:00:32
    that this is shrouded in so much secrecy
  • 00:00:34
    is because this is a big step towards
  • 00:00:36
    AGI and many are thinking that open AI
  • 00:00:39
    are quite likely to be the first company
  • 00:00:41
    to achieve it now with that being said
  • 00:00:43
    many have wanted to know exactly how
  • 00:00:45
    this system works and there have been
  • 00:00:46
    many different ways open I have of
  • 00:00:48
    course published a few different
  • 00:00:50
    Publications but nothing to the point
  • 00:00:52
    where we truly understand what's going
  • 00:00:54
    on beneath the hood however there has
  • 00:00:56
    been a recent research paper from a
  • 00:00:58
    group of researchers in China and we are
  • 00:01:01
    now asking ourselves if they just
  • 00:01:03
    managed to crack the code did they
  • 00:01:05
    figure out how 01 works and release a
  • 00:01:08
    road map to build something similar so
  • 00:01:11
    this is the paper scaling of search and
  • 00:01:13
    learning a road map to reproduce 01 from
  • 00:01:15
    reinforcement learning perspective and
  • 00:01:17
    this is the paper that could change
  • 00:01:19
    everything because if this is true then
  • 00:01:21
    it means the playing field is leveled
  • 00:01:23
    and it means it's only a matter of time
  • 00:01:25
    before many other companies start to
  • 00:01:27
    produce their AI models that are going
  • 00:01:29
    to be on par with open AI now I'm
  • 00:01:31
    actually going to break this down into
  • 00:01:32
    four parts but let's actually first
  • 00:01:34
    understand the basics of how this AI
  • 00:01:36
    thing even works so one of the first
  • 00:01:38
    things that we do have is we have of
  • 00:01:39
    course reinforcement learning with AI so
  • 00:01:42
    essentially we can use a game analogy so
  • 00:01:45
    imagine you're trying to teach a dog a
  • 00:01:47
    trick so you would give this dog a treat
  • 00:01:49
    which is the reward when it does
  • 00:01:51
    something right and it then learns to
  • 00:01:54
    repeat those actions to get more treats
  • 00:01:56
    and that is basically reinforcement
  • 00:01:57
    learning now with AI the dog is
  • 00:02:00
    essentially a program and the treat is a
  • 00:02:03
    digital reward and the trick could be
  • 00:02:05
    anything from winning a game to writing
  • 00:02:07
    code now why is reinforcement learning
  • 00:02:10
    important for the 01 series and this is
  • 00:02:12
    because open AI seems to believe that
  • 00:02:14
    reinforcement learning is the key to
  • 00:02:16
    making 01 so smart it h it's basically
  • 00:02:19
    how 01 learns to reason and solve
  • 00:02:21
    complex problems through trial and error
  • 00:02:24
    now there are four pillars of this
  • 00:02:26
    according to the paper you can see right
  • 00:02:28
    here they give us an over view of how 01
  • 00:02:31
    essentially Works we've got the policy
  • 00:02:34
    initialization this is the starting
  • 00:02:36
    point of the model this sets up the
  • 00:02:38
    model's initial reasoning abilities
  • 00:02:39
    using pre-training or fine-tuning and
  • 00:02:41
    this is basically the foundation of the
  • 00:02:43
    model we've got reward design which is
  • 00:02:45
    of course how the model is rewarded
  • 00:02:46
    which we just spoke about I'm going to
  • 00:02:48
    speak about that in more detail and then
  • 00:02:49
    of course we've got search which is
  • 00:02:51
    where during the inference time where
  • 00:02:53
    the model is quote unquote thinking this
  • 00:02:55
    is how the model searches through
  • 00:02:57
    different possibilities and of course we
  • 00:02:58
    have learning and this is where you
  • 00:03:00
    improve the model by analyzing the data
  • 00:03:03
    generated during the search process and
  • 00:03:05
    then you use different techniques such
  • 00:03:06
    as reinforcement learning to make the
  • 00:03:08
    model better over time and essentially
  • 00:03:10
    the central idea is reinforcement
  • 00:03:12
    learning okay and the core mechanism
  • 00:03:14
    ties these components together the model
  • 00:03:17
    which is the policy interacts with its
  • 00:03:18
    environment data flows from search
  • 00:03:21
    results into the learning process and
  • 00:03:22
    the improved policy is fed back into the
  • 00:03:24
    search creating a continuous Improvement
  • 00:03:26
    Loop and the diagram basically
  • 00:03:28
    emphasizes the cyclic nature of the
  • 00:03:30
    process search generates data for
  • 00:03:32
    learning learning updates the policy and
  • 00:03:33
    yada y yada so if we want to actually
  • 00:03:36
    understand how this works we have to
  • 00:03:37
    actually understand the policy so this
  • 00:03:39
    is the basics this is the foundation of
  • 00:03:41
    the model so imagine you're basically
  • 00:03:43
    teaching someone to play a complex game
  • 00:03:45
    like chess you wouldn't throw them into
  • 00:03:47
    a match against a Grandmaster on their
  • 00:03:49
    first day right you'd start by teaching
  • 00:03:51
    them the basics how the pieces move
  • 00:03:53
    basic strategies and maybe some common
  • 00:03:55
    opening moves that's essentially what
  • 00:03:57
    policy initialization is for AI now in
  • 00:04:00
    the context of a powerful AI like 01
  • 00:04:03
    policy initialization is essentially
  • 00:04:05
    giving the AI just the very strong
  • 00:04:07
    foundation and reasoning before it even
  • 00:04:09
    starts trying to solve really hard
  • 00:04:11
    problems it's about equipping it with a
  • 00:04:13
    basic set of skills and knowledge that
  • 00:04:15
    it can then build upon through
  • 00:04:16
    reinforcement learning the paper
  • 00:04:18
    suggests that for 01 this Head Start
  • 00:04:20
    likely comes in two main phases number
  • 00:04:23
    one the pre-training which we can see
  • 00:04:25
    here which is you know where you train
  • 00:04:26
    it on massive text Data think of this
  • 00:04:29
    like like letting the AI read the
  • 00:04:31
    entirety of the internet or at least a
  • 00:04:34
    huge chunk of it and by doing this the
  • 00:04:36
    AI learns how language Works how words
  • 00:04:39
    relate to each other and gains a vast
  • 00:04:41
    amount of general knowledge about the
  • 00:04:43
    world think of it like learning grammar
  • 00:04:45
    vocabulary the basic facts before trying
  • 00:04:47
    to write a novel and it will also learn
  • 00:04:49
    basic reasoning abilities by training on
  • 00:04:52
    this data and then this is where we get
  • 00:04:53
    to the important bit which is where we
  • 00:04:55
    get the fine-tuning with instructions
  • 00:04:57
    and humanlike reasoning and this is
  • 00:04:59
    where we actually give the AI more
  • 00:05:01
    specific lessons on how to reason and
  • 00:05:03
    solve problems and this involves two key
  • 00:05:05
    techniques which we can see right here
  • 00:05:07
    prompt engineering and supervised
  • 00:05:09
    fine-tuning so prompt engineering is
  • 00:05:11
    where essentially you know you give the
  • 00:05:13
    AI carefully crafted instructions or
  • 00:05:16
    examples to guide Its Behavior and the
  • 00:05:18
    paper mentions behaviors like problem
  • 00:05:19
    analysis which is where you restate the
  • 00:05:22
    problem to make sure it's understood
  • 00:05:23
    task decomposition like breaking down a
  • 00:05:26
    complex problem into smaller easier
  • 00:05:28
    steps which is where you literally say
  • 00:05:29
    you know first think step by step and of
  • 00:05:31
    course with supervised finetuning which
  • 00:05:33
    is right here sft this involves training
  • 00:05:36
    the AI on examples of human solving
  • 00:05:38
    problems like basically showing it the
  • 00:05:40
    right way to think and reason it could
  • 00:05:42
    involve showing it examples of experts
  • 00:05:45
    explaining their thought process step by
  • 00:05:46
    step so in a nutshell policy
  • 00:05:48
    initialization is about giving AI a
  • 00:05:50
    solid foundation and language knowledge
  • 00:05:52
    and basic reasoning skills setting up
  • 00:05:55
    setting it up for success in the later
  • 00:05:56
    stages of learning and problem solving
  • 00:05:58
    and this phase of 1 is essentially
  • 00:06:00
    crucial for developing human-like
  • 00:06:02
    reasoning behaviors in AI enabling them
  • 00:06:04
    to think systematically and explore
  • 00:06:06
    solution spaces efficiently next we get
  • 00:06:08
    to something super interesting this is
  • 00:06:11
    where we get to reward design so this
  • 00:06:13
    image that you can see on the screen
  • 00:06:15
    illustrates two types of reward systems
  • 00:06:18
    used in reinforcement learning outcome
  • 00:06:21
    reward modeling which is om over here
  • 00:06:23
    and then we've got process reward
  • 00:06:24
    modeling which is PRM now as for the
  • 00:06:27
    explanation it's actually pretty
  • 00:06:28
    straightforward so outcome reward
  • 00:06:31
    modeling is something that only
  • 00:06:33
    evaluates the solution based on the
  • 00:06:35
    final result so if the final answer is
  • 00:06:38
    incorrect the entire solution is marked
  • 00:06:40
    as wrong even if these steps right here
  • 00:06:43
    or even if most steps are correct and in
  • 00:06:45
    this example there are some steps that
  • 00:06:47
    are actually correct but due to the fact
  • 00:06:49
    that the final output is incorrect the
  • 00:06:51
    entire thing is just marked as wrong but
  • 00:06:54
    this is where we actually use process
  • 00:06:56
    reward modeling which is much better so
  • 00:06:58
    with process mod modeling this evaluates
  • 00:07:01
    each step in the solution individually
  • 00:07:03
    this is where we reward the correct
  • 00:07:05
    steps and we penalize the incorrect ones
  • 00:07:08
    and this one actually provides more
  • 00:07:10
    granular feedback which helps guide
  • 00:07:12
    improvements during training so we can
  • 00:07:14
    see that steps one two and three are
  • 00:07:16
    correct and then they receive the
  • 00:07:17
    rewards and steps four and five are
  • 00:07:19
    incorrect and are thus flagged its
  • 00:07:21
    errors and this approach is far better
  • 00:07:24
    because it pinpoints the exact errors in
  • 00:07:26
    the process rather than discarding the
  • 00:07:28
    entire solution and this this diagram
  • 00:07:30
    basically emphasizes the importance of
  • 00:07:32
    process rewards in tasks that involve
  • 00:07:34
    multi-step reasoning as it allows for
  • 00:07:36
    iterative improvements and Better
  • 00:07:38
    Learning outcomes which is essentially
  • 00:07:40
    what they believe 01 is using now this
  • 00:07:43
    is where we get into the really
  • 00:07:45
    interesting thing because this is where
  • 00:07:47
    we get to search and many have heralded
  • 00:07:49
    search as the thing that could take us
  • 00:07:51
    to Super intelligence in fact I did
  • 00:07:53
    recently see a tweet that just stated
  • 00:07:55
    that I'm sure I'll manage to add that on
  • 00:07:57
    screen so when we decide to break this
  • 00:07:59
    down this is essentially where we have
  • 00:08:01
    the AI thinking so you know when you
  • 00:08:03
    have a powerful AI like 01 it needs time
  • 00:08:06
    to think to explore different
  • 00:08:08
    possibilities and find the best solution
  • 00:08:10
    this thinking process is what the paper
  • 00:08:12
    refers to as such so thinking more is
  • 00:08:15
    where they say that you know one way you
  • 00:08:17
    could improve the performance is by
  • 00:08:20
    thinking more during the inference which
  • 00:08:22
    means that instead of just generating
  • 00:08:23
    one answer it explores multiple possible
  • 00:08:26
    solutions before picking the best one so
  • 00:08:28
    you know let's say you think about
  • 00:08:30
    writing an essay you don't just write
  • 00:08:31
    the first draft and submit it right you
  • 00:08:33
    brainstorm ideas you write multiple
  • 00:08:35
    drafts you revise and edit until you're
  • 00:08:37
    happy with the final product and that is
  • 00:08:39
    essentially a form of search 2 so there
  • 00:08:43
    are two main strategies that are in the
  • 00:08:46
    search area and the paper highlights
  • 00:08:47
    these strategies that 01 might be using
  • 00:08:50
    for this thinking process so coming in
  • 00:08:52
    at number one we have the tree search so
  • 00:08:55
    imagine a branching tree where a branch
  • 00:08:58
    represents a different choice or you
  • 00:09:00
    know action that the AI could
  • 00:09:02
    potentially take research is like you
  • 00:09:04
    know exploring the tree following
  • 00:09:06
    different paths to see where they lead
  • 00:09:08
    for example in a game of chess an AI
  • 00:09:10
    might consider all the possible moves
  • 00:09:11
    that it could make then all the possible
  • 00:09:13
    responses its opponent could make and
  • 00:09:15
    then build on this tree of possibilities
  • 00:09:18
    and then it uses a certain kind of
  • 00:09:20
    criteria to decide which branch to
  • 00:09:22
    explore further and which to prune
  • 00:09:25
    focusing on the most promising path
  • 00:09:27
    basically thinking about where you're
  • 00:09:28
    going to go what decisions you're going
  • 00:09:30
    to make and which one yields the best
  • 00:09:32
    rewards it's kind of like a gardener
  • 00:09:34
    selectively trimming branches to help a
  • 00:09:36
    tree grow in the right direction a
  • 00:09:38
    simple example of this is best of end
  • 00:09:39
    sampling where the model generates end
  • 00:09:41
    possible solutions and then picks the
  • 00:09:43
    best one based on some kind of criteria
  • 00:09:45
    now on the bottom right here this is
  • 00:09:47
    where we have sequential revisions this
  • 00:09:50
    is like writing that essay we talked
  • 00:09:52
    about earlier and the AI starts with an
  • 00:09:54
    initial attempt at a solution then
  • 00:09:56
    refines it step by step along the way
  • 00:09:59
    making improvements for example an AI
  • 00:10:03
    might generate an initial answer to a
  • 00:10:06
    math problem and then it might check its
  • 00:10:08
    work then identify the errors and then
  • 00:10:12
    revise its solution accordingly it's
  • 00:10:14
    kind of like editing your essay catching
  • 00:10:16
    the mistakes and then making it better
  • 00:10:18
    with every time you review it so you
  • 00:10:20
    have to also think about you know how
  • 00:10:22
    does the AI decide which paths to
  • 00:10:25
    explore in the tree search or how to
  • 00:10:27
    even you know revive is the solution in
  • 00:10:30
    sequential revision so the paper
  • 00:10:32
    mentions two types of guidance so we
  • 00:10:34
    have internal guidance and this is where
  • 00:10:37
    you've got the AI using its own internal
  • 00:10:39
    knowledge and calculations to guide its
  • 00:10:42
    search and one example is of course you
  • 00:10:44
    know model uncertainty and this is where
  • 00:10:47
    the model can actually estimate how
  • 00:10:49
    confident it is in certain parts of its
  • 00:10:52
    solution it might focus on areas where
  • 00:10:54
    it's less certain exploring Alternatives
  • 00:10:57
    or making revisions it's kind of like
  • 00:10:58
    double checking your work when you're
  • 00:11:00
    not really sure if you've made a mistake
  • 00:11:03
    another example of this is of course you
  • 00:11:04
    know self- evaluation this is where you
  • 00:11:07
    know the AI can be trained to assess its
  • 00:11:09
    own work identifying potential errors or
  • 00:11:12
    areas for improvement it's kind of like
  • 00:11:13
    having an internal editor that reviews
  • 00:11:16
    your writing and suggest changes then
  • 00:11:18
    we've got external guidance and this is
  • 00:11:21
    like getting feedback from the outside
  • 00:11:22
    world to guide the search so one example
  • 00:11:25
    is environmental feedback which is where
  • 00:11:27
    in some cases AI can interact with a
  • 00:11:30
    real or simulated environment and get
  • 00:11:32
    feedback on its actions for example a
  • 00:11:35
    robot learning to navigate a maze might
  • 00:11:38
    get feedback on whether it's moving
  • 00:11:39
    closer to or farther from the goal and
  • 00:11:42
    another example of this is using a
  • 00:11:44
    reward model which we discussed earlier
  • 00:11:46
    the reward model can provide feedback on
  • 00:11:48
    the quality of different solutions or
  • 00:11:51
    actions guiding the AI towards better
  • 00:11:53
    outcomes it's kind of like having a
  • 00:11:56
    teacher who grades your work and tells
  • 00:11:58
    you what you did well and tells you
  • 00:11:59
    where you need to improve in essence the
  • 00:12:01
    search element and the process by which
  • 00:12:04
    01 explores different possibilities and
  • 00:12:06
    refines its solution is Guided by both
  • 00:12:08
    its internal knowledge and its external
  • 00:12:10
    feedback and this is a crucial part of
  • 00:12:12
    what makes 01 so good at complex
  • 00:12:15
    reasoning tasks so of course search is
  • 00:12:18
    how the AI thinks about a problem but
  • 00:12:20
    how does it actually get better at
  • 00:12:21
    solving problems over time this is where
  • 00:12:24
    learning comes in so the paper suggests
  • 00:12:26
    that 01 uses a powerful technique called
  • 00:12:29
    reinforcement learning to improve its
  • 00:12:31
    performance so search generates the
  • 00:12:34
    training data so remember how we talked
  • 00:12:35
    about search generating multiple
  • 00:12:37
    possible solutions well those Solutions
  • 00:12:39
    along with the feedback from internal or
  • 00:12:42
    external guidance become value training
  • 00:12:44
    become valuable training data for the AI
  • 00:12:47
    think of it like a student practicing
  • 00:12:49
    for an exam they might try and solve
  • 00:12:51
    many different practice problems getting
  • 00:12:53
    feedback on their answers and learning
  • 00:12:55
    from their mistakes each attempt whether
  • 00:12:57
    successful or not provides valuable
  • 00:12:58
    information
  • 00:12:59
    that actually helps them learn and
  • 00:13:01
    improve now we've got two main learning
  • 00:13:04
    methods and the paper focuses on two
  • 00:13:06
    main methods that a one might be using
  • 00:13:08
    to learn from during this search
  • 00:13:10
    generated data number one is policy
  • 00:13:12
    gradient methods like Po and these
  • 00:13:15
    methods are a little bit more complex
  • 00:13:17
    but the basic idea is that the AI
  • 00:13:18
    adjusts its internal policy which is the
  • 00:13:21
    strategy for choosing its actions based
  • 00:13:23
    on the reward that it achieves and
  • 00:13:25
    actions that lead to high rewards are
  • 00:13:27
    made more likely while actions that lead
  • 00:13:28
    to low rewards are made less likely it's
  • 00:13:31
    kind of like fine-tuning the ai's
  • 00:13:32
    decision-making process based on its own
  • 00:13:34
    experiences then we've got po which is
  • 00:13:37
    essentially proximal policy optimization
  • 00:13:39
    which is it's a popular policy gradient
  • 00:13:42
    method that is known for its stability
  • 00:13:43
    and efficiency it's like having a
  • 00:13:45
    careful and methodical way of updating
  • 00:13:47
    the AI strategy making sure it doesn't
  • 00:13:50
    change too drastically in its response
  • 00:13:51
    to any single experience then of course
  • 00:13:54
    here we have Behavior cloning and this
  • 00:13:56
    is a simpler method where the AI learns
  • 00:13:59
    to mimic successful Solutions it's like
  • 00:14:01
    learning via imitation if the search
  • 00:14:03
    process finds a really good solution one
  • 00:14:06
    that gets a high reward the AI can learn
  • 00:14:08
    to copy that solution in similar
  • 00:14:10
    situations it's like a student learning
  • 00:14:12
    to solve a math problem by studying a
  • 00:14:15
    worked example and the paper suggests
  • 00:14:17
    that 01 might use Behavior cloning to
  • 00:14:20
    learn from the very best Solutions found
  • 00:14:22
    during search effectively adding them to
  • 00:14:24
    its repertoire of successful strategies
  • 00:14:27
    or it could be used as an initial way to
  • 00:14:30
    warm up the model before using more
  • 00:14:32
    complex methods like Po now of course
  • 00:14:34
    we've got iterative search and learning
  • 00:14:36
    and the real power of this approach
  • 00:14:38
    comes from combining search and of
  • 00:14:40
    course learning in an iterative Loop so
  • 00:14:42
    the AI searches for Solutions learns
  • 00:14:45
    from the results then use its improved
  • 00:14:47
    knowledge to conduct even better
  • 00:14:49
    searches in the future it's like a
  • 00:14:51
    continuous cycle of practice feedback
  • 00:14:53
    and Improvement and the paper suggests
  • 00:14:55
    that this iterative progress is key to
  • 00:14:59
    ow one's ability to achieve superhuman
  • 00:15:01
    performance on certain tasks by
  • 00:15:03
    continuously searching and learning the
  • 00:15:05
    AI can surpass the limitations of its
  • 00:15:07
    initial training data potentially
  • 00:15:09
    discover new and better Solutions than
  • 00:15:12
    humans haven't thought of so with all
  • 00:15:14
    that being said about how 01 works and
  • 00:15:17
    now that you know the basics the four
  • 00:15:18
    key pillars do you guys think we are
  • 00:15:20
    close to Super intelligence after
  • 00:15:22
    reading this research paper and
  • 00:15:24
    understanding the key granular details
  • 00:15:25
    about how Owen works I think I really do
  • 00:15:28
    understand why the wider AI Community is
  • 00:15:30
    saying that super intelligence isn't
  • 00:15:31
    that far away if an AI can search for
  • 00:15:34
    Solutions then learn from those results
  • 00:15:36
    and use that improved knowledge to
  • 00:15:38
    conduct even better searches in the
  • 00:15:39
    future having a continuous cycle of
  • 00:15:41
    practice feedback and Improvement
  • 00:15:43
    achieving superhuman performance would
  • 00:15:45
    be possible in theory so maybe
  • 00:15:47
    artificial super intelligence isn't that
  • 00:15:49
    far away with that being said I'd love
  • 00:15:50
    to know your thoughts and hopefully you
  • 00:15:52
    guys have a
タグ
  • OpenAI
  • AI models
  • AGI
  • Reinforcement Learning
  • 01 series
  • Search
  • Learning
  • Superintelligence
  • Policy Initialization
  • Reward Design