Reinforcement Fine-Tuning—12 Days of OpenAI: Day 2

00:20:35
https://www.youtube.com/watch?v=yCIYS9fx56U

Résumé

TLDROpenAI is advancing its 01 series, introducing a model customization program with reinforcement fine-tuning, set to launch publicly next year. This technique allows models to adapt using users' datasets, offering tailored reasoning capabilities for various specialized fields, including legal, finance, and healthcare. A notable collaboration with Thompson Reuters showcased legal AI development, and scientific research benefits from improved rare disease analysis methods using OpenAI models. Researchers are invited to apply for the reinforcement fine-tuning research program, which emphasizes OpenAI's commitment to enhancing AI application in real-world scenarios.

A retenir

  • 🚀 OpenAI launches new model customization capabilities with 01 series.
  • 🧠 Reinforcement fine-tuning allows for advanced domain-specific reasoning.
  • 📅 Public launch of customization features planned for next year.
  • 🤝 Collaborations include Thompson Reuters for legal AI.
  • 🔬 Models can aid in genetic research for rare diseases.
  • 📊 Opportunity for researchers to apply for fine-tuning program.
  • 💼 Fields benefiting include legal, finance, and engineering.
  • 💡 Emphasis on practical applications enhancing real-world impacts.
  • 📉 Fine-tuning seen as a significant advancement over standard methods.
  • 🎄 Presented a humorous Christmas-themed AI joke.

Chronologie

  • 00:00:00 - 00:05:00

    Mark from OpenAI introduced updates on their new model series 01, emphasizing its ability for better delayed response learned solutions. Reinforcement fine tuning is a new method introduced for customization which allows users from academia and enterprise sectors to train the model on their specific data sets utilizing reinforcement learning, leading to expert-level capabilities.

  • 00:05:00 - 00:10:00

    John, Julie, and Justin explain the advantages and functionalities of Reinforcement Fine Tuning (RFT) for the 01 series. It's designed to allow domain specific learning by utilizing few examples for teaching models to reason and improve its performance in fields like legal, finance, and engineering. The goal is to enhance their AI with deep expertise in various domains such as law, collaborating with organizations like Thomson Reuters.

  • 00:10:00 - 00:15:00

    Justin from Berkeley discusses the potential of using the 01 model with reinforcement fine tuning for understanding rare genetic diseases by combining medical expertise with model's reasoning capabilities. A dataset is used to improve its ability to predict causative genes for specific symptoms, showcasing how enhanced reasoning aids biomedical research.

  • 00:15:00 - 00:20:35

    Customizing models using reinforcement fine tuning involves creating datasets and graders to evaluate performance. OpenAI simplifies the process by leveraging its infrastructure to train models efficiently, showcasing improved model performance in medical research. Fine-tuning enhances model's ability to generalize and reason, potentially transforming workflows in healthcare and other sectors.

Afficher plus

Carte mentale

Vidéo Q&R

  • What new capabilities does the 01 model series provide?

    The 01 model series can now be customized and fine-tuned using reinforcement learning, allowing users to create expert models for specific tasks.

  • Who can access the preview for the new model customization program?

    The preview is aimed at universities, researchers, and enterprises with further access details to be provided.

  • How does reinforcement fine-tuning differ from standard fine-tuning?

    Reinforcement fine-tuning uses reinforcement learning algorithms to allow models to learn reasoning in custom domains, whereas standard fine-tuning mimics input examples.

  • What fields can benefit from the new reinforcement fine-tuning process?

    Fields requiring deep expertise, like legal, finance, engineering, and insurance, can benefit from this new process.

  • What collaboration was mentioned during the video?

    OpenAI partnered with Thompson Reuters to utilize reinforcement fine-tuning for a legal assistant in their co-counsel AI.

  • What example of scientific application was highlighted?

    A project involving rare genetic diseases where OpenAI models help identify gene mutations responsible for specific conditions.

  • What is the reinforcement fine-tuning research program?

    It is a program allowing organizations working on complex tasks to gain early access to reinforcement fine-tuning capabilities.

  • When is the public launch of the reinforcement fine-tuning features expected?

    The public launch is planned for early next year.

  • What are the anticipated benefits of the reinforcement fine-tuning in bioinformatics?

    It can help improve understanding and treatment of rare diseases by reasoning over biomedical data.

  • What was a humorous moment mentioned in the video?

    A 'Christmas-themed joke' involving self-driving sleighs not identifying trees due to lack of 'pine-tuning' by Santa.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !
Sous-titres
en
Défilement automatique:
  • 00:00:00
    hi everyone my name is Mark and I lead
  • 00:00:02
    research at openai yesterday we took 01
  • 00:00:06
    out of preview and we launched it in
  • 00:00:08
    chbt we're soon going to launch it in
  • 00:00:10
    the API if you haven't been following o1
  • 00:00:14
    it's our latest series of model
  • 00:00:15
    improvements that allow the models to
  • 00:00:17
    think for a while before they come back
  • 00:00:19
    with a response today we're really
  • 00:00:22
    excited to preview our latest
  • 00:00:24
    advancement in our model customization
  • 00:00:26
    program it'll let users find to 01 on
  • 00:00:30
    their own data sets and again this isn't
  • 00:00:33
    standard fine tuning this is
  • 00:00:35
    reinforcement fine tuning which really
  • 00:00:38
    leverages the reinforcement learning
  • 00:00:40
    algorithms that took us from Advanced
  • 00:00:42
    High School level to expert PhD level
  • 00:00:45
    for your own use cases I want to stress
  • 00:00:48
    again that this is a preview of
  • 00:00:49
    something that we're going to launch
  • 00:00:51
    publicly next year but if you are a
  • 00:00:54
    university or you're a researcher or
  • 00:00:56
    you're an Enterprise we'll give you some
  • 00:00:58
    information on how you can access our
  • 00:01:00
    output program later so why would you
  • 00:01:03
    want this thing well it allows you to
  • 00:01:05
    take your golden data sets and turn them
  • 00:01:08
    into unique
  • 00:01:09
    offerings that will give you the same
  • 00:01:11
    magic that we have for your own users
  • 00:01:14
    and your own customers so I'll let John
  • 00:01:16
    Julie and Justin say a little bit more
  • 00:01:19
    yeah hello everyone yeah my name is John
  • 00:01:20
    Allard and I'm an engineer here at openi
  • 00:01:22
    hi everyone I'm Julie W I'm a researcher
  • 00:01:24
    here at open AI I'm Justin ree I'm a
  • 00:01:27
    computational biologist at Berkeley lb
  • 00:01:29
    today today we're so excited to be
  • 00:01:31
    introducing this new way of model
  • 00:01:33
    customization for our 01 series of
  • 00:01:35
    models uh reinforcement fine tuning or
  • 00:01:37
    rft for short for the first time
  • 00:01:40
    developers researchers and machine
  • 00:01:42
    learning Engineers will be able to use
  • 00:01:44
    reinforcement learning to create expert
  • 00:01:46
    models capable of excelling at their
  • 00:01:49
    specific tasks within their domain we
  • 00:01:51
    believe that any field which requires
  • 00:01:53
    deep expertise in their AI models stands
  • 00:01:56
    the benefit so if you work in say legal
  • 00:01:59
    Finance engineering Insurance uh this
  • 00:02:02
    one's for you for example we recently
  • 00:02:04
    partnered with Thompson Reuters to to
  • 00:02:07
    use reinforcement fine-tuning to
  • 00:02:09
    fine-tune 01 mini to be a legal
  • 00:02:11
    assistant in their co-counsel AI um this
  • 00:02:15
    this tool assist their legal
  • 00:02:16
    Professionals in accomplishing some of
  • 00:02:18
    their most analytical
  • 00:02:21
    workflows yeah so some of you will be
  • 00:02:23
    familiar with the supervised fine tuning
  • 00:02:25
    API that we launched um early last year
  • 00:02:27
    and supervised fine tuning is really
  • 00:02:29
    powerful what you're trying to do is get
  • 00:02:30
    the model to replicate features that it
  • 00:02:32
    finds in uh input text or images and
  • 00:02:35
    this is great if you want to change the
  • 00:02:37
    tone or the style or the response format
  • 00:02:39
    of the model now with reinforcement
  • 00:02:41
    fine-tuning or reindeer enforcement fine
  • 00:02:43
    tuning I should
  • 00:02:44
    say the with reinforcement fine tuning
  • 00:02:47
    um it's actually it's different so
  • 00:02:49
    you're not just teaching the model to
  • 00:02:50
    mimic its um its inputs what you're
  • 00:02:52
    teaching it to do is to learn to reason
  • 00:02:53
    in entirely new ways over custom domains
  • 00:02:56
    and the way this works is that we when
  • 00:02:58
    the model sees a problem we give it
  • 00:03:00
    space to Think Through the problem and
  • 00:03:02
    then we grade the final answer from the
  • 00:03:04
    model and then using the power of
  • 00:03:06
    reinforcement learning we reinforce
  • 00:03:08
    lines of thinking that led to correct
  • 00:03:09
    answers and we disincentivize lines of
  • 00:03:11
    thinking that led to incorrect answers
  • 00:03:14
    um and what you'll see is that you know
  • 00:03:15
    with as little as a few dozen examples
  • 00:03:17
    the model will learn to reason in new
  • 00:03:20
    and effective ways over custom domains
  • 00:03:23
    that's crazy that you can do that with
  • 00:03:25
    just 12 examples that's not something
  • 00:03:27
    you can do with regular uh fine tuning
  • 00:03:29
    yeah exactly yeah in the in the space of
  • 00:03:30
    large language models and large machine
  • 00:03:32
    learning a few dozen examples is is
  • 00:03:33
    basically nothing yeah so for the first
  • 00:03:36
    time our model customization platform
  • 00:03:38
    will support reinforcement learning and
  • 00:03:40
    and notably this is the same technique
  • 00:03:41
    that we use internally at open AI to
  • 00:03:43
    train our Frontier models like gp40 and
  • 00:03:46
    the o1 series one area with many
  • 00:03:49
    exciting applications is scientific
  • 00:03:50
    research but don't just take our word
  • 00:03:52
    for it that's why we're joined today by
  • 00:03:54
    Justin ree uh Justin is a researcher at
  • 00:03:57
    Berkeley lab and one of his areas of
  • 00:03:58
    study is using computational methods to
  • 00:04:01
    understand the Gen the genetic causes
  • 00:04:04
    underlying rare diseases Justin thank
  • 00:04:06
    you so much for being here do you mind
  • 00:04:08
    telling us a little bit more about your
  • 00:04:09
    research and how reinforcement fine
  • 00:04:11
    tuning might help sure thanks it's great
  • 00:04:13
    to be here so one of the areas of my
  • 00:04:15
    research is rare genetic disease so
  • 00:04:17
    contrary to the name rare rare genetic
  • 00:04:19
    disease is actually not rare So any one
  • 00:04:21
    rare disease is rare but if you put them
  • 00:04:22
    all together um they're actually quite
  • 00:04:24
    common and so we're talking about 300
  • 00:04:26
    million people globally who who suffer
  • 00:04:28
    from a rare disease and what's more
  • 00:04:30
    these people often have a long
  • 00:04:31
    diagnostic Odyssey of months to years
  • 00:04:33
    before they find out about their
  • 00:04:34
    condition W it's like uh the whole
  • 00:04:36
    population of the US yes it's not a
  • 00:04:38
    small number of people and so what we're
  • 00:04:40
    working on is better uh computational
  • 00:04:42
    tools and methods to to Really research
  • 00:04:44
    what's important uh and and to help us
  • 00:04:46
    understand and treat these diseases so
  • 00:04:48
    we we do our work in kind of an academic
  • 00:04:50
    setting and learning more about the rare
  • 00:04:52
    disease and their causes and the the
  • 00:04:53
    hope is we'll be able to advance the
  • 00:04:55
    healthcare for these folks uh going down
  • 00:04:57
    the line and now assessing rare disease
  • 00:05:00
    is kind of hard because you kind of have
  • 00:05:01
    to have two things you have to have sort
  • 00:05:03
    of expert domain knowledge about the the
  • 00:05:04
    medical side of things and you also have
  • 00:05:06
    to have uh sort of systematic reasoning
  • 00:05:08
    over the biomedical data and this is an
  • 00:05:10
    area where we think that o the oan model
  • 00:05:12
    can really help us out with its
  • 00:05:14
    reasoning capabilities that makes a lot
  • 00:05:16
    of sense you know our large language
  • 00:05:18
    models have domain knowledge and our o1
  • 00:05:20
    models are really systemic reasoners so
  • 00:05:23
    it seems like now there's a pretty good
  • 00:05:25
    computational method for addressing some
  • 00:05:27
    of these that's right can you tell us a
  • 00:05:29
    little a little bit more about the data
  • 00:05:30
    sets that you're using sure so this was
  • 00:05:32
    sort of a collaborative effort between
  • 00:05:34
    uh our our group and charot Hospital in
  • 00:05:36
    Germany and Peter Robinson's lab and the
  • 00:05:38
    Monarch initiative um and what we did
  • 00:05:41
    really was was to extract disease
  • 00:05:43
    information from hundreds of scientific
  • 00:05:45
    Publications that were case reports
  • 00:05:46
    about rare disease and so uh we sort of
  • 00:05:49
    curated the information uh and that's
  • 00:05:51
    lists of signs and symptoms that were
  • 00:05:53
    present in in the patient and that were
  • 00:05:55
    excluded in the patient and then of
  • 00:05:56
    course the the disease that they had and
  • 00:05:59
    importantly for the conversation the
  • 00:06:00
    causative Gene that was mutated that was
  • 00:06:02
    causing the problems in these fols I see
  • 00:06:04
    so you and maybe some doctors are trying
  • 00:06:06
    to figure out given a patient symptoms
  • 00:06:09
    uh What gene might have mutated to cause
  • 00:06:11
    those symptoms yeah that's right and and
  • 00:06:13
    so something we've been working on
  • 00:06:14
    together with the open AI team is is uh
  • 00:06:17
    training the old one the old one models
  • 00:06:20
    to reason more effectively about the
  • 00:06:21
    causes of disease incredible thank you
  • 00:06:24
    Justin uh we're now going to give you a
  • 00:06:26
    preview of reinforcement fine-tuning at
  • 00:06:28
    work and not to steal any Thunder but
  • 00:06:31
    we're going to take 01 mini and make it
  • 00:06:33
    exceed the performance of 01 on this
  • 00:06:35
    task uh that's the 01 that we just
  • 00:06:37
    launched yesterday and this matters so
  • 00:06:39
    much because o1 mini is a smaller faster
  • 00:06:43
    and cheaper model than 01 yeah so using
  • 00:06:46
    Justin's data set we're going to show
  • 00:06:48
    you can just drastically improve the
  • 00:06:49
    performance of ow and mini um on this
  • 00:06:51
    task where given a list of symptoms
  • 00:06:54
    you're trying to predict which Gene
  • 00:06:55
    might be responsible for the genetic
  • 00:06:57
    disease and so to give an overview of
  • 00:06:59
    this process we're going to start by
  • 00:07:00
    looking at uh data sets that are used to
  • 00:07:02
    train the model and graders that are
  • 00:07:04
    used to evaluate the model and then
  • 00:07:06
    we're going to um launch a training job
  • 00:07:08
    on open AI training infrastructure and
  • 00:07:10
    finally we'll evaluate the resulting
  • 00:07:12
    fine-tune model so we can see how it's
  • 00:07:13
    improved over the base model that we
  • 00:07:15
    started with so to start us off we're
  • 00:07:17
    going to jump over to the open AI
  • 00:07:18
    development platform and we're going to
  • 00:07:20
    go ahead and we're going to create a new
  • 00:07:21
    model so um you know we've had
  • 00:07:23
    supervised fine tuning for a bit over a
  • 00:07:25
    year now what we're going to do is we're
  • 00:07:26
    going to select reinforcement fine
  • 00:07:27
    tuning now we're going to be training o1
  • 00:07:30
    so we'll select that as the base model
  • 00:07:32
    and now we need to upload a training
  • 00:07:33
    data set and now training data sets
  • 00:07:35
    they're just Json L files which is just
  • 00:07:37
    a file where each line in the file is an
  • 00:07:39
    example that you want the model to be
  • 00:07:41
    trained on for this um case Justin and
  • 00:07:44
    his colleagues assembled a data set of
  • 00:07:45
    about 1100 examples um and so I'll go
  • 00:07:48
    ahead and upload that one and just so
  • 00:07:51
    that we get a really good feel for um
  • 00:07:54
    how this data set works and what this
  • 00:07:55
    task is we'll zoom in on an individual
  • 00:07:57
    data point really quickly and so this is
  • 00:08:00
    what an individual data point looks like
  • 00:08:02
    and there's really three important
  • 00:08:03
    things here so the first is the case
  • 00:08:06
    report and this is a description of the
  • 00:08:08
    patient and the patient symptoms so we
  • 00:08:10
    see that the patient was a 51-year-old
  • 00:08:12
    woman the disease onset was not
  • 00:08:14
    specified we have a list of symptoms
  • 00:08:16
    like hyperism um hyperthyroidism and
  • 00:08:19
    others um as Justin said earlier we have
  • 00:08:21
    the absent symptoms um these are the
  • 00:08:22
    symptoms that are not present and this
  • 00:08:24
    is important because it helps the model
  • 00:08:25
    to rule out genes that it might think
  • 00:08:28
    would otherwise be responsible um for
  • 00:08:29
    the symptoms that are present next we
  • 00:08:32
    have the instructions and I'm sure if
  • 00:08:34
    you're watching this live stream you're
  • 00:08:35
    familiar with prompting and so all we're
  • 00:08:36
    doing here is just prompting the model
  • 00:08:38
    for um what we wanted to do for this
  • 00:08:39
    task and so what we're saying is you
  • 00:08:41
    know given the list of symptoms and the
  • 00:08:43
    case report can you list all the genes
  • 00:08:45
    that you think might be responsible for
  • 00:08:47
    the um for the genetic disease that that
  • 00:08:49
    you think is present and then we also
  • 00:08:51
    asked it to provide an explanation for
  • 00:08:53
    why it thinks those genes might be
  • 00:08:55
    responsible um finally we also have the
  • 00:08:58
    correct answer and so this is the gene
  • 00:09:00
    that we happen to know is responsible
  • 00:09:02
    but importantly we're not showing this
  • 00:09:03
    to the model during the training process
  • 00:09:05
    that would be cheating but we're using
  • 00:09:07
    it internally during the training
  • 00:09:09
    process to grade the model's outputs or
  • 00:09:10
    to check if the model is correct this is
  • 00:09:13
    a pretty hard task uh I definitely have
  • 00:09:15
    no hope of answering this question yeah
  • 00:09:17
    I mean you can tell that we're we've
  • 00:09:19
    come up far away from just trying to
  • 00:09:20
    count the number of RS in the word
  • 00:09:22
    strawberry yeah so
  • 00:09:25
    um so um now when we give the model this
  • 00:09:28
    prompt this case report and these
  • 00:09:30
    instructions the model is going to
  • 00:09:31
    output something like this which is a
  • 00:09:33
    list of genes that it thinks might be
  • 00:09:35
    responsible and importantly the genes
  • 00:09:37
    are in sorted order where the first Gene
  • 00:09:39
    in the list is the one that it thinks is
  • 00:09:40
    most likely to be responsible the second
  • 00:09:42
    one in the list is the one that it
  • 00:09:43
    thinks is second most likely and so on
  • 00:09:45
    and so forth cool so um we'll hop back
  • 00:09:50
    over and so um next we need to upload
  • 00:09:52
    some validation data and validation data
  • 00:09:55
    um it's going to be in the exact same
  • 00:09:56
    format as the training data but
  • 00:09:58
    importantly there's no no overlap in the
  • 00:10:00
    correct genes between the validation
  • 00:10:02
    data set and the training data set and
  • 00:10:04
    what that means is that the model can't
  • 00:10:05
    cheat it has to um or it can't learn to
  • 00:10:07
    just memorize a list of symptoms and
  • 00:10:10
    Associate those with the gene it has to
  • 00:10:11
    actually generalize from the training
  • 00:10:13
    data set to the validation data set
  • 00:10:15
    gotcha so I mean where's the
  • 00:10:17
    reinforcement part come in you know we
  • 00:10:19
    talked about grading uh is that part of
  • 00:10:21
    the process here yeah that's a really
  • 00:10:22
    good question so grading is done by this
  • 00:10:24
    concept of graders that we're we're
  • 00:10:26
    introducing here and so um graders are
  • 00:10:28
    are really simple what a greater does is
  • 00:10:30
    it takes the output from the model and
  • 00:10:32
    it takes the correct answer and it
  • 00:10:33
    Compares them and it returns a score
  • 00:10:35
    between zero and one and so zero means
  • 00:10:37
    that the model did not get the answer
  • 00:10:38
    correct at all and one means the model
  • 00:10:40
    got the answer correct and you can also
  • 00:10:42
    give partial credit so it can be
  • 00:10:43
    anywhere in that range so for this
  • 00:10:45
    specific task we have a grader that
  • 00:10:47
    looks like this so it takes the correct
  • 00:10:50
    answer um that we happen to know and it
  • 00:10:52
    takes the output from the model which is
  • 00:10:54
    the list of genes and it produces a
  • 00:10:55
    score so in this case you know foxy3 is
  • 00:10:58
    the correct answer it was second in the
  • 00:11:00
    list of genes and so it gets a score of
  • 00:11:01
    like 7 I see so if it had instead said
  • 00:11:05
    foxy 3 was first in the list I would
  • 00:11:07
    have gotten a grade of one yeah exactly
  • 00:11:09
    and then as it gets further and further
  • 00:11:10
    along the list the score kind of
  • 00:11:11
    gradually decays to zero ah nice makes
  • 00:11:14
    sense um but what if I have a task that
  • 00:11:16
    isn't you know grading A ranked list do
  • 00:11:18
    we have other graders that are more
  • 00:11:20
    General yeah yeah so we're supplying
  • 00:11:22
    kind of a collection of graders that we
  • 00:11:23
    think pretty effectively cover the space
  • 00:11:25
    of possible intents that you might have
  • 00:11:27
    um you for while doing enforcement fine
  • 00:11:29
    tuning and we're always adding more yeah
  • 00:11:31
    and eventually we're going to hopefully
  • 00:11:33
    let you define your own graders yeah
  • 00:11:35
    yeah maybe like upload a python file or
  • 00:11:36
    something and do some custom grading
  • 00:11:38
    yeah cool so um we've defined our
  • 00:11:41
    training data set we've defined our
  • 00:11:42
    validation data set let me go ahead and
  • 00:11:44
    copy in the grader really quick
  • 00:11:46
    um and now openi allows you to set um
  • 00:11:49
    you know we allow you to customize these
  • 00:11:51
    fine tuning runs by setting hyper
  • 00:11:52
    parameters but we set some pretty good
  • 00:11:53
    default so I'm just going to go ahead
  • 00:11:54
    and click create
  • 00:11:56
    here now what this is doing is is um you
  • 00:12:00
    know we we've just kicked off a training
  • 00:12:02
    job um and so the really cool thing is
  • 00:12:05
    that um you bring the data set and you
  • 00:12:08
    bring the greater and these are the
  • 00:12:09
    places where you really have domain
  • 00:12:11
    expertise and where you can really
  • 00:12:12
    contribute to this problem and then you
  • 00:12:14
    get to leverage the full power of open
  • 00:12:16
    AI reinforcement learning algorithms and
  • 00:12:17
    our full distributed model training
  • 00:12:19
    stack to customize a Frontier Model for
  • 00:12:22
    your use case so as a user as a user I
  • 00:12:25
    just like to bring my data set and
  • 00:12:26
    grader and open takes care of everything
  • 00:12:28
    else yeah exactly yeah um so you know
  • 00:12:33
    reinforcement F tuning jobs can take
  • 00:12:34
    anywhere from a few hours to a few days
  • 00:12:36
    to run so we're going to jump over to a
  • 00:12:37
    job that I ran earlier this week on the
  • 00:12:39
    same data set um just so we can kind of
  • 00:12:40
    see the
  • 00:12:41
    results so I'll jump over
  • 00:12:44
    here um so I have this job that I ran
  • 00:12:46
    earlier this week um it completed
  • 00:12:48
    successfully It produced a fine two
  • 00:12:49
    model for us and there's one thing that
  • 00:12:51
    I want to um look at which is um the
  • 00:12:53
    validation reward score and so what this
  • 00:12:55
    is is the it's the average score from
  • 00:12:57
    the greater on the validation data set
  • 00:12:59
    and how it changed over the course of
  • 00:13:01
    the fine-tuning run and so what we can
  • 00:13:02
    see is that the score is going up and as
  • 00:13:05
    we said earlier since there's no overlap
  • 00:13:06
    in genes between the training data set
  • 00:13:08
    and the validation data set it means
  • 00:13:10
    that the model really learned to
  • 00:13:11
    generalize on our task um it wasn't
  • 00:13:13
    simply memorizing a list of symptoms and
  • 00:13:15
    mapping those to genes so you know while
  • 00:13:17
    this is cool you know the chart goes up
  • 00:13:19
    and to the right which is what we like
  • 00:13:20
    to see um it'd be nice if we could get a
  • 00:13:22
    better feel for how the model has
  • 00:13:24
    actually changed during the fine-tuning
  • 00:13:25
    process and so you know we'll we'll take
  • 00:13:27
    a closer look at that now
  • 00:13:29
    all right so we're going to pop over to
  • 00:13:32
    the evaluations dashboard which is a
  • 00:13:34
    product in our developer platform that
  • 00:13:36
    we launched earlier this year um there's
  • 00:13:38
    a lot of numbers but don't worry we're
  • 00:13:39
    going to go through all of them so I've
  • 00:13:41
    set up three different runs here the
  • 00:13:43
    first one was uh a run against our 01
  • 00:13:46
    model which we released yesterday the
  • 00:13:48
    second was against 01 mini which was the
  • 00:13:50
    starting point of our fine-tuning job
  • 00:13:53
    and then finally uh the reinforcement
  • 00:13:55
    fine tuned 01 mini now we looked at the
  • 00:13:59
    reward going up and to the right but
  • 00:14:00
    what does that actually mean for this
  • 00:14:02
    task I've set up three different
  • 00:14:03
    evaluations to sort of assess that the
  • 00:14:06
    first one is top at one which is how
  • 00:14:08
    often is the correct answer the very
  • 00:14:10
    first item in the list top at five which
  • 00:14:12
    is how often is the correct answer in
  • 00:14:14
    the top five elements of the list and
  • 00:14:16
    finally top at Max did we at all put the
  • 00:14:19
    right answer in our list um and so
  • 00:14:22
    looking at top at one we can see that
  • 00:14:23
    our starting point ow and mini got 177%
  • 00:14:27
    on our data set of about 200
  • 00:14:29
    01 got 25% so it's doing better uh but
  • 00:14:33
    then our fine-tuned 01 mini got
  • 00:14:36
    31% awesome uh I took a screenshot of
  • 00:14:39
    this and I put it into chat GPT and
  • 00:14:41
    asked it to make me a plot a Christmas
  • 00:14:42
    them plot and uh here's a nice
  • 00:14:45
    visualization of those nine numbers that
  • 00:14:47
    we saw earlier so you can see uh our
  • 00:14:49
    starting point 01 mini uh across top at
  • 00:14:52
    one top at 5 and top at Max our 01 model
  • 00:14:56
    and then finally our best performing
  • 00:14:57
    model which is this 01 mini fine tune
  • 00:15:00
    here in um dotted red line so looking at
  • 00:15:03
    these results what do you think Justin
  • 00:15:05
    well I I think this is pretty impressive
  • 00:15:07
    performance and and especially the
  • 00:15:08
    increase in the in the validation uh
  • 00:15:11
    data because that implies that the model
  • 00:15:13
    is learning something generally about
  • 00:15:14
    how to reason over these kind of data
  • 00:15:15
    which is pretty exciting um and and so
  • 00:15:18
    an obvious question you might ask is how
  • 00:15:19
    is this doing comparing compared to
  • 00:15:21
    existing bioinformatics tools and I
  • 00:15:23
    don't really have an Apples to Apples
  • 00:15:25
    comparison but um because typically in
  • 00:15:27
    this kind of experiment you would
  • 00:15:28
    provide uh genomic sequencing data and
  • 00:15:30
    we haven't included that here um but the
  • 00:15:33
    sort of open-ended quering of models
  • 00:15:35
    here over incomplete symptom list is is
  • 00:15:37
    new and exciting I think great uh so
  • 00:15:41
    these are aggregate statistics but let's
  • 00:15:43
    look at the actual model responses so
  • 00:15:46
    I'm going to pop on over to this data
  • 00:15:48
    tab let's filter by the passes uh and so
  • 00:15:52
    here is the input that we're giving to
  • 00:15:54
    the model so the problem as John
  • 00:15:56
    described earlier is to identify genes
  • 00:15:58
    that may be responsible for a set of
  • 00:16:00
    observed symptoms uh we asked the model
  • 00:16:02
    to Output a dictionary containing both a
  • 00:16:05
    string that explains you know why did I
  • 00:16:08
    pick these genes and of course the genes
  • 00:16:10
    themselves in ranked order and then
  • 00:16:12
    finally we have the symptom list as well
  • 00:16:15
    so this um patient presented with sub
  • 00:16:19
    endal nodules
  • 00:16:21
    seizure uh yeah and a couple other
  • 00:16:23
    things uh we then run our models so this
  • 00:16:26
    was our 01 model this this one our
  • 00:16:29
    fine-tuned o1 Mini model um we gave it
  • 00:16:32
    that input and now the output is uh this
  • 00:16:36
    dictionary that we described earlier so
  • 00:16:38
    reasoning the combination of sub endal
  • 00:16:41
    nodules uh seizure cortical tubulars are
  • 00:16:44
    indicative of this complex which is
  • 00:16:46
    commonly caused by mutations in these
  • 00:16:48
    genes um it lists a couple other
  • 00:16:50
    potential ones and then it says
  • 00:16:53
    tsc2 is the most likely um candidate and
  • 00:16:57
    if we scroll back on over to our answer
  • 00:16:59
    we'll see that tsc2 is in fact the
  • 00:17:01
    correct answer so that allowed us to get
  • 00:17:04
    a pass on top at one top at five and top
  • 00:17:07
    at Max so looking at this output um
  • 00:17:11
    Justin like is this a useful output for
  • 00:17:13
    the model to be giving back yeah
  • 00:17:15
    absolutely so it's particularly useful
  • 00:17:17
    to see the model's reasoning and that's
  • 00:17:18
    a big contribution here and also the r
  • 00:17:21
    obviously the rank list of answers so
  • 00:17:22
    even if the the the correct answer is
  • 00:17:24
    not first you know you can you can look
  • 00:17:26
    at all the possibilities um and
  • 00:17:29
    it's it's also great to see the
  • 00:17:31
    fine-tuning improves the performance of
  • 00:17:33
    in the rank list of possible answers so
  • 00:17:35
    that that right answer is getting closer
  • 00:17:36
    to one so that's gratifying Justin kind
  • 00:17:39
    of zooming out a little bit like how
  • 00:17:40
    does reinforcement learning shape your
  • 00:17:42
    field um can you talk about some Trends
  • 00:17:44
    in viy sure so I I think there's a lot
  • 00:17:47
    of interest in the research community
  • 00:17:49
    and Import in using these models for
  • 00:17:51
    these for these kind of tasks and so the
  • 00:17:54
    feeling for this particular use case is
  • 00:17:56
    that the best solution in in the near
  • 00:17:58
    term is probably a hybrid solution
  • 00:18:00
    between existing bioinformatic tools and
  • 00:18:02
    these uh models like 01 uh and so this
  • 00:18:05
    is I think is excellent progressing sort
  • 00:18:07
    of characterizing the strengths of these
  • 00:18:08
    models and and and also how we can use
  • 00:18:11
    uh tools like fine-tuning to improve
  • 00:18:13
    performance um and so like I said
  • 00:18:16
    there's not really a comparable
  • 00:18:16
    Benchmark to compare the two but um it
  • 00:18:19
    it's it's definitely progress and and
  • 00:18:21
    how we can use these models to kind of
  • 00:18:23
    understand disease and then you know in
  • 00:18:25
    a larger sense to sort of how we can
  • 00:18:27
    incorporate these models into a workflow
  • 00:18:29
    that will eventually improve healthcare
  • 00:18:31
    for these folks right amazing thank you
  • 00:18:33
    Justin so while we've just shown you an
  • 00:18:35
    exciting application of reinforcement
  • 00:18:37
    fine-tuning in scientific research this
  • 00:18:39
    is a general purpose technique we've
  • 00:18:41
    seen promising results in data sets from
  • 00:18:44
    biocham from AI safety from legal um and
  • 00:18:48
    from Health Care as well we can think of
  • 00:18:50
    hundreds of more examples or tasks that
  • 00:18:53
    we can use this model on but we know
  • 00:18:55
    that you can probably think of many more
  • 00:18:57
    uh so that's why we're so excited to be
  • 00:18:59
    expanding uh our Alpha program today to
  • 00:19:02
    enable more people to push the
  • 00:19:04
    boundaries of the capabilities of our 01
  • 00:19:06
    models on the tasks that matter the most
  • 00:19:08
    to them yeah so you know we've been
  • 00:19:10
    working with a small group of trusted
  • 00:19:12
    Partners to really test out
  • 00:19:13
    reinforcement fine tuning and today
  • 00:19:15
    we're expanding Alpha access via what
  • 00:19:17
    we're calling the reinforcement
  • 00:19:18
    fine-tuning research program so um you
  • 00:19:21
    know this program is ideal for
  • 00:19:22
    organizations who are currently working
  • 00:19:24
    on very complex tasks with teams of
  • 00:19:26
    experts and who think that they might
  • 00:19:28
    benefit from AI assistant on these tasks
  • 00:19:30
    so you know if you're interested in
  • 00:19:32
    applying for one of these limited spots
  • 00:19:33
    you can find a link to the application
  • 00:19:35
    in the description of this live stream
  • 00:19:36
    and as Mark said ear um earlier you know
  • 00:19:38
    we plan on launching this product
  • 00:19:40
    reinforcement fine tuning publicly early
  • 00:19:42
    next year yeah we're all really truly
  • 00:19:44
    excited to see what you do with
  • 00:19:46
    reinforcement fine tuning and really
  • 00:19:48
    speaking as a researcher there's nothing
  • 00:19:49
    that makes us happier than seeing our
  • 00:19:51
    models being adapted and used to advance
  • 00:19:53
    you know science knowledge in the real
  • 00:19:55
    world do you have a joke for us today
  • 00:19:58
    well as so happens I do uh as it's
  • 00:20:01
    become a tradition I have a Christmas
  • 00:20:03
    theme joke so you know we live in San
  • 00:20:05
    Francisco self-driving vehicles are all
  • 00:20:07
    the rage and actually Santa's been
  • 00:20:09
    trying to get in on this too um he's
  • 00:20:11
    trying to make a self-driving sleigh but
  • 00:20:13
    he for some reason his models just keep
  • 00:20:16
    not identifying trees and the Slate is
  • 00:20:18
    hitting trees left and right uh do you
  • 00:20:21
    guys have any guesses
  • 00:20:23
    why
  • 00:20:25
    no he didn't he didn't Pine tun his
  • 00:20:29
    oh jeez okay all right um please join us
  • 00:20:32
    next week we'll have a lot more to share
  • 00:20:33
    thank you
Tags
  • OpenAI
  • 01 Model
  • Reinforcement Learning
  • Fine-Tuning
  • Customization
  • AI Research
  • Machine Learning
  • Healthcare
  • Legal AI
  • Genetic Research