CVPR24 FM4AS | Ted Xiao: What's Missing for Robotics-first Foundation Models?

00:46:07
https://www.youtube.com/watch?v=hPQhGI9w9x0

Resumen

TLDRForedraget fokuserer på de nuværende udfordringer og muligheder inden for robotteknologi, især i forhold til fundamentale modeller. Taleren fremhæver, at der er en betydelig energi og interesse i feltet, men at der stadig er væsentlige mangler, der skal adresseres for at gøre robotter mere generelt anvendelige. Der præsenteres tre centrale mangler: positiv overførsel ved skalering, styrebarhed og skalerbar evaluering. Taleren argumenterer for, at nuværende metoder ikke er tilstrækkelige, og at der er behov for paradigmeskift for at udnytte de muligheder, som store datamængder og avancerede modeller tilbyder. Der diskuteres også, hvordan man kan forbedre robotters evne til at generalisere og tilpasse sig nye situationer.

Para llevar

  • 🤖 Robotter står over for udfordringer med positiv overførsel.
  • 📊 Skalerbar evaluering er essentiel for robotters præstation.
  • 🔄 Styrebarhed kan forbedres med menneskelig feedback.
  • 🌐 Store datamængder kan hjælpe med at generalisere færdigheder.
  • 🔍 Der er behov for paradigmeskift i robotteknologi.

Cronología

  • 00:00:00 - 00:05:00

    I dag taler jeg om, hvad der mangler i robotik i forhold til fundamentale modeller. Der er en enorm energi i robotik og AI, og mange nye forskere deltager i konferencer som CPR. Jeg vil fokusere på de mest presserende udfordringer, der skal løses for at gøre robotik mere tilgængelig og diskutere, hvordan vi kan forstå de nuværende trends for at finde de manglende brikker.

  • 00:05:00 - 00:10:00

    Robotik i dag følger en pipeline-tilgang med perception, planlægning og kontrol. De seneste 10 år har været præget af tværfaglighed, hvor fremskridt i sprogbehandling og computer vision hurtigt er blevet anvendt i robotik. Men med fremkomsten af fundamentale modeller er det nødvendigt at revurdere, hvordan vi anvender disse teknologier i robotik.

  • 00:10:00 - 00:15:00

    En udfordring er, at de nuværende modeller ikke er optimeret til kontrol og beslutningstagning. Der er også snævre flaskehalse mellem modulerne, som begrænser fordelene ved skala og overførsel. Vi skal finde måder at bryde disse barrierer ned for at skabe en robotisk fundamentalmode.

  • 00:15:00 - 00:20:00

    Der er tre manglende elementer i robotik, som vi ser i sprogmodeller: positiv overførsel ved stor skala, styrebarhed og skalerbar evaluering. I dag er positiv overførsel sjælden i robotik, og vi har ikke de samme skaleringslove som i sprogmodeller. Vi skal finde måder at opnå disse egenskaber i robotik.

  • 00:20:00 - 00:25:00

    For at opnå positiv overførsel skal vi skalere dataene. Vi kan enten gøre robotdata mere lig internetdata eller samle data fra forskellige robotopgaver for at se, om vi kan opnå positiv overførsel. Vi har set, at ved at behandle robotdata som en del af internetdata kan vi opnå nogle positive resultater.

  • 00:25:00 - 00:30:00

    Vi har også undersøgt, hvordan vi kan træne på tværs af forskellige robotdata for at forbedre præstationen. Overraskende nok har vi set, at generalistmodeller, der er trænet på data fra mange forskellige robotter, kan overgå specialiserede modeller, der kun er trænet på specifikke opgaver.

  • 00:30:00 - 00:35:00

    Det er vigtigt at forstå, at selvom vi har adgang til store mængder data, er det ikke altid tilstrækkeligt. Vi skal også overveje, hvordan vi kan bruge robotdata til at forbedre vores modeller og opnå bedre præstationer i specifikke opgaver.

  • 00:35:00 - 00:40:00

    Den næste udfordring er styrebarhed og promptability. I dag er robotters input ofte begrænset, hvilket gør det svært at udnytte de muligheder, som store sprogmodeller tilbyder. Vi skal finde måder at udvide inputmulighederne for robotter, så de kan lære og tilpasse sig bedre.

  • 00:40:00 - 00:46:07

    Endelig er skalerbar evaluering en stor udfordring i robotik. Vi skal finde metoder til at evaluere robotters præstationer på en meningsfuld måde, der kan fange deres evner i forskellige situationer. Dette kræver en systematisk tilgang til evaluering, som vi endnu ikke har opnået.

Ver más

Mapa mental

Vídeo de preguntas y respuestas

  • Hvad er de tre centrale mangler i robotteknologi ifølge foredraget?

    De tre centrale mangler er positiv overførsel ved skalering, styrebarhed og skalerbar evaluering.

  • Hvordan kan robotter drage fordel af store datamængder?

    Ved at behandle robotdata som en del af en større datamængde kan robotter opnå bedre generalisering og overførsel af færdigheder.

  • Hvad er betydningen af 'positiv overførsel' i robotteknologi?

    Positiv overførsel refererer til evnen til at anvende læring fra en opgave til en anden, hvilket er sjældent i nuværende robotteknologi.

  • Hvad er 'styrebarhed' i konteksten af robotter?

    Styrebarhed refererer til robotters evne til at tilpasse deres handlinger baseret på menneskelig feedback eller ændrede betingelser.

  • Hvorfor er skalerbar evaluering vigtig for robotter?

    Skalerbar evaluering er vigtig for at kunne vurdere robotters præstationer på tværs af forskellige opgaver og miljøer.

Ver más resúmenes de vídeos

Obtén acceso instantáneo a resúmenes gratuitos de vídeos de YouTube gracias a la IA.
Subtítulos
en
Desplazamiento automático:
  • 00:00:00
    title my talk today is what's missing
  • 00:00:02
    for robotics vers Foundation models and
  • 00:00:05
    to preface this uh I think I really want
  • 00:00:08
    to emphasize the absolute amount of
  • 00:00:10
    energy right now that's happening in in
  • 00:00:13
    body AI in robotics specifically even
  • 00:00:15
    today at tvpr I see so many time
  • 00:00:17
    roboticists which this is their first
  • 00:00:19
    CPR I think it's the general trend of a
  • 00:00:22
    lot of converging Trends in different
  • 00:00:24
    fields from language modeling and
  • 00:00:26
    computer vision and embod decision
  • 00:00:28
    making so I think it's really such an
  • 00:00:30
    exciting time right now to be working on
  • 00:00:31
    these problems but the purpose of my
  • 00:00:34
    talk today is to really try to paint a
  • 00:00:37
    picture of what I view as the bleeding
  • 00:00:39
    edge of modern trends that are ongoing
  • 00:00:41
    in robotics and more importantly ask
  • 00:00:44
    some provocative questions about how
  • 00:00:47
    understanding this bleeding edge can
  • 00:00:49
    point us to the missing pieces and open
  • 00:00:51
    challenges that are on the path between
  • 00:00:54
    us and really solving general purpose
  • 00:00:57
    robotics uh the work here that discuss
  • 00:01:00
    today uh comes from a great group uh at
  • 00:01:03
    Google be mind and the know
  • 00:01:05
    controversial opinions are only my own
  • 00:01:08
    especially the wrong
  • 00:01:10
    ones so give a brief glance then at
  • 00:01:12
    today's agenda I'll first motivate what
  • 00:01:15
    is a robot Foundation model and I think
  • 00:01:17
    you've heard today from so many great
  • 00:01:20
    speakers so I'll keep this brief but
  • 00:01:22
    most importantly I'd like to focus on a
  • 00:01:24
    few missing pieces that I hope will
  • 00:01:26
    paint a path for what are the most
  • 00:01:28
    important problems to solve before
  • 00:01:30
    robotics become generally accessible uh
  • 00:01:33
    and finally I'll save some time for the
  • 00:01:34
    horizons of how the field might evolve
  • 00:01:37
    at a metal
  • 00:01:38
    level um first off some very brief
  • 00:01:41
    preliminaries uh the modern uh sense
  • 00:01:44
    plan act Paradigm in robotics as such is
  • 00:01:46
    a pipeline system right you have
  • 00:01:48
    perception understanding the world you
  • 00:01:50
    have planning how you synthesize
  • 00:01:52
    different types of information
  • 00:01:53
    especially semantics human intent Etc
  • 00:01:56
    and of course control and actuation
  • 00:01:59
    moving robot to influence the world and
  • 00:02:02
    what this has looked like in the modern
  • 00:02:04
    era and I use this to to broadly uh you
  • 00:02:07
    know the loosest sense in the last 10
  • 00:02:08
    years is that robotics has been very
  • 00:02:11
    interdisiplinary a very full stack
  • 00:02:13
    effort where advances in perception in
  • 00:02:16
    in in language and other domains are
  • 00:02:18
    quickly adopted and utilized by
  • 00:02:19
    roboticist to drive progress forward
  • 00:02:21
    this looks like taking the
  • 00:02:23
    state-of-the-art Leading Edge model from
  • 00:02:25
    CPR plugging that in as a perception
  • 00:02:27
    staff taking some kind of offthe shelf
  • 00:02:29
    great planning system either or recently
  • 00:02:32
    you know a learn system and then really
  • 00:02:35
    focusing on the control part is the Last
  • 00:02:36
    Mile and this has worked pretty well I
  • 00:02:39
    think especially when we were you know
  • 00:02:41
    working with like scaling up cnns or or
  • 00:02:43
    scaling up initial smaller scale
  • 00:02:45
    planners but in the age of foundation
  • 00:02:46
    models I think some of the uh lessons
  • 00:02:49
    that we've you know really taken to
  • 00:02:51
    heart as robotics people in in the last
  • 00:02:53
    10 years may not continue uh in the next
  • 00:02:56
    few years um one issue you might think
  • 00:02:58
    of is that it's it's great to keep
  • 00:03:00
    taking these off the shelf
  • 00:03:02
    state-of-the-art models but of course
  • 00:03:03
    these are Frozen They're pre-trained
  • 00:03:05
    they're not optimized for control
  • 00:03:06
    they're not optimized for decision-
  • 00:03:08
    making uh and the same for the language
  • 00:03:10
    model planners which again uh these days
  • 00:03:12
    if you really view emergence as a
  • 00:03:14
    phenomenon that only exists at scale
  • 00:03:16
    organically well maybe the types of
  • 00:03:19
    emerging capabilities intell
  • 00:03:21
    intelligence you need for lowl control
  • 00:03:23
    are not the same ones that you need for
  • 00:03:25
    internet scale text modeling um and of
  • 00:03:28
    course the second part which is related
  • 00:03:29
    to point is that the bottlenecks between
  • 00:03:31
    these modules are very narrow they're
  • 00:03:33
    very they're turistic driven they're
  • 00:03:36
    expert defined and they're not
  • 00:03:38
    expressive enough perhaps to Encompass
  • 00:03:41
    the benefits of scale and transfer that
  • 00:03:43
    we all know and love from Foundation
  • 00:03:44
    modeling and and and so how do we tackle
  • 00:03:47
    these two issues then well again maybe
  • 00:03:50
    again we can take some lessons from
  • 00:03:52
    Trends and practices that have worked in
  • 00:03:54
    Foundation modeling I'm sure all of you
  • 00:03:56
    guys know this story but you know from
  • 00:03:58
    from language model and from Vision
  • 00:04:00
    there's all these separate tasks and
  • 00:04:02
    data sets and and Specialties and the
  • 00:04:04
    Insight of scaling large scale internet
  • 00:04:07
    scale Foundation models is that by
  • 00:04:09
    treating these problems all as uh you
  • 00:04:12
    know interoperable or or or homomorphic
  • 00:04:15
    is that by breaking down the barriers
  • 00:04:17
    between the strict definitions of what
  • 00:04:18
    you call sentiment classification versus
  • 00:04:21
    what you call station you get a lot of
  • 00:04:23
    benefits and really great scaling Trends
  • 00:04:26
    and and then in robotics right going
  • 00:04:28
    back to the modern
  • 00:04:30
    Loop or we we today still have these
  • 00:04:33
    very big brick walls between these
  • 00:04:35
    modules is there a way we can break them
  • 00:04:37
    down and if we can do so that would be a
  • 00:04:39
    robotics First Foundation model um and
  • 00:04:42
    again I don't think I think we have some
  • 00:04:44
    ideas of how to do this and I'll talk in
  • 00:04:47
    today's talk but I think even beyond
  • 00:04:50
    that there's a lot of questions that we
  • 00:04:51
    don't yet have answers to which I think
  • 00:04:53
    will be very exciting to make
  • 00:04:55
    progress so why we even want to do this
  • 00:04:57
    okay let's say we break down these brick
  • 00:04:59
    walls we robotics versus Foundation
  • 00:05:00
    model what does that mean I think
  • 00:05:02
    there's three missing pieces that are
  • 00:05:04
    really nice properties that we see in
  • 00:05:06
    today's language models and vision
  • 00:05:08
    language models that we as of today do
  • 00:05:10
    not yet see in robotics one is really
  • 00:05:14
    positive transfer at imense Scales where
  • 00:05:17
    you basically can assume by default that
  • 00:05:18
    positive transfer and scaling will work
  • 00:05:20
    as opposed to robotics these days it's
  • 00:05:22
    more rare than you know that you see
  • 00:05:24
    positive transfer than cases where you
  • 00:05:26
    get negative results um and again when
  • 00:05:28
    this happens you can are making
  • 00:05:30
    predictive scaling laws about how much
  • 00:05:32
    you should scale how many how much you
  • 00:05:33
    know compute how many tokens you need uh
  • 00:05:36
    and Robotics is far from that today
  • 00:05:38
    another one I think that is very close
  • 00:05:40
    to my heart is steerability and
  • 00:05:41
    pumpability right uh in in my opinion
  • 00:05:44
    this is what really made me see the
  • 00:05:46
    light between gpt2 and gpt3 eras where
  • 00:05:48
    you saw that few shot learning where you
  • 00:05:51
    saw pump engineering really start to
  • 00:05:53
    take off at a massive scale and in
  • 00:05:54
    robotics I think we're far from that and
  • 00:05:57
    finally I think scalable evaluation is
  • 00:06:00
    an aspect where I really enjoy the
  • 00:06:02
    progress that's been made in Foundation
  • 00:06:03
    modeling with everything from very
  • 00:06:05
    realistic meaningful evaluations to
  • 00:06:08
    predicted benchmarks that correlate well
  • 00:06:10
    with how users actually would would you
  • 00:06:12
    know leverage the capabilities these
  • 00:06:15
    models and my claim here is that these
  • 00:06:18
    properties are not just nice to have but
  • 00:06:20
    we actually need that in order to tackle
  • 00:06:22
    the full unstructured Wilderness of the
  • 00:06:25
    real world again robotics has kind of
  • 00:06:27
    spedrun the the scale up from one one
  • 00:06:30
    bin for indust object picking to one
  • 00:06:32
    room to one lab building but to go from
  • 00:06:34
    here to all the VC funded human
  • 00:06:37
    companies and home robot companies and
  • 00:06:39
    assisting the elderly and and playing
  • 00:06:41
    with your dogs and kids this is kind of
  • 00:06:43
    a huge gap I think that you don't get by
  • 00:06:46
    just incremental progress but by
  • 00:06:48
    fundamentally transformative Paradigm
  • 00:06:49
    shifts which can leverage emergence
  • 00:06:51
    scale
  • 00:06:54
    generalization and a hot take here is
  • 00:06:56
    that I think while we see Signs of Life
  • 00:06:59
    today I think if you froze technology
  • 00:07:02
    today you cannot solve Robotics and
  • 00:07:03
    again I think there's companies and
  • 00:07:05
    startups and academics betting on this
  • 00:07:07
    of it's just a data problem it's just
  • 00:07:09
    engineering I think there are still
  • 00:07:10
    fundamental Paradigm shifts that are
  • 00:07:12
    algorithmic or or you know maybe data
  • 00:07:14
    breakthroughs that or needed um until
  • 00:07:17
    it's it's going to be trackable to try
  • 00:07:18
    to go and solve
  • 00:07:21
    RS and let's then jump into the
  • 00:07:24
    nitty-gritty of these missing pieces
  • 00:07:25
    then right so the first one I mentioned
  • 00:07:28
    was positive transfer from scaling and I
  • 00:07:30
    think you know to to kind of illustrate
  • 00:07:32
    what this mean I think we can just point
  • 00:07:33
    at all of the kind of forerunners in
  • 00:07:35
    language modeling and BLM right you see
  • 00:07:37
    these great large scale internet scale
  • 00:07:40
    data sets you see great you know scaling
  • 00:07:42
    laws and chinchilla optimality and again
  • 00:07:45
    I think this is building off of the fact
  • 00:07:46
    that passive the interet gets you a lot
  • 00:07:48
    of great properties that are good for
  • 00:07:51
    language modeling and and vision langage
  • 00:07:52
    modeling but in robotics that's not the
  • 00:07:55
    case right so I I I think in robotics in
  • 00:07:57
    order to study how you can get positive
  • 00:08:00
    transfer you need to First scale the
  • 00:08:02
    data and there's probably two ways which
  • 00:08:05
    I'll dive into shortly is is one maybe
  • 00:08:07
    you can just make robot data look more
  • 00:08:09
    like internet data you can just treat
  • 00:08:11
    them as the same data source and the
  • 00:08:13
    second is maybe maybe your specific
  • 00:08:15
    robot there's not enough data and even
  • 00:08:17
    with you know a bajillion dollars you
  • 00:08:18
    can't get enough but maybe if you pull
  • 00:08:20
    together all of the robot data that's
  • 00:08:22
    been collected across different
  • 00:08:23
    embodiment tasks environments maybe you
  • 00:08:26
    have a better shot of seeing some kind
  • 00:08:28
    of positive transfer in St
  • 00:08:30
    us um so again Vision language models
  • 00:08:34
    this is kind of the golden standard for
  • 00:08:35
    what we want to model the properties we
  • 00:08:37
    like to see in robotics off this is one
  • 00:08:39
    existence group of pro concept what's
  • 00:08:41
    worked and BMS capture both Visual and
  • 00:08:43
    semantic knowledge of the world and then
  • 00:08:46
    you put it side by side here this is the
  • 00:08:48
    rth network architecture for a model
  • 00:08:50
    called rt1 from our team from almost two
  • 00:08:53
    and a half years ago which is kind of
  • 00:08:54
    like a home grw Garage Band Transformer
  • 00:08:57
    model that has these components of you
  • 00:08:59
    know know a perception Vision encoder
  • 00:09:01
    some kind of like cross modal tension we
  • 00:09:03
    s for and then you know a Transformer
  • 00:09:05
    blocks be the B for the reasoning and
  • 00:09:07
    finally some kind of action token output
  • 00:09:09
    again these are design choices that were
  • 00:09:11
    really optimized at our scale on our
  • 00:09:12
    problems which we're still pretty large
  • 00:09:14
    scales by academic domains on hundreds
  • 00:09:16
    and hundreds of robot tasks but again if
  • 00:09:18
    if you kind of squint at this these
  • 00:09:19
    components start to look like the design
  • 00:09:21
    decisions that are used in state of the
  • 00:09:23
    art Vision language models right and you
  • 00:09:25
    can kind of just like try to think about
  • 00:09:27
    how you would fit the kind of design we
  • 00:09:30
    have to make the small rp1 scale into a
  • 00:09:33
    large VM and you see a lot of properties
  • 00:09:36
    that are same again these are disti
  • 00:09:38
    tokens uh they're very similar just
  • 00:09:40
    maybe with some slight output
  • 00:09:42
    differences and input differences uh but
  • 00:09:45
    maybe you can just use the de as a
  • 00:09:46
    policy you don't need to iterate at
  • 00:09:48
    small scales on your home brew Garage
  • 00:09:50
    Band policies you can just leverage the
  • 00:09:52
    kind of great work and infra and data
  • 00:09:55
    from Vision language mod and then the
  • 00:09:57
    main question becomes well how do you
  • 00:09:59
    align the action domain how do you align
  • 00:10:01
    the action manable um and again this is
  • 00:10:04
    what our actions look like they're kind
  • 00:10:07
    of robot specific but they Encompass
  • 00:10:09
    things like positional change and
  • 00:10:10
    rotations and Grier open close and and
  • 00:10:12
    termin action types but that that's
  • 00:10:14
    broadly you know going to be specific to
  • 00:10:16
    per robot and uh you can you can choose
  • 00:10:19
    discretize them but I think the main
  • 00:10:20
    point is that these look still a little
  • 00:10:22
    bit different right from like the textt
  • 00:10:24
    you find on Wikipedia articles or on
  • 00:10:27
    image captions on the web but maybe
  • 00:10:29
    there's ways you can just make them the
  • 00:10:30
    same by adding making just turning it
  • 00:10:33
    into a string the most naive thing you
  • 00:10:35
    can do and see how that works and that's
  • 00:10:36
    exactly what we did in this work called
  • 00:10:38
    rt2 um we just converted the tokenized
  • 00:10:41
    discretized robot actions into a string
  • 00:10:44
    such as this list of eight integers and
  • 00:10:47
    and and you can just treat that as a
  • 00:10:50
    caption so robot action prediction is
  • 00:10:51
    now just a v2a um of course there's
  • 00:10:54
    Alternatives and that we can think of
  • 00:10:56
    such as maybe we use floats but then
  • 00:10:58
    there's more tokens and consen is a
  • 00:11:00
    thing you can use extra IDs or like
  • 00:11:02
    least Us action tokenization you can use
  • 00:11:04
    strings and I think these are all very
  • 00:11:06
    compelling choices but I think the most
  • 00:11:07
    simple thing to start with is just
  • 00:11:09
    turning your raw robot actions and just
  • 00:11:11
    casting it as a string and seeing what
  • 00:11:13
    happens uh and this is what's what's now
  • 00:11:16
    called a vision language action model
  • 00:11:17
    that treats actions as just another
  • 00:11:19
    rality represented in text um for the
  • 00:11:21
    backbones we look at po X models which
  • 00:11:24
    is a model for Google is all py at the
  • 00:11:27
    5B 12b and5 variance and uh importantly
  • 00:11:31
    we we we we co- find un we co- Trin the
  • 00:11:34
    robot data at the bottom along with the
  • 00:11:36
    original vqa data the internet data that
  • 00:11:39
    the original were trained with start the
  • 00:11:42
    robot data is this offline data set that
  • 00:11:44
    we iterated on for rt1 and many other
  • 00:11:46
    works it's a it's a large expert
  • 00:11:48
    demonstration data set that we save
  • 00:11:50
    outline um and what we see then is that
  • 00:11:52
    when we when we train this large vaa
  • 00:11:55
    model we start to see some emergent
  • 00:11:57
    skills and I use emergent in the sense
  • 00:11:58
    that these were not not specific
  • 00:12:00
    capabilities uh such as you know OCR or
  • 00:12:02
    recognizing Flags these are not any
  • 00:12:04
    concepts that we collected data for in
  • 00:12:06
    our robot data set but kind of emerg via
  • 00:12:09
    positive transfer from whatever semantic
  • 00:12:11
    Concepts were present in the internet
  • 00:12:13
    scale VM training the VM data and
  • 00:12:17
    quantitatively right you know post talk
  • 00:12:19
    we can kind of try to like you know
  • 00:12:20
    squint and see what kind of these new
  • 00:12:22
    capabilities are and we we roughly group
  • 00:12:25
    them into things like symol
  • 00:12:26
    understanding reasoning or or human
  • 00:12:27
    recognition and we see that
  • 00:12:29
    this is indeed positive transfer
  • 00:12:31
    happening thanks to the V paradig and
  • 00:12:34
    again though those were kind of like
  • 00:12:35
    hand evals right you maybe you don't
  • 00:12:37
    necessarily care about if you're
  • 00:12:38
    roboticist uh you know whether or not
  • 00:12:40
    your robot Foundation model recognizes
  • 00:12:42
    Taylor S maybe you do care about though
  • 00:12:44
    is generalization to distribution shifts
  • 00:12:46
    like lighting conditions or new objects
  • 00:12:48
    or new background conditions and here we
  • 00:12:50
    see that actually the VA models the web
  • 00:12:52
    data also gives you some of this for
  • 00:12:54
    free just by treating robotics as the
  • 00:12:57
    same as in data
  • 00:13:00
    switching uh to the second part though
  • 00:13:02
    so we've treated robot data now the same
  • 00:13:04
    as internet data but often times maybe
  • 00:13:07
    that robot data still isn't enough when
  • 00:13:08
    you get any additional bank gr up by
  • 00:13:10
    training on all the robot data even if
  • 00:13:13
    the robots look different than the one
  • 00:13:14
    you have in your lab in your factory um
  • 00:13:17
    here we we had a a cross institutional
  • 00:13:19
    collaboration called Open Cross
  • 00:13:21
    embodiment where we pulled together uh
  • 00:13:23
    the data sets that were open source for
  • 00:13:25
    more than 30 robot labs and arreated
  • 00:13:27
    together and see whether or not uh we
  • 00:13:29
    could get some nice scaling Properties
  • 00:13:31
    by treating all the robot data as the
  • 00:13:34
    same uh and this data set is very very
  • 00:13:37
    heterogeneous we have many different
  • 00:13:39
    robot moralities and volums we have many
  • 00:13:41
    different environments across the world
  • 00:13:44
    and many different CTIC tasks that each
  • 00:13:46
    of the individual researchers I cared
  • 00:13:47
    about when they collected their original
  • 00:13:49
    data set it's different from clock
  • 00:13:51
    folding to this like cable routing where
  • 00:13:52
    you have to put a cable very dextrously
  • 00:13:55
    into this small socket um it's really
  • 00:13:58
    quite diverse um and with this diers
  • 00:14:00
    data set we studied kind of the scaling
  • 00:14:03
    properties at two model classes one is
  • 00:14:05
    the rg1 the small um homemade
  • 00:14:08
    Transformer that we had 35 million
  • 00:14:09
    parameters as well as rg2 the full vaa
  • 00:14:12
    55 billion parameter vaa scale to see
  • 00:14:14
    whether or not uh you know treating
  • 00:14:16
    these different robot AC you know data
  • 00:14:18
    sets as just input is some robot images
  • 00:14:21
    output are some robot actions
  • 00:14:23
    representative strings what would happen
  • 00:14:25
    in this case and here surprisingly we
  • 00:14:27
    actually see signs where the generalist
  • 00:14:30
    policies which are trained on all of the
  • 00:14:32
    robot data outperforms the Specialists
  • 00:14:34
    which are alized for each of their
  • 00:14:35
    evaluation settings and in this case
  • 00:14:38
    right it's actually I think this was to
  • 00:14:40
    me very shocking because for each of
  • 00:14:42
    these individual kind of uh you know
  • 00:14:44
    Labs here each of these Lo sport on to
  • 00:14:46
    lab there generally I think the what the
  • 00:14:50
    best accepted um you know understanding
  • 00:14:52
    of roboticist is that if you care about
  • 00:14:53
    maximizing performance on your specific
  • 00:14:55
    task on your robot and your setting like
  • 00:14:58
    you should just TR on robot data from
  • 00:15:00
    your robot because adding any other data
  • 00:15:02
    is going to be you know not is it's just
  • 00:15:04
    going to like you know your policies
  • 00:15:06
    will not be robust to those if they're
  • 00:15:08
    going to overfit to them it's not going
  • 00:15:09
    to help but in this case it seems that
  • 00:15:11
    training on all the data actually
  • 00:15:13
    improve performance even on the if you
  • 00:15:15
    only cared about performance on your
  • 00:15:16
    specific setup that was pretty you know
  • 00:15:19
    exciting to see in robotics because this
  • 00:15:21
    is a try we've seen in other fields but
  • 00:15:22
    in robotics this is really really rare
  • 00:15:24
    so far and again the Improvement SC was
  • 00:15:27
    almost 50%
  • 00:15:29
    another interesting Trend that we found
  • 00:15:31
    is that actually we were getting to the
  • 00:15:32
    point where we were seeing that smaller
  • 00:15:34
    models would underfit this large
  • 00:15:36
    heterogeneous robot data set and again
  • 00:15:39
    overfitting is normally uh not something
  • 00:15:42
    uh I think that underfitting is not
  • 00:15:44
    something that we usually see in
  • 00:15:45
    robotics because usually the data amount
  • 00:15:47
    is so small that model capacity is never
  • 00:15:49
    an issue and here it it was very
  • 00:15:51
    surprising that model capacity actually
  • 00:15:53
    was required to kind of soak up all of
  • 00:15:55
    this diverse robot data
  • 00:15:59
    and one big question though was that
  • 00:16:02
    great you had all this robot data from
  • 00:16:04
    all the robots wasn't even needed you
  • 00:16:06
    already had all the web data from R22
  • 00:16:09
    does adding all this incremental you
  • 00:16:11
    know Poss data at anything at all and
  • 00:16:13
    the good news is that we did sp there
  • 00:16:15
    was know a lot of these spatial
  • 00:16:17
    reasoning skills about the the the
  • 00:16:19
    motions and the Precision of specific
  • 00:16:21
    tasks like whether you put the apple on
  • 00:16:24
    the cloth versus near the cloth these
  • 00:16:26
    were tasks that only adding the robot
  • 00:16:28
    data would enable the policies to do
  • 00:16:31
    whereas the internet data alone wouldn't
  • 00:16:33
    give you these understanding and
  • 00:16:34
    capabilities for
  • 00:16:36
    free as a brief recap then we've seen
  • 00:16:40
    attempts at data scaling and positive
  • 00:16:42
    transfer in in in two projects that I
  • 00:16:44
    just highlighted one was taking our
  • 00:16:46
    robot data set from
  • 00:16:48
    rq1 putting it alongside internet data
  • 00:16:50
    and Shing them the same and then we
  • 00:16:53
    scaled it up even further by taking all
  • 00:16:55
    the robot data from many different
  • 00:16:56
    embodiments and we saw that in both
  • 00:16:58
    cases is by treating robot actions and
  • 00:17:01
    robot embodiments as just another data
  • 00:17:02
    modality you did see signs of positive
  • 00:17:06
    transfer and again for rt1 this was
  • 00:17:08
    great we do well on the train set the in
  • 00:17:10
    distribution for rt2 this was we start
  • 00:17:12
    to get internet scale semantics and for
  • 00:17:15
    RTX we start to understand spatial
  • 00:17:17
    Precision in Concepts like that which
  • 00:17:18
    are important for action and for
  • 00:17:21
    physics but I think that's where we got
  • 00:17:24
    to maybe end of last year I would say
  • 00:17:27
    the state of the world and in then
  • 00:17:29
    though I think there's many open
  • 00:17:31
    challenges and things which do not work
  • 00:17:33
    that I want to highlight one is that
  • 00:17:35
    Vaas in current training paradigms still
  • 00:17:38
    overfit to robotics data distributions
  • 00:17:40
    what I mean by this is that we take a
  • 00:17:42
    vaa we kind of query it on v2a tasks
  • 00:17:45
    like an internet image image prompt uh
  • 00:17:47
    internet prompt and it does well we give
  • 00:17:49
    it a robot image ask it for actions it
  • 00:17:51
    does well when you try to mix and match
  • 00:17:53
    across these data distributions for both
  • 00:17:55
    the visual L and space as well as for
  • 00:17:57
    the uh text input space it doesn't
  • 00:18:00
    really seem like you're getting a ton of
  • 00:18:01
    transfer and this is a very you know
  • 00:18:03
    interesting failure mode because it
  • 00:18:05
    suggests that you know there are still
  • 00:18:07
    fundamental uh issues with how we're
  • 00:18:10
    doing code trainining how we're mixing
  • 00:18:11
    robot data with internet
  • 00:18:13
    data we also see that reasoning right we
  • 00:18:16
    often times see that reasoning stems
  • 00:18:19
    from the power of language model
  • 00:18:20
    backgrounds from language model
  • 00:18:21
    pre-training and it seems that how you
  • 00:18:24
    can mix this kind of reasoning process
  • 00:18:26
    with lowlevel you know physical action
  • 00:18:29
    reasoning that's also not you know going
  • 00:18:32
    exactly as expected because we see cases
  • 00:18:34
    where uh you know for example here your
  • 00:18:36
    reasoning and planning works well when
  • 00:18:38
    all the objects are out of distribution
  • 00:18:40
    but when you add one in distribution
  • 00:18:41
    object such as a cocan your model starts
  • 00:18:43
    to revert to what it knows and just
  • 00:18:45
    immediately even though it can kind of
  • 00:18:46
    knows hey if I want to hammer a nail I
  • 00:18:48
    should use a rock it's never picked up a
  • 00:18:50
    rock before but it's picked up tens of
  • 00:18:52
    thousands of coc so even though it kind
  • 00:18:54
    of reasons that it use a rock it still
  • 00:18:55
    goes for the cocan examples like this
  • 00:18:57
    are just you know they're everywhere but
  • 00:19:00
    I think they're not as normally as
  • 00:19:01
    visible so I just wanted to highlight
  • 00:19:02
    this as interesting failure modes and
  • 00:19:05
    finally right how we actually the the
  • 00:19:07
    model architecture decision the design
  • 00:19:09
    decisions are also not well thought
  • 00:19:11
    through um recently there's a paper
  • 00:19:13
    actually just from three days ago
  • 00:19:14
    studying how you tokenize and how you
  • 00:19:16
    represent your action penous discreet
  • 00:19:18
    classification what what actual token
  • 00:19:20
    choices do you use um that I think is a
  • 00:19:22
    great step in the right direction I'm
  • 00:19:24
    making uh you know vaa training more of
  • 00:19:26
    a science and less of an art
  • 00:19:30
    and moving on then uh let me just check
  • 00:19:32
    I'm doing on time okay great um so the
  • 00:19:35
    next missing piece then is steerability
  • 00:19:37
    and prompt ability uh I think this is
  • 00:19:41
    very interesting because in robotics
  • 00:19:44
    today oftentimes we kind of uh
  • 00:19:47
    historically due to the scale the Rel
  • 00:19:49
    policies been working with we really
  • 00:19:51
    tried to constrain the amount of entropy
  • 00:19:53
    and information that can be fed into the
  • 00:19:55
    policy's input right usually it's just
  • 00:19:57
    one image or maybe uh you you frame
  • 00:19:59
    stack some history of the past IM you've
  • 00:20:01
    seen and then you convey a goal in the
  • 00:20:03
    simplest way possible Right a one hot ID
  • 00:20:05
    a very simple you know L English
  • 00:20:07
    sentence or or maybe a goal image but
  • 00:20:10
    generally you're not really expanding
  • 00:20:12
    the the throughput of the bandwidth of
  • 00:20:14
    your information input to the scale of
  • 00:20:16
    like an open-ended chat interface as
  • 00:20:18
    many language models see today um and
  • 00:20:21
    again I think we we've seen in language
  • 00:20:22
    modeling how large context Windows
  • 00:20:24
    enable so many great emerging
  • 00:20:26
    capabilities such as TW shot learning
  • 00:20:28
    such as uh you know really being able to
  • 00:20:30
    hone down and do Chain of Thought
  • 00:20:32
    reasoning and like you know reflection
  • 00:20:33
    all this all this kind of good stuff
  • 00:20:35
    that we we can't really have robotics
  • 00:20:37
    because our input domains uh you know
  • 00:20:39
    interfaces are just so constrained and
  • 00:20:42
    and I really want a pable robot which
  • 00:20:44
    you can just say hey can you try again
  • 00:20:45
    can you you move too far right you know
  • 00:20:47
    can can you just adjust it on your next
  • 00:20:49
    try we don't have any metal learning
  • 00:20:51
    like that right now and it won't
  • 00:20:52
    naturally emerged by just adding you
  • 00:20:54
    know more of the same types of robot
  • 00:20:56
    data and more of the same types of
  • 00:20:57
    internet data there has to be a paradigm
  • 00:20:59
    shift in order to get from us where we
  • 00:21:01
    are today to a prompt seable uh robot
  • 00:21:03
    control model and here I think you know
  • 00:21:06
    maybe one hypothesis here is that maybe
  • 00:21:09
    it's language that's the bottleneck at
  • 00:21:10
    least language the way we approach it
  • 00:21:11
    today right we we we've been kind of
  • 00:21:13
    taking all of the recipes that have
  • 00:21:15
    worked in other domains and we've tried
  • 00:21:17
    to say okay let's just add language as a
  • 00:21:20
    conditioning modality for robots let's
  • 00:21:21
    just add language data sets but this
  • 00:21:24
    maybe misses a lot of what makes robots
  • 00:21:27
    hard right what makes robots and unique
  • 00:21:29
    and interesting problem to study are the
  • 00:21:32
    the Motions the physics the the causal
  • 00:21:34
    nature of interaction with the world and
  • 00:21:37
    when you try to condense everything into
  • 00:21:39
    something that you know a language model
  • 00:21:40
    Wikipedia article might look like maybe
  • 00:21:42
    you just lose a lot of that in
  • 00:21:44
    Translation um and and so towards that
  • 00:21:47
    uh you know maybe honing in on the
  • 00:21:49
    motion of Robotics is is what this work
  • 00:21:52
    has to do called Archy trajectory which
  • 00:21:54
    takes the idea of you know hindsight
  • 00:21:56
    experience replay you take the goal you
  • 00:21:59
    you turn that into the target for your
  • 00:22:00
    policy learning but does that for the
  • 00:22:02
    motion of the trajectory that was
  • 00:22:04
    executed when we collect robot data
  • 00:22:05
    often times we end Factor like Pro
  • 00:22:08
    perception for all your joint States and
  • 00:22:10
    your Motors are kind of store and we
  • 00:22:12
    currently just toss all that away when
  • 00:22:13
    we just save out RGB action trajectories
  • 00:22:16
    maybe you can actually utilize you know
  • 00:22:18
    the motion of what happened in the real
  • 00:22:19
    world and use that to condition
  • 00:22:21
    condition your policy because you
  • 00:22:22
    already have this data for free we just
  • 00:22:24
    currently toss it out so we do in this
  • 00:22:26
    work is we take that proception and
  • 00:22:27
    effector data we project it into 2D RGB
  • 00:22:30
    space and that just becomes a kind of
  • 00:22:32
    visual hint a visual Chain of Thought if
  • 00:22:34
    you will of not just what you should do
  • 00:22:36
    you should you know put the chip back in
  • 00:22:38
    the in the drawer but how you should do
  • 00:22:40
    it how you should approach it from what
  • 00:22:41
    angle where you should close how you
  • 00:22:42
    should avoid object obstacles at what
  • 00:22:45
    depth and what Heights you should be
  • 00:22:46
    operating at and then we feed it into an
  • 00:22:48
    rp1 policy and and and what's nice is
  • 00:22:51
    that even though training this was
  • 00:22:52
    automated at inference time then you
  • 00:22:55
    know maybe getting this trajectory from
  • 00:22:57
    scratch would have been hard but
  • 00:22:59
    uh the policy is agnostic to what
  • 00:23:01
    trajectory you give it it doesn't have
  • 00:23:02
    to be hindsight it can be handdrawn by a
  • 00:23:05
    human it can be you can take a human
  • 00:23:07
    video and do you know pose extraction
  • 00:23:09
    with the hand and and then get this
  • 00:23:10
    strory out of there or you can even give
  • 00:23:12
    Foundation models either you know
  • 00:23:14
    generative image models or or or
  • 00:23:16
    language models to go and predict what
  • 00:23:18
    the trajectory should be and then that
  • 00:23:19
    can condition your pre-trained policy
  • 00:23:22
    which is only train on H side
  • 00:23:23
    trajectories and what's nice here is
  • 00:23:25
    that we start to see signs of motion
  • 00:23:27
    generalization right this is this is
  • 00:23:30
    where you have tasks which have
  • 00:23:32
    fundamentally maybe they're operating at
  • 00:23:34
    different Heights or different motions
  • 00:23:35
    just combinations of State action
  • 00:23:37
    transition which were just never in the
  • 00:23:38
    training data which has been
  • 00:23:40
    traditionally hard for some of our
  • 00:23:41
    policies which are just kind of overfit
  • 00:23:43
    to pick up the cocan and and that that's
  • 00:23:46
    all I know and you can't really
  • 00:23:47
    generalize to cocan and vastly different
  • 00:23:50
    motion profiles um but we start to see
  • 00:23:52
    that here with tasks such as swiveling
  • 00:23:54
    chairs or or folding towels or or
  • 00:23:56
    picking objects up from a chair at a
  • 00:23:58
    very l low
  • 00:23:59
    height but most interesting for the
  • 00:24:01
    topic here is that this was the first um
  • 00:24:04
    project I worked on at least where I
  • 00:24:05
    really got some kind of Shadow or some
  • 00:24:08
    hint of I felt that prompt engineering
  • 00:24:10
    was possible here we'd sometimes try out
  • 00:24:12
    you know evaluations in very different
  • 00:24:14
    settings with like in like a home
  • 00:24:15
    setting like Ikea like setting with with
  • 00:24:17
    with Darkwood furniture at different
  • 00:24:19
    heights and the robot would sometimes be
  • 00:24:21
    able to generalize this out of tradtion
  • 00:24:22
    settings but then just by drawing a a
  • 00:24:25
    prompt trajectory differently the human
  • 00:24:27
    can kind of learn the failure the the
  • 00:24:29
    centricities of the robot and just by
  • 00:24:31
    like you know a trained operator could
  • 00:24:32
    then learn to draw better trajectories
  • 00:24:34
    and then start to zero shot new tasks uh
  • 00:24:37
    again this was no other policy that we
  • 00:24:39
    had today where the input is always
  • 00:24:40
    something as simple as you know hey pick
  • 00:24:42
    up a Coke can pick up a Pepsi can close
  • 00:24:44
    a drawer going from that to like you
  • 00:24:46
    know hey you're this brand new setting
  • 00:24:47
    the drawer is going to be five
  • 00:24:48
    centimeters lower than you're used to
  • 00:24:50
    and there's also clutter can you go and
  • 00:24:52
    do that like there's no way any policy
  • 00:24:53
    could be responsive to that yet with
  • 00:24:55
    trajectories it seems like sometimes we
  • 00:24:57
    would get that
  • 00:24:59
    and concurrently it's been really great
  • 00:25:01
    to see over the past six months a lot of
  • 00:25:02
    other works also kind of adopts
  • 00:25:04
    paradigms using you know Point tracking
  • 00:25:07
    Optical flow using other kinds of motion
  • 00:25:09
    representations to also have you know
  • 00:25:12
    touch upon the same idea is that what
  • 00:25:14
    makes robotics unique its physics its
  • 00:25:16
    motion its trajectories maybe adding
  • 00:25:18
    this information to our foundation
  • 00:25:20
    models can unlock things which language
  • 00:25:22
    canot um which is very excited to see
  • 00:25:25
    but I think overall it's still very nent
  • 00:25:27
    and none of these methods have scaled up
  • 00:25:29
    to the level of let's say a 55 billion
  • 00:25:31
    parameter Vision language action
  • 00:25:34
    model next up though one other
  • 00:25:37
    interesting question I I was asking you
  • 00:25:38
    know maybe a year ago was I have all
  • 00:25:40
    these rights of language in the section
  • 00:25:42
    of this this section right was like what
  • 00:25:44
    can we do Beyond language but maybe the
  • 00:25:46
    language maybe language was not the
  • 00:25:48
    problem just the way we were using
  • 00:25:49
    language that we're using language in
  • 00:25:51
    twoo sinful of a way and if the language
  • 00:25:53
    was more grounded in in motions and
  • 00:25:55
    physics and what makes robotics unique I
  • 00:25:58
    more hierarchical granular we could get
  • 00:26:00
    the benefits and so in this project
  • 00:26:01
    called Archy hierarchy we would kind of
  • 00:26:03
    do a Chain of Thought prediction where
  • 00:26:05
    you turn the the pick cocan the very
  • 00:26:07
    abstract very like long Horizon T into
  • 00:26:10
    intermediate very grounded language
  • 00:26:13
    Motions like move your arm right and
  • 00:26:15
    rotate it clockwise and then close your
  • 00:26:17
    gripper into like language that's very
  • 00:26:19
    grounded towards very short Horizon um
  • 00:26:22
    actions and again this can kind of be
  • 00:26:23
    viewed as like chain of thoughts still
  • 00:26:25
    using language at the Chain of Thought
  • 00:26:26
    medium but doing it in a way which is
  • 00:26:28
    much easier than perhaps to generalize
  • 00:26:31
    and what's interesting then is that this
  • 00:26:32
    unlocked learning from uh you know a
  • 00:26:36
    category of tasks we had been collecting
  • 00:26:38
    for over six months that no other policy
  • 00:26:40
    was able to use that data from because
  • 00:26:41
    they were just so hard so it's kind of
  • 00:26:43
    like Tas related to the serial task
  • 00:26:45
    where there's like a small Gap you have
  • 00:26:46
    to put a bowl and you have to push a
  • 00:26:47
    lever and then there's this like
  • 00:26:49
    cluttered like you know thing of oal
  • 00:26:51
    packets and it's like slightly more
  • 00:26:53
    dextrous and precise than the past we've
  • 00:26:55
    seen before but policies were just
  • 00:26:57
    really struggling to kind of you know
  • 00:27:00
    work at the the entropy of these systems
  • 00:27:02
    which was which was just a little bit
  • 00:27:04
    more complex than before and then these
  • 00:27:06
    language motions are what unlock the
  • 00:27:08
    ability to work on those types of
  • 00:27:11
    TS and talking about like pumpability
  • 00:27:14
    and steerability we also saw that having
  • 00:27:17
    interventions in language of hey you
  • 00:27:19
    know actually you you you're doing the
  • 00:27:21
    right low level control it was just your
  • 00:27:23
    intermediate plan of translating closed
  • 00:27:25
    the pistachio jar you wanted to move up
  • 00:27:28
    you should have moved left humans could
  • 00:27:30
    correct this in a dagger likee setting
  • 00:27:32
    in a much more efficient manner than it
  • 00:27:33
    would have taken for low-level action
  • 00:27:35
    interventions and this was a way to kind
  • 00:27:37
    of scale up this Chain of Thought level
  • 00:27:39
    Improvement where you don't always need
  • 00:27:41
    to make your you know iterate at the
  • 00:27:42
    Action level which is the most expensive
  • 00:27:44
    level maybe sometimes you can you know
  • 00:27:45
    intervene at the language motion or
  • 00:27:47
    highering level that is also kind of
  • 00:27:49
    stability and probability one of the
  • 00:27:51
    rare that
  • 00:27:53
    I and and to kind of summarize here rt1
  • 00:27:57
    and rp2 even though they were scaled up
  • 00:27:59
    so much they were still shoving together
  • 00:28:00
    all of the foundation model knowledge
  • 00:28:02
    all the large data sets through a very
  • 00:28:04
    narrow bottleneck of simple language
  • 00:28:05
    instructions and by whing that kind of
  • 00:28:08
    bottom neck via either things like C
  • 00:28:10
    trajectories via motion Centric
  • 00:28:12
    representations or even via language
  • 00:28:14
    just better language seem to unlock more
  • 00:28:16
    of the intelligence that's contained in
  • 00:28:18
    these internet scale
  • 00:28:21
    foundations and the question here
  • 00:28:23
    becomes is that I view these these
  • 00:28:25
    projects kind of as fruital Concepts I'm
  • 00:28:27
    very excited about results but again we
  • 00:28:29
    haven't scaled them up yet to the full
  • 00:28:31
    internet scale and the question is maybe
  • 00:28:33
    to scale them up to you know a full
  • 00:28:35
    vision reaction model maybe we need more
  • 00:28:37
    robot data and my hot take here is that
  • 00:28:41
    actually this is not a call to action
  • 00:28:42
    for me to you know a lot of you know
  • 00:28:44
    companies for example right now are
  • 00:28:46
    convinced that algorithms are are solved
  • 00:28:48
    it's just data and maybe this slide
  • 00:28:50
    would have suggested that but I think we
  • 00:28:51
    don't actually know what kinds of robot
  • 00:28:53
    data you need and so if you prematurely
  • 00:28:54
    try to scale deploy
  • 00:28:59
    youting not correct maybe sometimes that
  • 00:29:02
    wouldn't have been a recoverable mistake
  • 00:29:04
    so again I don't think I'm suggesting to
  • 00:29:06
    not do that because I think we
  • 00:29:07
    definitely need to do that but I think
  • 00:29:08
    there's also work that robot researchers
  • 00:29:10
    need to think about on can we figure out
  • 00:29:13
    the right types of modalities and data
  • 00:29:15
    sets and tasks and skills and labels
  • 00:29:17
    that we need when we go about collecting
  • 00:29:19
    these like Society scale data
  • 00:29:23
    sets and finally then I think uh one
  • 00:29:26
    missing piece is scalable evaluation
  • 00:29:29
    right this is I think uh generally a
  • 00:29:31
    problem throughout all of AI right now
  • 00:29:33
    um we we we see that all these different
  • 00:29:36
    you know benchmarks and you know
  • 00:29:38
    leaderboards uh are are oftentimes a lot
  • 00:29:40
    of hype these days because they're very
  • 00:29:41
    easily gainable but they're good
  • 00:29:43
    attempts I think at least in Foundation
  • 00:29:45
    Molly Broadley to try to capture various
  • 00:29:48
    representations of capabilities that we
  • 00:29:50
    want and the challenge here however is
  • 00:29:52
    that is all AI models across the you
  • 00:29:55
    know broadly the field are everything
  • 00:29:57
    that want wants to be a genous these
  • 00:29:58
    days right and and that's a very high
  • 00:30:00
    claim if your genous model can do
  • 00:30:02
    everything how do you back up rigorously
  • 00:30:04
    that you can actually do everything and
  • 00:30:06
    and be robust at that and I think this
  • 00:30:09
    is clearly already an issue in in
  • 00:30:11
    Foundation modeling but I think things
  • 00:30:12
    like these ELO based leader boards or
  • 00:30:14
    just getting these models out and
  • 00:30:16
    deployed and open source or even as
  • 00:30:18
    close you know Source apis these are
  • 00:30:20
    ways where you can just leave it up to
  • 00:30:22
    you know the the court of the public to
  • 00:30:25
    kind of say hey are your models actually
  • 00:30:27
    good and in robotics this is kind of
  • 00:30:29
    hard right because robotics Target
  • 00:30:31
    physical data distributions and it's
  • 00:30:33
    it's not really clear whether or not
  • 00:30:34
    some representative small set of you
  • 00:30:37
    know evaluations can capture you know
  • 00:30:40
    all of the the properties you want to
  • 00:30:41
    measure um or or if there's just a lot
  • 00:30:44
    of stuff that you need to do before you
  • 00:30:45
    get to a product that's out but I think
  • 00:30:47
    regardless of what the answer is the
  • 00:30:49
    problem is very clear and the problem is
  • 00:30:51
    that as our policies within our team at
  • 00:30:53
    Google de might have scaled the amount
  • 00:30:54
    of evaluations we've had to do have also
  • 00:30:56
    scaled right and I think this might not
  • 00:30:58
    sound like a lot of 3,000 or 6,000
  • 00:31:00
    trials but if you consider that each
  • 00:31:01
    trial can take up to 10 15 minutes you
  • 00:31:04
    know this is quickly going to become
  • 00:31:06
    intractable if we add another order of
  • 00:31:08
    magnitude to what we claim our policies
  • 00:31:10
    can do the next year or two um that just
  • 00:31:12
    not going to be tractable for any
  • 00:31:13
    industry or academic
  • 00:31:15
    lab maybe one way is we can decompose
  • 00:31:18
    the kind of attack space of the claims
  • 00:31:20
    you want to make about these robot
  • 00:31:21
    Foundation models and break them down
  • 00:31:23
    into particular axes that are meant to
  • 00:31:25
    represent policy generalization that
  • 00:31:26
    Encompass properties that we hope to
  • 00:31:28
    measure such as maybe different
  • 00:31:30
    backgrounds or adding distractor objects
  • 00:31:31
    and clutter or changing lighting
  • 00:31:33
    conditions adding new objects Etc and
  • 00:31:35
    maybe we could measure generalization
  • 00:31:37
    gaps from our training set to our test
  • 00:31:39
    set we did this and in the real world
  • 00:31:42
    and and we indeed found that you know
  • 00:31:44
    maybe some of these like axes of
  • 00:31:45
    generalization these factors of
  • 00:31:47
    distribution shifts were harder than
  • 00:31:49
    others and and uh would have these kind
  • 00:31:51
    of scaling curves as you kind of added
  • 00:31:53
    more of these to your training but I I
  • 00:31:55
    think this was kind of just an initial
  • 00:31:57
    kind of attempt at codifying or
  • 00:32:00
    formalizing what it means for your robot
  • 00:32:02
    Foundation model to be a generalist
  • 00:32:04
    Foundation model and it's it's
  • 00:32:06
    definitely a challenge now because every
  • 00:32:08
    you know New Foundation model robot
  • 00:32:10
    policy that comes out uh kind of has to
  • 00:32:12
    do all this on their own again and
  • 00:32:13
    Define their own evals and Define their
  • 00:32:15
    own benchmarks and it's just kind of
  • 00:32:17
    hard to compare apples to apples in
  • 00:32:19
    today's academic
  • 00:32:20
    landscape maybe another path however is
  • 00:32:23
    of course using simulation and and today
  • 00:32:25
    there's so many great talks on world
  • 00:32:26
    models in Sim especially for ab and in
  • 00:32:29
    robot manipulation it's been a bit
  • 00:32:31
    harder because manipulation right with
  • 00:32:33
    contact forces and physics and visual
  • 00:32:35
    distribution shifts and oftentimes data
  • 00:32:36
    sets are very tuned to one limited
  • 00:32:38
    setting which raises the bar for how
  • 00:32:40
    realistic the systems have to be this
  • 00:32:42
    has been an ongoing challenge so an
  • 00:32:44
    Insight in our recent work here called
  • 00:32:46
    simpler was that maybe you don't need a
  • 00:32:48
    full Fidelity digital twin in order to
  • 00:32:51
    that you would need for syrial training
  • 00:32:53
    maybe all you can need to do is optimize
  • 00:32:55
    for correlation between the ranking of
  • 00:32:57
    policies in Sim and how they would rank
  • 00:32:59
    in the real world if you had evaluated
  • 00:33:01
    them in the real world and by doing this
  • 00:33:03
    we would try to just get a minimal
  • 00:33:04
    viable Sim like gets you useful signal
  • 00:33:07
    of you know which which checkpoints you
  • 00:33:09
    should use which with kind of policies
  • 00:33:10
    you should devote your very expensive
  • 00:33:12
    real house to and again I don't think
  • 00:33:14
    it's perfect but I think it it was
  • 00:33:16
    working well for a various classes of
  • 00:33:18
    generalist robot policies that we were
  • 00:33:20
    operating on such as rt1 rt1 X rt2
  • 00:33:24
    Etc and maybe again a shout out to
  • 00:33:27
    concurrent work or or kind of related
  • 00:33:29
    works right I'm prism one related today
  • 00:33:31
    from Alex un from Sherry also Genie
  • 00:33:34
    which is a l and action conditioned
  • 00:33:36
    video diffusion model from Google mind
  • 00:33:39
    um these are all kind of I think
  • 00:33:40
    directionally where we would want to go
  • 00:33:42
    for seeing if these kind of approaches
  • 00:33:45
    can also act as good offline policy
  • 00:33:47
    evaluation um and my hot take here then
  • 00:33:50
    is that I think despite all of these
  • 00:33:52
    like you know great progresses I think
  • 00:33:54
    real world evals will always be the gold
  • 00:33:56
    standard you can never place down um I
  • 00:33:58
    think these evals you know offline evals
  • 00:34:00
    can help you decide which you know how
  • 00:34:03
    to spend your limited dwid for real
  • 00:34:04
    world evals but it won't replace them
  • 00:34:06
    completely right and then so if you
  • 00:34:08
    actually need real evals and you need
  • 00:34:10
    them at scale that becomes a very
  • 00:34:12
    challenging problem right and it's
  • 00:34:14
    perhaps one that will be solved by
  • 00:34:16
    products deployed in the wild where you
  • 00:34:18
    actually get scale EV from actual user
  • 00:34:20
    deployments by people actually using
  • 00:34:22
    your robots and your robot
  • 00:34:25
    policies and uh finally here um I'd like
  • 00:34:29
    to talk a bit about maybe what this
  • 00:34:30
    means for uh what we can think about for
  • 00:34:32
    how we can predict the next year or two
  • 00:34:34
    years of what my you know robot
  • 00:34:36
    Foundation model might look like uh and
  • 00:34:39
    the first one here is a recap um we
  • 00:34:41
    talked about positive transfer and the
  • 00:34:43
    scaling Laws of Robotics as we increase
  • 00:34:46
    our data sets and our and our model
  • 00:34:47
    capacities uh the bleeding edge here
  • 00:34:49
    from what I talk about today I view them
  • 00:34:51
    as the vision language action moding
  • 00:34:52
    Paradigm and a lot of improvements and
  • 00:34:54
    cross tring on many different types of
  • 00:34:56
    robot data here this is my again just my
  • 00:34:58
    own opinion I would give the field a six
  • 00:35:00
    out of 10 um is how how much we progress
  • 00:35:02
    and how much remaining kind of unknown
  • 00:35:04
    unknowns we have I think as you saw the
  • 00:35:06
    failure modes were still overfitting
  • 00:35:08
    this the it's still vaa training is
  • 00:35:11
    still more of an art than a science of
  • 00:35:12
    your data training mixtures of how you
  • 00:35:14
    go about doing that your token action
  • 00:35:16
    tokenization of decisions and I think
  • 00:35:18
    there's a lot of science that still
  • 00:35:19
    needs to happen here for steerability
  • 00:35:21
    and proability um again here we talked
  • 00:35:24
    about going Beyond language let's
  • 00:35:26
    thinking about motions let's think about
  • 00:35:27
    trajectory
  • 00:35:28
    here I would say the fields at a four
  • 00:35:29
    out of 10 because I think while we've
  • 00:35:31
    had Signs of Life maybe Sparks of what
  • 00:35:33
    could be come a pumpable or steerable
  • 00:35:36
    robot learning Paradigm none of these
  • 00:35:38
    have been scaled up to a large scale and
  • 00:35:40
    we don't yet know what kind of data
  • 00:35:42
    requirements or or kind of you know
  • 00:35:44
    bottlenecks uh that will occur when we
  • 00:35:46
    do try to scale these methods and we
  • 00:35:48
    don't know if these will scale to you
  • 00:35:50
    know more dextrous embodiment such as
  • 00:35:52
    humanoids or consumer
  • 00:35:54
    robots and finally for scalable eval
  • 00:35:57
    again maybe we can think about
  • 00:35:58
    generalization or simulation I would
  • 00:36:00
    score this at a three out of 10 just
  • 00:36:02
    because uh um we don't even often times
  • 00:36:05
    don't even know what we should be
  • 00:36:07
    evaluating for right even before we can
  • 00:36:09
    design the eval we need to kind of set
  • 00:36:11
    up the paradigms the the structure the
  • 00:36:14
    formalisms all of that is still kind of
  • 00:36:16
    very ad hoc and it's done on a per
  • 00:36:18
    project basis almost um and so
  • 00:36:20
    definitely the future I hope that maybe
  • 00:36:23
    either concerted efforts or just you
  • 00:36:25
    know by bringing uh robot prediction and
  • 00:36:28
    lowle control closer to Foundation
  • 00:36:30
    modeling these problems of evaluation
  • 00:36:32
    can also become more
  • 00:36:35
    homogeneous with that then I'll take
  • 00:36:37
    these final like you know last points
  • 00:36:39
    that we just solved like what why these
  • 00:36:41
    why I gave these ratings U maybe we can
  • 00:36:44
    try to predict what solutions to these
  • 00:36:46
    might be you know this is I would say
  • 00:36:47
    what might happen in the 2025 to 2026
  • 00:36:51
    Horizon I think the first one is that
  • 00:36:53
    you know once we understand B better
  • 00:36:55
    once it turns into a science it's not
  • 00:36:57
    just in AR I think maybe robotics
  • 00:37:00
    research will also start to split into
  • 00:37:02
    pre-training and post training just as
  • 00:37:04
    Foundation modeling has once you have
  • 00:37:06
    access to very robust starting points um
  • 00:37:09
    of foundation models from you know maybe
  • 00:37:11
    from from Google p maybe from open from
  • 00:37:13
    other companies once you have great
  • 00:37:15
    starting points then posttraining your
  • 00:37:17
    robot will become more of what your
  • 00:37:18
    daily Cycles are going to be looking
  • 00:37:19
    like as a practitioner or as a
  • 00:37:21
    researcher um for robot specific data
  • 00:37:24
    again I think um robot data engine again
  • 00:37:28
    if you're really thinking about scale
  • 00:37:29
    deployments at a society scale here I
  • 00:37:32
    think is where industry and startups
  • 00:37:33
    will be able to contribute a lot to the
  • 00:37:35
    field and what that's going to look like
  • 00:37:36
    we'll see but uh I think the the race is
  • 00:37:39
    really starting up this year with so
  • 00:37:40
    many startups and exciting new companies
  • 00:37:42
    for and then finally for evaluations um
  • 00:37:45
    with all of the amazing progress we're
  • 00:37:47
    seeing in in the video modeling um again
  • 00:37:51
    I think I'm really excited to see what
  • 00:37:53
    the what the action condition variant of
  • 00:37:55
    these World models will look like uh how
  • 00:37:58
    how well they'll be able to model out of
  • 00:37:59
    distribution behaviors and not just in
  • 00:38:01
    distribution training data because I
  • 00:38:03
    think for for evaluating robot policies
  • 00:38:05
    you don't just want them to be aesthetic
  • 00:38:07
    or you know realistic you want them to
  • 00:38:09
    actually measure these long tail of like
  • 00:38:11
    very rare events or you want them to
  • 00:38:13
    model you know contact forces and and
  • 00:38:15
    visible cality which may not always
  • 00:38:17
    correlate to you know aesthetic pre
  • 00:38:19
    trining video YouTube data that they may
  • 00:38:21
    be rain um as well as of course product
  • 00:38:24
    deployments so I think their evaluations
  • 00:38:26
    is another setting where we'll see
  • 00:38:28
    industry and startups contribute to um
  • 00:38:31
    those are again just my own opinions but
  • 00:38:33
    I think I'm very excited to see how the
  • 00:38:34
    field of Robo fure
  • 00:38:37
    deols with that said uh thanks for your
  • 00:38:39
    time um reach me on my email here and
  • 00:38:41
    I'll share the slides after but thanks
  • 00:38:42
    for your
  • 00:38:48
    time you very much Ted for the great
  • 00:38:51
    talk uh there's already a question go
  • 00:38:53
    ahead
  • 00:38:55
    [Music]
  • 00:39:02
    large
  • 00:39:14
    [Music]
  • 00:39:29
    any so I guess the question here was um
  • 00:39:32
    you know with Tetra randomization being
  • 00:39:34
    kind of one AIS of generalization a
  • 00:39:35
    distribution sh that seem challenging
  • 00:39:37
    are there maybe methods or is there any
  • 00:39:39
    data points we have on how to address
  • 00:39:40
    that um I think it's it's also uh there
  • 00:39:44
    have been some works on like data
  • 00:39:45
    augmentation for robotics maybe in a
  • 00:39:46
    semantic fashion or instead of just like
  • 00:39:48
    you know broadly domain randomizing like
  • 00:39:50
    just random textures actually using uh
  • 00:39:53
    semantically relevant textures of like
  • 00:39:55
    what's actually appropriate like you
  • 00:39:57
    don't want like a you know neon random
  • 00:39:59
    RGB wallpaper in your home you want
  • 00:40:01
    wallpapers that are actually seen in
  • 00:40:03
    people's homes uh I think this is still
  • 00:40:05
    very nent but I think there's already
  • 00:40:06
    been very good results of people like
  • 00:40:08
    pushing up on these like generalization
  • 00:40:10
    settings with that it's like diffusion
  • 00:40:12
    models for
  • 00:40:14
    IND other questions yes please wondering
  • 00:40:19
    yes are you saying uh video data real
  • 00:40:22
    data simulation data they're all
  • 00:40:26
    contributing or if you have to pick one
  • 00:40:28
    which one would you go for or is there a
  • 00:40:30
    transfer between doesn't even
  • 00:40:34
    matter your
  • 00:40:37
    representation yeah absolutely so the
  • 00:40:39
    question I guess is yeah video data Sim
  • 00:40:41
    data real data how do VI are they all
  • 00:40:43
    equal do they transfer uh my sense is at
  • 00:40:46
    least based in the last few years again
  • 00:40:48
    real data is absolutely key so if you
  • 00:40:50
    offer me the same quantity of real data
  • 00:40:52
    versus Sim versus video I'm picking real
  • 00:40:54
    any day of the week however then you
  • 00:40:56
    know may maybe at some point there will
  • 00:40:57
    be dimin returns or maybe if you offer
  • 00:40:59
    me a thousand times more or a million
  • 00:41:01
    times more similar video data compared
  • 00:41:03
    to robot option data then maybe the
  • 00:41:05
    trade-off starts to become different but
  • 00:41:07
    in the past I would say that's also not
  • 00:41:09
    a very good predictive measure of what
  • 00:41:11
    might happen in the future because in
  • 00:41:12
    the past again the robot kind of
  • 00:41:14
    foundation model policy capacities were
  • 00:41:16
    so small that if you could only kind of
  • 00:41:18
    operate over the scale of you know
  • 00:41:20
    100,000 trajectories then you would of
  • 00:41:22
    course just operate on 100,000 real
  • 00:41:24
    trajectories you never had the
  • 00:41:25
    opportunity to be operating at this
  • 00:41:27
    scale an internet scale V where you did
  • 00:41:30
    have the capacity to consume a million
  • 00:41:32
    times more Sim video data and I think we
  • 00:41:34
    have the opportunity to do that in the
  • 00:41:35
    next you know year or even
  • 00:41:38
    now yeah
  • 00:41:41
    was foration prob most about what's
  • 00:41:45
    missing from the Robotics and vision the
  • 00:41:48
    language everything physical sense
  • 00:42:17
    yeah great question so the question here
  • 00:42:19
    was yeah physical touch right sensing um
  • 00:42:21
    this is so important for humans uh how
  • 00:42:24
    about for robots um I think my my sense
  • 00:42:26
    here is that uh you know I I kind of put
  • 00:42:29
    two columns of like what could make
  • 00:42:30
    robot seable or what's missing from
  • 00:42:32
    Foundation models that's unique to
  • 00:42:34
    robots I would definitely put sensing
  • 00:42:35
    and other modalities here as well the
  • 00:42:38
    reason I think it's it maybe it's a
  • 00:42:39
    little bit farther off or for me at
  • 00:42:40
    least not a top priority is that I think
  • 00:42:43
    just grounding the idea of temporal
  • 00:42:45
    motion into like putting that into
  • 00:42:48
    visual language latent space if that is
  • 00:42:50
    already so hard and adding a completely
  • 00:42:52
    new modality like touch sensors where
  • 00:42:54
    you don't have a lot of data where it is
  • 00:42:55
    noisy where it's so different from
  • 00:42:57
    internet data that's going to be even
  • 00:42:59
    harder so if we can't even add
  • 00:43:00
    trajectories of motions then adding
  • 00:43:02
    sensing data might be even harder with
  • 00:43:04
    so that's why for me at least um my aim
  • 00:43:07
    is to try to make some progress on on on
  • 00:43:08
    hitting the the Motions the physics the
  • 00:43:10
    trajectories first and then sensing
  • 00:43:12
    maybe I think hopefully will that'll
  • 00:43:14
    tell us how we should approach sensing
  • 00:43:15
    as
  • 00:43:18
    well in the interest of time let's have
  • 00:43:20
    two last question so I two hands you can
  • 00:43:23
    start new front and then other
  • 00:43:43
    yeah great question so the question here
  • 00:43:45
    is that in this RT hierarchy work um the
  • 00:43:47
    interface between the high level
  • 00:43:48
    planning the low LEL control was this
  • 00:43:50
    this like lowlevel language motion and
  • 00:43:52
    uh there's a hethy language and and yeah
  • 00:43:54
    absolutely I think there's there's many
  • 00:43:55
    other kinds of representations you could
  • 00:43:57
    use uh for example even these
  • 00:43:59
    trajectories is one right your high
  • 00:44:01
    level plan could be like hey how do I
  • 00:44:03
    wipe you s of the chair and then the
  • 00:44:05
    interface is this like RGB like curve
  • 00:44:08
    but I think there there's definitely a
  • 00:44:09
    lot more language is just a natural one
  • 00:44:11
    where you could hope to get a bit more
  • 00:44:12
    transfer from like you know Chain of
  • 00:44:14
    Thought in nonbody domains
  • 00:44:18
    but and final question
  • 00:44:54
    yeah um the question was um
  • 00:44:57
    interpolation versus true out of
  • 00:44:59
    distribution generalization reasoning U
  • 00:45:01
    maybe often times when we claim we're
  • 00:45:03
    generalizing a new objects or lighting
  • 00:45:04
    conditions is actually just
  • 00:45:05
    interpolating the data that we already
  • 00:45:07
    saw um and and that's why I think uh I I
  • 00:45:10
    I think here right when I Define
  • 00:45:12
    emergent uh I I I I try to embrace this
  • 00:45:14
    by saying that uh emergent doesn't mean
  • 00:45:16
    that it's not present in any of the
  • 00:45:17
    training data and it's merged magically
  • 00:45:19
    it's that it was in the data but
  • 00:45:20
    internet scale data is too much for us
  • 00:45:22
    to codify so it's just emerging like
  • 00:45:24
    we're finding out what was in the data
  • 00:45:26
    basically right like we're finding out
  • 00:45:27
    what's in the data and now it's being
  • 00:45:29
    projected into Robot action space and
  • 00:45:31
    we're seeing what has successfully
  • 00:45:32
    projected um so I think uh again I I
  • 00:45:35
    don't I don't even know if we if we need
  • 00:45:37
    necessarily to solve like true
  • 00:45:38
    generalization in order for robotics to
  • 00:45:40
    work well I think if we're able to solve
  • 00:45:43
    like just interpolation at scale that's
  • 00:45:46
    in in a very like you know predictable
  • 00:45:48
    fashion I think I would be already happy
  • 00:45:50
    with that in terms of getting general
  • 00:45:51
    purpose robots I think beyond that for
  • 00:45:54
    AGI no loaded term yes probably
  • 00:45:58
    okay
  • 00:46:00
    sorry so let's send the speaker again
Etiquetas
  • robotteknologi
  • fundamentale modeller
  • positiv overførsel
  • styrebarhed
  • skalerbar evaluering
  • generel anvendelighed
  • data
  • maskinlæring
  • interdisciplinær forskning
  • udfordringer