Hvordan kan robotter drage fordel af store datamængder?

Ved at behandle robotdata som en del af en større datamængde kan robotter opnå bedre generalisering og overførsel af færdigheder.

Hvad er betydningen af 'positiv overførsel' i robotteknologi?

Positiv overførsel refererer til evnen til at anvende læring fra en opgave til en anden, hvilket er sjældent i nuværende robotteknologi.

Hvad er 'styrebarhed' i konteksten af robotter?

Styrebarhed refererer til robotters evne til at tilpasse deres handlinger baseret på menneskelig feedback eller ændrede betingelser.

Hvorfor er skalerbar evaluering vigtig for robotter?

Skalerbar evaluering er vigtig for at kunne vurdere robotters præstationer på tværs af forskellige opgaver og miljøer.

CVPR24 FM4AS | Ted Xiao: What's Missing for Robotics-first Foundation Models?

00:46:07

https://www.youtube.com/watch?v=hPQhGI9w9x0

Resumen

TLDRForedraget fokuserer på de nuværende udfordringer og muligheder inden for robotteknologi, især i forhold til fundamentale modeller. Taleren fremhæver, at der er en betydelig energi og interesse i feltet, men at der stadig er væsentlige mangler, der skal adresseres for at gøre robotter mere generelt anvendelige. Der præsenteres tre centrale mangler: positiv overførsel ved skalering, styrebarhed og skalerbar evaluering. Taleren argumenterer for, at nuværende metoder ikke er tilstrækkelige, og at der er behov for paradigmeskift for at udnytte de muligheder, som store datamængder og avancerede modeller tilbyder. Der diskuteres også, hvordan man kan forbedre robotters evne til at generalisere og tilpasse sig nye situationer.

Para llevar

🤖 Robotter står over for udfordringer med positiv overførsel.
📊 Skalerbar evaluering er essentiel for robotters præstation.
🔄 Styrebarhed kan forbedres med menneskelig feedback.
🌐 Store datamængder kan hjælpe med at generalisere færdigheder.
🔍 Der er behov for paradigmeskift i robotteknologi.

Cronología

00:00:00 - 00:05:00
I dag taler jeg om, hvad der mangler i robotik i forhold til fundamentale modeller. Der er en enorm energi i robotik og AI, og mange nye forskere deltager i konferencer som CPR. Jeg vil fokusere på de mest presserende udfordringer, der skal løses for at gøre robotik mere tilgængelig og diskutere, hvordan vi kan forstå de nuværende trends for at finde de manglende brikker.
00:05:00 - 00:10:00
Robotik i dag følger en pipeline-tilgang med perception, planlægning og kontrol. De seneste 10 år har været præget af tværfaglighed, hvor fremskridt i sprogbehandling og computer vision hurtigt er blevet anvendt i robotik. Men med fremkomsten af fundamentale modeller er det nødvendigt at revurdere, hvordan vi anvender disse teknologier i robotik.
00:10:00 - 00:15:00
En udfordring er, at de nuværende modeller ikke er optimeret til kontrol og beslutningstagning. Der er også snævre flaskehalse mellem modulerne, som begrænser fordelene ved skala og overførsel. Vi skal finde måder at bryde disse barrierer ned for at skabe en robotisk fundamentalmode.
00:15:00 - 00:20:00
Der er tre manglende elementer i robotik, som vi ser i sprogmodeller: positiv overførsel ved stor skala, styrebarhed og skalerbar evaluering. I dag er positiv overførsel sjælden i robotik, og vi har ikke de samme skaleringslove som i sprogmodeller. Vi skal finde måder at opnå disse egenskaber i robotik.
00:20:00 - 00:25:00
For at opnå positiv overførsel skal vi skalere dataene. Vi kan enten gøre robotdata mere lig internetdata eller samle data fra forskellige robotopgaver for at se, om vi kan opnå positiv overførsel. Vi har set, at ved at behandle robotdata som en del af internetdata kan vi opnå nogle positive resultater.
00:25:00 - 00:30:00
Vi har også undersøgt, hvordan vi kan træne på tværs af forskellige robotdata for at forbedre præstationen. Overraskende nok har vi set, at generalistmodeller, der er trænet på data fra mange forskellige robotter, kan overgå specialiserede modeller, der kun er trænet på specifikke opgaver.
00:30:00 - 00:35:00
Det er vigtigt at forstå, at selvom vi har adgang til store mængder data, er det ikke altid tilstrækkeligt. Vi skal også overveje, hvordan vi kan bruge robotdata til at forbedre vores modeller og opnå bedre præstationer i specifikke opgaver.
00:35:00 - 00:40:00
Den næste udfordring er styrebarhed og promptability. I dag er robotters input ofte begrænset, hvilket gør det svært at udnytte de muligheder, som store sprogmodeller tilbyder. Vi skal finde måder at udvide inputmulighederne for robotter, så de kan lære og tilpasse sig bedre.
00:40:00 - 00:46:07
Endelig er skalerbar evaluering en stor udfordring i robotik. Vi skal finde metoder til at evaluere robotters præstationer på en meningsfuld måde, der kan fange deres evner i forskellige situationer. Dette kræver en systematisk tilgang til evaluering, som vi endnu ikke har opnået.

Mapa mental

Vídeo de preguntas y respuestas

Hvad er de tre centrale mangler i robotteknologi ifølge foredraget?
De tre centrale mangler er positiv overførsel ved skalering, styrebarhed og skalerbar evaluering.
Hvordan kan robotter drage fordel af store datamængder?
Ved at behandle robotdata som en del af en større datamængde kan robotter opnå bedre generalisering og overførsel af færdigheder.
Hvad er betydningen af 'positiv overførsel' i robotteknologi?
Positiv overførsel refererer til evnen til at anvende læring fra en opgave til en anden, hvilket er sjældent i nuværende robotteknologi.
Hvad er 'styrebarhed' i konteksten af robotter?
Styrebarhed refererer til robotters evne til at tilpasse deres handlinger baseret på menneskelig feedback eller ændrede betingelser.
Hvorfor er skalerbar evaluering vigtig for robotter?
Skalerbar evaluering er vigtig for at kunne vurdere robotters præstationer på tværs af forskellige opgaver og miljøer.

Ver más resúmenes de vídeos

Obtén acceso instantáneo a resúmenes gratuitos de vídeos de YouTube gracias a la IA.

Subtítulos

Desplazamiento automático:

00:00:00
title my talk today is what's missing
00:00:02
for robotics vers Foundation models and
00:00:05
to preface this uh I think I really want
00:00:08
to emphasize the absolute amount of
00:00:10
energy right now that's happening in in
00:00:13
body AI in robotics specifically even
00:00:15
today at tvpr I see so many time
00:00:17
roboticists which this is their first
00:00:19
CPR I think it's the general trend of a
00:00:22
lot of converging Trends in different
00:00:24
fields from language modeling and
00:00:26
computer vision and embod decision
00:00:28
making so I think it's really such an
00:00:30
exciting time right now to be working on
00:00:31
these problems but the purpose of my
00:00:34
talk today is to really try to paint a
00:00:37
picture of what I view as the bleeding
00:00:39
edge of modern trends that are ongoing
00:00:41
in robotics and more importantly ask
00:00:44
some provocative questions about how
00:00:47
understanding this bleeding edge can
00:00:49
point us to the missing pieces and open
00:00:51
challenges that are on the path between
00:00:54
us and really solving general purpose
00:00:57
robotics uh the work here that discuss
00:01:00
today uh comes from a great group uh at
00:01:03
Google be mind and the know
00:01:05
controversial opinions are only my own
00:01:08
especially the wrong
00:01:10
ones so give a brief glance then at
00:01:12
today's agenda I'll first motivate what
00:01:15
is a robot Foundation model and I think
00:01:17
you've heard today from so many great
00:01:20
speakers so I'll keep this brief but
00:01:22
most importantly I'd like to focus on a
00:01:24
few missing pieces that I hope will
00:01:26
paint a path for what are the most
00:01:28
important problems to solve before
00:01:30
robotics become generally accessible uh
00:01:33
and finally I'll save some time for the
00:01:34
horizons of how the field might evolve
00:01:37
at a metal
00:01:38
level um first off some very brief
00:01:41
preliminaries uh the modern uh sense
00:01:44
plan act Paradigm in robotics as such is
00:01:46
a pipeline system right you have
00:01:48
perception understanding the world you
00:01:50
have planning how you synthesize
00:01:52
different types of information
00:01:53
especially semantics human intent Etc
00:01:56
and of course control and actuation
00:01:59
moving robot to influence the world and
00:02:02
what this has looked like in the modern
00:02:04
era and I use this to to broadly uh you
00:02:07
know the loosest sense in the last 10
00:02:08
years is that robotics has been very
00:02:11
interdisiplinary a very full stack
00:02:13
effort where advances in perception in
00:02:16
in in language and other domains are
00:02:18
quickly adopted and utilized by
00:02:19
roboticist to drive progress forward
00:02:21
this looks like taking the
00:02:23
state-of-the-art Leading Edge model from
00:02:25
CPR plugging that in as a perception
00:02:27
staff taking some kind of offthe shelf
00:02:29
great planning system either or recently
00:02:32
you know a learn system and then really
00:02:35
focusing on the control part is the Last
00:02:36
Mile and this has worked pretty well I
00:02:39
think especially when we were you know
00:02:41
working with like scaling up cnns or or
00:02:43
scaling up initial smaller scale
00:02:45
planners but in the age of foundation
00:02:46
models I think some of the uh lessons
00:02:49
that we've you know really taken to
00:02:51
heart as robotics people in in the last
00:02:53
10 years may not continue uh in the next
00:02:56
few years um one issue you might think
00:02:58
of is that it's it's great to keep
00:03:00
taking these off the shelf
00:03:02
state-of-the-art models but of course
00:03:03
these are Frozen They're pre-trained
00:03:05
they're not optimized for control
00:03:06
they're not optimized for decision-
00:03:08
making uh and the same for the language
00:03:10
model planners which again uh these days
00:03:12
if you really view emergence as a
00:03:14
phenomenon that only exists at scale
00:03:16
organically well maybe the types of
00:03:19
emerging capabilities intell
00:03:21
intelligence you need for lowl control
00:03:23
are not the same ones that you need for
00:03:25
internet scale text modeling um and of
00:03:28
course the second part which is related
00:03:29
to point is that the bottlenecks between
00:03:31
these modules are very narrow they're
00:03:33
very they're turistic driven they're
00:03:36
expert defined and they're not
00:03:38
expressive enough perhaps to Encompass
00:03:41
the benefits of scale and transfer that
00:03:43
we all know and love from Foundation
00:03:44
modeling and and and so how do we tackle
00:03:47
these two issues then well again maybe
00:03:50
again we can take some lessons from
00:03:52
Trends and practices that have worked in
00:03:54
Foundation modeling I'm sure all of you
00:03:56
guys know this story but you know from
00:03:58
from language model and from Vision
00:04:00
there's all these separate tasks and
00:04:02
data sets and and Specialties and the
00:04:04
Insight of scaling large scale internet
00:04:07
scale Foundation models is that by
00:04:09
treating these problems all as uh you
00:04:12
know interoperable or or or homomorphic
00:04:15
is that by breaking down the barriers
00:04:17
between the strict definitions of what
00:04:18
you call sentiment classification versus
00:04:21
what you call station you get a lot of
00:04:23
benefits and really great scaling Trends
00:04:26
and and then in robotics right going
00:04:28
back to the modern
00:04:30
Loop or we we today still have these
00:04:33
very big brick walls between these
00:04:35
modules is there a way we can break them
00:04:37
down and if we can do so that would be a
00:04:39
robotics First Foundation model um and
00:04:42
again I don't think I think we have some
00:04:44
ideas of how to do this and I'll talk in
00:04:47
today's talk but I think even beyond
00:04:50
that there's a lot of questions that we
00:04:51
don't yet have answers to which I think
00:04:53
will be very exciting to make
00:04:55
progress so why we even want to do this
00:04:57
okay let's say we break down these brick
00:04:59
walls we robotics versus Foundation
00:05:00
model what does that mean I think
00:05:02
there's three missing pieces that are
00:05:04
really nice properties that we see in
00:05:06
today's language models and vision
00:05:08
language models that we as of today do
00:05:10
not yet see in robotics one is really
00:05:14
positive transfer at imense Scales where
00:05:17
you basically can assume by default that
00:05:18
positive transfer and scaling will work
00:05:20
as opposed to robotics these days it's
00:05:22
more rare than you know that you see
00:05:24
positive transfer than cases where you
00:05:26
get negative results um and again when
00:05:28
this happens you can are making
00:05:30
predictive scaling laws about how much
00:05:32
you should scale how many how much you
00:05:33
know compute how many tokens you need uh
00:05:36
and Robotics is far from that today
00:05:38
another one I think that is very close
00:05:40
to my heart is steerability and
00:05:41
pumpability right uh in in my opinion
00:05:44
this is what really made me see the
00:05:46
light between gpt2 and gpt3 eras where
00:05:48
you saw that few shot learning where you
00:05:51
saw pump engineering really start to
00:05:53
take off at a massive scale and in
00:05:54
robotics I think we're far from that and
00:05:57
finally I think scalable evaluation is
00:06:00
an aspect where I really enjoy the
00:06:02
progress that's been made in Foundation
00:06:03
modeling with everything from very
00:06:05
realistic meaningful evaluations to
00:06:08
predicted benchmarks that correlate well
00:06:10
with how users actually would would you
00:06:12
know leverage the capabilities these
00:06:15
models and my claim here is that these
00:06:18
properties are not just nice to have but
00:06:20
we actually need that in order to tackle
00:06:22
the full unstructured Wilderness of the
00:06:25
real world again robotics has kind of
00:06:27
spedrun the the scale up from one one
00:06:30
bin for indust object picking to one
00:06:32
room to one lab building but to go from
00:06:34
here to all the VC funded human
00:06:37
companies and home robot companies and
00:06:39
assisting the elderly and and playing
00:06:41
with your dogs and kids this is kind of
00:06:43
a huge gap I think that you don't get by
00:06:46
just incremental progress but by
00:06:48
fundamentally transformative Paradigm
00:06:49
shifts which can leverage emergence
00:06:51
scale
00:06:54
generalization and a hot take here is
00:06:56
that I think while we see Signs of Life
00:06:59
today I think if you froze technology
00:07:02
today you cannot solve Robotics and
00:07:03
again I think there's companies and
00:07:05
startups and academics betting on this
00:07:07
of it's just a data problem it's just
00:07:09
engineering I think there are still
00:07:10
fundamental Paradigm shifts that are
00:07:12
algorithmic or or you know maybe data
00:07:14
breakthroughs that or needed um until
00:07:17
it's it's going to be trackable to try
00:07:18
to go and solve
00:07:21
RS and let's then jump into the
00:07:24
nitty-gritty of these missing pieces
00:07:25
then right so the first one I mentioned
00:07:28
was positive transfer from scaling and I
00:07:30
think you know to to kind of illustrate
00:07:32
what this mean I think we can just point
00:07:33
at all of the kind of forerunners in
00:07:35
language modeling and BLM right you see
00:07:37
these great large scale internet scale
00:07:40
data sets you see great you know scaling
00:07:42
laws and chinchilla optimality and again
00:07:45
I think this is building off of the fact
00:07:46
that passive the interet gets you a lot
00:07:48
of great properties that are good for
00:07:51
language modeling and and vision langage
00:07:52
modeling but in robotics that's not the
00:07:55
case right so I I I think in robotics in
00:07:57
order to study how you can get positive
00:08:00
transfer you need to First scale the
00:08:02
data and there's probably two ways which
00:08:05
I'll dive into shortly is is one maybe
00:08:07
you can just make robot data look more
00:08:09
like internet data you can just treat
00:08:11
them as the same data source and the
00:08:13
second is maybe maybe your specific
00:08:15
robot there's not enough data and even
00:08:17
with you know a bajillion dollars you
00:08:18
can't get enough but maybe if you pull
00:08:20
together all of the robot data that's
00:08:22
been collected across different
00:08:23
embodiment tasks environments maybe you
00:08:26
have a better shot of seeing some kind
00:08:28
of positive transfer in St
00:08:30
us um so again Vision language models
00:08:34
this is kind of the golden standard for
00:08:35
what we want to model the properties we
00:08:37
like to see in robotics off this is one
00:08:39
existence group of pro concept what's
00:08:41
worked and BMS capture both Visual and
00:08:43
semantic knowledge of the world and then
00:08:46
you put it side by side here this is the
00:08:48
rth network architecture for a model
00:08:50
called rt1 from our team from almost two
00:08:53
and a half years ago which is kind of
00:08:54
like a home grw Garage Band Transformer
00:08:57
model that has these components of you
00:08:59
know know a perception Vision encoder
00:09:01
some kind of like cross modal tension we
00:09:03
s for and then you know a Transformer
00:09:05
blocks be the B for the reasoning and
00:09:07
finally some kind of action token output
00:09:09
again these are design choices that were
00:09:11
really optimized at our scale on our
00:09:12
problems which we're still pretty large
00:09:14
scales by academic domains on hundreds
00:09:16
and hundreds of robot tasks but again if
00:09:18
if you kind of squint at this these
00:09:19
components start to look like the design
00:09:21
decisions that are used in state of the
00:09:23
art Vision language models right and you
00:09:25
can kind of just like try to think about
00:09:27
how you would fit the kind of design we
00:09:30
have to make the small rp1 scale into a
00:09:33
large VM and you see a lot of properties
00:09:36
that are same again these are disti
00:09:38
tokens uh they're very similar just
00:09:40
maybe with some slight output
00:09:42
differences and input differences uh but
00:09:45
maybe you can just use the de as a
00:09:46
policy you don't need to iterate at
00:09:48
small scales on your home brew Garage
00:09:50
Band policies you can just leverage the
00:09:52
kind of great work and infra and data
00:09:55
from Vision language mod and then the
00:09:57
main question becomes well how do you
00:09:59
align the action domain how do you align
00:10:01
the action manable um and again this is
00:10:04
what our actions look like they're kind
00:10:07
of robot specific but they Encompass
00:10:09
things like positional change and
00:10:10
rotations and Grier open close and and
00:10:12
termin action types but that that's
00:10:14
broadly you know going to be specific to
00:10:16
per robot and uh you can you can choose
00:10:19
discretize them but I think the main
00:10:20
point is that these look still a little
00:10:22
bit different right from like the textt
00:10:24
you find on Wikipedia articles or on
00:10:27
image captions on the web but maybe
00:10:29
there's ways you can just make them the
00:10:30
same by adding making just turning it
00:10:33
into a string the most naive thing you
00:10:35
can do and see how that works and that's
00:10:36
exactly what we did in this work called
00:10:38
rt2 um we just converted the tokenized
00:10:41
discretized robot actions into a string
00:10:44
such as this list of eight integers and
00:10:47
and and you can just treat that as a
00:10:50
caption so robot action prediction is
00:10:51
now just a v2a um of course there's
00:10:54
Alternatives and that we can think of
00:10:56
such as maybe we use floats but then
00:10:58
there's more tokens and consen is a
00:11:00
thing you can use extra IDs or like
00:11:02
least Us action tokenization you can use
00:11:04
strings and I think these are all very
00:11:06
compelling choices but I think the most
00:11:07
simple thing to start with is just
00:11:09
turning your raw robot actions and just
00:11:11
casting it as a string and seeing what
00:11:13
happens uh and this is what's what's now
00:11:16
called a vision language action model
00:11:17
that treats actions as just another
00:11:19
rality represented in text um for the
00:11:21
backbones we look at po X models which
00:11:24
is a model for Google is all py at the
00:11:27
5B 12b and5 variance and uh importantly
00:11:31
we we we we co- find un we co- Trin the
00:11:34
robot data at the bottom along with the
00:11:36
original vqa data the internet data that
00:11:39
the original were trained with start the
00:11:42
robot data is this offline data set that
00:11:44
we iterated on for rt1 and many other
00:11:46
works it's a it's a large expert
00:11:48
demonstration data set that we save
00:11:50
outline um and what we see then is that
00:11:52
when we when we train this large vaa
00:11:55
model we start to see some emergent
00:11:57
skills and I use emergent in the sense
00:11:58
that these were not not specific
00:12:00
capabilities uh such as you know OCR or
00:12:02
recognizing Flags these are not any
00:12:04
concepts that we collected data for in
00:12:06
our robot data set but kind of emerg via
00:12:09
positive transfer from whatever semantic
00:12:11
Concepts were present in the internet
00:12:13
scale VM training the VM data and
00:12:17
quantitatively right you know post talk
00:12:19
we can kind of try to like you know
00:12:20
squint and see what kind of these new
00:12:22
capabilities are and we we roughly group
00:12:25
them into things like symol
00:12:26
understanding reasoning or or human
00:12:27
recognition and we see that
00:12:29
this is indeed positive transfer
00:12:31
happening thanks to the V paradig and
00:12:34
again though those were kind of like
00:12:35
hand evals right you maybe you don't
00:12:37
necessarily care about if you're
00:12:38
roboticist uh you know whether or not
00:12:40
your robot Foundation model recognizes
00:12:42
Taylor S maybe you do care about though
00:12:44
is generalization to distribution shifts
00:12:46
like lighting conditions or new objects
00:12:48
or new background conditions and here we
00:12:50
see that actually the VA models the web
00:12:52
data also gives you some of this for
00:12:54
free just by treating robotics as the
00:12:57
same as in data
00:13:00
switching uh to the second part though
00:13:02
so we've treated robot data now the same
00:13:04
as internet data but often times maybe
00:13:07
that robot data still isn't enough when
00:13:08
you get any additional bank gr up by
00:13:10
training on all the robot data even if
00:13:13
the robots look different than the one
00:13:14
you have in your lab in your factory um
00:13:17
here we we had a a cross institutional
00:13:19
collaboration called Open Cross
00:13:21
embodiment where we pulled together uh
00:13:23
the data sets that were open source for
00:13:25
more than 30 robot labs and arreated
00:13:27
together and see whether or not uh we
00:13:29
could get some nice scaling Properties
00:13:31
by treating all the robot data as the
00:13:34
same uh and this data set is very very
00:13:37
heterogeneous we have many different
00:13:39
robot moralities and volums we have many
00:13:41
different environments across the world
00:13:44
and many different CTIC tasks that each
00:13:46
of the individual researchers I cared
00:13:47
about when they collected their original
00:13:49
data set it's different from clock
00:13:51
folding to this like cable routing where
00:13:52
you have to put a cable very dextrously
00:13:55
into this small socket um it's really
00:13:58
quite diverse um and with this diers
00:14:00
data set we studied kind of the scaling
00:14:03
properties at two model classes one is
00:14:05
the rg1 the small um homemade
00:14:08
Transformer that we had 35 million
00:14:09
parameters as well as rg2 the full vaa
00:14:12
55 billion parameter vaa scale to see
00:14:14
whether or not uh you know treating
00:14:16
these different robot AC you know data
00:14:18
sets as just input is some robot images
00:14:21
output are some robot actions
00:14:23
representative strings what would happen
00:14:25
in this case and here surprisingly we
00:14:27
actually see signs where the generalist
00:14:30
policies which are trained on all of the
00:14:32
robot data outperforms the Specialists
00:14:34
which are alized for each of their
00:14:35
evaluation settings and in this case
00:14:38
right it's actually I think this was to
00:14:40
me very shocking because for each of
00:14:42
these individual kind of uh you know
00:14:44
Labs here each of these Lo sport on to
00:14:46
lab there generally I think the what the
00:14:50
best accepted um you know understanding
00:14:52
of roboticist is that if you care about
00:14:53
maximizing performance on your specific
00:14:55
task on your robot and your setting like
00:14:58
you should just TR on robot data from
00:15:00
your robot because adding any other data
00:15:02
is going to be you know not is it's just
00:15:04
going to like you know your policies
00:15:06
will not be robust to those if they're
00:15:08
going to overfit to them it's not going
00:15:09
to help but in this case it seems that
00:15:11
training on all the data actually
00:15:13
improve performance even on the if you
00:15:15
only cared about performance on your
00:15:16
specific setup that was pretty you know
00:15:19
exciting to see in robotics because this
00:15:21
is a try we've seen in other fields but
00:15:22
in robotics this is really really rare
00:15:24
so far and again the Improvement SC was
00:15:27
almost 50%
00:15:29
another interesting Trend that we found
00:15:31
is that actually we were getting to the
00:15:32
point where we were seeing that smaller
00:15:34
models would underfit this large
00:15:36
heterogeneous robot data set and again
00:15:39
overfitting is normally uh not something
00:15:42
uh I think that underfitting is not
00:15:44
something that we usually see in
00:15:45
robotics because usually the data amount
00:15:47
is so small that model capacity is never
00:15:49
an issue and here it it was very
00:15:51
surprising that model capacity actually
00:15:53
was required to kind of soak up all of
00:15:55
this diverse robot data
00:15:59
and one big question though was that
00:16:02
great you had all this robot data from
00:16:04
all the robots wasn't even needed you
00:16:06
already had all the web data from R22
00:16:09
does adding all this incremental you
00:16:11
know Poss data at anything at all and
00:16:13
the good news is that we did sp there
00:16:15
was know a lot of these spatial
00:16:17
reasoning skills about the the the
00:16:19
motions and the Precision of specific
00:16:21
tasks like whether you put the apple on
00:16:24
the cloth versus near the cloth these
00:16:26
were tasks that only adding the robot
00:16:28
data would enable the policies to do
00:16:31
whereas the internet data alone wouldn't
00:16:33
give you these understanding and
00:16:34
capabilities for
00:16:36
free as a brief recap then we've seen
00:16:40
attempts at data scaling and positive
00:16:42
transfer in in in two projects that I
00:16:44
just highlighted one was taking our
00:16:46
robot data set from
00:16:48
rq1 putting it alongside internet data
00:16:50
and Shing them the same and then we
00:16:53
scaled it up even further by taking all
00:16:55
the robot data from many different
00:16:56
embodiments and we saw that in both
00:16:58
cases is by treating robot actions and
00:17:01
robot embodiments as just another data
00:17:02
modality you did see signs of positive
00:17:06
transfer and again for rt1 this was
00:17:08
great we do well on the train set the in
00:17:10
distribution for rt2 this was we start
00:17:12
to get internet scale semantics and for
00:17:15
RTX we start to understand spatial
00:17:17
Precision in Concepts like that which
00:17:18
are important for action and for
00:17:21
physics but I think that's where we got
00:17:24
to maybe end of last year I would say
00:17:27
the state of the world and in then
00:17:29
though I think there's many open
00:17:31
challenges and things which do not work
00:17:33
that I want to highlight one is that
00:17:35
Vaas in current training paradigms still
00:17:38
overfit to robotics data distributions
00:17:40
what I mean by this is that we take a
00:17:42
vaa we kind of query it on v2a tasks
00:17:45
like an internet image image prompt uh
00:17:47
internet prompt and it does well we give
00:17:49
it a robot image ask it for actions it
00:17:51
does well when you try to mix and match
00:17:53
across these data distributions for both
00:17:55
the visual L and space as well as for
00:17:57
the uh text input space it doesn't
00:18:00
really seem like you're getting a ton of
00:18:01
transfer and this is a very you know
00:18:03
interesting failure mode because it
00:18:05
suggests that you know there are still
00:18:07
fundamental uh issues with how we're
00:18:10
doing code trainining how we're mixing
00:18:11
robot data with internet
00:18:13
data we also see that reasoning right we
00:18:16
often times see that reasoning stems
00:18:19
from the power of language model
00:18:20
backgrounds from language model
00:18:21
pre-training and it seems that how you
00:18:24
can mix this kind of reasoning process
00:18:26
with lowlevel you know physical action
00:18:29
reasoning that's also not you know going
00:18:32
exactly as expected because we see cases
00:18:34
where uh you know for example here your
00:18:36
reasoning and planning works well when
00:18:38
all the objects are out of distribution
00:18:40
but when you add one in distribution
00:18:41
object such as a cocan your model starts
00:18:43
to revert to what it knows and just
00:18:45
immediately even though it can kind of
00:18:46
knows hey if I want to hammer a nail I
00:18:48
should use a rock it's never picked up a
00:18:50
rock before but it's picked up tens of
00:18:52
thousands of coc so even though it kind
00:18:54
of reasons that it use a rock it still
00:18:55
goes for the cocan examples like this
00:18:57
are just you know they're everywhere but
00:19:00
I think they're not as normally as
00:19:01
visible so I just wanted to highlight
00:19:02
this as interesting failure modes and
00:19:05
finally right how we actually the the
00:19:07
model architecture decision the design
00:19:09
decisions are also not well thought
00:19:11
through um recently there's a paper
00:19:13
actually just from three days ago
00:19:14
studying how you tokenize and how you
00:19:16
represent your action penous discreet
00:19:18
classification what what actual token
00:19:20
choices do you use um that I think is a
00:19:22
great step in the right direction I'm
00:19:24
making uh you know vaa training more of
00:19:26
a science and less of an art
00:19:30
and moving on then uh let me just check
00:19:32
I'm doing on time okay great um so the
00:19:35
next missing piece then is steerability
00:19:37
and prompt ability uh I think this is
00:19:41
very interesting because in robotics
00:19:44
today oftentimes we kind of uh
00:19:47
historically due to the scale the Rel
00:19:49
policies been working with we really
00:19:51
tried to constrain the amount of entropy
00:19:53
and information that can be fed into the
00:19:55
policy's input right usually it's just
00:19:57
one image or maybe uh you you frame
00:19:59
stack some history of the past IM you've
00:20:01
seen and then you convey a goal in the
00:20:03
simplest way possible Right a one hot ID
00:20:05
a very simple you know L English
00:20:07
sentence or or maybe a goal image but
00:20:10
generally you're not really expanding
00:20:12
the the throughput of the bandwidth of
00:20:14
your information input to the scale of
00:20:16
like an open-ended chat interface as
00:20:18
many language models see today um and
00:20:21
again I think we we've seen in language
00:20:22
modeling how large context Windows
00:20:24
enable so many great emerging
00:20:26
capabilities such as TW shot learning
00:20:28
such as uh you know really being able to
00:20:30
hone down and do Chain of Thought
00:20:32
reasoning and like you know reflection
00:20:33
all this all this kind of good stuff
00:20:35
that we we can't really have robotics
00:20:37
because our input domains uh you know
00:20:39
interfaces are just so constrained and
00:20:42
and I really want a pable robot which
00:20:44
you can just say hey can you try again
00:20:45
can you you move too far right you know
00:20:47
can can you just adjust it on your next
00:20:49
try we don't have any metal learning
00:20:51
like that right now and it won't
00:20:52
naturally emerged by just adding you
00:20:54
know more of the same types of robot
00:20:56
data and more of the same types of
00:20:57
internet data there has to be a paradigm
00:20:59
shift in order to get from us where we
00:21:01
are today to a prompt seable uh robot
00:21:03
control model and here I think you know
00:21:06
maybe one hypothesis here is that maybe
00:21:09
it's language that's the bottleneck at
00:21:10
least language the way we approach it
00:21:11
today right we we we've been kind of
00:21:13
taking all of the recipes that have
00:21:15
worked in other domains and we've tried
00:21:17
to say okay let's just add language as a
00:21:20
conditioning modality for robots let's
00:21:21
just add language data sets but this
00:21:24
maybe misses a lot of what makes robots
00:21:27
hard right what makes robots and unique
00:21:29
and interesting problem to study are the
00:21:32
the Motions the physics the the causal
00:21:34
nature of interaction with the world and
00:21:37
when you try to condense everything into
00:21:39
something that you know a language model
00:21:40
Wikipedia article might look like maybe
00:21:42
you just lose a lot of that in
00:21:44
Translation um and and so towards that
00:21:47
uh you know maybe honing in on the
00:21:49
motion of Robotics is is what this work
00:21:52
has to do called Archy trajectory which
00:21:54
takes the idea of you know hindsight
00:21:56
experience replay you take the goal you
00:21:59
you turn that into the target for your
00:22:00
policy learning but does that for the
00:22:02
motion of the trajectory that was
00:22:04
executed when we collect robot data
00:22:05
often times we end Factor like Pro
00:22:08
perception for all your joint States and
00:22:10
your Motors are kind of store and we
00:22:12
currently just toss all that away when
00:22:13
we just save out RGB action trajectories
00:22:16
maybe you can actually utilize you know
00:22:18
the motion of what happened in the real
00:22:19
world and use that to condition
00:22:21
condition your policy because you
00:22:22
already have this data for free we just
00:22:24
currently toss it out so we do in this
00:22:26
work is we take that proception and
00:22:27
effector data we project it into 2D RGB
00:22:30
space and that just becomes a kind of
00:22:32
visual hint a visual Chain of Thought if
00:22:34
you will of not just what you should do
00:22:36
you should you know put the chip back in
00:22:38
the in the drawer but how you should do
00:22:40
it how you should approach it from what
00:22:41
angle where you should close how you
00:22:42
should avoid object obstacles at what
00:22:45
depth and what Heights you should be
00:22:46
operating at and then we feed it into an
00:22:48
rp1 policy and and and what's nice is
00:22:51
that even though training this was
00:22:52
automated at inference time then you
00:22:55
know maybe getting this trajectory from
00:22:57
scratch would have been hard but
00:22:59
uh the policy is agnostic to what
00:23:01
trajectory you give it it doesn't have
00:23:02
to be hindsight it can be handdrawn by a
00:23:05
human it can be you can take a human
00:23:07
video and do you know pose extraction
00:23:09
with the hand and and then get this
00:23:10
strory out of there or you can even give
00:23:12
Foundation models either you know
00:23:14
generative image models or or or
00:23:16
language models to go and predict what
00:23:18
the trajectory should be and then that
00:23:19
can condition your pre-trained policy
00:23:22
which is only train on H side
00:23:23
trajectories and what's nice here is
00:23:25
that we start to see signs of motion
00:23:27
generalization right this is this is
00:23:30
where you have tasks which have
00:23:32
fundamentally maybe they're operating at
00:23:34
different Heights or different motions
00:23:35
just combinations of State action
00:23:37
transition which were just never in the
00:23:38
training data which has been
00:23:40
traditionally hard for some of our
00:23:41
policies which are just kind of overfit
00:23:43
to pick up the cocan and and that that's
00:23:46
all I know and you can't really
00:23:47
generalize to cocan and vastly different
00:23:50
motion profiles um but we start to see
00:23:52
that here with tasks such as swiveling
00:23:54
chairs or or folding towels or or
00:23:56
picking objects up from a chair at a
00:23:58
very l low
00:23:59
height but most interesting for the
00:24:01
topic here is that this was the first um
00:24:04
project I worked on at least where I
00:24:05
really got some kind of Shadow or some
00:24:08
hint of I felt that prompt engineering
00:24:10
was possible here we'd sometimes try out
00:24:12
you know evaluations in very different
00:24:14
settings with like in like a home
00:24:15
setting like Ikea like setting with with
00:24:17
with Darkwood furniture at different
00:24:19
heights and the robot would sometimes be
00:24:21
able to generalize this out of tradtion
00:24:22
settings but then just by drawing a a
00:24:25
prompt trajectory differently the human
00:24:27
can kind of learn the failure the the
00:24:29
centricities of the robot and just by
00:24:31
like you know a trained operator could
00:24:32
then learn to draw better trajectories
00:24:34
and then start to zero shot new tasks uh
00:24:37
again this was no other policy that we
00:24:39
had today where the input is always
00:24:40
something as simple as you know hey pick
00:24:42
up a Coke can pick up a Pepsi can close
00:24:44
a drawer going from that to like you
00:24:46
know hey you're this brand new setting
00:24:47
the drawer is going to be five
00:24:48
centimeters lower than you're used to
00:24:50
and there's also clutter can you go and
00:24:52
do that like there's no way any policy
00:24:53
could be responsive to that yet with
00:24:55
trajectories it seems like sometimes we
00:24:57
would get that
00:24:59
and concurrently it's been really great
00:25:01
to see over the past six months a lot of
00:25:02
other works also kind of adopts
00:25:04
paradigms using you know Point tracking
00:25:07
Optical flow using other kinds of motion
00:25:09
representations to also have you know
00:25:12
touch upon the same idea is that what
00:25:14
makes robotics unique its physics its
00:25:16
motion its trajectories maybe adding
00:25:18
this information to our foundation
00:25:20
models can unlock things which language
00:25:22
canot um which is very excited to see
00:25:25
but I think overall it's still very nent
00:25:27
and none of these methods have scaled up
00:25:29
to the level of let's say a 55 billion
00:25:31
parameter Vision language action
00:25:34
model next up though one other
00:25:37
interesting question I I was asking you
00:25:38
know maybe a year ago was I have all
00:25:40
these rights of language in the section
00:25:42
of this this section right was like what
00:25:44
can we do Beyond language but maybe the
00:25:46
language maybe language was not the
00:25:48
problem just the way we were using
00:25:49
language that we're using language in
00:25:51
twoo sinful of a way and if the language
00:25:53
was more grounded in in motions and
00:25:55
physics and what makes robotics unique I
00:25:58
more hierarchical granular we could get
00:26:00
the benefits and so in this project
00:26:01
called Archy hierarchy we would kind of
00:26:03
do a Chain of Thought prediction where
00:26:05
you turn the the pick cocan the very
00:26:07
abstract very like long Horizon T into
00:26:10
intermediate very grounded language
00:26:13
Motions like move your arm right and
00:26:15
rotate it clockwise and then close your
00:26:17
gripper into like language that's very
00:26:19
grounded towards very short Horizon um
00:26:22
actions and again this can kind of be
00:26:23
viewed as like chain of thoughts still
00:26:25
using language at the Chain of Thought
00:26:26
medium but doing it in a way which is
00:26:28
much easier than perhaps to generalize
00:26:31
and what's interesting then is that this
00:26:32
unlocked learning from uh you know a
00:26:36
category of tasks we had been collecting
00:26:38
for over six months that no other policy
00:26:40
was able to use that data from because
00:26:41
they were just so hard so it's kind of
00:26:43
like Tas related to the serial task
00:26:45
where there's like a small Gap you have
00:26:46
to put a bowl and you have to push a
00:26:47
lever and then there's this like
00:26:49
cluttered like you know thing of oal
00:26:51
packets and it's like slightly more
00:26:53
dextrous and precise than the past we've
00:26:55
seen before but policies were just
00:26:57
really struggling to kind of you know
00:27:00
work at the the entropy of these systems
00:27:02
which was which was just a little bit
00:27:04
more complex than before and then these
00:27:06
language motions are what unlock the
00:27:08
ability to work on those types of
00:27:11
TS and talking about like pumpability
00:27:14
and steerability we also saw that having
00:27:17
interventions in language of hey you
00:27:19
know actually you you you're doing the
00:27:21
right low level control it was just your
00:27:23
intermediate plan of translating closed
00:27:25
the pistachio jar you wanted to move up
00:27:28
you should have moved left humans could
00:27:30
correct this in a dagger likee setting
00:27:32
in a much more efficient manner than it
00:27:33
would have taken for low-level action
00:27:35
interventions and this was a way to kind
00:27:37
of scale up this Chain of Thought level
00:27:39
Improvement where you don't always need
00:27:41
to make your you know iterate at the
00:27:42
Action level which is the most expensive
00:27:44
level maybe sometimes you can you know
00:27:45
intervene at the language motion or
00:27:47
highering level that is also kind of
00:27:49
stability and probability one of the
00:27:51
rare that
00:27:53
I and and to kind of summarize here rt1
00:27:57
and rp2 even though they were scaled up
00:27:59
so much they were still shoving together
00:28:00
all of the foundation model knowledge
00:28:02
all the large data sets through a very
00:28:04
narrow bottleneck of simple language
00:28:05
instructions and by whing that kind of
00:28:08
bottom neck via either things like C
00:28:10
trajectories via motion Centric
00:28:12
representations or even via language
00:28:14
just better language seem to unlock more
00:28:16
of the intelligence that's contained in
00:28:18
these internet scale
00:28:21
foundations and the question here
00:28:23
becomes is that I view these these
00:28:25
projects kind of as fruital Concepts I'm
00:28:27
very excited about results but again we
00:28:29
haven't scaled them up yet to the full
00:28:31
internet scale and the question is maybe
00:28:33
to scale them up to you know a full
00:28:35
vision reaction model maybe we need more
00:28:37
robot data and my hot take here is that
00:28:41
actually this is not a call to action
00:28:42
for me to you know a lot of you know
00:28:44
companies for example right now are
00:28:46
convinced that algorithms are are solved
00:28:48
it's just data and maybe this slide
00:28:50
would have suggested that but I think we
00:28:51
don't actually know what kinds of robot
00:28:53
data you need and so if you prematurely
00:28:54
try to scale deploy
00:28:59
youting not correct maybe sometimes that
00:29:02
wouldn't have been a recoverable mistake
00:29:04
so again I don't think I'm suggesting to
00:29:06
not do that because I think we
00:29:07
definitely need to do that but I think
00:29:08
there's also work that robot researchers
00:29:10
need to think about on can we figure out
00:29:13
the right types of modalities and data
00:29:15
sets and tasks and skills and labels
00:29:17
that we need when we go about collecting
00:29:19
these like Society scale data
00:29:23
sets and finally then I think uh one
00:29:26
missing piece is scalable evaluation
00:29:29
right this is I think uh generally a
00:29:31
problem throughout all of AI right now
00:29:33
um we we we see that all these different
00:29:36
you know benchmarks and you know
00:29:38
leaderboards uh are are oftentimes a lot
00:29:40
of hype these days because they're very
00:29:41
easily gainable but they're good
00:29:43
attempts I think at least in Foundation
00:29:45
Molly Broadley to try to capture various
00:29:48
representations of capabilities that we
00:29:50
want and the challenge here however is
00:29:52
that is all AI models across the you
00:29:55
know broadly the field are everything
00:29:57
that want wants to be a genous these
00:29:58
days right and and that's a very high
00:30:00
claim if your genous model can do
00:30:02
everything how do you back up rigorously
00:30:04
that you can actually do everything and
00:30:06
and be robust at that and I think this
00:30:09
is clearly already an issue in in
00:30:11
Foundation modeling but I think things
00:30:12
like these ELO based leader boards or
00:30:14
just getting these models out and
00:30:16
deployed and open source or even as
00:30:18
close you know Source apis these are
00:30:20
ways where you can just leave it up to
00:30:22
you know the the court of the public to
00:30:25
kind of say hey are your models actually
00:30:27
good and in robotics this is kind of
00:30:29
hard right because robotics Target
00:30:31
physical data distributions and it's
00:30:33
it's not really clear whether or not
00:30:34
some representative small set of you
00:30:37
know evaluations can capture you know
00:30:40
all of the the properties you want to
00:30:41
measure um or or if there's just a lot
00:30:44
of stuff that you need to do before you
00:30:45
get to a product that's out but I think
00:30:47
regardless of what the answer is the
00:30:49
problem is very clear and the problem is
00:30:51
that as our policies within our team at
00:30:53
Google de might have scaled the amount
00:30:54
of evaluations we've had to do have also
00:30:56
scaled right and I think this might not
00:30:58
sound like a lot of 3,000 or 6,000
00:31:00
trials but if you consider that each
00:31:01
trial can take up to 10 15 minutes you
00:31:04
know this is quickly going to become
00:31:06
intractable if we add another order of
00:31:08
magnitude to what we claim our policies
00:31:10
can do the next year or two um that just
00:31:12
not going to be tractable for any
00:31:13
industry or academic
00:31:15
lab maybe one way is we can decompose
00:31:18
the kind of attack space of the claims
00:31:20
you want to make about these robot
00:31:21
Foundation models and break them down
00:31:23
into particular axes that are meant to
00:31:25
represent policy generalization that
00:31:26
Encompass properties that we hope to
00:31:28
measure such as maybe different
00:31:30
backgrounds or adding distractor objects
00:31:31
and clutter or changing lighting
00:31:33
conditions adding new objects Etc and
00:31:35
maybe we could measure generalization
00:31:37
gaps from our training set to our test
00:31:39
set we did this and in the real world
00:31:42
and and we indeed found that you know
00:31:44
maybe some of these like axes of
00:31:45
generalization these factors of
00:31:47
distribution shifts were harder than
00:31:49
others and and uh would have these kind
00:31:51
of scaling curves as you kind of added
00:31:53
more of these to your training but I I
00:31:55
think this was kind of just an initial
00:31:57
kind of attempt at codifying or
00:32:00
formalizing what it means for your robot
00:32:02
Foundation model to be a generalist
00:32:04
Foundation model and it's it's
00:32:06
definitely a challenge now because every
00:32:08
you know New Foundation model robot
00:32:10
policy that comes out uh kind of has to
00:32:12
do all this on their own again and
00:32:13
Define their own evals and Define their
00:32:15
own benchmarks and it's just kind of
00:32:17
hard to compare apples to apples in
00:32:19
today's academic
00:32:20
landscape maybe another path however is
00:32:23
of course using simulation and and today
00:32:25
there's so many great talks on world
00:32:26
models in Sim especially for ab and in
00:32:29
robot manipulation it's been a bit
00:32:31
harder because manipulation right with
00:32:33
contact forces and physics and visual
00:32:35
distribution shifts and oftentimes data
00:32:36
sets are very tuned to one limited
00:32:38
setting which raises the bar for how
00:32:40
realistic the systems have to be this
00:32:42
has been an ongoing challenge so an
00:32:44
Insight in our recent work here called
00:32:46
simpler was that maybe you don't need a
00:32:48
full Fidelity digital twin in order to
00:32:51
that you would need for syrial training
00:32:53
maybe all you can need to do is optimize
00:32:55
for correlation between the ranking of
00:32:57
policies in Sim and how they would rank
00:32:59
in the real world if you had evaluated
00:33:01
them in the real world and by doing this
00:33:03
we would try to just get a minimal
00:33:04
viable Sim like gets you useful signal
00:33:07
of you know which which checkpoints you
00:33:09
should use which with kind of policies
00:33:10
you should devote your very expensive
00:33:12
real house to and again I don't think
00:33:14
it's perfect but I think it it was
00:33:16
working well for a various classes of
00:33:18
generalist robot policies that we were
00:33:20
operating on such as rt1 rt1 X rt2
00:33:24
Etc and maybe again a shout out to
00:33:27
concurrent work or or kind of related
00:33:29
works right I'm prism one related today
00:33:31
from Alex un from Sherry also Genie
00:33:34
which is a l and action conditioned
00:33:36
video diffusion model from Google mind
00:33:39
um these are all kind of I think
00:33:40
directionally where we would want to go
00:33:42
for seeing if these kind of approaches
00:33:45
can also act as good offline policy
00:33:47
evaluation um and my hot take here then
00:33:50
is that I think despite all of these
00:33:52
like you know great progresses I think
00:33:54
real world evals will always be the gold
00:33:56
standard you can never place down um I
00:33:58
think these evals you know offline evals
00:34:00
can help you decide which you know how
00:34:03
to spend your limited dwid for real
00:34:04
world evals but it won't replace them
00:34:06
completely right and then so if you
00:34:08
actually need real evals and you need
00:34:10
them at scale that becomes a very
00:34:12
challenging problem right and it's
00:34:14
perhaps one that will be solved by
00:34:16
products deployed in the wild where you
00:34:18
actually get scale EV from actual user
00:34:20
deployments by people actually using
00:34:22
your robots and your robot
00:34:25
policies and uh finally here um I'd like
00:34:29
to talk a bit about maybe what this
00:34:30
means for uh what we can think about for
00:34:32
how we can predict the next year or two
00:34:34
years of what my you know robot
00:34:36
Foundation model might look like uh and
00:34:39
the first one here is a recap um we
00:34:41
talked about positive transfer and the
00:34:43
scaling Laws of Robotics as we increase
00:34:46
our data sets and our and our model
00:34:47
capacities uh the bleeding edge here
00:34:49
from what I talk about today I view them
00:34:51
as the vision language action moding
00:34:52
Paradigm and a lot of improvements and
00:34:54
cross tring on many different types of
00:34:56
robot data here this is my again just my
00:34:58
own opinion I would give the field a six
00:35:00
out of 10 um is how how much we progress
00:35:02
and how much remaining kind of unknown
00:35:04
unknowns we have I think as you saw the
00:35:06
failure modes were still overfitting
00:35:08
this the it's still vaa training is
00:35:11
still more of an art than a science of
00:35:12
your data training mixtures of how you
00:35:14
go about doing that your token action
00:35:16
tokenization of decisions and I think
00:35:18
there's a lot of science that still
00:35:19
needs to happen here for steerability
00:35:21
and proability um again here we talked
00:35:24
about going Beyond language let's
00:35:26
thinking about motions let's think about
00:35:27
trajectory
00:35:28
here I would say the fields at a four
00:35:29
out of 10 because I think while we've
00:35:31
had Signs of Life maybe Sparks of what
00:35:33
could be come a pumpable or steerable
00:35:36
robot learning Paradigm none of these
00:35:38
have been scaled up to a large scale and
00:35:40
we don't yet know what kind of data
00:35:42
requirements or or kind of you know
00:35:44
bottlenecks uh that will occur when we
00:35:46
do try to scale these methods and we
00:35:48
don't know if these will scale to you
00:35:50
know more dextrous embodiment such as
00:35:52
humanoids or consumer
00:35:54
robots and finally for scalable eval
00:35:57
again maybe we can think about
00:35:58
generalization or simulation I would
00:36:00
score this at a three out of 10 just
00:36:02
because uh um we don't even often times
00:36:05
don't even know what we should be
00:36:07
evaluating for right even before we can
00:36:09
design the eval we need to kind of set
00:36:11
up the paradigms the the structure the
00:36:14
formalisms all of that is still kind of
00:36:16
very ad hoc and it's done on a per
00:36:18
project basis almost um and so
00:36:20
definitely the future I hope that maybe
00:36:23
either concerted efforts or just you
00:36:25
know by bringing uh robot prediction and
00:36:28
lowle control closer to Foundation
00:36:30
modeling these problems of evaluation
00:36:32
can also become more
00:36:35
homogeneous with that then I'll take
00:36:37
these final like you know last points
00:36:39
that we just solved like what why these
00:36:41
why I gave these ratings U maybe we can
00:36:44
try to predict what solutions to these
00:36:46
might be you know this is I would say
00:36:47
what might happen in the 2025 to 2026
00:36:51
Horizon I think the first one is that
00:36:53
you know once we understand B better
00:36:55
once it turns into a science it's not
00:36:57
just in AR I think maybe robotics
00:37:00
research will also start to split into
00:37:02
pre-training and post training just as
00:37:04
Foundation modeling has once you have
00:37:06
access to very robust starting points um
00:37:09
of foundation models from you know maybe
00:37:11
from from Google p maybe from open from
00:37:13
other companies once you have great
00:37:15
starting points then posttraining your
00:37:17
robot will become more of what your
00:37:18
daily Cycles are going to be looking
00:37:19
like as a practitioner or as a
00:37:21
researcher um for robot specific data
00:37:24
again I think um robot data engine again
00:37:28
if you're really thinking about scale
00:37:29
deployments at a society scale here I
00:37:32
think is where industry and startups
00:37:33
will be able to contribute a lot to the
00:37:35
field and what that's going to look like
00:37:36
we'll see but uh I think the the race is
00:37:39
really starting up this year with so
00:37:40
many startups and exciting new companies
00:37:42
for and then finally for evaluations um
00:37:45
with all of the amazing progress we're
00:37:47
seeing in in the video modeling um again
00:37:51
I think I'm really excited to see what
00:37:53
the what the action condition variant of
00:37:55
these World models will look like uh how
00:37:58
how well they'll be able to model out of
00:37:59
distribution behaviors and not just in
00:38:01
distribution training data because I
00:38:03
think for for evaluating robot policies
00:38:05
you don't just want them to be aesthetic
00:38:07
or you know realistic you want them to
00:38:09
actually measure these long tail of like
00:38:11
very rare events or you want them to
00:38:13
model you know contact forces and and
00:38:15
visible cality which may not always
00:38:17
correlate to you know aesthetic pre
00:38:19
trining video YouTube data that they may
00:38:21
be rain um as well as of course product
00:38:24
deployments so I think their evaluations
00:38:26
is another setting where we'll see
00:38:28
industry and startups contribute to um
00:38:31
those are again just my own opinions but
00:38:33
I think I'm very excited to see how the
00:38:34
field of Robo fure
00:38:37
deols with that said uh thanks for your
00:38:39
time um reach me on my email here and
00:38:41
I'll share the slides after but thanks
00:38:42
for your
00:38:48
time you very much Ted for the great
00:38:51
talk uh there's already a question go
00:38:53
ahead
00:38:55
[Music]
00:39:02
large
00:39:14
[Music]
00:39:29
any so I guess the question here was um
00:39:32
you know with Tetra randomization being
00:39:34
kind of one AIS of generalization a
00:39:35
distribution sh that seem challenging
00:39:37
are there maybe methods or is there any
00:39:39
data points we have on how to address
00:39:40
that um I think it's it's also uh there
00:39:44
have been some works on like data
00:39:45
augmentation for robotics maybe in a
00:39:46
semantic fashion or instead of just like
00:39:48
you know broadly domain randomizing like
00:39:50
just random textures actually using uh
00:39:53
semantically relevant textures of like
00:39:55
what's actually appropriate like you
00:39:57
don't want like a you know neon random
00:39:59
RGB wallpaper in your home you want
00:40:01
wallpapers that are actually seen in
00:40:03
people's homes uh I think this is still
00:40:05
very nent but I think there's already
00:40:06
been very good results of people like
00:40:08
pushing up on these like generalization
00:40:10
settings with that it's like diffusion
00:40:12
models for
00:40:14
IND other questions yes please wondering
00:40:19
yes are you saying uh video data real
00:40:22
data simulation data they're all
00:40:26
contributing or if you have to pick one
00:40:28
which one would you go for or is there a
00:40:30
transfer between doesn't even
00:40:34
matter your
00:40:37
representation yeah absolutely so the
00:40:39
question I guess is yeah video data Sim
00:40:41
data real data how do VI are they all
00:40:43
equal do they transfer uh my sense is at
00:40:46
least based in the last few years again
00:40:48
real data is absolutely key so if you
00:40:50
offer me the same quantity of real data
00:40:52
versus Sim versus video I'm picking real
00:40:54
any day of the week however then you
00:40:56
know may maybe at some point there will
00:40:57
be dimin returns or maybe if you offer
00:40:59
me a thousand times more or a million
00:41:01
times more similar video data compared
00:41:03
to robot option data then maybe the
00:41:05
trade-off starts to become different but
00:41:07
in the past I would say that's also not
00:41:09
a very good predictive measure of what
00:41:11
might happen in the future because in
00:41:12
the past again the robot kind of
00:41:14
foundation model policy capacities were
00:41:16
so small that if you could only kind of
00:41:18
operate over the scale of you know
00:41:20
100,000 trajectories then you would of
00:41:22
course just operate on 100,000 real
00:41:24
trajectories you never had the
00:41:25
opportunity to be operating at this
00:41:27
scale an internet scale V where you did
00:41:30
have the capacity to consume a million
00:41:32
times more Sim video data and I think we
00:41:34
have the opportunity to do that in the
00:41:35
next you know year or even
00:41:38
now yeah
00:41:41
was foration prob most about what's
00:41:45
missing from the Robotics and vision the
00:41:48
language everything physical sense
00:42:17
yeah great question so the question here
00:42:19
was yeah physical touch right sensing um
00:42:21
this is so important for humans uh how
00:42:24
about for robots um I think my my sense
00:42:26
here is that uh you know I I kind of put
00:42:29
two columns of like what could make
00:42:30
robot seable or what's missing from
00:42:32
Foundation models that's unique to
00:42:34
robots I would definitely put sensing
00:42:35
and other modalities here as well the
00:42:38
reason I think it's it maybe it's a
00:42:39
little bit farther off or for me at
00:42:40
least not a top priority is that I think
00:42:43
just grounding the idea of temporal
00:42:45
motion into like putting that into
00:42:48
visual language latent space if that is
00:42:50
already so hard and adding a completely
00:42:52
new modality like touch sensors where
00:42:54
you don't have a lot of data where it is
00:42:55
noisy where it's so different from
00:42:57
internet data that's going to be even
00:42:59
harder so if we can't even add
00:43:00
trajectories of motions then adding
00:43:02
sensing data might be even harder with
00:43:04
so that's why for me at least um my aim
00:43:07
is to try to make some progress on on on
00:43:08
hitting the the Motions the physics the
00:43:10
trajectories first and then sensing
00:43:12
maybe I think hopefully will that'll
00:43:14
tell us how we should approach sensing
00:43:15
as
00:43:18
well in the interest of time let's have
00:43:20
two last question so I two hands you can
00:43:23
start new front and then other
00:43:43
yeah great question so the question here
00:43:45
is that in this RT hierarchy work um the
00:43:47
interface between the high level
00:43:48
planning the low LEL control was this
00:43:50
this like lowlevel language motion and
00:43:52
uh there's a hethy language and and yeah
00:43:54
absolutely I think there's there's many
00:43:55
other kinds of representations you could
00:43:57
use uh for example even these
00:43:59
trajectories is one right your high
00:44:01
level plan could be like hey how do I
00:44:03
wipe you s of the chair and then the
00:44:05
interface is this like RGB like curve
00:44:08
but I think there there's definitely a
00:44:09
lot more language is just a natural one
00:44:11
where you could hope to get a bit more
00:44:12
transfer from like you know Chain of
00:44:14
Thought in nonbody domains
00:44:18
but and final question
00:44:54
yeah um the question was um
00:44:57
interpolation versus true out of
00:44:59
distribution generalization reasoning U
00:45:01
maybe often times when we claim we're
00:45:03
generalizing a new objects or lighting
00:45:04
conditions is actually just
00:45:05
interpolating the data that we already
00:45:07
saw um and and that's why I think uh I I
00:45:10
I think here right when I Define
00:45:12
emergent uh I I I I try to embrace this
00:45:14
by saying that uh emergent doesn't mean
00:45:16
that it's not present in any of the
00:45:17
training data and it's merged magically
00:45:19
it's that it was in the data but
00:45:20
internet scale data is too much for us
00:45:22
to codify so it's just emerging like
00:45:24
we're finding out what was in the data
00:45:26
basically right like we're finding out
00:45:27
what's in the data and now it's being
00:45:29
projected into Robot action space and
00:45:31
we're seeing what has successfully
00:45:32
projected um so I think uh again I I
00:45:35
don't I don't even know if we if we need
00:45:37
necessarily to solve like true
00:45:38
generalization in order for robotics to
00:45:40
work well I think if we're able to solve
00:45:43
like just interpolation at scale that's
00:45:46
in in a very like you know predictable
00:45:48
fashion I think I would be already happy
00:45:50
with that in terms of getting general
00:45:51
purpose robots I think beyond that for
00:45:54
AGI no loaded term yes probably
00:45:58
okay
00:46:00
sorry so let's send the speaker again

Etiquetas

robotteknologi
fundamentale modeller
positiv overførsel
styrebarhed
skalerbar evaluering
generel anvendelighed
data
maskinlæring
interdisciplinær forskning
udfordringer