CVPR24 FM4AS | Ted Xiao: What's Missing for Robotics-first Foundation Models?
Resumen
TLDRForedraget fokuserer på de nuværende udfordringer og muligheder inden for robotteknologi, især i forhold til fundamentale modeller. Taleren fremhæver, at der er en betydelig energi og interesse i feltet, men at der stadig er væsentlige mangler, der skal adresseres for at gøre robotter mere generelt anvendelige. Der præsenteres tre centrale mangler: positiv overførsel ved skalering, styrebarhed og skalerbar evaluering. Taleren argumenterer for, at nuværende metoder ikke er tilstrækkelige, og at der er behov for paradigmeskift for at udnytte de muligheder, som store datamængder og avancerede modeller tilbyder. Der diskuteres også, hvordan man kan forbedre robotters evne til at generalisere og tilpasse sig nye situationer.
Para llevar
- 🤖 Robotter står over for udfordringer med positiv overførsel.
- 📊 Skalerbar evaluering er essentiel for robotters præstation.
- 🔄 Styrebarhed kan forbedres med menneskelig feedback.
- 🌐 Store datamængder kan hjælpe med at generalisere færdigheder.
- 🔍 Der er behov for paradigmeskift i robotteknologi.
Cronología
- 00:00:00 - 00:05:00
I dag taler jeg om, hvad der mangler i robotik i forhold til fundamentale modeller. Der er en enorm energi i robotik og AI, og mange nye forskere deltager i konferencer som CPR. Jeg vil fokusere på de mest presserende udfordringer, der skal løses for at gøre robotik mere tilgængelig og diskutere, hvordan vi kan forstå de nuværende trends for at finde de manglende brikker.
- 00:05:00 - 00:10:00
Robotik i dag følger en pipeline-tilgang med perception, planlægning og kontrol. De seneste 10 år har været præget af tværfaglighed, hvor fremskridt i sprogbehandling og computer vision hurtigt er blevet anvendt i robotik. Men med fremkomsten af fundamentale modeller er det nødvendigt at revurdere, hvordan vi anvender disse teknologier i robotik.
- 00:10:00 - 00:15:00
En udfordring er, at de nuværende modeller ikke er optimeret til kontrol og beslutningstagning. Der er også snævre flaskehalse mellem modulerne, som begrænser fordelene ved skala og overførsel. Vi skal finde måder at bryde disse barrierer ned for at skabe en robotisk fundamentalmode.
- 00:15:00 - 00:20:00
Der er tre manglende elementer i robotik, som vi ser i sprogmodeller: positiv overførsel ved stor skala, styrebarhed og skalerbar evaluering. I dag er positiv overførsel sjælden i robotik, og vi har ikke de samme skaleringslove som i sprogmodeller. Vi skal finde måder at opnå disse egenskaber i robotik.
- 00:20:00 - 00:25:00
For at opnå positiv overførsel skal vi skalere dataene. Vi kan enten gøre robotdata mere lig internetdata eller samle data fra forskellige robotopgaver for at se, om vi kan opnå positiv overførsel. Vi har set, at ved at behandle robotdata som en del af internetdata kan vi opnå nogle positive resultater.
- 00:25:00 - 00:30:00
Vi har også undersøgt, hvordan vi kan træne på tværs af forskellige robotdata for at forbedre præstationen. Overraskende nok har vi set, at generalistmodeller, der er trænet på data fra mange forskellige robotter, kan overgå specialiserede modeller, der kun er trænet på specifikke opgaver.
- 00:30:00 - 00:35:00
Det er vigtigt at forstå, at selvom vi har adgang til store mængder data, er det ikke altid tilstrækkeligt. Vi skal også overveje, hvordan vi kan bruge robotdata til at forbedre vores modeller og opnå bedre præstationer i specifikke opgaver.
- 00:35:00 - 00:40:00
Den næste udfordring er styrebarhed og promptability. I dag er robotters input ofte begrænset, hvilket gør det svært at udnytte de muligheder, som store sprogmodeller tilbyder. Vi skal finde måder at udvide inputmulighederne for robotter, så de kan lære og tilpasse sig bedre.
- 00:40:00 - 00:46:07
Endelig er skalerbar evaluering en stor udfordring i robotik. Vi skal finde metoder til at evaluere robotters præstationer på en meningsfuld måde, der kan fange deres evner i forskellige situationer. Dette kræver en systematisk tilgang til evaluering, som vi endnu ikke har opnået.
Mapa mental
Vídeo de preguntas y respuestas
Hvad er de tre centrale mangler i robotteknologi ifølge foredraget?
De tre centrale mangler er positiv overførsel ved skalering, styrebarhed og skalerbar evaluering.
Hvordan kan robotter drage fordel af store datamængder?
Ved at behandle robotdata som en del af en større datamængde kan robotter opnå bedre generalisering og overførsel af færdigheder.
Hvad er betydningen af 'positiv overførsel' i robotteknologi?
Positiv overførsel refererer til evnen til at anvende læring fra en opgave til en anden, hvilket er sjældent i nuværende robotteknologi.
Hvad er 'styrebarhed' i konteksten af robotter?
Styrebarhed refererer til robotters evne til at tilpasse deres handlinger baseret på menneskelig feedback eller ændrede betingelser.
Hvorfor er skalerbar evaluering vigtig for robotter?
Skalerbar evaluering er vigtig for at kunne vurdere robotters præstationer på tværs af forskellige opgaver og miljøer.
Ver más resúmenes de vídeos
New York's 100 Year Olds Share Advice for Younger Self
What #1 New York Times Bestseller Actually Means
Dark Bowser ruins my Birthday/Grounded/I get reassured
How Lizard Sex Transformed Medicine | Abraham Morgentaler | TEDxSugarLand
“Sadhguru: 20 Minutes That Will Change Your Mornings Forever! 🌅”
Repeat DUI driver charged with murder for Newport Beach crash that killed teen
- 00:00:00title my talk today is what's missing
- 00:00:02for robotics vers Foundation models and
- 00:00:05to preface this uh I think I really want
- 00:00:08to emphasize the absolute amount of
- 00:00:10energy right now that's happening in in
- 00:00:13body AI in robotics specifically even
- 00:00:15today at tvpr I see so many time
- 00:00:17roboticists which this is their first
- 00:00:19CPR I think it's the general trend of a
- 00:00:22lot of converging Trends in different
- 00:00:24fields from language modeling and
- 00:00:26computer vision and embod decision
- 00:00:28making so I think it's really such an
- 00:00:30exciting time right now to be working on
- 00:00:31these problems but the purpose of my
- 00:00:34talk today is to really try to paint a
- 00:00:37picture of what I view as the bleeding
- 00:00:39edge of modern trends that are ongoing
- 00:00:41in robotics and more importantly ask
- 00:00:44some provocative questions about how
- 00:00:47understanding this bleeding edge can
- 00:00:49point us to the missing pieces and open
- 00:00:51challenges that are on the path between
- 00:00:54us and really solving general purpose
- 00:00:57robotics uh the work here that discuss
- 00:01:00today uh comes from a great group uh at
- 00:01:03Google be mind and the know
- 00:01:05controversial opinions are only my own
- 00:01:08especially the wrong
- 00:01:10ones so give a brief glance then at
- 00:01:12today's agenda I'll first motivate what
- 00:01:15is a robot Foundation model and I think
- 00:01:17you've heard today from so many great
- 00:01:20speakers so I'll keep this brief but
- 00:01:22most importantly I'd like to focus on a
- 00:01:24few missing pieces that I hope will
- 00:01:26paint a path for what are the most
- 00:01:28important problems to solve before
- 00:01:30robotics become generally accessible uh
- 00:01:33and finally I'll save some time for the
- 00:01:34horizons of how the field might evolve
- 00:01:37at a metal
- 00:01:38level um first off some very brief
- 00:01:41preliminaries uh the modern uh sense
- 00:01:44plan act Paradigm in robotics as such is
- 00:01:46a pipeline system right you have
- 00:01:48perception understanding the world you
- 00:01:50have planning how you synthesize
- 00:01:52different types of information
- 00:01:53especially semantics human intent Etc
- 00:01:56and of course control and actuation
- 00:01:59moving robot to influence the world and
- 00:02:02what this has looked like in the modern
- 00:02:04era and I use this to to broadly uh you
- 00:02:07know the loosest sense in the last 10
- 00:02:08years is that robotics has been very
- 00:02:11interdisiplinary a very full stack
- 00:02:13effort where advances in perception in
- 00:02:16in in language and other domains are
- 00:02:18quickly adopted and utilized by
- 00:02:19roboticist to drive progress forward
- 00:02:21this looks like taking the
- 00:02:23state-of-the-art Leading Edge model from
- 00:02:25CPR plugging that in as a perception
- 00:02:27staff taking some kind of offthe shelf
- 00:02:29great planning system either or recently
- 00:02:32you know a learn system and then really
- 00:02:35focusing on the control part is the Last
- 00:02:36Mile and this has worked pretty well I
- 00:02:39think especially when we were you know
- 00:02:41working with like scaling up cnns or or
- 00:02:43scaling up initial smaller scale
- 00:02:45planners but in the age of foundation
- 00:02:46models I think some of the uh lessons
- 00:02:49that we've you know really taken to
- 00:02:51heart as robotics people in in the last
- 00:02:5310 years may not continue uh in the next
- 00:02:56few years um one issue you might think
- 00:02:58of is that it's it's great to keep
- 00:03:00taking these off the shelf
- 00:03:02state-of-the-art models but of course
- 00:03:03these are Frozen They're pre-trained
- 00:03:05they're not optimized for control
- 00:03:06they're not optimized for decision-
- 00:03:08making uh and the same for the language
- 00:03:10model planners which again uh these days
- 00:03:12if you really view emergence as a
- 00:03:14phenomenon that only exists at scale
- 00:03:16organically well maybe the types of
- 00:03:19emerging capabilities intell
- 00:03:21intelligence you need for lowl control
- 00:03:23are not the same ones that you need for
- 00:03:25internet scale text modeling um and of
- 00:03:28course the second part which is related
- 00:03:29to point is that the bottlenecks between
- 00:03:31these modules are very narrow they're
- 00:03:33very they're turistic driven they're
- 00:03:36expert defined and they're not
- 00:03:38expressive enough perhaps to Encompass
- 00:03:41the benefits of scale and transfer that
- 00:03:43we all know and love from Foundation
- 00:03:44modeling and and and so how do we tackle
- 00:03:47these two issues then well again maybe
- 00:03:50again we can take some lessons from
- 00:03:52Trends and practices that have worked in
- 00:03:54Foundation modeling I'm sure all of you
- 00:03:56guys know this story but you know from
- 00:03:58from language model and from Vision
- 00:04:00there's all these separate tasks and
- 00:04:02data sets and and Specialties and the
- 00:04:04Insight of scaling large scale internet
- 00:04:07scale Foundation models is that by
- 00:04:09treating these problems all as uh you
- 00:04:12know interoperable or or or homomorphic
- 00:04:15is that by breaking down the barriers
- 00:04:17between the strict definitions of what
- 00:04:18you call sentiment classification versus
- 00:04:21what you call station you get a lot of
- 00:04:23benefits and really great scaling Trends
- 00:04:26and and then in robotics right going
- 00:04:28back to the modern
- 00:04:30Loop or we we today still have these
- 00:04:33very big brick walls between these
- 00:04:35modules is there a way we can break them
- 00:04:37down and if we can do so that would be a
- 00:04:39robotics First Foundation model um and
- 00:04:42again I don't think I think we have some
- 00:04:44ideas of how to do this and I'll talk in
- 00:04:47today's talk but I think even beyond
- 00:04:50that there's a lot of questions that we
- 00:04:51don't yet have answers to which I think
- 00:04:53will be very exciting to make
- 00:04:55progress so why we even want to do this
- 00:04:57okay let's say we break down these brick
- 00:04:59walls we robotics versus Foundation
- 00:05:00model what does that mean I think
- 00:05:02there's three missing pieces that are
- 00:05:04really nice properties that we see in
- 00:05:06today's language models and vision
- 00:05:08language models that we as of today do
- 00:05:10not yet see in robotics one is really
- 00:05:14positive transfer at imense Scales where
- 00:05:17you basically can assume by default that
- 00:05:18positive transfer and scaling will work
- 00:05:20as opposed to robotics these days it's
- 00:05:22more rare than you know that you see
- 00:05:24positive transfer than cases where you
- 00:05:26get negative results um and again when
- 00:05:28this happens you can are making
- 00:05:30predictive scaling laws about how much
- 00:05:32you should scale how many how much you
- 00:05:33know compute how many tokens you need uh
- 00:05:36and Robotics is far from that today
- 00:05:38another one I think that is very close
- 00:05:40to my heart is steerability and
- 00:05:41pumpability right uh in in my opinion
- 00:05:44this is what really made me see the
- 00:05:46light between gpt2 and gpt3 eras where
- 00:05:48you saw that few shot learning where you
- 00:05:51saw pump engineering really start to
- 00:05:53take off at a massive scale and in
- 00:05:54robotics I think we're far from that and
- 00:05:57finally I think scalable evaluation is
- 00:06:00an aspect where I really enjoy the
- 00:06:02progress that's been made in Foundation
- 00:06:03modeling with everything from very
- 00:06:05realistic meaningful evaluations to
- 00:06:08predicted benchmarks that correlate well
- 00:06:10with how users actually would would you
- 00:06:12know leverage the capabilities these
- 00:06:15models and my claim here is that these
- 00:06:18properties are not just nice to have but
- 00:06:20we actually need that in order to tackle
- 00:06:22the full unstructured Wilderness of the
- 00:06:25real world again robotics has kind of
- 00:06:27spedrun the the scale up from one one
- 00:06:30bin for indust object picking to one
- 00:06:32room to one lab building but to go from
- 00:06:34here to all the VC funded human
- 00:06:37companies and home robot companies and
- 00:06:39assisting the elderly and and playing
- 00:06:41with your dogs and kids this is kind of
- 00:06:43a huge gap I think that you don't get by
- 00:06:46just incremental progress but by
- 00:06:48fundamentally transformative Paradigm
- 00:06:49shifts which can leverage emergence
- 00:06:51scale
- 00:06:54generalization and a hot take here is
- 00:06:56that I think while we see Signs of Life
- 00:06:59today I think if you froze technology
- 00:07:02today you cannot solve Robotics and
- 00:07:03again I think there's companies and
- 00:07:05startups and academics betting on this
- 00:07:07of it's just a data problem it's just
- 00:07:09engineering I think there are still
- 00:07:10fundamental Paradigm shifts that are
- 00:07:12algorithmic or or you know maybe data
- 00:07:14breakthroughs that or needed um until
- 00:07:17it's it's going to be trackable to try
- 00:07:18to go and solve
- 00:07:21RS and let's then jump into the
- 00:07:24nitty-gritty of these missing pieces
- 00:07:25then right so the first one I mentioned
- 00:07:28was positive transfer from scaling and I
- 00:07:30think you know to to kind of illustrate
- 00:07:32what this mean I think we can just point
- 00:07:33at all of the kind of forerunners in
- 00:07:35language modeling and BLM right you see
- 00:07:37these great large scale internet scale
- 00:07:40data sets you see great you know scaling
- 00:07:42laws and chinchilla optimality and again
- 00:07:45I think this is building off of the fact
- 00:07:46that passive the interet gets you a lot
- 00:07:48of great properties that are good for
- 00:07:51language modeling and and vision langage
- 00:07:52modeling but in robotics that's not the
- 00:07:55case right so I I I think in robotics in
- 00:07:57order to study how you can get positive
- 00:08:00transfer you need to First scale the
- 00:08:02data and there's probably two ways which
- 00:08:05I'll dive into shortly is is one maybe
- 00:08:07you can just make robot data look more
- 00:08:09like internet data you can just treat
- 00:08:11them as the same data source and the
- 00:08:13second is maybe maybe your specific
- 00:08:15robot there's not enough data and even
- 00:08:17with you know a bajillion dollars you
- 00:08:18can't get enough but maybe if you pull
- 00:08:20together all of the robot data that's
- 00:08:22been collected across different
- 00:08:23embodiment tasks environments maybe you
- 00:08:26have a better shot of seeing some kind
- 00:08:28of positive transfer in St
- 00:08:30us um so again Vision language models
- 00:08:34this is kind of the golden standard for
- 00:08:35what we want to model the properties we
- 00:08:37like to see in robotics off this is one
- 00:08:39existence group of pro concept what's
- 00:08:41worked and BMS capture both Visual and
- 00:08:43semantic knowledge of the world and then
- 00:08:46you put it side by side here this is the
- 00:08:48rth network architecture for a model
- 00:08:50called rt1 from our team from almost two
- 00:08:53and a half years ago which is kind of
- 00:08:54like a home grw Garage Band Transformer
- 00:08:57model that has these components of you
- 00:08:59know know a perception Vision encoder
- 00:09:01some kind of like cross modal tension we
- 00:09:03s for and then you know a Transformer
- 00:09:05blocks be the B for the reasoning and
- 00:09:07finally some kind of action token output
- 00:09:09again these are design choices that were
- 00:09:11really optimized at our scale on our
- 00:09:12problems which we're still pretty large
- 00:09:14scales by academic domains on hundreds
- 00:09:16and hundreds of robot tasks but again if
- 00:09:18if you kind of squint at this these
- 00:09:19components start to look like the design
- 00:09:21decisions that are used in state of the
- 00:09:23art Vision language models right and you
- 00:09:25can kind of just like try to think about
- 00:09:27how you would fit the kind of design we
- 00:09:30have to make the small rp1 scale into a
- 00:09:33large VM and you see a lot of properties
- 00:09:36that are same again these are disti
- 00:09:38tokens uh they're very similar just
- 00:09:40maybe with some slight output
- 00:09:42differences and input differences uh but
- 00:09:45maybe you can just use the de as a
- 00:09:46policy you don't need to iterate at
- 00:09:48small scales on your home brew Garage
- 00:09:50Band policies you can just leverage the
- 00:09:52kind of great work and infra and data
- 00:09:55from Vision language mod and then the
- 00:09:57main question becomes well how do you
- 00:09:59align the action domain how do you align
- 00:10:01the action manable um and again this is
- 00:10:04what our actions look like they're kind
- 00:10:07of robot specific but they Encompass
- 00:10:09things like positional change and
- 00:10:10rotations and Grier open close and and
- 00:10:12termin action types but that that's
- 00:10:14broadly you know going to be specific to
- 00:10:16per robot and uh you can you can choose
- 00:10:19discretize them but I think the main
- 00:10:20point is that these look still a little
- 00:10:22bit different right from like the textt
- 00:10:24you find on Wikipedia articles or on
- 00:10:27image captions on the web but maybe
- 00:10:29there's ways you can just make them the
- 00:10:30same by adding making just turning it
- 00:10:33into a string the most naive thing you
- 00:10:35can do and see how that works and that's
- 00:10:36exactly what we did in this work called
- 00:10:38rt2 um we just converted the tokenized
- 00:10:41discretized robot actions into a string
- 00:10:44such as this list of eight integers and
- 00:10:47and and you can just treat that as a
- 00:10:50caption so robot action prediction is
- 00:10:51now just a v2a um of course there's
- 00:10:54Alternatives and that we can think of
- 00:10:56such as maybe we use floats but then
- 00:10:58there's more tokens and consen is a
- 00:11:00thing you can use extra IDs or like
- 00:11:02least Us action tokenization you can use
- 00:11:04strings and I think these are all very
- 00:11:06compelling choices but I think the most
- 00:11:07simple thing to start with is just
- 00:11:09turning your raw robot actions and just
- 00:11:11casting it as a string and seeing what
- 00:11:13happens uh and this is what's what's now
- 00:11:16called a vision language action model
- 00:11:17that treats actions as just another
- 00:11:19rality represented in text um for the
- 00:11:21backbones we look at po X models which
- 00:11:24is a model for Google is all py at the
- 00:11:275B 12b and5 variance and uh importantly
- 00:11:31we we we we co- find un we co- Trin the
- 00:11:34robot data at the bottom along with the
- 00:11:36original vqa data the internet data that
- 00:11:39the original were trained with start the
- 00:11:42robot data is this offline data set that
- 00:11:44we iterated on for rt1 and many other
- 00:11:46works it's a it's a large expert
- 00:11:48demonstration data set that we save
- 00:11:50outline um and what we see then is that
- 00:11:52when we when we train this large vaa
- 00:11:55model we start to see some emergent
- 00:11:57skills and I use emergent in the sense
- 00:11:58that these were not not specific
- 00:12:00capabilities uh such as you know OCR or
- 00:12:02recognizing Flags these are not any
- 00:12:04concepts that we collected data for in
- 00:12:06our robot data set but kind of emerg via
- 00:12:09positive transfer from whatever semantic
- 00:12:11Concepts were present in the internet
- 00:12:13scale VM training the VM data and
- 00:12:17quantitatively right you know post talk
- 00:12:19we can kind of try to like you know
- 00:12:20squint and see what kind of these new
- 00:12:22capabilities are and we we roughly group
- 00:12:25them into things like symol
- 00:12:26understanding reasoning or or human
- 00:12:27recognition and we see that
- 00:12:29this is indeed positive transfer
- 00:12:31happening thanks to the V paradig and
- 00:12:34again though those were kind of like
- 00:12:35hand evals right you maybe you don't
- 00:12:37necessarily care about if you're
- 00:12:38roboticist uh you know whether or not
- 00:12:40your robot Foundation model recognizes
- 00:12:42Taylor S maybe you do care about though
- 00:12:44is generalization to distribution shifts
- 00:12:46like lighting conditions or new objects
- 00:12:48or new background conditions and here we
- 00:12:50see that actually the VA models the web
- 00:12:52data also gives you some of this for
- 00:12:54free just by treating robotics as the
- 00:12:57same as in data
- 00:13:00switching uh to the second part though
- 00:13:02so we've treated robot data now the same
- 00:13:04as internet data but often times maybe
- 00:13:07that robot data still isn't enough when
- 00:13:08you get any additional bank gr up by
- 00:13:10training on all the robot data even if
- 00:13:13the robots look different than the one
- 00:13:14you have in your lab in your factory um
- 00:13:17here we we had a a cross institutional
- 00:13:19collaboration called Open Cross
- 00:13:21embodiment where we pulled together uh
- 00:13:23the data sets that were open source for
- 00:13:25more than 30 robot labs and arreated
- 00:13:27together and see whether or not uh we
- 00:13:29could get some nice scaling Properties
- 00:13:31by treating all the robot data as the
- 00:13:34same uh and this data set is very very
- 00:13:37heterogeneous we have many different
- 00:13:39robot moralities and volums we have many
- 00:13:41different environments across the world
- 00:13:44and many different CTIC tasks that each
- 00:13:46of the individual researchers I cared
- 00:13:47about when they collected their original
- 00:13:49data set it's different from clock
- 00:13:51folding to this like cable routing where
- 00:13:52you have to put a cable very dextrously
- 00:13:55into this small socket um it's really
- 00:13:58quite diverse um and with this diers
- 00:14:00data set we studied kind of the scaling
- 00:14:03properties at two model classes one is
- 00:14:05the rg1 the small um homemade
- 00:14:08Transformer that we had 35 million
- 00:14:09parameters as well as rg2 the full vaa
- 00:14:1255 billion parameter vaa scale to see
- 00:14:14whether or not uh you know treating
- 00:14:16these different robot AC you know data
- 00:14:18sets as just input is some robot images
- 00:14:21output are some robot actions
- 00:14:23representative strings what would happen
- 00:14:25in this case and here surprisingly we
- 00:14:27actually see signs where the generalist
- 00:14:30policies which are trained on all of the
- 00:14:32robot data outperforms the Specialists
- 00:14:34which are alized for each of their
- 00:14:35evaluation settings and in this case
- 00:14:38right it's actually I think this was to
- 00:14:40me very shocking because for each of
- 00:14:42these individual kind of uh you know
- 00:14:44Labs here each of these Lo sport on to
- 00:14:46lab there generally I think the what the
- 00:14:50best accepted um you know understanding
- 00:14:52of roboticist is that if you care about
- 00:14:53maximizing performance on your specific
- 00:14:55task on your robot and your setting like
- 00:14:58you should just TR on robot data from
- 00:15:00your robot because adding any other data
- 00:15:02is going to be you know not is it's just
- 00:15:04going to like you know your policies
- 00:15:06will not be robust to those if they're
- 00:15:08going to overfit to them it's not going
- 00:15:09to help but in this case it seems that
- 00:15:11training on all the data actually
- 00:15:13improve performance even on the if you
- 00:15:15only cared about performance on your
- 00:15:16specific setup that was pretty you know
- 00:15:19exciting to see in robotics because this
- 00:15:21is a try we've seen in other fields but
- 00:15:22in robotics this is really really rare
- 00:15:24so far and again the Improvement SC was
- 00:15:27almost 50%
- 00:15:29another interesting Trend that we found
- 00:15:31is that actually we were getting to the
- 00:15:32point where we were seeing that smaller
- 00:15:34models would underfit this large
- 00:15:36heterogeneous robot data set and again
- 00:15:39overfitting is normally uh not something
- 00:15:42uh I think that underfitting is not
- 00:15:44something that we usually see in
- 00:15:45robotics because usually the data amount
- 00:15:47is so small that model capacity is never
- 00:15:49an issue and here it it was very
- 00:15:51surprising that model capacity actually
- 00:15:53was required to kind of soak up all of
- 00:15:55this diverse robot data
- 00:15:59and one big question though was that
- 00:16:02great you had all this robot data from
- 00:16:04all the robots wasn't even needed you
- 00:16:06already had all the web data from R22
- 00:16:09does adding all this incremental you
- 00:16:11know Poss data at anything at all and
- 00:16:13the good news is that we did sp there
- 00:16:15was know a lot of these spatial
- 00:16:17reasoning skills about the the the
- 00:16:19motions and the Precision of specific
- 00:16:21tasks like whether you put the apple on
- 00:16:24the cloth versus near the cloth these
- 00:16:26were tasks that only adding the robot
- 00:16:28data would enable the policies to do
- 00:16:31whereas the internet data alone wouldn't
- 00:16:33give you these understanding and
- 00:16:34capabilities for
- 00:16:36free as a brief recap then we've seen
- 00:16:40attempts at data scaling and positive
- 00:16:42transfer in in in two projects that I
- 00:16:44just highlighted one was taking our
- 00:16:46robot data set from
- 00:16:48rq1 putting it alongside internet data
- 00:16:50and Shing them the same and then we
- 00:16:53scaled it up even further by taking all
- 00:16:55the robot data from many different
- 00:16:56embodiments and we saw that in both
- 00:16:58cases is by treating robot actions and
- 00:17:01robot embodiments as just another data
- 00:17:02modality you did see signs of positive
- 00:17:06transfer and again for rt1 this was
- 00:17:08great we do well on the train set the in
- 00:17:10distribution for rt2 this was we start
- 00:17:12to get internet scale semantics and for
- 00:17:15RTX we start to understand spatial
- 00:17:17Precision in Concepts like that which
- 00:17:18are important for action and for
- 00:17:21physics but I think that's where we got
- 00:17:24to maybe end of last year I would say
- 00:17:27the state of the world and in then
- 00:17:29though I think there's many open
- 00:17:31challenges and things which do not work
- 00:17:33that I want to highlight one is that
- 00:17:35Vaas in current training paradigms still
- 00:17:38overfit to robotics data distributions
- 00:17:40what I mean by this is that we take a
- 00:17:42vaa we kind of query it on v2a tasks
- 00:17:45like an internet image image prompt uh
- 00:17:47internet prompt and it does well we give
- 00:17:49it a robot image ask it for actions it
- 00:17:51does well when you try to mix and match
- 00:17:53across these data distributions for both
- 00:17:55the visual L and space as well as for
- 00:17:57the uh text input space it doesn't
- 00:18:00really seem like you're getting a ton of
- 00:18:01transfer and this is a very you know
- 00:18:03interesting failure mode because it
- 00:18:05suggests that you know there are still
- 00:18:07fundamental uh issues with how we're
- 00:18:10doing code trainining how we're mixing
- 00:18:11robot data with internet
- 00:18:13data we also see that reasoning right we
- 00:18:16often times see that reasoning stems
- 00:18:19from the power of language model
- 00:18:20backgrounds from language model
- 00:18:21pre-training and it seems that how you
- 00:18:24can mix this kind of reasoning process
- 00:18:26with lowlevel you know physical action
- 00:18:29reasoning that's also not you know going
- 00:18:32exactly as expected because we see cases
- 00:18:34where uh you know for example here your
- 00:18:36reasoning and planning works well when
- 00:18:38all the objects are out of distribution
- 00:18:40but when you add one in distribution
- 00:18:41object such as a cocan your model starts
- 00:18:43to revert to what it knows and just
- 00:18:45immediately even though it can kind of
- 00:18:46knows hey if I want to hammer a nail I
- 00:18:48should use a rock it's never picked up a
- 00:18:50rock before but it's picked up tens of
- 00:18:52thousands of coc so even though it kind
- 00:18:54of reasons that it use a rock it still
- 00:18:55goes for the cocan examples like this
- 00:18:57are just you know they're everywhere but
- 00:19:00I think they're not as normally as
- 00:19:01visible so I just wanted to highlight
- 00:19:02this as interesting failure modes and
- 00:19:05finally right how we actually the the
- 00:19:07model architecture decision the design
- 00:19:09decisions are also not well thought
- 00:19:11through um recently there's a paper
- 00:19:13actually just from three days ago
- 00:19:14studying how you tokenize and how you
- 00:19:16represent your action penous discreet
- 00:19:18classification what what actual token
- 00:19:20choices do you use um that I think is a
- 00:19:22great step in the right direction I'm
- 00:19:24making uh you know vaa training more of
- 00:19:26a science and less of an art
- 00:19:30and moving on then uh let me just check
- 00:19:32I'm doing on time okay great um so the
- 00:19:35next missing piece then is steerability
- 00:19:37and prompt ability uh I think this is
- 00:19:41very interesting because in robotics
- 00:19:44today oftentimes we kind of uh
- 00:19:47historically due to the scale the Rel
- 00:19:49policies been working with we really
- 00:19:51tried to constrain the amount of entropy
- 00:19:53and information that can be fed into the
- 00:19:55policy's input right usually it's just
- 00:19:57one image or maybe uh you you frame
- 00:19:59stack some history of the past IM you've
- 00:20:01seen and then you convey a goal in the
- 00:20:03simplest way possible Right a one hot ID
- 00:20:05a very simple you know L English
- 00:20:07sentence or or maybe a goal image but
- 00:20:10generally you're not really expanding
- 00:20:12the the throughput of the bandwidth of
- 00:20:14your information input to the scale of
- 00:20:16like an open-ended chat interface as
- 00:20:18many language models see today um and
- 00:20:21again I think we we've seen in language
- 00:20:22modeling how large context Windows
- 00:20:24enable so many great emerging
- 00:20:26capabilities such as TW shot learning
- 00:20:28such as uh you know really being able to
- 00:20:30hone down and do Chain of Thought
- 00:20:32reasoning and like you know reflection
- 00:20:33all this all this kind of good stuff
- 00:20:35that we we can't really have robotics
- 00:20:37because our input domains uh you know
- 00:20:39interfaces are just so constrained and
- 00:20:42and I really want a pable robot which
- 00:20:44you can just say hey can you try again
- 00:20:45can you you move too far right you know
- 00:20:47can can you just adjust it on your next
- 00:20:49try we don't have any metal learning
- 00:20:51like that right now and it won't
- 00:20:52naturally emerged by just adding you
- 00:20:54know more of the same types of robot
- 00:20:56data and more of the same types of
- 00:20:57internet data there has to be a paradigm
- 00:20:59shift in order to get from us where we
- 00:21:01are today to a prompt seable uh robot
- 00:21:03control model and here I think you know
- 00:21:06maybe one hypothesis here is that maybe
- 00:21:09it's language that's the bottleneck at
- 00:21:10least language the way we approach it
- 00:21:11today right we we we've been kind of
- 00:21:13taking all of the recipes that have
- 00:21:15worked in other domains and we've tried
- 00:21:17to say okay let's just add language as a
- 00:21:20conditioning modality for robots let's
- 00:21:21just add language data sets but this
- 00:21:24maybe misses a lot of what makes robots
- 00:21:27hard right what makes robots and unique
- 00:21:29and interesting problem to study are the
- 00:21:32the Motions the physics the the causal
- 00:21:34nature of interaction with the world and
- 00:21:37when you try to condense everything into
- 00:21:39something that you know a language model
- 00:21:40Wikipedia article might look like maybe
- 00:21:42you just lose a lot of that in
- 00:21:44Translation um and and so towards that
- 00:21:47uh you know maybe honing in on the
- 00:21:49motion of Robotics is is what this work
- 00:21:52has to do called Archy trajectory which
- 00:21:54takes the idea of you know hindsight
- 00:21:56experience replay you take the goal you
- 00:21:59you turn that into the target for your
- 00:22:00policy learning but does that for the
- 00:22:02motion of the trajectory that was
- 00:22:04executed when we collect robot data
- 00:22:05often times we end Factor like Pro
- 00:22:08perception for all your joint States and
- 00:22:10your Motors are kind of store and we
- 00:22:12currently just toss all that away when
- 00:22:13we just save out RGB action trajectories
- 00:22:16maybe you can actually utilize you know
- 00:22:18the motion of what happened in the real
- 00:22:19world and use that to condition
- 00:22:21condition your policy because you
- 00:22:22already have this data for free we just
- 00:22:24currently toss it out so we do in this
- 00:22:26work is we take that proception and
- 00:22:27effector data we project it into 2D RGB
- 00:22:30space and that just becomes a kind of
- 00:22:32visual hint a visual Chain of Thought if
- 00:22:34you will of not just what you should do
- 00:22:36you should you know put the chip back in
- 00:22:38the in the drawer but how you should do
- 00:22:40it how you should approach it from what
- 00:22:41angle where you should close how you
- 00:22:42should avoid object obstacles at what
- 00:22:45depth and what Heights you should be
- 00:22:46operating at and then we feed it into an
- 00:22:48rp1 policy and and and what's nice is
- 00:22:51that even though training this was
- 00:22:52automated at inference time then you
- 00:22:55know maybe getting this trajectory from
- 00:22:57scratch would have been hard but
- 00:22:59uh the policy is agnostic to what
- 00:23:01trajectory you give it it doesn't have
- 00:23:02to be hindsight it can be handdrawn by a
- 00:23:05human it can be you can take a human
- 00:23:07video and do you know pose extraction
- 00:23:09with the hand and and then get this
- 00:23:10strory out of there or you can even give
- 00:23:12Foundation models either you know
- 00:23:14generative image models or or or
- 00:23:16language models to go and predict what
- 00:23:18the trajectory should be and then that
- 00:23:19can condition your pre-trained policy
- 00:23:22which is only train on H side
- 00:23:23trajectories and what's nice here is
- 00:23:25that we start to see signs of motion
- 00:23:27generalization right this is this is
- 00:23:30where you have tasks which have
- 00:23:32fundamentally maybe they're operating at
- 00:23:34different Heights or different motions
- 00:23:35just combinations of State action
- 00:23:37transition which were just never in the
- 00:23:38training data which has been
- 00:23:40traditionally hard for some of our
- 00:23:41policies which are just kind of overfit
- 00:23:43to pick up the cocan and and that that's
- 00:23:46all I know and you can't really
- 00:23:47generalize to cocan and vastly different
- 00:23:50motion profiles um but we start to see
- 00:23:52that here with tasks such as swiveling
- 00:23:54chairs or or folding towels or or
- 00:23:56picking objects up from a chair at a
- 00:23:58very l low
- 00:23:59height but most interesting for the
- 00:24:01topic here is that this was the first um
- 00:24:04project I worked on at least where I
- 00:24:05really got some kind of Shadow or some
- 00:24:08hint of I felt that prompt engineering
- 00:24:10was possible here we'd sometimes try out
- 00:24:12you know evaluations in very different
- 00:24:14settings with like in like a home
- 00:24:15setting like Ikea like setting with with
- 00:24:17with Darkwood furniture at different
- 00:24:19heights and the robot would sometimes be
- 00:24:21able to generalize this out of tradtion
- 00:24:22settings but then just by drawing a a
- 00:24:25prompt trajectory differently the human
- 00:24:27can kind of learn the failure the the
- 00:24:29centricities of the robot and just by
- 00:24:31like you know a trained operator could
- 00:24:32then learn to draw better trajectories
- 00:24:34and then start to zero shot new tasks uh
- 00:24:37again this was no other policy that we
- 00:24:39had today where the input is always
- 00:24:40something as simple as you know hey pick
- 00:24:42up a Coke can pick up a Pepsi can close
- 00:24:44a drawer going from that to like you
- 00:24:46know hey you're this brand new setting
- 00:24:47the drawer is going to be five
- 00:24:48centimeters lower than you're used to
- 00:24:50and there's also clutter can you go and
- 00:24:52do that like there's no way any policy
- 00:24:53could be responsive to that yet with
- 00:24:55trajectories it seems like sometimes we
- 00:24:57would get that
- 00:24:59and concurrently it's been really great
- 00:25:01to see over the past six months a lot of
- 00:25:02other works also kind of adopts
- 00:25:04paradigms using you know Point tracking
- 00:25:07Optical flow using other kinds of motion
- 00:25:09representations to also have you know
- 00:25:12touch upon the same idea is that what
- 00:25:14makes robotics unique its physics its
- 00:25:16motion its trajectories maybe adding
- 00:25:18this information to our foundation
- 00:25:20models can unlock things which language
- 00:25:22canot um which is very excited to see
- 00:25:25but I think overall it's still very nent
- 00:25:27and none of these methods have scaled up
- 00:25:29to the level of let's say a 55 billion
- 00:25:31parameter Vision language action
- 00:25:34model next up though one other
- 00:25:37interesting question I I was asking you
- 00:25:38know maybe a year ago was I have all
- 00:25:40these rights of language in the section
- 00:25:42of this this section right was like what
- 00:25:44can we do Beyond language but maybe the
- 00:25:46language maybe language was not the
- 00:25:48problem just the way we were using
- 00:25:49language that we're using language in
- 00:25:51twoo sinful of a way and if the language
- 00:25:53was more grounded in in motions and
- 00:25:55physics and what makes robotics unique I
- 00:25:58more hierarchical granular we could get
- 00:26:00the benefits and so in this project
- 00:26:01called Archy hierarchy we would kind of
- 00:26:03do a Chain of Thought prediction where
- 00:26:05you turn the the pick cocan the very
- 00:26:07abstract very like long Horizon T into
- 00:26:10intermediate very grounded language
- 00:26:13Motions like move your arm right and
- 00:26:15rotate it clockwise and then close your
- 00:26:17gripper into like language that's very
- 00:26:19grounded towards very short Horizon um
- 00:26:22actions and again this can kind of be
- 00:26:23viewed as like chain of thoughts still
- 00:26:25using language at the Chain of Thought
- 00:26:26medium but doing it in a way which is
- 00:26:28much easier than perhaps to generalize
- 00:26:31and what's interesting then is that this
- 00:26:32unlocked learning from uh you know a
- 00:26:36category of tasks we had been collecting
- 00:26:38for over six months that no other policy
- 00:26:40was able to use that data from because
- 00:26:41they were just so hard so it's kind of
- 00:26:43like Tas related to the serial task
- 00:26:45where there's like a small Gap you have
- 00:26:46to put a bowl and you have to push a
- 00:26:47lever and then there's this like
- 00:26:49cluttered like you know thing of oal
- 00:26:51packets and it's like slightly more
- 00:26:53dextrous and precise than the past we've
- 00:26:55seen before but policies were just
- 00:26:57really struggling to kind of you know
- 00:27:00work at the the entropy of these systems
- 00:27:02which was which was just a little bit
- 00:27:04more complex than before and then these
- 00:27:06language motions are what unlock the
- 00:27:08ability to work on those types of
- 00:27:11TS and talking about like pumpability
- 00:27:14and steerability we also saw that having
- 00:27:17interventions in language of hey you
- 00:27:19know actually you you you're doing the
- 00:27:21right low level control it was just your
- 00:27:23intermediate plan of translating closed
- 00:27:25the pistachio jar you wanted to move up
- 00:27:28you should have moved left humans could
- 00:27:30correct this in a dagger likee setting
- 00:27:32in a much more efficient manner than it
- 00:27:33would have taken for low-level action
- 00:27:35interventions and this was a way to kind
- 00:27:37of scale up this Chain of Thought level
- 00:27:39Improvement where you don't always need
- 00:27:41to make your you know iterate at the
- 00:27:42Action level which is the most expensive
- 00:27:44level maybe sometimes you can you know
- 00:27:45intervene at the language motion or
- 00:27:47highering level that is also kind of
- 00:27:49stability and probability one of the
- 00:27:51rare that
- 00:27:53I and and to kind of summarize here rt1
- 00:27:57and rp2 even though they were scaled up
- 00:27:59so much they were still shoving together
- 00:28:00all of the foundation model knowledge
- 00:28:02all the large data sets through a very
- 00:28:04narrow bottleneck of simple language
- 00:28:05instructions and by whing that kind of
- 00:28:08bottom neck via either things like C
- 00:28:10trajectories via motion Centric
- 00:28:12representations or even via language
- 00:28:14just better language seem to unlock more
- 00:28:16of the intelligence that's contained in
- 00:28:18these internet scale
- 00:28:21foundations and the question here
- 00:28:23becomes is that I view these these
- 00:28:25projects kind of as fruital Concepts I'm
- 00:28:27very excited about results but again we
- 00:28:29haven't scaled them up yet to the full
- 00:28:31internet scale and the question is maybe
- 00:28:33to scale them up to you know a full
- 00:28:35vision reaction model maybe we need more
- 00:28:37robot data and my hot take here is that
- 00:28:41actually this is not a call to action
- 00:28:42for me to you know a lot of you know
- 00:28:44companies for example right now are
- 00:28:46convinced that algorithms are are solved
- 00:28:48it's just data and maybe this slide
- 00:28:50would have suggested that but I think we
- 00:28:51don't actually know what kinds of robot
- 00:28:53data you need and so if you prematurely
- 00:28:54try to scale deploy
- 00:28:59youting not correct maybe sometimes that
- 00:29:02wouldn't have been a recoverable mistake
- 00:29:04so again I don't think I'm suggesting to
- 00:29:06not do that because I think we
- 00:29:07definitely need to do that but I think
- 00:29:08there's also work that robot researchers
- 00:29:10need to think about on can we figure out
- 00:29:13the right types of modalities and data
- 00:29:15sets and tasks and skills and labels
- 00:29:17that we need when we go about collecting
- 00:29:19these like Society scale data
- 00:29:23sets and finally then I think uh one
- 00:29:26missing piece is scalable evaluation
- 00:29:29right this is I think uh generally a
- 00:29:31problem throughout all of AI right now
- 00:29:33um we we we see that all these different
- 00:29:36you know benchmarks and you know
- 00:29:38leaderboards uh are are oftentimes a lot
- 00:29:40of hype these days because they're very
- 00:29:41easily gainable but they're good
- 00:29:43attempts I think at least in Foundation
- 00:29:45Molly Broadley to try to capture various
- 00:29:48representations of capabilities that we
- 00:29:50want and the challenge here however is
- 00:29:52that is all AI models across the you
- 00:29:55know broadly the field are everything
- 00:29:57that want wants to be a genous these
- 00:29:58days right and and that's a very high
- 00:30:00claim if your genous model can do
- 00:30:02everything how do you back up rigorously
- 00:30:04that you can actually do everything and
- 00:30:06and be robust at that and I think this
- 00:30:09is clearly already an issue in in
- 00:30:11Foundation modeling but I think things
- 00:30:12like these ELO based leader boards or
- 00:30:14just getting these models out and
- 00:30:16deployed and open source or even as
- 00:30:18close you know Source apis these are
- 00:30:20ways where you can just leave it up to
- 00:30:22you know the the court of the public to
- 00:30:25kind of say hey are your models actually
- 00:30:27good and in robotics this is kind of
- 00:30:29hard right because robotics Target
- 00:30:31physical data distributions and it's
- 00:30:33it's not really clear whether or not
- 00:30:34some representative small set of you
- 00:30:37know evaluations can capture you know
- 00:30:40all of the the properties you want to
- 00:30:41measure um or or if there's just a lot
- 00:30:44of stuff that you need to do before you
- 00:30:45get to a product that's out but I think
- 00:30:47regardless of what the answer is the
- 00:30:49problem is very clear and the problem is
- 00:30:51that as our policies within our team at
- 00:30:53Google de might have scaled the amount
- 00:30:54of evaluations we've had to do have also
- 00:30:56scaled right and I think this might not
- 00:30:58sound like a lot of 3,000 or 6,000
- 00:31:00trials but if you consider that each
- 00:31:01trial can take up to 10 15 minutes you
- 00:31:04know this is quickly going to become
- 00:31:06intractable if we add another order of
- 00:31:08magnitude to what we claim our policies
- 00:31:10can do the next year or two um that just
- 00:31:12not going to be tractable for any
- 00:31:13industry or academic
- 00:31:15lab maybe one way is we can decompose
- 00:31:18the kind of attack space of the claims
- 00:31:20you want to make about these robot
- 00:31:21Foundation models and break them down
- 00:31:23into particular axes that are meant to
- 00:31:25represent policy generalization that
- 00:31:26Encompass properties that we hope to
- 00:31:28measure such as maybe different
- 00:31:30backgrounds or adding distractor objects
- 00:31:31and clutter or changing lighting
- 00:31:33conditions adding new objects Etc and
- 00:31:35maybe we could measure generalization
- 00:31:37gaps from our training set to our test
- 00:31:39set we did this and in the real world
- 00:31:42and and we indeed found that you know
- 00:31:44maybe some of these like axes of
- 00:31:45generalization these factors of
- 00:31:47distribution shifts were harder than
- 00:31:49others and and uh would have these kind
- 00:31:51of scaling curves as you kind of added
- 00:31:53more of these to your training but I I
- 00:31:55think this was kind of just an initial
- 00:31:57kind of attempt at codifying or
- 00:32:00formalizing what it means for your robot
- 00:32:02Foundation model to be a generalist
- 00:32:04Foundation model and it's it's
- 00:32:06definitely a challenge now because every
- 00:32:08you know New Foundation model robot
- 00:32:10policy that comes out uh kind of has to
- 00:32:12do all this on their own again and
- 00:32:13Define their own evals and Define their
- 00:32:15own benchmarks and it's just kind of
- 00:32:17hard to compare apples to apples in
- 00:32:19today's academic
- 00:32:20landscape maybe another path however is
- 00:32:23of course using simulation and and today
- 00:32:25there's so many great talks on world
- 00:32:26models in Sim especially for ab and in
- 00:32:29robot manipulation it's been a bit
- 00:32:31harder because manipulation right with
- 00:32:33contact forces and physics and visual
- 00:32:35distribution shifts and oftentimes data
- 00:32:36sets are very tuned to one limited
- 00:32:38setting which raises the bar for how
- 00:32:40realistic the systems have to be this
- 00:32:42has been an ongoing challenge so an
- 00:32:44Insight in our recent work here called
- 00:32:46simpler was that maybe you don't need a
- 00:32:48full Fidelity digital twin in order to
- 00:32:51that you would need for syrial training
- 00:32:53maybe all you can need to do is optimize
- 00:32:55for correlation between the ranking of
- 00:32:57policies in Sim and how they would rank
- 00:32:59in the real world if you had evaluated
- 00:33:01them in the real world and by doing this
- 00:33:03we would try to just get a minimal
- 00:33:04viable Sim like gets you useful signal
- 00:33:07of you know which which checkpoints you
- 00:33:09should use which with kind of policies
- 00:33:10you should devote your very expensive
- 00:33:12real house to and again I don't think
- 00:33:14it's perfect but I think it it was
- 00:33:16working well for a various classes of
- 00:33:18generalist robot policies that we were
- 00:33:20operating on such as rt1 rt1 X rt2
- 00:33:24Etc and maybe again a shout out to
- 00:33:27concurrent work or or kind of related
- 00:33:29works right I'm prism one related today
- 00:33:31from Alex un from Sherry also Genie
- 00:33:34which is a l and action conditioned
- 00:33:36video diffusion model from Google mind
- 00:33:39um these are all kind of I think
- 00:33:40directionally where we would want to go
- 00:33:42for seeing if these kind of approaches
- 00:33:45can also act as good offline policy
- 00:33:47evaluation um and my hot take here then
- 00:33:50is that I think despite all of these
- 00:33:52like you know great progresses I think
- 00:33:54real world evals will always be the gold
- 00:33:56standard you can never place down um I
- 00:33:58think these evals you know offline evals
- 00:34:00can help you decide which you know how
- 00:34:03to spend your limited dwid for real
- 00:34:04world evals but it won't replace them
- 00:34:06completely right and then so if you
- 00:34:08actually need real evals and you need
- 00:34:10them at scale that becomes a very
- 00:34:12challenging problem right and it's
- 00:34:14perhaps one that will be solved by
- 00:34:16products deployed in the wild where you
- 00:34:18actually get scale EV from actual user
- 00:34:20deployments by people actually using
- 00:34:22your robots and your robot
- 00:34:25policies and uh finally here um I'd like
- 00:34:29to talk a bit about maybe what this
- 00:34:30means for uh what we can think about for
- 00:34:32how we can predict the next year or two
- 00:34:34years of what my you know robot
- 00:34:36Foundation model might look like uh and
- 00:34:39the first one here is a recap um we
- 00:34:41talked about positive transfer and the
- 00:34:43scaling Laws of Robotics as we increase
- 00:34:46our data sets and our and our model
- 00:34:47capacities uh the bleeding edge here
- 00:34:49from what I talk about today I view them
- 00:34:51as the vision language action moding
- 00:34:52Paradigm and a lot of improvements and
- 00:34:54cross tring on many different types of
- 00:34:56robot data here this is my again just my
- 00:34:58own opinion I would give the field a six
- 00:35:00out of 10 um is how how much we progress
- 00:35:02and how much remaining kind of unknown
- 00:35:04unknowns we have I think as you saw the
- 00:35:06failure modes were still overfitting
- 00:35:08this the it's still vaa training is
- 00:35:11still more of an art than a science of
- 00:35:12your data training mixtures of how you
- 00:35:14go about doing that your token action
- 00:35:16tokenization of decisions and I think
- 00:35:18there's a lot of science that still
- 00:35:19needs to happen here for steerability
- 00:35:21and proability um again here we talked
- 00:35:24about going Beyond language let's
- 00:35:26thinking about motions let's think about
- 00:35:27trajectory
- 00:35:28here I would say the fields at a four
- 00:35:29out of 10 because I think while we've
- 00:35:31had Signs of Life maybe Sparks of what
- 00:35:33could be come a pumpable or steerable
- 00:35:36robot learning Paradigm none of these
- 00:35:38have been scaled up to a large scale and
- 00:35:40we don't yet know what kind of data
- 00:35:42requirements or or kind of you know
- 00:35:44bottlenecks uh that will occur when we
- 00:35:46do try to scale these methods and we
- 00:35:48don't know if these will scale to you
- 00:35:50know more dextrous embodiment such as
- 00:35:52humanoids or consumer
- 00:35:54robots and finally for scalable eval
- 00:35:57again maybe we can think about
- 00:35:58generalization or simulation I would
- 00:36:00score this at a three out of 10 just
- 00:36:02because uh um we don't even often times
- 00:36:05don't even know what we should be
- 00:36:07evaluating for right even before we can
- 00:36:09design the eval we need to kind of set
- 00:36:11up the paradigms the the structure the
- 00:36:14formalisms all of that is still kind of
- 00:36:16very ad hoc and it's done on a per
- 00:36:18project basis almost um and so
- 00:36:20definitely the future I hope that maybe
- 00:36:23either concerted efforts or just you
- 00:36:25know by bringing uh robot prediction and
- 00:36:28lowle control closer to Foundation
- 00:36:30modeling these problems of evaluation
- 00:36:32can also become more
- 00:36:35homogeneous with that then I'll take
- 00:36:37these final like you know last points
- 00:36:39that we just solved like what why these
- 00:36:41why I gave these ratings U maybe we can
- 00:36:44try to predict what solutions to these
- 00:36:46might be you know this is I would say
- 00:36:47what might happen in the 2025 to 2026
- 00:36:51Horizon I think the first one is that
- 00:36:53you know once we understand B better
- 00:36:55once it turns into a science it's not
- 00:36:57just in AR I think maybe robotics
- 00:37:00research will also start to split into
- 00:37:02pre-training and post training just as
- 00:37:04Foundation modeling has once you have
- 00:37:06access to very robust starting points um
- 00:37:09of foundation models from you know maybe
- 00:37:11from from Google p maybe from open from
- 00:37:13other companies once you have great
- 00:37:15starting points then posttraining your
- 00:37:17robot will become more of what your
- 00:37:18daily Cycles are going to be looking
- 00:37:19like as a practitioner or as a
- 00:37:21researcher um for robot specific data
- 00:37:24again I think um robot data engine again
- 00:37:28if you're really thinking about scale
- 00:37:29deployments at a society scale here I
- 00:37:32think is where industry and startups
- 00:37:33will be able to contribute a lot to the
- 00:37:35field and what that's going to look like
- 00:37:36we'll see but uh I think the the race is
- 00:37:39really starting up this year with so
- 00:37:40many startups and exciting new companies
- 00:37:42for and then finally for evaluations um
- 00:37:45with all of the amazing progress we're
- 00:37:47seeing in in the video modeling um again
- 00:37:51I think I'm really excited to see what
- 00:37:53the what the action condition variant of
- 00:37:55these World models will look like uh how
- 00:37:58how well they'll be able to model out of
- 00:37:59distribution behaviors and not just in
- 00:38:01distribution training data because I
- 00:38:03think for for evaluating robot policies
- 00:38:05you don't just want them to be aesthetic
- 00:38:07or you know realistic you want them to
- 00:38:09actually measure these long tail of like
- 00:38:11very rare events or you want them to
- 00:38:13model you know contact forces and and
- 00:38:15visible cality which may not always
- 00:38:17correlate to you know aesthetic pre
- 00:38:19trining video YouTube data that they may
- 00:38:21be rain um as well as of course product
- 00:38:24deployments so I think their evaluations
- 00:38:26is another setting where we'll see
- 00:38:28industry and startups contribute to um
- 00:38:31those are again just my own opinions but
- 00:38:33I think I'm very excited to see how the
- 00:38:34field of Robo fure
- 00:38:37deols with that said uh thanks for your
- 00:38:39time um reach me on my email here and
- 00:38:41I'll share the slides after but thanks
- 00:38:42for your
- 00:38:48time you very much Ted for the great
- 00:38:51talk uh there's already a question go
- 00:38:53ahead
- 00:38:55[Music]
- 00:39:02large
- 00:39:14[Music]
- 00:39:29any so I guess the question here was um
- 00:39:32you know with Tetra randomization being
- 00:39:34kind of one AIS of generalization a
- 00:39:35distribution sh that seem challenging
- 00:39:37are there maybe methods or is there any
- 00:39:39data points we have on how to address
- 00:39:40that um I think it's it's also uh there
- 00:39:44have been some works on like data
- 00:39:45augmentation for robotics maybe in a
- 00:39:46semantic fashion or instead of just like
- 00:39:48you know broadly domain randomizing like
- 00:39:50just random textures actually using uh
- 00:39:53semantically relevant textures of like
- 00:39:55what's actually appropriate like you
- 00:39:57don't want like a you know neon random
- 00:39:59RGB wallpaper in your home you want
- 00:40:01wallpapers that are actually seen in
- 00:40:03people's homes uh I think this is still
- 00:40:05very nent but I think there's already
- 00:40:06been very good results of people like
- 00:40:08pushing up on these like generalization
- 00:40:10settings with that it's like diffusion
- 00:40:12models for
- 00:40:14IND other questions yes please wondering
- 00:40:19yes are you saying uh video data real
- 00:40:22data simulation data they're all
- 00:40:26contributing or if you have to pick one
- 00:40:28which one would you go for or is there a
- 00:40:30transfer between doesn't even
- 00:40:34matter your
- 00:40:37representation yeah absolutely so the
- 00:40:39question I guess is yeah video data Sim
- 00:40:41data real data how do VI are they all
- 00:40:43equal do they transfer uh my sense is at
- 00:40:46least based in the last few years again
- 00:40:48real data is absolutely key so if you
- 00:40:50offer me the same quantity of real data
- 00:40:52versus Sim versus video I'm picking real
- 00:40:54any day of the week however then you
- 00:40:56know may maybe at some point there will
- 00:40:57be dimin returns or maybe if you offer
- 00:40:59me a thousand times more or a million
- 00:41:01times more similar video data compared
- 00:41:03to robot option data then maybe the
- 00:41:05trade-off starts to become different but
- 00:41:07in the past I would say that's also not
- 00:41:09a very good predictive measure of what
- 00:41:11might happen in the future because in
- 00:41:12the past again the robot kind of
- 00:41:14foundation model policy capacities were
- 00:41:16so small that if you could only kind of
- 00:41:18operate over the scale of you know
- 00:41:20100,000 trajectories then you would of
- 00:41:22course just operate on 100,000 real
- 00:41:24trajectories you never had the
- 00:41:25opportunity to be operating at this
- 00:41:27scale an internet scale V where you did
- 00:41:30have the capacity to consume a million
- 00:41:32times more Sim video data and I think we
- 00:41:34have the opportunity to do that in the
- 00:41:35next you know year or even
- 00:41:38now yeah
- 00:41:41was foration prob most about what's
- 00:41:45missing from the Robotics and vision the
- 00:41:48language everything physical sense
- 00:42:17yeah great question so the question here
- 00:42:19was yeah physical touch right sensing um
- 00:42:21this is so important for humans uh how
- 00:42:24about for robots um I think my my sense
- 00:42:26here is that uh you know I I kind of put
- 00:42:29two columns of like what could make
- 00:42:30robot seable or what's missing from
- 00:42:32Foundation models that's unique to
- 00:42:34robots I would definitely put sensing
- 00:42:35and other modalities here as well the
- 00:42:38reason I think it's it maybe it's a
- 00:42:39little bit farther off or for me at
- 00:42:40least not a top priority is that I think
- 00:42:43just grounding the idea of temporal
- 00:42:45motion into like putting that into
- 00:42:48visual language latent space if that is
- 00:42:50already so hard and adding a completely
- 00:42:52new modality like touch sensors where
- 00:42:54you don't have a lot of data where it is
- 00:42:55noisy where it's so different from
- 00:42:57internet data that's going to be even
- 00:42:59harder so if we can't even add
- 00:43:00trajectories of motions then adding
- 00:43:02sensing data might be even harder with
- 00:43:04so that's why for me at least um my aim
- 00:43:07is to try to make some progress on on on
- 00:43:08hitting the the Motions the physics the
- 00:43:10trajectories first and then sensing
- 00:43:12maybe I think hopefully will that'll
- 00:43:14tell us how we should approach sensing
- 00:43:15as
- 00:43:18well in the interest of time let's have
- 00:43:20two last question so I two hands you can
- 00:43:23start new front and then other
- 00:43:43yeah great question so the question here
- 00:43:45is that in this RT hierarchy work um the
- 00:43:47interface between the high level
- 00:43:48planning the low LEL control was this
- 00:43:50this like lowlevel language motion and
- 00:43:52uh there's a hethy language and and yeah
- 00:43:54absolutely I think there's there's many
- 00:43:55other kinds of representations you could
- 00:43:57use uh for example even these
- 00:43:59trajectories is one right your high
- 00:44:01level plan could be like hey how do I
- 00:44:03wipe you s of the chair and then the
- 00:44:05interface is this like RGB like curve
- 00:44:08but I think there there's definitely a
- 00:44:09lot more language is just a natural one
- 00:44:11where you could hope to get a bit more
- 00:44:12transfer from like you know Chain of
- 00:44:14Thought in nonbody domains
- 00:44:18but and final question
- 00:44:54yeah um the question was um
- 00:44:57interpolation versus true out of
- 00:44:59distribution generalization reasoning U
- 00:45:01maybe often times when we claim we're
- 00:45:03generalizing a new objects or lighting
- 00:45:04conditions is actually just
- 00:45:05interpolating the data that we already
- 00:45:07saw um and and that's why I think uh I I
- 00:45:10I think here right when I Define
- 00:45:12emergent uh I I I I try to embrace this
- 00:45:14by saying that uh emergent doesn't mean
- 00:45:16that it's not present in any of the
- 00:45:17training data and it's merged magically
- 00:45:19it's that it was in the data but
- 00:45:20internet scale data is too much for us
- 00:45:22to codify so it's just emerging like
- 00:45:24we're finding out what was in the data
- 00:45:26basically right like we're finding out
- 00:45:27what's in the data and now it's being
- 00:45:29projected into Robot action space and
- 00:45:31we're seeing what has successfully
- 00:45:32projected um so I think uh again I I
- 00:45:35don't I don't even know if we if we need
- 00:45:37necessarily to solve like true
- 00:45:38generalization in order for robotics to
- 00:45:40work well I think if we're able to solve
- 00:45:43like just interpolation at scale that's
- 00:45:46in in a very like you know predictable
- 00:45:48fashion I think I would be already happy
- 00:45:50with that in terms of getting general
- 00:45:51purpose robots I think beyond that for
- 00:45:54AGI no loaded term yes probably
- 00:45:58okay
- 00:46:00sorry so let's send the speaker again
- robotteknologi
- fundamentale modeller
- positiv overførsel
- styrebarhed
- skalerbar evaluering
- generel anvendelighed
- data
- maskinlæring
- interdisciplinær forskning
- udfordringer