“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI
Summary
TLDRThe video discusses the fundamentals of visual spatial intelligence and its place in the development of artificial intelligence. It emphasizes how integral visual spatial skills are, comparing their significance to that of language. The discussion shifts to a historical perspective of AI, highlighting critical advancements such as deep learning, particularly noting the inception of neural networks, major breakthroughs like AlexNet, and the Transformer model's impact. The conversation elaborates on data's crucial role, with projects like ImageNet displaying the benefits of comprehensive datasets. The narrative dives into the experiences of AI experts like Fei-Fei Li and Justin Johnson, detailing their journeys and contributions. It describes how the remarkable advancement in computing power has transformed AI from theoretical viewpoints into conceivable practical use, enabling complex model training and faster learning capabilities. Furthermore, the discussion explores the mission of World Labs, which aims to harness the profound understanding of data and computing to unlock spatial intelligence, and the applications this might have in fields ranging from interactive 3D world creation for education or gaming to augmented reality and robotics. Finally, the differences between spatial intelligence and language models are explained, emphasizing the former's essential focus on 3D space for more effective world interaction and representation, as opposed to the largely 1D nature of language models.
Takeaways
- 📈 Visual spatial intelligence is as fundamental as language.
- 🔄 AI has transitioned from an AI winter to new peaks with deep learning.
- 🧠 Key AI figures like Fei-Fei Li and Justin Johnson have influenced the field.
- 💾 Computing power has substantially accelerated AI model training.
- 📊 Large datasets, like ImageNet, have driven significant AI advancements.
- 🤖 World Labs focuses on unlocking spatial intelligence.
- 🌐 Spatial intelligence emphasizes 3D perception and action.
- ⏩ Progress in AI enables new media forms, blending virtual and physical worlds.
- 🕶️ Augmented reality and robotics can benefit from advanced spatial intelligence.
- 🔍 Differences between language models and spatial intelligence are crucial.
- 🎮 World generation for gaming is a potential application of spatial intelligence.
- 📚 Educational tools can leverage spatial intelligence technologies.
Timeline
- 00:00:00 - 00:05:00
The discussion starts with highlighting the fundamental role of visual-spatial intelligence, mentioning the progress in understanding data and advancements in algorithms, and setting the stage to focus on unlocking new potentials in AI.
- 00:05:00 - 00:10:00
Over the past two years, there has been a significant increase in consumer AI companies, marking a wild and exciting time for AI development. The conversation touches on the historical context of AI from the AI winter to the emergence of deep learning, and its current transformative state involving various data forms such as text, pixels, videos, and audios.
- 00:10:00 - 00:15:00
The speakers recount their individual journeys into AI. One was inspired by groundbreaking deep learning papers during undergraduate studies, highlighting the importance of combining powerful algorithms with large compute and data for breakthrough results — a notion established around 2011-2012.
- 00:15:00 - 00:20:00
As one speaker’s journey continued, they noticed a critical shift in the importance of data for AI, recognizing data as an overlooked element crucial for model generalization. This realization led to initiatives like ImageNet, emphasizing large-scale data acquisition as a power unlock for machine learning models during its inception at the advent of the internet.
- 00:20:00 - 00:25:00
The dialogue transitions to the concept of 'big unlocks' in AI. While major algorithmic innovations like Transformers have fueled AI progress, the conversation sheds light on the underestimated impact of computational power. The example of AlexNet, trained with substantially less compute compared to modern standards, underscores this point.
- 00:25:00 - 00:30:00
The discussion pivots to differentiating the types of AI tasks, particularly generative versus predictive modeling. Historical attempts at generative tasks are noted, with more recent advancements in generative AI (using GANs) marking significant strides towards generating novel outputs like images from textual descriptions.
- 00:30:00 - 00:35:00
Further exploring generative AI advances, the speakers discuss projects involving style transfer and real-time generation, emphasizing the evolution of generative modeling from static image rendering to dynamic, real-time applications. This illustrates the broad transformation the field has undergone over the years.
- 00:35:00 - 00:40:00
Emphasizing spatial intelligence, the speakers outline a journey focused on visual intelligence, suggesting it's just as fundamental as language. The readiness, given present algorithmic advancements and computational capabilities, makes it the right time to invest in developing technologies like those that power World Labs.
- 00:40:00 - 00:48:10
Finally, the conversation delves into the specifics of spatial intelligence, contrasting it with language-based AI approaches. Spatial intelligence emphasizes understanding and interacting in 3D environments, a fundamental aspect that language models, inherently one-dimensional, cannot fully grasp. This complements generative AI, transcending text and 2D to richer 3D representations.
Mind Map
Video Q&A
Why is visual spatial intelligence described as fundamental?
Visual spatial intelligence is compared to language in its fundamental importance due to its ancient and essential role in understanding and interacting with the world.
How has AI evolved over the past decades, according to the video?
AI has transitioned from theoretical models and the AI winter into practical, deep learning applications, involving significant advances such as language models and image recognition.
What major AI breakthroughs are discussed?
The discussion highlights the significance of deep learning, especially neural networks like AlexNet, the importance of computing power, and algorithmic advances like Transformers.
Who are the key figures mentioned in the evolution of AI and deep learning?
Notable figures include Fei-Fei Li and Justin Johnson, along with references to Andrew Ng, Hinton, and others involved in foundational deep learning research.
How is computational power significant to AI development?
Increased computational power has enabled the practical application and fast training of complex AI models, transforming theoretical constructs into effective tools.
What role does data play in AI development, according to the speakers?
Large datasets have been crucial for training AI models, enabling discoveries and the development of more accurate and generalizable models.
What is the significance of the ImageNet project?
ImageNet played a crucial role in demonstrating the power of large datasets, helping propel computer vision and AI into practical applications.
What differentiates spatial intelligence from language models?
Spatial intelligence focuses on 3D perception and action, essential for interacting with the physical world, contrasting with the 1D sequence processing in language models.
What is the mission of World Labs?
World Labs aims to unlock spatial intelligence, leveraging advancements in algorithms, computing, and data to create technology that perceives and interacts with the 3D world.
How might spatial intelligence be applied?
Potential applications include world generation for games and education, augmented reality, and enhancing robotics with better 3D understanding.
View more video summaries
Coagulation Cascade Animation - Physiology of Hemostasis
High School Resurrection | Film
Tony's Chocolonely - the story of an unusual chocolate bar
Aimee Ng: "Truth and Fiction in Italian Renaissance Portraiture"
2024 Lexus GX550 Overtrail Plus | 6500 Mile Update and First Service Report...DID ANYTHING BREAK?
Food Safety DVD
- 00:00:00visual spatial intelligence is so
- 00:00:03fundamental it's as fundamental as
- 00:00:05language we've got this ingredients
- 00:00:08compute deeper understanding of data and
- 00:00:11we've got some advancement of algorithms
- 00:00:14we are in the right moment to really
- 00:00:17make a bet and to focus and just unlock
- 00:00:26[Music]
- 00:00:28that over the last two years we've seen
- 00:00:31this kind of massive Rush of consumer AI
- 00:00:33companies and technology and it's been
- 00:00:35quite wild but you've been doing this
- 00:00:38now for decades and so maybe walk
- 00:00:40through a little bit about how we got
- 00:00:41here kind of like your key contributions
- 00:00:43and insights along the way so it is a
- 00:00:46very exciting moment right just zooming
- 00:00:48back AI is in a very exciting moment I
- 00:00:51personally have been doing this for for
- 00:00:53two decades plus and you know we have
- 00:00:56come out of the last AI winter we have
- 00:00:58seen the birth of modern AI then we have
- 00:01:01seen deep learning taking off showing us
- 00:01:04possibilities like playing chess but
- 00:01:07then we're starting to see the the the
- 00:01:10deepening of the technology and the
- 00:01:12industry um adoption of uh of some of
- 00:01:16the earlier possibilities like language
- 00:01:19models and now I think we're in the
- 00:01:21middle of a Cambrian explosion in almost
- 00:01:24a literal sense because now in addition
- 00:01:27to texts you're seeing pixels videos
- 00:01:30audios all coming out with possible AI
- 00:01:35applications and models so it's very
- 00:01:37exciting moment I know you both so well
- 00:01:40and many people know you both so well
- 00:01:41because you're so prominent in the field
- 00:01:42but not everybody like grew up in AI so
- 00:01:44maybe it's kind of worth just going
- 00:01:45through like your quick backgrounds just
- 00:01:47to kind of level set the audience yeah
- 00:01:48sure so I first got into AI uh at the
- 00:01:51end of my undergrad uh I did math and
- 00:01:53computer science for undergrad at
- 00:01:54keltech that was awesome but then
- 00:01:55towards the end of that there was this
- 00:01:57paper that came out that was at the time
- 00:01:59a very famous paper the cat paper um
- 00:02:01from H Lee and Andrew and others that
- 00:02:03were at Google brain at the time and
- 00:02:04that was like the first time that I came
- 00:02:06across this concept of deep learning um
- 00:02:08and to me it just felt like this amazing
- 00:02:10technology and that was the first time
- 00:02:12that I came across this recipe that
- 00:02:13would come to define the next like more
- 00:02:15than decade of my life which is that you
- 00:02:17can get these amazingly powerful
- 00:02:19learning algorithms that are very
- 00:02:20generic couple them with very large
- 00:02:22amounts of compute couple them with very
- 00:02:23large amounts of data and magic things
- 00:02:25started to happen when you compi those
- 00:02:27ingredients so I I first came across
- 00:02:29that idea like around 2011 2012-ish and
- 00:02:31I just thought like oh my God this is
- 00:02:33this is going to be what I want to do so
- 00:02:35it was obvious you got to go to grad
- 00:02:36school to do this stuff and then um sort
- 00:02:38of saw that Fay was at Stanford one of
- 00:02:40the few people in the world at the time
- 00:02:41who was kind of on that on that train
- 00:02:44and that was just an amazing time to be
- 00:02:45in deep learning and computer vision
- 00:02:47specifically because that was really the
- 00:02:49era when this went from these first nent
- 00:02:52bits of technology that were just
- 00:02:53starting to work and really got
- 00:02:54developed AC and spread across a ton of
- 00:02:56different applications so then over that
- 00:02:58time we saw the beginning of language
- 00:03:00modeling we saw the beginnings of
- 00:03:02discriminative computer vision you could
- 00:03:03take pictures and understand what's in
- 00:03:05them in a lot of different ways we also
- 00:03:06saw some of the early bits of what we
- 00:03:08would Now call gen generative modeling
- 00:03:10generating images generating text a lot
- 00:03:12of those Court algor algorithmic pieces
- 00:03:14actually got figured out by the academic
- 00:03:16Community um during my PhD years like
- 00:03:18there was a time I would just like wake
- 00:03:19up every morning and check the new
- 00:03:21papers on archive and just be ready it
- 00:03:23was like unwrapping presents on
- 00:03:24Christmas that like every day you know
- 00:03:25there's going to be some amazing new
- 00:03:27discovery some amazing new application
- 00:03:28or algorithm somewhere in the world what
- 00:03:30happened is in the last two years
- 00:03:32everyone else in the world kind of came
- 00:03:33to the same realization using AI to get
- 00:03:35new Christmas presents every day but I
- 00:03:37think for those of us that have been in
- 00:03:38the field for a decade or more um we've
- 00:03:40sort of had that experience for a very
- 00:03:41long time obviously I'm much older than
- 00:03:45Justin I I come to AI through a
- 00:03:49different angle which is from physics
- 00:03:51because my undergraduate uh background
- 00:03:53was physics but physics is the kind of
- 00:03:56discipline that teaches you to think
- 00:03:59audacious question s and think about
- 00:04:01what is the remaining mystery of the
- 00:04:04world of course in physics is atomic
- 00:04:06world you know universe and all that but
- 00:04:09somehow I that kind of training thinking
- 00:04:13got me into the audacious question that
- 00:04:16really captur my own imagination which
- 00:04:18is intelligence so I did my PhD in Ai
- 00:04:22and computational neuros siiz at CCH so
- 00:04:26Justin and I actually didn't overlap but
- 00:04:28we share um
- 00:04:30the same amam mat um at keltech oh and
- 00:04:33and the same adviser at celtech yes same
- 00:04:35adviser your undergraduate adviser in my
- 00:04:37PhD advisor petro perona and my PhD time
- 00:04:41which is similar to your your your PhD
- 00:04:44time was when AI was still in the winter
- 00:04:47in the public eye but it was not in the
- 00:04:50winter in my eye because it's that
- 00:04:53preing hibernation there's so much life
- 00:04:56machine learning statistical modeling
- 00:04:59was really gaining uh gaining power and
- 00:05:03we I I think I was one of the Native
- 00:05:07generation in machine learning and AI
- 00:05:11whereas I look at Justice generation is
- 00:05:13the native deep learning generation so
- 00:05:16so so machine learning was the precursor
- 00:05:19of deep learning and we were
- 00:05:21experimenting with all kinds of models
- 00:05:24but one thing came out at the end of my
- 00:05:26PhD and the beginning of my assistant
- 00:05:29professor
- 00:05:30there was a
- 00:05:32overlooked elements of AI that is
- 00:05:36mathematically important to drive
- 00:05:39generalization but the whole field was
- 00:05:41not thinking that way and it was Data
- 00:05:45because we were thinking about um you
- 00:05:47know the intricacy of beijan models or
- 00:05:50or whatever you know um uh kernel
- 00:05:53methods and all that but what was
- 00:05:56fundamental that my students and my lab
- 00:05:59realized probably uh earlier than most
- 00:06:01people is that if you if you let Data
- 00:06:05Drive models you can unleash the kind of
- 00:06:08power that we haven't seen before and
- 00:06:11that was really the the the reason we
- 00:06:14went on a pretty
- 00:06:17crazy bet on image net which is you know
- 00:06:21what just forget about any scale we're
- 00:06:24seeing now which is thousands of data
- 00:06:26points at that point uh NLP community
- 00:06:29has their own data sets I remember UC
- 00:06:32see Irvine data set or some data set in
- 00:06:34NLP was it was small compar Vision
- 00:06:37Community has their data sets but all in
- 00:06:40the order of thousands or tens of
- 00:06:42thousands were like we need to drive it
- 00:06:44to internet scale and luckily it was
- 00:06:48also the the the coming of age of
- 00:06:50Internet so we were riding that wave and
- 00:06:54that's when I came to Stanford so these
- 00:06:57epochs are what we often talk about like
- 00:06:59IM is clearly the epoch that created you
- 00:07:01know or or at least like maybe made like
- 00:07:04popular and viable computer vision and
- 00:07:07the Gen wave we talk about two kind of
- 00:07:09core unlocks one is like the
- 00:07:10Transformers paper which is attention we
- 00:07:12talk about stable diffusion is that a
- 00:07:13fair way to think about this which is
- 00:07:15like there's these two algorithmic
- 00:07:16unlocks that came from Academia or
- 00:07:18Google and like that's where everything
- 00:07:19comes from or has it been more
- 00:07:20deliberate or have there been other kind
- 00:07:22of big unlocks that kind of brought us
- 00:07:24here that we don't talk as much about
- 00:07:25yeah I I think the big unlock is compute
- 00:07:28like I know the story of AI is of in the
- 00:07:29story of compute but even no matter how
- 00:07:31much people talk about it I I think
- 00:07:32people underestimate it right and the
- 00:07:34amount of the amount of growth that
- 00:07:35we've seen in computational power over
- 00:07:37the last decade is astounding the first
- 00:07:39paper that's really credited with the
- 00:07:40like Breakthrough moment in computer
- 00:07:42vision for deep learning was Alex net um
- 00:07:45which was a 2012 paper that where a deep
- 00:07:47neural network did really well on the
- 00:07:48image net Challenge and just blew away
- 00:07:50all the other algorithms that F had been
- 00:07:53working on the types of algorithms that
- 00:07:54they' been working on more in grad
- 00:07:55school that Alex net was a 60 million
- 00:07:57parameter deep neural network um and it
- 00:07:59was trained for six days on two GTX 580s
- 00:08:03which was the top consumer card at the
- 00:08:04time which came out in 2010 um so I was
- 00:08:07looking at some numbers last night just
- 00:08:09to you know put these in perspective the
- 00:08:11newest the latest and greatest from
- 00:08:12Nvidia is the gb200 um do either of you
- 00:08:15want to guess how much raw compute
- 00:08:17Factor we have between the GTX 580 and
- 00:08:20the gb200 shoot no what go for it it's
- 00:08:23uh it's in the thousands so I I ran the
- 00:08:26numbers last night like that two We R
- 00:08:28that two we training run that of Six
- 00:08:30Days on two GTX 580s if you scale it it
- 00:08:33comes out to just under five minutes on
- 00:08:35a single GB on a single gb200 Justin is
- 00:08:39making a really good point the 2012 Alex
- 00:08:42net paper on image net challenge is
- 00:08:45literally a very classic Model and that
- 00:08:49is the convolution on your network model
- 00:08:52and that was published in 1980s the
- 00:08:54first paper I remember as a graduate
- 00:08:56student learning that and it more or
- 00:09:00less also has six seven layers the
- 00:09:03practically the only difference between
- 00:09:06alexnet and the convet what's the
- 00:09:08difference is the gpus the two gpus and
- 00:09:14the delude of data yeah well so that's
- 00:09:17what I was going to go which is like so
- 00:09:18I think most people now are familiar
- 00:09:20with like quote the bitter lesson and
- 00:09:21the bitter lesson says is if you make an
- 00:09:23algorithm don't be cute yeah just make
- 00:09:25sure you can take advantage of available
- 00:09:26compute because the available compute
- 00:09:28will show up right and so like you just
- 00:09:29like need to like why like on the other
- 00:09:32hand there's another narrative um which
- 00:09:35seems to me to be like just as credible
- 00:09:36which is like it's actually new data
- 00:09:37sources that unlock deep learning right
- 00:09:39like imet is a great example but like a
- 00:09:40lot of people like self attention is
- 00:09:42great from Transformers but they'll also
- 00:09:44say this is a way you can exploit human
- 00:09:45labeling of data because like it's the
- 00:09:47humans that put the structure in the
- 00:09:48sentences and if you look at clip
- 00:09:50they'll say well like we're using the
- 00:09:51internet to like actually like have
- 00:09:53humans use the alt tag to label images
- 00:09:56right and so like that's a story of data
- 00:09:58that's not a story of compute and so is
- 00:10:00it just is the answer just both or is
- 00:10:02like one more than the other or I think
- 00:10:03it's both but you're hitting another
- 00:10:05really good point so I think there's
- 00:10:06actually two EO that to me feel quite
- 00:10:08distinct in the algorithmics here so
- 00:10:10like the imag net era is actually the
- 00:10:12era of supervised learning um so in the
- 00:10:14era of supervised learning you have a
- 00:10:15lot of data but you don't know how to
- 00:10:17use data on its own like the expectation
- 00:10:20of imet and other data sets of that time
- 00:10:22period was that we're going to get a lot
- 00:10:23of images but we need people to label
- 00:10:25everyone and all of the training data
- 00:10:27that we're going to train on like a
- 00:10:29person a human labeler has looked at
- 00:10:30everyone and said something about that
- 00:10:32image yeah um and the big algorithmic
- 00:10:34unlocks we know how to train on things
- 00:10:36that don't require human labeled data as
- 00:10:38as the naive person in the room that
- 00:10:39doesn't have an AI background it seems
- 00:10:41to me if you're training on human data
- 00:10:43like the humans have labeled it it's
- 00:10:45just not explicit I knew you were GNA
- 00:10:47say that Mar I knew that yes
- 00:10:49philosophically that's a really
- 00:10:51important question but that actually is
- 00:10:53more try language than pixels fair
- 00:10:56enough yeah 100 yeah yeah yeah yeah yeah
- 00:10:58but I do think it's an important
- 00:11:05thinked learn itel just more implicit
- 00:11:08than explicit yeah it's still it's still
- 00:11:09human labeled the distinction is that
- 00:11:11for for this supervised learning era um
- 00:11:13our learning tasks were much more
- 00:11:14constrained so like you would have to
- 00:11:16come up with this ontology of Concepts
- 00:11:18that we want to discover right if you're
- 00:11:19doing in imag net like fa and and your
- 00:11:22students at the time spent a lot of time
- 00:11:24thinking about you know which thousand
- 00:11:26categories should be in the imag net
- 00:11:27challenge other data sets of that time
- 00:11:29like the Coco data set for object
- 00:11:30detection like they thought really hard
- 00:11:32about which 80 categories we put in
- 00:11:34there so let's let's walk to gen um so
- 00:11:36so when I was doing my my PhD before
- 00:11:38that um you came so I took U machine
- 00:11:41learning from Andre in and then I took
- 00:11:43like beigan something very complicated
- 00:11:44from Deany Coler and it was very
- 00:11:45complicated for me a lot of that was
- 00:11:47just predictive modeling y um and then
- 00:11:49like I remember the whole kind of vision
- 00:11:51stuff that you unlock but then the
- 00:11:52generative stuff is shown up like I
- 00:11:53would say in the last four years which
- 00:11:55is to me very different like you're not
- 00:11:57identifying objects you're not you know
- 00:11:59predicting something you're generating
- 00:12:00something and so maybe kind of walk
- 00:12:02through like the key unlocks that got us
- 00:12:04there and then why it's different and if
- 00:12:06we should think about it differently and
- 00:12:07is it part of a Continuum is it not it
- 00:12:10is so interesting even during my
- 00:12:13graduate time generative model was there
- 00:12:17we wanted to do generation we nobody
- 00:12:20remembers even with the uh letters and
- 00:12:24uh numbers we were trying to do some you
- 00:12:26know Jeff Hinton has had to generate
- 00:12:29papers we were thinking about how to
- 00:12:31generate and in fact if you do have if
- 00:12:34you think from a probability
- 00:12:36distribution point of view you can
- 00:12:37mathematically generate it's just
- 00:12:39nothing we generate would ever impress
- 00:12:42anybody right so this concept of
- 00:12:45generation mathematically theoretically
- 00:12:47is there but nothing worked so then I do
- 00:12:52want to call out Justin's PhD and Justin
- 00:12:55was saying that he got enamored by Deep
- 00:12:57learning so he came to my lab Justin PhD
- 00:12:59his entire PhD is a story almost a mini
- 00:13:03story of the trajectory of the of the uh
- 00:13:07field he started his first project in
- 00:13:09data I forced him to he didn't like
- 00:13:13it so in retrospect I learned a lot of
- 00:13:16really useful things I'm glad you say
- 00:13:18that now so we moved Justin to um to
- 00:13:22deep learning and the core problem there
- 00:13:25was taking images and generating words
- 00:13:29well actually it was even about there
- 00:13:31were I think there were three discret
- 00:13:32phases here on this trajectory so the
- 00:13:34first one was actually matching images
- 00:13:36and words right right right like we have
- 00:13:38we have an image we have words and can
- 00:13:40we say how much they allow so actually
- 00:13:41my first paper both of my PhD and like
- 00:13:44ever my first academic publication ever
- 00:13:47was the image retrieval with scene
- 00:13:48graphs and then we went into the Genera
- 00:13:51uh taking pixels generating words and
- 00:13:53Justin and Andre uh really worked on
- 00:13:56that but that was still a very very
- 00:14:00lossy way of of of generating and
- 00:14:03getting information out of the pixel
- 00:14:05world and then in the middle Justus went
- 00:14:07off and did a very famous piece of work
- 00:14:10and it was the first time that uh
- 00:14:13someone made it real time right yeah
- 00:14:16yeah so so the story there is there was
- 00:14:17this paper that came out in 2015 a
- 00:14:19neural algorithm of artistic style led
- 00:14:21by Leon gtis and it was like the paper
- 00:14:24came out and they showed like these
- 00:14:25these real world photographs that they
- 00:14:26had converted into van go style and like
- 00:14:29we are kind of used to seeing things
- 00:14:30like this in 2024 but this was in 2015
- 00:14:33so this paper just popped up on archive
- 00:14:35one day and it like blew my mind like I
- 00:14:37just got this like gen brainworm like in
- 00:14:39my brain in like 2015 and it like did
- 00:14:42something to me and I thought like oh my
- 00:14:44God I need to understand this algorithm
- 00:14:45I need to play with it I need to make my
- 00:14:47own images into van go so then I like
- 00:14:49read the paper and over a long weekend I
- 00:14:51reimplemented the thing and got it to
- 00:14:52work it was a very actually very simple
- 00:14:55algorithm um so like my implementation
- 00:14:57was like 300 lines of Lua cuz at the
- 00:14:59time it was pre it was Lua there was
- 00:15:01there was um this was pre pie torch so
- 00:15:03we were using Lua torch um but it was
- 00:15:05like very simple algorithm but it was
- 00:15:06slow right so it was an optim
- 00:15:08optimization based thing every image you
- 00:15:10want to generate you need to run this
- 00:15:11optimization Loop run this gradient Dent
- 00:15:12Loop for every image that you generate
- 00:15:14the images were beautiful but I just
- 00:15:16like wanted to be faster and and Justin
- 00:15:19just did it and it was actually I think
- 00:15:21your first taste
- 00:15:23of a an academic work having an industry
- 00:15:27impact a bunch of people seen this this
- 00:15:30artistic style transfer stuff at the
- 00:15:31time and me and a couple others at the
- 00:15:33same time came up with different ways to
- 00:15:34speed this up yeah um but mine was the
- 00:15:37one that got a lot of traction right so
- 00:15:38I was very proud of Justin but there's
- 00:15:40one more thing I was very proud of
- 00:15:41Justin to connect to J AI is that before
- 00:15:45the world understand gen Justin's last
- 00:15:48piece of uh uh work in PhD which I I
- 00:15:52knew about it because I was forcing you
- 00:15:53to do it that one was fun that was was
- 00:15:57actually uh input
- 00:16:00language and getting a whole picture out
- 00:16:03it's one of the first gen uh work it's
- 00:16:07using gang which was so hard to use but
- 00:16:10the problem is that we are not ready to
- 00:16:12use a natural piece of language so
- 00:16:14justtin you heard he worked on sing
- 00:16:16graph so we have to input a sing graph
- 00:16:20language structure so you know the Sheep
- 00:16:23the the the grass the sky in a graph way
- 00:16:26it literally was one of our photos right
- 00:16:28and then he he and another very good uh
- 00:16:31uh Master student of grim they got that
- 00:16:34again to work so so you can see from
- 00:16:37data to matching to style transfer to to
- 00:16:42generative a uh uh images we're starting
- 00:16:46to see you ask if this is a abrupt
- 00:16:49change for people like us it's already
- 00:16:52happening a Continuum but for the world
- 00:16:55it was it's more the results are more
- 00:16:58abrupt so I read your book and for those
- 00:17:00that are listening it's a phenomenal
- 00:17:01book like I I really recommend you read
- 00:17:03it and it seems for a long time like a
- 00:17:06lot of you and I'm talking to you fa
- 00:17:07like a lot of your research has been you
- 00:17:09know and your direction has been towards
- 00:17:12kind of spatial stuff and pixel stuff
- 00:17:14and intelligence and now you're doing
- 00:17:16World labs and it's around spatial
- 00:17:18intelligence and so maybe talk through
- 00:17:20like you know is this been part of a
- 00:17:23long journey for you like why did you
- 00:17:24decide to do it now is it a technical
- 00:17:26unlock is it a personal unlock just kind
- 00:17:28of like move us from that kind of Meo of
- 00:17:32AI research to to World Labs sure for me
- 00:17:35is uh
- 00:17:37um it is both personal and intellectual
- 00:17:41right my entire you talk about my book
- 00:17:44my entire intellectual journey is really
- 00:17:48this passion to seek North Stars but
- 00:17:51also believing that those nor stars are
- 00:17:54critically important for the advancement
- 00:17:56of our field so at the beginning
- 00:17:59I remembered after graduate school I
- 00:18:02thought my Northstar was telling stories
- 00:18:05of uh images because for me that's such
- 00:18:08a important piece of visual intelligence
- 00:18:12that's part of what you call AI or AGI
- 00:18:15but when Justin and Andre did that I was
- 00:18:18like oh my God that's that was my live
- 00:18:20stream what do I do next so it it came a
- 00:18:23lot faster I thought it would take a
- 00:18:25hundred years to do that so um but
- 00:18:29visual intelligence is my passion
- 00:18:32because I do believe for every
- 00:18:36intelligent uh
- 00:18:37being like people or robots or some
- 00:18:41other form um knowing how to see the
- 00:18:44world reason about it interact in it
- 00:18:49whether you're navigating or or or
- 00:18:51manipulating or making things you can
- 00:18:54even build civilization upon it it
- 00:18:58visual spatial intelligence is so
- 00:19:01fundamental it's as fundamental as
- 00:19:04language possibly more ancient and and
- 00:19:08more fundamental in certain ways so so
- 00:19:10it's very natural for me that um world
- 00:19:14Labs is our Northstar is to unlock
- 00:19:17spatial intelligence the moment to me is
- 00:19:21right to do it like Justin was saying
- 00:19:24compute we've got these ingredients
- 00:19:26we've got compute we've got a much
- 00:19:30deeper understanding of data way deeper
- 00:19:32than image that days you know uh
- 00:19:34compared to to that those days we're so
- 00:19:37much more sophisticated and we've got
- 00:19:40some advancement of algorithms including
- 00:19:43co-founders in World la like Ben milen
- 00:19:46Hall and uh Kristoff lar they were at
- 00:19:50The Cutting Edge of nerve that we are in
- 00:19:52the right moment to really make a bet
- 00:19:55and to focus and just unlock that so I
- 00:19:59just want to clarify for for folks that
- 00:20:01are listening to this which is so you
- 00:20:02know you're starting this company World
- 00:20:03lab spatial intelligence is kind of how
- 00:20:05you're generally describing the problem
- 00:20:06you're solving can you maybe try to
- 00:20:08crisply describe what that means yeah so
- 00:20:11spatial intelligence is about machines
- 00:20:13ability to un to perceive reason and act
- 00:20:16in 3D and 3D space and time to
- 00:20:19understand how objects and events are
- 00:20:21positioned in 3D space and time how
- 00:20:23interactions in the world can affect
- 00:20:25those 3D position 3D 4D positions over
- 00:20:28space time
- 00:20:29um and both sort of perceive reason
- 00:20:31about generate interact with really take
- 00:20:33the machine out of the main frame or out
- 00:20:35of the data center and putting it out
- 00:20:37into the world and understanding the 3D
- 00:20:394D world with all of its richness so to
- 00:20:41be very clear are we talking about the
- 00:20:42physical world or are we just talking
- 00:20:43about an abstract notion of world I
- 00:20:45think it can be both I think it can be
- 00:20:47both and that encompasses our vision
- 00:20:48long term even if you're generating
- 00:20:50worlds even if you're generating content
- 00:20:52um doing that in positioned in 3D with
- 00:20:543D uh has a lot of benefits um or if
- 00:20:57you're recognizing the real world being
- 00:20:59able to put 3D understanding into the
- 00:21:02into the real world as well is part of
- 00:21:04it great so I mean Ju Just for everybody
- 00:21:07listening like the two other co-founders
- 00:21:08Ben M Hall and Kristoff lner are
- 00:21:10absolute Legends in the field at the at
- 00:21:12the same level these four decided to
- 00:21:13come out and do this company now and so
- 00:21:16I'm trying to get dig to like like why
- 00:21:18now is the the the right time yeah I
- 00:21:20mean this is Again part of a longer
- 00:21:22Evolution for me but like really after
- 00:21:23PhD when I was really wanting to develop
- 00:21:25into my own independent researcher both
- 00:21:27at for my later career I was just
- 00:21:29thinking what are the big problems in Ai
- 00:21:31and computer vision um and the
- 00:21:32conclusion that I came to about that
- 00:21:34time was that the previous decade had
- 00:21:36mostly been about understanding data
- 00:21:38that already exists um but the next
- 00:21:40decade was going to be about
- 00:21:41understanding new data and if we think
- 00:21:43about that the data that already exists
- 00:21:45was all of the images and videos that
- 00:21:47maybe existed on the web already and the
- 00:21:49next decade was going to be about
- 00:21:50understanding new data right like people
- 00:21:53are people are have smartphones
- 00:21:54smartphones are collecting cameras those
- 00:21:55cameras have new sensors those cameras
- 00:21:57are positioned in the 3D world it's not
- 00:21:59just you're going to get a bag of pixels
- 00:22:00from the internet and know nothing about
- 00:22:02it and try to say if it's a cat or a dog
- 00:22:04we want to treat these treat images as
- 00:22:07universal sensors to the physical world
- 00:22:09and how can we use that to understand
- 00:22:11the 3D and 4D structure of the world um
- 00:22:13either in physical spaces or or or
- 00:22:15generative spaces so I made a pretty big
- 00:22:18pivot post PhD into 3D computer vision
- 00:22:20predicting 3D shapes of objects with
- 00:22:22some of my colleagues at fair at the
- 00:22:24time then later I got really enamored by
- 00:22:26this idea of learning 3D structure
- 00:22:28through 2D right because we talk about
- 00:22:30data a lot it's it's um you know 3D data
- 00:22:33is hard to get on its own um but there
- 00:22:36because there's a very strong
- 00:22:37mathematical connection here um our 2D
- 00:22:39images are projections of a 3D World and
- 00:22:42there's a lot of mathematical structure
- 00:22:43here we can take advantage of so even if
- 00:22:45you have a lot of 2D data there's
- 00:22:46there's a lot of people have done
- 00:22:48amazing work to figure out how can you
- 00:22:50back out the 3D structure of the world
- 00:22:51from large quantities of 2D observations
- 00:22:54um and then in 2020 you asked about bre
- 00:22:56breakthrough moments there was a really
- 00:22:57big breakthrough Moment One from our
- 00:22:59co-founder Ben mildenhall at the time
- 00:23:00with his paper Nerf N Radiance fields
- 00:23:03and that was a very simple very clear
- 00:23:05way of backing out 3D structure from 2D
- 00:23:08observations that just lit a fire under
- 00:23:10this whole Space of 3D computer vision I
- 00:23:13think there's another aspect here that
- 00:23:15maybe people outside the field don't
- 00:23:16quite understand as that was also a time
- 00:23:19when large language models were starting
- 00:23:20to take off so a lot of the stuff with
- 00:23:23language modeling actually had gotten
- 00:23:24developed in Academia even during my PhD
- 00:23:26I did some early work with Andre
- 00:23:27Carpathia on language modeling in 2014
- 00:23:30lstm I still remember lstms RNN brus
- 00:23:34like this was pre- Transformer um but uh
- 00:23:37then at at some point like around like
- 00:23:39around the gpt2 time like you couldn't
- 00:23:41really do those kind of models anymore
- 00:23:42in Academia because they took a way way
- 00:23:44more resourcing but there was one really
- 00:23:46interesting thing that the Nerf the Nerf
- 00:23:48approach that that Ben came up with like
- 00:23:50you could train these in in in an hour a
- 00:23:52couple hours on a single GPU so I think
- 00:23:54at that time like this is a there was a
- 00:23:56dynamic here that happened which is that
- 00:23:57I think a lot of academic researchers
- 00:23:59ended up focusing a lot of these
- 00:24:00problems because there was core
- 00:24:02algorithmic stuff to figure out and
- 00:24:04because you could actually do a lot with
- 00:24:05without a ton of compute and you could
- 00:24:07get state-of-the-art results on a single
- 00:24:08GPU because of those Dynamics um there
- 00:24:11was a lot of research a lot of
- 00:24:12researchers in Academia were moving to
- 00:24:14think about what are the core
- 00:24:16algorithmic ways that we can advance
- 00:24:17this area as well uh then I ended up
- 00:24:20chatting with f more and I realized that
- 00:24:22we were actually she's very convincing
- 00:24:23she's very convincing well there's that
- 00:24:25but but like you know we talk about
- 00:24:27trying to like figure out your own
- 00:24:28depent research trajectory from your
- 00:24:29adviser well it turns out we ended oh no
- 00:24:32kind of concluding converging on on
- 00:24:34similar things okay well from my end I
- 00:24:36want to talk to the smartest person I I
- 00:24:38call Justin there's no question about it
- 00:24:41uh I do want to talk about a very
- 00:24:43interesting technical um uh issue or or
- 00:24:47technical uh story of pixels that most
- 00:24:50people work in language don't realize is
- 00:24:52that pre era in the field of computer
- 00:24:55vision those of us who work on pixels
- 00:24:58we actually have a long history in a an
- 00:25:03area of research called reconstruction
- 00:25:053D reconstruction which is you know it
- 00:25:08dates back from the 70s you know you can
- 00:25:11take photos because humans have two eyes
- 00:25:13right so in generally starts with stereo
- 00:25:15photos and then you try to triangulate
- 00:25:18the geometry and uh make a 3D shape out
- 00:25:22of it it is a really really hard problem
- 00:25:25to this day it's not fundamentally
- 00:25:27solved because there there's
- 00:25:28correspondence and all that and then so
- 00:25:31this whole field which is a older way of
- 00:25:34thinking about 3D has been going around
- 00:25:37and it has been making really good
- 00:25:39progress but when nerve happened when
- 00:25:42Nerf happened in the context of
- 00:25:45generative methods in the context of
- 00:25:47diffusion models
- 00:25:50suddenly reconstruction and generations
- 00:25:52start to really merge and now like
- 00:25:56within really a short period of time in
- 00:25:58the field of computer vision it's hard
- 00:26:01to talk about reconstruction versus
- 00:26:03generation anymore we suddenly have a
- 00:26:06moment where if we see something or if
- 00:26:11we imagine something both can converge
- 00:26:14towards generating it right right and
- 00:26:17that's just to me a a really important
- 00:26:19moment for computer vision but most
- 00:26:21people missed it because we're not
- 00:26:23talking about it as much as llms right
- 00:26:25so in pixel space there's reconstruction
- 00:26:27where you reconstruct
- 00:26:28like a scene that's real and then if you
- 00:26:31don't see the scene then you use
- 00:26:32generative techniques right so these
- 00:26:33things are kind of very similar
- 00:26:35throughout this entire conversation
- 00:26:36you're talking about languages and
- 00:26:38you're talking about pixels so maybe
- 00:26:40it's a good time to talk about how like
- 00:26:41space for intelligence and what you're
- 00:26:43working on
- 00:26:44contrasts with language approaches which
- 00:26:47of course are very popular now like is
- 00:26:48it complimentary is it orthogonal yeah I
- 00:26:51think I think they're complimentary I I
- 00:26:53don't mean to be too leading here like
- 00:26:54maybe just contrast them like everybody
- 00:26:56says like listen I I I know opening up
- 00:26:58and I know GPT and I know multimodal
- 00:27:00models and a lot of what you're talking
- 00:27:01about is like they've got pixels and
- 00:27:03they've got languages and like doesn't
- 00:27:05this kind of do what we want to do with
- 00:27:07spatial reasoning yeah so I think to do
- 00:27:09that you need to open up the Black Box a
- 00:27:10little bit of how these systems work
- 00:27:11under the hood um so with language
- 00:27:13models and the multimodal language
- 00:27:14models that we're seeing nowadays
- 00:27:16they're their their underlying
- 00:27:18representation under the hood is is a
- 00:27:19one-dimensional representation we talk
- 00:27:21about context lengths we talk about
- 00:27:23Transformers we talk about sequences
- 00:27:25attention attention fundamentally their
- 00:27:27representation of the world is is
- 00:27:29onedimensional so these things
- 00:27:30fundamentally operate on a
- 00:27:31onedimensional sequence of tokens so
- 00:27:33this is a very natural representation
- 00:27:35when you're talking about language
- 00:27:37because written text is a
- 00:27:38one-dimensional sequence of discret
- 00:27:39letters so that kind of underlying
- 00:27:41representation is the thing that led to
- 00:27:43llms and now the multimodal llms that
- 00:27:45we're seeing now you kind of end up
- 00:27:47shoehorning the other modalities into
- 00:27:49this underlying representation of a 1D
- 00:27:51sequence of tokens um now when we move
- 00:27:54to spatial intelligence it's kind of
- 00:27:56going the other way where we're saying
- 00:27:57that the three-dimensional nature of the
- 00:28:00world should be front and center in the
- 00:28:01representation so at an algorithmic
- 00:28:03perspective that opens up the door for
- 00:28:05us to process data in different ways to
- 00:28:07get different kinds of outputs out of it
- 00:28:10um and to tackle slightly different
- 00:28:11problems so even at at a course level
- 00:28:13you kind of look at outside and you say
- 00:28:14oh multimodal LMS can look at images too
- 00:28:17well they can but I I think that it's
- 00:28:19they don't have that fundamental 3D
- 00:28:21representation at the heart of their
- 00:28:22approaches I totally agree with Justin I
- 00:28:24think talking about the 1D versus
- 00:28:27fundamental 3D representation is one of
- 00:28:30the most core differentiation the other
- 00:28:32thing it's a slightly philosophical but
- 00:28:34it's really important to for me at least
- 00:28:37is language is fundamentally a purely
- 00:28:41generated signal there's no language out
- 00:28:45there you don't go out in the nature and
- 00:28:47there's words written in the sky for you
- 00:28:50whatever data you feeding you pretty
- 00:28:52much can just somehow regurgitate with
- 00:28:57enough
- 00:28:58generalizability at the the same data
- 00:29:01out and that's language to language and
- 00:29:04but but 3D World Is Not There is a 3D
- 00:29:08world out there that follows laws of
- 00:29:11physics that has its own structures due
- 00:29:13to materials and and many other things
- 00:29:17and to to fundamentally back that
- 00:29:20information out and be able to represent
- 00:29:23it and be able to generate it is just
- 00:29:26fundamentally quite a different
- 00:29:28problem we will be borrowing um similar
- 00:29:33ideas or useful ideas from language and
- 00:29:37llms but this is fundamentally
- 00:29:39philosophically to me a different
- 00:29:41problem right so language 1D and
- 00:29:44probably a bad representation of the
- 00:29:46physical world because it's been
- 00:29:47generated by humans and it's probably
- 00:29:49lossy there's a whole another modality
- 00:29:52of generative AI models which are pixels
- 00:29:54and these are 2D image and 2D video and
- 00:29:57like one could say that like if you look
- 00:29:58at a video it looks you know you can see
- 00:30:003D stuff because like you can pan a
- 00:30:01camera or whatever it is and so like how
- 00:30:04would like spatial intelligence be
- 00:30:06different than say 2D video here when I
- 00:30:07think about this it's useful to
- 00:30:09disentangle two things um one is the
- 00:30:11underlying representation and then two
- 00:30:13is kind of the the user facing
- 00:30:14affordances that you have um and here's
- 00:30:17where where you can get sometimes
- 00:30:18confused because um fundamentally we see
- 00:30:212D right like our retinas are 2D
- 00:30:23structures in our bodies and we've got
- 00:30:25two of them so like fundamentally our
- 00:30:27visual system some perceives 2D images
- 00:30:30um but the problem is that depending on
- 00:30:32what representation you use there could
- 00:30:33be different affordances that are more
- 00:30:35natural or less natural so even if you
- 00:30:38are at the end of the day you might be
- 00:30:39seeing a 2D image or a 2d video um your
- 00:30:42brain is perceiving that as a projection
- 00:30:45of a 3D World so there's things you
- 00:30:47might want to do like move objects
- 00:30:49around move the camera around um in
- 00:30:51principle you might be able to do these
- 00:30:53with a purely 2D representation and
- 00:30:55model but it's just not a fit to the
- 00:30:57problems that you're the model to do
- 00:30:59right like modeling the 2D projections
- 00:31:01of a dynamic 3D world is is a function
- 00:31:04that probably can be modeled but by
- 00:31:05putting a 3D representation Into the
- 00:31:07Heart of a model there's just going to
- 00:31:08be a better fit between the kind of
- 00:31:10representation that the model is working
- 00:31:12on and the kind of tasks that you want
- 00:31:14that model to do so our bet is that by
- 00:31:17threading a little bit more 3D
- 00:31:19representation under the hood that'll
- 00:31:21enable better affordances for for users
- 00:31:24and this also goes back to the norstar
- 00:31:26for me you know why is it spatial
- 00:31:29intelligence why is it not flat pixel
- 00:31:33intelligence is because I think the Arc
- 00:31:35of intelligence has to go to what Justin
- 00:31:39calls affordances and uh and the Arc of
- 00:31:42intelligence if you look at Evolution
- 00:31:46right the Arc of intelligence eventually
- 00:31:49enables animals and humans especially
- 00:31:52human as an intelligent animal to move
- 00:31:55around the world interact with it create
- 00:31:58civilization create life create a piece
- 00:32:01of Sandwich whatever you do in this 3D
- 00:32:04World and and translating that into a
- 00:32:08piece of technology that three native 3D
- 00:32:12nness is fundamentally important for the
- 00:32:16flood flood gate um of possible
- 00:32:20applications even if some of them the
- 00:32:25the serving of them looks Tod but the
- 00:32:28but it's innately 3D um to me I think
- 00:32:32this is actually very subtle yeah and
- 00:32:34Incredibly critical point and so I think
- 00:32:36it's worth digging into and a good way
- 00:32:38to do this is talking about use cases
- 00:32:39and so just to level set this we're
- 00:32:41talking about generating a technology
- 00:32:44let's call it a model that can do
- 00:32:46spatial intelligence so maybe in the
- 00:32:48abstract what might that look like kind
- 00:32:50of a little bit more concretely what
- 00:32:52would be the potential use cases that
- 00:32:55you could apply this to so I think
- 00:32:57there's a there's a couple different
- 00:32:58kinds of things we imagine these
- 00:33:00spatially intelligent models able to do
- 00:33:02over time um and one that I'm really
- 00:33:04excited about is World Generation we're
- 00:33:07all we're all used to something like a
- 00:33:08text image generator or starting to see
- 00:33:10text video generators where you put an
- 00:33:11image put in a video and out pops an
- 00:33:14amazing image or an amazing two-c clip
- 00:33:16um but I I think you could imagine
- 00:33:18leveling this up and getting 3D worlds
- 00:33:20out so one thing that we could imagine
- 00:33:23spatial intelligence helping us with in
- 00:33:25the future are upleveling these
- 00:33:26experiences into 3D where we're not
- 00:33:29getting just an image out or just a clip
- 00:33:30out but you're getting out a full
- 00:33:32simulated but vibrant and interactive 3D
- 00:33:34World for gaming maybe for gaming right
- 00:33:37maybe for gaming maybe for virtual
- 00:33:39photography like you name it there's I
- 00:33:40think there even if you got this to work
- 00:33:42there'd be there'd be a million
- 00:33:43applications for Education yeah for
- 00:33:45education I mean I guess one of one of
- 00:33:47my things is that like we in in some
- 00:33:50sense this enables a new form of media
- 00:33:52right because we already have the
- 00:33:54ability to create virtual interactive
- 00:33:57world worlds um but it cost hundreds of
- 00:34:00hundreds of millions of dollars and a
- 00:34:02and a ton of development time and as a
- 00:34:04result like what are the places that
- 00:34:06people drive this technological ability
- 00:34:08is is video games right because if we do
- 00:34:11have the ability as a society to create
- 00:34:13amazingly detailed virtual interactive
- 00:34:16worlds that give you amazing experiences
- 00:34:18but because it takes so much labor to do
- 00:34:20so then the only economically viable use
- 00:34:23of that technology in its form today is
- 00:34:25is games that can be sold for $70 a
- 00:34:28piece to millions and millions of people
- 00:34:29to recoup the investment if we had the
- 00:34:31ability to create these same virtual
- 00:34:34interactive vibrant 3D worlds um you
- 00:34:37could see a lot of other applications of
- 00:34:39this right because if you bring down
- 00:34:41that cost of producing that kind of
- 00:34:42content then people are going to use it
- 00:34:44for other things what if you could have
- 00:34:46a an intera like sort of a personalized
- 00:34:483D experience that's as good and as rich
- 00:34:51as detailed as one of these AAA video
- 00:34:53games that cost hundreds of millions of
- 00:34:54dollars to produce but it could be
- 00:34:56catered to like this very Niche thing
- 00:34:58that only maybe a couple people would
- 00:34:59want that particular thing that's not a
- 00:35:01particular product or a particular road
- 00:35:03map but I think that's a vision of a new
- 00:35:05kind of media that would be enabled by
- 00:35:08um spatial intelligence in the
- 00:35:10generative Realms if I think about a
- 00:35:11world I actually think about things that
- 00:35:13are not just seene generation I think
- 00:35:14about stuff like movement and physics
- 00:35:15and so like like in the limit is that
- 00:35:17included and then the second one is
- 00:35:19absolutely if I'm interacting with it
- 00:35:23like like are there semantics and I mean
- 00:35:26by that like if I open a book are there
- 00:35:28like pages and are there words in it and
- 00:35:29do they mean like like are we talking
- 00:35:31like a full depth experience or we
- 00:35:32talking about like kind of a static
- 00:35:33scene I think I'll see a progression of
- 00:35:35this technology over time this is really
- 00:35:37hard stuff to build so I think the
- 00:35:39static the static problem is a little
- 00:35:41bit easier um but in the limit I think
- 00:35:43we want this to be fully Dynamic fully
- 00:35:45interactable all the things that you
- 00:35:46just said I mean that's the definition
- 00:35:48of spatial intelligence yeah so so there
- 00:35:51is going to be a progression we'll start
- 00:35:53with more static but everything you've
- 00:35:56said is is in the in the road map of uh
- 00:36:00spatial intelligence I mean this is kind
- 00:36:02of in in the name of the company itself
- 00:36:03World Labs um like the world is about
- 00:36:06building and understanding worlds and
- 00:36:08and like this is actually a little bit
- 00:36:09inside baseball I realized after we told
- 00:36:11the name to people they don't always get
- 00:36:12it because in computer vision and and
- 00:36:14reconstruction and generation we often
- 00:36:15make a distinction or a delineation
- 00:36:17about the kinds of things you can do um
- 00:36:19and kind of the first level is objects
- 00:36:21right like a microphone a cup a chair
- 00:36:23like these are discret things in the
- 00:36:25world um and a lot of the imet style
- 00:36:27stuff that F worked on was about
- 00:36:29recognizing objects in the world then
- 00:36:31leveling up the next level of objects I
- 00:36:33think of his scenes like scenes are
- 00:36:35compositions of objects like now we've
- 00:36:36got this recording studio with a table
- 00:36:38and microphones and people in chairs at
- 00:36:40some composition of objects but but then
- 00:36:41like we we envision worlds as a Step
- 00:36:44Beyond scenes right like scenes are kind
- 00:36:46of maybe individual things but we want
- 00:36:47to break the boundaries go outside the
- 00:36:49door like step up from the table walk
- 00:36:51out from the door walk down the street
- 00:36:52and see the cars buzzing past and see
- 00:36:55like the the the the leaves on the tree
- 00:36:57moving and be able to interact with
- 00:36:59those things another thing that's really
- 00:37:00exciting is just to mention the word New
- 00:37:02Media with this technology the boundary
- 00:37:06between real world and virtual imagin
- 00:37:09world or augmented world or predicted
- 00:37:12world is all blurry you really it there
- 00:37:17the real world is 3D right so in the
- 00:37:20digital world you have to have a
- 00:37:233D representation to even blend with the
- 00:37:26real world you know you cannot have a 2d
- 00:37:29you cannot have a 1D to be able to
- 00:37:31interface with the real 3D World in an
- 00:37:34effective way and with this it unlocks
- 00:37:36it so it it the use cases can can be
- 00:37:40quite Limitless because of this right so
- 00:37:43the first use case that that Justin was
- 00:37:44talking about would be like the
- 00:37:46generation of a virtual world for any
- 00:37:48number of use cases one that you're just
- 00:37:49alluding to would be more of an
- 00:37:51augmented reality right yes just around
- 00:37:53the time world lab was uh um being
- 00:37:55formed uh vision was released by Apple
- 00:37:59and uh they use the word spatial
- 00:38:02Computing we're almost like they almost
- 00:38:04stole
- 00:38:05our but we're spatial intelligence so
- 00:38:09spatial Computing needs spatial
- 00:38:11intelligence that's exactly right so we
- 00:38:14don't know what Hardware form it will
- 00:38:17take it will be goggles glasses contact
- 00:38:19lenses contact lenses but that interface
- 00:38:23between the true real world and what you
- 00:38:26can do on top of it whether it's to help
- 00:38:29you to augment your capability to work
- 00:38:32on a piece of machine and fix your car
- 00:38:34even if you are not a trained mechanic
- 00:38:37or to just be in a Pokemon go Plus+ for
- 00:38:42entertainment suddenly this piece of
- 00:38:44technology is is going to be the the the
- 00:38:49operating system basically uh for for
- 00:38:52arvr uh Mixr in the limit like what does
- 00:38:55an AR device need to do it's this thing
- 00:38:57thing that's always on it's with you
- 00:38:59it's looking out into the world so it
- 00:39:00needs to understand the stuff that
- 00:39:02you're seeing um and maybe help you out
- 00:39:04with tasks in your daily life but I'm
- 00:39:06I'm also really excited about this blend
- 00:39:08between virtual and physical that
- 00:39:09becomes really critical if you have the
- 00:39:11ability to understand what's around you
- 00:39:13in real time in perfect 3D then it
- 00:39:15actually starts to deprecate large parts
- 00:39:17of the real world as well like right now
- 00:39:19how many differently sized screens do we
- 00:39:20all own for different use cases too many
- 00:39:23right you've got you've got your you've
- 00:39:24got your phone you've got your iPad
- 00:39:25you've got your computer monitor you've
- 00:39:26got your t
- 00:39:28you've got your watch like these are all
- 00:39:29basically different side screens because
- 00:39:31they need to present information to you
- 00:39:32in different contexts and in different
- 00:39:34different positions but if you've got
- 00:39:35the ability to seamlessly blend virtual
- 00:39:38content with the physical world it kind
- 00:39:39of deprecates the need for all of those
- 00:39:41it just ideally seamlessly Blends
- 00:39:43information that you need to know in the
- 00:39:44moment with the right way mechanism of
- 00:39:46of giving you that information another
- 00:39:49huge case of being able to blend the the
- 00:39:53digital virtual world with the 3D
- 00:39:55physical world is for anying agents to
- 00:39:59be able to do things in the physical
- 00:40:01world and if humans use this mix art
- 00:40:04devices to do things like I said I don't
- 00:40:07know how to fix a car but if I have to I
- 00:40:09put on this this goggle or glass and
- 00:40:12suddenly I'm guided to do that but there
- 00:40:14are other types of Agents namely robots
- 00:40:18any kind of robots not just humanoid and
- 00:40:22uh their interface by definition is the
- 00:40:253D world but their their compute their
- 00:40:28brain by definition is the digital world
- 00:40:31so what connects that from the learning
- 00:40:34to to behaving between a robot brain to
- 00:40:38the real world brain it has to be
- 00:40:40spatial intelligence so you've talked
- 00:40:43about virtual world you've talked about
- 00:40:45kind of more of an augmented reality and
- 00:40:47now you've just talked about the purely
- 00:40:48physical world basically which would be
- 00:40:51used for robotics um for any company
- 00:40:54that would be like a very large Charter
- 00:40:57especially if you're going to get into
- 00:40:58each one of these different areas so how
- 00:41:00do you think about the idea of like deep
- 00:41:01deep Tech versus any of these specific
- 00:41:03application areas we see ourselves as a
- 00:41:06deep tech company as the platform
- 00:41:08company that provides models that uh
- 00:41:12that can serve different use cases is of
- 00:41:15these three is there any one that you
- 00:41:16think is kind of more natural early on
- 00:41:18that people can kind of expect the
- 00:41:20company to lean into or is it I think
- 00:41:22it's suffices to say the devices are not
- 00:41:25totally ready actually I got my first VR
- 00:41:27headset in grad school um and just like
- 00:41:29that's one of these transformative
- 00:41:30technology experiences you put it on
- 00:41:32you're like oh my God like this is crazy
- 00:41:34and I think a lot of people have that
- 00:41:35experience the first time they use VR um
- 00:41:37so I I've been excited about this space
- 00:41:38for a long time and I I love the Vision
- 00:41:40Pro like I stayed up late to order one
- 00:41:42of the first ones like the first day it
- 00:41:44came out um but I I think the reality is
- 00:41:46it's just not there yet as a platform
- 00:41:48for Mass Market appeal so very likely as
- 00:41:50a company will will will move into a
- 00:41:53market that's more ready than then I I
- 00:41:55think there can sometimes be Simplicity
- 00:41:56in generality right like if you we we
- 00:41:59have this notion of being a deep tech
- 00:42:00company we we believe that there is some
- 00:42:03fun underlying fundamental problems that
- 00:42:05need to be solved really well and if
- 00:42:07solved really well can apply to a lot of
- 00:42:09different domains we really view this
- 00:42:10long Arc of the company as building and
- 00:42:12realizing the the dreams of spatial
- 00:42:14intelligence r large so this is a lot of
- 00:42:17technology to build it seems to me yeah
- 00:42:19I think it's a really hard problem um I
- 00:42:21think sometimes from people who are not
- 00:42:22directly in the AI space they just see
- 00:42:24it as like AI as one undifferentiated
- 00:42:27massive Talent um and and for those of
- 00:42:29us who have been here long for for
- 00:42:31longer you realize that there's a lot of
- 00:42:33different a lot of different kinds of
- 00:42:34talent that need to come together to
- 00:42:35build anything in in AI in particular
- 00:42:37this one we've talked a little bit about
- 00:42:39the the data problem we've talked a
- 00:42:40little bit about some of the algorithms
- 00:42:42that we that I worked on during my PhD
- 00:42:44but there's a lot of other stuff we need
- 00:42:45to do this too um you need really high
- 00:42:47quality large scale engineering you need
- 00:42:49really deep understanding of 3D of the
- 00:42:513D World you need really there's
- 00:42:53actually a lot of connections with
- 00:42:54computer Graphics um because they've
- 00:42:55been kind of attacking lot of the same
- 00:42:57problems from the from the opposite
- 00:42:58direction so when we think about Team
- 00:43:00Construction we think about how do we
- 00:43:02find expert like absolute topof thee
- 00:43:04world best experts in the world at each
- 00:43:06of these different subdomains that are
- 00:43:09necessary to build this really hard
- 00:43:11thing when I thought thought about how
- 00:43:13we form the best founding team for World
- 00:43:16Labs it has to start with the a a group
- 00:43:20of phenomenal multidisciplinary funders
- 00:43:24and of course justtin is natural for me
- 00:43:27Justin cover your years as one of my
- 00:43:30best students and uh one of the smartest
- 00:43:34technologist but there are
- 00:43:36two two other people I have known by
- 00:43:39reputation and and one of them Justin
- 00:43:41even worked with that I was drooling for
- 00:43:45right one is Ben mhal we talked about
- 00:43:47his um seminal work in nerve but another
- 00:43:52person is uh Kristoff lner who has been
- 00:43:56reputated in the community of computer
- 00:43:58graphics and uh especially he had the
- 00:44:01foresight of working on a precursor of
- 00:44:05the gausian Splat um representation for
- 00:44:083D modeling five years right before the
- 00:44:12uh the Gan spat take off and when when
- 00:44:15we heard about when we talk about the
- 00:44:18potential possibility of working with
- 00:44:20Christof lastner Justin just jumped off
- 00:44:23his chair Ben and Kristoff are are are
- 00:44:25legends and maybe just quickly talk
- 00:44:27about kind of like how you thought about
- 00:44:28the build out of the rest of the team
- 00:44:30because again like it's you know there's
- 00:44:31a lot to build here and a lot to work on
- 00:44:33not just in kind of AI or Graphics but
- 00:44:35like systems and so forth yeah um this
- 00:44:39is what so far I'm personally most proud
- 00:44:42of is the formidable team I've had the
- 00:44:45privilege of working with the smartest
- 00:44:48young people in my entire career right
- 00:44:50from from the top universities being a
- 00:44:52professor at Stanford but the kind of
- 00:44:56talent that we put together here at uh
- 00:44:59at uh World Labs is just phenomenal I've
- 00:45:02never seen the concentration and I think
- 00:45:04the biggest
- 00:45:06differentiating um element here is that
- 00:45:09we're Believers of uh spatial
- 00:45:11intelligence all of the
- 00:45:13multidisciplinary talents whether it's
- 00:45:16system engineering machine uh machine
- 00:45:18learning infra to you know uh generative
- 00:45:21modeling to data to you know Graphics
- 00:45:26all of us whether it's our personal
- 00:45:28research Journey or or technology
- 00:45:31Journey or even personal hobby we
- 00:45:34believe that spatial intelligence has to
- 00:45:36happen at this moment with this group of
- 00:45:39people and uh that's how we really found
- 00:45:42our founding team and uh and that focus
- 00:45:46of energy and talent is is is really
- 00:45:50just uh um humbling to me I I just love
- 00:45:53it so I know you've been Guided by an
- 00:45:55Northstar so something about North Stars
- 00:45:58is like you can't actually reach
- 00:46:00them because they're in the sky but it's
- 00:46:02a great way to have guidance so how will
- 00:46:03you know when you've accomplished what
- 00:46:06you've set out to accomplish or is this
- 00:46:08a lifelong thing that's going to
- 00:46:09continue kind of infinitely first of all
- 00:46:13there's real northstars and virtual
- 00:46:15North Stars sometimes you can reach
- 00:46:17virtual northstars fair enough good
- 00:46:18enough in the world in the world model
- 00:46:22exactly like I said I thought one of my
- 00:46:25Northstar that would take a 100 years
- 00:46:27with storytelling of images and uh
- 00:46:30Justin and Andre you know in my opinion
- 00:46:33solved it for me so um so we could get
- 00:46:36to our Northstar but I think for me is
- 00:46:39when so many people and so many
- 00:46:42businesses are using our models to
- 00:46:44unlock their um needs for spatial
- 00:46:47intelligence and that's the moment I
- 00:46:49know we have reached a major Milestone
- 00:46:53actual deployment actual impact actually
- 00:46:56yeah I I don't think going to get there
- 00:46:57um I I think that this is such a
- 00:46:59fundamental thing like the universe is a
- 00:47:01giant evolving four-dimensional
- 00:47:03structure and spatial intelligence r
- 00:47:05large is just understanding that in all
- 00:47:07of its depths and figuring out all the
- 00:47:08applications to that so I I think that
- 00:47:11we have a we have a particular set of
- 00:47:12ideas in mind today but I I think this I
- 00:47:14think this journey is going to take us
- 00:47:16places that we can't even imagine right
- 00:47:17now the magic of good technology is that
- 00:47:20technology opens up more possibilities
- 00:47:23and and unknown so so we will be pushing
- 00:47:26and then the possibilities will will be
- 00:47:28expanding brilliant thank you Justin
- 00:47:30thank you fa this was fantastic thank
- 00:47:32you Martin thank you Martin thank you so
- 00:47:35much for listening to the a16z podcast
- 00:47:38if you've made it this far don't forget
- 00:47:40to subscribe so that you are the first
- 00:47:42to get our exclusive video content or
- 00:47:45you can check out this video that we've
- 00:47:48hand selected for you
- Visual Spatial Intelligence
- Deep Learning
- AI Evolution
- Neural Networks
- Computational Power
- Data in AI
- ImageNet
- World Labs
- 3D Representation
- AI Applications