Why is visual spatial intelligence described as fundamental?

Visual spatial intelligence is compared to language in its fundamental importance due to its ancient and essential role in understanding and interacting with the world.

How has AI evolved over the past decades, according to the video?

AI has transitioned from theoretical models and the AI winter into practical, deep learning applications, involving significant advances such as language models and image recognition.

What major AI breakthroughs are discussed?

The discussion highlights the significance of deep learning, especially neural networks like AlexNet, the importance of computing power, and algorithmic advances like Transformers.

Who are the key figures mentioned in the evolution of AI and deep learning?

Notable figures include Fei-Fei Li and Justin Johnson, along with references to Andrew Ng, Hinton, and others involved in foundational deep learning research.

How is computational power significant to AI development?

Increased computational power has enabled the practical application and fast training of complex AI models, transforming theoretical constructs into effective tools.

What role does data play in AI development, according to the speakers?

Large datasets have been crucial for training AI models, enabling discoveries and the development of more accurate and generalizable models.

What is the significance of the ImageNet project?

ImageNet played a crucial role in demonstrating the power of large datasets, helping propel computer vision and AI into practical applications.

What differentiates spatial intelligence from language models?

Spatial intelligence focuses on 3D perception and action, essential for interacting with the physical world, contrasting with the 1D sequence processing in language models.

What is the mission of World Labs?

World Labs aims to unlock spatial intelligence, leveraging advancements in algorithms, computing, and data to create technology that perceives and interacts with the 3D world.

How might spatial intelligence be applied?

Potential applications include world generation for games and education, augmented reality, and enhancing robotics with better 3D understanding.

“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

00:48:10

https://www.youtube.com/watch?v=vIXfYFB7aBI

Summary

TLDRThe video discusses the fundamentals of visual spatial intelligence and its place in the development of artificial intelligence. It emphasizes how integral visual spatial skills are, comparing their significance to that of language. The discussion shifts to a historical perspective of AI, highlighting critical advancements such as deep learning, particularly noting the inception of neural networks, major breakthroughs like AlexNet, and the Transformer model's impact. The conversation elaborates on data's crucial role, with projects like ImageNet displaying the benefits of comprehensive datasets. The narrative dives into the experiences of AI experts like Fei-Fei Li and Justin Johnson, detailing their journeys and contributions. It describes how the remarkable advancement in computing power has transformed AI from theoretical viewpoints into conceivable practical use, enabling complex model training and faster learning capabilities. Furthermore, the discussion explores the mission of World Labs, which aims to harness the profound understanding of data and computing to unlock spatial intelligence, and the applications this might have in fields ranging from interactive 3D world creation for education or gaming to augmented reality and robotics. Finally, the differences between spatial intelligence and language models are explained, emphasizing the former's essential focus on 3D space for more effective world interaction and representation, as opposed to the largely 1D nature of language models.

Takeaways

📈 Visual spatial intelligence is as fundamental as language.
🔄 AI has transitioned from an AI winter to new peaks with deep learning.
🧠 Key AI figures like Fei-Fei Li and Justin Johnson have influenced the field.
💾 Computing power has substantially accelerated AI model training.
📊 Large datasets, like ImageNet, have driven significant AI advancements.
🤖 World Labs focuses on unlocking spatial intelligence.
🌐 Spatial intelligence emphasizes 3D perception and action.
⏩ Progress in AI enables new media forms, blending virtual and physical worlds.
🕶️ Augmented reality and robotics can benefit from advanced spatial intelligence.
🔍 Differences between language models and spatial intelligence are crucial.
🎮 World generation for gaming is a potential application of spatial intelligence.
📚 Educational tools can leverage spatial intelligence technologies.

Timeline

00:00:00 - 00:05:00
The discussion starts with highlighting the fundamental role of visual-spatial intelligence, mentioning the progress in understanding data and advancements in algorithms, and setting the stage to focus on unlocking new potentials in AI.
00:05:00 - 00:10:00
Over the past two years, there has been a significant increase in consumer AI companies, marking a wild and exciting time for AI development. The conversation touches on the historical context of AI from the AI winter to the emergence of deep learning, and its current transformative state involving various data forms such as text, pixels, videos, and audios.
00:10:00 - 00:15:00
The speakers recount their individual journeys into AI. One was inspired by groundbreaking deep learning papers during undergraduate studies, highlighting the importance of combining powerful algorithms with large compute and data for breakthrough results — a notion established around 2011-2012.
00:15:00 - 00:20:00
As one speaker’s journey continued, they noticed a critical shift in the importance of data for AI, recognizing data as an overlooked element crucial for model generalization. This realization led to initiatives like ImageNet, emphasizing large-scale data acquisition as a power unlock for machine learning models during its inception at the advent of the internet.
00:20:00 - 00:25:00
The dialogue transitions to the concept of 'big unlocks' in AI. While major algorithmic innovations like Transformers have fueled AI progress, the conversation sheds light on the underestimated impact of computational power. The example of AlexNet, trained with substantially less compute compared to modern standards, underscores this point.
00:25:00 - 00:30:00
The discussion pivots to differentiating the types of AI tasks, particularly generative versus predictive modeling. Historical attempts at generative tasks are noted, with more recent advancements in generative AI (using GANs) marking significant strides towards generating novel outputs like images from textual descriptions.
00:30:00 - 00:35:00
Further exploring generative AI advances, the speakers discuss projects involving style transfer and real-time generation, emphasizing the evolution of generative modeling from static image rendering to dynamic, real-time applications. This illustrates the broad transformation the field has undergone over the years.
00:35:00 - 00:40:00
Emphasizing spatial intelligence, the speakers outline a journey focused on visual intelligence, suggesting it's just as fundamental as language. The readiness, given present algorithmic advancements and computational capabilities, makes it the right time to invest in developing technologies like those that power World Labs.
00:40:00 - 00:48:10
Finally, the conversation delves into the specifics of spatial intelligence, contrasting it with language-based AI approaches. Spatial intelligence emphasizes understanding and interacting in 3D environments, a fundamental aspect that language models, inherently one-dimensional, cannot fully grasp. This complements generative AI, transcending text and 2D to richer 3D representations.

Mind Map

Video Q&A

Why is visual spatial intelligence described as fundamental?
Visual spatial intelligence is compared to language in its fundamental importance due to its ancient and essential role in understanding and interacting with the world.
How has AI evolved over the past decades, according to the video?
AI has transitioned from theoretical models and the AI winter into practical, deep learning applications, involving significant advances such as language models and image recognition.
What major AI breakthroughs are discussed?
The discussion highlights the significance of deep learning, especially neural networks like AlexNet, the importance of computing power, and algorithmic advances like Transformers.
Who are the key figures mentioned in the evolution of AI and deep learning?
Notable figures include Fei-Fei Li and Justin Johnson, along with references to Andrew Ng, Hinton, and others involved in foundational deep learning research.
How is computational power significant to AI development?
Increased computational power has enabled the practical application and fast training of complex AI models, transforming theoretical constructs into effective tools.
What role does data play in AI development, according to the speakers?
Large datasets have been crucial for training AI models, enabling discoveries and the development of more accurate and generalizable models.
What is the significance of the ImageNet project?
ImageNet played a crucial role in demonstrating the power of large datasets, helping propel computer vision and AI into practical applications.
What differentiates spatial intelligence from language models?
Spatial intelligence focuses on 3D perception and action, essential for interacting with the physical world, contrasting with the 1D sequence processing in language models.
What is the mission of World Labs?
World Labs aims to unlock spatial intelligence, leveraging advancements in algorithms, computing, and data to create technology that perceives and interacts with the 3D world.
How might spatial intelligence be applied?
Potential applications include world generation for games and education, augmented reality, and enhancing robotics with better 3D understanding.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:00
visual spatial intelligence is so
00:00:03
fundamental it's as fundamental as
00:00:05
language we've got this ingredients
00:00:08
compute deeper understanding of data and
00:00:11
we've got some advancement of algorithms
00:00:14
we are in the right moment to really
00:00:17
make a bet and to focus and just unlock
00:00:26
[Music]
00:00:28
that over the last two years we've seen
00:00:31
this kind of massive Rush of consumer AI
00:00:33
companies and technology and it's been
00:00:35
quite wild but you've been doing this
00:00:38
now for decades and so maybe walk
00:00:40
through a little bit about how we got
00:00:41
here kind of like your key contributions
00:00:43
and insights along the way so it is a
00:00:46
very exciting moment right just zooming
00:00:48
back AI is in a very exciting moment I
00:00:51
personally have been doing this for for
00:00:53
two decades plus and you know we have
00:00:56
come out of the last AI winter we have
00:00:58
seen the birth of modern AI then we have
00:01:01
seen deep learning taking off showing us
00:01:04
possibilities like playing chess but
00:01:07
then we're starting to see the the the
00:01:10
deepening of the technology and the
00:01:12
industry um adoption of uh of some of
00:01:16
the earlier possibilities like language
00:01:19
models and now I think we're in the
00:01:21
middle of a Cambrian explosion in almost
00:01:24
a literal sense because now in addition
00:01:27
to texts you're seeing pixels videos
00:01:30
audios all coming out with possible AI
00:01:35
applications and models so it's very
00:01:37
exciting moment I know you both so well
00:01:40
and many people know you both so well
00:01:41
because you're so prominent in the field
00:01:42
but not everybody like grew up in AI so
00:01:44
maybe it's kind of worth just going
00:01:45
through like your quick backgrounds just
00:01:47
to kind of level set the audience yeah
00:01:48
sure so I first got into AI uh at the
00:01:51
end of my undergrad uh I did math and
00:01:53
computer science for undergrad at
00:01:54
keltech that was awesome but then
00:01:55
towards the end of that there was this
00:01:57
paper that came out that was at the time
00:01:59
a very famous paper the cat paper um
00:02:01
from H Lee and Andrew and others that
00:02:03
were at Google brain at the time and
00:02:04
that was like the first time that I came
00:02:06
across this concept of deep learning um
00:02:08
and to me it just felt like this amazing
00:02:10
technology and that was the first time
00:02:12
that I came across this recipe that
00:02:13
would come to define the next like more
00:02:15
than decade of my life which is that you
00:02:17
can get these amazingly powerful
00:02:19
learning algorithms that are very
00:02:20
generic couple them with very large
00:02:22
amounts of compute couple them with very
00:02:23
large amounts of data and magic things
00:02:25
started to happen when you compi those
00:02:27
ingredients so I I first came across
00:02:29
that idea like around 2011 2012-ish and
00:02:31
I just thought like oh my God this is
00:02:33
this is going to be what I want to do so
00:02:35
it was obvious you got to go to grad
00:02:36
school to do this stuff and then um sort
00:02:38
of saw that Fay was at Stanford one of
00:02:40
the few people in the world at the time
00:02:41
who was kind of on that on that train
00:02:44
and that was just an amazing time to be
00:02:45
in deep learning and computer vision
00:02:47
specifically because that was really the
00:02:49
era when this went from these first nent
00:02:52
bits of technology that were just
00:02:53
starting to work and really got
00:02:54
developed AC and spread across a ton of
00:02:56
different applications so then over that
00:02:58
time we saw the beginning of language
00:03:00
modeling we saw the beginnings of
00:03:02
discriminative computer vision you could
00:03:03
take pictures and understand what's in
00:03:05
them in a lot of different ways we also
00:03:06
saw some of the early bits of what we
00:03:08
would Now call gen generative modeling
00:03:10
generating images generating text a lot
00:03:12
of those Court algor algorithmic pieces
00:03:14
actually got figured out by the academic
00:03:16
Community um during my PhD years like
00:03:18
there was a time I would just like wake
00:03:19
up every morning and check the new
00:03:21
papers on archive and just be ready it
00:03:23
was like unwrapping presents on
00:03:24
Christmas that like every day you know
00:03:25
there's going to be some amazing new
00:03:27
discovery some amazing new application
00:03:28
or algorithm somewhere in the world what
00:03:30
happened is in the last two years
00:03:32
everyone else in the world kind of came
00:03:33
to the same realization using AI to get
00:03:35
new Christmas presents every day but I
00:03:37
think for those of us that have been in
00:03:38
the field for a decade or more um we've
00:03:40
sort of had that experience for a very
00:03:41
long time obviously I'm much older than
00:03:45
Justin I I come to AI through a
00:03:49
different angle which is from physics
00:03:51
because my undergraduate uh background
00:03:53
was physics but physics is the kind of
00:03:56
discipline that teaches you to think
00:03:59
audacious question s and think about
00:04:01
what is the remaining mystery of the
00:04:04
world of course in physics is atomic
00:04:06
world you know universe and all that but
00:04:09
somehow I that kind of training thinking
00:04:13
got me into the audacious question that
00:04:16
really captur my own imagination which
00:04:18
is intelligence so I did my PhD in Ai
00:04:22
and computational neuros siiz at CCH so
00:04:26
Justin and I actually didn't overlap but
00:04:28
we share um
00:04:30
the same amam mat um at keltech oh and
00:04:33
and the same adviser at celtech yes same
00:04:35
adviser your undergraduate adviser in my
00:04:37
PhD advisor petro perona and my PhD time
00:04:41
which is similar to your your your PhD
00:04:44
time was when AI was still in the winter
00:04:47
in the public eye but it was not in the
00:04:50
winter in my eye because it's that
00:04:53
preing hibernation there's so much life
00:04:56
machine learning statistical modeling
00:04:59
was really gaining uh gaining power and
00:05:03
we I I think I was one of the Native
00:05:07
generation in machine learning and AI
00:05:11
whereas I look at Justice generation is
00:05:13
the native deep learning generation so
00:05:16
so so machine learning was the precursor
00:05:19
of deep learning and we were
00:05:21
experimenting with all kinds of models
00:05:24
but one thing came out at the end of my
00:05:26
PhD and the beginning of my assistant
00:05:29
professor
00:05:30
there was a
00:05:32
overlooked elements of AI that is
00:05:36
mathematically important to drive
00:05:39
generalization but the whole field was
00:05:41
not thinking that way and it was Data
00:05:45
because we were thinking about um you
00:05:47
know the intricacy of beijan models or
00:05:50
or whatever you know um uh kernel
00:05:53
methods and all that but what was
00:05:56
fundamental that my students and my lab
00:05:59
realized probably uh earlier than most
00:06:01
people is that if you if you let Data
00:06:05
Drive models you can unleash the kind of
00:06:08
power that we haven't seen before and
00:06:11
that was really the the the reason we
00:06:14
went on a pretty
00:06:17
crazy bet on image net which is you know
00:06:21
what just forget about any scale we're
00:06:24
seeing now which is thousands of data
00:06:26
points at that point uh NLP community
00:06:29
has their own data sets I remember UC
00:06:32
see Irvine data set or some data set in
00:06:34
NLP was it was small compar Vision
00:06:37
Community has their data sets but all in
00:06:40
the order of thousands or tens of
00:06:42
thousands were like we need to drive it
00:06:44
to internet scale and luckily it was
00:06:48
also the the the coming of age of
00:06:50
Internet so we were riding that wave and
00:06:54
that's when I came to Stanford so these
00:06:57
epochs are what we often talk about like
00:06:59
IM is clearly the epoch that created you
00:07:01
know or or at least like maybe made like
00:07:04
popular and viable computer vision and
00:07:07
the Gen wave we talk about two kind of
00:07:09
core unlocks one is like the
00:07:10
Transformers paper which is attention we
00:07:12
talk about stable diffusion is that a
00:07:13
fair way to think about this which is
00:07:15
like there's these two algorithmic
00:07:16
unlocks that came from Academia or
00:07:18
Google and like that's where everything
00:07:19
comes from or has it been more
00:07:20
deliberate or have there been other kind
00:07:22
of big unlocks that kind of brought us
00:07:24
here that we don't talk as much about
00:07:25
yeah I I think the big unlock is compute
00:07:28
like I know the story of AI is of in the
00:07:29
story of compute but even no matter how
00:07:31
much people talk about it I I think
00:07:32
people underestimate it right and the
00:07:34
amount of the amount of growth that
00:07:35
we've seen in computational power over
00:07:37
the last decade is astounding the first
00:07:39
paper that's really credited with the
00:07:40
like Breakthrough moment in computer
00:07:42
vision for deep learning was Alex net um
00:07:45
which was a 2012 paper that where a deep
00:07:47
neural network did really well on the
00:07:48
image net Challenge and just blew away
00:07:50
all the other algorithms that F had been
00:07:53
working on the types of algorithms that
00:07:54
they' been working on more in grad
00:07:55
school that Alex net was a 60 million
00:07:57
parameter deep neural network um and it
00:07:59
was trained for six days on two GTX 580s
00:08:03
which was the top consumer card at the
00:08:04
time which came out in 2010 um so I was
00:08:07
looking at some numbers last night just
00:08:09
to you know put these in perspective the
00:08:11
newest the latest and greatest from
00:08:12
Nvidia is the gb200 um do either of you
00:08:15
want to guess how much raw compute
00:08:17
Factor we have between the GTX 580 and
00:08:20
the gb200 shoot no what go for it it's
00:08:23
uh it's in the thousands so I I ran the
00:08:26
numbers last night like that two We R
00:08:28
that two we training run that of Six
00:08:30
Days on two GTX 580s if you scale it it
00:08:33
comes out to just under five minutes on
00:08:35
a single GB on a single gb200 Justin is
00:08:39
making a really good point the 2012 Alex
00:08:42
net paper on image net challenge is
00:08:45
literally a very classic Model and that
00:08:49
is the convolution on your network model
00:08:52
and that was published in 1980s the
00:08:54
first paper I remember as a graduate
00:08:56
student learning that and it more or
00:09:00
less also has six seven layers the
00:09:03
practically the only difference between
00:09:06
alexnet and the convet what's the
00:09:08
difference is the gpus the two gpus and
00:09:14
the delude of data yeah well so that's
00:09:17
what I was going to go which is like so
00:09:18
I think most people now are familiar
00:09:20
with like quote the bitter lesson and
00:09:21
the bitter lesson says is if you make an
00:09:23
algorithm don't be cute yeah just make
00:09:25
sure you can take advantage of available
00:09:26
compute because the available compute
00:09:28
will show up right and so like you just
00:09:29
like need to like why like on the other
00:09:32
hand there's another narrative um which
00:09:35
seems to me to be like just as credible
00:09:36
which is like it's actually new data
00:09:37
sources that unlock deep learning right
00:09:39
like imet is a great example but like a
00:09:40
lot of people like self attention is
00:09:42
great from Transformers but they'll also
00:09:44
say this is a way you can exploit human
00:09:45
labeling of data because like it's the
00:09:47
humans that put the structure in the
00:09:48
sentences and if you look at clip
00:09:50
they'll say well like we're using the
00:09:51
internet to like actually like have
00:09:53
humans use the alt tag to label images
00:09:56
right and so like that's a story of data
00:09:58
that's not a story of compute and so is
00:10:00
it just is the answer just both or is
00:10:02
like one more than the other or I think
00:10:03
it's both but you're hitting another
00:10:05
really good point so I think there's
00:10:06
actually two EO that to me feel quite
00:10:08
distinct in the algorithmics here so
00:10:10
like the imag net era is actually the
00:10:12
era of supervised learning um so in the
00:10:14
era of supervised learning you have a
00:10:15
lot of data but you don't know how to
00:10:17
use data on its own like the expectation
00:10:20
of imet and other data sets of that time
00:10:22
period was that we're going to get a lot
00:10:23
of images but we need people to label
00:10:25
everyone and all of the training data
00:10:27
that we're going to train on like a
00:10:29
person a human labeler has looked at
00:10:30
everyone and said something about that
00:10:32
image yeah um and the big algorithmic
00:10:34
unlocks we know how to train on things
00:10:36
that don't require human labeled data as
00:10:38
as the naive person in the room that
00:10:39
doesn't have an AI background it seems
00:10:41
to me if you're training on human data
00:10:43
like the humans have labeled it it's
00:10:45
just not explicit I knew you were GNA
00:10:47
say that Mar I knew that yes
00:10:49
philosophically that's a really
00:10:51
important question but that actually is
00:10:53
more try language than pixels fair
00:10:56
enough yeah 100 yeah yeah yeah yeah yeah
00:10:58
but I do think it's an important
00:11:05
thinked learn itel just more implicit
00:11:08
than explicit yeah it's still it's still
00:11:09
human labeled the distinction is that
00:11:11
for for this supervised learning era um
00:11:13
our learning tasks were much more
00:11:14
constrained so like you would have to
00:11:16
come up with this ontology of Concepts
00:11:18
that we want to discover right if you're
00:11:19
doing in imag net like fa and and your
00:11:22
students at the time spent a lot of time
00:11:24
thinking about you know which thousand
00:11:26
categories should be in the imag net
00:11:27
challenge other data sets of that time
00:11:29
like the Coco data set for object
00:11:30
detection like they thought really hard
00:11:32
about which 80 categories we put in
00:11:34
there so let's let's walk to gen um so
00:11:36
so when I was doing my my PhD before
00:11:38
that um you came so I took U machine
00:11:41
learning from Andre in and then I took
00:11:43
like beigan something very complicated
00:11:44
from Deany Coler and it was very
00:11:45
complicated for me a lot of that was
00:11:47
just predictive modeling y um and then
00:11:49
like I remember the whole kind of vision
00:11:51
stuff that you unlock but then the
00:11:52
generative stuff is shown up like I
00:11:53
would say in the last four years which
00:11:55
is to me very different like you're not
00:11:57
identifying objects you're not you know
00:11:59
predicting something you're generating
00:12:00
something and so maybe kind of walk
00:12:02
through like the key unlocks that got us
00:12:04
there and then why it's different and if
00:12:06
we should think about it differently and
00:12:07
is it part of a Continuum is it not it
00:12:10
is so interesting even during my
00:12:13
graduate time generative model was there
00:12:17
we wanted to do generation we nobody
00:12:20
remembers even with the uh letters and
00:12:24
uh numbers we were trying to do some you
00:12:26
know Jeff Hinton has had to generate
00:12:29
papers we were thinking about how to
00:12:31
generate and in fact if you do have if
00:12:34
you think from a probability
00:12:36
distribution point of view you can
00:12:37
mathematically generate it's just
00:12:39
nothing we generate would ever impress
00:12:42
anybody right so this concept of
00:12:45
generation mathematically theoretically
00:12:47
is there but nothing worked so then I do
00:12:52
want to call out Justin's PhD and Justin
00:12:55
was saying that he got enamored by Deep
00:12:57
learning so he came to my lab Justin PhD
00:12:59
his entire PhD is a story almost a mini
00:13:03
story of the trajectory of the of the uh
00:13:07
field he started his first project in
00:13:09
data I forced him to he didn't like
00:13:13
it so in retrospect I learned a lot of
00:13:16
really useful things I'm glad you say
00:13:18
that now so we moved Justin to um to
00:13:22
deep learning and the core problem there
00:13:25
was taking images and generating words
00:13:29
well actually it was even about there
00:13:31
were I think there were three discret
00:13:32
phases here on this trajectory so the
00:13:34
first one was actually matching images
00:13:36
and words right right right like we have
00:13:38
we have an image we have words and can
00:13:40
we say how much they allow so actually
00:13:41
my first paper both of my PhD and like
00:13:44
ever my first academic publication ever
00:13:47
was the image retrieval with scene
00:13:48
graphs and then we went into the Genera
00:13:51
uh taking pixels generating words and
00:13:53
Justin and Andre uh really worked on
00:13:56
that but that was still a very very
00:14:00
lossy way of of of generating and
00:14:03
getting information out of the pixel
00:14:05
world and then in the middle Justus went
00:14:07
off and did a very famous piece of work
00:14:10
and it was the first time that uh
00:14:13
someone made it real time right yeah
00:14:16
yeah so so the story there is there was
00:14:17
this paper that came out in 2015 a
00:14:19
neural algorithm of artistic style led
00:14:21
by Leon gtis and it was like the paper
00:14:24
came out and they showed like these
00:14:25
these real world photographs that they
00:14:26
had converted into van go style and like
00:14:29
we are kind of used to seeing things
00:14:30
like this in 2024 but this was in 2015
00:14:33
so this paper just popped up on archive
00:14:35
one day and it like blew my mind like I
00:14:37
just got this like gen brainworm like in
00:14:39
my brain in like 2015 and it like did
00:14:42
something to me and I thought like oh my
00:14:44
God I need to understand this algorithm
00:14:45
I need to play with it I need to make my
00:14:47
own images into van go so then I like
00:14:49
read the paper and over a long weekend I
00:14:51
reimplemented the thing and got it to
00:14:52
work it was a very actually very simple
00:14:55
algorithm um so like my implementation
00:14:57
was like 300 lines of Lua cuz at the
00:14:59
time it was pre it was Lua there was
00:15:01
there was um this was pre pie torch so
00:15:03
we were using Lua torch um but it was
00:15:05
like very simple algorithm but it was
00:15:06
slow right so it was an optim
00:15:08
optimization based thing every image you
00:15:10
want to generate you need to run this
00:15:11
optimization Loop run this gradient Dent
00:15:12
Loop for every image that you generate
00:15:14
the images were beautiful but I just
00:15:16
like wanted to be faster and and Justin
00:15:19
just did it and it was actually I think
00:15:21
your first taste
00:15:23
of a an academic work having an industry
00:15:27
impact a bunch of people seen this this
00:15:30
artistic style transfer stuff at the
00:15:31
time and me and a couple others at the
00:15:33
same time came up with different ways to
00:15:34
speed this up yeah um but mine was the
00:15:37
one that got a lot of traction right so
00:15:38
I was very proud of Justin but there's
00:15:40
one more thing I was very proud of
00:15:41
Justin to connect to J AI is that before
00:15:45
the world understand gen Justin's last
00:15:48
piece of uh uh work in PhD which I I
00:15:52
knew about it because I was forcing you
00:15:53
to do it that one was fun that was was
00:15:57
actually uh input
00:16:00
language and getting a whole picture out
00:16:03
it's one of the first gen uh work it's
00:16:07
using gang which was so hard to use but
00:16:10
the problem is that we are not ready to
00:16:12
use a natural piece of language so
00:16:14
justtin you heard he worked on sing
00:16:16
graph so we have to input a sing graph
00:16:20
language structure so you know the Sheep
00:16:23
the the the grass the sky in a graph way
00:16:26
it literally was one of our photos right
00:16:28
and then he he and another very good uh
00:16:31
uh Master student of grim they got that
00:16:34
again to work so so you can see from
00:16:37
data to matching to style transfer to to
00:16:42
generative a uh uh images we're starting
00:16:46
to see you ask if this is a abrupt
00:16:49
change for people like us it's already
00:16:52
happening a Continuum but for the world
00:16:55
it was it's more the results are more
00:16:58
abrupt so I read your book and for those
00:17:00
that are listening it's a phenomenal
00:17:01
book like I I really recommend you read
00:17:03
it and it seems for a long time like a
00:17:06
lot of you and I'm talking to you fa
00:17:07
like a lot of your research has been you
00:17:09
know and your direction has been towards
00:17:12
kind of spatial stuff and pixel stuff
00:17:14
and intelligence and now you're doing
00:17:16
World labs and it's around spatial
00:17:18
intelligence and so maybe talk through
00:17:20
like you know is this been part of a
00:17:23
long journey for you like why did you
00:17:24
decide to do it now is it a technical
00:17:26
unlock is it a personal unlock just kind
00:17:28
of like move us from that kind of Meo of
00:17:32
AI research to to World Labs sure for me
00:17:35
is uh
00:17:37
um it is both personal and intellectual
00:17:41
right my entire you talk about my book
00:17:44
my entire intellectual journey is really
00:17:48
this passion to seek North Stars but
00:17:51
also believing that those nor stars are
00:17:54
critically important for the advancement
00:17:56
of our field so at the beginning
00:17:59
I remembered after graduate school I
00:18:02
thought my Northstar was telling stories
00:18:05
of uh images because for me that's such
00:18:08
a important piece of visual intelligence
00:18:12
that's part of what you call AI or AGI
00:18:15
but when Justin and Andre did that I was
00:18:18
like oh my God that's that was my live
00:18:20
stream what do I do next so it it came a
00:18:23
lot faster I thought it would take a
00:18:25
hundred years to do that so um but
00:18:29
visual intelligence is my passion
00:18:32
because I do believe for every
00:18:36
intelligent uh
00:18:37
being like people or robots or some
00:18:41
other form um knowing how to see the
00:18:44
world reason about it interact in it
00:18:49
whether you're navigating or or or
00:18:51
manipulating or making things you can
00:18:54
even build civilization upon it it
00:18:58
visual spatial intelligence is so
00:19:01
fundamental it's as fundamental as
00:19:04
language possibly more ancient and and
00:19:08
more fundamental in certain ways so so
00:19:10
it's very natural for me that um world
00:19:14
Labs is our Northstar is to unlock
00:19:17
spatial intelligence the moment to me is
00:19:21
right to do it like Justin was saying
00:19:24
compute we've got these ingredients
00:19:26
we've got compute we've got a much
00:19:30
deeper understanding of data way deeper
00:19:32
than image that days you know uh
00:19:34
compared to to that those days we're so
00:19:37
much more sophisticated and we've got
00:19:40
some advancement of algorithms including
00:19:43
co-founders in World la like Ben milen
00:19:46
Hall and uh Kristoff lar they were at
00:19:50
The Cutting Edge of nerve that we are in
00:19:52
the right moment to really make a bet
00:19:55
and to focus and just unlock that so I
00:19:59
just want to clarify for for folks that
00:20:01
are listening to this which is so you
00:20:02
know you're starting this company World
00:20:03
lab spatial intelligence is kind of how
00:20:05
you're generally describing the problem
00:20:06
you're solving can you maybe try to
00:20:08
crisply describe what that means yeah so
00:20:11
spatial intelligence is about machines
00:20:13
ability to un to perceive reason and act
00:20:16
in 3D and 3D space and time to
00:20:19
understand how objects and events are
00:20:21
positioned in 3D space and time how
00:20:23
interactions in the world can affect
00:20:25
those 3D position 3D 4D positions over
00:20:28
space time
00:20:29
um and both sort of perceive reason
00:20:31
about generate interact with really take
00:20:33
the machine out of the main frame or out
00:20:35
of the data center and putting it out
00:20:37
into the world and understanding the 3D
00:20:39
4D world with all of its richness so to
00:20:41
be very clear are we talking about the
00:20:42
physical world or are we just talking
00:20:43
about an abstract notion of world I
00:20:45
think it can be both I think it can be
00:20:47
both and that encompasses our vision
00:20:48
long term even if you're generating
00:20:50
worlds even if you're generating content
00:20:52
um doing that in positioned in 3D with
00:20:54
3D uh has a lot of benefits um or if
00:20:57
you're recognizing the real world being
00:20:59
able to put 3D understanding into the
00:21:02
into the real world as well is part of
00:21:04
it great so I mean Ju Just for everybody
00:21:07
listening like the two other co-founders
00:21:08
Ben M Hall and Kristoff lner are
00:21:10
absolute Legends in the field at the at
00:21:12
the same level these four decided to
00:21:13
come out and do this company now and so
00:21:16
I'm trying to get dig to like like why
00:21:18
now is the the the right time yeah I
00:21:20
mean this is Again part of a longer
00:21:22
Evolution for me but like really after
00:21:23
PhD when I was really wanting to develop
00:21:25
into my own independent researcher both
00:21:27
at for my later career I was just
00:21:29
thinking what are the big problems in Ai
00:21:31
and computer vision um and the
00:21:32
conclusion that I came to about that
00:21:34
time was that the previous decade had
00:21:36
mostly been about understanding data
00:21:38
that already exists um but the next
00:21:40
decade was going to be about
00:21:41
understanding new data and if we think
00:21:43
about that the data that already exists
00:21:45
was all of the images and videos that
00:21:47
maybe existed on the web already and the
00:21:49
next decade was going to be about
00:21:50
understanding new data right like people
00:21:53
are people are have smartphones
00:21:54
smartphones are collecting cameras those
00:21:55
cameras have new sensors those cameras
00:21:57
are positioned in the 3D world it's not
00:21:59
just you're going to get a bag of pixels
00:22:00
from the internet and know nothing about
00:22:02
it and try to say if it's a cat or a dog
00:22:04
we want to treat these treat images as
00:22:07
universal sensors to the physical world
00:22:09
and how can we use that to understand
00:22:11
the 3D and 4D structure of the world um
00:22:13
either in physical spaces or or or
00:22:15
generative spaces so I made a pretty big
00:22:18
pivot post PhD into 3D computer vision
00:22:20
predicting 3D shapes of objects with
00:22:22
some of my colleagues at fair at the
00:22:24
time then later I got really enamored by
00:22:26
this idea of learning 3D structure
00:22:28
through 2D right because we talk about
00:22:30
data a lot it's it's um you know 3D data
00:22:33
is hard to get on its own um but there
00:22:36
because there's a very strong
00:22:37
mathematical connection here um our 2D
00:22:39
images are projections of a 3D World and
00:22:42
there's a lot of mathematical structure
00:22:43
here we can take advantage of so even if
00:22:45
you have a lot of 2D data there's
00:22:46
there's a lot of people have done
00:22:48
amazing work to figure out how can you
00:22:50
back out the 3D structure of the world
00:22:51
from large quantities of 2D observations
00:22:54
um and then in 2020 you asked about bre
00:22:56
breakthrough moments there was a really
00:22:57
big breakthrough Moment One from our
00:22:59
co-founder Ben mildenhall at the time
00:23:00
with his paper Nerf N Radiance fields
00:23:03
and that was a very simple very clear
00:23:05
way of backing out 3D structure from 2D
00:23:08
observations that just lit a fire under
00:23:10
this whole Space of 3D computer vision I
00:23:13
think there's another aspect here that
00:23:15
maybe people outside the field don't
00:23:16
quite understand as that was also a time
00:23:19
when large language models were starting
00:23:20
to take off so a lot of the stuff with
00:23:23
language modeling actually had gotten
00:23:24
developed in Academia even during my PhD
00:23:26
I did some early work with Andre
00:23:27
Carpathia on language modeling in 2014
00:23:30
lstm I still remember lstms RNN brus
00:23:34
like this was pre- Transformer um but uh
00:23:37
then at at some point like around like
00:23:39
around the gpt2 time like you couldn't
00:23:41
really do those kind of models anymore
00:23:42
in Academia because they took a way way
00:23:44
more resourcing but there was one really
00:23:46
interesting thing that the Nerf the Nerf
00:23:48
approach that that Ben came up with like
00:23:50
you could train these in in in an hour a
00:23:52
couple hours on a single GPU so I think
00:23:54
at that time like this is a there was a
00:23:56
dynamic here that happened which is that
00:23:57
I think a lot of academic researchers
00:23:59
ended up focusing a lot of these
00:24:00
problems because there was core
00:24:02
algorithmic stuff to figure out and
00:24:04
because you could actually do a lot with
00:24:05
without a ton of compute and you could
00:24:07
get state-of-the-art results on a single
00:24:08
GPU because of those Dynamics um there
00:24:11
was a lot of research a lot of
00:24:12
researchers in Academia were moving to
00:24:14
think about what are the core
00:24:16
algorithmic ways that we can advance
00:24:17
this area as well uh then I ended up
00:24:20
chatting with f more and I realized that
00:24:22
we were actually she's very convincing
00:24:23
she's very convincing well there's that
00:24:25
but but like you know we talk about
00:24:27
trying to like figure out your own
00:24:28
depent research trajectory from your
00:24:29
adviser well it turns out we ended oh no
00:24:32
kind of concluding converging on on
00:24:34
similar things okay well from my end I
00:24:36
want to talk to the smartest person I I
00:24:38
call Justin there's no question about it
00:24:41
uh I do want to talk about a very
00:24:43
interesting technical um uh issue or or
00:24:47
technical uh story of pixels that most
00:24:50
people work in language don't realize is
00:24:52
that pre era in the field of computer
00:24:55
vision those of us who work on pixels
00:24:58
we actually have a long history in a an
00:25:03
area of research called reconstruction
00:25:05
3D reconstruction which is you know it
00:25:08
dates back from the 70s you know you can
00:25:11
take photos because humans have two eyes
00:25:13
right so in generally starts with stereo
00:25:15
photos and then you try to triangulate
00:25:18
the geometry and uh make a 3D shape out
00:25:22
of it it is a really really hard problem
00:25:25
to this day it's not fundamentally
00:25:27
solved because there there's
00:25:28
correspondence and all that and then so
00:25:31
this whole field which is a older way of
00:25:34
thinking about 3D has been going around
00:25:37
and it has been making really good
00:25:39
progress but when nerve happened when
00:25:42
Nerf happened in the context of
00:25:45
generative methods in the context of
00:25:47
diffusion models
00:25:50
suddenly reconstruction and generations
00:25:52
start to really merge and now like
00:25:56
within really a short period of time in
00:25:58
the field of computer vision it's hard
00:26:01
to talk about reconstruction versus
00:26:03
generation anymore we suddenly have a
00:26:06
moment where if we see something or if
00:26:11
we imagine something both can converge
00:26:14
towards generating it right right and
00:26:17
that's just to me a a really important
00:26:19
moment for computer vision but most
00:26:21
people missed it because we're not
00:26:23
talking about it as much as llms right
00:26:25
so in pixel space there's reconstruction
00:26:27
where you reconstruct
00:26:28
like a scene that's real and then if you
00:26:31
don't see the scene then you use
00:26:32
generative techniques right so these
00:26:33
things are kind of very similar
00:26:35
throughout this entire conversation
00:26:36
you're talking about languages and
00:26:38
you're talking about pixels so maybe
00:26:40
it's a good time to talk about how like
00:26:41
space for intelligence and what you're
00:26:43
working on
00:26:44
contrasts with language approaches which
00:26:47
of course are very popular now like is
00:26:48
it complimentary is it orthogonal yeah I
00:26:51
think I think they're complimentary I I
00:26:53
don't mean to be too leading here like
00:26:54
maybe just contrast them like everybody
00:26:56
says like listen I I I know opening up
00:26:58
and I know GPT and I know multimodal
00:27:00
models and a lot of what you're talking
00:27:01
about is like they've got pixels and
00:27:03
they've got languages and like doesn't
00:27:05
this kind of do what we want to do with
00:27:07
spatial reasoning yeah so I think to do
00:27:09
that you need to open up the Black Box a
00:27:10
little bit of how these systems work
00:27:11
under the hood um so with language
00:27:13
models and the multimodal language
00:27:14
models that we're seeing nowadays
00:27:16
they're their their underlying
00:27:18
representation under the hood is is a
00:27:19
one-dimensional representation we talk
00:27:21
about context lengths we talk about
00:27:23
Transformers we talk about sequences
00:27:25
attention attention fundamentally their
00:27:27
representation of the world is is
00:27:29
onedimensional so these things
00:27:30
fundamentally operate on a
00:27:31
onedimensional sequence of tokens so
00:27:33
this is a very natural representation
00:27:35
when you're talking about language
00:27:37
because written text is a
00:27:38
one-dimensional sequence of discret
00:27:39
letters so that kind of underlying
00:27:41
representation is the thing that led to
00:27:43
llms and now the multimodal llms that
00:27:45
we're seeing now you kind of end up
00:27:47
shoehorning the other modalities into
00:27:49
this underlying representation of a 1D
00:27:51
sequence of tokens um now when we move
00:27:54
to spatial intelligence it's kind of
00:27:56
going the other way where we're saying
00:27:57
that the three-dimensional nature of the
00:28:00
world should be front and center in the
00:28:01
representation so at an algorithmic
00:28:03
perspective that opens up the door for
00:28:05
us to process data in different ways to
00:28:07
get different kinds of outputs out of it
00:28:10
um and to tackle slightly different
00:28:11
problems so even at at a course level
00:28:13
you kind of look at outside and you say
00:28:14
oh multimodal LMS can look at images too
00:28:17
well they can but I I think that it's
00:28:19
they don't have that fundamental 3D
00:28:21
representation at the heart of their
00:28:22
approaches I totally agree with Justin I
00:28:24
think talking about the 1D versus
00:28:27
fundamental 3D representation is one of
00:28:30
the most core differentiation the other
00:28:32
thing it's a slightly philosophical but
00:28:34
it's really important to for me at least
00:28:37
is language is fundamentally a purely
00:28:41
generated signal there's no language out
00:28:45
there you don't go out in the nature and
00:28:47
there's words written in the sky for you
00:28:50
whatever data you feeding you pretty
00:28:52
much can just somehow regurgitate with
00:28:57
enough
00:28:58
generalizability at the the same data
00:29:01
out and that's language to language and
00:29:04
but but 3D World Is Not There is a 3D
00:29:08
world out there that follows laws of
00:29:11
physics that has its own structures due
00:29:13
to materials and and many other things
00:29:17
and to to fundamentally back that
00:29:20
information out and be able to represent
00:29:23
it and be able to generate it is just
00:29:26
fundamentally quite a different
00:29:28
problem we will be borrowing um similar
00:29:33
ideas or useful ideas from language and
00:29:37
llms but this is fundamentally
00:29:39
philosophically to me a different
00:29:41
problem right so language 1D and
00:29:44
probably a bad representation of the
00:29:46
physical world because it's been
00:29:47
generated by humans and it's probably
00:29:49
lossy there's a whole another modality
00:29:52
of generative AI models which are pixels
00:29:54
and these are 2D image and 2D video and
00:29:57
like one could say that like if you look
00:29:58
at a video it looks you know you can see
00:30:00
3D stuff because like you can pan a
00:30:01
camera or whatever it is and so like how
00:30:04
would like spatial intelligence be
00:30:06
different than say 2D video here when I
00:30:07
think about this it's useful to
00:30:09
disentangle two things um one is the
00:30:11
underlying representation and then two
00:30:13
is kind of the the user facing
00:30:14
affordances that you have um and here's
00:30:17
where where you can get sometimes
00:30:18
confused because um fundamentally we see
00:30:21
2D right like our retinas are 2D
00:30:23
structures in our bodies and we've got
00:30:25
two of them so like fundamentally our
00:30:27
visual system some perceives 2D images
00:30:30
um but the problem is that depending on
00:30:32
what representation you use there could
00:30:33
be different affordances that are more
00:30:35
natural or less natural so even if you
00:30:38
are at the end of the day you might be
00:30:39
seeing a 2D image or a 2d video um your
00:30:42
brain is perceiving that as a projection
00:30:45
of a 3D World so there's things you
00:30:47
might want to do like move objects
00:30:49
around move the camera around um in
00:30:51
principle you might be able to do these
00:30:53
with a purely 2D representation and
00:30:55
model but it's just not a fit to the
00:30:57
problems that you're the model to do
00:30:59
right like modeling the 2D projections
00:31:01
of a dynamic 3D world is is a function
00:31:04
that probably can be modeled but by
00:31:05
putting a 3D representation Into the
00:31:07
Heart of a model there's just going to
00:31:08
be a better fit between the kind of
00:31:10
representation that the model is working
00:31:12
on and the kind of tasks that you want
00:31:14
that model to do so our bet is that by
00:31:17
threading a little bit more 3D
00:31:19
representation under the hood that'll
00:31:21
enable better affordances for for users
00:31:24
and this also goes back to the norstar
00:31:26
for me you know why is it spatial
00:31:29
intelligence why is it not flat pixel
00:31:33
intelligence is because I think the Arc
00:31:35
of intelligence has to go to what Justin
00:31:39
calls affordances and uh and the Arc of
00:31:42
intelligence if you look at Evolution
00:31:46
right the Arc of intelligence eventually
00:31:49
enables animals and humans especially
00:31:52
human as an intelligent animal to move
00:31:55
around the world interact with it create
00:31:58
civilization create life create a piece
00:32:01
of Sandwich whatever you do in this 3D
00:32:04
World and and translating that into a
00:32:08
piece of technology that three native 3D
00:32:12
nness is fundamentally important for the
00:32:16
flood flood gate um of possible
00:32:20
applications even if some of them the
00:32:25
the serving of them looks Tod but the
00:32:28
but it's innately 3D um to me I think
00:32:32
this is actually very subtle yeah and
00:32:34
Incredibly critical point and so I think
00:32:36
it's worth digging into and a good way
00:32:38
to do this is talking about use cases
00:32:39
and so just to level set this we're
00:32:41
talking about generating a technology
00:32:44
let's call it a model that can do
00:32:46
spatial intelligence so maybe in the
00:32:48
abstract what might that look like kind
00:32:50
of a little bit more concretely what
00:32:52
would be the potential use cases that
00:32:55
you could apply this to so I think
00:32:57
there's a there's a couple different
00:32:58
kinds of things we imagine these
00:33:00
spatially intelligent models able to do
00:33:02
over time um and one that I'm really
00:33:04
excited about is World Generation we're
00:33:07
all we're all used to something like a
00:33:08
text image generator or starting to see
00:33:10
text video generators where you put an
00:33:11
image put in a video and out pops an
00:33:14
amazing image or an amazing two-c clip
00:33:16
um but I I think you could imagine
00:33:18
leveling this up and getting 3D worlds
00:33:20
out so one thing that we could imagine
00:33:23
spatial intelligence helping us with in
00:33:25
the future are upleveling these
00:33:26
experiences into 3D where we're not
00:33:29
getting just an image out or just a clip
00:33:30
out but you're getting out a full
00:33:32
simulated but vibrant and interactive 3D
00:33:34
World for gaming maybe for gaming right
00:33:37
maybe for gaming maybe for virtual
00:33:39
photography like you name it there's I
00:33:40
think there even if you got this to work
00:33:42
there'd be there'd be a million
00:33:43
applications for Education yeah for
00:33:45
education I mean I guess one of one of
00:33:47
my things is that like we in in some
00:33:50
sense this enables a new form of media
00:33:52
right because we already have the
00:33:54
ability to create virtual interactive
00:33:57
world worlds um but it cost hundreds of
00:34:00
hundreds of millions of dollars and a
00:34:02
and a ton of development time and as a
00:34:04
result like what are the places that
00:34:06
people drive this technological ability
00:34:08
is is video games right because if we do
00:34:11
have the ability as a society to create
00:34:13
amazingly detailed virtual interactive
00:34:16
worlds that give you amazing experiences
00:34:18
but because it takes so much labor to do
00:34:20
so then the only economically viable use
00:34:23
of that technology in its form today is
00:34:25
is games that can be sold for $70 a
00:34:28
piece to millions and millions of people
00:34:29
to recoup the investment if we had the
00:34:31
ability to create these same virtual
00:34:34
interactive vibrant 3D worlds um you
00:34:37
could see a lot of other applications of
00:34:39
this right because if you bring down
00:34:41
that cost of producing that kind of
00:34:42
content then people are going to use it
00:34:44
for other things what if you could have
00:34:46
a an intera like sort of a personalized
00:34:48
3D experience that's as good and as rich
00:34:51
as detailed as one of these AAA video
00:34:53
games that cost hundreds of millions of
00:34:54
dollars to produce but it could be
00:34:56
catered to like this very Niche thing
00:34:58
that only maybe a couple people would
00:34:59
want that particular thing that's not a
00:35:01
particular product or a particular road
00:35:03
map but I think that's a vision of a new
00:35:05
kind of media that would be enabled by
00:35:08
um spatial intelligence in the
00:35:10
generative Realms if I think about a
00:35:11
world I actually think about things that
00:35:13
are not just seene generation I think
00:35:14
about stuff like movement and physics
00:35:15
and so like like in the limit is that
00:35:17
included and then the second one is
00:35:19
absolutely if I'm interacting with it
00:35:23
like like are there semantics and I mean
00:35:26
by that like if I open a book are there
00:35:28
like pages and are there words in it and
00:35:29
do they mean like like are we talking
00:35:31
like a full depth experience or we
00:35:32
talking about like kind of a static
00:35:33
scene I think I'll see a progression of
00:35:35
this technology over time this is really
00:35:37
hard stuff to build so I think the
00:35:39
static the static problem is a little
00:35:41
bit easier um but in the limit I think
00:35:43
we want this to be fully Dynamic fully
00:35:45
interactable all the things that you
00:35:46
just said I mean that's the definition
00:35:48
of spatial intelligence yeah so so there
00:35:51
is going to be a progression we'll start
00:35:53
with more static but everything you've
00:35:56
said is is in the in the road map of uh
00:36:00
spatial intelligence I mean this is kind
00:36:02
of in in the name of the company itself
00:36:03
World Labs um like the world is about
00:36:06
building and understanding worlds and
00:36:08
and like this is actually a little bit
00:36:09
inside baseball I realized after we told
00:36:11
the name to people they don't always get
00:36:12
it because in computer vision and and
00:36:14
reconstruction and generation we often
00:36:15
make a distinction or a delineation
00:36:17
about the kinds of things you can do um
00:36:19
and kind of the first level is objects
00:36:21
right like a microphone a cup a chair
00:36:23
like these are discret things in the
00:36:25
world um and a lot of the imet style
00:36:27
stuff that F worked on was about
00:36:29
recognizing objects in the world then
00:36:31
leveling up the next level of objects I
00:36:33
think of his scenes like scenes are
00:36:35
compositions of objects like now we've
00:36:36
got this recording studio with a table
00:36:38
and microphones and people in chairs at
00:36:40
some composition of objects but but then
00:36:41
like we we envision worlds as a Step
00:36:44
Beyond scenes right like scenes are kind
00:36:46
of maybe individual things but we want
00:36:47
to break the boundaries go outside the
00:36:49
door like step up from the table walk
00:36:51
out from the door walk down the street
00:36:52
and see the cars buzzing past and see
00:36:55
like the the the the leaves on the tree
00:36:57
moving and be able to interact with
00:36:59
those things another thing that's really
00:37:00
exciting is just to mention the word New
00:37:02
Media with this technology the boundary
00:37:06
between real world and virtual imagin
00:37:09
world or augmented world or predicted
00:37:12
world is all blurry you really it there
00:37:17
the real world is 3D right so in the
00:37:20
digital world you have to have a
00:37:23
3D representation to even blend with the
00:37:26
real world you know you cannot have a 2d
00:37:29
you cannot have a 1D to be able to
00:37:31
interface with the real 3D World in an
00:37:34
effective way and with this it unlocks
00:37:36
it so it it the use cases can can be
00:37:40
quite Limitless because of this right so
00:37:43
the first use case that that Justin was
00:37:44
talking about would be like the
00:37:46
generation of a virtual world for any
00:37:48
number of use cases one that you're just
00:37:49
alluding to would be more of an
00:37:51
augmented reality right yes just around
00:37:53
the time world lab was uh um being
00:37:55
formed uh vision was released by Apple
00:37:59
and uh they use the word spatial
00:38:02
Computing we're almost like they almost
00:38:04
stole
00:38:05
our but we're spatial intelligence so
00:38:09
spatial Computing needs spatial
00:38:11
intelligence that's exactly right so we
00:38:14
don't know what Hardware form it will
00:38:17
take it will be goggles glasses contact
00:38:19
lenses contact lenses but that interface
00:38:23
between the true real world and what you
00:38:26
can do on top of it whether it's to help
00:38:29
you to augment your capability to work
00:38:32
on a piece of machine and fix your car
00:38:34
even if you are not a trained mechanic
00:38:37
or to just be in a Pokemon go Plus+ for
00:38:42
entertainment suddenly this piece of
00:38:44
technology is is going to be the the the
00:38:49
operating system basically uh for for
00:38:52
arvr uh Mixr in the limit like what does
00:38:55
an AR device need to do it's this thing
00:38:57
thing that's always on it's with you
00:38:59
it's looking out into the world so it
00:39:00
needs to understand the stuff that
00:39:02
you're seeing um and maybe help you out
00:39:04
with tasks in your daily life but I'm
00:39:06
I'm also really excited about this blend
00:39:08
between virtual and physical that
00:39:09
becomes really critical if you have the
00:39:11
ability to understand what's around you
00:39:13
in real time in perfect 3D then it
00:39:15
actually starts to deprecate large parts
00:39:17
of the real world as well like right now
00:39:19
how many differently sized screens do we
00:39:20
all own for different use cases too many
00:39:23
right you've got you've got your you've
00:39:24
got your phone you've got your iPad
00:39:25
you've got your computer monitor you've
00:39:26
got your t
00:39:28
you've got your watch like these are all
00:39:29
basically different side screens because
00:39:31
they need to present information to you
00:39:32
in different contexts and in different
00:39:34
different positions but if you've got
00:39:35
the ability to seamlessly blend virtual
00:39:38
content with the physical world it kind
00:39:39
of deprecates the need for all of those
00:39:41
it just ideally seamlessly Blends
00:39:43
information that you need to know in the
00:39:44
moment with the right way mechanism of
00:39:46
of giving you that information another
00:39:49
huge case of being able to blend the the
00:39:53
digital virtual world with the 3D
00:39:55
physical world is for anying agents to
00:39:59
be able to do things in the physical
00:40:01
world and if humans use this mix art
00:40:04
devices to do things like I said I don't
00:40:07
know how to fix a car but if I have to I
00:40:09
put on this this goggle or glass and
00:40:12
suddenly I'm guided to do that but there
00:40:14
are other types of Agents namely robots
00:40:18
any kind of robots not just humanoid and
00:40:22
uh their interface by definition is the
00:40:25
3D world but their their compute their
00:40:28
brain by definition is the digital world
00:40:31
so what connects that from the learning
00:40:34
to to behaving between a robot brain to
00:40:38
the real world brain it has to be
00:40:40
spatial intelligence so you've talked
00:40:43
about virtual world you've talked about
00:40:45
kind of more of an augmented reality and
00:40:47
now you've just talked about the purely
00:40:48
physical world basically which would be
00:40:51
used for robotics um for any company
00:40:54
that would be like a very large Charter
00:40:57
especially if you're going to get into
00:40:58
each one of these different areas so how
00:41:00
do you think about the idea of like deep
00:41:01
deep Tech versus any of these specific
00:41:03
application areas we see ourselves as a
00:41:06
deep tech company as the platform
00:41:08
company that provides models that uh
00:41:12
that can serve different use cases is of
00:41:15
these three is there any one that you
00:41:16
think is kind of more natural early on
00:41:18
that people can kind of expect the
00:41:20
company to lean into or is it I think
00:41:22
it's suffices to say the devices are not
00:41:25
totally ready actually I got my first VR
00:41:27
headset in grad school um and just like
00:41:29
that's one of these transformative
00:41:30
technology experiences you put it on
00:41:32
you're like oh my God like this is crazy
00:41:34
and I think a lot of people have that
00:41:35
experience the first time they use VR um
00:41:37
so I I've been excited about this space
00:41:38
for a long time and I I love the Vision
00:41:40
Pro like I stayed up late to order one
00:41:42
of the first ones like the first day it
00:41:44
came out um but I I think the reality is
00:41:46
it's just not there yet as a platform
00:41:48
for Mass Market appeal so very likely as
00:41:50
a company will will will move into a
00:41:53
market that's more ready than then I I
00:41:55
think there can sometimes be Simplicity
00:41:56
in generality right like if you we we
00:41:59
have this notion of being a deep tech
00:42:00
company we we believe that there is some
00:42:03
fun underlying fundamental problems that
00:42:05
need to be solved really well and if
00:42:07
solved really well can apply to a lot of
00:42:09
different domains we really view this
00:42:10
long Arc of the company as building and
00:42:12
realizing the the dreams of spatial
00:42:14
intelligence r large so this is a lot of
00:42:17
technology to build it seems to me yeah
00:42:19
I think it's a really hard problem um I
00:42:21
think sometimes from people who are not
00:42:22
directly in the AI space they just see
00:42:24
it as like AI as one undifferentiated
00:42:27
massive Talent um and and for those of
00:42:29
us who have been here long for for
00:42:31
longer you realize that there's a lot of
00:42:33
different a lot of different kinds of
00:42:34
talent that need to come together to
00:42:35
build anything in in AI in particular
00:42:37
this one we've talked a little bit about
00:42:39
the the data problem we've talked a
00:42:40
little bit about some of the algorithms
00:42:42
that we that I worked on during my PhD
00:42:44
but there's a lot of other stuff we need
00:42:45
to do this too um you need really high
00:42:47
quality large scale engineering you need
00:42:49
really deep understanding of 3D of the
00:42:51
3D World you need really there's
00:42:53
actually a lot of connections with
00:42:54
computer Graphics um because they've
00:42:55
been kind of attacking lot of the same
00:42:57
problems from the from the opposite
00:42:58
direction so when we think about Team
00:43:00
Construction we think about how do we
00:43:02
find expert like absolute topof thee
00:43:04
world best experts in the world at each
00:43:06
of these different subdomains that are
00:43:09
necessary to build this really hard
00:43:11
thing when I thought thought about how
00:43:13
we form the best founding team for World
00:43:16
Labs it has to start with the a a group
00:43:20
of phenomenal multidisciplinary funders
00:43:24
and of course justtin is natural for me
00:43:27
Justin cover your years as one of my
00:43:30
best students and uh one of the smartest
00:43:34
technologist but there are
00:43:36
two two other people I have known by
00:43:39
reputation and and one of them Justin
00:43:41
even worked with that I was drooling for
00:43:45
right one is Ben mhal we talked about
00:43:47
his um seminal work in nerve but another
00:43:52
person is uh Kristoff lner who has been
00:43:56
reputated in the community of computer
00:43:58
graphics and uh especially he had the
00:44:01
foresight of working on a precursor of
00:44:05
the gausian Splat um representation for
00:44:08
3D modeling five years right before the
00:44:12
uh the Gan spat take off and when when
00:44:15
we heard about when we talk about the
00:44:18
potential possibility of working with
00:44:20
Christof lastner Justin just jumped off
00:44:23
his chair Ben and Kristoff are are are
00:44:25
legends and maybe just quickly talk
00:44:27
about kind of like how you thought about
00:44:28
the build out of the rest of the team
00:44:30
because again like it's you know there's
00:44:31
a lot to build here and a lot to work on
00:44:33
not just in kind of AI or Graphics but
00:44:35
like systems and so forth yeah um this
00:44:39
is what so far I'm personally most proud
00:44:42
of is the formidable team I've had the
00:44:45
privilege of working with the smartest
00:44:48
young people in my entire career right
00:44:50
from from the top universities being a
00:44:52
professor at Stanford but the kind of
00:44:56
talent that we put together here at uh
00:44:59
at uh World Labs is just phenomenal I've
00:45:02
never seen the concentration and I think
00:45:04
the biggest
00:45:06
differentiating um element here is that
00:45:09
we're Believers of uh spatial
00:45:11
intelligence all of the
00:45:13
multidisciplinary talents whether it's
00:45:16
system engineering machine uh machine
00:45:18
learning infra to you know uh generative
00:45:21
modeling to data to you know Graphics
00:45:26
all of us whether it's our personal
00:45:28
research Journey or or technology
00:45:31
Journey or even personal hobby we
00:45:34
believe that spatial intelligence has to
00:45:36
happen at this moment with this group of
00:45:39
people and uh that's how we really found
00:45:42
our founding team and uh and that focus
00:45:46
of energy and talent is is is really
00:45:50
just uh um humbling to me I I just love
00:45:53
it so I know you've been Guided by an
00:45:55
Northstar so something about North Stars
00:45:58
is like you can't actually reach
00:46:00
them because they're in the sky but it's
00:46:02
a great way to have guidance so how will
00:46:03
you know when you've accomplished what
00:46:06
you've set out to accomplish or is this
00:46:08
a lifelong thing that's going to
00:46:09
continue kind of infinitely first of all
00:46:13
there's real northstars and virtual
00:46:15
North Stars sometimes you can reach
00:46:17
virtual northstars fair enough good
00:46:18
enough in the world in the world model
00:46:22
exactly like I said I thought one of my
00:46:25
Northstar that would take a 100 years
00:46:27
with storytelling of images and uh
00:46:30
Justin and Andre you know in my opinion
00:46:33
solved it for me so um so we could get
00:46:36
to our Northstar but I think for me is
00:46:39
when so many people and so many
00:46:42
businesses are using our models to
00:46:44
unlock their um needs for spatial
00:46:47
intelligence and that's the moment I
00:46:49
know we have reached a major Milestone
00:46:53
actual deployment actual impact actually
00:46:56
yeah I I don't think going to get there
00:46:57
um I I think that this is such a
00:46:59
fundamental thing like the universe is a
00:47:01
giant evolving four-dimensional
00:47:03
structure and spatial intelligence r
00:47:05
large is just understanding that in all
00:47:07
of its depths and figuring out all the
00:47:08
applications to that so I I think that
00:47:11
we have a we have a particular set of
00:47:12
ideas in mind today but I I think this I
00:47:14
think this journey is going to take us
00:47:16
places that we can't even imagine right
00:47:17
now the magic of good technology is that
00:47:20
technology opens up more possibilities
00:47:23
and and unknown so so we will be pushing
00:47:26
and then the possibilities will will be
00:47:28
expanding brilliant thank you Justin
00:47:30
thank you fa this was fantastic thank
00:47:32
you Martin thank you Martin thank you so
00:47:35
much for listening to the a16z podcast
00:47:38
if you've made it this far don't forget
00:47:40
to subscribe so that you are the first
00:47:42
to get our exclusive video content or
00:47:45
you can check out this video that we've
00:47:48
hand selected for you