“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

00:48:10
https://www.youtube.com/watch?v=vIXfYFB7aBI

Summary

TLDRThe video discusses the fundamentals of visual spatial intelligence and its place in the development of artificial intelligence. It emphasizes how integral visual spatial skills are, comparing their significance to that of language. The discussion shifts to a historical perspective of AI, highlighting critical advancements such as deep learning, particularly noting the inception of neural networks, major breakthroughs like AlexNet, and the Transformer model's impact. The conversation elaborates on data's crucial role, with projects like ImageNet displaying the benefits of comprehensive datasets. The narrative dives into the experiences of AI experts like Fei-Fei Li and Justin Johnson, detailing their journeys and contributions. It describes how the remarkable advancement in computing power has transformed AI from theoretical viewpoints into conceivable practical use, enabling complex model training and faster learning capabilities. Furthermore, the discussion explores the mission of World Labs, which aims to harness the profound understanding of data and computing to unlock spatial intelligence, and the applications this might have in fields ranging from interactive 3D world creation for education or gaming to augmented reality and robotics. Finally, the differences between spatial intelligence and language models are explained, emphasizing the former's essential focus on 3D space for more effective world interaction and representation, as opposed to the largely 1D nature of language models.

Takeaways

  • 📈 Visual spatial intelligence is as fundamental as language.
  • 🔄 AI has transitioned from an AI winter to new peaks with deep learning.
  • 🧠 Key AI figures like Fei-Fei Li and Justin Johnson have influenced the field.
  • 💾 Computing power has substantially accelerated AI model training.
  • 📊 Large datasets, like ImageNet, have driven significant AI advancements.
  • 🤖 World Labs focuses on unlocking spatial intelligence.
  • 🌐 Spatial intelligence emphasizes 3D perception and action.
  • ⏩ Progress in AI enables new media forms, blending virtual and physical worlds.
  • 🕶️ Augmented reality and robotics can benefit from advanced spatial intelligence.
  • 🔍 Differences between language models and spatial intelligence are crucial.
  • 🎮 World generation for gaming is a potential application of spatial intelligence.
  • 📚 Educational tools can leverage spatial intelligence technologies.

Timeline

  • 00:00:00 - 00:05:00

    The discussion starts with highlighting the fundamental role of visual-spatial intelligence, mentioning the progress in understanding data and advancements in algorithms, and setting the stage to focus on unlocking new potentials in AI.

  • 00:05:00 - 00:10:00

    Over the past two years, there has been a significant increase in consumer AI companies, marking a wild and exciting time for AI development. The conversation touches on the historical context of AI from the AI winter to the emergence of deep learning, and its current transformative state involving various data forms such as text, pixels, videos, and audios.

  • 00:10:00 - 00:15:00

    The speakers recount their individual journeys into AI. One was inspired by groundbreaking deep learning papers during undergraduate studies, highlighting the importance of combining powerful algorithms with large compute and data for breakthrough results — a notion established around 2011-2012.

  • 00:15:00 - 00:20:00

    As one speaker’s journey continued, they noticed a critical shift in the importance of data for AI, recognizing data as an overlooked element crucial for model generalization. This realization led to initiatives like ImageNet, emphasizing large-scale data acquisition as a power unlock for machine learning models during its inception at the advent of the internet.

  • 00:20:00 - 00:25:00

    The dialogue transitions to the concept of 'big unlocks' in AI. While major algorithmic innovations like Transformers have fueled AI progress, the conversation sheds light on the underestimated impact of computational power. The example of AlexNet, trained with substantially less compute compared to modern standards, underscores this point.

  • 00:25:00 - 00:30:00

    The discussion pivots to differentiating the types of AI tasks, particularly generative versus predictive modeling. Historical attempts at generative tasks are noted, with more recent advancements in generative AI (using GANs) marking significant strides towards generating novel outputs like images from textual descriptions.

  • 00:30:00 - 00:35:00

    Further exploring generative AI advances, the speakers discuss projects involving style transfer and real-time generation, emphasizing the evolution of generative modeling from static image rendering to dynamic, real-time applications. This illustrates the broad transformation the field has undergone over the years.

  • 00:35:00 - 00:40:00

    Emphasizing spatial intelligence, the speakers outline a journey focused on visual intelligence, suggesting it's just as fundamental as language. The readiness, given present algorithmic advancements and computational capabilities, makes it the right time to invest in developing technologies like those that power World Labs.

  • 00:40:00 - 00:48:10

    Finally, the conversation delves into the specifics of spatial intelligence, contrasting it with language-based AI approaches. Spatial intelligence emphasizes understanding and interacting in 3D environments, a fundamental aspect that language models, inherently one-dimensional, cannot fully grasp. This complements generative AI, transcending text and 2D to richer 3D representations.

Show more

Mind Map

Video Q&A

  • Why is visual spatial intelligence described as fundamental?

    Visual spatial intelligence is compared to language in its fundamental importance due to its ancient and essential role in understanding and interacting with the world.

  • How has AI evolved over the past decades, according to the video?

    AI has transitioned from theoretical models and the AI winter into practical, deep learning applications, involving significant advances such as language models and image recognition.

  • What major AI breakthroughs are discussed?

    The discussion highlights the significance of deep learning, especially neural networks like AlexNet, the importance of computing power, and algorithmic advances like Transformers.

  • Who are the key figures mentioned in the evolution of AI and deep learning?

    Notable figures include Fei-Fei Li and Justin Johnson, along with references to Andrew Ng, Hinton, and others involved in foundational deep learning research.

  • How is computational power significant to AI development?

    Increased computational power has enabled the practical application and fast training of complex AI models, transforming theoretical constructs into effective tools.

  • What role does data play in AI development, according to the speakers?

    Large datasets have been crucial for training AI models, enabling discoveries and the development of more accurate and generalizable models.

  • What is the significance of the ImageNet project?

    ImageNet played a crucial role in demonstrating the power of large datasets, helping propel computer vision and AI into practical applications.

  • What differentiates spatial intelligence from language models?

    Spatial intelligence focuses on 3D perception and action, essential for interacting with the physical world, contrasting with the 1D sequence processing in language models.

  • What is the mission of World Labs?

    World Labs aims to unlock spatial intelligence, leveraging advancements in algorithms, computing, and data to create technology that perceives and interacts with the 3D world.

  • How might spatial intelligence be applied?

    Potential applications include world generation for games and education, augmented reality, and enhancing robotics with better 3D understanding.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:00
    visual spatial intelligence is so
  • 00:00:03
    fundamental it's as fundamental as
  • 00:00:05
    language we've got this ingredients
  • 00:00:08
    compute deeper understanding of data and
  • 00:00:11
    we've got some advancement of algorithms
  • 00:00:14
    we are in the right moment to really
  • 00:00:17
    make a bet and to focus and just unlock
  • 00:00:26
    [Music]
  • 00:00:28
    that over the last two years we've seen
  • 00:00:31
    this kind of massive Rush of consumer AI
  • 00:00:33
    companies and technology and it's been
  • 00:00:35
    quite wild but you've been doing this
  • 00:00:38
    now for decades and so maybe walk
  • 00:00:40
    through a little bit about how we got
  • 00:00:41
    here kind of like your key contributions
  • 00:00:43
    and insights along the way so it is a
  • 00:00:46
    very exciting moment right just zooming
  • 00:00:48
    back AI is in a very exciting moment I
  • 00:00:51
    personally have been doing this for for
  • 00:00:53
    two decades plus and you know we have
  • 00:00:56
    come out of the last AI winter we have
  • 00:00:58
    seen the birth of modern AI then we have
  • 00:01:01
    seen deep learning taking off showing us
  • 00:01:04
    possibilities like playing chess but
  • 00:01:07
    then we're starting to see the the the
  • 00:01:10
    deepening of the technology and the
  • 00:01:12
    industry um adoption of uh of some of
  • 00:01:16
    the earlier possibilities like language
  • 00:01:19
    models and now I think we're in the
  • 00:01:21
    middle of a Cambrian explosion in almost
  • 00:01:24
    a literal sense because now in addition
  • 00:01:27
    to texts you're seeing pixels videos
  • 00:01:30
    audios all coming out with possible AI
  • 00:01:35
    applications and models so it's very
  • 00:01:37
    exciting moment I know you both so well
  • 00:01:40
    and many people know you both so well
  • 00:01:41
    because you're so prominent in the field
  • 00:01:42
    but not everybody like grew up in AI so
  • 00:01:44
    maybe it's kind of worth just going
  • 00:01:45
    through like your quick backgrounds just
  • 00:01:47
    to kind of level set the audience yeah
  • 00:01:48
    sure so I first got into AI uh at the
  • 00:01:51
    end of my undergrad uh I did math and
  • 00:01:53
    computer science for undergrad at
  • 00:01:54
    keltech that was awesome but then
  • 00:01:55
    towards the end of that there was this
  • 00:01:57
    paper that came out that was at the time
  • 00:01:59
    a very famous paper the cat paper um
  • 00:02:01
    from H Lee and Andrew and others that
  • 00:02:03
    were at Google brain at the time and
  • 00:02:04
    that was like the first time that I came
  • 00:02:06
    across this concept of deep learning um
  • 00:02:08
    and to me it just felt like this amazing
  • 00:02:10
    technology and that was the first time
  • 00:02:12
    that I came across this recipe that
  • 00:02:13
    would come to define the next like more
  • 00:02:15
    than decade of my life which is that you
  • 00:02:17
    can get these amazingly powerful
  • 00:02:19
    learning algorithms that are very
  • 00:02:20
    generic couple them with very large
  • 00:02:22
    amounts of compute couple them with very
  • 00:02:23
    large amounts of data and magic things
  • 00:02:25
    started to happen when you compi those
  • 00:02:27
    ingredients so I I first came across
  • 00:02:29
    that idea like around 2011 2012-ish and
  • 00:02:31
    I just thought like oh my God this is
  • 00:02:33
    this is going to be what I want to do so
  • 00:02:35
    it was obvious you got to go to grad
  • 00:02:36
    school to do this stuff and then um sort
  • 00:02:38
    of saw that Fay was at Stanford one of
  • 00:02:40
    the few people in the world at the time
  • 00:02:41
    who was kind of on that on that train
  • 00:02:44
    and that was just an amazing time to be
  • 00:02:45
    in deep learning and computer vision
  • 00:02:47
    specifically because that was really the
  • 00:02:49
    era when this went from these first nent
  • 00:02:52
    bits of technology that were just
  • 00:02:53
    starting to work and really got
  • 00:02:54
    developed AC and spread across a ton of
  • 00:02:56
    different applications so then over that
  • 00:02:58
    time we saw the beginning of language
  • 00:03:00
    modeling we saw the beginnings of
  • 00:03:02
    discriminative computer vision you could
  • 00:03:03
    take pictures and understand what's in
  • 00:03:05
    them in a lot of different ways we also
  • 00:03:06
    saw some of the early bits of what we
  • 00:03:08
    would Now call gen generative modeling
  • 00:03:10
    generating images generating text a lot
  • 00:03:12
    of those Court algor algorithmic pieces
  • 00:03:14
    actually got figured out by the academic
  • 00:03:16
    Community um during my PhD years like
  • 00:03:18
    there was a time I would just like wake
  • 00:03:19
    up every morning and check the new
  • 00:03:21
    papers on archive and just be ready it
  • 00:03:23
    was like unwrapping presents on
  • 00:03:24
    Christmas that like every day you know
  • 00:03:25
    there's going to be some amazing new
  • 00:03:27
    discovery some amazing new application
  • 00:03:28
    or algorithm somewhere in the world what
  • 00:03:30
    happened is in the last two years
  • 00:03:32
    everyone else in the world kind of came
  • 00:03:33
    to the same realization using AI to get
  • 00:03:35
    new Christmas presents every day but I
  • 00:03:37
    think for those of us that have been in
  • 00:03:38
    the field for a decade or more um we've
  • 00:03:40
    sort of had that experience for a very
  • 00:03:41
    long time obviously I'm much older than
  • 00:03:45
    Justin I I come to AI through a
  • 00:03:49
    different angle which is from physics
  • 00:03:51
    because my undergraduate uh background
  • 00:03:53
    was physics but physics is the kind of
  • 00:03:56
    discipline that teaches you to think
  • 00:03:59
    audacious question s and think about
  • 00:04:01
    what is the remaining mystery of the
  • 00:04:04
    world of course in physics is atomic
  • 00:04:06
    world you know universe and all that but
  • 00:04:09
    somehow I that kind of training thinking
  • 00:04:13
    got me into the audacious question that
  • 00:04:16
    really captur my own imagination which
  • 00:04:18
    is intelligence so I did my PhD in Ai
  • 00:04:22
    and computational neuros siiz at CCH so
  • 00:04:26
    Justin and I actually didn't overlap but
  • 00:04:28
    we share um
  • 00:04:30
    the same amam mat um at keltech oh and
  • 00:04:33
    and the same adviser at celtech yes same
  • 00:04:35
    adviser your undergraduate adviser in my
  • 00:04:37
    PhD advisor petro perona and my PhD time
  • 00:04:41
    which is similar to your your your PhD
  • 00:04:44
    time was when AI was still in the winter
  • 00:04:47
    in the public eye but it was not in the
  • 00:04:50
    winter in my eye because it's that
  • 00:04:53
    preing hibernation there's so much life
  • 00:04:56
    machine learning statistical modeling
  • 00:04:59
    was really gaining uh gaining power and
  • 00:05:03
    we I I think I was one of the Native
  • 00:05:07
    generation in machine learning and AI
  • 00:05:11
    whereas I look at Justice generation is
  • 00:05:13
    the native deep learning generation so
  • 00:05:16
    so so machine learning was the precursor
  • 00:05:19
    of deep learning and we were
  • 00:05:21
    experimenting with all kinds of models
  • 00:05:24
    but one thing came out at the end of my
  • 00:05:26
    PhD and the beginning of my assistant
  • 00:05:29
    professor
  • 00:05:30
    there was a
  • 00:05:32
    overlooked elements of AI that is
  • 00:05:36
    mathematically important to drive
  • 00:05:39
    generalization but the whole field was
  • 00:05:41
    not thinking that way and it was Data
  • 00:05:45
    because we were thinking about um you
  • 00:05:47
    know the intricacy of beijan models or
  • 00:05:50
    or whatever you know um uh kernel
  • 00:05:53
    methods and all that but what was
  • 00:05:56
    fundamental that my students and my lab
  • 00:05:59
    realized probably uh earlier than most
  • 00:06:01
    people is that if you if you let Data
  • 00:06:05
    Drive models you can unleash the kind of
  • 00:06:08
    power that we haven't seen before and
  • 00:06:11
    that was really the the the reason we
  • 00:06:14
    went on a pretty
  • 00:06:17
    crazy bet on image net which is you know
  • 00:06:21
    what just forget about any scale we're
  • 00:06:24
    seeing now which is thousands of data
  • 00:06:26
    points at that point uh NLP community
  • 00:06:29
    has their own data sets I remember UC
  • 00:06:32
    see Irvine data set or some data set in
  • 00:06:34
    NLP was it was small compar Vision
  • 00:06:37
    Community has their data sets but all in
  • 00:06:40
    the order of thousands or tens of
  • 00:06:42
    thousands were like we need to drive it
  • 00:06:44
    to internet scale and luckily it was
  • 00:06:48
    also the the the coming of age of
  • 00:06:50
    Internet so we were riding that wave and
  • 00:06:54
    that's when I came to Stanford so these
  • 00:06:57
    epochs are what we often talk about like
  • 00:06:59
    IM is clearly the epoch that created you
  • 00:07:01
    know or or at least like maybe made like
  • 00:07:04
    popular and viable computer vision and
  • 00:07:07
    the Gen wave we talk about two kind of
  • 00:07:09
    core unlocks one is like the
  • 00:07:10
    Transformers paper which is attention we
  • 00:07:12
    talk about stable diffusion is that a
  • 00:07:13
    fair way to think about this which is
  • 00:07:15
    like there's these two algorithmic
  • 00:07:16
    unlocks that came from Academia or
  • 00:07:18
    Google and like that's where everything
  • 00:07:19
    comes from or has it been more
  • 00:07:20
    deliberate or have there been other kind
  • 00:07:22
    of big unlocks that kind of brought us
  • 00:07:24
    here that we don't talk as much about
  • 00:07:25
    yeah I I think the big unlock is compute
  • 00:07:28
    like I know the story of AI is of in the
  • 00:07:29
    story of compute but even no matter how
  • 00:07:31
    much people talk about it I I think
  • 00:07:32
    people underestimate it right and the
  • 00:07:34
    amount of the amount of growth that
  • 00:07:35
    we've seen in computational power over
  • 00:07:37
    the last decade is astounding the first
  • 00:07:39
    paper that's really credited with the
  • 00:07:40
    like Breakthrough moment in computer
  • 00:07:42
    vision for deep learning was Alex net um
  • 00:07:45
    which was a 2012 paper that where a deep
  • 00:07:47
    neural network did really well on the
  • 00:07:48
    image net Challenge and just blew away
  • 00:07:50
    all the other algorithms that F had been
  • 00:07:53
    working on the types of algorithms that
  • 00:07:54
    they' been working on more in grad
  • 00:07:55
    school that Alex net was a 60 million
  • 00:07:57
    parameter deep neural network um and it
  • 00:07:59
    was trained for six days on two GTX 580s
  • 00:08:03
    which was the top consumer card at the
  • 00:08:04
    time which came out in 2010 um so I was
  • 00:08:07
    looking at some numbers last night just
  • 00:08:09
    to you know put these in perspective the
  • 00:08:11
    newest the latest and greatest from
  • 00:08:12
    Nvidia is the gb200 um do either of you
  • 00:08:15
    want to guess how much raw compute
  • 00:08:17
    Factor we have between the GTX 580 and
  • 00:08:20
    the gb200 shoot no what go for it it's
  • 00:08:23
    uh it's in the thousands so I I ran the
  • 00:08:26
    numbers last night like that two We R
  • 00:08:28
    that two we training run that of Six
  • 00:08:30
    Days on two GTX 580s if you scale it it
  • 00:08:33
    comes out to just under five minutes on
  • 00:08:35
    a single GB on a single gb200 Justin is
  • 00:08:39
    making a really good point the 2012 Alex
  • 00:08:42
    net paper on image net challenge is
  • 00:08:45
    literally a very classic Model and that
  • 00:08:49
    is the convolution on your network model
  • 00:08:52
    and that was published in 1980s the
  • 00:08:54
    first paper I remember as a graduate
  • 00:08:56
    student learning that and it more or
  • 00:09:00
    less also has six seven layers the
  • 00:09:03
    practically the only difference between
  • 00:09:06
    alexnet and the convet what's the
  • 00:09:08
    difference is the gpus the two gpus and
  • 00:09:14
    the delude of data yeah well so that's
  • 00:09:17
    what I was going to go which is like so
  • 00:09:18
    I think most people now are familiar
  • 00:09:20
    with like quote the bitter lesson and
  • 00:09:21
    the bitter lesson says is if you make an
  • 00:09:23
    algorithm don't be cute yeah just make
  • 00:09:25
    sure you can take advantage of available
  • 00:09:26
    compute because the available compute
  • 00:09:28
    will show up right and so like you just
  • 00:09:29
    like need to like why like on the other
  • 00:09:32
    hand there's another narrative um which
  • 00:09:35
    seems to me to be like just as credible
  • 00:09:36
    which is like it's actually new data
  • 00:09:37
    sources that unlock deep learning right
  • 00:09:39
    like imet is a great example but like a
  • 00:09:40
    lot of people like self attention is
  • 00:09:42
    great from Transformers but they'll also
  • 00:09:44
    say this is a way you can exploit human
  • 00:09:45
    labeling of data because like it's the
  • 00:09:47
    humans that put the structure in the
  • 00:09:48
    sentences and if you look at clip
  • 00:09:50
    they'll say well like we're using the
  • 00:09:51
    internet to like actually like have
  • 00:09:53
    humans use the alt tag to label images
  • 00:09:56
    right and so like that's a story of data
  • 00:09:58
    that's not a story of compute and so is
  • 00:10:00
    it just is the answer just both or is
  • 00:10:02
    like one more than the other or I think
  • 00:10:03
    it's both but you're hitting another
  • 00:10:05
    really good point so I think there's
  • 00:10:06
    actually two EO that to me feel quite
  • 00:10:08
    distinct in the algorithmics here so
  • 00:10:10
    like the imag net era is actually the
  • 00:10:12
    era of supervised learning um so in the
  • 00:10:14
    era of supervised learning you have a
  • 00:10:15
    lot of data but you don't know how to
  • 00:10:17
    use data on its own like the expectation
  • 00:10:20
    of imet and other data sets of that time
  • 00:10:22
    period was that we're going to get a lot
  • 00:10:23
    of images but we need people to label
  • 00:10:25
    everyone and all of the training data
  • 00:10:27
    that we're going to train on like a
  • 00:10:29
    person a human labeler has looked at
  • 00:10:30
    everyone and said something about that
  • 00:10:32
    image yeah um and the big algorithmic
  • 00:10:34
    unlocks we know how to train on things
  • 00:10:36
    that don't require human labeled data as
  • 00:10:38
    as the naive person in the room that
  • 00:10:39
    doesn't have an AI background it seems
  • 00:10:41
    to me if you're training on human data
  • 00:10:43
    like the humans have labeled it it's
  • 00:10:45
    just not explicit I knew you were GNA
  • 00:10:47
    say that Mar I knew that yes
  • 00:10:49
    philosophically that's a really
  • 00:10:51
    important question but that actually is
  • 00:10:53
    more try language than pixels fair
  • 00:10:56
    enough yeah 100 yeah yeah yeah yeah yeah
  • 00:10:58
    but I do think it's an important
  • 00:11:05
    thinked learn itel just more implicit
  • 00:11:08
    than explicit yeah it's still it's still
  • 00:11:09
    human labeled the distinction is that
  • 00:11:11
    for for this supervised learning era um
  • 00:11:13
    our learning tasks were much more
  • 00:11:14
    constrained so like you would have to
  • 00:11:16
    come up with this ontology of Concepts
  • 00:11:18
    that we want to discover right if you're
  • 00:11:19
    doing in imag net like fa and and your
  • 00:11:22
    students at the time spent a lot of time
  • 00:11:24
    thinking about you know which thousand
  • 00:11:26
    categories should be in the imag net
  • 00:11:27
    challenge other data sets of that time
  • 00:11:29
    like the Coco data set for object
  • 00:11:30
    detection like they thought really hard
  • 00:11:32
    about which 80 categories we put in
  • 00:11:34
    there so let's let's walk to gen um so
  • 00:11:36
    so when I was doing my my PhD before
  • 00:11:38
    that um you came so I took U machine
  • 00:11:41
    learning from Andre in and then I took
  • 00:11:43
    like beigan something very complicated
  • 00:11:44
    from Deany Coler and it was very
  • 00:11:45
    complicated for me a lot of that was
  • 00:11:47
    just predictive modeling y um and then
  • 00:11:49
    like I remember the whole kind of vision
  • 00:11:51
    stuff that you unlock but then the
  • 00:11:52
    generative stuff is shown up like I
  • 00:11:53
    would say in the last four years which
  • 00:11:55
    is to me very different like you're not
  • 00:11:57
    identifying objects you're not you know
  • 00:11:59
    predicting something you're generating
  • 00:12:00
    something and so maybe kind of walk
  • 00:12:02
    through like the key unlocks that got us
  • 00:12:04
    there and then why it's different and if
  • 00:12:06
    we should think about it differently and
  • 00:12:07
    is it part of a Continuum is it not it
  • 00:12:10
    is so interesting even during my
  • 00:12:13
    graduate time generative model was there
  • 00:12:17
    we wanted to do generation we nobody
  • 00:12:20
    remembers even with the uh letters and
  • 00:12:24
    uh numbers we were trying to do some you
  • 00:12:26
    know Jeff Hinton has had to generate
  • 00:12:29
    papers we were thinking about how to
  • 00:12:31
    generate and in fact if you do have if
  • 00:12:34
    you think from a probability
  • 00:12:36
    distribution point of view you can
  • 00:12:37
    mathematically generate it's just
  • 00:12:39
    nothing we generate would ever impress
  • 00:12:42
    anybody right so this concept of
  • 00:12:45
    generation mathematically theoretically
  • 00:12:47
    is there but nothing worked so then I do
  • 00:12:52
    want to call out Justin's PhD and Justin
  • 00:12:55
    was saying that he got enamored by Deep
  • 00:12:57
    learning so he came to my lab Justin PhD
  • 00:12:59
    his entire PhD is a story almost a mini
  • 00:13:03
    story of the trajectory of the of the uh
  • 00:13:07
    field he started his first project in
  • 00:13:09
    data I forced him to he didn't like
  • 00:13:13
    it so in retrospect I learned a lot of
  • 00:13:16
    really useful things I'm glad you say
  • 00:13:18
    that now so we moved Justin to um to
  • 00:13:22
    deep learning and the core problem there
  • 00:13:25
    was taking images and generating words
  • 00:13:29
    well actually it was even about there
  • 00:13:31
    were I think there were three discret
  • 00:13:32
    phases here on this trajectory so the
  • 00:13:34
    first one was actually matching images
  • 00:13:36
    and words right right right like we have
  • 00:13:38
    we have an image we have words and can
  • 00:13:40
    we say how much they allow so actually
  • 00:13:41
    my first paper both of my PhD and like
  • 00:13:44
    ever my first academic publication ever
  • 00:13:47
    was the image retrieval with scene
  • 00:13:48
    graphs and then we went into the Genera
  • 00:13:51
    uh taking pixels generating words and
  • 00:13:53
    Justin and Andre uh really worked on
  • 00:13:56
    that but that was still a very very
  • 00:14:00
    lossy way of of of generating and
  • 00:14:03
    getting information out of the pixel
  • 00:14:05
    world and then in the middle Justus went
  • 00:14:07
    off and did a very famous piece of work
  • 00:14:10
    and it was the first time that uh
  • 00:14:13
    someone made it real time right yeah
  • 00:14:16
    yeah so so the story there is there was
  • 00:14:17
    this paper that came out in 2015 a
  • 00:14:19
    neural algorithm of artistic style led
  • 00:14:21
    by Leon gtis and it was like the paper
  • 00:14:24
    came out and they showed like these
  • 00:14:25
    these real world photographs that they
  • 00:14:26
    had converted into van go style and like
  • 00:14:29
    we are kind of used to seeing things
  • 00:14:30
    like this in 2024 but this was in 2015
  • 00:14:33
    so this paper just popped up on archive
  • 00:14:35
    one day and it like blew my mind like I
  • 00:14:37
    just got this like gen brainworm like in
  • 00:14:39
    my brain in like 2015 and it like did
  • 00:14:42
    something to me and I thought like oh my
  • 00:14:44
    God I need to understand this algorithm
  • 00:14:45
    I need to play with it I need to make my
  • 00:14:47
    own images into van go so then I like
  • 00:14:49
    read the paper and over a long weekend I
  • 00:14:51
    reimplemented the thing and got it to
  • 00:14:52
    work it was a very actually very simple
  • 00:14:55
    algorithm um so like my implementation
  • 00:14:57
    was like 300 lines of Lua cuz at the
  • 00:14:59
    time it was pre it was Lua there was
  • 00:15:01
    there was um this was pre pie torch so
  • 00:15:03
    we were using Lua torch um but it was
  • 00:15:05
    like very simple algorithm but it was
  • 00:15:06
    slow right so it was an optim
  • 00:15:08
    optimization based thing every image you
  • 00:15:10
    want to generate you need to run this
  • 00:15:11
    optimization Loop run this gradient Dent
  • 00:15:12
    Loop for every image that you generate
  • 00:15:14
    the images were beautiful but I just
  • 00:15:16
    like wanted to be faster and and Justin
  • 00:15:19
    just did it and it was actually I think
  • 00:15:21
    your first taste
  • 00:15:23
    of a an academic work having an industry
  • 00:15:27
    impact a bunch of people seen this this
  • 00:15:30
    artistic style transfer stuff at the
  • 00:15:31
    time and me and a couple others at the
  • 00:15:33
    same time came up with different ways to
  • 00:15:34
    speed this up yeah um but mine was the
  • 00:15:37
    one that got a lot of traction right so
  • 00:15:38
    I was very proud of Justin but there's
  • 00:15:40
    one more thing I was very proud of
  • 00:15:41
    Justin to connect to J AI is that before
  • 00:15:45
    the world understand gen Justin's last
  • 00:15:48
    piece of uh uh work in PhD which I I
  • 00:15:52
    knew about it because I was forcing you
  • 00:15:53
    to do it that one was fun that was was
  • 00:15:57
    actually uh input
  • 00:16:00
    language and getting a whole picture out
  • 00:16:03
    it's one of the first gen uh work it's
  • 00:16:07
    using gang which was so hard to use but
  • 00:16:10
    the problem is that we are not ready to
  • 00:16:12
    use a natural piece of language so
  • 00:16:14
    justtin you heard he worked on sing
  • 00:16:16
    graph so we have to input a sing graph
  • 00:16:20
    language structure so you know the Sheep
  • 00:16:23
    the the the grass the sky in a graph way
  • 00:16:26
    it literally was one of our photos right
  • 00:16:28
    and then he he and another very good uh
  • 00:16:31
    uh Master student of grim they got that
  • 00:16:34
    again to work so so you can see from
  • 00:16:37
    data to matching to style transfer to to
  • 00:16:42
    generative a uh uh images we're starting
  • 00:16:46
    to see you ask if this is a abrupt
  • 00:16:49
    change for people like us it's already
  • 00:16:52
    happening a Continuum but for the world
  • 00:16:55
    it was it's more the results are more
  • 00:16:58
    abrupt so I read your book and for those
  • 00:17:00
    that are listening it's a phenomenal
  • 00:17:01
    book like I I really recommend you read
  • 00:17:03
    it and it seems for a long time like a
  • 00:17:06
    lot of you and I'm talking to you fa
  • 00:17:07
    like a lot of your research has been you
  • 00:17:09
    know and your direction has been towards
  • 00:17:12
    kind of spatial stuff and pixel stuff
  • 00:17:14
    and intelligence and now you're doing
  • 00:17:16
    World labs and it's around spatial
  • 00:17:18
    intelligence and so maybe talk through
  • 00:17:20
    like you know is this been part of a
  • 00:17:23
    long journey for you like why did you
  • 00:17:24
    decide to do it now is it a technical
  • 00:17:26
    unlock is it a personal unlock just kind
  • 00:17:28
    of like move us from that kind of Meo of
  • 00:17:32
    AI research to to World Labs sure for me
  • 00:17:35
    is uh
  • 00:17:37
    um it is both personal and intellectual
  • 00:17:41
    right my entire you talk about my book
  • 00:17:44
    my entire intellectual journey is really
  • 00:17:48
    this passion to seek North Stars but
  • 00:17:51
    also believing that those nor stars are
  • 00:17:54
    critically important for the advancement
  • 00:17:56
    of our field so at the beginning
  • 00:17:59
    I remembered after graduate school I
  • 00:18:02
    thought my Northstar was telling stories
  • 00:18:05
    of uh images because for me that's such
  • 00:18:08
    a important piece of visual intelligence
  • 00:18:12
    that's part of what you call AI or AGI
  • 00:18:15
    but when Justin and Andre did that I was
  • 00:18:18
    like oh my God that's that was my live
  • 00:18:20
    stream what do I do next so it it came a
  • 00:18:23
    lot faster I thought it would take a
  • 00:18:25
    hundred years to do that so um but
  • 00:18:29
    visual intelligence is my passion
  • 00:18:32
    because I do believe for every
  • 00:18:36
    intelligent uh
  • 00:18:37
    being like people or robots or some
  • 00:18:41
    other form um knowing how to see the
  • 00:18:44
    world reason about it interact in it
  • 00:18:49
    whether you're navigating or or or
  • 00:18:51
    manipulating or making things you can
  • 00:18:54
    even build civilization upon it it
  • 00:18:58
    visual spatial intelligence is so
  • 00:19:01
    fundamental it's as fundamental as
  • 00:19:04
    language possibly more ancient and and
  • 00:19:08
    more fundamental in certain ways so so
  • 00:19:10
    it's very natural for me that um world
  • 00:19:14
    Labs is our Northstar is to unlock
  • 00:19:17
    spatial intelligence the moment to me is
  • 00:19:21
    right to do it like Justin was saying
  • 00:19:24
    compute we've got these ingredients
  • 00:19:26
    we've got compute we've got a much
  • 00:19:30
    deeper understanding of data way deeper
  • 00:19:32
    than image that days you know uh
  • 00:19:34
    compared to to that those days we're so
  • 00:19:37
    much more sophisticated and we've got
  • 00:19:40
    some advancement of algorithms including
  • 00:19:43
    co-founders in World la like Ben milen
  • 00:19:46
    Hall and uh Kristoff lar they were at
  • 00:19:50
    The Cutting Edge of nerve that we are in
  • 00:19:52
    the right moment to really make a bet
  • 00:19:55
    and to focus and just unlock that so I
  • 00:19:59
    just want to clarify for for folks that
  • 00:20:01
    are listening to this which is so you
  • 00:20:02
    know you're starting this company World
  • 00:20:03
    lab spatial intelligence is kind of how
  • 00:20:05
    you're generally describing the problem
  • 00:20:06
    you're solving can you maybe try to
  • 00:20:08
    crisply describe what that means yeah so
  • 00:20:11
    spatial intelligence is about machines
  • 00:20:13
    ability to un to perceive reason and act
  • 00:20:16
    in 3D and 3D space and time to
  • 00:20:19
    understand how objects and events are
  • 00:20:21
    positioned in 3D space and time how
  • 00:20:23
    interactions in the world can affect
  • 00:20:25
    those 3D position 3D 4D positions over
  • 00:20:28
    space time
  • 00:20:29
    um and both sort of perceive reason
  • 00:20:31
    about generate interact with really take
  • 00:20:33
    the machine out of the main frame or out
  • 00:20:35
    of the data center and putting it out
  • 00:20:37
    into the world and understanding the 3D
  • 00:20:39
    4D world with all of its richness so to
  • 00:20:41
    be very clear are we talking about the
  • 00:20:42
    physical world or are we just talking
  • 00:20:43
    about an abstract notion of world I
  • 00:20:45
    think it can be both I think it can be
  • 00:20:47
    both and that encompasses our vision
  • 00:20:48
    long term even if you're generating
  • 00:20:50
    worlds even if you're generating content
  • 00:20:52
    um doing that in positioned in 3D with
  • 00:20:54
    3D uh has a lot of benefits um or if
  • 00:20:57
    you're recognizing the real world being
  • 00:20:59
    able to put 3D understanding into the
  • 00:21:02
    into the real world as well is part of
  • 00:21:04
    it great so I mean Ju Just for everybody
  • 00:21:07
    listening like the two other co-founders
  • 00:21:08
    Ben M Hall and Kristoff lner are
  • 00:21:10
    absolute Legends in the field at the at
  • 00:21:12
    the same level these four decided to
  • 00:21:13
    come out and do this company now and so
  • 00:21:16
    I'm trying to get dig to like like why
  • 00:21:18
    now is the the the right time yeah I
  • 00:21:20
    mean this is Again part of a longer
  • 00:21:22
    Evolution for me but like really after
  • 00:21:23
    PhD when I was really wanting to develop
  • 00:21:25
    into my own independent researcher both
  • 00:21:27
    at for my later career I was just
  • 00:21:29
    thinking what are the big problems in Ai
  • 00:21:31
    and computer vision um and the
  • 00:21:32
    conclusion that I came to about that
  • 00:21:34
    time was that the previous decade had
  • 00:21:36
    mostly been about understanding data
  • 00:21:38
    that already exists um but the next
  • 00:21:40
    decade was going to be about
  • 00:21:41
    understanding new data and if we think
  • 00:21:43
    about that the data that already exists
  • 00:21:45
    was all of the images and videos that
  • 00:21:47
    maybe existed on the web already and the
  • 00:21:49
    next decade was going to be about
  • 00:21:50
    understanding new data right like people
  • 00:21:53
    are people are have smartphones
  • 00:21:54
    smartphones are collecting cameras those
  • 00:21:55
    cameras have new sensors those cameras
  • 00:21:57
    are positioned in the 3D world it's not
  • 00:21:59
    just you're going to get a bag of pixels
  • 00:22:00
    from the internet and know nothing about
  • 00:22:02
    it and try to say if it's a cat or a dog
  • 00:22:04
    we want to treat these treat images as
  • 00:22:07
    universal sensors to the physical world
  • 00:22:09
    and how can we use that to understand
  • 00:22:11
    the 3D and 4D structure of the world um
  • 00:22:13
    either in physical spaces or or or
  • 00:22:15
    generative spaces so I made a pretty big
  • 00:22:18
    pivot post PhD into 3D computer vision
  • 00:22:20
    predicting 3D shapes of objects with
  • 00:22:22
    some of my colleagues at fair at the
  • 00:22:24
    time then later I got really enamored by
  • 00:22:26
    this idea of learning 3D structure
  • 00:22:28
    through 2D right because we talk about
  • 00:22:30
    data a lot it's it's um you know 3D data
  • 00:22:33
    is hard to get on its own um but there
  • 00:22:36
    because there's a very strong
  • 00:22:37
    mathematical connection here um our 2D
  • 00:22:39
    images are projections of a 3D World and
  • 00:22:42
    there's a lot of mathematical structure
  • 00:22:43
    here we can take advantage of so even if
  • 00:22:45
    you have a lot of 2D data there's
  • 00:22:46
    there's a lot of people have done
  • 00:22:48
    amazing work to figure out how can you
  • 00:22:50
    back out the 3D structure of the world
  • 00:22:51
    from large quantities of 2D observations
  • 00:22:54
    um and then in 2020 you asked about bre
  • 00:22:56
    breakthrough moments there was a really
  • 00:22:57
    big breakthrough Moment One from our
  • 00:22:59
    co-founder Ben mildenhall at the time
  • 00:23:00
    with his paper Nerf N Radiance fields
  • 00:23:03
    and that was a very simple very clear
  • 00:23:05
    way of backing out 3D structure from 2D
  • 00:23:08
    observations that just lit a fire under
  • 00:23:10
    this whole Space of 3D computer vision I
  • 00:23:13
    think there's another aspect here that
  • 00:23:15
    maybe people outside the field don't
  • 00:23:16
    quite understand as that was also a time
  • 00:23:19
    when large language models were starting
  • 00:23:20
    to take off so a lot of the stuff with
  • 00:23:23
    language modeling actually had gotten
  • 00:23:24
    developed in Academia even during my PhD
  • 00:23:26
    I did some early work with Andre
  • 00:23:27
    Carpathia on language modeling in 2014
  • 00:23:30
    lstm I still remember lstms RNN brus
  • 00:23:34
    like this was pre- Transformer um but uh
  • 00:23:37
    then at at some point like around like
  • 00:23:39
    around the gpt2 time like you couldn't
  • 00:23:41
    really do those kind of models anymore
  • 00:23:42
    in Academia because they took a way way
  • 00:23:44
    more resourcing but there was one really
  • 00:23:46
    interesting thing that the Nerf the Nerf
  • 00:23:48
    approach that that Ben came up with like
  • 00:23:50
    you could train these in in in an hour a
  • 00:23:52
    couple hours on a single GPU so I think
  • 00:23:54
    at that time like this is a there was a
  • 00:23:56
    dynamic here that happened which is that
  • 00:23:57
    I think a lot of academic researchers
  • 00:23:59
    ended up focusing a lot of these
  • 00:24:00
    problems because there was core
  • 00:24:02
    algorithmic stuff to figure out and
  • 00:24:04
    because you could actually do a lot with
  • 00:24:05
    without a ton of compute and you could
  • 00:24:07
    get state-of-the-art results on a single
  • 00:24:08
    GPU because of those Dynamics um there
  • 00:24:11
    was a lot of research a lot of
  • 00:24:12
    researchers in Academia were moving to
  • 00:24:14
    think about what are the core
  • 00:24:16
    algorithmic ways that we can advance
  • 00:24:17
    this area as well uh then I ended up
  • 00:24:20
    chatting with f more and I realized that
  • 00:24:22
    we were actually she's very convincing
  • 00:24:23
    she's very convincing well there's that
  • 00:24:25
    but but like you know we talk about
  • 00:24:27
    trying to like figure out your own
  • 00:24:28
    depent research trajectory from your
  • 00:24:29
    adviser well it turns out we ended oh no
  • 00:24:32
    kind of concluding converging on on
  • 00:24:34
    similar things okay well from my end I
  • 00:24:36
    want to talk to the smartest person I I
  • 00:24:38
    call Justin there's no question about it
  • 00:24:41
    uh I do want to talk about a very
  • 00:24:43
    interesting technical um uh issue or or
  • 00:24:47
    technical uh story of pixels that most
  • 00:24:50
    people work in language don't realize is
  • 00:24:52
    that pre era in the field of computer
  • 00:24:55
    vision those of us who work on pixels
  • 00:24:58
    we actually have a long history in a an
  • 00:25:03
    area of research called reconstruction
  • 00:25:05
    3D reconstruction which is you know it
  • 00:25:08
    dates back from the 70s you know you can
  • 00:25:11
    take photos because humans have two eyes
  • 00:25:13
    right so in generally starts with stereo
  • 00:25:15
    photos and then you try to triangulate
  • 00:25:18
    the geometry and uh make a 3D shape out
  • 00:25:22
    of it it is a really really hard problem
  • 00:25:25
    to this day it's not fundamentally
  • 00:25:27
    solved because there there's
  • 00:25:28
    correspondence and all that and then so
  • 00:25:31
    this whole field which is a older way of
  • 00:25:34
    thinking about 3D has been going around
  • 00:25:37
    and it has been making really good
  • 00:25:39
    progress but when nerve happened when
  • 00:25:42
    Nerf happened in the context of
  • 00:25:45
    generative methods in the context of
  • 00:25:47
    diffusion models
  • 00:25:50
    suddenly reconstruction and generations
  • 00:25:52
    start to really merge and now like
  • 00:25:56
    within really a short period of time in
  • 00:25:58
    the field of computer vision it's hard
  • 00:26:01
    to talk about reconstruction versus
  • 00:26:03
    generation anymore we suddenly have a
  • 00:26:06
    moment where if we see something or if
  • 00:26:11
    we imagine something both can converge
  • 00:26:14
    towards generating it right right and
  • 00:26:17
    that's just to me a a really important
  • 00:26:19
    moment for computer vision but most
  • 00:26:21
    people missed it because we're not
  • 00:26:23
    talking about it as much as llms right
  • 00:26:25
    so in pixel space there's reconstruction
  • 00:26:27
    where you reconstruct
  • 00:26:28
    like a scene that's real and then if you
  • 00:26:31
    don't see the scene then you use
  • 00:26:32
    generative techniques right so these
  • 00:26:33
    things are kind of very similar
  • 00:26:35
    throughout this entire conversation
  • 00:26:36
    you're talking about languages and
  • 00:26:38
    you're talking about pixels so maybe
  • 00:26:40
    it's a good time to talk about how like
  • 00:26:41
    space for intelligence and what you're
  • 00:26:43
    working on
  • 00:26:44
    contrasts with language approaches which
  • 00:26:47
    of course are very popular now like is
  • 00:26:48
    it complimentary is it orthogonal yeah I
  • 00:26:51
    think I think they're complimentary I I
  • 00:26:53
    don't mean to be too leading here like
  • 00:26:54
    maybe just contrast them like everybody
  • 00:26:56
    says like listen I I I know opening up
  • 00:26:58
    and I know GPT and I know multimodal
  • 00:27:00
    models and a lot of what you're talking
  • 00:27:01
    about is like they've got pixels and
  • 00:27:03
    they've got languages and like doesn't
  • 00:27:05
    this kind of do what we want to do with
  • 00:27:07
    spatial reasoning yeah so I think to do
  • 00:27:09
    that you need to open up the Black Box a
  • 00:27:10
    little bit of how these systems work
  • 00:27:11
    under the hood um so with language
  • 00:27:13
    models and the multimodal language
  • 00:27:14
    models that we're seeing nowadays
  • 00:27:16
    they're their their underlying
  • 00:27:18
    representation under the hood is is a
  • 00:27:19
    one-dimensional representation we talk
  • 00:27:21
    about context lengths we talk about
  • 00:27:23
    Transformers we talk about sequences
  • 00:27:25
    attention attention fundamentally their
  • 00:27:27
    representation of the world is is
  • 00:27:29
    onedimensional so these things
  • 00:27:30
    fundamentally operate on a
  • 00:27:31
    onedimensional sequence of tokens so
  • 00:27:33
    this is a very natural representation
  • 00:27:35
    when you're talking about language
  • 00:27:37
    because written text is a
  • 00:27:38
    one-dimensional sequence of discret
  • 00:27:39
    letters so that kind of underlying
  • 00:27:41
    representation is the thing that led to
  • 00:27:43
    llms and now the multimodal llms that
  • 00:27:45
    we're seeing now you kind of end up
  • 00:27:47
    shoehorning the other modalities into
  • 00:27:49
    this underlying representation of a 1D
  • 00:27:51
    sequence of tokens um now when we move
  • 00:27:54
    to spatial intelligence it's kind of
  • 00:27:56
    going the other way where we're saying
  • 00:27:57
    that the three-dimensional nature of the
  • 00:28:00
    world should be front and center in the
  • 00:28:01
    representation so at an algorithmic
  • 00:28:03
    perspective that opens up the door for
  • 00:28:05
    us to process data in different ways to
  • 00:28:07
    get different kinds of outputs out of it
  • 00:28:10
    um and to tackle slightly different
  • 00:28:11
    problems so even at at a course level
  • 00:28:13
    you kind of look at outside and you say
  • 00:28:14
    oh multimodal LMS can look at images too
  • 00:28:17
    well they can but I I think that it's
  • 00:28:19
    they don't have that fundamental 3D
  • 00:28:21
    representation at the heart of their
  • 00:28:22
    approaches I totally agree with Justin I
  • 00:28:24
    think talking about the 1D versus
  • 00:28:27
    fundamental 3D representation is one of
  • 00:28:30
    the most core differentiation the other
  • 00:28:32
    thing it's a slightly philosophical but
  • 00:28:34
    it's really important to for me at least
  • 00:28:37
    is language is fundamentally a purely
  • 00:28:41
    generated signal there's no language out
  • 00:28:45
    there you don't go out in the nature and
  • 00:28:47
    there's words written in the sky for you
  • 00:28:50
    whatever data you feeding you pretty
  • 00:28:52
    much can just somehow regurgitate with
  • 00:28:57
    enough
  • 00:28:58
    generalizability at the the same data
  • 00:29:01
    out and that's language to language and
  • 00:29:04
    but but 3D World Is Not There is a 3D
  • 00:29:08
    world out there that follows laws of
  • 00:29:11
    physics that has its own structures due
  • 00:29:13
    to materials and and many other things
  • 00:29:17
    and to to fundamentally back that
  • 00:29:20
    information out and be able to represent
  • 00:29:23
    it and be able to generate it is just
  • 00:29:26
    fundamentally quite a different
  • 00:29:28
    problem we will be borrowing um similar
  • 00:29:33
    ideas or useful ideas from language and
  • 00:29:37
    llms but this is fundamentally
  • 00:29:39
    philosophically to me a different
  • 00:29:41
    problem right so language 1D and
  • 00:29:44
    probably a bad representation of the
  • 00:29:46
    physical world because it's been
  • 00:29:47
    generated by humans and it's probably
  • 00:29:49
    lossy there's a whole another modality
  • 00:29:52
    of generative AI models which are pixels
  • 00:29:54
    and these are 2D image and 2D video and
  • 00:29:57
    like one could say that like if you look
  • 00:29:58
    at a video it looks you know you can see
  • 00:30:00
    3D stuff because like you can pan a
  • 00:30:01
    camera or whatever it is and so like how
  • 00:30:04
    would like spatial intelligence be
  • 00:30:06
    different than say 2D video here when I
  • 00:30:07
    think about this it's useful to
  • 00:30:09
    disentangle two things um one is the
  • 00:30:11
    underlying representation and then two
  • 00:30:13
    is kind of the the user facing
  • 00:30:14
    affordances that you have um and here's
  • 00:30:17
    where where you can get sometimes
  • 00:30:18
    confused because um fundamentally we see
  • 00:30:21
    2D right like our retinas are 2D
  • 00:30:23
    structures in our bodies and we've got
  • 00:30:25
    two of them so like fundamentally our
  • 00:30:27
    visual system some perceives 2D images
  • 00:30:30
    um but the problem is that depending on
  • 00:30:32
    what representation you use there could
  • 00:30:33
    be different affordances that are more
  • 00:30:35
    natural or less natural so even if you
  • 00:30:38
    are at the end of the day you might be
  • 00:30:39
    seeing a 2D image or a 2d video um your
  • 00:30:42
    brain is perceiving that as a projection
  • 00:30:45
    of a 3D World so there's things you
  • 00:30:47
    might want to do like move objects
  • 00:30:49
    around move the camera around um in
  • 00:30:51
    principle you might be able to do these
  • 00:30:53
    with a purely 2D representation and
  • 00:30:55
    model but it's just not a fit to the
  • 00:30:57
    problems that you're the model to do
  • 00:30:59
    right like modeling the 2D projections
  • 00:31:01
    of a dynamic 3D world is is a function
  • 00:31:04
    that probably can be modeled but by
  • 00:31:05
    putting a 3D representation Into the
  • 00:31:07
    Heart of a model there's just going to
  • 00:31:08
    be a better fit between the kind of
  • 00:31:10
    representation that the model is working
  • 00:31:12
    on and the kind of tasks that you want
  • 00:31:14
    that model to do so our bet is that by
  • 00:31:17
    threading a little bit more 3D
  • 00:31:19
    representation under the hood that'll
  • 00:31:21
    enable better affordances for for users
  • 00:31:24
    and this also goes back to the norstar
  • 00:31:26
    for me you know why is it spatial
  • 00:31:29
    intelligence why is it not flat pixel
  • 00:31:33
    intelligence is because I think the Arc
  • 00:31:35
    of intelligence has to go to what Justin
  • 00:31:39
    calls affordances and uh and the Arc of
  • 00:31:42
    intelligence if you look at Evolution
  • 00:31:46
    right the Arc of intelligence eventually
  • 00:31:49
    enables animals and humans especially
  • 00:31:52
    human as an intelligent animal to move
  • 00:31:55
    around the world interact with it create
  • 00:31:58
    civilization create life create a piece
  • 00:32:01
    of Sandwich whatever you do in this 3D
  • 00:32:04
    World and and translating that into a
  • 00:32:08
    piece of technology that three native 3D
  • 00:32:12
    nness is fundamentally important for the
  • 00:32:16
    flood flood gate um of possible
  • 00:32:20
    applications even if some of them the
  • 00:32:25
    the serving of them looks Tod but the
  • 00:32:28
    but it's innately 3D um to me I think
  • 00:32:32
    this is actually very subtle yeah and
  • 00:32:34
    Incredibly critical point and so I think
  • 00:32:36
    it's worth digging into and a good way
  • 00:32:38
    to do this is talking about use cases
  • 00:32:39
    and so just to level set this we're
  • 00:32:41
    talking about generating a technology
  • 00:32:44
    let's call it a model that can do
  • 00:32:46
    spatial intelligence so maybe in the
  • 00:32:48
    abstract what might that look like kind
  • 00:32:50
    of a little bit more concretely what
  • 00:32:52
    would be the potential use cases that
  • 00:32:55
    you could apply this to so I think
  • 00:32:57
    there's a there's a couple different
  • 00:32:58
    kinds of things we imagine these
  • 00:33:00
    spatially intelligent models able to do
  • 00:33:02
    over time um and one that I'm really
  • 00:33:04
    excited about is World Generation we're
  • 00:33:07
    all we're all used to something like a
  • 00:33:08
    text image generator or starting to see
  • 00:33:10
    text video generators where you put an
  • 00:33:11
    image put in a video and out pops an
  • 00:33:14
    amazing image or an amazing two-c clip
  • 00:33:16
    um but I I think you could imagine
  • 00:33:18
    leveling this up and getting 3D worlds
  • 00:33:20
    out so one thing that we could imagine
  • 00:33:23
    spatial intelligence helping us with in
  • 00:33:25
    the future are upleveling these
  • 00:33:26
    experiences into 3D where we're not
  • 00:33:29
    getting just an image out or just a clip
  • 00:33:30
    out but you're getting out a full
  • 00:33:32
    simulated but vibrant and interactive 3D
  • 00:33:34
    World for gaming maybe for gaming right
  • 00:33:37
    maybe for gaming maybe for virtual
  • 00:33:39
    photography like you name it there's I
  • 00:33:40
    think there even if you got this to work
  • 00:33:42
    there'd be there'd be a million
  • 00:33:43
    applications for Education yeah for
  • 00:33:45
    education I mean I guess one of one of
  • 00:33:47
    my things is that like we in in some
  • 00:33:50
    sense this enables a new form of media
  • 00:33:52
    right because we already have the
  • 00:33:54
    ability to create virtual interactive
  • 00:33:57
    world worlds um but it cost hundreds of
  • 00:34:00
    hundreds of millions of dollars and a
  • 00:34:02
    and a ton of development time and as a
  • 00:34:04
    result like what are the places that
  • 00:34:06
    people drive this technological ability
  • 00:34:08
    is is video games right because if we do
  • 00:34:11
    have the ability as a society to create
  • 00:34:13
    amazingly detailed virtual interactive
  • 00:34:16
    worlds that give you amazing experiences
  • 00:34:18
    but because it takes so much labor to do
  • 00:34:20
    so then the only economically viable use
  • 00:34:23
    of that technology in its form today is
  • 00:34:25
    is games that can be sold for $70 a
  • 00:34:28
    piece to millions and millions of people
  • 00:34:29
    to recoup the investment if we had the
  • 00:34:31
    ability to create these same virtual
  • 00:34:34
    interactive vibrant 3D worlds um you
  • 00:34:37
    could see a lot of other applications of
  • 00:34:39
    this right because if you bring down
  • 00:34:41
    that cost of producing that kind of
  • 00:34:42
    content then people are going to use it
  • 00:34:44
    for other things what if you could have
  • 00:34:46
    a an intera like sort of a personalized
  • 00:34:48
    3D experience that's as good and as rich
  • 00:34:51
    as detailed as one of these AAA video
  • 00:34:53
    games that cost hundreds of millions of
  • 00:34:54
    dollars to produce but it could be
  • 00:34:56
    catered to like this very Niche thing
  • 00:34:58
    that only maybe a couple people would
  • 00:34:59
    want that particular thing that's not a
  • 00:35:01
    particular product or a particular road
  • 00:35:03
    map but I think that's a vision of a new
  • 00:35:05
    kind of media that would be enabled by
  • 00:35:08
    um spatial intelligence in the
  • 00:35:10
    generative Realms if I think about a
  • 00:35:11
    world I actually think about things that
  • 00:35:13
    are not just seene generation I think
  • 00:35:14
    about stuff like movement and physics
  • 00:35:15
    and so like like in the limit is that
  • 00:35:17
    included and then the second one is
  • 00:35:19
    absolutely if I'm interacting with it
  • 00:35:23
    like like are there semantics and I mean
  • 00:35:26
    by that like if I open a book are there
  • 00:35:28
    like pages and are there words in it and
  • 00:35:29
    do they mean like like are we talking
  • 00:35:31
    like a full depth experience or we
  • 00:35:32
    talking about like kind of a static
  • 00:35:33
    scene I think I'll see a progression of
  • 00:35:35
    this technology over time this is really
  • 00:35:37
    hard stuff to build so I think the
  • 00:35:39
    static the static problem is a little
  • 00:35:41
    bit easier um but in the limit I think
  • 00:35:43
    we want this to be fully Dynamic fully
  • 00:35:45
    interactable all the things that you
  • 00:35:46
    just said I mean that's the definition
  • 00:35:48
    of spatial intelligence yeah so so there
  • 00:35:51
    is going to be a progression we'll start
  • 00:35:53
    with more static but everything you've
  • 00:35:56
    said is is in the in the road map of uh
  • 00:36:00
    spatial intelligence I mean this is kind
  • 00:36:02
    of in in the name of the company itself
  • 00:36:03
    World Labs um like the world is about
  • 00:36:06
    building and understanding worlds and
  • 00:36:08
    and like this is actually a little bit
  • 00:36:09
    inside baseball I realized after we told
  • 00:36:11
    the name to people they don't always get
  • 00:36:12
    it because in computer vision and and
  • 00:36:14
    reconstruction and generation we often
  • 00:36:15
    make a distinction or a delineation
  • 00:36:17
    about the kinds of things you can do um
  • 00:36:19
    and kind of the first level is objects
  • 00:36:21
    right like a microphone a cup a chair
  • 00:36:23
    like these are discret things in the
  • 00:36:25
    world um and a lot of the imet style
  • 00:36:27
    stuff that F worked on was about
  • 00:36:29
    recognizing objects in the world then
  • 00:36:31
    leveling up the next level of objects I
  • 00:36:33
    think of his scenes like scenes are
  • 00:36:35
    compositions of objects like now we've
  • 00:36:36
    got this recording studio with a table
  • 00:36:38
    and microphones and people in chairs at
  • 00:36:40
    some composition of objects but but then
  • 00:36:41
    like we we envision worlds as a Step
  • 00:36:44
    Beyond scenes right like scenes are kind
  • 00:36:46
    of maybe individual things but we want
  • 00:36:47
    to break the boundaries go outside the
  • 00:36:49
    door like step up from the table walk
  • 00:36:51
    out from the door walk down the street
  • 00:36:52
    and see the cars buzzing past and see
  • 00:36:55
    like the the the the leaves on the tree
  • 00:36:57
    moving and be able to interact with
  • 00:36:59
    those things another thing that's really
  • 00:37:00
    exciting is just to mention the word New
  • 00:37:02
    Media with this technology the boundary
  • 00:37:06
    between real world and virtual imagin
  • 00:37:09
    world or augmented world or predicted
  • 00:37:12
    world is all blurry you really it there
  • 00:37:17
    the real world is 3D right so in the
  • 00:37:20
    digital world you have to have a
  • 00:37:23
    3D representation to even blend with the
  • 00:37:26
    real world you know you cannot have a 2d
  • 00:37:29
    you cannot have a 1D to be able to
  • 00:37:31
    interface with the real 3D World in an
  • 00:37:34
    effective way and with this it unlocks
  • 00:37:36
    it so it it the use cases can can be
  • 00:37:40
    quite Limitless because of this right so
  • 00:37:43
    the first use case that that Justin was
  • 00:37:44
    talking about would be like the
  • 00:37:46
    generation of a virtual world for any
  • 00:37:48
    number of use cases one that you're just
  • 00:37:49
    alluding to would be more of an
  • 00:37:51
    augmented reality right yes just around
  • 00:37:53
    the time world lab was uh um being
  • 00:37:55
    formed uh vision was released by Apple
  • 00:37:59
    and uh they use the word spatial
  • 00:38:02
    Computing we're almost like they almost
  • 00:38:04
    stole
  • 00:38:05
    our but we're spatial intelligence so
  • 00:38:09
    spatial Computing needs spatial
  • 00:38:11
    intelligence that's exactly right so we
  • 00:38:14
    don't know what Hardware form it will
  • 00:38:17
    take it will be goggles glasses contact
  • 00:38:19
    lenses contact lenses but that interface
  • 00:38:23
    between the true real world and what you
  • 00:38:26
    can do on top of it whether it's to help
  • 00:38:29
    you to augment your capability to work
  • 00:38:32
    on a piece of machine and fix your car
  • 00:38:34
    even if you are not a trained mechanic
  • 00:38:37
    or to just be in a Pokemon go Plus+ for
  • 00:38:42
    entertainment suddenly this piece of
  • 00:38:44
    technology is is going to be the the the
  • 00:38:49
    operating system basically uh for for
  • 00:38:52
    arvr uh Mixr in the limit like what does
  • 00:38:55
    an AR device need to do it's this thing
  • 00:38:57
    thing that's always on it's with you
  • 00:38:59
    it's looking out into the world so it
  • 00:39:00
    needs to understand the stuff that
  • 00:39:02
    you're seeing um and maybe help you out
  • 00:39:04
    with tasks in your daily life but I'm
  • 00:39:06
    I'm also really excited about this blend
  • 00:39:08
    between virtual and physical that
  • 00:39:09
    becomes really critical if you have the
  • 00:39:11
    ability to understand what's around you
  • 00:39:13
    in real time in perfect 3D then it
  • 00:39:15
    actually starts to deprecate large parts
  • 00:39:17
    of the real world as well like right now
  • 00:39:19
    how many differently sized screens do we
  • 00:39:20
    all own for different use cases too many
  • 00:39:23
    right you've got you've got your you've
  • 00:39:24
    got your phone you've got your iPad
  • 00:39:25
    you've got your computer monitor you've
  • 00:39:26
    got your t
  • 00:39:28
    you've got your watch like these are all
  • 00:39:29
    basically different side screens because
  • 00:39:31
    they need to present information to you
  • 00:39:32
    in different contexts and in different
  • 00:39:34
    different positions but if you've got
  • 00:39:35
    the ability to seamlessly blend virtual
  • 00:39:38
    content with the physical world it kind
  • 00:39:39
    of deprecates the need for all of those
  • 00:39:41
    it just ideally seamlessly Blends
  • 00:39:43
    information that you need to know in the
  • 00:39:44
    moment with the right way mechanism of
  • 00:39:46
    of giving you that information another
  • 00:39:49
    huge case of being able to blend the the
  • 00:39:53
    digital virtual world with the 3D
  • 00:39:55
    physical world is for anying agents to
  • 00:39:59
    be able to do things in the physical
  • 00:40:01
    world and if humans use this mix art
  • 00:40:04
    devices to do things like I said I don't
  • 00:40:07
    know how to fix a car but if I have to I
  • 00:40:09
    put on this this goggle or glass and
  • 00:40:12
    suddenly I'm guided to do that but there
  • 00:40:14
    are other types of Agents namely robots
  • 00:40:18
    any kind of robots not just humanoid and
  • 00:40:22
    uh their interface by definition is the
  • 00:40:25
    3D world but their their compute their
  • 00:40:28
    brain by definition is the digital world
  • 00:40:31
    so what connects that from the learning
  • 00:40:34
    to to behaving between a robot brain to
  • 00:40:38
    the real world brain it has to be
  • 00:40:40
    spatial intelligence so you've talked
  • 00:40:43
    about virtual world you've talked about
  • 00:40:45
    kind of more of an augmented reality and
  • 00:40:47
    now you've just talked about the purely
  • 00:40:48
    physical world basically which would be
  • 00:40:51
    used for robotics um for any company
  • 00:40:54
    that would be like a very large Charter
  • 00:40:57
    especially if you're going to get into
  • 00:40:58
    each one of these different areas so how
  • 00:41:00
    do you think about the idea of like deep
  • 00:41:01
    deep Tech versus any of these specific
  • 00:41:03
    application areas we see ourselves as a
  • 00:41:06
    deep tech company as the platform
  • 00:41:08
    company that provides models that uh
  • 00:41:12
    that can serve different use cases is of
  • 00:41:15
    these three is there any one that you
  • 00:41:16
    think is kind of more natural early on
  • 00:41:18
    that people can kind of expect the
  • 00:41:20
    company to lean into or is it I think
  • 00:41:22
    it's suffices to say the devices are not
  • 00:41:25
    totally ready actually I got my first VR
  • 00:41:27
    headset in grad school um and just like
  • 00:41:29
    that's one of these transformative
  • 00:41:30
    technology experiences you put it on
  • 00:41:32
    you're like oh my God like this is crazy
  • 00:41:34
    and I think a lot of people have that
  • 00:41:35
    experience the first time they use VR um
  • 00:41:37
    so I I've been excited about this space
  • 00:41:38
    for a long time and I I love the Vision
  • 00:41:40
    Pro like I stayed up late to order one
  • 00:41:42
    of the first ones like the first day it
  • 00:41:44
    came out um but I I think the reality is
  • 00:41:46
    it's just not there yet as a platform
  • 00:41:48
    for Mass Market appeal so very likely as
  • 00:41:50
    a company will will will move into a
  • 00:41:53
    market that's more ready than then I I
  • 00:41:55
    think there can sometimes be Simplicity
  • 00:41:56
    in generality right like if you we we
  • 00:41:59
    have this notion of being a deep tech
  • 00:42:00
    company we we believe that there is some
  • 00:42:03
    fun underlying fundamental problems that
  • 00:42:05
    need to be solved really well and if
  • 00:42:07
    solved really well can apply to a lot of
  • 00:42:09
    different domains we really view this
  • 00:42:10
    long Arc of the company as building and
  • 00:42:12
    realizing the the dreams of spatial
  • 00:42:14
    intelligence r large so this is a lot of
  • 00:42:17
    technology to build it seems to me yeah
  • 00:42:19
    I think it's a really hard problem um I
  • 00:42:21
    think sometimes from people who are not
  • 00:42:22
    directly in the AI space they just see
  • 00:42:24
    it as like AI as one undifferentiated
  • 00:42:27
    massive Talent um and and for those of
  • 00:42:29
    us who have been here long for for
  • 00:42:31
    longer you realize that there's a lot of
  • 00:42:33
    different a lot of different kinds of
  • 00:42:34
    talent that need to come together to
  • 00:42:35
    build anything in in AI in particular
  • 00:42:37
    this one we've talked a little bit about
  • 00:42:39
    the the data problem we've talked a
  • 00:42:40
    little bit about some of the algorithms
  • 00:42:42
    that we that I worked on during my PhD
  • 00:42:44
    but there's a lot of other stuff we need
  • 00:42:45
    to do this too um you need really high
  • 00:42:47
    quality large scale engineering you need
  • 00:42:49
    really deep understanding of 3D of the
  • 00:42:51
    3D World you need really there's
  • 00:42:53
    actually a lot of connections with
  • 00:42:54
    computer Graphics um because they've
  • 00:42:55
    been kind of attacking lot of the same
  • 00:42:57
    problems from the from the opposite
  • 00:42:58
    direction so when we think about Team
  • 00:43:00
    Construction we think about how do we
  • 00:43:02
    find expert like absolute topof thee
  • 00:43:04
    world best experts in the world at each
  • 00:43:06
    of these different subdomains that are
  • 00:43:09
    necessary to build this really hard
  • 00:43:11
    thing when I thought thought about how
  • 00:43:13
    we form the best founding team for World
  • 00:43:16
    Labs it has to start with the a a group
  • 00:43:20
    of phenomenal multidisciplinary funders
  • 00:43:24
    and of course justtin is natural for me
  • 00:43:27
    Justin cover your years as one of my
  • 00:43:30
    best students and uh one of the smartest
  • 00:43:34
    technologist but there are
  • 00:43:36
    two two other people I have known by
  • 00:43:39
    reputation and and one of them Justin
  • 00:43:41
    even worked with that I was drooling for
  • 00:43:45
    right one is Ben mhal we talked about
  • 00:43:47
    his um seminal work in nerve but another
  • 00:43:52
    person is uh Kristoff lner who has been
  • 00:43:56
    reputated in the community of computer
  • 00:43:58
    graphics and uh especially he had the
  • 00:44:01
    foresight of working on a precursor of
  • 00:44:05
    the gausian Splat um representation for
  • 00:44:08
    3D modeling five years right before the
  • 00:44:12
    uh the Gan spat take off and when when
  • 00:44:15
    we heard about when we talk about the
  • 00:44:18
    potential possibility of working with
  • 00:44:20
    Christof lastner Justin just jumped off
  • 00:44:23
    his chair Ben and Kristoff are are are
  • 00:44:25
    legends and maybe just quickly talk
  • 00:44:27
    about kind of like how you thought about
  • 00:44:28
    the build out of the rest of the team
  • 00:44:30
    because again like it's you know there's
  • 00:44:31
    a lot to build here and a lot to work on
  • 00:44:33
    not just in kind of AI or Graphics but
  • 00:44:35
    like systems and so forth yeah um this
  • 00:44:39
    is what so far I'm personally most proud
  • 00:44:42
    of is the formidable team I've had the
  • 00:44:45
    privilege of working with the smartest
  • 00:44:48
    young people in my entire career right
  • 00:44:50
    from from the top universities being a
  • 00:44:52
    professor at Stanford but the kind of
  • 00:44:56
    talent that we put together here at uh
  • 00:44:59
    at uh World Labs is just phenomenal I've
  • 00:45:02
    never seen the concentration and I think
  • 00:45:04
    the biggest
  • 00:45:06
    differentiating um element here is that
  • 00:45:09
    we're Believers of uh spatial
  • 00:45:11
    intelligence all of the
  • 00:45:13
    multidisciplinary talents whether it's
  • 00:45:16
    system engineering machine uh machine
  • 00:45:18
    learning infra to you know uh generative
  • 00:45:21
    modeling to data to you know Graphics
  • 00:45:26
    all of us whether it's our personal
  • 00:45:28
    research Journey or or technology
  • 00:45:31
    Journey or even personal hobby we
  • 00:45:34
    believe that spatial intelligence has to
  • 00:45:36
    happen at this moment with this group of
  • 00:45:39
    people and uh that's how we really found
  • 00:45:42
    our founding team and uh and that focus
  • 00:45:46
    of energy and talent is is is really
  • 00:45:50
    just uh um humbling to me I I just love
  • 00:45:53
    it so I know you've been Guided by an
  • 00:45:55
    Northstar so something about North Stars
  • 00:45:58
    is like you can't actually reach
  • 00:46:00
    them because they're in the sky but it's
  • 00:46:02
    a great way to have guidance so how will
  • 00:46:03
    you know when you've accomplished what
  • 00:46:06
    you've set out to accomplish or is this
  • 00:46:08
    a lifelong thing that's going to
  • 00:46:09
    continue kind of infinitely first of all
  • 00:46:13
    there's real northstars and virtual
  • 00:46:15
    North Stars sometimes you can reach
  • 00:46:17
    virtual northstars fair enough good
  • 00:46:18
    enough in the world in the world model
  • 00:46:22
    exactly like I said I thought one of my
  • 00:46:25
    Northstar that would take a 100 years
  • 00:46:27
    with storytelling of images and uh
  • 00:46:30
    Justin and Andre you know in my opinion
  • 00:46:33
    solved it for me so um so we could get
  • 00:46:36
    to our Northstar but I think for me is
  • 00:46:39
    when so many people and so many
  • 00:46:42
    businesses are using our models to
  • 00:46:44
    unlock their um needs for spatial
  • 00:46:47
    intelligence and that's the moment I
  • 00:46:49
    know we have reached a major Milestone
  • 00:46:53
    actual deployment actual impact actually
  • 00:46:56
    yeah I I don't think going to get there
  • 00:46:57
    um I I think that this is such a
  • 00:46:59
    fundamental thing like the universe is a
  • 00:47:01
    giant evolving four-dimensional
  • 00:47:03
    structure and spatial intelligence r
  • 00:47:05
    large is just understanding that in all
  • 00:47:07
    of its depths and figuring out all the
  • 00:47:08
    applications to that so I I think that
  • 00:47:11
    we have a we have a particular set of
  • 00:47:12
    ideas in mind today but I I think this I
  • 00:47:14
    think this journey is going to take us
  • 00:47:16
    places that we can't even imagine right
  • 00:47:17
    now the magic of good technology is that
  • 00:47:20
    technology opens up more possibilities
  • 00:47:23
    and and unknown so so we will be pushing
  • 00:47:26
    and then the possibilities will will be
  • 00:47:28
    expanding brilliant thank you Justin
  • 00:47:30
    thank you fa this was fantastic thank
  • 00:47:32
    you Martin thank you Martin thank you so
  • 00:47:35
    much for listening to the a16z podcast
  • 00:47:38
    if you've made it this far don't forget
  • 00:47:40
    to subscribe so that you are the first
  • 00:47:42
    to get our exclusive video content or
  • 00:47:45
    you can check out this video that we've
  • 00:47:48
    hand selected for you
Tags
  • Visual Spatial Intelligence
  • Deep Learning
  • AI Evolution
  • Neural Networks
  • Computational Power
  • Data in AI
  • ImageNet
  • World Labs
  • 3D Representation
  • AI Applications