00:00:00
take a look at this line This is a $100
00:00:01
million line probably the most expensive
00:00:03
math equation in history it's a limit an
00:00:05
imaginary wall of physics and
00:00:07
Mathematics of how intelligent
00:00:09
artificial intelligence can ever be so
00:00:11
take a good look cuz I promise I'm not
00:00:13
going to show it again in this video
00:00:14
like I'm not pretending this is a
00:00:16
science channel so we're we're not going
00:00:17
to go that deep but I will explain in
00:00:19
the simplest of words why there's a
00:00:21
limit to how smart these models can get
00:00:24
a limit that even the top scientists
00:00:26
have not been able to overcome and the
00:00:28
clues about this have already started
00:00:29
started to show up and this is the
00:00:32
equation that was the equation that
00:00:34
might put an end to all this uh bubbly
00:00:37
Behavior around
00:00:40
it so for years we've been imagining an
00:00:43
artificial intelligence that outex
00:00:45
smarts us but let's get something out of
00:00:46
the way computers are already way better
00:00:48
than us at plenty of things computers
00:00:51
absolutely kick our ass at solving math
00:00:53
and they're definitely smarter than us
00:00:55
at storing data and reciting it back
00:00:57
that's not really intelligence current
00:00:58
AI companies have even had to
00:01:00
differentiate our current AI from AGI
00:01:02
artificial general intelligence because
00:01:04
they know deep down that what we have
00:01:06
now is not really intelligence but
00:01:09
assuming that we gave hands to a GPT
00:01:11
could it actually cook me an egg for
00:01:13
breakfast hold that thought the fact
00:01:15
that AI is the buzz word of the Year and
00:01:17
that Nvidia is worth some trillion
00:01:18
dollar number is based on three premises
00:01:21
one that the smarter models are going to
00:01:24
need all of those gpus two that more
00:01:26
people are going to adopt AI into their
00:01:29
daily lives and three that our AI models
00:01:32
are going to get exponentially smarter
00:01:34
everyone is in panic mode because some
00:01:36
Chinese startup allegedly built and
00:01:38
trained at chat GPT level model using a
00:01:41
fraction of the compute cost and a
00:01:42
fraction of the cost one Chinese startup
00:01:45
just launched a new AI model to rival
00:01:47
open AI deep seek has become the most
00:01:50
downloaded free app passing chaty PT
00:01:52
which is pretty shocking we're going to
00:01:54
get to this it remains to be seen how
00:01:55
much the average Joe or the average
00:01:56
company adopts AI into their lives but
00:02:00
none of those betting on big Tech AI
00:02:02
transformation even questioning the
00:02:05
possibility of that third bullet point
00:02:07
and therein lies the problem making our
00:02:10
current AI models smarter is almost
00:02:13
impossible in order to understand why
00:02:15
that math equation is so dangerous to
00:02:17
these companies we need to understand at
00:02:19
least the basics of how our current
00:02:21
models are trained and how they think so
00:02:23
let's just go to our explainer time
00:02:25
explainer
00:02:26
time what GPT excels at to the point
00:02:29
where it acts like an intelligent
00:02:31
sentient bot enough to fool many of us
00:02:33
to write entire essays what it does is
00:02:36
is predicting the next word in a
00:02:38
sentence but being really good at
00:02:40
predicting words or the next word
00:02:42
already allows a model like GPT to beat
00:02:44
us at most standardized tests which is
00:02:47
kind of the way we measure our own human
00:02:49
intelligence isn't it and yet GPT
00:02:52
doesn't do that well at math it only
00:02:54
beat about 50% of the students in these
00:02:56
tests why is that well one number that
00:02:58
you're going to hear about all the time
00:02:59
when people talk about these models is
00:03:02
the number of parameters that a model
00:03:04
was trained on how many parameters is a
00:03:05
model using gp3 for example which is
00:03:08
almost useless dumb compared to the
00:03:11
models that we use today used 175
00:03:14
billion parameters what the hell does
00:03:16
that mean so let me show you now a model
00:03:18
like GPT uses pre-trained Transformers
00:03:21
to generate text hence the name now this
00:03:23
sounds like nonsense to you right now
00:03:24
but I promise it'll make sense in
00:03:26
exactly 4 minutes imagine that we feed
00:03:28
the model a sentence so that it can try
00:03:30
and predict what the next word is so the
00:03:32
first thing that the model needs to do
00:03:34
is try to understand what this group of
00:03:36
words means like we we're seeing words
00:03:38
here but the computer is really just
00:03:40
seeing bits out of all this thing and
00:03:42
the first thing that the model will do
00:03:43
is try to break this into tokens maybe
00:03:45
you've heard the term before then what
00:03:47
we'll try to do is classify those tokens
00:03:49
based on their
00:03:51
meaning llms classify words by grouping
00:03:54
them together with words that have
00:03:56
similar meaning technically this is done
00:03:57
not by grouping entire words but tokens
00:04:00
which are fractions of a word but I'm
00:04:02
going to stick with the concept of words
00:04:03
just for Simplicity sake for example
00:04:05
ring may be classified with other words
00:04:08
like ear like Jewel maybe around the
00:04:11
world Circle so here's a 2d simple
00:04:14
two-dimension axis we have a horizontal
00:04:16
axis X and a vertical axis y this is
00:04:18
great for numbers right because we can
00:04:19
just go up or down depending on the
00:04:21
number but we're dealing with words here
00:04:23
and there are thousands of words out
00:04:25
there with thousands of different
00:04:27
meanings it would be kind of impossible
00:04:29
to group words by meaning into a 2d
00:04:32
space even in a 3D space we'd run out of
00:04:34
directions to go to very quickly so gpt3
00:04:37
classifies words or tokens really into
00:04:41
12,288 different dimensions that means a
00:04:43
grid with
00:04:45
12,288 axis we can't see it of course we
00:04:48
can't even imagine it it's like that
00:04:50
Interstellar Tesseract but with 11,284
00:04:53
more Dimensions to go but don't worry
00:04:57
what you need to understand is that in
00:04:59
this unimaginable Cloud this black hole
00:05:02
of directions words with similar
00:05:04
meanings are going to be grouped close
00:05:07
to each other so from the gpt3 paper we
00:05:10
know that open AI used about 50,000
00:05:13
tokens basically a dictionary of tokens
00:05:15
and in map them into this 12,000
00:05:17
Dimension space which already puts the
00:05:19
count of parameters that the model is
00:05:21
going to need at around 600 million
00:05:24
still that's far from the 175 billion
00:05:26
parameters that gpt3 had so let's keep
00:05:28
digging but before I do that I want to
00:05:30
take a moment to thank nordpass for
00:05:31
helping us fun today's explainer
00:05:32
nordpass is a secure password manager
00:05:35
created by the experts behind nordvpn to
00:05:37
help you and your team store and share
00:05:39
passwords and credit card details
00:05:40
securely one in four people can still
00:05:42
log into accounts from their previous
00:05:44
jobs granting them access to stuff that
00:05:46
they shouldn't have but passwords shared
00:05:48
through Nord pass can be revoked in
00:05:50
seconds you have full visibility into
00:05:52
who has access to which shared Company
00:05:54
accounts and the vulnerability of these
00:05:56
making life a lot simpler for your it
00:05:58
departments we migrated our old password
00:06:01
manager into Nord pass with a simple
00:06:03
export import function and everybody hit
00:06:05
the ground running in minutes it's easy
00:06:07
to use you can sync it across devices
00:06:09
and it has this userfriendly interface
00:06:11
so I really can't recommend it enough
00:06:13
npress also has this really cool feature
00:06:14
called data breach scanner which gives
00:06:16
you live alerts if any of your corporate
00:06:18
data appears on the dark net so it gives
00:06:20
you a warning advance to change your
00:06:22
passwords before any of your accounts
00:06:23
are breached which can of course cause
00:06:26
financial and reputation damage we
00:06:28
partnered with norpass to bring you
00:06:29
3-month free trial on npass for business
00:06:32
and 20% of their business plans no
00:06:34
credit card is required you can just go
00:06:36
to nordp pass.com slidebean use the code
00:06:38
slidebean at signup or you can just scan
00:06:41
this QR code you'll level up your
00:06:43
business security you'll save a lot of
00:06:44
money and you'll help our channel in the
00:06:46
process okay so now let's dig into what
00:06:48
happens after the embedding so the
00:06:50
mapping of words into this
00:06:52
incomprehensible tacct black hole is
00:06:54
called embedding this is the embedding
00:06:56
step and it's how a model turns words
00:06:58
into something that computers can
00:07:00
understand understand and process but
00:07:01
just understanding that the word ring
00:07:03
lives in a neighborhood of other words
00:07:05
we still don't know what it means in
00:07:06
this context ring might be a sound might
00:07:09
be an earring might be the one ring so
00:07:11
how does the model know and so that's
00:07:13
where transforming comes in what the
00:07:14
model's going to do is well transform
00:07:17
the word that essentially move this word
00:07:19
in this 12,000 dimensional space this
00:07:22
specific sentence it'll move it closer
00:07:24
to the meaning that's based on the
00:07:26
context around this specific word in
00:07:28
this sentence so that that context could
00:07:30
be the word before the word after could
00:07:32
be the words mentioned earlier in the
00:07:34
conversation for example the model might
00:07:36
notice that this R is capitalized even
00:07:39
though it's not at the beginning of the
00:07:40
sentence must mean something it might
00:07:42
also look at adjectives and how they
00:07:44
affect nouns so this transformation
00:07:46
layer makes tiny adjustments in the
00:07:48
region of space where this particular
00:07:51
word
00:07:52
lives now all of these Transformers are
00:07:55
going to run at the same time and that's
00:07:57
in part why gpus are so good at doing
00:08:00
this thing because they were built to
00:08:01
calculate all the pixels in your screen
00:08:03
at the same time now each Transformer in
00:08:05
gpt3 has about 1.8 billion parameters
00:08:09
around 600 million of those parameters
00:08:11
are in this first attention layer which
00:08:13
helps Focus the word in space and about
00:08:16
1.2 billion of those parameters are in
00:08:18
the feed forward Network layer which is
00:08:20
kind of like a like a zoom in on the
00:08:23
meaning of the word but that's as far as
00:08:24
we're going to zoom in today now gpt3
00:08:26
uses 96 of the Transformers for a total
00:08:29
of almost 174 billion parameters we're
00:08:32
almost done now the last few parameters
00:08:34
are on the output layer which
00:08:35
essentially does the unembedded the
00:08:38
inverse of the input layer it brings
00:08:40
this word these 12,000 Dimensions into
00:08:44
our old 2D bit world and it gives us the
00:08:47
result of this massive operation of the
00:08:49
model as words not as numbers the result
00:08:52
of this massive massive mathematical
00:08:55
journey is a list of words along with
00:08:57
the probability of which word comes next
00:09:00
now the whole idea of machine learning
00:09:02
is that we don't have to go to train
00:09:04
each one of those 175 billion parameters
00:09:06
to tell it what it needs to do it learns
00:09:08
itself AKA machine learning now the
00:09:11
first time this runs this thing is going
00:09:12
to spit out just gibberish but during
00:09:14
training the model adjust these
00:09:16
parameters using algorithms to reduce
00:09:18
these errors think of them like small
00:09:20
knobs that slightly move to generate
00:09:24
slightly different mathematical outcomes
00:09:25
in the end it's like trial and error on
00:09:27
a trillion scale
00:09:30
if a particular set of values helps the
00:09:31
model make a correct prediction those
00:09:33
values are reinforced if not they're
00:09:35
adjusted each of these connections
00:09:36
between one value and the other is a
00:09:39
neuron which makes a neural network and
00:09:41
it works not so differently from Human
00:09:43
neurons like it may sound impossible but
00:09:45
after billions of operations and
00:09:46
training data this thing can actually
00:09:50
and pretty accurately predict the next
00:09:52
word in a sentence again this thing has
00:09:53
consumed billions and billions of texts
00:09:55
written by humans and has become so good
00:09:57
at predicting words that it can pass our
00:09:59
tests and predicting words is the llm
00:10:01
example but you can apply this logic of
00:10:03
predicting the next thing at how a pixel
00:10:06
should look to generate an image or
00:10:08
understanding if this dress is blue or
00:10:10
gold same basic principle now you know
00:10:13
the reason why it failed the high school
00:10:14
math exam a pure GPT model doesn't do
00:10:17
math at least not directly in the
00:10:19
simplest of terms if you ask it what's
00:10:21
1+ 1 it knows the answer is two because
00:10:24
it read a million times that the answer
00:10:26
is two and it's incredibly efficient at
00:10:28
identifying patterns but not because it
00:10:30
pulled up a calculator and added 1+ one
00:10:32
that's not bad per se but it's going to
00:10:34
be a problem later the thing is once you
00:10:36
have a computer that can understand
00:10:39
these relationships within words you can
00:10:41
give it instructions in plain English
00:10:43
and it'll base its responses on that
00:10:45
like this Transformer model with an
00:10:47
instruction on top of it is the same
00:10:49
concept that grock and Lama are using
00:10:52
and they're all limited by the same
00:10:53
equation now that 175 billion parameter
00:10:56
gp3 model had problems like you could
00:10:59
tell it was AI because it didn't write
00:11:01
quite like a human it couldn't count the
00:11:03
RS in Strawberry it also had a rather
00:11:05
small limit of context how many tokens
00:11:08
before the current word are processed
00:11:10
and considered for the prediction of the
00:11:12
next word so let's just Trin it with
00:11:14
more right open AI theorized that by
00:11:16
scaling the amount of data and the
00:11:17
amount of parameters the model would get
00:11:19
a lot smarter and it did a way to
00:11:21
measure the effectiveness of the model
00:11:23
is with the error rate so the word
00:11:24
predictions that are incorrect In in
00:11:26
very simple terms it's it's more
00:11:27
complicated than that but anyway they
00:11:29
they projected the error rate decreasing
00:11:31
the bigger the model was and the more
00:11:33
data was used for its training and so
00:11:35
they went and did it they spent over a
00:11:37
$100 million in training this thing
00:11:39
leaked data from open AI says that gp4
00:11:41
uses 1.8 trillion parameters it has more
00:11:45
Transformer steps potentially with more
00:11:46
dimensions for the tokens and it took
00:11:48
about 25,000 gpus running for over 3
00:11:51
months to train GPT 4 but it worked the
00:11:55
results were way better than gpt3 so
00:11:57
let's just keep doing that right more
00:11:59
gpus more data more parameters well
00:12:03
that's when they hit a
00:12:05
wall now that wall is this formula I
00:12:08
said that I wouldn't show you again cuz
00:12:09
you would think that a bigger model in
00:12:11
this case you know the bigger the size
00:12:13
of the model the better the performance
00:12:15
at a at some fantastic astronomical
00:12:17
level but open AI has kind of reached
00:12:19
this wall of diminishing returns it's
00:12:21
kind of like here right there's not a
00:12:23
lot that we can do like regardless of
00:12:25
the size we just can't get that
00:12:27
performance up a lot even if we throw a
00:12:28
lot more dat and create neural networks
00:12:30
with quadrillions of parameters the
00:12:32
improvements are going to be marginal
00:12:34
all the way through 2024 we had lived on
00:12:37
this part of the chart right but GPT 5
00:12:41
failures seem to reveal that we've kind
00:12:44
of arrived at this Plateau right
00:12:47
here and that's not even the worst of it
00:12:49
so a recent paper concluded that there
00:12:51
is simply not enough data to train them
00:12:55
there is a point in this curve where the
00:12:57
amount of data needed for training is
00:12:59
bigger than the amount of data that
00:13:01
exists we just haven't produced enough
00:13:03
data that can be used for training text
00:13:06
knowledge images speech to satisfy the
00:13:08
needs that the models would have to
00:13:11
reach Perfection or or a very very small
00:13:14
error rate so in other words we have
00:13:15
found the limit of the current machine
00:13:18
learning algorithms the models are
00:13:21
flawed and Humanity doesn't have the
00:13:23
resources to train them let's be let's
00:13:25
be real for a second like this series of
00:13:27
tubes a series of tubes this Transformer
00:13:30
model is arguably one of the most
00:13:31
important scientific breakthroughs of
00:13:33
the century and I'm focusing on language
00:13:36
models here but we have now built models
00:13:38
to predict the shape of proteins which
00:13:40
seemed an impossible task for a human if
00:13:42
you wanted to produce an image of
00:13:44
something that didn't exist you need
00:13:46
creative people illustrators Photoshop
00:13:47
artists 3D rendering and now an AI can
00:13:50
just deduce how something looks from
00:13:53
previous training like I don't think
00:13:55
enough people talk about what this means
00:13:57
for 3D artists when we spent trying to
00:13:59
build a world a new world from scratch
00:14:01
and now a computer can reverse engineer
00:14:03
that from training data and just give us
00:14:05
the same result in a fraction of the
00:14:07
time but it's not over though for years
00:14:09
we thought there was no other way to
00:14:10
reach this level of performance unless
00:14:12
we had like two trillion parameters
00:14:14
billions of dollars and servers and
00:14:16
piles of training data but it looks like
00:14:18
there is a better way a way around it
00:14:20
based on deep seeks efficiency with
00:14:22
apparently a fraction of the parameters
00:14:24
and the cost we're yet to see if that's
00:14:26
true still I think it's only a matter of
00:14:28
time before we find a more efficient way
00:14:30
to do all this process easier and
00:14:32
cheaper but even more importantly the
00:14:34
answer may not be GPT 5 or 6 or seven
00:14:37
nlms have proven that we can make a
00:14:39
computer understand natural language and
00:14:41
so companies figured The Next Step was
00:14:43
connecting other systems to that brain
00:14:45
and this is why GPT can now see images
00:14:47
or recognize speech giving eyes and ears
00:14:50
to the system hello there cutie that
00:14:52
eventually will turn into hands but how
00:14:55
far is it from cooking an egg or doing
00:14:57
my dishes there's some reasoning needed
00:14:59
behind
00:15:01
that so this is the whole idea of what
00:15:03
models like 01 and more recently 03 try
00:15:06
to do this model that we built
00:15:07
originally just tries to spit out the
00:15:09
next word as quickly as possible but
00:15:11
scientists came up with this concept of
00:15:13
reasoning Now using the same Transformer
00:15:16
the same model at its core it tries to
00:15:18
interpret the question and tries to
00:15:20
break it down into smaller subtasks or
00:15:23
prompts and then it tries to solve each
00:15:25
one of those prompts in order kind of
00:15:27
giving it like part answers to your
00:15:29
original question now once that possible
00:15:32
response is done it analyzes it again to
00:15:34
see if it makes sense against the
00:15:36
original question and the original
00:15:38
context of the conversation so it does a
00:15:39
bit like your own brain's thought
00:15:42
process you know writing that email
00:15:43
response starting over readjusting
00:15:45
rereading before you hit send nailed
00:15:50
it like this step-by-step process is
00:15:52
called Chain of Thought again not too
00:15:54
different from your train of thought but
00:15:56
let's go back to that table that I
00:15:57
showed you earlier this iterative
00:15:59
thinking has actually allowed current
00:16:00
models to beat us at General human
00:16:03
intelligence tests IQ structure logic
00:16:06
decision-making scenarios it's good
00:16:08
enough to write basic and even some
00:16:09
intermediate code it's got a fair share
00:16:12
of Engineers struggling to find jobs
00:16:14
which would have thought that they were
00:16:15
the first to be replaced by this but
00:16:17
anyway in startups at least there's
00:16:18
there's an unspoken truth of the number
00:16:22
of jobs that AI has already replaced but
00:16:24
it's very bad press and nobody really
00:16:26
wants to talk about it but when you get
00:16:27
out of control environments and into the
00:16:30
real world that's where AI struggles
00:16:32
like Common Sense reasoning creativity
00:16:35
decision- making in real world scenarios
00:16:37
and even advanced mathematics where
00:16:39
problem solving and some creativity is
00:16:41
required like these models take minutes
00:16:44
to process through all of this and it
00:16:46
still takes a seconds to make these
00:16:47
Advanced decisions that's a processing
00:16:50
and capacity problem not an architecture
00:16:52
problem also what happens when these
00:16:54
models are allowed to escape containment
00:16:58
when they can start doing things in the
00:17:00
real world operator is a resarch preview
00:17:03
of an agent that uses browser to uh help
00:17:06
user to do things I'm not doing anything
00:17:08
right now the operator is doing
00:17:09
everything by itself it's okay Mom
00:17:10
should help I think we we just keep
00:17:13
pushing the bar of what intelligence is
00:17:17
feelings right creativity true invention
00:17:20
we still have some of those and
00:17:23
computers don't but the set of human
00:17:25
only skills is shrinking it's running
00:17:28
out and I think we have to deal with the
00:17:29
reality that it's no longer an if but a
00:17:32
when question do computers really think
00:17:36
let's just say it all depends on what
00:17:39
you mean by thinking now if you enjoyed
00:17:43
today's explainer you should watch our
00:17:45
video from last week on how money gets
00:17:46
created and why 93% of today's money
00:17:49
doesn't really exist catch you on the
00:17:51
next one
00:17:53
[Music]