What is the main focus of this lecture?

The lecture focuses on implementing a more complex character-level language model inspired by WaveNet.

How many characters does the new model take as input?

The new model takes eight characters as input to predict the ninth character.

What is the purpose of batch normalization in this context?

Batch normalization helps stabilize the learning process by normalizing the inputs to each layer.

What improvements were made to the model's architecture?

The architecture was changed to progressively fuse information in a hierarchical manner, similar to WaveNet.

What are some future topics mentioned for exploration?

Future topics include dilated convolutions, residual connections, and hyperparameter tuning.

Building makemore Part 5: Building a WaveNet

00:56:21

https://www.youtube.com/watch?v=t3YJ5hKiMQ0

Zusammenfassung

TLDRIn this lecture, the speaker continues the implementation of a character-level language model, evolving from a simple multi-layer perceptron to a more complex architecture inspired by WaveNet. The model now processes eight characters to predict the ninth, utilizing a hierarchical approach to progressively fuse information. The speaker addresses challenges with batch normalization and emphasizes the importance of maintaining the correct state during training and evaluation. The lecture concludes with a discussion on potential improvements and future topics, including dilated convolutions and hyperparameter tuning, highlighting the need for a more structured experimental approach to optimize the model's performance.

Mitbringsel

🏨 The speaker is in Kyoto, enhancing the lecture's atmosphere.
🧠 Transitioning from a simple model to a complex architecture inspired by WaveNet.
🔍 The model now takes eight characters as input for prediction.
📊 Batch normalization is crucial for stabilizing the learning process.
🔄 A hierarchical approach is used to progressively fuse information.
⚙️ Future topics include dilated convolutions and residual connections.
📈 Emphasis on the importance of hyperparameter tuning for model performance.
🛠️ The need for a structured experimental approach is highlighted.

Zeitleiste

00:00:00 - 00:05:00
The speaker introduces the continuation of their character-level language model implementation while in Kyoto. They discuss the architecture of a multi-layer perceptron that predicts the next character based on three previous characters, and express the intention to enhance this model by increasing the input sequence length and making it deeper, similar to a Wavenet architecture.
00:05:00 - 00:10:00
The speaker explains the starter code for part five, which is based on the previous part three, and describes the data processing steps. They mention having 182,000 examples for training, and the importance of building modular layers for the neural network, similar to PyTorch's API.
00:10:00 - 00:15:00
The speaker discusses the implementation of various layers, including linear layers and batch normalization. They highlight the complexity of batch normalization and the need to manage its state during training and evaluation, as well as the importance of maintaining running statistics for the layers.
00:15:00 - 00:20:00
The speaker simplifies the code by removing unnecessary generator objects and organizing the neural network elements. They explain the embedding table and the structure of the layers, emphasizing the need for a clean and efficient forward pass in the model.
00:20:00 - 00:25:00
The speaker addresses issues with the loss function and the need for a better evaluation method. They discuss the importance of setting the model to evaluation mode before validation and mention the current validation loss of 2.10, indicating room for improvement in the model's performance.
00:25:00 - 00:30:00
The speaker discusses the need to improve the graph representation of the loss function and demonstrates how to average values for better visualization. They also mention the learning rate decay and its impact on optimization, leading to a more stable loss curve.
00:30:00 - 00:35:00
The speaker simplifies the forward pass by organizing layers into a sequential structure, allowing for cleaner code. They introduce a custom sequential class to manage layers and streamline the forward pass process, making it easier to evaluate the model's performance.
00:35:00 - 00:40:00
The speaker discusses the need to modify the flattening operation to accommodate the new hierarchical structure of the model. They explain how to reshape tensors to maintain the necessary dimensions for processing pairs of characters in the input sequence.
00:40:00 - 00:45:00
The speaker implements a new flattening method that allows for grouping consecutive elements in the input tensor. They demonstrate how to adjust the model's architecture to process these groups effectively, leading to a more efficient neural network design.
00:45:00 - 00:50:00
The speaker discusses the implementation of a hierarchical model that processes pairs of characters and gradually fuses information. They explain the changes made to the linear layers and the expected input shapes for the model, emphasizing the importance of maintaining the correct dimensions throughout the network.
00:50:00 - 00:56:21
The speaker reflects on the performance improvements achieved by increasing the context length from 3 to 8 characters, resulting in a validation loss of 2.02. They express the need for further optimization and hyperparameter tuning to enhance the model's performance even more.

Mind Map

Video-Fragen und Antworten

What is the main focus of this lecture?
The lecture focuses on implementing a more complex character-level language model inspired by WaveNet.
How many characters does the new model take as input?
The new model takes eight characters as input to predict the ninth character.
What is the purpose of batch normalization in this context?
Batch normalization helps stabilize the learning process by normalizing the inputs to each layer.
What improvements were made to the model's architecture?
The architecture was changed to progressively fuse information in a hierarchical manner, similar to WaveNet.
What are some future topics mentioned for exploration?
Future topics include dilated convolutions, residual connections, and hyperparameter tuning.

Weitere Video-Zusammenfassungen anzeigen

Erhalten Sie sofortigen Zugang zu kostenlosen YouTube-Videozusammenfassungen, die von AI unterstützt werden!

Untertitel

Automatisches Blättern:

00:00:00
hi everyone today we are continuing our
00:00:02
implementation of make more our favorite
00:00:04
character level language model
00:00:06
now you'll notice that the background
00:00:07
behind me is different that's because I
00:00:09
am in Kyoto and it is awesome so I'm in
00:00:12
a hotel room here
00:00:13
now over the last few lectures we've
00:00:15
built up to this architecture that is a
00:00:17
multi-layer perceptron character level
00:00:19
language model so we see that it
00:00:21
receives three previous characters and
00:00:23
tries to predict the fourth character in
00:00:24
a sequence using a very simple multi
00:00:26
perceptron using one hidden layer of
00:00:28
neurons with 10ational neuralities
00:00:31
so we'd like to do now in this lecture
00:00:33
is I'd like to complexify this
00:00:34
architecture in particular we would like
00:00:36
to take more characters in a sequence as
00:00:38
an input not just three and in addition
00:00:41
to that we don't just want to feed them
00:00:42
all into a single hidden layer because
00:00:45
that squashes too much information too
00:00:46
quickly instead we would like to make a
00:00:49
deeper model that progressively fuses
00:00:51
this information to make its guess about
00:00:53
the next character in a sequence
00:00:55
and so we'll see that as we make this
00:00:57
architecture more complex we're actually
00:00:59
going to arrive at something that looks
00:01:01
very much like a wavenet
00:01:03
the witness is this paper published by
00:01:05
the point in 2016 and it is also a
00:01:09
language model basically but it tries to
00:01:11
predict audio sequences instead of
00:01:13
character level sequences or Word level
00:01:15
sequences but fundamentally the modeling
00:01:18
setup is identical it is an auto
00:01:20
aggressive model and it tries to predict
00:01:23
next character in a sequence and the
00:01:25
architecture actually takes this
00:01:26
interesting hierarchical sort of
00:01:29
approach to predicting the next
00:01:31
character in a sequence uh with the
00:01:33
street-like structure and this is the
00:01:35
architecture and we're going to
00:01:36
implement it in the course of this video
00:01:38
so let's get started so the starter code
00:01:41
for part five is very similar to where
00:01:43
we ended up in in part three recall that
00:01:46
part four was the manual black
00:01:47
replication exercise that is kind of an
00:01:49
aside so we are coming back to part
00:01:51
three copy pasting chunks out of it and
00:01:53
that is our starter code for part five
00:01:55
I've changed very few things otherwise
00:01:57
so a lot of this should look familiar to
00:01:59
if you've gone through part three so in
00:02:01
particular very briefly we are doing
00:02:03
Imports we are reading our our data set
00:02:05
of words and we are processing their set
00:02:09
of words into individual examples and
00:02:11
none of this data generation code has
00:02:13
changed and basically we have lots and
00:02:15
lots of examples in particular we have
00:02:17
182 000 examples of three characters try
00:02:21
to predict the fourth one and we've
00:02:24
broken up every one of these words into
00:02:25
little problems of given three
00:02:27
characters predict the fourth one so
00:02:29
this is our data set and this is what
00:02:30
we're trying to get the neural lot to do
00:02:32
now in part three we started to develop
00:02:35
our code around these layer modules
00:02:39
um that are for example like class
00:02:40
linear and we're doing this because we
00:02:42
want to think of these modules as
00:02:44
building blocks and like a Lego building
00:02:47
block bricks that we can sort of like
00:02:49
stack up into neural networks and we can
00:02:51
feed data between these layers and stack
00:02:53
them up into a sort of graphs
00:02:56
now we also developed these layers to
00:02:59
have apis and signatures very similar to
00:03:01
those that are found in pytorch so we
00:03:04
have torch.nn and it's got all these
00:03:05
layer building blocks that you would use
00:03:07
in practice and we were developing all
00:03:09
of these to mimic the apis of these so
00:03:11
for example we have linear so there will
00:03:13
also be a torch.nn.linear and its
00:03:17
signature will be very similar to our
00:03:18
signature and the functionality will be
00:03:20
also quite identical as far as I'm aware
00:03:22
so we have the linear layer with the
00:03:24
Bass from 1D layer and the 10h layer
00:03:27
that we developed previously
00:03:29
and linear just as a matrix multiply in
00:03:32
the forward pass of this module batch
00:03:35
number of course is this crazy layer
00:03:36
that we developed in the previous
00:03:37
lecture and what's crazy about it is
00:03:40
well there's many things number one it
00:03:42
has these running mean and variances
00:03:44
that are trained outside of back
00:03:46
propagation they are trained using
00:03:49
exponential moving average inside this
00:03:52
layer when we call the forward pass
00:03:54
in addition to that
00:03:56
there's this training plug because the
00:03:58
behavior of bathroom is different during
00:03:59
train time and evaluation time and so
00:04:02
suddenly we have to be very careful that
00:04:03
bash form is in its correct state that
00:04:05
it's in the evaluation state or training
00:04:07
state so that's something to now keep
00:04:08
track of something that sometimes
00:04:10
introduces bugs
00:04:11
uh because you forget to put it into the
00:04:13
right mode and finally we saw that
00:04:15
Bachelor couples the statistics or the
00:04:18
the activations across the examples in
00:04:20
the batch so normally we thought of the
00:04:22
bat as just an efficiency thing but now
00:04:25
we are coupling the computation across
00:04:28
batch elements and it's done for the
00:04:30
purposes of controlling the automation
00:04:32
statistics as we saw in the previous
00:04:33
video
00:04:34
so it's a very weird layer at least a
00:04:36
lot of bugs
00:04:38
partly for example because you have to
00:04:40
modulate the training in eval phase and
00:04:42
so on
00:04:44
um in addition for example you have to
00:04:46
wait for uh the mean and the variance to
00:04:49
settle and to actually reach a steady
00:04:51
state and so um you have to make sure
00:04:53
that you basically there's state in this
00:04:55
layer and state is harmful uh usually
00:04:59
now I brought out the generator object
00:05:02
previously we had a generator equals g
00:05:04
and so on inside these layers I've
00:05:07
discarded that in favor of just
00:05:08
initializing the torch RNG outside here
00:05:12
use it just once globally just for
00:05:15
Simplicity
00:05:16
and then here we are starting to build
00:05:18
out some of the neural network elements
00:05:19
this should look very familiar we are we
00:05:22
have our embedding table C and then we
00:05:24
have a list of players and uh it's a
00:05:27
linear feeds to Bachelor feeds to 10h
00:05:29
and then a linear output layer and its
00:05:32
weights are scaled down so we are not
00:05:33
confidently wrong at the initialization
00:05:36
we see that this is about 12 000
00:05:38
parameters we're telling pytorch that
00:05:40
the parameters require gradients
00:05:42
the optimization is as far as I'm aware
00:05:44
identical and should look very very
00:05:46
familiar
00:05:47
nothing changed here
00:05:49
uh loss function looks very crazy we
00:05:52
should probably fix this and that's
00:05:54
because 32 batch elements are too few
00:05:56
and so you can get very lucky lucky or
00:05:59
unlucky in any one of these batches and
00:06:01
it creates a very thick loss function
00:06:04
um so we're going to fix that soon
00:06:06
now once we want to evaluate the trained
00:06:08
neural network we need to remember
00:06:09
because of the bathroom layers to set
00:06:11
all the layers to be training equals
00:06:13
false so this only matters for the
00:06:15
bathroom layer so far
00:06:17
and then we evaluate
00:06:19
we see that currently we have validation
00:06:22
loss of 2.10 which is fairly good but
00:06:25
there's still ways to go but even at
00:06:28
2.10 we see that when we sample from the
00:06:30
model we actually get relatively
00:06:31
name-like results that do not exist in a
00:06:34
training set so for example Yvonne kilo
00:06:37
Pros
00:06:40
Alaia Etc so certainly not
00:06:43
reasonable not unreasonable I would say
00:06:46
but not amazing and we can still push
00:06:48
this validation loss even lower and get
00:06:50
much better samples that are even more
00:06:52
name-like
00:06:53
so let's improve this model
00:06:56
okay first let's fix this graph because
00:06:58
it is daggers in my eyes and I just
00:07:00
can't take it anymore
00:07:01
um so last I if you recall is a python
00:07:05
list of floats so for example the first
00:07:07
10 elements
00:07:10
now what we'd like to do basically is we
00:07:12
need to average up
00:07:14
um some of these values to get a more
00:07:16
sort of Representative uh value along
00:07:19
the way so one way to do this is the
00:07:20
following
00:07:21
in part torch if I create for example
00:07:24
a tensor of the first 10 numbers
00:07:27
then this is currently a one-dimensional
00:07:29
array but recall that I can view this
00:07:31
array as two-dimensional so for example
00:07:33
I can use it as a two by five array and
00:07:36
this is a 2d tensor now two by five and
00:07:39
you see what petroch has done is that
00:07:40
the first row of this tensor is the
00:07:42
first five elements and the second row
00:07:44
is the second five elements
00:07:46
I can also view it as a five by two as
00:07:48
an example
00:07:50
and then recall that I can also
00:07:52
use negative one in place of one of
00:07:55
these numbers
00:07:55
and pytorch will calculate what that
00:07:58
number must be in order to make the
00:07:59
number of elements work out so this can
00:08:01
be
00:08:03
this or like that but it will work of
00:08:06
course this would not work
00:08:09
okay so this allows it to spread out
00:08:11
some of the consecutive values into rows
00:08:13
so that's very helpful because what we
00:08:15
can do now is first of all we're going
00:08:17
to create a torshot tensor out of the a
00:08:21
list of floats
00:08:22
and then we're going to view it as
00:08:24
whatever it is but we're going to
00:08:26
stretch it out into rows of 1000
00:08:29
consecutive elements so the shape of
00:08:31
this now becomes 200 by 1000. and each
00:08:35
row is one thousand um consecutive
00:08:37
elements in this list
00:08:39
so that's very helpful because now we
00:08:41
can do a mean along the rows
00:08:43
and the shape of this will just be 200.
00:08:47
and so we've taken basically the mean on
00:08:48
every row so plt.plot of that should be
00:08:51
something nicer
00:08:53
much better
00:08:55
so we see that we basically made a lot
00:08:56
of progress and then here this is the
00:08:59
learning rate Decay so here we see that
00:09:01
the learning rate Decay subtracted a ton
00:09:03
of energy out of the system and allowed
00:09:05
us to settle into sort of the local
00:09:07
minimum in this optimization
00:09:09
so this is a much nicer plot let me come
00:09:12
up and delete the monster and we're
00:09:15
going to be using this going forward now
00:09:16
next up what I'm bothered by is that you
00:09:19
see our forward pass is a little bit
00:09:22
gnarly and takes way too many lines of
00:09:24
code
00:09:24
so in particular we see that we've
00:09:26
organized some of the layers inside the
00:09:28
layers list but not all of them uh for
00:09:30
no reason so in particular we see that
00:09:32
we still have the embedding table a
00:09:34
special case outside of the layers and
00:09:37
in addition to that the viewing
00:09:39
operation here is also outside of our
00:09:40
layers so let's create layers for these
00:09:43
and then we can add those layers to just
00:09:45
our list
00:09:46
so in particular the two things that we
00:09:48
need is here we have this embedding
00:09:50
table and we are indexing at the
00:09:53
integers inside uh the batch XB uh
00:09:56
inside the tensor xB
00:09:58
so that's an embedding table lookup just
00:10:00
done with indexing and then here we see
00:10:03
that we have this view operation which
00:10:04
if you recall from the previous video
00:10:06
Simply rearranges the character
00:10:09
embeddings and stretches them out into a
00:10:12
row and effectively what print that does
00:10:14
is the concatenation operation basically
00:10:16
except it's free because viewing is very
00:10:19
cheap in pytorch no no memory is being
00:10:22
copied we're just re-representing how we
00:10:24
view that tensor so let's create
00:10:27
um
00:10:28
modules for both of these operations the
00:10:31
embedding operation and flattening
00:10:32
operation
00:10:33
so I actually wrote the code in just to
00:10:37
save some time
00:10:38
so we have a module embedding and a
00:10:40
module pattern and both of them simply
00:10:43
do the indexing operation in the forward
00:10:45
pass and the flattening operation here
00:10:49
and this C now will just become a salt
00:10:53
dot weight inside an embedding module
00:10:56
and I'm calling these layers
00:10:58
specifically embedding a platinum
00:10:59
because it turns out that both of them
00:11:01
actually exist in pi torch so in
00:11:03
phytorch we have n and Dot embedding and
00:11:06
it also takes the number of embeddings
00:11:07
and the dimensionality of the bedding
00:11:09
just like we have here but in addition
00:11:11
python takes in a lot of other keyword
00:11:13
arguments that we are not using for our
00:11:15
purposes yet
00:11:17
and for flatten that also exists in
00:11:19
pytorch and it also takes additional
00:11:21
keyword arguments that we are not using
00:11:23
so we have a very simple platform
00:11:26
but both of them exist in pytorch
00:11:28
they're just a bit more simpler and now
00:11:30
that we have these we can simply take
00:11:33
out some of these special cased
00:11:36
um things so instead of C we're just
00:11:40
going to have an embedding
00:11:41
and of a cup size and N embed
00:11:45
and then after the embedding we are
00:11:47
going to flatten
00:11:48
so let's construct those modules and now
00:11:51
I can take out this the
00:11:53
and here I don't have to special case
00:11:54
anymore because now C is the embeddings
00:11:57
weight and it's inside layers
00:12:01
so this should just work
00:12:03
and then here our forward pass
00:12:06
simplifies substantially because we
00:12:08
don't need to do these now outside of
00:12:10
these layer outside and explicitly
00:12:13
they're now inside layers
00:12:15
so we can delete those
00:12:17
but now to to kick things off we want
00:12:19
this little X which in the beginning is
00:12:21
just XB uh the tensor of integers
00:12:24
specifying the identities of these
00:12:26
characters at the input
00:12:27
and so these characters can now directly
00:12:29
feed into the first layer and this
00:12:31
should just work
00:12:32
so let me come here and insert a break
00:12:35
because I just want to make sure that
00:12:36
the first iteration of this runs and
00:12:38
then there's no mistake so that ran
00:12:40
properly and basically we substantially
00:12:42
simplified the forward pass here okay
00:12:45
I'm sorry I changed my microphone so
00:12:46
hopefully the audio is a little bit
00:12:48
better
00:12:49
now one more thing that I would like to
00:12:51
do in order to pytortify our code even
00:12:53
further is that right now we are
00:12:54
maintaining all of our modules in a
00:12:56
naked list of layers and we can also
00:12:59
simplify this uh because we can
00:13:01
introduce the concept of Pi torch
00:13:03
containers so in tors.nn which we are
00:13:05
basically rebuilding from scratch here
00:13:07
there's a concept of containers
00:13:09
and these containers are basically a way
00:13:10
of organizing layers into
00:13:13
lists or dicts and so on so in
00:13:16
particular there's a sequential which
00:13:18
maintains a list of layers and is a
00:13:20
module class in pytorch and it basically
00:13:23
just passes a given input through all
00:13:25
the layers sequentially exactly as we
00:13:27
are doing here
00:13:28
so let's write our own sequential
00:13:31
I've written a code here and basically
00:13:33
the code for sequential is quite
00:13:35
straightforward we pass in a list of
00:13:37
layers which we keep here and then given
00:13:39
any input in a forward pass we just call
00:13:41
all the layers sequentially and return
00:13:43
the result in terms of the parameters
00:13:45
it's just all the parameters of the
00:13:46
child modules
00:13:48
so we can run this and we can again
00:13:50
simplify this substantially because we
00:13:52
don't maintain this naked list of layers
00:13:54
we now have a notion of a model which is
00:13:57
a module and in particular is a
00:14:00
sequential of all these layers
00:14:04
and now parameters are simply just a
00:14:07
model about parameters
00:14:09
and so that list comprehension now lives
00:14:11
here
00:14:13
and then here we are press here we are
00:14:15
doing all the things we used to do
00:14:17
now here the code again simplifies
00:14:19
substantially because we don't have to
00:14:22
do this forwarding here instead of just
00:14:24
call the model on the input data and the
00:14:26
input data here are the integers inside
00:14:28
xB so we can simply do logits which are
00:14:31
the outputs of our model are simply the
00:14:33
model called on xB
00:14:36
and then the cross entropy here takes
00:14:38
the logits and the targets
00:14:41
so this simplifies substantially
00:14:43
and then this looks good so let's just
00:14:46
make sure this runs that looks good
00:14:49
now here we actually have some work to
00:14:51
do still here but I'm going to come back
00:14:52
later for now there's no more layers
00:14:54
there's a model that layers but it's not
00:14:57
a to access attributes of these classes
00:15:00
directly so we'll come back and fix this
00:15:01
later
00:15:03
and then here of course this simplifies
00:15:05
substantially as well because logits are
00:15:07
the model called on x
00:15:10
and then these low Jets come here
00:15:14
so we can evaluate the train and
00:15:15
validation loss which currently is
00:15:17
terrible because we just initialized the
00:15:19
neural net and then we can also sample
00:15:21
from the model and this simplifies
00:15:22
dramatically as well
00:15:24
because we just want to call the model
00:15:25
onto the context and outcome logits
00:15:30
and these logits go into softmax and get
00:15:32
the probabilities Etc so we can sample
00:15:35
from this model
00:15:37
what did I screw up
00:15:42
okay so I fixed the issue and we now get
00:15:44
the result that we expect which is
00:15:46
gibberish because the model is not
00:15:48
trained because we re-initialize it from
00:15:49
scratch
00:15:50
the problem was that when I fixed this
00:15:52
cell to be modeled out layers instead of
00:15:54
just layers I did not actually run the
00:15:56
cell and so our neural net was in a
00:15:58
training mode and what caused the issue
00:16:01
here is the bathroom layer as bathroom
00:16:03
layer of the likes to do because
00:16:05
Bachelor was in a training mode and here
00:16:07
we are passing in an input which is a
00:16:09
batch of just a single example made up
00:16:11
of the context
00:16:12
and so if you are trying to pass in a
00:16:15
single example into a bash Norm that is
00:16:16
in the training mode you're going to end
00:16:18
up estimating the variance using the
00:16:20
input and the variance of a single
00:16:21
number is is not a number because it is
00:16:24
a measure of a spread so for example the
00:16:26
variance of just the single number five
00:16:28
you can see is not a number and so
00:16:31
that's what happened in the master
00:16:33
basically caused an issue and then that
00:16:35
polluted all of the further processing
00:16:37
so all that we have to do was make sure
00:16:39
that this runs and we basically made the
00:16:43
issue of
00:16:45
again we didn't actually see the issue
00:16:46
with the loss we could have evaluated
00:16:48
the loss but we got the wrong result
00:16:49
because basharm was in the training mode
00:16:52
and uh and so we still get a result it's
00:16:54
just the wrong result because it's using
00:16:56
the uh sample statistics of the batch
00:16:59
whereas we want to use the running mean
00:17:00
and running variants inside the bachelor
00:17:02
and so
00:17:04
again an example of introducing a bug
00:17:06
inline because we did not properly
00:17:09
maintain the state of what is training
00:17:10
or not okay so I Rewritten everything
00:17:12
and here's where we are as a reminder we
00:17:15
have the training loss of 2.05 and
00:17:17
validation 2.10
00:17:18
now because these losses are very
00:17:21
similar to each other we have a sense
00:17:22
that we are not overfitting too much on
00:17:24
this task and we can make additional
00:17:26
progress in our performance by scaling
00:17:28
up the size of the neural network and
00:17:29
making everything bigger and deeper
00:17:32
now currently we are using this
00:17:33
architecture here where we are taking in
00:17:35
some number of characters going into a
00:17:37
single hidden layer and then going to
00:17:39
the prediction of the next character
00:17:41
the problem here is we don't have a
00:17:43
naive way of making this bigger in a
00:17:46
productive way we could of course use
00:17:48
our layers sort of building blocks and
00:17:51
materials to introduce additional layers
00:17:53
here and make the network deeper but it
00:17:55
is still the case that we are crushing
00:17:56
all of the characters into a single
00:17:58
layer all the way at the beginning
00:18:00
and even if we make this a bigger layer
00:18:02
and add neurons it's still kind of like
00:18:04
silly to squash all that information so
00:18:07
fast in a single step
00:18:09
so we'd like to do instead is we'd like
00:18:11
our Network to look a lot more like this
00:18:13
in the wavenet case so you see in the
00:18:15
wavenet when we are trying to make the
00:18:17
prediction for the next character in the
00:18:18
sequence it is a function of the
00:18:20
previous characters that are feeding
00:18:22
that feed in but not all of these
00:18:25
different characters are not just
00:18:26
crushed to a single layer and then you
00:18:28
have a sandwich they are crushed slowly
00:18:31
so in particular we take two characters
00:18:34
and we fuse them into sort of like a
00:18:36
diagram representation and we do that
00:18:38
for all these characters consecutively
00:18:40
and then we take the bigrams and we fuse
00:18:42
those into four character level chunks
00:18:46
and then we fuse that again and so we do
00:18:49
that in this like tree-like hierarchical
00:18:51
manner so we fuse the information from
00:18:53
the previous context slowly into the
00:18:56
network as it gets deeper and so this is
00:18:58
the kind of architecture that we want to
00:18:59
implement
00:19:00
now in the wave Nets case this is a
00:19:02
visualization of a stack of dilated
00:19:04
causal convolution layers and this makes
00:19:07
it sound very scary but actually the
00:19:08
idea is very simple and the fact that
00:19:10
it's a dilated causal convolution layer
00:19:12
is really just an implementation detail
00:19:14
to make everything fast we're going to
00:19:16
see that later but for now let's just
00:19:18
keep the basic idea of it which is this
00:19:20
Progressive Fusion so we want to make
00:19:22
the network deeper and at each level we
00:19:24
want to fuse only two consecutive
00:19:26
elements two characters then two bigrams
00:19:29
then two four grams and so on so let's
00:19:32
unplant this okay so first up let me
00:19:34
scroll to where we built the data set
00:19:35
and let's change the block size from 3
00:19:37
to 8. so we're going to be taking eight
00:19:39
characters of context to predict the
00:19:42
ninth character so the data set now
00:19:44
looks like this we have a lot more
00:19:45
context feeding in to predict any next
00:19:47
character in a sequence and these eight
00:19:49
characters are going to be processed in
00:19:51
this tree like structure
00:19:53
now if we scroll here everything here
00:19:56
should just be able to work so we should
00:19:58
be able to redefine the network
00:19:59
you see the number of parameters has
00:20:01
increased by 10 000 and that's because
00:20:03
the block size has grown so this first
00:20:06
linear layer is much much bigger our
00:20:08
linear layer now takes eight characters
00:20:10
into this middle layer so there's a lot
00:20:13
more parameters there but this should
00:20:15
just run let me just break right after
00:20:18
the very first iteration so you see that
00:20:20
this runs just fine it's just that this
00:20:22
network doesn't make too much sense
00:20:23
we're crushing way too much information
00:20:25
way too fast
00:20:26
so let's now come in and see how we
00:20:29
could try to implement the hierarchical
00:20:30
scheme now before we dive into the
00:20:33
detail of the re-implementation here I
00:20:35
was just curious to actually run it and
00:20:37
see where we are in terms of the
00:20:38
Baseline performance of just lazily
00:20:40
scaling up the context length so I'll
00:20:42
let it run we get a nice loss curve and
00:20:45
then evaluating the loss we actually see
00:20:46
quite a bit of improvement just from
00:20:48
increasing the context line length so I
00:20:51
started a little bit of a performance
00:20:52
log here and previously where we were is
00:20:54
we were getting a performance of 2.10 on
00:20:57
the validation loss and now simply
00:20:59
scaling up the contact length from 3 to
00:21:01
8 gives us a performance of 2.02 so
00:21:05
quite a bit of an improvement here and
00:21:07
also when you sample from the model you
00:21:08
see that the names are definitely
00:21:10
improving qualitatively as well
00:21:13
so we could of course spend a lot of
00:21:14
time here tuning
00:21:16
um uh tuning things and making it even
00:21:18
bigger and scaling up the network
00:21:19
further even with the simple
00:21:21
um sort of setup here but let's continue
00:21:24
and let's Implement here model and treat
00:21:27
this as just a rough Baseline
00:21:28
performance but there's a lot of
00:21:30
optimization like left on the table in
00:21:32
terms of some of the hyper parameters
00:21:34
that you're hopefully getting a sense of
00:21:35
now okay so let's scroll up now
00:21:38
and come back up and what I've done here
00:21:41
is I've created a bit of a scratch space
00:21:42
for us to just like look at the forward
00:21:45
pass of the neural net and inspect the
00:21:47
shape of the tensor along the way as the
00:21:49
neural net uh forwards so here I'm just
00:21:53
temporarily for debugging creating a
00:21:55
batch of just say four examples so four
00:21:58
random integers then I'm plucking out
00:22:00
those rows from our training set
00:22:02
and then I'm passing into the model the
00:22:04
input xB
00:22:06
now the shape of XB here because we have
00:22:08
only four examples is four by eight and
00:22:11
this eight is now the current block size
00:22:14
so uh inspecting XP we just see that we
00:22:18
have four examples each one of them is a
00:22:19
row of xB
00:22:21
and we have eight characters here and
00:22:24
this integer tensor just contains the
00:22:26
identities of those characters
00:22:29
so the first layer of our neural net is
00:22:31
the embedding layer so passing XB this
00:22:33
integer tensor through the embedding
00:22:35
layer creates an output that is four by
00:22:37
eight by ten
00:22:39
so our embedding table has for each
00:22:42
character a 10-dimensional vector that
00:22:44
we are trying to learn
00:22:46
and so what the embedding layer does
00:22:48
here is it plucks out the embedding
00:22:50
Vector for each one of these integers
00:22:53
and organizes it all in a four by eight
00:22:56
by ten tensor now
00:22:58
so all of these integers are translated
00:23:00
into 10 dimensional vectors inside this
00:23:02
three-dimensional tensor now
00:23:04
passing that through the flattened layer
00:23:06
as you recall what this does is it views
00:23:09
this tensor as just a 4 by 80 tensor and
00:23:12
what that effectively does is that all
00:23:15
these 10 dimensional embeddings for all
00:23:16
these eight characters just end up being
00:23:18
stretched out into a long row
00:23:21
and that looks kind of like a
00:23:22
concatenation operation basically so by
00:23:25
viewing the tensor differently we now
00:23:27
have a four by eighty and inside this 80
00:23:29
it's all the 10 dimensional uh
00:23:32
vectors just uh concatenate next to each
00:23:35
other
00:23:36
and then the linear layer of course
00:23:37
takes uh 80 and creates 200 channels
00:23:40
just via matrix multiplication
00:23:43
so so far so good now I'd like to show
00:23:45
you something surprising
00:23:47
let's look at the insides of the linear
00:23:50
layer and remind ourselves how it works
00:23:52
the linear layer here in the forward
00:23:54
pass takes the input X multiplies it
00:23:56
with a weight and then optionally adds
00:23:58
bias and the weight here is
00:24:00
two-dimensional as defined here and the
00:24:02
bias is one dimensional here
00:24:04
so effectively in terms of the shapes
00:24:06
involved what's happening inside this
00:24:08
linear layer looks like this right now
00:24:10
and I'm using random numbers here but
00:24:12
I'm just illustrating the shapes and
00:24:15
what happens
00:24:16
basically a 4 by 80 input comes into the
00:24:18
linear layer that's multiplied by this
00:24:20
80 by 200 weight Matrix inside and
00:24:23
there's a plus 200 bias and the shape of
00:24:25
the whole thing that comes out of the
00:24:26
linear layer is four by two hundred as
00:24:28
we see here
00:24:30
now notice here by the way that this
00:24:32
here will create a 4x200 tensor and then
00:24:36
plus 200 there's a broadcasting
00:24:38
happening here about 4 by 200 broadcasts
00:24:41
with 200 uh so everything works here
00:24:44
so now the surprising thing that I'd
00:24:46
like to show you that you may not expect
00:24:47
is that this input here that is being
00:24:49
multiplied uh doesn't actually have to
00:24:52
be two-dimensional this Matrix multiply
00:24:55
operator in pytorch is quite powerful
00:24:56
and in fact you can actually pass in
00:24:58
higher dimensional arrays or tensors and
00:25:00
everything works fine so for example
00:25:02
this could be four by five by eighty and
00:25:04
the result in that case will become four
00:25:06
by five by two hundred
00:25:08
you can add as many dimensions as you
00:25:09
like on the left here
00:25:11
and so effectively what's happening is
00:25:13
that the matrix multiplication only
00:25:15
works on the last Dimension and the
00:25:17
dimensions before it in the input tensor
00:25:19
are left unchanged
00:25:24
so that is basically these um these
00:25:27
dimensions on the left are all treated
00:25:29
as just a batch Dimension so we can have
00:25:32
multiple batch dimensions and then in
00:25:34
parallel over all those Dimensions we
00:25:36
are doing the matrix multiplication on
00:25:38
the last dimension
00:25:39
so this is quite convenient because we
00:25:41
can use that in our Network now
00:25:44
because remember that we have these
00:25:46
eight characters coming in
00:25:49
and we don't want to now uh flatten all
00:25:51
of it out into a large eight-dimensional
00:25:53
vector
00:25:54
because we don't want to Matrix multiply
00:25:57
80.
00:25:59
into a weight Matrix multiply
00:26:01
immediately instead we want to group
00:26:03
these
00:26:04
like this
00:26:06
so every consecutive two elements
00:26:09
one two and three and four and five and
00:26:11
six and seven and eight all of these
00:26:12
should be now
00:26:14
basically flattened out and multiplied
00:26:17
by weight Matrix but all of these four
00:26:19
groups here we'd like to process in
00:26:21
parallel so it's kind of like a batch
00:26:23
Dimension that we can introduce
00:26:25
and then we can in parallel basically
00:26:28
process all of these uh bigram groups in
00:26:33
the four batch dimensions of an
00:26:34
individual example and also over the
00:26:37
actual batch dimension of the you know
00:26:39
four examples in our example here so
00:26:42
let's see how that works effectively
00:26:43
what we want is right now we take a 4 by
00:26:46
80
00:26:47
and multiply it by 80 by 200
00:26:50
to in the linear layer this is what
00:26:52
happens
00:26:53
but instead what we want is we don't
00:26:56
want 80 characters or 80 numbers to come
00:26:58
in we only want two characters to come
00:27:00
in on the very first layer and those two
00:27:02
characters should be fused
00:27:04
so in other words we just want 20 to
00:27:07
come in right 20 numbers would come in
00:27:11
and here we don't want a 4 by 80 to feed
00:27:13
into the linear layer we actually want
00:27:15
these groups of two to feed in so
00:27:17
instead of four by eighty we want this
00:27:19
to be a 4 by 4 by 20.
00:27:23
so these are the four groups of two and
00:27:27
each one of them is ten dimensional
00:27:28
vector
00:27:29
so what we want is now is we need to
00:27:31
change the flattened layer so it doesn't
00:27:33
output a four by eighty but it outputs a
00:27:35
four by four by Twenty where basically
00:27:38
these um
00:27:39
every two consecutive characters are uh
00:27:43
packed in on the very last Dimension and
00:27:46
then these four is the first batch
00:27:48
Dimension and this four is the second
00:27:50
batch Dimension referring to the four
00:27:52
groups inside every one of these
00:27:54
examples
00:27:55
and then this will just multiply like
00:27:57
this so this is what we want to get to
00:27:59
so we're going to have to change the
00:28:01
linear layer in terms of how many inputs
00:28:02
it expects it shouldn't expect 80 it
00:28:05
should just expect 20 numbers and we
00:28:07
have to change our flattened layer so it
00:28:09
doesn't just fully flatten out this
00:28:11
entire example it needs to create a 4x4
00:28:14
by 20 instead of four by eighty so let's
00:28:17
see how this could be implemented
00:28:19
basically right now we have an input
00:28:21
that is a four by eight by ten that
00:28:23
feeds into the flattened layer and
00:28:25
currently the flattened layer just
00:28:27
stretches it out so if you remember the
00:28:29
implementation of flatten
00:28:31
it takes RX and it just views it as
00:28:34
whatever the batch Dimension is and then
00:28:35
negative one
00:28:37
so effectively what it does right now is
00:28:39
it does e dot view of 4 negative one and
00:28:42
the shape of this of course is 4 by 80.
00:28:45
so that's what currently happens and we
00:28:48
instead want this to be a four by four
00:28:49
by Twenty where these consecutive
00:28:51
ten-dimensional vectors get concatenated
00:28:54
so you know how in Python you can take a
00:28:57
list of range of 10
00:29:00
so we have numbers from zero to nine and
00:29:03
we can index like this to get all the
00:29:05
even parts
00:29:06
and we can also index like starting at
00:29:08
one and going in steps up two to get all
00:29:11
the odd parts
00:29:13
so one way to implement this it would be
00:29:15
as follows we can take e and we can
00:29:18
index into it for all the batch elements
00:29:21
and then just even elements in this
00:29:24
Dimension so at indexes 0 2 4 and 8.
00:29:29
and then all the parts here from this
00:29:31
last dimension
00:29:33
and this gives us the even characters
00:29:37
and then here
00:29:39
this gives us all the odd characters and
00:29:42
basically what we want to do is we make
00:29:43
sure we want to make sure that these get
00:29:44
concatenated in pi torch and then we
00:29:47
want to concatenate these two tensors
00:29:49
along the second dimension
00:29:53
so this and the shape of it would be
00:29:55
four by four by Twenty this is
00:29:57
definitely the result we want we are
00:29:58
explicitly grabbing the even parts and
00:30:01
the odd parts and we're arranging those
00:30:03
four by four by ten right next to each
00:30:06
other and concatenate
00:30:08
so this works but it turns out that what
00:30:10
also works is you can simply use a view
00:30:13
again and just request the right shape
00:30:16
and it just so happens that in this case
00:30:18
those vectors will again end up being
00:30:21
arranged in exactly the way we want so
00:30:23
in particular if we take e and we just
00:30:25
view it as a four by four by Twenty
00:30:27
which is what we want
00:30:28
we can check that this is exactly equal
00:30:30
to but let me call this this is the
00:30:33
explicit concatenation I suppose
00:30:36
um
00:30:36
so explosives dot shape is 4x4 by 20. if
00:30:40
you just view it as 4x4 by 20 you can
00:30:42
check that when you compare to explicit
00:30:46
uh you got a big this is element wise
00:30:48
operation so making sure that all of
00:30:49
them are true that is the truth so
00:30:53
basically long story short we don't need
00:30:54
to make an explicit call to concatenate
00:30:56
Etc we can simply take this input tensor
00:31:00
to flatten and we can just view it in
00:31:03
whatever way we want
00:31:04
and in particular you don't want to
00:31:07
stretch things out with negative one we
00:31:09
want to actually create a
00:31:10
three-dimensional array and depending on
00:31:12
how many vectors that are consecutive we
00:31:15
want to
00:31:16
um fuse like for example two then we can
00:31:20
just simply ask for this Dimension to be
00:31:21
20. and um
00:31:24
use a negative 1 here and python will
00:31:26
figure out how many groups it needs to
00:31:27
pack into this additional batch
00:31:29
dimension
00:31:30
so let's now go into flatten and
00:31:32
implement this okay so I scroll up here
00:31:34
to flatten and what we'd like to do is
00:31:36
we'd like to change it now so let me
00:31:38
create a Constructor and take the number
00:31:40
of elements that are consecutive that we
00:31:42
would like to concatenate now in the
00:31:44
last dimension of the output
00:31:46
so here we're just going to remember
00:31:48
solve.n equals n
00:31:50
and then I want to be careful here
00:31:52
because pipe pytorch actually has a
00:31:54
torch to flatten and its keyword
00:31:56
arguments are different and they kind of
00:31:58
like function differently so R flatten
00:32:00
is going to start to depart from patreon
00:32:02
flatten so let me call it flat flatten
00:32:04
consecutive or something like that just
00:32:06
to make sure that our apis are about
00:32:08
equal
00:32:09
so this uh basically flattens only some
00:32:13
n consecutive elements and puts them
00:32:15
into the last dimension
00:32:17
now here the shape of X is B by T by C
00:32:21
so let me
00:32:23
pop those out into variables and recall
00:32:26
that in our example down below B was 4 T
00:32:28
was 8 and C was 10.
00:32:33
now instead of doing x dot view of B by
00:32:37
negative one
00:32:39
right this is what we had before
00:32:44
we want this to be B by
00:32:47
um negative 1 by
00:32:49
and basically here we want c times n
00:32:52
that's how many consecutive elements we
00:32:55
want
00:32:56
and here instead of negative one I don't
00:32:58
super love the use of negative one
00:33:00
because I like to be very explicit so
00:33:02
that you get error messages when things
00:33:03
don't go according to your expectation
00:33:04
so what do we expect here we expect this
00:33:07
to become t
00:33:09
divide n using integer division here
00:33:12
so that's what I expect to happen
00:33:14
and then one more thing I want to do
00:33:15
here is remember previously all the way
00:33:18
in the beginning n was three and uh
00:33:21
basically we're concatenating
00:33:23
um all the three characters that existed
00:33:25
there
00:33:26
so we basically are concatenated
00:33:28
everything
00:33:29
and so sometimes I can create a spurious
00:33:31
dimension of one here so if it is the
00:33:34
case that x dot shape at one is one then
00:33:37
it's kind of like a spurious dimension
00:33:39
um so we don't want to return a
00:33:41
three-dimensional tensor with a one here
00:33:44
we just want to return a two-dimensional
00:33:46
tensor exactly as we did before
00:33:48
so in this case basically we will just
00:33:50
say x equals x dot squeeze that is a
00:33:54
pytorch function
00:33:56
and squeeze takes a dimension that it
00:34:01
either squeezes out all the dimensions
00:34:02
of a tensor that are one or you can
00:34:05
specify the exact Dimension that you
00:34:08
want to be squeezed and again I like to
00:34:10
be as explicit as possible always so I
00:34:12
expect to squeeze out the First
00:34:13
Dimension only
00:34:15
of this tensor
00:34:17
this three-dimensional tensor and if
00:34:19
this Dimension here is one then I just
00:34:21
want to return B by c times n
00:34:24
and so self dot out will be X and then
00:34:26
we return salt dot out
00:34:28
so that's the candidate implementation
00:34:30
and of course this should be self.n
00:34:33
instead of just n
00:34:34
so let's run
00:34:36
and let's come here now
00:34:39
and take it for a spin so flatten
00:34:41
consecutive
00:34:44
and in the beginning let's just use
00:34:47
eight so this should recover the
00:34:49
previous Behavior so flagging
00:34:51
consecutive of eight uh which is the
00:34:53
current block size
00:34:55
we can do this uh that should recover
00:34:57
the previous Behavior
00:34:59
so we should be able to run the model
00:35:02
and here we can inspect I have a little
00:35:06
code snippet here where I iterate over
00:35:08
all the layers I print the name of this
00:35:11
class and the shape
00:35:14
and so we see the shapes as we expect
00:35:17
them after every single layer in the top
00:35:19
bit so now let's try to restructure it
00:35:22
using our flattened consecutive and do
00:35:25
it hierarchically so in particular
00:35:28
we want to flatten consecutive not just
00:35:30
not block size but just two
00:35:33
and then we want to process this with
00:35:34
linear now then the number of inputs to
00:35:37
this linear will not be an embed times
00:35:38
block size it will now only be n embed
00:35:41
times two
00:35:42
20.
00:35:44
this goes through the first layer and
00:35:46
now we can in principle just copy paste
00:35:48
this
00:35:49
now the next linear layer should expect
00:35:51
and hidden times two
00:35:53
and the last piece of it should expect
00:35:58
and it enters 2 again
00:36:01
so this is sort of like the naive
00:36:03
version of it
00:36:04
um
00:36:05
so running this we now have a much much
00:36:07
bigger model
00:36:09
and we should be able to basically just
00:36:10
forward the model
00:36:13
and now we can inspect uh the numbers in
00:36:16
between
00:36:17
so four byte by 20
00:36:19
was Platinum consecutively into four by
00:36:21
four by Twenty
00:36:23
this was projected into four by four by
00:36:24
two hundred
00:36:26
and then bash storm just worked out of
00:36:29
the box we have to verify that bastron
00:36:31
does the correct thing even though it
00:36:33
takes a three-dimensional impedance that
00:36:34
are two dimensional input
00:36:36
then we have 10h which is element wise
00:36:38
then we crushed it again so if we
00:36:41
flatten consecutively and ended up with
00:36:42
a four by two by 400 now
00:36:45
then linear brought it back down to 200
00:36:47
batch room 10h and lastly we get a 4 by
00:36:50
400 and we see that the flattened
00:36:52
consecutive for the last flatten here uh
00:36:54
it squeezed out that dimension of one so
00:36:57
we only ended up with four by four
00:36:58
hundred and then linear Bachelor on 10h
00:37:00
and uh the last linear layer to get our
00:37:04
logents and so The Lodges end up in the
00:37:06
same shape as they were before but now
00:37:08
we actually have a nice three layer
00:37:10
neural nut and it basically corresponds
00:37:12
to whoops sorry it basically corresponds
00:37:15
exactly to this network now except only
00:37:18
this piece here because we only have
00:37:20
three layers whereas here in this
00:37:22
example there's uh four layers with the
00:37:25
total receptive field size of 16
00:37:28
characters instead of just eight
00:37:29
characters so the block size here is 16.
00:37:32
so this piece of it's basically
00:37:34
implemented here
00:37:36
um now we just have to kind of figure
00:37:38
out some good Channel numbers to use
00:37:40
here now in particular I changed the
00:37:42
number of hidden units to be 68 in this
00:37:45
architecture because when I use 68 the
00:37:47
number of parameters comes out to be 22
00:37:49
000 so that's exactly the same that we
00:37:52
had before and we have the same amount
00:37:54
of capacity at this neural net in terms
00:37:56
of the number of parameters but the
00:37:57
question is whether we are utilizing
00:37:59
those parameters in a more efficient
00:38:00
architecture so what I did then is I got
00:38:03
rid of a lot of the debugging cells here
00:38:05
and I rerun the optimization and
00:38:07
scrolling down to the result we see that
00:38:09
we get the identical performance roughly
00:38:12
so our validation loss now is 2.029 and
00:38:15
previously it was 2.027 so controlling
00:38:18
for the number of parameters changing
00:38:20
from the flat to hierarchical is not
00:38:21
giving us anything yet
00:38:23
that said there are two things
00:38:25
um to point out number one we didn't
00:38:27
really torture the um architecture here
00:38:29
very much this is just my first guess
00:38:31
and there's a bunch of hyper parameters
00:38:33
search that we could do in order in
00:38:35
terms of how we allocate uh our budget
00:38:37
of parameters to what layers number two
00:38:39
we still may have a bug inside the
00:38:42
bachelor 1D layer so let's take a look
00:38:44
at
00:38:45
um uh that because it runs but does it
00:38:49
do the right thing
00:38:50
so I pulled up the layer inspector sort
00:38:53
of that we have here and printed out the
00:38:55
shape along the way and currently it
00:38:57
looks like the batch form is receiving
00:38:58
an input that is 32 by 4 by 68 right and
00:39:03
here on the right I have the current
00:39:04
implementation of Bachelor that we have
00:39:05
right now
00:39:06
now this bachelor assumed in the way we
00:39:09
wrote it and at the time that X is
00:39:11
two-dimensional so it was n by D where n
00:39:15
was the batch size so that's why we only
00:39:17
reduced uh the mean and the variance
00:39:19
over the zeroth dimension but now X will
00:39:21
basically become three-dimensional so
00:39:23
what's happening inside the bachelor
00:39:24
right now and how come it's working at
00:39:26
all and not giving any errors the reason
00:39:28
for that is basically because everything
00:39:30
broadcasts properly but the bachelor is
00:39:32
not doing what we need what we wanted to
00:39:34
do
00:39:35
so in particular let's basically think
00:39:37
through what's happening inside the
00:39:38
bathroom uh looking at what's what's do
00:39:41
What's Happening Here
00:39:43
I have the code here
00:39:45
so we're receiving an input of 32 by 4
00:39:47
by 68 and then we are doing uh here x
00:39:52
dot mean here I have e instead of X but
00:39:54
we're doing the mean over zero and
00:39:57
that's actually giving us 1 by 4 by 68.
00:39:59
so we're doing the mean only over the
00:40:01
very first Dimension and it's giving us
00:40:03
a mean and a variance that still
00:40:05
maintain this Dimension here
00:40:07
so these means are only taking over 32
00:40:10
numbers in the First Dimension and then
00:40:12
when we perform this everything
00:40:14
broadcasts correctly still
00:40:16
but basically what ends up happening is
00:40:20
when we also look at the running mean
00:40:26
the shape of it so I'm looking at the
00:40:27
model that layers at three which is the
00:40:28
first bathroom layer and they're looking
00:40:30
at whatever the running mean became and
00:40:32
its shape
00:40:34
the shape of this running mean now is 1
00:40:35
by 4 by 68.
00:40:38
right instead of it being
00:40:39
um you know just a size of dimension
00:40:43
because we have 68 channels we expect to
00:40:45
have 68 means and variances that we're
00:40:47
maintaining but actually we have an
00:40:49
array of 4 by 68 and so basically what
00:40:51
this is telling us is this bash Norm is
00:40:54
only
00:40:55
this bachelor is currently working in
00:40:57
parallel
00:40:58
over
00:41:01
4 times 68 instead of just 68 channels
00:41:06
so basically we are maintaining
00:41:08
statistics for every one of these four
00:41:10
positions individually and independently
00:41:13
and instead what we want to do is we
00:41:15
want to treat this four as a batch
00:41:16
Dimension just like the zeroth dimension
00:41:19
so as far as the bachelor is concerned
00:41:22
it doesn't want to average we don't want
00:41:24
to average over 32 numbers we want to
00:41:26
now average over 32 times four numbers
00:41:29
for every single one of these 68
00:41:31
channels
00:41:32
and uh so let me now
00:41:34
remove this
00:41:36
it turns out that when you look at the
00:41:38
documentation of torch.mean
00:41:42
so let's go to torch.me
00:41:49
in one of its signatures when we specify
00:41:51
the dimension
00:41:53
we see that the dimension here is not
00:41:54
just it can be in or it can also be a
00:41:56
tuple of ins so we can reduce over
00:41:59
multiple integers at the same time over
00:42:02
multiple Dimensions at the same time so
00:42:04
instead of just reducing over zero we
00:42:05
can pass in a tuple 0 1.
00:42:08
and here zero one as well and then
00:42:10
what's going to happen is the output of
00:42:12
course is going to be the same
00:42:13
but now what's going to happen is
00:42:15
because we reduce over 0 and 1 if we
00:42:17
look at immin.shape
00:42:20
we see that now we've reduced we took
00:42:22
the mean over both the zeroth and the
00:42:25
First Dimension
00:42:26
so we're just getting 68 numbers and a
00:42:28
bunch of spurious Dimensions here
00:42:30
so now this becomes 1 by 1 by 68 and the
00:42:34
running mean and the running variance
00:42:35
analogously will become one by one by
00:42:37
68. so even though there are the
00:42:39
spurious Dimensions uh the current the
00:42:41
current the correct thing will happen in
00:42:43
that we are only maintaining means and
00:42:45
variances for 64 sorry for 68 channels
00:42:49
and we're not calculating the mean
00:42:50
variance across 32 times 4 dimensions so
00:42:54
that's exactly what we want and let's
00:42:56
change the implementation of bash term
00:42:58
1D that we have so that it can take in
00:43:01
two-dimensional or three-dimensional
00:43:02
inputs and perform accordingly so at the
00:43:05
end of the day the fix is relatively
00:43:07
straightforward basically the dimension
00:43:09
we want to reduce over is either 0 or
00:43:12
the Tuple zero and one depending on the
00:43:14
dimensionality of X so if x dot and dim
00:43:16
is two so it's a two dimensional tensor
00:43:18
then Dimension we want to reduce over is
00:43:20
just the integer zero
00:43:22
L if x dot ending is three so it's a
00:43:24
three-dimensional tensor then the dims
00:43:26
we're going to assume are zero and one
00:43:29
that we want to reduce over and then
00:43:31
here we just pass in dim
00:43:33
and if the dimensionality of X is
00:43:35
anything else we'll now get an error
00:43:36
which is good
00:43:38
um so that should be the fix now I want
00:43:41
to point out one more thing we're
00:43:42
actually departing from the API of Pi
00:43:44
torch here a little bit because when you
00:43:46
come to batch room 1D and pytorch you
00:43:48
can scroll down and you can see that the
00:43:50
input to this layer can either be n by C
00:43:53
where n is the batch size and C is the
00:43:55
number of features or channels or it
00:43:57
actually does accept three-dimensional
00:43:59
inputs but it expects it to be n by C by
00:44:01
L
00:44:02
where LSA like the sequence length or
00:44:04
something like that
00:44:05
so um
00:44:07
this is problem because you see how C is
00:44:09
nested here in the middle and so when it
00:44:12
gets three-dimensional inputs this bash
00:44:14
term layer will reduce over zero and two
00:44:17
instead of zero and one so it basically
00:44:20
Pi torch batch number one D layer
00:44:22
assumes that c will always be the First
00:44:25
Dimension whereas we'll we assume here
00:44:28
that c is the last Dimension and there
00:44:30
are some number of batch Dimensions
00:44:32
beforehand
00:44:34
um
00:44:34
and so
00:44:36
it expects n by C or M by C by all we
00:44:39
expect and by C or n by L by C
00:44:42
and so it's a deviation
00:44:45
um
00:44:46
I think it's okay I prefer it this way
00:44:49
honestly so this is the way that we will
00:44:50
keep it for our purposes
00:44:52
so I redefined the layers re-initialize
00:44:54
the neural net and did a single forward
00:44:55
pass with a break just for one step
00:44:57
looking at the shapes along the way
00:44:59
they're of course identical all the
00:45:01
shapes are the same but the way we see
00:45:03
that things are actually working as we
00:45:05
want them to now is that when we look at
00:45:07
the bathroom layer the running mean
00:45:08
shape is now one by one by 68. so we're
00:45:11
only maintaining 68 means for every one
00:45:13
of our channels and we're treating both
00:45:15
the zeroth and the First Dimension as a
00:45:17
batch Dimension which is exactly what we
00:45:19
want so let me retrain the neural lot
00:45:21
now okay so I retrained the neural net
00:45:22
with the bug fix we get a nice curve and
00:45:25
when we look at the validation
00:45:25
performance we do actually see a slight
00:45:27
Improvement so we went from 2.029 to
00:45:30
2.022 so basically the bug inside the
00:45:32
bathroom was holding up us back like a
00:45:35
little bit it looks like and we are
00:45:37
getting a tiny Improvement now but it's
00:45:39
not clear if this is statistical
00:45:40
significant
00:45:41
um
00:45:42
and the reason we slightly expect an
00:45:44
improvement is because we're not
00:45:46
maintaining so many different means and
00:45:47
variances that are only estimated using
00:45:49
using 32 numbers effectively now we are
00:45:52
estimating them using 32 times 4 numbers
00:45:54
so you just have a lot more numbers that
00:45:56
go into any one estimate of the mean and
00:45:58
variance and it allows things to be a
00:46:01
bit more stable and less Wiggly inside
00:46:03
those estimates of those statistics so
00:46:07
pretty nice with this more General
00:46:08
architecture in place we are now set up
00:46:10
to push the performance further by
00:46:12
increasing the size of the network so
00:46:14
for example I bumped up the number of
00:46:16
embeddings to 24 instead of 10 and also
00:46:19
increased number of hidden units but
00:46:21
using the exact same architecture we now
00:46:23
have 76 000 parameters and the training
00:46:25
takes a lot longer but we do get a nice
00:46:28
curve and then when you actually
00:46:29
evaluate the performance we are now
00:46:31
getting validation performance of 1.993
00:46:33
so we've crossed over the 2.0 sort of
00:46:36
territory and right about 1.99 but we
00:46:39
are starting to have to wait quite a bit
00:46:42
longer and we're a little bit in the
00:46:44
dark with respect to the correct setting
00:46:46
of the hyper parameters here and the
00:46:47
learning rates and so on because the
00:46:48
experiments are starting to take longer
00:46:50
to train and so we are missing sort of
00:46:52
like an experimental harness on which we
00:46:54
could run a number of experiments and
00:46:56
really tune this architecture very well
00:46:58
so I'd like to conclude now with a few
00:46:59
notes we basically improved our
00:47:02
performance from a starting of 2.1 down
00:47:04
to 1.9 but I don't want that to be the
00:47:06
focus because honestly we're kind of in
00:47:08
the dark we have no experimental harness
00:47:10
we're just guessing and checking and
00:47:12
this whole thing is terrible we're just
00:47:13
looking at the training loss normally
00:47:15
you want to look at both the training
00:47:17
and the validation loss together and the
00:47:19
whole thing looks different if you're
00:47:20
actually trying to squeeze out numbers
00:47:23
that said we did implement this
00:47:25
architecture from the wavenet paper but
00:47:28
we did not implement this specific uh
00:47:31
forward pass of it where you have a more
00:47:33
complicated a linear layer sort of that
00:47:35
is this gated linear layer kind of and
00:47:38
there's residual connections and Skip
00:47:40
connections and so on so we did not
00:47:42
Implement that we just implemented this
00:47:44
structure I would like to briefly hint
00:47:46
or preview how what we've done here
00:47:48
relates to convolutional neural networks
00:47:50
as used in the wavenet paper and
00:47:52
basically the use of convolutions is
00:47:54
strictly for efficiency it doesn't
00:47:56
actually change the model we've
00:47:57
implemented
00:47:58
so here for example
00:48:00
let me look at a specific name to work
00:48:02
with an example so there's a name in our
00:48:05
training set and it's DeAndre and it has
00:48:08
seven letters so that is eight
00:48:10
independent examples in our model so all
00:48:12
these rows here are independent examples
00:48:14
of the Android
00:48:16
now you can forward of course any one of
00:48:18
these rows independently so I can take
00:48:20
my model and call call it on any
00:48:24
individual index notice by the way here
00:48:26
I'm being a little bit tricky
00:48:28
the reason for this is that extra at
00:48:30
seven that shape is just
00:48:33
um one dimensional array of eight so you
00:48:36
can't actually call the model on it
00:48:37
you're going to get an error because
00:48:39
there's no batch dimension
00:48:41
so when you do extra at
00:48:45
a list of seven then the shape of this
00:48:47
becomes one by eight so I get an extra
00:48:49
batch dimension of one and then we can
00:48:52
forward the model
00:48:53
so
00:48:55
that forwards a single example and you
00:48:57
might imagine that you actually may want
00:48:59
to forward all of these eight
00:49:01
um at the same time
00:49:03
so pre-allocating some memory and then
00:49:05
doing a for Loop eight times and
00:49:07
forwarding all of those eight here will
00:49:10
give us all the logits in all these
00:49:11
different cases
00:49:13
now for us with the model as we've
00:49:14
implemented it right now this is eight
00:49:16
independent calls to our model
00:49:18
but what convolutions allow you to do is
00:49:20
it allow you to basically slide this
00:49:22
model efficiently over the input
00:49:24
sequence and so this for Loop can be
00:49:27
done not outside in Python but inside of
00:49:31
kernels in Cuda and so this for Loop
00:49:33
gets hidden into the convolution
00:49:35
so the convolution basically you can
00:49:37
cover this it's a for Loop applying a
00:49:40
little linear filter over space of some
00:49:43
input sequence and in our case the space
00:49:45
we're interested in is one dimensional
00:49:46
and we're interested in sliding these
00:49:48
filters over the input data
00:49:51
so this diagram actually is fairly good
00:49:54
as well
00:49:55
basically what we've done is here they
00:49:57
are highlighting in Black one individ
00:49:59
one single sort of like tree of this
00:50:01
calculation so just calculating the
00:50:03
single output example here
00:50:06
um
00:50:07
and so this is basically what we've
00:50:08
implemented here we've implemented a
00:50:10
single this black structure we've
00:50:13
implemented that and calculated a single
00:50:15
output like a single example
00:50:17
but what collusions allow you to do is
00:50:19
it allows you to take this black
00:50:20
structure and kind of like slide it over
00:50:23
the input sequence here and calculate
00:50:26
all of these orange outputs at the same
00:50:29
time or here that corresponds to
00:50:31
calculating all of these outputs of
00:50:34
um at all the positions of DeAndre at
00:50:37
the same time
00:50:38
and the reason that this is much more
00:50:41
efficient is because number one as I
00:50:43
mentioned the for Loop is inside the
00:50:45
Cuda kernels in the sliding so that
00:50:48
makes it efficient but number two notice
00:50:50
the variable reuse here for example if
00:50:52
we look at this circle this node here
00:50:54
this node here is the right child of
00:50:56
this node but is also the left child of
00:50:59
the node here
00:51:01
and so basically this node and its value
00:51:03
is used twice
00:51:05
and so right now in this naive way we'd
00:51:08
have to recalculate it but here we are
00:51:11
allowed to reuse it
00:51:12
so in the convolutional neural network
00:51:14
you think of these linear layers that we
00:51:16
have up above as filters and we take
00:51:19
these filters and they're linear filters
00:51:21
and you slide them over input sequence
00:51:23
and we calculate the first layer and
00:51:25
then the second layer and then the third
00:51:26
layer and then the output layer of the
00:51:28
sandwich and it's all done very
00:51:30
efficiently using these convolutions
00:51:32
so we're going to cover that in a future
00:51:34
video the second thing I hope you took
00:51:35
away from this video is you've seen me
00:51:37
basically Implement all of these layer
00:51:40
Lego building blocks or module building
00:51:42
blocks and I'm implementing them over
00:51:45
here and we've implemented a number of
00:51:46
layers together and we've also
00:51:48
implemented these these containers and
00:51:51
we've overall pytorchified our code
00:51:53
quite a bit more
00:51:54
now basically what we're doing here is
00:51:56
we're re-implementing torch.nn which is
00:51:59
the neural networks library on top of
00:52:02
torch.tensor and it looks very much like
00:52:04
this except it is much better because
00:52:07
because it's in pi torch instead of
00:52:08
jingling my Jupiter notebook so I think
00:52:11
going forward I will probably have
00:52:13
considered us having unlocked
00:52:15
um torch.nn we understand roughly what's
00:52:18
in there how these modules work how
00:52:19
they're nested and what they're doing on
00:52:21
top of torture tensor so hopefully we'll
00:52:24
just uh we'll just switch over and
00:52:25
continue and start using torch.net
00:52:27
directly the next thing I hope you got a
00:52:29
bit of a sense of is what the
00:52:31
development process of building deep
00:52:33
neural networks looks like which I think
00:52:35
was relatively representative to some
00:52:36
extent so number one we are spending a
00:52:39
lot of time in the documentation page of
00:52:41
pytorch and we're reading through all
00:52:44
the layers looking at documentations
00:52:45
where the shapes of the inputs what can
00:52:48
they be what does the layer do and so on
00:52:51
unfortunately I have to say the
00:52:53
patreon's documentation is not are very
00:52:55
good they spend a ton of time on
00:52:57
Hardcore engineering of all kinds of
00:52:59
distributed Primitives Etc but as far as
00:53:01
I can tell no one is maintaining any
00:53:03
documentation it will lie to you it will
00:53:06
be wrong it will be incomplete it will
00:53:08
be unclear so unfortunately it is what
00:53:12
it is and you just kind of do your best
00:53:14
um with what they've given us
00:53:18
um number two
00:53:20
uh the other thing that I hope you got a
00:53:22
sense of is there's a ton of trying to
00:53:24
make the shapes work and there's a lot
00:53:26
of gymnastics around these
00:53:27
multi-dimensional arrays and are they
00:53:29
two-dimensional three-dimensional
00:53:30
four-dimensional uh what layers take
00:53:32
what shapes is it NCL or NLC and you're
00:53:36
promoting and viewing and it just can
00:53:39
get pretty messy and so that brings me
00:53:40
to number three I very often prototype
00:53:43
these layers and implementations in
00:53:44
jupyter notebooks and make sure that all
00:53:46
the shapes work out and I'm spending a
00:53:48
lot of time basically babysitting the
00:53:50
shapes and making sure everything is
00:53:52
correct and then once I'm satisfied with
00:53:54
the functionality in the Jupiter
00:53:55
notebook I will take that code and copy
00:53:57
paste it into my repository of actual
00:53:59
code that I'm training with and so then
00:54:02
I'm working with vs code on the side so
00:54:04
I usually have jupyter notebook and vs
00:54:06
code I develop in Jupiter notebook I
00:54:07
paste into vs code and then I kick off
00:54:09
experiments from from the reaper of
00:54:11
course from the code repository so
00:54:14
that's roughly some notes on the
00:54:16
development process of working with
00:54:17
neurons lastly I think this lecture
00:54:19
unlocks a lot of potential further
00:54:21
lectures because number one we have to
00:54:23
convert our neural network to actually
00:54:25
use these dilated causal convolutional
00:54:27
layers so implementing the comnet number
00:54:30
two potentially starting to get into
00:54:32
what this means whatever residual
00:54:34
connections and Skip connections and why
00:54:36
are they useful
00:54:37
number three we as I mentioned we don't
00:54:40
have any experimental harness so right
00:54:42
now I'm just guessing checking
00:54:44
everything this is not representative of
00:54:45
typical deep learning workflows you have
00:54:47
to set up your evaluation harness you
00:54:49
can kick off experiments you have lots
00:54:51
of arguments that your script can take
00:54:53
you're you're kicking off a lot of
00:54:54
experimentation you're looking at a lot
00:54:56
of plots of training and validation
00:54:57
losses and you're looking at what is
00:54:59
working and what is not working and
00:55:01
you're working on this like population
00:55:02
level and you're doing all these hyper
00:55:04
parameter searches and so we've done
00:55:06
none of that so far so how to set that
00:55:09
up and how to make it good I think as a
00:55:11
whole another topic number three we
00:55:14
should probably cover recurring neural
00:55:16
networks RNs lstm's grooves and of
00:55:19
course Transformers so many uh places to
00:55:22
go and we'll cover that in the future
00:55:24
for now bye sorry I forgot to say that
00:55:27
if you are interested I think it is kind
00:55:30
of interesting to try to beat this
00:55:31
number 1.993 because I really haven't
00:55:34
tried a lot of experimentation here and
00:55:36
there's quite a bit of fruit potentially
00:55:37
to still purchase further so I haven't
00:55:40
tried any other ways of allocating these
00:55:42
channels in this neural net maybe the
00:55:44
number of dimensions for the embedding
00:55:47
is all wrong maybe it's possible to
00:55:49
actually take the original network with
00:55:50
just one hidden layer and make it big
00:55:53
enough and actually beat my fancy
00:55:54
hierarchical Network it's not obvious
00:55:56
that would be kind of embarrassing if
00:55:59
this did not do better even once you
00:56:01
torture it a little bit maybe you can
00:56:03
read the weight net paper and try to
00:56:04
figure out how some of these layers work
00:56:06
and Implement them yourselves using what
00:56:07
we have
00:56:08
and of course you can always tune some
00:56:10
of the initialization or some of the
00:56:12
optimization and see if you can improve
00:56:15
it that way so I'd be curious if people
00:56:16
can come up with some ways to beat this
00:56:18
and yeah that's it for now bye