Building makemore Part 5: Building a WaveNet
Zusammenfassung
TLDRIn this lecture, the speaker continues the implementation of a character-level language model, evolving from a simple multi-layer perceptron to a more complex architecture inspired by WaveNet. The model now processes eight characters to predict the ninth, utilizing a hierarchical approach to progressively fuse information. The speaker addresses challenges with batch normalization and emphasizes the importance of maintaining the correct state during training and evaluation. The lecture concludes with a discussion on potential improvements and future topics, including dilated convolutions and hyperparameter tuning, highlighting the need for a more structured experimental approach to optimize the model's performance.
Mitbringsel
- π¨ The speaker is in Kyoto, enhancing the lecture's atmosphere.
- π§ Transitioning from a simple model to a complex architecture inspired by WaveNet.
- π The model now takes eight characters as input for prediction.
- π Batch normalization is crucial for stabilizing the learning process.
- π A hierarchical approach is used to progressively fuse information.
- βοΈ Future topics include dilated convolutions and residual connections.
- π Emphasis on the importance of hyperparameter tuning for model performance.
- π οΈ The need for a structured experimental approach is highlighted.
Zeitleiste
- 00:00:00 - 00:05:00
The speaker introduces the continuation of their character-level language model implementation while in Kyoto. They discuss the architecture of a multi-layer perceptron that predicts the next character based on three previous characters, and express the intention to enhance this model by increasing the input sequence length and making it deeper, similar to a Wavenet architecture.
- 00:05:00 - 00:10:00
The speaker explains the starter code for part five, which is based on the previous part three, and describes the data processing steps. They mention having 182,000 examples for training, and the importance of building modular layers for the neural network, similar to PyTorch's API.
- 00:10:00 - 00:15:00
The speaker discusses the implementation of various layers, including linear layers and batch normalization. They highlight the complexity of batch normalization and the need to manage its state during training and evaluation, as well as the importance of maintaining running statistics for the layers.
- 00:15:00 - 00:20:00
The speaker simplifies the code by removing unnecessary generator objects and organizing the neural network elements. They explain the embedding table and the structure of the layers, emphasizing the need for a clean and efficient forward pass in the model.
- 00:20:00 - 00:25:00
The speaker addresses issues with the loss function and the need for a better evaluation method. They discuss the importance of setting the model to evaluation mode before validation and mention the current validation loss of 2.10, indicating room for improvement in the model's performance.
- 00:25:00 - 00:30:00
The speaker discusses the need to improve the graph representation of the loss function and demonstrates how to average values for better visualization. They also mention the learning rate decay and its impact on optimization, leading to a more stable loss curve.
- 00:30:00 - 00:35:00
The speaker simplifies the forward pass by organizing layers into a sequential structure, allowing for cleaner code. They introduce a custom sequential class to manage layers and streamline the forward pass process, making it easier to evaluate the model's performance.
- 00:35:00 - 00:40:00
The speaker discusses the need to modify the flattening operation to accommodate the new hierarchical structure of the model. They explain how to reshape tensors to maintain the necessary dimensions for processing pairs of characters in the input sequence.
- 00:40:00 - 00:45:00
The speaker implements a new flattening method that allows for grouping consecutive elements in the input tensor. They demonstrate how to adjust the model's architecture to process these groups effectively, leading to a more efficient neural network design.
- 00:45:00 - 00:50:00
The speaker discusses the implementation of a hierarchical model that processes pairs of characters and gradually fuses information. They explain the changes made to the linear layers and the expected input shapes for the model, emphasizing the importance of maintaining the correct dimensions throughout the network.
- 00:50:00 - 00:56:21
The speaker reflects on the performance improvements achieved by increasing the context length from 3 to 8 characters, resulting in a validation loss of 2.02. They express the need for further optimization and hyperparameter tuning to enhance the model's performance even more.
Mind Map
Video-Fragen und Antworten
What is the main focus of this lecture?
The lecture focuses on implementing a more complex character-level language model inspired by WaveNet.
How many characters does the new model take as input?
The new model takes eight characters as input to predict the ninth character.
What is the purpose of batch normalization in this context?
Batch normalization helps stabilize the learning process by normalizing the inputs to each layer.
What improvements were made to the model's architecture?
The architecture was changed to progressively fuse information in a hierarchical manner, similar to WaveNet.
What are some future topics mentioned for exploration?
Future topics include dilated convolutions, residual connections, and hyperparameter tuning.
Weitere Video-Zusammenfassungen anzeigen
The Easiest way to make $$$ with Flash Loans Guaranteed in 2025
Sample Problem-Focused Standardized Patient Encounter
India's multi-layered integrated air defence system, & how itβs thwarting Pakistan's drone attacks
July Batch - Introduction & Funnel Class
Texas Tower Sniper - Charles Whitman - Forgotten History
Gaslight β½| Boltz Highlights
- 00:00:00hi everyone today we are continuing our
- 00:00:02implementation of make more our favorite
- 00:00:04character level language model
- 00:00:06now you'll notice that the background
- 00:00:07behind me is different that's because I
- 00:00:09am in Kyoto and it is awesome so I'm in
- 00:00:12a hotel room here
- 00:00:13now over the last few lectures we've
- 00:00:15built up to this architecture that is a
- 00:00:17multi-layer perceptron character level
- 00:00:19language model so we see that it
- 00:00:21receives three previous characters and
- 00:00:23tries to predict the fourth character in
- 00:00:24a sequence using a very simple multi
- 00:00:26perceptron using one hidden layer of
- 00:00:28neurons with 10ational neuralities
- 00:00:31so we'd like to do now in this lecture
- 00:00:33is I'd like to complexify this
- 00:00:34architecture in particular we would like
- 00:00:36to take more characters in a sequence as
- 00:00:38an input not just three and in addition
- 00:00:41to that we don't just want to feed them
- 00:00:42all into a single hidden layer because
- 00:00:45that squashes too much information too
- 00:00:46quickly instead we would like to make a
- 00:00:49deeper model that progressively fuses
- 00:00:51this information to make its guess about
- 00:00:53the next character in a sequence
- 00:00:55and so we'll see that as we make this
- 00:00:57architecture more complex we're actually
- 00:00:59going to arrive at something that looks
- 00:01:01very much like a wavenet
- 00:01:03the witness is this paper published by
- 00:01:05the point in 2016 and it is also a
- 00:01:09language model basically but it tries to
- 00:01:11predict audio sequences instead of
- 00:01:13character level sequences or Word level
- 00:01:15sequences but fundamentally the modeling
- 00:01:18setup is identical it is an auto
- 00:01:20aggressive model and it tries to predict
- 00:01:23next character in a sequence and the
- 00:01:25architecture actually takes this
- 00:01:26interesting hierarchical sort of
- 00:01:29approach to predicting the next
- 00:01:31character in a sequence uh with the
- 00:01:33street-like structure and this is the
- 00:01:35architecture and we're going to
- 00:01:36implement it in the course of this video
- 00:01:38so let's get started so the starter code
- 00:01:41for part five is very similar to where
- 00:01:43we ended up in in part three recall that
- 00:01:46part four was the manual black
- 00:01:47replication exercise that is kind of an
- 00:01:49aside so we are coming back to part
- 00:01:51three copy pasting chunks out of it and
- 00:01:53that is our starter code for part five
- 00:01:55I've changed very few things otherwise
- 00:01:57so a lot of this should look familiar to
- 00:01:59if you've gone through part three so in
- 00:02:01particular very briefly we are doing
- 00:02:03Imports we are reading our our data set
- 00:02:05of words and we are processing their set
- 00:02:09of words into individual examples and
- 00:02:11none of this data generation code has
- 00:02:13changed and basically we have lots and
- 00:02:15lots of examples in particular we have
- 00:02:17182 000 examples of three characters try
- 00:02:21to predict the fourth one and we've
- 00:02:24broken up every one of these words into
- 00:02:25little problems of given three
- 00:02:27characters predict the fourth one so
- 00:02:29this is our data set and this is what
- 00:02:30we're trying to get the neural lot to do
- 00:02:32now in part three we started to develop
- 00:02:35our code around these layer modules
- 00:02:39um that are for example like class
- 00:02:40linear and we're doing this because we
- 00:02:42want to think of these modules as
- 00:02:44building blocks and like a Lego building
- 00:02:47block bricks that we can sort of like
- 00:02:49stack up into neural networks and we can
- 00:02:51feed data between these layers and stack
- 00:02:53them up into a sort of graphs
- 00:02:56now we also developed these layers to
- 00:02:59have apis and signatures very similar to
- 00:03:01those that are found in pytorch so we
- 00:03:04have torch.nn and it's got all these
- 00:03:05layer building blocks that you would use
- 00:03:07in practice and we were developing all
- 00:03:09of these to mimic the apis of these so
- 00:03:11for example we have linear so there will
- 00:03:13also be a torch.nn.linear and its
- 00:03:17signature will be very similar to our
- 00:03:18signature and the functionality will be
- 00:03:20also quite identical as far as I'm aware
- 00:03:22so we have the linear layer with the
- 00:03:24Bass from 1D layer and the 10h layer
- 00:03:27that we developed previously
- 00:03:29and linear just as a matrix multiply in
- 00:03:32the forward pass of this module batch
- 00:03:35number of course is this crazy layer
- 00:03:36that we developed in the previous
- 00:03:37lecture and what's crazy about it is
- 00:03:40well there's many things number one it
- 00:03:42has these running mean and variances
- 00:03:44that are trained outside of back
- 00:03:46propagation they are trained using
- 00:03:49exponential moving average inside this
- 00:03:52layer when we call the forward pass
- 00:03:54in addition to that
- 00:03:56there's this training plug because the
- 00:03:58behavior of bathroom is different during
- 00:03:59train time and evaluation time and so
- 00:04:02suddenly we have to be very careful that
- 00:04:03bash form is in its correct state that
- 00:04:05it's in the evaluation state or training
- 00:04:07state so that's something to now keep
- 00:04:08track of something that sometimes
- 00:04:10introduces bugs
- 00:04:11uh because you forget to put it into the
- 00:04:13right mode and finally we saw that
- 00:04:15Bachelor couples the statistics or the
- 00:04:18the activations across the examples in
- 00:04:20the batch so normally we thought of the
- 00:04:22bat as just an efficiency thing but now
- 00:04:25we are coupling the computation across
- 00:04:28batch elements and it's done for the
- 00:04:30purposes of controlling the automation
- 00:04:32statistics as we saw in the previous
- 00:04:33video
- 00:04:34so it's a very weird layer at least a
- 00:04:36lot of bugs
- 00:04:38partly for example because you have to
- 00:04:40modulate the training in eval phase and
- 00:04:42so on
- 00:04:44um in addition for example you have to
- 00:04:46wait for uh the mean and the variance to
- 00:04:49settle and to actually reach a steady
- 00:04:51state and so um you have to make sure
- 00:04:53that you basically there's state in this
- 00:04:55layer and state is harmful uh usually
- 00:04:59now I brought out the generator object
- 00:05:02previously we had a generator equals g
- 00:05:04and so on inside these layers I've
- 00:05:07discarded that in favor of just
- 00:05:08initializing the torch RNG outside here
- 00:05:12use it just once globally just for
- 00:05:15Simplicity
- 00:05:16and then here we are starting to build
- 00:05:18out some of the neural network elements
- 00:05:19this should look very familiar we are we
- 00:05:22have our embedding table C and then we
- 00:05:24have a list of players and uh it's a
- 00:05:27linear feeds to Bachelor feeds to 10h
- 00:05:29and then a linear output layer and its
- 00:05:32weights are scaled down so we are not
- 00:05:33confidently wrong at the initialization
- 00:05:36we see that this is about 12 000
- 00:05:38parameters we're telling pytorch that
- 00:05:40the parameters require gradients
- 00:05:42the optimization is as far as I'm aware
- 00:05:44identical and should look very very
- 00:05:46familiar
- 00:05:47nothing changed here
- 00:05:49uh loss function looks very crazy we
- 00:05:52should probably fix this and that's
- 00:05:54because 32 batch elements are too few
- 00:05:56and so you can get very lucky lucky or
- 00:05:59unlucky in any one of these batches and
- 00:06:01it creates a very thick loss function
- 00:06:04um so we're going to fix that soon
- 00:06:06now once we want to evaluate the trained
- 00:06:08neural network we need to remember
- 00:06:09because of the bathroom layers to set
- 00:06:11all the layers to be training equals
- 00:06:13false so this only matters for the
- 00:06:15bathroom layer so far
- 00:06:17and then we evaluate
- 00:06:19we see that currently we have validation
- 00:06:22loss of 2.10 which is fairly good but
- 00:06:25there's still ways to go but even at
- 00:06:282.10 we see that when we sample from the
- 00:06:30model we actually get relatively
- 00:06:31name-like results that do not exist in a
- 00:06:34training set so for example Yvonne kilo
- 00:06:37Pros
- 00:06:40Alaia Etc so certainly not
- 00:06:43reasonable not unreasonable I would say
- 00:06:46but not amazing and we can still push
- 00:06:48this validation loss even lower and get
- 00:06:50much better samples that are even more
- 00:06:52name-like
- 00:06:53so let's improve this model
- 00:06:56okay first let's fix this graph because
- 00:06:58it is daggers in my eyes and I just
- 00:07:00can't take it anymore
- 00:07:01um so last I if you recall is a python
- 00:07:05list of floats so for example the first
- 00:07:0710 elements
- 00:07:10now what we'd like to do basically is we
- 00:07:12need to average up
- 00:07:14um some of these values to get a more
- 00:07:16sort of Representative uh value along
- 00:07:19the way so one way to do this is the
- 00:07:20following
- 00:07:21in part torch if I create for example
- 00:07:24a tensor of the first 10 numbers
- 00:07:27then this is currently a one-dimensional
- 00:07:29array but recall that I can view this
- 00:07:31array as two-dimensional so for example
- 00:07:33I can use it as a two by five array and
- 00:07:36this is a 2d tensor now two by five and
- 00:07:39you see what petroch has done is that
- 00:07:40the first row of this tensor is the
- 00:07:42first five elements and the second row
- 00:07:44is the second five elements
- 00:07:46I can also view it as a five by two as
- 00:07:48an example
- 00:07:50and then recall that I can also
- 00:07:52use negative one in place of one of
- 00:07:55these numbers
- 00:07:55and pytorch will calculate what that
- 00:07:58number must be in order to make the
- 00:07:59number of elements work out so this can
- 00:08:01be
- 00:08:03this or like that but it will work of
- 00:08:06course this would not work
- 00:08:09okay so this allows it to spread out
- 00:08:11some of the consecutive values into rows
- 00:08:13so that's very helpful because what we
- 00:08:15can do now is first of all we're going
- 00:08:17to create a torshot tensor out of the a
- 00:08:21list of floats
- 00:08:22and then we're going to view it as
- 00:08:24whatever it is but we're going to
- 00:08:26stretch it out into rows of 1000
- 00:08:29consecutive elements so the shape of
- 00:08:31this now becomes 200 by 1000. and each
- 00:08:35row is one thousand um consecutive
- 00:08:37elements in this list
- 00:08:39so that's very helpful because now we
- 00:08:41can do a mean along the rows
- 00:08:43and the shape of this will just be 200.
- 00:08:47and so we've taken basically the mean on
- 00:08:48every row so plt.plot of that should be
- 00:08:51something nicer
- 00:08:53much better
- 00:08:55so we see that we basically made a lot
- 00:08:56of progress and then here this is the
- 00:08:59learning rate Decay so here we see that
- 00:09:01the learning rate Decay subtracted a ton
- 00:09:03of energy out of the system and allowed
- 00:09:05us to settle into sort of the local
- 00:09:07minimum in this optimization
- 00:09:09so this is a much nicer plot let me come
- 00:09:12up and delete the monster and we're
- 00:09:15going to be using this going forward now
- 00:09:16next up what I'm bothered by is that you
- 00:09:19see our forward pass is a little bit
- 00:09:22gnarly and takes way too many lines of
- 00:09:24code
- 00:09:24so in particular we see that we've
- 00:09:26organized some of the layers inside the
- 00:09:28layers list but not all of them uh for
- 00:09:30no reason so in particular we see that
- 00:09:32we still have the embedding table a
- 00:09:34special case outside of the layers and
- 00:09:37in addition to that the viewing
- 00:09:39operation here is also outside of our
- 00:09:40layers so let's create layers for these
- 00:09:43and then we can add those layers to just
- 00:09:45our list
- 00:09:46so in particular the two things that we
- 00:09:48need is here we have this embedding
- 00:09:50table and we are indexing at the
- 00:09:53integers inside uh the batch XB uh
- 00:09:56inside the tensor xB
- 00:09:58so that's an embedding table lookup just
- 00:10:00done with indexing and then here we see
- 00:10:03that we have this view operation which
- 00:10:04if you recall from the previous video
- 00:10:06Simply rearranges the character
- 00:10:09embeddings and stretches them out into a
- 00:10:12row and effectively what print that does
- 00:10:14is the concatenation operation basically
- 00:10:16except it's free because viewing is very
- 00:10:19cheap in pytorch no no memory is being
- 00:10:22copied we're just re-representing how we
- 00:10:24view that tensor so let's create
- 00:10:27um
- 00:10:28modules for both of these operations the
- 00:10:31embedding operation and flattening
- 00:10:32operation
- 00:10:33so I actually wrote the code in just to
- 00:10:37save some time
- 00:10:38so we have a module embedding and a
- 00:10:40module pattern and both of them simply
- 00:10:43do the indexing operation in the forward
- 00:10:45pass and the flattening operation here
- 00:10:49and this C now will just become a salt
- 00:10:53dot weight inside an embedding module
- 00:10:56and I'm calling these layers
- 00:10:58specifically embedding a platinum
- 00:10:59because it turns out that both of them
- 00:11:01actually exist in pi torch so in
- 00:11:03phytorch we have n and Dot embedding and
- 00:11:06it also takes the number of embeddings
- 00:11:07and the dimensionality of the bedding
- 00:11:09just like we have here but in addition
- 00:11:11python takes in a lot of other keyword
- 00:11:13arguments that we are not using for our
- 00:11:15purposes yet
- 00:11:17and for flatten that also exists in
- 00:11:19pytorch and it also takes additional
- 00:11:21keyword arguments that we are not using
- 00:11:23so we have a very simple platform
- 00:11:26but both of them exist in pytorch
- 00:11:28they're just a bit more simpler and now
- 00:11:30that we have these we can simply take
- 00:11:33out some of these special cased
- 00:11:36um things so instead of C we're just
- 00:11:40going to have an embedding
- 00:11:41and of a cup size and N embed
- 00:11:45and then after the embedding we are
- 00:11:47going to flatten
- 00:11:48so let's construct those modules and now
- 00:11:51I can take out this the
- 00:11:53and here I don't have to special case
- 00:11:54anymore because now C is the embeddings
- 00:11:57weight and it's inside layers
- 00:12:01so this should just work
- 00:12:03and then here our forward pass
- 00:12:06simplifies substantially because we
- 00:12:08don't need to do these now outside of
- 00:12:10these layer outside and explicitly
- 00:12:13they're now inside layers
- 00:12:15so we can delete those
- 00:12:17but now to to kick things off we want
- 00:12:19this little X which in the beginning is
- 00:12:21just XB uh the tensor of integers
- 00:12:24specifying the identities of these
- 00:12:26characters at the input
- 00:12:27and so these characters can now directly
- 00:12:29feed into the first layer and this
- 00:12:31should just work
- 00:12:32so let me come here and insert a break
- 00:12:35because I just want to make sure that
- 00:12:36the first iteration of this runs and
- 00:12:38then there's no mistake so that ran
- 00:12:40properly and basically we substantially
- 00:12:42simplified the forward pass here okay
- 00:12:45I'm sorry I changed my microphone so
- 00:12:46hopefully the audio is a little bit
- 00:12:48better
- 00:12:49now one more thing that I would like to
- 00:12:51do in order to pytortify our code even
- 00:12:53further is that right now we are
- 00:12:54maintaining all of our modules in a
- 00:12:56naked list of layers and we can also
- 00:12:59simplify this uh because we can
- 00:13:01introduce the concept of Pi torch
- 00:13:03containers so in tors.nn which we are
- 00:13:05basically rebuilding from scratch here
- 00:13:07there's a concept of containers
- 00:13:09and these containers are basically a way
- 00:13:10of organizing layers into
- 00:13:13lists or dicts and so on so in
- 00:13:16particular there's a sequential which
- 00:13:18maintains a list of layers and is a
- 00:13:20module class in pytorch and it basically
- 00:13:23just passes a given input through all
- 00:13:25the layers sequentially exactly as we
- 00:13:27are doing here
- 00:13:28so let's write our own sequential
- 00:13:31I've written a code here and basically
- 00:13:33the code for sequential is quite
- 00:13:35straightforward we pass in a list of
- 00:13:37layers which we keep here and then given
- 00:13:39any input in a forward pass we just call
- 00:13:41all the layers sequentially and return
- 00:13:43the result in terms of the parameters
- 00:13:45it's just all the parameters of the
- 00:13:46child modules
- 00:13:48so we can run this and we can again
- 00:13:50simplify this substantially because we
- 00:13:52don't maintain this naked list of layers
- 00:13:54we now have a notion of a model which is
- 00:13:57a module and in particular is a
- 00:14:00sequential of all these layers
- 00:14:04and now parameters are simply just a
- 00:14:07model about parameters
- 00:14:09and so that list comprehension now lives
- 00:14:11here
- 00:14:13and then here we are press here we are
- 00:14:15doing all the things we used to do
- 00:14:17now here the code again simplifies
- 00:14:19substantially because we don't have to
- 00:14:22do this forwarding here instead of just
- 00:14:24call the model on the input data and the
- 00:14:26input data here are the integers inside
- 00:14:28xB so we can simply do logits which are
- 00:14:31the outputs of our model are simply the
- 00:14:33model called on xB
- 00:14:36and then the cross entropy here takes
- 00:14:38the logits and the targets
- 00:14:41so this simplifies substantially
- 00:14:43and then this looks good so let's just
- 00:14:46make sure this runs that looks good
- 00:14:49now here we actually have some work to
- 00:14:51do still here but I'm going to come back
- 00:14:52later for now there's no more layers
- 00:14:54there's a model that layers but it's not
- 00:14:57a to access attributes of these classes
- 00:15:00directly so we'll come back and fix this
- 00:15:01later
- 00:15:03and then here of course this simplifies
- 00:15:05substantially as well because logits are
- 00:15:07the model called on x
- 00:15:10and then these low Jets come here
- 00:15:14so we can evaluate the train and
- 00:15:15validation loss which currently is
- 00:15:17terrible because we just initialized the
- 00:15:19neural net and then we can also sample
- 00:15:21from the model and this simplifies
- 00:15:22dramatically as well
- 00:15:24because we just want to call the model
- 00:15:25onto the context and outcome logits
- 00:15:30and these logits go into softmax and get
- 00:15:32the probabilities Etc so we can sample
- 00:15:35from this model
- 00:15:37what did I screw up
- 00:15:42okay so I fixed the issue and we now get
- 00:15:44the result that we expect which is
- 00:15:46gibberish because the model is not
- 00:15:48trained because we re-initialize it from
- 00:15:49scratch
- 00:15:50the problem was that when I fixed this
- 00:15:52cell to be modeled out layers instead of
- 00:15:54just layers I did not actually run the
- 00:15:56cell and so our neural net was in a
- 00:15:58training mode and what caused the issue
- 00:16:01here is the bathroom layer as bathroom
- 00:16:03layer of the likes to do because
- 00:16:05Bachelor was in a training mode and here
- 00:16:07we are passing in an input which is a
- 00:16:09batch of just a single example made up
- 00:16:11of the context
- 00:16:12and so if you are trying to pass in a
- 00:16:15single example into a bash Norm that is
- 00:16:16in the training mode you're going to end
- 00:16:18up estimating the variance using the
- 00:16:20input and the variance of a single
- 00:16:21number is is not a number because it is
- 00:16:24a measure of a spread so for example the
- 00:16:26variance of just the single number five
- 00:16:28you can see is not a number and so
- 00:16:31that's what happened in the master
- 00:16:33basically caused an issue and then that
- 00:16:35polluted all of the further processing
- 00:16:37so all that we have to do was make sure
- 00:16:39that this runs and we basically made the
- 00:16:43issue of
- 00:16:45again we didn't actually see the issue
- 00:16:46with the loss we could have evaluated
- 00:16:48the loss but we got the wrong result
- 00:16:49because basharm was in the training mode
- 00:16:52and uh and so we still get a result it's
- 00:16:54just the wrong result because it's using
- 00:16:56the uh sample statistics of the batch
- 00:16:59whereas we want to use the running mean
- 00:17:00and running variants inside the bachelor
- 00:17:02and so
- 00:17:04again an example of introducing a bug
- 00:17:06inline because we did not properly
- 00:17:09maintain the state of what is training
- 00:17:10or not okay so I Rewritten everything
- 00:17:12and here's where we are as a reminder we
- 00:17:15have the training loss of 2.05 and
- 00:17:17validation 2.10
- 00:17:18now because these losses are very
- 00:17:21similar to each other we have a sense
- 00:17:22that we are not overfitting too much on
- 00:17:24this task and we can make additional
- 00:17:26progress in our performance by scaling
- 00:17:28up the size of the neural network and
- 00:17:29making everything bigger and deeper
- 00:17:32now currently we are using this
- 00:17:33architecture here where we are taking in
- 00:17:35some number of characters going into a
- 00:17:37single hidden layer and then going to
- 00:17:39the prediction of the next character
- 00:17:41the problem here is we don't have a
- 00:17:43naive way of making this bigger in a
- 00:17:46productive way we could of course use
- 00:17:48our layers sort of building blocks and
- 00:17:51materials to introduce additional layers
- 00:17:53here and make the network deeper but it
- 00:17:55is still the case that we are crushing
- 00:17:56all of the characters into a single
- 00:17:58layer all the way at the beginning
- 00:18:00and even if we make this a bigger layer
- 00:18:02and add neurons it's still kind of like
- 00:18:04silly to squash all that information so
- 00:18:07fast in a single step
- 00:18:09so we'd like to do instead is we'd like
- 00:18:11our Network to look a lot more like this
- 00:18:13in the wavenet case so you see in the
- 00:18:15wavenet when we are trying to make the
- 00:18:17prediction for the next character in the
- 00:18:18sequence it is a function of the
- 00:18:20previous characters that are feeding
- 00:18:22that feed in but not all of these
- 00:18:25different characters are not just
- 00:18:26crushed to a single layer and then you
- 00:18:28have a sandwich they are crushed slowly
- 00:18:31so in particular we take two characters
- 00:18:34and we fuse them into sort of like a
- 00:18:36diagram representation and we do that
- 00:18:38for all these characters consecutively
- 00:18:40and then we take the bigrams and we fuse
- 00:18:42those into four character level chunks
- 00:18:46and then we fuse that again and so we do
- 00:18:49that in this like tree-like hierarchical
- 00:18:51manner so we fuse the information from
- 00:18:53the previous context slowly into the
- 00:18:56network as it gets deeper and so this is
- 00:18:58the kind of architecture that we want to
- 00:18:59implement
- 00:19:00now in the wave Nets case this is a
- 00:19:02visualization of a stack of dilated
- 00:19:04causal convolution layers and this makes
- 00:19:07it sound very scary but actually the
- 00:19:08idea is very simple and the fact that
- 00:19:10it's a dilated causal convolution layer
- 00:19:12is really just an implementation detail
- 00:19:14to make everything fast we're going to
- 00:19:16see that later but for now let's just
- 00:19:18keep the basic idea of it which is this
- 00:19:20Progressive Fusion so we want to make
- 00:19:22the network deeper and at each level we
- 00:19:24want to fuse only two consecutive
- 00:19:26elements two characters then two bigrams
- 00:19:29then two four grams and so on so let's
- 00:19:32unplant this okay so first up let me
- 00:19:34scroll to where we built the data set
- 00:19:35and let's change the block size from 3
- 00:19:37to 8. so we're going to be taking eight
- 00:19:39characters of context to predict the
- 00:19:42ninth character so the data set now
- 00:19:44looks like this we have a lot more
- 00:19:45context feeding in to predict any next
- 00:19:47character in a sequence and these eight
- 00:19:49characters are going to be processed in
- 00:19:51this tree like structure
- 00:19:53now if we scroll here everything here
- 00:19:56should just be able to work so we should
- 00:19:58be able to redefine the network
- 00:19:59you see the number of parameters has
- 00:20:01increased by 10 000 and that's because
- 00:20:03the block size has grown so this first
- 00:20:06linear layer is much much bigger our
- 00:20:08linear layer now takes eight characters
- 00:20:10into this middle layer so there's a lot
- 00:20:13more parameters there but this should
- 00:20:15just run let me just break right after
- 00:20:18the very first iteration so you see that
- 00:20:20this runs just fine it's just that this
- 00:20:22network doesn't make too much sense
- 00:20:23we're crushing way too much information
- 00:20:25way too fast
- 00:20:26so let's now come in and see how we
- 00:20:29could try to implement the hierarchical
- 00:20:30scheme now before we dive into the
- 00:20:33detail of the re-implementation here I
- 00:20:35was just curious to actually run it and
- 00:20:37see where we are in terms of the
- 00:20:38Baseline performance of just lazily
- 00:20:40scaling up the context length so I'll
- 00:20:42let it run we get a nice loss curve and
- 00:20:45then evaluating the loss we actually see
- 00:20:46quite a bit of improvement just from
- 00:20:48increasing the context line length so I
- 00:20:51started a little bit of a performance
- 00:20:52log here and previously where we were is
- 00:20:54we were getting a performance of 2.10 on
- 00:20:57the validation loss and now simply
- 00:20:59scaling up the contact length from 3 to
- 00:21:018 gives us a performance of 2.02 so
- 00:21:05quite a bit of an improvement here and
- 00:21:07also when you sample from the model you
- 00:21:08see that the names are definitely
- 00:21:10improving qualitatively as well
- 00:21:13so we could of course spend a lot of
- 00:21:14time here tuning
- 00:21:16um uh tuning things and making it even
- 00:21:18bigger and scaling up the network
- 00:21:19further even with the simple
- 00:21:21um sort of setup here but let's continue
- 00:21:24and let's Implement here model and treat
- 00:21:27this as just a rough Baseline
- 00:21:28performance but there's a lot of
- 00:21:30optimization like left on the table in
- 00:21:32terms of some of the hyper parameters
- 00:21:34that you're hopefully getting a sense of
- 00:21:35now okay so let's scroll up now
- 00:21:38and come back up and what I've done here
- 00:21:41is I've created a bit of a scratch space
- 00:21:42for us to just like look at the forward
- 00:21:45pass of the neural net and inspect the
- 00:21:47shape of the tensor along the way as the
- 00:21:49neural net uh forwards so here I'm just
- 00:21:53temporarily for debugging creating a
- 00:21:55batch of just say four examples so four
- 00:21:58random integers then I'm plucking out
- 00:22:00those rows from our training set
- 00:22:02and then I'm passing into the model the
- 00:22:04input xB
- 00:22:06now the shape of XB here because we have
- 00:22:08only four examples is four by eight and
- 00:22:11this eight is now the current block size
- 00:22:14so uh inspecting XP we just see that we
- 00:22:18have four examples each one of them is a
- 00:22:19row of xB
- 00:22:21and we have eight characters here and
- 00:22:24this integer tensor just contains the
- 00:22:26identities of those characters
- 00:22:29so the first layer of our neural net is
- 00:22:31the embedding layer so passing XB this
- 00:22:33integer tensor through the embedding
- 00:22:35layer creates an output that is four by
- 00:22:37eight by ten
- 00:22:39so our embedding table has for each
- 00:22:42character a 10-dimensional vector that
- 00:22:44we are trying to learn
- 00:22:46and so what the embedding layer does
- 00:22:48here is it plucks out the embedding
- 00:22:50Vector for each one of these integers
- 00:22:53and organizes it all in a four by eight
- 00:22:56by ten tensor now
- 00:22:58so all of these integers are translated
- 00:23:00into 10 dimensional vectors inside this
- 00:23:02three-dimensional tensor now
- 00:23:04passing that through the flattened layer
- 00:23:06as you recall what this does is it views
- 00:23:09this tensor as just a 4 by 80 tensor and
- 00:23:12what that effectively does is that all
- 00:23:15these 10 dimensional embeddings for all
- 00:23:16these eight characters just end up being
- 00:23:18stretched out into a long row
- 00:23:21and that looks kind of like a
- 00:23:22concatenation operation basically so by
- 00:23:25viewing the tensor differently we now
- 00:23:27have a four by eighty and inside this 80
- 00:23:29it's all the 10 dimensional uh
- 00:23:32vectors just uh concatenate next to each
- 00:23:35other
- 00:23:36and then the linear layer of course
- 00:23:37takes uh 80 and creates 200 channels
- 00:23:40just via matrix multiplication
- 00:23:43so so far so good now I'd like to show
- 00:23:45you something surprising
- 00:23:47let's look at the insides of the linear
- 00:23:50layer and remind ourselves how it works
- 00:23:52the linear layer here in the forward
- 00:23:54pass takes the input X multiplies it
- 00:23:56with a weight and then optionally adds
- 00:23:58bias and the weight here is
- 00:24:00two-dimensional as defined here and the
- 00:24:02bias is one dimensional here
- 00:24:04so effectively in terms of the shapes
- 00:24:06involved what's happening inside this
- 00:24:08linear layer looks like this right now
- 00:24:10and I'm using random numbers here but
- 00:24:12I'm just illustrating the shapes and
- 00:24:15what happens
- 00:24:16basically a 4 by 80 input comes into the
- 00:24:18linear layer that's multiplied by this
- 00:24:2080 by 200 weight Matrix inside and
- 00:24:23there's a plus 200 bias and the shape of
- 00:24:25the whole thing that comes out of the
- 00:24:26linear layer is four by two hundred as
- 00:24:28we see here
- 00:24:30now notice here by the way that this
- 00:24:32here will create a 4x200 tensor and then
- 00:24:36plus 200 there's a broadcasting
- 00:24:38happening here about 4 by 200 broadcasts
- 00:24:41with 200 uh so everything works here
- 00:24:44so now the surprising thing that I'd
- 00:24:46like to show you that you may not expect
- 00:24:47is that this input here that is being
- 00:24:49multiplied uh doesn't actually have to
- 00:24:52be two-dimensional this Matrix multiply
- 00:24:55operator in pytorch is quite powerful
- 00:24:56and in fact you can actually pass in
- 00:24:58higher dimensional arrays or tensors and
- 00:25:00everything works fine so for example
- 00:25:02this could be four by five by eighty and
- 00:25:04the result in that case will become four
- 00:25:06by five by two hundred
- 00:25:08you can add as many dimensions as you
- 00:25:09like on the left here
- 00:25:11and so effectively what's happening is
- 00:25:13that the matrix multiplication only
- 00:25:15works on the last Dimension and the
- 00:25:17dimensions before it in the input tensor
- 00:25:19are left unchanged
- 00:25:24so that is basically these um these
- 00:25:27dimensions on the left are all treated
- 00:25:29as just a batch Dimension so we can have
- 00:25:32multiple batch dimensions and then in
- 00:25:34parallel over all those Dimensions we
- 00:25:36are doing the matrix multiplication on
- 00:25:38the last dimension
- 00:25:39so this is quite convenient because we
- 00:25:41can use that in our Network now
- 00:25:44because remember that we have these
- 00:25:46eight characters coming in
- 00:25:49and we don't want to now uh flatten all
- 00:25:51of it out into a large eight-dimensional
- 00:25:53vector
- 00:25:54because we don't want to Matrix multiply
- 00:25:5780.
- 00:25:59into a weight Matrix multiply
- 00:26:01immediately instead we want to group
- 00:26:03these
- 00:26:04like this
- 00:26:06so every consecutive two elements
- 00:26:09one two and three and four and five and
- 00:26:11six and seven and eight all of these
- 00:26:12should be now
- 00:26:14basically flattened out and multiplied
- 00:26:17by weight Matrix but all of these four
- 00:26:19groups here we'd like to process in
- 00:26:21parallel so it's kind of like a batch
- 00:26:23Dimension that we can introduce
- 00:26:25and then we can in parallel basically
- 00:26:28process all of these uh bigram groups in
- 00:26:33the four batch dimensions of an
- 00:26:34individual example and also over the
- 00:26:37actual batch dimension of the you know
- 00:26:39four examples in our example here so
- 00:26:42let's see how that works effectively
- 00:26:43what we want is right now we take a 4 by
- 00:26:4680
- 00:26:47and multiply it by 80 by 200
- 00:26:50to in the linear layer this is what
- 00:26:52happens
- 00:26:53but instead what we want is we don't
- 00:26:56want 80 characters or 80 numbers to come
- 00:26:58in we only want two characters to come
- 00:27:00in on the very first layer and those two
- 00:27:02characters should be fused
- 00:27:04so in other words we just want 20 to
- 00:27:07come in right 20 numbers would come in
- 00:27:11and here we don't want a 4 by 80 to feed
- 00:27:13into the linear layer we actually want
- 00:27:15these groups of two to feed in so
- 00:27:17instead of four by eighty we want this
- 00:27:19to be a 4 by 4 by 20.
- 00:27:23so these are the four groups of two and
- 00:27:27each one of them is ten dimensional
- 00:27:28vector
- 00:27:29so what we want is now is we need to
- 00:27:31change the flattened layer so it doesn't
- 00:27:33output a four by eighty but it outputs a
- 00:27:35four by four by Twenty where basically
- 00:27:38these um
- 00:27:39every two consecutive characters are uh
- 00:27:43packed in on the very last Dimension and
- 00:27:46then these four is the first batch
- 00:27:48Dimension and this four is the second
- 00:27:50batch Dimension referring to the four
- 00:27:52groups inside every one of these
- 00:27:54examples
- 00:27:55and then this will just multiply like
- 00:27:57this so this is what we want to get to
- 00:27:59so we're going to have to change the
- 00:28:01linear layer in terms of how many inputs
- 00:28:02it expects it shouldn't expect 80 it
- 00:28:05should just expect 20 numbers and we
- 00:28:07have to change our flattened layer so it
- 00:28:09doesn't just fully flatten out this
- 00:28:11entire example it needs to create a 4x4
- 00:28:14by 20 instead of four by eighty so let's
- 00:28:17see how this could be implemented
- 00:28:19basically right now we have an input
- 00:28:21that is a four by eight by ten that
- 00:28:23feeds into the flattened layer and
- 00:28:25currently the flattened layer just
- 00:28:27stretches it out so if you remember the
- 00:28:29implementation of flatten
- 00:28:31it takes RX and it just views it as
- 00:28:34whatever the batch Dimension is and then
- 00:28:35negative one
- 00:28:37so effectively what it does right now is
- 00:28:39it does e dot view of 4 negative one and
- 00:28:42the shape of this of course is 4 by 80.
- 00:28:45so that's what currently happens and we
- 00:28:48instead want this to be a four by four
- 00:28:49by Twenty where these consecutive
- 00:28:51ten-dimensional vectors get concatenated
- 00:28:54so you know how in Python you can take a
- 00:28:57list of range of 10
- 00:29:00so we have numbers from zero to nine and
- 00:29:03we can index like this to get all the
- 00:29:05even parts
- 00:29:06and we can also index like starting at
- 00:29:08one and going in steps up two to get all
- 00:29:11the odd parts
- 00:29:13so one way to implement this it would be
- 00:29:15as follows we can take e and we can
- 00:29:18index into it for all the batch elements
- 00:29:21and then just even elements in this
- 00:29:24Dimension so at indexes 0 2 4 and 8.
- 00:29:29and then all the parts here from this
- 00:29:31last dimension
- 00:29:33and this gives us the even characters
- 00:29:37and then here
- 00:29:39this gives us all the odd characters and
- 00:29:42basically what we want to do is we make
- 00:29:43sure we want to make sure that these get
- 00:29:44concatenated in pi torch and then we
- 00:29:47want to concatenate these two tensors
- 00:29:49along the second dimension
- 00:29:53so this and the shape of it would be
- 00:29:55four by four by Twenty this is
- 00:29:57definitely the result we want we are
- 00:29:58explicitly grabbing the even parts and
- 00:30:01the odd parts and we're arranging those
- 00:30:03four by four by ten right next to each
- 00:30:06other and concatenate
- 00:30:08so this works but it turns out that what
- 00:30:10also works is you can simply use a view
- 00:30:13again and just request the right shape
- 00:30:16and it just so happens that in this case
- 00:30:18those vectors will again end up being
- 00:30:21arranged in exactly the way we want so
- 00:30:23in particular if we take e and we just
- 00:30:25view it as a four by four by Twenty
- 00:30:27which is what we want
- 00:30:28we can check that this is exactly equal
- 00:30:30to but let me call this this is the
- 00:30:33explicit concatenation I suppose
- 00:30:36um
- 00:30:36so explosives dot shape is 4x4 by 20. if
- 00:30:40you just view it as 4x4 by 20 you can
- 00:30:42check that when you compare to explicit
- 00:30:46uh you got a big this is element wise
- 00:30:48operation so making sure that all of
- 00:30:49them are true that is the truth so
- 00:30:53basically long story short we don't need
- 00:30:54to make an explicit call to concatenate
- 00:30:56Etc we can simply take this input tensor
- 00:31:00to flatten and we can just view it in
- 00:31:03whatever way we want
- 00:31:04and in particular you don't want to
- 00:31:07stretch things out with negative one we
- 00:31:09want to actually create a
- 00:31:10three-dimensional array and depending on
- 00:31:12how many vectors that are consecutive we
- 00:31:15want to
- 00:31:16um fuse like for example two then we can
- 00:31:20just simply ask for this Dimension to be
- 00:31:2120. and um
- 00:31:24use a negative 1 here and python will
- 00:31:26figure out how many groups it needs to
- 00:31:27pack into this additional batch
- 00:31:29dimension
- 00:31:30so let's now go into flatten and
- 00:31:32implement this okay so I scroll up here
- 00:31:34to flatten and what we'd like to do is
- 00:31:36we'd like to change it now so let me
- 00:31:38create a Constructor and take the number
- 00:31:40of elements that are consecutive that we
- 00:31:42would like to concatenate now in the
- 00:31:44last dimension of the output
- 00:31:46so here we're just going to remember
- 00:31:48solve.n equals n
- 00:31:50and then I want to be careful here
- 00:31:52because pipe pytorch actually has a
- 00:31:54torch to flatten and its keyword
- 00:31:56arguments are different and they kind of
- 00:31:58like function differently so R flatten
- 00:32:00is going to start to depart from patreon
- 00:32:02flatten so let me call it flat flatten
- 00:32:04consecutive or something like that just
- 00:32:06to make sure that our apis are about
- 00:32:08equal
- 00:32:09so this uh basically flattens only some
- 00:32:13n consecutive elements and puts them
- 00:32:15into the last dimension
- 00:32:17now here the shape of X is B by T by C
- 00:32:21so let me
- 00:32:23pop those out into variables and recall
- 00:32:26that in our example down below B was 4 T
- 00:32:28was 8 and C was 10.
- 00:32:33now instead of doing x dot view of B by
- 00:32:37negative one
- 00:32:39right this is what we had before
- 00:32:44we want this to be B by
- 00:32:47um negative 1 by
- 00:32:49and basically here we want c times n
- 00:32:52that's how many consecutive elements we
- 00:32:55want
- 00:32:56and here instead of negative one I don't
- 00:32:58super love the use of negative one
- 00:33:00because I like to be very explicit so
- 00:33:02that you get error messages when things
- 00:33:03don't go according to your expectation
- 00:33:04so what do we expect here we expect this
- 00:33:07to become t
- 00:33:09divide n using integer division here
- 00:33:12so that's what I expect to happen
- 00:33:14and then one more thing I want to do
- 00:33:15here is remember previously all the way
- 00:33:18in the beginning n was three and uh
- 00:33:21basically we're concatenating
- 00:33:23um all the three characters that existed
- 00:33:25there
- 00:33:26so we basically are concatenated
- 00:33:28everything
- 00:33:29and so sometimes I can create a spurious
- 00:33:31dimension of one here so if it is the
- 00:33:34case that x dot shape at one is one then
- 00:33:37it's kind of like a spurious dimension
- 00:33:39um so we don't want to return a
- 00:33:41three-dimensional tensor with a one here
- 00:33:44we just want to return a two-dimensional
- 00:33:46tensor exactly as we did before
- 00:33:48so in this case basically we will just
- 00:33:50say x equals x dot squeeze that is a
- 00:33:54pytorch function
- 00:33:56and squeeze takes a dimension that it
- 00:34:01either squeezes out all the dimensions
- 00:34:02of a tensor that are one or you can
- 00:34:05specify the exact Dimension that you
- 00:34:08want to be squeezed and again I like to
- 00:34:10be as explicit as possible always so I
- 00:34:12expect to squeeze out the First
- 00:34:13Dimension only
- 00:34:15of this tensor
- 00:34:17this three-dimensional tensor and if
- 00:34:19this Dimension here is one then I just
- 00:34:21want to return B by c times n
- 00:34:24and so self dot out will be X and then
- 00:34:26we return salt dot out
- 00:34:28so that's the candidate implementation
- 00:34:30and of course this should be self.n
- 00:34:33instead of just n
- 00:34:34so let's run
- 00:34:36and let's come here now
- 00:34:39and take it for a spin so flatten
- 00:34:41consecutive
- 00:34:44and in the beginning let's just use
- 00:34:47eight so this should recover the
- 00:34:49previous Behavior so flagging
- 00:34:51consecutive of eight uh which is the
- 00:34:53current block size
- 00:34:55we can do this uh that should recover
- 00:34:57the previous Behavior
- 00:34:59so we should be able to run the model
- 00:35:02and here we can inspect I have a little
- 00:35:06code snippet here where I iterate over
- 00:35:08all the layers I print the name of this
- 00:35:11class and the shape
- 00:35:14and so we see the shapes as we expect
- 00:35:17them after every single layer in the top
- 00:35:19bit so now let's try to restructure it
- 00:35:22using our flattened consecutive and do
- 00:35:25it hierarchically so in particular
- 00:35:28we want to flatten consecutive not just
- 00:35:30not block size but just two
- 00:35:33and then we want to process this with
- 00:35:34linear now then the number of inputs to
- 00:35:37this linear will not be an embed times
- 00:35:38block size it will now only be n embed
- 00:35:41times two
- 00:35:4220.
- 00:35:44this goes through the first layer and
- 00:35:46now we can in principle just copy paste
- 00:35:48this
- 00:35:49now the next linear layer should expect
- 00:35:51and hidden times two
- 00:35:53and the last piece of it should expect
- 00:35:58and it enters 2 again
- 00:36:01so this is sort of like the naive
- 00:36:03version of it
- 00:36:04um
- 00:36:05so running this we now have a much much
- 00:36:07bigger model
- 00:36:09and we should be able to basically just
- 00:36:10forward the model
- 00:36:13and now we can inspect uh the numbers in
- 00:36:16between
- 00:36:17so four byte by 20
- 00:36:19was Platinum consecutively into four by
- 00:36:21four by Twenty
- 00:36:23this was projected into four by four by
- 00:36:24two hundred
- 00:36:26and then bash storm just worked out of
- 00:36:29the box we have to verify that bastron
- 00:36:31does the correct thing even though it
- 00:36:33takes a three-dimensional impedance that
- 00:36:34are two dimensional input
- 00:36:36then we have 10h which is element wise
- 00:36:38then we crushed it again so if we
- 00:36:41flatten consecutively and ended up with
- 00:36:42a four by two by 400 now
- 00:36:45then linear brought it back down to 200
- 00:36:47batch room 10h and lastly we get a 4 by
- 00:36:50400 and we see that the flattened
- 00:36:52consecutive for the last flatten here uh
- 00:36:54it squeezed out that dimension of one so
- 00:36:57we only ended up with four by four
- 00:36:58hundred and then linear Bachelor on 10h
- 00:37:00and uh the last linear layer to get our
- 00:37:04logents and so The Lodges end up in the
- 00:37:06same shape as they were before but now
- 00:37:08we actually have a nice three layer
- 00:37:10neural nut and it basically corresponds
- 00:37:12to whoops sorry it basically corresponds
- 00:37:15exactly to this network now except only
- 00:37:18this piece here because we only have
- 00:37:20three layers whereas here in this
- 00:37:22example there's uh four layers with the
- 00:37:25total receptive field size of 16
- 00:37:28characters instead of just eight
- 00:37:29characters so the block size here is 16.
- 00:37:32so this piece of it's basically
- 00:37:34implemented here
- 00:37:36um now we just have to kind of figure
- 00:37:38out some good Channel numbers to use
- 00:37:40here now in particular I changed the
- 00:37:42number of hidden units to be 68 in this
- 00:37:45architecture because when I use 68 the
- 00:37:47number of parameters comes out to be 22
- 00:37:49000 so that's exactly the same that we
- 00:37:52had before and we have the same amount
- 00:37:54of capacity at this neural net in terms
- 00:37:56of the number of parameters but the
- 00:37:57question is whether we are utilizing
- 00:37:59those parameters in a more efficient
- 00:38:00architecture so what I did then is I got
- 00:38:03rid of a lot of the debugging cells here
- 00:38:05and I rerun the optimization and
- 00:38:07scrolling down to the result we see that
- 00:38:09we get the identical performance roughly
- 00:38:12so our validation loss now is 2.029 and
- 00:38:15previously it was 2.027 so controlling
- 00:38:18for the number of parameters changing
- 00:38:20from the flat to hierarchical is not
- 00:38:21giving us anything yet
- 00:38:23that said there are two things
- 00:38:25um to point out number one we didn't
- 00:38:27really torture the um architecture here
- 00:38:29very much this is just my first guess
- 00:38:31and there's a bunch of hyper parameters
- 00:38:33search that we could do in order in
- 00:38:35terms of how we allocate uh our budget
- 00:38:37of parameters to what layers number two
- 00:38:39we still may have a bug inside the
- 00:38:42bachelor 1D layer so let's take a look
- 00:38:44at
- 00:38:45um uh that because it runs but does it
- 00:38:49do the right thing
- 00:38:50so I pulled up the layer inspector sort
- 00:38:53of that we have here and printed out the
- 00:38:55shape along the way and currently it
- 00:38:57looks like the batch form is receiving
- 00:38:58an input that is 32 by 4 by 68 right and
- 00:39:03here on the right I have the current
- 00:39:04implementation of Bachelor that we have
- 00:39:05right now
- 00:39:06now this bachelor assumed in the way we
- 00:39:09wrote it and at the time that X is
- 00:39:11two-dimensional so it was n by D where n
- 00:39:15was the batch size so that's why we only
- 00:39:17reduced uh the mean and the variance
- 00:39:19over the zeroth dimension but now X will
- 00:39:21basically become three-dimensional so
- 00:39:23what's happening inside the bachelor
- 00:39:24right now and how come it's working at
- 00:39:26all and not giving any errors the reason
- 00:39:28for that is basically because everything
- 00:39:30broadcasts properly but the bachelor is
- 00:39:32not doing what we need what we wanted to
- 00:39:34do
- 00:39:35so in particular let's basically think
- 00:39:37through what's happening inside the
- 00:39:38bathroom uh looking at what's what's do
- 00:39:41What's Happening Here
- 00:39:43I have the code here
- 00:39:45so we're receiving an input of 32 by 4
- 00:39:47by 68 and then we are doing uh here x
- 00:39:52dot mean here I have e instead of X but
- 00:39:54we're doing the mean over zero and
- 00:39:57that's actually giving us 1 by 4 by 68.
- 00:39:59so we're doing the mean only over the
- 00:40:01very first Dimension and it's giving us
- 00:40:03a mean and a variance that still
- 00:40:05maintain this Dimension here
- 00:40:07so these means are only taking over 32
- 00:40:10numbers in the First Dimension and then
- 00:40:12when we perform this everything
- 00:40:14broadcasts correctly still
- 00:40:16but basically what ends up happening is
- 00:40:20when we also look at the running mean
- 00:40:26the shape of it so I'm looking at the
- 00:40:27model that layers at three which is the
- 00:40:28first bathroom layer and they're looking
- 00:40:30at whatever the running mean became and
- 00:40:32its shape
- 00:40:34the shape of this running mean now is 1
- 00:40:35by 4 by 68.
- 00:40:38right instead of it being
- 00:40:39um you know just a size of dimension
- 00:40:43because we have 68 channels we expect to
- 00:40:45have 68 means and variances that we're
- 00:40:47maintaining but actually we have an
- 00:40:49array of 4 by 68 and so basically what
- 00:40:51this is telling us is this bash Norm is
- 00:40:54only
- 00:40:55this bachelor is currently working in
- 00:40:57parallel
- 00:40:58over
- 00:41:014 times 68 instead of just 68 channels
- 00:41:06so basically we are maintaining
- 00:41:08statistics for every one of these four
- 00:41:10positions individually and independently
- 00:41:13and instead what we want to do is we
- 00:41:15want to treat this four as a batch
- 00:41:16Dimension just like the zeroth dimension
- 00:41:19so as far as the bachelor is concerned
- 00:41:22it doesn't want to average we don't want
- 00:41:24to average over 32 numbers we want to
- 00:41:26now average over 32 times four numbers
- 00:41:29for every single one of these 68
- 00:41:31channels
- 00:41:32and uh so let me now
- 00:41:34remove this
- 00:41:36it turns out that when you look at the
- 00:41:38documentation of torch.mean
- 00:41:42so let's go to torch.me
- 00:41:49in one of its signatures when we specify
- 00:41:51the dimension
- 00:41:53we see that the dimension here is not
- 00:41:54just it can be in or it can also be a
- 00:41:56tuple of ins so we can reduce over
- 00:41:59multiple integers at the same time over
- 00:42:02multiple Dimensions at the same time so
- 00:42:04instead of just reducing over zero we
- 00:42:05can pass in a tuple 0 1.
- 00:42:08and here zero one as well and then
- 00:42:10what's going to happen is the output of
- 00:42:12course is going to be the same
- 00:42:13but now what's going to happen is
- 00:42:15because we reduce over 0 and 1 if we
- 00:42:17look at immin.shape
- 00:42:20we see that now we've reduced we took
- 00:42:22the mean over both the zeroth and the
- 00:42:25First Dimension
- 00:42:26so we're just getting 68 numbers and a
- 00:42:28bunch of spurious Dimensions here
- 00:42:30so now this becomes 1 by 1 by 68 and the
- 00:42:34running mean and the running variance
- 00:42:35analogously will become one by one by
- 00:42:3768. so even though there are the
- 00:42:39spurious Dimensions uh the current the
- 00:42:41current the correct thing will happen in
- 00:42:43that we are only maintaining means and
- 00:42:45variances for 64 sorry for 68 channels
- 00:42:49and we're not calculating the mean
- 00:42:50variance across 32 times 4 dimensions so
- 00:42:54that's exactly what we want and let's
- 00:42:56change the implementation of bash term
- 00:42:581D that we have so that it can take in
- 00:43:01two-dimensional or three-dimensional
- 00:43:02inputs and perform accordingly so at the
- 00:43:05end of the day the fix is relatively
- 00:43:07straightforward basically the dimension
- 00:43:09we want to reduce over is either 0 or
- 00:43:12the Tuple zero and one depending on the
- 00:43:14dimensionality of X so if x dot and dim
- 00:43:16is two so it's a two dimensional tensor
- 00:43:18then Dimension we want to reduce over is
- 00:43:20just the integer zero
- 00:43:22L if x dot ending is three so it's a
- 00:43:24three-dimensional tensor then the dims
- 00:43:26we're going to assume are zero and one
- 00:43:29that we want to reduce over and then
- 00:43:31here we just pass in dim
- 00:43:33and if the dimensionality of X is
- 00:43:35anything else we'll now get an error
- 00:43:36which is good
- 00:43:38um so that should be the fix now I want
- 00:43:41to point out one more thing we're
- 00:43:42actually departing from the API of Pi
- 00:43:44torch here a little bit because when you
- 00:43:46come to batch room 1D and pytorch you
- 00:43:48can scroll down and you can see that the
- 00:43:50input to this layer can either be n by C
- 00:43:53where n is the batch size and C is the
- 00:43:55number of features or channels or it
- 00:43:57actually does accept three-dimensional
- 00:43:59inputs but it expects it to be n by C by
- 00:44:01L
- 00:44:02where LSA like the sequence length or
- 00:44:04something like that
- 00:44:05so um
- 00:44:07this is problem because you see how C is
- 00:44:09nested here in the middle and so when it
- 00:44:12gets three-dimensional inputs this bash
- 00:44:14term layer will reduce over zero and two
- 00:44:17instead of zero and one so it basically
- 00:44:20Pi torch batch number one D layer
- 00:44:22assumes that c will always be the First
- 00:44:25Dimension whereas we'll we assume here
- 00:44:28that c is the last Dimension and there
- 00:44:30are some number of batch Dimensions
- 00:44:32beforehand
- 00:44:34um
- 00:44:34and so
- 00:44:36it expects n by C or M by C by all we
- 00:44:39expect and by C or n by L by C
- 00:44:42and so it's a deviation
- 00:44:45um
- 00:44:46I think it's okay I prefer it this way
- 00:44:49honestly so this is the way that we will
- 00:44:50keep it for our purposes
- 00:44:52so I redefined the layers re-initialize
- 00:44:54the neural net and did a single forward
- 00:44:55pass with a break just for one step
- 00:44:57looking at the shapes along the way
- 00:44:59they're of course identical all the
- 00:45:01shapes are the same but the way we see
- 00:45:03that things are actually working as we
- 00:45:05want them to now is that when we look at
- 00:45:07the bathroom layer the running mean
- 00:45:08shape is now one by one by 68. so we're
- 00:45:11only maintaining 68 means for every one
- 00:45:13of our channels and we're treating both
- 00:45:15the zeroth and the First Dimension as a
- 00:45:17batch Dimension which is exactly what we
- 00:45:19want so let me retrain the neural lot
- 00:45:21now okay so I retrained the neural net
- 00:45:22with the bug fix we get a nice curve and
- 00:45:25when we look at the validation
- 00:45:25performance we do actually see a slight
- 00:45:27Improvement so we went from 2.029 to
- 00:45:302.022 so basically the bug inside the
- 00:45:32bathroom was holding up us back like a
- 00:45:35little bit it looks like and we are
- 00:45:37getting a tiny Improvement now but it's
- 00:45:39not clear if this is statistical
- 00:45:40significant
- 00:45:41um
- 00:45:42and the reason we slightly expect an
- 00:45:44improvement is because we're not
- 00:45:46maintaining so many different means and
- 00:45:47variances that are only estimated using
- 00:45:49using 32 numbers effectively now we are
- 00:45:52estimating them using 32 times 4 numbers
- 00:45:54so you just have a lot more numbers that
- 00:45:56go into any one estimate of the mean and
- 00:45:58variance and it allows things to be a
- 00:46:01bit more stable and less Wiggly inside
- 00:46:03those estimates of those statistics so
- 00:46:07pretty nice with this more General
- 00:46:08architecture in place we are now set up
- 00:46:10to push the performance further by
- 00:46:12increasing the size of the network so
- 00:46:14for example I bumped up the number of
- 00:46:16embeddings to 24 instead of 10 and also
- 00:46:19increased number of hidden units but
- 00:46:21using the exact same architecture we now
- 00:46:23have 76 000 parameters and the training
- 00:46:25takes a lot longer but we do get a nice
- 00:46:28curve and then when you actually
- 00:46:29evaluate the performance we are now
- 00:46:31getting validation performance of 1.993
- 00:46:33so we've crossed over the 2.0 sort of
- 00:46:36territory and right about 1.99 but we
- 00:46:39are starting to have to wait quite a bit
- 00:46:42longer and we're a little bit in the
- 00:46:44dark with respect to the correct setting
- 00:46:46of the hyper parameters here and the
- 00:46:47learning rates and so on because the
- 00:46:48experiments are starting to take longer
- 00:46:50to train and so we are missing sort of
- 00:46:52like an experimental harness on which we
- 00:46:54could run a number of experiments and
- 00:46:56really tune this architecture very well
- 00:46:58so I'd like to conclude now with a few
- 00:46:59notes we basically improved our
- 00:47:02performance from a starting of 2.1 down
- 00:47:04to 1.9 but I don't want that to be the
- 00:47:06focus because honestly we're kind of in
- 00:47:08the dark we have no experimental harness
- 00:47:10we're just guessing and checking and
- 00:47:12this whole thing is terrible we're just
- 00:47:13looking at the training loss normally
- 00:47:15you want to look at both the training
- 00:47:17and the validation loss together and the
- 00:47:19whole thing looks different if you're
- 00:47:20actually trying to squeeze out numbers
- 00:47:23that said we did implement this
- 00:47:25architecture from the wavenet paper but
- 00:47:28we did not implement this specific uh
- 00:47:31forward pass of it where you have a more
- 00:47:33complicated a linear layer sort of that
- 00:47:35is this gated linear layer kind of and
- 00:47:38there's residual connections and Skip
- 00:47:40connections and so on so we did not
- 00:47:42Implement that we just implemented this
- 00:47:44structure I would like to briefly hint
- 00:47:46or preview how what we've done here
- 00:47:48relates to convolutional neural networks
- 00:47:50as used in the wavenet paper and
- 00:47:52basically the use of convolutions is
- 00:47:54strictly for efficiency it doesn't
- 00:47:56actually change the model we've
- 00:47:57implemented
- 00:47:58so here for example
- 00:48:00let me look at a specific name to work
- 00:48:02with an example so there's a name in our
- 00:48:05training set and it's DeAndre and it has
- 00:48:08seven letters so that is eight
- 00:48:10independent examples in our model so all
- 00:48:12these rows here are independent examples
- 00:48:14of the Android
- 00:48:16now you can forward of course any one of
- 00:48:18these rows independently so I can take
- 00:48:20my model and call call it on any
- 00:48:24individual index notice by the way here
- 00:48:26I'm being a little bit tricky
- 00:48:28the reason for this is that extra at
- 00:48:30seven that shape is just
- 00:48:33um one dimensional array of eight so you
- 00:48:36can't actually call the model on it
- 00:48:37you're going to get an error because
- 00:48:39there's no batch dimension
- 00:48:41so when you do extra at
- 00:48:45a list of seven then the shape of this
- 00:48:47becomes one by eight so I get an extra
- 00:48:49batch dimension of one and then we can
- 00:48:52forward the model
- 00:48:53so
- 00:48:55that forwards a single example and you
- 00:48:57might imagine that you actually may want
- 00:48:59to forward all of these eight
- 00:49:01um at the same time
- 00:49:03so pre-allocating some memory and then
- 00:49:05doing a for Loop eight times and
- 00:49:07forwarding all of those eight here will
- 00:49:10give us all the logits in all these
- 00:49:11different cases
- 00:49:13now for us with the model as we've
- 00:49:14implemented it right now this is eight
- 00:49:16independent calls to our model
- 00:49:18but what convolutions allow you to do is
- 00:49:20it allow you to basically slide this
- 00:49:22model efficiently over the input
- 00:49:24sequence and so this for Loop can be
- 00:49:27done not outside in Python but inside of
- 00:49:31kernels in Cuda and so this for Loop
- 00:49:33gets hidden into the convolution
- 00:49:35so the convolution basically you can
- 00:49:37cover this it's a for Loop applying a
- 00:49:40little linear filter over space of some
- 00:49:43input sequence and in our case the space
- 00:49:45we're interested in is one dimensional
- 00:49:46and we're interested in sliding these
- 00:49:48filters over the input data
- 00:49:51so this diagram actually is fairly good
- 00:49:54as well
- 00:49:55basically what we've done is here they
- 00:49:57are highlighting in Black one individ
- 00:49:59one single sort of like tree of this
- 00:50:01calculation so just calculating the
- 00:50:03single output example here
- 00:50:06um
- 00:50:07and so this is basically what we've
- 00:50:08implemented here we've implemented a
- 00:50:10single this black structure we've
- 00:50:13implemented that and calculated a single
- 00:50:15output like a single example
- 00:50:17but what collusions allow you to do is
- 00:50:19it allows you to take this black
- 00:50:20structure and kind of like slide it over
- 00:50:23the input sequence here and calculate
- 00:50:26all of these orange outputs at the same
- 00:50:29time or here that corresponds to
- 00:50:31calculating all of these outputs of
- 00:50:34um at all the positions of DeAndre at
- 00:50:37the same time
- 00:50:38and the reason that this is much more
- 00:50:41efficient is because number one as I
- 00:50:43mentioned the for Loop is inside the
- 00:50:45Cuda kernels in the sliding so that
- 00:50:48makes it efficient but number two notice
- 00:50:50the variable reuse here for example if
- 00:50:52we look at this circle this node here
- 00:50:54this node here is the right child of
- 00:50:56this node but is also the left child of
- 00:50:59the node here
- 00:51:01and so basically this node and its value
- 00:51:03is used twice
- 00:51:05and so right now in this naive way we'd
- 00:51:08have to recalculate it but here we are
- 00:51:11allowed to reuse it
- 00:51:12so in the convolutional neural network
- 00:51:14you think of these linear layers that we
- 00:51:16have up above as filters and we take
- 00:51:19these filters and they're linear filters
- 00:51:21and you slide them over input sequence
- 00:51:23and we calculate the first layer and
- 00:51:25then the second layer and then the third
- 00:51:26layer and then the output layer of the
- 00:51:28sandwich and it's all done very
- 00:51:30efficiently using these convolutions
- 00:51:32so we're going to cover that in a future
- 00:51:34video the second thing I hope you took
- 00:51:35away from this video is you've seen me
- 00:51:37basically Implement all of these layer
- 00:51:40Lego building blocks or module building
- 00:51:42blocks and I'm implementing them over
- 00:51:45here and we've implemented a number of
- 00:51:46layers together and we've also
- 00:51:48implemented these these containers and
- 00:51:51we've overall pytorchified our code
- 00:51:53quite a bit more
- 00:51:54now basically what we're doing here is
- 00:51:56we're re-implementing torch.nn which is
- 00:51:59the neural networks library on top of
- 00:52:02torch.tensor and it looks very much like
- 00:52:04this except it is much better because
- 00:52:07because it's in pi torch instead of
- 00:52:08jingling my Jupiter notebook so I think
- 00:52:11going forward I will probably have
- 00:52:13considered us having unlocked
- 00:52:15um torch.nn we understand roughly what's
- 00:52:18in there how these modules work how
- 00:52:19they're nested and what they're doing on
- 00:52:21top of torture tensor so hopefully we'll
- 00:52:24just uh we'll just switch over and
- 00:52:25continue and start using torch.net
- 00:52:27directly the next thing I hope you got a
- 00:52:29bit of a sense of is what the
- 00:52:31development process of building deep
- 00:52:33neural networks looks like which I think
- 00:52:35was relatively representative to some
- 00:52:36extent so number one we are spending a
- 00:52:39lot of time in the documentation page of
- 00:52:41pytorch and we're reading through all
- 00:52:44the layers looking at documentations
- 00:52:45where the shapes of the inputs what can
- 00:52:48they be what does the layer do and so on
- 00:52:51unfortunately I have to say the
- 00:52:53patreon's documentation is not are very
- 00:52:55good they spend a ton of time on
- 00:52:57Hardcore engineering of all kinds of
- 00:52:59distributed Primitives Etc but as far as
- 00:53:01I can tell no one is maintaining any
- 00:53:03documentation it will lie to you it will
- 00:53:06be wrong it will be incomplete it will
- 00:53:08be unclear so unfortunately it is what
- 00:53:12it is and you just kind of do your best
- 00:53:14um with what they've given us
- 00:53:18um number two
- 00:53:20uh the other thing that I hope you got a
- 00:53:22sense of is there's a ton of trying to
- 00:53:24make the shapes work and there's a lot
- 00:53:26of gymnastics around these
- 00:53:27multi-dimensional arrays and are they
- 00:53:29two-dimensional three-dimensional
- 00:53:30four-dimensional uh what layers take
- 00:53:32what shapes is it NCL or NLC and you're
- 00:53:36promoting and viewing and it just can
- 00:53:39get pretty messy and so that brings me
- 00:53:40to number three I very often prototype
- 00:53:43these layers and implementations in
- 00:53:44jupyter notebooks and make sure that all
- 00:53:46the shapes work out and I'm spending a
- 00:53:48lot of time basically babysitting the
- 00:53:50shapes and making sure everything is
- 00:53:52correct and then once I'm satisfied with
- 00:53:54the functionality in the Jupiter
- 00:53:55notebook I will take that code and copy
- 00:53:57paste it into my repository of actual
- 00:53:59code that I'm training with and so then
- 00:54:02I'm working with vs code on the side so
- 00:54:04I usually have jupyter notebook and vs
- 00:54:06code I develop in Jupiter notebook I
- 00:54:07paste into vs code and then I kick off
- 00:54:09experiments from from the reaper of
- 00:54:11course from the code repository so
- 00:54:14that's roughly some notes on the
- 00:54:16development process of working with
- 00:54:17neurons lastly I think this lecture
- 00:54:19unlocks a lot of potential further
- 00:54:21lectures because number one we have to
- 00:54:23convert our neural network to actually
- 00:54:25use these dilated causal convolutional
- 00:54:27layers so implementing the comnet number
- 00:54:30two potentially starting to get into
- 00:54:32what this means whatever residual
- 00:54:34connections and Skip connections and why
- 00:54:36are they useful
- 00:54:37number three we as I mentioned we don't
- 00:54:40have any experimental harness so right
- 00:54:42now I'm just guessing checking
- 00:54:44everything this is not representative of
- 00:54:45typical deep learning workflows you have
- 00:54:47to set up your evaluation harness you
- 00:54:49can kick off experiments you have lots
- 00:54:51of arguments that your script can take
- 00:54:53you're you're kicking off a lot of
- 00:54:54experimentation you're looking at a lot
- 00:54:56of plots of training and validation
- 00:54:57losses and you're looking at what is
- 00:54:59working and what is not working and
- 00:55:01you're working on this like population
- 00:55:02level and you're doing all these hyper
- 00:55:04parameter searches and so we've done
- 00:55:06none of that so far so how to set that
- 00:55:09up and how to make it good I think as a
- 00:55:11whole another topic number three we
- 00:55:14should probably cover recurring neural
- 00:55:16networks RNs lstm's grooves and of
- 00:55:19course Transformers so many uh places to
- 00:55:22go and we'll cover that in the future
- 00:55:24for now bye sorry I forgot to say that
- 00:55:27if you are interested I think it is kind
- 00:55:30of interesting to try to beat this
- 00:55:31number 1.993 because I really haven't
- 00:55:34tried a lot of experimentation here and
- 00:55:36there's quite a bit of fruit potentially
- 00:55:37to still purchase further so I haven't
- 00:55:40tried any other ways of allocating these
- 00:55:42channels in this neural net maybe the
- 00:55:44number of dimensions for the embedding
- 00:55:47is all wrong maybe it's possible to
- 00:55:49actually take the original network with
- 00:55:50just one hidden layer and make it big
- 00:55:53enough and actually beat my fancy
- 00:55:54hierarchical Network it's not obvious
- 00:55:56that would be kind of embarrassing if
- 00:55:59this did not do better even once you
- 00:56:01torture it a little bit maybe you can
- 00:56:03read the weight net paper and try to
- 00:56:04figure out how some of these layers work
- 00:56:06and Implement them yourselves using what
- 00:56:07we have
- 00:56:08and of course you can always tune some
- 00:56:10of the initialization or some of the
- 00:56:12optimization and see if you can improve
- 00:56:15it that way so I'd be curious if people
- 00:56:16can come up with some ways to beat this
- 00:56:18and yeah that's it for now bye
- language model
- WaveNet
- batch normalization
- neural network
- character prediction
- hierarchical architecture
- hyperparameter tuning
- dilated convolutions
- training
- evaluation