00:00:00
[Music]
00:00:16
hello everyone welcome to this lecture
00:00:19
in the building large language models
00:00:21
from scratch series we have covered five
00:00:25
lectures up till now and in the previous
00:00:28
lecture we looked at the gpt3
00:00:30
architecture in a lot of detail we also
00:00:33
saw the progression from GPT to gpt2 to
00:00:36
gpt3 and finally to GPT
00:00:39
4 uh we saw that the total pre-training
00:00:42
cost for gpt3 is around 4.6 million
00:00:46
which is insanely
00:00:48
high and up till now we have also looked
00:00:51
at the data set which was used for
00:00:53
pre-training gpt3 and we have seen this
00:00:56
several times until
00:00:58
now in the prev previous lecture we
00:01:00
learned about the differences between
00:01:02
zero shot versus few shot learning as
00:01:05
well so if you have not been through the
00:01:07
previous lectures we have already
00:01:09
covered five lectures in this series and
00:01:11
uh all of them have actually received a
00:01:13
very good response from YouTube and I've
00:01:16
received a number of comments saying
00:01:18
that they have really helped people so I
00:01:21
encourage you to go through those
00:01:23
videos in today's lecture we are going
00:01:25
to be discussing about what we will
00:01:28
exactly cover in the playlist in these
00:01:30
five lectures we have looked at some of
00:01:32
the theory modules some of the intuition
00:01:35
modules behind attention behind self
00:01:37
attention prediction of next word uh
00:01:40
zero short versus few short learning
00:01:42
basics of the Transformer architecture
00:01:45
data sets used for llm pre pre-training
00:01:48
difference between pre-training and fine
00:01:49
tuning Etc but now from the next lecture
00:01:54
onwards we are going to start with the
00:01:56
Hands-On aspects of actually building an
00:01:59
llm so I wanted to utilize this
00:02:01
particular lecture to give you a road
00:02:03
map of what all we will be doing in this
00:02:07
series and what all stages which we will
00:02:09
be covering during this
00:02:12
playlist so that is the title of today's
00:02:14
lecture stages of building a large
00:02:16
language model towards the end of this
00:02:18
lecture we will also do a recap of what
00:02:21
all we have learned until now so let's
00:02:24
get started with today's
00:02:26
lecture okay so we will break this
00:02:28
playlist into three stage stages stage
00:02:31
one stage two and stage three remember
00:02:35
before we get started that this material
00:02:37
which I showing is heavily borrowed from
00:02:40
U the book building a large language
00:02:42
model from scratch which is written by
00:02:44
sebasian rashka so I'm very grateful to
00:02:47
the author for writing this book which
00:02:49
is allowing me to make this
00:02:51
playlist okay so we'll be dividing the
00:02:54
playlist into three stages stage one
00:02:56
stage two and stage number three unfort
00:02:59
fortunately all of the playlists
00:03:01
currently which are available on YouTube
00:03:03
only go through some of these stages and
00:03:06
that two they do not cover these stages
00:03:08
in detail my plan is to devote a number
00:03:11
of lectures to each stage in this uh
00:03:15
playlist so that you get a very detailed
00:03:18
understanding of how the nuts and bolts
00:03:20
really
00:03:21
work so in stage one we are going to be
00:03:24
looking at uh essentially building a
00:03:26
large language model and we are going to
00:03:29
look at the building blocks which are
00:03:31
necessary so before we go to train the
00:03:34
large language model we need to do the
00:03:36
data pre-processing and sampling in a
00:03:38
very specific manner we need to
00:03:40
understand the attention mechanism and
00:03:42
we will need to understand the llm
00:03:44
architecture so in the stage one we are
00:03:46
going to focus on these three things
00:03:49
understanding how the data is collected
00:03:51
from different data sets how the data is
00:03:54
processed how the data is sampled number
00:03:56
one then we will go to attention
00:03:58
mechanism how to C out the attention
00:04:00
mechanism completely from scratch in
00:04:02
Python what is meant by key query value
00:04:05
what is the attention score what is
00:04:08
positional encoding what is Vector
00:04:10
embedding all of this will be covered in
00:04:12
this stage we'll also be looking at the
00:04:14
llm architecture such as how to stack
00:04:17
different layers on top of each other
00:04:19
where should the attention head go all
00:04:21
of these things essentially uh the
00:04:25
main understanding or the main part of
00:04:28
this stage will be to understand
00:04:29
understand the basic mechanism behind
00:04:32
the large language model so what exactly
00:04:34
we will cover in data preparation and
00:04:36
sampling first we'll see tokenization if
00:04:39
you are given sentences how to break
00:04:41
them down into individual tokens as we
00:04:44
have seen earlier a token can be thought
00:04:46
of as a unit of a sentence but there is
00:04:48
a particular way of doing tokenization
00:04:50
we'll cover that then we will cover
00:04:53
Vector embedding essentially after we do
00:04:56
tokenization every word needs to be
00:04:59
transformed into a very high dimensional
00:05:01
Vector space so that the semantic
00:05:04
meaning between words is captured as you
00:05:07
can see here we want apple banana and
00:05:10
orange to be closer together which are
00:05:12
seen in this red circle over here we
00:05:14
want King man and woman to be closer
00:05:17
together which is shown in the blue
00:05:18
circle and we want Sports such as
00:05:20
football Golf and Tennis to be closer
00:05:22
together as shown in the green these are
00:05:25
just representative examples what I want
00:05:27
to explain is that before we give the
00:05:30
data set for training we need to encode
00:05:32
every word so that the semantic meaning
00:05:36
between the words are captured so Words
00:05:38
which mean similar things lie closer
00:05:40
together so we will learn about Vector
00:05:43
embeddings in a lot of detail here we'll
00:05:45
also learn about positional encoding the
00:05:47
order in which the word appears in a
00:05:49
sentence is also very important and we
00:05:52
need to give that information to the
00:05:54
pre-training
00:05:55
model after learning about tokenization
00:05:58
Vector embedding we will learn about how
00:06:01
to construct batches of the data so if
00:06:04
we have a huge amount of data set how to
00:06:06
give the data in batches to uh GPT or to
00:06:09
the large language model which we are
00:06:11
going to build so we will be looking at
00:06:14
the next word prediction task so you
00:06:16
will be given a bunch of words and then
00:06:18
predicting the next word so we'll also
00:06:20
see the meaning of context how many
00:06:22
words should be taken for training to
00:06:25
predict the next output we'll see about
00:06:27
that and how to basically Fe the data in
00:06:31
different sets of batches so that the
00:06:33
computation becomes much more efficient
00:06:36
so we'll be implementing a data batching
00:06:38
sequence before giving all of the data
00:06:41
set into the large language model for
00:06:44
pre-training after this the second Point
00:06:46
as I mentioned here is the attention
00:06:48
mechanism so here is the attention
00:06:50
mechanism for the Transformer model
00:06:52
we'll first understand what is meant by
00:06:54
every single thing here what is meant by
00:06:56
multi-ad attention what is meant by Mas
00:06:59
multi head attention what is meant by
00:07:01
positional encoding input embedding
00:07:03
output embedding all of these things and
00:07:05
then we will build our own llm
00:07:08
architecture so uh these are the two
00:07:11
things attention mechanism and llm
00:07:13
architecture after we cover all of these
00:07:15
aspects we are essentially ready with
00:07:17
stage one of this playlist and then we
00:07:20
can move to the stage two stage two of
00:07:23
this series is essentially going to be
00:07:25
pre-training which is after we have
00:07:27
assembled all the data after we have
00:07:29
constructed the large language model
00:07:31
architecture which we are going to use
00:07:33
we are going to write down a code which
00:07:35
trains the large language model on the
00:07:37
underlying data set that is also called
00:07:40
as pre-training so the outcome of stage
00:07:43
two is to build a foundational model on
00:07:45
unlabeled
00:07:47
data now uh I'll just show a schematic
00:07:50
from the book which we will be following
00:07:52
so this is how the training data set
00:07:53
will look like we'll break it down into
00:07:56
epox and we will compute the gradient
00:08:00
uh of the loss in each Epoch and we'll
00:08:02
update the parameters towards the end
00:08:04
we'll generate sample text for visual
00:08:06
inspection this is what will happen
00:08:08
exactly in the training procedure of the
00:08:11
large language model and then we'll also
00:08:13
do model evaluation and loading
00:08:15
pre-train weaps so let me show you the
00:08:17
schematic for that so we'll do text
00:08:19
generation evaluation training and
00:08:21
validation losses then we'll write the
00:08:24
llm training function which I showed you
00:08:26
uh and then we'll do one more thing we
00:08:28
will Implement function to save and lo
00:08:30
load the large language model weights to
00:08:33
use or continue training the llm later
00:08:35
so there is no point in training the LM
00:08:38
from scratch every single time right
00:08:39
weight saving and loading essentially
00:08:41
saves you a ton of computational cost
00:08:43
and
00:08:44
memory and then at the end of this we'll
00:08:47
also load pre-trained weights from open
00:08:49
AI into our large language model so open
00:08:52
AI has already made some of the weights
00:08:54
available they are pre-trained weights
00:08:56
so we'll be loading uh pre-trained
00:08:58
weights from open a into our llm model
00:09:02
this is all what we'll be covering in
00:09:04
the stage two which is essentially
00:09:06
training Loop plus uh training Loop plus
00:09:09
model evaluation plus loading
00:09:10
pre-trained weights to build our
00:09:12
foundational model so the main goal of
00:09:15
stage two as I as I told you is
00:09:17
pre-training and llm on unlabelled data
00:09:20
great but we will not stop here after
00:09:22
this we move to stage number three and
00:09:25
the main goal of stage number three is
00:09:27
fine tuning the large language model so
00:09:29
if we want to build specific
00:09:31
applications we will do fine tuning in
00:09:33
this playlist we are going to build two
00:09:35
applications which are mentioned in the
00:09:37
book I showed you at the start one is
00:09:39
building a classifier and one is
00:09:41
building your own personal assistant so
00:09:44
here are some schematics to show so if
00:09:46
you want to let you have got a lot of
00:09:48
emails right and if you want to use your
00:09:50
llm to classify spam or no spam for
00:09:54
example you are a winner you have been
00:09:56
uh specially selected to receive th000
00:09:58
cash now this should be classified as
00:10:01
spam whereas hey just wanted to check if
00:10:03
we are still on for dinner tonight let
00:10:05
me know this will be not spam so we will
00:10:08
build a large language model this
00:10:10
application which classifies between
00:10:12
spam and no spam and we cannot just use
00:10:14
the pre-trained or foundational model
00:10:16
for this because we need to train with
00:10:17
labeled data to the pre-train model we
00:10:20
need to give some more data and tell it
00:10:22
that hey this is usually spam and this
00:10:24
is not spam can you use the foundational
00:10:26
model plus this additional specific
00:10:28
label data asset which I have given to
00:10:30
build a fine-tuned llm application for
00:10:34
email classification so this is what
00:10:36
we'll be building as the first
00:10:38
application the second application which
00:10:40
we'll be building is a type of a chat
00:10:42
bot which Bas basically answers queries
00:10:44
so there is an instruction there is an
00:10:46
input and there is an output and we'll
00:10:48
be building this chatbot after fine
00:10:51
tuning the large language model so if
00:10:54
you want to be a very serious llm
00:10:56
engineer all the stages are equally
00:10:58
important many students what they are
00:11:00
doing right now is that they just look
00:11:02
at stage number three and they either
00:11:04
use Lang chain let's
00:11:06
say they use Lang chain they use tools
00:11:09
like
00:11:10
AMA and they directly deploy
00:11:13
applications but they do not understand
00:11:15
what's going on in stage one and stage
00:11:17
two at all so this leaves you also a bit
00:11:19
underc confident and insecure about
00:11:21
whether I really know the nuts and bolts
00:11:23
whether I really know the details my
00:11:25
plan is to go over every single thing
00:11:27
without skipping even a single Concept
00:11:30
in stage one stage two and stage number
00:11:33
three so this is the plan which you'll
00:11:35
be following in this playlist and I hope
00:11:37
you are excited for this because at the
00:11:39
end of this really my vision for this
00:11:42
playlist is to make it the most detailed
00:11:44
llm playlist uh which many people can
00:11:46
refer not just students but working
00:11:48
professionals startup Founders managers
00:11:51
Etc and then you can once this playlist
00:11:53
is built over I think two to 3 months
00:11:56
later you can uh refer to whichever part
00:11:59
you are more interested in so people who
00:12:01
are following this in the early stages
00:12:03
of this journey it's awesome because
00:12:05
I'll reply to all the comments in the um
00:12:09
chat section and we'll build this
00:12:11
journey
00:12:13
together I want to end this a lecture by
00:12:16
providing a recap of what all we have
00:12:18
learned so far this is very uh this is
00:12:21
going to be very important because from
00:12:22
the next lecture we are going to start a
00:12:24
bit of the Hands-On
00:12:26
approach okay so number one large
00:12:29
language models have really transformed
00:12:31
uh the field of natural language
00:12:34
processing they have led to advancements
00:12:36
in generating understanding and
00:12:38
translating human language this is very
00:12:40
important uh so the field of NLP before
00:12:43
you needed to train a separate algorithm
00:12:45
for each specific task but large
00:12:47
language models are pretty generic if
00:12:49
you train an llm for predicting the next
00:12:51
word it turns out that it develops
00:12:53
emergent properties which means it's not
00:12:55
only good at predicting the next word
00:12:57
but also at things like uh multiple
00:13:00
choice questions text summarization then
00:13:03
emotion classification language
00:13:05
translation Etc it's useful for a wide
00:13:07
range of tasks and it's that has led to
00:13:10
its predominance as an amazing tool in a
00:13:13
variety of
00:13:15
fields secondly all modern large
00:13:18
language models are trained in two main
00:13:20
steps first we pre-train on an unlabeled
00:13:23
data this is called as a foundational
00:13:25
model and for this very large data sets
00:13:28
are needed typically billions of words
00:13:31
and it costs a lot as we saw training
00:13:33
pre-training gpt3 costs $4.6 million so
00:13:37
you need access to huge amount of data
00:13:39
compute power and money to pre-train
00:13:42
such a foundational model now if you are
00:13:45
actually going to implement an llm
00:13:47
application on production level so let's
00:13:49
say if you're an educational company
00:13:51
building multiple choice questions and
00:13:53
you think that the answers provided by
00:13:55
the pre-training or foundational model
00:13:57
are not very good and they are a bit
00:13:58
generic
00:13:59
you can provide your own specific data
00:14:02
set and you can label the data set
00:14:04
saying that these are the right answers
00:14:06
and I want you to further train on this
00:14:07
refined data set uh to build a better
00:14:10
model this is called fine tuning usually
00:14:14
airline companies restaurants Banks
00:14:16
educational companies when they deploy
00:14:19
llms into production level they fine
00:14:21
tune the pre-trained llm nobody deploys
00:14:23
the pre-trend one directly you fine tune
00:14:26
the element llm on your specific smaller
00:14:29
label data set this is very important
00:14:31
see for pre-training the data set which
00:14:33
we have is unlabeled it's Auto
00:14:35
regressive so the sentence structure
00:14:37
itself is used for creating the labels
00:14:39
as we are just predicting the next world
00:14:42
but when we F tune we have a label data
00:14:44
set such as remember the spam versus no
00:14:47
spam example which I showed you that is
00:14:49
a label data set we give labels like hey
00:14:51
this is Spam this is not spam this is a
00:14:53
good answer this is not a good answer
00:14:55
and this finetuning step is generally
00:14:57
needed for Building Product ction ready
00:14:59
llm
00:15:01
applications important thing to remember
00:15:03
is that fine tuned llms can outperform
00:15:06
only pre-trained llms on specific tasks
00:15:09
so let's say you take two cases right in
00:15:11
one case you only have pre-trained llms
00:15:13
and in second case you have pre-trained
00:15:15
plus fine tuned llms so it turns out
00:15:18
that pre-trained plus finetune does a
00:15:20
much better job at certain specific
00:15:22
tasks than just using pre-rain for
00:15:24
students who just want to interact for
00:15:26
getting their doubts solved or for
00:15:29
getting assistance uh in summarization
00:15:32
uh helping in writing a research paper
00:15:34
Etc gp4 perplexity or such API tools or
00:15:39
such interfaces which are available work
00:15:41
perfectly fine but if you want to build
00:15:43
a specific application on your data set
00:15:46
and take it to production level you
00:15:48
definitely need fine
00:15:50
tuning okay now uh one more key thing is
00:15:54
that the secret Source behind large
00:15:55
language models is this Transformer
00:15:57
architecture
00:15:59
so uh the key idea behind Transformer
00:16:02
architecture is the attention mechanism
00:16:05
uh just to show you how the Transformer
00:16:07
architecture looks like it looks like
00:16:08
this and the main thing behind the
00:16:10
Transformer architecture which really
00:16:12
makes it so
00:16:14
powerful are these attention
00:16:17
blocks we'll see what they mean so no
00:16:19
need to worry about this right
00:16:21
now but in the nutshell attention
00:16:24
mechanism gives the llm selective access
00:16:26
to the whole input sequence when
00:16:28
generating output one word at a time
00:16:31
basically attention mechanism allows the
00:16:33
llm to understand the importance of
00:16:36
words and not just the word in the
00:16:39
current sentence but in the previous
00:16:41
sentences which have come long before
00:16:42
also because context is important in
00:16:45
predicting the next word the current
00:16:47
sentence is not the only one which
00:16:48
matters attention mechanism allows the
00:16:51
llm to give access to the entire context
00:16:53
and select or give weightage to which
00:16:55
words are important in predicting the
00:16:57
next word this is a key idea which and
00:17:00
we'll spend a lot of time on this
00:17:02
idea remember that the original
00:17:04
Transformer had only the had encoder
00:17:07
plus decoder so it had both of these
00:17:10
things it had the encoder as well as it
00:17:11
had the decoder but generative pre-train
00:17:15
Transformer only has the decoder it did
00:17:17
not it does not have the encoder so
00:17:20
Transformer and GPT is not the same
00:17:22
Transformer paper came in 2017 it had
00:17:24
encoder plus decoder generative pre-rain
00:17:27
Transformer came one year later
00:17:29
2018 and that only had the decoder
00:17:32
architecture so even gp4 right now it
00:17:34
only has decoder no encoder so 2018 came
00:17:38
GPT the first generative pre-trend
00:17:40
Transformer architecture 2019 came gpt2
00:17:43
2020 came gpt3 which had 175 billion
00:17:47
parameters and that really changed the
00:17:49
game because no one had seen a model
00:17:51
this large before and then now we are at
00:17:53
GPT 4
00:17:55
stage one last point which is very
00:17:57
important is that llms are only trained
00:18:00
for predicting the next word right but
00:18:02
very surprisingly they develop emergent
00:18:04
properties which means that although
00:18:07
they are only trained to predict the
00:18:08
next word they show some amazing
00:18:11
properties like ability to classify text
00:18:14
translate text from one language into
00:18:16
another language and even summarize
00:18:17
texts so they were not trained for these
00:18:20
tasks but they developed these
00:18:22
properties and that was an awesome thing
00:18:23
to realize the pre-training stage works
00:18:26
so well that llms develop all of these
00:18:28
wonderful other properties which makes
00:18:30
them so impactful for a wide range of
00:18:33
tasks
00:18:35
currently okay so this brings us to the
00:18:37
end of the recap which we have covered
00:18:39
up till now if you have not seen the
00:18:41
previous lectures I really encourage you
00:18:43
to go through them because these
00:18:45
lectures have really set the stage for
00:18:46
us to now dive into stage one so from
00:18:49
the next lecture we'll start going into
00:18:51
stage one and we'll start seeing the
00:18:53
first aspect which is data preparation
00:18:55
and sampling so the next lecture title
00:18:58
will be be working with Text data and
00:19:00
we'll be looking at the data sets how to
00:19:03
load a data set how to count the number
00:19:05
of characters uh how to break the data
00:19:07
into tokens and I'll I'll start sharing
00:19:10
sharing Jupiter notebooks from next time
00:19:12
onward so that we can parall begin
00:19:15
coding so thanks everyone I hope you are
00:19:17
liking these lectures so lecture 1 to
00:19:20
six we kind of like an introductory
00:19:23
lecture to give you a feel of the entire
00:19:24
series and so that you understand
00:19:26
Concepts at a fundamental level from
00:19:28
from lecture 7 we'll be diving deep into
00:19:30
code and we'll be starting into stage
00:19:33
one so I follow this approach of writing
00:19:36
on a whiteboard and also
00:19:38
coding um so that you understand the
00:19:40
details plus the code at the same time
00:19:43
because I believe Theory plus practical
00:19:44
implementation both are important and
00:19:47
that is one of the philosophies of this
00:19:49
lecture Series so do let me know in the
00:19:51
comments how you finding this teaching
00:19:53
style uh because I will take feedback
00:19:56
from that and we can build this series
00:19:58
together 3 to four months later this can
00:20:00
be an amazing and awesome series and I
00:20:03
will rely on your feedback to build this
00:20:05
thanks a lot everyone and I look forward
00:20:07
to seeing you in the next lecture