Lecture 6: Stages of building an LLM from Scratch

00:20:15
https://www.youtube.com/watch?v=z9fgKz1Drlc

Summary

TLDRThis lecture provides a comprehensive overview of the stages involved in building large language models (LLMs) from scratch. It reviews previous lectures on GPT architectures and learning methods, and introduces a structured roadmap divided into three main stages: Stage 1 focuses on data preparation, attention mechanisms, and understanding LLM architecture; Stage 2 covers the pre-training of the model on unlabeled data; and Stage 3 involves fine-tuning the model for specific applications. The lecture emphasizes the importance of understanding each stage thoroughly to build confidence and competence in LLM development, and sets the stage for hands-on coding in future lectures.

Takeaways

  • πŸ“š Overview of LLM building stages
  • πŸ” Focus on data preparation and architecture
  • πŸ’‘ Importance of attention mechanisms
  • πŸ’° High cost of pre-training GPT-3
  • πŸ”„ Difference between pre-training and fine-tuning
  • πŸ“ Applications of fine-tuned LLMs
  • πŸ”‘ Role of tokenization in data processing
  • πŸ“Š Significance of vector embeddings
  • βš™οΈ Understanding Transformer architecture
  • πŸš€ Hands-on coding in future lectures

Timeline

  • 00:00:00 - 00:05:00

    In this lecture series on building large language models (LLMs), the instructor recaps previous lectures, focusing on the evolution of the GPT architecture from GPT to GPT-4, and discusses the high costs associated with pre-training models like GPT-3. The lecture outlines the roadmap for upcoming sessions, emphasizing a hands-on approach to building LLMs, starting with data preparation, attention mechanisms, and LLM architecture. The instructor expresses gratitude for the foundational book used in the series and highlights the importance of understanding the underlying concepts before diving into practical applications.

  • 00:05:00 - 00:10:00

    The playlist is divided into three stages: Stage 1 focuses on building the foundational aspects of LLMs, including data pre-processing, attention mechanisms, and architecture. Key topics include tokenization, vector embeddings, and constructing data batches for training. The instructor emphasizes the importance of understanding these building blocks before moving on to the next stages, ensuring a comprehensive grasp of how LLMs function at a fundamental level.

  • 00:10:00 - 00:15:00

    Stage 2 will cover the pre-training of the LLM, where the assembled data and architecture will be used to train the model on unlabeled data. The instructor outlines the training process, including computing gradients, updating parameters, and generating sample text for evaluation. Additionally, the importance of saving and loading model weights is discussed, along with the integration of pre-trained weights from OpenAI to enhance the foundational model.

  • 00:15:00 - 00:20:15

    Stage 3 will focus on fine-tuning the LLM for specific applications, such as spam classification and chatbot development. The instructor stresses the significance of fine-tuning with labeled data to improve model performance on specific tasks. The lecture concludes with a recap of key concepts learned so far, including the transformative impact of LLMs on natural language processing, the necessity of pre-training and fine-tuning, and the pivotal role of the transformer architecture and attention mechanisms in enabling LLMs to perform a wide range of tasks.

Show more

Mind Map

Video Q&A

  • What are the three stages of building a large language model?

    Stage 1: Data preparation and understanding LLM architecture; Stage 2: Pre-training the model on unlabeled data; Stage 3: Fine-tuning the model for specific applications.

  • What is the focus of Stage 1?

    Stage 1 focuses on data pre-processing, attention mechanisms, and understanding the architecture of large language models.

  • What is the significance of fine-tuning in LLMs?

    Fine-tuning allows the model to adapt to specific tasks using labeled data, improving performance on those tasks.

  • What is the attention mechanism?

    The attention mechanism allows the model to weigh the importance of different words in the input sequence when generating output.

  • What is the cost of pre-training GPT-3?

    The total pre-training cost for GPT-3 is around $4.6 million.

  • What is the difference between pre-training and fine-tuning?

    Pre-training is done on unlabeled data to create a foundational model, while fine-tuning uses labeled data to adapt the model for specific tasks.

  • What are some applications of fine-tuned LLMs?

    Fine-tuned LLMs can be used for tasks like email classification and building chatbots.

  • What is the role of tokenization in data preparation?

    Tokenization breaks down sentences into individual tokens, which are then transformed into high-dimensional vectors for processing.

  • What is the importance of vector embeddings?

    Vector embeddings capture the semantic meaning of words, ensuring that similar words are represented closely in the vector space.

  • What is the main architecture behind modern LLMs?

    The main architecture behind modern LLMs is the Transformer architecture, which utilizes attention mechanisms.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:00
    [Music]
  • 00:00:16
    hello everyone welcome to this lecture
  • 00:00:19
    in the building large language models
  • 00:00:21
    from scratch series we have covered five
  • 00:00:25
    lectures up till now and in the previous
  • 00:00:28
    lecture we looked at the gpt3
  • 00:00:30
    architecture in a lot of detail we also
  • 00:00:33
    saw the progression from GPT to gpt2 to
  • 00:00:36
    gpt3 and finally to GPT
  • 00:00:39
    4 uh we saw that the total pre-training
  • 00:00:42
    cost for gpt3 is around 4.6 million
  • 00:00:46
    which is insanely
  • 00:00:48
    high and up till now we have also looked
  • 00:00:51
    at the data set which was used for
  • 00:00:53
    pre-training gpt3 and we have seen this
  • 00:00:56
    several times until
  • 00:00:58
    now in the prev previous lecture we
  • 00:01:00
    learned about the differences between
  • 00:01:02
    zero shot versus few shot learning as
  • 00:01:05
    well so if you have not been through the
  • 00:01:07
    previous lectures we have already
  • 00:01:09
    covered five lectures in this series and
  • 00:01:11
    uh all of them have actually received a
  • 00:01:13
    very good response from YouTube and I've
  • 00:01:16
    received a number of comments saying
  • 00:01:18
    that they have really helped people so I
  • 00:01:21
    encourage you to go through those
  • 00:01:23
    videos in today's lecture we are going
  • 00:01:25
    to be discussing about what we will
  • 00:01:28
    exactly cover in the playlist in these
  • 00:01:30
    five lectures we have looked at some of
  • 00:01:32
    the theory modules some of the intuition
  • 00:01:35
    modules behind attention behind self
  • 00:01:37
    attention prediction of next word uh
  • 00:01:40
    zero short versus few short learning
  • 00:01:42
    basics of the Transformer architecture
  • 00:01:45
    data sets used for llm pre pre-training
  • 00:01:48
    difference between pre-training and fine
  • 00:01:49
    tuning Etc but now from the next lecture
  • 00:01:54
    onwards we are going to start with the
  • 00:01:56
    Hands-On aspects of actually building an
  • 00:01:59
    llm so I wanted to utilize this
  • 00:02:01
    particular lecture to give you a road
  • 00:02:03
    map of what all we will be doing in this
  • 00:02:07
    series and what all stages which we will
  • 00:02:09
    be covering during this
  • 00:02:12
    playlist so that is the title of today's
  • 00:02:14
    lecture stages of building a large
  • 00:02:16
    language model towards the end of this
  • 00:02:18
    lecture we will also do a recap of what
  • 00:02:21
    all we have learned until now so let's
  • 00:02:24
    get started with today's
  • 00:02:26
    lecture okay so we will break this
  • 00:02:28
    playlist into three stage stages stage
  • 00:02:31
    one stage two and stage three remember
  • 00:02:35
    before we get started that this material
  • 00:02:37
    which I showing is heavily borrowed from
  • 00:02:40
    U the book building a large language
  • 00:02:42
    model from scratch which is written by
  • 00:02:44
    sebasian rashka so I'm very grateful to
  • 00:02:47
    the author for writing this book which
  • 00:02:49
    is allowing me to make this
  • 00:02:51
    playlist okay so we'll be dividing the
  • 00:02:54
    playlist into three stages stage one
  • 00:02:56
    stage two and stage number three unfort
  • 00:02:59
    fortunately all of the playlists
  • 00:03:01
    currently which are available on YouTube
  • 00:03:03
    only go through some of these stages and
  • 00:03:06
    that two they do not cover these stages
  • 00:03:08
    in detail my plan is to devote a number
  • 00:03:11
    of lectures to each stage in this uh
  • 00:03:15
    playlist so that you get a very detailed
  • 00:03:18
    understanding of how the nuts and bolts
  • 00:03:20
    really
  • 00:03:21
    work so in stage one we are going to be
  • 00:03:24
    looking at uh essentially building a
  • 00:03:26
    large language model and we are going to
  • 00:03:29
    look at the building blocks which are
  • 00:03:31
    necessary so before we go to train the
  • 00:03:34
    large language model we need to do the
  • 00:03:36
    data pre-processing and sampling in a
  • 00:03:38
    very specific manner we need to
  • 00:03:40
    understand the attention mechanism and
  • 00:03:42
    we will need to understand the llm
  • 00:03:44
    architecture so in the stage one we are
  • 00:03:46
    going to focus on these three things
  • 00:03:49
    understanding how the data is collected
  • 00:03:51
    from different data sets how the data is
  • 00:03:54
    processed how the data is sampled number
  • 00:03:56
    one then we will go to attention
  • 00:03:58
    mechanism how to C out the attention
  • 00:04:00
    mechanism completely from scratch in
  • 00:04:02
    Python what is meant by key query value
  • 00:04:05
    what is the attention score what is
  • 00:04:08
    positional encoding what is Vector
  • 00:04:10
    embedding all of this will be covered in
  • 00:04:12
    this stage we'll also be looking at the
  • 00:04:14
    llm architecture such as how to stack
  • 00:04:17
    different layers on top of each other
  • 00:04:19
    where should the attention head go all
  • 00:04:21
    of these things essentially uh the
  • 00:04:25
    main understanding or the main part of
  • 00:04:28
    this stage will be to understand
  • 00:04:29
    understand the basic mechanism behind
  • 00:04:32
    the large language model so what exactly
  • 00:04:34
    we will cover in data preparation and
  • 00:04:36
    sampling first we'll see tokenization if
  • 00:04:39
    you are given sentences how to break
  • 00:04:41
    them down into individual tokens as we
  • 00:04:44
    have seen earlier a token can be thought
  • 00:04:46
    of as a unit of a sentence but there is
  • 00:04:48
    a particular way of doing tokenization
  • 00:04:50
    we'll cover that then we will cover
  • 00:04:53
    Vector embedding essentially after we do
  • 00:04:56
    tokenization every word needs to be
  • 00:04:59
    transformed into a very high dimensional
  • 00:05:01
    Vector space so that the semantic
  • 00:05:04
    meaning between words is captured as you
  • 00:05:07
    can see here we want apple banana and
  • 00:05:10
    orange to be closer together which are
  • 00:05:12
    seen in this red circle over here we
  • 00:05:14
    want King man and woman to be closer
  • 00:05:17
    together which is shown in the blue
  • 00:05:18
    circle and we want Sports such as
  • 00:05:20
    football Golf and Tennis to be closer
  • 00:05:22
    together as shown in the green these are
  • 00:05:25
    just representative examples what I want
  • 00:05:27
    to explain is that before we give the
  • 00:05:30
    data set for training we need to encode
  • 00:05:32
    every word so that the semantic meaning
  • 00:05:36
    between the words are captured so Words
  • 00:05:38
    which mean similar things lie closer
  • 00:05:40
    together so we will learn about Vector
  • 00:05:43
    embeddings in a lot of detail here we'll
  • 00:05:45
    also learn about positional encoding the
  • 00:05:47
    order in which the word appears in a
  • 00:05:49
    sentence is also very important and we
  • 00:05:52
    need to give that information to the
  • 00:05:54
    pre-training
  • 00:05:55
    model after learning about tokenization
  • 00:05:58
    Vector embedding we will learn about how
  • 00:06:01
    to construct batches of the data so if
  • 00:06:04
    we have a huge amount of data set how to
  • 00:06:06
    give the data in batches to uh GPT or to
  • 00:06:09
    the large language model which we are
  • 00:06:11
    going to build so we will be looking at
  • 00:06:14
    the next word prediction task so you
  • 00:06:16
    will be given a bunch of words and then
  • 00:06:18
    predicting the next word so we'll also
  • 00:06:20
    see the meaning of context how many
  • 00:06:22
    words should be taken for training to
  • 00:06:25
    predict the next output we'll see about
  • 00:06:27
    that and how to basically Fe the data in
  • 00:06:31
    different sets of batches so that the
  • 00:06:33
    computation becomes much more efficient
  • 00:06:36
    so we'll be implementing a data batching
  • 00:06:38
    sequence before giving all of the data
  • 00:06:41
    set into the large language model for
  • 00:06:44
    pre-training after this the second Point
  • 00:06:46
    as I mentioned here is the attention
  • 00:06:48
    mechanism so here is the attention
  • 00:06:50
    mechanism for the Transformer model
  • 00:06:52
    we'll first understand what is meant by
  • 00:06:54
    every single thing here what is meant by
  • 00:06:56
    multi-ad attention what is meant by Mas
  • 00:06:59
    multi head attention what is meant by
  • 00:07:01
    positional encoding input embedding
  • 00:07:03
    output embedding all of these things and
  • 00:07:05
    then we will build our own llm
  • 00:07:08
    architecture so uh these are the two
  • 00:07:11
    things attention mechanism and llm
  • 00:07:13
    architecture after we cover all of these
  • 00:07:15
    aspects we are essentially ready with
  • 00:07:17
    stage one of this playlist and then we
  • 00:07:20
    can move to the stage two stage two of
  • 00:07:23
    this series is essentially going to be
  • 00:07:25
    pre-training which is after we have
  • 00:07:27
    assembled all the data after we have
  • 00:07:29
    constructed the large language model
  • 00:07:31
    architecture which we are going to use
  • 00:07:33
    we are going to write down a code which
  • 00:07:35
    trains the large language model on the
  • 00:07:37
    underlying data set that is also called
  • 00:07:40
    as pre-training so the outcome of stage
  • 00:07:43
    two is to build a foundational model on
  • 00:07:45
    unlabeled
  • 00:07:47
    data now uh I'll just show a schematic
  • 00:07:50
    from the book which we will be following
  • 00:07:52
    so this is how the training data set
  • 00:07:53
    will look like we'll break it down into
  • 00:07:56
    epox and we will compute the gradient
  • 00:08:00
    uh of the loss in each Epoch and we'll
  • 00:08:02
    update the parameters towards the end
  • 00:08:04
    we'll generate sample text for visual
  • 00:08:06
    inspection this is what will happen
  • 00:08:08
    exactly in the training procedure of the
  • 00:08:11
    large language model and then we'll also
  • 00:08:13
    do model evaluation and loading
  • 00:08:15
    pre-train weaps so let me show you the
  • 00:08:17
    schematic for that so we'll do text
  • 00:08:19
    generation evaluation training and
  • 00:08:21
    validation losses then we'll write the
  • 00:08:24
    llm training function which I showed you
  • 00:08:26
    uh and then we'll do one more thing we
  • 00:08:28
    will Implement function to save and lo
  • 00:08:30
    load the large language model weights to
  • 00:08:33
    use or continue training the llm later
  • 00:08:35
    so there is no point in training the LM
  • 00:08:38
    from scratch every single time right
  • 00:08:39
    weight saving and loading essentially
  • 00:08:41
    saves you a ton of computational cost
  • 00:08:43
    and
  • 00:08:44
    memory and then at the end of this we'll
  • 00:08:47
    also load pre-trained weights from open
  • 00:08:49
    AI into our large language model so open
  • 00:08:52
    AI has already made some of the weights
  • 00:08:54
    available they are pre-trained weights
  • 00:08:56
    so we'll be loading uh pre-trained
  • 00:08:58
    weights from open a into our llm model
  • 00:09:02
    this is all what we'll be covering in
  • 00:09:04
    the stage two which is essentially
  • 00:09:06
    training Loop plus uh training Loop plus
  • 00:09:09
    model evaluation plus loading
  • 00:09:10
    pre-trained weights to build our
  • 00:09:12
    foundational model so the main goal of
  • 00:09:15
    stage two as I as I told you is
  • 00:09:17
    pre-training and llm on unlabelled data
  • 00:09:20
    great but we will not stop here after
  • 00:09:22
    this we move to stage number three and
  • 00:09:25
    the main goal of stage number three is
  • 00:09:27
    fine tuning the large language model so
  • 00:09:29
    if we want to build specific
  • 00:09:31
    applications we will do fine tuning in
  • 00:09:33
    this playlist we are going to build two
  • 00:09:35
    applications which are mentioned in the
  • 00:09:37
    book I showed you at the start one is
  • 00:09:39
    building a classifier and one is
  • 00:09:41
    building your own personal assistant so
  • 00:09:44
    here are some schematics to show so if
  • 00:09:46
    you want to let you have got a lot of
  • 00:09:48
    emails right and if you want to use your
  • 00:09:50
    llm to classify spam or no spam for
  • 00:09:54
    example you are a winner you have been
  • 00:09:56
    uh specially selected to receive th000
  • 00:09:58
    cash now this should be classified as
  • 00:10:01
    spam whereas hey just wanted to check if
  • 00:10:03
    we are still on for dinner tonight let
  • 00:10:05
    me know this will be not spam so we will
  • 00:10:08
    build a large language model this
  • 00:10:10
    application which classifies between
  • 00:10:12
    spam and no spam and we cannot just use
  • 00:10:14
    the pre-trained or foundational model
  • 00:10:16
    for this because we need to train with
  • 00:10:17
    labeled data to the pre-train model we
  • 00:10:20
    need to give some more data and tell it
  • 00:10:22
    that hey this is usually spam and this
  • 00:10:24
    is not spam can you use the foundational
  • 00:10:26
    model plus this additional specific
  • 00:10:28
    label data asset which I have given to
  • 00:10:30
    build a fine-tuned llm application for
  • 00:10:34
    email classification so this is what
  • 00:10:36
    we'll be building as the first
  • 00:10:38
    application the second application which
  • 00:10:40
    we'll be building is a type of a chat
  • 00:10:42
    bot which Bas basically answers queries
  • 00:10:44
    so there is an instruction there is an
  • 00:10:46
    input and there is an output and we'll
  • 00:10:48
    be building this chatbot after fine
  • 00:10:51
    tuning the large language model so if
  • 00:10:54
    you want to be a very serious llm
  • 00:10:56
    engineer all the stages are equally
  • 00:10:58
    important many students what they are
  • 00:11:00
    doing right now is that they just look
  • 00:11:02
    at stage number three and they either
  • 00:11:04
    use Lang chain let's
  • 00:11:06
    say they use Lang chain they use tools
  • 00:11:09
    like
  • 00:11:10
    AMA and they directly deploy
  • 00:11:13
    applications but they do not understand
  • 00:11:15
    what's going on in stage one and stage
  • 00:11:17
    two at all so this leaves you also a bit
  • 00:11:19
    underc confident and insecure about
  • 00:11:21
    whether I really know the nuts and bolts
  • 00:11:23
    whether I really know the details my
  • 00:11:25
    plan is to go over every single thing
  • 00:11:27
    without skipping even a single Concept
  • 00:11:30
    in stage one stage two and stage number
  • 00:11:33
    three so this is the plan which you'll
  • 00:11:35
    be following in this playlist and I hope
  • 00:11:37
    you are excited for this because at the
  • 00:11:39
    end of this really my vision for this
  • 00:11:42
    playlist is to make it the most detailed
  • 00:11:44
    llm playlist uh which many people can
  • 00:11:46
    refer not just students but working
  • 00:11:48
    professionals startup Founders managers
  • 00:11:51
    Etc and then you can once this playlist
  • 00:11:53
    is built over I think two to 3 months
  • 00:11:56
    later you can uh refer to whichever part
  • 00:11:59
    you are more interested in so people who
  • 00:12:01
    are following this in the early stages
  • 00:12:03
    of this journey it's awesome because
  • 00:12:05
    I'll reply to all the comments in the um
  • 00:12:09
    chat section and we'll build this
  • 00:12:11
    journey
  • 00:12:13
    together I want to end this a lecture by
  • 00:12:16
    providing a recap of what all we have
  • 00:12:18
    learned so far this is very uh this is
  • 00:12:21
    going to be very important because from
  • 00:12:22
    the next lecture we are going to start a
  • 00:12:24
    bit of the Hands-On
  • 00:12:26
    approach okay so number one large
  • 00:12:29
    language models have really transformed
  • 00:12:31
    uh the field of natural language
  • 00:12:34
    processing they have led to advancements
  • 00:12:36
    in generating understanding and
  • 00:12:38
    translating human language this is very
  • 00:12:40
    important uh so the field of NLP before
  • 00:12:43
    you needed to train a separate algorithm
  • 00:12:45
    for each specific task but large
  • 00:12:47
    language models are pretty generic if
  • 00:12:49
    you train an llm for predicting the next
  • 00:12:51
    word it turns out that it develops
  • 00:12:53
    emergent properties which means it's not
  • 00:12:55
    only good at predicting the next word
  • 00:12:57
    but also at things like uh multiple
  • 00:13:00
    choice questions text summarization then
  • 00:13:03
    emotion classification language
  • 00:13:05
    translation Etc it's useful for a wide
  • 00:13:07
    range of tasks and it's that has led to
  • 00:13:10
    its predominance as an amazing tool in a
  • 00:13:13
    variety of
  • 00:13:15
    fields secondly all modern large
  • 00:13:18
    language models are trained in two main
  • 00:13:20
    steps first we pre-train on an unlabeled
  • 00:13:23
    data this is called as a foundational
  • 00:13:25
    model and for this very large data sets
  • 00:13:28
    are needed typically billions of words
  • 00:13:31
    and it costs a lot as we saw training
  • 00:13:33
    pre-training gpt3 costs $4.6 million so
  • 00:13:37
    you need access to huge amount of data
  • 00:13:39
    compute power and money to pre-train
  • 00:13:42
    such a foundational model now if you are
  • 00:13:45
    actually going to implement an llm
  • 00:13:47
    application on production level so let's
  • 00:13:49
    say if you're an educational company
  • 00:13:51
    building multiple choice questions and
  • 00:13:53
    you think that the answers provided by
  • 00:13:55
    the pre-training or foundational model
  • 00:13:57
    are not very good and they are a bit
  • 00:13:58
    generic
  • 00:13:59
    you can provide your own specific data
  • 00:14:02
    set and you can label the data set
  • 00:14:04
    saying that these are the right answers
  • 00:14:06
    and I want you to further train on this
  • 00:14:07
    refined data set uh to build a better
  • 00:14:10
    model this is called fine tuning usually
  • 00:14:14
    airline companies restaurants Banks
  • 00:14:16
    educational companies when they deploy
  • 00:14:19
    llms into production level they fine
  • 00:14:21
    tune the pre-trained llm nobody deploys
  • 00:14:23
    the pre-trend one directly you fine tune
  • 00:14:26
    the element llm on your specific smaller
  • 00:14:29
    label data set this is very important
  • 00:14:31
    see for pre-training the data set which
  • 00:14:33
    we have is unlabeled it's Auto
  • 00:14:35
    regressive so the sentence structure
  • 00:14:37
    itself is used for creating the labels
  • 00:14:39
    as we are just predicting the next world
  • 00:14:42
    but when we F tune we have a label data
  • 00:14:44
    set such as remember the spam versus no
  • 00:14:47
    spam example which I showed you that is
  • 00:14:49
    a label data set we give labels like hey
  • 00:14:51
    this is Spam this is not spam this is a
  • 00:14:53
    good answer this is not a good answer
  • 00:14:55
    and this finetuning step is generally
  • 00:14:57
    needed for Building Product ction ready
  • 00:14:59
    llm
  • 00:15:01
    applications important thing to remember
  • 00:15:03
    is that fine tuned llms can outperform
  • 00:15:06
    only pre-trained llms on specific tasks
  • 00:15:09
    so let's say you take two cases right in
  • 00:15:11
    one case you only have pre-trained llms
  • 00:15:13
    and in second case you have pre-trained
  • 00:15:15
    plus fine tuned llms so it turns out
  • 00:15:18
    that pre-trained plus finetune does a
  • 00:15:20
    much better job at certain specific
  • 00:15:22
    tasks than just using pre-rain for
  • 00:15:24
    students who just want to interact for
  • 00:15:26
    getting their doubts solved or for
  • 00:15:29
    getting assistance uh in summarization
  • 00:15:32
    uh helping in writing a research paper
  • 00:15:34
    Etc gp4 perplexity or such API tools or
  • 00:15:39
    such interfaces which are available work
  • 00:15:41
    perfectly fine but if you want to build
  • 00:15:43
    a specific application on your data set
  • 00:15:46
    and take it to production level you
  • 00:15:48
    definitely need fine
  • 00:15:50
    tuning okay now uh one more key thing is
  • 00:15:54
    that the secret Source behind large
  • 00:15:55
    language models is this Transformer
  • 00:15:57
    architecture
  • 00:15:59
    so uh the key idea behind Transformer
  • 00:16:02
    architecture is the attention mechanism
  • 00:16:05
    uh just to show you how the Transformer
  • 00:16:07
    architecture looks like it looks like
  • 00:16:08
    this and the main thing behind the
  • 00:16:10
    Transformer architecture which really
  • 00:16:12
    makes it so
  • 00:16:14
    powerful are these attention
  • 00:16:17
    blocks we'll see what they mean so no
  • 00:16:19
    need to worry about this right
  • 00:16:21
    now but in the nutshell attention
  • 00:16:24
    mechanism gives the llm selective access
  • 00:16:26
    to the whole input sequence when
  • 00:16:28
    generating output one word at a time
  • 00:16:31
    basically attention mechanism allows the
  • 00:16:33
    llm to understand the importance of
  • 00:16:36
    words and not just the word in the
  • 00:16:39
    current sentence but in the previous
  • 00:16:41
    sentences which have come long before
  • 00:16:42
    also because context is important in
  • 00:16:45
    predicting the next word the current
  • 00:16:47
    sentence is not the only one which
  • 00:16:48
    matters attention mechanism allows the
  • 00:16:51
    llm to give access to the entire context
  • 00:16:53
    and select or give weightage to which
  • 00:16:55
    words are important in predicting the
  • 00:16:57
    next word this is a key idea which and
  • 00:17:00
    we'll spend a lot of time on this
  • 00:17:02
    idea remember that the original
  • 00:17:04
    Transformer had only the had encoder
  • 00:17:07
    plus decoder so it had both of these
  • 00:17:10
    things it had the encoder as well as it
  • 00:17:11
    had the decoder but generative pre-train
  • 00:17:15
    Transformer only has the decoder it did
  • 00:17:17
    not it does not have the encoder so
  • 00:17:20
    Transformer and GPT is not the same
  • 00:17:22
    Transformer paper came in 2017 it had
  • 00:17:24
    encoder plus decoder generative pre-rain
  • 00:17:27
    Transformer came one year later
  • 00:17:29
    2018 and that only had the decoder
  • 00:17:32
    architecture so even gp4 right now it
  • 00:17:34
    only has decoder no encoder so 2018 came
  • 00:17:38
    GPT the first generative pre-trend
  • 00:17:40
    Transformer architecture 2019 came gpt2
  • 00:17:43
    2020 came gpt3 which had 175 billion
  • 00:17:47
    parameters and that really changed the
  • 00:17:49
    game because no one had seen a model
  • 00:17:51
    this large before and then now we are at
  • 00:17:53
    GPT 4
  • 00:17:55
    stage one last point which is very
  • 00:17:57
    important is that llms are only trained
  • 00:18:00
    for predicting the next word right but
  • 00:18:02
    very surprisingly they develop emergent
  • 00:18:04
    properties which means that although
  • 00:18:07
    they are only trained to predict the
  • 00:18:08
    next word they show some amazing
  • 00:18:11
    properties like ability to classify text
  • 00:18:14
    translate text from one language into
  • 00:18:16
    another language and even summarize
  • 00:18:17
    texts so they were not trained for these
  • 00:18:20
    tasks but they developed these
  • 00:18:22
    properties and that was an awesome thing
  • 00:18:23
    to realize the pre-training stage works
  • 00:18:26
    so well that llms develop all of these
  • 00:18:28
    wonderful other properties which makes
  • 00:18:30
    them so impactful for a wide range of
  • 00:18:33
    tasks
  • 00:18:35
    currently okay so this brings us to the
  • 00:18:37
    end of the recap which we have covered
  • 00:18:39
    up till now if you have not seen the
  • 00:18:41
    previous lectures I really encourage you
  • 00:18:43
    to go through them because these
  • 00:18:45
    lectures have really set the stage for
  • 00:18:46
    us to now dive into stage one so from
  • 00:18:49
    the next lecture we'll start going into
  • 00:18:51
    stage one and we'll start seeing the
  • 00:18:53
    first aspect which is data preparation
  • 00:18:55
    and sampling so the next lecture title
  • 00:18:58
    will be be working with Text data and
  • 00:19:00
    we'll be looking at the data sets how to
  • 00:19:03
    load a data set how to count the number
  • 00:19:05
    of characters uh how to break the data
  • 00:19:07
    into tokens and I'll I'll start sharing
  • 00:19:10
    sharing Jupiter notebooks from next time
  • 00:19:12
    onward so that we can parall begin
  • 00:19:15
    coding so thanks everyone I hope you are
  • 00:19:17
    liking these lectures so lecture 1 to
  • 00:19:20
    six we kind of like an introductory
  • 00:19:23
    lecture to give you a feel of the entire
  • 00:19:24
    series and so that you understand
  • 00:19:26
    Concepts at a fundamental level from
  • 00:19:28
    from lecture 7 we'll be diving deep into
  • 00:19:30
    code and we'll be starting into stage
  • 00:19:33
    one so I follow this approach of writing
  • 00:19:36
    on a whiteboard and also
  • 00:19:38
    coding um so that you understand the
  • 00:19:40
    details plus the code at the same time
  • 00:19:43
    because I believe Theory plus practical
  • 00:19:44
    implementation both are important and
  • 00:19:47
    that is one of the philosophies of this
  • 00:19:49
    lecture Series so do let me know in the
  • 00:19:51
    comments how you finding this teaching
  • 00:19:53
    style uh because I will take feedback
  • 00:19:56
    from that and we can build this series
  • 00:19:58
    together 3 to four months later this can
  • 00:20:00
    be an amazing and awesome series and I
  • 00:20:03
    will rely on your feedback to build this
  • 00:20:05
    thanks a lot everyone and I look forward
  • 00:20:07
    to seeing you in the next lecture
Tags
  • Large Language Models
  • GPT Architecture
  • Data Preparation
  • Attention Mechanism
  • Pre-training
  • Fine-tuning
  • Tokenization
  • Vector Embeddings
  • Transformer Architecture
  • NLP