Whitepaper Companion Podcast - Foundational LLMs & Text Generation
概要
TLDRThis deep dive discusses the evolution, architecture, and application of large language models (LLMs), starting from the foundational Transformer architecture developed by Google in 2017. Significant advancements in LLMs are emphasized, including models like GPT-1, BERT, GPT-2, GPT-3, and recent multimodal models like Gemini. Key concepts like multi-head attention, fine-tuning approaches, prompt engineering, and performance evaluation methods are covered. The conversation culminates in exploring real-world applications spanning various fields such as programming, translation, content creation, and more, showcasing the rapid innovation and expansion of LLM capabilities.
収穫
- 🧠 LLMs are revolutionizing text creation and understanding.
- ⚙️ Transformers process context using advanced self-attention techniques.
- 📊 Fine-tuning helps specialize LLMs for specific tasks.
- ⚡ Efforts to speed up inference focus on balancing efficiency and output quality.
- 🌐 Multimodal models are emerging to expand application possibilities.
タイムライン
- 00:00:00 - 00:05:00
The video introduces a comprehensive exploration of large language models (LLMs) and their foundational role in generating text. It highlights the rapid advancements in the technology, mentioning the importance of understanding the architecture and learning mechanisms of LLMs, particularly up to 2025. The discussion emphasizes the foundational Transformer architecture used in many modern LLMs, which was initially developed for language translation by Google in 2017.
- 00:05:00 - 00:10:00
A detailed overview of the Transformer layers covers the transition of input text into tokens, embeddings, and the significance of positional encoding. The process of self-attention is explained using the example of a sentence about a thirsty tiger, illustrating how the model understands relationships between words through 'queries' and 'keys'. It introduces multi-head attention, where different attention heads learn from different aspects of the input text simultaneously, enhancing the model's understanding.
- 00:10:00 - 00:15:00
The video discusses the importance of layer normalization and residual connections in deep networks to maintain performance across layers, as well as the efficiency of the feed-forward layer. Transitioning to model architecture, the conversation covers the emergence of decoder-only models, suitable for text generation tasks without the encoder, using masked self-attention to predict subsequent tokens while ensuring the output adheres to a logical sequence.
- 00:15:00 - 00:20:00
Exploration of the evolution of LLMs starts from the first Transformer paper to significant developments like GPT-1 to GPT-4 and other models like Bert, Lambda, and Gopher. Each model’s function and advancements are outlined, emphasizing how newer iterations have progressively improved understanding, generation capabilities, and efficiency, such as the introduction of mixture of experts to optimize performance without sacrificing speed.
- 00:20:00 - 00:29:54
The discussion culminates in techniques for fine-tuning LLMs, the critical role of prompt engineering, evaluation methods, and acceleration strategies for inference processes. The applications of LLMs in various fields, particularly in coding, translation, content creation, and conversational AI, reveal their transformative potential and ongoing developments, highlighting the rapid pace of innovation and inviting thoughts on future advancements.
マインドマップ
ビデオQ&A
What is the foundation of most modern LLMs?
The foundation of most modern LLMs is the Transformer architecture.
How do LLMs process input text?
LLMs process input text by dividing it into tokens which are then transformed into dense vectors, called embeddings, capturing their meanings.
What is self-attention in the context of Transformers?
Self-attention helps the model determine the relationship of a word to other words in a sentence, enhancing contextual understanding.
What is the role of fine-tuning in LLMs?
Fine-tuning adjusts pre-trained models on smaller, specific datasets to enhance performance on targeted tasks.
What techniques are used to speed up inference in LLMs?
Techniques include quantization, distillation, and prefix caching to improve response times without significantly sacrificing output quality.
ビデオをもっと見る
19A Spelling Practice
Sustainability in Action: CVP 500's Role in Lowering Carbon Footprint
Transistors - Field Effect and Bipolar Transistors: MOSFETS and BJTs
Teaching Math Using The Four Principles of How People Learn
Math Classrooms Should Be Places of Surprise and Wonder | Matthew Oldridge | TEDxChathamKent
Karl Marx and Friedrich Engels
- 00:00:00all right welcome everyone to the Deep
- 00:00:01dive today we're uh taking a deep dive
- 00:00:04into something pretty huge foundational
- 00:00:07large language models or llms and how
- 00:00:10they create text I mean it seems like
- 00:00:12they're popping up everywhere right
- 00:00:13changing how we write code how we even
- 00:00:15write stories yeah the advancements have
- 00:00:17been uh incredibly fast it's hard to
- 00:00:19keep up for this deep dag we're going
- 00:00:21all the way up to February 2025 so we're
- 00:00:24talking Cutting Edge stuff yeah
- 00:00:25seriously Cutting Edge so our mission
- 00:00:27today is to um to still all that down
- 00:00:31right get to the core of these llms what
- 00:00:33are they made of how do they evolve you
- 00:00:35know how do they actually learn of
- 00:00:37course how do we even measure how good
- 00:00:39they are we're going to look at all that
- 00:00:40even some of the tricks used to uh make
- 00:00:42them run faster it's a lot to cover but
- 00:00:44hopefully we can make it uh make it a
- 00:00:46fun ride you know the starting point for
- 00:00:48all this the foundation of most modern
- 00:00:49llms is the Transformer architecture and
- 00:00:52it's actually kind of funny it came from
- 00:00:53a Google project focused on language
- 00:00:55translation back in 2017 okay so this
- 00:00:58Transformer thing I remember hearing
- 00:00:59about that the original one had this
- 00:01:01encoder and decoder right like it would
- 00:01:03take a sentence in one language and turn
- 00:01:05it into uh another language yeah exactly
- 00:01:08so the encoder would take the input you
- 00:01:10know like a sentence in French and
- 00:01:12create this representation of It kind of
- 00:01:13like a summary of the meaning then the
- 00:01:15decoder uses that representation to
- 00:01:17generate the output like the English
- 00:01:19translation piece by piece and each
- 00:01:21piece they call it a token it could be a
- 00:01:23whole word like cat or part of word like
- 00:01:25pre and prefix but the real magic is
- 00:01:28what happens inside each lay layer of
- 00:01:30this Transformer thing all right well
- 00:01:32let's get into that magic what's
- 00:01:34actually going on in a Transformer layer
- 00:01:36so first things first the input text
- 00:01:38needs to be prepped for the model right
- 00:01:40we turn the text into those tokens based
- 00:01:42on a specific vocabulary the model uses
- 00:01:45and inch of these tokens gets turned
- 00:01:46into this dense Vector we call it an
- 00:01:49embedding that captures the meaning of
- 00:01:51that token but and this is important
- 00:01:54Transformers process all the tokens at
- 00:01:56the same time so we need to add in some
- 00:01:59information about the order they
- 00:02:00appeared in the sentence that's called
- 00:02:02positional encoding and there are
- 00:02:04different types of positional encoding
- 00:02:05like sinodal and learned encodings the
- 00:02:07choice can actually subtly affect how
- 00:02:09well the model understands longer
- 00:02:11sentences or longer sequences of text
- 00:02:13makes sense otherwise it's like just
- 00:02:14throwing all the words in a bag you lose
- 00:02:16all the structure then we get to I think
- 00:02:18the most famous part the multi-head
- 00:02:20attention I saw this thirsty tiger
- 00:02:22example I thought was uh pretty helpful
- 00:02:25to try and understand self attention oh
- 00:02:27yeah the Thirsty tiger a classic so the
- 00:02:31sentence is the tiger jumped out of a
- 00:02:33tree to get a drink because it was
- 00:02:34thirsty now self attention it's what
- 00:02:36lets the model figure out that it refers
- 00:02:39back to the tiger and it does this by uh
- 00:02:42creating these vectors query key and
- 00:02:45value vectors for every single word okay
- 00:02:48so wait let me let me try this so it
- 00:02:50that would be the query it's like asking
- 00:02:52hey which other words in this sentence
- 00:02:53are important to understanding me yeah
- 00:02:55you got it and the key it's like a label
- 00:02:57attached to each word telling you what
- 00:02:59it represents then the value that's the
- 00:03:01actual information the word carries so
- 00:03:03like it looks at all the other words
- 00:03:04keys and sees that the Tiger has a key
- 00:03:06that's really similar so it pays more
- 00:03:08attention to the tiger exactly and the
- 00:03:10model calculates this score you know for
- 00:03:12how well each query matches up with all
- 00:03:14the other Keys then it normalizes these
- 00:03:16scores so they become weights attention
- 00:03:19weights these weights tell you how much
- 00:03:21each word should pay attention to the
- 00:03:23others then it uses those weights to
- 00:03:25create a weighted sum of all the value
- 00:03:27vectors and what you get is this Rich
- 00:03:30representation for each word which takes
- 00:03:32into account its relationship to every
- 00:03:34other word in the sentence and the
- 00:03:36really cool part is all of this all this
- 00:03:38comparison and calculation happens in
- 00:03:40parallel using these matrices for the
- 00:03:42query q key K and value V of all the
- 00:03:45tokens this ability to process all these
- 00:03:47relationships at the same time is a huge
- 00:03:49reason why Transformers are so good at
- 00:03:51capturing these subtle meanings in
- 00:03:53language that previous models you know
- 00:03:54the sequential ones really struggled
- 00:03:56with especially across longer distances
- 00:03:58within a sentence okay I think I'm
- 00:04:00starting to get it and multi-edges means
- 00:04:01doing the self attention thing like
- 00:04:03several times at the same time right but
- 00:04:05with different sets of those query key
- 00:04:07and value matrices yes and each head
- 00:04:11each of these parallel self- attention
- 00:04:13processes learns to focus on different
- 00:04:16types of relationships one head might
- 00:04:17look for grammatical stuff another one
- 00:04:19might focus on the uh the meaning
- 00:04:21connections between words and by
- 00:04:24combining all those different views you
- 00:04:26know those different perspective the
- 00:04:27model gets this much deeper understand
- 00:04:29understanding of what's going on in the
- 00:04:31text it's like getting a second opinion
- 00:04:33or a third or a fourth it's powerful
- 00:04:35stuff now I also saw these terms layer
- 00:04:38normalization and residual connections
- 00:04:40they seem to be important for uh keeping
- 00:04:43the training on track especially when
- 00:04:45you have these really Jeep networks oh
- 00:04:47they're essential layer normalization it
- 00:04:49helps to keep the activity level of each
- 00:04:51layer you know the activations at a
- 00:04:52steady level that makes the training go
- 00:04:54much faster and usually gives you better
- 00:04:55results in the end residual connections
- 00:04:58they act like shortcuts you know within
- 00:05:00the network it's like they let the
- 00:05:01original input of a layer bypass
- 00:05:03everything and get added directly to the
- 00:05:05output so it's a way for the network to
- 00:05:07remember what it learned earlier even if
- 00:05:09it's gone through many many layers
- 00:05:11exactly that's why they're so important
- 00:05:13in these really deep models it prevents
- 00:05:15that Vanishing Radiance problem where
- 00:05:17the signal gets weaker and weaker as it
- 00:05:19goes deeper then after all that we have
- 00:05:22the feed forward layer right the feed
- 00:05:24forward layer yeah it's this network a
- 00:05:26feed forward Network that's applied to
- 00:05:28each token's representation separately
- 00:05:30after we've done all that attention
- 00:05:32stuff it usually has two linear
- 00:05:34Transformations with a what's called a
- 00:05:36nonlinear activation function in between
- 00:05:39like relu or
- 00:05:41gelu this gives the model even more
- 00:05:43power to represent information helps it
- 00:05:45learn these complex functions of the
- 00:05:47input so we've talked about encoders and
- 00:05:49decoders in the original Transformer
- 00:05:51design but I noticed in the materials
- 00:05:53that many of the newer llms they're
- 00:05:55going with a decoder only architecture
- 00:05:57what's the advantage of just using the
- 00:05:58decoder well you see when you're focused
- 00:06:00on generating texts like writing or
- 00:06:02having a conversation you don't always
- 00:06:04need the encoder part the encoder's main
- 00:06:07job is to create this representation of
- 00:06:09the whole input sequence up front
- 00:06:11decoder only models they kind of skip
- 00:06:14that step and directly generate the
- 00:06:16output token by token they use this
- 00:06:19special type of self- attention called
- 00:06:20masked self- attention it's a way to
- 00:06:23make sure that uh when the model is
- 00:06:26predicting the next token it can only
- 00:06:28see the tokens that came before it you
- 00:06:30know just like when we write or speak so
- 00:06:32it's a simpler design and it makes sense
- 00:06:33for generating text exactly and before
- 00:06:36we move on from architecture there's one
- 00:06:38more thing um mixture of experts or Moi
- 00:06:41it's this really clever way to make
- 00:06:43these models even bigger but without
- 00:06:44making them super slow I was just going
- 00:06:46to ask about that how do you make these
- 00:06:48massive models more efficient Moi seems
- 00:06:50to be a key part of that it really is so
- 00:06:52in Moi you have these specialized
- 00:06:54submodels these experts right and they
- 00:06:57all live within one big model but the
- 00:06:59trick is is there's this gating Network
- 00:07:01that decides which experts are the best
- 00:07:03ones to use for each input so you might
- 00:07:06have a model with billions of parameters
- 00:07:08but for any given input only a small
- 00:07:11fraction of those parameters those
- 00:07:13experts are actually active it's like
- 00:07:15having a team of Specialists and you
- 00:07:17only call in the ones you need for the
- 00:07:18specific job makes sense yeah it's all
- 00:07:21about efficiency now I think it would be
- 00:07:23good to step back and look at the big
- 00:07:24picture how llms have evolved over time
- 00:07:27you know the Transformer was the spark
- 00:07:29but then things really started taking
- 00:07:30off yeah there's this whole family tree
- 00:07:32of llms now where did it all begin after
- 00:07:35that first Transformer paper well GPT
- 00:07:37one from open AI in 2018 was a real
- 00:07:39turning point it was decoder only and
- 00:07:42they trained it in an unsupervised way
- 00:07:44on this massive data set of books they
- 00:07:45called it books scorpus this
- 00:07:47unsupervised pre-training was key it let
- 00:07:49the model learn General language
- 00:07:51patterns from all this raw text then
- 00:07:53they would fine-tune it for specific Tas
- 00:07:55but gpt1 had its limitations right I
- 00:07:58remember reading that sometimes it would
- 00:08:00get stuck repeating the same phrases
- 00:08:02over and over yeah I wasn't perfect
- 00:08:04sometimes the text would get a bit
- 00:08:05repetitive and it wasn't so good at long
- 00:08:08conversations but it was still a major
- 00:08:10step then that same year Google came out
- 00:08:13with Bert now Bert was different it was
- 00:08:15encoder only and its focus was on
- 00:08:18understanding language not generating it
- 00:08:20it was trained on these tasks uh like
- 00:08:23massed language modeling and next
- 00:08:24sentence prediction which are all about
- 00:08:26figuring out the meaning of text so gpt1
- 00:08:28could talk but sometimes it would get
- 00:08:30stuck and Bert could understand but
- 00:08:32couldn't really hold a conversation
- 00:08:33that's a good way to put it then came
- 00:08:35gpt2 in 2019 also from open AI they took
- 00:08:39the gpt1 idea and just scaled it up way
- 00:08:42more data from this data set called Web
- 00:08:44text which was taken from Reddit and
- 00:08:46many more parameters in the model itself
- 00:08:48the result much better coherence it
- 00:08:50could handle longer dependencies between
- 00:08:52words and the really cool thing was it
- 00:08:54could learn new tasks without even being
- 00:08:56specifically trained on them they call
- 00:08:57it zero shot learning you just show it
- 00:08:59an example of the task in the prompt and
- 00:09:01it could often figure out how to do it
- 00:09:03whoa just from an example that's amazing
- 00:09:05it was quite a leap and then starting in
- 00:09:082020 we got the gpt3 family these models
- 00:09:11just kept getting bigger and bigger
- 00:09:12billions of parameters gpt3 with its 175
- 00:09:16billion parameters it was huge and it
- 00:09:18got even better at fuse shot learning
- 00:09:20learning from just a handful of examples
- 00:09:22we also saw these instruction tune
- 00:09:24models like instruct GPT trained
- 00:09:26specifically to follow instructions
- 00:09:28written in natural language then came
- 00:09:30models like GPT 3.5 which were amazing
- 00:09:33at understanding and writing code and
- 00:09:35GPT 4 that was a GameChanger a truly
- 00:09:37multimodal model it could handle images
- 00:09:39and text together the context window
- 00:09:42size also exploded meaning it could
- 00:09:43consider much longer pieces of text at
- 00:09:46once and Google they were pushing things
- 00:09:47forward as well right I remember Lambda
- 00:09:50their conversational AI was a big deal
- 00:09:52absolutely Lambda came out in 2021 and
- 00:09:55it was designed from the ground up for
- 00:09:56natural sounding conversations while the
- 00:09:58gpts were becoming more general purpose
- 00:10:00Lambda was all about dialogue and it
- 00:10:02really showed then Deep Mind got in on
- 00:10:04the action with gopher in 2021 gopher
- 00:10:07what made that one Stand Out gopher was
- 00:10:09another big decoder only model but deep
- 00:10:11mine they really focused on using
- 00:10:13highquality data for training a data set
- 00:10:15they called massive text and they also
- 00:10:17used some pretty Advanced optimization
- 00:10:19techniques gopher did really well on
- 00:10:21knowledge intensive tasks but it still
- 00:10:23struggled with um more complex reasoning
- 00:10:27problems one interesting thing they
- 00:10:28found was that that just making the
- 00:10:30model bigger you know adding more
- 00:10:32parameters doesn't help with every type
- 00:10:34of task some tasks need different
- 00:10:36approaches right it's not just about
- 00:10:37size then there was Jam from Google
- 00:10:40which used this mixture of experts idea
- 00:10:42we were talking about earlier making
- 00:10:43those huge models run much faster
- 00:10:46exactly Graham showed that you could get
- 00:10:47the same or even better performance than
- 00:10:49a dense model like gpt3 but use way less
- 00:10:52compute power it was a big step forward
- 00:10:54in efficiency then came chinchilla in
- 00:10:572022 also from deepmind they really
- 00:10:59challenge those scaling laws you know
- 00:11:01the idea that bigger is always better
- 00:11:03yeah chinell was a really important
- 00:11:04paper they found that for a given number
- 00:11:07of parameters you should actually train
- 00:11:09on a much larger data set than people
- 00:11:11were doing before they had this 70
- 00:11:14billion parameter model that actually
- 00:11:16outperformed much larger models because
- 00:11:18they trained it on this huge amount of
- 00:11:20data it really changed how people
- 00:11:22thought about scaling so it's not just
- 00:11:23about the size of the model it's also
- 00:11:25about the size of the data you train it
- 00:11:26on yeah exactly and then Google
- 00:11:29released uh paulm and paulm 2 paulm came
- 00:11:33out in 2022 and had really impressive
- 00:11:36performance on all kinds of benchmarks
- 00:11:38part of that was because of Google's
- 00:11:39pathway system which made it easier to
- 00:11:41scale up models efficiently pollen 2
- 00:11:44came out in 2023 and it was even better
- 00:11:46at things like reasoning coding and math
- 00:11:49even though it actually had fewer
- 00:11:50parameters than the first PM Palm 2 is
- 00:11:53now the foundation for a lot of Google's
- 00:11:55uh generative AI stuff in Google cloud
- 00:11:58and then we have Gemini Google's newest
- 00:12:00family of models which are multimodal
- 00:12:02right from the start yeah Gemini is
- 00:12:04really pushing the boundaries it's
- 00:12:05designed to handle not just text but
- 00:12:07also images audio and video they've been
- 00:12:10working on architectural improvements
- 00:12:12that let them scale these models up
- 00:12:13really big and they've optimized Gemini
- 00:12:15to run really fast on their tensor
- 00:12:17processing units tpus they also use Moi
- 00:12:20in some of the Gemini models there are
- 00:12:22different sizes to ultra pro nano and
- 00:12:25Flash each for different needs Gemini
- 00:12:271.5 Pro with its massive context window
- 00:12:30that's been particularly impressive it
- 00:12:32can handle millions of tokens which is
- 00:12:34incredible it's mindboggling how fast
- 00:12:36these context windows are growing what
- 00:12:38about the open source side of things
- 00:12:40there's a lot happening there too right
- 00:12:41oh absolutely the open source llm
- 00:12:43Community is exploding Google released
- 00:12:46Gemma and Gemma 2 in 2024 which are
- 00:12:48these lightweight but very powerful open
- 00:12:51models building off of their Gemini
- 00:12:52research Gemma has a huge vocabulary and
- 00:12:55there's even a two billion parameter
- 00:12:57version that can run on a single GPU so
- 00:12:59it's much more accessible Gemma 2 is
- 00:13:01performing comparably to much bigger
- 00:13:03models like meta llama 370b meta llama
- 00:13:06family has been really influential
- 00:13:08starting with llama 1 then llama 2 which
- 00:13:09had a commercial use license and now
- 00:13:11llama 3 they've been improving in areas
- 00:13:13like reasoning coding general knowledge
- 00:13:16safety and they've even added
- 00:13:17multilingual and vision models in the
- 00:13:19Llama 3.2 release mistal AI they have
- 00:13:22mixol which uses a sparse mixture of
- 00:13:24experts set up eight experts but only
- 00:13:26two are active at any given time it's
- 00:13:28great at math coding and multilingual
- 00:13:30tasks and many of their models are open
- 00:13:32source then you have open AI 01 models
- 00:13:34which are all about complex reasoning
- 00:13:36they're getting top results in these
- 00:13:37really challenging scientific reasoning
- 00:13:38benchmarks deep seek has also been doing
- 00:13:40some really interesting work on
- 00:13:41reasoning using this new reinforcement
- 00:13:43learning technique called group relative
- 00:13:45policy optimization their deep seek R1
- 00:13:48model is comparable to open ai's 01 on
- 00:13:51many tasks although it's still closed
- 00:13:53Source even though they release the
- 00:13:54model weights and Beyond those there are
- 00:13:56tons of other open models being
- 00:13:57developed all the time like cu 1.5 from
- 00:14:00Alibaba ye from U1 Ai and grock 3 from
- 00:14:03XII it's a really exciting space but
- 00:14:05it's important to check the licenses on
- 00:14:07those open models before you use them
- 00:14:09yeah keeping up with all these models is
- 00:14:11a full-time job in itself it's
- 00:14:12incredible it is and you know all these
- 00:14:14models all these advancements they're
- 00:14:15all built on that basic Transformer
- 00:14:17architecture we talked about earlier
- 00:14:19right but these foundational models
- 00:14:21they're powerful but they need to be
- 00:14:23tailored for specific tasks and that's
- 00:14:25where fine-tuning comes in exactly so
- 00:14:27training in llm US usually involves two
- 00:14:30main steps first you have pre-training
- 00:14:33you feed the model tons and tons of data
- 00:14:36just raw text No Labels this lets it
- 00:14:39learn the basic patterns of language how
- 00:14:41words and sentences work together it's
- 00:14:43like learning the grammar and vocabulary
- 00:14:45of a language pre-training is super
- 00:14:47resource intensive it takes huge amounts
- 00:14:49of compute power it's like giving the
- 00:14:51model a general education in language
- 00:14:53exactly then comes fine-tuning you take
- 00:14:56that pre-trained model which has all
- 00:14:58that General knowledge and you train it
- 00:15:00further on a smaller more targeted data
- 00:15:03set this data set is specific to the
- 00:15:05task you want it to do like translating
- 00:15:08languages writing different kinds of
- 00:15:09creative text formats or answering
- 00:15:11questions so you're specializing the
- 00:15:13model making it an expert in a
- 00:15:15particular area and supervis fine-tuning
- 00:15:18or sft that's one of the main techniques
- 00:15:20use for this right yeah sft is really
- 00:15:22common it involves training the model on
- 00:15:24labeled examples where you have a prompt
- 00:15:27and the desired response so for example
- 00:15:29if you want it to answer questions you
- 00:15:30get lots of examples of questions and
- 00:15:33the correct answers this helps the model
- 00:15:35learn how to perform that specific task
- 00:15:37and also helps to shape its overall
- 00:15:39Behavior so you're not just teaching it
- 00:15:41what to do you're also teaching it how
- 00:15:43to behave exactly you want it to be
- 00:15:45helpful safe and good at following
- 00:15:47instructions and then there's
- 00:15:48reinforcement learning from Human
- 00:15:50feedback or rhf this is a way to make
- 00:15:53the model's output more aligned with
- 00:15:55what humans actually prefer I was
- 00:15:57wondering about that how do teach these
- 00:15:59models to be you know more humanlike in
- 00:16:02their responses well rhf is a big part
- 00:16:05of that it's not just about giving the
- 00:16:06model correct answers it's about
- 00:16:08teaching it to generate responses that
- 00:16:09humans find helpful truthful and safe
- 00:16:13they do this by training a separate
- 00:16:14reward model based on human preferences
- 00:16:17so you might have human evaluators rank
- 00:16:19different responses from the llm you
- 00:16:20know telling you which ones they like
- 00:16:22better then this reward model is used to
- 00:16:24fine-tune the llm using reinforcement
- 00:16:27learning algorithms so the llm learns to
- 00:16:30generate responses that get higher
- 00:16:32rewards from the reward model which is
- 00:16:33based on what humans prefer there are
- 00:16:36also some newer techniques like
- 00:16:37reinforcement learning from AI feedback
- 00:16:39rla aif and direct preference
- 00:16:42optimization DPO that are trying to make
- 00:16:44this alignment process even better it's
- 00:16:46fascinating how much human input goes
- 00:16:48into making these models uh more
- 00:16:51humanlike now fully fine-tuning these
- 00:16:53massive models it sounds computationally
- 00:16:55expensive are there ways to you know
- 00:16:58adapt them to new ask without having to
- 00:16:59retrain the whole thing yeah that's a
- 00:17:01good point fully fine-tuning these huge
- 00:17:03models it can be really expensive so
- 00:17:05people have developed these techniques
- 00:17:06called parameter efficient fine-tuning
- 00:17:08or PFT the idea is to only train a small
- 00:17:11part of the model leaving most of the
- 00:17:13pre-trained weights Frozen this makes
- 00:17:15fine-tuning much faster and cheaper so
- 00:17:17it's like just making small adjustments
- 00:17:19instead of overhauling the entire system
- 00:17:21yeah what are some examples of these pea
- 00:17:23techniques one popular method is
- 00:17:25adapter-based fine tuning you add these
- 00:17:28small modules called adapters into the
- 00:17:30model and you only train the parameters
- 00:17:31within those adapters the original
- 00:17:33weights stay the same another one is low
- 00:17:36rank adaptation or Laura in Laura you
- 00:17:39use low rank matrices to approximate the
- 00:17:41changes you would make to the original
- 00:17:42weights during full fine tuning this
- 00:17:45drastically reduces the number of
- 00:17:46parameters you need to train there's
- 00:17:48also Cura which is like Laura but even
- 00:17:51more efficient because it uses quantized
- 00:17:53weights and then there's soft prompting
- 00:17:55where you learn the small Vector a soft
- 00:17:57prompt that you add to the input this
- 00:17:59soft prompt helps the model perform the
- 00:18:01desired task without changing the
- 00:18:03original weights so it sounds like there
- 00:18:05are several different approaches toine
- 00:18:07tuning and each one has its own
- 00:18:09trade-offs between performance cost and
- 00:18:11efficiency exactly and these PF
- 00:18:14techniques are making it possible for
- 00:18:16more people to use and customize these
- 00:18:18powerful llms it's really democratizing
- 00:18:21the technology now once you have a
- 00:18:23fine-tuned model how do you actually use
- 00:18:26it effectively brumpt engineering seems
- 00:18:28to be key skill here oh it's absolutely
- 00:18:30essential prompt engineering is all
- 00:18:31about designing the input you give to
- 00:18:33the model The Prompt in a way that gets
- 00:18:36you the output you're looking for it can
- 00:18:37make a huge difference in the quality
- 00:18:39and relevance of the model's response so
- 00:18:42what are some good prompt engineering
- 00:18:43techniques there are a few that are
- 00:18:45really commonly used zero shot prompting
- 00:18:48is where you give the model a direct
- 00:18:49instruction or question without giving
- 00:18:51it any examples you're relying on its
- 00:18:54pre-existing knowledge F shot prompting
- 00:18:56is similar but you give it a few
- 00:18:58examples to help it understand the
- 00:19:00format and style you're looking for and
- 00:19:03for more complex reasoning tasks Chain
- 00:19:05of Thought prompting is really useful
- 00:19:07you basically show the model How to
- 00:19:09Think Through the problem step by step
- 00:19:11which often leads to better results it's
- 00:19:13like teaching it how to break down a
- 00:19:14complex problem into smaller more
- 00:19:16manageable steps exactly and then
- 00:19:18there's the uh the way the model
- 00:19:20actually generates text the sampling
- 00:19:22techniques these can have a big impact
- 00:19:24on the quality creativity and diversity
- 00:19:27of the output yeah I was curious about
- 00:19:28that what are some of the different
- 00:19:29sampling techniques well the simplest is
- 00:19:32greedy search where the model always
- 00:19:34picks the most likely next token this is
- 00:19:36fast but can lead to repetitive output
- 00:19:39random sampling as the name suggests
- 00:19:41introduces more Randomness which can
- 00:19:42lead to more creative outputs but also a
- 00:19:45higher chance of getting nonsensical
- 00:19:46text temperature is a parameter you can
- 00:19:49adjust to control this Randomness higher
- 00:19:51temperature more Randomness topk
- 00:19:53sampling limits the model's choices to
- 00:19:55the top K most likely pokin which helps
- 00:19:58to control the output top P sampling
- 00:20:01also called nucleus sampling is similar
- 00:20:03but uses a dynamic threshold based on
- 00:20:05the probabilities of the tokens and
- 00:20:07finally best of end sampling generates
- 00:20:09multiple responses and then picks the
- 00:20:11best one based on some criteria so
- 00:20:13fine-tuning these sampling parameters is
- 00:20:15key to getting the kind of output you
- 00:20:17want whether it's factual and accurate
- 00:20:19or more creative and imaginative yeah
- 00:20:21it's a powerful tool now I think it's
- 00:20:23time we talk about how we actually know
- 00:20:24if these models are any good how do we
- 00:20:26evaluate their performance that's a
- 00:20:28great question question evaluating these
- 00:20:30llms it's not like traditional machine
- 00:20:32learning tasks where you have a clear
- 00:20:33right or wrong answer how do you measure
- 00:20:37something as
- 00:20:38subjective as you know the quality of
- 00:20:40generated text it's definitely
- 00:20:42challenging especially as we're trying
- 00:20:43to move Beyond uh you know those early
- 00:20:45demos to real world applications those
- 00:20:48traditional metrics like accuracy or F1
- 00:20:50score They Don't Really capture the
- 00:20:52whole picture when you're dealing with
- 00:20:53something as open-ended as text
- 00:20:55generation so what does a good
- 00:20:56evaluation framework look like for l
- 00:20:59it needs to be multifaceted that's for
- 00:21:01sure first you need data specifically
- 00:21:03designed for the task you're evaluating
- 00:21:05this data should reflect what the model
- 00:21:07will see in the real world and should
- 00:21:08include real user interactions as well
- 00:21:10as synthetic data to cover all kinds of
- 00:21:13situations second you can't just
- 00:21:15evaluate the model in isolation you need
- 00:21:16to consider the whole system it's part
- 00:21:18of like if you're using retrieval
- 00:21:20augmented generation r or if the llm is
- 00:21:23controlling an agent and lastly you need
- 00:21:25to Define what good actually means for
- 00:21:28your specific specific use case it might
- 00:21:30be about accuracy but it might also be
- 00:21:31about things like helpfulness creativity
- 00:21:34factual correctness or adherence to a
- 00:21:36certain style it sounds like you need to
- 00:21:38tailor your evaluation to the specific
- 00:21:40application what are some of the main
- 00:21:42methods used for evaluating llms we
- 00:21:45still use traditional quantitative
- 00:21:46methods you know comparing the model's
- 00:21:48output to some grown truth answers using
- 00:21:50metrics like Blu or Rouge but these
- 00:21:53metrics don't always capture the nuances
- 00:21:54of language sometimes a creative or
- 00:21:56unexpected response might be just as
- 00:21:58good or even better than the expected
- 00:22:00one that's why human evaluation is so
- 00:22:03important human reviewers can provide
- 00:22:05more nuanced judgments on things like
- 00:22:07fluency coherence and overall quality
- 00:22:10but of course human evaluation is
- 00:22:11expensive and timec consuming so people
- 00:22:13have started using llm powered aerators
- 00:22:16so you're using AI to judge other AI
- 00:22:19exactly it sounds strange but it can be
- 00:22:21quite effective you basically give the
- 00:22:23aerator model the task the evaluation
- 00:22:26criteria and the responses generated by
- 00:22:28the model your testing the aerator then
- 00:22:30gives you a score often with a reason
- 00:22:33for its judgment there are different
- 00:22:34types of aerators too generative models
- 00:22:37reward models and discriminative models
- 00:22:40but one important thing is that you need
- 00:22:41to calibrate these aerators meaning you
- 00:22:43need to compare their judgments to human
- 00:22:46judgments to make sure they're actually
- 00:22:47measuring what you want them to measure
- 00:22:49you also need to be aware of the
- 00:22:50limitations of the autter rator model
- 00:22:52itself and there are even more advanced
- 00:22:55approaches being developed like breaking
- 00:22:57down tasks into subtasks and using
- 00:22:59rubrics with multiple criteria to make
- 00:23:01the evaluation more interpretable this
- 00:23:03is especially useful for evaluating
- 00:23:05multimodal generation where you might
- 00:23:07need to assess the quality of the text
- 00:23:09images or videos separately it sounds us
- 00:23:11evaluation is a complex area but really
- 00:23:14important for making sure these models
- 00:23:15are reliable and actually useful in the
- 00:23:17real world now all these models they can
- 00:23:20be incredibly large and getting
- 00:23:23responses from them can take time what
- 00:23:25are some ways to speed up the inference
- 00:23:27process you know make them respond
- 00:23:29faster yeah as these models get bigger
- 00:23:32they also get slower and more expensive
- 00:23:34to run so optimizing inference the
- 00:23:36process of generating responses is
- 00:23:38really important especially for
- 00:23:40applications where speed is critical so
- 00:23:42what are some of the techniques used to
- 00:23:44accelerate inference well there are
- 00:23:46different approaches but a lot of it
- 00:23:48comes down to trade-offs you often have
- 00:23:50to balance the quality of the output
- 00:23:52with the speed and cost of generating it
- 00:23:54so sometimes you might sacrifice little
- 00:23:56accuracy to gain a lot of speed exactly
- 00:23:59and you also need to consider the
- 00:24:00tradeoff between the latency of a single
- 00:24:02request you know how long it takes to
- 00:24:04get one response and the overall
- 00:24:06throughput of the system how many
- 00:24:08requests it can handle per se the best
- 00:24:11approach depends on the application now
- 00:24:13we can broadly categorize these
- 00:24:15techniques into two groups there are the
- 00:24:17output approximating methods which might
- 00:24:19involve changing the output slightly to
- 00:24:21gain efficiency and then there are the
- 00:24:23output preserving methods which keep the
- 00:24:25output exactly the same but try to
- 00:24:26optimize the computation let's start
- 00:24:29with the output approximating methods I
- 00:24:30know quantization is a popular technique
- 00:24:33yeah quantization is all about reducing
- 00:24:34the numerical Precision of the models
- 00:24:36weights and activations so instead of
- 00:24:38using 32-bit floating Point numbers you
- 00:24:41might use 8 bit or even four bit
- 00:24:42integers this saves a lot of memory and
- 00:24:45makes the calculations faster often with
- 00:24:47only a very small drop in accuracy there
- 00:24:49are also techniques like quantization
- 00:24:51aware training qat which can help to
- 00:24:53minimize those accuracy losses and you
- 00:24:56can even fine-tune the quantization
- 00:24:57strategy itself
- 00:24:58what about distillation isn't that where
- 00:25:00you train a smaller model to mimic a
- 00:25:02larger one yes distillation is another
- 00:25:05way to improve efficiency you have a
- 00:25:06large accurate teacher model and you
- 00:25:09train a smaller student model to copy
- 00:25:11Its Behavior the student model is often
- 00:25:13much faster and more efficient and it
- 00:25:15can still achieve good accuracy there
- 00:25:17are a few different distillation
- 00:25:19techniques like data distillation
- 00:25:21knowledge distillation and on policy
- 00:25:24distillation okay those are the methods
- 00:25:25that might change the output a little
- 00:25:27bit what about the the output preserving
- 00:25:29methods I've heard of flash attention
- 00:25:31flash attention is really cool it's
- 00:25:34specifically designed to optimize the
- 00:25:36self attention calculations within the
- 00:25:38Transformer it basically minimizes the
- 00:25:40amount of data movement needed during
- 00:25:42those calculations which can be a big
- 00:25:44bottleneck the great thing about Flash
- 00:25:46attention is that it doesn't change the
- 00:25:48results of the attention computation
- 00:25:50just the way it's done so the output is
- 00:25:52exactly the same and prefix caching that
- 00:25:55seems like a good trick for
- 00:25:56conversational applications yeah prefix
- 00:25:59caching is all about saving time when
- 00:26:01you have repeating parts of the input
- 00:26:03like in a conversation where each turn
- 00:26:05Builds on the previous ones you cache
- 00:26:07the results of the attention
- 00:26:08calculations for the initial part of the
- 00:26:10input so you don't have to redo them for
- 00:26:11every turn Google AI studio and vertex
- 00:26:15AI they both have features that use this
- 00:26:17idea so it's like remembering what
- 00:26:19you've already calculated so you don't
- 00:26:20have to do it again what about
- 00:26:22speculative decoding speculative
- 00:26:24decoding is pretty clever you use a
- 00:26:26smaller faster draft or model to predict
- 00:26:29a bunch of future tokens and then the
- 00:26:31main model checks those predictions in
- 00:26:33parallel if the drafter is right you can
- 00:26:36accept those tokens and skip the
- 00:26:37calculations for them which speeds up
- 00:26:39the decoding process the key is to have
- 00:26:42a drafter model that's well aligned with
- 00:26:44the main model so its predictions are
- 00:26:45usually correct and then there's the
- 00:26:47more General optimization techniques
- 00:26:48like batching and parallelization right
- 00:26:51batching is where you process multiple
- 00:26:53requests at the same time which can be
- 00:26:55more efficient than doing them one by
- 00:26:56one parallel ization is about splitting
- 00:26:59up the computation across multiple
- 00:27:01processors or devices there are
- 00:27:03different types of parallelization each
- 00:27:05with its own tradeoffs so there's a
- 00:27:07whole toolbox of techniques for making
- 00:27:08these models run faster and more
- 00:27:10efficiently now before we wrap up I'd
- 00:27:12love to hear some examples of how all
- 00:27:14this is being used in practice oh the
- 00:27:16applications are just exploding it's
- 00:27:17hard to even keep track in code and math
- 00:27:20llms are being used for code generation
- 00:27:22completion refactoring debugging
- 00:27:25translating code between languages
- 00:27:27writing documentation and even helping
- 00:27:29to understand large code bases we have
- 00:27:31models like Alpha code 2 that are doing
- 00:27:33incredibly well in programming
- 00:27:35competitions and projects like fund
- 00:27:36search and Alpha geometry are actually
- 00:27:38helping mathematicians make new
- 00:27:40discoveries in machine translation llms
- 00:27:42are leading to more fluent accurate and
- 00:27:45natural sounding translations text
- 00:27:47summarization is getting much better
- 00:27:49able to condense large amounts of text
- 00:27:51down to the key points question
- 00:27:53answering systems are becoming more
- 00:27:54knowledgeable and precise thanks in part
- 00:27:56to techniques like RX chat Bots are
- 00:27:59becoming more humanlike in their
- 00:28:00conversations able to engage in more
- 00:28:02Dynamic and interesting dialogue content
- 00:28:05creation is also being transformed with
- 00:28:07llms being used for writing ads scripts
- 00:28:09and all sorts of creative text formats
- 00:28:11and we're seeing advancements in natural
- 00:28:13language inference which is used for
- 00:28:14things like sentiment analysis analyzing
- 00:28:17legal documents and even assisting with
- 00:28:19medical diagnoses text classification is
- 00:28:21getting more accurate which is useful
- 00:28:22for spam detection news categorization
- 00:28:24and understanding customer feedback and
- 00:28:27LMS are even being used to evaluate
- 00:28:29other llms acting as those aerators we
- 00:28:31talked about in text analysis llms are
- 00:28:33helping to extract insights and identify
- 00:28:35Trends from huge data sets it's really
- 00:28:37an incredible range of applications and
- 00:28:40we're only scratching the surface right
- 00:28:41especially with the multimodal
- 00:28:43capabilities coming online exactly
- 00:28:46multimodal llms they're enabling
- 00:28:48entirely new categories of applications
- 00:28:50you know where you combine text images
- 00:28:52audio and video we're seeing them being
- 00:28:54used in Creative content creation
- 00:28:56education assist Technologies business
- 00:28:59scientific research you name it it's
- 00:29:01truly a transformative technology well I
- 00:29:03have to say this has been a fascinating
- 00:29:05Deep dive we started with the basic
- 00:29:08building blocks of the Transformer
- 00:29:10architecture explored the evolution of
- 00:29:12all these different llm models got into
- 00:29:15the nitty-gritty of fine-tuning and
- 00:29:17evaluation and even learned about the
- 00:29:19techniques used to make them faster and
- 00:29:21more efficient it's incredible to see
- 00:29:23how far this field has come in such a
- 00:29:25short time yeah the progress has been
- 00:29:27remarkable and it seems like things are
- 00:29:28only accelerating who knows what amazing
- 00:29:31things we'll see in the next few years
- 00:29:33that's a good question and it's one I
- 00:29:34think our listenership honder as well
- 00:29:36given the rapid pace of innovation what
- 00:29:39new applications do you think will be
- 00:29:41possible with the next generation of
- 00:29:42llms what challenges do you think we
- 00:29:44need to overcome to make those
- 00:29:46applications a reality let us know your
- 00:29:48thoughts and thanks for joining us for
- 00:29:50another deam dive thanks everyone it's
- 00:29:52been a pleasure
- LLMs
- Transformer
- Architecture
- Fine-tuning
- Self-attention
- Prompt engineering
- Multimodal
- Evaluation
- Efficiency
- Applications