Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)
Resumen
TLDRLa vidéo examine un papier de recherche sur le modèle Bite Latent Transformer, qui propose une architecture innovante ne basant pas son fonctionnement sur une tokenisation fixe. Au lieu de cela, il introduit les "patchs" pour segmenter le texte, ce qui améliore l'évolutivité et surmonte les limites de vocabulaire. En utilisant une architecture à deux niveaux avec un encodeur local et un transformer, le modèle offre des performances compétitives tout en traitant des tâches nécessitant une attention particulière aux caractères individuels. Les résultats des expériences montrent que les modèles basés sur patches surclassent les approches traditionnelles, en particulier pour les langages sous-représentés.
Para llevar
- 📄 Introduction du Bite Latent Transformer qui utilise des patches à la place des tokens.
- 🔍 Meilleure évolutivité et gestion des problèmes de vocabulaire limité.
- 🆕 Les patches facilitent la tokenisation dynamique.
- 🚀 Performs mieux sur des tâches linguistiques nécessitant des niveaux de granularité élevés.
- 🌐 Avantages pour les langues sous-représentées dans le traitement du texte.
Cronología
- 00:00:00 - 00:05:00
Introduction à l'architecture 'Bite Latent Transformer', qui surpasserait les modèles classiques basés sur la tokenisation. Cette nouvelle approche évite la tokenisation fixe et utilise des 'patches' dynamiques, montrant de meilleures propriétés d'échelle contraire aux modèles basés sur les tokens.
- 00:05:00 - 00:10:00
La comparaison entre le modèle à patches et ceux utilisant la tokenisation classique comme le 'byte pair encoding' révèle des performances de mise à l'échelle supérieures. Les graphiques illustrent que pour un même nombre de FLOPS d'entraînement, le modèle à patches obtient de meilleurs résultats à partir d'un certain seuil.
- 00:10:00 - 00:15:00
Présentation de l'architecture 'Bite Latent Transformer', qui se compose de deux niveaux : un niveau interne basé sur un modèle LLM classique et un niveau externe qui prédit des tokens en se basant sur les représentations d'embedding, montrant des différences de fonctionnement entre les deux niveaux.
- 00:15:00 - 00:20:00
Explication du processus de transformation des textes en tokens, abordant les limites de tokenisation classique et les alternatives comme le 'byte pair encoding'. Le problème d'un tableau d'embeddings trop volumineux en raison du vocabulaire croissant est aussi discuté.
- 00:20:00 - 00:25:00
Analyse des méthodes de tokenisation telles que 'byte pair encoding' et 'word piece encoding', et leurs problèmes respectifs, notamment le problème d'out-of-vocabulary. Le papier aborde la nécessité d'un équilibre entre taille du vocabulaire et performances du modèle.
- 00:25:00 - 00:30:00
Proposition d'une approche dynamique de tokenisation permet d'éviter les limitations d'un vocabulaire fixe, posant les bases de la création d'embeddings de 'patches', qui sont des représentations d'ensembles de caractères.
- 00:30:00 - 00:36:14
La méthode de groupement basée sur l'entropie adaptée pour déterminer les limites des patches, où des seuils d'entropie permettent d'identifier quand effectuer des séparations, et comment cela influence le processus de décodage à chaque étape.
Mapa mental
Vídeo de preguntas y respuestas
Qu'est-ce que le Bite Latent Transformer ?
C'est un nouveau modèle de traitement du langage qui utilise des patches au lieu de tokens fixes.
Comment le modèle améliore-t-il l'évolutivité ?
Il permet un meilleur scalabilité grâce à sa capacité à créer des groupes dynamiques de caractères plutôt qu'à dépendre d'un vocabulaire fixe.
Quel problème le modèle résout-il ?
Il traite le problème de l'out of vocabulary et offre une tokenisation dynamique.
Qu'est-ce qu'un patch dans ce contexte ?
Un patch est un ensemble dynamique de caractères utilisés pour l'encodage, remplaçant les tokens classiques.
Quels sont les avantages des patches par rapport aux tokens fixes ?
Les patches offrent une meilleure représentation des langues moins représentées et permettent une gestion plus fine des séquences.
Ver más resúmenes de vídeos
Incredible NVIDIA RTX 5090 Founders Edition: Liquid Metal & Cooler ft. Malcolm Gutenburg
Relentless Pursuit | Week 2 | Sermon
Vilebrequin : combien vaut vraiment le 1000TIPLA ?
'What flag is that? You're a Jew': Woman confronts pro-Palestine Jewish protesters in New York
NVIDIA's AI Agents Just Replaced Your Entire Team
01 - Modélisation dimensionnelle
- 00:00:00hello there today we're looking at the
- 00:00:02paper bite lat and Transformer patches
- 00:00:04scale better than tokens this paper in a
- 00:00:08sense does away with classic fixed
- 00:00:11vocabulary based
- 00:00:13tokenization and in doing so develops a
- 00:00:16new architecture called The Bite latent
- 00:00:18Transformer and in their experiments
- 00:00:21they show that this as the paper says
- 00:00:24scales better than a sort of classic a
- 00:00:28model that operates on classic basically
- 00:00:30tokenized tokens so the thing that
- 00:00:34they're doing is they do away with
- 00:00:36tokenization they find a different way
- 00:00:39of splitting text into pieces a dynamic
- 00:00:43way and they call these pieces patches
- 00:00:45so patches are like tokens um except
- 00:00:50they need to distinguish them verbally
- 00:00:53so it's clear which one you're talking
- 00:00:55about and then once you let a model run
- 00:00:58on that you do get better scaling
- 00:01:00properties and that's kind of the
- 00:01:02central claim of this paper that if you
- 00:01:04look at models that are
- 00:01:08um here in they compare to bite pair
- 00:01:11encoding so this is kind of classic
- 00:01:13tokenization used in the Llama models if
- 00:01:16you compare that with a model that
- 00:01:18operates on patches um then you do get
- 00:01:22better scaling Behavior as you can see
- 00:01:24here so the red orange lines are the
- 00:01:27patch based ones the blue line is the
- 00:01:30kind of classically tokenized ones now
- 00:01:34there are a lot of like choices here
- 00:01:37that make this graphic so the Y AIS here
- 00:01:40is what they call bits per pite which is
- 00:01:42kind of so since you if you don't deal
- 00:01:45with tokens and if you deal with
- 00:01:47especially different tokenization you
- 00:01:49can't really use perplexity as a measure
- 00:01:52because that kind of necessitates that
- 00:01:54you operate over the same kind of
- 00:01:56fundamental pieces so bits per bite is
- 00:02:00sort of the analogous measure to entropy
- 00:02:04uh sorry to uh
- 00:02:06perplexity so think of this as kind of
- 00:02:09perplexity and then here are the x-axis
- 00:02:11and that's important that's total
- 00:02:12training flops so they always consider
- 00:02:14flop matched models uh because they are
- 00:02:19like their model operates differently it
- 00:02:21has like an outer layer and an inner
- 00:02:22layer and don't you don't need to always
- 00:02:24execute the inner layer for each outer
- 00:02:28step so the outer step ends up running
- 00:02:31more often but the inner step ends up
- 00:02:33running less often and that's why you
- 00:02:36can uh achieve you can achieve uh better
- 00:02:41or you can achieve bigger models let's
- 00:02:44say with um with the patch based models
- 00:02:48because your patches are bigger and you
- 00:02:51need to run them less often so if you
- 00:02:54invest the same amount of training flops
- 00:02:56then you can kind of become better so
- 00:03:01there is a lot of ifs in this kind of oh
- 00:03:03it scales better what they keep constant
- 00:03:06is the training flops so for the same
- 00:03:09amount of training flops after a certain
- 00:03:11threshold you'll do better with the
- 00:03:13patch based models just because they
- 00:03:15have better scaling property um why
- 00:03:18exactly that is that could be it's
- 00:03:22probably a mixture of the part of their
- 00:03:23architecture so let's dive into that
- 00:03:25architecture here is the bite latent
- 00:03:27Transformer as I said it's kind of a two
- 00:03:29tier system so the inner system right
- 00:03:32here that is um your very let's say
- 00:03:36regular llm type Transformer there's
- 00:03:41absolutely nothing special about it
- 00:03:43except it that it operates on kind of
- 00:03:47these pieces right here now usually
- 00:03:49these would be token and token
- 00:03:51embeddings so these here would be token
- 00:03:54embeddings and it predicts the next
- 00:03:58token um if you view a Transformer so
- 00:04:04you have I don't know token and the
- 00:04:06history going in and then you have a
- 00:04:10Transformer and it will output uh a
- 00:04:14distribution over next tokens right like
- 00:04:17a some sort of a soft Max probability
- 00:04:20distribution over next tokens if you
- 00:04:23however take one step back and you
- 00:04:25consider what happens in the last layer
- 00:04:27in the last layer you do have a an so
- 00:04:31here is the model out comes an
- 00:04:34embedding so this is like the hidden
- 00:04:37signal in the second last layer the last
- 00:04:41layer is a matrix that is of
- 00:04:46Dimension um here this Dimension is the
- 00:04:51size of H and then this Dimension is the
- 00:04:54size of the um token or of the
- 00:04:57vocabulary let's say
- 00:05:04vocab right so this is what actually
- 00:05:06does the classification multiplying
- 00:05:08these two things together will and then
- 00:05:10applying a soft Max will give you this
- 00:05:13distribution so in a sense um you could
- 00:05:17argue that uh
- 00:05:20the that this here you know that what
- 00:05:23the Transformer actually does is even a
- 00:05:26regular Transformer it actually predicts
- 00:05:28the embedding of the next
- 00:05:30token and there is one more thing if you
- 00:05:34do what's called weight tying or
- 00:05:36embedding tying so here is the tokens
- 00:05:39coming in here is your embedding table
- 00:05:41so for each token you have an embedding
- 00:05:43in here some models actually tie those
- 00:05:46together meaning they use the same
- 00:05:47parameters saves them a lot of
- 00:05:49parameters and it kind of has the same
- 00:05:52idea um this maps from token IDs to
- 00:05:56embedding space and this uh sorry this
- 00:05:59this Matrix here kind of maps from
- 00:06:00embedding space to back to token IDs
- 00:06:03right with this probability distribution
- 00:06:05so in that sense it's even more true
- 00:06:07that the latent Transformer just kind of
- 00:06:10out predicts the embedding or sorry any
- 00:06:14language model uh Transformer that is
- 00:06:18you know has at the end uh outputs
- 00:06:22loged actually just predicts effectively
- 00:06:25the embedding of the next
- 00:06:27token so that's you know being said in
- 00:06:30in the inner side here there is just a
- 00:06:33very regular Transformer Auto regressive
- 00:06:37llm that takes in
- 00:06:41things and predicts the next thing from
- 00:06:45them all in embedding space now usually
- 00:06:50again usually these are tokens so maybe
- 00:06:52it's worth briefly how uh we get to
- 00:06:55those tokens so if we have a piece of
- 00:06:58text for example
- 00:07:00uh this piece of text addess here data
- 00:07:02is primarily determined by the number
- 00:07:05okay what we want to do is we want to
- 00:07:08split those things up into individual
- 00:07:11pieces that our models can operate over
- 00:07:14now one method of doing that would be to
- 00:07:17just split every single character
- 00:07:19including the white spaces right becomes
- 00:07:22one piece but that's not really uh the
- 00:07:25best because it will result in very long
- 00:07:28sequences for for a given text and you
- 00:07:31know that Transformers scale by sequence
- 00:07:33length quadratically which isn't
- 00:07:36necessarily doesn't make us happy so our
- 00:07:39context window of 128,000 tokens will
- 00:07:43just be 128,000 characters in the end so
- 00:07:47can we do better well yes we could split
- 00:07:50uh by for example whites space so data
- 00:07:54sorry data becomes one token is becomes
- 00:07:58one token primarily becomes one token
- 00:08:00and so on this was very standard for a
- 00:08:04very long time and what what you have to
- 00:08:08do in all of these things if you operate
- 00:08:11with tokens is you're going to have a
- 00:08:12table that is um mapping your tokens to
- 00:08:17an embedding as we said before and then
- 00:08:20every token needs to have a
- 00:08:22corresponding embedding Vector that you
- 00:08:24can look up so the word data has to have
- 00:08:28an embedding vector in here somehow uh
- 00:08:32the word is has to have an embedding
- 00:08:34Vector in here somehow right um and you
- 00:08:39can already see the problem there is
- 00:08:41this table is become going to become
- 00:08:43really really big and the even bigger
- 00:08:45problem would be that let's say you
- 00:08:49derive you have to derive this table
- 00:08:51somehow so you take a training Corpus
- 00:08:54and you look at all the words in there
- 00:08:55and that's how you initialize the table
- 00:08:58but it's very likely like English is
- 00:09:00such a big language that in your test
- 00:09:02data set there's going to be a word that
- 00:09:06you've never seen in the training data
- 00:09:08set like a name or maybe a number or
- 00:09:12just a a word that you've never seen uh
- 00:09:16for example you might actually never
- 00:09:18have seen the word determined before
- 00:09:21there is some people have tried to
- 00:09:24mitigate some so they what they do is
- 00:09:26like stemming or something like this so
- 00:09:29instead of determined you just say
- 00:09:31determine um so you and you say oh those
- 00:09:35two are the same the same word
- 00:09:37essentially so you only have one entry
- 00:09:39in the embedding table instead of having
- 00:09:41one for determine determined determining
- 00:09:44uh determinization and whatnot so this
- 00:09:48is just one token but still the problem
- 00:09:50of like out of vocabulary people used to
- 00:09:54call that was really really big and
- 00:09:58problematic and people came up with
- 00:10:00alternatives to that and those
- 00:10:02alternatives are what currently are very
- 00:10:06very popular um so what are those
- 00:10:08Alternatives those Alternatives if you
- 00:10:11look at things like bite pair encoding
- 00:10:13or word piece encoding or things like
- 00:10:15that uh they all are of the same
- 00:10:18principle they say there
- 00:10:21exists um a set of like unitary things
- 00:10:27and those unitary Things Are
- 00:10:30they can they can be used to make up all
- 00:10:32of the text that we see so in word piece
- 00:10:36um those unitary things would be just
- 00:10:40the all the characters that exist so a b
- 00:10:43c d da da d d da da until like Z then
- 00:10:47capital A then like zero then the
- 00:10:51question mark and so on like it's still
- 00:10:54a lot but it's not infinite right with
- 00:10:58with a decent amount of single symbols
- 00:11:01you can represent any
- 00:11:04um any sequence of characters and you
- 00:11:08might want to say well aren't we now
- 00:11:10back to the same problem where character
- 00:11:13level isn't really good and that's yes
- 00:11:15okay so let's say we have we just do
- 00:11:18asky lowercase okay let's say we have a
- 00:11:21to z that's good so we can represent
- 00:11:24everything however we know that the
- 00:11:27combination e is very very frequent in
- 00:11:31the language so let's just assign a
- 00:11:34different slot to ER yeah we still have
- 00:11:37e in here somewhere we still have R in
- 00:11:39here somewhere but if we encounter e we
- 00:11:42choose to represent it with its own
- 00:11:44token and its own embedding and then you
- 00:11:48go on and eventually you'll say oh maybe
- 00:11:51the maybe you know I don't know am is
- 00:11:54very common and Ur is very common and
- 00:11:58something like this and then you start
- 00:12:00making bigger combinations you say okay
- 00:12:02the d a d like Dad that's a very common
- 00:12:05thing in the language and so on so you
- 00:12:07build these things there are heuristic
- 00:12:09ways of deriving them uh it's
- 00:12:11essentially a compression algorithm if
- 00:12:13you will uh and you assign individual
- 00:12:16tokens um to those and you it's not just
- 00:12:20whole words right you can see like these
- 00:12:22things they're more like word pieces
- 00:12:25that you start building up the same with
- 00:12:28bite pair encoding uh where you just
- 00:12:30operate in the realm of bytes so uh you
- 00:12:34know you can encode any text into a
- 00:12:35series of of bytes uh by different
- 00:12:38encoding standards that exist like utf8
- 00:12:41is a very common one and then you
- 00:12:43literally you know what all the symbols
- 00:12:45are they are
- 00:12:470255 those are all your single bytes
- 00:12:49that can exist and then you start
- 00:12:51combining you know the bites that appear
- 00:12:54often in your text it's kind of more
- 00:12:57clean than working with character and
- 00:12:59symbols but those are your your choices
- 00:13:02so that would be like the bip paring
- 00:13:04coding and this would more be like word
- 00:13:07piece or something like that um yeah
- 00:13:11so like this seems good but it has its
- 00:13:14own set of problems so first of all what
- 00:13:19are its set of problems first of
- 00:13:22all uh you know a couple of problems
- 00:13:24that stem from tokenization so for
- 00:13:27example if you have like numbers or
- 00:13:29something like if you have the number
- 00:13:322568 uh then that might actually get
- 00:13:35tokenized as the token 256 and 8 because
- 00:13:39256 is very common uh number and then
- 00:13:43you know just add eight so the tokenizer
- 00:13:46is going for the minimum amount of
- 00:13:48tokens uh so that's a problem if you
- 00:13:50want to teach the neural network to
- 00:13:52multiply something because it will not
- 00:13:54see 2 5 6 8 it will see some token with
- 00:14:00the ID 89 and then some token with the
- 00:14:03ID 71 right it has no clue that you know
- 00:14:08these are made up of numbers or
- 00:14:10something like this and there are a
- 00:14:12bunch of other problems with with
- 00:14:14tokenization what this paper also shows
- 00:14:16is that tokenization does result in
- 00:14:19fairly small chunks of text where you
- 00:14:22could go for bigger chunks of text but
- 00:14:25the problem is if you keep it all in a
- 00:14:27table if if you want bigger chunks of
- 00:14:29text or obviously more combinations
- 00:14:32possible so you'll have to kind of your
- 00:14:35storage kind of explodes for
- 00:14:38this so that's why they say do we even
- 00:14:41need this table here do we even need
- 00:14:43that maybe we don't actually need it
- 00:14:45maybe we can get away with having a
- 00:14:48table Just For The Individual pieces
- 00:14:52like Just For The Individual unitary
- 00:14:54things and we can come up with a scheme
- 00:14:58of how we com how we recombine those
- 00:15:01things for those down here in kind of
- 00:15:04like a a learned way like can we teach a
- 00:15:07neural network to take the embeddings of
- 00:15:11the individual
- 00:15:13constituents and come up with the
- 00:15:15embedding for higher order combinations
- 00:15:18because that would allow us to not even
- 00:15:21have a fixed set of higher order
- 00:15:23combinations but like kind of an
- 00:15:24arbitrary combination of higher order
- 00:15:26com um combinations and the neural
- 00:15:29network will just be able to produce an
- 00:15:31embedding for these on the Fly and then
- 00:15:34those could be the individual pieces we
- 00:15:37feed into the bigger llm right so it's
- 00:15:40not a Chara we're not doing a character
- 00:15:42level or a bite level
- 00:15:45llm um what we're doing is a two-stage
- 00:15:48process where we have a first stage that
- 00:15:51out of the bite embeddings produces what
- 00:15:55they call a patch embedding and a patch
- 00:15:57embedding is like a um six to8 character
- 00:16:01is long thing and that then gets fed
- 00:16:06into the llm now you'll realize what I
- 00:16:09said at the beginning this idea could
- 00:16:11actually totally be done using the
- 00:16:14tokenization we have right like you
- 00:16:16could just tokenize how we tokenize
- 00:16:19right now but just not have this big uh
- 00:16:22sorry not have this big embedding table
- 00:16:24but just do this sort of two-stage
- 00:16:28process where the first stage just
- 00:16:30builds your token embedding from the
- 00:16:33character embeddings that make up the
- 00:16:34token and then the second stage will
- 00:16:37actually go and or the second stage is
- 00:16:40your normal llm that operates on token
- 00:16:43embeddings however you know because they
- 00:16:46have this method they also say well we
- 00:16:50don't need a fixed vocabulary
- 00:16:52tokenization anymore right this here is
- 00:16:55a fixed vocabulary you derive it once
- 00:16:59your vocab because you need that table
- 00:17:02and then you tokenize all the text into
- 00:17:05this fixed vocabulary you don't have out
- 00:17:08of vocabulary anymore because you can
- 00:17:10you have the individual characters here
- 00:17:11so you can tokenize anything uh but
- 00:17:15still it's fixed so they say hey we have
- 00:17:18this process now now we can do Dynamic
- 00:17:21tokenization and that's what they call
- 00:17:23patching they're again from from the
- 00:17:25inside to the outside on the inside we
- 00:17:29have an llm that operates on they call
- 00:17:32Patch embeddings which are essentially
- 00:17:33just token embeddings except the tokens
- 00:17:36aren't fixed they are Dynamic
- 00:17:40groupings uh patches of characters or of
- 00:17:44bites in our case same
- 00:17:46same sorry uh all non asky
- 00:17:51people and so you can you can see that
- 00:17:55once we know once we know what the where
- 00:18:00the patch boundaries are and in this
- 00:18:02case here here here are the patch
- 00:18:05boundaries right so this is a token this
- 00:18:07is a token this is a token and this is a
- 00:18:09token this this text down here gets
- 00:18:11divided into four tokens once we know
- 00:18:14what they are we can use this local
- 00:18:17encoder thing to look
- 00:18:20at the characters in the patch and give
- 00:18:25us a single patch embedding that we then
- 00:18:27feed to the Transformer so the local
- 00:18:31encoder is a model that's trained to do
- 00:18:34exactly that um as far as I can tell
- 00:18:36it's trained end to end together with
- 00:18:38the latent Transformer and then the
- 00:18:41local decoder takes a patch embedding
- 00:18:45patch embedding and decodes it into the
- 00:18:48constituent characters so you can see
- 00:18:52that the local encoder and the local
- 00:18:54decoder they run more often than the
- 00:18:57latent Transformer and now you have a
- 00:18:59degree of Freedom the long the bigger
- 00:19:02you make these patches the The Wider
- 00:19:05they become the more characters on
- 00:19:07average to a patch the more often you
- 00:19:10run the local
- 00:19:12encoder in comparison to running the
- 00:19:15chunky latent
- 00:19:17Transformer so you can make this in here
- 00:19:21bigger if you make these
- 00:19:24smaller then you still you still gain a
- 00:19:28lot lot like you can gain a lot of flops
- 00:19:32um because you have to run the inner
- 00:19:34part less because you make the patches
- 00:19:38larger and as long as the outer parts
- 00:19:40are kind of lightweight uh they don't
- 00:19:42matter and you can get away with having
- 00:19:45a bigger model because you spend less
- 00:19:47flops because you run it less
- 00:19:49often right some astute observers might
- 00:19:54have realized that hey you know this
- 00:19:58local this local decoder when does it
- 00:20:01know when to stop
- 00:20:04um it you know it's just it gives it
- 00:20:06gets one thing and it's just supposed to
- 00:20:09produce uh tokens like characters from
- 00:20:12it we'll get to that in just a bit and
- 00:20:15the the second part is obviously how do
- 00:20:19we know where the patch boundaries are
- 00:20:20how do you know how to group the
- 00:20:22characters into tokens and the answer to
- 00:20:24these two things is kind of the same and
- 00:20:27that's with their
- 00:20:29what they call uh entropy based grouping
- 00:20:33of bytes into
- 00:20:35patches um
- 00:20:38so the entropy based grouping is a
- 00:20:42concept that's as I said kind of
- 00:20:46um yeah it's what they essentially do is
- 00:20:50they train a small transformer so a B
- 00:20:54level
- 00:20:55Transformer um notably this is not this
- 00:20:59thing right here so they have a
- 00:21:01separate llm that's small that's just on
- 00:21:05the bytes so that actually is a
- 00:21:08character level llm that's just trained
- 00:21:11on a corpus
- 00:21:13and that decides where to split in the
- 00:21:18following way you feed text into it it
- 00:21:23will predict the next token and if the
- 00:21:27entropy of the prediction so this
- 00:21:31distribution right here if the entropy
- 00:21:33of the prediction of the next character
- 00:21:35is very high meaning like what is a high
- 00:21:39entropy a high entropy is a distribution
- 00:21:42that's
- 00:21:43like you know could be could be any of
- 00:21:47these whereas a low entropy distribution
- 00:21:49is like oh it's this it's this one it's
- 00:21:53this one definitely so high entropy
- 00:21:55meaning it's not sure that's where where
- 00:21:58you split so if the next
- 00:22:01character
- 00:22:02is above a threshold of entropy in the
- 00:22:05prediction of this bite level llm that's
- 00:22:10where you make a
- 00:22:11split that that's just a just a decision
- 00:22:14they make right um it's a it's a design
- 00:22:17choice that they make but there's good
- 00:22:20reason right there's there's good reason
- 00:22:22to split by entropy uh because what you
- 00:22:25do is you keep the stuff together
- 00:22:28together where you're sure so whenever
- 00:22:32you know
- 00:22:33bet you know
- 00:22:35bet bet like the erer that's very clear
- 00:22:40and therefore you want to keep it
- 00:22:42together because it kind of is one unit
- 00:22:44like whenever you're very sure what
- 00:22:46comes you can very much argue that the
- 00:22:49thing is actually should be treated as a
- 00:22:51single unit when you're not sure that
- 00:22:54means there could be multiple
- 00:22:55continuations that's when you want to
- 00:22:57split it up and say oh well here you
- 00:23:00know this these two things need to be
- 00:23:02treated separately because in an
- 00:23:04alternative Universe there there's a
- 00:23:06different continuation here that I need
- 00:23:07to take into account and then you better
- 00:23:11off if that first part is the same token
- 00:23:14each time and not if the entire thing is
- 00:23:17like a different token and you know
- 00:23:19nothing
- 00:23:21anymore all right um what I want to
- 00:23:24say yeah and this is also the answer on
- 00:23:27how the local decoder stops decoding so
- 00:23:31it decodes decodes decodes and when the
- 00:23:33next and then it it just always asks
- 00:23:36this small llm here what's the entropy
- 00:23:39of what I'm doing right like what's the
- 00:23:40entropy of the next token in your
- 00:23:43estimation like this local model this
- 00:23:45knows nothing of the lat Transformer
- 00:23:47what it just looks at the stuff that's
- 00:23:49being produced and if the next if the
- 00:23:53next token according to it has a high
- 00:23:56entropy that's where we end the the
- 00:23:59patch okay so the process is as
- 00:24:03follows we have some we have some
- 00:24:06text and let's say we're at a new patch
- 00:24:09boundary okay the local encoder looks at
- 00:24:13the patch sorry we're here let's let's
- 00:24:16start it the we run the small llm
- 00:24:20forward right boop boop boop boop boop
- 00:24:23until the entropy threshold is above
- 00:24:26that's where we say ah okay that's a
- 00:24:28patch okay our patch is from here to
- 00:24:30here then that local encoder looks at
- 00:24:32the characters in here and and takes
- 00:24:36there is an there's a embedding table
- 00:24:39from bite to embedding notably you only
- 00:24:43need 20 56 entries fixed right this
- 00:24:47doesn't grow so it looks up the
- 00:24:51embeddings of the constituents and
- 00:24:54Aggregates them into a patch and Bing
- 00:24:56it's trained to do that then then you
- 00:24:58run the latent transformer for one step
- 00:25:01let's assume this doesn't exist yet for
- 00:25:03one step and produce the
- 00:25:05next the next latent um output token the
- 00:25:11local decoder takes this and
- 00:25:15starts um let's assume let's assume that
- 00:25:19actually let's assume the local decoder
- 00:25:21is here currently right the local
- 00:25:23decoder takes this and starts
- 00:25:27producing uh um uh tokens it starts
- 00:25:30decoding like an llm except conditioned
- 00:25:33on This Global signal light here so it's
- 00:25:35like okay this one okay and I'm produce
- 00:25:39this one I'm produce this one and each
- 00:25:41time it asks the small llm what it
- 00:25:44thinks about the next token in the
- 00:25:46sequence it has decoded if the small as
- 00:25:49soon as the small llm says oh wait the
- 00:25:51entropy is quite High then it's like
- 00:25:54okay stop it here I'm going to I'm going
- 00:25:57to stop it here please go back to the
- 00:26:00next thing um
- 00:26:03and uh you know start the next cycle of
- 00:26:07the
- 00:26:08process we almost at least that's how I
- 00:26:11think it goes uh maybe I'm I'm totally
- 00:26:15wrong but that's what I can read from
- 00:26:16the paper the paper is a bit sparse on
- 00:26:18these exact details um but and I haven't
- 00:26:22read the code I have to apologize for
- 00:26:24that but the code is available so you
- 00:26:26can go and verify or or refute that um
- 00:26:30there is one extra thing there's one
- 00:26:34little bit of extra info that you need
- 00:26:36right here and that's
- 00:26:39usually usually when you do auto
- 00:26:42regressive decoding you take what you've
- 00:26:45produced and you feed it back right um
- 00:26:49into your own model however that doesn't
- 00:26:53work here because this local decoder it
- 00:26:56doesn't take text as a an input it
- 00:26:58doesn't take characters as an input it
- 00:27:01just takes this signal right here as an
- 00:27:04input
- 00:27:05so what does take characters as an input
- 00:27:08well that local encoder thing takes
- 00:27:10characters as an input so there is a
- 00:27:14hidden skip connection from like here to
- 00:27:18here so when you when the local decoder
- 00:27:21produces a character at least that's
- 00:27:23again my understanding you run this
- 00:27:27thing through through the local encoder
- 00:27:30you know here get its local encoder
- 00:27:33embedding but you don't go to the latent
- 00:27:35Transformer because you're not done with
- 00:27:36a patch yet you just feed this back into
- 00:27:39the local decoder which then has like a
- 00:27:42a latent a latent representation that it
- 00:27:45can decode the next token from so the
- 00:27:47loop between local decoder go to local
- 00:27:49encoder go to local decoder that's kind
- 00:27:51of the outer loop that runs in order to
- 00:27:54produce these tokens and once you're
- 00:27:56done with a patch then you know you
- 00:27:58start again to ask the local decoder
- 00:28:01about the next patch um to to or sorry
- 00:28:04about the patch that you've just
- 00:28:06produced embed it get it into the latent
- 00:28:09Transformer from that you get next
- 00:28:11Global signal and then you do that outer
- 00:28:13loop again in order to produce the
- 00:28:15individual bytes until the small LM says
- 00:28:18again patches
- 00:28:20over again that's how I personally
- 00:28:24understand it there is yeah so so here
- 00:28:28we have exactly we have the encoder
- 00:28:31decoder so um the
- 00:28:34encoder gets B embeddings uh uses and
- 00:28:38then
- 00:28:38uses cross attention so it knows it
- 00:28:42those should be um tokenized into three
- 00:28:45different patches so it uses cross
- 00:28:47attention from the patch um to Only The
- 00:28:53Tokens that are part of the batch by the
- 00:28:56way there are two here and not so
- 00:29:00they're three different patches but they
- 00:29:03use multi-head attenion so this just
- 00:29:05represents a two-headed uh multi-head
- 00:29:08tension with keys into here but you
- 00:29:11still have hidden states you have many
- 00:29:13layers so you still have hidden States
- 00:29:15and these hidden States is what you give
- 00:29:19to the
- 00:29:20decoder um which does the exact opposite
- 00:29:22so its keys are sorry its queries are
- 00:29:27the individual bites that you produce
- 00:29:28and its keys and values are the global
- 00:29:31signal that you get from the latent
- 00:29:35Transformer all right there is one more
- 00:29:39thing now I'm going to guess that this
- 00:29:42thing here the encoder hash NR
- 00:29:45embeddings they added because it just
- 00:29:48works better like this seems very much
- 00:29:50like a thing you add after that so they
- 00:29:53say look we do have we we have a
- 00:29:59[Music]
- 00:30:01um we model each bite individually so
- 00:30:06when we do encoding each bite gets like
- 00:30:10encoded
- 00:30:12um by itself and as part of a bite and
- 00:30:19gram so you can see that they build up
- 00:30:22not just embedding tables or the bite to
- 00:30:26embedding but they build up several
- 00:30:29embedding tables so there is an
- 00:30:30embedding table um for bite two or or
- 00:30:35three G um there is one for bite four G
- 00:30:39for bite 5 G and so on up until bite 8 G
- 00:30:44and now you ask well aren't the bite 8
- 00:30:46GS huge and that's exactly what we Tred
- 00:30:49to avoid yes they are that's why you
- 00:30:52just kind of you just kind of hash them
- 00:30:55and then modulus by the size of the
- 00:30:58embedding table so you're like you're
- 00:31:02essentially counting on the fact that
- 00:31:04yes there are going to be hash
- 00:31:05collisions like some of the bite three
- 00:31:06Gs are going to hit the same embedding
- 00:31:08right here but those hash collisions are
- 00:31:10kind of orthogonal things in meaning and
- 00:31:13so it's probably fine
- 00:31:16um I'm going to I'm going to guess it's
- 00:31:19just a way to get NRS in there so when
- 00:31:21you look at a bite for example the
- 00:31:23letter T right here you also take the
- 00:31:26embedding for the 3 G the 4 G the 5 G
- 00:31:29the six G the seven G and the 8 G in
- 00:31:32front of that bite and you aggregate all
- 00:31:36of these together into the bite
- 00:31:39embedding so to say so the local encoder
- 00:31:44doesn't operate purely on the bite
- 00:31:46embedding as I said before but it
- 00:31:48actually operates on a super position of
- 00:31:53engr bite engram embeddings that this
- 00:31:57puts this into context with the bites
- 00:32:00before it that to me it it just seems
- 00:32:04like a kind of a a way to get kind of
- 00:32:08fake it's it's it's a bit of like you
- 00:32:10get you like tokenization is back that's
- 00:32:14what it tells me except instead of
- 00:32:16tokens it's
- 00:32:18NRS so yeah make of that as you
- 00:32:23will I don't want to you know talk too
- 00:32:27much more I I think that's kind of it
- 00:32:28for the model design and how they decode
- 00:32:31and so on when they experiment around
- 00:32:34they find they can actually make larger
- 00:32:37patches than regular tokenization so
- 00:32:40they um they say look our our patches we
- 00:32:45can go to patch sizes of like uh what do
- 00:32:49I say look Trends between yeah so they
- 00:32:55can go they they can achieve kind of
- 00:32:57performance perance of like llama 2 and
- 00:32:59llama 3 models while using significantly
- 00:33:01larger patch sizes so while llama 2 and
- 00:33:05llama 3 B par and codings have an
- 00:33:06average token size of 3.7 and 4.4 bytes
- 00:33:10so we can achieve similar tra scaling
- 00:33:12Trends with an average patch size of six
- 00:33:14and even eight bytes um so you you have
- 00:33:18that handle on that tradeoff and that's
- 00:33:20pretty cool I have to say they do some
- 00:33:23experiments where they show that yeah
- 00:33:25they can remain competitive with these
- 00:33:27LS models but also they're a lot better
- 00:33:31in you know in tasks where you actually
- 00:33:34need to look at the individual
- 00:33:37characters in a token because given that
- 00:33:40they operate on bite embeddings they can
- 00:33:43now also you know very fine Greenly
- 00:33:46train models that are actually need to
- 00:33:50look at the individual things whereas if
- 00:33:52you obviously just have fixed tokens and
- 00:33:55you look up their embeddings in a table
- 00:33:57that that doesn't work as well so but
- 00:33:59it's it's kind of like it's kind of
- 00:34:01cheesing a bit but just demonstrating
- 00:34:03hey look spelling inverse were doing
- 00:34:06like really really well compared to the
- 00:34:08Llama models which was to be expected
- 00:34:11but it is nice that they perform an
- 00:34:13experiment to actually show that what's
- 00:34:16also interesting is that um translation
- 00:34:19works better for kind of languages that
- 00:34:22are under represented or that are you
- 00:34:24know kind of tokenized in a in a non
- 00:34:28like Say Non in a way other than like
- 00:34:32your standard languages are tokenized
- 00:34:34and that's also pretty
- 00:34:36cool all right that's I want to don't
- 00:34:40want to dive too much here more uh
- 00:34:43please look at the rest of the paper
- 00:34:45it's pretty interesting it's pretty
- 00:34:46thorough the experiments are pretty cool
- 00:34:48and they pay a lot of attention to like
- 00:34:50control for various parameters in
- 00:34:52because it is really hard if if you know
- 00:34:55your model operates on different
- 00:34:56fundament to you
- 00:34:58how do you even compare to other models
- 00:35:00and they do good job at that um there
- 00:35:03are several room for improvements
- 00:35:05notably you could train more things
- 00:35:07jointly for example that small language
- 00:35:09model that does the patching and so on
- 00:35:12and as of now this in terms of um in
- 00:35:17terms of like raw runtime uh this still
- 00:35:20lags behind because obviously we've
- 00:35:22spent like a decade hyper optimizing or
- 00:35:26at least half a decade hyper optimizing
- 00:35:28fixed tokenization autoaggressive
- 00:35:31llms uh yeah with things like they name
- 00:35:33here such as Flex attention um and we
- 00:35:37and obviously that would still need to
- 00:35:41be done for these patch level models in
- 00:35:45terms of actually getting their runtime
- 00:35:47there so when they compare something
- 00:35:50they like match flops which is probably
- 00:35:53a pretty good measure that's kind of
- 00:35:55independent of raw optimization
- 00:35:58all right that's it as I said read the
- 00:36:01paper uh subscribe for more reviews and
- 00:36:04thank you so much if you read this as it
- 00:36:07comes out then Holly Jolly uh Christmas
- 00:36:10and uh Happy New Year and see you around
- 00:36:13bye-bye
- Bite Latent Transformer
- tokenisation
- patchs
- évolutivité
- modèles de langage
- embedding
- entropie
- encodage dynamique
- LLM
- performances linguistiques