Qu'est-ce que le Bite Latent Transformer ?

C'est un nouveau modèle de traitement du langage qui utilise des patches au lieu de tokens fixes.

Comment le modèle améliore-t-il l'évolutivité ?

Il permet un meilleur scalabilité grâce à sa capacité à créer des groupes dynamiques de caractères plutôt qu'à dépendre d'un vocabulaire fixe.

Quel problème le modèle résout-il ?

Il traite le problème de l'out of vocabulary et offre une tokenisation dynamique.

Qu'est-ce qu'un patch dans ce contexte ?

Un patch est un ensemble dynamique de caractères utilisés pour l'encodage, remplaçant les tokens classiques.

Quels sont les avantages des patches par rapport aux tokens fixes ?

Les patches offrent une meilleure représentation des langues moins représentées et permettent une gestion plus fine des séquences.

Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)

00:36:14

https://www.youtube.com/watch?v=loaTGpqfctI

Résumé

TLDRLa vidéo examine un papier de recherche sur le modèle Bite Latent Transformer, qui propose une architecture innovante ne basant pas son fonctionnement sur une tokenisation fixe. Au lieu de cela, il introduit les "patchs" pour segmenter le texte, ce qui améliore l'évolutivité et surmonte les limites de vocabulaire. En utilisant une architecture à deux niveaux avec un encodeur local et un transformer, le modèle offre des performances compétitives tout en traitant des tâches nécessitant une attention particulière aux caractères individuels. Les résultats des expériences montrent que les modèles basés sur patches surclassent les approches traditionnelles, en particulier pour les langages sous-représentés.

A retenir

📄 Introduction du Bite Latent Transformer qui utilise des patches à la place des tokens.
🔍 Meilleure évolutivité et gestion des problèmes de vocabulaire limité.
🆕 Les patches facilitent la tokenisation dynamique.
🚀 Performs mieux sur des tâches linguistiques nécessitant des niveaux de granularité élevés.
🌐 Avantages pour les langues sous-représentées dans le traitement du texte.

Chronologie

00:00:00 - 00:05:00
Introduction à l'architecture 'Bite Latent Transformer', qui surpasserait les modèles classiques basés sur la tokenisation. Cette nouvelle approche évite la tokenisation fixe et utilise des 'patches' dynamiques, montrant de meilleures propriétés d'échelle contraire aux modèles basés sur les tokens.
00:05:00 - 00:10:00
La comparaison entre le modèle à patches et ceux utilisant la tokenisation classique comme le 'byte pair encoding' révèle des performances de mise à l'échelle supérieures. Les graphiques illustrent que pour un même nombre de FLOPS d'entraînement, le modèle à patches obtient de meilleurs résultats à partir d'un certain seuil.
00:10:00 - 00:15:00
Présentation de l'architecture 'Bite Latent Transformer', qui se compose de deux niveaux : un niveau interne basé sur un modèle LLM classique et un niveau externe qui prédit des tokens en se basant sur les représentations d'embedding, montrant des différences de fonctionnement entre les deux niveaux.
00:15:00 - 00:20:00
Explication du processus de transformation des textes en tokens, abordant les limites de tokenisation classique et les alternatives comme le 'byte pair encoding'. Le problème d'un tableau d'embeddings trop volumineux en raison du vocabulaire croissant est aussi discuté.
00:20:00 - 00:25:00
Analyse des méthodes de tokenisation telles que 'byte pair encoding' et 'word piece encoding', et leurs problèmes respectifs, notamment le problème d'out-of-vocabulary. Le papier aborde la nécessité d'un équilibre entre taille du vocabulaire et performances du modèle.
00:25:00 - 00:30:00
Proposition d'une approche dynamique de tokenisation permet d'éviter les limitations d'un vocabulaire fixe, posant les bases de la création d'embeddings de 'patches', qui sont des représentations d'ensembles de caractères.
00:30:00 - 00:36:14
La méthode de groupement basée sur l'entropie adaptée pour déterminer les limites des patches, où des seuils d'entropie permettent d'identifier quand effectuer des séparations, et comment cela influence le processus de décodage à chaque étape.

Afficher plus

Carte mentale

Vidéo Q&R

Qu'est-ce que le Bite Latent Transformer ?
C'est un nouveau modèle de traitement du langage qui utilise des patches au lieu de tokens fixes.
Comment le modèle améliore-t-il l'évolutivité ?
Il permet un meilleur scalabilité grâce à sa capacité à créer des groupes dynamiques de caractères plutôt qu'à dépendre d'un vocabulaire fixe.
Quel problème le modèle résout-il ?
Il traite le problème de l'out of vocabulary et offre une tokenisation dynamique.
Qu'est-ce qu'un patch dans ce contexte ?
Un patch est un ensemble dynamique de caractères utilisés pour l'encodage, remplaçant les tokens classiques.
Quels sont les avantages des patches par rapport aux tokens fixes ?
Les patches offrent une meilleure représentation des langues moins représentées et permettent une gestion plus fine des séquences.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !

Sous-titres

Défilement automatique:

00:00:00
hello there today we're looking at the
00:00:02
paper bite lat and Transformer patches
00:00:04
scale better than tokens this paper in a
00:00:08
sense does away with classic fixed
00:00:11
vocabulary based
00:00:13
tokenization and in doing so develops a
00:00:16
new architecture called The Bite latent
00:00:18
Transformer and in their experiments
00:00:21
they show that this as the paper says
00:00:24
scales better than a sort of classic a
00:00:28
model that operates on classic basically
00:00:30
tokenized tokens so the thing that
00:00:34
they're doing is they do away with
00:00:36
tokenization they find a different way
00:00:39
of splitting text into pieces a dynamic
00:00:43
way and they call these pieces patches
00:00:45
so patches are like tokens um except
00:00:50
they need to distinguish them verbally
00:00:53
so it's clear which one you're talking
00:00:55
about and then once you let a model run
00:00:58
on that you do get better scaling
00:01:00
properties and that's kind of the
00:01:02
central claim of this paper that if you
00:01:04
look at models that are
00:01:08
um here in they compare to bite pair
00:01:11
encoding so this is kind of classic
00:01:13
tokenization used in the Llama models if
00:01:16
you compare that with a model that
00:01:18
operates on patches um then you do get
00:01:22
better scaling Behavior as you can see
00:01:24
here so the red orange lines are the
00:01:27
patch based ones the blue line is the
00:01:30
kind of classically tokenized ones now
00:01:34
there are a lot of like choices here
00:01:37
that make this graphic so the Y AIS here
00:01:40
is what they call bits per pite which is
00:01:42
kind of so since you if you don't deal
00:01:45
with tokens and if you deal with
00:01:47
especially different tokenization you
00:01:49
can't really use perplexity as a measure
00:01:52
because that kind of necessitates that
00:01:54
you operate over the same kind of
00:01:56
fundamental pieces so bits per bite is
00:02:00
sort of the analogous measure to entropy
00:02:04
uh sorry to uh
00:02:06
perplexity so think of this as kind of
00:02:09
perplexity and then here are the x-axis
00:02:11
and that's important that's total
00:02:12
training flops so they always consider
00:02:14
flop matched models uh because they are
00:02:19
like their model operates differently it
00:02:21
has like an outer layer and an inner
00:02:22
layer and don't you don't need to always
00:02:24
execute the inner layer for each outer
00:02:28
step so the outer step ends up running
00:02:31
more often but the inner step ends up
00:02:33
running less often and that's why you
00:02:36
can uh achieve you can achieve uh better
00:02:41
or you can achieve bigger models let's
00:02:44
say with um with the patch based models
00:02:48
because your patches are bigger and you
00:02:51
need to run them less often so if you
00:02:54
invest the same amount of training flops
00:02:56
then you can kind of become better so
00:03:01
there is a lot of ifs in this kind of oh
00:03:03
it scales better what they keep constant
00:03:06
is the training flops so for the same
00:03:09
amount of training flops after a certain
00:03:11
threshold you'll do better with the
00:03:13
patch based models just because they
00:03:15
have better scaling property um why
00:03:18
exactly that is that could be it's
00:03:22
probably a mixture of the part of their
00:03:23
architecture so let's dive into that
00:03:25
architecture here is the bite latent
00:03:27
Transformer as I said it's kind of a two
00:03:29
tier system so the inner system right
00:03:32
here that is um your very let's say
00:03:36
regular llm type Transformer there's
00:03:41
absolutely nothing special about it
00:03:43
except it that it operates on kind of
00:03:47
these pieces right here now usually
00:03:49
these would be token and token
00:03:51
embeddings so these here would be token
00:03:54
embeddings and it predicts the next
00:03:58
token um if you view a Transformer so
00:04:04
you have I don't know token and the
00:04:06
history going in and then you have a
00:04:10
Transformer and it will output uh a
00:04:14
distribution over next tokens right like
00:04:17
a some sort of a soft Max probability
00:04:20
distribution over next tokens if you
00:04:23
however take one step back and you
00:04:25
consider what happens in the last layer
00:04:27
in the last layer you do have a an so
00:04:31
here is the model out comes an
00:04:34
embedding so this is like the hidden
00:04:37
signal in the second last layer the last
00:04:41
layer is a matrix that is of
00:04:46
Dimension um here this Dimension is the
00:04:51
size of H and then this Dimension is the
00:04:54
size of the um token or of the
00:04:57
vocabulary let's say
00:05:04
vocab right so this is what actually
00:05:06
does the classification multiplying
00:05:08
these two things together will and then
00:05:10
applying a soft Max will give you this
00:05:13
distribution so in a sense um you could
00:05:17
argue that uh
00:05:20
the that this here you know that what
00:05:23
the Transformer actually does is even a
00:05:26
regular Transformer it actually predicts
00:05:28
the embedding of the next
00:05:30
token and there is one more thing if you
00:05:34
do what's called weight tying or
00:05:36
embedding tying so here is the tokens
00:05:39
coming in here is your embedding table
00:05:41
so for each token you have an embedding
00:05:43
in here some models actually tie those
00:05:46
together meaning they use the same
00:05:47
parameters saves them a lot of
00:05:49
parameters and it kind of has the same
00:05:52
idea um this maps from token IDs to
00:05:56
embedding space and this uh sorry this
00:05:59
this Matrix here kind of maps from
00:06:00
embedding space to back to token IDs
00:06:03
right with this probability distribution
00:06:05
so in that sense it's even more true
00:06:07
that the latent Transformer just kind of
00:06:10
out predicts the embedding or sorry any
00:06:14
language model uh Transformer that is
00:06:18
you know has at the end uh outputs
00:06:22
loged actually just predicts effectively
00:06:25
the embedding of the next
00:06:27
token so that's you know being said in
00:06:30
in the inner side here there is just a
00:06:33
very regular Transformer Auto regressive
00:06:37
llm that takes in
00:06:41
things and predicts the next thing from
00:06:45
them all in embedding space now usually
00:06:50
again usually these are tokens so maybe
00:06:52
it's worth briefly how uh we get to
00:06:55
those tokens so if we have a piece of
00:06:58
text for example
00:07:00
uh this piece of text addess here data
00:07:02
is primarily determined by the number
00:07:05
okay what we want to do is we want to
00:07:08
split those things up into individual
00:07:11
pieces that our models can operate over
00:07:14
now one method of doing that would be to
00:07:17
just split every single character
00:07:19
including the white spaces right becomes
00:07:22
one piece but that's not really uh the
00:07:25
best because it will result in very long
00:07:28
sequences for for a given text and you
00:07:31
know that Transformers scale by sequence
00:07:33
length quadratically which isn't
00:07:36
necessarily doesn't make us happy so our
00:07:39
context window of 128,000 tokens will
00:07:43
just be 128,000 characters in the end so
00:07:47
can we do better well yes we could split
00:07:50
uh by for example whites space so data
00:07:54
sorry data becomes one token is becomes
00:07:58
one token primarily becomes one token
00:08:00
and so on this was very standard for a
00:08:04
very long time and what what you have to
00:08:08
do in all of these things if you operate
00:08:11
with tokens is you're going to have a
00:08:12
table that is um mapping your tokens to
00:08:17
an embedding as we said before and then
00:08:20
every token needs to have a
00:08:22
corresponding embedding Vector that you
00:08:24
can look up so the word data has to have
00:08:28
an embedding vector in here somehow uh
00:08:32
the word is has to have an embedding
00:08:34
Vector in here somehow right um and you
00:08:39
can already see the problem there is
00:08:41
this table is become going to become
00:08:43
really really big and the even bigger
00:08:45
problem would be that let's say you
00:08:49
derive you have to derive this table
00:08:51
somehow so you take a training Corpus
00:08:54
and you look at all the words in there
00:08:55
and that's how you initialize the table
00:08:58
but it's very likely like English is
00:09:00
such a big language that in your test
00:09:02
data set there's going to be a word that
00:09:06
you've never seen in the training data
00:09:08
set like a name or maybe a number or
00:09:12
just a a word that you've never seen uh
00:09:16
for example you might actually never
00:09:18
have seen the word determined before
00:09:21
there is some people have tried to
00:09:24
mitigate some so they what they do is
00:09:26
like stemming or something like this so
00:09:29
instead of determined you just say
00:09:31
determine um so you and you say oh those
00:09:35
two are the same the same word
00:09:37
essentially so you only have one entry
00:09:39
in the embedding table instead of having
00:09:41
one for determine determined determining
00:09:44
uh determinization and whatnot so this
00:09:48
is just one token but still the problem
00:09:50
of like out of vocabulary people used to
00:09:54
call that was really really big and
00:09:58
problematic and people came up with
00:10:00
alternatives to that and those
00:10:02
alternatives are what currently are very
00:10:06
very popular um so what are those
00:10:08
Alternatives those Alternatives if you
00:10:11
look at things like bite pair encoding
00:10:13
or word piece encoding or things like
00:10:15
that uh they all are of the same
00:10:18
principle they say there
00:10:21
exists um a set of like unitary things
00:10:27
and those unitary Things Are
00:10:30
they can they can be used to make up all
00:10:32
of the text that we see so in word piece
00:10:36
um those unitary things would be just
00:10:40
the all the characters that exist so a b
00:10:43
c d da da d d da da until like Z then
00:10:47
capital A then like zero then the
00:10:51
question mark and so on like it's still
00:10:54
a lot but it's not infinite right with
00:10:58
with a decent amount of single symbols
00:11:01
you can represent any
00:11:04
um any sequence of characters and you
00:11:08
might want to say well aren't we now
00:11:10
back to the same problem where character
00:11:13
level isn't really good and that's yes
00:11:15
okay so let's say we have we just do
00:11:18
asky lowercase okay let's say we have a
00:11:21
to z that's good so we can represent
00:11:24
everything however we know that the
00:11:27
combination e is very very frequent in
00:11:31
the language so let's just assign a
00:11:34
different slot to ER yeah we still have
00:11:37
e in here somewhere we still have R in
00:11:39
here somewhere but if we encounter e we
00:11:42
choose to represent it with its own
00:11:44
token and its own embedding and then you
00:11:48
go on and eventually you'll say oh maybe
00:11:51
the maybe you know I don't know am is
00:11:54
very common and Ur is very common and
00:11:58
something like this and then you start
00:12:00
making bigger combinations you say okay
00:12:02
the d a d like Dad that's a very common
00:12:05
thing in the language and so on so you
00:12:07
build these things there are heuristic
00:12:09
ways of deriving them uh it's
00:12:11
essentially a compression algorithm if
00:12:13
you will uh and you assign individual
00:12:16
tokens um to those and you it's not just
00:12:20
whole words right you can see like these
00:12:22
things they're more like word pieces
00:12:25
that you start building up the same with
00:12:28
bite pair encoding uh where you just
00:12:30
operate in the realm of bytes so uh you
00:12:34
know you can encode any text into a
00:12:35
series of of bytes uh by different
00:12:38
encoding standards that exist like utf8
00:12:41
is a very common one and then you
00:12:43
literally you know what all the symbols
00:12:45
are they are
00:12:47
0255 those are all your single bytes
00:12:49
that can exist and then you start
00:12:51
combining you know the bites that appear
00:12:54
often in your text it's kind of more
00:12:57
clean than working with character and
00:12:59
symbols but those are your your choices
00:13:02
so that would be like the bip paring
00:13:04
coding and this would more be like word
00:13:07
piece or something like that um yeah
00:13:11
so like this seems good but it has its
00:13:14
own set of problems so first of all what
00:13:19
are its set of problems first of
00:13:22
all uh you know a couple of problems
00:13:24
that stem from tokenization so for
00:13:27
example if you have like numbers or
00:13:29
something like if you have the number
00:13:32
2568 uh then that might actually get
00:13:35
tokenized as the token 256 and 8 because
00:13:39
256 is very common uh number and then
00:13:43
you know just add eight so the tokenizer
00:13:46
is going for the minimum amount of
00:13:48
tokens uh so that's a problem if you
00:13:50
want to teach the neural network to
00:13:52
multiply something because it will not
00:13:54
see 2 5 6 8 it will see some token with
00:14:00
the ID 89 and then some token with the
00:14:03
ID 71 right it has no clue that you know
00:14:08
these are made up of numbers or
00:14:10
something like this and there are a
00:14:12
bunch of other problems with with
00:14:14
tokenization what this paper also shows
00:14:16
is that tokenization does result in
00:14:19
fairly small chunks of text where you
00:14:22
could go for bigger chunks of text but
00:14:25
the problem is if you keep it all in a
00:14:27
table if if you want bigger chunks of
00:14:29
text or obviously more combinations
00:14:32
possible so you'll have to kind of your
00:14:35
storage kind of explodes for
00:14:38
this so that's why they say do we even
00:14:41
need this table here do we even need
00:14:43
that maybe we don't actually need it
00:14:45
maybe we can get away with having a
00:14:48
table Just For The Individual pieces
00:14:52
like Just For The Individual unitary
00:14:54
things and we can come up with a scheme
00:14:58
of how we com how we recombine those
00:15:01
things for those down here in kind of
00:15:04
like a a learned way like can we teach a
00:15:07
neural network to take the embeddings of
00:15:11
the individual
00:15:13
constituents and come up with the
00:15:15
embedding for higher order combinations
00:15:18
because that would allow us to not even
00:15:21
have a fixed set of higher order
00:15:23
combinations but like kind of an
00:15:24
arbitrary combination of higher order
00:15:26
com um combinations and the neural
00:15:29
network will just be able to produce an
00:15:31
embedding for these on the Fly and then
00:15:34
those could be the individual pieces we
00:15:37
feed into the bigger llm right so it's
00:15:40
not a Chara we're not doing a character
00:15:42
level or a bite level
00:15:45
llm um what we're doing is a two-stage
00:15:48
process where we have a first stage that
00:15:51
out of the bite embeddings produces what
00:15:55
they call a patch embedding and a patch
00:15:57
embedding is like a um six to8 character
00:16:01
is long thing and that then gets fed
00:16:06
into the llm now you'll realize what I
00:16:09
said at the beginning this idea could
00:16:11
actually totally be done using the
00:16:14
tokenization we have right like you
00:16:16
could just tokenize how we tokenize
00:16:19
right now but just not have this big uh
00:16:22
sorry not have this big embedding table
00:16:24
but just do this sort of two-stage
00:16:28
process where the first stage just
00:16:30
builds your token embedding from the
00:16:33
character embeddings that make up the
00:16:34
token and then the second stage will
00:16:37
actually go and or the second stage is
00:16:40
your normal llm that operates on token
00:16:43
embeddings however you know because they
00:16:46
have this method they also say well we
00:16:50
don't need a fixed vocabulary
00:16:52
tokenization anymore right this here is
00:16:55
a fixed vocabulary you derive it once
00:16:59
your vocab because you need that table
00:17:02
and then you tokenize all the text into
00:17:05
this fixed vocabulary you don't have out
00:17:08
of vocabulary anymore because you can
00:17:10
you have the individual characters here
00:17:11
so you can tokenize anything uh but
00:17:15
still it's fixed so they say hey we have
00:17:18
this process now now we can do Dynamic
00:17:21
tokenization and that's what they call
00:17:23
patching they're again from from the
00:17:25
inside to the outside on the inside we
00:17:29
have an llm that operates on they call
00:17:32
Patch embeddings which are essentially
00:17:33
just token embeddings except the tokens
00:17:36
aren't fixed they are Dynamic
00:17:40
groupings uh patches of characters or of
00:17:44
bites in our case same
00:17:46
same sorry uh all non asky
00:17:51
people and so you can you can see that
00:17:55
once we know once we know what the where
00:18:00
the patch boundaries are and in this
00:18:02
case here here here are the patch
00:18:05
boundaries right so this is a token this
00:18:07
is a token this is a token and this is a
00:18:09
token this this text down here gets
00:18:11
divided into four tokens once we know
00:18:14
what they are we can use this local
00:18:17
encoder thing to look
00:18:20
at the characters in the patch and give
00:18:25
us a single patch embedding that we then
00:18:27
feed to the Transformer so the local
00:18:31
encoder is a model that's trained to do
00:18:34
exactly that um as far as I can tell
00:18:36
it's trained end to end together with
00:18:38
the latent Transformer and then the
00:18:41
local decoder takes a patch embedding
00:18:45
patch embedding and decodes it into the
00:18:48
constituent characters so you can see
00:18:52
that the local encoder and the local
00:18:54
decoder they run more often than the
00:18:57
latent Transformer and now you have a
00:18:59
degree of Freedom the long the bigger
00:19:02
you make these patches the The Wider
00:19:05
they become the more characters on
00:19:07
average to a patch the more often you
00:19:10
run the local
00:19:12
encoder in comparison to running the
00:19:15
chunky latent
00:19:17
Transformer so you can make this in here
00:19:21
bigger if you make these
00:19:24
smaller then you still you still gain a
00:19:28
lot lot like you can gain a lot of flops
00:19:32
um because you have to run the inner
00:19:34
part less because you make the patches
00:19:38
larger and as long as the outer parts
00:19:40
are kind of lightweight uh they don't
00:19:42
matter and you can get away with having
00:19:45
a bigger model because you spend less
00:19:47
flops because you run it less
00:19:49
often right some astute observers might
00:19:54
have realized that hey you know this
00:19:58
local this local decoder when does it
00:20:01
know when to stop
00:20:04
um it you know it's just it gives it
00:20:06
gets one thing and it's just supposed to
00:20:09
produce uh tokens like characters from
00:20:12
it we'll get to that in just a bit and
00:20:15
the the second part is obviously how do
00:20:19
we know where the patch boundaries are
00:20:20
how do you know how to group the
00:20:22
characters into tokens and the answer to
00:20:24
these two things is kind of the same and
00:20:27
that's with their
00:20:29
what they call uh entropy based grouping
00:20:33
of bytes into
00:20:35
patches um
00:20:38
so the entropy based grouping is a
00:20:42
concept that's as I said kind of
00:20:46
um yeah it's what they essentially do is
00:20:50
they train a small transformer so a B
00:20:54
level
00:20:55
Transformer um notably this is not this
00:20:59
thing right here so they have a
00:21:01
separate llm that's small that's just on
00:21:05
the bytes so that actually is a
00:21:08
character level llm that's just trained
00:21:11
on a corpus
00:21:13
and that decides where to split in the
00:21:18
following way you feed text into it it
00:21:23
will predict the next token and if the
00:21:27
entropy of the prediction so this
00:21:31
distribution right here if the entropy
00:21:33
of the prediction of the next character
00:21:35
is very high meaning like what is a high
00:21:39
entropy a high entropy is a distribution
00:21:42
that's
00:21:43
like you know could be could be any of
00:21:47
these whereas a low entropy distribution
00:21:49
is like oh it's this it's this one it's
00:21:53
this one definitely so high entropy
00:21:55
meaning it's not sure that's where where
00:21:58
you split so if the next
00:22:01
character
00:22:02
is above a threshold of entropy in the
00:22:05
prediction of this bite level llm that's
00:22:10
where you make a
00:22:11
split that that's just a just a decision
00:22:14
they make right um it's a it's a design
00:22:17
choice that they make but there's good
00:22:20
reason right there's there's good reason
00:22:22
to split by entropy uh because what you
00:22:25
do is you keep the stuff together
00:22:28
together where you're sure so whenever
00:22:32
you know
00:22:33
bet you know
00:22:35
bet bet like the erer that's very clear
00:22:40
and therefore you want to keep it
00:22:42
together because it kind of is one unit
00:22:44
like whenever you're very sure what
00:22:46
comes you can very much argue that the
00:22:49
thing is actually should be treated as a
00:22:51
single unit when you're not sure that
00:22:54
means there could be multiple
00:22:55
continuations that's when you want to
00:22:57
split it up and say oh well here you
00:23:00
know this these two things need to be
00:23:02
treated separately because in an
00:23:04
alternative Universe there there's a
00:23:06
different continuation here that I need
00:23:07
to take into account and then you better
00:23:11
off if that first part is the same token
00:23:14
each time and not if the entire thing is
00:23:17
like a different token and you know
00:23:19
nothing
00:23:21
anymore all right um what I want to
00:23:24
say yeah and this is also the answer on
00:23:27
how the local decoder stops decoding so
00:23:31
it decodes decodes decodes and when the
00:23:33
next and then it it just always asks
00:23:36
this small llm here what's the entropy
00:23:39
of what I'm doing right like what's the
00:23:40
entropy of the next token in your
00:23:43
estimation like this local model this
00:23:45
knows nothing of the lat Transformer
00:23:47
what it just looks at the stuff that's
00:23:49
being produced and if the next if the
00:23:53
next token according to it has a high
00:23:56
entropy that's where we end the the
00:23:59
patch okay so the process is as
00:24:03
follows we have some we have some
00:24:06
text and let's say we're at a new patch
00:24:09
boundary okay the local encoder looks at
00:24:13
the patch sorry we're here let's let's
00:24:16
start it the we run the small llm
00:24:20
forward right boop boop boop boop boop
00:24:23
until the entropy threshold is above
00:24:26
that's where we say ah okay that's a
00:24:28
patch okay our patch is from here to
00:24:30
here then that local encoder looks at
00:24:32
the characters in here and and takes
00:24:36
there is an there's a embedding table
00:24:39
from bite to embedding notably you only
00:24:43
need 20 56 entries fixed right this
00:24:47
doesn't grow so it looks up the
00:24:51
embeddings of the constituents and
00:24:54
Aggregates them into a patch and Bing
00:24:56
it's trained to do that then then you
00:24:58
run the latent transformer for one step
00:25:01
let's assume this doesn't exist yet for
00:25:03
one step and produce the
00:25:05
next the next latent um output token the
00:25:11
local decoder takes this and
00:25:15
starts um let's assume let's assume that
00:25:19
actually let's assume the local decoder
00:25:21
is here currently right the local
00:25:23
decoder takes this and starts
00:25:27
producing uh um uh tokens it starts
00:25:30
decoding like an llm except conditioned
00:25:33
on This Global signal light here so it's
00:25:35
like okay this one okay and I'm produce
00:25:39
this one I'm produce this one and each
00:25:41
time it asks the small llm what it
00:25:44
thinks about the next token in the
00:25:46
sequence it has decoded if the small as
00:25:49
soon as the small llm says oh wait the
00:25:51
entropy is quite High then it's like
00:25:54
okay stop it here I'm going to I'm going
00:25:57
to stop it here please go back to the
00:26:00
next thing um
00:26:03
and uh you know start the next cycle of
00:26:07
the
00:26:08
process we almost at least that's how I
00:26:11
think it goes uh maybe I'm I'm totally
00:26:15
wrong but that's what I can read from
00:26:16
the paper the paper is a bit sparse on
00:26:18
these exact details um but and I haven't
00:26:22
read the code I have to apologize for
00:26:24
that but the code is available so you
00:26:26
can go and verify or or refute that um
00:26:30
there is one extra thing there's one
00:26:34
little bit of extra info that you need
00:26:36
right here and that's
00:26:39
usually usually when you do auto
00:26:42
regressive decoding you take what you've
00:26:45
produced and you feed it back right um
00:26:49
into your own model however that doesn't
00:26:53
work here because this local decoder it
00:26:56
doesn't take text as a an input it
00:26:58
doesn't take characters as an input it
00:27:01
just takes this signal right here as an
00:27:04
input
00:27:05
so what does take characters as an input
00:27:08
well that local encoder thing takes
00:27:10
characters as an input so there is a
00:27:14
hidden skip connection from like here to
00:27:18
here so when you when the local decoder
00:27:21
produces a character at least that's
00:27:23
again my understanding you run this
00:27:27
thing through through the local encoder
00:27:30
you know here get its local encoder
00:27:33
embedding but you don't go to the latent
00:27:35
Transformer because you're not done with
00:27:36
a patch yet you just feed this back into
00:27:39
the local decoder which then has like a
00:27:42
a latent a latent representation that it
00:27:45
can decode the next token from so the
00:27:47
loop between local decoder go to local
00:27:49
encoder go to local decoder that's kind
00:27:51
of the outer loop that runs in order to
00:27:54
produce these tokens and once you're
00:27:56
done with a patch then you know you
00:27:58
start again to ask the local decoder
00:28:01
about the next patch um to to or sorry
00:28:04
about the patch that you've just
00:28:06
produced embed it get it into the latent
00:28:09
Transformer from that you get next
00:28:11
Global signal and then you do that outer
00:28:13
loop again in order to produce the
00:28:15
individual bytes until the small LM says
00:28:18
again patches
00:28:20
over again that's how I personally
00:28:24
understand it there is yeah so so here
00:28:28
we have exactly we have the encoder
00:28:31
decoder so um the
00:28:34
encoder gets B embeddings uh uses and
00:28:38
then
00:28:38
uses cross attention so it knows it
00:28:42
those should be um tokenized into three
00:28:45
different patches so it uses cross
00:28:47
attention from the patch um to Only The
00:28:53
Tokens that are part of the batch by the
00:28:56
way there are two here and not so
00:29:00
they're three different patches but they
00:29:03
use multi-head attenion so this just
00:29:05
represents a two-headed uh multi-head
00:29:08
tension with keys into here but you
00:29:11
still have hidden states you have many
00:29:13
layers so you still have hidden States
00:29:15
and these hidden States is what you give
00:29:19
to the
00:29:20
decoder um which does the exact opposite
00:29:22
so its keys are sorry its queries are
00:29:27
the individual bites that you produce
00:29:28
and its keys and values are the global
00:29:31
signal that you get from the latent
00:29:35
Transformer all right there is one more
00:29:39
thing now I'm going to guess that this
00:29:42
thing here the encoder hash NR
00:29:45
embeddings they added because it just
00:29:48
works better like this seems very much
00:29:50
like a thing you add after that so they
00:29:53
say look we do have we we have a
00:29:59
[Music]
00:30:01
um we model each bite individually so
00:30:06
when we do encoding each bite gets like
00:30:10
encoded
00:30:12
um by itself and as part of a bite and
00:30:19
gram so you can see that they build up
00:30:22
not just embedding tables or the bite to
00:30:26
embedding but they build up several
00:30:29
embedding tables so there is an
00:30:30
embedding table um for bite two or or
00:30:35
three G um there is one for bite four G
00:30:39
for bite 5 G and so on up until bite 8 G
00:30:44
and now you ask well aren't the bite 8
00:30:46
GS huge and that's exactly what we Tred
00:30:49
to avoid yes they are that's why you
00:30:52
just kind of you just kind of hash them
00:30:55
and then modulus by the size of the
00:30:58
embedding table so you're like you're
00:31:02
essentially counting on the fact that
00:31:04
yes there are going to be hash
00:31:05
collisions like some of the bite three
00:31:06
Gs are going to hit the same embedding
00:31:08
right here but those hash collisions are
00:31:10
kind of orthogonal things in meaning and
00:31:13
so it's probably fine
00:31:16
um I'm going to I'm going to guess it's
00:31:19
just a way to get NRS in there so when
00:31:21
you look at a bite for example the
00:31:23
letter T right here you also take the
00:31:26
embedding for the 3 G the 4 G the 5 G
00:31:29
the six G the seven G and the 8 G in
00:31:32
front of that bite and you aggregate all
00:31:36
of these together into the bite
00:31:39
embedding so to say so the local encoder
00:31:44
doesn't operate purely on the bite
00:31:46
embedding as I said before but it
00:31:48
actually operates on a super position of
00:31:53
engr bite engram embeddings that this
00:31:57
puts this into context with the bites
00:32:00
before it that to me it it just seems
00:32:04
like a kind of a a way to get kind of
00:32:08
fake it's it's it's a bit of like you
00:32:10
get you like tokenization is back that's
00:32:14
what it tells me except instead of
00:32:16
tokens it's
00:32:18
NRS so yeah make of that as you
00:32:23
will I don't want to you know talk too
00:32:27
much more I I think that's kind of it
00:32:28
for the model design and how they decode
00:32:31
and so on when they experiment around
00:32:34
they find they can actually make larger
00:32:37
patches than regular tokenization so
00:32:40
they um they say look our our patches we
00:32:45
can go to patch sizes of like uh what do
00:32:49
I say look Trends between yeah so they
00:32:55
can go they they can achieve kind of
00:32:57
performance perance of like llama 2 and
00:32:59
llama 3 models while using significantly
00:33:01
larger patch sizes so while llama 2 and
00:33:05
llama 3 B par and codings have an
00:33:06
average token size of 3.7 and 4.4 bytes
00:33:10
so we can achieve similar tra scaling
00:33:12
Trends with an average patch size of six
00:33:14
and even eight bytes um so you you have
00:33:18
that handle on that tradeoff and that's
00:33:20
pretty cool I have to say they do some
00:33:23
experiments where they show that yeah
00:33:25
they can remain competitive with these
00:33:27
LS models but also they're a lot better
00:33:31
in you know in tasks where you actually
00:33:34
need to look at the individual
00:33:37
characters in a token because given that
00:33:40
they operate on bite embeddings they can
00:33:43
now also you know very fine Greenly
00:33:46
train models that are actually need to
00:33:50
look at the individual things whereas if
00:33:52
you obviously just have fixed tokens and
00:33:55
you look up their embeddings in a table
00:33:57
that that doesn't work as well so but
00:33:59
it's it's kind of like it's kind of
00:34:01
cheesing a bit but just demonstrating
00:34:03
hey look spelling inverse were doing
00:34:06
like really really well compared to the
00:34:08
Llama models which was to be expected
00:34:11
but it is nice that they perform an
00:34:13
experiment to actually show that what's
00:34:16
also interesting is that um translation
00:34:19
works better for kind of languages that
00:34:22
are under represented or that are you
00:34:24
know kind of tokenized in a in a non
00:34:28
like Say Non in a way other than like
00:34:32
your standard languages are tokenized
00:34:34
and that's also pretty
00:34:36
cool all right that's I want to don't
00:34:40
want to dive too much here more uh
00:34:43
please look at the rest of the paper
00:34:45
it's pretty interesting it's pretty
00:34:46
thorough the experiments are pretty cool
00:34:48
and they pay a lot of attention to like
00:34:50
control for various parameters in
00:34:52
because it is really hard if if you know
00:34:55
your model operates on different
00:34:56
fundament to you
00:34:58
how do you even compare to other models
00:35:00
and they do good job at that um there
00:35:03
are several room for improvements
00:35:05
notably you could train more things
00:35:07
jointly for example that small language
00:35:09
model that does the patching and so on
00:35:12
and as of now this in terms of um in
00:35:17
terms of like raw runtime uh this still
00:35:20
lags behind because obviously we've
00:35:22
spent like a decade hyper optimizing or
00:35:26
at least half a decade hyper optimizing
00:35:28
fixed tokenization autoaggressive
00:35:31
llms uh yeah with things like they name
00:35:33
here such as Flex attention um and we
00:35:37
and obviously that would still need to
00:35:41
be done for these patch level models in
00:35:45
terms of actually getting their runtime
00:35:47
there so when they compare something
00:35:50
they like match flops which is probably
00:35:53
a pretty good measure that's kind of
00:35:55
independent of raw optimization
00:35:58
all right that's it as I said read the
00:36:01
paper uh subscribe for more reviews and
00:36:04
thank you so much if you read this as it
00:36:07
comes out then Holly Jolly uh Christmas
00:36:10
and uh Happy New Year and see you around
00:36:13
bye-bye