Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)

00:36:14
https://www.youtube.com/watch?v=loaTGpqfctI

Résumé

TLDRLa vidéo examine un papier de recherche sur le modèle Bite Latent Transformer, qui propose une architecture innovante ne basant pas son fonctionnement sur une tokenisation fixe. Au lieu de cela, il introduit les "patchs" pour segmenter le texte, ce qui améliore l'évolutivité et surmonte les limites de vocabulaire. En utilisant une architecture à deux niveaux avec un encodeur local et un transformer, le modèle offre des performances compétitives tout en traitant des tâches nécessitant une attention particulière aux caractères individuels. Les résultats des expériences montrent que les modèles basés sur patches surclassent les approches traditionnelles, en particulier pour les langages sous-représentés.

A retenir

  • 📄 Introduction du Bite Latent Transformer qui utilise des patches à la place des tokens.
  • 🔍 Meilleure évolutivité et gestion des problèmes de vocabulaire limité.
  • 🆕 Les patches facilitent la tokenisation dynamique.
  • 🚀 Performs mieux sur des tâches linguistiques nécessitant des niveaux de granularité élevés.
  • 🌐 Avantages pour les langues sous-représentées dans le traitement du texte.

Chronologie

  • 00:00:00 - 00:05:00

    Introduction à l'architecture 'Bite Latent Transformer', qui surpasserait les modèles classiques basés sur la tokenisation. Cette nouvelle approche évite la tokenisation fixe et utilise des 'patches' dynamiques, montrant de meilleures propriétés d'échelle contraire aux modèles basés sur les tokens.

  • 00:05:00 - 00:10:00

    La comparaison entre le modèle à patches et ceux utilisant la tokenisation classique comme le 'byte pair encoding' révèle des performances de mise à l'échelle supérieures. Les graphiques illustrent que pour un même nombre de FLOPS d'entraînement, le modèle à patches obtient de meilleurs résultats à partir d'un certain seuil.

  • 00:10:00 - 00:15:00

    Présentation de l'architecture 'Bite Latent Transformer', qui se compose de deux niveaux : un niveau interne basé sur un modèle LLM classique et un niveau externe qui prédit des tokens en se basant sur les représentations d'embedding, montrant des différences de fonctionnement entre les deux niveaux.

  • 00:15:00 - 00:20:00

    Explication du processus de transformation des textes en tokens, abordant les limites de tokenisation classique et les alternatives comme le 'byte pair encoding'. Le problème d'un tableau d'embeddings trop volumineux en raison du vocabulaire croissant est aussi discuté.

  • 00:20:00 - 00:25:00

    Analyse des méthodes de tokenisation telles que 'byte pair encoding' et 'word piece encoding', et leurs problèmes respectifs, notamment le problème d'out-of-vocabulary. Le papier aborde la nécessité d'un équilibre entre taille du vocabulaire et performances du modèle.

  • 00:25:00 - 00:30:00

    Proposition d'une approche dynamique de tokenisation permet d'éviter les limitations d'un vocabulaire fixe, posant les bases de la création d'embeddings de 'patches', qui sont des représentations d'ensembles de caractères.

  • 00:30:00 - 00:36:14

    La méthode de groupement basée sur l'entropie adaptée pour déterminer les limites des patches, où des seuils d'entropie permettent d'identifier quand effectuer des séparations, et comment cela influence le processus de décodage à chaque étape.

Afficher plus

Carte mentale

Vidéo Q&R

  • Qu'est-ce que le Bite Latent Transformer ?

    C'est un nouveau modèle de traitement du langage qui utilise des patches au lieu de tokens fixes.

  • Comment le modèle améliore-t-il l'évolutivité ?

    Il permet un meilleur scalabilité grâce à sa capacité à créer des groupes dynamiques de caractères plutôt qu'à dépendre d'un vocabulaire fixe.

  • Quel problème le modèle résout-il ?

    Il traite le problème de l'out of vocabulary et offre une tokenisation dynamique.

  • Qu'est-ce qu'un patch dans ce contexte ?

    Un patch est un ensemble dynamique de caractères utilisés pour l'encodage, remplaçant les tokens classiques.

  • Quels sont les avantages des patches par rapport aux tokens fixes ?

    Les patches offrent une meilleure représentation des langues moins représentées et permettent une gestion plus fine des séquences.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !
Sous-titres
en
Défilement automatique:
  • 00:00:00
    hello there today we're looking at the
  • 00:00:02
    paper bite lat and Transformer patches
  • 00:00:04
    scale better than tokens this paper in a
  • 00:00:08
    sense does away with classic fixed
  • 00:00:11
    vocabulary based
  • 00:00:13
    tokenization and in doing so develops a
  • 00:00:16
    new architecture called The Bite latent
  • 00:00:18
    Transformer and in their experiments
  • 00:00:21
    they show that this as the paper says
  • 00:00:24
    scales better than a sort of classic a
  • 00:00:28
    model that operates on classic basically
  • 00:00:30
    tokenized tokens so the thing that
  • 00:00:34
    they're doing is they do away with
  • 00:00:36
    tokenization they find a different way
  • 00:00:39
    of splitting text into pieces a dynamic
  • 00:00:43
    way and they call these pieces patches
  • 00:00:45
    so patches are like tokens um except
  • 00:00:50
    they need to distinguish them verbally
  • 00:00:53
    so it's clear which one you're talking
  • 00:00:55
    about and then once you let a model run
  • 00:00:58
    on that you do get better scaling
  • 00:01:00
    properties and that's kind of the
  • 00:01:02
    central claim of this paper that if you
  • 00:01:04
    look at models that are
  • 00:01:08
    um here in they compare to bite pair
  • 00:01:11
    encoding so this is kind of classic
  • 00:01:13
    tokenization used in the Llama models if
  • 00:01:16
    you compare that with a model that
  • 00:01:18
    operates on patches um then you do get
  • 00:01:22
    better scaling Behavior as you can see
  • 00:01:24
    here so the red orange lines are the
  • 00:01:27
    patch based ones the blue line is the
  • 00:01:30
    kind of classically tokenized ones now
  • 00:01:34
    there are a lot of like choices here
  • 00:01:37
    that make this graphic so the Y AIS here
  • 00:01:40
    is what they call bits per pite which is
  • 00:01:42
    kind of so since you if you don't deal
  • 00:01:45
    with tokens and if you deal with
  • 00:01:47
    especially different tokenization you
  • 00:01:49
    can't really use perplexity as a measure
  • 00:01:52
    because that kind of necessitates that
  • 00:01:54
    you operate over the same kind of
  • 00:01:56
    fundamental pieces so bits per bite is
  • 00:02:00
    sort of the analogous measure to entropy
  • 00:02:04
    uh sorry to uh
  • 00:02:06
    perplexity so think of this as kind of
  • 00:02:09
    perplexity and then here are the x-axis
  • 00:02:11
    and that's important that's total
  • 00:02:12
    training flops so they always consider
  • 00:02:14
    flop matched models uh because they are
  • 00:02:19
    like their model operates differently it
  • 00:02:21
    has like an outer layer and an inner
  • 00:02:22
    layer and don't you don't need to always
  • 00:02:24
    execute the inner layer for each outer
  • 00:02:28
    step so the outer step ends up running
  • 00:02:31
    more often but the inner step ends up
  • 00:02:33
    running less often and that's why you
  • 00:02:36
    can uh achieve you can achieve uh better
  • 00:02:41
    or you can achieve bigger models let's
  • 00:02:44
    say with um with the patch based models
  • 00:02:48
    because your patches are bigger and you
  • 00:02:51
    need to run them less often so if you
  • 00:02:54
    invest the same amount of training flops
  • 00:02:56
    then you can kind of become better so
  • 00:03:01
    there is a lot of ifs in this kind of oh
  • 00:03:03
    it scales better what they keep constant
  • 00:03:06
    is the training flops so for the same
  • 00:03:09
    amount of training flops after a certain
  • 00:03:11
    threshold you'll do better with the
  • 00:03:13
    patch based models just because they
  • 00:03:15
    have better scaling property um why
  • 00:03:18
    exactly that is that could be it's
  • 00:03:22
    probably a mixture of the part of their
  • 00:03:23
    architecture so let's dive into that
  • 00:03:25
    architecture here is the bite latent
  • 00:03:27
    Transformer as I said it's kind of a two
  • 00:03:29
    tier system so the inner system right
  • 00:03:32
    here that is um your very let's say
  • 00:03:36
    regular llm type Transformer there's
  • 00:03:41
    absolutely nothing special about it
  • 00:03:43
    except it that it operates on kind of
  • 00:03:47
    these pieces right here now usually
  • 00:03:49
    these would be token and token
  • 00:03:51
    embeddings so these here would be token
  • 00:03:54
    embeddings and it predicts the next
  • 00:03:58
    token um if you view a Transformer so
  • 00:04:04
    you have I don't know token and the
  • 00:04:06
    history going in and then you have a
  • 00:04:10
    Transformer and it will output uh a
  • 00:04:14
    distribution over next tokens right like
  • 00:04:17
    a some sort of a soft Max probability
  • 00:04:20
    distribution over next tokens if you
  • 00:04:23
    however take one step back and you
  • 00:04:25
    consider what happens in the last layer
  • 00:04:27
    in the last layer you do have a an so
  • 00:04:31
    here is the model out comes an
  • 00:04:34
    embedding so this is like the hidden
  • 00:04:37
    signal in the second last layer the last
  • 00:04:41
    layer is a matrix that is of
  • 00:04:46
    Dimension um here this Dimension is the
  • 00:04:51
    size of H and then this Dimension is the
  • 00:04:54
    size of the um token or of the
  • 00:04:57
    vocabulary let's say
  • 00:05:04
    vocab right so this is what actually
  • 00:05:06
    does the classification multiplying
  • 00:05:08
    these two things together will and then
  • 00:05:10
    applying a soft Max will give you this
  • 00:05:13
    distribution so in a sense um you could
  • 00:05:17
    argue that uh
  • 00:05:20
    the that this here you know that what
  • 00:05:23
    the Transformer actually does is even a
  • 00:05:26
    regular Transformer it actually predicts
  • 00:05:28
    the embedding of the next
  • 00:05:30
    token and there is one more thing if you
  • 00:05:34
    do what's called weight tying or
  • 00:05:36
    embedding tying so here is the tokens
  • 00:05:39
    coming in here is your embedding table
  • 00:05:41
    so for each token you have an embedding
  • 00:05:43
    in here some models actually tie those
  • 00:05:46
    together meaning they use the same
  • 00:05:47
    parameters saves them a lot of
  • 00:05:49
    parameters and it kind of has the same
  • 00:05:52
    idea um this maps from token IDs to
  • 00:05:56
    embedding space and this uh sorry this
  • 00:05:59
    this Matrix here kind of maps from
  • 00:06:00
    embedding space to back to token IDs
  • 00:06:03
    right with this probability distribution
  • 00:06:05
    so in that sense it's even more true
  • 00:06:07
    that the latent Transformer just kind of
  • 00:06:10
    out predicts the embedding or sorry any
  • 00:06:14
    language model uh Transformer that is
  • 00:06:18
    you know has at the end uh outputs
  • 00:06:22
    loged actually just predicts effectively
  • 00:06:25
    the embedding of the next
  • 00:06:27
    token so that's you know being said in
  • 00:06:30
    in the inner side here there is just a
  • 00:06:33
    very regular Transformer Auto regressive
  • 00:06:37
    llm that takes in
  • 00:06:41
    things and predicts the next thing from
  • 00:06:45
    them all in embedding space now usually
  • 00:06:50
    again usually these are tokens so maybe
  • 00:06:52
    it's worth briefly how uh we get to
  • 00:06:55
    those tokens so if we have a piece of
  • 00:06:58
    text for example
  • 00:07:00
    uh this piece of text addess here data
  • 00:07:02
    is primarily determined by the number
  • 00:07:05
    okay what we want to do is we want to
  • 00:07:08
    split those things up into individual
  • 00:07:11
    pieces that our models can operate over
  • 00:07:14
    now one method of doing that would be to
  • 00:07:17
    just split every single character
  • 00:07:19
    including the white spaces right becomes
  • 00:07:22
    one piece but that's not really uh the
  • 00:07:25
    best because it will result in very long
  • 00:07:28
    sequences for for a given text and you
  • 00:07:31
    know that Transformers scale by sequence
  • 00:07:33
    length quadratically which isn't
  • 00:07:36
    necessarily doesn't make us happy so our
  • 00:07:39
    context window of 128,000 tokens will
  • 00:07:43
    just be 128,000 characters in the end so
  • 00:07:47
    can we do better well yes we could split
  • 00:07:50
    uh by for example whites space so data
  • 00:07:54
    sorry data becomes one token is becomes
  • 00:07:58
    one token primarily becomes one token
  • 00:08:00
    and so on this was very standard for a
  • 00:08:04
    very long time and what what you have to
  • 00:08:08
    do in all of these things if you operate
  • 00:08:11
    with tokens is you're going to have a
  • 00:08:12
    table that is um mapping your tokens to
  • 00:08:17
    an embedding as we said before and then
  • 00:08:20
    every token needs to have a
  • 00:08:22
    corresponding embedding Vector that you
  • 00:08:24
    can look up so the word data has to have
  • 00:08:28
    an embedding vector in here somehow uh
  • 00:08:32
    the word is has to have an embedding
  • 00:08:34
    Vector in here somehow right um and you
  • 00:08:39
    can already see the problem there is
  • 00:08:41
    this table is become going to become
  • 00:08:43
    really really big and the even bigger
  • 00:08:45
    problem would be that let's say you
  • 00:08:49
    derive you have to derive this table
  • 00:08:51
    somehow so you take a training Corpus
  • 00:08:54
    and you look at all the words in there
  • 00:08:55
    and that's how you initialize the table
  • 00:08:58
    but it's very likely like English is
  • 00:09:00
    such a big language that in your test
  • 00:09:02
    data set there's going to be a word that
  • 00:09:06
    you've never seen in the training data
  • 00:09:08
    set like a name or maybe a number or
  • 00:09:12
    just a a word that you've never seen uh
  • 00:09:16
    for example you might actually never
  • 00:09:18
    have seen the word determined before
  • 00:09:21
    there is some people have tried to
  • 00:09:24
    mitigate some so they what they do is
  • 00:09:26
    like stemming or something like this so
  • 00:09:29
    instead of determined you just say
  • 00:09:31
    determine um so you and you say oh those
  • 00:09:35
    two are the same the same word
  • 00:09:37
    essentially so you only have one entry
  • 00:09:39
    in the embedding table instead of having
  • 00:09:41
    one for determine determined determining
  • 00:09:44
    uh determinization and whatnot so this
  • 00:09:48
    is just one token but still the problem
  • 00:09:50
    of like out of vocabulary people used to
  • 00:09:54
    call that was really really big and
  • 00:09:58
    problematic and people came up with
  • 00:10:00
    alternatives to that and those
  • 00:10:02
    alternatives are what currently are very
  • 00:10:06
    very popular um so what are those
  • 00:10:08
    Alternatives those Alternatives if you
  • 00:10:11
    look at things like bite pair encoding
  • 00:10:13
    or word piece encoding or things like
  • 00:10:15
    that uh they all are of the same
  • 00:10:18
    principle they say there
  • 00:10:21
    exists um a set of like unitary things
  • 00:10:27
    and those unitary Things Are
  • 00:10:30
    they can they can be used to make up all
  • 00:10:32
    of the text that we see so in word piece
  • 00:10:36
    um those unitary things would be just
  • 00:10:40
    the all the characters that exist so a b
  • 00:10:43
    c d da da d d da da until like Z then
  • 00:10:47
    capital A then like zero then the
  • 00:10:51
    question mark and so on like it's still
  • 00:10:54
    a lot but it's not infinite right with
  • 00:10:58
    with a decent amount of single symbols
  • 00:11:01
    you can represent any
  • 00:11:04
    um any sequence of characters and you
  • 00:11:08
    might want to say well aren't we now
  • 00:11:10
    back to the same problem where character
  • 00:11:13
    level isn't really good and that's yes
  • 00:11:15
    okay so let's say we have we just do
  • 00:11:18
    asky lowercase okay let's say we have a
  • 00:11:21
    to z that's good so we can represent
  • 00:11:24
    everything however we know that the
  • 00:11:27
    combination e is very very frequent in
  • 00:11:31
    the language so let's just assign a
  • 00:11:34
    different slot to ER yeah we still have
  • 00:11:37
    e in here somewhere we still have R in
  • 00:11:39
    here somewhere but if we encounter e we
  • 00:11:42
    choose to represent it with its own
  • 00:11:44
    token and its own embedding and then you
  • 00:11:48
    go on and eventually you'll say oh maybe
  • 00:11:51
    the maybe you know I don't know am is
  • 00:11:54
    very common and Ur is very common and
  • 00:11:58
    something like this and then you start
  • 00:12:00
    making bigger combinations you say okay
  • 00:12:02
    the d a d like Dad that's a very common
  • 00:12:05
    thing in the language and so on so you
  • 00:12:07
    build these things there are heuristic
  • 00:12:09
    ways of deriving them uh it's
  • 00:12:11
    essentially a compression algorithm if
  • 00:12:13
    you will uh and you assign individual
  • 00:12:16
    tokens um to those and you it's not just
  • 00:12:20
    whole words right you can see like these
  • 00:12:22
    things they're more like word pieces
  • 00:12:25
    that you start building up the same with
  • 00:12:28
    bite pair encoding uh where you just
  • 00:12:30
    operate in the realm of bytes so uh you
  • 00:12:34
    know you can encode any text into a
  • 00:12:35
    series of of bytes uh by different
  • 00:12:38
    encoding standards that exist like utf8
  • 00:12:41
    is a very common one and then you
  • 00:12:43
    literally you know what all the symbols
  • 00:12:45
    are they are
  • 00:12:47
    0255 those are all your single bytes
  • 00:12:49
    that can exist and then you start
  • 00:12:51
    combining you know the bites that appear
  • 00:12:54
    often in your text it's kind of more
  • 00:12:57
    clean than working with character and
  • 00:12:59
    symbols but those are your your choices
  • 00:13:02
    so that would be like the bip paring
  • 00:13:04
    coding and this would more be like word
  • 00:13:07
    piece or something like that um yeah
  • 00:13:11
    so like this seems good but it has its
  • 00:13:14
    own set of problems so first of all what
  • 00:13:19
    are its set of problems first of
  • 00:13:22
    all uh you know a couple of problems
  • 00:13:24
    that stem from tokenization so for
  • 00:13:27
    example if you have like numbers or
  • 00:13:29
    something like if you have the number
  • 00:13:32
    2568 uh then that might actually get
  • 00:13:35
    tokenized as the token 256 and 8 because
  • 00:13:39
    256 is very common uh number and then
  • 00:13:43
    you know just add eight so the tokenizer
  • 00:13:46
    is going for the minimum amount of
  • 00:13:48
    tokens uh so that's a problem if you
  • 00:13:50
    want to teach the neural network to
  • 00:13:52
    multiply something because it will not
  • 00:13:54
    see 2 5 6 8 it will see some token with
  • 00:14:00
    the ID 89 and then some token with the
  • 00:14:03
    ID 71 right it has no clue that you know
  • 00:14:08
    these are made up of numbers or
  • 00:14:10
    something like this and there are a
  • 00:14:12
    bunch of other problems with with
  • 00:14:14
    tokenization what this paper also shows
  • 00:14:16
    is that tokenization does result in
  • 00:14:19
    fairly small chunks of text where you
  • 00:14:22
    could go for bigger chunks of text but
  • 00:14:25
    the problem is if you keep it all in a
  • 00:14:27
    table if if you want bigger chunks of
  • 00:14:29
    text or obviously more combinations
  • 00:14:32
    possible so you'll have to kind of your
  • 00:14:35
    storage kind of explodes for
  • 00:14:38
    this so that's why they say do we even
  • 00:14:41
    need this table here do we even need
  • 00:14:43
    that maybe we don't actually need it
  • 00:14:45
    maybe we can get away with having a
  • 00:14:48
    table Just For The Individual pieces
  • 00:14:52
    like Just For The Individual unitary
  • 00:14:54
    things and we can come up with a scheme
  • 00:14:58
    of how we com how we recombine those
  • 00:15:01
    things for those down here in kind of
  • 00:15:04
    like a a learned way like can we teach a
  • 00:15:07
    neural network to take the embeddings of
  • 00:15:11
    the individual
  • 00:15:13
    constituents and come up with the
  • 00:15:15
    embedding for higher order combinations
  • 00:15:18
    because that would allow us to not even
  • 00:15:21
    have a fixed set of higher order
  • 00:15:23
    combinations but like kind of an
  • 00:15:24
    arbitrary combination of higher order
  • 00:15:26
    com um combinations and the neural
  • 00:15:29
    network will just be able to produce an
  • 00:15:31
    embedding for these on the Fly and then
  • 00:15:34
    those could be the individual pieces we
  • 00:15:37
    feed into the bigger llm right so it's
  • 00:15:40
    not a Chara we're not doing a character
  • 00:15:42
    level or a bite level
  • 00:15:45
    llm um what we're doing is a two-stage
  • 00:15:48
    process where we have a first stage that
  • 00:15:51
    out of the bite embeddings produces what
  • 00:15:55
    they call a patch embedding and a patch
  • 00:15:57
    embedding is like a um six to8 character
  • 00:16:01
    is long thing and that then gets fed
  • 00:16:06
    into the llm now you'll realize what I
  • 00:16:09
    said at the beginning this idea could
  • 00:16:11
    actually totally be done using the
  • 00:16:14
    tokenization we have right like you
  • 00:16:16
    could just tokenize how we tokenize
  • 00:16:19
    right now but just not have this big uh
  • 00:16:22
    sorry not have this big embedding table
  • 00:16:24
    but just do this sort of two-stage
  • 00:16:28
    process where the first stage just
  • 00:16:30
    builds your token embedding from the
  • 00:16:33
    character embeddings that make up the
  • 00:16:34
    token and then the second stage will
  • 00:16:37
    actually go and or the second stage is
  • 00:16:40
    your normal llm that operates on token
  • 00:16:43
    embeddings however you know because they
  • 00:16:46
    have this method they also say well we
  • 00:16:50
    don't need a fixed vocabulary
  • 00:16:52
    tokenization anymore right this here is
  • 00:16:55
    a fixed vocabulary you derive it once
  • 00:16:59
    your vocab because you need that table
  • 00:17:02
    and then you tokenize all the text into
  • 00:17:05
    this fixed vocabulary you don't have out
  • 00:17:08
    of vocabulary anymore because you can
  • 00:17:10
    you have the individual characters here
  • 00:17:11
    so you can tokenize anything uh but
  • 00:17:15
    still it's fixed so they say hey we have
  • 00:17:18
    this process now now we can do Dynamic
  • 00:17:21
    tokenization and that's what they call
  • 00:17:23
    patching they're again from from the
  • 00:17:25
    inside to the outside on the inside we
  • 00:17:29
    have an llm that operates on they call
  • 00:17:32
    Patch embeddings which are essentially
  • 00:17:33
    just token embeddings except the tokens
  • 00:17:36
    aren't fixed they are Dynamic
  • 00:17:40
    groupings uh patches of characters or of
  • 00:17:44
    bites in our case same
  • 00:17:46
    same sorry uh all non asky
  • 00:17:51
    people and so you can you can see that
  • 00:17:55
    once we know once we know what the where
  • 00:18:00
    the patch boundaries are and in this
  • 00:18:02
    case here here here are the patch
  • 00:18:05
    boundaries right so this is a token this
  • 00:18:07
    is a token this is a token and this is a
  • 00:18:09
    token this this text down here gets
  • 00:18:11
    divided into four tokens once we know
  • 00:18:14
    what they are we can use this local
  • 00:18:17
    encoder thing to look
  • 00:18:20
    at the characters in the patch and give
  • 00:18:25
    us a single patch embedding that we then
  • 00:18:27
    feed to the Transformer so the local
  • 00:18:31
    encoder is a model that's trained to do
  • 00:18:34
    exactly that um as far as I can tell
  • 00:18:36
    it's trained end to end together with
  • 00:18:38
    the latent Transformer and then the
  • 00:18:41
    local decoder takes a patch embedding
  • 00:18:45
    patch embedding and decodes it into the
  • 00:18:48
    constituent characters so you can see
  • 00:18:52
    that the local encoder and the local
  • 00:18:54
    decoder they run more often than the
  • 00:18:57
    latent Transformer and now you have a
  • 00:18:59
    degree of Freedom the long the bigger
  • 00:19:02
    you make these patches the The Wider
  • 00:19:05
    they become the more characters on
  • 00:19:07
    average to a patch the more often you
  • 00:19:10
    run the local
  • 00:19:12
    encoder in comparison to running the
  • 00:19:15
    chunky latent
  • 00:19:17
    Transformer so you can make this in here
  • 00:19:21
    bigger if you make these
  • 00:19:24
    smaller then you still you still gain a
  • 00:19:28
    lot lot like you can gain a lot of flops
  • 00:19:32
    um because you have to run the inner
  • 00:19:34
    part less because you make the patches
  • 00:19:38
    larger and as long as the outer parts
  • 00:19:40
    are kind of lightweight uh they don't
  • 00:19:42
    matter and you can get away with having
  • 00:19:45
    a bigger model because you spend less
  • 00:19:47
    flops because you run it less
  • 00:19:49
    often right some astute observers might
  • 00:19:54
    have realized that hey you know this
  • 00:19:58
    local this local decoder when does it
  • 00:20:01
    know when to stop
  • 00:20:04
    um it you know it's just it gives it
  • 00:20:06
    gets one thing and it's just supposed to
  • 00:20:09
    produce uh tokens like characters from
  • 00:20:12
    it we'll get to that in just a bit and
  • 00:20:15
    the the second part is obviously how do
  • 00:20:19
    we know where the patch boundaries are
  • 00:20:20
    how do you know how to group the
  • 00:20:22
    characters into tokens and the answer to
  • 00:20:24
    these two things is kind of the same and
  • 00:20:27
    that's with their
  • 00:20:29
    what they call uh entropy based grouping
  • 00:20:33
    of bytes into
  • 00:20:35
    patches um
  • 00:20:38
    so the entropy based grouping is a
  • 00:20:42
    concept that's as I said kind of
  • 00:20:46
    um yeah it's what they essentially do is
  • 00:20:50
    they train a small transformer so a B
  • 00:20:54
    level
  • 00:20:55
    Transformer um notably this is not this
  • 00:20:59
    thing right here so they have a
  • 00:21:01
    separate llm that's small that's just on
  • 00:21:05
    the bytes so that actually is a
  • 00:21:08
    character level llm that's just trained
  • 00:21:11
    on a corpus
  • 00:21:13
    and that decides where to split in the
  • 00:21:18
    following way you feed text into it it
  • 00:21:23
    will predict the next token and if the
  • 00:21:27
    entropy of the prediction so this
  • 00:21:31
    distribution right here if the entropy
  • 00:21:33
    of the prediction of the next character
  • 00:21:35
    is very high meaning like what is a high
  • 00:21:39
    entropy a high entropy is a distribution
  • 00:21:42
    that's
  • 00:21:43
    like you know could be could be any of
  • 00:21:47
    these whereas a low entropy distribution
  • 00:21:49
    is like oh it's this it's this one it's
  • 00:21:53
    this one definitely so high entropy
  • 00:21:55
    meaning it's not sure that's where where
  • 00:21:58
    you split so if the next
  • 00:22:01
    character
  • 00:22:02
    is above a threshold of entropy in the
  • 00:22:05
    prediction of this bite level llm that's
  • 00:22:10
    where you make a
  • 00:22:11
    split that that's just a just a decision
  • 00:22:14
    they make right um it's a it's a design
  • 00:22:17
    choice that they make but there's good
  • 00:22:20
    reason right there's there's good reason
  • 00:22:22
    to split by entropy uh because what you
  • 00:22:25
    do is you keep the stuff together
  • 00:22:28
    together where you're sure so whenever
  • 00:22:32
    you know
  • 00:22:33
    bet you know
  • 00:22:35
    bet bet like the erer that's very clear
  • 00:22:40
    and therefore you want to keep it
  • 00:22:42
    together because it kind of is one unit
  • 00:22:44
    like whenever you're very sure what
  • 00:22:46
    comes you can very much argue that the
  • 00:22:49
    thing is actually should be treated as a
  • 00:22:51
    single unit when you're not sure that
  • 00:22:54
    means there could be multiple
  • 00:22:55
    continuations that's when you want to
  • 00:22:57
    split it up and say oh well here you
  • 00:23:00
    know this these two things need to be
  • 00:23:02
    treated separately because in an
  • 00:23:04
    alternative Universe there there's a
  • 00:23:06
    different continuation here that I need
  • 00:23:07
    to take into account and then you better
  • 00:23:11
    off if that first part is the same token
  • 00:23:14
    each time and not if the entire thing is
  • 00:23:17
    like a different token and you know
  • 00:23:19
    nothing
  • 00:23:21
    anymore all right um what I want to
  • 00:23:24
    say yeah and this is also the answer on
  • 00:23:27
    how the local decoder stops decoding so
  • 00:23:31
    it decodes decodes decodes and when the
  • 00:23:33
    next and then it it just always asks
  • 00:23:36
    this small llm here what's the entropy
  • 00:23:39
    of what I'm doing right like what's the
  • 00:23:40
    entropy of the next token in your
  • 00:23:43
    estimation like this local model this
  • 00:23:45
    knows nothing of the lat Transformer
  • 00:23:47
    what it just looks at the stuff that's
  • 00:23:49
    being produced and if the next if the
  • 00:23:53
    next token according to it has a high
  • 00:23:56
    entropy that's where we end the the
  • 00:23:59
    patch okay so the process is as
  • 00:24:03
    follows we have some we have some
  • 00:24:06
    text and let's say we're at a new patch
  • 00:24:09
    boundary okay the local encoder looks at
  • 00:24:13
    the patch sorry we're here let's let's
  • 00:24:16
    start it the we run the small llm
  • 00:24:20
    forward right boop boop boop boop boop
  • 00:24:23
    until the entropy threshold is above
  • 00:24:26
    that's where we say ah okay that's a
  • 00:24:28
    patch okay our patch is from here to
  • 00:24:30
    here then that local encoder looks at
  • 00:24:32
    the characters in here and and takes
  • 00:24:36
    there is an there's a embedding table
  • 00:24:39
    from bite to embedding notably you only
  • 00:24:43
    need 20 56 entries fixed right this
  • 00:24:47
    doesn't grow so it looks up the
  • 00:24:51
    embeddings of the constituents and
  • 00:24:54
    Aggregates them into a patch and Bing
  • 00:24:56
    it's trained to do that then then you
  • 00:24:58
    run the latent transformer for one step
  • 00:25:01
    let's assume this doesn't exist yet for
  • 00:25:03
    one step and produce the
  • 00:25:05
    next the next latent um output token the
  • 00:25:11
    local decoder takes this and
  • 00:25:15
    starts um let's assume let's assume that
  • 00:25:19
    actually let's assume the local decoder
  • 00:25:21
    is here currently right the local
  • 00:25:23
    decoder takes this and starts
  • 00:25:27
    producing uh um uh tokens it starts
  • 00:25:30
    decoding like an llm except conditioned
  • 00:25:33
    on This Global signal light here so it's
  • 00:25:35
    like okay this one okay and I'm produce
  • 00:25:39
    this one I'm produce this one and each
  • 00:25:41
    time it asks the small llm what it
  • 00:25:44
    thinks about the next token in the
  • 00:25:46
    sequence it has decoded if the small as
  • 00:25:49
    soon as the small llm says oh wait the
  • 00:25:51
    entropy is quite High then it's like
  • 00:25:54
    okay stop it here I'm going to I'm going
  • 00:25:57
    to stop it here please go back to the
  • 00:26:00
    next thing um
  • 00:26:03
    and uh you know start the next cycle of
  • 00:26:07
    the
  • 00:26:08
    process we almost at least that's how I
  • 00:26:11
    think it goes uh maybe I'm I'm totally
  • 00:26:15
    wrong but that's what I can read from
  • 00:26:16
    the paper the paper is a bit sparse on
  • 00:26:18
    these exact details um but and I haven't
  • 00:26:22
    read the code I have to apologize for
  • 00:26:24
    that but the code is available so you
  • 00:26:26
    can go and verify or or refute that um
  • 00:26:30
    there is one extra thing there's one
  • 00:26:34
    little bit of extra info that you need
  • 00:26:36
    right here and that's
  • 00:26:39
    usually usually when you do auto
  • 00:26:42
    regressive decoding you take what you've
  • 00:26:45
    produced and you feed it back right um
  • 00:26:49
    into your own model however that doesn't
  • 00:26:53
    work here because this local decoder it
  • 00:26:56
    doesn't take text as a an input it
  • 00:26:58
    doesn't take characters as an input it
  • 00:27:01
    just takes this signal right here as an
  • 00:27:04
    input
  • 00:27:05
    so what does take characters as an input
  • 00:27:08
    well that local encoder thing takes
  • 00:27:10
    characters as an input so there is a
  • 00:27:14
    hidden skip connection from like here to
  • 00:27:18
    here so when you when the local decoder
  • 00:27:21
    produces a character at least that's
  • 00:27:23
    again my understanding you run this
  • 00:27:27
    thing through through the local encoder
  • 00:27:30
    you know here get its local encoder
  • 00:27:33
    embedding but you don't go to the latent
  • 00:27:35
    Transformer because you're not done with
  • 00:27:36
    a patch yet you just feed this back into
  • 00:27:39
    the local decoder which then has like a
  • 00:27:42
    a latent a latent representation that it
  • 00:27:45
    can decode the next token from so the
  • 00:27:47
    loop between local decoder go to local
  • 00:27:49
    encoder go to local decoder that's kind
  • 00:27:51
    of the outer loop that runs in order to
  • 00:27:54
    produce these tokens and once you're
  • 00:27:56
    done with a patch then you know you
  • 00:27:58
    start again to ask the local decoder
  • 00:28:01
    about the next patch um to to or sorry
  • 00:28:04
    about the patch that you've just
  • 00:28:06
    produced embed it get it into the latent
  • 00:28:09
    Transformer from that you get next
  • 00:28:11
    Global signal and then you do that outer
  • 00:28:13
    loop again in order to produce the
  • 00:28:15
    individual bytes until the small LM says
  • 00:28:18
    again patches
  • 00:28:20
    over again that's how I personally
  • 00:28:24
    understand it there is yeah so so here
  • 00:28:28
    we have exactly we have the encoder
  • 00:28:31
    decoder so um the
  • 00:28:34
    encoder gets B embeddings uh uses and
  • 00:28:38
    then
  • 00:28:38
    uses cross attention so it knows it
  • 00:28:42
    those should be um tokenized into three
  • 00:28:45
    different patches so it uses cross
  • 00:28:47
    attention from the patch um to Only The
  • 00:28:53
    Tokens that are part of the batch by the
  • 00:28:56
    way there are two here and not so
  • 00:29:00
    they're three different patches but they
  • 00:29:03
    use multi-head attenion so this just
  • 00:29:05
    represents a two-headed uh multi-head
  • 00:29:08
    tension with keys into here but you
  • 00:29:11
    still have hidden states you have many
  • 00:29:13
    layers so you still have hidden States
  • 00:29:15
    and these hidden States is what you give
  • 00:29:19
    to the
  • 00:29:20
    decoder um which does the exact opposite
  • 00:29:22
    so its keys are sorry its queries are
  • 00:29:27
    the individual bites that you produce
  • 00:29:28
    and its keys and values are the global
  • 00:29:31
    signal that you get from the latent
  • 00:29:35
    Transformer all right there is one more
  • 00:29:39
    thing now I'm going to guess that this
  • 00:29:42
    thing here the encoder hash NR
  • 00:29:45
    embeddings they added because it just
  • 00:29:48
    works better like this seems very much
  • 00:29:50
    like a thing you add after that so they
  • 00:29:53
    say look we do have we we have a
  • 00:29:59
    [Music]
  • 00:30:01
    um we model each bite individually so
  • 00:30:06
    when we do encoding each bite gets like
  • 00:30:10
    encoded
  • 00:30:12
    um by itself and as part of a bite and
  • 00:30:19
    gram so you can see that they build up
  • 00:30:22
    not just embedding tables or the bite to
  • 00:30:26
    embedding but they build up several
  • 00:30:29
    embedding tables so there is an
  • 00:30:30
    embedding table um for bite two or or
  • 00:30:35
    three G um there is one for bite four G
  • 00:30:39
    for bite 5 G and so on up until bite 8 G
  • 00:30:44
    and now you ask well aren't the bite 8
  • 00:30:46
    GS huge and that's exactly what we Tred
  • 00:30:49
    to avoid yes they are that's why you
  • 00:30:52
    just kind of you just kind of hash them
  • 00:30:55
    and then modulus by the size of the
  • 00:30:58
    embedding table so you're like you're
  • 00:31:02
    essentially counting on the fact that
  • 00:31:04
    yes there are going to be hash
  • 00:31:05
    collisions like some of the bite three
  • 00:31:06
    Gs are going to hit the same embedding
  • 00:31:08
    right here but those hash collisions are
  • 00:31:10
    kind of orthogonal things in meaning and
  • 00:31:13
    so it's probably fine
  • 00:31:16
    um I'm going to I'm going to guess it's
  • 00:31:19
    just a way to get NRS in there so when
  • 00:31:21
    you look at a bite for example the
  • 00:31:23
    letter T right here you also take the
  • 00:31:26
    embedding for the 3 G the 4 G the 5 G
  • 00:31:29
    the six G the seven G and the 8 G in
  • 00:31:32
    front of that bite and you aggregate all
  • 00:31:36
    of these together into the bite
  • 00:31:39
    embedding so to say so the local encoder
  • 00:31:44
    doesn't operate purely on the bite
  • 00:31:46
    embedding as I said before but it
  • 00:31:48
    actually operates on a super position of
  • 00:31:53
    engr bite engram embeddings that this
  • 00:31:57
    puts this into context with the bites
  • 00:32:00
    before it that to me it it just seems
  • 00:32:04
    like a kind of a a way to get kind of
  • 00:32:08
    fake it's it's it's a bit of like you
  • 00:32:10
    get you like tokenization is back that's
  • 00:32:14
    what it tells me except instead of
  • 00:32:16
    tokens it's
  • 00:32:18
    NRS so yeah make of that as you
  • 00:32:23
    will I don't want to you know talk too
  • 00:32:27
    much more I I think that's kind of it
  • 00:32:28
    for the model design and how they decode
  • 00:32:31
    and so on when they experiment around
  • 00:32:34
    they find they can actually make larger
  • 00:32:37
    patches than regular tokenization so
  • 00:32:40
    they um they say look our our patches we
  • 00:32:45
    can go to patch sizes of like uh what do
  • 00:32:49
    I say look Trends between yeah so they
  • 00:32:55
    can go they they can achieve kind of
  • 00:32:57
    performance perance of like llama 2 and
  • 00:32:59
    llama 3 models while using significantly
  • 00:33:01
    larger patch sizes so while llama 2 and
  • 00:33:05
    llama 3 B par and codings have an
  • 00:33:06
    average token size of 3.7 and 4.4 bytes
  • 00:33:10
    so we can achieve similar tra scaling
  • 00:33:12
    Trends with an average patch size of six
  • 00:33:14
    and even eight bytes um so you you have
  • 00:33:18
    that handle on that tradeoff and that's
  • 00:33:20
    pretty cool I have to say they do some
  • 00:33:23
    experiments where they show that yeah
  • 00:33:25
    they can remain competitive with these
  • 00:33:27
    LS models but also they're a lot better
  • 00:33:31
    in you know in tasks where you actually
  • 00:33:34
    need to look at the individual
  • 00:33:37
    characters in a token because given that
  • 00:33:40
    they operate on bite embeddings they can
  • 00:33:43
    now also you know very fine Greenly
  • 00:33:46
    train models that are actually need to
  • 00:33:50
    look at the individual things whereas if
  • 00:33:52
    you obviously just have fixed tokens and
  • 00:33:55
    you look up their embeddings in a table
  • 00:33:57
    that that doesn't work as well so but
  • 00:33:59
    it's it's kind of like it's kind of
  • 00:34:01
    cheesing a bit but just demonstrating
  • 00:34:03
    hey look spelling inverse were doing
  • 00:34:06
    like really really well compared to the
  • 00:34:08
    Llama models which was to be expected
  • 00:34:11
    but it is nice that they perform an
  • 00:34:13
    experiment to actually show that what's
  • 00:34:16
    also interesting is that um translation
  • 00:34:19
    works better for kind of languages that
  • 00:34:22
    are under represented or that are you
  • 00:34:24
    know kind of tokenized in a in a non
  • 00:34:28
    like Say Non in a way other than like
  • 00:34:32
    your standard languages are tokenized
  • 00:34:34
    and that's also pretty
  • 00:34:36
    cool all right that's I want to don't
  • 00:34:40
    want to dive too much here more uh
  • 00:34:43
    please look at the rest of the paper
  • 00:34:45
    it's pretty interesting it's pretty
  • 00:34:46
    thorough the experiments are pretty cool
  • 00:34:48
    and they pay a lot of attention to like
  • 00:34:50
    control for various parameters in
  • 00:34:52
    because it is really hard if if you know
  • 00:34:55
    your model operates on different
  • 00:34:56
    fundament to you
  • 00:34:58
    how do you even compare to other models
  • 00:35:00
    and they do good job at that um there
  • 00:35:03
    are several room for improvements
  • 00:35:05
    notably you could train more things
  • 00:35:07
    jointly for example that small language
  • 00:35:09
    model that does the patching and so on
  • 00:35:12
    and as of now this in terms of um in
  • 00:35:17
    terms of like raw runtime uh this still
  • 00:35:20
    lags behind because obviously we've
  • 00:35:22
    spent like a decade hyper optimizing or
  • 00:35:26
    at least half a decade hyper optimizing
  • 00:35:28
    fixed tokenization autoaggressive
  • 00:35:31
    llms uh yeah with things like they name
  • 00:35:33
    here such as Flex attention um and we
  • 00:35:37
    and obviously that would still need to
  • 00:35:41
    be done for these patch level models in
  • 00:35:45
    terms of actually getting their runtime
  • 00:35:47
    there so when they compare something
  • 00:35:50
    they like match flops which is probably
  • 00:35:53
    a pretty good measure that's kind of
  • 00:35:55
    independent of raw optimization
  • 00:35:58
    all right that's it as I said read the
  • 00:36:01
    paper uh subscribe for more reviews and
  • 00:36:04
    thank you so much if you read this as it
  • 00:36:07
    comes out then Holly Jolly uh Christmas
  • 00:36:10
    and uh Happy New Year and see you around
  • 00:36:13
    bye-bye
Tags
  • Bite Latent Transformer
  • tokenisation
  • patchs
  • évolutivité
  • modèles de langage
  • embedding
  • entropie
  • encodage dynamique
  • LLM
  • performances linguistiques