LSTM is dead. Long Live Transformers!

00:28:48
https://www.youtube.com/watch?v=S27pHKBEp30

Résumé

TLDRCette présentation explore un changement significatif dans le traitement du langage naturel (TAL) dû à l'introduction des transformateurs. Ces modèles ont considérablement amélioré la façon dont nous traitons les documents de longueur variable par rapport aux approches précédentes, notamment les réseaux de neurones récurrents (RNN) et leurs variantes avancées comme les LSTMs. Les transformateurs utilisent deux innovations principales : l'attention multi-tête et les encodages positionnels, qui permettent de conserver des informations de séquence tout en prenant en compte le contexte global du document. Cela permet un traitement parallèle plus efficace et un apprentissage par transfert plus facile à partir de grandes quantités de texte non supervisé. Contrairement aux modèles antérieurs, les transformateurs ne nécessitent pas d'activation sigmoïde qui peut saturer les gradients, ce qui les rend moins susceptibles aux problèmes de dispersion ou d'explosion du gradient. Les avantages de ces nouveaux modèles incluent une capacité de formation plus facile et des résultats de prédiction améliorés pour diverses applications en TAL, en permettant également une réutilisation aisée des modèles grâce à l'apprentissage par transfert.

A retenir

  • 👥 Les transformateurs remplacent les modèles de langage traditionnels.
  • 🔄 L'apprentissage par transfert est facilité avec les transformateurs.
  • 🔍 L'attention multi-tête permet une analyse contextuelle précise.
  • 📏 Les encodages positionnels préservent l'ordre des mots.
  • 🖩 Les transformateurs offrent une efficacité de calcul accrue.
  • 💡 L'architecture permet une formation plus rapide et facile.
  • 📚 Les grands modèles de langage sont réutilisables pour diverses tâches.
  • ⚡ Les Relus évitent la saturation des gradients.
  • 🔗 Approprié pour de longs documents grâce à l'attention globale.
  • 🧠 Modifications basées sur le modèle Transformer proposent de meilleures prédictions.

Chronologie

  • 00:00:00 - 00:05:00

    L'orateur exprime sa réticence à donner deux présentations consécutives mais estime important de discuter du progrès significatif dans le traitement automatique du langage naturel (TALN) grâce aux Transformateurs. Il décrit les défis de la représentation des documents textuels comme des vecteurs de taille fixe en entrant dans les détails des modèles de sacs de mots et de leurs limites.

  • 00:05:00 - 00:10:00

    Il aborde les réseaux de neurones récurrents (RNN) et les défis associés, tels que les gradients qui disparaissent ou explosent, rendant les RNN mal adaptés pour les longues séquences. Les réseaux à mémoire longue et courte durée (LSTM) viennent en aide mais ont des limitations en termes de complexité d’entraînement et de transfert d’apprentissage.

  • 00:10:00 - 00:15:00

    L'orateur mentionne l'émergence des Transformateurs, avec BERT et les modèles muppets, qui ont révolutionné la façon de traiter les suites de documents. Le modèle de transformateur, introduit dans le contexte de la traduction automatique, utilise l'attention comme mécanisme clé permettant de traiter les documents de longueur variable.

  • 00:15:00 - 00:20:00

    Les transformateurs se distinguent par leur mécanisme d'attention multi-têtes et le codage positionnel qui permet de conserver l'ordre dans les séquences de mots, ce qui résout un problème majeur dans les modèles de sacs de mots. Ces innovations permettent aux transformateurs d'être très efficaces sur les GPU modernes.

  • 00:20:00 - 00:28:48

    En conclusion, les transformateurs simplifient l'apprentissage grâce à leurs structures parallèles et facilitent le transfert d'apprentissage. Comparés aux anciennes méthodes CNN ou RNN, ils offrent plus de flexibilité, bien que les RNN, comme les LSTM, gardent des avantages dans certains contextes, notamment lorsque les séquences sont très longues ou infinies.

Afficher plus

Carte mentale

Vidéo Q&R

  • Quel est le sujet principal de la présentation ?

    Le sujet principal est l'évolution importante du traitement du langage naturel avec l'introduction des transformateurs.

  • Pourquoi les réseaux récurrents standard sont-ils problématiques ?

    Ils souffrent de problèmes de gradient qui explose ou qui disparaît, rendant l'apprentissage difficile.

  • Qu'est-ce qui rend les transformateurs uniques ?

    Les transformateurs utilisent une attention multi-tête et des encodages positionnels pour traiter efficacement des documents de longueur variable.

  • Quel est l'avantage des transformateurs par rapport aux LSTMs ?

    Les transformateurs sont plus efficaces en termes de calcul et permettent un transfert d'apprentissage en utilisant de grandes quantités de texte non supervisé.

  • Quel problème majeur de calcul les transformateurs résolvent-ils ?

    Ils évitent le calcul séquentiel étape par étape des RNNs, ce qui les rend plus rapides et plus efficaces.

  • Comment les transformateurs traitent-ils la position des mots ?

    Ils utilisent des encodages positionnels basés sur des sinus et cosinus pour conserver l'ordre des mots.

  • Qu'est-ce qu'une attention multi-tête ?

    C'est une technique qui permet au modèle de se concentrer sur différentes parties d'un texte pour divers aspects comme la grammaire ou le vocabulaire.

  • Pourquoi le transfert d'apprentissage est-il important dans les transformateurs ?

    Il permet d'utiliser des modèles pré-entraînés sur de grandes bases de données, rendant plus facile et efficace l'adaptation à des tâches spécifiques.

  • Quels types de tâches restent appropriés pour les LSTMs ?

    Les tâches avec des séquences très longues ou infinies, où le coût de calcul au carré des transformateurs est prohibitif.

  • Pour quelles raisons les activations relu sont-elles préférées ?

    Elles évitent les problèmes de saturation des gradients et permettent une exécution efficace sur du matériel à faible précision.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !
Sous-titres
en
Défilement automatique:
  • 00:00:03
    all right cool well thanks everybody um
  • 00:00:06
    so I'm gonna give the second talk
  • 00:00:08
    tonight which I'm not crazy about and
  • 00:00:10
    and I don't want this pattern to to
  • 00:00:12
    repeat but you know Andrew and I wanted
  • 00:00:14
    to kick this series off and and felt
  • 00:00:20
    like me talking twice or better than
  • 00:00:23
    then not but we're gonna we're gonna get
  • 00:00:26
    more diversity of folks if any of you
  • 00:00:28
    want to give a talk yourselves you know
  • 00:00:29
    somebody who you think might that'd be
  • 00:00:31
    awesome but a topic that I feel is
  • 00:00:34
    important for practitioners to
  • 00:00:35
    understand is a real sea change in
  • 00:00:38
    natural language processing that's you
  • 00:00:40
    know all of like 12 months old but is
  • 00:00:42
    one these things I think is incredibly
  • 00:00:44
    significant in in the field and that is
  • 00:00:47
    the advance of the Transformers so the
  • 00:00:52
    outline for this talk is to start out
  • 00:00:55
    with some background on natural language
  • 00:00:57
    processing and sequence modeling and
  • 00:01:00
    then talk about the LS TM why it's
  • 00:01:03
    awesome and amazing but still not good
  • 00:01:05
    enough and then go into Transformus and
  • 00:01:09
    talk about how they work and why they're
  • 00:01:12
    amazing
  • 00:01:12
    so for background on natural language
  • 00:01:14
    processing NLP I'm gonna be talking just
  • 00:01:18
    about a subset of NLP which is the
  • 00:01:21
    supervised learning a part of it so not
  • 00:01:23
    structured prediction sequence
  • 00:01:25
    prediction but where you're taking the
  • 00:01:29
    document as some input and trying to
  • 00:01:32
    predict some fairly straightforward
  • 00:01:34
    output about it like is this document
  • 00:01:37
    spam right and so what this means is
  • 00:01:41
    that you need to somehow take your
  • 00:01:44
    document and represent it as a
  • 00:01:46
    fixed-size vector because I'm not aware
  • 00:01:49
    of any linear algebra that works on
  • 00:01:50
    vectors of variable dimensionality and
  • 00:01:53
    the challenge with this is that
  • 00:01:55
    documents are of variable length right
  • 00:01:58
    so you have to come up with some way of
  • 00:02:00
    taking that document and meaningfully
  • 00:02:02
    encoding it into a fixed size vector
  • 00:02:04
    right so the classic way of doing this
  • 00:02:06
    is the bag of words right where you have
  • 00:02:08
    one dimension per unique word in your
  • 00:02:11
    vocabulary
  • 00:02:12
    so English has I don't know about a
  • 00:02:14
    hundred thousand words in the vocabulary
  • 00:02:16
    right
  • 00:02:17
    and so you have a hundred thousand
  • 00:02:18
    dimensional vector most of them are zero
  • 00:02:20
    because most words are not present in
  • 00:02:22
    your document and the ones that are have
  • 00:02:24
    some value that's maybe account or
  • 00:02:26
    tf-idf score or something like that
  • 00:02:28
    and that is your vector and this
  • 00:02:32
    naturally leads to sparse data where
  • 00:02:35
    again it's mostly zero so you don't
  • 00:02:37
    store the zeros because that's
  • 00:02:38
    computationally inefficient you store
  • 00:02:39
    lists of a position value tuples or
  • 00:02:43
    maybe just a list of positions and this
  • 00:02:46
    makes the computation much cheaper and
  • 00:02:48
    this works this works reasonably well a
  • 00:02:51
    key limitation is that when you're
  • 00:02:52
    looking an actual document order matters
  • 00:02:54
    right these two documents mean
  • 00:02:58
    completely different things right but a
  • 00:03:01
    bag of words model will score them
  • 00:03:02
    identically every single time because
  • 00:03:04
    they have the exact same vectors for for
  • 00:03:08
    what words are present so the solution
  • 00:03:11
    to that in this context is in grams you
  • 00:03:14
    can have by grams which every pair of
  • 00:03:15
    possible words or trigrams are for every
  • 00:03:17
    combination of three words which would
  • 00:03:19
    easily distinguish between those two but
  • 00:03:21
    now you're up to what is that a
  • 00:03:23
    quadrillion dimensional vector and you
  • 00:03:26
    can do it but you know you start running
  • 00:03:28
    into all sorts of problems when you walk
  • 00:03:31
    down that path so a in neural network
  • 00:03:35
    land it's the natural way to just solve
  • 00:03:37
    this problem is the R and n which is the
  • 00:03:40
    recurrent neural network not the
  • 00:03:42
    recursive neural network I've made that
  • 00:03:43
    mistake but RN ends are a new approach
  • 00:03:48
    to this which asked the question how do
  • 00:03:50
    you calculate a function on a
  • 00:03:52
    variable-length set of input and they
  • 00:03:55
    answer it using a for loop in math where
  • 00:03:58
    they recursively define the output at
  • 00:04:02
    any stage as a function of the inputs at
  • 00:04:05
    the previous stages and the previous
  • 00:04:07
    output and then for the purpose of
  • 00:04:10
    supervised learning the final output is
  • 00:04:13
    just the final hidden state here and so
  • 00:04:16
    visually this looks like this activation
  • 00:04:18
    which takes an input from the raw
  • 00:04:21
    document X and also itself in the
  • 00:04:23
    previous time you can unroll this and
  • 00:04:25
    visualize it as a very deep neural
  • 00:04:27
    network where there the final answer
  • 00:04:31
    the number you're looking at the end is
  • 00:04:32
    this and it's this deep neural network
  • 00:04:34
    that processes every one of the inputs
  • 00:04:36
    along the way alright and the problem
  • 00:04:39
    with this classic vanilla all right on
  • 00:04:42
    this plane recurrent neural network is
  • 00:04:44
    vanishing an exploding gradients right
  • 00:04:45
    so you take this recursive definition of
  • 00:04:50
    the hidden state and you imagine what
  • 00:04:53
    happens just three points in right and
  • 00:04:55
    so you're calling this a function this a
  • 00:04:57
    transformation over and over and over
  • 00:04:59
    again on your data and classically in
  • 00:05:02
    the vanilla want a case this is just
  • 00:05:04
    some matrix multiplication some learned
  • 00:05:06
    matrix W times your input X and so when
  • 00:05:10
    you go out and say a hundred words in
  • 00:05:12
    you're taking that W vector W matrix and
  • 00:05:15
    you're multiplying it a hundred times
  • 00:05:17
    alright so in in simple math in in real
  • 00:05:22
    number math we know that if you take any
  • 00:05:24
    number less than one and raise it to a
  • 00:05:26
    very high dimensional value sorry very
  • 00:05:28
    high exponent you get some incredibly
  • 00:05:30
    small number and if your number is
  • 00:05:32
    slightly larger than one then it blows
  • 00:05:34
    up to something big
  • 00:05:35
    and if you go if your X even higher if
  • 00:05:37
    you have longer documents this gets even
  • 00:05:39
    worse and in linear algebra this is
  • 00:05:41
    about the same except you need to think
  • 00:05:44
    about the eigenvalues of the matrix so
  • 00:05:46
    the eigenvalues is say how much the
  • 00:05:48
    matrix is going to grow or shrink
  • 00:05:50
    vectors when the transformation is
  • 00:05:53
    applied and if your eigenvalues are less
  • 00:05:55
    than one in this transformation you're
  • 00:05:57
    going to get these gradients that go to
  • 00:05:59
    zero as you use this matrix over and
  • 00:06:00
    over again if they're greater than one
  • 00:06:02
    then your gradients are going to explode
  • 00:06:03
    all right and so this made vanilla RN
  • 00:06:05
    ends extremely difficult to work with
  • 00:06:07
    and basically just didn't work on
  • 00:06:08
    anything but fairly short sequences all
  • 00:06:12
    right so LST m to the rescue right so I
  • 00:06:15
    wrote this document a few years ago
  • 00:06:17
    called the rise and fall and rise and
  • 00:06:19
    fall of LST M so at least Tim came
  • 00:06:24
    around in the dark ages and then it went
  • 00:06:27
    into the AI winter it came back again
  • 00:06:29
    for awhile but I think it's on its way
  • 00:06:31
    out again now with with transformers so
  • 00:06:34
    Ellis Tim to be clear is a kind of
  • 00:06:36
    recurrent neural network it just houses
  • 00:06:38
    more sophisticated cell inside and it
  • 00:06:42
    was invented originally in the dark ages
  • 00:06:44
    on
  • 00:06:45
    stone tablet that has been recovered
  • 00:06:48
    into a PDF that you can access on Sep
  • 00:06:51
    hawk right there's a server III kid but
  • 00:06:55
    seven and you're gonna both grade I
  • 00:06:57
    enjoy they're both quite a bit but they
  • 00:07:00
    did a bunch of amazing work in the 90s
  • 00:07:02
    that was really well ahead of its time
  • 00:07:04
    and and often get neglected and
  • 00:07:09
    forgotten as time goes on that's totally
  • 00:07:12
    not fair because they did an amazing
  • 00:07:13
    research so the LST emcell looks like
  • 00:07:16
    this it actually has two hidden states
  • 00:07:18
    and the the input coming along the
  • 00:07:21
    bottom and the output up the top again
  • 00:07:23
    and these two hidden states and I'm not
  • 00:07:25
    going to go into it in detail and you
  • 00:07:26
    should totally look at Christopher ollas
  • 00:07:28
    blog post if you want to dive into it
  • 00:07:29
    but the key point is that these these
  • 00:07:32
    transformations these the matrix
  • 00:07:34
    multiplies right and they are not
  • 00:07:35
    applied recursively on the main hidden
  • 00:07:38
    vector all you're doing is you're adding
  • 00:07:40
    in or the forget gate yeah you actually
  • 00:07:44
    don't really need it but you're adding
  • 00:07:46
    in some some new number and so the OS TM
  • 00:07:49
    is actually a lot like a res net it's a
  • 00:07:51
    lot like a CNN resonate in that you're
  • 00:07:53
    adding new values on to the activation
  • 00:07:57
    as you go through the layers right and
  • 00:08:00
    so this solves the exploding and
  • 00:08:03
    vanishing gradients problems however LST
  • 00:08:06
    M is still pretty difficult to train
  • 00:08:09
    because you still have these very long
  • 00:08:11
    gradient paths even even without even
  • 00:08:14
    with those residual connections you're
  • 00:08:15
    still propagating gradients from the end
  • 00:08:17
    all the way through this transformation
  • 00:08:19
    cell over at the beginning and for a
  • 00:08:20
    long document this means very very deep
  • 00:08:22
    networks that aren't just Toria Slee
  • 00:08:26
    difficult to train and more importantly
  • 00:08:29
    transfer learning never really worked on
  • 00:08:32
    these LST M models right one of the
  • 00:08:35
    great things about image net and cnn's
  • 00:08:37
    is that you can train a convolutional
  • 00:08:40
    net on millions of images in image net
  • 00:08:42
    and take that neural network and
  • 00:08:44
    fine-tune it for some new problem that
  • 00:08:46
    you have and the the starting state of
  • 00:08:50
    the you mention at CNN gives you a great
  • 00:08:52
    a great place to start from when you're
  • 00:08:55
    looking for a new neural network and
  • 00:08:56
    makes training on your own problem much
  • 00:08:58
    he
  • 00:08:58
    there was much less data that never
  • 00:09:00
    really worked with Ellis Jim sometimes
  • 00:09:01
    it did but it just wasn't very reliable
  • 00:09:04
    which means that anytime you're using an
  • 00:09:06
    LS TM you need a new label data set
  • 00:09:10
    that's specific to your task and that's
  • 00:09:12
    expensive okay so this this changed
  • 00:09:16
    dramatically just about a year ago when
  • 00:09:18
    the burp model was was released so
  • 00:09:23
    you'll hear people talk about
  • 00:09:24
    Transformers and Muppets together and
  • 00:09:26
    the reason for this is that the original
  • 00:09:29
    paper on this technique that describes
  • 00:09:31
    the network architecture it was called
  • 00:09:33
    the transformer network and then the
  • 00:09:34
    Bert paper is a muppet news and Elmo
  • 00:09:36
    paper and you know researchers just run
  • 00:09:38
    with the joke um so this is just context
  • 00:09:40
    you understand what people are talking
  • 00:09:41
    about if they say well use them up in
  • 00:09:42
    network so this I think it was the
  • 00:09:48
    natural progression of the sequence of
  • 00:09:50
    document models and it was the
  • 00:09:52
    transformer model was first described
  • 00:09:54
    about two and a half years ago in this
  • 00:09:55
    paper attention is all you need and this
  • 00:09:58
    paper was addressing machine translation
  • 00:10:01
    so think about taking a document in in
  • 00:10:05
    English and converting it into French
  • 00:10:07
    right and so the classic way to do this
  • 00:10:09
    in neural network is encoder/decoder
  • 00:10:11
    here's the full structure there's a lot
  • 00:10:13
    going on here right so we're just going
  • 00:10:15
    to focus on the encoder part because
  • 00:10:16
    that's all you need for these supervised
  • 00:10:18
    learning problems the decoder is similar
  • 00:10:19
    anyway so zooming in on the encoder part
  • 00:10:22
    of it there's still quite a bit going on
  • 00:10:24
    and so we're but basically there's three
  • 00:10:27
    parts there's we're gonna talk about
  • 00:10:28
    first we're going to talk about this
  • 00:10:30
    attention part then we'll talk about the
  • 00:10:32
    part of the bottom of the positional
  • 00:10:33
    coding the top parts just not that hard
  • 00:10:35
    it's just a simple fully connected layer
  • 00:10:37
    so the attention mechanism in the middle
  • 00:10:39
    is the key to making this thing work on
  • 00:10:42
    documents of variable lengths and the
  • 00:10:45
    way they do that is by having an
  • 00:10:47
    all-to-all comparison for every layer of
  • 00:10:50
    the neural network it considers every
  • 00:10:52
    pause for every output of the next layer
  • 00:10:55
    considers every plausible input from the
  • 00:10:57
    previous layer in this N squared way and
  • 00:10:59
    it does this weighted sum of the
  • 00:11:01
    previous ones where the waiting is the
  • 00:11:04
    learned function right and then it
  • 00:11:07
    applies just a fully connected layer
  • 00:11:08
    after it but it this is this is great
  • 00:11:11
    for for a number of reasons one
  • 00:11:13
    is that you can you can look at this
  • 00:11:15
    thing and you can visually see what it's
  • 00:11:17
    doing so here is this translation
  • 00:11:19
    problem of converting from the English
  • 00:11:21
    sentence the agreement on the European
  • 00:11:23
    Economic Area was signed in August 1992
  • 00:11:26
    and translate that into French my
  • 00:11:29
    apologies la casa la zone economic
  • 00:11:31
    European at this may and oh I forgot
  • 00:11:35
    1992 right and you can see the attention
  • 00:11:38
    so as its generating lips as a
  • 00:11:41
    generating the each token in the output
  • 00:11:44
    it's it's starting with this whole
  • 00:11:46
    thing's name button its generating is
  • 00:11:47
    these output tokens one at a time and it
  • 00:11:50
    says okay first you got to translate the
  • 00:11:51
    the way I do that it translates into la
  • 00:11:54
    and all I'm doing is looking at this
  • 00:11:55
    next I'll put a color all I'm doing is
  • 00:11:58
    looking at agreement then sir is on la
  • 00:12:00
    is the okay now interesting European
  • 00:12:03
    Economic Area translates into zone
  • 00:12:06
    economic European so the order is
  • 00:12:08
    reversed right you can see the attention
  • 00:12:10
    mechanism is reversed also or you can
  • 00:12:12
    see very clearly what this thing is
  • 00:12:13
    doing as it's running along and the way
  • 00:12:16
    it works in the attention are setting
  • 00:12:19
    the transformer model the way they
  • 00:12:20
    describe it is with query and key
  • 00:12:23
    vectors so for every output position you
  • 00:12:26
    generate a query and for every input
  • 00:12:29
    you're considering you generate a key
  • 00:12:31
    and then the relevant score is just the
  • 00:12:32
    dot product of those two right and to
  • 00:12:36
    visualize that you first you combine the
  • 00:12:39
    key the query and the key values and
  • 00:12:41
    that gives you the relevant scores you
  • 00:12:44
    you use the softmax normalize them and
  • 00:12:46
    then you do a weighted average of the
  • 00:12:49
    values the third version of each token
  • 00:12:52
    to get your output now to explain this
  • 00:12:56
    in a little bit more detail I'm going to
  • 00:12:57
    go through it in pseudocode so this
  • 00:12:59
    looks like Python it wouldn't actually
  • 00:13:00
    run but I think it's close enough to
  • 00:13:02
    help people understand what's going on
  • 00:13:04
    so you've got this attention function
  • 00:13:07
    right and it takes as input a list of
  • 00:13:11
    tensors I know you don't need to do that
  • 00:13:13
    a list of 10 serious one per token on
  • 00:13:16
    the input and then the first thing it
  • 00:13:18
    does it goes through each everything in
  • 00:13:20
    the sequence and it computes the query
  • 00:13:22
    the key and the value by multiplying the
  • 00:13:25
    appropriate input vector by Q
  • 00:13:27
    k and V which are these learned matrices
  • 00:13:29
    right so it learns this transformation
  • 00:13:31
    from the previous layer to whatever
  • 00:13:35
    should be the query the key and the
  • 00:13:36
    value at the at the next layer then it
  • 00:13:40
    goes through this double nested loop
  • 00:13:42
    alright so for every output token it
  • 00:13:46
    figures out okay this is the query I'm
  • 00:13:48
    working with and then it goes through
  • 00:13:49
    everything in the input and it
  • 00:13:51
    multiplies that query with the the key
  • 00:13:53
    from the possible key and it computes a
  • 00:13:57
    whole bunch of relevant scores and then
  • 00:13:59
    it normalizes these relevant scores
  • 00:14:01
    using a soft Max which makes sure that
  • 00:14:04
    they just all add up to one so you can
  • 00:14:05
    sensibly can use that to compute a
  • 00:14:08
    weighted sum of all of the values so you
  • 00:14:11
    know you just go through for each output
  • 00:14:14
    you go through each of the each of the
  • 00:14:18
    input tokens the value score which is
  • 00:14:20
    calculated for them and you multiply it
  • 00:14:21
    by the relevance this is just a floating
  • 00:14:23
    point number from 0 to 1 and you get a
  • 00:14:25
    weighted average which is the output and
  • 00:14:27
    you return that so this is what's going
  • 00:14:30
    on in the attention mechanism which can
  • 00:14:33
    be which can be pretty confusing when
  • 00:14:35
    you just look at it look at the diagram
  • 00:14:37
    that like that but I hope this
  • 00:14:40
    I hope this explains it a little bit I'm
  • 00:14:42
    sure we'll get some questions on this so
  • 00:14:45
    relevant scores are interpretable as I
  • 00:14:48
    say and and this is is super helpful
  • 00:14:50
    right now the an innovation I think it
  • 00:14:56
    was novel in the transformer paper is
  • 00:14:58
    multi-headed attention and this is one
  • 00:15:01
    of these really clever ID and important
  • 00:15:03
    innovations that it's not actually all
  • 00:15:05
    that complicated at all I you just do
  • 00:15:08
    that same thing that same attention
  • 00:15:10
    mechanism eight times whatever whatever
  • 00:15:12
    value of 8 you want to use and that lets
  • 00:15:15
    the network learn eight different things
  • 00:15:17
    to pay attention to so in the
  • 00:15:19
    translation case it can learn an
  • 00:15:21
    attention mechanism for grammar one for
  • 00:15:23
    vocabulary one for gender one for kent's
  • 00:15:25
    whatever it is right whatever the thing
  • 00:15:27
    needs to it can look at different parts
  • 00:15:28
    of the input document for different
  • 00:15:30
    purposes and do this at each layer right
  • 00:15:32
    so you can kind of intuitively see how
  • 00:15:33
    this would be a really flexible
  • 00:15:34
    mechanism for for processing a document
  • 00:15:38
    or any sequence okay so that is
  • 00:15:41
    when the key things that enables the
  • 00:15:44
    transfer model that's the multi-headed
  • 00:15:46
    attention part of it now let's look down
  • 00:15:48
    here at the positional encoding which is
  • 00:15:50
    which is critical and novel in a
  • 00:15:54
    critical innovation that I think is
  • 00:15:56
    incredibly clever so without this
  • 00:15:59
    positional encoding attention mechanisms
  • 00:16:01
    are just bags of words right there's
  • 00:16:03
    nothing seeing what the difference is
  • 00:16:05
    between work to live or live to work
  • 00:16:07
    right there they're just all positions
  • 00:16:10
    they're all equivalent positions you're
  • 00:16:12
    just going to compute some score for
  • 00:16:14
    each of them so what they did is they
  • 00:16:17
    took a lesson from Fourier theory and
  • 00:16:20
    added in a bunch of sines and cosines as
  • 00:16:23
    extra dimensions
  • 00:16:25
    sorry not as extra dimensions but onto
  • 00:16:28
    the the word embeddings so going back so
  • 00:16:32
    what they do is they take the inputs
  • 00:16:33
    they use word Tyvek to calculate some
  • 00:16:35
    vector for each input token and then
  • 00:16:37
    onto that onto that embedding they add a
  • 00:16:41
    bunch of sine and cosines of different
  • 00:16:43
    frequencies starting at just pi and then
  • 00:16:46
    stretching out longer and longer and
  • 00:16:48
    longer and if you look at the whole
  • 00:16:51
    thing it looks like this and what this
  • 00:16:52
    does is it lets the model reason about
  • 00:16:56
    the relative position of any tokens
  • 00:16:58
    right so if you can kind of imagine that
  • 00:17:01
    the model can say if the orange
  • 00:17:03
    dimension is slightly higher than the
  • 00:17:05
    blue dimension on one word versus
  • 00:17:08
    another then you can see how it knows
  • 00:17:11
    that that token is to the left or right
  • 00:17:13
    of the other and because it has this at
  • 00:17:14
    all these different wavelengths it can
  • 00:17:16
    look across the entire document at kind
  • 00:17:18
    of arbitrary scales to see whether one
  • 00:17:20
    idea is before or after another
  • 00:17:23
    the key thing is that this is how the
  • 00:17:26
    system understands position and isn't
  • 00:17:29
    just a bag of words for Fortran when
  • 00:17:32
    doing the attention
  • 00:17:33
    okay so transformers there's the two key
  • 00:17:36
    innovations as positional encoding and
  • 00:17:38
    multi-headed attention transformers are
  • 00:17:40
    awesome even though there are N squared
  • 00:17:42
    and the length of the document these
  • 00:17:44
    all-to-all comparisons can be done
  • 00:17:46
    almost for free in a modern GPU GPUs
  • 00:17:49
    changed all sorts of things right you
  • 00:17:51
    can do a thousand by thousand matrix
  • 00:17:53
    multiply as fast as you can do a ten by
  • 00:17:55
    two
  • 00:17:55
    in a lot of cases because they have so
  • 00:17:57
    much parallelism they have so much
  • 00:17:58
    bandwidth that but a fixed latency for
  • 00:18:01
    every operation so you can do these
  • 00:18:03
    massive massive multiplies almost for
  • 00:18:05
    free in a lot of cases so doing things
  • 00:18:07
    in M Squared is is not actually
  • 00:18:10
    necessarily much more expensive whereas
  • 00:18:11
    in an RNN like an L STM you can't do
  • 00:18:16
    anything with token 11 until you're
  • 00:18:18
    completely done processing token 10 all
  • 00:18:21
    right so this is a key advantage of
  • 00:18:22
    transformers they're much more
  • 00:18:24
    computationally efficient also you don't
  • 00:18:28
    need to use any of these sigmoid or tan
  • 00:18:31
    h activation functions which are built
  • 00:18:33
    into the LS TM model these things of
  • 00:18:35
    scale your activations to 0 1 why are
  • 00:18:38
    these things problematic so these were
  • 00:18:41
    bread-and-butter in the old days of of
  • 00:18:43
    neural networks people would use these
  • 00:18:46
    between layers all the time and they
  • 00:18:50
    make sense there that kind of
  • 00:18:51
    biologically inspired you take any
  • 00:18:53
    activation you scale it from 0 to 1 or
  • 00:18:55
    minus 1 to 1 but they're actually really
  • 00:18:57
    really problematic because if you get a
  • 00:19:00
    neuron which has a very high activation
  • 00:19:02
    value then you've got this number up
  • 00:19:05
    here which is 1 and you take the
  • 00:19:07
    derivative of that and it's 0 or it's
  • 00:19:10
    some very very small number and so your
  • 00:19:12
    gradient descent can't tell the
  • 00:19:14
    difference between an activation up here
  • 00:19:16
    and one way over on the other side so
  • 00:19:19
    it's very easy for the trainer to get
  • 00:19:21
    confused if your activations don't stay
  • 00:19:23
    near this middle part all right and
  • 00:19:25
    that's problematic compare that to rel U
  • 00:19:26
    which is the standard these days and
  • 00:19:28
    really you yes it does have this this
  • 00:19:31
    very very large dead space but if you're
  • 00:19:34
    not in the dead space then there's
  • 00:19:36
    nothing stopping it from getting getting
  • 00:19:38
    bigger and bigger and scaling off to
  • 00:19:39
    infinity and one of the reasons why when
  • 00:19:43
    the intuitions behind why this works
  • 00:19:45
    better as Geoffrey Hinton puts it is
  • 00:19:47
    that this allows each neuron it to
  • 00:19:49
    express a stronger opinion right in an
  • 00:19:53
    LS sorry in a sigmoid there is really no
  • 00:19:56
    difference between the activation being
  • 00:19:58
    three or eight or twenty or a hundred
  • 00:20:01
    the output is the same right it all I
  • 00:20:05
    can say is kind of yes no maybe right
  • 00:20:07
    but
  • 00:20:09
    in with our Lu it can say the
  • 00:20:11
    inactivation of five or a hundred or a
  • 00:20:13
    thousand and these are all meaningfully
  • 00:20:15
    different values that can be used for
  • 00:20:16
    different purposes down the line right
  • 00:20:18
    so each neuron it can express more
  • 00:20:20
    information also the gradient doesn't
  • 00:20:23
    saturate we talked about that and very
  • 00:20:27
    critically and I think this is really
  • 00:20:28
    underappreciated values are really
  • 00:20:32
    insensitive to random initialization if
  • 00:20:34
    you're working with a bunch of sigmoid
  • 00:20:35
    layers you need to pick those random
  • 00:20:37
    values at the beginning of your training
  • 00:20:39
    to make sure that your activation values
  • 00:20:42
    are in that middle part where you're
  • 00:20:44
    going to get reasonable gradients and
  • 00:20:45
    people used to worry a lot about what
  • 00:20:48
    initialization to use for your neural
  • 00:20:49
    network you don't hear people worrying
  • 00:20:51
    about that much at all anymore and rail
  • 00:20:53
    users are really the key reason why that
  • 00:20:56
    is also really runs great on low
  • 00:20:58
    precision Hardware those those floating
  • 00:21:01
    of the smooth activation functions they
  • 00:21:03
    need 32-bit float maybe you can get it
  • 00:21:06
    to work in 16-bit float sometimes but
  • 00:21:08
    you're not going to be running it an
  • 00:21:09
    8-bit int without a ton of careful work
  • 00:21:12
    and that is the kind of things are
  • 00:21:13
    really easy to do with a rel u based
  • 00:21:16
    network and a lot of hardware is going
  • 00:21:17
    in that direction because it takes
  • 00:21:19
    vastly fewer transistors and a lot less
  • 00:21:21
    power to do 8-bit integer math versus
  • 00:21:24
    32-bit float it's also stupidly easy to
  • 00:21:27
    compute the gradient it's one or at zero
  • 00:21:30
    right you just take that top bit and
  • 00:21:32
    you're done so the derivatives
  • 00:21:33
    ridiculously usually rel you have some
  • 00:21:35
    downsides it does have those dead
  • 00:21:37
    neurons on on the left side you can fix
  • 00:21:39
    that with a leaky rail you there's this
  • 00:21:41
    discontinuity in the gradient of the
  • 00:21:43
    origin you can fix that with Gail u
  • 00:21:45
    which burnt uses and so this brings me
  • 00:21:49
    to a little aside about general deep
  • 00:21:51
    learning wisdom if you're designing a
  • 00:21:54
    new network for whatever reason don't
  • 00:21:57
    bother messing with different kinds of
  • 00:21:58
    activations don't bother trying sigmoid
  • 00:22:00
    or tant they're they're probably not
  • 00:22:02
    going to work out very well but
  • 00:22:04
    different optimizers do matter atom is a
  • 00:22:07
    great place to start it's super fast it
  • 00:22:09
    tends to give pretty good results it has
  • 00:22:11
    a bit of a tendency to overfit if you
  • 00:22:13
    really are trying to squeeze the juice
  • 00:22:14
    out of your system and you want the best
  • 00:22:15
    results SGD is likely to get you a
  • 00:22:18
    better result but it's going to take
  • 00:22:19
    quite a bit more time to
  • 00:22:22
    Verge sometimes rmsprop beats the pants
  • 00:22:25
    off both of them it's worth playing
  • 00:22:26
    around with these with these things I
  • 00:22:28
    told you about why I think SWA is great
  • 00:22:30
    there's this system called attitude my
  • 00:22:32
    old team at Amazon released where you
  • 00:22:35
    don't even need to take a learning rate
  • 00:22:37
    it dynamically calculates the ideal
  • 00:22:39
    learning rate scheduled at every point
  • 00:22:41
    during training for you it's kind of
  • 00:22:42
    magical so it's worth playing around
  • 00:22:46
    with different optimizers but don't mess
  • 00:22:47
    with the with the activation functions
  • 00:22:49
    okay
  • 00:22:50
    let's pop out right there's a bunch of a
  • 00:22:52
    bunch of Theory bunch of math and and
  • 00:22:53
    ideas in there how do we actually apply
  • 00:22:55
    this stuff in code so if you want to use
  • 00:22:58
    a transformer I strongly recommend
  • 00:23:01
    hopping over to the the fine folks at
  • 00:23:04
    hugging face and using their transformer
  • 00:23:06
    package they have both pi torch and
  • 00:23:09
    tensor flow implementations pre-trained
  • 00:23:11
    models ready to fine tune and I'll show
  • 00:23:14
    you how easy it is here's how to fine
  • 00:23:16
    tune a Bert model in just 12 lines of
  • 00:23:18
    code you just pick what kind of Bert you
  • 00:23:21
    want the base model that's a paying
  • 00:23:24
    attention upper and lower case you get
  • 00:23:26
    the tokenizer to convert your string
  • 00:23:27
    into tokens you download the pre trained
  • 00:23:30
    model in one line of code pick your data
  • 00:23:32
    set for your own problem process the
  • 00:23:35
    data set with the tokenizer
  • 00:23:37
    to get training validation splits
  • 00:23:39
    shuffle one batch um four more lines of
  • 00:23:41
    code another four lines of code to
  • 00:23:43
    instantiate your optimizer define your
  • 00:23:46
    loss function pick a metric it's
  • 00:23:49
    tensorflow so you got to compile it and
  • 00:23:52
    then you call fit and that's it that's
  • 00:23:55
    all you need to do use - all you need to
  • 00:23:57
    do to fine-tune a state-of-the-art
  • 00:23:59
    language model on your specific problem
  • 00:24:01
    and the fact you can do this on some pre
  • 00:24:04
    trained model that's that's seen tons
  • 00:24:06
    and tons of data that easily is really
  • 00:24:08
    amazing and there's even bigger models
  • 00:24:11
    out there right so Nvidia made this
  • 00:24:12
    bottle called megatron with eight
  • 00:24:14
    billion parameters they ran a hundreds
  • 00:24:16
    of GPUs for over a week spent vast
  • 00:24:19
    quantities of cash well I mean they own
  • 00:24:20
    the stuffs but so not really but they
  • 00:24:22
    they put a ton of energy into training
  • 00:24:26
    this I've heard people a lot of people
  • 00:24:27
    complaining about how much greenhouse
  • 00:24:29
    gas comes from training model like
  • 00:24:31
    Megatron I think that's totally the
  • 00:24:33
    wrong
  • 00:24:34
    way of looking at this because they only
  • 00:24:37
    need to do this once in the history of
  • 00:24:40
    the world and everybody in this room can
  • 00:24:42
    do it without having to burn those GPUs
  • 00:24:45
    again right these things are reusable
  • 00:24:47
    and fine tunable I don't think they've
  • 00:24:48
    actually released this yet but but they
  • 00:24:51
    might and somebody else will
  • 00:24:53
    right so you don't need to do that that
  • 00:24:55
    expensive work over and over again write
  • 00:24:57
    this thing learns a base model really
  • 00:25:00
    well the folks at Facebook trained this
  • 00:25:03
    Roberta model on two and a half
  • 00:25:04
    terabytes of data across over a hundred
  • 00:25:07
    languages and this thing understands low
  • 00:25:10
    resource languages like Swahili and an
  • 00:25:13
    Urdu in ways that the it's just vastly
  • 00:25:16
    better than what's been done before and
  • 00:25:18
    again these are reusable if you need a
  • 00:25:21
    model that understands all the world's
  • 00:25:23
    languages this is accessible to you by
  • 00:25:25
    leveraging other people's work and
  • 00:25:27
    before Bert and transformers and the
  • 00:25:29
    Muppets this just was not possible now
  • 00:25:31
    you can leverage other people's work in
  • 00:25:34
    this way and I think that's really
  • 00:25:36
    amazing so to sum up the key advantages
  • 00:25:39
    of these transforming networks yes
  • 00:25:41
    they're easier to train they're more
  • 00:25:42
    efficient all that yada yada yada
  • 00:25:44
    but more importantly transfer learning
  • 00:25:47
    actually works with them right you can
  • 00:25:49
    take a pre trained model fine-tune it
  • 00:25:51
    for your task without a specific data
  • 00:25:53
    set and another really critical point
  • 00:25:56
    which I didn't get a chance to go into
  • 00:25:57
    is that these things are originally
  • 00:25:59
    trained on large quantities of
  • 00:26:01
    unsupervised text you can just take all
  • 00:26:03
    of the world's text data and use this as
  • 00:26:06
    training data the way it works very very
  • 00:26:07
    quickly is kind of comparable to how
  • 00:26:10
    word Tyvek works where the language
  • 00:26:11
    model tries to predict them some missing
  • 00:26:14
    words from a document and in that's
  • 00:26:17
    enough for it to understand how to build
  • 00:26:21
    a supervised model using vast quantities
  • 00:26:24
    of text without any effort to label them
  • 00:26:28
    Ellis team still has its place in
  • 00:26:30
    particular if the sequence length is
  • 00:26:32
    very long or infinite you can't do n
  • 00:26:35
    squared right
  • 00:26:36
    and that happens if you're doing real
  • 00:26:38
    time control like for a robot or a
  • 00:26:40
    thermostat or something like that you
  • 00:26:41
    can't have the entire sequence and for
  • 00:26:44
    some reason you can't pre train on some
  • 00:26:46
    large corpus LS TM seems
  • 00:26:48
    to outperform transformers when your
  • 00:26:50
    dataset size is is relatively small and
  • 00:26:52
    fixed and with that I will take
  • 00:26:57
    questions
  • 00:26:58
    well you yes yeah would CNN how do you
  • 00:27:13
    compare words CNN transformer so when
  • 00:27:16
    when I wrote this paper the rise and
  • 00:27:19
    fall and rise and fall of LST M I
  • 00:27:20
    predicted that time that word CNN's were
  • 00:27:23
    going to be the thing that replaced LST
  • 00:27:25
    M I did not I did not see this this
  • 00:27:29
    transformer thing coming so a word CNN
  • 00:27:31
    has a lot of the advantages in terms of
  • 00:27:34
    parallelism and the ability to use rel
  • 00:27:36
    you and the key difference is that it
  • 00:27:39
    only looks at a fixed size window fixed
  • 00:27:41
    size part of the document instead of
  • 00:27:42
    looking at the entire document at once
  • 00:27:44
    and so it's it's got a fair amount
  • 00:27:49
    fundamentally in common word CNN's have
  • 00:27:53
    an easier task easier time identifying
  • 00:27:57
    diagrams trigrams things like that
  • 00:28:00
    because it's got those direct
  • 00:28:01
    comparisons right it doesn't need this
  • 00:28:02
    positional encoding trick to try to
  • 00:28:04
    infer with with fourier waves what where
  • 00:28:08
    things are relative to each other so
  • 00:28:10
    it's got that advantage for
  • 00:28:12
    understanding close closely related
  • 00:28:13
    tokens but it can't see across the
  • 00:28:16
    entire document at once right it's got a
  • 00:28:20
    much harder time reasoning like a word
  • 00:28:23
    CNN can't easily answer a question like
  • 00:28:25
    does this concept exist anywhere in this
  • 00:28:29
    document whereas a transformer can very
  • 00:28:31
    easily answer that just by having some
  • 00:28:33
    attention query that finds that
  • 00:28:35
    regardless of where it is CNN would need
  • 00:28:37
    a very large large window or a series of
  • 00:28:40
    windows cascading up to to be able to
  • 00:28:42
    accomplish that
Tags
  • transformateurs
  • apprentissage par transfert
  • TAL
  • RNN
  • LSTM
  • attention multi-tête
  • encodage positionnel
  • modèles de langage
  • efficacité
  • gestion des séquences