Quel est le sujet principal de la présentation ?

Le sujet principal est l'évolution importante du traitement du langage naturel avec l'introduction des transformateurs.

Pourquoi les réseaux récurrents standard sont-ils problématiques ?

Ils souffrent de problèmes de gradient qui explose ou qui disparaît, rendant l'apprentissage difficile.

Qu'est-ce qui rend les transformateurs uniques ?

Les transformateurs utilisent une attention multi-tête et des encodages positionnels pour traiter efficacement des documents de longueur variable.

Quel est l'avantage des transformateurs par rapport aux LSTMs ?

Les transformateurs sont plus efficaces en termes de calcul et permettent un transfert d'apprentissage en utilisant de grandes quantités de texte non supervisé.

Quel problème majeur de calcul les transformateurs résolvent-ils ?

Ils évitent le calcul séquentiel étape par étape des RNNs, ce qui les rend plus rapides et plus efficaces.

Comment les transformateurs traitent-ils la position des mots ?

Ils utilisent des encodages positionnels basés sur des sinus et cosinus pour conserver l'ordre des mots.

Qu'est-ce qu'une attention multi-tête ?

C'est une technique qui permet au modèle de se concentrer sur différentes parties d'un texte pour divers aspects comme la grammaire ou le vocabulaire.

Pourquoi le transfert d'apprentissage est-il important dans les transformateurs ?

Il permet d'utiliser des modèles pré-entraînés sur de grandes bases de données, rendant plus facile et efficace l'adaptation à des tâches spécifiques.

Quels types de tâches restent appropriés pour les LSTMs ?

Les tâches avec des séquences très longues ou infinies, où le coût de calcul au carré des transformateurs est prohibitif.

Pour quelles raisons les activations relu sont-elles préférées ?

Elles évitent les problèmes de saturation des gradients et permettent une exécution efficace sur du matériel à faible précision.

LSTM is dead. Long Live Transformers!

00:28:48

https://www.youtube.com/watch?v=S27pHKBEp30

Resumo

TLDRCette présentation explore un changement significatif dans le traitement du langage naturel (TAL) dû à l'introduction des transformateurs. Ces modèles ont considérablement amélioré la façon dont nous traitons les documents de longueur variable par rapport aux approches précédentes, notamment les réseaux de neurones récurrents (RNN) et leurs variantes avancées comme les LSTMs. Les transformateurs utilisent deux innovations principales : l'attention multi-tête et les encodages positionnels, qui permettent de conserver des informations de séquence tout en prenant en compte le contexte global du document. Cela permet un traitement parallèle plus efficace et un apprentissage par transfert plus facile à partir de grandes quantités de texte non supervisé. Contrairement aux modèles antérieurs, les transformateurs ne nécessitent pas d'activation sigmoïde qui peut saturer les gradients, ce qui les rend moins susceptibles aux problèmes de dispersion ou d'explosion du gradient. Les avantages de ces nouveaux modèles incluent une capacité de formation plus facile et des résultats de prédiction améliorés pour diverses applications en TAL, en permettant également une réutilisation aisée des modèles grâce à l'apprentissage par transfert.

Conclusões

👥 Les transformateurs remplacent les modèles de langage traditionnels.
🔄 L'apprentissage par transfert est facilité avec les transformateurs.
🔍 L'attention multi-tête permet une analyse contextuelle précise.
📏 Les encodages positionnels préservent l'ordre des mots.
🖩 Les transformateurs offrent une efficacité de calcul accrue.
💡 L'architecture permet une formation plus rapide et facile.
📚 Les grands modèles de langage sont réutilisables pour diverses tâches.
⚡ Les Relus évitent la saturation des gradients.
🔗 Approprié pour de longs documents grâce à l'attention globale.
🧠 Modifications basées sur le modèle Transformer proposent de meilleures prédictions.

Linha do tempo

00:00:00 - 00:05:00
L'orateur exprime sa réticence à donner deux présentations consécutives mais estime important de discuter du progrès significatif dans le traitement automatique du langage naturel (TALN) grâce aux Transformateurs. Il décrit les défis de la représentation des documents textuels comme des vecteurs de taille fixe en entrant dans les détails des modèles de sacs de mots et de leurs limites.
00:05:00 - 00:10:00
Il aborde les réseaux de neurones récurrents (RNN) et les défis associés, tels que les gradients qui disparaissent ou explosent, rendant les RNN mal adaptés pour les longues séquences. Les réseaux à mémoire longue et courte durée (LSTM) viennent en aide mais ont des limitations en termes de complexité d’entraînement et de transfert d’apprentissage.
00:10:00 - 00:15:00
L'orateur mentionne l'émergence des Transformateurs, avec BERT et les modèles muppets, qui ont révolutionné la façon de traiter les suites de documents. Le modèle de transformateur, introduit dans le contexte de la traduction automatique, utilise l'attention comme mécanisme clé permettant de traiter les documents de longueur variable.
00:15:00 - 00:20:00
Les transformateurs se distinguent par leur mécanisme d'attention multi-têtes et le codage positionnel qui permet de conserver l'ordre dans les séquences de mots, ce qui résout un problème majeur dans les modèles de sacs de mots. Ces innovations permettent aux transformateurs d'être très efficaces sur les GPU modernes.
00:20:00 - 00:28:48
En conclusion, les transformateurs simplifient l'apprentissage grâce à leurs structures parallèles et facilitent le transfert d'apprentissage. Comparés aux anciennes méthodes CNN ou RNN, ils offrent plus de flexibilité, bien que les RNN, comme les LSTM, gardent des avantages dans certains contextes, notamment lorsque les séquences sont très longues ou infinies.

Mostrar mais

Mapa mental

Vídeo de perguntas e respostas

Quel est le sujet principal de la présentation ?
Le sujet principal est l'évolution importante du traitement du langage naturel avec l'introduction des transformateurs.
Pourquoi les réseaux récurrents standard sont-ils problématiques ?
Ils souffrent de problèmes de gradient qui explose ou qui disparaît, rendant l'apprentissage difficile.
Qu'est-ce qui rend les transformateurs uniques ?
Les transformateurs utilisent une attention multi-tête et des encodages positionnels pour traiter efficacement des documents de longueur variable.
Quel est l'avantage des transformateurs par rapport aux LSTMs ?
Les transformateurs sont plus efficaces en termes de calcul et permettent un transfert d'apprentissage en utilisant de grandes quantités de texte non supervisé.
Quel problème majeur de calcul les transformateurs résolvent-ils ?
Ils évitent le calcul séquentiel étape par étape des RNNs, ce qui les rend plus rapides et plus efficaces.
Comment les transformateurs traitent-ils la position des mots ?
Ils utilisent des encodages positionnels basés sur des sinus et cosinus pour conserver l'ordre des mots.
Qu'est-ce qu'une attention multi-tête ?
C'est une technique qui permet au modèle de se concentrer sur différentes parties d'un texte pour divers aspects comme la grammaire ou le vocabulaire.
Pourquoi le transfert d'apprentissage est-il important dans les transformateurs ?
Il permet d'utiliser des modèles pré-entraînés sur de grandes bases de données, rendant plus facile et efficace l'adaptation à des tâches spécifiques.
Quels types de tâches restent appropriés pour les LSTMs ?
Les tâches avec des séquences très longues ou infinies, où le coût de calcul au carré des transformateurs est prohibitif.
Pour quelles raisons les activations relu sont-elles préférées ?
Elles évitent les problèmes de saturation des gradients et permettent une exécution efficace sur du matériel à faible précision.

Ver mais resumos de vídeos

Obtenha acesso instantâneo a resumos gratuitos de vídeos do YouTube com tecnologia de IA!

Legendas

Rolagem automática:

00:00:03
all right cool well thanks everybody um
00:00:06
so I'm gonna give the second talk
00:00:08
tonight which I'm not crazy about and
00:00:10
and I don't want this pattern to to
00:00:12
repeat but you know Andrew and I wanted
00:00:14
to kick this series off and and felt
00:00:20
like me talking twice or better than
00:00:23
then not but we're gonna we're gonna get
00:00:26
more diversity of folks if any of you
00:00:28
want to give a talk yourselves you know
00:00:29
somebody who you think might that'd be
00:00:31
awesome but a topic that I feel is
00:00:34
important for practitioners to
00:00:35
understand is a real sea change in
00:00:38
natural language processing that's you
00:00:40
know all of like 12 months old but is
00:00:42
one these things I think is incredibly
00:00:44
significant in in the field and that is
00:00:47
the advance of the Transformers so the
00:00:52
outline for this talk is to start out
00:00:55
with some background on natural language
00:00:57
processing and sequence modeling and
00:01:00
then talk about the LS TM why it's
00:01:03
awesome and amazing but still not good
00:01:05
enough and then go into Transformus and
00:01:09
talk about how they work and why they're
00:01:12
amazing
00:01:12
so for background on natural language
00:01:14
processing NLP I'm gonna be talking just
00:01:18
about a subset of NLP which is the
00:01:21
supervised learning a part of it so not
00:01:23
structured prediction sequence
00:01:25
prediction but where you're taking the
00:01:29
document as some input and trying to
00:01:32
predict some fairly straightforward
00:01:34
output about it like is this document
00:01:37
spam right and so what this means is
00:01:41
that you need to somehow take your
00:01:44
document and represent it as a
00:01:46
fixed-size vector because I'm not aware
00:01:49
of any linear algebra that works on
00:01:50
vectors of variable dimensionality and
00:01:53
the challenge with this is that
00:01:55
documents are of variable length right
00:01:58
so you have to come up with some way of
00:02:00
taking that document and meaningfully
00:02:02
encoding it into a fixed size vector
00:02:04
right so the classic way of doing this
00:02:06
is the bag of words right where you have
00:02:08
one dimension per unique word in your
00:02:11
vocabulary
00:02:12
so English has I don't know about a
00:02:14
hundred thousand words in the vocabulary
00:02:16
right
00:02:17
and so you have a hundred thousand
00:02:18
dimensional vector most of them are zero
00:02:20
because most words are not present in
00:02:22
your document and the ones that are have
00:02:24
some value that's maybe account or
00:02:26
tf-idf score or something like that
00:02:28
and that is your vector and this
00:02:32
naturally leads to sparse data where
00:02:35
again it's mostly zero so you don't
00:02:37
store the zeros because that's
00:02:38
computationally inefficient you store
00:02:39
lists of a position value tuples or
00:02:43
maybe just a list of positions and this
00:02:46
makes the computation much cheaper and
00:02:48
this works this works reasonably well a
00:02:51
key limitation is that when you're
00:02:52
looking an actual document order matters
00:02:54
right these two documents mean
00:02:58
completely different things right but a
00:03:01
bag of words model will score them
00:03:02
identically every single time because
00:03:04
they have the exact same vectors for for
00:03:08
what words are present so the solution
00:03:11
to that in this context is in grams you
00:03:14
can have by grams which every pair of
00:03:15
possible words or trigrams are for every
00:03:17
combination of three words which would
00:03:19
easily distinguish between those two but
00:03:21
now you're up to what is that a
00:03:23
quadrillion dimensional vector and you
00:03:26
can do it but you know you start running
00:03:28
into all sorts of problems when you walk
00:03:31
down that path so a in neural network
00:03:35
land it's the natural way to just solve
00:03:37
this problem is the R and n which is the
00:03:40
recurrent neural network not the
00:03:42
recursive neural network I've made that
00:03:43
mistake but RN ends are a new approach
00:03:48
to this which asked the question how do
00:03:50
you calculate a function on a
00:03:52
variable-length set of input and they
00:03:55
answer it using a for loop in math where
00:03:58
they recursively define the output at
00:04:02
any stage as a function of the inputs at
00:04:05
the previous stages and the previous
00:04:07
output and then for the purpose of
00:04:10
supervised learning the final output is
00:04:13
just the final hidden state here and so
00:04:16
visually this looks like this activation
00:04:18
which takes an input from the raw
00:04:21
document X and also itself in the
00:04:23
previous time you can unroll this and
00:04:25
visualize it as a very deep neural
00:04:27
network where there the final answer
00:04:31
the number you're looking at the end is
00:04:32
this and it's this deep neural network
00:04:34
that processes every one of the inputs
00:04:36
along the way alright and the problem
00:04:39
with this classic vanilla all right on
00:04:42
this plane recurrent neural network is
00:04:44
vanishing an exploding gradients right
00:04:45
so you take this recursive definition of
00:04:50
the hidden state and you imagine what
00:04:53
happens just three points in right and
00:04:55
so you're calling this a function this a
00:04:57
transformation over and over and over
00:04:59
again on your data and classically in
00:05:02
the vanilla want a case this is just
00:05:04
some matrix multiplication some learned
00:05:06
matrix W times your input X and so when
00:05:10
you go out and say a hundred words in
00:05:12
you're taking that W vector W matrix and
00:05:15
you're multiplying it a hundred times
00:05:17
alright so in in simple math in in real
00:05:22
number math we know that if you take any
00:05:24
number less than one and raise it to a
00:05:26
very high dimensional value sorry very
00:05:28
high exponent you get some incredibly
00:05:30
small number and if your number is
00:05:32
slightly larger than one then it blows
00:05:34
up to something big
00:05:35
and if you go if your X even higher if
00:05:37
you have longer documents this gets even
00:05:39
worse and in linear algebra this is
00:05:41
about the same except you need to think
00:05:44
about the eigenvalues of the matrix so
00:05:46
the eigenvalues is say how much the
00:05:48
matrix is going to grow or shrink
00:05:50
vectors when the transformation is
00:05:53
applied and if your eigenvalues are less
00:05:55
than one in this transformation you're
00:05:57
going to get these gradients that go to
00:05:59
zero as you use this matrix over and
00:06:00
over again if they're greater than one
00:06:02
then your gradients are going to explode
00:06:03
all right and so this made vanilla RN
00:06:05
ends extremely difficult to work with
00:06:07
and basically just didn't work on
00:06:08
anything but fairly short sequences all
00:06:12
right so LST m to the rescue right so I
00:06:15
wrote this document a few years ago
00:06:17
called the rise and fall and rise and
00:06:19
fall of LST M so at least Tim came
00:06:24
around in the dark ages and then it went
00:06:27
into the AI winter it came back again
00:06:29
for awhile but I think it's on its way
00:06:31
out again now with with transformers so
00:06:34
Ellis Tim to be clear is a kind of
00:06:36
recurrent neural network it just houses
00:06:38
more sophisticated cell inside and it
00:06:42
was invented originally in the dark ages
00:06:44
on
00:06:45
stone tablet that has been recovered
00:06:48
into a PDF that you can access on Sep
00:06:51
hawk right there's a server III kid but
00:06:55
seven and you're gonna both grade I
00:06:57
enjoy they're both quite a bit but they
00:07:00
did a bunch of amazing work in the 90s
00:07:02
that was really well ahead of its time
00:07:04
and and often get neglected and
00:07:09
forgotten as time goes on that's totally
00:07:12
not fair because they did an amazing
00:07:13
research so the LST emcell looks like
00:07:16
this it actually has two hidden states
00:07:18
and the the input coming along the
00:07:21
bottom and the output up the top again
00:07:23
and these two hidden states and I'm not
00:07:25
going to go into it in detail and you
00:07:26
should totally look at Christopher ollas
00:07:28
blog post if you want to dive into it
00:07:29
but the key point is that these these
00:07:32
transformations these the matrix
00:07:34
multiplies right and they are not
00:07:35
applied recursively on the main hidden
00:07:38
vector all you're doing is you're adding
00:07:40
in or the forget gate yeah you actually
00:07:44
don't really need it but you're adding
00:07:46
in some some new number and so the OS TM
00:07:49
is actually a lot like a res net it's a
00:07:51
lot like a CNN resonate in that you're
00:07:53
adding new values on to the activation
00:07:57
as you go through the layers right and
00:08:00
so this solves the exploding and
00:08:03
vanishing gradients problems however LST
00:08:06
M is still pretty difficult to train
00:08:09
because you still have these very long
00:08:11
gradient paths even even without even
00:08:14
with those residual connections you're
00:08:15
still propagating gradients from the end
00:08:17
all the way through this transformation
00:08:19
cell over at the beginning and for a
00:08:20
long document this means very very deep
00:08:22
networks that aren't just Toria Slee
00:08:26
difficult to train and more importantly
00:08:29
transfer learning never really worked on
00:08:32
these LST M models right one of the
00:08:35
great things about image net and cnn's
00:08:37
is that you can train a convolutional
00:08:40
net on millions of images in image net
00:08:42
and take that neural network and
00:08:44
fine-tune it for some new problem that
00:08:46
you have and the the starting state of
00:08:50
the you mention at CNN gives you a great
00:08:52
a great place to start from when you're
00:08:55
looking for a new neural network and
00:08:56
makes training on your own problem much
00:08:58
he
00:08:58
there was much less data that never
00:09:00
really worked with Ellis Jim sometimes
00:09:01
it did but it just wasn't very reliable
00:09:04
which means that anytime you're using an
00:09:06
LS TM you need a new label data set
00:09:10
that's specific to your task and that's
00:09:12
expensive okay so this this changed
00:09:16
dramatically just about a year ago when
00:09:18
the burp model was was released so
00:09:23
you'll hear people talk about
00:09:24
Transformers and Muppets together and
00:09:26
the reason for this is that the original
00:09:29
paper on this technique that describes
00:09:31
the network architecture it was called
00:09:33
the transformer network and then the
00:09:34
Bert paper is a muppet news and Elmo
00:09:36
paper and you know researchers just run
00:09:38
with the joke um so this is just context
00:09:40
you understand what people are talking
00:09:41
about if they say well use them up in
00:09:42
network so this I think it was the
00:09:48
natural progression of the sequence of
00:09:50
document models and it was the
00:09:52
transformer model was first described
00:09:54
about two and a half years ago in this
00:09:55
paper attention is all you need and this
00:09:58
paper was addressing machine translation
00:10:01
so think about taking a document in in
00:10:05
English and converting it into French
00:10:07
right and so the classic way to do this
00:10:09
in neural network is encoder/decoder
00:10:11
here's the full structure there's a lot
00:10:13
going on here right so we're just going
00:10:15
to focus on the encoder part because
00:10:16
that's all you need for these supervised
00:10:18
learning problems the decoder is similar
00:10:19
anyway so zooming in on the encoder part
00:10:22
of it there's still quite a bit going on
00:10:24
and so we're but basically there's three
00:10:27
parts there's we're gonna talk about
00:10:28
first we're going to talk about this
00:10:30
attention part then we'll talk about the
00:10:32
part of the bottom of the positional
00:10:33
coding the top parts just not that hard
00:10:35
it's just a simple fully connected layer
00:10:37
so the attention mechanism in the middle
00:10:39
is the key to making this thing work on
00:10:42
documents of variable lengths and the
00:10:45
way they do that is by having an
00:10:47
all-to-all comparison for every layer of
00:10:50
the neural network it considers every
00:10:52
pause for every output of the next layer
00:10:55
considers every plausible input from the
00:10:57
previous layer in this N squared way and
00:10:59
it does this weighted sum of the
00:11:01
previous ones where the waiting is the
00:11:04
learned function right and then it
00:11:07
applies just a fully connected layer
00:11:08
after it but it this is this is great
00:11:11
for for a number of reasons one
00:11:13
is that you can you can look at this
00:11:15
thing and you can visually see what it's
00:11:17
doing so here is this translation
00:11:19
problem of converting from the English
00:11:21
sentence the agreement on the European
00:11:23
Economic Area was signed in August 1992
00:11:26
and translate that into French my
00:11:29
apologies la casa la zone economic
00:11:31
European at this may and oh I forgot
00:11:35
1992 right and you can see the attention
00:11:38
so as its generating lips as a
00:11:41
generating the each token in the output
00:11:44
it's it's starting with this whole
00:11:46
thing's name button its generating is
00:11:47
these output tokens one at a time and it
00:11:50
says okay first you got to translate the
00:11:51
the way I do that it translates into la
00:11:54
and all I'm doing is looking at this
00:11:55
next I'll put a color all I'm doing is
00:11:58
looking at agreement then sir is on la
00:12:00
is the okay now interesting European
00:12:03
Economic Area translates into zone
00:12:06
economic European so the order is
00:12:08
reversed right you can see the attention
00:12:10
mechanism is reversed also or you can
00:12:12
see very clearly what this thing is
00:12:13
doing as it's running along and the way
00:12:16
it works in the attention are setting
00:12:19
the transformer model the way they
00:12:20
describe it is with query and key
00:12:23
vectors so for every output position you
00:12:26
generate a query and for every input
00:12:29
you're considering you generate a key
00:12:31
and then the relevant score is just the
00:12:32
dot product of those two right and to
00:12:36
visualize that you first you combine the
00:12:39
key the query and the key values and
00:12:41
that gives you the relevant scores you
00:12:44
you use the softmax normalize them and
00:12:46
then you do a weighted average of the
00:12:49
values the third version of each token
00:12:52
to get your output now to explain this
00:12:56
in a little bit more detail I'm going to
00:12:57
go through it in pseudocode so this
00:12:59
looks like Python it wouldn't actually
00:13:00
run but I think it's close enough to
00:13:02
help people understand what's going on
00:13:04
so you've got this attention function
00:13:07
right and it takes as input a list of
00:13:11
tensors I know you don't need to do that
00:13:13
a list of 10 serious one per token on
00:13:16
the input and then the first thing it
00:13:18
does it goes through each everything in
00:13:20
the sequence and it computes the query
00:13:22
the key and the value by multiplying the
00:13:25
appropriate input vector by Q
00:13:27
k and V which are these learned matrices
00:13:29
right so it learns this transformation
00:13:31
from the previous layer to whatever
00:13:35
should be the query the key and the
00:13:36
value at the at the next layer then it
00:13:40
goes through this double nested loop
00:13:42
alright so for every output token it
00:13:46
figures out okay this is the query I'm
00:13:48
working with and then it goes through
00:13:49
everything in the input and it
00:13:51
multiplies that query with the the key
00:13:53
from the possible key and it computes a
00:13:57
whole bunch of relevant scores and then
00:13:59
it normalizes these relevant scores
00:14:01
using a soft Max which makes sure that
00:14:04
they just all add up to one so you can
00:14:05
sensibly can use that to compute a
00:14:08
weighted sum of all of the values so you
00:14:11
know you just go through for each output
00:14:14
you go through each of the each of the
00:14:18
input tokens the value score which is
00:14:20
calculated for them and you multiply it
00:14:21
by the relevance this is just a floating
00:14:23
point number from 0 to 1 and you get a
00:14:25
weighted average which is the output and
00:14:27
you return that so this is what's going
00:14:30
on in the attention mechanism which can
00:14:33
be which can be pretty confusing when
00:14:35
you just look at it look at the diagram
00:14:37
that like that but I hope this
00:14:40
I hope this explains it a little bit I'm
00:14:42
sure we'll get some questions on this so
00:14:45
relevant scores are interpretable as I
00:14:48
say and and this is is super helpful
00:14:50
right now the an innovation I think it
00:14:56
was novel in the transformer paper is
00:14:58
multi-headed attention and this is one
00:15:01
of these really clever ID and important
00:15:03
innovations that it's not actually all
00:15:05
that complicated at all I you just do
00:15:08
that same thing that same attention
00:15:10
mechanism eight times whatever whatever
00:15:12
value of 8 you want to use and that lets
00:15:15
the network learn eight different things
00:15:17
to pay attention to so in the
00:15:19
translation case it can learn an
00:15:21
attention mechanism for grammar one for
00:15:23
vocabulary one for gender one for kent's
00:15:25
whatever it is right whatever the thing
00:15:27
needs to it can look at different parts
00:15:28
of the input document for different
00:15:30
purposes and do this at each layer right
00:15:32
so you can kind of intuitively see how
00:15:33
this would be a really flexible
00:15:34
mechanism for for processing a document
00:15:38
or any sequence okay so that is
00:15:41
when the key things that enables the
00:15:44
transfer model that's the multi-headed
00:15:46
attention part of it now let's look down
00:15:48
here at the positional encoding which is
00:15:50
which is critical and novel in a
00:15:54
critical innovation that I think is
00:15:56
incredibly clever so without this
00:15:59
positional encoding attention mechanisms
00:16:01
are just bags of words right there's
00:16:03
nothing seeing what the difference is
00:16:05
between work to live or live to work
00:16:07
right there they're just all positions
00:16:10
they're all equivalent positions you're
00:16:12
just going to compute some score for
00:16:14
each of them so what they did is they
00:16:17
took a lesson from Fourier theory and
00:16:20
added in a bunch of sines and cosines as
00:16:23
extra dimensions
00:16:25
sorry not as extra dimensions but onto
00:16:28
the the word embeddings so going back so
00:16:32
what they do is they take the inputs
00:16:33
they use word Tyvek to calculate some
00:16:35
vector for each input token and then
00:16:37
onto that onto that embedding they add a
00:16:41
bunch of sine and cosines of different
00:16:43
frequencies starting at just pi and then
00:16:46
stretching out longer and longer and
00:16:48
longer and if you look at the whole
00:16:51
thing it looks like this and what this
00:16:52
does is it lets the model reason about
00:16:56
the relative position of any tokens
00:16:58
right so if you can kind of imagine that
00:17:01
the model can say if the orange
00:17:03
dimension is slightly higher than the
00:17:05
blue dimension on one word versus
00:17:08
another then you can see how it knows
00:17:11
that that token is to the left or right
00:17:13
of the other and because it has this at
00:17:14
all these different wavelengths it can
00:17:16
look across the entire document at kind
00:17:18
of arbitrary scales to see whether one
00:17:20
idea is before or after another
00:17:23
the key thing is that this is how the
00:17:26
system understands position and isn't
00:17:29
just a bag of words for Fortran when
00:17:32
doing the attention
00:17:33
okay so transformers there's the two key
00:17:36
innovations as positional encoding and
00:17:38
multi-headed attention transformers are
00:17:40
awesome even though there are N squared
00:17:42
and the length of the document these
00:17:44
all-to-all comparisons can be done
00:17:46
almost for free in a modern GPU GPUs
00:17:49
changed all sorts of things right you
00:17:51
can do a thousand by thousand matrix
00:17:53
multiply as fast as you can do a ten by
00:17:55
two
00:17:55
in a lot of cases because they have so
00:17:57
much parallelism they have so much
00:17:58
bandwidth that but a fixed latency for
00:18:01
every operation so you can do these
00:18:03
massive massive multiplies almost for
00:18:05
free in a lot of cases so doing things
00:18:07
in M Squared is is not actually
00:18:10
necessarily much more expensive whereas
00:18:11
in an RNN like an L STM you can't do
00:18:16
anything with token 11 until you're
00:18:18
completely done processing token 10 all
00:18:21
right so this is a key advantage of
00:18:22
transformers they're much more
00:18:24
computationally efficient also you don't
00:18:28
need to use any of these sigmoid or tan
00:18:31
h activation functions which are built
00:18:33
into the LS TM model these things of
00:18:35
scale your activations to 0 1 why are
00:18:38
these things problematic so these were
00:18:41
bread-and-butter in the old days of of
00:18:43
neural networks people would use these
00:18:46
between layers all the time and they
00:18:50
make sense there that kind of
00:18:51
biologically inspired you take any
00:18:53
activation you scale it from 0 to 1 or
00:18:55
minus 1 to 1 but they're actually really
00:18:57
really problematic because if you get a
00:19:00
neuron which has a very high activation
00:19:02
value then you've got this number up
00:19:05
here which is 1 and you take the
00:19:07
derivative of that and it's 0 or it's
00:19:10
some very very small number and so your
00:19:12
gradient descent can't tell the
00:19:14
difference between an activation up here
00:19:16
and one way over on the other side so
00:19:19
it's very easy for the trainer to get
00:19:21
confused if your activations don't stay
00:19:23
near this middle part all right and
00:19:25
that's problematic compare that to rel U
00:19:26
which is the standard these days and
00:19:28
really you yes it does have this this
00:19:31
very very large dead space but if you're
00:19:34
not in the dead space then there's
00:19:36
nothing stopping it from getting getting
00:19:38
bigger and bigger and scaling off to
00:19:39
infinity and one of the reasons why when
00:19:43
the intuitions behind why this works
00:19:45
better as Geoffrey Hinton puts it is
00:19:47
that this allows each neuron it to
00:19:49
express a stronger opinion right in an
00:19:53
LS sorry in a sigmoid there is really no
00:19:56
difference between the activation being
00:19:58
three or eight or twenty or a hundred
00:20:01
the output is the same right it all I
00:20:05
can say is kind of yes no maybe right
00:20:07
but
00:20:09
in with our Lu it can say the
00:20:11
inactivation of five or a hundred or a
00:20:13
thousand and these are all meaningfully
00:20:15
different values that can be used for
00:20:16
different purposes down the line right
00:20:18
so each neuron it can express more
00:20:20
information also the gradient doesn't
00:20:23
saturate we talked about that and very
00:20:27
critically and I think this is really
00:20:28
underappreciated values are really
00:20:32
insensitive to random initialization if
00:20:34
you're working with a bunch of sigmoid
00:20:35
layers you need to pick those random
00:20:37
values at the beginning of your training
00:20:39
to make sure that your activation values
00:20:42
are in that middle part where you're
00:20:44
going to get reasonable gradients and
00:20:45
people used to worry a lot about what
00:20:48
initialization to use for your neural
00:20:49
network you don't hear people worrying
00:20:51
about that much at all anymore and rail
00:20:53
users are really the key reason why that
00:20:56
is also really runs great on low
00:20:58
precision Hardware those those floating
00:21:01
of the smooth activation functions they
00:21:03
need 32-bit float maybe you can get it
00:21:06
to work in 16-bit float sometimes but
00:21:08
you're not going to be running it an
00:21:09
8-bit int without a ton of careful work
00:21:12
and that is the kind of things are
00:21:13
really easy to do with a rel u based
00:21:16
network and a lot of hardware is going
00:21:17
in that direction because it takes
00:21:19
vastly fewer transistors and a lot less
00:21:21
power to do 8-bit integer math versus
00:21:24
32-bit float it's also stupidly easy to
00:21:27
compute the gradient it's one or at zero
00:21:30
right you just take that top bit and
00:21:32
you're done so the derivatives
00:21:33
ridiculously usually rel you have some
00:21:35
downsides it does have those dead
00:21:37
neurons on on the left side you can fix
00:21:39
that with a leaky rail you there's this
00:21:41
discontinuity in the gradient of the
00:21:43
origin you can fix that with Gail u
00:21:45
which burnt uses and so this brings me
00:21:49
to a little aside about general deep
00:21:51
learning wisdom if you're designing a
00:21:54
new network for whatever reason don't
00:21:57
bother messing with different kinds of
00:21:58
activations don't bother trying sigmoid
00:22:00
or tant they're they're probably not
00:22:02
going to work out very well but
00:22:04
different optimizers do matter atom is a
00:22:07
great place to start it's super fast it
00:22:09
tends to give pretty good results it has
00:22:11
a bit of a tendency to overfit if you
00:22:13
really are trying to squeeze the juice
00:22:14
out of your system and you want the best
00:22:15
results SGD is likely to get you a
00:22:18
better result but it's going to take
00:22:19
quite a bit more time to
00:22:22
Verge sometimes rmsprop beats the pants
00:22:25
off both of them it's worth playing
00:22:26
around with these with these things I
00:22:28
told you about why I think SWA is great
00:22:30
there's this system called attitude my
00:22:32
old team at Amazon released where you
00:22:35
don't even need to take a learning rate
00:22:37
it dynamically calculates the ideal
00:22:39
learning rate scheduled at every point
00:22:41
during training for you it's kind of
00:22:42
magical so it's worth playing around
00:22:46
with different optimizers but don't mess
00:22:47
with the with the activation functions
00:22:49
okay
00:22:50
let's pop out right there's a bunch of a
00:22:52
bunch of Theory bunch of math and and
00:22:53
ideas in there how do we actually apply
00:22:55
this stuff in code so if you want to use
00:22:58
a transformer I strongly recommend
00:23:01
hopping over to the the fine folks at
00:23:04
hugging face and using their transformer
00:23:06
package they have both pi torch and
00:23:09
tensor flow implementations pre-trained
00:23:11
models ready to fine tune and I'll show
00:23:14
you how easy it is here's how to fine
00:23:16
tune a Bert model in just 12 lines of
00:23:18
code you just pick what kind of Bert you
00:23:21
want the base model that's a paying
00:23:24
attention upper and lower case you get
00:23:26
the tokenizer to convert your string
00:23:27
into tokens you download the pre trained
00:23:30
model in one line of code pick your data
00:23:32
set for your own problem process the
00:23:35
data set with the tokenizer
00:23:37
to get training validation splits
00:23:39
shuffle one batch um four more lines of
00:23:41
code another four lines of code to
00:23:43
instantiate your optimizer define your
00:23:46
loss function pick a metric it's
00:23:49
tensorflow so you got to compile it and
00:23:52
then you call fit and that's it that's
00:23:55
all you need to do use - all you need to
00:23:57
do to fine-tune a state-of-the-art
00:23:59
language model on your specific problem
00:24:01
and the fact you can do this on some pre
00:24:04
trained model that's that's seen tons
00:24:06
and tons of data that easily is really
00:24:08
amazing and there's even bigger models
00:24:11
out there right so Nvidia made this
00:24:12
bottle called megatron with eight
00:24:14
billion parameters they ran a hundreds
00:24:16
of GPUs for over a week spent vast
00:24:19
quantities of cash well I mean they own
00:24:20
the stuffs but so not really but they
00:24:22
they put a ton of energy into training
00:24:26
this I've heard people a lot of people
00:24:27
complaining about how much greenhouse
00:24:29
gas comes from training model like
00:24:31
Megatron I think that's totally the
00:24:33
wrong
00:24:34
way of looking at this because they only
00:24:37
need to do this once in the history of
00:24:40
the world and everybody in this room can
00:24:42
do it without having to burn those GPUs
00:24:45
again right these things are reusable
00:24:47
and fine tunable I don't think they've
00:24:48
actually released this yet but but they
00:24:51
might and somebody else will
00:24:53
right so you don't need to do that that
00:24:55
expensive work over and over again write
00:24:57
this thing learns a base model really
00:25:00
well the folks at Facebook trained this
00:25:03
Roberta model on two and a half
00:25:04
terabytes of data across over a hundred
00:25:07
languages and this thing understands low
00:25:10
resource languages like Swahili and an
00:25:13
Urdu in ways that the it's just vastly
00:25:16
better than what's been done before and
00:25:18
again these are reusable if you need a
00:25:21
model that understands all the world's
00:25:23
languages this is accessible to you by
00:25:25
leveraging other people's work and
00:25:27
before Bert and transformers and the
00:25:29
Muppets this just was not possible now
00:25:31
you can leverage other people's work in
00:25:34
this way and I think that's really
00:25:36
amazing so to sum up the key advantages
00:25:39
of these transforming networks yes
00:25:41
they're easier to train they're more
00:25:42
efficient all that yada yada yada
00:25:44
but more importantly transfer learning
00:25:47
actually works with them right you can
00:25:49
take a pre trained model fine-tune it
00:25:51
for your task without a specific data
00:25:53
set and another really critical point
00:25:56
which I didn't get a chance to go into
00:25:57
is that these things are originally
00:25:59
trained on large quantities of
00:26:01
unsupervised text you can just take all
00:26:03
of the world's text data and use this as
00:26:06
training data the way it works very very
00:26:07
quickly is kind of comparable to how
00:26:10
word Tyvek works where the language
00:26:11
model tries to predict them some missing
00:26:14
words from a document and in that's
00:26:17
enough for it to understand how to build
00:26:21
a supervised model using vast quantities
00:26:24
of text without any effort to label them
00:26:28
Ellis team still has its place in
00:26:30
particular if the sequence length is
00:26:32
very long or infinite you can't do n
00:26:35
squared right
00:26:36
and that happens if you're doing real
00:26:38
time control like for a robot or a
00:26:40
thermostat or something like that you
00:26:41
can't have the entire sequence and for
00:26:44
some reason you can't pre train on some
00:26:46
large corpus LS TM seems
00:26:48
to outperform transformers when your
00:26:50
dataset size is is relatively small and
00:26:52
fixed and with that I will take
00:26:57
questions
00:26:58
well you yes yeah would CNN how do you
00:27:13
compare words CNN transformer so when
00:27:16
when I wrote this paper the rise and
00:27:19
fall and rise and fall of LST M I
00:27:20
predicted that time that word CNN's were
00:27:23
going to be the thing that replaced LST
00:27:25
M I did not I did not see this this
00:27:29
transformer thing coming so a word CNN
00:27:31
has a lot of the advantages in terms of
00:27:34
parallelism and the ability to use rel
00:27:36
you and the key difference is that it
00:27:39
only looks at a fixed size window fixed
00:27:41
size part of the document instead of
00:27:42
looking at the entire document at once
00:27:44
and so it's it's got a fair amount
00:27:49
fundamentally in common word CNN's have
00:27:53
an easier task easier time identifying
00:27:57
diagrams trigrams things like that
00:28:00
because it's got those direct
00:28:01
comparisons right it doesn't need this
00:28:02
positional encoding trick to try to
00:28:04
infer with with fourier waves what where
00:28:08
things are relative to each other so
00:28:10
it's got that advantage for
00:28:12
understanding close closely related
00:28:13
tokens but it can't see across the
00:28:16
entire document at once right it's got a
00:28:20
much harder time reasoning like a word
00:28:23
CNN can't easily answer a question like
00:28:25
does this concept exist anywhere in this
00:28:29
document whereas a transformer can very
00:28:31
easily answer that just by having some
00:28:33
attention query that finds that
00:28:35
regardless of where it is CNN would need
00:28:37
a very large large window or a series of
00:28:40
windows cascading up to to be able to
00:28:42
accomplish that

Etiquetas

transformateurs
apprentissage par transfert
TAL
RNN
LSTM
attention multi-tête
encodage positionnel
modèles de langage
efficacité
gestion des séquences