00:00:03
all right cool well thanks everybody um
00:00:06
so I'm gonna give the second talk
00:00:08
tonight which I'm not crazy about and
00:00:10
and I don't want this pattern to to
00:00:12
repeat but you know Andrew and I wanted
00:00:14
to kick this series off and and felt
00:00:20
like me talking twice or better than
00:00:23
then not but we're gonna we're gonna get
00:00:26
more diversity of folks if any of you
00:00:28
want to give a talk yourselves you know
00:00:29
somebody who you think might that'd be
00:00:31
awesome but a topic that I feel is
00:00:34
important for practitioners to
00:00:35
understand is a real sea change in
00:00:38
natural language processing that's you
00:00:40
know all of like 12 months old but is
00:00:42
one these things I think is incredibly
00:00:44
significant in in the field and that is
00:00:47
the advance of the Transformers so the
00:00:52
outline for this talk is to start out
00:00:55
with some background on natural language
00:00:57
processing and sequence modeling and
00:01:00
then talk about the LS TM why it's
00:01:03
awesome and amazing but still not good
00:01:05
enough and then go into Transformus and
00:01:09
talk about how they work and why they're
00:01:12
amazing
00:01:12
so for background on natural language
00:01:14
processing NLP I'm gonna be talking just
00:01:18
about a subset of NLP which is the
00:01:21
supervised learning a part of it so not
00:01:23
structured prediction sequence
00:01:25
prediction but where you're taking the
00:01:29
document as some input and trying to
00:01:32
predict some fairly straightforward
00:01:34
output about it like is this document
00:01:37
spam right and so what this means is
00:01:41
that you need to somehow take your
00:01:44
document and represent it as a
00:01:46
fixed-size vector because I'm not aware
00:01:49
of any linear algebra that works on
00:01:50
vectors of variable dimensionality and
00:01:53
the challenge with this is that
00:01:55
documents are of variable length right
00:01:58
so you have to come up with some way of
00:02:00
taking that document and meaningfully
00:02:02
encoding it into a fixed size vector
00:02:04
right so the classic way of doing this
00:02:06
is the bag of words right where you have
00:02:08
one dimension per unique word in your
00:02:11
vocabulary
00:02:12
so English has I don't know about a
00:02:14
hundred thousand words in the vocabulary
00:02:16
right
00:02:17
and so you have a hundred thousand
00:02:18
dimensional vector most of them are zero
00:02:20
because most words are not present in
00:02:22
your document and the ones that are have
00:02:24
some value that's maybe account or
00:02:26
tf-idf score or something like that
00:02:28
and that is your vector and this
00:02:32
naturally leads to sparse data where
00:02:35
again it's mostly zero so you don't
00:02:37
store the zeros because that's
00:02:38
computationally inefficient you store
00:02:39
lists of a position value tuples or
00:02:43
maybe just a list of positions and this
00:02:46
makes the computation much cheaper and
00:02:48
this works this works reasonably well a
00:02:51
key limitation is that when you're
00:02:52
looking an actual document order matters
00:02:54
right these two documents mean
00:02:58
completely different things right but a
00:03:01
bag of words model will score them
00:03:02
identically every single time because
00:03:04
they have the exact same vectors for for
00:03:08
what words are present so the solution
00:03:11
to that in this context is in grams you
00:03:14
can have by grams which every pair of
00:03:15
possible words or trigrams are for every
00:03:17
combination of three words which would
00:03:19
easily distinguish between those two but
00:03:21
now you're up to what is that a
00:03:23
quadrillion dimensional vector and you
00:03:26
can do it but you know you start running
00:03:28
into all sorts of problems when you walk
00:03:31
down that path so a in neural network
00:03:35
land it's the natural way to just solve
00:03:37
this problem is the R and n which is the
00:03:40
recurrent neural network not the
00:03:42
recursive neural network I've made that
00:03:43
mistake but RN ends are a new approach
00:03:48
to this which asked the question how do
00:03:50
you calculate a function on a
00:03:52
variable-length set of input and they
00:03:55
answer it using a for loop in math where
00:03:58
they recursively define the output at
00:04:02
any stage as a function of the inputs at
00:04:05
the previous stages and the previous
00:04:07
output and then for the purpose of
00:04:10
supervised learning the final output is
00:04:13
just the final hidden state here and so
00:04:16
visually this looks like this activation
00:04:18
which takes an input from the raw
00:04:21
document X and also itself in the
00:04:23
previous time you can unroll this and
00:04:25
visualize it as a very deep neural
00:04:27
network where there the final answer
00:04:31
the number you're looking at the end is
00:04:32
this and it's this deep neural network
00:04:34
that processes every one of the inputs
00:04:36
along the way alright and the problem
00:04:39
with this classic vanilla all right on
00:04:42
this plane recurrent neural network is
00:04:44
vanishing an exploding gradients right
00:04:45
so you take this recursive definition of
00:04:50
the hidden state and you imagine what
00:04:53
happens just three points in right and
00:04:55
so you're calling this a function this a
00:04:57
transformation over and over and over
00:04:59
again on your data and classically in
00:05:02
the vanilla want a case this is just
00:05:04
some matrix multiplication some learned
00:05:06
matrix W times your input X and so when
00:05:10
you go out and say a hundred words in
00:05:12
you're taking that W vector W matrix and
00:05:15
you're multiplying it a hundred times
00:05:17
alright so in in simple math in in real
00:05:22
number math we know that if you take any
00:05:24
number less than one and raise it to a
00:05:26
very high dimensional value sorry very
00:05:28
high exponent you get some incredibly
00:05:30
small number and if your number is
00:05:32
slightly larger than one then it blows
00:05:34
up to something big
00:05:35
and if you go if your X even higher if
00:05:37
you have longer documents this gets even
00:05:39
worse and in linear algebra this is
00:05:41
about the same except you need to think
00:05:44
about the eigenvalues of the matrix so
00:05:46
the eigenvalues is say how much the
00:05:48
matrix is going to grow or shrink
00:05:50
vectors when the transformation is
00:05:53
applied and if your eigenvalues are less
00:05:55
than one in this transformation you're
00:05:57
going to get these gradients that go to
00:05:59
zero as you use this matrix over and
00:06:00
over again if they're greater than one
00:06:02
then your gradients are going to explode
00:06:03
all right and so this made vanilla RN
00:06:05
ends extremely difficult to work with
00:06:07
and basically just didn't work on
00:06:08
anything but fairly short sequences all
00:06:12
right so LST m to the rescue right so I
00:06:15
wrote this document a few years ago
00:06:17
called the rise and fall and rise and
00:06:19
fall of LST M so at least Tim came
00:06:24
around in the dark ages and then it went
00:06:27
into the AI winter it came back again
00:06:29
for awhile but I think it's on its way
00:06:31
out again now with with transformers so
00:06:34
Ellis Tim to be clear is a kind of
00:06:36
recurrent neural network it just houses
00:06:38
more sophisticated cell inside and it
00:06:42
was invented originally in the dark ages
00:06:44
on
00:06:45
stone tablet that has been recovered
00:06:48
into a PDF that you can access on Sep
00:06:51
hawk right there's a server III kid but
00:06:55
seven and you're gonna both grade I
00:06:57
enjoy they're both quite a bit but they
00:07:00
did a bunch of amazing work in the 90s
00:07:02
that was really well ahead of its time
00:07:04
and and often get neglected and
00:07:09
forgotten as time goes on that's totally
00:07:12
not fair because they did an amazing
00:07:13
research so the LST emcell looks like
00:07:16
this it actually has two hidden states
00:07:18
and the the input coming along the
00:07:21
bottom and the output up the top again
00:07:23
and these two hidden states and I'm not
00:07:25
going to go into it in detail and you
00:07:26
should totally look at Christopher ollas
00:07:28
blog post if you want to dive into it
00:07:29
but the key point is that these these
00:07:32
transformations these the matrix
00:07:34
multiplies right and they are not
00:07:35
applied recursively on the main hidden
00:07:38
vector all you're doing is you're adding
00:07:40
in or the forget gate yeah you actually
00:07:44
don't really need it but you're adding
00:07:46
in some some new number and so the OS TM
00:07:49
is actually a lot like a res net it's a
00:07:51
lot like a CNN resonate in that you're
00:07:53
adding new values on to the activation
00:07:57
as you go through the layers right and
00:08:00
so this solves the exploding and
00:08:03
vanishing gradients problems however LST
00:08:06
M is still pretty difficult to train
00:08:09
because you still have these very long
00:08:11
gradient paths even even without even
00:08:14
with those residual connections you're
00:08:15
still propagating gradients from the end
00:08:17
all the way through this transformation
00:08:19
cell over at the beginning and for a
00:08:20
long document this means very very deep
00:08:22
networks that aren't just Toria Slee
00:08:26
difficult to train and more importantly
00:08:29
transfer learning never really worked on
00:08:32
these LST M models right one of the
00:08:35
great things about image net and cnn's
00:08:37
is that you can train a convolutional
00:08:40
net on millions of images in image net
00:08:42
and take that neural network and
00:08:44
fine-tune it for some new problem that
00:08:46
you have and the the starting state of
00:08:50
the you mention at CNN gives you a great
00:08:52
a great place to start from when you're
00:08:55
looking for a new neural network and
00:08:56
makes training on your own problem much
00:08:58
he
00:08:58
there was much less data that never
00:09:00
really worked with Ellis Jim sometimes
00:09:01
it did but it just wasn't very reliable
00:09:04
which means that anytime you're using an
00:09:06
LS TM you need a new label data set
00:09:10
that's specific to your task and that's
00:09:12
expensive okay so this this changed
00:09:16
dramatically just about a year ago when
00:09:18
the burp model was was released so
00:09:23
you'll hear people talk about
00:09:24
Transformers and Muppets together and
00:09:26
the reason for this is that the original
00:09:29
paper on this technique that describes
00:09:31
the network architecture it was called
00:09:33
the transformer network and then the
00:09:34
Bert paper is a muppet news and Elmo
00:09:36
paper and you know researchers just run
00:09:38
with the joke um so this is just context
00:09:40
you understand what people are talking
00:09:41
about if they say well use them up in
00:09:42
network so this I think it was the
00:09:48
natural progression of the sequence of
00:09:50
document models and it was the
00:09:52
transformer model was first described
00:09:54
about two and a half years ago in this
00:09:55
paper attention is all you need and this
00:09:58
paper was addressing machine translation
00:10:01
so think about taking a document in in
00:10:05
English and converting it into French
00:10:07
right and so the classic way to do this
00:10:09
in neural network is encoder/decoder
00:10:11
here's the full structure there's a lot
00:10:13
going on here right so we're just going
00:10:15
to focus on the encoder part because
00:10:16
that's all you need for these supervised
00:10:18
learning problems the decoder is similar
00:10:19
anyway so zooming in on the encoder part
00:10:22
of it there's still quite a bit going on
00:10:24
and so we're but basically there's three
00:10:27
parts there's we're gonna talk about
00:10:28
first we're going to talk about this
00:10:30
attention part then we'll talk about the
00:10:32
part of the bottom of the positional
00:10:33
coding the top parts just not that hard
00:10:35
it's just a simple fully connected layer
00:10:37
so the attention mechanism in the middle
00:10:39
is the key to making this thing work on
00:10:42
documents of variable lengths and the
00:10:45
way they do that is by having an
00:10:47
all-to-all comparison for every layer of
00:10:50
the neural network it considers every
00:10:52
pause for every output of the next layer
00:10:55
considers every plausible input from the
00:10:57
previous layer in this N squared way and
00:10:59
it does this weighted sum of the
00:11:01
previous ones where the waiting is the
00:11:04
learned function right and then it
00:11:07
applies just a fully connected layer
00:11:08
after it but it this is this is great
00:11:11
for for a number of reasons one
00:11:13
is that you can you can look at this
00:11:15
thing and you can visually see what it's
00:11:17
doing so here is this translation
00:11:19
problem of converting from the English
00:11:21
sentence the agreement on the European
00:11:23
Economic Area was signed in August 1992
00:11:26
and translate that into French my
00:11:29
apologies la casa la zone economic
00:11:31
European at this may and oh I forgot
00:11:35
1992 right and you can see the attention
00:11:38
so as its generating lips as a
00:11:41
generating the each token in the output
00:11:44
it's it's starting with this whole
00:11:46
thing's name button its generating is
00:11:47
these output tokens one at a time and it
00:11:50
says okay first you got to translate the
00:11:51
the way I do that it translates into la
00:11:54
and all I'm doing is looking at this
00:11:55
next I'll put a color all I'm doing is
00:11:58
looking at agreement then sir is on la
00:12:00
is the okay now interesting European
00:12:03
Economic Area translates into zone
00:12:06
economic European so the order is
00:12:08
reversed right you can see the attention
00:12:10
mechanism is reversed also or you can
00:12:12
see very clearly what this thing is
00:12:13
doing as it's running along and the way
00:12:16
it works in the attention are setting
00:12:19
the transformer model the way they
00:12:20
describe it is with query and key
00:12:23
vectors so for every output position you
00:12:26
generate a query and for every input
00:12:29
you're considering you generate a key
00:12:31
and then the relevant score is just the
00:12:32
dot product of those two right and to
00:12:36
visualize that you first you combine the
00:12:39
key the query and the key values and
00:12:41
that gives you the relevant scores you
00:12:44
you use the softmax normalize them and
00:12:46
then you do a weighted average of the
00:12:49
values the third version of each token
00:12:52
to get your output now to explain this
00:12:56
in a little bit more detail I'm going to
00:12:57
go through it in pseudocode so this
00:12:59
looks like Python it wouldn't actually
00:13:00
run but I think it's close enough to
00:13:02
help people understand what's going on
00:13:04
so you've got this attention function
00:13:07
right and it takes as input a list of
00:13:11
tensors I know you don't need to do that
00:13:13
a list of 10 serious one per token on
00:13:16
the input and then the first thing it
00:13:18
does it goes through each everything in
00:13:20
the sequence and it computes the query
00:13:22
the key and the value by multiplying the
00:13:25
appropriate input vector by Q
00:13:27
k and V which are these learned matrices
00:13:29
right so it learns this transformation
00:13:31
from the previous layer to whatever
00:13:35
should be the query the key and the
00:13:36
value at the at the next layer then it
00:13:40
goes through this double nested loop
00:13:42
alright so for every output token it
00:13:46
figures out okay this is the query I'm
00:13:48
working with and then it goes through
00:13:49
everything in the input and it
00:13:51
multiplies that query with the the key
00:13:53
from the possible key and it computes a
00:13:57
whole bunch of relevant scores and then
00:13:59
it normalizes these relevant scores
00:14:01
using a soft Max which makes sure that
00:14:04
they just all add up to one so you can
00:14:05
sensibly can use that to compute a
00:14:08
weighted sum of all of the values so you
00:14:11
know you just go through for each output
00:14:14
you go through each of the each of the
00:14:18
input tokens the value score which is
00:14:20
calculated for them and you multiply it
00:14:21
by the relevance this is just a floating
00:14:23
point number from 0 to 1 and you get a
00:14:25
weighted average which is the output and
00:14:27
you return that so this is what's going
00:14:30
on in the attention mechanism which can
00:14:33
be which can be pretty confusing when
00:14:35
you just look at it look at the diagram
00:14:37
that like that but I hope this
00:14:40
I hope this explains it a little bit I'm
00:14:42
sure we'll get some questions on this so
00:14:45
relevant scores are interpretable as I
00:14:48
say and and this is is super helpful
00:14:50
right now the an innovation I think it
00:14:56
was novel in the transformer paper is
00:14:58
multi-headed attention and this is one
00:15:01
of these really clever ID and important
00:15:03
innovations that it's not actually all
00:15:05
that complicated at all I you just do
00:15:08
that same thing that same attention
00:15:10
mechanism eight times whatever whatever
00:15:12
value of 8 you want to use and that lets
00:15:15
the network learn eight different things
00:15:17
to pay attention to so in the
00:15:19
translation case it can learn an
00:15:21
attention mechanism for grammar one for
00:15:23
vocabulary one for gender one for kent's
00:15:25
whatever it is right whatever the thing
00:15:27
needs to it can look at different parts
00:15:28
of the input document for different
00:15:30
purposes and do this at each layer right
00:15:32
so you can kind of intuitively see how
00:15:33
this would be a really flexible
00:15:34
mechanism for for processing a document
00:15:38
or any sequence okay so that is
00:15:41
when the key things that enables the
00:15:44
transfer model that's the multi-headed
00:15:46
attention part of it now let's look down
00:15:48
here at the positional encoding which is
00:15:50
which is critical and novel in a
00:15:54
critical innovation that I think is
00:15:56
incredibly clever so without this
00:15:59
positional encoding attention mechanisms
00:16:01
are just bags of words right there's
00:16:03
nothing seeing what the difference is
00:16:05
between work to live or live to work
00:16:07
right there they're just all positions
00:16:10
they're all equivalent positions you're
00:16:12
just going to compute some score for
00:16:14
each of them so what they did is they
00:16:17
took a lesson from Fourier theory and
00:16:20
added in a bunch of sines and cosines as
00:16:23
extra dimensions
00:16:25
sorry not as extra dimensions but onto
00:16:28
the the word embeddings so going back so
00:16:32
what they do is they take the inputs
00:16:33
they use word Tyvek to calculate some
00:16:35
vector for each input token and then
00:16:37
onto that onto that embedding they add a
00:16:41
bunch of sine and cosines of different
00:16:43
frequencies starting at just pi and then
00:16:46
stretching out longer and longer and
00:16:48
longer and if you look at the whole
00:16:51
thing it looks like this and what this
00:16:52
does is it lets the model reason about
00:16:56
the relative position of any tokens
00:16:58
right so if you can kind of imagine that
00:17:01
the model can say if the orange
00:17:03
dimension is slightly higher than the
00:17:05
blue dimension on one word versus
00:17:08
another then you can see how it knows
00:17:11
that that token is to the left or right
00:17:13
of the other and because it has this at
00:17:14
all these different wavelengths it can
00:17:16
look across the entire document at kind
00:17:18
of arbitrary scales to see whether one
00:17:20
idea is before or after another
00:17:23
the key thing is that this is how the
00:17:26
system understands position and isn't
00:17:29
just a bag of words for Fortran when
00:17:32
doing the attention
00:17:33
okay so transformers there's the two key
00:17:36
innovations as positional encoding and
00:17:38
multi-headed attention transformers are
00:17:40
awesome even though there are N squared
00:17:42
and the length of the document these
00:17:44
all-to-all comparisons can be done
00:17:46
almost for free in a modern GPU GPUs
00:17:49
changed all sorts of things right you
00:17:51
can do a thousand by thousand matrix
00:17:53
multiply as fast as you can do a ten by
00:17:55
two
00:17:55
in a lot of cases because they have so
00:17:57
much parallelism they have so much
00:17:58
bandwidth that but a fixed latency for
00:18:01
every operation so you can do these
00:18:03
massive massive multiplies almost for
00:18:05
free in a lot of cases so doing things
00:18:07
in M Squared is is not actually
00:18:10
necessarily much more expensive whereas
00:18:11
in an RNN like an L STM you can't do
00:18:16
anything with token 11 until you're
00:18:18
completely done processing token 10 all
00:18:21
right so this is a key advantage of
00:18:22
transformers they're much more
00:18:24
computationally efficient also you don't
00:18:28
need to use any of these sigmoid or tan
00:18:31
h activation functions which are built
00:18:33
into the LS TM model these things of
00:18:35
scale your activations to 0 1 why are
00:18:38
these things problematic so these were
00:18:41
bread-and-butter in the old days of of
00:18:43
neural networks people would use these
00:18:46
between layers all the time and they
00:18:50
make sense there that kind of
00:18:51
biologically inspired you take any
00:18:53
activation you scale it from 0 to 1 or
00:18:55
minus 1 to 1 but they're actually really
00:18:57
really problematic because if you get a
00:19:00
neuron which has a very high activation
00:19:02
value then you've got this number up
00:19:05
here which is 1 and you take the
00:19:07
derivative of that and it's 0 or it's
00:19:10
some very very small number and so your
00:19:12
gradient descent can't tell the
00:19:14
difference between an activation up here
00:19:16
and one way over on the other side so
00:19:19
it's very easy for the trainer to get
00:19:21
confused if your activations don't stay
00:19:23
near this middle part all right and
00:19:25
that's problematic compare that to rel U
00:19:26
which is the standard these days and
00:19:28
really you yes it does have this this
00:19:31
very very large dead space but if you're
00:19:34
not in the dead space then there's
00:19:36
nothing stopping it from getting getting
00:19:38
bigger and bigger and scaling off to
00:19:39
infinity and one of the reasons why when
00:19:43
the intuitions behind why this works
00:19:45
better as Geoffrey Hinton puts it is
00:19:47
that this allows each neuron it to
00:19:49
express a stronger opinion right in an
00:19:53
LS sorry in a sigmoid there is really no
00:19:56
difference between the activation being
00:19:58
three or eight or twenty or a hundred
00:20:01
the output is the same right it all I
00:20:05
can say is kind of yes no maybe right
00:20:07
but
00:20:09
in with our Lu it can say the
00:20:11
inactivation of five or a hundred or a
00:20:13
thousand and these are all meaningfully
00:20:15
different values that can be used for
00:20:16
different purposes down the line right
00:20:18
so each neuron it can express more
00:20:20
information also the gradient doesn't
00:20:23
saturate we talked about that and very
00:20:27
critically and I think this is really
00:20:28
underappreciated values are really
00:20:32
insensitive to random initialization if
00:20:34
you're working with a bunch of sigmoid
00:20:35
layers you need to pick those random
00:20:37
values at the beginning of your training
00:20:39
to make sure that your activation values
00:20:42
are in that middle part where you're
00:20:44
going to get reasonable gradients and
00:20:45
people used to worry a lot about what
00:20:48
initialization to use for your neural
00:20:49
network you don't hear people worrying
00:20:51
about that much at all anymore and rail
00:20:53
users are really the key reason why that
00:20:56
is also really runs great on low
00:20:58
precision Hardware those those floating
00:21:01
of the smooth activation functions they
00:21:03
need 32-bit float maybe you can get it
00:21:06
to work in 16-bit float sometimes but
00:21:08
you're not going to be running it an
00:21:09
8-bit int without a ton of careful work
00:21:12
and that is the kind of things are
00:21:13
really easy to do with a rel u based
00:21:16
network and a lot of hardware is going
00:21:17
in that direction because it takes
00:21:19
vastly fewer transistors and a lot less
00:21:21
power to do 8-bit integer math versus
00:21:24
32-bit float it's also stupidly easy to
00:21:27
compute the gradient it's one or at zero
00:21:30
right you just take that top bit and
00:21:32
you're done so the derivatives
00:21:33
ridiculously usually rel you have some
00:21:35
downsides it does have those dead
00:21:37
neurons on on the left side you can fix
00:21:39
that with a leaky rail you there's this
00:21:41
discontinuity in the gradient of the
00:21:43
origin you can fix that with Gail u
00:21:45
which burnt uses and so this brings me
00:21:49
to a little aside about general deep
00:21:51
learning wisdom if you're designing a
00:21:54
new network for whatever reason don't
00:21:57
bother messing with different kinds of
00:21:58
activations don't bother trying sigmoid
00:22:00
or tant they're they're probably not
00:22:02
going to work out very well but
00:22:04
different optimizers do matter atom is a
00:22:07
great place to start it's super fast it
00:22:09
tends to give pretty good results it has
00:22:11
a bit of a tendency to overfit if you
00:22:13
really are trying to squeeze the juice
00:22:14
out of your system and you want the best
00:22:15
results SGD is likely to get you a
00:22:18
better result but it's going to take
00:22:19
quite a bit more time to
00:22:22
Verge sometimes rmsprop beats the pants
00:22:25
off both of them it's worth playing
00:22:26
around with these with these things I
00:22:28
told you about why I think SWA is great
00:22:30
there's this system called attitude my
00:22:32
old team at Amazon released where you
00:22:35
don't even need to take a learning rate
00:22:37
it dynamically calculates the ideal
00:22:39
learning rate scheduled at every point
00:22:41
during training for you it's kind of
00:22:42
magical so it's worth playing around
00:22:46
with different optimizers but don't mess
00:22:47
with the with the activation functions
00:22:49
okay
00:22:50
let's pop out right there's a bunch of a
00:22:52
bunch of Theory bunch of math and and
00:22:53
ideas in there how do we actually apply
00:22:55
this stuff in code so if you want to use
00:22:58
a transformer I strongly recommend
00:23:01
hopping over to the the fine folks at
00:23:04
hugging face and using their transformer
00:23:06
package they have both pi torch and
00:23:09
tensor flow implementations pre-trained
00:23:11
models ready to fine tune and I'll show
00:23:14
you how easy it is here's how to fine
00:23:16
tune a Bert model in just 12 lines of
00:23:18
code you just pick what kind of Bert you
00:23:21
want the base model that's a paying
00:23:24
attention upper and lower case you get
00:23:26
the tokenizer to convert your string
00:23:27
into tokens you download the pre trained
00:23:30
model in one line of code pick your data
00:23:32
set for your own problem process the
00:23:35
data set with the tokenizer
00:23:37
to get training validation splits
00:23:39
shuffle one batch um four more lines of
00:23:41
code another four lines of code to
00:23:43
instantiate your optimizer define your
00:23:46
loss function pick a metric it's
00:23:49
tensorflow so you got to compile it and
00:23:52
then you call fit and that's it that's
00:23:55
all you need to do use - all you need to
00:23:57
do to fine-tune a state-of-the-art
00:23:59
language model on your specific problem
00:24:01
and the fact you can do this on some pre
00:24:04
trained model that's that's seen tons
00:24:06
and tons of data that easily is really
00:24:08
amazing and there's even bigger models
00:24:11
out there right so Nvidia made this
00:24:12
bottle called megatron with eight
00:24:14
billion parameters they ran a hundreds
00:24:16
of GPUs for over a week spent vast
00:24:19
quantities of cash well I mean they own
00:24:20
the stuffs but so not really but they
00:24:22
they put a ton of energy into training
00:24:26
this I've heard people a lot of people
00:24:27
complaining about how much greenhouse
00:24:29
gas comes from training model like
00:24:31
Megatron I think that's totally the
00:24:33
wrong
00:24:34
way of looking at this because they only
00:24:37
need to do this once in the history of
00:24:40
the world and everybody in this room can
00:24:42
do it without having to burn those GPUs
00:24:45
again right these things are reusable
00:24:47
and fine tunable I don't think they've
00:24:48
actually released this yet but but they
00:24:51
might and somebody else will
00:24:53
right so you don't need to do that that
00:24:55
expensive work over and over again write
00:24:57
this thing learns a base model really
00:25:00
well the folks at Facebook trained this
00:25:03
Roberta model on two and a half
00:25:04
terabytes of data across over a hundred
00:25:07
languages and this thing understands low
00:25:10
resource languages like Swahili and an
00:25:13
Urdu in ways that the it's just vastly
00:25:16
better than what's been done before and
00:25:18
again these are reusable if you need a
00:25:21
model that understands all the world's
00:25:23
languages this is accessible to you by
00:25:25
leveraging other people's work and
00:25:27
before Bert and transformers and the
00:25:29
Muppets this just was not possible now
00:25:31
you can leverage other people's work in
00:25:34
this way and I think that's really
00:25:36
amazing so to sum up the key advantages
00:25:39
of these transforming networks yes
00:25:41
they're easier to train they're more
00:25:42
efficient all that yada yada yada
00:25:44
but more importantly transfer learning
00:25:47
actually works with them right you can
00:25:49
take a pre trained model fine-tune it
00:25:51
for your task without a specific data
00:25:53
set and another really critical point
00:25:56
which I didn't get a chance to go into
00:25:57
is that these things are originally
00:25:59
trained on large quantities of
00:26:01
unsupervised text you can just take all
00:26:03
of the world's text data and use this as
00:26:06
training data the way it works very very
00:26:07
quickly is kind of comparable to how
00:26:10
word Tyvek works where the language
00:26:11
model tries to predict them some missing
00:26:14
words from a document and in that's
00:26:17
enough for it to understand how to build
00:26:21
a supervised model using vast quantities
00:26:24
of text without any effort to label them
00:26:28
Ellis team still has its place in
00:26:30
particular if the sequence length is
00:26:32
very long or infinite you can't do n
00:26:35
squared right
00:26:36
and that happens if you're doing real
00:26:38
time control like for a robot or a
00:26:40
thermostat or something like that you
00:26:41
can't have the entire sequence and for
00:26:44
some reason you can't pre train on some
00:26:46
large corpus LS TM seems
00:26:48
to outperform transformers when your
00:26:50
dataset size is is relatively small and
00:26:52
fixed and with that I will take
00:26:57
questions
00:26:58
well you yes yeah would CNN how do you
00:27:13
compare words CNN transformer so when
00:27:16
when I wrote this paper the rise and
00:27:19
fall and rise and fall of LST M I
00:27:20
predicted that time that word CNN's were
00:27:23
going to be the thing that replaced LST
00:27:25
M I did not I did not see this this
00:27:29
transformer thing coming so a word CNN
00:27:31
has a lot of the advantages in terms of
00:27:34
parallelism and the ability to use rel
00:27:36
you and the key difference is that it
00:27:39
only looks at a fixed size window fixed
00:27:41
size part of the document instead of
00:27:42
looking at the entire document at once
00:27:44
and so it's it's got a fair amount
00:27:49
fundamentally in common word CNN's have
00:27:53
an easier task easier time identifying
00:27:57
diagrams trigrams things like that
00:28:00
because it's got those direct
00:28:01
comparisons right it doesn't need this
00:28:02
positional encoding trick to try to
00:28:04
infer with with fourier waves what where
00:28:08
things are relative to each other so
00:28:10
it's got that advantage for
00:28:12
understanding close closely related
00:28:13
tokens but it can't see across the
00:28:16
entire document at once right it's got a
00:28:20
much harder time reasoning like a word
00:28:23
CNN can't easily answer a question like
00:28:25
does this concept exist anywhere in this
00:28:29
document whereas a transformer can very
00:28:31
easily answer that just by having some
00:28:33
attention query that finds that
00:28:35
regardless of where it is CNN would need
00:28:37
a very large large window or a series of
00:28:40
windows cascading up to to be able to
00:28:42
accomplish that