00:00:00
hello my name is luis serrano and this
00:00:02
is a friendly introduction to support
00:00:04
vector machines or SVM for short this is
00:00:07
the third of a series of three videos on
00:00:09
linear models if you haven't take a look
00:00:12
at the first one it's called linear
00:00:14
regression and the second one called
00:00:15
logistic regression this one builds up a
00:00:17
lot on the second one in particular and
00:00:20
I'll start with the credits this year I
00:00:22
thought a machine learning class at
00:00:23
Qwest University in British Columbia
00:00:25
Canada and had a wonderful group of
00:00:27
students who had an awesome time here's
00:00:30
a picture of us with my friend Richard
00:00:32
Hoshino on the right he's also a
00:00:33
professor and actually my students were
00:00:36
the ones who helped me figure out the
00:00:38
key idea for this video
00:00:40
so as VMS are a very important
00:00:43
classification algorithm and basically
00:00:45
what it does is the usually tries to
00:00:47
separate points of two classes using a
00:00:50
line however it tries really hard to
00:00:52
find the best line and the best line
00:00:54
will be the one that is sort of the
00:00:56
farthest from the points as possible to
00:00:58
want to separate them best normally as
00:01:01
VMS are explained in terms of either
00:01:04
some kind of linear optimization or some
00:01:07
kind of gradient descent what I want to
00:01:09
show you today is something that I
00:01:10
actually haven't seen in the literature
00:01:11
it may exist but I haven't seen it and
00:01:13
it's a method that is small greatest a
00:01:18
step like method which is sort of an
00:01:21
iteration and in this iteration what you
00:01:23
do is you first of all try to find a
00:01:25
better line that classifies the points
00:01:27
and then at every step you just take two
00:01:30
lines parallel and just kind of stretch
00:01:32
them apart let me be more explicit so
00:01:36
let me start with a very quick recap on
00:01:38
the previous video on logistic
00:01:40
regression of perceptron algorithm
00:01:41
basically what we want to do is we have
00:01:44
data split into two classes red points
00:01:47
and blue points and we want to find the
00:01:49
perfect line this is the perceptron
00:01:50
algorithm so what I want to do is not
00:01:53
just find the perfect line but a line
00:01:54
with a red side and a blue side that
00:01:56
splits the point in the best possible
00:01:58
way and the way we did this was we start
00:02:02
with a random line and then we start
00:02:03
asking the points what can they tell us
00:02:06
to make our line better so for example
00:02:09
this point over here says I'm good so
00:02:11
don't worry don't do anything
00:02:13
blue one says well he I'm on the wrong
00:02:15
side so you better move closer to me in
00:02:17
order to classify me better so we move a
00:02:19
little closer my mother machine-learning
00:02:21
want to do tiny steps we don't want to
00:02:23
make any big drastic steps so we asked
00:02:25
another point this one is red in the red
00:02:27
area so it says I'm good
00:02:29
don't do anything then we ask this one
00:02:32
over here and it says get over here so
00:02:34
we get over there then we ask this blue
00:02:37
one in the blue area so it says I'm good
00:02:40
by the way we're adding random points
00:02:42
here there's no particular order then we
00:02:45
ask this one it's a red point in the red
00:02:47
area so it's just some good we ask this
00:02:49
point which is a blue point in the red
00:02:51
area so it says get over here so we move
00:02:54
closer then we ask this red on the red
00:02:57
area so it says I'm good then this red
00:03:00
which is now misclassifies on the blue
00:03:02
area so it says move over here and we
00:03:05
listen to it and now it seems like all
00:03:08
the points are good so that is in a
00:03:10
nutshell the perceptron algorithm I'd
00:03:12
like to remind you that the way we did
00:03:15
it is we started with a random line with
00:03:18
red and blue sides then we picked a
00:03:20
large number the number of repetitions
00:03:21
or epochs which in this case is going to
00:03:24
be a thousand that's the number of times
00:03:26
we're going to repeat our iterative step
00:03:28
then step three says repeat a thousand
00:03:31
times we pick a random point we ask the
00:03:33
point if it's correctly classified or
00:03:35
not if it's correctly classified we do
00:03:37
nothing if it's not correctly classified
00:03:39
then we move the line a little bit
00:03:40
towards a point and we do this
00:03:45
repeatedly so we get the line that
00:03:48
separates the data pretty well so anyway
00:03:52
that's a small recap of the perceptron
00:03:53
algorithm and this algorithm is going to
00:03:56
be very similar but it's gonna have a
00:03:57
little bit of an extra step so let me
00:04:00
show you that extra step first let's
00:04:02
start by defining what is it that the
00:04:04
SVM does best so I'm gonna give you an
00:04:07
example of some data here and I'm gonna
00:04:09
copy it twice and this is a line that
00:04:13
separates as data and this is another
00:04:15
line that separates that data so
00:04:17
question for you which line is better so
00:04:21
feel free to think about it for a minute
00:04:22
I think the best one is
00:04:25
this one on the left and this one on the
00:04:27
right is not so good even though they
00:04:29
both separate the data if you notice the
00:04:32
one on the left separates the points
00:04:35
really well like it's really far away
00:04:37
from the points whereas the one on the
00:04:39
right is really really close to two of
00:04:40
the points so if you were to wiggle the
00:04:42
line on the right around you may miss
00:04:45
one of the points and you may miss
00:04:46
classify them whereas so lying on the
00:04:48
left you can wiggle it freely and you
00:04:50
still get a good classifier so now the
00:04:54
question is how do we train the computer
00:04:56
to pick the line in the left instead of
00:04:59
the line in the right because if you
00:05:00
remember perceptron algorithm just finds
00:05:03
a good line that separates the data but
00:05:06
it doesn't necessarily find the best one
00:05:09
so let's rephrase the question what what
00:05:12
don't do is not just find one line but
00:05:15
find two lines that are spaced apart as
00:05:17
possible from each other so here for
00:05:19
example centered on the main line we
00:05:22
have these two parallel equidistant
00:05:25
lines and notice that for this case on
00:05:29
the Left we can actually have them
00:05:30
pretty far away from each other on the
00:05:33
other hand if we do this with the line
00:05:35
on the right the farthest we can get is
00:05:37
two lines that are pretty close so we're
00:05:39
gonna compare these Green distance over
00:05:41
here with this distance over here
00:05:44
and the one on the left is pretty wide
00:05:46
whereas the one on the right is pretty
00:05:48
narrow so we're gonna go for white so
00:05:51
we're gonna tell the computer when you
00:05:52
find a white one
00:05:53
you're good but if you find a narrow one
00:05:55
then you're not good and now the
00:05:58
question is how do we train an algorithm
00:06:00
to find two lines as far apart from each
00:06:04
other that are parallel that still split
00:06:07
our data so this is what we're gonna do
00:06:08
very similar to what we did before we're
00:06:11
gonna start by dropping a random line
00:06:14
that doesn't do a very good job
00:06:15
necessarily then we draw two parallel
00:06:18
lines around it at some small random
00:06:21
distance and then what we're gonna do is
00:06:23
we're gonna do something very similar to
00:06:25
the perceptron algorithm we're gonna
00:06:27
start listening to the points and asking
00:06:29
them what we need to do so let's say one
00:06:32
point tells us to move in this direction
00:06:34
so we move in this direction and then
00:06:36
what we're gonna do is at every step
00:06:38
we are going to separate the lines just
00:06:41
a little bit and then we listen to
00:06:43
another point that maybe tells us to
00:06:45
move in this direction and then again
00:06:46
we're gonna separate the lines a little
00:06:48
bit and then again another point tells
00:06:51
us to move in this direction and then
00:06:53
we're gonna separate the lines a little
00:06:55
bit and that's pretty much it that's
00:06:58
what the SVM algorithm does of course we
00:07:02
need to go through some technicalities
00:07:03
one technicality is how to separate
00:07:06
lines so let me show you how to separate
00:07:09
lines using equations so let's say we
00:07:11
have a line with equation for example 2x
00:07:14
plus 3y plus minus 6 equals 0 and then
00:07:18
again recall that really in the
00:07:19
Cartesian plane where the horizontal
00:07:21
axis is the x axis and the vertical axis
00:07:24
is the y axis so notice that this line
00:07:28
is the set of points that satisfy that
00:07:32
two times the x-coordinate plus 3 times
00:07:35
the y-coordinate minus 6 is equal to 0
00:07:38
what happens if I multiply this 2 3 and
00:07:41
-6 by some constant for example by 2 I
00:07:44
get for example 4x plus 6y plus minus 12
00:07:49
equals 0
00:07:49
well what line do you think this is it's
00:07:53
actually the exact same line because any
00:07:55
point that satisfies 2x + 3 y - 6 0 0
00:07:58
also satisfies that 2 times that thing
00:08:01
equals 0 because 2 times 0 is equal to 0
00:08:03
so in particular we get the same line
00:08:05
and if I multiply this equation by any
00:08:07
factor for example by 10 I get 20 x plus
00:08:10
3y plus minus 60 is equal to 0 I get the
00:08:14
exact same line so this is actually this
00:08:16
line actually represents a family of
00:08:18
equations I can also multiply it by
00:08:20
numbers are smaller than 1 for example
00:08:22
0.2 X plus point 3y plus minus point 6
00:08:26
that's dividing the original equation by
00:08:28
10 that also satisfies the same line and
00:08:31
I can even multiply in my negative
00:08:32
numbers and it still works but now let's
00:08:36
see what changes so here again we have
00:08:39
2x plus 3y plus minus 6 equals 0 and the
00:08:42
exact same line which is 4x plus 6y plus
00:08:45
minus 12 equals 0 now let's actually
00:08:48
draw the line 2x plus 3y plus minus 6
00:08:52
1 & 2 X plus 3y plus minus 6 equals
00:08:56
minus 1 because what we're gonna do is
00:08:58
our two parallel lines to the original
00:09:01
one are the ones with the same equation
00:09:04
except the ones that don't give 0 but
00:09:07
they give one and minus 1 now what do
00:09:10
you think happens if I do the same thing
00:09:12
on the graph in the right and this is
00:09:15
important so actually feel free to pause
00:09:16
this video and think about it for a
00:09:18
minute
00:09:19
I'll tell you what happens what happens
00:09:20
is that we get two lines that are
00:09:23
parallel but much closer so the
00:09:25
equations for X plus six Y plus a minus
00:09:28
12 equals one and minus one are actually
00:09:31
a lot closer to the original one than
00:09:34
the ones with equation 2x plus 3y plus
00:09:37
minus 6 equals 1 and actually if I
00:09:39
multiply this equation by a smaller
00:09:42
factor for example by dividing by 10 so
00:09:47
I get zero point two X plus zero point
00:09:49
three y plus minus zero point six equals
00:09:51
one and minus one then I get lines that
00:09:54
are much more far away from the original
00:09:57
one and if I were to multiply it by a
00:10:00
huge number by ten for example I get 20
00:10:03
X plus 3y plus minus 60 equals 1 and
00:10:06
minus 1 then the lines get much much
00:10:08
closer so the original line stays the
00:10:10
same if I multiply by a constant but
00:10:12
these two parallel lines move farther
00:10:15
away or closer depending on if I'm
00:10:17
multiplying by a number that is close to
00:10:20
zero a small number or by a large number
00:10:23
this is not gonna appear in this video
00:10:26
but if you multiply by a negative number
00:10:28
that the two lines actually switch but
00:10:31
this is not so important for this
00:10:32
algorithm but basically what we're gonna
00:10:35
do is we're gonna be able to separate
00:10:39
lines by multiplying them by a small
00:10:41
number that's really what we're gonna do
00:10:43
in this algorithm but first we need some
00:10:44
justification why is it that this
00:10:46
phenomenon happens so let's look at this
00:10:48
line for example 2x plus 3y plus minus 6
00:10:52
equals 0 and let's just look at one side
00:10:54
of it so 2x plus 3y plus minus 6 equals
00:10:57
1 so why is it that this line over here
00:11:01
in between is the equation 4x plus 6y
00:11:05
plus minus
00:11:06
twelve is equal to one it's actually
00:11:08
exactly in the middle well let's take a
00:11:10
look at this equation 4x plus 6y plus
00:11:13
minus 12 equals 1 is the same line as if
00:11:17
I just divide the entire thing by 2
00:11:18
including the 1 so if I were to divide
00:11:21
2x plus 3y plus it's minus 6 equals 0.5
00:11:25
I get the exact same line and the reason
00:11:28
is that any X&Y that satisfied 4x plus
00:11:32
6y plus minus 12 equals 1
00:11:34
they also satisfied 2x plus 3y plus
00:11:37
minus 6 equals 0.5 the exact same
00:11:39
equation so when I bring back this
00:11:42
equation well now you can see that it's
00:11:44
a value of 0.5 actually lies right in
00:11:48
between the value of 0 and the value of
00:11:50
1 so that's why this equation is in
00:11:53
between and you can see that this works
00:11:55
for pretty much any constant that I
00:11:57
multiply the line by so what we're gonna
00:12:00
do is we're gonna introduce something
00:12:02
called the expanding rate and expanding
00:12:04
rate is very simple we have again our
00:12:06
equation 2x plus 3y plus minus 6 equals
00:12:08
0 which gives us this line and then we
00:12:11
have our two neighbor equations the one
00:12:15
that gives us one which is over here and
00:12:17
the one that gives us minus one which is
00:12:19
over here and our expanding rate is just
00:12:22
gonna be some number and remember that
00:12:25
in machine learning we always want to
00:12:27
make tiny steps we don't want to make
00:12:30
any big steps so we want to separate
00:12:33
this line but by a very very little
00:12:35
amount so we're gonna take a number that
00:12:37
is very close to 1 for example 0.99
00:12:40
let's say that's my favorite number that
00:12:42
is close to 1 and we're gonna call that
00:12:44
the expanding rate and what we're gonna
00:12:47
do is we're just gonna multiply all
00:12:49
these numbers here by 0.99 so what do we
00:12:53
get
00:12:53
well we get these equations the
00:12:56
equations are 1.9 8x plus 2.9 7y plus
00:13:00
minus 5.94 is equal to 0 to 1 and 10
00:13:06
minus 1 and this equations gives us
00:13:10
three lines that one of the middle is
00:13:11
still the same one but the two on the
00:13:14
sides are actually just a little spread
00:13:16
apart so we're just gonna add
00:13:19
that step to the perceptron algorithm
00:13:21
and that's gonna spread our lines apart
00:13:23
a little bit every time we aerate so now
00:13:27
we're ready to formulate the SVM
00:13:29
algorithm and it's gonna be very similar
00:13:31
to the perceptron algorithm step one is
00:13:33
we're gonna start with a random line and
00:13:35
to equi distant parallel lines to it and
00:13:37
I'm gonna color them red and blue just
00:13:39
to emphasize which side of the line is
00:13:41
red and which side of the line is blue
00:13:42
in order to see which points we have
00:13:44
correctly or incorrectly classified now
00:13:47
step two is gonna be pick a large number
00:13:49
on the number of repetitions or epochs
00:13:51
the number of times we're gonna iterate
00:13:52
this algorithm step three is gonna be
00:13:55
pick a number close to 1 so the
00:13:57
expanding factor and we saw it's gonna
00:13:59
be 0.99 I can pick anything but that's
00:14:02
the one I'm gonna pick close to one step
00:14:04
four is now the loop so repeat a
00:14:06
thousand times pick a random point and
00:14:08
the point is correctly classified for
00:14:11
example this one says I'm good then we
00:14:13
do nothing if it's incorrectly
00:14:15
classified then for example like this
00:14:18
one which is a blue point in the red
00:14:20
area says get over here so we move the
00:14:22
line towards a point so we learned in
00:14:25
the previous video how to move a line
00:14:26
towards a point like this and then we're
00:14:31
gonna do the extra step which is
00:14:33
separate the lines using the expanding
00:14:35
factor so we're gonna do separate the
00:14:37
lines a little bit and we're just gonna
00:14:40
repeat these many many many times
00:14:41
thousand times until we get a pretty
00:14:44
good result and then we enjoy the lines
00:14:46
that separate the data best so notice
00:14:48
that the two steps that we've added is
00:14:50
this step three pick a number the imp
00:14:52
expanding factor close to one and the
00:14:55
one where we separate the lines using
00:14:57
the expanding factor the rest is pretty
00:14:59
much the same thing as the perceptron
00:15:01
algorithm so now just for full
00:15:06
disclosure if you want to code this like
00:15:07
this is actually the perceptron
00:15:09
algorithm that we saw in the previous
00:15:10
video where we added step four is the
00:15:14
the mathematical step where we check if
00:15:17
something is in the blue and red area by
00:15:19
checking if the equation applied on the
00:15:22
point become Q is been a 0 or less than
00:15:25
0 so we update the values of a B and C
00:15:28
accordingly by adding the learning rate
00:15:32
times
00:15:33
the coordinates of the point so the SVM
00:15:35
algorithm is actually very similar what
00:15:37
we do is we start the random line of
00:15:39
equation ax will be y plus C equals zero
00:15:42
and we draw the parallel lines with
00:15:44
equations x will be y plus c equals one
00:15:47
and minus one then we pick a large
00:15:49
number the number of epochs which is
00:15:51
gonna be a thousand then we pick a
00:15:52
learning rate which is gonna be zero
00:15:54
point zero one we saw it in the logistic
00:15:56
regression video then we pick an
00:15:58
expanding rate which is gonna be 0.99
00:16:01
it's a number close to one and then the
00:16:04
loop step is repeated thousand times
00:16:05
pick a random point and the point is
00:16:08
correctly classified and do nothing if
00:16:10
the point is blue in the red area then
00:16:12
we update the values of a B and C
00:16:15
accordingly if the point is red in the
00:16:17
blue area we update the values in a
00:16:19
different way and then it's a final step
00:16:21
we multiply the values a B and C by 0.99
00:16:27
which is the expanding step and again
00:16:30
the two new steps are step three and the
00:16:33
expanding step so that's it that's the
00:16:35
SVM training algorithm I encourage you
00:16:37
to code it and see how it does try
00:16:40
different values for number of epochs
00:16:42
learning rate expanding rate etc and let
00:16:46
me know how you went in the comments so
00:16:48
that's the SVM algorithm as I said I
00:16:50
encourage you to code it take a look at
00:16:52
in some datasets and see how it goes
00:16:54
however this comes out of somewhere this
00:16:57
comes out of an error function
00:17:00
development with gradient descent so now
00:17:03
I'm gonna show you what the error
00:17:04
function is and it's very similar to the
00:17:06
perceptron algorithm where we had a
00:17:07
classification error based on how far
00:17:09
the points are from the boundary however
00:17:12
now we're gonna have a another thing
00:17:15
that adds to the error which is based on
00:17:17
how far away these two lines are so let
00:17:20
me show you so to start there functions
00:17:23
let me first ask you a question here we
00:17:24
have the same data set twice and I'm
00:17:27
gonna show you two support vector
00:17:29
machines that classified the first one
00:17:31
is this one and the second one is this
00:17:34
one now the question is which one do you
00:17:37
think is better I feel free to pause the
00:17:40
video and think about it so notice that
00:17:42
the model on the left has one problem
00:17:44
which is that it misclassifies
00:17:46
point however it's good because it's got
00:17:49
the lines pretty wide apart the mall on
00:17:53
the right is great there's a
00:17:55
classification because it classifies
00:17:57
every point correctly however the lines
00:17:59
are very close together so the question
00:18:03
is which one is better and the answer is
00:18:05
we don't really know it depends on our
00:18:08
data it depends on our model it depends
00:18:10
on the scenario but with error functions
00:18:13
we can actually have an approach to
00:18:15
maybe analyze what exactly do we want so
00:18:18
let's recall what happened with the
00:18:20
perceptron error so we here we have some
00:18:22
points and a model a perception that
00:18:25
separates them now this will make some
00:18:27
mistakes right it makes these two
00:18:30
because these two are blue points in the
00:18:33
red area and makes these two because
00:18:35
these are red points in the blue area so
00:18:38
the question is how do we measure the
00:18:40
error or how bad this model is and the
00:18:44
rationale is if a point is on the
00:18:46
correct side then this error is zero if
00:18:49
a point is on the wrong side then the
00:18:51
error can change if a point is close to
00:18:55
the boundary then the error is small and
00:18:56
if it's far from the battery then the
00:18:57
error is huge because if you're for
00:19:00
example a blue point and you're close to
00:19:02
the blue area but still in the red area
00:19:04
you have a small error but if you well
00:19:05
into the red area then you generate a
00:19:07
lot of error because that model is very
00:19:09
wrong on that point so what you want is
00:19:12
the distance or not exactly the distance
00:19:14
but something proportional to this
00:19:15
distance and the same here so we're
00:19:18
gonna add a number of proportional to
00:19:21
these distances and that's gonna be the
00:19:22
perceptron error so for SVM is gonna be
00:19:25
similar we're gonna have our lines and
00:19:27
now we're just gonna have to
00:19:30
classification errors coming from
00:19:31
different places so what we're gonna
00:19:33
have is a red one so our red area now
00:19:36
doesn't start from the middle but it
00:19:38
starts from the bottom line and every
00:19:43
point above this line that is blue it's
00:19:47
automatically misclassify so these three
00:19:49
are misclassified and the error is
00:19:51
precisely the distance from the bottom
00:19:53
line and that simple so notice that this
00:19:57
blue point
00:19:59
that is close to the bottom line is
00:20:00
actually misclassified even though it
00:20:02
was correctly classifying the perceptron
00:20:04
algorithm it is okay it's a harsh error
00:20:07
function now the blue error comes from
00:20:12
the line in the top so it comes from
00:20:14
here now every red point underneath this
00:20:16
top line is gonna be misclassified and
00:20:19
its error is gonna be similar to the
00:20:22
perceptron error it's gonna be
00:20:23
proportional to this distance over here
00:20:25
so we're adding all those distances and
00:20:28
that's our error so those two errors
00:20:30
form the classification error now we
00:20:32
have something called the margin error
00:20:33
and the margin error is simply something
00:20:37
that tells us if these two lines are
00:20:40
close by or far apart I'm gonna be a
00:20:43
little more specific later but it's
00:20:45
basically a number that is gonna be big
00:20:47
if the lines are close together and
00:20:49
small if the lines are far apart
00:20:51
because it's an error so the better
00:20:54
model the smaller the error and the
00:20:56
better our model the wider our lines are
00:20:59
so let's actually look a little bit more
00:21:02
on the marry margin error here so we
00:21:04
have our data set and our data set again
00:21:07
and two models so this one has the lines
00:21:12
pretty far apart therefore it has a
00:21:14
large margin so it's gonna have a small
00:21:16
margin error and this one over here the
00:21:19
lines are pretty close so it's got a
00:21:21
small margin therefore it has a large
00:21:23
margin error and just show the contrast
00:21:25
notice that this way model on the right
00:21:28
has a small classification error and
00:21:30
this model on the left has a large
00:21:32
classification error because the mandar
00:21:33
right classifies all the points
00:21:35
correctly and the one of the left
00:21:36
classify points one point incorrectly
00:21:39
but let's get back to our margin error
00:21:42
so we have our three lines and let's
00:21:44
recall the questions of the line are
00:21:46
something along lines of ax plus B y
00:21:49
plus C equals 1 and X will do i plus c
00:21:52
equals minus 1 so now we're gonna do is
00:21:55
calculate the distance so that I'm gonna
00:21:58
leave as a challenge for you to do some
00:22:01
math and show that this is actually 2
00:22:03
divided by the square root of a squared
00:22:05
plus B squared so I challenge you to to
00:22:09
prove this what you have to do is play
00:22:11
with linear equations
00:22:12
and Pythagorean theorem and so now the
00:22:15
question is what can our error be so
00:22:19
let's think about it we need a number
00:22:21
that is big if the distance is small and
00:22:25
a number that is small if the distance
00:22:27
is big so what can our error be feel
00:22:30
free to think about it the hint is look
00:22:33
at the denominator right the bigger a
00:22:36
square plus B Square is the smaller this
00:22:39
number is and vice versa so what about
00:22:42
just taking the margin error to be this
00:22:43
a squared plus B squared notice that in
00:22:47
this number is large that means the
00:22:49
denominator is small and vice versa
00:22:52
so if we let our margin error just be
00:22:55
that sum of squares that works that
00:22:58
actually measures how far apart the
00:23:02
lines are in the opposite way so if the
00:23:05
lines are close the error is big even
00:23:07
lines are far the error is small
00:23:09
that looks familiar shouldn't be a
00:23:12
surprise it's actually the
00:23:13
regularization term if you've seen l2
00:23:16
regularization so now we can summarize
00:23:18
where as vm error is here it is we have
00:23:21
our data set on our model and the error
00:23:24
basically splits in three first is the
00:23:27
blue classification error which is
00:23:29
basically measures all the red points
00:23:32
that are in the blue side then we have
00:23:34
the red classification error which
00:23:36
measures all the blue points that are in
00:23:37
the red side and then we have the margin
00:23:40
error which measures how far apart the
00:23:43
lines are so the red and the blue get
00:23:46
together to form the total
00:23:47
classification error which tells us how
00:23:50
many points are misclassified and how
00:23:52
badly they are misclassified and then
00:23:54
the margin error that tells us if the
00:23:56
lines are far apart or close by and
00:23:59
these two get together to form the total
00:24:02
SVM error so that is the error in a
00:24:06
support vector machine the one that
00:24:08
we're supports minimize and the gradient
00:24:10
descent steps very similar what it does
00:24:13
is actually the same thing as the SVM
00:24:14
strick what it does is here we have our
00:24:17
data and here we have a model and this
00:24:19
model is pretty bad notice that the
00:24:20
lines are pretty narrow and
00:24:23
misclassifies a bunch of the points so
00:24:26
this is a bad as vm it's got a large
00:24:29
error both in the classification sense
00:24:31
and in the margin sense and we wanted to
00:24:33
is using calculus or using gradient
00:24:36
descent we minimize this error in order
00:24:39
to get to a good place a good as vm that
00:24:42
has a good boundary the lines are far
00:24:45
apart and it actually classifies most of
00:24:47
the points correctly so in the same way
00:24:50
that we did with the perceptron
00:24:52
algorithm this gradient descent process
00:24:55
takes us from a large error to a small
00:24:57
error and this actually is exact same
00:25:00
thing as the SVM trick that I show you
00:25:03
recently of moving the line closer to
00:25:06
the points plus separating the lines a
00:25:10
tiny little bit so now I have a
00:25:12
challenge for you and the challenge is
00:25:15
simply to convince yourself that the
00:25:19
expanding step actually comes out of
00:25:22
gradient descent so take a look at this
00:25:24
we have our lines for the question x
00:25:26
plus b y plus Z equals 1 and ax plus B a
00:25:28
plus C equals -1 and we have the margin
00:25:32
over here and the margin error which is
00:25:34
a square plus B Square so if you're
00:25:36
familiarize with gradient descent what
00:25:38
happens is that we want to take a step
00:25:41
in the direction of the negative of the
00:25:44
gradient so the gradient is the
00:25:45
derivative of the margin error with
00:25:48
respect to the two parameters a and B
00:25:50
this is a very simple gradient because
00:25:53
it's it's a the derivative respect to a
00:25:55
is simply to a because a square plus B
00:25:59
square that is back today is to a and
00:26:01
the respect to B is to be there for our
00:26:05
grading the same step takes a and sends
00:26:08
it to a minus the learning rate a de
00:26:12
times to a which is a derivative and
00:26:15
that's the same thing with B it turns it
00:26:18
into B minus a dot times to B now we can
00:26:22
factor this as 8 times 1 minus 2 ADA and
00:26:26
the bottom one we can factor it as B
00:26:29
times 1 minus 2 ADA but notice something
00:26:33
here notice this number over here
00:26:37
this is exactly the expanding factor
00:26:39
because what we're doing is multiplying
00:26:41
a by a number that is close to one
00:26:45
remember that we multiplied a MB by 0.99
00:26:50
this one here is the 0.99 because if we
00:26:52
take a that to be a small number then
00:26:55
we're multiplying a by a number that is
00:26:59
very very close to one because if it is
00:27:02
small then one minus two ADA is very
00:27:05
close to one so that is exactly the
00:27:07
expanding step so the expanding step is
00:27:10
coming from gradient descent and using
00:27:15
the regularization step
00:27:16
anyway the challenge is to formalize
00:27:19
this and to and to really convince
00:27:21
yourself that this is the case so now
00:27:24
let's go back a little bit and remember
00:27:26
these two models because we never really
00:27:30
answered a question of which one is
00:27:31
better remember that the one on the Left
00:27:35
misclassifies one blue point and the one
00:27:37
on the right just has a very very short
00:27:40
distance between the lines so they're
00:27:42
both good and bad in some way so let's
00:27:45
really study them the one on the left
00:27:47
has a large classification error because
00:27:49
it makes one mistake and a small margin
00:27:52
error because the lines are pretty far
00:27:54
apart and the one on the right has a
00:27:56
small classification error because it
00:27:58
classifies every point correctly and a
00:28:00
very large margin error because the
00:28:02
lines are too close by so again which
00:28:05
one to pick depends on us it depends on
00:28:08
what we want from the algorithm however
00:28:10
we need to pass this information to the
00:28:12
computer so we need to we need to pass
00:28:14
information of which one do we care more
00:28:17
about the classification error or the
00:28:19
margin error and the way to pass this
00:28:22
information to the computer is using a
00:28:23
parameter or a hyper parameter this
00:28:26
one's gonna call we call the C parameter
00:28:28
so recall that the error here is the
00:28:31
classification error plus the margin
00:28:33
error so we're just gonna take a number
00:28:36
C and attach it to the classification
00:28:39
error and so now our error is not the
00:28:42
sum but a weighted sum where one of
00:28:44
those weighted by C
00:28:46
now what happens over here well recall
00:28:49
our error is the C times the
00:28:52
classification error plus the margin
00:28:54
error so what happens if we have a small
00:28:55
value of C if we have a small value of C
00:28:58
then the classification error gets
00:29:00
multiplied by a very small number so
00:29:02
it's all of a sudden is less important
00:29:04
and then the margin error is an
00:29:06
important one so we are really training
00:29:08
an algorithm to focus a lot more on the
00:29:10
margin error so we end up with a good
00:29:14
margin and maybe a bad classification so
00:29:17
we end up with the model on the left
00:29:20
however if we have a large value of C
00:29:24
then C is attached to the classification
00:29:27
error so this means that the
00:29:28
classification error ends up being a lot
00:29:30
more important and the margin errors are
00:29:32
being a little if if C is large so
00:29:35
therefore the model with a large C
00:29:39
focuses more on classification because
00:29:42
it tries to minimize the classification
00:29:44
error more than it tries to minimize the
00:29:46
margin error so we end up with a model
00:29:48
like the one in the right which is good
00:29:51
for classification bad for margin and so
00:29:54
again we decide this parameter ourselves
00:29:57
what we really do in real life is try a
00:29:59
bunch of different ones and see which
00:30:01
algorithm did better but but it's good
00:30:02
to know that we have certain control
00:30:05
over this training and these these are
00:30:06
called hyper parameters every every
00:30:08
machine learning algorithm has a bunch
00:30:10
of hyper parameters that one can tune to
00:30:12
decide what we want so that's all folks
00:30:17
thank you very much for your attention I
00:30:19
remind you that this is the last of a
00:30:21
series of three videos on linear models
00:30:23
linear regression logistic regression
00:30:25
and support vector machines so I hope
00:30:27
you enjoyed this as much as I enjoyed it
00:30:29
thank you remember to subscribe if you
00:30:34
want to get notifications of more videos
00:30:36
coming if you liked it please hit like
00:30:39
share it with your friends or comment I
00:30:41
love reading the comments I read them
00:30:43
all if you have suggestions on what
00:30:46
other ways to make I love to hear them
00:30:48
and if you want to tweet at me this is
00:30:51
my Twitter handle Louis Highsmith thank
00:30:54
you very much and see you in the next
00:30:56
video