What are Support Vector Machines (SVM)?

SVMs are a classification algorithm in machine learning that find the best line to separate data points from different classes.

How do SVMs differ from the perceptron algorithm?

SVMs aim to find not just one separating line but two parallel lines that are as far apart as possible while separating the classes.

What is the significance of the 'C' parameter in SVM?

The 'C' parameter balances the importance of classification error versus the margin error during SVM training.

What does the margin error in SVM determine?

The margin error indicates how far apart the two parallel lines are; smaller margins indicate a higher margin error.

What role does gradient descent play in SVM training?

Gradient descent is used to minimize the classification and margin error to optimize the separating lines.

Can you explain the expanding factor in SVM?

The expanding factor is a number close to 1 that is multiplied to parameters during iterations to slightly separate the two parallel lines.

How is the error calculated in an SVM?

The SVM error combines the classification error (distance of misclassified points to the boundary) and the margin error (distance between the two separating lines).

Support Vector Machines (SVMs): A friendly introduction

00:30:57

https://www.youtube.com/watch?v=Lpr__X8zuE8

Ringkasan

TLDRThis video provides a detailed introduction to Support Vector Machines (SVM), a critical classification algorithm in machine learning. The instructor, Luis Serrano, builds upon the previous videos on linear regression and logistic regression to explain how SVMs function by separating data points of different classes with parallel lines. The key focus is on finding two parallel lines that maximize the distance between them while still separating the classes correctly. The algorithm iteratively adjusts the position of these lines based on feedback from data points, and concepts such as the expanding factor and the 'C' hyperparameter are introduced to control the model's performance concerning classification versus margin errors. Throughout, the process and practical implementations are discussed, along with the theoretical underpinnings of SVM errors and gradient descent techniques.

Takeaways

🔍 SVMs classify by finding the optimal separating line between classes.
📏 The goal is to maximize the distance between two parallel lines around the decision boundary.
🔄 Iterative adjustments are made based on point classifications.
📈 'C' parameter adjusts the trade-off between classification error and margin error.
🏗️ The expanding factor slightly separates the lines during training.
🧮 SVM error comprises classification error and margin error.
🔬 Understanding the trade-offs is crucial for effective model training.
🌐 Hyperparameters can be tuned to enhance model performance.

Garis waktu

00:00:00 - 00:05:00
Luis Serrano introduces Support Vector Machines (SVM) while recapping linear regression and logistic regression from previous videos. He credits his students for their contributions during his teaching experience and emphasizes the significance of SVM as a classification algorithm that seeks to separate points of two classes with the best possible line.
00:05:00 - 00:10:00
The SVM algorithm, compared to the perceptron algorithm, focuses on not just finding any separating line but the optimal line that maximizes the distance between parallel lines drawn around it. This involves finding two lines that are as far apart as possible from each other while still effectively separating the data points.
00:10:00 - 00:15:00
Serrano illustrates the margin concept with two parallel lines, explaining that the best line is the one with maximized margin from the data points. The process involves iteratively adjusting the separating lines based on the positions of the data points and how they are classified by the current line.
00:15:00 - 00:20:00
He discusses the mathematical representation of lines and how adjustments to the equations can shift the position of the lines. By introducing an expanding rate, the SVM algorithm allows the lines to be spread apart incrementally throughout the training process, ensuring an optimal separation between classes.
00:20:00 - 00:25:00
Luis outlines the SVM training steps including defining initial random lines, the number of iterations, learning rates, and the iterative loop for moving the lines based on misclassifications of data points while maintaining the expanding step. This step is crucial for ensuring that lines spread further apart over time as they learn.
00:25:00 - 00:30:57
Finally, Luis explains the importance of error functions in SVM, defining classification errors based on misclassifications and margin errors based on the distance between separating lines. He combines these to create a comprehensive error measure for the SVM, and discusses hyperparameters like the C parameter that controls the trade-off between the classification error and the margin error.

Tampilkan lebih banyak

Peta Pikiran

Video Tanya Jawab

What are Support Vector Machines (SVM)?
SVMs are a classification algorithm in machine learning that find the best line to separate data points from different classes.
How do SVMs differ from the perceptron algorithm?
SVMs aim to find not just one separating line but two parallel lines that are as far apart as possible while separating the classes.
What is the significance of the 'C' parameter in SVM?
The 'C' parameter balances the importance of classification error versus the margin error during SVM training.
What does the margin error in SVM determine?
The margin error indicates how far apart the two parallel lines are; smaller margins indicate a higher margin error.
What role does gradient descent play in SVM training?
Gradient descent is used to minimize the classification and margin error to optimize the separating lines.
Can you explain the expanding factor in SVM?
The expanding factor is a number close to 1 that is multiplied to parameters during iterations to slightly separate the two parallel lines.
How is the error calculated in an SVM?
The SVM error combines the classification error (distance of misclassified points to the boundary) and the margin error (distance between the two separating lines).

Lihat lebih banyak ringkasan video

Dapatkan akses instan ke ringkasan video YouTube gratis yang didukung oleh AI!

Teks

Gulir Otomatis:

00:00:00
hello my name is luis serrano and this
00:00:02
is a friendly introduction to support
00:00:04
vector machines or SVM for short this is
00:00:07
the third of a series of three videos on
00:00:09
linear models if you haven't take a look
00:00:12
at the first one it's called linear
00:00:14
regression and the second one called
00:00:15
logistic regression this one builds up a
00:00:17
lot on the second one in particular and
00:00:20
I'll start with the credits this year I
00:00:22
thought a machine learning class at
00:00:23
Qwest University in British Columbia
00:00:25
Canada and had a wonderful group of
00:00:27
students who had an awesome time here's
00:00:30
a picture of us with my friend Richard
00:00:32
Hoshino on the right he's also a
00:00:33
professor and actually my students were
00:00:36
the ones who helped me figure out the
00:00:38
key idea for this video
00:00:40
so as VMS are a very important
00:00:43
classification algorithm and basically
00:00:45
what it does is the usually tries to
00:00:47
separate points of two classes using a
00:00:50
line however it tries really hard to
00:00:52
find the best line and the best line
00:00:54
will be the one that is sort of the
00:00:56
farthest from the points as possible to
00:00:58
want to separate them best normally as
00:01:01
VMS are explained in terms of either
00:01:04
some kind of linear optimization or some
00:01:07
kind of gradient descent what I want to
00:01:09
show you today is something that I
00:01:10
actually haven't seen in the literature
00:01:11
it may exist but I haven't seen it and
00:01:13
it's a method that is small greatest a
00:01:18
step like method which is sort of an
00:01:21
iteration and in this iteration what you
00:01:23
do is you first of all try to find a
00:01:25
better line that classifies the points
00:01:27
and then at every step you just take two
00:01:30
lines parallel and just kind of stretch
00:01:32
them apart let me be more explicit so
00:01:36
let me start with a very quick recap on
00:01:38
the previous video on logistic
00:01:40
regression of perceptron algorithm
00:01:41
basically what we want to do is we have
00:01:44
data split into two classes red points
00:01:47
and blue points and we want to find the
00:01:49
perfect line this is the perceptron
00:01:50
algorithm so what I want to do is not
00:01:53
just find the perfect line but a line
00:01:54
with a red side and a blue side that
00:01:56
splits the point in the best possible
00:01:58
way and the way we did this was we start
00:02:02
with a random line and then we start
00:02:03
asking the points what can they tell us
00:02:06
to make our line better so for example
00:02:09
this point over here says I'm good so
00:02:11
don't worry don't do anything
00:02:13
blue one says well he I'm on the wrong
00:02:15
side so you better move closer to me in
00:02:17
order to classify me better so we move a
00:02:19
little closer my mother machine-learning
00:02:21
want to do tiny steps we don't want to
00:02:23
make any big drastic steps so we asked
00:02:25
another point this one is red in the red
00:02:27
area so it says I'm good
00:02:29
don't do anything then we ask this one
00:02:32
over here and it says get over here so
00:02:34
we get over there then we ask this blue
00:02:37
one in the blue area so it says I'm good
00:02:40
by the way we're adding random points
00:02:42
here there's no particular order then we
00:02:45
ask this one it's a red point in the red
00:02:47
area so it's just some good we ask this
00:02:49
point which is a blue point in the red
00:02:51
area so it says get over here so we move
00:02:54
closer then we ask this red on the red
00:02:57
area so it says I'm good then this red
00:03:00
which is now misclassifies on the blue
00:03:02
area so it says move over here and we
00:03:05
listen to it and now it seems like all
00:03:08
the points are good so that is in a
00:03:10
nutshell the perceptron algorithm I'd
00:03:12
like to remind you that the way we did
00:03:15
it is we started with a random line with
00:03:18
red and blue sides then we picked a
00:03:20
large number the number of repetitions
00:03:21
or epochs which in this case is going to
00:03:24
be a thousand that's the number of times
00:03:26
we're going to repeat our iterative step
00:03:28
then step three says repeat a thousand
00:03:31
times we pick a random point we ask the
00:03:33
point if it's correctly classified or
00:03:35
not if it's correctly classified we do
00:03:37
nothing if it's not correctly classified
00:03:39
then we move the line a little bit
00:03:40
towards a point and we do this
00:03:45
repeatedly so we get the line that
00:03:48
separates the data pretty well so anyway
00:03:52
that's a small recap of the perceptron
00:03:53
algorithm and this algorithm is going to
00:03:56
be very similar but it's gonna have a
00:03:57
little bit of an extra step so let me
00:04:00
show you that extra step first let's
00:04:02
start by defining what is it that the
00:04:04
SVM does best so I'm gonna give you an
00:04:07
example of some data here and I'm gonna
00:04:09
copy it twice and this is a line that
00:04:13
separates as data and this is another
00:04:15
line that separates that data so
00:04:17
question for you which line is better so
00:04:21
feel free to think about it for a minute
00:04:22
I think the best one is
00:04:25
this one on the left and this one on the
00:04:27
right is not so good even though they
00:04:29
both separate the data if you notice the
00:04:32
one on the left separates the points
00:04:35
really well like it's really far away
00:04:37
from the points whereas the one on the
00:04:39
right is really really close to two of
00:04:40
the points so if you were to wiggle the
00:04:42
line on the right around you may miss
00:04:45
one of the points and you may miss
00:04:46
classify them whereas so lying on the
00:04:48
left you can wiggle it freely and you
00:04:50
still get a good classifier so now the
00:04:54
question is how do we train the computer
00:04:56
to pick the line in the left instead of
00:04:59
the line in the right because if you
00:05:00
remember perceptron algorithm just finds
00:05:03
a good line that separates the data but
00:05:06
it doesn't necessarily find the best one
00:05:09
so let's rephrase the question what what
00:05:12
don't do is not just find one line but
00:05:15
find two lines that are spaced apart as
00:05:17
possible from each other so here for
00:05:19
example centered on the main line we
00:05:22
have these two parallel equidistant
00:05:25
lines and notice that for this case on
00:05:29
the Left we can actually have them
00:05:30
pretty far away from each other on the
00:05:33
other hand if we do this with the line
00:05:35
on the right the farthest we can get is
00:05:37
two lines that are pretty close so we're
00:05:39
gonna compare these Green distance over
00:05:41
here with this distance over here
00:05:44
and the one on the left is pretty wide
00:05:46
whereas the one on the right is pretty
00:05:48
narrow so we're gonna go for white so
00:05:51
we're gonna tell the computer when you
00:05:52
find a white one
00:05:53
you're good but if you find a narrow one
00:05:55
then you're not good and now the
00:05:58
question is how do we train an algorithm
00:06:00
to find two lines as far apart from each
00:06:04
other that are parallel that still split
00:06:07
our data so this is what we're gonna do
00:06:08
very similar to what we did before we're
00:06:11
gonna start by dropping a random line
00:06:14
that doesn't do a very good job
00:06:15
necessarily then we draw two parallel
00:06:18
lines around it at some small random
00:06:21
distance and then what we're gonna do is
00:06:23
we're gonna do something very similar to
00:06:25
the perceptron algorithm we're gonna
00:06:27
start listening to the points and asking
00:06:29
them what we need to do so let's say one
00:06:32
point tells us to move in this direction
00:06:34
so we move in this direction and then
00:06:36
what we're gonna do is at every step
00:06:38
we are going to separate the lines just
00:06:41
a little bit and then we listen to
00:06:43
another point that maybe tells us to
00:06:45
move in this direction and then again
00:06:46
we're gonna separate the lines a little
00:06:48
bit and then again another point tells
00:06:51
us to move in this direction and then
00:06:53
we're gonna separate the lines a little
00:06:55
bit and that's pretty much it that's
00:06:58
what the SVM algorithm does of course we
00:07:02
need to go through some technicalities
00:07:03
one technicality is how to separate
00:07:06
lines so let me show you how to separate
00:07:09
lines using equations so let's say we
00:07:11
have a line with equation for example 2x
00:07:14
plus 3y plus minus 6 equals 0 and then
00:07:18
again recall that really in the
00:07:19
Cartesian plane where the horizontal
00:07:21
axis is the x axis and the vertical axis
00:07:24
is the y axis so notice that this line
00:07:28
is the set of points that satisfy that
00:07:32
two times the x-coordinate plus 3 times
00:07:35
the y-coordinate minus 6 is equal to 0
00:07:38
what happens if I multiply this 2 3 and
00:07:41
-6 by some constant for example by 2 I
00:07:44
get for example 4x plus 6y plus minus 12
00:07:49
equals 0
00:07:49
well what line do you think this is it's
00:07:53
actually the exact same line because any
00:07:55
point that satisfies 2x + 3 y - 6 0 0
00:07:58
also satisfies that 2 times that thing
00:08:01
equals 0 because 2 times 0 is equal to 0
00:08:03
so in particular we get the same line
00:08:05
and if I multiply this equation by any
00:08:07
factor for example by 10 I get 20 x plus
00:08:10
3y plus minus 60 is equal to 0 I get the
00:08:14
exact same line so this is actually this
00:08:16
line actually represents a family of
00:08:18
equations I can also multiply it by
00:08:20
numbers are smaller than 1 for example
00:08:22
0.2 X plus point 3y plus minus point 6
00:08:26
that's dividing the original equation by
00:08:28
10 that also satisfies the same line and
00:08:31
I can even multiply in my negative
00:08:32
numbers and it still works but now let's
00:08:36
see what changes so here again we have
00:08:39
2x plus 3y plus minus 6 equals 0 and the
00:08:42
exact same line which is 4x plus 6y plus
00:08:45
minus 12 equals 0 now let's actually
00:08:48
draw the line 2x plus 3y plus minus 6
00:08:52
1 & 2 X plus 3y plus minus 6 equals
00:08:56
minus 1 because what we're gonna do is
00:08:58
our two parallel lines to the original
00:09:01
one are the ones with the same equation
00:09:04
except the ones that don't give 0 but
00:09:07
they give one and minus 1 now what do
00:09:10
you think happens if I do the same thing
00:09:12
on the graph in the right and this is
00:09:15
important so actually feel free to pause
00:09:16
this video and think about it for a
00:09:18
minute
00:09:19
I'll tell you what happens what happens
00:09:20
is that we get two lines that are
00:09:23
parallel but much closer so the
00:09:25
equations for X plus six Y plus a minus
00:09:28
12 equals one and minus one are actually
00:09:31
a lot closer to the original one than
00:09:34
the ones with equation 2x plus 3y plus
00:09:37
minus 6 equals 1 and actually if I
00:09:39
multiply this equation by a smaller
00:09:42
factor for example by dividing by 10 so
00:09:47
I get zero point two X plus zero point
00:09:49
three y plus minus zero point six equals
00:09:51
one and minus one then I get lines that
00:09:54
are much more far away from the original
00:09:57
one and if I were to multiply it by a
00:10:00
huge number by ten for example I get 20
00:10:03
X plus 3y plus minus 60 equals 1 and
00:10:06
minus 1 then the lines get much much
00:10:08
closer so the original line stays the
00:10:10
same if I multiply by a constant but
00:10:12
these two parallel lines move farther
00:10:15
away or closer depending on if I'm
00:10:17
multiplying by a number that is close to
00:10:20
zero a small number or by a large number
00:10:23
this is not gonna appear in this video
00:10:26
but if you multiply by a negative number
00:10:28
that the two lines actually switch but
00:10:31
this is not so important for this
00:10:32
algorithm but basically what we're gonna
00:10:35
do is we're gonna be able to separate
00:10:39
lines by multiplying them by a small
00:10:41
number that's really what we're gonna do
00:10:43
in this algorithm but first we need some
00:10:44
justification why is it that this
00:10:46
phenomenon happens so let's look at this
00:10:48
line for example 2x plus 3y plus minus 6
00:10:52
equals 0 and let's just look at one side
00:10:54
of it so 2x plus 3y plus minus 6 equals
00:10:57
1 so why is it that this line over here
00:11:01
in between is the equation 4x plus 6y
00:11:05
plus minus
00:11:06
twelve is equal to one it's actually
00:11:08
exactly in the middle well let's take a
00:11:10
look at this equation 4x plus 6y plus
00:11:13
minus 12 equals 1 is the same line as if
00:11:17
I just divide the entire thing by 2
00:11:18
including the 1 so if I were to divide
00:11:21
2x plus 3y plus it's minus 6 equals 0.5
00:11:25
I get the exact same line and the reason
00:11:28
is that any X&Y that satisfied 4x plus
00:11:32
6y plus minus 12 equals 1
00:11:34
they also satisfied 2x plus 3y plus
00:11:37
minus 6 equals 0.5 the exact same
00:11:39
equation so when I bring back this
00:11:42
equation well now you can see that it's
00:11:44
a value of 0.5 actually lies right in
00:11:48
between the value of 0 and the value of
00:11:50
1 so that's why this equation is in
00:11:53
between and you can see that this works
00:11:55
for pretty much any constant that I
00:11:57
multiply the line by so what we're gonna
00:12:00
do is we're gonna introduce something
00:12:02
called the expanding rate and expanding
00:12:04
rate is very simple we have again our
00:12:06
equation 2x plus 3y plus minus 6 equals
00:12:08
0 which gives us this line and then we
00:12:11
have our two neighbor equations the one
00:12:15
that gives us one which is over here and
00:12:17
the one that gives us minus one which is
00:12:19
over here and our expanding rate is just
00:12:22
gonna be some number and remember that
00:12:25
in machine learning we always want to
00:12:27
make tiny steps we don't want to make
00:12:30
any big steps so we want to separate
00:12:33
this line but by a very very little
00:12:35
amount so we're gonna take a number that
00:12:37
is very close to 1 for example 0.99
00:12:40
let's say that's my favorite number that
00:12:42
is close to 1 and we're gonna call that
00:12:44
the expanding rate and what we're gonna
00:12:47
do is we're just gonna multiply all
00:12:49
these numbers here by 0.99 so what do we
00:12:53
get
00:12:53
well we get these equations the
00:12:56
equations are 1.9 8x plus 2.9 7y plus
00:13:00
minus 5.94 is equal to 0 to 1 and 10
00:13:06
minus 1 and this equations gives us
00:13:10
three lines that one of the middle is
00:13:11
still the same one but the two on the
00:13:14
sides are actually just a little spread
00:13:16
apart so we're just gonna add
00:13:19
that step to the perceptron algorithm
00:13:21
and that's gonna spread our lines apart
00:13:23
a little bit every time we aerate so now
00:13:27
we're ready to formulate the SVM
00:13:29
algorithm and it's gonna be very similar
00:13:31
to the perceptron algorithm step one is
00:13:33
we're gonna start with a random line and
00:13:35
to equi distant parallel lines to it and
00:13:37
I'm gonna color them red and blue just
00:13:39
to emphasize which side of the line is
00:13:41
red and which side of the line is blue
00:13:42
in order to see which points we have
00:13:44
correctly or incorrectly classified now
00:13:47
step two is gonna be pick a large number
00:13:49
on the number of repetitions or epochs
00:13:51
the number of times we're gonna iterate
00:13:52
this algorithm step three is gonna be
00:13:55
pick a number close to 1 so the
00:13:57
expanding factor and we saw it's gonna
00:13:59
be 0.99 I can pick anything but that's
00:14:02
the one I'm gonna pick close to one step
00:14:04
four is now the loop so repeat a
00:14:06
thousand times pick a random point and
00:14:08
the point is correctly classified for
00:14:11
example this one says I'm good then we
00:14:13
do nothing if it's incorrectly
00:14:15
classified then for example like this
00:14:18
one which is a blue point in the red
00:14:20
area says get over here so we move the
00:14:22
line towards a point so we learned in
00:14:25
the previous video how to move a line
00:14:26
towards a point like this and then we're
00:14:31
gonna do the extra step which is
00:14:33
separate the lines using the expanding
00:14:35
factor so we're gonna do separate the
00:14:37
lines a little bit and we're just gonna
00:14:40
repeat these many many many times
00:14:41
thousand times until we get a pretty
00:14:44
good result and then we enjoy the lines
00:14:46
that separate the data best so notice
00:14:48
that the two steps that we've added is
00:14:50
this step three pick a number the imp
00:14:52
expanding factor close to one and the
00:14:55
one where we separate the lines using
00:14:57
the expanding factor the rest is pretty
00:14:59
much the same thing as the perceptron
00:15:01
algorithm so now just for full
00:15:06
disclosure if you want to code this like
00:15:07
this is actually the perceptron
00:15:09
algorithm that we saw in the previous
00:15:10
video where we added step four is the
00:15:14
the mathematical step where we check if
00:15:17
something is in the blue and red area by
00:15:19
checking if the equation applied on the
00:15:22
point become Q is been a 0 or less than
00:15:25
0 so we update the values of a B and C
00:15:28
accordingly by adding the learning rate
00:15:32
times
00:15:33
the coordinates of the point so the SVM
00:15:35
algorithm is actually very similar what
00:15:37
we do is we start the random line of
00:15:39
equation ax will be y plus C equals zero
00:15:42
and we draw the parallel lines with
00:15:44
equations x will be y plus c equals one
00:15:47
and minus one then we pick a large
00:15:49
number the number of epochs which is
00:15:51
gonna be a thousand then we pick a
00:15:52
learning rate which is gonna be zero
00:15:54
point zero one we saw it in the logistic
00:15:56
regression video then we pick an
00:15:58
expanding rate which is gonna be 0.99
00:16:01
it's a number close to one and then the
00:16:04
loop step is repeated thousand times
00:16:05
pick a random point and the point is
00:16:08
correctly classified and do nothing if
00:16:10
the point is blue in the red area then
00:16:12
we update the values of a B and C
00:16:15
accordingly if the point is red in the
00:16:17
blue area we update the values in a
00:16:19
different way and then it's a final step
00:16:21
we multiply the values a B and C by 0.99
00:16:27
which is the expanding step and again
00:16:30
the two new steps are step three and the
00:16:33
expanding step so that's it that's the
00:16:35
SVM training algorithm I encourage you
00:16:37
to code it and see how it does try
00:16:40
different values for number of epochs
00:16:42
learning rate expanding rate etc and let
00:16:46
me know how you went in the comments so
00:16:48
that's the SVM algorithm as I said I
00:16:50
encourage you to code it take a look at
00:16:52
in some datasets and see how it goes
00:16:54
however this comes out of somewhere this
00:16:57
comes out of an error function
00:17:00
development with gradient descent so now
00:17:03
I'm gonna show you what the error
00:17:04
function is and it's very similar to the
00:17:06
perceptron algorithm where we had a
00:17:07
classification error based on how far
00:17:09
the points are from the boundary however
00:17:12
now we're gonna have a another thing
00:17:15
that adds to the error which is based on
00:17:17
how far away these two lines are so let
00:17:20
me show you so to start there functions
00:17:23
let me first ask you a question here we
00:17:24
have the same data set twice and I'm
00:17:27
gonna show you two support vector
00:17:29
machines that classified the first one
00:17:31
is this one and the second one is this
00:17:34
one now the question is which one do you
00:17:37
think is better I feel free to pause the
00:17:40
video and think about it so notice that
00:17:42
the model on the left has one problem
00:17:44
which is that it misclassifies
00:17:46
point however it's good because it's got
00:17:49
the lines pretty wide apart the mall on
00:17:53
the right is great there's a
00:17:55
classification because it classifies
00:17:57
every point correctly however the lines
00:17:59
are very close together so the question
00:18:03
is which one is better and the answer is
00:18:05
we don't really know it depends on our
00:18:08
data it depends on our model it depends
00:18:10
on the scenario but with error functions
00:18:13
we can actually have an approach to
00:18:15
maybe analyze what exactly do we want so
00:18:18
let's recall what happened with the
00:18:20
perceptron error so we here we have some
00:18:22
points and a model a perception that
00:18:25
separates them now this will make some
00:18:27
mistakes right it makes these two
00:18:30
because these two are blue points in the
00:18:33
red area and makes these two because
00:18:35
these are red points in the blue area so
00:18:38
the question is how do we measure the
00:18:40
error or how bad this model is and the
00:18:44
rationale is if a point is on the
00:18:46
correct side then this error is zero if
00:18:49
a point is on the wrong side then the
00:18:51
error can change if a point is close to
00:18:55
the boundary then the error is small and
00:18:56
if it's far from the battery then the
00:18:57
error is huge because if you're for
00:19:00
example a blue point and you're close to
00:19:02
the blue area but still in the red area
00:19:04
you have a small error but if you well
00:19:05
into the red area then you generate a
00:19:07
lot of error because that model is very
00:19:09
wrong on that point so what you want is
00:19:12
the distance or not exactly the distance
00:19:14
but something proportional to this
00:19:15
distance and the same here so we're
00:19:18
gonna add a number of proportional to
00:19:21
these distances and that's gonna be the
00:19:22
perceptron error so for SVM is gonna be
00:19:25
similar we're gonna have our lines and
00:19:27
now we're just gonna have to
00:19:30
classification errors coming from
00:19:31
different places so what we're gonna
00:19:33
have is a red one so our red area now
00:19:36
doesn't start from the middle but it
00:19:38
starts from the bottom line and every
00:19:43
point above this line that is blue it's
00:19:47
automatically misclassify so these three
00:19:49
are misclassified and the error is
00:19:51
precisely the distance from the bottom
00:19:53
line and that simple so notice that this
00:19:57
blue point
00:19:59
that is close to the bottom line is
00:20:00
actually misclassified even though it
00:20:02
was correctly classifying the perceptron
00:20:04
algorithm it is okay it's a harsh error
00:20:07
function now the blue error comes from
00:20:12
the line in the top so it comes from
00:20:14
here now every red point underneath this
00:20:16
top line is gonna be misclassified and
00:20:19
its error is gonna be similar to the
00:20:22
perceptron error it's gonna be
00:20:23
proportional to this distance over here
00:20:25
so we're adding all those distances and
00:20:28
that's our error so those two errors
00:20:30
form the classification error now we
00:20:32
have something called the margin error
00:20:33
and the margin error is simply something
00:20:37
that tells us if these two lines are
00:20:40
close by or far apart I'm gonna be a
00:20:43
little more specific later but it's
00:20:45
basically a number that is gonna be big
00:20:47
if the lines are close together and
00:20:49
small if the lines are far apart
00:20:51
because it's an error so the better
00:20:54
model the smaller the error and the
00:20:56
better our model the wider our lines are
00:20:59
so let's actually look a little bit more
00:21:02
on the marry margin error here so we
00:21:04
have our data set and our data set again
00:21:07
and two models so this one has the lines
00:21:12
pretty far apart therefore it has a
00:21:14
large margin so it's gonna have a small
00:21:16
margin error and this one over here the
00:21:19
lines are pretty close so it's got a
00:21:21
small margin therefore it has a large
00:21:23
margin error and just show the contrast
00:21:25
notice that this way model on the right
00:21:28
has a small classification error and
00:21:30
this model on the left has a large
00:21:32
classification error because the mandar
00:21:33
right classifies all the points
00:21:35
correctly and the one of the left
00:21:36
classify points one point incorrectly
00:21:39
but let's get back to our margin error
00:21:42
so we have our three lines and let's
00:21:44
recall the questions of the line are
00:21:46
something along lines of ax plus B y
00:21:49
plus C equals 1 and X will do i plus c
00:21:52
equals minus 1 so now we're gonna do is
00:21:55
calculate the distance so that I'm gonna
00:21:58
leave as a challenge for you to do some
00:22:01
math and show that this is actually 2
00:22:03
divided by the square root of a squared
00:22:05
plus B squared so I challenge you to to
00:22:09
prove this what you have to do is play
00:22:11
with linear equations
00:22:12
and Pythagorean theorem and so now the
00:22:15
question is what can our error be so
00:22:19
let's think about it we need a number
00:22:21
that is big if the distance is small and
00:22:25
a number that is small if the distance
00:22:27
is big so what can our error be feel
00:22:30
free to think about it the hint is look
00:22:33
at the denominator right the bigger a
00:22:36
square plus B Square is the smaller this
00:22:39
number is and vice versa so what about
00:22:42
just taking the margin error to be this
00:22:43
a squared plus B squared notice that in
00:22:47
this number is large that means the
00:22:49
denominator is small and vice versa
00:22:52
so if we let our margin error just be
00:22:55
that sum of squares that works that
00:22:58
actually measures how far apart the
00:23:02
lines are in the opposite way so if the
00:23:05
lines are close the error is big even
00:23:07
lines are far the error is small
00:23:09
that looks familiar shouldn't be a
00:23:12
surprise it's actually the
00:23:13
regularization term if you've seen l2
00:23:16
regularization so now we can summarize
00:23:18
where as vm error is here it is we have
00:23:21
our data set on our model and the error
00:23:24
basically splits in three first is the
00:23:27
blue classification error which is
00:23:29
basically measures all the red points
00:23:32
that are in the blue side then we have
00:23:34
the red classification error which
00:23:36
measures all the blue points that are in
00:23:37
the red side and then we have the margin
00:23:40
error which measures how far apart the
00:23:43
lines are so the red and the blue get
00:23:46
together to form the total
00:23:47
classification error which tells us how
00:23:50
many points are misclassified and how
00:23:52
badly they are misclassified and then
00:23:54
the margin error that tells us if the
00:23:56
lines are far apart or close by and
00:23:59
these two get together to form the total
00:24:02
SVM error so that is the error in a
00:24:06
support vector machine the one that
00:24:08
we're supports minimize and the gradient
00:24:10
descent steps very similar what it does
00:24:13
is actually the same thing as the SVM
00:24:14
strick what it does is here we have our
00:24:17
data and here we have a model and this
00:24:19
model is pretty bad notice that the
00:24:20
lines are pretty narrow and
00:24:23
misclassifies a bunch of the points so
00:24:26
this is a bad as vm it's got a large
00:24:29
error both in the classification sense
00:24:31
and in the margin sense and we wanted to
00:24:33
is using calculus or using gradient
00:24:36
descent we minimize this error in order
00:24:39
to get to a good place a good as vm that
00:24:42
has a good boundary the lines are far
00:24:45
apart and it actually classifies most of
00:24:47
the points correctly so in the same way
00:24:50
that we did with the perceptron
00:24:52
algorithm this gradient descent process
00:24:55
takes us from a large error to a small
00:24:57
error and this actually is exact same
00:25:00
thing as the SVM trick that I show you
00:25:03
recently of moving the line closer to
00:25:06
the points plus separating the lines a
00:25:10
tiny little bit so now I have a
00:25:12
challenge for you and the challenge is
00:25:15
simply to convince yourself that the
00:25:19
expanding step actually comes out of
00:25:22
gradient descent so take a look at this
00:25:24
we have our lines for the question x
00:25:26
plus b y plus Z equals 1 and ax plus B a
00:25:28
plus C equals -1 and we have the margin
00:25:32
over here and the margin error which is
00:25:34
a square plus B Square so if you're
00:25:36
familiarize with gradient descent what
00:25:38
happens is that we want to take a step
00:25:41
in the direction of the negative of the
00:25:44
gradient so the gradient is the
00:25:45
derivative of the margin error with
00:25:48
respect to the two parameters a and B
00:25:50
this is a very simple gradient because
00:25:53
it's it's a the derivative respect to a
00:25:55
is simply to a because a square plus B
00:25:59
square that is back today is to a and
00:26:01
the respect to B is to be there for our
00:26:05
grading the same step takes a and sends
00:26:08
it to a minus the learning rate a de
00:26:12
times to a which is a derivative and
00:26:15
that's the same thing with B it turns it
00:26:18
into B minus a dot times to B now we can
00:26:22
factor this as 8 times 1 minus 2 ADA and
00:26:26
the bottom one we can factor it as B
00:26:29
times 1 minus 2 ADA but notice something
00:26:33
here notice this number over here
00:26:37
this is exactly the expanding factor
00:26:39
because what we're doing is multiplying
00:26:41
a by a number that is close to one
00:26:45
remember that we multiplied a MB by 0.99
00:26:50
this one here is the 0.99 because if we
00:26:52
take a that to be a small number then
00:26:55
we're multiplying a by a number that is
00:26:59
very very close to one because if it is
00:27:02
small then one minus two ADA is very
00:27:05
close to one so that is exactly the
00:27:07
expanding step so the expanding step is
00:27:10
coming from gradient descent and using
00:27:15
the regularization step
00:27:16
anyway the challenge is to formalize
00:27:19
this and to and to really convince
00:27:21
yourself that this is the case so now
00:27:24
let's go back a little bit and remember
00:27:26
these two models because we never really
00:27:30
answered a question of which one is
00:27:31
better remember that the one on the Left
00:27:35
misclassifies one blue point and the one
00:27:37
on the right just has a very very short
00:27:40
distance between the lines so they're
00:27:42
both good and bad in some way so let's
00:27:45
really study them the one on the left
00:27:47
has a large classification error because
00:27:49
it makes one mistake and a small margin
00:27:52
error because the lines are pretty far
00:27:54
apart and the one on the right has a
00:27:56
small classification error because it
00:27:58
classifies every point correctly and a
00:28:00
very large margin error because the
00:28:02
lines are too close by so again which
00:28:05
one to pick depends on us it depends on
00:28:08
what we want from the algorithm however
00:28:10
we need to pass this information to the
00:28:12
computer so we need to we need to pass
00:28:14
information of which one do we care more
00:28:17
about the classification error or the
00:28:19
margin error and the way to pass this
00:28:22
information to the computer is using a
00:28:23
parameter or a hyper parameter this
00:28:26
one's gonna call we call the C parameter
00:28:28
so recall that the error here is the
00:28:31
classification error plus the margin
00:28:33
error so we're just gonna take a number
00:28:36
C and attach it to the classification
00:28:39
error and so now our error is not the
00:28:42
sum but a weighted sum where one of
00:28:44
those weighted by C
00:28:46
now what happens over here well recall
00:28:49
our error is the C times the
00:28:52
classification error plus the margin
00:28:54
error so what happens if we have a small
00:28:55
value of C if we have a small value of C
00:28:58
then the classification error gets
00:29:00
multiplied by a very small number so
00:29:02
it's all of a sudden is less important
00:29:04
and then the margin error is an
00:29:06
important one so we are really training
00:29:08
an algorithm to focus a lot more on the
00:29:10
margin error so we end up with a good
00:29:14
margin and maybe a bad classification so
00:29:17
we end up with the model on the left
00:29:20
however if we have a large value of C
00:29:24
then C is attached to the classification
00:29:27
error so this means that the
00:29:28
classification error ends up being a lot
00:29:30
more important and the margin errors are
00:29:32
being a little if if C is large so
00:29:35
therefore the model with a large C
00:29:39
focuses more on classification because
00:29:42
it tries to minimize the classification
00:29:44
error more than it tries to minimize the
00:29:46
margin error so we end up with a model
00:29:48
like the one in the right which is good
00:29:51
for classification bad for margin and so
00:29:54
again we decide this parameter ourselves
00:29:57
what we really do in real life is try a
00:29:59
bunch of different ones and see which
00:30:01
algorithm did better but but it's good
00:30:02
to know that we have certain control
00:30:05
over this training and these these are
00:30:06
called hyper parameters every every
00:30:08
machine learning algorithm has a bunch
00:30:10
of hyper parameters that one can tune to
00:30:12
decide what we want so that's all folks
00:30:17
thank you very much for your attention I
00:30:19
remind you that this is the last of a
00:30:21
series of three videos on linear models
00:30:23
linear regression logistic regression
00:30:25
and support vector machines so I hope
00:30:27
you enjoyed this as much as I enjoyed it
00:30:29
thank you remember to subscribe if you
00:30:34
want to get notifications of more videos
00:30:36
coming if you liked it please hit like
00:30:39
share it with your friends or comment I
00:30:41
love reading the comments I read them
00:30:43
all if you have suggestions on what
00:30:46
other ways to make I love to hear them
00:30:48
and if you want to tweet at me this is
00:30:51
my Twitter handle Louis Highsmith thank
00:30:54
you very much and see you in the next
00:30:56
video