Support Vector Machines (SVMs): A friendly introduction

00:30:57
https://www.youtube.com/watch?v=Lpr__X8zuE8

Ringkasan

TLDRThis video provides a detailed introduction to Support Vector Machines (SVM), a critical classification algorithm in machine learning. The instructor, Luis Serrano, builds upon the previous videos on linear regression and logistic regression to explain how SVMs function by separating data points of different classes with parallel lines. The key focus is on finding two parallel lines that maximize the distance between them while still separating the classes correctly. The algorithm iteratively adjusts the position of these lines based on feedback from data points, and concepts such as the expanding factor and the 'C' hyperparameter are introduced to control the model's performance concerning classification versus margin errors. Throughout, the process and practical implementations are discussed, along with the theoretical underpinnings of SVM errors and gradient descent techniques.

Takeaways

  • 🔍 SVMs classify by finding the optimal separating line between classes.
  • 📏 The goal is to maximize the distance between two parallel lines around the decision boundary.
  • 🔄 Iterative adjustments are made based on point classifications.
  • 📈 'C' parameter adjusts the trade-off between classification error and margin error.
  • 🏗️ The expanding factor slightly separates the lines during training.
  • 🧮 SVM error comprises classification error and margin error.
  • 🔬 Understanding the trade-offs is crucial for effective model training.
  • 🌐 Hyperparameters can be tuned to enhance model performance.

Garis waktu

  • 00:00:00 - 00:05:00

    Luis Serrano introduces Support Vector Machines (SVM) while recapping linear regression and logistic regression from previous videos. He credits his students for their contributions during his teaching experience and emphasizes the significance of SVM as a classification algorithm that seeks to separate points of two classes with the best possible line.

  • 00:05:00 - 00:10:00

    The SVM algorithm, compared to the perceptron algorithm, focuses on not just finding any separating line but the optimal line that maximizes the distance between parallel lines drawn around it. This involves finding two lines that are as far apart as possible from each other while still effectively separating the data points.

  • 00:10:00 - 00:15:00

    Serrano illustrates the margin concept with two parallel lines, explaining that the best line is the one with maximized margin from the data points. The process involves iteratively adjusting the separating lines based on the positions of the data points and how they are classified by the current line.

  • 00:15:00 - 00:20:00

    He discusses the mathematical representation of lines and how adjustments to the equations can shift the position of the lines. By introducing an expanding rate, the SVM algorithm allows the lines to be spread apart incrementally throughout the training process, ensuring an optimal separation between classes.

  • 00:20:00 - 00:25:00

    Luis outlines the SVM training steps including defining initial random lines, the number of iterations, learning rates, and the iterative loop for moving the lines based on misclassifications of data points while maintaining the expanding step. This step is crucial for ensuring that lines spread further apart over time as they learn.

  • 00:25:00 - 00:30:57

    Finally, Luis explains the importance of error functions in SVM, defining classification errors based on misclassifications and margin errors based on the distance between separating lines. He combines these to create a comprehensive error measure for the SVM, and discusses hyperparameters like the C parameter that controls the trade-off between the classification error and the margin error.

Tampilkan lebih banyak

Peta Pikiran

Video Tanya Jawab

  • What are Support Vector Machines (SVM)?

    SVMs are a classification algorithm in machine learning that find the best line to separate data points from different classes.

  • How do SVMs differ from the perceptron algorithm?

    SVMs aim to find not just one separating line but two parallel lines that are as far apart as possible while separating the classes.

  • What is the significance of the 'C' parameter in SVM?

    The 'C' parameter balances the importance of classification error versus the margin error during SVM training.

  • What does the margin error in SVM determine?

    The margin error indicates how far apart the two parallel lines are; smaller margins indicate a higher margin error.

  • What role does gradient descent play in SVM training?

    Gradient descent is used to minimize the classification and margin error to optimize the separating lines.

  • Can you explain the expanding factor in SVM?

    The expanding factor is a number close to 1 that is multiplied to parameters during iterations to slightly separate the two parallel lines.

  • How is the error calculated in an SVM?

    The SVM error combines the classification error (distance of misclassified points to the boundary) and the margin error (distance between the two separating lines).

Lihat lebih banyak ringkasan video

Dapatkan akses instan ke ringkasan video YouTube gratis yang didukung oleh AI!
Teks
en
Gulir Otomatis:
  • 00:00:00
    hello my name is luis serrano and this
  • 00:00:02
    is a friendly introduction to support
  • 00:00:04
    vector machines or SVM for short this is
  • 00:00:07
    the third of a series of three videos on
  • 00:00:09
    linear models if you haven't take a look
  • 00:00:12
    at the first one it's called linear
  • 00:00:14
    regression and the second one called
  • 00:00:15
    logistic regression this one builds up a
  • 00:00:17
    lot on the second one in particular and
  • 00:00:20
    I'll start with the credits this year I
  • 00:00:22
    thought a machine learning class at
  • 00:00:23
    Qwest University in British Columbia
  • 00:00:25
    Canada and had a wonderful group of
  • 00:00:27
    students who had an awesome time here's
  • 00:00:30
    a picture of us with my friend Richard
  • 00:00:32
    Hoshino on the right he's also a
  • 00:00:33
    professor and actually my students were
  • 00:00:36
    the ones who helped me figure out the
  • 00:00:38
    key idea for this video
  • 00:00:40
    so as VMS are a very important
  • 00:00:43
    classification algorithm and basically
  • 00:00:45
    what it does is the usually tries to
  • 00:00:47
    separate points of two classes using a
  • 00:00:50
    line however it tries really hard to
  • 00:00:52
    find the best line and the best line
  • 00:00:54
    will be the one that is sort of the
  • 00:00:56
    farthest from the points as possible to
  • 00:00:58
    want to separate them best normally as
  • 00:01:01
    VMS are explained in terms of either
  • 00:01:04
    some kind of linear optimization or some
  • 00:01:07
    kind of gradient descent what I want to
  • 00:01:09
    show you today is something that I
  • 00:01:10
    actually haven't seen in the literature
  • 00:01:11
    it may exist but I haven't seen it and
  • 00:01:13
    it's a method that is small greatest a
  • 00:01:18
    step like method which is sort of an
  • 00:01:21
    iteration and in this iteration what you
  • 00:01:23
    do is you first of all try to find a
  • 00:01:25
    better line that classifies the points
  • 00:01:27
    and then at every step you just take two
  • 00:01:30
    lines parallel and just kind of stretch
  • 00:01:32
    them apart let me be more explicit so
  • 00:01:36
    let me start with a very quick recap on
  • 00:01:38
    the previous video on logistic
  • 00:01:40
    regression of perceptron algorithm
  • 00:01:41
    basically what we want to do is we have
  • 00:01:44
    data split into two classes red points
  • 00:01:47
    and blue points and we want to find the
  • 00:01:49
    perfect line this is the perceptron
  • 00:01:50
    algorithm so what I want to do is not
  • 00:01:53
    just find the perfect line but a line
  • 00:01:54
    with a red side and a blue side that
  • 00:01:56
    splits the point in the best possible
  • 00:01:58
    way and the way we did this was we start
  • 00:02:02
    with a random line and then we start
  • 00:02:03
    asking the points what can they tell us
  • 00:02:06
    to make our line better so for example
  • 00:02:09
    this point over here says I'm good so
  • 00:02:11
    don't worry don't do anything
  • 00:02:13
    blue one says well he I'm on the wrong
  • 00:02:15
    side so you better move closer to me in
  • 00:02:17
    order to classify me better so we move a
  • 00:02:19
    little closer my mother machine-learning
  • 00:02:21
    want to do tiny steps we don't want to
  • 00:02:23
    make any big drastic steps so we asked
  • 00:02:25
    another point this one is red in the red
  • 00:02:27
    area so it says I'm good
  • 00:02:29
    don't do anything then we ask this one
  • 00:02:32
    over here and it says get over here so
  • 00:02:34
    we get over there then we ask this blue
  • 00:02:37
    one in the blue area so it says I'm good
  • 00:02:40
    by the way we're adding random points
  • 00:02:42
    here there's no particular order then we
  • 00:02:45
    ask this one it's a red point in the red
  • 00:02:47
    area so it's just some good we ask this
  • 00:02:49
    point which is a blue point in the red
  • 00:02:51
    area so it says get over here so we move
  • 00:02:54
    closer then we ask this red on the red
  • 00:02:57
    area so it says I'm good then this red
  • 00:03:00
    which is now misclassifies on the blue
  • 00:03:02
    area so it says move over here and we
  • 00:03:05
    listen to it and now it seems like all
  • 00:03:08
    the points are good so that is in a
  • 00:03:10
    nutshell the perceptron algorithm I'd
  • 00:03:12
    like to remind you that the way we did
  • 00:03:15
    it is we started with a random line with
  • 00:03:18
    red and blue sides then we picked a
  • 00:03:20
    large number the number of repetitions
  • 00:03:21
    or epochs which in this case is going to
  • 00:03:24
    be a thousand that's the number of times
  • 00:03:26
    we're going to repeat our iterative step
  • 00:03:28
    then step three says repeat a thousand
  • 00:03:31
    times we pick a random point we ask the
  • 00:03:33
    point if it's correctly classified or
  • 00:03:35
    not if it's correctly classified we do
  • 00:03:37
    nothing if it's not correctly classified
  • 00:03:39
    then we move the line a little bit
  • 00:03:40
    towards a point and we do this
  • 00:03:45
    repeatedly so we get the line that
  • 00:03:48
    separates the data pretty well so anyway
  • 00:03:52
    that's a small recap of the perceptron
  • 00:03:53
    algorithm and this algorithm is going to
  • 00:03:56
    be very similar but it's gonna have a
  • 00:03:57
    little bit of an extra step so let me
  • 00:04:00
    show you that extra step first let's
  • 00:04:02
    start by defining what is it that the
  • 00:04:04
    SVM does best so I'm gonna give you an
  • 00:04:07
    example of some data here and I'm gonna
  • 00:04:09
    copy it twice and this is a line that
  • 00:04:13
    separates as data and this is another
  • 00:04:15
    line that separates that data so
  • 00:04:17
    question for you which line is better so
  • 00:04:21
    feel free to think about it for a minute
  • 00:04:22
    I think the best one is
  • 00:04:25
    this one on the left and this one on the
  • 00:04:27
    right is not so good even though they
  • 00:04:29
    both separate the data if you notice the
  • 00:04:32
    one on the left separates the points
  • 00:04:35
    really well like it's really far away
  • 00:04:37
    from the points whereas the one on the
  • 00:04:39
    right is really really close to two of
  • 00:04:40
    the points so if you were to wiggle the
  • 00:04:42
    line on the right around you may miss
  • 00:04:45
    one of the points and you may miss
  • 00:04:46
    classify them whereas so lying on the
  • 00:04:48
    left you can wiggle it freely and you
  • 00:04:50
    still get a good classifier so now the
  • 00:04:54
    question is how do we train the computer
  • 00:04:56
    to pick the line in the left instead of
  • 00:04:59
    the line in the right because if you
  • 00:05:00
    remember perceptron algorithm just finds
  • 00:05:03
    a good line that separates the data but
  • 00:05:06
    it doesn't necessarily find the best one
  • 00:05:09
    so let's rephrase the question what what
  • 00:05:12
    don't do is not just find one line but
  • 00:05:15
    find two lines that are spaced apart as
  • 00:05:17
    possible from each other so here for
  • 00:05:19
    example centered on the main line we
  • 00:05:22
    have these two parallel equidistant
  • 00:05:25
    lines and notice that for this case on
  • 00:05:29
    the Left we can actually have them
  • 00:05:30
    pretty far away from each other on the
  • 00:05:33
    other hand if we do this with the line
  • 00:05:35
    on the right the farthest we can get is
  • 00:05:37
    two lines that are pretty close so we're
  • 00:05:39
    gonna compare these Green distance over
  • 00:05:41
    here with this distance over here
  • 00:05:44
    and the one on the left is pretty wide
  • 00:05:46
    whereas the one on the right is pretty
  • 00:05:48
    narrow so we're gonna go for white so
  • 00:05:51
    we're gonna tell the computer when you
  • 00:05:52
    find a white one
  • 00:05:53
    you're good but if you find a narrow one
  • 00:05:55
    then you're not good and now the
  • 00:05:58
    question is how do we train an algorithm
  • 00:06:00
    to find two lines as far apart from each
  • 00:06:04
    other that are parallel that still split
  • 00:06:07
    our data so this is what we're gonna do
  • 00:06:08
    very similar to what we did before we're
  • 00:06:11
    gonna start by dropping a random line
  • 00:06:14
    that doesn't do a very good job
  • 00:06:15
    necessarily then we draw two parallel
  • 00:06:18
    lines around it at some small random
  • 00:06:21
    distance and then what we're gonna do is
  • 00:06:23
    we're gonna do something very similar to
  • 00:06:25
    the perceptron algorithm we're gonna
  • 00:06:27
    start listening to the points and asking
  • 00:06:29
    them what we need to do so let's say one
  • 00:06:32
    point tells us to move in this direction
  • 00:06:34
    so we move in this direction and then
  • 00:06:36
    what we're gonna do is at every step
  • 00:06:38
    we are going to separate the lines just
  • 00:06:41
    a little bit and then we listen to
  • 00:06:43
    another point that maybe tells us to
  • 00:06:45
    move in this direction and then again
  • 00:06:46
    we're gonna separate the lines a little
  • 00:06:48
    bit and then again another point tells
  • 00:06:51
    us to move in this direction and then
  • 00:06:53
    we're gonna separate the lines a little
  • 00:06:55
    bit and that's pretty much it that's
  • 00:06:58
    what the SVM algorithm does of course we
  • 00:07:02
    need to go through some technicalities
  • 00:07:03
    one technicality is how to separate
  • 00:07:06
    lines so let me show you how to separate
  • 00:07:09
    lines using equations so let's say we
  • 00:07:11
    have a line with equation for example 2x
  • 00:07:14
    plus 3y plus minus 6 equals 0 and then
  • 00:07:18
    again recall that really in the
  • 00:07:19
    Cartesian plane where the horizontal
  • 00:07:21
    axis is the x axis and the vertical axis
  • 00:07:24
    is the y axis so notice that this line
  • 00:07:28
    is the set of points that satisfy that
  • 00:07:32
    two times the x-coordinate plus 3 times
  • 00:07:35
    the y-coordinate minus 6 is equal to 0
  • 00:07:38
    what happens if I multiply this 2 3 and
  • 00:07:41
    -6 by some constant for example by 2 I
  • 00:07:44
    get for example 4x plus 6y plus minus 12
  • 00:07:49
    equals 0
  • 00:07:49
    well what line do you think this is it's
  • 00:07:53
    actually the exact same line because any
  • 00:07:55
    point that satisfies 2x + 3 y - 6 0 0
  • 00:07:58
    also satisfies that 2 times that thing
  • 00:08:01
    equals 0 because 2 times 0 is equal to 0
  • 00:08:03
    so in particular we get the same line
  • 00:08:05
    and if I multiply this equation by any
  • 00:08:07
    factor for example by 10 I get 20 x plus
  • 00:08:10
    3y plus minus 60 is equal to 0 I get the
  • 00:08:14
    exact same line so this is actually this
  • 00:08:16
    line actually represents a family of
  • 00:08:18
    equations I can also multiply it by
  • 00:08:20
    numbers are smaller than 1 for example
  • 00:08:22
    0.2 X plus point 3y plus minus point 6
  • 00:08:26
    that's dividing the original equation by
  • 00:08:28
    10 that also satisfies the same line and
  • 00:08:31
    I can even multiply in my negative
  • 00:08:32
    numbers and it still works but now let's
  • 00:08:36
    see what changes so here again we have
  • 00:08:39
    2x plus 3y plus minus 6 equals 0 and the
  • 00:08:42
    exact same line which is 4x plus 6y plus
  • 00:08:45
    minus 12 equals 0 now let's actually
  • 00:08:48
    draw the line 2x plus 3y plus minus 6
  • 00:08:52
    1 & 2 X plus 3y plus minus 6 equals
  • 00:08:56
    minus 1 because what we're gonna do is
  • 00:08:58
    our two parallel lines to the original
  • 00:09:01
    one are the ones with the same equation
  • 00:09:04
    except the ones that don't give 0 but
  • 00:09:07
    they give one and minus 1 now what do
  • 00:09:10
    you think happens if I do the same thing
  • 00:09:12
    on the graph in the right and this is
  • 00:09:15
    important so actually feel free to pause
  • 00:09:16
    this video and think about it for a
  • 00:09:18
    minute
  • 00:09:19
    I'll tell you what happens what happens
  • 00:09:20
    is that we get two lines that are
  • 00:09:23
    parallel but much closer so the
  • 00:09:25
    equations for X plus six Y plus a minus
  • 00:09:28
    12 equals one and minus one are actually
  • 00:09:31
    a lot closer to the original one than
  • 00:09:34
    the ones with equation 2x plus 3y plus
  • 00:09:37
    minus 6 equals 1 and actually if I
  • 00:09:39
    multiply this equation by a smaller
  • 00:09:42
    factor for example by dividing by 10 so
  • 00:09:47
    I get zero point two X plus zero point
  • 00:09:49
    three y plus minus zero point six equals
  • 00:09:51
    one and minus one then I get lines that
  • 00:09:54
    are much more far away from the original
  • 00:09:57
    one and if I were to multiply it by a
  • 00:10:00
    huge number by ten for example I get 20
  • 00:10:03
    X plus 3y plus minus 60 equals 1 and
  • 00:10:06
    minus 1 then the lines get much much
  • 00:10:08
    closer so the original line stays the
  • 00:10:10
    same if I multiply by a constant but
  • 00:10:12
    these two parallel lines move farther
  • 00:10:15
    away or closer depending on if I'm
  • 00:10:17
    multiplying by a number that is close to
  • 00:10:20
    zero a small number or by a large number
  • 00:10:23
    this is not gonna appear in this video
  • 00:10:26
    but if you multiply by a negative number
  • 00:10:28
    that the two lines actually switch but
  • 00:10:31
    this is not so important for this
  • 00:10:32
    algorithm but basically what we're gonna
  • 00:10:35
    do is we're gonna be able to separate
  • 00:10:39
    lines by multiplying them by a small
  • 00:10:41
    number that's really what we're gonna do
  • 00:10:43
    in this algorithm but first we need some
  • 00:10:44
    justification why is it that this
  • 00:10:46
    phenomenon happens so let's look at this
  • 00:10:48
    line for example 2x plus 3y plus minus 6
  • 00:10:52
    equals 0 and let's just look at one side
  • 00:10:54
    of it so 2x plus 3y plus minus 6 equals
  • 00:10:57
    1 so why is it that this line over here
  • 00:11:01
    in between is the equation 4x plus 6y
  • 00:11:05
    plus minus
  • 00:11:06
    twelve is equal to one it's actually
  • 00:11:08
    exactly in the middle well let's take a
  • 00:11:10
    look at this equation 4x plus 6y plus
  • 00:11:13
    minus 12 equals 1 is the same line as if
  • 00:11:17
    I just divide the entire thing by 2
  • 00:11:18
    including the 1 so if I were to divide
  • 00:11:21
    2x plus 3y plus it's minus 6 equals 0.5
  • 00:11:25
    I get the exact same line and the reason
  • 00:11:28
    is that any X&Y that satisfied 4x plus
  • 00:11:32
    6y plus minus 12 equals 1
  • 00:11:34
    they also satisfied 2x plus 3y plus
  • 00:11:37
    minus 6 equals 0.5 the exact same
  • 00:11:39
    equation so when I bring back this
  • 00:11:42
    equation well now you can see that it's
  • 00:11:44
    a value of 0.5 actually lies right in
  • 00:11:48
    between the value of 0 and the value of
  • 00:11:50
    1 so that's why this equation is in
  • 00:11:53
    between and you can see that this works
  • 00:11:55
    for pretty much any constant that I
  • 00:11:57
    multiply the line by so what we're gonna
  • 00:12:00
    do is we're gonna introduce something
  • 00:12:02
    called the expanding rate and expanding
  • 00:12:04
    rate is very simple we have again our
  • 00:12:06
    equation 2x plus 3y plus minus 6 equals
  • 00:12:08
    0 which gives us this line and then we
  • 00:12:11
    have our two neighbor equations the one
  • 00:12:15
    that gives us one which is over here and
  • 00:12:17
    the one that gives us minus one which is
  • 00:12:19
    over here and our expanding rate is just
  • 00:12:22
    gonna be some number and remember that
  • 00:12:25
    in machine learning we always want to
  • 00:12:27
    make tiny steps we don't want to make
  • 00:12:30
    any big steps so we want to separate
  • 00:12:33
    this line but by a very very little
  • 00:12:35
    amount so we're gonna take a number that
  • 00:12:37
    is very close to 1 for example 0.99
  • 00:12:40
    let's say that's my favorite number that
  • 00:12:42
    is close to 1 and we're gonna call that
  • 00:12:44
    the expanding rate and what we're gonna
  • 00:12:47
    do is we're just gonna multiply all
  • 00:12:49
    these numbers here by 0.99 so what do we
  • 00:12:53
    get
  • 00:12:53
    well we get these equations the
  • 00:12:56
    equations are 1.9 8x plus 2.9 7y plus
  • 00:13:00
    minus 5.94 is equal to 0 to 1 and 10
  • 00:13:06
    minus 1 and this equations gives us
  • 00:13:10
    three lines that one of the middle is
  • 00:13:11
    still the same one but the two on the
  • 00:13:14
    sides are actually just a little spread
  • 00:13:16
    apart so we're just gonna add
  • 00:13:19
    that step to the perceptron algorithm
  • 00:13:21
    and that's gonna spread our lines apart
  • 00:13:23
    a little bit every time we aerate so now
  • 00:13:27
    we're ready to formulate the SVM
  • 00:13:29
    algorithm and it's gonna be very similar
  • 00:13:31
    to the perceptron algorithm step one is
  • 00:13:33
    we're gonna start with a random line and
  • 00:13:35
    to equi distant parallel lines to it and
  • 00:13:37
    I'm gonna color them red and blue just
  • 00:13:39
    to emphasize which side of the line is
  • 00:13:41
    red and which side of the line is blue
  • 00:13:42
    in order to see which points we have
  • 00:13:44
    correctly or incorrectly classified now
  • 00:13:47
    step two is gonna be pick a large number
  • 00:13:49
    on the number of repetitions or epochs
  • 00:13:51
    the number of times we're gonna iterate
  • 00:13:52
    this algorithm step three is gonna be
  • 00:13:55
    pick a number close to 1 so the
  • 00:13:57
    expanding factor and we saw it's gonna
  • 00:13:59
    be 0.99 I can pick anything but that's
  • 00:14:02
    the one I'm gonna pick close to one step
  • 00:14:04
    four is now the loop so repeat a
  • 00:14:06
    thousand times pick a random point and
  • 00:14:08
    the point is correctly classified for
  • 00:14:11
    example this one says I'm good then we
  • 00:14:13
    do nothing if it's incorrectly
  • 00:14:15
    classified then for example like this
  • 00:14:18
    one which is a blue point in the red
  • 00:14:20
    area says get over here so we move the
  • 00:14:22
    line towards a point so we learned in
  • 00:14:25
    the previous video how to move a line
  • 00:14:26
    towards a point like this and then we're
  • 00:14:31
    gonna do the extra step which is
  • 00:14:33
    separate the lines using the expanding
  • 00:14:35
    factor so we're gonna do separate the
  • 00:14:37
    lines a little bit and we're just gonna
  • 00:14:40
    repeat these many many many times
  • 00:14:41
    thousand times until we get a pretty
  • 00:14:44
    good result and then we enjoy the lines
  • 00:14:46
    that separate the data best so notice
  • 00:14:48
    that the two steps that we've added is
  • 00:14:50
    this step three pick a number the imp
  • 00:14:52
    expanding factor close to one and the
  • 00:14:55
    one where we separate the lines using
  • 00:14:57
    the expanding factor the rest is pretty
  • 00:14:59
    much the same thing as the perceptron
  • 00:15:01
    algorithm so now just for full
  • 00:15:06
    disclosure if you want to code this like
  • 00:15:07
    this is actually the perceptron
  • 00:15:09
    algorithm that we saw in the previous
  • 00:15:10
    video where we added step four is the
  • 00:15:14
    the mathematical step where we check if
  • 00:15:17
    something is in the blue and red area by
  • 00:15:19
    checking if the equation applied on the
  • 00:15:22
    point become Q is been a 0 or less than
  • 00:15:25
    0 so we update the values of a B and C
  • 00:15:28
    accordingly by adding the learning rate
  • 00:15:32
    times
  • 00:15:33
    the coordinates of the point so the SVM
  • 00:15:35
    algorithm is actually very similar what
  • 00:15:37
    we do is we start the random line of
  • 00:15:39
    equation ax will be y plus C equals zero
  • 00:15:42
    and we draw the parallel lines with
  • 00:15:44
    equations x will be y plus c equals one
  • 00:15:47
    and minus one then we pick a large
  • 00:15:49
    number the number of epochs which is
  • 00:15:51
    gonna be a thousand then we pick a
  • 00:15:52
    learning rate which is gonna be zero
  • 00:15:54
    point zero one we saw it in the logistic
  • 00:15:56
    regression video then we pick an
  • 00:15:58
    expanding rate which is gonna be 0.99
  • 00:16:01
    it's a number close to one and then the
  • 00:16:04
    loop step is repeated thousand times
  • 00:16:05
    pick a random point and the point is
  • 00:16:08
    correctly classified and do nothing if
  • 00:16:10
    the point is blue in the red area then
  • 00:16:12
    we update the values of a B and C
  • 00:16:15
    accordingly if the point is red in the
  • 00:16:17
    blue area we update the values in a
  • 00:16:19
    different way and then it's a final step
  • 00:16:21
    we multiply the values a B and C by 0.99
  • 00:16:27
    which is the expanding step and again
  • 00:16:30
    the two new steps are step three and the
  • 00:16:33
    expanding step so that's it that's the
  • 00:16:35
    SVM training algorithm I encourage you
  • 00:16:37
    to code it and see how it does try
  • 00:16:40
    different values for number of epochs
  • 00:16:42
    learning rate expanding rate etc and let
  • 00:16:46
    me know how you went in the comments so
  • 00:16:48
    that's the SVM algorithm as I said I
  • 00:16:50
    encourage you to code it take a look at
  • 00:16:52
    in some datasets and see how it goes
  • 00:16:54
    however this comes out of somewhere this
  • 00:16:57
    comes out of an error function
  • 00:17:00
    development with gradient descent so now
  • 00:17:03
    I'm gonna show you what the error
  • 00:17:04
    function is and it's very similar to the
  • 00:17:06
    perceptron algorithm where we had a
  • 00:17:07
    classification error based on how far
  • 00:17:09
    the points are from the boundary however
  • 00:17:12
    now we're gonna have a another thing
  • 00:17:15
    that adds to the error which is based on
  • 00:17:17
    how far away these two lines are so let
  • 00:17:20
    me show you so to start there functions
  • 00:17:23
    let me first ask you a question here we
  • 00:17:24
    have the same data set twice and I'm
  • 00:17:27
    gonna show you two support vector
  • 00:17:29
    machines that classified the first one
  • 00:17:31
    is this one and the second one is this
  • 00:17:34
    one now the question is which one do you
  • 00:17:37
    think is better I feel free to pause the
  • 00:17:40
    video and think about it so notice that
  • 00:17:42
    the model on the left has one problem
  • 00:17:44
    which is that it misclassifies
  • 00:17:46
    point however it's good because it's got
  • 00:17:49
    the lines pretty wide apart the mall on
  • 00:17:53
    the right is great there's a
  • 00:17:55
    classification because it classifies
  • 00:17:57
    every point correctly however the lines
  • 00:17:59
    are very close together so the question
  • 00:18:03
    is which one is better and the answer is
  • 00:18:05
    we don't really know it depends on our
  • 00:18:08
    data it depends on our model it depends
  • 00:18:10
    on the scenario but with error functions
  • 00:18:13
    we can actually have an approach to
  • 00:18:15
    maybe analyze what exactly do we want so
  • 00:18:18
    let's recall what happened with the
  • 00:18:20
    perceptron error so we here we have some
  • 00:18:22
    points and a model a perception that
  • 00:18:25
    separates them now this will make some
  • 00:18:27
    mistakes right it makes these two
  • 00:18:30
    because these two are blue points in the
  • 00:18:33
    red area and makes these two because
  • 00:18:35
    these are red points in the blue area so
  • 00:18:38
    the question is how do we measure the
  • 00:18:40
    error or how bad this model is and the
  • 00:18:44
    rationale is if a point is on the
  • 00:18:46
    correct side then this error is zero if
  • 00:18:49
    a point is on the wrong side then the
  • 00:18:51
    error can change if a point is close to
  • 00:18:55
    the boundary then the error is small and
  • 00:18:56
    if it's far from the battery then the
  • 00:18:57
    error is huge because if you're for
  • 00:19:00
    example a blue point and you're close to
  • 00:19:02
    the blue area but still in the red area
  • 00:19:04
    you have a small error but if you well
  • 00:19:05
    into the red area then you generate a
  • 00:19:07
    lot of error because that model is very
  • 00:19:09
    wrong on that point so what you want is
  • 00:19:12
    the distance or not exactly the distance
  • 00:19:14
    but something proportional to this
  • 00:19:15
    distance and the same here so we're
  • 00:19:18
    gonna add a number of proportional to
  • 00:19:21
    these distances and that's gonna be the
  • 00:19:22
    perceptron error so for SVM is gonna be
  • 00:19:25
    similar we're gonna have our lines and
  • 00:19:27
    now we're just gonna have to
  • 00:19:30
    classification errors coming from
  • 00:19:31
    different places so what we're gonna
  • 00:19:33
    have is a red one so our red area now
  • 00:19:36
    doesn't start from the middle but it
  • 00:19:38
    starts from the bottom line and every
  • 00:19:43
    point above this line that is blue it's
  • 00:19:47
    automatically misclassify so these three
  • 00:19:49
    are misclassified and the error is
  • 00:19:51
    precisely the distance from the bottom
  • 00:19:53
    line and that simple so notice that this
  • 00:19:57
    blue point
  • 00:19:59
    that is close to the bottom line is
  • 00:20:00
    actually misclassified even though it
  • 00:20:02
    was correctly classifying the perceptron
  • 00:20:04
    algorithm it is okay it's a harsh error
  • 00:20:07
    function now the blue error comes from
  • 00:20:12
    the line in the top so it comes from
  • 00:20:14
    here now every red point underneath this
  • 00:20:16
    top line is gonna be misclassified and
  • 00:20:19
    its error is gonna be similar to the
  • 00:20:22
    perceptron error it's gonna be
  • 00:20:23
    proportional to this distance over here
  • 00:20:25
    so we're adding all those distances and
  • 00:20:28
    that's our error so those two errors
  • 00:20:30
    form the classification error now we
  • 00:20:32
    have something called the margin error
  • 00:20:33
    and the margin error is simply something
  • 00:20:37
    that tells us if these two lines are
  • 00:20:40
    close by or far apart I'm gonna be a
  • 00:20:43
    little more specific later but it's
  • 00:20:45
    basically a number that is gonna be big
  • 00:20:47
    if the lines are close together and
  • 00:20:49
    small if the lines are far apart
  • 00:20:51
    because it's an error so the better
  • 00:20:54
    model the smaller the error and the
  • 00:20:56
    better our model the wider our lines are
  • 00:20:59
    so let's actually look a little bit more
  • 00:21:02
    on the marry margin error here so we
  • 00:21:04
    have our data set and our data set again
  • 00:21:07
    and two models so this one has the lines
  • 00:21:12
    pretty far apart therefore it has a
  • 00:21:14
    large margin so it's gonna have a small
  • 00:21:16
    margin error and this one over here the
  • 00:21:19
    lines are pretty close so it's got a
  • 00:21:21
    small margin therefore it has a large
  • 00:21:23
    margin error and just show the contrast
  • 00:21:25
    notice that this way model on the right
  • 00:21:28
    has a small classification error and
  • 00:21:30
    this model on the left has a large
  • 00:21:32
    classification error because the mandar
  • 00:21:33
    right classifies all the points
  • 00:21:35
    correctly and the one of the left
  • 00:21:36
    classify points one point incorrectly
  • 00:21:39
    but let's get back to our margin error
  • 00:21:42
    so we have our three lines and let's
  • 00:21:44
    recall the questions of the line are
  • 00:21:46
    something along lines of ax plus B y
  • 00:21:49
    plus C equals 1 and X will do i plus c
  • 00:21:52
    equals minus 1 so now we're gonna do is
  • 00:21:55
    calculate the distance so that I'm gonna
  • 00:21:58
    leave as a challenge for you to do some
  • 00:22:01
    math and show that this is actually 2
  • 00:22:03
    divided by the square root of a squared
  • 00:22:05
    plus B squared so I challenge you to to
  • 00:22:09
    prove this what you have to do is play
  • 00:22:11
    with linear equations
  • 00:22:12
    and Pythagorean theorem and so now the
  • 00:22:15
    question is what can our error be so
  • 00:22:19
    let's think about it we need a number
  • 00:22:21
    that is big if the distance is small and
  • 00:22:25
    a number that is small if the distance
  • 00:22:27
    is big so what can our error be feel
  • 00:22:30
    free to think about it the hint is look
  • 00:22:33
    at the denominator right the bigger a
  • 00:22:36
    square plus B Square is the smaller this
  • 00:22:39
    number is and vice versa so what about
  • 00:22:42
    just taking the margin error to be this
  • 00:22:43
    a squared plus B squared notice that in
  • 00:22:47
    this number is large that means the
  • 00:22:49
    denominator is small and vice versa
  • 00:22:52
    so if we let our margin error just be
  • 00:22:55
    that sum of squares that works that
  • 00:22:58
    actually measures how far apart the
  • 00:23:02
    lines are in the opposite way so if the
  • 00:23:05
    lines are close the error is big even
  • 00:23:07
    lines are far the error is small
  • 00:23:09
    that looks familiar shouldn't be a
  • 00:23:12
    surprise it's actually the
  • 00:23:13
    regularization term if you've seen l2
  • 00:23:16
    regularization so now we can summarize
  • 00:23:18
    where as vm error is here it is we have
  • 00:23:21
    our data set on our model and the error
  • 00:23:24
    basically splits in three first is the
  • 00:23:27
    blue classification error which is
  • 00:23:29
    basically measures all the red points
  • 00:23:32
    that are in the blue side then we have
  • 00:23:34
    the red classification error which
  • 00:23:36
    measures all the blue points that are in
  • 00:23:37
    the red side and then we have the margin
  • 00:23:40
    error which measures how far apart the
  • 00:23:43
    lines are so the red and the blue get
  • 00:23:46
    together to form the total
  • 00:23:47
    classification error which tells us how
  • 00:23:50
    many points are misclassified and how
  • 00:23:52
    badly they are misclassified and then
  • 00:23:54
    the margin error that tells us if the
  • 00:23:56
    lines are far apart or close by and
  • 00:23:59
    these two get together to form the total
  • 00:24:02
    SVM error so that is the error in a
  • 00:24:06
    support vector machine the one that
  • 00:24:08
    we're supports minimize and the gradient
  • 00:24:10
    descent steps very similar what it does
  • 00:24:13
    is actually the same thing as the SVM
  • 00:24:14
    strick what it does is here we have our
  • 00:24:17
    data and here we have a model and this
  • 00:24:19
    model is pretty bad notice that the
  • 00:24:20
    lines are pretty narrow and
  • 00:24:23
    misclassifies a bunch of the points so
  • 00:24:26
    this is a bad as vm it's got a large
  • 00:24:29
    error both in the classification sense
  • 00:24:31
    and in the margin sense and we wanted to
  • 00:24:33
    is using calculus or using gradient
  • 00:24:36
    descent we minimize this error in order
  • 00:24:39
    to get to a good place a good as vm that
  • 00:24:42
    has a good boundary the lines are far
  • 00:24:45
    apart and it actually classifies most of
  • 00:24:47
    the points correctly so in the same way
  • 00:24:50
    that we did with the perceptron
  • 00:24:52
    algorithm this gradient descent process
  • 00:24:55
    takes us from a large error to a small
  • 00:24:57
    error and this actually is exact same
  • 00:25:00
    thing as the SVM trick that I show you
  • 00:25:03
    recently of moving the line closer to
  • 00:25:06
    the points plus separating the lines a
  • 00:25:10
    tiny little bit so now I have a
  • 00:25:12
    challenge for you and the challenge is
  • 00:25:15
    simply to convince yourself that the
  • 00:25:19
    expanding step actually comes out of
  • 00:25:22
    gradient descent so take a look at this
  • 00:25:24
    we have our lines for the question x
  • 00:25:26
    plus b y plus Z equals 1 and ax plus B a
  • 00:25:28
    plus C equals -1 and we have the margin
  • 00:25:32
    over here and the margin error which is
  • 00:25:34
    a square plus B Square so if you're
  • 00:25:36
    familiarize with gradient descent what
  • 00:25:38
    happens is that we want to take a step
  • 00:25:41
    in the direction of the negative of the
  • 00:25:44
    gradient so the gradient is the
  • 00:25:45
    derivative of the margin error with
  • 00:25:48
    respect to the two parameters a and B
  • 00:25:50
    this is a very simple gradient because
  • 00:25:53
    it's it's a the derivative respect to a
  • 00:25:55
    is simply to a because a square plus B
  • 00:25:59
    square that is back today is to a and
  • 00:26:01
    the respect to B is to be there for our
  • 00:26:05
    grading the same step takes a and sends
  • 00:26:08
    it to a minus the learning rate a de
  • 00:26:12
    times to a which is a derivative and
  • 00:26:15
    that's the same thing with B it turns it
  • 00:26:18
    into B minus a dot times to B now we can
  • 00:26:22
    factor this as 8 times 1 minus 2 ADA and
  • 00:26:26
    the bottom one we can factor it as B
  • 00:26:29
    times 1 minus 2 ADA but notice something
  • 00:26:33
    here notice this number over here
  • 00:26:37
    this is exactly the expanding factor
  • 00:26:39
    because what we're doing is multiplying
  • 00:26:41
    a by a number that is close to one
  • 00:26:45
    remember that we multiplied a MB by 0.99
  • 00:26:50
    this one here is the 0.99 because if we
  • 00:26:52
    take a that to be a small number then
  • 00:26:55
    we're multiplying a by a number that is
  • 00:26:59
    very very close to one because if it is
  • 00:27:02
    small then one minus two ADA is very
  • 00:27:05
    close to one so that is exactly the
  • 00:27:07
    expanding step so the expanding step is
  • 00:27:10
    coming from gradient descent and using
  • 00:27:15
    the regularization step
  • 00:27:16
    anyway the challenge is to formalize
  • 00:27:19
    this and to and to really convince
  • 00:27:21
    yourself that this is the case so now
  • 00:27:24
    let's go back a little bit and remember
  • 00:27:26
    these two models because we never really
  • 00:27:30
    answered a question of which one is
  • 00:27:31
    better remember that the one on the Left
  • 00:27:35
    misclassifies one blue point and the one
  • 00:27:37
    on the right just has a very very short
  • 00:27:40
    distance between the lines so they're
  • 00:27:42
    both good and bad in some way so let's
  • 00:27:45
    really study them the one on the left
  • 00:27:47
    has a large classification error because
  • 00:27:49
    it makes one mistake and a small margin
  • 00:27:52
    error because the lines are pretty far
  • 00:27:54
    apart and the one on the right has a
  • 00:27:56
    small classification error because it
  • 00:27:58
    classifies every point correctly and a
  • 00:28:00
    very large margin error because the
  • 00:28:02
    lines are too close by so again which
  • 00:28:05
    one to pick depends on us it depends on
  • 00:28:08
    what we want from the algorithm however
  • 00:28:10
    we need to pass this information to the
  • 00:28:12
    computer so we need to we need to pass
  • 00:28:14
    information of which one do we care more
  • 00:28:17
    about the classification error or the
  • 00:28:19
    margin error and the way to pass this
  • 00:28:22
    information to the computer is using a
  • 00:28:23
    parameter or a hyper parameter this
  • 00:28:26
    one's gonna call we call the C parameter
  • 00:28:28
    so recall that the error here is the
  • 00:28:31
    classification error plus the margin
  • 00:28:33
    error so we're just gonna take a number
  • 00:28:36
    C and attach it to the classification
  • 00:28:39
    error and so now our error is not the
  • 00:28:42
    sum but a weighted sum where one of
  • 00:28:44
    those weighted by C
  • 00:28:46
    now what happens over here well recall
  • 00:28:49
    our error is the C times the
  • 00:28:52
    classification error plus the margin
  • 00:28:54
    error so what happens if we have a small
  • 00:28:55
    value of C if we have a small value of C
  • 00:28:58
    then the classification error gets
  • 00:29:00
    multiplied by a very small number so
  • 00:29:02
    it's all of a sudden is less important
  • 00:29:04
    and then the margin error is an
  • 00:29:06
    important one so we are really training
  • 00:29:08
    an algorithm to focus a lot more on the
  • 00:29:10
    margin error so we end up with a good
  • 00:29:14
    margin and maybe a bad classification so
  • 00:29:17
    we end up with the model on the left
  • 00:29:20
    however if we have a large value of C
  • 00:29:24
    then C is attached to the classification
  • 00:29:27
    error so this means that the
  • 00:29:28
    classification error ends up being a lot
  • 00:29:30
    more important and the margin errors are
  • 00:29:32
    being a little if if C is large so
  • 00:29:35
    therefore the model with a large C
  • 00:29:39
    focuses more on classification because
  • 00:29:42
    it tries to minimize the classification
  • 00:29:44
    error more than it tries to minimize the
  • 00:29:46
    margin error so we end up with a model
  • 00:29:48
    like the one in the right which is good
  • 00:29:51
    for classification bad for margin and so
  • 00:29:54
    again we decide this parameter ourselves
  • 00:29:57
    what we really do in real life is try a
  • 00:29:59
    bunch of different ones and see which
  • 00:30:01
    algorithm did better but but it's good
  • 00:30:02
    to know that we have certain control
  • 00:30:05
    over this training and these these are
  • 00:30:06
    called hyper parameters every every
  • 00:30:08
    machine learning algorithm has a bunch
  • 00:30:10
    of hyper parameters that one can tune to
  • 00:30:12
    decide what we want so that's all folks
  • 00:30:17
    thank you very much for your attention I
  • 00:30:19
    remind you that this is the last of a
  • 00:30:21
    series of three videos on linear models
  • 00:30:23
    linear regression logistic regression
  • 00:30:25
    and support vector machines so I hope
  • 00:30:27
    you enjoyed this as much as I enjoyed it
  • 00:30:29
    thank you remember to subscribe if you
  • 00:30:34
    want to get notifications of more videos
  • 00:30:36
    coming if you liked it please hit like
  • 00:30:39
    share it with your friends or comment I
  • 00:30:41
    love reading the comments I read them
  • 00:30:43
    all if you have suggestions on what
  • 00:30:46
    other ways to make I love to hear them
  • 00:30:48
    and if you want to tweet at me this is
  • 00:30:51
    my Twitter handle Louis Highsmith thank
  • 00:30:54
    you very much and see you in the next
  • 00:30:56
    video
Tags
  • SVM
  • Support Vector Machines
  • Machine Learning
  • Classification Algorithms
  • Linear Models
  • Gradient Descent
  • Perceptron
  • C Parameter
  • Margin Error
  • Hyperparameters