00:00:00
[Music]
00:00:01
hey everyone how's it going so it's
00:00:02
going to be a pretty brief video today
00:00:03
we're going to be talking about the role
00:00:05
of
00:00:05
outliers in machine learning algorithms
00:00:08
and then also talk about ways that
00:00:09
people
00:00:10
typically deal with outliers as well as
00:00:12
some of the shortcomings of those
00:00:13
methods
00:00:14
in general i think outliers are a pretty
00:00:16
interesting topic for two main reasons
00:00:17
one is that even if you don't study
00:00:19
stats this word outlier has become kind
00:00:21
of a commonplace word even in the media
00:00:23
and just
00:00:23
speech for example will say oh that
00:00:25
baseball game was an outlier
00:00:27
and we just mean that it was different
00:00:28
from a typical baseball game for example
00:00:30
the other reason i think it's
00:00:31
interesting is that even with all these
00:00:33
statistical tools we have there's not
00:00:35
like
00:00:36
a set way to deal with outliers it
00:00:38
really depends on the problem
00:00:40
different people will take different
00:00:41
approaches so it really highlights this
00:00:42
fact that
00:00:43
math is not this yes or no set in stone
00:00:46
kind of process but it really depends on
00:00:48
the situation so we'll go about this
00:00:50
video pretty simply we'll just go
00:00:52
through
00:00:52
four different very popular machine
00:00:54
learning algorithms and talk about the
00:00:56
impact
00:00:57
that a couple of outliers can have on
00:00:58
the results and then as i said we'll
00:01:00
talk about some common ways people deal
00:01:02
with outliers by the way i got this
00:01:04
really cool lobster hat for christmas
00:01:05
hope you like it
00:01:06
so let's begin by visiting our very
00:01:08
first friend in machine learning and
00:01:10
stats which was linear regression
00:01:12
so if you remember linear regression you
00:01:14
just have a x and a y variable keeping
00:01:16
it real simple
00:01:17
and you draw a line of best fit through
00:01:20
all of your data points
00:01:21
so let's say all these black x's were
00:01:23
your initial data points and you drew
00:01:25
this green line of best fit through them
00:01:27
and they have a pretty good fit no real
00:01:28
issues there now here's the problem
00:01:31
let's say you introduce a couple of
00:01:32
outliers so these red x's down here
00:01:34
with the exclamation point next to them
00:01:36
are outliers because they are very
00:01:38
different from the typical black x's we
00:01:40
have
00:01:40
up here as you might have learned in
00:01:42
linear regression this line that we're
00:01:44
going to draw now is very affected
00:01:46
by outliers and what's going to happen
00:01:48
is that the slope of the line
00:01:50
or beta 1 hat in this form over here is
00:01:53
going to shift
00:01:54
down so that the new line we get is this
00:01:56
red line
00:01:57
and the biggest issue with this red line
00:01:58
you can see what it's trying to do it's
00:02:00
trying to kind of
00:02:01
compromise between the original data and
00:02:03
these outliers but in doing so it
00:02:05
doesn't really capture either one too
00:02:06
well
00:02:07
so we definitely have an issue with
00:02:09
outliers in linear regression
00:02:11
it turns out that we can also frame
00:02:12
logistic regression which might have
00:02:14
been the first
00:02:15
classification technique you learned in
00:02:17
the exact same way
00:02:18
just a quick recap models the logit of
00:02:20
the probability of any example being
00:02:22
either 0 or 1
00:02:24
as the same form beta naught plus beta 1
00:02:26
x just like we had up here
00:02:28
so let's say our initial data is these
00:02:29
black x's which were all class
00:02:31
0 and these black x's which are all
00:02:33
class 1
00:02:34
and we asked to draw the sigmoid which
00:02:36
is going to predict the probability
00:02:38
of each example being in class 1. and
00:02:41
without the presence of outliers this
00:02:42
would be pretty simple we just draw this
00:02:44
black sigmoid
00:02:45
so that all of these get correctly
00:02:46
classified as class 1 because their
00:02:48
probabilities are above
00:02:50
0.5 and all these guys get correctly
00:02:52
classified as class
00:02:53
0 because their probabilities are below
00:02:55
0.5 now again we introduced just a
00:02:57
couple of outliers so here's some
00:02:59
outliers here these three red x's and
00:03:01
i've drawn this red arrow to indicate
00:03:02
that they are
00:03:03
way over here in the x direction so they
00:03:05
are way to the right have very large
00:03:07
values
00:03:08
now let's think about first
00:03:09
mathematically what's going to happen to
00:03:10
the sigmoid
00:03:11
actually the same thing happens because
00:03:13
we are modeling it again as beta naught
00:03:14
plus beta 1 x
00:03:16
so our beta 1 hat again goes down and
00:03:18
the impact that having a lower beta 1
00:03:20
hat is going to have
00:03:21
on this sigmoid is that it's going to
00:03:24
flatten out and stretch out the sigmoid
00:03:26
so it now looks like this
00:03:27
red sigmoid here and more intuitively
00:03:30
you can see what's going on is that
00:03:31
because these three red x's have very
00:03:34
large values
00:03:35
of the x variable yet they are
00:03:37
classified as class
00:03:39
zero we get this situation where the
00:03:41
sigmoid is trying its best to
00:03:43
incorporate these into the class
00:03:44
zero which it tries to achieve by
00:03:46
stretching out the sigmoids so much that
00:03:48
they get incorporated into the lower
00:03:50
part of the sigmoid
00:03:51
but in trying to do so it completely
00:03:52
destroys the rest of our data
00:03:54
for example if you look at this sigmoid
00:03:56
now you see that everything below 0.5 so
00:03:59
these are all still correctly classified
00:04:01
but if you look at everything above 0.5
00:04:03
we get these three correctly classified
00:04:05
but these three or four x's here are
00:04:07
actually belonging to class zero
00:04:08
predicted as class
00:04:09
zero even though they're class one and
00:04:11
maybe even the worst part of this is
00:04:12
that
00:04:13
although it tries really hard to
00:04:14
incorporate these three into class
00:04:16
zero it never achieves they still get
00:04:18
classified as class one so we get a
00:04:20
bunch of mistakes by doing this
00:04:22
in logistic regression having these
00:04:23
outliers so again problematic scenario
00:04:26
let's look at k nearest neighbors
00:04:28
another friendly face
00:04:29
so with k nearest neighbors if we have
00:04:31
two nice looking clouds of data so we
00:04:33
have the blue triangles and we have the
00:04:35
green circles
00:04:36
then we can draw a pretty nice looking
00:04:37
decision boundary any point on this side
00:04:40
of the decision boundary so above it
00:04:42
is going to get classified as a green
00:04:44
circle because if we use let's say
00:04:45
k equals three who are my three closest
00:04:48
neighbors they're always going to be
00:04:49
some of the circles if we're on this
00:04:50
side
00:04:51
and they're always going to be the
00:04:52
triangles if we're on this side so no
00:04:54
issues there
00:04:55
let's see how this story changes if we
00:04:57
incorporate just two
00:04:58
extra green circles in the wrong place
00:05:01
so they're outliers
00:05:03
so here we have put the two green
00:05:04
circles with the main pack of blue
00:05:07
triangles
00:05:07
and we see that the decision boundary
00:05:09
changes in the following way
00:05:11
so this whole area is unaffected this
00:05:13
whole area is unaffected but
00:05:14
around where we introduce the outliers
00:05:17
we get a very funky looking decision
00:05:18
boundary and the reason is that let's
00:05:20
say you're trying to predict some new
00:05:21
data point here where the x is
00:05:23
and you ask who are my three closest
00:05:24
neighbors well it's going to be these
00:05:26
two outliers as well as this blue
00:05:27
triangle
00:05:28
so it's going to say majority class is
00:05:30
green circle and so this whole decision
00:05:32
space gets allocated to these circles as
00:05:34
well
00:05:34
so we see that in k nearest neighbor
00:05:36
introducing just a couple of outliers
00:05:38
near a different class can have a big
00:05:40
impact on the decision boundary
00:05:41
now we've talked about so far machine
00:05:43
learning methods that are very impacted
00:05:46
by outliers let's talk about one that is
00:05:47
not so affected by outliers and that is
00:05:49
our old friend decision trees so we have
00:05:52
a decision tree here
00:05:53
this is just some variable so we see
00:05:55
that in general low values of this
00:05:57
variable
00:05:58
correspond to these triangles higher
00:06:00
values of these variables correspond to
00:06:02
the circles but there are two outliers
00:06:04
where the variable is very high yet
00:06:06
those are classified as triangles
00:06:09
and so if we recall how decision trees
00:06:11
work it's going to scan this entire
00:06:12
variable's range
00:06:14
and it's going to pick a split such that
00:06:16
on one side of the split we have mostly
00:06:18
triangles and on the other side of the
00:06:19
split you have mostly circles so let's
00:06:21
pretend at first it chooses this split
00:06:22
here so this black line that i've drawn
00:06:24
on the left hand side it's getting 100
00:06:26
correct because it's saying those are
00:06:28
triangles and they are
00:06:29
indeed triangles on the right hand side
00:06:31
it's getting most of them correct but
00:06:33
it's doing poorly
00:06:34
it's misclassifying the two outliers so
00:06:36
the natural question is is there a
00:06:38
different split that i could try to get
00:06:40
even a better outcome
00:06:41
and the answer is no for example let's
00:06:43
just try hypothetically what if it chose
00:06:45
to
00:06:46
split here instead well if we did kind
00:06:48
of entertain the idea that the decision
00:06:50
tree could be swayed by outliers maybe
00:06:52
we think
00:06:53
the decision boundary would get pulled
00:06:55
in that direction let's think about if
00:06:56
that actually makes sense in the context
00:06:58
of decision trees
00:06:59
so if we have this as our split and we
00:07:01
say everything on the left hand side
00:07:03
is a triangle we're still getting all
00:07:05
these correct but now we get an extra
00:07:06
mistake with this green circle
00:07:08
and if we say everything on the right
00:07:10
hand side is a circle we're still
00:07:11
getting these three circles correct but
00:07:13
we're still getting those two triangles
00:07:14
wrong so
00:07:15
all we've done by changing the location
00:07:17
of the split
00:07:18
is just introduce one more mistake so we
00:07:21
see a decision tree wouldn't actually
00:07:22
ever do this so this is not
00:07:24
possible for the decision tree to split
00:07:26
and so we see that even if you have
00:07:27
outliers like these two
00:07:29
triangles and no matter how far they are
00:07:31
in that direction
00:07:32
it's not going to matter whereas in
00:07:34
logistic regression the further these
00:07:36
outliers were
00:07:37
in that direction the more the sigmoid
00:07:39
gets stretched out and the more mistakes
00:07:40
we are making so
00:07:42
that's why if you hear decision trees
00:07:43
and everything that comes from them like
00:07:45
random forests and
00:07:46
bagging and boosting are somehow
00:07:48
resilient or robust to outliers
00:07:50
this is kind of the behavior that we are
00:07:52
talking about and now to close this
00:07:54
video let's talk about two very
00:07:55
common strategies people use to deal
00:07:57
with outliers let's talk about the pros
00:07:59
and cons of them and let's talk about
00:08:01
the general con
00:08:02
of doing anything to your outliers to
00:08:04
end the video
00:08:05
so the two main strategies that people
00:08:08
use to deal with outliers the first is
00:08:09
called trimming
00:08:10
this is probably the one you're more
00:08:12
familiar with so let's say this is our
00:08:14
data
00:08:14
so this is some variable and you're
00:08:16
looking at a histogram of that variable
00:08:18
trimming basically operates under the
00:08:20
assumption that any
00:08:22
very low values of that variable or any
00:08:24
abnormally high values of that variable
00:08:26
should be deleted
00:08:27
so for example if we choose our
00:08:29
thresholds as the 5th percentile and the
00:08:31
95th percentile
00:08:32
anything before the 5th and anything
00:08:34
after the 95th
00:08:36
we just throw it away and so a natural
00:08:37
question now is what does that do to the
00:08:39
histogram so
00:08:40
we go from this histogram to this one so
00:08:42
you can see the tails have been chopped
00:08:44
off
00:08:44
and also what happens to the rest of the
00:08:46
distribution is that it all gets raised
00:08:48
slightly so an intuitive way to think
00:08:50
about it is we take the probability from
00:08:52
the tails away
00:08:54
so that's gone but we still need to have
00:08:56
the curve integrate to one has to add up
00:08:58
to 100 probability
00:08:59
so we take that probability we just
00:09:01
deleted and we reallocate it
00:09:03
to the rest of the curve so the rest of
00:09:04
the curve shifts up
00:09:06
so that is trimming now the downside of
00:09:08
trimming you probably notice is that we
00:09:10
are literally just throwing away data
00:09:12
in cases where you maybe don't have a
00:09:13
ton of data to begin with this could be
00:09:15
a problem
00:09:16
and that's where the second strategy
00:09:18
which is related but has a very
00:09:20
different step at the end
00:09:21
this is called windsorizing probably
00:09:23
named after somebody called windsor
00:09:25
and so the first part is the same we
00:09:27
still pick low and high thresholds we'll
00:09:29
just pick 5 and 95 again
00:09:31
but the big difference is that we don't
00:09:33
delete the data on either side of the
00:09:34
threshold
00:09:35
we take the stuff that's below fifth
00:09:37
percentile
00:09:38
and we set it equal to the fifth
00:09:40
percentile and so the intuition here is
00:09:43
that we're saying
00:09:44
anything below the fifth percentile is
00:09:46
in some sense abnormal or unexpected
00:09:49
is kind of not the normal so what we're
00:09:51
going to do is take all those values and
00:09:53
set them to the most
00:09:54
reasonable value that does exist in the
00:09:56
data set that we think is normal
00:09:58
and that would be the fifth percentile
00:10:00
so we take all this data in the tail and
00:10:01
we set it equal to the fifth percentile
00:10:03
and we do the same thing on the other
00:10:04
side we take everything above
00:10:06
the 95th percentile and we set it equal
00:10:08
to the 95th percentile
00:10:10
now let's ask the same question what
00:10:12
does that do to the histogram afterwards
00:10:14
well we haven't deleted any data in this
00:10:16
case we still have the same number of
00:10:18
observations
00:10:19
so the only change is that these values
00:10:21
the 5th percentile
00:10:23
and the 95th percentile that we had up
00:10:25
here are both going to get boosted
00:10:26
they're going to get
00:10:27
raised because now we just have a lot
00:10:29
more observations at
00:10:30
exactly those values so the advantage
00:10:33
here is that we're not
00:10:34
throwing away data like we are in
00:10:35
trimming but the disadvantage is that we
00:10:37
could potentially have a lot
00:10:39
of samples now that are exactly the same
00:10:42
either exactly the fifth or exactly the
00:10:44
95th percentile so we might be
00:10:46
artificially reducing the variance of
00:10:48
our data so
00:10:48
these are the trade-offs in these two
00:10:50
methods and now just to close this video
00:10:52
i want to talk about the
00:10:53
general downside of doing any kind of
00:10:55
default method with your outliers so
00:10:57
there's a lot of programming languages
00:10:59
out there there's a lot of packages to
00:11:00
deal with outliers you can do
00:11:02
trimming you can do windsoring you can
00:11:04
do probably even more complex things to
00:11:06
take your outliers and transform them
00:11:07
into something else but
00:11:09
you have to stop and think anytime
00:11:10
you're taking an outlier and
00:11:11
transforming it into something else
00:11:13
you're
00:11:13
inherently making this assumption that
00:11:15
you're never going to see
00:11:16
these type of outliers again for example
00:11:18
in your testing data
00:11:20
because if we think about it for a
00:11:21
second we're saying that these values
00:11:23
are abnormal we probably won't see them
00:11:25
again they're just kind of a one-off
00:11:26
thing
00:11:27
and so we're going to do something
00:11:28
reasonable to them but that
00:11:30
may not be the case it may be the case
00:11:32
that if you look at your testing data
00:11:34
the things you're trying to predict you
00:11:36
could still see outliers just like this
00:11:38
and if you haven't done anything to
00:11:40
address the root cause or mechanism
00:11:42
that produced these outliers then you're
00:11:43
never going to get those correct in the
00:11:45
testing data because you just haven't
00:11:46
built anything to deal with them
00:11:48
and even more than that you're probably
00:11:49
going to get them very wrong because
00:11:51
you're treating them in the training
00:11:52
data as just
00:11:53
regular examples instead of anything
00:11:55
special and so you're probably going to
00:11:57
have big errors when it comes to your
00:11:59
testing data
00:12:00
so a lot of times students will come to
00:12:01
me and ask what's the right way to deal
00:12:03
with outliers
00:12:04
and they're usually trying to choose
00:12:05
between some of these out of the box
00:12:07
techniques and
00:12:08
i would say that none of these are the
00:12:10
right way to do with outliers
00:12:11
all these out of the box techniques are
00:12:13
the fast easy
00:12:14
way to deal with outliers if you want to
00:12:16
make quick progress in your project
00:12:18
but the right way to deal with outliers
00:12:20
is to stop
00:12:21
take some time think about what is the
00:12:23
mechanism that produced these outliers
00:12:25
could that mechanism still exist in the
00:12:27
testing data and if so we should do
00:12:29
something more
00:12:30
intelligent we should look at how the
00:12:32
outliers differ from the rest of the
00:12:34
data
00:12:34
and perhaps build a separate model or
00:12:36
treat them in a different way
00:12:38
so hopefully you learned about outliers
00:12:40
in the context of machine learning so
00:12:42
both
00:12:43
which models are and are usually not
00:12:45
affected by outliers
00:12:47
some out of the box techniques to do
00:12:48
with outliers and the pros and cons
00:12:50
between them
00:12:51
and most importantly just the philosophy
00:12:54
of what it means to do anything to an
00:12:55
outlier at all
00:12:56
and what consequences it can have for
00:12:58
your entire data
00:12:59
project okay so if you enjoyed this
00:13:02
video please like and subscribe for more
00:13:04
just like this and i'll see you next
00:13:05
time