Qu'est-ce qu'un outlier?

Un outlier est une valeur ou un ensemble de valeurs considérablement différent du reste des données. Le terme est aussi utilisé dans des contextes non-statistiques pour désigner des cas atypiques.

Quels algorithmes de machine learning sont affectés par les outliers?

Les algorithmes comme la régression linéaire et la régression logistique sont fortement affectés par les outliers.

Quels algorithmes sont robustes aux outliers?

Les arbres de décision et leurs dérivés, comme les forêts aléatoires, sont généralement robustes face aux outliers.

Qu'est-ce que le trimming?

Le trimming est une méthode qui consiste à supprimer les valeurs extrêmes de l'ensemble de données.

Qu'est-ce que la windsorisation?

La windsorisation consiste à remplacer les valeurs extrêmes par les valeurs de seuils de distribution, conservant ainsi le volume de données initial.

Pourquoi est-il risqué de transformer les outliers?

Transformer les outliers sans comprendre leur cause sous-jacente peut être risqué car ces valeurs peuvent réapparaître dans les données de test, induisant des erreurs importantes.

Quelle est l'approche idéale pour gérer les outliers?

L'approche idéale implique de comprendre la mécanique qui produit les outliers et d'adapter les méthodes en conséquence pour améliorer la précision des modèles.

Outliers : Data Science Basics

00:13:06

https://www.youtube.com/watch?v=7KeITQajazo

Summary

TLDRCette vidéo explore l'impact des valeurs aberrantes sur les algorithmes de machine learning et discute des méthodes courantes pour les gérer. Les outliers peuvent grandement influencer les modèles, comme la régression linéaire et logistique, en faussant les résultats. Les arbres de décision s'avèrent plus robustes face à ce problème. Les stratégies pour traiter les outliers incluent le trimming, qui consiste à éliminer les valeurs extrêmes, et la windsorisation, qui les remplace par des valeurs seuils. Cependant, transformer les outliers suppose qu'ils n'apparaîtront pas dans les données futures, ce qui peut poser problème si ce n'est pas le cas. L'approche idéale consiste à analyser la mécanique derrière les outliers et adapter les modèles en conséquence.

Takeaways

📊 Les outliers peuvent fortement influencer les algorithmes d'apprentissage automatique, comme la régression linéaire et la régression logistique.
📉 Les méthodes à base d'arbres de décision, comme les forêts aléatoires, sont plus robustes face aux outliers.
🔍 Le terme 'outlier' est devenu commun même en dehors du contexte statistique, désignant des cas atypiques.
🚫 Trim est une méthode qui consiste à supprimer les valeurs extrêmes, mais cela peut entraîner une perte significative de données.
🔄 Windsorisation implique de remplacer les valeurs extrêmes par les valeurs de seuils, préservant ainsi le volume de données mais pouvant réduire la variance.
🤔 Transformer les outliers suppose qu'ils ne réapparaîtront pas dans les données de test, ce qui peut être une hypothèse risquée.
🧠 La gestion des outliers devrait prendre en compte la cause sous-jacente pour proposer des solutions plus appropriées.
⚖️ Les méthodes automatiques pour traiter les outliers facilitent le progrès rapide mais peuvent compromettre la précision des modèles.
🔄 Comprendre la mécanique des outliers peut nécessiter de construire un modèle spécifique ou une solution sur mesure.
📝 La philosophie entourant le traitement des outliers inclut des impacts potentiels sur l'ensemble du projet de données.

Timeline

00:00:00 - 00:05:00
Dans cette vidéo, l'orateur discute du rôle des valeurs aberrantes dans les algorithmes de machine learning et des méthodes courantes pour les traiter. Les valeurs aberrantes sont importantes car elles sont un concept commun même en dehors des statistiques, mais il n'existe pas de méthode universelle pour les gérer. L'orateur examine l'impact des valeurs aberrantes sur quatre algorithmes populaires de machine learning, en commençant par la régression linéaire et la régression logistique, où les valeurs aberrantes peuvent fausser les résultats. En régression linéaire, la ligne de tendance est affectée, tandis qu'en régression logistique, la courbe sigmoidale est étirée, entraînant ainsi des prédictions incorrectes.
00:05:00 - 00:13:06
Ensuite, l'orateur aborde les k plus proches voisins, où seulement quelques valeurs aberrantes peuvent modifier significativement la frontière de décision. En revanche, les arbres de décision, et leurs dérivés, sont moins affectés par les valeurs aberrantes puisqu'ils choisissent des divisions basées sur la majorité des données présentes. L'orateur décrit également deux stratégies pour traiter les valeurs aberrantes : la troncature, qui consiste à supprimer les données extrêmes, et la Windsorisation, qui ajuste les valeurs aux percentiles les plus proches sans supprimer de données. Il souligne que toute méthode appliquée aux valeurs aberrantes repose sur l'hypothèse qu'elles ne réapparaîtront pas dans les données de test, d'où l'importance de comprendre le mécanisme qui les a produites.

Mind Map

Video Q&A

Qu'est-ce qu'un outlier?
Un outlier est une valeur ou un ensemble de valeurs considérablement différent du reste des données. Le terme est aussi utilisé dans des contextes non-statistiques pour désigner des cas atypiques.
Quels algorithmes de machine learning sont affectés par les outliers?
Les algorithmes comme la régression linéaire et la régression logistique sont fortement affectés par les outliers.
Quels algorithmes sont robustes aux outliers?
Les arbres de décision et leurs dérivés, comme les forêts aléatoires, sont généralement robustes face aux outliers.
Qu'est-ce que le trimming?
Le trimming est une méthode qui consiste à supprimer les valeurs extrêmes de l'ensemble de données.
Qu'est-ce que la windsorisation?
La windsorisation consiste à remplacer les valeurs extrêmes par les valeurs de seuils de distribution, conservant ainsi le volume de données initial.
Pourquoi est-il risqué de transformer les outliers?
Transformer les outliers sans comprendre leur cause sous-jacente peut être risqué car ces valeurs peuvent réapparaître dans les données de test, induisant des erreurs importantes.
Quelle est l'approche idéale pour gérer les outliers?
L'approche idéale implique de comprendre la mécanique qui produit les outliers et d'adapter les méthodes en conséquence pour améliorer la précision des modèles.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:00
[Music]
00:00:01
hey everyone how's it going so it's
00:00:02
going to be a pretty brief video today
00:00:03
we're going to be talking about the role
00:00:05
of
00:00:05
outliers in machine learning algorithms
00:00:08
and then also talk about ways that
00:00:09
people
00:00:10
typically deal with outliers as well as
00:00:12
some of the shortcomings of those
00:00:13
methods
00:00:14
in general i think outliers are a pretty
00:00:16
interesting topic for two main reasons
00:00:17
one is that even if you don't study
00:00:19
stats this word outlier has become kind
00:00:21
of a commonplace word even in the media
00:00:23
and just
00:00:23
speech for example will say oh that
00:00:25
baseball game was an outlier
00:00:27
and we just mean that it was different
00:00:28
from a typical baseball game for example
00:00:30
the other reason i think it's
00:00:31
interesting is that even with all these
00:00:33
statistical tools we have there's not
00:00:35
like
00:00:36
a set way to deal with outliers it
00:00:38
really depends on the problem
00:00:40
different people will take different
00:00:41
approaches so it really highlights this
00:00:42
fact that
00:00:43
math is not this yes or no set in stone
00:00:46
kind of process but it really depends on
00:00:48
the situation so we'll go about this
00:00:50
video pretty simply we'll just go
00:00:52
through
00:00:52
four different very popular machine
00:00:54
learning algorithms and talk about the
00:00:56
impact
00:00:57
that a couple of outliers can have on
00:00:58
the results and then as i said we'll
00:01:00
talk about some common ways people deal
00:01:02
with outliers by the way i got this
00:01:04
really cool lobster hat for christmas
00:01:05
hope you like it
00:01:06
so let's begin by visiting our very
00:01:08
first friend in machine learning and
00:01:10
stats which was linear regression
00:01:12
so if you remember linear regression you
00:01:14
just have a x and a y variable keeping
00:01:16
it real simple
00:01:17
and you draw a line of best fit through
00:01:20
all of your data points
00:01:21
so let's say all these black x's were
00:01:23
your initial data points and you drew
00:01:25
this green line of best fit through them
00:01:27
and they have a pretty good fit no real
00:01:28
issues there now here's the problem
00:01:31
let's say you introduce a couple of
00:01:32
outliers so these red x's down here
00:01:34
with the exclamation point next to them
00:01:36
are outliers because they are very
00:01:38
different from the typical black x's we
00:01:40
have
00:01:40
up here as you might have learned in
00:01:42
linear regression this line that we're
00:01:44
going to draw now is very affected
00:01:46
by outliers and what's going to happen
00:01:48
is that the slope of the line
00:01:50
or beta 1 hat in this form over here is
00:01:53
going to shift
00:01:54
down so that the new line we get is this
00:01:56
red line
00:01:57
and the biggest issue with this red line
00:01:58
you can see what it's trying to do it's
00:02:00
trying to kind of
00:02:01
compromise between the original data and
00:02:03
these outliers but in doing so it
00:02:05
doesn't really capture either one too
00:02:06
well
00:02:07
so we definitely have an issue with
00:02:09
outliers in linear regression
00:02:11
it turns out that we can also frame
00:02:12
logistic regression which might have
00:02:14
been the first
00:02:15
classification technique you learned in
00:02:17
the exact same way
00:02:18
just a quick recap models the logit of
00:02:20
the probability of any example being
00:02:22
either 0 or 1
00:02:24
as the same form beta naught plus beta 1
00:02:26
x just like we had up here
00:02:28
so let's say our initial data is these
00:02:29
black x's which were all class
00:02:31
0 and these black x's which are all
00:02:33
class 1
00:02:34
and we asked to draw the sigmoid which
00:02:36
is going to predict the probability
00:02:38
of each example being in class 1. and
00:02:41
without the presence of outliers this
00:02:42
would be pretty simple we just draw this
00:02:44
black sigmoid
00:02:45
so that all of these get correctly
00:02:46
classified as class 1 because their
00:02:48
probabilities are above
00:02:50
0.5 and all these guys get correctly
00:02:52
classified as class
00:02:53
0 because their probabilities are below
00:02:55
0.5 now again we introduced just a
00:02:57
couple of outliers so here's some
00:02:59
outliers here these three red x's and
00:03:01
i've drawn this red arrow to indicate
00:03:02
that they are
00:03:03
way over here in the x direction so they
00:03:05
are way to the right have very large
00:03:07
values
00:03:08
now let's think about first
00:03:09
mathematically what's going to happen to
00:03:10
the sigmoid
00:03:11
actually the same thing happens because
00:03:13
we are modeling it again as beta naught
00:03:14
plus beta 1 x
00:03:16
so our beta 1 hat again goes down and
00:03:18
the impact that having a lower beta 1
00:03:20
hat is going to have
00:03:21
on this sigmoid is that it's going to
00:03:24
flatten out and stretch out the sigmoid
00:03:26
so it now looks like this
00:03:27
red sigmoid here and more intuitively
00:03:30
you can see what's going on is that
00:03:31
because these three red x's have very
00:03:34
large values
00:03:35
of the x variable yet they are
00:03:37
classified as class
00:03:39
zero we get this situation where the
00:03:41
sigmoid is trying its best to
00:03:43
incorporate these into the class
00:03:44
zero which it tries to achieve by
00:03:46
stretching out the sigmoids so much that
00:03:48
they get incorporated into the lower
00:03:50
part of the sigmoid
00:03:51
but in trying to do so it completely
00:03:52
destroys the rest of our data
00:03:54
for example if you look at this sigmoid
00:03:56
now you see that everything below 0.5 so
00:03:59
these are all still correctly classified
00:04:01
but if you look at everything above 0.5
00:04:03
we get these three correctly classified
00:04:05
but these three or four x's here are
00:04:07
actually belonging to class zero
00:04:08
predicted as class
00:04:09
zero even though they're class one and
00:04:11
maybe even the worst part of this is
00:04:12
that
00:04:13
although it tries really hard to
00:04:14
incorporate these three into class
00:04:16
zero it never achieves they still get
00:04:18
classified as class one so we get a
00:04:20
bunch of mistakes by doing this
00:04:22
in logistic regression having these
00:04:23
outliers so again problematic scenario
00:04:26
let's look at k nearest neighbors
00:04:28
another friendly face
00:04:29
so with k nearest neighbors if we have
00:04:31
two nice looking clouds of data so we
00:04:33
have the blue triangles and we have the
00:04:35
green circles
00:04:36
then we can draw a pretty nice looking
00:04:37
decision boundary any point on this side
00:04:40
of the decision boundary so above it
00:04:42
is going to get classified as a green
00:04:44
circle because if we use let's say
00:04:45
k equals three who are my three closest
00:04:48
neighbors they're always going to be
00:04:49
some of the circles if we're on this
00:04:50
side
00:04:51
and they're always going to be the
00:04:52
triangles if we're on this side so no
00:04:54
issues there
00:04:55
let's see how this story changes if we
00:04:57
incorporate just two
00:04:58
extra green circles in the wrong place
00:05:01
so they're outliers
00:05:03
so here we have put the two green
00:05:04
circles with the main pack of blue
00:05:07
triangles
00:05:07
and we see that the decision boundary
00:05:09
changes in the following way
00:05:11
so this whole area is unaffected this
00:05:13
whole area is unaffected but
00:05:14
around where we introduce the outliers
00:05:17
we get a very funky looking decision
00:05:18
boundary and the reason is that let's
00:05:20
say you're trying to predict some new
00:05:21
data point here where the x is
00:05:23
and you ask who are my three closest
00:05:24
neighbors well it's going to be these
00:05:26
two outliers as well as this blue
00:05:27
triangle
00:05:28
so it's going to say majority class is
00:05:30
green circle and so this whole decision
00:05:32
space gets allocated to these circles as
00:05:34
well
00:05:34
so we see that in k nearest neighbor
00:05:36
introducing just a couple of outliers
00:05:38
near a different class can have a big
00:05:40
impact on the decision boundary
00:05:41
now we've talked about so far machine
00:05:43
learning methods that are very impacted
00:05:46
by outliers let's talk about one that is
00:05:47
not so affected by outliers and that is
00:05:49
our old friend decision trees so we have
00:05:52
a decision tree here
00:05:53
this is just some variable so we see
00:05:55
that in general low values of this
00:05:57
variable
00:05:58
correspond to these triangles higher
00:06:00
values of these variables correspond to
00:06:02
the circles but there are two outliers
00:06:04
where the variable is very high yet
00:06:06
those are classified as triangles
00:06:09
and so if we recall how decision trees
00:06:11
work it's going to scan this entire
00:06:12
variable's range
00:06:14
and it's going to pick a split such that
00:06:16
on one side of the split we have mostly
00:06:18
triangles and on the other side of the
00:06:19
split you have mostly circles so let's
00:06:21
pretend at first it chooses this split
00:06:22
here so this black line that i've drawn
00:06:24
on the left hand side it's getting 100
00:06:26
correct because it's saying those are
00:06:28
triangles and they are
00:06:29
indeed triangles on the right hand side
00:06:31
it's getting most of them correct but
00:06:33
it's doing poorly
00:06:34
it's misclassifying the two outliers so
00:06:36
the natural question is is there a
00:06:38
different split that i could try to get
00:06:40
even a better outcome
00:06:41
and the answer is no for example let's
00:06:43
just try hypothetically what if it chose
00:06:45
to
00:06:46
split here instead well if we did kind
00:06:48
of entertain the idea that the decision
00:06:50
tree could be swayed by outliers maybe
00:06:52
we think
00:06:53
the decision boundary would get pulled
00:06:55
in that direction let's think about if
00:06:56
that actually makes sense in the context
00:06:58
of decision trees
00:06:59
so if we have this as our split and we
00:07:01
say everything on the left hand side
00:07:03
is a triangle we're still getting all
00:07:05
these correct but now we get an extra
00:07:06
mistake with this green circle
00:07:08
and if we say everything on the right
00:07:10
hand side is a circle we're still
00:07:11
getting these three circles correct but
00:07:13
we're still getting those two triangles
00:07:14
wrong so
00:07:15
all we've done by changing the location
00:07:17
of the split
00:07:18
is just introduce one more mistake so we
00:07:21
see a decision tree wouldn't actually
00:07:22
ever do this so this is not
00:07:24
possible for the decision tree to split
00:07:26
and so we see that even if you have
00:07:27
outliers like these two
00:07:29
triangles and no matter how far they are
00:07:31
in that direction
00:07:32
it's not going to matter whereas in
00:07:34
logistic regression the further these
00:07:36
outliers were
00:07:37
in that direction the more the sigmoid
00:07:39
gets stretched out and the more mistakes
00:07:40
we are making so
00:07:42
that's why if you hear decision trees
00:07:43
and everything that comes from them like
00:07:45
random forests and
00:07:46
bagging and boosting are somehow
00:07:48
resilient or robust to outliers
00:07:50
this is kind of the behavior that we are
00:07:52
talking about and now to close this
00:07:54
video let's talk about two very
00:07:55
common strategies people use to deal
00:07:57
with outliers let's talk about the pros
00:07:59
and cons of them and let's talk about
00:08:01
the general con
00:08:02
of doing anything to your outliers to
00:08:04
end the video
00:08:05
so the two main strategies that people
00:08:08
use to deal with outliers the first is
00:08:09
called trimming
00:08:10
this is probably the one you're more
00:08:12
familiar with so let's say this is our
00:08:14
data
00:08:14
so this is some variable and you're
00:08:16
looking at a histogram of that variable
00:08:18
trimming basically operates under the
00:08:20
assumption that any
00:08:22
very low values of that variable or any
00:08:24
abnormally high values of that variable
00:08:26
should be deleted
00:08:27
so for example if we choose our
00:08:29
thresholds as the 5th percentile and the
00:08:31
95th percentile
00:08:32
anything before the 5th and anything
00:08:34
after the 95th
00:08:36
we just throw it away and so a natural
00:08:37
question now is what does that do to the
00:08:39
histogram so
00:08:40
we go from this histogram to this one so
00:08:42
you can see the tails have been chopped
00:08:44
off
00:08:44
and also what happens to the rest of the
00:08:46
distribution is that it all gets raised
00:08:48
slightly so an intuitive way to think
00:08:50
about it is we take the probability from
00:08:52
the tails away
00:08:54
so that's gone but we still need to have
00:08:56
the curve integrate to one has to add up
00:08:58
to 100 probability
00:08:59
so we take that probability we just
00:09:01
deleted and we reallocate it
00:09:03
to the rest of the curve so the rest of
00:09:04
the curve shifts up
00:09:06
so that is trimming now the downside of
00:09:08
trimming you probably notice is that we
00:09:10
are literally just throwing away data
00:09:12
in cases where you maybe don't have a
00:09:13
ton of data to begin with this could be
00:09:15
a problem
00:09:16
and that's where the second strategy
00:09:18
which is related but has a very
00:09:20
different step at the end
00:09:21
this is called windsorizing probably
00:09:23
named after somebody called windsor
00:09:25
and so the first part is the same we
00:09:27
still pick low and high thresholds we'll
00:09:29
just pick 5 and 95 again
00:09:31
but the big difference is that we don't
00:09:33
delete the data on either side of the
00:09:34
threshold
00:09:35
we take the stuff that's below fifth
00:09:37
percentile
00:09:38
and we set it equal to the fifth
00:09:40
percentile and so the intuition here is
00:09:43
that we're saying
00:09:44
anything below the fifth percentile is
00:09:46
in some sense abnormal or unexpected
00:09:49
is kind of not the normal so what we're
00:09:51
going to do is take all those values and
00:09:53
set them to the most
00:09:54
reasonable value that does exist in the
00:09:56
data set that we think is normal
00:09:58
and that would be the fifth percentile
00:10:00
so we take all this data in the tail and
00:10:01
we set it equal to the fifth percentile
00:10:03
and we do the same thing on the other
00:10:04
side we take everything above
00:10:06
the 95th percentile and we set it equal
00:10:08
to the 95th percentile
00:10:10
now let's ask the same question what
00:10:12
does that do to the histogram afterwards
00:10:14
well we haven't deleted any data in this
00:10:16
case we still have the same number of
00:10:18
observations
00:10:19
so the only change is that these values
00:10:21
the 5th percentile
00:10:23
and the 95th percentile that we had up
00:10:25
here are both going to get boosted
00:10:26
they're going to get
00:10:27
raised because now we just have a lot
00:10:29
more observations at
00:10:30
exactly those values so the advantage
00:10:33
here is that we're not
00:10:34
throwing away data like we are in
00:10:35
trimming but the disadvantage is that we
00:10:37
could potentially have a lot
00:10:39
of samples now that are exactly the same
00:10:42
either exactly the fifth or exactly the
00:10:44
95th percentile so we might be
00:10:46
artificially reducing the variance of
00:10:48
our data so
00:10:48
these are the trade-offs in these two
00:10:50
methods and now just to close this video
00:10:52
i want to talk about the
00:10:53
general downside of doing any kind of
00:10:55
default method with your outliers so
00:10:57
there's a lot of programming languages
00:10:59
out there there's a lot of packages to
00:11:00
deal with outliers you can do
00:11:02
trimming you can do windsoring you can
00:11:04
do probably even more complex things to
00:11:06
take your outliers and transform them
00:11:07
into something else but
00:11:09
you have to stop and think anytime
00:11:10
you're taking an outlier and
00:11:11
transforming it into something else
00:11:13
you're
00:11:13
inherently making this assumption that
00:11:15
you're never going to see
00:11:16
these type of outliers again for example
00:11:18
in your testing data
00:11:20
because if we think about it for a
00:11:21
second we're saying that these values
00:11:23
are abnormal we probably won't see them
00:11:25
again they're just kind of a one-off
00:11:26
thing
00:11:27
and so we're going to do something
00:11:28
reasonable to them but that
00:11:30
may not be the case it may be the case
00:11:32
that if you look at your testing data
00:11:34
the things you're trying to predict you
00:11:36
could still see outliers just like this
00:11:38
and if you haven't done anything to
00:11:40
address the root cause or mechanism
00:11:42
that produced these outliers then you're
00:11:43
never going to get those correct in the
00:11:45
testing data because you just haven't
00:11:46
built anything to deal with them
00:11:48
and even more than that you're probably
00:11:49
going to get them very wrong because
00:11:51
you're treating them in the training
00:11:52
data as just
00:11:53
regular examples instead of anything
00:11:55
special and so you're probably going to
00:11:57
have big errors when it comes to your
00:11:59
testing data
00:12:00
so a lot of times students will come to
00:12:01
me and ask what's the right way to deal
00:12:03
with outliers
00:12:04
and they're usually trying to choose
00:12:05
between some of these out of the box
00:12:07
techniques and
00:12:08
i would say that none of these are the
00:12:10
right way to do with outliers
00:12:11
all these out of the box techniques are
00:12:13
the fast easy
00:12:14
way to deal with outliers if you want to
00:12:16
make quick progress in your project
00:12:18
but the right way to deal with outliers
00:12:20
is to stop
00:12:21
take some time think about what is the
00:12:23
mechanism that produced these outliers
00:12:25
could that mechanism still exist in the
00:12:27
testing data and if so we should do
00:12:29
something more
00:12:30
intelligent we should look at how the
00:12:32
outliers differ from the rest of the
00:12:34
data
00:12:34
and perhaps build a separate model or
00:12:36
treat them in a different way
00:12:38
so hopefully you learned about outliers
00:12:40
in the context of machine learning so
00:12:42
both
00:12:43
which models are and are usually not
00:12:45
affected by outliers
00:12:47
some out of the box techniques to do
00:12:48
with outliers and the pros and cons
00:12:50
between them
00:12:51
and most importantly just the philosophy
00:12:54
of what it means to do anything to an
00:12:55
outlier at all
00:12:56
and what consequences it can have for
00:12:58
your entire data
00:12:59
project okay so if you enjoyed this
00:13:02
video please like and subscribe for more
00:13:04
just like this and i'll see you next
00:13:05
time