TLDRCette vidéo explore l'impact des valeurs aberrantes sur les algorithmes de machine learning et discute des méthodes courantes pour les gérer. Les outliers peuvent grandement influencer les modèles, comme la régression linéaire et logistique, en faussant les résultats. Les arbres de décision s'avèrent plus robustes face à ce problème. Les stratégies pour traiter les outliers incluent le trimming, qui consiste à éliminer les valeurs extrêmes, et la windsorisation, qui les remplace par des valeurs seuils. Cependant, transformer les outliers suppose qu'ils n'apparaîtront pas dans les données futures, ce qui peut poser problème si ce n'est pas le cas. L'approche idéale consiste à analyser la mécanique derrière les outliers et adapter les modèles en conséquence.


  • 📊 Les outliers peuvent fortement influencer les algorithmes d'apprentissage automatique, comme la régression linéaire et la régression logistique.
  • 📉 Les méthodes à base d'arbres de décision, comme les forêts aléatoires, sont plus robustes face aux outliers.
  • 🔍 Le terme 'outlier' est devenu commun même en dehors du contexte statistique, désignant des cas atypiques.
  • 🚫 Trim est une méthode qui consiste à supprimer les valeurs extrêmes, mais cela peut entraîner une perte significative de données.
  • 🔄 Windsorisation implique de remplacer les valeurs extrêmes par les valeurs de seuils, préservant ainsi le volume de données mais pouvant réduire la variance.
  • 🤔 Transformer les outliers suppose qu'ils ne réapparaîtront pas dans les données de test, ce qui peut être une hypothèse risquée.
  • 🧠 La gestion des outliers devrait prendre en compte la cause sous-jacente pour proposer des solutions plus appropriées.
  • ⚖️ Les méthodes automatiques pour traiter les outliers facilitent le progrès rapide mais peuvent compromettre la précision des modèles.
  • 🔄 Comprendre la mécanique des outliers peut nécessiter de construire un modèle spécifique ou une solution sur mesure.
  • 📝 La philosophie entourant le traitement des outliers inclut des impacts potentiels sur l'ensemble du projet de données.


  • 00:00:00 - 00:05:00

    Dans cette vidéo, l'orateur discute du rôle des valeurs aberrantes dans les algorithmes de machine learning et des méthodes courantes pour les traiter. Les valeurs aberrantes sont importantes car elles sont un concept commun même en dehors des statistiques, mais il n'existe pas de méthode universelle pour les gérer. L'orateur examine l'impact des valeurs aberrantes sur quatre algorithmes populaires de machine learning, en commençant par la régression linéaire et la régression logistique, où les valeurs aberrantes peuvent fausser les résultats. En régression linéaire, la ligne de tendance est affectée, tandis qu'en régression logistique, la courbe sigmoidale est étirée, entraînant ainsi des prédictions incorrectes.

  • 00:05:00 - 00:13:06

    Ensuite, l'orateur aborde les k plus proches voisins, où seulement quelques valeurs aberrantes peuvent modifier significativement la frontière de décision. En revanche, les arbres de décision, et leurs dérivés, sont moins affectés par les valeurs aberrantes puisqu'ils choisissent des divisions basées sur la majorité des données présentes. L'orateur décrit également deux stratégies pour traiter les valeurs aberrantes : la troncature, qui consiste à supprimer les données extrêmes, et la Windsorisation, qui ajuste les valeurs aux percentiles les plus proches sans supprimer de données. Il souligne que toute méthode appliquée aux valeurs aberrantes repose sur l'hypothèse qu'elles ne réapparaîtront pas dans les données de test, d'où l'importance de comprendre le mécanisme qui les a produites.



  • Qu'est-ce qu'un outlier?

    Un outlier est une valeur ou un ensemble de valeurs considérablement différent du reste des données. Le terme est aussi utilisé dans des contextes non-statistiques pour désigner des cas atypiques.

  • Quels algorithmes de machine learning sont affectés par les outliers?

    Les algorithmes comme la régression linéaire et la régression logistique sont fortement affectés par les outliers.

  • Quels algorithmes sont robustes aux outliers?

    Les arbres de décision et leurs dérivés, comme les forêts aléatoires, sont généralement robustes face aux outliers.

  • Qu'est-ce que le trimming?

    Le trimming est une méthode qui consiste à supprimer les valeurs extrêmes de l'ensemble de données.

  • Qu'est-ce que la windsorisation?

    La windsorisation consiste à remplacer les valeurs extrêmes par les valeurs de seuils de distribution, conservant ainsi le volume de données initial.

  • Pourquoi est-il risqué de transformer les outliers?

    Transformer les outliers sans comprendre leur cause sous-jacente peut être risqué car ces valeurs peuvent réapparaître dans les données de test, induisant des erreurs importantes.

  • Quelle est l'approche idéale pour gérer les outliers?

    L'approche idéale implique de comprendre la mécanique qui produit les outliers et d'adapter les méthodes en conséquence pour améliorer la précision des modèles.


    hey everyone how's it going so it's
    going to be a pretty brief video today
    we're going to be talking about the role
    outliers in machine learning algorithms
    and then also talk about ways that
    typically deal with outliers as well as
    some of the shortcomings of those
    in general i think outliers are a pretty
    interesting topic for two main reasons
    one is that even if you don't study
    stats this word outlier has become kind
    of a commonplace word even in the media
  • 00:00:23
    speech for example will say oh that
    baseball game was an outlier
    and we just mean that it was different
    from a typical baseball game for example
  • 00:00:30
  • 00:00:31
  • 00:00:33
    statistical tools we have there's not
    a set way to deal with outliers it
    really depends on the problem
    different people will take different
    approaches so it really highlights this
  • 00:00:42
  • 00:00:43
    math is not this yes or no set in stone
    kind of process but it really depends on
  • 00:00:48
  • 00:00:50
    video pretty simply we'll just go
    four different very popular machine
    learning algorithms and talk about the
    that a couple of outliers can have on
    the results and then as i said we'll
    talk about some common ways people deal
  • 00:01:02
  • 00:01:04
  • 00:01:05
    hope you like it
    so let's begin by visiting our very
    first friend in machine learning and
  • 00:01:10
    stats which was linear regression
    so if you remember linear regression you
  • 00:01:14
  • 00:01:16
  • 00:01:17
  • 00:01:20
    all of your data points
    so let's say all these black x's were
  • 00:01:23
  • 00:01:25
    this green line of best fit through them
  • 00:01:27
  • 00:01:28
    issues there now here's the problem
    let's say you introduce a couple of
  • 00:01:32
  • 00:01:34
    with the exclamation point next to them
    are outliers because they are very
  • 00:01:38
  • 00:01:40
  • 00:01:42
  • 00:01:44
    going to draw now is very affected
    by outliers and what's going to happen
  • 00:01:48
  • 00:01:50
    or beta 1 hat in this form over here is
    going to shift
    down so that the new line we get is this
    red line
  • 00:01:57
  • 00:01:58
    you can see what it's trying to do it's
    trying to kind of
  • 00:02:01
  • 00:02:03
    these outliers but in doing so it
    doesn't really capture either one too
  • 00:02:06
    so we definitely have an issue with
  • 00:02:09
  • 00:02:11
  • 00:02:12
    logistic regression which might have
    been the first
  • 00:02:15
  • 00:02:17
    the exact same way
    just a quick recap models the logit of
  • 00:02:20
  • 00:02:22
    either 0 or 1
    as the same form beta naught plus beta 1
  • 00:02:26
  • 00:02:28
    so let's say our initial data is these
    black x's which were all class
  • 00:02:31
  • 00:02:33
    class 1
    and we asked to draw the sigmoid which
  • 00:02:36
  • 00:02:38
    of each example being in class 1. and
    without the presence of outliers this
  • 00:02:42
  • 00:02:44
    black sigmoid
    so that all of these get correctly
  • 00:02:46
  • 00:02:48
    probabilities are above
    0.5 and all these guys get correctly
  • 00:02:52
  • 00:02:53
    0 because their probabilities are below
    0.5 now again we introduced just a
  • 00:02:57
  • 00:02:59
    outliers here these three red x's and
    i've drawn this red arrow to indicate
  • 00:03:02
  • 00:03:03
    way over here in the x direction so they
    are way to the right have very large
  • 00:03:07
    now let's think about first
  • 00:03:09
  • 00:03:10
    the sigmoid
    actually the same thing happens because
  • 00:03:13
  • 00:03:14
    plus beta 1 x
    so our beta 1 hat again goes down and
  • 00:03:18
  • 00:03:20
    hat is going to have
    on this sigmoid is that it's going to
  • 00:03:24
  • 00:03:26
    so it now looks like this
    red sigmoid here and more intuitively
  • 00:03:30
  • 00:03:31
    because these three red x's have very
    large values
  • 00:03:35
  • 00:03:37
    classified as class
    zero we get this situation where the
  • 00:03:41
  • 00:03:43
    incorporate these into the class
    zero which it tries to achieve by
  • 00:03:46
  • 00:03:48
    they get incorporated into the lower
    part of the sigmoid
  • 00:03:51
  • 00:03:52
    destroys the rest of our data
    for example if you look at this sigmoid
  • 00:03:56
  • 00:03:59
    these are all still correctly classified
    but if you look at everything above 0.5
  • 00:04:03
  • 00:04:05
    but these three or four x's here are
    actually belonging to class zero
  • 00:04:08
  • 00:04:09
    zero even though they're class one and
    maybe even the worst part of this is
  • 00:04:12
  • 00:04:14
  • 00:04:16
    zero it never achieves they still get
    classified as class one so we get a
  • 00:04:20
  • 00:04:22
    in logistic regression having these
    outliers so again problematic scenario
  • 00:04:26
  • 00:04:28
    another friendly face
    so with k nearest neighbors if we have
  • 00:04:31
  • 00:04:33
    have the blue triangles and we have the
    green circles
  • 00:04:36
  • 00:04:37
    decision boundary any point on this side
    of the decision boundary so above it
  • 00:04:42
  • 00:04:44
    circle because if we use let's say
    k equals three who are my three closest
  • 00:04:48
  • 00:04:49
    some of the circles if we're on this
    and they're always going to be the
    triangles if we're on this side so no
  • 00:04:54
  • 00:04:55
    let's see how this story changes if we
    incorporate just two
  • 00:04:58
  • 00:05:01
    so they're outliers
    so here we have put the two green
  • 00:05:04
  • 00:05:07
    and we see that the decision boundary
  • 00:05:09
  • 00:05:11
    so this whole area is unaffected this
    whole area is unaffected but
  • 00:05:14
  • 00:05:17
    we get a very funky looking decision
    boundary and the reason is that let's
  • 00:05:20
  • 00:05:21
    data point here where the x is
    and you ask who are my three closest
  • 00:05:24
  • 00:05:26
    two outliers as well as this blue
    so it's going to say majority class is
    green circle and so this whole decision
  • 00:05:32
  • 00:05:34
  • 00:05:34
  • 00:05:36
    introducing just a couple of outliers
    near a different class can have a big
  • 00:05:40
  • 00:05:41
  • 00:05:43
    learning methods that are very impacted
    by outliers let's talk about one that is
  • 00:05:47
  • 00:05:49
    our old friend decision trees so we have
    a decision tree here
  • 00:05:53
  • 00:05:55
    that in general low values of this
  • 00:05:58
    correspond to these triangles higher
    values of these variables correspond to
  • 00:06:02
  • 00:06:04
    where the variable is very high yet
    those are classified as triangles
  • 00:06:09
    and so if we recall how decision trees
    work it's going to scan this entire
  • 00:06:12
  • 00:06:14
    and it's going to pick a split such that
    on one side of the split we have mostly
  • 00:06:18
  • 00:06:19
    split you have mostly circles so let's
    pretend at first it chooses this split
  • 00:06:22
  • 00:06:24
    on the left hand side it's getting 100
    correct because it's saying those are
  • 00:06:28
  • 00:06:29
    indeed triangles on the right hand side
    it's getting most of them correct but
    it's doing poorly
  • 00:06:34
  • 00:06:36
    the natural question is is there a
    different split that i could try to get
  • 00:06:40
  • 00:06:41
    and the answer is no for example let's
    just try hypothetically what if it chose
  • 00:06:45
  • 00:06:48
  • 00:06:50
    tree could be swayed by outliers maybe
    we think
    the decision boundary would get pulled
  • 00:06:55
  • 00:06:56
    that actually makes sense in the context
    of decision trees
  • 00:06:59
  • 00:07:01
    say everything on the left hand side
    is a triangle we're still getting all
  • 00:07:05
  • 00:07:06
    mistake with this green circle
    and if we say everything on the right
  • 00:07:10
  • 00:07:11
    getting these three circles correct but
    we're still getting those two triangles
  • 00:07:14
  • 00:07:15
  • 00:07:17
    of the split
    is just introduce one more mistake so we
  • 00:07:21
  • 00:07:22
    ever do this so this is not
    possible for the decision tree to split
  • 00:07:26
  • 00:07:27
    outliers like these two
    triangles and no matter how far they are
  • 00:07:31
  • 00:07:32
  • 00:07:34
    logistic regression the further these
    outliers were
  • 00:07:37
  • 00:07:39
    gets stretched out and the more mistakes
    we are making so
  • 00:07:42
  • 00:07:43
    and everything that comes from them like
    random forests and
  • 00:07:46
  • 00:07:48
    resilient or robust to outliers
    this is kind of the behavior that we are
  • 00:07:52
  • 00:07:54
    video let's talk about two very
    common strategies people use to deal
  • 00:07:57
  • 00:07:59
    and cons of them and let's talk about
    the general con
  • 00:08:02
  • 00:08:04
    end the video
    so the two main strategies that people
  • 00:08:08
  • 00:08:09
    called trimming
    this is probably the one you're more
  • 00:08:12
  • 00:08:14
    so this is some variable and you're
  • 00:08:16
  • 00:08:18
    trimming basically operates under the
    assumption that any
  • 00:08:22
  • 00:08:24
    abnormally high values of that variable
    should be deleted
  • 00:08:27
  • 00:08:29
    thresholds as the 5th percentile and the
    95th percentile
  • 00:08:32
  • 00:08:34
    after the 95th
    we just throw it away and so a natural
  • 00:08:37
  • 00:08:39
    histogram so
    we go from this histogram to this one so
  • 00:08:42
  • 00:08:44
  • 00:08:44
  • 00:08:46
    distribution is that it all gets raised
    slightly so an intuitive way to think
  • 00:08:50
  • 00:08:52
    the tails away
    so that's gone but we still need to have
  • 00:08:56
  • 00:08:58
    to 100 probability
    so we take that probability we just
  • 00:09:01
  • 00:09:03
    to the rest of the curve so the rest of
    the curve shifts up
  • 00:09:06
  • 00:09:08
    trimming you probably notice is that we
    are literally just throwing away data
  • 00:09:12
  • 00:09:13
    ton of data to begin with this could be
    a problem
  • 00:09:16
  • 00:09:18
    which is related but has a very
    different step at the end
  • 00:09:21
  • 00:09:23
    named after somebody called windsor
    and so the first part is the same we
  • 00:09:27
  • 00:09:29
    just pick 5 and 95 again
    but the big difference is that we don't
  • 00:09:33
  • 00:09:34
  • 00:09:37
    and we set it equal to the fifth
  • 00:09:40
  • 00:09:43
    that we're saying
    anything below the fifth percentile is
  • 00:09:46
  • 00:09:49
    is kind of not the normal so what we're
    going to do is take all those values and
  • 00:09:53
  • 00:09:54
    reasonable value that does exist in the
    data set that we think is normal
  • 00:09:58
  • 00:10:00
    so we take all this data in the tail and
    we set it equal to the fifth percentile
  • 00:10:03
  • 00:10:04
    side we take everything above
    the 95th percentile and we set it equal
  • 00:10:08
  • 00:10:10
  • 00:10:12
    does that do to the histogram afterwards
    well we haven't deleted any data in this
  • 00:10:16
  • 00:10:18
  • 00:10:19
  • 00:10:21
    the 5th percentile
    and the 95th percentile that we had up
  • 00:10:25
  • 00:10:26
    they're going to get
    raised because now we just have a lot
  • 00:10:29
  • 00:10:30
    exactly those values so the advantage
    here is that we're not
  • 00:10:34
  • 00:10:35
    trimming but the disadvantage is that we
    could potentially have a lot
  • 00:10:39
  • 00:10:42
    either exactly the fifth or exactly the
    95th percentile so we might be
  • 00:10:46
  • 00:10:48
    our data so
    these are the trade-offs in these two
  • 00:10:50
  • 00:10:52
    i want to talk about the
    general downside of doing any kind of
  • 00:10:55
  • 00:10:57
    there's a lot of programming languages
    out there there's a lot of packages to
  • 00:11:00
  • 00:11:02
    trimming you can do windsoring you can
    do probably even more complex things to
  • 00:11:06
  • 00:11:07
    into something else but
    you have to stop and think anytime
  • 00:11:10
  • 00:11:11
    transforming it into something else
    inherently making this assumption that
    you're never going to see
  • 00:11:16
  • 00:11:18
    in your testing data
    because if we think about it for a
  • 00:11:21
  • 00:11:23
    are abnormal we probably won't see them
    again they're just kind of a one-off
  • 00:11:26
  • 00:11:28
  • 00:11:30
    may not be the case it may be the case
    that if you look at your testing data
  • 00:11:34
  • 00:11:36
    could still see outliers just like this
    and if you haven't done anything to
  • 00:11:40
  • 00:11:42
    that produced these outliers then you're
    never going to get those correct in the
  • 00:11:45
  • 00:11:46
    built anything to deal with them
    and even more than that you're probably
  • 00:11:49
  • 00:11:51
    you're treating them in the training
    data as just
  • 00:11:53
  • 00:11:55
    special and so you're probably going to
    have big errors when it comes to your
  • 00:11:59
  • 00:12:00
    so a lot of times students will come to
    me and ask what's the right way to deal
  • 00:12:03
  • 00:12:04
    and they're usually trying to choose
    between some of these out of the box
  • 00:12:07
  • 00:12:08
    i would say that none of these are the
    right way to do with outliers
  • 00:12:11
  • 00:12:13
    the fast easy
    way to deal with outliers if you want to
  • 00:12:16
  • 00:12:18
    but the right way to deal with outliers
    is to stop
  • 00:12:21
  • 00:12:23
    mechanism that produced these outliers
    could that mechanism still exist in the
  • 00:12:27
  • 00:12:29
    something more
    intelligent we should look at how the
  • 00:12:32
  • 00:12:34
    and perhaps build a separate model or
  • 00:12:36
  • 00:12:38
    so hopefully you learned about outliers
    in the context of machine learning so
  • 00:12:42
  • 00:12:45
  • 00:12:47
    some out of the box techniques to do
    with outliers and the pros and cons
  • 00:12:50
  • 00:12:51
    and most importantly just the philosophy
    of what it means to do anything to an
  • 00:12:55
  • 00:12:56
    and what consequences it can have for
    your entire data
    project okay so if you enjoyed this
    video please like and subscribe for more
  • 00:13:04
