Outliers : Data Science Basics

00:13:06
https://www.youtube.com/watch?v=7KeITQajazo

Summary

TLDRCette vidéo explore l'impact des valeurs aberrantes sur les algorithmes de machine learning et discute des méthodes courantes pour les gérer. Les outliers peuvent grandement influencer les modèles, comme la régression linéaire et logistique, en faussant les résultats. Les arbres de décision s'avèrent plus robustes face à ce problème. Les stratégies pour traiter les outliers incluent le trimming, qui consiste à éliminer les valeurs extrêmes, et la windsorisation, qui les remplace par des valeurs seuils. Cependant, transformer les outliers suppose qu'ils n'apparaîtront pas dans les données futures, ce qui peut poser problème si ce n'est pas le cas. L'approche idéale consiste à analyser la mécanique derrière les outliers et adapter les modèles en conséquence.

Takeaways

  • 📊 Les outliers peuvent fortement influencer les algorithmes d'apprentissage automatique, comme la régression linéaire et la régression logistique.
  • 📉 Les méthodes à base d'arbres de décision, comme les forêts aléatoires, sont plus robustes face aux outliers.
  • 🔍 Le terme 'outlier' est devenu commun même en dehors du contexte statistique, désignant des cas atypiques.
  • 🚫 Trim est une méthode qui consiste à supprimer les valeurs extrêmes, mais cela peut entraîner une perte significative de données.
  • 🔄 Windsorisation implique de remplacer les valeurs extrêmes par les valeurs de seuils, préservant ainsi le volume de données mais pouvant réduire la variance.
  • 🤔 Transformer les outliers suppose qu'ils ne réapparaîtront pas dans les données de test, ce qui peut être une hypothèse risquée.
  • 🧠 La gestion des outliers devrait prendre en compte la cause sous-jacente pour proposer des solutions plus appropriées.
  • ⚖️ Les méthodes automatiques pour traiter les outliers facilitent le progrès rapide mais peuvent compromettre la précision des modèles.
  • 🔄 Comprendre la mécanique des outliers peut nécessiter de construire un modèle spécifique ou une solution sur mesure.
  • 📝 La philosophie entourant le traitement des outliers inclut des impacts potentiels sur l'ensemble du projet de données.

Timeline

  • 00:00:00 - 00:05:00

    Dans cette vidéo, l'orateur discute du rôle des valeurs aberrantes dans les algorithmes de machine learning et des méthodes courantes pour les traiter. Les valeurs aberrantes sont importantes car elles sont un concept commun même en dehors des statistiques, mais il n'existe pas de méthode universelle pour les gérer. L'orateur examine l'impact des valeurs aberrantes sur quatre algorithmes populaires de machine learning, en commençant par la régression linéaire et la régression logistique, où les valeurs aberrantes peuvent fausser les résultats. En régression linéaire, la ligne de tendance est affectée, tandis qu'en régression logistique, la courbe sigmoidale est étirée, entraînant ainsi des prédictions incorrectes.

  • 00:05:00 - 00:13:06

    Ensuite, l'orateur aborde les k plus proches voisins, où seulement quelques valeurs aberrantes peuvent modifier significativement la frontière de décision. En revanche, les arbres de décision, et leurs dérivés, sont moins affectés par les valeurs aberrantes puisqu'ils choisissent des divisions basées sur la majorité des données présentes. L'orateur décrit également deux stratégies pour traiter les valeurs aberrantes : la troncature, qui consiste à supprimer les données extrêmes, et la Windsorisation, qui ajuste les valeurs aux percentiles les plus proches sans supprimer de données. Il souligne que toute méthode appliquée aux valeurs aberrantes repose sur l'hypothèse qu'elles ne réapparaîtront pas dans les données de test, d'où l'importance de comprendre le mécanisme qui les a produites.

Mind Map

Video Q&A

  • Qu'est-ce qu'un outlier?

    Un outlier est une valeur ou un ensemble de valeurs considérablement différent du reste des données. Le terme est aussi utilisé dans des contextes non-statistiques pour désigner des cas atypiques.

  • Quels algorithmes de machine learning sont affectés par les outliers?

    Les algorithmes comme la régression linéaire et la régression logistique sont fortement affectés par les outliers.

  • Quels algorithmes sont robustes aux outliers?

    Les arbres de décision et leurs dérivés, comme les forêts aléatoires, sont généralement robustes face aux outliers.

  • Qu'est-ce que le trimming?

    Le trimming est une méthode qui consiste à supprimer les valeurs extrêmes de l'ensemble de données.

  • Qu'est-ce que la windsorisation?

    La windsorisation consiste à remplacer les valeurs extrêmes par les valeurs de seuils de distribution, conservant ainsi le volume de données initial.

  • Pourquoi est-il risqué de transformer les outliers?

    Transformer les outliers sans comprendre leur cause sous-jacente peut être risqué car ces valeurs peuvent réapparaître dans les données de test, induisant des erreurs importantes.

  • Quelle est l'approche idéale pour gérer les outliers?

    L'approche idéale implique de comprendre la mécanique qui produit les outliers et d'adapter les méthodes en conséquence pour améliorer la précision des modèles.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:00
    [Music]
  • 00:00:01
    hey everyone how's it going so it's
  • 00:00:02
    going to be a pretty brief video today
  • 00:00:03
    we're going to be talking about the role
  • 00:00:05
    of
  • 00:00:05
    outliers in machine learning algorithms
  • 00:00:08
    and then also talk about ways that
  • 00:00:09
    people
  • 00:00:10
    typically deal with outliers as well as
  • 00:00:12
    some of the shortcomings of those
  • 00:00:13
    methods
  • 00:00:14
    in general i think outliers are a pretty
  • 00:00:16
    interesting topic for two main reasons
  • 00:00:17
    one is that even if you don't study
  • 00:00:19
    stats this word outlier has become kind
  • 00:00:21
    of a commonplace word even in the media
  • 00:00:23
    and just
  • 00:00:23
    speech for example will say oh that
  • 00:00:25
    baseball game was an outlier
  • 00:00:27
    and we just mean that it was different
  • 00:00:28
    from a typical baseball game for example
  • 00:00:30
    the other reason i think it's
  • 00:00:31
    interesting is that even with all these
  • 00:00:33
    statistical tools we have there's not
  • 00:00:35
    like
  • 00:00:36
    a set way to deal with outliers it
  • 00:00:38
    really depends on the problem
  • 00:00:40
    different people will take different
  • 00:00:41
    approaches so it really highlights this
  • 00:00:42
    fact that
  • 00:00:43
    math is not this yes or no set in stone
  • 00:00:46
    kind of process but it really depends on
  • 00:00:48
    the situation so we'll go about this
  • 00:00:50
    video pretty simply we'll just go
  • 00:00:52
    through
  • 00:00:52
    four different very popular machine
  • 00:00:54
    learning algorithms and talk about the
  • 00:00:56
    impact
  • 00:00:57
    that a couple of outliers can have on
  • 00:00:58
    the results and then as i said we'll
  • 00:01:00
    talk about some common ways people deal
  • 00:01:02
    with outliers by the way i got this
  • 00:01:04
    really cool lobster hat for christmas
  • 00:01:05
    hope you like it
  • 00:01:06
    so let's begin by visiting our very
  • 00:01:08
    first friend in machine learning and
  • 00:01:10
    stats which was linear regression
  • 00:01:12
    so if you remember linear regression you
  • 00:01:14
    just have a x and a y variable keeping
  • 00:01:16
    it real simple
  • 00:01:17
    and you draw a line of best fit through
  • 00:01:20
    all of your data points
  • 00:01:21
    so let's say all these black x's were
  • 00:01:23
    your initial data points and you drew
  • 00:01:25
    this green line of best fit through them
  • 00:01:27
    and they have a pretty good fit no real
  • 00:01:28
    issues there now here's the problem
  • 00:01:31
    let's say you introduce a couple of
  • 00:01:32
    outliers so these red x's down here
  • 00:01:34
    with the exclamation point next to them
  • 00:01:36
    are outliers because they are very
  • 00:01:38
    different from the typical black x's we
  • 00:01:40
    have
  • 00:01:40
    up here as you might have learned in
  • 00:01:42
    linear regression this line that we're
  • 00:01:44
    going to draw now is very affected
  • 00:01:46
    by outliers and what's going to happen
  • 00:01:48
    is that the slope of the line
  • 00:01:50
    or beta 1 hat in this form over here is
  • 00:01:53
    going to shift
  • 00:01:54
    down so that the new line we get is this
  • 00:01:56
    red line
  • 00:01:57
    and the biggest issue with this red line
  • 00:01:58
    you can see what it's trying to do it's
  • 00:02:00
    trying to kind of
  • 00:02:01
    compromise between the original data and
  • 00:02:03
    these outliers but in doing so it
  • 00:02:05
    doesn't really capture either one too
  • 00:02:06
    well
  • 00:02:07
    so we definitely have an issue with
  • 00:02:09
    outliers in linear regression
  • 00:02:11
    it turns out that we can also frame
  • 00:02:12
    logistic regression which might have
  • 00:02:14
    been the first
  • 00:02:15
    classification technique you learned in
  • 00:02:17
    the exact same way
  • 00:02:18
    just a quick recap models the logit of
  • 00:02:20
    the probability of any example being
  • 00:02:22
    either 0 or 1
  • 00:02:24
    as the same form beta naught plus beta 1
  • 00:02:26
    x just like we had up here
  • 00:02:28
    so let's say our initial data is these
  • 00:02:29
    black x's which were all class
  • 00:02:31
    0 and these black x's which are all
  • 00:02:33
    class 1
  • 00:02:34
    and we asked to draw the sigmoid which
  • 00:02:36
    is going to predict the probability
  • 00:02:38
    of each example being in class 1. and
  • 00:02:41
    without the presence of outliers this
  • 00:02:42
    would be pretty simple we just draw this
  • 00:02:44
    black sigmoid
  • 00:02:45
    so that all of these get correctly
  • 00:02:46
    classified as class 1 because their
  • 00:02:48
    probabilities are above
  • 00:02:50
    0.5 and all these guys get correctly
  • 00:02:52
    classified as class
  • 00:02:53
    0 because their probabilities are below
  • 00:02:55
    0.5 now again we introduced just a
  • 00:02:57
    couple of outliers so here's some
  • 00:02:59
    outliers here these three red x's and
  • 00:03:01
    i've drawn this red arrow to indicate
  • 00:03:02
    that they are
  • 00:03:03
    way over here in the x direction so they
  • 00:03:05
    are way to the right have very large
  • 00:03:07
    values
  • 00:03:08
    now let's think about first
  • 00:03:09
    mathematically what's going to happen to
  • 00:03:10
    the sigmoid
  • 00:03:11
    actually the same thing happens because
  • 00:03:13
    we are modeling it again as beta naught
  • 00:03:14
    plus beta 1 x
  • 00:03:16
    so our beta 1 hat again goes down and
  • 00:03:18
    the impact that having a lower beta 1
  • 00:03:20
    hat is going to have
  • 00:03:21
    on this sigmoid is that it's going to
  • 00:03:24
    flatten out and stretch out the sigmoid
  • 00:03:26
    so it now looks like this
  • 00:03:27
    red sigmoid here and more intuitively
  • 00:03:30
    you can see what's going on is that
  • 00:03:31
    because these three red x's have very
  • 00:03:34
    large values
  • 00:03:35
    of the x variable yet they are
  • 00:03:37
    classified as class
  • 00:03:39
    zero we get this situation where the
  • 00:03:41
    sigmoid is trying its best to
  • 00:03:43
    incorporate these into the class
  • 00:03:44
    zero which it tries to achieve by
  • 00:03:46
    stretching out the sigmoids so much that
  • 00:03:48
    they get incorporated into the lower
  • 00:03:50
    part of the sigmoid
  • 00:03:51
    but in trying to do so it completely
  • 00:03:52
    destroys the rest of our data
  • 00:03:54
    for example if you look at this sigmoid
  • 00:03:56
    now you see that everything below 0.5 so
  • 00:03:59
    these are all still correctly classified
  • 00:04:01
    but if you look at everything above 0.5
  • 00:04:03
    we get these three correctly classified
  • 00:04:05
    but these three or four x's here are
  • 00:04:07
    actually belonging to class zero
  • 00:04:08
    predicted as class
  • 00:04:09
    zero even though they're class one and
  • 00:04:11
    maybe even the worst part of this is
  • 00:04:12
    that
  • 00:04:13
    although it tries really hard to
  • 00:04:14
    incorporate these three into class
  • 00:04:16
    zero it never achieves they still get
  • 00:04:18
    classified as class one so we get a
  • 00:04:20
    bunch of mistakes by doing this
  • 00:04:22
    in logistic regression having these
  • 00:04:23
    outliers so again problematic scenario
  • 00:04:26
    let's look at k nearest neighbors
  • 00:04:28
    another friendly face
  • 00:04:29
    so with k nearest neighbors if we have
  • 00:04:31
    two nice looking clouds of data so we
  • 00:04:33
    have the blue triangles and we have the
  • 00:04:35
    green circles
  • 00:04:36
    then we can draw a pretty nice looking
  • 00:04:37
    decision boundary any point on this side
  • 00:04:40
    of the decision boundary so above it
  • 00:04:42
    is going to get classified as a green
  • 00:04:44
    circle because if we use let's say
  • 00:04:45
    k equals three who are my three closest
  • 00:04:48
    neighbors they're always going to be
  • 00:04:49
    some of the circles if we're on this
  • 00:04:50
    side
  • 00:04:51
    and they're always going to be the
  • 00:04:52
    triangles if we're on this side so no
  • 00:04:54
    issues there
  • 00:04:55
    let's see how this story changes if we
  • 00:04:57
    incorporate just two
  • 00:04:58
    extra green circles in the wrong place
  • 00:05:01
    so they're outliers
  • 00:05:03
    so here we have put the two green
  • 00:05:04
    circles with the main pack of blue
  • 00:05:07
    triangles
  • 00:05:07
    and we see that the decision boundary
  • 00:05:09
    changes in the following way
  • 00:05:11
    so this whole area is unaffected this
  • 00:05:13
    whole area is unaffected but
  • 00:05:14
    around where we introduce the outliers
  • 00:05:17
    we get a very funky looking decision
  • 00:05:18
    boundary and the reason is that let's
  • 00:05:20
    say you're trying to predict some new
  • 00:05:21
    data point here where the x is
  • 00:05:23
    and you ask who are my three closest
  • 00:05:24
    neighbors well it's going to be these
  • 00:05:26
    two outliers as well as this blue
  • 00:05:27
    triangle
  • 00:05:28
    so it's going to say majority class is
  • 00:05:30
    green circle and so this whole decision
  • 00:05:32
    space gets allocated to these circles as
  • 00:05:34
    well
  • 00:05:34
    so we see that in k nearest neighbor
  • 00:05:36
    introducing just a couple of outliers
  • 00:05:38
    near a different class can have a big
  • 00:05:40
    impact on the decision boundary
  • 00:05:41
    now we've talked about so far machine
  • 00:05:43
    learning methods that are very impacted
  • 00:05:46
    by outliers let's talk about one that is
  • 00:05:47
    not so affected by outliers and that is
  • 00:05:49
    our old friend decision trees so we have
  • 00:05:52
    a decision tree here
  • 00:05:53
    this is just some variable so we see
  • 00:05:55
    that in general low values of this
  • 00:05:57
    variable
  • 00:05:58
    correspond to these triangles higher
  • 00:06:00
    values of these variables correspond to
  • 00:06:02
    the circles but there are two outliers
  • 00:06:04
    where the variable is very high yet
  • 00:06:06
    those are classified as triangles
  • 00:06:09
    and so if we recall how decision trees
  • 00:06:11
    work it's going to scan this entire
  • 00:06:12
    variable's range
  • 00:06:14
    and it's going to pick a split such that
  • 00:06:16
    on one side of the split we have mostly
  • 00:06:18
    triangles and on the other side of the
  • 00:06:19
    split you have mostly circles so let's
  • 00:06:21
    pretend at first it chooses this split
  • 00:06:22
    here so this black line that i've drawn
  • 00:06:24
    on the left hand side it's getting 100
  • 00:06:26
    correct because it's saying those are
  • 00:06:28
    triangles and they are
  • 00:06:29
    indeed triangles on the right hand side
  • 00:06:31
    it's getting most of them correct but
  • 00:06:33
    it's doing poorly
  • 00:06:34
    it's misclassifying the two outliers so
  • 00:06:36
    the natural question is is there a
  • 00:06:38
    different split that i could try to get
  • 00:06:40
    even a better outcome
  • 00:06:41
    and the answer is no for example let's
  • 00:06:43
    just try hypothetically what if it chose
  • 00:06:45
    to
  • 00:06:46
    split here instead well if we did kind
  • 00:06:48
    of entertain the idea that the decision
  • 00:06:50
    tree could be swayed by outliers maybe
  • 00:06:52
    we think
  • 00:06:53
    the decision boundary would get pulled
  • 00:06:55
    in that direction let's think about if
  • 00:06:56
    that actually makes sense in the context
  • 00:06:58
    of decision trees
  • 00:06:59
    so if we have this as our split and we
  • 00:07:01
    say everything on the left hand side
  • 00:07:03
    is a triangle we're still getting all
  • 00:07:05
    these correct but now we get an extra
  • 00:07:06
    mistake with this green circle
  • 00:07:08
    and if we say everything on the right
  • 00:07:10
    hand side is a circle we're still
  • 00:07:11
    getting these three circles correct but
  • 00:07:13
    we're still getting those two triangles
  • 00:07:14
    wrong so
  • 00:07:15
    all we've done by changing the location
  • 00:07:17
    of the split
  • 00:07:18
    is just introduce one more mistake so we
  • 00:07:21
    see a decision tree wouldn't actually
  • 00:07:22
    ever do this so this is not
  • 00:07:24
    possible for the decision tree to split
  • 00:07:26
    and so we see that even if you have
  • 00:07:27
    outliers like these two
  • 00:07:29
    triangles and no matter how far they are
  • 00:07:31
    in that direction
  • 00:07:32
    it's not going to matter whereas in
  • 00:07:34
    logistic regression the further these
  • 00:07:36
    outliers were
  • 00:07:37
    in that direction the more the sigmoid
  • 00:07:39
    gets stretched out and the more mistakes
  • 00:07:40
    we are making so
  • 00:07:42
    that's why if you hear decision trees
  • 00:07:43
    and everything that comes from them like
  • 00:07:45
    random forests and
  • 00:07:46
    bagging and boosting are somehow
  • 00:07:48
    resilient or robust to outliers
  • 00:07:50
    this is kind of the behavior that we are
  • 00:07:52
    talking about and now to close this
  • 00:07:54
    video let's talk about two very
  • 00:07:55
    common strategies people use to deal
  • 00:07:57
    with outliers let's talk about the pros
  • 00:07:59
    and cons of them and let's talk about
  • 00:08:01
    the general con
  • 00:08:02
    of doing anything to your outliers to
  • 00:08:04
    end the video
  • 00:08:05
    so the two main strategies that people
  • 00:08:08
    use to deal with outliers the first is
  • 00:08:09
    called trimming
  • 00:08:10
    this is probably the one you're more
  • 00:08:12
    familiar with so let's say this is our
  • 00:08:14
    data
  • 00:08:14
    so this is some variable and you're
  • 00:08:16
    looking at a histogram of that variable
  • 00:08:18
    trimming basically operates under the
  • 00:08:20
    assumption that any
  • 00:08:22
    very low values of that variable or any
  • 00:08:24
    abnormally high values of that variable
  • 00:08:26
    should be deleted
  • 00:08:27
    so for example if we choose our
  • 00:08:29
    thresholds as the 5th percentile and the
  • 00:08:31
    95th percentile
  • 00:08:32
    anything before the 5th and anything
  • 00:08:34
    after the 95th
  • 00:08:36
    we just throw it away and so a natural
  • 00:08:37
    question now is what does that do to the
  • 00:08:39
    histogram so
  • 00:08:40
    we go from this histogram to this one so
  • 00:08:42
    you can see the tails have been chopped
  • 00:08:44
    off
  • 00:08:44
    and also what happens to the rest of the
  • 00:08:46
    distribution is that it all gets raised
  • 00:08:48
    slightly so an intuitive way to think
  • 00:08:50
    about it is we take the probability from
  • 00:08:52
    the tails away
  • 00:08:54
    so that's gone but we still need to have
  • 00:08:56
    the curve integrate to one has to add up
  • 00:08:58
    to 100 probability
  • 00:08:59
    so we take that probability we just
  • 00:09:01
    deleted and we reallocate it
  • 00:09:03
    to the rest of the curve so the rest of
  • 00:09:04
    the curve shifts up
  • 00:09:06
    so that is trimming now the downside of
  • 00:09:08
    trimming you probably notice is that we
  • 00:09:10
    are literally just throwing away data
  • 00:09:12
    in cases where you maybe don't have a
  • 00:09:13
    ton of data to begin with this could be
  • 00:09:15
    a problem
  • 00:09:16
    and that's where the second strategy
  • 00:09:18
    which is related but has a very
  • 00:09:20
    different step at the end
  • 00:09:21
    this is called windsorizing probably
  • 00:09:23
    named after somebody called windsor
  • 00:09:25
    and so the first part is the same we
  • 00:09:27
    still pick low and high thresholds we'll
  • 00:09:29
    just pick 5 and 95 again
  • 00:09:31
    but the big difference is that we don't
  • 00:09:33
    delete the data on either side of the
  • 00:09:34
    threshold
  • 00:09:35
    we take the stuff that's below fifth
  • 00:09:37
    percentile
  • 00:09:38
    and we set it equal to the fifth
  • 00:09:40
    percentile and so the intuition here is
  • 00:09:43
    that we're saying
  • 00:09:44
    anything below the fifth percentile is
  • 00:09:46
    in some sense abnormal or unexpected
  • 00:09:49
    is kind of not the normal so what we're
  • 00:09:51
    going to do is take all those values and
  • 00:09:53
    set them to the most
  • 00:09:54
    reasonable value that does exist in the
  • 00:09:56
    data set that we think is normal
  • 00:09:58
    and that would be the fifth percentile
  • 00:10:00
    so we take all this data in the tail and
  • 00:10:01
    we set it equal to the fifth percentile
  • 00:10:03
    and we do the same thing on the other
  • 00:10:04
    side we take everything above
  • 00:10:06
    the 95th percentile and we set it equal
  • 00:10:08
    to the 95th percentile
  • 00:10:10
    now let's ask the same question what
  • 00:10:12
    does that do to the histogram afterwards
  • 00:10:14
    well we haven't deleted any data in this
  • 00:10:16
    case we still have the same number of
  • 00:10:18
    observations
  • 00:10:19
    so the only change is that these values
  • 00:10:21
    the 5th percentile
  • 00:10:23
    and the 95th percentile that we had up
  • 00:10:25
    here are both going to get boosted
  • 00:10:26
    they're going to get
  • 00:10:27
    raised because now we just have a lot
  • 00:10:29
    more observations at
  • 00:10:30
    exactly those values so the advantage
  • 00:10:33
    here is that we're not
  • 00:10:34
    throwing away data like we are in
  • 00:10:35
    trimming but the disadvantage is that we
  • 00:10:37
    could potentially have a lot
  • 00:10:39
    of samples now that are exactly the same
  • 00:10:42
    either exactly the fifth or exactly the
  • 00:10:44
    95th percentile so we might be
  • 00:10:46
    artificially reducing the variance of
  • 00:10:48
    our data so
  • 00:10:48
    these are the trade-offs in these two
  • 00:10:50
    methods and now just to close this video
  • 00:10:52
    i want to talk about the
  • 00:10:53
    general downside of doing any kind of
  • 00:10:55
    default method with your outliers so
  • 00:10:57
    there's a lot of programming languages
  • 00:10:59
    out there there's a lot of packages to
  • 00:11:00
    deal with outliers you can do
  • 00:11:02
    trimming you can do windsoring you can
  • 00:11:04
    do probably even more complex things to
  • 00:11:06
    take your outliers and transform them
  • 00:11:07
    into something else but
  • 00:11:09
    you have to stop and think anytime
  • 00:11:10
    you're taking an outlier and
  • 00:11:11
    transforming it into something else
  • 00:11:13
    you're
  • 00:11:13
    inherently making this assumption that
  • 00:11:15
    you're never going to see
  • 00:11:16
    these type of outliers again for example
  • 00:11:18
    in your testing data
  • 00:11:20
    because if we think about it for a
  • 00:11:21
    second we're saying that these values
  • 00:11:23
    are abnormal we probably won't see them
  • 00:11:25
    again they're just kind of a one-off
  • 00:11:26
    thing
  • 00:11:27
    and so we're going to do something
  • 00:11:28
    reasonable to them but that
  • 00:11:30
    may not be the case it may be the case
  • 00:11:32
    that if you look at your testing data
  • 00:11:34
    the things you're trying to predict you
  • 00:11:36
    could still see outliers just like this
  • 00:11:38
    and if you haven't done anything to
  • 00:11:40
    address the root cause or mechanism
  • 00:11:42
    that produced these outliers then you're
  • 00:11:43
    never going to get those correct in the
  • 00:11:45
    testing data because you just haven't
  • 00:11:46
    built anything to deal with them
  • 00:11:48
    and even more than that you're probably
  • 00:11:49
    going to get them very wrong because
  • 00:11:51
    you're treating them in the training
  • 00:11:52
    data as just
  • 00:11:53
    regular examples instead of anything
  • 00:11:55
    special and so you're probably going to
  • 00:11:57
    have big errors when it comes to your
  • 00:11:59
    testing data
  • 00:12:00
    so a lot of times students will come to
  • 00:12:01
    me and ask what's the right way to deal
  • 00:12:03
    with outliers
  • 00:12:04
    and they're usually trying to choose
  • 00:12:05
    between some of these out of the box
  • 00:12:07
    techniques and
  • 00:12:08
    i would say that none of these are the
  • 00:12:10
    right way to do with outliers
  • 00:12:11
    all these out of the box techniques are
  • 00:12:13
    the fast easy
  • 00:12:14
    way to deal with outliers if you want to
  • 00:12:16
    make quick progress in your project
  • 00:12:18
    but the right way to deal with outliers
  • 00:12:20
    is to stop
  • 00:12:21
    take some time think about what is the
  • 00:12:23
    mechanism that produced these outliers
  • 00:12:25
    could that mechanism still exist in the
  • 00:12:27
    testing data and if so we should do
  • 00:12:29
    something more
  • 00:12:30
    intelligent we should look at how the
  • 00:12:32
    outliers differ from the rest of the
  • 00:12:34
    data
  • 00:12:34
    and perhaps build a separate model or
  • 00:12:36
    treat them in a different way
  • 00:12:38
    so hopefully you learned about outliers
  • 00:12:40
    in the context of machine learning so
  • 00:12:42
    both
  • 00:12:43
    which models are and are usually not
  • 00:12:45
    affected by outliers
  • 00:12:47
    some out of the box techniques to do
  • 00:12:48
    with outliers and the pros and cons
  • 00:12:50
    between them
  • 00:12:51
    and most importantly just the philosophy
  • 00:12:54
    of what it means to do anything to an
  • 00:12:55
    outlier at all
  • 00:12:56
    and what consequences it can have for
  • 00:12:58
    your entire data
  • 00:12:59
    project okay so if you enjoyed this
  • 00:13:02
    video please like and subscribe for more
  • 00:13:04
    just like this and i'll see you next
  • 00:13:05
    time
Tags
  • outliers
  • régression linéaire
  • régression logistique
  • arbres de décision
  • trimming
  • windsorisation
  • machine learning
  • résilience
  • données aberrantes
  • gestion des données