Statistical Learning: 1.1 Opening Remarks

00:18:18
https://www.youtube.com/watch?v=LvySJGj-88U

Resumen

TLDRIn deze cursus over statistisch leren introduceren Trevor Hasty en Rob Tibshirani de basisprincipes van het vakgebied. Ze bespreken hun achtergrond als statistici en de evolutie van statistisch leren, met een focus op machine learning. Voorbeelden zoals IBM's Watson, Nate Silver's verkiezingsvoorspellingen, en medische toepassingen zoals prostaatkanker en hartziekten worden gepresenteerd. De cursus behandelt ook spamdetectie, handgeschreven cijferherkenning en genexpressie in borstkanker. De sprekers benadrukken het belang van gegevensvisualisatie en het begrijpen van data voordat complexe analyses worden uitgevoerd.

Para llevar

  • 👨‍🏫 Introductie van statistisch leren door experts Trevor Hasty en Rob Tibshirani.
  • 📊 Voorbeelden van toepassingen in de echte wereld, zoals IBM's Watson en verkiezingsvoorspellingen.
  • 🩺 Medische toepassingen van statistisch leren, zoals prostaatkanker en hartziekten.
  • 📧 Spamdetectie als een belangrijk probleem in machine learning.
  • ✍️ Handgeschreven cijferherkenning als een uitdagende taak voor computers.
  • 🧬 Genexpressie-analyse in borstkanker om subtypes te identificeren.
  • 📈 Het belang van gegevensvisualisatie voor het begrijpen van data.
  • 🔍 Het gebruik van regressiemodellen om relaties tussen variabelen te begrijpen.
  • 🌍 Toepassing van statistisch leren in verschillende domeinen, van gezondheidszorg tot marketing.
  • 📚 De cursus biedt praktische voorbeelden en technieken voor statistisch leren.

Cronología

  • 00:00:00 - 00:05:00

    In de introductie van de cursus statistisch leren verwelkomen Trevor Hasty en Rob Tipani de deelnemers. Ze delen hun achtergrond als statistici en hun ervaring met machine learning, dat in de jaren '80 opkwam. Ze bespreken de ontwikkeling van statistisch leren en introduceren enkele voorbeelden, waaronder IBM's Watson, dat een belangrijke mijlpaal was in kunstmatige intelligentie en machine learning.

  • 00:05:00 - 00:10:00

    De sprekers presenteren verschillende statistische leerproblemen, te beginnen met een dataset over prostaatkanker. Ze benadrukken het belang van het visualiseren van gegevens voordat ze worden geanalyseerd. Vervolgens bespreken ze een voorbeeld van het classificeren van klinkers op basis van geluidsfrequenties en het voorspellen van hartziekten op basis van demografische en klinische gegevens.

  • 00:10:00 - 00:18:18

    De cursus behandelt ook praktische toepassingen zoals spamdetectie in e-mails, het herkennen van handgeschreven cijfers en het classificeren van weefselmonsters in de geneeskunde. De sprekers benadrukken het belang van gegevensvisualisatie en het gebruik van regressiemodellen om relaties tussen variabelen te begrijpen, en sluiten af met een voorbeeld van het voorspellen van landgebruik op basis van satellietbeelden.

Mapa mental

Vídeo de preguntas y respuestas

  • Wat is statistisch leren?

    Statistisch leren is een vakgebied dat zich richt op het ontwikkelen van modellen en algoritmen om patronen in gegevens te identificeren en voorspellingen te doen.

  • Wie zijn de instructeurs van de cursus?

    De instructeurs zijn Trevor Hasty en Rob Tibshirani, beide ervaren statistici.

  • Wat zijn enkele toepassingen van statistisch leren?

    Toepassingen omvatten medische diagnoses, spamdetectie, verkiezingsvoorspellingen en meer.

  • Waarom is gegevensvisualisatie belangrijk?

    Gegevensvisualisatie helpt bij het begrijpen van de structuur en patronen in de data voordat complexe analyses worden uitgevoerd.

  • Wat is het doel van de cursus?

    Het doel is om deelnemers kennis te laten maken met statistisch leren en hen te voorzien van praktische voorbeelden en technieken.

Ver más resúmenes de vídeos

Obtén acceso instantáneo a resúmenes gratuitos de vídeos de YouTube gracias a la IA.
Subtítulos
en
Desplazamiento automático:
  • 00:00:00
    hi I'm Trevor Hasty and I'm Rob tipani
  • 00:00:03
    hi Hasty and then you say welcome to the
  • 00:00:05
    course and Cisco hi I'm Trevor Hasty and
  • 00:00:07
    I'm
  • 00:00:09
    Rob hi I'm Trevor I'm Rob tipan and I'm
  • 00:00:12
    Trevor Hasty and Welcome to our course
  • 00:00:14
    on statistical learning this is the
  • 00:00:16
    first online course we've ever we ever
  • 00:00:18
    given and we're really excited to tell
  • 00:00:19
    you about it and a little nervous as you
  • 00:00:21
    can hear so uh by way of background what
  • 00:00:24
    is statistical learning um Trevor and I
  • 00:00:26
    are both statisticians we were actually
  • 00:00:27
    graduate students here at Stanford in
  • 00:00:29
    the 80s we've known each other for about
  • 00:00:30
    30 years oh my goodness and uh back then
  • 00:00:33
    uh well we did applied statistics like a
  • 00:00:35
    lot of statisticians did statistics have
  • 00:00:37
    been around since about 1900 or before
  • 00:00:40
    um but in the 1980s people in in
  • 00:00:42
    computer science developed a field of
  • 00:00:44
    machine learning uh especially neural
  • 00:00:46
    networks became a very hot topic I was
  • 00:00:47
    at University of Toronto and Trevor was
  • 00:00:49
    at Bell labs and one of the first neural
  • 00:00:51
    networks was developed at Bell labs to
  • 00:00:53
    solve the the ZIP code recognition
  • 00:00:56
    problem which we'll show you a little
  • 00:00:57
    bit about in in a few slides so that
  • 00:01:00
    time uh Trevor and I and then some
  • 00:01:02
    colleagues Jerry Freedman uh Brad Efron
  • 00:01:05
    Brian Leo briman and you actually you
  • 00:01:06
    you'll hear from Jerry and Brad both in
  • 00:01:09
    in this course we have uh some some
  • 00:01:11
    interviews with them about that time we
  • 00:01:13
    started to work on the area of machine
  • 00:01:14
    learning and sort of developed our own
  • 00:01:16
    view of it which was is now called
  • 00:01:18
    statistical learning so we're one of
  • 00:01:19
    along with colleagues here at Stanford
  • 00:01:21
    and other places we developed this field
  • 00:01:23
    of statistical learning so in this
  • 00:01:25
    course we'll talk to you about some of
  • 00:01:26
    the developments in this area and give
  • 00:01:28
    lots of examples so let's start with our
  • 00:01:30
    first example which is um a computer
  • 00:01:34
    program playing Jeopardy called Watson
  • 00:01:36
    that IBM built um and it it it beat uh
  • 00:01:40
    the players in in a three-game match and
  • 00:01:43
    the the the people at IBM who developed
  • 00:01:45
    the system said it was really a Triumph
  • 00:01:46
    of machine learning there were a lot of
  • 00:01:48
    very smart technology both Hardware but
  • 00:01:50
    also the software and the algorithms
  • 00:01:51
    were based on machine learning so this
  • 00:01:53
    was a um a watershed moment I think for
  • 00:01:56
    the area of of artificial intelligence
  • 00:01:57
    and machine learning
  • 00:02:01
    Google is is a big user of data and a
  • 00:02:03
    big analyzer of data and uh here's a
  • 00:02:07
    quote that came in the New York Times in
  • 00:02:08
    2009 from Hal Varian who a chief
  • 00:02:11
    Economist at at Google you can see the
  • 00:02:13
    quote they keep saying that the sexy job
  • 00:02:15
    in the next 10 years will be
  • 00:02:16
    statisticians and indeed there's a
  • 00:02:18
    picture of Carrie Grimes who was a
  • 00:02:19
    graduate from Stanford statistics she
  • 00:02:21
    was one of the first statisticians hired
  • 00:02:24
    at Google Now Google has many
  • 00:02:27
    statisticians our next example this is a
  • 00:02:29
    picture of Nate silver on the right Nate
  • 00:02:31
    is has a masters in economics but he
  • 00:02:32
    calls himself a statistician and he
  • 00:02:35
    writes at least he did write a blog
  • 00:02:37
    called 538 for the New York Times and in
  • 00:02:39
    that in that blog along uh he he
  • 00:02:42
    predicted the outcome of the 2012
  • 00:02:44
    Presidential and senate elections uh
  • 00:02:46
    very very well matter of fact he got all
  • 00:02:48
    the the the the Senate races right and
  • 00:02:50
    the uh the presidential election he he
  • 00:02:53
    predicted very very accurately using
  • 00:02:55
    statistics using carefully uh carefully
  • 00:02:57
    sampled data from various places some
  • 00:02:59
    careful analysis he did an extremely
  • 00:03:01
    accurate job of of predicting the
  • 00:03:02
    election when a lot of places where a
  • 00:03:04
    lot of news outlets weren't sure who was
  • 00:03:06
    going to win pretty nerdy looking guy
  • 00:03:07
    isn't he rob yes but he's very famous
  • 00:03:09
    and uh he's like a rock star these days
  • 00:03:12
    yes we we joke about when you a
  • 00:03:13
    statistician when you go to your party
  • 00:03:15
    and tell someone says you know what do
  • 00:03:16
    you do you say I'm a statistician they
  • 00:03:18
    they run for the door right but nowadays
  • 00:03:20
    we can say well we do machine learning
  • 00:03:22
    and uh well they still run for the door
  • 00:03:23
    but they take a little longer to get
  • 00:03:24
    there in fact we now call ourselves data
  • 00:03:27
    scientist it's a trendier word
  • 00:03:30
    so we're going to run through a number
  • 00:03:32
    of statistical learning problems um you
  • 00:03:35
    can see there's a bunch of examples on
  • 00:03:36
    this page and we'll go through them one
  • 00:03:38
    by one just to give you a flavor of
  • 00:03:39
    what's what sorts of problems we're
  • 00:03:40
    going to um be thinking about so the
  • 00:03:43
    first the first data set we're going to
  • 00:03:45
    look at is on prostate uh cancer this is
  • 00:03:48
    a a relatively small data set 97 um U
  • 00:03:52
    men sampled from 97 men with prostate
  • 00:03:55
    cancer actually by Stanford uh um
  • 00:03:58
    physician Dr stamy in late 80s and what
  • 00:04:02
    we have is the PSA measurement for each
  • 00:04:05
    subject along with a number of clinical
  • 00:04:08
    and and blood measurements from the
  • 00:04:09
    patients some measurements on the cancer
  • 00:04:11
    itself and and some measurements from
  • 00:04:14
    from the blood um the measurements to do
  • 00:04:17
    with cancer size and the the the
  • 00:04:19
    severity of the cancer and this is a
  • 00:04:22
    scatter pluged Matrix which which
  • 00:04:24
    actually show the shows the data and you
  • 00:04:26
    see on the diagonal is the name of each
  • 00:04:28
    of the variables and each little plot is
  • 00:04:30
    a pair of variables so you get in one
  • 00:04:32
    picture if you got a a relatively small
  • 00:04:35
    number of variables you can see all the
  • 00:04:36
    data at once in in a picture like this
  • 00:04:38
    and you can see the nature of the data
  • 00:04:41
    what variables are correlated and so on
  • 00:04:43
    and so this is a good way of of of
  • 00:04:45
    getting a view of your data and uh in
  • 00:04:48
    this particular case the the the the the
  • 00:04:51
    goal was to try and predict the PSA from
  • 00:04:54
    from the other measurement so it's along
  • 00:04:56
    the top and you can see there's some
  • 00:04:57
    correlations between these measur
  • 00:05:00
    ments um here's actually another view of
  • 00:05:02
    these data um which um looks rather
  • 00:05:05
    similar except in in the one instance
  • 00:05:08
    over here which is this is the log
  • 00:05:10
    weight these variables are on the log
  • 00:05:11
    scale and this is log weight and you
  • 00:05:15
    notice there's a point over here it
  • 00:05:17
    looks like somewhat of an outlier well
  • 00:05:20
    it turns out on the log scale it looks a
  • 00:05:22
    bit like an outlier but when you when
  • 00:05:23
    you look on the normal scale it it's
  • 00:05:26
    enormous and basically that was a typo
  • 00:05:29
    and that would say if if that was a real
  • 00:05:31
    measurement it would say that a patient
  • 00:05:34
    this particular patient would have a um
  • 00:05:38
    a 449 G prostate well we got a a message
  • 00:05:42
    from a a retired urologist Dr Steven
  • 00:05:45
    link who pointed this out to us and and
  • 00:05:47
    so um we corrected an earlier published
  • 00:05:50
    version of this of this scatter plot
  • 00:05:52
    which is a good thing to remember the
  • 00:05:52
    first thing to do when you get a a set
  • 00:05:54
    of data for analysis is not to run it
  • 00:05:55
    through a fancy algorithm make some
  • 00:05:57
    graphs some plots look at the data I
  • 00:05:59
    think in the old days before computers
  • 00:06:01
    people did that much more because it was
  • 00:06:02
    it was easy I mean you do it by hand and
  • 00:06:04
    do analysis took many many hours so
  • 00:06:06
    people would look first the data much
  • 00:06:07
    more and we need to keep remember that
  • 00:06:09
    that even with big data you should look
  • 00:06:11
    at it first before you you jump in with
  • 00:06:13
    an analysis so the next example is um
  • 00:06:17
    phones um for for two vowel sounds and
  • 00:06:20
    this is looking at that this graph has
  • 00:06:22
    the a log periodograms of uh for two
  • 00:06:25
    different phones the power at different
  • 00:06:26
    frequencies for two different phones uh
  • 00:06:29
    a a a and AO how do you pronounce those
  • 00:06:32
    Trevor AA is odd and AO is ought so as
  • 00:06:36
    you can tell Trevor talks funny but
  • 00:06:37
    hopefully during the course you'll be
  • 00:06:38
    able to how would you say it odd and a
  • 00:06:42
    okay okay so uh you see the the log per
  • 00:06:46
    periodograms at various frequencies of
  • 00:06:48
    of um these two vowel sounds are spoken
  • 00:06:50
    by different people the orange and the
  • 00:06:52
    green and the goal here is to try to
  • 00:06:54
    classify the two vowel sounds based on
  • 00:06:57
    the power at different frequencies so on
  • 00:06:59
    the bottom we see uh a loit models been
  • 00:07:03
    fit to the the the data looking at
  • 00:07:05
    trying to classify the two classes from
  • 00:07:07
    each other based on the the power of
  • 00:07:10
    different frequencies the loit model is
  • 00:07:11
    from logistic regression which is used
  • 00:07:13
    to to classify into one of the two vow
  • 00:07:15
    sounds based on the on the the log
  • 00:07:17
    periodogram and we'll cover it in detail
  • 00:07:19
    in the course and the the the
  • 00:07:21
    coefficients the estimated coefficients
  • 00:07:22
    from the the logistic model are in the
  • 00:07:24
    the gray uh profiles here in the bottom
  • 00:07:28
    plot and you can see uh they're very
  • 00:07:30
    non-s smooth it you it' be hard pressed
  • 00:07:33
    to tell where the important frequencies
  • 00:07:34
    were but when you apply a kind of
  • 00:07:36
    smoothing which we also discuss in the
  • 00:07:38
    course which we use the fact that the
  • 00:07:40
    nearby frequency should be similar right
  • 00:07:42
    which the the uh gray did not exploit we
  • 00:07:46
    get the red curve and the red curve
  • 00:07:48
    shows you pretty clearly that the
  • 00:07:49
    important frequencies looks like um The
  • 00:07:52
    One V sound's got more power around 25
  • 00:07:55
    and the other V sound has more power
  • 00:07:56
    around just before 50
  • 00:08:01
    predict whether someone will have a
  • 00:08:03
    heart attack on the basis of demographic
  • 00:08:05
    diet and clinical measurements so these
  • 00:08:07
    are some data on on actually men from
  • 00:08:10
    South Africa the red ones are those that
  • 00:08:13
    had heart disease and the blue points
  • 00:08:15
    are those that didn't it's a case
  • 00:08:17
    control sample so all the heart attack
  • 00:08:20
    cases were were taken as cases and a
  • 00:08:21
    sample of the controls were made and the
  • 00:08:24
    idea was to understand the risk factors
  • 00:08:26
    um in in heart disease now when you have
  • 00:08:28
    a binary response like this you can
  • 00:08:30
    color The scatterplot Matrix so you can
  • 00:08:32
    see the points which is is rather handy
  • 00:08:35
    and these these data come from a region
  • 00:08:36
    of South Africa where the heart the risk
  • 00:08:39
    of heart disease is very high it's over
  • 00:08:41
    uh over 5% for this age group The People
  • 00:08:45
    the the especially men around they eat
  • 00:08:47
    lots of these were men um they eat lots
  • 00:08:50
    of meat they have meat for all three
  • 00:08:52
    meals and in fact meat so prevalent
  • 00:08:56
    chicken regarded as a vegetable poor
  • 00:08:59
    chicken loves that job I used to love
  • 00:09:01
    that job okay so again you can see
  • 00:09:03
    there's correlations in these data and
  • 00:09:06
    uh and the goal here is to is to fil fit
  • 00:09:09
    a model that jointly involves all these
  • 00:09:11
    different risk factors in in coming up
  • 00:09:13
    with a risk model for for heart disease
  • 00:09:15
    which in this case is not is is um as I
  • 00:09:19
    said colored in in
  • 00:09:21
    red our next example is email spam
  • 00:09:24
    detection now as everyone uses email and
  • 00:09:27
    spam is definitely a problem uh and so
  • 00:09:30
    um spam filters are very important
  • 00:09:32
    application of cal machine learning the
  • 00:09:34
    data in this on this table actually um I
  • 00:09:37
    think it's from the maybe the late 90s
  • 00:09:40
    is that right before yeah late 90s
  • 00:09:42
    exactly it's from uh hulet Packard so
  • 00:09:44
    this is a a person named George who
  • 00:09:46
    worked at heiler Packard so this was
  • 00:09:47
    early in the days of email where as well
  • 00:09:50
    spam was also not very sophisticated so
  • 00:09:52
    what we have here is a data from over
  • 00:09:54
    4,000 emails sent to an individual named
  • 00:09:56
    George at HP Labs each one's been hand
  • 00:09:58
    labeled as either being spam or good
  • 00:10:00
    email and the goal here is to try to
  • 00:10:02
    predict actually they call good email H
  • 00:10:04
    these days right okay um so the a goal
  • 00:10:08
    was try to classify spam from Ham based
  • 00:10:11
    on the frequencies of words in the in
  • 00:10:14
    the email so here we just have a a
  • 00:10:17
    summary table of some of the more
  • 00:10:18
    important features so uh it's based on
  • 00:10:21
    um uh words and characters in the email
  • 00:10:24
    so for example this is saying that if
  • 00:10:26
    this email had George in it it was more
  • 00:10:27
    likely to be good email than spam amam
  • 00:10:30
    back then if some you know if you saw
  • 00:10:33
    your name is George and you saw George
  • 00:10:34
    in your email it was more likely to be
  • 00:10:35
    good email nowadays of course spam is
  • 00:10:37
    much more um sophisticated they know
  • 00:10:39
    your name they know a lot about your
  • 00:10:40
    life and um the fact that your name's in
  • 00:10:43
    actually it may made me a smaller chance
  • 00:10:45
    it's actually good email but back then
  • 00:10:47
    this the spam spammers were much less
  • 00:10:49
    sophisticated so for example if your
  • 00:10:52
    name was in it your more chance to be
  • 00:10:53
    high email what's with that remove word
  • 00:10:55
    r uh remove okay so I guess it probably
  • 00:10:58
    said something like don't remove is that
  • 00:11:00
    right I think it says if you want to be
  • 00:11:03
    removed from this list click I see
  • 00:11:05
    that's usually a Spam right so the goal
  • 00:11:09
    was that in and we'll talk about this
  • 00:11:11
    example in detail to use the the 57
  • 00:11:14
    features and here's these are seven of
  • 00:11:16
    those features as a classifier together
  • 00:11:19
    to to try to predict whether an email is
  • 00:11:21
    Spam or
  • 00:11:25
    ham identify the numbers in a
  • 00:11:27
    handwritten zip code this is what we
  • 00:11:29
    were alluding to earlier here some
  • 00:11:31
    handwritten um digits taken from
  • 00:11:34
    envelopes um and the goal is to based on
  • 00:11:38
    an image of any of these digits to to
  • 00:11:41
    say what the digit is is to to classify
  • 00:11:43
    into the 10 digit classes well to humans
  • 00:11:46
    this looks like a a pretty easy task um
  • 00:11:49
    you know we're pretty good at patent
  • 00:11:50
    recognition turns out it's a notoriously
  • 00:11:53
    difficult task for for computers they're
  • 00:11:54
    getting better and better all the time
  • 00:11:57
    so this this was one of the first
  • 00:11:59
    learning tasks that was used um to uh
  • 00:12:04
    develop neural networks neural networks
  • 00:12:06
    were first bought to bear on this
  • 00:12:07
    problem and uh you know we thought we
  • 00:12:10
    thought this should be a easy uh problem
  • 00:12:12
    to crack turns out it's really difficult
  • 00:12:15
    actually I remember the first time Trav
  • 00:12:16
    that we worked on a machine learning
  • 00:12:17
    problem was it was this problem and you
  • 00:12:19
    were working at Bell Labs I visited Bell
  • 00:12:20
    labs and you just gotten this data and
  • 00:12:22
    you said these people in artificial
  • 00:12:24
    intelligence are working on this and
  • 00:12:25
    that we thought oh let's try some
  • 00:12:26
    statistical methods and we tried uh
  • 00:12:28
    discriminat analysis right that's right
  • 00:12:30
    we got an error rate of about 8 and
  • 00:12:31
    half% and the best error rate anyone
  • 00:12:33
    about 20 minutes right and the best
  • 00:12:35
    error rate anyone else had is about four
  • 00:12:36
    or 5% at that point we thought oh this
  • 00:12:38
    is going to be easy we're already at 8%
  • 00:12:40
    in 10 or 15 minutes six months later six
  • 00:12:44
    months later we are maybe at the same
  • 00:12:45
    place so we realized actually you know
  • 00:12:47
    as often as the case you can get some of
  • 00:12:49
    the signal pretty quickly but getting
  • 00:12:51
    down to a very good error rate uh in
  • 00:12:53
    this case trying to trying to classify
  • 00:12:55
    some of the harder to classify things
  • 00:12:57
    like maybe this four or actually most of
  • 00:13:00
    these are pretty easy but if you if if
  • 00:13:01
    you look at the database some of them
  • 00:13:03
    are very hard so hard that the human eye
  • 00:13:05
    can't really tell what they are or it
  • 00:13:06
    has difficulty and those are the ones
  • 00:13:08
    that the machine learning algorithms
  • 00:13:09
    have have are really challenged by
  • 00:13:12
    anyway it's a lovely problem and it's
  • 00:13:13
    fascinated uh machine Learners and
  • 00:13:16
    statisticians for a long time so the the
  • 00:13:19
    next example comes from from medicine
  • 00:13:21
    classifying a tissue sample into into
  • 00:13:23
    one of several cancer classes based on
  • 00:13:25
    the gene expression profile so Trevor
  • 00:13:27
    and I both work in the medical school
  • 00:13:29
    part-time here at Stanford and a lot of
  • 00:13:31
    what we do and others do is to try to
  • 00:13:32
    use U machine learning Cisco learning
  • 00:13:35
    Big Data analysis to uh learn about um
  • 00:13:39
    data in cancer and other diseases so
  • 00:13:41
    this is an example of that this is data
  • 00:13:42
    in breast cancer it's called gene
  • 00:13:44
    expression data so this has been
  • 00:13:46
    collected from Gene chips and what we
  • 00:13:48
    see here on the left is a a matrix of
  • 00:13:50
    data each row is a gene and there's
  • 00:13:53
    about uh 8,000 genes here I think and
  • 00:13:56
    each column is a patient and this is
  • 00:13:57
    called a heat map so what this heat map
  • 00:14:00
    is is representing is low and high gene
  • 00:14:02
    expression for a given patient for a
  • 00:14:04
    given Gene so green meaning low and red
  • 00:14:06
    meaning high and gene expression means
  • 00:14:08
    the gene is working so if a gene is
  • 00:14:10
    expressing it's working hard in the cell
  • 00:14:12
    if it's not expressing it's it's it's
  • 00:14:13
    quiet it's silent and the the goal was
  • 00:14:16
    to try to figure out which genes well
  • 00:14:18
    try to figure out the pattern of gene
  • 00:14:19
    expression these are patients these are
  • 00:14:21
    women with with breast cancer trying to
  • 00:14:23
    figure out the common patterns of gene
  • 00:14:25
    expression for for women with breast
  • 00:14:27
    cancer and seeing why there's
  • 00:14:28
    subcategories of breast cancer showing
  • 00:14:30
    different gene expression so we see
  • 00:14:32
    here's a heat map of the full data 88
  • 00:14:34
    women in the columns and about again
  • 00:14:36
    about 8,000 genes in the rows and um
  • 00:14:39
    hierarchical clustering which we'll
  • 00:14:40
    discuss in the last part of this course
  • 00:14:42
    has been applied to the columns and you
  • 00:14:43
    see the clustering tree at the top here
  • 00:14:45
    which has been expanded for your view at
  • 00:14:48
    the top and hierle clustering has been
  • 00:14:50
    used to divide these women into roughly
  • 00:14:53
    one two three four five six subgroups
  • 00:14:56
    based on their gene expression they're
  • 00:14:58
    very effective especially with these
  • 00:15:00
    colors you can just see these clusters
  • 00:15:01
    standing out yeah hle clustering and
  • 00:15:03
    heat Maps actually been a very important
  • 00:15:05
    contribution for genomics which is this
  • 00:15:07
    is an example of simply because they
  • 00:15:09
    enable you to see and to organize the
  • 00:15:11
    full set of data on just in a single
  • 00:15:13
    picture and the bottom right here are
  • 00:15:15
    some more we've drilled down to to look
  • 00:15:16
    more at the the gene expression like for
  • 00:15:19
    example this subgroup here these red
  • 00:15:21
    patients seem to be high largely in
  • 00:15:23
    these genes and maybe in these genes so
  • 00:15:27
    we'll talk about this example in detail
  • 00:15:28
    later on the
  • 00:15:32
    course establish the relationship
  • 00:15:34
    between selary and demographic variables
  • 00:15:36
    in a populations in population survey
  • 00:15:39
    data so here's some survey data um um we
  • 00:15:42
    see income from the central Atlantic
  • 00:15:45
    region of the USA in 2009 and you see
  • 00:15:50
    what you might expect to see as a
  • 00:15:51
    function of age income it initially goes
  • 00:15:54
    up then levels off and then finally goes
  • 00:15:56
    down as as as people get older um
  • 00:15:59
    incomes gradually increase with year as
  • 00:16:01
    the cost of living increases and incomes
  • 00:16:04
    change with with education level um
  • 00:16:07
    that's the right hand plot those are box
  • 00:16:09
    plots and so yeah we see the the three
  • 00:16:13
    three of the variables that affect
  • 00:16:14
    income and again the goal is we' use
  • 00:16:17
    regression models to try and understand
  • 00:16:19
    the roles of these variables together
  • 00:16:20
    and see if there's you know if there's
  • 00:16:22
    interactions and
  • 00:16:24
    so and our last example is um lad images
  • 00:16:28
    of of land use uh area in in Australia
  • 00:16:31
    so this is a rural area of Australia
  • 00:16:33
    those are harsh colors Rob did you did
  • 00:16:35
    you choose those colors uh you probably
  • 00:16:37
    you're the uh color this is before I
  • 00:16:40
    developed taste when did that
  • 00:16:42
    happen I didn't I didn't see the uh news
  • 00:16:45
    memo okay so um here are uh these are
  • 00:16:49
    from lanat images so let's start here in
  • 00:16:52
    the in this panel so this is again a
  • 00:16:55
    rural a of Australia where the uh land
  • 00:16:57
    use has been labeled I think actually by
  • 00:17:00
    um by graduate students or or um
  • 00:17:03
    researchers into one of 1 two 3 four
  • 00:17:06
    five six they don't have to pay their
  • 00:17:09
    their graduate students as much they as
  • 00:17:12
    we do so they've been labeled uh in
  • 00:17:14
    these colors indicating the different
  • 00:17:15
    labels these are the true labels and the
  • 00:17:17
    goal is to try to predict these true
  • 00:17:18
    labels from um the uh spectral bands at
  • 00:17:22
    four frequencies taken from a um a
  • 00:17:24
    satellite image and so here's the here's
  • 00:17:26
    the power the different frequencies uh
  • 00:17:29
    in in four spectral bands so we have
  • 00:17:31
    features which are now they're pretty
  • 00:17:33
    complicated because we have features
  • 00:17:35
    spatial features for layers of them and
  • 00:17:37
    we're going to try to use those
  • 00:17:38
    combination of features to predict the
  • 00:17:40
    the land use that we see here and pixel
  • 00:17:43
    by pixel right pixel by pixel although
  • 00:17:44
    we might want to use a fact that nearby
  • 00:17:46
    pixels are more likely to be the same
  • 00:17:48
    land use than ones that are far away and
  • 00:17:51
    we'll talk about classifiers I think the
  • 00:17:53
    one we use here is actually nearest
  • 00:17:54
    neighbor it's a very simple classifier
  • 00:17:55
    and that produces the prediction in the
  • 00:17:56
    bottom right and you can see it's quite
  • 00:17:58
    good it's not perfect there's a few
  • 00:18:00
    mistakes it makes but it's for the most
  • 00:18:02
    part quite accurate okay so that's
  • 00:18:04
    that's the end of uh the series of
  • 00:18:06
    examples in the next session we'll just
  • 00:18:09
    tell you some notation and how we set up
  • 00:18:11
    problems for supervised learning um and
  • 00:18:14
    which we'll use for the rest of the
  • 00:18:15
    course
Etiquetas
  • statistisch leren
  • machine learning
  • gegevensanalyse
  • gegevensvisualisatie
  • medische toepassingen
  • spamdetectie
  • verkiezingsvoorspellingen
  • handgeschreven cijferherkenning
  • genexpressie
  • datawetenschap