00:00:00
hi I'm Trevor Hasty and I'm Rob tipani
00:00:03
hi Hasty and then you say welcome to the
00:00:05
course and Cisco hi I'm Trevor Hasty and
00:00:07
I'm
00:00:09
Rob hi I'm Trevor I'm Rob tipan and I'm
00:00:12
Trevor Hasty and Welcome to our course
00:00:14
on statistical learning this is the
00:00:16
first online course we've ever we ever
00:00:18
given and we're really excited to tell
00:00:19
you about it and a little nervous as you
00:00:21
can hear so uh by way of background what
00:00:24
is statistical learning um Trevor and I
00:00:26
are both statisticians we were actually
00:00:27
graduate students here at Stanford in
00:00:29
the 80s we've known each other for about
00:00:30
30 years oh my goodness and uh back then
00:00:33
uh well we did applied statistics like a
00:00:35
lot of statisticians did statistics have
00:00:37
been around since about 1900 or before
00:00:40
um but in the 1980s people in in
00:00:42
computer science developed a field of
00:00:44
machine learning uh especially neural
00:00:46
networks became a very hot topic I was
00:00:47
at University of Toronto and Trevor was
00:00:49
at Bell labs and one of the first neural
00:00:51
networks was developed at Bell labs to
00:00:53
solve the the ZIP code recognition
00:00:56
problem which we'll show you a little
00:00:57
bit about in in a few slides so that
00:01:00
time uh Trevor and I and then some
00:01:02
colleagues Jerry Freedman uh Brad Efron
00:01:05
Brian Leo briman and you actually you
00:01:06
you'll hear from Jerry and Brad both in
00:01:09
in this course we have uh some some
00:01:11
interviews with them about that time we
00:01:13
started to work on the area of machine
00:01:14
learning and sort of developed our own
00:01:16
view of it which was is now called
00:01:18
statistical learning so we're one of
00:01:19
along with colleagues here at Stanford
00:01:21
and other places we developed this field
00:01:23
of statistical learning so in this
00:01:25
course we'll talk to you about some of
00:01:26
the developments in this area and give
00:01:28
lots of examples so let's start with our
00:01:30
first example which is um a computer
00:01:34
program playing Jeopardy called Watson
00:01:36
that IBM built um and it it it beat uh
00:01:40
the players in in a three-game match and
00:01:43
the the the people at IBM who developed
00:01:45
the system said it was really a Triumph
00:01:46
of machine learning there were a lot of
00:01:48
very smart technology both Hardware but
00:01:50
also the software and the algorithms
00:01:51
were based on machine learning so this
00:01:53
was a um a watershed moment I think for
00:01:56
the area of of artificial intelligence
00:01:57
and machine learning
00:02:01
Google is is a big user of data and a
00:02:03
big analyzer of data and uh here's a
00:02:07
quote that came in the New York Times in
00:02:08
2009 from Hal Varian who a chief
00:02:11
Economist at at Google you can see the
00:02:13
quote they keep saying that the sexy job
00:02:15
in the next 10 years will be
00:02:16
statisticians and indeed there's a
00:02:18
picture of Carrie Grimes who was a
00:02:19
graduate from Stanford statistics she
00:02:21
was one of the first statisticians hired
00:02:24
at Google Now Google has many
00:02:27
statisticians our next example this is a
00:02:29
picture of Nate silver on the right Nate
00:02:31
is has a masters in economics but he
00:02:32
calls himself a statistician and he
00:02:35
writes at least he did write a blog
00:02:37
called 538 for the New York Times and in
00:02:39
that in that blog along uh he he
00:02:42
predicted the outcome of the 2012
00:02:44
Presidential and senate elections uh
00:02:46
very very well matter of fact he got all
00:02:48
the the the the Senate races right and
00:02:50
the uh the presidential election he he
00:02:53
predicted very very accurately using
00:02:55
statistics using carefully uh carefully
00:02:57
sampled data from various places some
00:02:59
careful analysis he did an extremely
00:03:01
accurate job of of predicting the
00:03:02
election when a lot of places where a
00:03:04
lot of news outlets weren't sure who was
00:03:06
going to win pretty nerdy looking guy
00:03:07
isn't he rob yes but he's very famous
00:03:09
and uh he's like a rock star these days
00:03:12
yes we we joke about when you a
00:03:13
statistician when you go to your party
00:03:15
and tell someone says you know what do
00:03:16
you do you say I'm a statistician they
00:03:18
they run for the door right but nowadays
00:03:20
we can say well we do machine learning
00:03:22
and uh well they still run for the door
00:03:23
but they take a little longer to get
00:03:24
there in fact we now call ourselves data
00:03:27
scientist it's a trendier word
00:03:30
so we're going to run through a number
00:03:32
of statistical learning problems um you
00:03:35
can see there's a bunch of examples on
00:03:36
this page and we'll go through them one
00:03:38
by one just to give you a flavor of
00:03:39
what's what sorts of problems we're
00:03:40
going to um be thinking about so the
00:03:43
first the first data set we're going to
00:03:45
look at is on prostate uh cancer this is
00:03:48
a a relatively small data set 97 um U
00:03:52
men sampled from 97 men with prostate
00:03:55
cancer actually by Stanford uh um
00:03:58
physician Dr stamy in late 80s and what
00:04:02
we have is the PSA measurement for each
00:04:05
subject along with a number of clinical
00:04:08
and and blood measurements from the
00:04:09
patients some measurements on the cancer
00:04:11
itself and and some measurements from
00:04:14
from the blood um the measurements to do
00:04:17
with cancer size and the the the
00:04:19
severity of the cancer and this is a
00:04:22
scatter pluged Matrix which which
00:04:24
actually show the shows the data and you
00:04:26
see on the diagonal is the name of each
00:04:28
of the variables and each little plot is
00:04:30
a pair of variables so you get in one
00:04:32
picture if you got a a relatively small
00:04:35
number of variables you can see all the
00:04:36
data at once in in a picture like this
00:04:38
and you can see the nature of the data
00:04:41
what variables are correlated and so on
00:04:43
and so this is a good way of of of
00:04:45
getting a view of your data and uh in
00:04:48
this particular case the the the the the
00:04:51
goal was to try and predict the PSA from
00:04:54
from the other measurement so it's along
00:04:56
the top and you can see there's some
00:04:57
correlations between these measur
00:05:00
ments um here's actually another view of
00:05:02
these data um which um looks rather
00:05:05
similar except in in the one instance
00:05:08
over here which is this is the log
00:05:10
weight these variables are on the log
00:05:11
scale and this is log weight and you
00:05:15
notice there's a point over here it
00:05:17
looks like somewhat of an outlier well
00:05:20
it turns out on the log scale it looks a
00:05:22
bit like an outlier but when you when
00:05:23
you look on the normal scale it it's
00:05:26
enormous and basically that was a typo
00:05:29
and that would say if if that was a real
00:05:31
measurement it would say that a patient
00:05:34
this particular patient would have a um
00:05:38
a 449 G prostate well we got a a message
00:05:42
from a a retired urologist Dr Steven
00:05:45
link who pointed this out to us and and
00:05:47
so um we corrected an earlier published
00:05:50
version of this of this scatter plot
00:05:52
which is a good thing to remember the
00:05:52
first thing to do when you get a a set
00:05:54
of data for analysis is not to run it
00:05:55
through a fancy algorithm make some
00:05:57
graphs some plots look at the data I
00:05:59
think in the old days before computers
00:06:01
people did that much more because it was
00:06:02
it was easy I mean you do it by hand and
00:06:04
do analysis took many many hours so
00:06:06
people would look first the data much
00:06:07
more and we need to keep remember that
00:06:09
that even with big data you should look
00:06:11
at it first before you you jump in with
00:06:13
an analysis so the next example is um
00:06:17
phones um for for two vowel sounds and
00:06:20
this is looking at that this graph has
00:06:22
the a log periodograms of uh for two
00:06:25
different phones the power at different
00:06:26
frequencies for two different phones uh
00:06:29
a a a and AO how do you pronounce those
00:06:32
Trevor AA is odd and AO is ought so as
00:06:36
you can tell Trevor talks funny but
00:06:37
hopefully during the course you'll be
00:06:38
able to how would you say it odd and a
00:06:42
okay okay so uh you see the the log per
00:06:46
periodograms at various frequencies of
00:06:48
of um these two vowel sounds are spoken
00:06:50
by different people the orange and the
00:06:52
green and the goal here is to try to
00:06:54
classify the two vowel sounds based on
00:06:57
the power at different frequencies so on
00:06:59
the bottom we see uh a loit models been
00:07:03
fit to the the the data looking at
00:07:05
trying to classify the two classes from
00:07:07
each other based on the the power of
00:07:10
different frequencies the loit model is
00:07:11
from logistic regression which is used
00:07:13
to to classify into one of the two vow
00:07:15
sounds based on the on the the log
00:07:17
periodogram and we'll cover it in detail
00:07:19
in the course and the the the
00:07:21
coefficients the estimated coefficients
00:07:22
from the the logistic model are in the
00:07:24
the gray uh profiles here in the bottom
00:07:28
plot and you can see uh they're very
00:07:30
non-s smooth it you it' be hard pressed
00:07:33
to tell where the important frequencies
00:07:34
were but when you apply a kind of
00:07:36
smoothing which we also discuss in the
00:07:38
course which we use the fact that the
00:07:40
nearby frequency should be similar right
00:07:42
which the the uh gray did not exploit we
00:07:46
get the red curve and the red curve
00:07:48
shows you pretty clearly that the
00:07:49
important frequencies looks like um The
00:07:52
One V sound's got more power around 25
00:07:55
and the other V sound has more power
00:07:56
around just before 50
00:08:01
predict whether someone will have a
00:08:03
heart attack on the basis of demographic
00:08:05
diet and clinical measurements so these
00:08:07
are some data on on actually men from
00:08:10
South Africa the red ones are those that
00:08:13
had heart disease and the blue points
00:08:15
are those that didn't it's a case
00:08:17
control sample so all the heart attack
00:08:20
cases were were taken as cases and a
00:08:21
sample of the controls were made and the
00:08:24
idea was to understand the risk factors
00:08:26
um in in heart disease now when you have
00:08:28
a binary response like this you can
00:08:30
color The scatterplot Matrix so you can
00:08:32
see the points which is is rather handy
00:08:35
and these these data come from a region
00:08:36
of South Africa where the heart the risk
00:08:39
of heart disease is very high it's over
00:08:41
uh over 5% for this age group The People
00:08:45
the the especially men around they eat
00:08:47
lots of these were men um they eat lots
00:08:50
of meat they have meat for all three
00:08:52
meals and in fact meat so prevalent
00:08:56
chicken regarded as a vegetable poor
00:08:59
chicken loves that job I used to love
00:09:01
that job okay so again you can see
00:09:03
there's correlations in these data and
00:09:06
uh and the goal here is to is to fil fit
00:09:09
a model that jointly involves all these
00:09:11
different risk factors in in coming up
00:09:13
with a risk model for for heart disease
00:09:15
which in this case is not is is um as I
00:09:19
said colored in in
00:09:21
red our next example is email spam
00:09:24
detection now as everyone uses email and
00:09:27
spam is definitely a problem uh and so
00:09:30
um spam filters are very important
00:09:32
application of cal machine learning the
00:09:34
data in this on this table actually um I
00:09:37
think it's from the maybe the late 90s
00:09:40
is that right before yeah late 90s
00:09:42
exactly it's from uh hulet Packard so
00:09:44
this is a a person named George who
00:09:46
worked at heiler Packard so this was
00:09:47
early in the days of email where as well
00:09:50
spam was also not very sophisticated so
00:09:52
what we have here is a data from over
00:09:54
4,000 emails sent to an individual named
00:09:56
George at HP Labs each one's been hand
00:09:58
labeled as either being spam or good
00:10:00
email and the goal here is to try to
00:10:02
predict actually they call good email H
00:10:04
these days right okay um so the a goal
00:10:08
was try to classify spam from Ham based
00:10:11
on the frequencies of words in the in
00:10:14
the email so here we just have a a
00:10:17
summary table of some of the more
00:10:18
important features so uh it's based on
00:10:21
um uh words and characters in the email
00:10:24
so for example this is saying that if
00:10:26
this email had George in it it was more
00:10:27
likely to be good email than spam amam
00:10:30
back then if some you know if you saw
00:10:33
your name is George and you saw George
00:10:34
in your email it was more likely to be
00:10:35
good email nowadays of course spam is
00:10:37
much more um sophisticated they know
00:10:39
your name they know a lot about your
00:10:40
life and um the fact that your name's in
00:10:43
actually it may made me a smaller chance
00:10:45
it's actually good email but back then
00:10:47
this the spam spammers were much less
00:10:49
sophisticated so for example if your
00:10:52
name was in it your more chance to be
00:10:53
high email what's with that remove word
00:10:55
r uh remove okay so I guess it probably
00:10:58
said something like don't remove is that
00:11:00
right I think it says if you want to be
00:11:03
removed from this list click I see
00:11:05
that's usually a Spam right so the goal
00:11:09
was that in and we'll talk about this
00:11:11
example in detail to use the the 57
00:11:14
features and here's these are seven of
00:11:16
those features as a classifier together
00:11:19
to to try to predict whether an email is
00:11:21
Spam or
00:11:25
ham identify the numbers in a
00:11:27
handwritten zip code this is what we
00:11:29
were alluding to earlier here some
00:11:31
handwritten um digits taken from
00:11:34
envelopes um and the goal is to based on
00:11:38
an image of any of these digits to to
00:11:41
say what the digit is is to to classify
00:11:43
into the 10 digit classes well to humans
00:11:46
this looks like a a pretty easy task um
00:11:49
you know we're pretty good at patent
00:11:50
recognition turns out it's a notoriously
00:11:53
difficult task for for computers they're
00:11:54
getting better and better all the time
00:11:57
so this this was one of the first
00:11:59
learning tasks that was used um to uh
00:12:04
develop neural networks neural networks
00:12:06
were first bought to bear on this
00:12:07
problem and uh you know we thought we
00:12:10
thought this should be a easy uh problem
00:12:12
to crack turns out it's really difficult
00:12:15
actually I remember the first time Trav
00:12:16
that we worked on a machine learning
00:12:17
problem was it was this problem and you
00:12:19
were working at Bell Labs I visited Bell
00:12:20
labs and you just gotten this data and
00:12:22
you said these people in artificial
00:12:24
intelligence are working on this and
00:12:25
that we thought oh let's try some
00:12:26
statistical methods and we tried uh
00:12:28
discriminat analysis right that's right
00:12:30
we got an error rate of about 8 and
00:12:31
half% and the best error rate anyone
00:12:33
about 20 minutes right and the best
00:12:35
error rate anyone else had is about four
00:12:36
or 5% at that point we thought oh this
00:12:38
is going to be easy we're already at 8%
00:12:40
in 10 or 15 minutes six months later six
00:12:44
months later we are maybe at the same
00:12:45
place so we realized actually you know
00:12:47
as often as the case you can get some of
00:12:49
the signal pretty quickly but getting
00:12:51
down to a very good error rate uh in
00:12:53
this case trying to trying to classify
00:12:55
some of the harder to classify things
00:12:57
like maybe this four or actually most of
00:13:00
these are pretty easy but if you if if
00:13:01
you look at the database some of them
00:13:03
are very hard so hard that the human eye
00:13:05
can't really tell what they are or it
00:13:06
has difficulty and those are the ones
00:13:08
that the machine learning algorithms
00:13:09
have have are really challenged by
00:13:12
anyway it's a lovely problem and it's
00:13:13
fascinated uh machine Learners and
00:13:16
statisticians for a long time so the the
00:13:19
next example comes from from medicine
00:13:21
classifying a tissue sample into into
00:13:23
one of several cancer classes based on
00:13:25
the gene expression profile so Trevor
00:13:27
and I both work in the medical school
00:13:29
part-time here at Stanford and a lot of
00:13:31
what we do and others do is to try to
00:13:32
use U machine learning Cisco learning
00:13:35
Big Data analysis to uh learn about um
00:13:39
data in cancer and other diseases so
00:13:41
this is an example of that this is data
00:13:42
in breast cancer it's called gene
00:13:44
expression data so this has been
00:13:46
collected from Gene chips and what we
00:13:48
see here on the left is a a matrix of
00:13:50
data each row is a gene and there's
00:13:53
about uh 8,000 genes here I think and
00:13:56
each column is a patient and this is
00:13:57
called a heat map so what this heat map
00:14:00
is is representing is low and high gene
00:14:02
expression for a given patient for a
00:14:04
given Gene so green meaning low and red
00:14:06
meaning high and gene expression means
00:14:08
the gene is working so if a gene is
00:14:10
expressing it's working hard in the cell
00:14:12
if it's not expressing it's it's it's
00:14:13
quiet it's silent and the the goal was
00:14:16
to try to figure out which genes well
00:14:18
try to figure out the pattern of gene
00:14:19
expression these are patients these are
00:14:21
women with with breast cancer trying to
00:14:23
figure out the common patterns of gene
00:14:25
expression for for women with breast
00:14:27
cancer and seeing why there's
00:14:28
subcategories of breast cancer showing
00:14:30
different gene expression so we see
00:14:32
here's a heat map of the full data 88
00:14:34
women in the columns and about again
00:14:36
about 8,000 genes in the rows and um
00:14:39
hierarchical clustering which we'll
00:14:40
discuss in the last part of this course
00:14:42
has been applied to the columns and you
00:14:43
see the clustering tree at the top here
00:14:45
which has been expanded for your view at
00:14:48
the top and hierle clustering has been
00:14:50
used to divide these women into roughly
00:14:53
one two three four five six subgroups
00:14:56
based on their gene expression they're
00:14:58
very effective especially with these
00:15:00
colors you can just see these clusters
00:15:01
standing out yeah hle clustering and
00:15:03
heat Maps actually been a very important
00:15:05
contribution for genomics which is this
00:15:07
is an example of simply because they
00:15:09
enable you to see and to organize the
00:15:11
full set of data on just in a single
00:15:13
picture and the bottom right here are
00:15:15
some more we've drilled down to to look
00:15:16
more at the the gene expression like for
00:15:19
example this subgroup here these red
00:15:21
patients seem to be high largely in
00:15:23
these genes and maybe in these genes so
00:15:27
we'll talk about this example in detail
00:15:28
later on the
00:15:32
course establish the relationship
00:15:34
between selary and demographic variables
00:15:36
in a populations in population survey
00:15:39
data so here's some survey data um um we
00:15:42
see income from the central Atlantic
00:15:45
region of the USA in 2009 and you see
00:15:50
what you might expect to see as a
00:15:51
function of age income it initially goes
00:15:54
up then levels off and then finally goes
00:15:56
down as as as people get older um
00:15:59
incomes gradually increase with year as
00:16:01
the cost of living increases and incomes
00:16:04
change with with education level um
00:16:07
that's the right hand plot those are box
00:16:09
plots and so yeah we see the the three
00:16:13
three of the variables that affect
00:16:14
income and again the goal is we' use
00:16:17
regression models to try and understand
00:16:19
the roles of these variables together
00:16:20
and see if there's you know if there's
00:16:22
interactions and
00:16:24
so and our last example is um lad images
00:16:28
of of land use uh area in in Australia
00:16:31
so this is a rural area of Australia
00:16:33
those are harsh colors Rob did you did
00:16:35
you choose those colors uh you probably
00:16:37
you're the uh color this is before I
00:16:40
developed taste when did that
00:16:42
happen I didn't I didn't see the uh news
00:16:45
memo okay so um here are uh these are
00:16:49
from lanat images so let's start here in
00:16:52
the in this panel so this is again a
00:16:55
rural a of Australia where the uh land
00:16:57
use has been labeled I think actually by
00:17:00
um by graduate students or or um
00:17:03
researchers into one of 1 two 3 four
00:17:06
five six they don't have to pay their
00:17:09
their graduate students as much they as
00:17:12
we do so they've been labeled uh in
00:17:14
these colors indicating the different
00:17:15
labels these are the true labels and the
00:17:17
goal is to try to predict these true
00:17:18
labels from um the uh spectral bands at
00:17:22
four frequencies taken from a um a
00:17:24
satellite image and so here's the here's
00:17:26
the power the different frequencies uh
00:17:29
in in four spectral bands so we have
00:17:31
features which are now they're pretty
00:17:33
complicated because we have features
00:17:35
spatial features for layers of them and
00:17:37
we're going to try to use those
00:17:38
combination of features to predict the
00:17:40
the land use that we see here and pixel
00:17:43
by pixel right pixel by pixel although
00:17:44
we might want to use a fact that nearby
00:17:46
pixels are more likely to be the same
00:17:48
land use than ones that are far away and
00:17:51
we'll talk about classifiers I think the
00:17:53
one we use here is actually nearest
00:17:54
neighbor it's a very simple classifier
00:17:55
and that produces the prediction in the
00:17:56
bottom right and you can see it's quite
00:17:58
good it's not perfect there's a few
00:18:00
mistakes it makes but it's for the most
00:18:02
part quite accurate okay so that's
00:18:04
that's the end of uh the series of
00:18:06
examples in the next session we'll just
00:18:09
tell you some notation and how we set up
00:18:11
problems for supervised learning um and
00:18:14
which we'll use for the rest of the
00:18:15
course