00:00:01
okay now we're going to talk about the
00:00:02
supervised learning problem and set down
00:00:04
a little bit of notation
00:00:06
so we'll have an outcome measurement y
00:00:09
which is goes by various names dependent
00:00:11
variable response or target and then
00:00:13
we'll have a vector of p predictor
00:00:15
measurements which are usually called x
00:00:17
they go by the name inputs regressors
00:00:19
covariates
00:00:21
features or independent variables
00:00:24
and we distinguish two cases one is the
00:00:26
regression problem y is quantitative
00:00:29
such as price or blood pressure
00:00:32
in the classification problem y takes
00:00:34
values in a in a finite unordered set
00:00:36
such as survived or died the digit class
00:00:39
is zero to nine the cancer class of the
00:00:41
tissue sample
00:00:43
now we have training data pairs x1 y1 x2
00:00:46
y2 up to xn yn so again x1 is a vector
00:00:50
of p measurements y1 is usually a single
00:00:53
response variable and so these are
00:00:55
examples or instances
00:00:57
of these measurements
00:01:00
so the objectives of supervised learning
00:01:02
is as follows on the basis of the
00:01:04
training data we would like to
00:01:06
accurately predict unseen test cases
00:01:09
understand which inputs affect the
00:01:11
outcome and how
00:01:12
and also to assess the quality of our
00:01:14
predictions and the inferences
00:01:18
so by way of philosophy as you take this
00:01:20
course we want
00:01:22
not just to give you a laundry list of
00:01:23
methods but we want you
00:01:25
to know that it's important to
00:01:26
understand the ideas behind the various
00:01:27
techniques so you know where and when to
00:01:29
use them because in your own work
00:01:31
you're going to have problems that we've
00:01:32
never seen before you've never seen
00:01:34
before and you want to be able to judge
00:01:35
which methods are likely to work well
00:01:36
which ones are not likely to work well
00:01:38
as well uh
00:01:40
not just prediction accuracy is
00:01:41
important but it's it's important to to
00:01:43
to try simple methods first in order to
00:01:45
grasp the more sophisticated ones we're
00:01:47
going to spend quite a bit of time on on
00:01:49
linear models linear regression and
00:01:50
linear logistic regression these are
00:01:52
simple methods but they're very
00:01:54
effective
00:01:55
and it's also important to understand
00:01:56
how well method is doing right it's easy
00:01:58
to apply an algorithm you can nowadays
00:01:59
you can just run software but
00:02:01
it's it's difficult but also very
00:02:03
important to figure out how methods is
00:02:04
how well is the method actually working
00:02:06
so you can tell your boss or your
00:02:07
collaborator that when you apply this
00:02:09
method we've developed this is how
00:02:11
likely how well you're likely to do
00:02:12
tomorrow and and in some cases you won't
00:02:14
do well enough to actually use the
00:02:16
method and you'll have to improve your
00:02:17
algorithm or maybe collect better data
00:02:20
another other thing we want to convey
00:02:22
just through the course and hopefully
00:02:23
through
00:02:24
the examples is that this is a really
00:02:26
exciting exciting area in research i
00:02:27
mean statistics in general is very hot
00:02:29
area cisco learning and machine learning
00:02:31
is of more and more importance and it's
00:02:33
really exciting that the area is not in
00:02:35
it gelled in any way in the sense that
00:02:37
there's a lot of good methods out there
00:02:38
but a lot of challenging problems that
00:02:39
aren't solved so especially in in recent
00:02:42
years rob you know with the onset of big
00:02:44
data and
00:02:45
and and coined the word data science
00:02:49
right and uh physical learning as trevor
00:02:51
mentioned is a fundamental ingredient in
00:02:52
the in this new area of data science
00:02:56
so you might be wondering where where's
00:02:57
the term supervised and
00:02:59
supervised learning come from it's
00:03:00
actually a very clever term and i'd like
00:03:02
to take credit for it but i can't it
00:03:03
wasn't it was developed by someone in
00:03:05
the i think in the in the machine
00:03:07
learning area that he has supervisor and
00:03:09
he can think in a kindergarten of a
00:03:10
teacher trying to teach a child to
00:03:12
classify to to discriminate between what
00:03:15
a say a house is in a bike so he might
00:03:17
show the child maybe johnny say
00:03:19
uh johnny here's here's some examples of
00:03:21
what a house looks like it may be in in
00:03:23
in lego blocks and here's some examples
00:03:24
of what what a bike looks like
00:03:27
so and you see and
00:03:29
he tells johnny this and shows him
00:03:30
examples of each of the classes and then
00:03:32
the child then learns oh i see a house
00:03:35
has got sort of square edges and a bike
00:03:36
has got some more rounded edges etc
00:03:38
that's supervised learning because he's
00:03:40
been given examples of label training
00:03:43
observations he's been supervised
00:03:45
and and as trevor just sketched out in
00:03:47
on the previous slide
00:03:49
the y there is given and the the child
00:03:52
tries to learn the to classify the two
00:03:54
objects based on the features the x's
00:03:57
now um
00:03:58
unsupervised learning is another
00:04:00
thing topic of this course and which i
00:04:02
grew up
00:04:04
i see that's the problem okay well so in
00:04:06
unsupervised learning now in the
00:04:08
kindergarten not trevor's in
00:04:09
kindergarten and the child was not
00:04:11
trevor was not given examples of what a
00:04:14
house in a bike was he's he just sees on
00:04:16
the ground lots lots of things right he
00:04:19
sees maybe some houses some bikes some
00:04:21
other things
00:04:22
and so this this data is unlabeled
00:04:24
there's no why oh it's pretty sharp bro
00:04:26
okay so um the problem there now is for
00:04:29
the child just it's unsupervised to try
00:04:32
to organize in his own mind the common
00:04:34
patterns of what he sees right he may
00:04:37
look at the object and say oh these
00:04:38
three things are probably houses or he
00:04:39
doesn't know they're called houses but
00:04:40
they they're similar to each other
00:04:42
because they have common features these
00:04:44
other objects maybe their bikes or other
00:04:46
things they're similar to each other
00:04:47
because i see some commonality
00:04:50
and
00:04:51
that brings the idea of trying to group
00:04:53
observations by similarity of features
00:04:55
which is going to be a
00:04:57
a major topic of this course
00:04:59
unsupervised learning
00:05:00
so more formally again there's no
00:05:02
outcome variable measure just a set of
00:05:04
predictors and the objective is more
00:05:06
fuzzy it's sort of it's not just it's
00:05:07
not to predict why because there is no
00:05:09
why it's rather to learn about the how
00:05:11
the data is organized
00:05:13
and to find which features are important
00:05:15
for the organization of the data so
00:05:16
we'll talk about again clustering and
00:05:18
principal components which are important
00:05:21
techniques for unsupervised learning one
00:05:23
of the other challenges is
00:05:25
is that it's hard to know how well
00:05:26
you're doing right there's no gold
00:05:27
standard there's no why so when you've
00:05:29
done a clustering
00:05:32
analysis you don't really know how well
00:05:33
you've done
00:05:34
and that's one of the challenges but
00:05:36
nonetheless it's an extremely important
00:05:38
area
00:05:40
both because well one reason is that the
00:05:42
idea of unsupervised learning is an
00:05:44
important preprocessor for supervised
00:05:45
learning it's often useful to try to
00:05:47
organize your features choose features
00:05:50
based on on the um
00:05:52
on the on the x's themselves and then
00:05:54
use those those processed or chosen
00:05:56
features as input into supervised
00:05:58
learning and the last point is that it's
00:06:00
a lot easier it's a lot more common to
00:06:02
collect data which is unlabeled right
00:06:04
because on the web for example if you
00:06:06
look at movie reviews you can you can
00:06:08
a computer algorithm can just scan the
00:06:10
web and grab reviews
00:06:12
figuring out whether review on the other
00:06:13
hand is positive or negative often takes
00:06:15
human intervention so it's much harder
00:06:17
and costly to label data much easier
00:06:20
just to collect unsupervised unlabeled
00:06:22
data
00:06:24
well the last example we're going to
00:06:26
show you is a wonderful example it's a
00:06:27
netflix prize
00:06:29
netflix is a
00:06:31
movie rental company in in the us
00:06:34
and now you can get the movies online
00:06:36
they used to be
00:06:38
see dvds that were mailed out
00:06:41
and
00:06:42
netflix set up a competition
00:06:45
to try and improve on their recommender
00:06:46
system so they created a data set with
00:06:49
400 000 netflix customers
00:06:52
and 18 000 movies
00:06:54
and each of these customers had rated on
00:06:56
average around 200 movies each so
00:07:00
so each customer had not seen
00:07:03
they've only seen about one percent of
00:07:04
the of the movies
00:07:06
and so you can think of this as having a
00:07:07
a very big matrix
00:07:10
which is very sparsely populated with
00:07:12
ratings between one and five and then
00:07:14
the goal is to try and predict as in all
00:07:16
recommender systems to predict what the
00:07:18
customers would think of the other
00:07:19
movies based on what they rated so far
00:07:21
so netflix set up a competition
00:07:24
um which
00:07:25
where they offered a one million dollar
00:07:27
prize for the first team that could
00:07:29
improve on their
00:07:31
on their rating system by 10 percent
00:07:33
by some measure and the design of the
00:07:35
competition was very clever i don't know
00:07:37
if it was by by luck or not but the the
00:07:39
root mean square error of the original
00:07:40
algorithm was about 9.953 so that's on a
00:07:43
scale of again one to five
00:07:45
and it took the community when they
00:07:46
announced the the competition and put
00:07:48
the data on the web it took the
00:07:49
community about about a month or so to
00:07:51
get to have an algorithm which improved
00:07:52
upon that
00:07:53
but then it took the community about
00:07:55
another three years to actually for
00:07:57
someone to win the competition
00:07:59
so it's a it's a it's a great example
00:08:01
here's the leaderboard um at the time
00:08:03
the competition ended it was with it was
00:08:06
eventually won by a team called bell
00:08:08
course pragmatic chaos
00:08:10
um but a very close second was ensemble
00:08:13
in fact they had the same score up to
00:08:15
four decimal points um and and the final
00:08:18
winner was determined by who submitted
00:08:20
the the final predictions first
00:08:24
and so this was a wonderful competition
00:08:26
but what was especially wonderful was
00:08:28
the amount of research that it generated
00:08:30
there were thousands tens of thousands
00:08:32
of teams all over the world entered this
00:08:34
competition over the period of of three
00:08:36
years and a whole lot of new techniques
00:08:38
were invented in the process
00:08:40
a lot of the winning techniques ended up
00:08:43
using a form of
00:08:45
principal components in the presence of
00:08:46
missing data how come our names not on
00:08:48
that list trevor where's our team
00:08:51
that's a good point rob
00:08:53
the page isn't long enough
00:08:55
i think if we went down a few hundred
00:08:57
you might
00:08:58
so actually seriously we actually tried
00:09:00
with a graduate student when the
00:09:01
competition started we spent about three
00:09:02
or four months
00:09:04
trying to trying to win the competition
00:09:06
and
00:09:07
one of the problems with was computation
00:09:09
the data was so big and our computers
00:09:10
were not fast enough to just to try
00:09:12
things out took too long and we realized
00:09:14
that the graduate student
00:09:16
um was probably not going to succeed and
00:09:18
he was probably going to waste three
00:09:19
years of his graduate program which is
00:09:20
not a good idea for his career so we
00:09:23
we basically abandoned ship early on
00:09:27
so i mentioned the beginning of the idea
00:09:29
of the field of machine learning which
00:09:31
actually
00:09:32
led to
00:09:33
the
00:09:34
statistical learning area which we're
00:09:35
talking about in this course
00:09:37
and machine learning say itself arose as
00:09:40
a subfield of artificial intelligence
00:09:42
especially with the advent of neural
00:09:43
networks in the 80s
00:09:46
so it's natural to wonder what's the
00:09:48
relationship between statistical
00:09:49
learning and machine learning and
00:09:50
there's it's first of all the question's
00:09:52
hard to answer we ask that question
00:09:53
often there's a lot of overlap machine
00:09:55
learning tends to work at larger scales
00:09:57
they tend to work on bigger problems
00:09:59
although again the gap tends to be
00:10:01
closing because computers fast computers
00:10:03
now becoming much cheaper
00:10:05
machine learning worries more about pure
00:10:06
prediction and how well things predict
00:10:08
cisco learning also worries about
00:10:10
prediction but but also about
00:10:13
models tries to come up with models
00:10:15
methods that can be interpreted by
00:10:16
scientists and others and also um
00:10:19
by how well the method is doing we worry
00:10:22
more about precision uncertainty
00:10:24
but again the distinctions become more
00:10:25
and more blurred and there's a lot of
00:10:26
cross-fertilization between the methods
00:10:29
machine learning clearly has the upper
00:10:31
hand in marketing
00:10:32
they tend to get much bigger grants and
00:10:34
their their conferences are much nicer
00:10:36
places but
00:10:37
we're trying to change that starting
00:10:38
with this course
00:10:41
so
00:10:42
here's the course text
00:10:44
introduction to statistical learning
00:10:46
we're very excited this is a new book um
00:10:49
by two of our graduate students past
00:10:51
graduate students gareth james and
00:10:53
daniella witten and robin myself book
00:10:56
just came out in in august 2013 and this
00:10:59
course will cover this book in its
00:11:00
entirety
00:11:02
the book has
00:11:04
as at the end of each chapter there's
00:11:07
examples run through in in the r
00:11:09
computing language
00:11:10
and we we do sessions on r and so when
00:11:13
you do this course you'll actually learn
00:11:15
to use r as well r is a wonderful
00:11:18
environment it's free
00:11:20
and
00:11:22
and it's a really nice way of doing data
00:11:23
analysis
00:11:25
you'll see there's a second book there
00:11:27
which is our
00:11:29
more advanced textbook elements of
00:11:31
statistical learning that's been around
00:11:32
for a while
00:11:34
that that would be serve as a reference
00:11:36
book for this course for people who want
00:11:38
to understand some of the techniques in
00:11:40
in more detail now the nice thing is
00:11:43
this court not only is this course free
00:11:45
but these books are free as well
00:11:47
the elements of statistical learning has
00:11:49
been free and and the pdfs available on
00:11:51
our websites this new book is going to
00:11:53
be free beginning of january when the
00:11:55
course begins
00:11:57
and uh and that's a with agreement with
00:11:59
the um with the publishers but if you
00:12:01
want to buy the book that's okay too
00:12:02
it's nice having the hard copy but if
00:12:04
you want the pdf is available
00:12:06
so
00:12:07
um we hope you enjoy the rest of the
00:12:09
class