00:00:00
okay we're going to talk about
00:00:02
statistical learning and and models now
00:00:05
i'm going to tell you
00:00:06
what what models are good for how we use
00:00:09
them and what are some of the issues
00:00:11
involved
00:00:12
okay so we see three plots in front of
00:00:14
us
00:00:15
these are sales figures from a marketing
00:00:18
campaign
00:00:19
as a function of amount
00:00:21
spent on tv ads radio ads and newspaper
00:00:24
ads
00:00:25
and you can see the
00:00:27
at least in the first two is a somewhat
00:00:30
of a trend
00:00:31
and in fact we've we've summarized the
00:00:33
trend by a
00:00:35
little linear regression line in each
00:00:38
and so we see that there's some
00:00:40
relationship the first two again look
00:00:43
uh stronger than than the third
00:00:46
now in a situation like this we
00:00:48
typically like to know
00:00:50
the joint relationship
00:00:51
between
00:00:53
the response sales and all three of
00:00:56
these together
00:00:57
you know we want to understand how they
00:01:00
operate together to influence sales
00:01:03
so you can think of that as as
00:01:06
wanting to
00:01:07
model sales as a as a function
00:01:11
of tv radio and newspaper all jointly
00:01:14
together so how do we do that
00:01:18
so before we we get into the details
00:01:22
let's set up some notation so yes sales
00:01:25
is the response or the target that we
00:01:27
wish to predict or model
00:01:29
and
00:01:30
we usually refer to it as as y we use
00:01:33
the letter y to refer to it
00:01:35
tv is a is one of the features or inputs
00:01:38
or predictors and we'll call it x1
00:01:41
likewise
00:01:42
radio
00:01:43
is x2 and so on
00:01:45
so in this case we've got three
00:01:46
predictors and we can refer to them
00:01:48
collectively by a vector
00:01:51
as
00:01:52
x equal to with three components x1 x2
00:01:54
and x3 and vectors we generally think of
00:01:57
as column vectors
00:01:59
and so that's a little bit of notation
00:02:01
and so now in this more compact notation
00:02:04
we can write our model as
00:02:06
y equals function of x
00:02:10
plus error
00:02:11
okay and this error
00:02:14
um is just a catch all to it captures
00:02:17
the measurement errors maybe in y and
00:02:19
other discrepancies our function of x is
00:02:21
never going to model y perfectly so
00:02:23
there's going to be a lot of things we
00:02:25
can't capture with the function and
00:02:27
that's caught up in the error
00:02:29
and again
00:02:30
f of x here
00:02:32
is now a function of this vector x which
00:02:34
has these three arguments
00:02:37
three components
00:02:40
so what is what is the function f of x
00:02:42
good for
00:02:43
so with a good f we can make predictions
00:02:45
of y at new points x equals little x so
00:02:49
this notation capital x equals little x
00:02:51
you know capital x we think is the
00:02:53
variable having these three components
00:02:56
and little x is an instance also with
00:02:58
three components
00:03:00
particular values
00:03:02
for
00:03:03
newspaper radio and tv
00:03:06
with the model we can understand which
00:03:08
components of x in general it'll have p
00:03:10
components if there's p predictors are
00:03:13
important explaining why and which are
00:03:15
relevant
00:03:16
for example if if we model an income as
00:03:18
a function of demographic variables
00:03:21
seniority and years of education might
00:03:23
have a big impact on income but marital
00:03:25
status typically does not and we'd like
00:03:27
our model to be able to tell us that
00:03:30
and depending on the complexity of f we
00:03:32
may be able to understand how each
00:03:33
component xj affects y in in what
00:03:36
particular fashion it affects y
00:03:39
so models have have many uses and those
00:03:41
are amongst them
00:03:44
okay well what is this function f and is
00:03:47
there an ideal f
00:03:49
so
00:03:50
in the plot
00:03:52
we've got a large sample of points from
00:03:55
a population there's just a single x in
00:03:57
this case and a response y and you can
00:04:00
see there's that's a scatter plot so we
00:04:03
see there's a lot of
00:04:05
point there's 2 000 points here let's
00:04:07
think of this as actually the whole
00:04:08
population or rather as a representation
00:04:11
of a
00:04:12
very large population
00:04:16
and so now let's think of what what a
00:04:18
good function f might be and let's say
00:04:21
not just the whole function but let's
00:04:22
think what value would we like f to have
00:04:25
at say the value of x equals 4 so at
00:04:28
this point over here right we want to
00:04:30
query x
00:04:32
f at all values of x but we wondering
00:04:34
what it should be at the value 4.
00:04:36
so you'll notice that at the x equals 4
00:04:39
there's many values of y but a function
00:04:42
can only take on one value
00:04:45
the function is going to deliver back
00:04:46
one value so what is a good value
00:04:49
well
00:04:50
one good value is to deliver back the
00:04:53
average values of those y's who have x
00:04:55
equal to 4.
00:04:57
and that we write in this sort of mathy
00:04:59
notation over here
00:05:01
it says
00:05:02
the function at the value 4 is the
00:05:05
expected value of y given x equals 4
00:05:08
and that expected value is a just a
00:05:10
fancy word for average it's actually a
00:05:12
conditional average given x equals four
00:05:15
since we can only deliver one value
00:05:18
of the function at x equals four
00:05:21
um the average seems like a good value
00:05:25
and if we do that at each value of x so
00:05:27
at every single value of x we deliver
00:05:30
back the average of the y's that have
00:05:31
that value of x so for example at x
00:05:34
equals 5
00:05:36
again we want to have the average value
00:05:39
in this little conditional slice here
00:05:42
that will trace out this little red
00:05:43
curve that we have here and that's
00:05:45
called the regression function so the
00:05:47
regression function gives you the
00:05:49
conditional expectation of y given x at
00:05:52
each value of x
00:05:54
so that in a sense is the ideal
00:05:57
function
00:05:58
for a for a population in this case of y
00:06:01
and a single x
00:06:04
so let's talk more about this regression
00:06:06
function it's also defined for a vector
00:06:08
x so if x has got three components for
00:06:11
example it's going to be the conditional
00:06:14
expectation of y given the three
00:06:16
particular instances of of the three
00:06:19
components of x
00:06:21
so so if you think about that
00:06:24
um let's think of of x as being two
00:06:26
dimensional because we can think in
00:06:28
three dimensions so let's say x lies
00:06:32
on the table
00:06:34
two-dimensional x and y stands up
00:06:36
vertically
00:06:37
so the idea is the same we want to we've
00:06:40
got a whole continuous cloud of of y's
00:06:43
and x's we go to a particular point x
00:06:46
with two coordinates x1 and x2 and we
00:06:49
say what's a good value for the function
00:06:51
at that point well we're just going to
00:06:53
go up in the slice and average the y's
00:06:55
above that point and we'll do that at
00:06:58
all points
00:06:59
in the plane
00:07:00
we said it's the ideal or optimal
00:07:02
predictor of y with regard
00:07:05
for the function and what that means is
00:07:08
actually it's it's with regard to a loss
00:07:10
function and what it means is that
00:07:12
particular choice of the function f of x
00:07:15
will minimize
00:07:16
the sum of squared errors right which we
00:07:19
write in this fashion
00:07:21
again expected value
00:07:23
of y minus
00:07:25
g of x over all functions g
00:07:27
at each point x right so it minimizes
00:07:30
the average
00:07:32
prediction errors
00:07:36
now
00:07:37
at each point x we're going to make
00:07:39
mistakes because
00:07:40
if we use this function to predict
00:07:43
why
00:07:44
because there's lots of y's at each
00:07:46
point x right and so
00:07:48
the errors that we make we call in this
00:07:50
case we call them epsilons and those are
00:07:52
the irreducible error
00:07:54
you might know the ideal function f but
00:07:57
of course it doesn't make perfect
00:07:59
predictions at each point x so it has to
00:08:01
make some errors but on average it does
00:08:03
well
00:08:09
for any estimate
00:08:10
f hat of x and that's what we tend to do
00:08:13
we tend to put these little
00:08:16
hats
00:08:19
on estimators to show that they've been
00:08:21
estimated from data
00:08:24
and with so f hat of x is an estimate of
00:08:27
f of x
00:08:29
we can expand
00:08:31
the squared prediction error at x into
00:08:33
two pieces
00:08:35
there's the irreducible piece which is
00:08:38
just the variance of the errors
00:08:41
and there's the reducible piece which is
00:08:43
the difference between our estimate f
00:08:45
hat of x
00:08:47
and the true function f of x
00:08:50
okay
00:08:51
and that's a squared component so this
00:08:54
expected prediction error breaks up into
00:08:56
these two pieces
00:08:58
so that's important to bear in mind so
00:09:00
if we want to improve our model
00:09:02
it's this first piece the reducible
00:09:04
piece that we can improve by maybe
00:09:06
changing the way we estimate f of x
00:09:10
okay
00:09:11
so that's all nice this is a kind of as
00:09:13
up to now has been somewhat of a
00:09:15
theoretical exercise well how do we
00:09:17
estimate the function f
00:09:20
so the problem is we can't carry out
00:09:22
this recipe of conditional expectation
00:09:24
or conditional averaging exactly because
00:09:27
at any given x in our data set
00:09:30
we might not have
00:09:32
many points to average we might not have
00:09:33
any points to average in the figure i've
00:09:38
we've got a much smaller data set now
00:09:39
and we've still got the point x equals
00:09:41
four
00:09:42
and if you look there you'll see
00:09:43
carefully that the solid point is one
00:09:45
point i put on the plot the solid the
00:09:48
solid green point
00:09:50
there's actually no data points whose x
00:09:53
value is exactly four
00:09:55
so how can we compute the conditional
00:09:57
expectation or average
00:10:00
well what we can do is relax the idea of
00:10:04
at the point x
00:10:06
to
00:10:07
at in a neighborhood of the point x and
00:10:09
so that's what the notation here refers
00:10:11
to
00:10:12
n of x or script n of x
00:10:15
is a neighborhood of points defined in
00:10:16
some way around the target point which
00:10:20
is this x equals four year
00:10:22
and it keeps the spirit of conditional
00:10:24
expectation it's close to the target
00:10:26
point x
00:10:27
and if we make that neighborhood wide
00:10:29
enough we'll have enough points in the
00:10:31
neighborhood to average and we'll use
00:10:33
their average to estimate the
00:10:35
conditional expectation
00:10:37
so this is called nearest neighbor or
00:10:39
local averaging it's a very it's a very
00:10:41
clever idea it's not my idea it's been
00:10:44
invented long time ago and of course
00:10:47
you'll move this neighborhood you'll
00:10:49
slide this neighborhood along the x-axis
00:10:52
and and as you up as you compute the
00:10:55
averages as you slide in along it'll
00:10:57
trace out a curve
00:10:58
so that's actually a very good estimate
00:11:00
of the of the of the function f it's not
00:11:03
going to be perfect
00:11:05
because
00:11:06
the the the little window it has a
00:11:08
certain width
00:11:09
and and so some as we can see here some
00:11:11
points of the true f may be lower and
00:11:14
some points higher but on average it
00:11:15
does quite well
00:11:17
so we have a pretty powerful tool here
00:11:18
for estimating this conditional
00:11:20
expectation just relax the definition
00:11:23
and compute the
00:11:24
the nearest neighbor average and that
00:11:27
gives us a fairly flexible way of
00:11:29
fitting a function
00:11:31
we'll see
00:11:32
in
00:11:33
in the next section that this doesn't
00:11:35
always work especially as the dimensions
00:11:37
get larger and we'll have to have ways
00:11:39
of dealing with that