00:00:00
welcome to this tutorial on simple
00:00:02
linear regression what we will be
00:00:05
learning in this tutorial is how to
00:00:07
better understand the relationship of
00:00:09
two or more variables using regression
00:00:12
analysis for example let's say we want
00:00:15
to study the relationship between
00:00:17
advertising expenditures and sales what
00:00:20
might we expect well usually the more we
00:00:23
spend on Advertising the more sales we
00:00:25
should have so we would want to see if
00:00:28
as advertising expenditure increase do
00:00:31
sales increase as well and by how much
00:00:35
we might also want to study the number
00:00:36
of hours we practice some task and the
00:00:39
number of errors in this case we would
00:00:42
expect that as we practice more the
00:00:45
number of Errors we make should decrease
00:00:48
what we are interested in doing with
00:00:50
regression analysis is to develop a
00:00:52
model that helps us to understand if two
00:00:55
variables are related and to help us to
00:00:57
make certain predictions for example if
00:00:59
if we can develop a model to show the
00:01:01
relationship between advertising
00:01:03
expenditures and sales we would be able
00:01:06
to use that model to predict the sales
00:01:08
for a given level of advertising in
00:01:10
regression we Define what is called a
00:01:13
dependent variable and that is the
00:01:15
variable we are trying to predict we
00:01:18
will also Define an independent variable
00:01:21
and this is the variable we will use to
00:01:23
predict the dependent variable we will
00:01:25
use the letter Y to represent our
00:01:28
dependent variable and the letter X to
00:01:30
represent the independent variable if we
00:01:33
are trying to predict sales based on
00:01:35
Advertising expenditures then y would be
00:01:39
the sales so our dependent variable is
00:01:41
sales that's what we're trying to
00:01:43
predict and advertising expenditures
00:01:45
would be our independent variable X
00:01:48
because that's what we're using to
00:01:49
predict sales what about the
00:01:51
relationship between practice time and
00:01:53
number of errors in this case we are
00:01:55
saying that as practice time increases
00:01:58
we would expect the number of errors to
00:02:00
decrease so we would try to predict the
00:02:02
number of errors and that would be our
00:02:05
dependent variable y based on practice
00:02:08
time so that would be our independent
00:02:10
variable X let's talk a little more
00:02:13
about what simple linear regression is
00:02:15
the term simple refers to the number of
00:02:18
independent variables we are using in a
00:02:20
simple regression in simple regression
00:02:23
we have only one independent variable
00:02:25
and one dependent variable that is 1 X
00:02:28
and one y using that one y to predict X
00:02:32
the term linear refers to the type of
00:02:35
model we are trying to create in this
00:02:37
case we are trying to fit a straight
00:02:38
line to the data to approximate the
00:02:41
relationship between X and Y the term
00:02:43
linear means fitting a straight line so
00:02:47
what we are doing in this tutorial is
00:02:49
simple linear regression simple because
00:02:52
we have only one independent variable
00:02:54
and linear because we are trying to
00:02:56
explain the relationship between X and Y
00:02:58
using a straight line
00:03:00
we can also use more than one
00:03:02
independent variable when we feel it
00:03:04
will help with the fit of the model and
00:03:06
when we use more than one independent
00:03:08
variable it would be called multiple
00:03:11
regression so now that we understand
00:03:13
what simple regression is let's take a
00:03:15
look at the model and it looks like this
00:03:18
we have y is equal to Beta KN + beta 1 *
00:03:22
X plus the Greek letter Epsilon this
00:03:26
model is for the population we will use
00:03:28
another model with B instead of betas
00:03:31
when we discussed the estimated model
00:03:33
using sampling much like we used xar
00:03:35
instead of mu to approximate the
00:03:37
population mean let's look at the
00:03:40
components of this model a little more
00:03:43
closely beta KN refers to the Y
00:03:45
intercept of the line we are defining
00:03:47
for this model we will refer to this
00:03:50
line as the regression line or the line
00:03:52
of regression so beta knot is the Y
00:03:55
intercept which means it is where the
00:03:57
line would cross over the Y AIS AIS that
00:04:00
point would be beta KN you can also
00:04:03
think of it as the value of y when X is
00:04:06
zero at the Y AIS the value of x is zero
00:04:10
so when X is zero whatever y is at that
00:04:13
point would be the Y intercept beta not
00:04:17
next we have beta 1 this would represent
00:04:20
the slope of the regression line the
00:04:22
slope tells us two things whether the
00:04:24
line is increasing or decreasing and how
00:04:27
steep it is the last component of the
00:04:30
model is Epsilon and that stands for the
00:04:33
error that exists in our prediction
00:04:34
model as good as our model is there is
00:04:37
always a random error term that cannot
00:04:39
be accounted for let's take a look at
00:04:42
some examples of the regression line
00:04:45
here we have an example that shows a
00:04:47
line sloping upward because the line is
00:04:50
sloping upward it is showing a positive
00:04:52
or increasing relationship between X and
00:04:55
Y this means that as X increases so does
00:04:59
y so the line slopes upward and
00:05:02
therefore beta 1 the slope will be a
00:05:05
positive number the line itself is
00:05:08
called the regression line or the line
00:05:10
of regression and the Y intercept beta
00:05:13
KN is here where the regression line
00:05:16
hits the Y AIS this next example shows a
00:05:21
downward sloping line and depicts a
00:05:23
negative linear relationship between X
00:05:25
and Y when we have a negative
00:05:28
relationship we see that as X increases
00:05:31
y decreases and that is why the line
00:05:33
slopes
00:05:35
downward so beta 1 the slope would be a
00:05:38
negative number the regression line is
00:05:41
here and the Y intercept is here where
00:05:44
the line hits the Y
00:05:47
AIS here is another example where we
00:05:49
have a flat line across this graph shows
00:05:53
that as X increases y Remains the Same
00:05:56
so there is no relationship between X
00:05:59
and Y
00:06:00
in this scenario beta 1 the slope would
00:06:03
be equal to zero so when there is no
00:06:05
linear relationship between X and Y the
00:06:08
slope is zero the regression line is
00:06:10
this flat line going across and the Y
00:06:13
intercept beta KN would be here where
00:06:16
the regression line hits the Y
00:06:19
AIS beta KN and beta 1 are the
00:06:22
population parameters for the Y
00:06:24
intercept and the slope just like mu was
00:06:27
used to refer to the population
00:06:28
parameter for the mean and the Greek
00:06:30
letter Sigma was used for the standard
00:06:33
deviation we use beta and beta not for
00:06:36
the population parameters of the Y
00:06:38
intercept and the slope we don't know
00:06:41
the true population parameters unless we
00:06:43
take a complete census so when we get
00:06:45
our data from a sample we are getting
00:06:47
what are called sample statistics rather
00:06:50
than true population parameters in
00:06:53
regression B knot and B1 are used to
00:06:56
estimate the true population parameters
00:06:58
of beta kn and beta 1 just like we used
00:07:02
xar to estimate mu and S to estimate
00:07:06
Sigma we use B and B1 to estimate beta
00:07:10
KN and beta 1 so for a sample the
00:07:13
estimated simple linear regression
00:07:15
equation looks like this now this y with
00:07:19
something that looks like a hat on it is
00:07:22
actually called y hat and in the
00:07:25
equation Y hat refers to the estimated
00:07:29
or predicted value of y for a given x
00:07:33
value B not is the Y intercept for the
00:07:35
line and B1 is the slope so now we have
00:07:39
all the components we need to define a
00:07:41
straight line we have the slope and we
00:07:44
have the Y
00:07:45
intercept now we are ready to look at
00:07:48
some sample data so that we can
00:07:49
calculate the estimated slope and the Y
00:07:53
intercept let's start by looking at a
00:07:55
graph called a scattered diagram this
00:07:58
graph shows us the relation a ship
00:08:00
between X and Y We Begin by drawing a
00:08:03
horizontal axis and labeling it X and a
00:08:06
vertical axis and labeling it y on the x
00:08:09
axis we will plot our independent
00:08:12
variable and on the Y AIS we will plot
00:08:14
our dependent variable for our example
00:08:18
number of hours studied is the
00:08:20
independent variable and grade on exam
00:08:23
will be our dependent variable let's
00:08:26
take a look at some sample data that I
00:08:28
have here I have two columns of numbers
00:08:31
one is with the number of hours students
00:08:33
studied and the second column has their
00:08:35
corresponding grades on an
00:08:37
exam there are 10 sets of numbers here
00:08:41
to begin let's draw our horizontal and
00:08:44
vertical axes and label them by putting
00:08:47
the number of hours studied on the x
00:08:49
axis the horizontal axis and grades on
00:08:52
the Y AIS the vertical axis remember we
00:08:55
said that X is always the independent
00:08:58
variable and that's what what we use to
00:09:00
predict y are dependent variable in this
00:09:03
example we will be using number of hours
00:09:05
studied to predict the grade so X would
00:09:08
be the independent variable number of
00:09:10
hours studied and Y would be the
00:09:12
dependent variable grade okay so let's
00:09:16
put some tick marks here and let's put
00:09:18
the numbers so that we can begin
00:09:20
plotting the data let's begin with the
00:09:22
first set of X and Y coordinates the
00:09:25
first X is 2 and its corresponding Y is
00:09:28
a grade of 6 9 so plotting the
00:09:30
coordinates of 2 and 69 we would put a
00:09:33
dot for that observation somewhere
00:09:36
around here now let's move on to the
00:09:39
next set of X and Y coordinates and they
00:09:41
are 9 and 98 so that would be around
00:09:45
here the third set of X and Y
00:09:48
coordinates is 5 and 82 and that would
00:09:51
be around here and if we do this for the
00:09:54
remaining x's and y's we get a scattered
00:09:56
diagram that looks like this if I had
00:09:59
graph paper it would be a little more
00:10:01
accurate but this is the best I can do
00:10:03
by eyeballing it we can see that the
00:10:05
plot seems to be showing a positive
00:10:08
relationship between X and Y that is as
00:10:11
X the number of hours increases so does
00:10:14
y the grade we can also see that a
00:10:17
straight line would probably fit the
00:10:19
data somewhere around here so we can
00:10:23
create a linear model to describe this
00:10:25
relationship between X and Y by finding
00:10:28
the slope and the Y intercept that
00:10:30
defines the line that fits this data the
00:10:34
best to find the line that fits this
00:10:37
data the best we will use our sample
00:10:39
data to help Define the line of
00:10:41
regression which is this line the line
00:10:44
that fits the data the best that line
00:10:47
will be the Y hat line where y hat is
00:10:50
the predicted value of y for a given X
00:10:54
so for our example it would be the
00:10:56
predicted grade on an exam for a given X
00:10:59
number of hours
00:11:00
studied and B KN would be the Y
00:11:03
intercept of the line so it would be the
00:11:05
value of y when X is zero if x is zero
00:11:10
that would mean the number of hours
00:11:11
studied is zero and the Y intercept B
00:11:14
not would be the grade for zero number
00:11:17
of hours studied B1 would be the slope
00:11:20
of the line and it tells us whether
00:11:23
there is an increasing or decreasing
00:11:25
relationship between X and Y and how
00:11:27
steep the line is for this example we
00:11:30
will expect to find a positive slope
00:11:33
because it is a positive relationship
00:11:35
between X and Y another way to put it is
00:11:38
that the slope tells us how much y
00:11:40
increases for every one unit increase in
00:11:44
X so for every one unit increase in X
00:11:47
the slope tells us how much y would
00:11:49
increase by the last letter in the
00:11:52
equation is X and that is the number of
00:11:54
hours studied so we will multiply x * B1
00:11:58
and then add add be not to get y hat
00:12:02
okay so let's try to get that line using
00:12:04
the Le squares method what we will do is
00:12:07
find the line that fits the data the
00:12:09
best that line the Y hat regression line
00:12:12
fits the data the best when the distance
00:12:15
of each of the data points is at its
00:12:17
minimum distance from the line in other
00:12:20
words you want to minimize the distance
00:12:22
of each Yi from each corresponding y hat
00:12:26
that is what is meant by the formula in
00:12:28
the red box
00:12:30
Min means to minimize and then you can
00:12:33
see that we are taking each Yi those are
00:12:35
our observed grades and we have 10 of
00:12:37
those for this example so we want to
00:12:39
minimize each of those observed Yi
00:12:42
values from the line of
00:12:45
regression that would be the line that
00:12:47
fits the data the best you can see how I
00:12:50
drew a straight line through the plotted
00:12:52
coordinates those coordinates are from
00:12:54
the sample data and then I used my eyes
00:12:57
to try to approximate where a line would
00:12:59
run through those points so that it is
00:13:02
minimizing the differences between each
00:13:04
of those points to the line both above
00:13:06
it and below it I eyeballed it here when
00:13:10
I drew it but we need to use a formula
00:13:12
to find the exact line by finding two
00:13:15
values a slope and a y intercept take a
00:13:18
look at the formula in the red box
00:13:21
again Yi represents the observed value
00:13:24
of y for the I
00:13:26
observation and Y hat would be the
00:13:29
predicted value of y for that same I
00:13:32
observation so let's say for example we
00:13:35
take an x value of three that is the
00:13:37
number of hours studied is three on the
00:13:40
graph we would look at three for the x
00:13:42
value and then look up to the line of
00:13:45
regression and over to where that point
00:13:48
is on the Y AIS and we would get a
00:13:50
predicted y value y hat for that
00:13:53
observation would be 69 once we get an
00:13:57
exact equation for the regression line
00:13:59
we will be able to predict that value
00:14:01
more exactly but it is approximately
00:14:05
69 now if you look back at the original
00:14:07
sample data from a few slides back you
00:14:10
will see that there was an observation
00:14:11
of xal 3 that is 3 hours studied and
00:14:15
that observation had a corresponding
00:14:18
observed y value of
00:14:21
71 so that
00:14:23
doyi represents the actual observed
00:14:26
value of 71 from the sample Zeta and the
00:14:30
Y hat value represents the predicted
00:14:32
value we would get from the line of
00:14:34
regression our task is to define a
00:14:37
straight line that minimizes the
00:14:40
differences or deviations from each of
00:14:42
those dots to the line and that is what
00:14:44
the formula in the red box is saying
00:14:47
take each Yi from each y hat and
00:14:51
minimize that squared difference now
00:14:54
that we understand what the best fitting
00:14:56
line to the data would be we need to
00:14:58
calc calculate its slope and its Y
00:15:01
intercept let's first start with the
00:15:03
slope here you can see the formula and
00:15:06
we can Define X subscript I as the value
00:15:09
of x for observation I in our example we
00:15:13
have 10 observations for X the number of
00:15:15
hours studied so we would have X
00:15:17
subscript 1 x subscript 2 x subscript 3
00:15:21
and so on until X subscript 10 so those
00:15:24
are our X subscript I likewise y
00:15:27
subscript I would be the value of y for
00:15:30
the I observation and we would have 10
00:15:33
of those for each of the corresponding y
00:15:36
subis xbar would be the average of the
00:15:39
X's so we would add up all of the x's
00:15:41
and divide by 10 in this case and Y Bar
00:15:45
is the average of the Y's so we would
00:15:47
add up all the grades and divide by how
00:15:49
many there are to get Y Bar once we plug
00:15:52
in all of these numbers and calculate
00:15:54
the slope then we can calculate the Y
00:15:56
intercept B not by using this formula
00:16:00
notice that in this formula we have B1
00:16:02
the slope so we need to First calculate
00:16:05
the slope before we can calculate the Y
00:16:07
intercept so let's take a look at how we
00:16:10
would do these
00:16:11
calculations here is the sample data
00:16:13
with the 10 observations of X and Y
00:16:16
number of hours studied and grade on
00:16:18
exam so we have two columns here one for
00:16:21
X subscript I and one for all the Y
00:16:24
subscript eyes now let's make some more
00:16:27
columns with all the numbers we will
00:16:29
need to calculate the slope let's start
00:16:32
with the third column where we will have
00:16:34
all the x sub i - x bars and if you look
00:16:38
back at the formula for the slope that
00:16:39
is the first component in the numerator
00:16:42
of the slope take a look at the
00:16:43
formula then we need a column for yi
00:16:47
minus y and that's the next component in
00:16:50
the numerator for the slope and then if
00:16:53
you take a look at that numerator you'll
00:16:54
see that we multiply those so the next
00:16:57
column will be the product of the third
00:16:59
and fourth columns and that will help us
00:17:02
to complete the calculations for the
00:17:04
numerator and finally we need the last
00:17:06
column which is the third column squared
00:17:09
and that will give us the denominator of
00:17:11
the slope formula before we fill in all
00:17:14
of these columns with our calculations
00:17:16
we first need to get xbar and Y Bar so
00:17:20
let's add up all of the x's and we get
00:17:24
48 and so xar would be 48 / 10 and we
00:17:29
get
00:17:30
4.8 now let's add up all the Y's and we
00:17:33
get
00:17:34
778 and then to get Y Bar we divide
00:17:38
778 by 10 and we get
00:17:41
77.8 so now that we have xar and Y Bar
00:17:45
we are ready to get all of the other
00:17:48
numbers now we are ready to calculate
00:17:50
all the numbers in the third column and
00:17:53
we get these
00:17:54
numbers make sure you understand where
00:17:57
all these numbers come from the first
00:17:59
number for example is -2.8 and how do we
00:18:03
get that we get that by subtracting the
00:18:05
x value 2 from xar and xar is 4.8 so 2 -
00:18:11
4.8 gives us -2.8 look at the column
00:18:15
header and you can see the column header
00:18:17
says xub I minus xar so that's what we
00:18:21
just did xub I is X1 and that is a two
00:18:25
and xar is
00:18:27
4.8 for the next number we get 4.2 by
00:18:30
subtracting the next xabi from xar so we
00:18:33
get 9 - 4.8 and that gives us 4.2 and so
00:18:38
on down the whole column for the next
00:18:41
column we get these numbers again make
00:18:44
sure you understand where all these
00:18:46
numbers are coming from the header of
00:18:48
this column is Yi minus y bar so we take
00:18:52
each Yi and subtract from Y Bar the
00:18:55
first Yi value is 69 - Y Bar is
00:19:00
77.8 so we take 69 -
00:19:03
77.8 and we get 8.8 and so on down for
00:19:08
this whole column of numbers now that we
00:19:11
have columns three and four we're ready
00:19:13
to get column five and that will be each
00:19:15
number in column 3 times each number in
00:19:18
column 4 and here are the numbers we
00:19:21
would get let's take a look at the first
00:19:23
number the first number is 24.6 4 and we
00:19:27
get that by taking 2.8 * 8.8 and that
00:19:32
gives us 24.6 4 and so on down the whole
00:19:36
column and finally for the last column
00:19:39
we take the third column and square each
00:19:42
value to get this column of numbers so
00:19:45
-2.8 SAR gives us
00:19:48
7.84 and so on and now we're almost
00:19:51
ready to get the slope we first need to
00:19:54
get the sum of this column and this is
00:19:56
320.50
00:19:59
if you look at the formula for the slope
00:20:01
the numerator is the sum of these
00:20:03
numbers you can see the capital Sigma
00:20:06
sign that tells us to sum our next step
00:20:09
is to sum the last column and we get
00:20:12
67.6 and this would give us the sum for
00:20:15
the denominator of the slope formula
00:20:17
here are those two very important
00:20:20
columns and the slope formula you can
00:20:23
see that the numerator in the slope
00:20:24
formula is this so the sum of this
00:20:27
column would go in the numerator
00:20:29
and the denominator of the slope formula
00:20:32
is the sum of this
00:20:34
column putting this all together we get
00:20:37
the slope equals
00:20:40
320.50 and we get that from here and
00:20:42
then we divide that by
00:20:45
67.6 which we get from the sum of this
00:20:48
column and now we divide and we get
00:20:52
4.74 so now we're ready to calculate the
00:20:55
Y intercept B not we will need xar and Y
00:21:00
Bar to calculate B not remember we
00:21:02
already calculated xar as 4.8 and we
00:21:06
found Y Bar was 77.8 on a previous slide
00:21:10
so now we have all the numbers we need
00:21:12
to calculate be not the Y intercept and
00:21:15
we get bot is equal to
00:21:18
77.8 +
00:21:20
4.74 * 4.8 77.8 is 77.8 1 2 3 77.8 is Y
00:21:30
Bar and 4.8 is xar and where did 4.74
00:21:35
come from that comes from here the slope
00:21:38
and we get a y intercept of
00:21:44
55.4 so now we have our slope and our Y
00:21:48
intercept and now we can Define our y
00:21:50
hat line substituting B KN and B1 in
00:21:54
this line we get our y hat line
00:21:58
y hat is equal to 55.0 48 +
00:22:03
4.74 * X now we can use this y hat line
00:22:08
also known as our line of regression to
00:22:10
predict any y for a given x value we
00:22:13
would just plug in the x value and out
00:22:16
comes y hat the predicted
00:22:18
yvalue let's start on a fresh page here
00:22:21
where we have our estimated regression
00:22:23
line that we just calculated now what
00:22:26
we're going to do is use this line of
00:22:28
regression expression to predict y for a
00:22:30
given x value so suppose the number of
00:22:33
hours studied is three that is our given
00:22:36
x value remember when we eyeballed it we
00:22:39
said it was around
00:22:40
69 let's see what it would be using this
00:22:44
y hat line what would be the predicted
00:22:46
grade on the exam so what we're asking
00:22:49
is when X is three what is the predicted
00:22:53
value of y to answer this question we
00:22:56
use our line of regression our y hat
00:22:59
line and substitute in the number three
00:23:02
for the letter X as you see here so we
00:23:05
have y hat is equal to
00:23:08
55.4 + 4.74 * 3 instead of time x and
00:23:14
this would give us our y hat our
00:23:16
predicted value of y for a given X and
00:23:19
that is
00:23:20
69.2 68 so if a student studies 3 hours
00:23:26
we would predict the grade to be 69
00:23:29
268 how good a prediction is this well
00:23:33
that depends on how good a fit the
00:23:35
regression line is to the data anyone
00:23:38
can draw a straight line through any
00:23:39
data points and Define it mathematically
00:23:42
with a slope and a y intercept but that
00:23:44
doesn't mean it's a good fitting model
00:23:47
even if there is no relationship between
00:23:49
X and Y we could still mathematically
00:23:51
Define a straight line that fits the
00:23:53
data the best but it would not be a good
00:23:55
fit so we need a measurement that tells
00:23:58
tell us how well the regression line
00:24:00
fits the data one such measurement is
00:24:03
called the coefficient of determination
00:24:06
and it tells us how good a fit the
00:24:08
regression line is to our data the
00:24:10
formula for the coefficient of
00:24:12
determination is shown here in the red
00:24:14
box R 2 is the coefficient of
00:24:17
determination and to calculate it we
00:24:19
take something called SSR and divided by
00:24:23
SST where SSR is defined as the sum of
00:24:27
the squares du due to regression the way
00:24:30
we calculate SSR is to take the sum of
00:24:33
the squared deviations of each predicted
00:24:36
value of y That's each y hat and
00:24:39
subtract Y Bar the average y so it
00:24:43
measures the difference between the
00:24:45
predicted values and the
00:24:47
average the denominator is SST and that
00:24:51
is defined as the sum of the squares for
00:24:53
the total deviation and we find that
00:24:55
value by taking the sum of the squared
00:24:58
different of each Yi that's each actual
00:25:01
observation from Y Bar the average and
00:25:05
finally we have something called SSE
00:25:08
which is defined as the sum of the
00:25:10
squares for the error and that is
00:25:12
calculated by taking the sum of the
00:25:14
square differences of each Yi from each
00:25:18
y hat each predicted value it is
00:25:21
important to know that SST is equal to
00:25:24
the sum of SSR and ssse so if we add SSR
00:25:29
plus ssse we would get SST this will
00:25:32
help us to make the calculation simpler
00:25:35
since if we know any two of these
00:25:37
numbers we can get the third number by
00:25:39
either adding or subtracting as you will
00:25:41
see in a few
00:25:42
moments let's go back to our original
00:25:45
data here we have the data in two
00:25:47
columns labeled X and Y now let's make a
00:25:51
third column for the predicted values of
00:25:53
y y hat for all these given X's so now
00:25:57
we have to take each value of x plug it
00:25:59
in the regression line and get this
00:26:02
column of numbers make sure you
00:26:04
understand where all these numbers are
00:26:06
coming from these are all the Y hat
00:26:08
values for each I observation take the
00:26:11
first number 64. 528 how did we get that
00:26:16
well you take the x value 2 and plug it
00:26:19
in the Y hat line so 55.0 48 + 4.74 * 2
00:26:27
right the two comes from the first x
00:26:28
value and if you plug that in the Y hat
00:26:31
line you get
00:26:45
64.52%
00:26:47
708 and so on until we have the entire
00:26:50
column of predicted y values so just to
00:26:53
be clear this column of numbers has the
00:26:55
predicted values of Y for each of the
00:26:57
given X values we have 10 x values and
00:27:01
so we have 10 y hat values the next
00:27:05
column will be for the error and that is
00:27:07
each Yi minus each y hat and we get this
00:27:12
column of numbers take a look at the
00:27:14
first number 4.47214
00:27:28
4. 528 so
00:27:31
69us
00:27:33
6452 gives us
00:27:37
4472 now the next column is the squared
00:27:40
error so it's the previous column
00:27:42
squared and we get all these numbers
00:27:44
just by squaring the previous column so
00:27:47
19.99 A8 is 4.47214
00:27:58
the average remember Y Bar was
00:28:01
77.8 so we take each y and subtract
00:28:06
77.8 to get this column of numbers so
00:28:10
for example the first y value is 69 and
00:28:13
Y Bar is
00:28:15
77.8 so 69 -
00:28:18
77.8 is
00:28:20
8.8 and we do that for each of the 10
00:28:23
numbers in this column and finally the
00:28:26
last column is the square Square
00:28:28
deviations so it is the previous column
00:28:31
squared so -8.8 squared would be
00:28:36
7744 and so on now to get ssse in order
00:28:41
to get ssse we need to sum up the
00:28:44
squared error column of these numbers
00:28:47
and we get 79.1 1215 for
00:28:53
ssse next we want to get SST next we
00:28:57
want to get SST so we need to sum up the
00:29:00
squared deviations column and we get
00:29:05
15996 for
00:29:07
SST so let's review we have SSE equal to
00:29:11
79.1
00:29:13
1215 and we have SST equal to
00:29:17
15996 now to get the coefficient of
00:29:19
determination we need to divide SSR by
00:29:23
SST but we didn't calculate SSR we
00:29:27
calculated SS e and SST remember that
00:29:30
SST is equal to SSR plus SS so we get
00:29:35
SSR by subtracting SST minus ssse so SSR
00:29:41
would be
00:29:43
15996 - 79.1
00:29:46
1215 which is 1520
00:29:51
4785 now we can go back to the formula
00:29:53
for R 2 and calculate it by dividing SSR
00:29:57
by SS St and we get
00:30:00
15204 785 that's SSR divided by
00:30:05
15996 SST and that gives us an R 2 value
00:30:09
of
00:30:12
9505 so our coefficient of determination
00:30:15
is
00:30:18
955 the coefficient of determination R
00:30:21
2ar measures the percent of variability
00:30:24
in y that can be explained by the X
00:30:27
variable
00:30:28
in this case Y is grades and X is the
00:30:31
number of hours studied so what we
00:30:33
measured shows the percent of
00:30:35
variability in grades that is explained
00:30:37
by the number of hours studied since R 2
00:30:41
is
00:30:42
9505 we can say that 95.0 5% of the
00:30:48
variability in grades can be explained
00:30:51
by the number of hours studied one more
00:30:54
measure of how well the line fits the
00:30:56
data needs to be discussed and that is
00:30:58
the correlation coefficient this
00:31:01
measures the strength of association
00:31:03
between X and Y the correlation
00:31:05
coefficient is called R and its values
00:31:08
are from -1 to positive 1 a value of
00:31:13
positive 1 means that there is a perfect
00:31:16
positive linear relationship between X
00:31:18
and Y so that means that all the data
00:31:21
points from the sample lie exactly on
00:31:23
the line of regression with no deviation
00:31:26
and that the line slopes
00:31:29
upward an R value of -1 means a perfect
00:31:33
negative linear relationship between X
00:31:35
and Y in this case all the data points
00:31:38
lie exactly on the line of regression
00:31:40
but the line is sloping downward R can
00:31:43
take on any value between and including
00:31:47
Nega 1 and two and including positive
00:31:51
one if R is zero then that means there
00:31:54
is no relationship between X and Y to
00:31:57
calculate R we simply take the square
00:32:00
root of the coefficient of determination
00:32:02
and use the sign of the slope we
00:32:05
calculated the r here has a subscript of
00:32:08
X and Y and it just tells us that R the
00:32:11
correlation coefficient is for the
00:32:13
values of X and Y sometimes we just
00:32:15
leave the X and Y out and say
00:32:18
R so to calculate R we take the sign of
00:32:22
the slope B1 and multiply the square
00:32:24
root of R 2 so for our example R 2 was
00:32:31
9505 now if we just take the square root
00:32:34
of that number we don't know if R should
00:32:36
be negative or positive since squared
00:32:39
numbers always lose their sign so in
00:32:41
order to know whether it is a positive
00:32:43
or A negative number we have to look at
00:32:45
the slope is it positive or
00:32:48
negative and then we use the sign for
00:32:50
our slope in our example of grades and
00:32:53
numbers of hours studied the slope was a
00:32:56
positive 4.7 4 so we use that positive
00:33:00
sign and we get R is equal to the
00:33:02
positive square < TK of
00:33:05
9505 and that is
00:33:09
9749 now remember we said that a plus
00:33:11
one would be perfect positive linear
00:33:14
relationship which is very rare so
00:33:16
positive 9749 would indicate a very
00:33:20
strong positive linear relationship
00:33:22
between X and Y so let's review what
00:33:26
we've just done and try to understand
00:33:28
the bigger picture first we calculated
00:33:31
the regression line using the least
00:33:33
squares method we calculated the slope
00:33:35
B1 and the Y intercept B not and came up
00:33:39
with this line to fit the grade data
00:33:41
that we had when we plot this y hat line
00:33:44
on a scatter diagram we find it falls
00:33:47
around here close to the data points you
00:33:50
can see how this line fits the data very
00:33:53
nicely and we saw that when we
00:33:55
calculated R and R squ this line is a
00:33:58
very good fit to the data some of the
00:34:00
data points are exactly on the line some
00:34:02
are above and some are below let's take
00:34:05
a look at the average grade Y Bar for
00:34:08
this data remember we calculated Y Bar
00:34:11
by summing up all 10 grades from the
00:34:13
data set and dividing by 10 and we got
00:34:16
an average Y Bar of
00:34:19
77.8 so let's plot this line on the
00:34:22
scatter diagram and it would be around
00:34:25
here the average grade for the class why
00:34:27
Y Bar is
00:34:28
77.8 so now we have a line for Y Bar and
00:34:32
we have a line for y hat notice that the
00:34:35
10 data points are closer to the Y hat
00:34:38
line than the Y Bar line when we
00:34:41
calculated R squar we measured how well
00:34:44
the line fit the data by calculating SSR
00:34:48
and SST R SAR is the proportion of SSR
00:34:52
to SST so what exactly is SSR and SST
00:34:57
let's start with SST which stands for
00:34:59
the total sums of squares SST measures
00:35:03
how well the observations cluster around
00:35:06
the Y Bar line you can see from the
00:35:08
formula SST is Yi minus y bar^ squar so
00:35:13
it is taking every y observation and
00:35:16
measuring its deviation from the mean Y
00:35:19
Bar let's say for example we have an
00:35:22
observation Point here a student
00:35:24
studying seven hours with a grade of 100
00:35:27
so for this data point x is 7 and Y is
00:35:31
100 now the difference between this
00:35:34
grade of 100 and the class average of
00:35:37
77.8 is Yi minus y bar and this distance
00:35:42
is here this deviation is called the
00:35:46
total deviation that is what SST
00:35:49
measures now the predicted grade y hat
00:35:52
for 7 hours of study xal 7 is given by
00:35:56
the line of progression and that would
00:35:59
be 882287705
00:36:28
SSR is the explained variation which is
00:36:30
the deviation between the predicted
00:36:33
value of y and the average value of y it
00:36:36
is explained by the line of regression
00:36:39
in other words the class average is a
00:36:42
77.8 so that's the expected grade
00:36:44
without any additional information but
00:36:47
if we use number of hour study to get a
00:36:49
better prediction then we get a y hat
00:36:51
value of 88.2 to8 so the difference
00:36:55
between the class average and the
00:36:56
predicted GR
00:36:58
is what is explained by the line of
00:37:00
regression and that is why it's called
00:37:02
explained variation explained by what
00:37:05
explained by the number of hours studied
00:37:07
our X variable but take a look at the
00:37:09
graph we have an observation point at
00:37:12
100 a student who studied 7 hours got
00:37:15
100 and not the predicted grade of 88.2
00:37:18
to8 that deviation between the actual
00:37:21
grade and the predicted grade is called
00:37:24
error SS the sum of the squares for the
00:37:27
error is shown by this formula the
00:37:30
difference between each Yi observation
00:37:33
and the Y hat or predicted value of y
00:37:37
you see we have Yi minus y hat and on
00:37:39
the scatter diagram where would that be
00:37:42
we'll take a look at Yi in the Y hat
00:37:44
line and that is here the deviation
00:37:47
between the actual value of y and the
00:37:50
predicted value of y and that is called
00:37:53
unexplained variation it is the
00:37:55
variation of Y that is not explained by
00:37:58
the line of regression and that's what
00:38:00
SS is so going back to our equation for
00:38:03
R 2 R 2 is SSR ided by SST and you can
00:38:08
see now that that means r s is explained
00:38:12
variation SSR divided by total variation
00:38:16
SST so when we calculated our r s value
00:38:19
we got
00:38:21
955 and we said that that meant that 95%
00:38:25
of the variability of grades can be
00:38:28
explained by the number of hours studied
00:38:30
what that means here is that the line of
00:38:32
regression the Y hat line explains 95%
00:38:36
of the variation in grades from the mean
00:38:39
but around 5% of the variation is
00:38:41
unexplained by the line of regression
00:38:44
and that would be SS e now we are ready
00:38:46
to test the significance of this
00:38:48
relationship in a different way by
00:38:50
looking at the slope of the line
00:38:53
remember when we talked about the slope
00:38:54
earlier we said a slope of zero means
00:38:57
there is no relationship between X and Y
00:39:00
the equation for a simple linear
00:39:01
regression line is y is equal to Beta KN
00:39:05
plus beta 1 * x + Epsilon the Epsilon
00:39:09
term we're not going to deal with
00:39:11
because that is the random error if the
00:39:13
slope is zero then y will be beta KN no
00:39:17
matter what value X is which means that
00:39:21
the value of y does not depend on X so
00:39:24
there is no linear relationship between
00:39:26
X and Y y when the slope beta 1 is
00:39:30
zero we use this understanding of the
00:39:32
slope to conduct a hypothesis test to
00:39:35
see if there is a linear relationship as
00:39:38
follows we would State our null
00:39:41
hypothesis H knot that the slope is
00:39:43
equal to zero and the alternative ha to
00:39:46
see if we find evidence that the slope
00:39:48
is not equal to zero if we find evidence
00:39:51
to support the alternative hypothesis
00:39:54
that the slope is not equal to zero we
00:39:56
can conclude that there is a linear
00:39:59
relationship between X and Y since we do
00:40:01
not know the value of Sigma for this
00:40:03
distribution we will be using a t test
00:40:06
and the test statistic would be B1 over
00:40:09
sb1 where sb1 is the standard error for
00:40:12
the slope to calculate the standard
00:40:15
error for the slope sb1 we use this
00:40:18
formula we need to First find S the
00:40:21
standard deviation for this distribution
00:40:23
and then divid it by the square root of
00:40:25
the sum of the X the - xar SAR and so s
00:40:31
is the square root of ss / nus 2 from a
00:40:37
previous calculation we see that we
00:40:39
found ssse to be 79.1 1215 here now to
00:40:45
get S we take the square root of ss e /
00:40:49
nus 2 and that is
00:40:53
79.1
00:40:55
1215 divided
00:40:58
10 - 2 which is
00:41:02
31449 so now we have S we are ready to
00:41:06
get sb1 the standard error for the slope
00:41:09
and that is the s that we just
00:41:10
calculated over the square root of the
00:41:12
sum of the x sub eyes - xar 2ar so s sp1
00:41:18
is
00:41:20
3114 / the square < TK of
00:41:25
67.6 if you remember back when we
00:41:27
calculated the slope we created a column
00:41:30
of each X from the mean squared and
00:41:33
added it up and we got
00:41:35
67.6 over here so this is the same
00:41:38
number we're using again okay back to
00:41:41
our calculations we get sb1 as
00:41:47
3825 now we can finally calculate our
00:41:50
test statistic which is B1 over sb1 so
00:41:54
it is the slope that we calculated
00:41:56
earlier remember was
00:41:58
4.74 so we have
00:42:00
4.74 over
00:42:03
3825 and that gives us a test statistic
00:42:06
of
00:42:07
12.39
00:42:09
21 all right let's look at what we have
00:42:11
so far we are testing to see if we have
00:42:13
enough evidence to support the
00:42:15
alternative hypothesis that the slope is
00:42:18
not equal to zero if we find this
00:42:20
evidence we will conclude that there is
00:42:22
a linear relationship between X and Y we
00:42:26
calculated our test statistic to be 12.
00:42:30
3921 so now we're ready to use either
00:42:32
the critical value approach or the P
00:42:35
value approach to solve this problem
00:42:38
let's begin with the critical value
00:42:39
approach and let's use an alpha value of
00:42:43
001 now since this is a two-tail test we
00:42:46
split Alpha and half so we look up Alpha
00:42:49
divided half
00:42:51
.5 since this is a T Test we would look
00:42:54
up our critical value in the T table
00:42:57
under N - 2° of
00:42:59
Freedom Looking In the T table under 8°
00:43:03
of Freedom right n minus 2 is 10 - 2
00:43:07
which is 8 degrees of freedom and Alpha
00:43:09
divided half
00:43:11
.5 we find a critical value of
00:43:16
3355 so with a critical value of
00:43:19
3355 and a test statistic of 12.39 21 we
00:43:24
can see looking at the T distribution
00:43:27
the critical value splits the
00:43:28
distribution into rejection regions and
00:43:31
non-rejection regions and the test
00:43:33
statistic Falls around
00:43:35
here in the rejection region we are now
00:43:38
ready to come to a statistical
00:43:40
conclusion and that of course would be
00:43:43
to reject the null there is evidence
00:43:46
that the slope is not equal to zero
00:43:48
which means there is a significant
00:43:50
relationship between grades and number
00:43:53
of hours studied we can also solve this
00:43:56
problem using the P value approach to
00:43:58
use the P value approach we must first
00:44:00
calculate the test
00:44:02
statistic and we got 12.
00:44:06
3921 so we need to look up that number
00:44:08
in the T table under nus 2 or 8 degrees
00:44:12
of freedom looking in the T table we
00:44:14
find under 8 degrees of freedom 12.39
00:44:18
to1 would be off the chart and therefore
00:44:22
the exact area under the curve for 12.39
00:44:25
can't be established from the table but
00:44:28
we can extrapolate the number to be less
00:44:31
than
00:44:34
005 remember for a two-tail test we
00:44:37
double the value we got in the table so
00:44:41
0.005 * 2 is equal to
00:44:46
0.1 the rejection rule is to reject the
00:44:49
null hypothesis if the P value was less
00:44:52
than or equal to Alpha since our Alpha
00:44:55
value for this problem was set at
00:44:57
01 001 our P value is less than 01 our
00:45:02
Alpha value and therefore we reject the
00:45:05
null hypothesis and find evidence that
00:45:07
the slope is not equal to zero which
00:45:10
means that grades and number of hours
00:45:12
studied have a linear
00:45:14
relationship that concludes this
00:45:16
tutorial on simple linear regression I
00:45:19
hope you enjoyed this tutorial and I
00:45:21
hope you learned something