00:00:01
hi it's Mr Anderson and this is a mini
00:00:02
lesson on analyzing and interpreting
00:00:04
data level four statistical analysis you
00:00:07
can see in the book that we have a
00:00:09
online data set so we'll get to that in
00:00:11
just a second and you're going to be
00:00:13
looking at relationships and so when
00:00:15
you're looking at data the key thing you
00:00:17
want to do is both analyze and interpret
00:00:19
the data but in the future we're going
00:00:21
to have data that just gets bigger and
00:00:23
bigger and bigger massive data sets that
00:00:25
we'll have to make sense of and so we're
00:00:27
going to practice a little bit of that
00:00:28
first thing you want to do is you want
00:00:30
to figure out what is the data that I'm
00:00:32
really looking at and then you want to
00:00:33
use some of the online tools to start
00:00:35
organizing that data so we can make
00:00:37
sense of big data sets after you've done
00:00:40
that we look at the data and we analyze
00:00:42
the data we try to figure out what does
00:00:44
this data actually mean and we do that
00:00:46
just using these two things looking for
00:00:48
patterns and then looking for
00:00:50
relationships within the data set and
00:00:52
then the last thing we do is we
00:00:54
interpret we try to figure out what does
00:00:56
all this data mean and what predictions
00:00:58
can we make so after watching this video
00:01:01
you should be able to use some of the
00:01:03
online data sets related to mammals and
00:01:05
size and lifespan also you could look at
00:01:08
Snowshoe hair population I've got some
00:01:10
good data on that I'm going to start by
00:01:12
just showing you a sample data set that
00:01:14
has some information on books and then
00:01:16
you'll have a chance to do the same
00:01:17
thing with sport balls and so what I'm
00:01:19
going to do is clean this up and then
00:01:20
we'll get
00:01:22
started okay so the first thing that we
00:01:24
want to do is we want to figure out what
00:01:26
are we using this is something called
00:01:28
Cod app and it's just a free online way
00:01:30
to look at data sets and I've loaded a
00:01:32
data set in it and so the first thing we
00:01:34
want to do is figure out exactly what is
00:01:36
this about you could read some of the
00:01:38
titles here but graphs are a really good
00:01:39
way to figure out what things are
00:01:41
actually about and so I'm going to put a
00:01:43
data graph to the right side and each of
00:01:46
these do dots represents one book it's
00:01:49
got information on the pages the weight
00:01:52
the price of the book and the genre of
00:01:54
the book and so the first thing I'm
00:01:55
going to do is write down what is the
00:01:57
data that we're actually dealing with
00:02:02
okay the data that we're dealing with is
00:02:04
books and it's book size price and genre
00:02:07
next thing we want to do is start to
00:02:08
organize the data we're going to try to
00:02:10
figure out like how do we make sense of
00:02:12
the data and so to do that what I can do
00:02:15
is if if I highlight things on the left
00:02:17
side it shows shows me what the book is
00:02:19
but a really easy way to organize it is
00:02:21
if I just click on the bottom so if I
00:02:23
click down here I could look at the
00:02:25
title and it would organize all the
00:02:26
titles I could also click down on the
00:02:29
bottom and I could say I'm interested
00:02:31
maybe in uh the prices of the book how
00:02:35
expensive they are and then it shows me
00:02:38
what a range is and so as I play around
00:02:40
with that I start to all of a sudden
00:02:42
figure out okay what are some patterns
00:02:44
in the data that I find interesting and
00:02:47
so let me write down just a quick
00:02:48
pattern that I notice as I look at this
00:02:50
so I'm looking at the price and so the
00:02:52
first pattern I might say is that um I
00:02:57
guess we have let's look at the genre
00:02:59
genr of books how many genres do we
00:03:02
have so there we have just four genres
00:03:04
of books it looks like uh we got
00:03:06
biography romance sci-fi and Thriller
00:03:09
and so maybe I could look at uh let's
00:03:11
just look at price again and so a
00:03:14
pattern I could write down is that we
00:03:16
have uh four
00:03:21
genres so a pattern I notice is that
00:03:23
we've got four genres of books and the
00:03:25
prices range from $10 to $50 um what's
00:03:28
another pattern that I could look at to
00:03:30
show you some of the statistical
00:03:36
tools okay so it shows me the pages it
00:03:39
looks like they go from 200 to around
00:03:41
800 if I click on the right side then I
00:03:44
could look at the mean and the median
00:03:46
and I could calculate and show those
00:03:49
values up here so I can see those values
00:03:51
up at the top and so that's other
00:03:53
patterns that I could noce statistical
00:03:54
pattern so let me write that
00:03:58
down
00:04:04
so the pages range from 200 to 800 I
00:04:07
also have a mean and a median uh a box
00:04:10
plot might be interesting so I could
00:04:12
look at like that lower cortile and then
00:04:15
we could look at the upper like 25% of
00:04:18
the books and so we have some really
00:04:19
really big books it looks like up here
00:04:21
to the right side so that's me looking
00:04:23
at patterns the next thing I want to
00:04:25
start doing is I want to start looking
00:04:26
at relationships and so I've got these
00:04:29
different columns and so let me show you
00:04:31
some relationships and and how we might
00:04:33
be able to figure that out so maybe I'm
00:04:35
interested in pages and how pages is
00:04:38
related to the weight of the book so
00:04:41
it's going to put Pages here on the x-
00:04:43
axis and then on the Y AIS we're going
00:04:45
to have the weight um really cool
00:04:47
statistical analysis I could do is I
00:04:49
could do a leas square line and so
00:04:51
that's going to show me when I click on
00:04:53
that what is a best fit line and another
00:04:56
really cool thing here is it shows you
00:04:58
the correlation value this r s value the
00:05:01
closer this value is to one the more
00:05:04
likely we are to have a direct
00:05:06
relationship between the two and so
00:05:08
that's a pretty cool relationship let me
00:05:09
write that
00:05:15
down okay so I said as the pages
00:05:17
increase the weight increases and we
00:05:19
have a correlation value it's
00:05:21
approaching one so it's it's pretty good
00:05:23
relationship let me find some other
00:05:28
relationships
00:05:35
so I also found that as the pages
00:05:37
increase the price increases but that R
00:05:39
square value is way less and so there is
00:05:41
a relationship but it's not as strong a
00:05:44
relationship let me find another
00:05:57
relationship so here I'm looking at the
00:05:59
median H uh price of the books and it's
00:06:02
more expensive it's a thriller than if
00:06:04
it's a Sci-Fi book so now I've looked at
00:06:06
a bunch of patterns I've got some
00:06:08
relationships but I really want to
00:06:10
figure out is how are all the parts
00:06:13
listed in this related to each other so
00:06:15
I would play around with relationships
00:06:17
and then I'm going to start to put those
00:06:19
out so we can organize those in a more
00:06:21
direct
00:06:28
way
00:06:41
okay so as I've looked around I started
00:06:42
to see a lot of relationships but
00:06:45
there's really only one of those that I
00:06:47
can just in my brain make sense as a
00:06:50
positive
00:06:53
relationship so I think if you increase
00:06:56
the pages in a book the weight of the
00:06:58
book will increase now if you change the
00:07:00
genre it's not going to somehow cause
00:07:02
the pages to increase but there may be a
00:07:04
correlation there but the only causation
00:07:06
I see is pages to weight with all these
00:07:09
other cool relationships and so the next
00:07:11
thing I want to do is I want to
00:07:13
interpret the data I want to make sense
00:07:15
of the
00:07:28
data
00:07:30
so the interpretation that I wrote down
00:07:32
is that if you increase the pages then
00:07:35
you cause an increase in the weight but
00:07:38
there are many other correlations
00:07:40
between price Pages genre and weight and
00:07:43
then the last thing I could do is I
00:07:44
could make some kind of a
00:07:58
prediction
00:08:01
so the prediction that I said is a 500
00:08:03
page book and I'm just using this kind
00:08:05
of best fit line a 500 page book is
00:08:08
going to cost about $33 and it's going
00:08:11
to weigh about 0.9 kilograms and so uh
00:08:14
this is just a way to use cod app as a
00:08:16
way to organize the data analyze it and
00:08:19
interpret it what I'm going to do is I'm
00:08:20
going to clean this all up and then I'm
00:08:21
going to give you a chance to do this on
00:08:23
your own okay now that you've learned
00:08:25
how to do some analysis and
00:08:27
interpretation on big data sets like
00:08:29
this I've made one for you to use I'll
00:08:31
put a link down below so you can find
00:08:33
that and what I would encourage you to
00:08:35
do is go through figure out what is this
00:08:36
data play around with organizing it it's
00:08:39
pretty straightforward you always want
00:08:40
to make sure that you hit the graph and
00:08:42
that goes off to the side and you can
00:08:44
resize it and then pretty much all the
00:08:46
tools you're going to want to use are
00:08:48
are up here or on the different axes so
00:08:50
I would encourage you to go play around
00:08:52
with this data set on sport balls uh
00:08:55
figure out what it is then find some
00:08:57
patterns relationships and then
00:08:59
interpret and predict then unpause the
00:09:01
video come back and we'll see how our
00:09:02
interpretation is similar and how it's
00:09:13
different okay so the first thing I
00:09:15
would do is I would just see what does
00:09:17
this data represent and so on the bottom
00:09:19
I could just look at the object names
00:09:21
and it's going to show me we've got a
00:09:22
bunch of different balls uh if I click
00:09:25
on here we could also look at
00:09:27
diameter looks like we also have
00:09:30
information on circumference and then it
00:09:33
looks like we also have some data on
00:09:35
weight and price and so first thing I
00:09:37
would do is write down okay this is
00:09:39
going to be what the data
00:09:45
represents okay so the data that we have
00:09:47
is we've got a bunch of sport balls and
00:09:50
we've got some in different size price
00:09:52
and then we have weight and so the first
00:09:54
thing I want to do is start playing
00:09:55
around with patterns what are just some
00:09:57
descriptive patterns that I would find
00:09:59
in the data itself and so as I start to
00:10:02
click around let me find some stuff
00:10:03
that's
00:10:18
interesting so the first pattern I
00:10:20
notice is that the bowling ball is both
00:10:22
most expensive at $119 and also the
00:10:25
heaviest at about 7 and A4 kilog let me
00:10:28
look at for some other evidence or other
00:10:31
patterns that I
00:10:50
noticed so the next thing I noticed is
00:10:53
if I look at diameter we have a range of
00:10:55
4 cm to 32 cm and also I calculated mean
00:11:00
and median and I got a mean of 12.4 and
00:11:02
a median of 7.4 so a lot of the data is
00:11:05
really stretched out here for some of
00:11:07
these that are just larger diameter next
00:11:09
thing I want to start doing is looking
00:11:10
at relationships what are some
00:11:12
interesting relationships that I find
00:11:14
between data and that's where you're
00:11:15
toggling between the X and the y axis so
00:11:17
let me play around with
00:11:27
that
00:11:56
okay some relationships that I've
00:11:58
discovered is the diameter is directly D
00:12:01
directly related to the circumference I
00:12:04
think that's uh just math so the reason
00:12:07
why is that when I find the R squ value
00:12:10
it's one so it's a perfectly direct
00:12:13
relationship and then if I look at the
00:12:15
slope it's Pi 3.14 what are some other
00:12:18
relationships that uh the price is
00:12:20
directly related to weight and so
00:12:22
there's a good relationship
00:12:25
there and so I got a r squ value of 0.9
00:12:28
that means it's close to one so it's
00:12:30
pretty close to a direct relationship
00:12:32
but then when I looked at Price related
00:12:34
to
00:12:36
diameter there's still a relationship
00:12:38
but it's not as good a correlation the r
00:12:41
s value is 0.52 and so uh as I start to
00:12:44
look at relationships I can start to
00:12:46
understand the database in a little bit
00:12:48
better way and so the next thing I want
00:12:50
to do is I want to interpret uh and then
00:12:53
make some predictions but to do that I
00:12:54
really have to look at all the overall
00:12:57
relationships
00:13:22
okay so the relationships that I
00:13:24
discovered where uh if you look at
00:13:26
diameter and circumference they're
00:13:27
correlated and so I just said those make
00:13:30
up the size of a sport ball and then I
00:13:32
said if you increase the size then you
00:13:34
increase the weight um so it's not like
00:13:37
the price is somehow making it bigger or
00:13:40
making its size larger but I did find a
00:13:42
correlation between these as well so I
00:13:44
said as you increase size of a ball it
00:13:46
makes it heavier and price is correlated
00:13:48
to both size and weight and then the
00:13:51
last thing I have to do is I have to
00:13:52
make some kind of a
00:13:57
prediction
00:14:11
so what I said is the the uh sport ball
00:14:14
with a diameter of 15 cm would have a
00:14:17
circumference of 47 that would be
00:14:20
cm and weigh about 1 kilogram okay now
00:14:24
that you've learned how to do that what
00:14:25
I would encourage you to do look at some
00:14:26
of the other data sets such as the ones
00:14:28
on mammals also the ones on the Snowshoe
00:14:31
hair population that's analysis
00:14:33
statistical analysis and interpretation
00:14:35
and I hope that's helpful