00:00:00
greetings welcome to exploratory data
00:00:02
analysis with excel
00:00:04
part four box plots
00:00:07
if you've reached this particular video
00:00:10
a little prematurely in the series
00:00:12
if you're looking for video number one
00:00:14
the starting point in the series just go
00:00:16
ahead and click up here
00:00:17
and you'll find video number one in the
00:00:19
series
00:00:20
also in the details below this video
00:00:24
you can get access to github repository
00:00:27
where you can download all of the
00:00:28
workbooks that
00:00:30
you see me create and work with
00:00:33
in this particular video series
00:00:36
box plots last time in video 3 we talked
00:00:39
about histograms
00:00:41
which was a way of exploring your
00:00:44
numeric data visually
00:00:46
and what we saw was that histograms are
00:00:49
pretty useful by themselves but they get
00:00:51
more powerful
00:00:52
when you add dimensions when you add
00:00:54
more variables when you add more
00:00:56
columns of data into the visualization
00:01:00
or the excel chart
00:01:02
and we saw how we could use a pivot
00:01:03
chart to do that
00:01:05
today we're going to talk about an out
00:01:06
of the box excel
00:01:08
data visualization that works with the
00:01:10
numeric data
00:01:11
and another dimension in particular
00:01:15
what it works with is a categorical so
00:01:17
let me show you what i mean by that
00:01:19
so let's let's go to excel okay you can
00:01:22
see here i'm in excel
00:01:23
i'm in part 4 worksheet here which you
00:01:26
can of course get from the github that i
00:01:27
mentioned earlier
00:01:28
and all i've done is taken the data that
00:01:31
we've been working with so far in this
00:01:32
series and i just hid
00:01:34
most of the columns because we're just
00:01:35
not going to need them and that gives me
00:01:37
some more real estate
00:01:38
and what i'm going to do is i'm going to
00:01:41
create a box plot
00:01:43
of age and what i would like to do is i
00:01:45
would like to see
00:01:46
age in my box plot plotted against
00:01:50
new survived because i want to see
00:01:52
essentially
00:01:54
the age of those that survived versus
00:01:56
those
00:01:57
that perished and i'm not going to
00:02:00
explain too much about a box plot i'm
00:02:01
just going to create one and then i'll
00:02:02
explain how you actually interpret
00:02:05
a box plot so first up i'm just going to
00:02:07
go ahead and click on
00:02:09
the h column here column h and then i'm
00:02:11
going to go up to insert
00:02:13
and i'm going to go over here to this
00:02:15
little bar chart statistical chart thing
00:02:17
over here
00:02:18
and i'm going to grab a box and whisker
00:02:21
plot
00:02:22
box and whisker is the formal name for
00:02:25
what's called
00:02:25
which is commonly called a box plot so
00:02:27
i'm going to pick that one
00:02:29
and what i get here is a box plot
00:02:33
and i'm going to go ahead and actually
00:02:36
i'm not going to do much with this
00:02:37
because i'm going to reformat it because
00:02:38
i don't like this
00:02:39
but again i'm not really going to
00:02:40
explain what's going on here yet
00:02:43
so what we've got is just the age and
00:02:45
you'll notice that i don't have
00:02:47
whether or not persons perished or
00:02:50
survived on the titanic based on the
00:02:51
data we have
00:02:52
so i need to add that in so the easiest
00:02:54
way to do that is for me just to click
00:02:56
into the chart
00:02:58
and pick select data
00:03:01
and what i'm going to do here is i'm
00:03:03
going to select the horizontal
00:03:05
right the horizontal axis and show it
00:03:08
what categories i want
00:03:10
to be used in the visualization so i
00:03:12
click edit
00:03:13
and it says hey dave pick a range of
00:03:16
values
00:03:16
and here's what i'm going to do i'm just
00:03:18
going to go over here to the c column
00:03:20
just click on c2 control shift down
00:03:24
arrow
00:03:24
and select everything click ok
00:03:28
click ok and now i got to scroll back up
00:03:30
to the top
00:03:32
here and voila i got a box plot
00:03:36
okay but i'm not done yet i don't like
00:03:38
the way this looks
00:03:39
this is my aesthetic choice you can
00:03:42
leave it like this if you like
00:03:44
i actually prefer this version right
00:03:46
here
00:03:47
and i'm going to get rid of these lines
00:03:49
because i think they're distracting
00:03:51
and i'm not going to keep the chart
00:03:53
title because i think it's distracting
00:03:55
okay and here we have a box plot i'm
00:03:59
going to scroll
00:04:00
over so that my smiling face does not
00:04:04
cover it up okay so we've got a box plot
00:04:06
here
00:04:07
box plot awesome sauce so here's the
00:04:10
only thing
00:04:11
we don't know how to interpret this so
00:04:13
let me pop over to powerpoint real quick
00:04:15
and i'll explain
00:04:17
how you interpret this data
00:04:19
visualization
00:04:21
okay here we are in powerpoint and all
00:04:23
i've done is i've just copied and pasted
00:04:25
in the
00:04:26
box plot visualization from excel into
00:04:28
powerpoint so that we can just talk
00:04:30
about it
00:04:31
so the first thing we need to realize is
00:04:34
that
00:04:35
this graphical depiction here is
00:04:38
really talking about the age column it's
00:04:41
really talking about the numbers
00:04:43
it's talking about the distribution of
00:04:46
values
00:04:47
in the age column and we talked about
00:04:49
the distribution
00:04:50
of numeric data before when we talked
00:04:52
about histograms right we threw things
00:04:53
in buckets and then we counted up all
00:04:55
the numbers
00:04:56
that were in the buckets and that gave
00:04:58
us a frequency distribution and we made
00:05:00
it a graphical
00:05:01
representation this is another graphical
00:05:03
representation
00:05:04
of a numeric distribution now this is
00:05:07
different than a histogram because the
00:05:09
lines
00:05:10
the way this is actually depicted as a
00:05:13
graphic
00:05:14
has very specific meanings which makes
00:05:16
it very very useful
00:05:18
so first up what we have here is this
00:05:20
line right here notice this line right
00:05:22
here and this line right here
00:05:24
this corresponds to what is known as the
00:05:26
median
00:05:28
and if you're not familiar this is a
00:05:29
super super simple concept
00:05:32
think of a column of numbers in excel
00:05:35
right
00:05:35
column numbers and let's say you sort
00:05:38
them
00:05:39
from the lowest to the highest value so
00:05:41
you have a column of numbers the
00:05:42
smallest one up here
00:05:43
median where the 50th percentile is just
00:05:45
the number that's in the middle
00:05:47
that's all the median is it's just
00:05:49
saying look if you got a big old pile of
00:05:50
numbers sort em
00:05:52
what's the middle in the number that or
00:05:55
the the number that's in the middle
00:05:57
the number that's in the middle that's
00:06:00
the median
00:06:01
the 50th percentile right it splits the
00:06:02
data in half half of the data is higher
00:06:04
than the median and half the data is
00:06:05
lower than the median
00:06:06
it's a pretty simple concept now not
00:06:09
surprisingly
00:06:11
if this is the median then these two
00:06:13
lines probably have
00:06:14
some sort of distinct meaning
00:06:18
and they do this is the 75th percentile
00:06:23
and this is the 25th percentile and the
00:06:25
easiest way to think about this once
00:06:27
again
00:06:27
let's take this bottom line here you
00:06:30
sorted your data
00:06:32
right you split it in half with the
00:06:34
median if you split it in half again
00:06:38
that is the 25th percentile because
00:06:40
below that line
00:06:42
is the last quarter of the data values
00:06:45
so all the data values median splits in
00:06:48
half
00:06:48
25th percentile splits the bottom half
00:06:51
in half again
00:06:52
we're into quarters essentially
00:06:54
similarly the 75th percentile
00:06:56
is up above and anything above
00:07:00
the 75th percentile is just the last
00:07:02
quarter of your data values
00:07:03
so that's what these lines represent and
00:07:05
it's a pretty useful way
00:07:07
to just get some sort of idea of like
00:07:09
where is the
00:07:10
the gravity the biggest chunk of the
00:07:13
numbers where are they located
00:07:14
in what range and basically what this
00:07:16
tells you is
00:07:18
is that half the data is within this box
00:07:22
between this value and this value you
00:07:23
have 50 of your data half your data
00:07:26
next up is this thing called the iqr
00:07:30
which stands for interquartile range and
00:07:33
all it is is basically is
00:07:34
how big is this line right here
00:07:38
take this line take this value the 75th
00:07:41
percentile of the data
00:07:42
subtract off the 25th percentile of the
00:07:44
data and it tells you how long this line
00:07:46
is
00:07:46
and that just tells you like okay how
00:07:49
many
00:07:50
values what is the range of values
00:07:52
between
00:07:53
the 25th percentile and the 75th
00:07:55
percentile right how splayed out is your
00:07:57
data
00:07:58
if you stay you know if you sort it and
00:08:00
you've got the middle 50
00:08:02
is it narrow like this or is it wide
00:08:04
like that that's what the iqr
00:08:05
tells you very useful statistic it
00:08:08
characterizes your numer your numeric
00:08:10
data okay
00:08:12
and lastly this is the box remember i
00:08:15
said earlier box and whisker
00:08:16
is the formal name for a box plot this
00:08:19
is obviously the box
00:08:20
part of the box plot and let's talk
00:08:23
about the whiskers
00:08:24
these these things right here these are
00:08:25
the whiskers these lines and
00:08:27
these lines right here these are the
00:08:29
whiskers and the whiskers are super
00:08:31
useful
00:08:32
because they kind of characterize once
00:08:34
again your data
00:08:36
and there's a standard calculation that
00:08:38
is used
00:08:39
to derive how long this line is and how
00:08:42
long this line is how long this whisker
00:08:44
is and how long this whisker is
00:08:46
and here's the calculation for the top
00:08:49
whisker
00:08:50
so this line right here right how long
00:08:53
this whisker is
00:08:54
is determined by one of two things it's
00:08:57
either
00:08:58
the maximum data value so you sort your
00:09:00
data
00:09:01
and it's that top most largest value
00:09:04
or it is this line here the 75th
00:09:07
percentile
00:09:08
this line right here 75th percentile
00:09:11
plus
00:09:13
1.5 times the iqr
00:09:16
and the iqr once again is the length of
00:09:18
this line here right so it's a standard
00:09:20
calculation it says look whichever
00:09:22
these two values is smaller
00:09:25
then use that for the length of this
00:09:27
whisker
00:09:28
and we'll see why that's important in a
00:09:30
second
00:09:31
next up we need to take a look at the
00:09:33
bottom line here
00:09:34
and the bottom line is a similar
00:09:36
calculation so
00:09:37
this line is either determined by the
00:09:40
minimum value in the data right you sort
00:09:42
it the bottom most value
00:09:44
or it's the 25th percentile
00:09:48
line minus
00:09:51
1.5 times the iqr again
00:09:55
whichever of these two values is larger
00:09:57
and the reason why this is really super
00:09:59
cool is because it provides a
00:10:01
standardized way
00:10:03
of evaluating your collection of numeric
00:10:05
data right that sorted numeric column of
00:10:07
data
00:10:08
and determining if you have any outliers
00:10:12
which you see here as dots these are
00:10:15
outliers
00:10:16
so what it's saying is look
00:10:17
statistically speaking based on the data
00:10:20
we would expect
00:10:21
most values to fall between the whiskers
00:10:24
the bulk of your data half of your data
00:10:27
is going to be right here inside the box
00:10:30
between the
00:10:30
75th and 25th percentile lines and
00:10:34
the remaining data is going to be
00:10:36
between the two whiskers
00:10:37
anything that's outside of the whiskers
00:10:39
is an outlier it's a value that's
00:10:41
extremely large
00:10:43
or extremely small based on
00:10:46
the collection of data now there will be
00:10:48
some people that will tell you and
00:10:50
rightly so
00:10:51
that a box plot by itself
00:10:54
has a lot of problems you shouldn't rely
00:10:56
on a box plot
00:10:58
solely and that that's completely valid
00:11:00
however as we saw
00:11:01
in video three we're not relying solely
00:11:05
on
00:11:05
box plots we're also using histograms
00:11:08
and other
00:11:08
types of visualizations in this series
00:11:10
so they're extremely useful
00:11:12
because they're part of a larger context
00:11:14
of data visualizations that we're using
00:11:17
to explore our data set okay so this is
00:11:20
how you interpret a box plot
00:11:23
now let's go ahead and go back to excel
00:11:25
and play around with our box plot
00:11:28
okay here we are back in trusty old
00:11:30
excel and we've got a box plot and we
00:11:32
can see here
00:11:34
that you know there's not really much
00:11:37
difference between the age distribution
00:11:40
for those that survived
00:11:42
versus those that perished because you
00:11:44
can see that the boxes
00:11:45
pretty much overlap and the whiskers
00:11:47
definitely overlap
00:11:49
so the bulk of the age data for both
00:11:51
those that perished
00:11:53
and for those that survived basically
00:11:54
overlapped so this isn't telling us a
00:11:56
heck of a lot
00:11:57
right now however notice this we can go
00:12:00
back over to
00:12:02
our table here and we can say
00:12:05
let's go ahead and only look at
00:12:08
let's say males in
00:12:13
hold on we gotta wait for this thing to
00:12:14
refresher real quick okay and we can see
00:12:16
now that
00:12:16
our box plot has refreshed only for
00:12:19
males
00:12:20
but we can refine it even further we say
00:12:22
look we want males
00:12:24
in second class so let's go ahead and
00:12:27
make it only second class and it'll take
00:12:29
a second here
00:12:31
and our box plot refreshes and let's
00:12:34
just go ahead and make it bigger so we
00:12:35
can actually see it again
00:12:36
boom look at that now this
00:12:40
is an interesting result because what
00:12:42
this is telling us is that for males and
00:12:44
second class and we saw this already by
00:12:46
the way
00:12:47
in part three when we were doing
00:12:49
histograms
00:12:51
we can we know already that young
00:12:54
males boys male children survive
00:12:57
disproportionately and now you can see a
00:12:58
big difference right look at the median
00:13:00
the median
00:13:01
of males that survived in
00:13:04
second class is very very low it's like
00:13:07
three to three
00:13:08
three years old so this is a result
00:13:10
right this is a prominent result this
00:13:12
tells us that
00:13:13
ah at least for males at second class
00:13:15
based on the distribution
00:13:17
of survived versus perished ages
00:13:21
that younger folks tend to survive in
00:13:25
second class if they were male and we
00:13:27
know that already from
00:13:28
our histogram but the box plot is just
00:13:30
another way of taking a look at data
00:13:35
numeric data the distribution of it
00:13:37
vis-a-vis some sort of
00:13:39
categorical value and that's extremely
00:13:41
useful in business data because business
00:13:43
data
00:13:43
has tons and tons of categorical values
00:13:46
in this data set for example we're
00:13:48
dealing with embarked
00:13:49
whether or not you survived your
00:13:52
class of ticket first class second class
00:13:54
third class whether you're male or
00:13:56
female these are all categoricals
00:13:57
and analyzing numeric data in relation
00:14:00
to those
00:14:01
is extremely useful as we're seeing in
00:14:03
this video and as we saw in the last
00:14:05
video
00:14:05
what would be really cool is if we could
00:14:08
see
00:14:09
all of these different types of
00:14:11
combinations
00:14:12
of p-class and gender vis-a-vis
00:14:16
our box plots and let me show you what i
00:14:18
mean by that let me flip over to
00:14:19
powerpoint again and let me show you
00:14:20
what i mean by that
00:14:22
here we are in powerpoint and what you
00:14:24
can see here
00:14:25
is a awesome box plot visualization
00:14:28
notice this is very similar
00:14:30
to what we looked at in the last video
00:14:32
with histograms
00:14:33
where in this top row we have females
00:14:37
all the females in the data set and in
00:14:39
this bottom row we have all the males in
00:14:41
the data set
00:14:42
and the columns are third class second
00:14:44
class
00:14:45
and first class respectively and what we
00:14:47
see here is
00:14:49
perished and survived right these are
00:14:50
the people that perished unfortunately
00:14:52
on the titanic and these are the folks
00:14:53
that survived
00:14:54
and we see all the box plots all at once
00:14:57
and we can just kind of like
00:14:59
sit back and just kind of let our eyes
00:15:01
just kind of like
00:15:02
gaze at it and focus in and of course
00:15:04
the first thing we notice is this right
00:15:05
here
00:15:06
out of all six of these plots this one
00:15:08
obviously catches her eye first and once
00:15:10
again it says
00:15:12
males in second class in case you're
00:15:15
curious
00:15:16
this particular data visualization was
00:15:18
created using the r programming language
00:15:21
as i've mentioned in previous videos i
00:15:23
have an online course
00:15:24
specifically designed to take excel
00:15:27
users
00:15:28
and quickly and easily easily teach them
00:15:31
our programming
00:15:32
and my course teaches you how to create
00:15:34
visualizations like this
00:15:35
so if you're interested in that just go
00:15:36
ahead and click up here and you'll find
00:15:38
another video
00:15:39
that provides more details on how an
00:15:41
excel user can learn how to do our
00:15:43
programming
00:15:44
and create real super powerful
00:15:45
visualizations like this
00:15:47
and trust me it's super easy it's a lot
00:15:48
easier than you think
00:15:51
video number four is complete video
00:15:53
number five
00:15:54
we'll start working with bar charts when
00:15:56
that's up
00:15:57
and ready i will update the video and
00:16:00
you'll see a card
00:16:01
a link either here or here for that
00:16:04
particular video
00:16:05
box plots wildly useful stuff especially
00:16:08
what we saw in the r programming example
00:16:09
when you can see a bunch of them
00:16:11
all laid out in like a grid really makes
00:16:13
your data pop
00:16:15
so box plots use them don't use them by
00:16:18
themselves as i said earlier
00:16:19
you're going to want to combine them
00:16:21
with histograms and other things that
00:16:22
we're going to be looking at in this
00:16:23
series
00:16:24
all right there you have it until next
00:16:27
time please stay healthy
00:16:28
and i wish you very happy data sleuthing