00:00:00
in this video we're looking at basic
00:00:02
probability now when I say basic I mean
00:00:05
probability as it relates to categorical
00:00:07
variables like yes no type variables
00:00:10
numerical variables have their own thing
00:00:12
going on and I'll deal with that in
00:00:13
later videos but we're going to be
00:00:15
learning through the use of an example
00:00:17
here and here it is the HBO cable
00:00:20
network took a survey of 500 subscribers
00:00:23
to determine people's favorite show now
00:00:25
in this case let's say there were two
00:00:28
categorical variables one which is
00:00:30
gender male female and the other
00:00:33
people's favorite show so I've Got Game
00:00:34
of Thrones in there Westworld and I've
00:00:36
just combined all the others into this
00:00:38
other
00:00:39
category I'm thinking that those two are
00:00:42
probably the biggest shows on the HBO
00:00:44
roster so let's see how this
00:00:46
distribution pans out now out of 500
00:00:49
subscribers 80 of them are male that
00:00:52
like Game of Thrones 120 of them are
00:00:55
female that like Game of Thrones etc etc
00:00:58
and each of these squares we can call
00:01:00
joint events they called Joint events
00:01:02
because they depend on classes from two
00:01:05
different variables now the other thing
00:01:07
to note is that of course they're all
00:01:08
going to add up to this total figure in
00:01:09
the bottom right but we can also
00:01:11
calculate this total column and total
00:01:14
row where we just sum up the total
00:01:16
number of people that thought Game of
00:01:18
Thrones was their favorite show and
00:01:20
that's 200 the total number of people
00:01:22
that thought Westworld was their
00:01:23
favorite show and that's 125 and we can
00:01:25
also total up the males and females it's
00:01:27
230 males and 270 females
00:01:30
but this is not quite a probability
00:01:32
distribution just yet so let's see how
00:01:34
we get there here's that distribution
00:01:36
again and if we divide everything by 500
00:01:40
which is the total number of
00:01:41
observations we get something which is
00:01:43
called a probability distribution so
00:01:46
where before we had 120 females who
00:01:49
preferred Game of Thrones we now have
00:01:52
.24 in that cell because that tells us
00:01:54
that 0.24 or 24% of the distribution is
00:01:58
defined by that joint event so that's
00:02:01
why we call that a joint probability so
00:02:03
that 024 is called a joint probability
00:02:06
and the way we can write that is to say
00:02:07
the probability of female the event
00:02:10
where the person is female and the event
00:02:12
where the person likes Game of Thrones
00:02:14
is
00:02:15
0.24 now you might see it written with
00:02:17
the word and in between or else you
00:02:20
might see it with this sort of upside
00:02:22
down U which we're going to call
00:02:23
intersection so that's just a fancy
00:02:25
statistical term to say the intersection
00:02:27
of female and Game of Thrones which
00:02:29
which is of course that joint
00:02:31
probability now collectively those six
00:02:34
cells here form what's called The Joint
00:02:36
probability distribution so all of these
00:02:38
six cells are going to add up to one
00:02:40
because if you think about it everyone
00:02:42
in this distribution has to be in one of
00:02:44
these cells you have to either be male
00:02:46
or female and you have to have selected
00:02:49
one of these options so it's no surprise
00:02:51
that this should sum up to one now we
00:02:54
can also calculate What's called the
00:02:55
marginal probability called as such
00:02:58
because it's in the marginal
00:03:00
another term for this is the simple
00:03:02
probability so just be aware of that if
00:03:04
you see the phrase simple probability
00:03:06
but here this 0.4 refers to the raw
00:03:09
probability of someone liking Game of
00:03:12
Thrones and I'm sure you can tell it's
00:03:14
just the sum of both the male and female
00:03:16
joint probabilities now of course this
00:03:19
actual column here is called the
00:03:21
marginal probability distribution so
00:03:23
just like we had a joint probability
00:03:25
distribution this one I've highlighted
00:03:27
here also sums to one now the thing to
00:03:29
appreciate is that this marginal
00:03:31
probability distribution completely
00:03:33
ignores gender it's as if that variable
00:03:35
doesn't exist here but be aware also
00:03:37
there's another marginal probability
00:03:39
distribution down here and in this case
00:03:42
it would ignore the show of preference
00:03:44
but again this is going to sum to one
00:03:46
because there are 46% males in our
00:03:49
sample and 54% females in our sample and
00:03:52
we're going to have to sum them up to
00:03:53
get 100% so our joint probability
00:03:56
distribution adds up to one but so does
00:03:58
both of these marginal probability
00:04:00
distributions okay so here's a couple of
00:04:03
questions for you just using what we've
00:04:04
gleaned from the last minute or so you
00:04:07
should be able to answer some of these
00:04:08
questions what's the probability of an
00:04:10
HBO subscriber being male now I think I
00:04:12
might have even said this before but the
00:04:14
probability of a subscriber being male
00:04:16
is purely that marginal probability of
00:04:19
the male column so 46 now just be aware
00:04:22
of terminology when we say probability
00:04:24
we're after a number that's between 0
00:04:26
and one sometimes they'll ask about
00:04:28
percentages and stuff so you can turn
00:04:30
that into a percent but if it just says
00:04:32
probability you can just leave it
00:04:34
as46 or the next question what's the
00:04:36
probability of an HBO subscriber
00:04:38
preferring Westworld pretty simple it's
00:04:41
just going to be the 0.25 so we can keep
00:04:44
going appreciating that those first two
00:04:46
questions were asking for marginal or
00:04:48
simple probabilities this question asks
00:04:51
what's the probability of an HBO
00:04:53
subscriber being male and preferring
00:04:55
Westworld so that's 0.2 and again you've
00:04:58
got that intersection symbol
00:05:01
here and here's a slightly different one
00:05:03
it says what's the probability of an HBO
00:05:05
subscriber being male or preferring
00:05:08
Westworld so this is not an intersection
00:05:11
it's not a joint probability it actually
00:05:13
requires us to think just a little bit
00:05:16
and appreciate that to find this we're
00:05:18
going to have to sum up all of the joint
00:05:21
probabilities where this condition's met
00:05:23
so if the subscriber is male we're in
00:05:26
this column so we can highlight these
00:05:28
three cells green and we can also
00:05:30
highlight the Westworld row green as
00:05:32
well so in summing up those four cells
00:05:35
we're going to be able to get this
00:05:36
probability and you can see here I've
00:05:38
just gone all of those numbers added
00:05:39
together and I've got
00:05:41
0.51 now you might have noticed I used
00:05:43
this upwards U here and this is the
00:05:47
symbol that we call Union so this is the
00:05:49
union of two events of someone being
00:05:51
male and the event someone liking
00:05:54
Westworld now if you've watched a few of
00:05:56
my other videos you might know that I
00:05:57
tend not to like formulas too much much
00:05:59
because formulas tend to stop us
00:06:01
thinking and many of them have an
00:06:03
intuitive basis to start with and this
00:06:05
is true of this formula here which
00:06:07
provides us a way of calculating the
00:06:10
union between two events so you might
00:06:12
have seen this somewhere where we say
00:06:14
the union between events A and B is the
00:06:17
probability of a plus the probability of
00:06:19
B minus the intersection and again you
00:06:21
can calculate this one that way if you'd
00:06:23
like which is to say that we add up the
00:06:26
two marginal probabilities the 0.25 and
00:06:29
the
00:06:30
46 and then we subtract the joint
00:06:33
probability at 0.2 now why do we
00:06:35
subtract the 0.2 well if you think about
00:06:38
each of these joint probabilities this
00:06:40
one here sums up the entire row for
00:06:42
Westworld this one here sums up the
00:06:45
entire column for male thus we've
00:06:47
actually added this cell twice the male
00:06:50
Westworld joint probability we've added
00:06:53
that twice if we've just simply added
00:06:54
these two blue cells together so we have
00:06:57
to subtract one of those away again to
00:06:59
be left with 0.51 which is no surprise
00:07:02
the same answer we got using the other
00:07:04
method this brings us to the next and
00:07:07
more tricky type of probability which is
00:07:09
called a conditional probability so here
00:07:11
I've given you a question again as an
00:07:13
example no just got an HBO subscription
00:07:17
what's the chance that her favorite show
00:07:18
will be Game of Thrones now again you
00:07:21
might have seen a formula that looks a
00:07:23
little bit like this where this says the
00:07:25
probability of event a given event b
00:07:29
equals the intersection of the two
00:07:30
events divided by the probability of the
00:07:34
condition so in this case what we do is
00:07:37
we basically focus on the part of this
00:07:39
distribution which is of interest and
00:07:42
because non is female we can ignore the
00:07:44
rest of the table completely we're only
00:07:47
really considering this column here we
00:07:50
take the joint probability which is .24
00:07:53
and we divide by the probability of the
00:07:56
condition which is 0 54 now the
00:07:58
condition was that she was email that
00:08:00
was given in the question so we can
00:08:02
calculate this as24
00:08:05
/54 which is 44444 which is about 49ths
00:08:09
but that's okay we can leave it as a
00:08:11
four decimal place probability here so
00:08:14
what we can do now is create for
00:08:16
ourselves a new column which is the
00:08:18
probability of preferring each of these
00:08:20
shows given someone is female so we just
00:08:24
got 4444 for Game of Thrones here now we
00:08:27
can do the same for Westworld and other
00:08:31
we find here what's called the
00:08:32
conditional probability distribution so
00:08:35
this is the probability distribution of
00:08:37
preferring various shows given someone
00:08:40
as female and again because it's a
00:08:43
complete probability distribution it's
00:08:45
going to add up to one and indeed this
00:08:48
does so what what we can do now is we
00:08:51
can compare this conditional probability
00:08:53
distribution to the marginal probability
00:08:57
distribution that's this one here the
00:08:58
original Total
00:09:00
distribution and we can see that if we
00:09:03
don't take sex into account 40% of
00:09:06
people like Game of Thrones 25% like
00:09:08
Westworld 35% like other shows but when
00:09:11
we take gender into account these change
00:09:14
a little bit so for females you can see
00:09:16
that they're more likely to like Game of
00:09:18
Thrones than the general population
00:09:20
they're less likely to like Westworld
00:09:22
than the general population and they're
00:09:24
a little bit more likely to like other
00:09:26
shows than the general population as
00:09:28
well and we can actually use this to
00:09:30
assess whether the two variables gender
00:09:33
and show Choice are independent but
00:09:36
before we do just appreciate that we can
00:09:37
also go the other way with our
00:09:39
conditions so for example if I asked you
00:09:42
given that a subscriber's favorite show
00:09:44
is Westworld what's the probability that
00:09:46
they are male in this case the condition
00:09:49
is that they like Westworld so our
00:09:52
condition is actually an entire row so
00:09:55
much like last time we can block out the
00:09:57
rest of the table that doesn't coincide
00:09:59
with Westworld and we can just focus on
00:10:01
this row and the probability of being
00:10:03
male is that 0.2 which is the joint
00:10:06
probability divided by the probability
00:10:09
of the condition
00:10:10
0.25 so if you do that you get 0.8 and
00:10:13
that's the probability of being male if
00:10:16
you prefer Westworld in other words 80%
00:10:18
of people that watch Westworld are
00:10:20
males so this leads us to the topic of
00:10:23
Independence between the variables
00:10:29
now we've already touched on it just
00:10:31
briefly to show that females have a
00:10:33
little bit of a different preference of
00:10:35
their HBO shows to the general
00:10:38
population but the strict definition of
00:10:40
independence actually has two different
00:10:43
approaches the first of which says that
00:10:46
if the two variables are independent
00:10:48
then the probability of a given B is
00:10:50
just equal to the probability of a
00:10:53
another way to think of this is that
00:10:55
imposing the condition B doesn't
00:10:58
actually affect
00:10:59
the probability of a at all so in our
00:11:02
case we found the probability of liking
00:11:05
Westworld if you're female is 0.093 we
00:11:09
found that from our conditional
00:11:11
probability distribution earlier but the
00:11:14
probability of just preferring Westworld
00:11:16
straight up is 0.25 it's in that final
00:11:19
value here it's in that marginal value
00:11:22
here so if these two variables gender
00:11:25
and choice of HBO shows were independent
00:11:29
these two values should be equal
00:11:32
therefore the variables are not
00:11:33
independent as 0.093 does not equal 0.25
00:11:38
so clearly gender does influence the HBO
00:11:41
show that they prefer now it's probably
00:11:44
worth mentioning that because this is a
00:11:46
sample you would never expect for these
00:11:49
values to be exactly equal because we
00:11:52
know in a sample there's going to be
00:11:53
some random variation and even if the
00:11:56
variables were completely independent
00:11:59
they these two probabilities wouldn't be
00:12:01
perfectly equal but in this case I think
00:12:05
it's quite clear that they're very
00:12:06
different probabilities one's 25% and
00:12:09
one's 9% so I think we're quite safe to
00:12:12
say these variables are not
00:12:14
independent now as I said there were two
00:12:16
approaches for assessing Independence
00:12:18
and this one says the probability of a
00:12:21
union B is equal to the two marginal
00:12:24
probabilities multiplied together now
00:12:26
I've got a feeling you might understand
00:12:28
this equation in l so if I can take you
00:12:30
away from this example just for a second
00:12:33
say I flipped a coin in one hand and I
00:12:35
rolled a dice in the other hand and I
00:12:38
said I'm going to give you a hundred
00:12:39
bucks if I roll a six and I flip aead if
00:12:44
both of those events occur I'll give you
00:12:46
a 100 bucks and I asked you what is the
00:12:47
probability of me giving you a 100 bucks
00:12:51
you probably say well that's one in 12
00:12:54
right because you've got half a chance
00:12:55
of getting ahead and a sixth of a chance
00:12:58
of rolling a six
00:12:59
so you multiply them together you're
00:13:01
going to do the probability of a getting
00:13:03
ahead times the probability of B and
00:13:05
you're going to get that joint
00:13:07
probability but the only reason you can
00:13:09
do that is because rolling a dice and
00:13:12
flipping a coin are completely
00:13:13
independent variables it's not as if
00:13:16
rolling a six influences the chance of
00:13:18
getting ahead right so as I said you can
00:13:21
kind of intuitively understand this
00:13:23
formula but what that implies for our
00:13:25
distribution here is that if we were to
00:13:27
look at the intersection between
00:13:30
Westworld and female we get
00:13:32
0.05 and if you multiply the two
00:13:34
marginal values together the
00:13:37
0.54 and the
00:13:38
0.25 you get
00:13:41
0.14 so again these two values are not
00:13:44
the same so we can say they're not
00:13:46
independent
00:13:48
variables which is no surprise because
00:13:50
we just showed this using the other
00:13:51
approach so either of these approaches
00:13:53
would be fine to show that people's
00:13:57
preference of HBO shows does depend on
00:14:01
their gender all right so that's a wrap
00:14:04
we've dealt with joint probabilities
00:14:06
marginal probabilities conditional
00:14:08
probabilities and we've also looked at
00:14:10
how to test for Independence between two
00:14:13
categorical variables now if you like
00:14:16
the video I got plenty more you can
00:14:17
check them out on the YouTube channel or
00:14:20
heading to my website Zed statistics.com
00:14:22
and if you've got any ideas feel free to
00:14:24
get in touch adios