00:00:00
welcome back I'm Steve Brunton from the
00:00:02
University of Washington and this is the
00:00:05
second half of a new course on
00:00:07
probability and statistics this is one
00:00:10
of my absolute favorite topics in all of
00:00:13
mathematics it is incredibly useful um
00:00:16
it's up there with Calculus linear
00:00:18
algebra and differential equations as
00:00:20
one of the pillars of how we model the
00:00:22
real world especially how we model
00:00:25
systems that are too complex to handle
00:00:27
with our kind of classical deterministic
00:00:30
uh methods okay and so this is kind of
00:00:33
the overview of the second half of the
00:00:35
course on statistics so we spent a lot
00:00:38
of time by the time this video comes out
00:00:39
you've probably seen that there's a
00:00:41
whole series on probability I'm guessing
00:00:44
about 10 hours of lectures on
00:00:45
probability Theory and now we're about
00:00:48
to launch into kind of the Dual problem
00:00:50
of Statistics this is where the rubber
00:00:53
hits the road so probability is all
00:00:54
about mathematical modeling
00:00:56
combinatorics distributions it's really
00:00:58
elegant Theory
00:01:00
um so I'm actually going to write this
00:01:01
down this is all about modeling
00:01:03
uncertainty in the real world um
00:01:06
building models and statistics is all
00:01:09
about data okay so this is really
00:01:12
important for us in the modern machine
00:01:14
learning era as data scientists
00:01:17
statistics is all about taking data and
00:01:20
saying something about the probability
00:01:22
model so probability you assume you have
00:01:25
the model you assume you have the
00:01:27
distribution that's known and we don't
00:01:29
know know uh what the samples or the
00:01:31
data are going to look like but we want
00:01:33
to say what is likely that's a
00:01:35
probability problem the Dual of that the
00:01:37
flip side of that is the statistics
00:01:39
problem where now we have data we assume
00:01:41
that samples and data are known and we
00:01:43
want to infer something about the
00:01:45
underlying probability model the
00:01:47
parameters of the system something about
00:01:49
the system from data so these of course
00:01:52
are kind of dual problems they're
00:01:53
intimately related and so you need to
00:01:56
know these foundational probability
00:01:58
Concepts to do good statistics
00:02:00
but statistics is really where we start
00:02:02
being able to make powerful predictions
00:02:05
decisions estimations and again it is
00:02:08
the basis uh of modern machine learning
00:02:10
is statistical data analysis okay so um
00:02:15
this is one of my passions I love this I
00:02:17
learned this um you know over 20 years
00:02:20
ago when I was uh at the University of
00:02:23
North Texas from uh Dr John Quinton
00:02:25
Nilla so I want to give mad shout out to
00:02:28
Dr Q uh again I'm gonna base a lot of
00:02:31
what I'm doing on you know what I
00:02:33
learned from Dr Q's notes in uh the
00:02:36
University of North Texas in fact I was
00:02:39
going back through this the other day uh
00:02:41
brushing up on some topic like random
00:02:42
walks or marob chains and I actually
00:02:45
found one of I think the first times I
00:02:48
wrote down Igan Steve so I think that
00:02:50
this might have been I was sitting in
00:02:51
this class uh back you know when I was
00:02:55
17 years old I think that might be where
00:02:57
Ian Steve actually comes from so anyway
00:02:59
way um you know I want to give a ton of
00:03:02
credit to Dr John Quinton Nilla um who
00:03:05
taught me essentially everything I know
00:03:07
about probability and statistics um so
00:03:09
anything interesting and correct I'm
00:03:11
saying is probably him anything uh
00:03:13
Incorrect and misleading is probably
00:03:15
because this is 20 years later um but
00:03:17
I'm going to specifically Take This
00:03:19
Modern perspective that what we really
00:03:20
want to do is start driving towards Big
00:03:23
Data and really complicated or or nasty
00:03:27
probability models that don't belong to
00:03:29
the classes of easy classical
00:03:32
probability models that we have been
00:03:34
analyzing things like normal
00:03:35
distributions exponential Plus on etc
00:03:37
etc those are still super useful for
00:03:40
tons of real world problems but there
00:03:42
are other real world problems where the
00:03:44
probability densities don't have a nice
00:03:47
analytic close form uh expression and
00:03:49
you have to learn them from data using
00:03:52
machine learning so this is all going to
00:03:54
build towards that but we're going to
00:03:55
start with foundational
00:03:57
statistics okay good um so so I think I
00:04:00
just want to tell you kind of the
00:04:01
outline of this class again this was
00:04:03
about 10 hours broken into two modules
00:04:05
of intermediate and advanced statistics
00:04:08
is going to be about the same there's
00:04:09
going to be kind of a core 5 hours that
00:04:11
you need to know that's kind of the
00:04:13
intro intermediate and then there's
00:04:15
going to be Advanced topics that are you
00:04:17
know special topics and more um more
00:04:19
technical so you can kind of pick and
00:04:21
choose your own adventure of how much
00:04:23
you want to learn okay but I really want
00:04:25
to make this as targeted as possible so
00:04:27
if you have 5 hours I want want you to
00:04:30
get the best 5 hours of probability or
00:04:32
the best five hours of Statistics that
00:04:34
gets you as close to being able to use
00:04:37
this as possible and if you have another
00:04:38
5 hours go deeper and after this there
00:04:41
will be a bunch of special topics things
00:04:43
like stochastic differential equations
00:04:44
marov chains Moni Carlo optimization for
00:04:47
beijan methods machine learning there's
00:04:50
an unlimited amount of cool stuff so I'm
00:04:52
just going to keep adding for a long
00:04:54
time hopefully okay let's get into it um
00:04:57
so given data
00:05:00
given data of a
00:05:03
system some things we can do some
00:05:07
things we can do this reminds me of a
00:05:10
Deltron song things we can do uh and I'm
00:05:13
just going to start going in order okay
00:05:15
so the first thing we can
00:05:16
do um and we're going to start here
00:05:18
actually because this is where the
00:05:19
statistics is the easiest just like in
00:05:21
probability we started from the intro
00:05:23
kind of baby steps and then we worked up
00:05:25
very quickly to some pretty advanced
00:05:27
concepts we're going to do the same
00:05:28
thing here so uh we're going to do
00:05:30
something called survey sampling so uh
00:05:33
we
00:05:35
draw a
00:05:37
small
00:05:39
sample from a
00:05:43
large
00:05:45
population so if we draw so this is
00:05:47
essentially another way of saying you
00:05:49
know survey sampling so this is called
00:05:51
survey sampling uh survey
00:05:54
sampling or
00:05:57
polling and um the IDE idea here is what
00:06:01
can we say about the larger population
00:06:03
from the small sample that we draw and
00:06:05
how big is a big enough sample to say
00:06:07
things with statistical confidence about
00:06:09
this larger population so some of the
00:06:12
things we're going to do for example
00:06:14
there is this notion of a sample mean if
00:06:16
I have this sample I can take the
00:06:18
average maybe I'm uh measuring you know
00:06:21
um people's political preferences or the
00:06:23
height of an American you know like
00:06:26
which clearly follows a normal
00:06:27
distribution and I might draw small
00:06:30
sample of 100 people to try to say
00:06:32
things about the larger population
00:06:34
distribution so there's this notion of
00:06:37
something called a sample mean it's a
00:06:38
really cool idea um you literally take
00:06:41
your small sample the the variable
00:06:42
you're measuring and you take the
00:06:44
average value it's just the mean of your
00:06:46
sample sometimes we call this
00:06:48
xbar and in the last set of lectures in
00:06:52
probability this kind of culminated in
00:06:55
something called the central limit
00:06:56
theorem which showed that this sample
00:06:58
mean from Act ual data tends to be
00:07:01
distributed as a normal random variable
00:07:04
where the mean of this normal
00:07:06
distribution is the mean mu of the
00:07:09
population of the true population and
00:07:11
the variance of this random variable is
00:07:14
sigma^2 over
00:07:15
n where
00:07:18
n is the sample size
00:07:21
sample size so this is the kind of thing
00:07:24
you can do with Statistics this is kind
00:07:26
of where we're going to start off we're
00:07:27
going to take a small sample compute its
00:07:29
mean and we're going to show with the
00:07:31
central limit theorem that that's a
00:07:32
normally distributed random variable
00:07:34
where mu and sigma squar are the
00:07:36
population mean and variance uh mean and
00:07:40
variance and N is the sample size of the
00:07:42
sample I took to compute this mean this
00:07:44
is incredibly powerful and this allows
00:07:46
me because I have this variance here it
00:07:49
essentially says that this variable will
00:07:51
converge to the true mean if I take the
00:07:54
average of a sample it will tell me
00:07:55
something about the average of my
00:07:57
population and the variance tells me how
00:07:59
close to the True Value I am how big of
00:08:02
an end do I need for this variance to be
00:08:04
small how how much wiggle room do I have
00:08:08
in this estimate of the true mean for a
00:08:11
given n so this tells me a lot of useful
00:08:13
things it tells me how I might design an
00:08:14
experiment if I want a certain amount of
00:08:17
accuracy or uncertainty in my estimate
00:08:20
really really important and this is a
00:08:23
simple place to start is survey sampling
00:08:25
okay good um and this is true for any
00:08:28
distribution of my data it doesn't have
00:08:31
to be from you know normally distributed
00:08:33
Heights I can I can take samples of um a
00:08:38
large population that has some weird
00:08:40
distribution and that sample mean will
00:08:42
still be a normally distributed random
00:08:44
variable by the central limit theorem
00:08:45
That's The Power of probability are
00:08:47
things like the central limit theorem
00:08:49
which are extremely General powerful
00:08:51
statements about arbitrary data and
00:08:53
distributions so that's our starting
00:08:55
point um two is going to get even more
00:08:58
interesting okay okay so this is just
00:09:00
kind of laying the foundation with some
00:09:01
easy math that ties back to probability
00:09:03
ties data to probability now we're going
00:09:07
to start doing hypothesis testing so
00:09:10
testing um hypothesis hyp and this is
00:09:14
literally called hypothesis
00:09:15
testing um you know and the hypotheses
00:09:18
there are so many of these you can write
00:09:20
down I'm just going to give a few to
00:09:21
give you a flavor of the kinds of things
00:09:23
we're going to be able to do really
00:09:25
really powerful things um does a drug
00:09:28
work okay so let's say that there's some
00:09:30
new super drug that is supposed to cure
00:09:32
cancer or you know cause incredible
00:09:35
weight loss uh does a drug work or not
00:09:40
this is something we can test with
00:09:42
Statistics we can um essentially have a
00:09:45
control group and a treatment group and
00:09:47
test if their means are different that
00:09:50
would indicate that the drug did
00:09:51
something that's a hypothesis we can
00:09:53
test using these distributions using
00:09:55
literally a normal
00:09:57
distribution um did a mark marketing
00:09:59
campaign work or not um so did a
00:10:03
marketing
00:10:06
campaign uh did a marketing campaign
00:10:10
increase web traffic this is just an
00:10:13
example um this is what we call AB
00:10:16
testing so this is um a testing um in
00:10:20
like computer science where you have you
00:10:22
know you do a modification you change
00:10:25
something about your website and you see
00:10:26
if people click on you know ads more
00:10:28
that would be a AB testing a drug
00:10:31
working or not this is a
00:10:33
control uh versus
00:10:36
treatment group okay this is kind of
00:10:38
you'd have a control group and a
00:10:40
treatment group other things you can do
00:10:42
um one of my absolute favorites actually
00:10:44
um have been thinking about this a lot
00:10:46
lately is are two distributions the same
00:10:49
are two
00:10:53
distributions the
00:10:55
same uh and this is what is called the
00:10:58
kai squar test um is going to tell us
00:11:00
that the kai
00:11:02
Square test and the kai square is a
00:11:05
distribution from probability that
00:11:06
allows us to test a hypothesis using
00:11:08
data so that becomes
00:11:10
statistics um really really important
00:11:13
ideas here about testing hypotheses with
00:11:16
data based on probability models of how
00:11:18
that data should behave you can test
00:11:21
lots of cool hypotheses and this again
00:11:23
generalizes to machine learning when
00:11:24
those distributions are empirical
00:11:27
distributions you can really think of
00:11:28
machine learning
00:11:29
as having
00:11:31
empirical
00:11:33
distributions from
00:11:36
data okay so you get empirical
00:11:38
probability models from a wealth of
00:11:40
measurement data so testing hypothesis
00:11:42
is going to be a big big deal here um
00:11:45
and this also allows you to quantify how
00:11:47
significant your results are you don't
00:11:49
just test these hypotheses you get like
00:11:51
a confidence of How likely the drug is
00:11:53
to work or not like am I 95% confident
00:11:56
in this result am I 99% confident you
00:11:58
get a notion of statistical
00:12:01
significance uh so you can
00:12:04
quantify how
00:12:07
significant uh a result is a
00:12:11
result is okay um and this leads very
00:12:15
naturally into something called
00:12:16
experimental design um super important
00:12:20
if you are going to run a drug trial
00:12:22
let's say that you think you you have a
00:12:24
new super drug or let's say that you
00:12:26
have a new super composite it's going to
00:12:28
make aircraft lighter and stronger
00:12:30
you've got a new material or a new drug
00:12:32
something new that's going to be amazing
00:12:35
and you need to convince the world that
00:12:36
it's safe and it works you need to
00:12:39
design a statistical experiment a data
00:12:41
collection protocol and a hypothesis to
00:12:44
test so that you can convince people
00:12:46
with some amount of significance of your
00:12:49
result and that is all about designing a
00:12:51
statistical experiment to be honest to
00:12:53
be accurate and to be significant so
00:12:56
that you can convince other people of
00:12:59
the effect of some you know new drug or
00:13:02
new material or new whatever it is okay
00:13:04
so experimental design is super
00:13:06
important and The Duel of experimental
00:13:08
design the significance level that we
00:13:11
quantify in a hypothesis test is usually
00:13:12
called the P value you've probably heard
00:13:14
of the P value before a p of 0.05 is a
00:13:18
statistically significant result meaning
00:13:20
there's like a 95% chance that you know
00:13:24
I get the correct answer um if I if I
00:13:27
say something happened and so what that
00:13:30
means is that a lot of people do bad
00:13:32
statistics called
00:13:34
packing where they do bad experimental
00:13:37
design they they do either through
00:13:38
fraudulence or ignorance they do a bad
00:13:40
experimental design to get a P value
00:13:43
that's significant even though their
00:13:45
experiment was wrong okay so there's
00:13:48
lots of ways of getting significant
00:13:49
statistical results by doing bad
00:13:51
statistics I'm going to tell you about
00:13:52
those pitfalls we're going to code this
00:13:54
up all of this you know we're going to
00:13:56
have examples in Jupiter in Python we're
00:13:59
going to actually you know code up
00:14:02
because this is data and testing we're
00:14:04
going to build code to do all of this
00:14:06
and I'm going to show you in code what
00:14:07
packing looks like and what to look out
00:14:09
for so that you can not fall into those
00:14:11
traps of fraudulence and ignorance okay
00:14:14
super important
00:14:15
stuff um and now the other kind of big
00:14:18
part of this that I want to talk about I
00:14:20
think this is really really cool maybe
00:14:21
I'll go over here so I have a little
00:14:22
more space is this notion of fitting
00:14:25
distributions and estimating parameters
00:14:28
so
00:14:29
um kind of the third big big topic we're
00:14:31
going to talk about is fitting
00:14:34
distributions and I'm putting it here
00:14:36
under machine learning because this is
00:14:37
really the intro to machine learning
00:14:39
fitting
00:14:42
distributions uh and estimating
00:14:45
parameters and
00:14:48
estimating uh
00:14:51
parameters good um and so probability
00:14:54
essentially involves so
00:14:57
probability involves the this
00:14:59
probability model this probability
00:15:01
density probability of X my random
00:15:03
variable given some parameters Theta so
00:15:06
I'll just label these really quickly so
00:15:08
this is my uh
00:15:10
data given my
00:15:13
parameters these are the parameters of
00:15:15
my probability distribution so in the
00:15:17
gausian example this would be the mean
00:15:19
and the standard deviation things like
00:15:22
that statistics is the flip of this so
00:15:27
statistics is all about finding the
00:15:30
probability of my
00:15:32
parameters given my data so it flips
00:15:34
this on its head it's this notion that
00:15:37
given data I want to find the best fit
00:15:40
parameters the best distribution that
00:15:42
fits that data that's the statistics
00:15:44
problem here and this really is uh very
00:15:48
much a
00:15:49
basian uh perspective this is literally
00:15:52
the beian inverse of this so we're going
00:15:54
to use beian ideas a lot in statistics
00:15:57
because we're trying to kind of flip the
00:15:59
Paradigm where instead of estimating
00:16:00
what the data should look like given a
00:16:02
distribution with fixed parameters we
00:16:04
have data and we're trying to estimate
00:16:06
the parameters from that data okay so
00:16:08
that's what we mean um and we're going
00:16:10
to look at a bunch of examples here of
00:16:12
how to do this things like um the method
00:16:15
of moments you've probably seen this
00:16:16
before you might have method of
00:16:20
moments very closely related to this
00:16:22
sampling statistics here you literally
00:16:24
estimate things about your population
00:16:26
from things like the first moment the
00:16:28
sample moment things like that um we're
00:16:31
going to talk about maximum likelihood
00:16:32
estimation Max
00:16:36
likelihood uh estimation
00:16:39
ml this is a big big big topic ml are a
00:16:43
super powerful way of turning this
00:16:45
problem into an optimization problem
00:16:47
which means we get all of the Power of
00:16:49
modern optimization machine learning and
00:16:51
data to solve this problem so maximum
00:16:54
likelihood estimates is a big deal um
00:16:57
and that's this also transitions very ni
00:16:59
into the beian perspective we're also
00:17:01
going to talk about things like goodness
00:17:03
of fit and hypothesis testing how good
00:17:05
is a fit so once I've fit these
00:17:07
parameters how good uh is the fit
00:17:10
goodness of
00:17:12
fit and hypothesis
00:17:16
testing um confidence intervals so once
00:17:19
we get the estimate of these parameters
00:17:21
we can also give confidence intervals of
00:17:23
of of kind of like what's the range of
00:17:25
theta we think so not just a fixed Theta
00:17:27
but maybe I have a distribution of what
00:17:29
I expect Theta to be that's kind of also
00:17:30
the beian perspective um I might have
00:17:33
confidence intervals
00:17:34
here uh confidence
00:17:37
intervals on Theta hat my estimate
00:17:40
confidence intervals and hypothesis
00:17:42
testing are really dual problems related
00:17:44
to this P value uh and then we're also
00:17:46
going to talk about something super
00:17:47
important I'm just going to actually put
00:17:49
this in pink because it's so important
00:17:51
um is this idea of
00:17:53
bootstrapping uh and simulation
00:17:56
bootstrapping and Moni Carlo
00:18:00
simulation uh simulation is the key word
00:18:02
here so often times there's things I
00:18:05
want to know about my statistical
00:18:07
distribution like I might want to know
00:18:09
you know the variance of this parameter
00:18:11
estimate Theta um and I can't compute it
00:18:13
using pencil and paper analytics so I'll
00:18:16
actually set up a big simulation a Monte
00:18:18
Carlo simulation to get a bootstrap
00:18:20
estimate of the distribution of my
00:18:22
uncertain parameter and again this is
00:18:24
the basis of a lot of modern beijan
00:18:26
statistics and beian machine learning is
00:18:28
doing Moni Carlo simulations and
00:18:30
bootstrapping so this is kind of going
00:18:32
to be an advanced topic that Segways us
00:18:34
into how to do computational statistics
00:18:36
with big data and nasty distributions
00:18:38
pretty cool stuff okay um then I guess
00:18:43
we're going to keep going I'm almost
00:18:44
done topic four uh is going to be you
00:18:48
know all about beijan
00:18:51
statistics um beian
00:18:53
statistics and I'm going to have beijan
00:18:56
statistics kind of woven out throughout
00:18:58
these lectures so we're going to get you
00:19:00
know an intro to B in this uh statistics
00:19:03
module but realistically we're going to
00:19:05
have a lot deeper dives into Baye later
00:19:08
in my optimization boot camp in physics
00:19:11
informed machine learning beian
00:19:13
Frameworks allow you to take prior
00:19:15
knowledge maybe I know something about
00:19:17
the distribution or I know something
00:19:18
about Theta or I know something about
00:19:20
the physical world it allows me to build
00:19:22
in that prior knowledge to these these
00:19:24
statistical estimates that's a huge
00:19:26
Topic in optimization machine learning
00:19:28
physics and for machine learning so this
00:19:30
is also going to be something we cover a
00:19:31
lot more later um a good way to think
00:19:34
about this is probability tells me a
00:19:38
model of how I think a Fair coin or a
00:19:40
biased coin will behave as a bernui
00:19:42
random variable we have a model for this
00:19:45
and if I flip this coin 10 times then
00:19:47
the number of heads is going to be a
00:19:48
binomially binomially distributed random
00:19:50
variable and if I flip it a 100 times
00:19:52
that binomial starts to look like a
00:19:54
normal distribution by the central limit
00:19:56
theorem things like that the statistic
00:19:59
view is a little bit different let's say
00:20:01
I have this coin and I flip it 10 times
00:20:04
let's say it comes up 10 heads in a row
00:20:07
the Statistics question is do I think
00:20:10
this this coin is fair what do I think
00:20:12
the probability is of getting heads
00:20:15
versus Tails can I estimate those
00:20:16
quantities and those uncertainties Bean
00:20:19
statistics is a really important way if
00:20:22
I flip a coin and it is heads three
00:20:25
times in a row some of these statistics
00:20:28
methods kind of will fail and
00:20:30
incorrectly assume that the parameter
00:20:32
Theta of How likely it is to flip ah
00:20:34
heads is equal to one it's always going
00:20:35
to flip heads and that's bad Bean
00:20:39
statistics allows me to bake in prior
00:20:41
knowledge if I just see a coin if I feel
00:20:43
a coin my prior pretty strong prior is
00:20:47
that it's a fair coin so even if I flip
00:20:49
three heads in a row that's not going to
00:20:51
shake my foundational belief in this
00:20:54
coin being fair it's going to take a lot
00:20:56
more evidence for me to update my prior
00:20:59
and say oh maybe if I get 15 heads in a
00:21:01
row this is probably not a Fair
00:21:03
coin okay so Bean statistics allows me
00:21:05
to build in a lot of prior knowledge to
00:21:08
robustify and improve statistics when I
00:21:10
have that prior knowledge now this
00:21:12
relies on you having good prior
00:21:13
knowledge bad priors cause bad
00:21:16
statistics and then you know dot dot dot
00:21:20
there's going to be a lot more this is
00:21:22
going to be more and more and more so
00:21:24
we're going to talk about tons of
00:21:25
interesting Advanced topics that I find
00:21:27
interesting things like benford's law I
00:21:30
love benford's law it's incredible uh
00:21:32
marov chains uh are ways of kind of
00:21:35
merging differential equations and
00:21:37
probabilities we'll talk about random
00:21:40
walks um we'll talk about you know
00:21:42
gausian processes for again stochastic
00:21:45
differential equations and and much much
00:21:48
more and eventually you know what we're
00:21:50
really getting towards is modern
00:21:52
statistics and data analysis which is we
00:21:54
you know we call this machine learning
00:21:56
fitting empirical distributions from
00:21:58
data I'm super excited to walk you
00:22:00
through this this should be about 10
00:22:02
hours kind of intermediate and advanced
00:22:04
this is going to give you a set of tools
00:22:07
like calculus like linear algebra like
00:22:09
differential equation to really model
00:22:11
the real world and its complexity and
00:22:13
its uncertainty from data I'm excited to
00:22:16
share this with you I hope you're
00:22:17
excited uh stay tuned for more thanks