00:00:00
if you want to finally understand
00:00:02
statistics this is the place to be after
00:00:05
this video you will know what statistics
00:00:08
is what descriptive statistics is and
00:00:11
what inferential statistics is so let's
00:00:13
start with the first question what is
00:00:16
statistics statistics deals with the
00:00:18
collection analysis and presentation of
00:00:21
data an example we would like to
00:00:24
investigate whether gender has an
00:00:27
influence on the preferred newspaper
00:00:30
then gender and newspaper are our
00:00:32
so-called variables that we want to
00:00:35
analyze in order to analyze whether
00:00:38
genda has an influence on the preferred
00:00:40
newspaper we first need to collect data
00:00:43
to do this we create a questionnaire
00:00:46
that asks about gender and preferred
00:00:49
newspaper we will then send out the
00:00:52
survey and wait 2 weeks afterwards we
00:00:55
can display the received answers in a
00:00:57
table in this table we have one column
00:01:01
for each variable one for gender and one
00:01:04
for newspaper on the other hand each row
00:01:08
is the response of one served person the
00:01:11
first respondent is mail and stated New
00:01:14
York Post the second is female and
00:01:17
stated USA Today and so on and so forth
00:01:20
of course the data does not have to be
00:01:22
from a survey the data can also come
00:01:25
from an experiment in which you for
00:01:27
example want to study the effect of Two
00:01:30
drugs on blood pressure now the first
00:01:33
step is done we have collected data and
00:01:36
we can start analyzing the data but what
00:01:39
do we actually want to analyze we did
00:01:42
not survey the entire population but we
00:01:44
took a sample now the big question is do
00:01:48
we just want to describe the sample data
00:01:51
or do we want to make a statement about
00:01:53
the whole population if our aim is
00:01:56
limited to the sample itself I.E we only
00:01:59
want to describe the collected data we
00:02:01
will use descriptive statistics
00:02:04
descriptive statistics will provide a
00:02:06
detailed summary of the sample however
00:02:09
if we want to draw conclusions about the
00:02:11
population as a whole inferential
00:02:14
statistics are used this approach allows
00:02:17
us to make educated guesses about the
00:02:19
population based on the sample data let
00:02:22
us take a closer look at both methods
00:02:25
starting with descriptive statistics why
00:02:28
is descriptive statistics so important
00:02:31
let's say a company wants to know how
00:02:34
its employees travel to work so the
00:02:36
company creates a survey to answer this
00:02:39
question once enough data has been
00:02:41
collected this data can be analyzed
00:02:44
using descriptive statistics but what is
00:02:47
descriptive statistics descriptive
00:02:49
statistics aims to describe and
00:02:51
summarize a data set in a meaningful way
00:02:55
but it is important to note that
00:02:56
descriptive statistics only describe the
00:02:59
collection data without drawing
00:03:01
conclusions about a larger population
00:03:04
put simply just because we know how some
00:03:07
people from one company get to work we
00:03:10
cannot say how all working people of the
00:03:13
company get to work this is the task of
00:03:16
infuential Statistics which we will
00:03:18
discuss later to describe data
00:03:20
descriptively we now look at the four
00:03:23
key components measures of central
00:03:25
tendency measures of dispersion
00:03:28
frequency tables and parts let's start
00:03:31
with the first one measures of central
00:03:33
tendency measures of central tendency
00:03:35
are for example the mean the median and
00:03:38
the mode Let's first have a look at the
00:03:41
mean the arithmetic mean is the sum of
00:03:44
all observations divided by the number
00:03:47
of observations an example imagine we
00:03:50
have the test scores of five students to
00:03:53
find the mean score we sum up all the
00:03:56
scores and divide by the number of
00:03:58
scores the mean test score of these five
00:04:01
students is therefore
00:04:03
86.6 what about the median when the
00:04:06
values in a data set are arranged in
00:04:09
ascending order the median is the middle
00:04:12
value if there is an odd number of data
00:04:15
points the median is simply the middle
00:04:18
value if there is an even number of data
00:04:21
points the median is the average of the
00:04:24
two middle values it is important to
00:04:27
note that the median is resistant to
00:04:29
extreme values or outliers let's look at
00:04:32
this example no matter how tall the last
00:04:35
person is the person in the middle
00:04:38
Remains the person in the middle so the
00:04:40
median does not change but if we look at
00:04:43
the mean it does have an effect on how
00:04:46
tall the last person is the mean is
00:04:49
therefore not robust to outliers let's
00:04:52
continue with the mode the mode refers
00:04:55
to the value or values that appear most
00:04:58
frequently in a a set of data for
00:05:01
example if 14 people travel to work by
00:05:03
car six by bike five walk and five take
00:05:08
public transport then car occurs most
00:05:12
often and is therefore the mode great
00:05:15
let's continue with the measures of
00:05:17
dispersion measures of dispersion
00:05:20
describe how spread out the values in a
00:05:22
data set are measures of dispersion are
00:05:25
for example the variance and standard
00:05:28
deviation the rate
00:05:30
and the interquartile range let's start
00:05:33
with the standard deviation the standard
00:05:35
deviation indicates the average distance
00:05:38
between each data point and the mean but
00:05:41
what does that mean each person has some
00:05:44
deviation from the mean now we want to
00:05:47
know how much the person's deviate from
00:05:49
the mean value on average in this
00:05:52
example the average deviation from the
00:05:54
mean value is 11.5 cm to calculate the
00:05:59
standard deviation we can use this
00:06:02
equation Sigma is the standard deviation
00:06:05
n is the number of persons x i is the
00:06:09
size of each person and xar is the mean
00:06:13
value of all persons but attention there
00:06:16
are two slightly different equations for
00:06:18
the standard deviation the difference is
00:06:21
that we have once 1 / by n and 1's 1 /
00:06:26
nus 1 to keep it simple if our survey
00:06:30
doesn't cover the whole population we
00:06:32
always use this equation to estimate the
00:06:35
standard deviation likewise if we have
00:06:38
conducted a clinical study then we also
00:06:41
use this equation to estimate the
00:06:43
standard deviation but what is the
00:06:45
difference between the standard
00:06:47
deviation and the variance as we now
00:06:50
know the standard deviation is the
00:06:52
quadratic mean of the distance from the
00:06:55
mean the variance now is the squared
00:06:57
standard deviation if if you want to
00:06:59
know more details about the standard
00:07:01
deviation and the variance please watch
00:07:04
our video let's move on to range and
00:07:06
interquartile range it is easy to
00:07:09
understand the range is simply the
00:07:11
difference between the maximum and
00:07:14
minimum value inter quartile range
00:07:17
represents the middle 50% of the data it
00:07:20
is the difference between the first
00:07:22
quartile q1 and the third quartile Q3
00:07:27
therefore 25% of the values are smaller
00:07:30
than the interquartile range and 25% of
00:07:34
the values are larger the inter quartile
00:07:36
range contains exactly the middle 50% of
00:07:40
the values before we get to the last two
00:07:43
points let's briefly compare measures of
00:07:46
central tendency and measures of
00:07:48
dispersion let's say we measure the
00:07:50
blood pressure of patients measures of
00:07:53
central tendency provide a single value
00:07:56
that represents the entire data set have
00:07:59
helping to identify a central value
00:08:02
around which data points tend to Cluster
00:08:05
measures of dispersion like the standard
00:08:07
deviation the range and the
00:08:09
interquartile range indicate how spread
00:08:12
out the data points are whether they are
00:08:15
closely packed around the center or
00:08:17
spread far from it in summary while
00:08:20
measures of central tendency provide a
00:08:23
central point of the data set measures
00:08:25
of dispersion describe how the data is
00:08:28
spread around Center let's move on to
00:08:31
tables here we will have a look at the
00:08:33
most important ones frequency tables and
00:08:36
contingency tables a frequency table
00:08:39
displays how often each distinct value
00:08:43
appears in a data set let's have a
00:08:45
closer look at the example from the
00:08:47
beginning a company surveyed its
00:08:50
employees to find out how they get to
00:08:52
work the options given were car bicycle
00:08:56
walk and public transport here are the
00:08:58
results results from 30 employees the
00:09:01
first answered car the next walk and so
00:09:04
on and so forth now we can create a
00:09:07
frequency table to summarize this data
00:09:10
to do this we simply enter the four
00:09:13
possible options car bicycle walk and
00:09:16
public transport in the First Column and
00:09:19
then count how often they occurred from
00:09:22
the table it is evident that the most
00:09:25
common mode of Transport among the
00:09:27
employees is by car with 14 employees
00:09:30
preferring it the frequency table thus
00:09:33
provides a clear and concise summary of
00:09:35
the data but what if we have not only
00:09:38
one but two categorical variables this
00:09:41
is where the contingency table also
00:09:43
called cross tab comes in Imagine the
00:09:46
company doesn't have one Factory but two
00:09:50
one in Detroit and one in Cleveland so
00:09:53
we also ask the employees at which
00:09:56
location they work if we want to display
00:09:58
both variables we can use a contingency
00:10:01
table a contingency table provides a way
00:10:04
to analyze and compare the relationship
00:10:07
between two categorical variables the
00:10:10
rows of a contingency table represent
00:10:13
the categories of one variable while the
00:10:16
columns represent the categories of
00:10:18
another variable each cell in the table
00:10:21
shows the number of observations that
00:10:23
fall into the corresponding category
00:10:26
combination for example the first cell
00:10:28
show that car and Detroit were answered
00:10:31
six times and what about the charts
00:10:35
let's take a look at the most important
00:10:37
ones to do this let's simply use
00:10:40
data.net if you like you can load this
00:10:43
sample data set with the link in the
00:10:45
video description or you just copy your
00:10:47
own data into this table here below you
00:10:50
can see the variables distance to work
00:10:53
mode of transport and site data daab
00:10:56
gives you a hint about the level of
00:10:58
measurement but you can also change it
00:11:01
here now if we only click on mode of
00:11:04
Transport we get a frequency table and
00:11:07
we can also display the percentage
00:11:10
values if we scroll down we get a bar
00:11:13
chart and a pie chart here on the left
00:11:16
we can adjust further settings for
00:11:19
example we can specify whether we want
00:11:22
to display the frequencies or the
00:11:24
percentage values or whether the bars
00:11:27
should be vertical or
00:11:29
horizontal if you also select side we
00:11:33
get a cross table here and a grouped bar
00:11:37
chart for the diagrams here we can
00:11:40
specify whether we want the chart to be
00:11:42
grouped or stacked if we click on
00:11:45
distance to work and mode of Transport
00:11:48
we get a bar chart where the height of
00:11:51
the bar shows the mean value of the
00:11:53
individual groups here we can also
00:11:56
display the
00:11:57
dispersion we also get a histogram a box
00:12:01
plot a violin plot and a rainbow plot if
00:12:05
you would like to know more about what a
00:12:07
box plot a violin plot and a rainbow
00:12:10
plot are take a look at my videos let's
00:12:13
continue with inferential statistics at
00:12:16
the beginning we briefly go through what
00:12:18
inferential statistics is and then I'll
00:12:21
explain the six key components to you so
00:12:24
what is inferential statistics
00:12:26
inferential statistics allows us to make
00:12:29
a conclusion or inference about a
00:12:32
population based on data from a sample
00:12:35
what is the population and what is the
00:12:38
sample the population is the whole group
00:12:41
we're interested in if you want to study
00:12:43
the average height of all adults in the
00:12:46
United States then the population would
00:12:49
be all adults in the United States the
00:12:52
sample is a smaller group we actually
00:12:54
study chosen from the population for
00:12:57
example 150 the adults were selected
00:13:00
from the United States and now we want
00:13:02
to use the sample to make a statement
00:13:05
about the population and here are the
00:13:07
six steps how to do that number one
00:13:11
hypothesis first we need a statement a
00:13:13
hypothesis that we want to test for
00:13:16
example you want to know whether a drug
00:13:19
will have a positive effect on blood
00:13:21
pressure in people with high blood
00:13:23
pressure but what's next in our
00:13:26
hypothesis we stated that we would like
00:13:28
to study people with high blood pressure
00:13:31
so our population is all people with
00:13:34
high blood pressure in for example the
00:13:36
us obviously we cannot collect data from
00:13:39
the whole population so we take a sample
00:13:42
from the population now we use this
00:13:45
sample to make a statement about the
00:13:47
population but how do we do that for
00:13:50
this we need a hypothesis test
00:13:52
hypothesis testing is a method for
00:13:55
testing a claim about a parameter in a
00:13:58
population using data measured in a
00:14:00
sample great that's exactly what we need
00:14:03
there are many different hypothesis
00:14:05
tests and at the end of this video I
00:14:07
will give you a guide on how to find the
00:14:10
right test and of course you can find
00:14:12
videos about many more hypothesis tests
00:14:15
on our Channel but how does a hypothesis
00:14:18
test work when we conduct a hypothesis
00:14:21
test we start with a research hypothesis
00:14:24
also called alternative hypothesis this
00:14:27
is the hypothesis we are trying trying
00:14:28
to find evidence for in our case the
00:14:31
research hypothesis is the drug has an
00:14:34
effect on blood pressure but we cannot
00:14:37
test this hypothesis directly with a
00:14:39
classical hypothesis test so we test the
00:14:42
opposite hypothesis that the drug has no
00:14:45
effect on blood pressure but what does
00:14:47
that mean first we assume that the drug
00:14:51
has no effect in the population we
00:14:53
therefore assume that in general people
00:14:56
who take the drug and people who don't
00:14:58
take the drug have the same blood
00:15:01
pressure on average if we now take a
00:15:03
random sample and it turns out that the
00:15:06
drag has a large effect in a sample then
00:15:09
we can ask How likely it is to draw such
00:15:13
a sample or one that deviates even more
00:15:16
if the drag actually has no effect so in
00:15:20
reality on average there's no difference
00:15:22
in a population if this probability is
00:15:25
very low we can ask ourselves maybe the
00:15:29
drug has an effect in the population and
00:15:32
we may have enough evidence to reject
00:15:34
the null hypothesis that the drug has no
00:15:37
effect and it is this probability that
00:15:40
is called the P value let's summarize
00:15:43
this in three simple steps number one
00:15:46
the null hypothesis states that there is
00:15:48
no difference in the population number
00:15:51
two the hypothesis test calculates how
00:15:54
much the sample deviates from the null
00:15:56
hypothesis number three the P value
00:15:59
indicates the probability of getting a
00:16:02
sample that deviates as much as our
00:16:05
sample or one that even deviates more
00:16:08
than our sample assuming the null
00:16:11
hypothesis is true but at what point is
00:16:14
the P value small enough for us to
00:16:16
reject the Nile hypothesis this brings
00:16:19
us to the next Point statistical
00:16:21
significance if the P value is less than
00:16:24
a predetermined threshold the result is
00:16:27
considered statistic ically significant
00:16:30
this means that the result is unlikely
00:16:32
to have occurred by chance alone and
00:16:35
that we have enough evidence to reject
00:16:37
the N hypothesis this threshold is often
00:16:41
0.05 therefore a small P value suggests
00:16:45
that the observed data or sample is
00:16:48
inconsistent with the null hypothesis
00:16:50
this leads us to reject the null
00:16:52
hypothesis in favor of the alternative
00:16:55
hypothesis a large P value suggests that
00:16:58
the obser serve data is consistent with
00:17:00
the Nal hypothesis and we will not
00:17:02
reject it but note there is always a
00:17:05
risk of making an error a small P value
00:17:08
does not prove that the alternative
00:17:10
hypothesis is true it is only saying
00:17:13
that it is unlikely to get such a result
00:17:16
or a more extreme when the null
00:17:19
hypothesis is true and again if the null
00:17:21
hypothesis is true there is no
00:17:24
difference in the population and the
00:17:26
other way around a large p value does
00:17:29
not prove that the N hypothesis is true
00:17:32
it is only saying that it is likely to
00:17:34
get such a result or a more extreme when
00:17:38
the null hypothesis is true so there are
00:17:40
two types of Errors which are called
00:17:42
type one and type two error let's start
00:17:45
with the type one error in hypothesis
00:17:48
testing a type one error occurs when a
00:17:51
true null hypothesis is rejected so in
00:17:54
reality the null hypothesis is true but
00:17:57
we make the the decision to reject the
00:17:59
null hypothesis in our example it means
00:18:02
that the drug actually had no effect so
00:18:06
in reality there is no difference in
00:18:08
blood pressure whether the drug is taken
00:18:11
or not the blood pressure Remains the
00:18:13
Same in both cases but our sample
00:18:16
happened to be so far off the True Value
00:18:19
that we mistakenly thought the drag was
00:18:22
working and a type two error occurs when
00:18:25
a full Sile hypothesis is not rejected
00:18:28
so in reality the null hypothesis is
00:18:31
false but we make the decision not to
00:18:34
reject the null hypothesis in our
00:18:36
example this means the drag actually did
00:18:39
work there is a difference between those
00:18:42
who have taken the drag and those who
00:18:44
have not but it was just a coincidence
00:18:47
that the sample taken did not show much
00:18:50
difference and we mistakenly thought the
00:18:53
drug was not working and now I'll show
00:18:56
you how data helps you to find a
00:18:59
suitable hypothesis test and of course
00:19:02
calculates it and interprets the results
00:19:04
for you let's go to data.net and copy
00:19:08
your own data in here we will just use
00:19:11
this example data set after copying your
00:19:13
data into the table the variables appear
00:19:17
down here data tab automatically tries
00:19:20
to determine the correct level of
00:19:22
measurement but you can also change it
00:19:25
up here now we just click on hypothesis
00:19:29
testing and select the variables we want
00:19:32
to use for the calculation of a
00:19:34
hypothesis test data tab will then
00:19:37
suggest a suitable test for example in
00:19:40
this case a Kai Square test or in that
00:19:43
case an analysis of
00:19:46
variant then you will see the hypotheses
00:19:49
and the results if you're not sure how
00:19:52
to interpret the results click on
00:19:54
summary inverts further you can check
00:19:57
the assumptions and decide whether you
00:20:00
want to calculate a parametric or a
00:20:03
non-parametric test you can find out the
00:20:06
difference between parametric and
00:20:08
nonparametric tests in my next video
00:20:12
thanks for watching and I hope you
00:20:13
enjoyed the
00:20:19
video