00:00:00
In 2011 an article was published in the
reputable "Journal of Personality and
00:00:05
Social Psychology". It was called "Feeling the
Future: Experimental Evidence for
00:00:10
Anomalous Retroactive Influences on
Cognition and Affect" or, in other words,
00:00:15
proof that people can see into the
future. The paper reported on nine
00:00:20
experiments. In one, participants were
shown two curtains on a computer screen
00:00:23
and asked to predict which one had an
image behind it, the other just covered a
00:00:27
blank wall. Once the participant made
their selection the computer randomly
00:00:30
positioned an image behind one of the
curtains, then the selected curtain was
00:00:34
pulled back to show either the image or
the blank wall
00:00:37
the images were randomly selected from
one of three categories: neutral, negative,
00:00:42
or erotic. If participants selected the
curtain covering the image this was
00:00:46
considered a hit. Now with there being
two curtains and the images positions
00:00:50
randomly behind one of them, you would
expect the hit rate to be about fifty
00:00:54
percent. And that is exactly what the
researchers found, at least for negative
00:00:59
neutral images
00:01:01
however for erotic images the hit rate
was fifty-three percent. Does that mean
00:01:05
that we can see into the future? Is that
slight deviation significant? Well to
00:01:09
assess significance scientists usually
turn to p-values, a statistic that tells
00:01:13
you how likely a result, at least this
extreme, is if the null hypothesis is
00:01:17
true. In this case the null hypothesis
would just be that people couldn't
00:01:21
actually see into the future and the
53-percent result was due to lucky
00:01:24
guesses. For this study the p-value was .01 meaning there was just a one-percent
00:01:29
chance of getting a hit rate of
fifty-three percent or higher from
00:01:32
simple luck. p-values less than .05 are generally considered significant
00:01:36
and worthy of publication but you might
want to use a higher bar before you
00:01:40
accept that humans can accurately
perceive the future and, say, invite the
00:01:44
study's author on your news program; but
hey, it's your choice. After all, the .05
00:01:49
threshold was arbitrarily selected by
Ronald Fisher in a book he published in
00:01:54
1925. But this raises the question: how much
of the published research literature is
00:01:59
actually false? The intuitive answer
seems to be five percent. I mean if
00:02:03
everyone is using p less than .05
as a cut-off for statistical
00:02:06
significance, you would expect five of
every hundred results to be false positives
00:02:11
but that unfortunately grossly
underestimates the problem and here's why.
00:02:16
Imagine you're a researcher in a field
where there are a thousand hypotheses
00:02:20
currently being investigated.
00:02:22
Let's assume that ten percent of them
reflect true relationships and the rest
00:02:25
are false, but no one of course knows
which are which, that's the whole point
00:02:28
of doing the research. Now, assuming the
experiments are pretty well designed,
00:02:32
they should correctly identify around
say 80 of the hundred true relationships
00:02:36
this is known as a statistical power of
eighty percent, so 20 results are false
00:02:42
negatives, perhaps the sample size was
too small or the measurements were not
00:02:45
sensitive enough. Now considered that
from those 900 false hypotheses using a
00:02:50
p-value of .05, forty-five false
hypotheses will be incorrectly
00:02:55
considered true. As for the rest, they
will be correctly identified as false
00:02:59
but most journals rarely published no
results: they make up just ten to thirty
00:03:03
percent of papers depending on the field,
which means that the papers that
00:03:07
eventually get published will include 80
true positive results:
00:03:10
45 false positive results and maybe 20
true negative results.
00:03:15
Nearly a third of published results will be wrong
00:03:18
even with the system working normally,
things get even worse if studies are
00:03:22
underpowered, and analysis shows they
typically are, if there is a higher ratio
00:03:26
of false-to-true hypotheses being tested
or if the researchers are biased.
00:03:32
All of this was pointed out in 2005 paper
entitled "Why most published research is false".
00:03:37
So, recently, researchers in a
number of fields have attempted to
00:03:40
quantify the problem by replicating some
prominent past results.
00:03:44
The Reproducibility Project repeated a hundred
psychology studies but found only
00:03:48
thirty-six percent had a statistically
significant result the second time
00:03:52
around and the strength of measured
relationships were on average half those
00:03:56
of the original studies. An attempted
verification of 53 studies considered
00:03:59
landmarks in the basic science of cancer
only managed to reproduce six even
00:04:04
working closely with the original
study's authors these results are even
00:04:08
worse than i just calculated the reason
for this is nicely illustrated by a 2015
00:04:13
study showing that eating a bar of
chocolate every day can help you lose
00:04:16
weight faster. In this case the
participants were randomly allocated to
00:04:20
one of three treatment groups:
00:04:22
one went on a low-carb diet, another one on
the same low carb diet plus a 1.5 ounce
00:04:26
bar of chocolate per day and the
third group was the control, instructed
00:04:30
just to maintain their regular eating
habits at the end of three weeks the
00:04:33
control group had neither lost nor
gained weight but both low carb groups
00:04:37
had lost an average of five pounds per
person
00:04:40
the group that a chocolate however lost
weight ten percent faster than the
00:04:44
non-chocolate eaters the finding was statistically
significant with a p-value less than .05
00:04:50
As you might expect this news
spread like wildfire, to the
00:04:53
front page of Bild, the most widely
circulated daily newspaper in Europe
00:04:57
and into the Daily Star, the Irish Examiner,
to Huffington Post and even Shape Magazine
00:05:02
unfortunately the whole thing had been
faked, kind of. I mean researchers did
00:05:07
perform the experiment exactly as they
described, but they intentionally
00:05:11
designed it to increase the likelihood
of false positives: the sample size was
00:05:15
incredibly small, just five people per
treatment group, and for each person 18
00:05:20
different measurements were tracked
including: weight, cholesterol, sodium,
00:05:24
blood protein levels, sleep quality,
well-being, and so on; so if weight loss
00:05:29
didn't show a significant difference
there were plenty of other factors that
00:05:32
might have. So the headline could have
been "chocolate lowers cholesterol" or
00:05:36
"increases sleep quality" or... something.
00:05:39
The point is: a p-value is only
really valid for a single measure
00:05:43
once you're comparing a whole slew of
variables the probability that at least
00:05:46
one of them gives you a false positive
goes way up, and this is known as "p-hacking".
00:05:51
Researchers can make a lot of
decisions about their analysis that can
00:05:54
decrease the p-value, for example let's
say you analyze your data and you find
00:05:58
it nearly reaches statistical
significance, so you decide to collect
00:06:01
just a few more data points to be sure
00:06:03
then if the p-value drops below .05 you
stop collecting data, confident that
00:06:08
these additional data points could only
have made the result more significant if
00:06:11
there were really a true relationship
there, but numerical simulations show
00:06:15
that relationships can cross the
significance threshold by adding more
00:06:19
data points even though a much larger
sample would show that there really is
00:06:23
no relationship. In fact, there are a
great number of ways to increase the
00:06:27
likelihood of significant results like:
having two dependent variables, adding
00:06:31
more observations, controlling for gender,
or dropping one of three conditions
00:06:36
combining all three of these strategies
together increases the
00:06:39
likelihood of a false-positive to over sixty
percent, and that is using p less than .05
00:06:45
Now if you think this is
just a problem for psychology
00:06:47
neuroscience or medicine, consider the
pentaquark, an exotic particle made
00:06:52
up of five quarks, as opposed to the
regular three for protons or neutrons.
00:06:56
Particle physics employs particularly
stringent requirements for statistical
00:07:00
significance referred to as 5-sigma or
one chance in 3.5 million of getting a
00:07:05
false positive, but in 2002 a Japanese
experiment found evidence for the
00:07:09
Theta-plus pentaquark, and in the two years
that followed 11 other independent
00:07:13
experiments then looked for and found
evidence of that same pentaquark with
00:07:17
very high levels of statistical
significance. From July 2003 to
00:07:22
May 2004 a theoretical paper on pentaquarks
was published on average every
00:07:26
other day, but alas, it was a false
discovery for their experimental
00:07:31
attempts to confirm that theta-plus
pentaquark using greater statistical
00:07:34
power failed to find any trace of its
existence.
00:07:37
The problem was those first scientists
weren't blind to the data, they knew how
00:07:41
the numbers were generated and what
answer they expected to get, and the way
00:07:45
the data was cut and analyzed, or p-hacked,
produced the false finding.
00:07:50
Now most scientists aren't p-hacking
maliciously, there are legitimate decisions to be
00:07:54
made about how to collect, analyze and
report data, and these decisions impact
00:07:58
on the statistical significance of
results. For example, 29 different
00:08:02
research groups were given the same data
and asked to determine if dark-skinned
00:08:05
soccer players are more likely to be
given red cards; using identical data
00:08:10
some groups found there was no
significant effect while others
00:08:13
concluded dark-skinned players were
three times as likely to receive a red card.
00:08:18
The point is that data doesn't speak for
itself, it must be interpreted.
00:08:22
Looking at those results
00:08:23
it seems that dark skinned players are
more likely to get red carded but
00:08:26
certainly not three times as likely;
consensus helps in this case but
00:08:31
for most results only one research group
provides the analysis and therein lies
00:08:35
the problem of incentives: scientists
have huge incentives to publish papers,
00:08:40
in fact their careers depend on it; as
one scientist Brian Nosek puts it:
00:08:44
"There is no cost to getting things wrong,
the cost is not getting them published".
00:08:49
Journals are far more likely to publish
00:08:51
results that reach statistical
significance so if a method of data
00:08:54
analysis results in a p-value less than
.05 then you're likely to go with
00:08:58
that method, publication's also more
likely if the result is novel and
00:09:02
unexpected, this encourages researchers
to investigate more and more unlikely
00:09:05
hypotheses which further decreases the
ratio of true to spurious relationships
00:09:10
that are tested; now what about
replication? Isn't science meant to
00:09:14
self-correct by having other scientists
replicate the findings of an initial
00:09:18
discovery? In theory yes but in practice
it's more complicated, like take the
00:09:22
precognition study from the start of
this video: three researchers attempted
00:09:26
to replicate one of those experiments,
and what did they find?
00:09:29
well, surprise surprise, the hit rate they
obtained was not significantly different
00:09:32
from chance. When they tried to publish
their findings in the same journal as
00:09:36
the original paper they were rejected.
The reason? The journal refuses to
00:09:41
publish replication studies. So if you're
a scientist the successful strategy is
00:09:46
clear and don't even attempt replication
studies because few journals will
00:09:49
publish them, and there is a very good
chance that your results won't be
00:09:53
statistically significant any way in
which case instead of being able to
00:09:57
convince colleagues of the lack of
reproducibility of an effect you will be
00:10:01
accused of just not doing it right.
00:10:03
So a far better approach is to test
novel and unexpected hypotheses and then
00:10:08
p-hack your way to a statistically
significant result. Now I don't want to
00:10:13
be too cynical about this because over
the past 10 years things have started
00:10:16
changing for the better.
00:10:17
Many scientists acknowledge the problems
i've outlined and are starting to take
00:10:21
steps to correct them: there are more
large-scale replication studies
00:10:25
undertaken in the last 10 years, plus
there's a site, Retraction Watch,
00:10:28
dedicated to publicizing papers that
have been withdrawn, there are online
00:10:32
repositories for unpublished negative
results and there is a move towards
00:10:37
submitting hypotheses and methods for
peer review before conducting
00:10:40
experiments with the guarantee that
research will be published regardless of
00:10:43
results so long as the procedure is
followed. This eliminates publication
00:10:48
bias, promotes higher powered studies and
lessens the incentive for p-hacking.
00:10:53
The thing I find most striking about the
reproducibility crisis in science is not
00:10:57
the prevalence of incorrect information
in published scientific journals
00:11:01
after all getting to the truth we know is hard
and mathematically not everything that
00:11:06
is published can be correct.
00:11:08
What gets me is the thought that even
trying our best to figure out what's
00:11:11
true, using our most sophisticated and
rigorous mathematical tools: peer review,
00:11:16
and the standards of practice, we still
get it wrong so often; so how frequently
00:11:20
do we delude ourselves when we're not
using the scientific method? As flawed as
00:11:26
our science may be, it is far away more
reliable than any other way of knowing
00:11:31
that we have.
00:11:37
This episode of veritasium
was supported in part by these fine
00:11:40
people on Patreon and by Audible.com, the
leading provider of audiobooks online
00:11:45
with hundreds of thousands of titles in
all areas of literature including:
00:11:48
fiction, nonfiction and periodicals,
Audible offers a free 30-day trial to
00:11:53
anyone who watches this channel, just go
to audible.com/veritasium so they know
00:11:57
i sent you. A book i'd recommend is
called "The Invention of Nature" by Andrea Wolf
00:12:02
which is a biography of Alexander
von Humboldt, an adventurer and naturalist
00:12:07
who actually inspired Darwin to
board the Beagle; you can download that
00:12:11
book or any other of your choosing for a
one month free trial at audible.com/veritasium
00:12:16
so as always i want to thank
Audible for supporting me and I really
00:12:18
want to thank you for watching.