What was the claim made by the 2011 paper in the Journal of Personality and Social Psychology?

The paper claimed experimental evidence for precognition, suggesting people can see into the future.

P-hacking refers to manipulating data analysis to achieve statistically significant results.

Why is the 0.05 p-value threshold significant?

A p-value less than 0.05 is commonly used to indicate statistical significance in scientific research, though it can lead to false positives.

What did the Reproducibility Project find?

It found that only 36% of replicated psychology studies had statistically significant results.

Why are replication studies often not published?

Journals often reject replication studies, preferring to publish novel findings.

What are some measures being taken to improve scientific research reliability?

Measures include more replication studies, retraction of faulty papers, and pre-registering study designs.

How did the chocolate weight loss study highlight issues in research?

The small sample size and multiple measured variables inflated the chance of false positives, misleading media into reporting false findings.

Is Most Published Research Wrong?

00:12:22

https://www.youtube.com/watch?v=42QuXLucH3Q

Summary

TLDRThe video addresses issues within scientific research, particularly why many published results can be false. A significant example is a 2011 psychology experiment suggesting precognition, which turned out flawed. Problems often occur due to manipulations in statistical analysis, commonly leading to false positives validated by low p-values (less than 0.05). False findings are more common than one might think due to biases, lack of replication studies, and p-hacking. An example involving a fake study on chocolate and weight loss illustrates how easily flawed research can appear credible. The video emphasizes the need for rigor, replication, and corrected incentives in science, and flags recent improvements to scientific methods.

Takeaways

🔍 False positives are prevalent in scientific research.
📉 P-values less than 0.05 often mislead researchers about significance.
🔄 Lack of replication studies contributes to research issues.
📊 P-hacking is a common problem that manipulates statistical results.
📚 Novel findings are prioritized over replication by journals.
🚫 Replication studies face publication biases.
🔬 Improvements are being made to enhance research reliability.
🔄 Reproducibility projects reveal many studies can't be replicated.
📰 Media often disseminates inaccurate science based on flawed studies.
⚖️ Pre-registering studies and open data aim to improve transparency.

Timeline

00:00:00 - 00:05:00
In 2011, a study published in the Journal of Personality and Social Psychology claimed evidence of people predicting the future, with a focus on how participants could predict the location of an image behind one of two curtains on a computer screen. Results showed a 53% hit rate for erotic images, slightly above the expected 50%, with a p-value of .01 suggesting this was statistically significant under the null hypothesis. This leads to a discussion on the reliability of research, pointing out that the .05 p-value standard may underestimate false positives. In scenarios with many hypotheses, a substantial number of published results could be incorrect due to these statistical thresholds.
00:05:00 - 00:12:22
Insights into scientific reproducibility reveal significant flaws and highlight issues like 'p-hacking' – where data analysis is manipulated to achieve significant p-values. Studies, such as one claiming chocolate aids weight loss, are designed with small sample sizes and multiple measurements increasing false positive chances. Particle physics, although more stringent, can fall prey too. Incentives in publishing favor positive findings, making replications rare. The narrative suggests skepticism in interpretations and acknowledges nascent changes, like better methodologies and more replications, to restore credibility. Nonetheless, scientific method remains the most reliable tool for truth, despite inherent flaws.

Mind Map

Video Q&A

What was the claim made by the 2011 paper in the Journal of Personality and Social Psychology?
The paper claimed experimental evidence for precognition, suggesting people can see into the future.
What is p-hacking?
P-hacking refers to manipulating data analysis to achieve statistically significant results.
Why is the 0.05 p-value threshold significant?
A p-value less than 0.05 is commonly used to indicate statistical significance in scientific research, though it can lead to false positives.
What did the Reproducibility Project find?
It found that only 36% of replicated psychology studies had statistically significant results.
Why are replication studies often not published?
Journals often reject replication studies, preferring to publish novel findings.
What are some measures being taken to improve scientific research reliability?
Measures include more replication studies, retraction of faulty papers, and pre-registering study designs.
How did the chocolate weight loss study highlight issues in research?
The small sample size and multiple measured variables inflated the chance of false positives, misleading media into reporting false findings.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:00
In 2011 an article was published in the reputable "Journal of Personality and
00:00:05
Social Psychology". It was called "Feeling the Future: Experimental Evidence for
00:00:10
Anomalous Retroactive Influences on Cognition and Affect" or, in other words,
00:00:15
proof that people can see into the future. The paper reported on nine
00:00:20
experiments. In one, participants were shown two curtains on a computer screen
00:00:23
and asked to predict which one had an image behind it, the other just covered a
00:00:27
blank wall. Once the participant made their selection the computer randomly
00:00:30
positioned an image behind one of the curtains, then the selected curtain was
00:00:34
pulled back to show either the image or the blank wall
00:00:37
the images were randomly selected from one of three categories: neutral, negative,
00:00:42
or erotic. If participants selected the curtain covering the image this was
00:00:46
considered a hit. Now with there being two curtains and the images positions
00:00:50
randomly behind one of them, you would expect the hit rate to be about fifty
00:00:54
percent. And that is exactly what the researchers found, at least for negative
00:00:59
neutral images
00:01:01
however for erotic images the hit rate was fifty-three percent. Does that mean
00:01:05
that we can see into the future? Is that slight deviation significant? Well to
00:01:09
assess significance scientists usually turn to p-values, a statistic that tells
00:01:13
you how likely a result, at least this extreme, is if the null hypothesis is
00:01:17
true. In this case the null hypothesis would just be that people couldn't
00:01:21
actually see into the future and the 53-percent result was due to lucky
00:01:24
guesses. For this study the p-value was .01 meaning there was just a one-percent
00:01:29
chance of getting a hit rate of fifty-three percent or higher from
00:01:32
simple luck. p-values less than .05 are generally considered significant
00:01:36
and worthy of publication but you might want to use a higher bar before you
00:01:40
accept that humans can accurately perceive the future and, say, invite the
00:01:44
study's author on your news program; but hey, it's your choice. After all, the .05
00:01:49
threshold was arbitrarily selected by Ronald Fisher in a book he published in
00:01:54
1925. But this raises the question: how much of the published research literature is
00:01:59
actually false? The intuitive answer seems to be five percent. I mean if
00:02:03
everyone is using p less than .05 as a cut-off for statistical
00:02:06
significance, you would expect five of every hundred results to be false positives
00:02:11
but that unfortunately grossly underestimates the problem and here's why.
00:02:16
Imagine you're a researcher in a field where there are a thousand hypotheses
00:02:20
currently being investigated.
00:02:22
Let's assume that ten percent of them reflect true relationships and the rest
00:02:25
are false, but no one of course knows which are which, that's the whole point
00:02:28
of doing the research. Now, assuming the experiments are pretty well designed,
00:02:32
they should correctly identify around say 80 of the hundred true relationships
00:02:36
this is known as a statistical power of eighty percent, so 20 results are false
00:02:42
negatives, perhaps the sample size was too small or the measurements were not
00:02:45
sensitive enough. Now considered that from those 900 false hypotheses using a
00:02:50
p-value of .05, forty-five false hypotheses will be incorrectly
00:02:55
considered true. As for the rest, they will be correctly identified as false
00:02:59
but most journals rarely published no results: they make up just ten to thirty
00:03:03
percent of papers depending on the field, which means that the papers that
00:03:07
eventually get published will include 80 true positive results:
00:03:10
45 false positive results and maybe 20 true negative results.
00:03:15
Nearly a third of published results will be wrong
00:03:18
even with the system working normally, things get even worse if studies are
00:03:22
underpowered, and analysis shows they typically are, if there is a higher ratio
00:03:26
of false-to-true hypotheses being tested or if the researchers are biased.
00:03:32
All of this was pointed out in 2005 paper entitled "Why most published research is false".
00:03:37
So, recently, researchers in a number of fields have attempted to
00:03:40
quantify the problem by replicating some prominent past results.
00:03:44
The Reproducibility Project repeated a hundred psychology studies but found only
00:03:48
thirty-six percent had a statistically significant result the second time
00:03:52
around and the strength of measured relationships were on average half those
00:03:56
of the original studies. An attempted verification of 53 studies considered
00:03:59
landmarks in the basic science of cancer only managed to reproduce six even
00:04:04
working closely with the original study's authors these results are even
00:04:08
worse than i just calculated the reason for this is nicely illustrated by a 2015
00:04:13
study showing that eating a bar of chocolate every day can help you lose
00:04:16
weight faster. In this case the participants were randomly allocated to
00:04:20
one of three treatment groups:
00:04:22
one went on a low-carb diet, another one on the same low carb diet plus a 1.5 ounce
00:04:26
bar of chocolate per day and the third group was the control, instructed
00:04:30
just to maintain their regular eating habits at the end of three weeks the
00:04:33
control group had neither lost nor gained weight but both low carb groups
00:04:37
had lost an average of five pounds per person
00:04:40
the group that a chocolate however lost weight ten percent faster than the
00:04:44
non-chocolate eaters the finding was statistically significant with a p-value less than .05
00:04:50
As you might expect this news spread like wildfire, to the
00:04:53
front page of Bild, the most widely circulated daily newspaper in Europe
00:04:57
and into the Daily Star, the Irish Examiner, to Huffington Post and even Shape Magazine
00:05:02
unfortunately the whole thing had been faked, kind of. I mean researchers did
00:05:07
perform the experiment exactly as they described, but they intentionally
00:05:11
designed it to increase the likelihood of false positives: the sample size was
00:05:15
incredibly small, just five people per treatment group, and for each person 18
00:05:20
different measurements were tracked including: weight, cholesterol, sodium,
00:05:24
blood protein levels, sleep quality, well-being, and so on; so if weight loss
00:05:29
didn't show a significant difference there were plenty of other factors that
00:05:32
might have. So the headline could have been "chocolate lowers cholesterol" or
00:05:36
"increases sleep quality" or... something.
00:05:39
The point is: a p-value is only really valid for a single measure
00:05:43
once you're comparing a whole slew of variables the probability that at least
00:05:46
one of them gives you a false positive goes way up, and this is known as "p-hacking".
00:05:51
Researchers can make a lot of decisions about their analysis that can
00:05:54
decrease the p-value, for example let's say you analyze your data and you find
00:05:58
it nearly reaches statistical significance, so you decide to collect
00:06:01
just a few more data points to be sure
00:06:03
then if the p-value drops below .05 you stop collecting data, confident that
00:06:08
these additional data points could only have made the result more significant if
00:06:11
there were really a true relationship there, but numerical simulations show
00:06:15
that relationships can cross the significance threshold by adding more
00:06:19
data points even though a much larger sample would show that there really is
00:06:23
no relationship. In fact, there are a great number of ways to increase the
00:06:27
likelihood of significant results like: having two dependent variables, adding
00:06:31
more observations, controlling for gender, or dropping one of three conditions
00:06:36
combining all three of these strategies together increases the
00:06:39
likelihood of a false-positive to over sixty percent, and that is using p less than .05
00:06:45
Now if you think this is just a problem for psychology
00:06:47
neuroscience or medicine, consider the pentaquark, an exotic particle made
00:06:52
up of five quarks, as opposed to the regular three for protons or neutrons.
00:06:56
Particle physics employs particularly stringent requirements for statistical
00:07:00
significance referred to as 5-sigma or one chance in 3.5 million of getting a
00:07:05
false positive, but in 2002 a Japanese experiment found evidence for the
00:07:09
Theta-plus pentaquark, and in the two years that followed 11 other independent
00:07:13
experiments then looked for and found evidence of that same pentaquark with
00:07:17
very high levels of statistical significance. From July 2003 to
00:07:22
May 2004 a theoretical paper on pentaquarks was published on average every
00:07:26
other day, but alas, it was a false discovery for their experimental
00:07:31
attempts to confirm that theta-plus pentaquark using greater statistical
00:07:34
power failed to find any trace of its existence.
00:07:37
The problem was those first scientists weren't blind to the data, they knew how
00:07:41
the numbers were generated and what answer they expected to get, and the way
00:07:45
the data was cut and analyzed, or p-hacked, produced the false finding.
00:07:50
Now most scientists aren't p-hacking maliciously, there are legitimate decisions to be
00:07:54
made about how to collect, analyze and report data, and these decisions impact
00:07:58
on the statistical significance of results. For example, 29 different
00:08:02
research groups were given the same data and asked to determine if dark-skinned
00:08:05
soccer players are more likely to be given red cards; using identical data
00:08:10
some groups found there was no significant effect while others
00:08:13
concluded dark-skinned players were three times as likely to receive a red card.
00:08:18
The point is that data doesn't speak for itself, it must be interpreted.
00:08:22
Looking at those results
00:08:23
it seems that dark skinned players are more likely to get red carded but
00:08:26
certainly not three times as likely; consensus helps in this case but
00:08:31
for most results only one research group provides the analysis and therein lies
00:08:35
the problem of incentives: scientists have huge incentives to publish papers,
00:08:40
in fact their careers depend on it; as one scientist Brian Nosek puts it:
00:08:44
"There is no cost to getting things wrong, the cost is not getting them published".
00:08:49
Journals are far more likely to publish
00:08:51
results that reach statistical significance so if a method of data
00:08:54
analysis results in a p-value less than .05 then you're likely to go with
00:08:58
that method, publication's also more likely if the result is novel and
00:09:02
unexpected, this encourages researchers to investigate more and more unlikely
00:09:05
hypotheses which further decreases the ratio of true to spurious relationships
00:09:10
that are tested; now what about replication? Isn't science meant to
00:09:14
self-correct by having other scientists replicate the findings of an initial
00:09:18
discovery? In theory yes but in practice it's more complicated, like take the
00:09:22
precognition study from the start of this video: three researchers attempted
00:09:26
to replicate one of those experiments, and what did they find?
00:09:29
well, surprise surprise, the hit rate they obtained was not significantly different
00:09:32
from chance. When they tried to publish their findings in the same journal as
00:09:36
the original paper they were rejected. The reason? The journal refuses to
00:09:41
publish replication studies. So if you're a scientist the successful strategy is
00:09:46
clear and don't even attempt replication studies because few journals will
00:09:49
publish them, and there is a very good chance that your results won't be
00:09:53
statistically significant any way in which case instead of being able to
00:09:57
convince colleagues of the lack of reproducibility of an effect you will be
00:10:01
accused of just not doing it right.
00:10:03
So a far better approach is to test novel and unexpected hypotheses and then
00:10:08
p-hack your way to a statistically significant result. Now I don't want to
00:10:13
be too cynical about this because over the past 10 years things have started
00:10:16
changing for the better.
00:10:17
Many scientists acknowledge the problems i've outlined and are starting to take
00:10:21
steps to correct them: there are more large-scale replication studies
00:10:25
undertaken in the last 10 years, plus there's a site, Retraction Watch,
00:10:28
dedicated to publicizing papers that have been withdrawn, there are online
00:10:32
repositories for unpublished negative results and there is a move towards
00:10:37
submitting hypotheses and methods for peer review before conducting
00:10:40
experiments with the guarantee that research will be published regardless of
00:10:43
results so long as the procedure is followed. This eliminates publication
00:10:48
bias, promotes higher powered studies and lessens the incentive for p-hacking.
00:10:53
The thing I find most striking about the reproducibility crisis in science is not
00:10:57
the prevalence of incorrect information in published scientific journals
00:11:01
after all getting to the truth we know is hard and mathematically not everything that
00:11:06
is published can be correct.
00:11:08
What gets me is the thought that even trying our best to figure out what's
00:11:11
true, using our most sophisticated and rigorous mathematical tools: peer review,
00:11:16
and the standards of practice, we still get it wrong so often; so how frequently
00:11:20
do we delude ourselves when we're not using the scientific method? As flawed as
00:11:26
our science may be, it is far away more reliable than any other way of knowing
00:11:31
that we have.
00:11:37
This episode of veritasium was supported in part by these fine
00:11:40
people on Patreon and by Audible.com, the leading provider of audiobooks online
00:11:45
with hundreds of thousands of titles in all areas of literature including:
00:11:48
fiction, nonfiction and periodicals, Audible offers a free 30-day trial to
00:11:53
anyone who watches this channel, just go to audible.com/veritasium so they know
00:11:57
i sent you. A book i'd recommend is called "The Invention of Nature" by Andrea Wolf
00:12:02
which is a biography of Alexander von Humboldt, an adventurer and naturalist
00:12:07
who actually inspired Darwin to board the Beagle; you can download that
00:12:11
book or any other of your choosing for a one month free trial at audible.com/veritasium
00:12:16
so as always i want to thank Audible for supporting me and I really
00:12:18
want to thank you for watching.