I recently finished Jordan Eilenberg’s book, How Not To Be Wrong, and quite enjoyed it. This book, on “the power of mathematical thinking” covers many topics at the intersection of math and life. While it requires a reasonable grasp of math to read, it is relatively accessible as these things go (and it was recommended by Bill Gates, along with Sapiens, which I also tore through this summer.)
Anyway, I receive a ton of press releases about studies. And so I think my favorite chapter was a parable about “The International Journal of Haruspicy,” and the problem of statistical significance. Ellenberg notes that he got this parable from a statistician (Cosma Shalizi). Haruspicy, in case you were wondering, is the practice of predicting the future by examining the entrails of sacrificed sheep. The modern haruspex does not believe in this practice because the deities commanded it. “You require evidence. And so you and your colleagues submit all your work to the peer-reviewed International Journal of Haruspicy, which demands without exception that all published results clear the bar of statistical significance.”
Many times, it does not work out. “The gods are very picky and it’s not always clear precisely which arrangement of the internal organs and which precise incantations will reliably unlock the future,” he notes, but it’s all worth it “for those moments of discovery, where everything works, and you find that the texture and protrusions of the liver really do predict the severity of the following year’s flu season, and, with a silent thank-you to the gods, you publish.”
The punchline: “You might find this happens about one time in twenty.”
It’s funny, of course, because a p value of .05 is the convention for statistical significance. When you get results, researchers calculate the odds that the findings are purely the result of chance. For the results to be considered statistically significant, there needs to be a less than 5% chance that they arose randomly. Of course, that means that 1 in 20 times a significant result is pure luck. Since it looks statistically significant, you can publish. In a peer-reviewed journal! And people think it’s the truth! No one publishes the other results that weren’t statistically significant. Needless to say, this gives a biased view of the world — like that haruspicy works.
Ellenberg also illustrates this with a xkcd cartoon in which the scientists are examining a thesis that jelly beans are linked to acne. They run the experiment with 20 different kinds of jelly beans. The vast majority show no link, but lo and behold, the consumption of green jelly beans is correlated with acne with statistical significance. This is followed by a front page headline screaming that “Green Jelly Beans Linked To Acne!” Yes, it is statistically significant. But it’s also clearly just noise. If someone tried to repeat the experiment, the green results wouldn’t be statistically significant. Indeed — and far more weighty than jelly beans — in 2012 a team at Amgen set out to replicate some of the most famous findings in the biology of cancer. Of 53 studies, Ellenberg writes, they could only reproduce 6.
Perhaps even more fascinating: one study of results in fields ranging from political science to economics to psychology found that the p values in journals are not randomly distributed over the close-to-0 to 0.05 range. There is a massive bump right around .05, suggesting that some stuff that was truly marginal was dragged over the finish line.
There are some movements to publish more non-results, or to accept experiments for publication before results are shown, so we know the answer was “nothing” rather than just hearing silence. But the best approach to all this is to remember that “The significance test is the detective, not the judge,” Ellenberg writes. “The provocative and oh-so-statistically significant finding isn’t the conclusion of the scientific process, but the bare beginning.”
In other words: I used to have a T-shirt that said “I’m statistically significant.”
In other other news: My editor took out the references to p values in I Know How She Does It, figuring (no doubt correctly) that people probably didn’t care. The one lingering bit of evidence is my musing that the statistically significant positive correlation between family size and time spent exercising is the “one in twenty chance of being a fluke.”
5 thoughts on “The International Journal of Haruspicy, or the problem of statistical significance”
Sorry—have to say it: But you have to control for multiple comparisons (so p<0.05 not necessarily significant when you are doing TWENTY comparisons!)
I do get the point, however, and it was really eye-opening when I actually started doing research and publishing manuscripts how tenuous findings could be and still be "significant" and "publishable".
Yes, and I’d defend (?) the .05 cutoff lump in journal articles by noting that plenty of studies are designed to include enough subjects (but not more ) to generate detectable results at that point — though the tea leaves of power calculations are an issue in their own right. Still, interesting points and sounds like a good read.
A lot of economists don’t do a bonferroni or even s holm test for multiple comparisons, but would instead advocate that the authors be transparent about how many comparisons were made and allow readers to draw their own conclusions, and yes, a lot of research is crap. But even worse is the ignorance of the media and the way in which they present results. So painful!
And that’s the problem—maybe economists and scientists and people who have studied and trained on interpreting research studies can synthesize that data and draw conclusions, but most people really don’t understand what “significant” means in terms of research. Its irresponsible to expect that people reading a news article can understand the subtleties involved rather than being swayed by the headline and sound bites. Ignorant and painful and can lead to real harm when people misunderstand/misinterpret medical findings.
Scientists definitely need to do a better job conveying the meaning of their results to the media, but honestly most journalists have little interest in nuance. They want something sexy that will draw a lot of readers.