I recently finished Jordan Eilenberg’s book, How Not To Be Wrong, and quite enjoyed it. This book, on “the power of mathematical thinking” covers many topics at the intersection of math and life. While it requires a reasonable grasp of math to read, it is relatively accessible as these things go (and it was recommended by Bill Gates, along with Sapiens, which I also tore through this summer.)
Anyway, I receive a ton of press releases about studies. And so I think my favorite chapter was a parable about “The International Journal of Haruspicy,” and the problem of statistical significance. Ellenberg notes that he got this parable from a statistician (Cosma Shalizi). Haruspicy, in case you were wondering, is the practice of predicting the future by examining the entrails of sacrificed sheep. The modern haruspex does not believe in this practice because the deities commanded it. “You require evidence. And so you and your colleagues submit all your work to the peer-reviewed International Journal of Haruspicy, which demands without exception that all published results clear the bar of statistical significance.”
Many times, it does not work out. “The gods are very picky and it’s not always clear precisely which arrangement of the internal organs and which precise incantations will reliably unlock the future,” he notes, but it’s all worth it “for those moments of discovery, where everything works, and you find that the texture and protrusions of the liver really do predict the severity of the following year’s flu season, and, with a silent thank-you to the gods, you publish.”
The punchline: “You might find this happens about one time in twenty.”
It’s funny, of course, because a p value of .05 is the convention for statistical significance. When you get results, researchers calculate the odds that the findings are purely the result of chance. For the results to be considered statistically significant, there needs to be a less than 5% chance that they arose randomly. Of course, that means that 1 in 20 times a significant result is pure luck. Since it looks statistically significant, you can publish. In a peer-reviewed journal! And people think it’s the truth! No one publishes the other results that weren’t statistically significant. Needless to say, this gives a biased view of the world — like that haruspicy works.
Ellenberg also illustrates this with a xkcd cartoon in which the scientists are examining a thesis that jelly beans are linked to acne. They run the experiment with 20 different kinds of jelly beans. The vast majority show no link, but lo and behold, the consumption of green jelly beans is correlated with acne with statistical significance. This is followed by a front page headline screaming that “Green Jelly Beans Linked To Acne!” Yes, it is statistically significant. But it’s also clearly just noise. If someone tried to repeat the experiment, the green results wouldn’t be statistically significant. Indeed — and far more weighty than jelly beans — in 2012 a team at Amgen set out to replicate some of the most famous findings in the biology of cancer. Of 53 studies, Ellenberg writes, they could only reproduce 6.
Perhaps even more fascinating: one study of results in fields ranging from political science to economics to psychology found that the p values in journals are not randomly distributed over the close-to-0 to 0.05 range. There is a massive bump right around .05, suggesting that some stuff that was truly marginal was dragged over the finish line.
There are some movements to publish more non-results, or to accept experiments for publication before results are shown, so we know the answer was “nothing” rather than just hearing silence. But the best approach to all this is to remember that “The significance test is the detective, not the judge,” Ellenberg writes. “The provocative and oh-so-statistically significant finding isn’t the conclusion of the scientific process, but the bare beginning.”
In other words: I used to have a T-shirt that said “I’m statistically significant.”
In other other news: My editor took out the references to p values in I Know How She Does It, figuring (no doubt correctly) that people probably didn’t care. The one lingering bit of evidence is my musing that the statistically significant positive correlation between family size and time spent exercising is the “one in twenty chance of being a fluke.”