Tuesday, 27 March 2012

The danger of p-values and statistical significance testing

I have just come across an article in the Financial Times (it is not new - it was published in 2007) titled "The Ten Things Everyone Should Know About Science".  Although the article is not new the source where I found the link to it is, namely right at the top of the home page for the 2011-12 course on Probabilistic Systems Analysis at MIT. In fact the top bullet point says:
The concept of statistical significance (to be touched upon at the end of this course) is considered by the Financial Times as one of " The Ten Things Everyone Should Know About Science".
The FT article does indeed list "Statistical significance" as one of the ten things, along with: Evolution, Genes and DNA, Big Bang, Quantum Mechanics, Relativity, Radiation, Atomic and Nuclear Reactions, Molecules and Chemical Reactions, and Digital data.   That is quite illustrious company, and in the sense that it helps promote the importance of correct probabilistic reasoning I am delighted. However, as is fairly common, the article assumes that 'statistical sugnificance' is synonymous with p-values. The article does hint at the fact that there there might be some scientists who are sceptical of this approach when it says:
Some critics claim that contemporary science places statistical significance on a pedestal that it does not deserve. But no one has come up with an alternative way of assessing experimental outcomes that is as simple or as generally applicable.
In fact, that first sentence is a gross under-statement, while the second is simply not true. To see why the first sentence is a gross understatement look at this summary (which explains what p-values are) that appears in Chapter 1 of our forthcoming book (you can see full draft chapters of the book here). To see why the second sentence is not true look at this example from Chapter 5 of the book (which also shows why Bayes offers a much better alternative). Also look at this (taken from Chapter 10) which explains why the related 'confidence intervals' are not what most people think (and how this dreadful approach can also be avoided using Bayes).

Hence it is very disappointing that an institute like MIT should be perpetuating the myths about this kind of significance testing. The ramifications of this myth have had (and continues to have) a profound negative impact on all empirical research. The book "The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Economics, Cognition & Society)" by Ziliak and McCloskey (The University of Michigan Press, 2008) provides extensive evidence of flawed studies and results published in reputable journals across all disciplines. It is also worth looking at the article "Why Most Published Research Findings Are False". Not only does it mean that 'false' findings are published but also that more scientifically rigorous empirical studies are rejected because authors have not performed the dreaded significance tests demanded by journal editors or reviewers.  This is something we see all the time and I can share an interesting anecdote on this. I was recently discussing a published paper with its author. The paper was specifically about using the Bayesian Information Criteria to determine which model was producing the best prediction in a particular application. The Bayesian analysis was the 'significance test' (only a lot more informative).Yet at the end of the paper was a section with a p-value significance test analysis that was redundant and uninformative. I asked the author why she had included this section as it kind of undermined the value of the rest of the paper. She told me that the paper she submitted did not have this section but that the journal editors had demanded a p-value analysis as a requirement for publishing the paper.