Ugh, Not Again


P-values are back in the news. Nature published an article, signed by 800 scientists, calling for an end to the concept of “statistical significance.” It ruffled my feathers, even though I agreed with its central thesis.

The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different. The same problems are likely to arise under any proposed statistical alternative that involves dichotomization, whether frequentist, Bayesian or otherwise.

Unfortunately, the false belief that crossing the threshold of statistical significance is enough to show that a result is ‘real’ has led scientists and journal editors to privilege such results, thereby distorting the literature. Statistically significant estimates are biased upwards in magnitude and potentially to a large degree, whereas statistically non-significant estimates are biased downwards in magnitude. Consequently, any discussion that focuses on estimates chosen for their significance will be biased. On top of this, the rigid focus on statistical significance encourages researchers to choose data and methods that yield statistical significance for some desired (or simply publishable) result, or that yield statistical non-significance for an undesired result, such as potential side effects of drugs — thereby invalidating conclusions.

Nothing wrong there. While I’ve mentioned some Bayesian buckets, I tucked away a one-sentence counter-argument in an aside over here. Any artificial significant/non-significant boundary is going to promote the distortions they mention here. What got me writing this post was their recommendations.

What will retiring statistical significance look like? We hope that methods sections and data tabulation will be more detailed and nuanced. Authors will emphasize their estimates and the uncertainty in them — for example, by explicitly discussing the lower and upper limits of their intervals. They will not rely on significance tests. When P values are reported, they will be given with sensible precision (for example, P = 0.021 or P = 0.13) — without adornments such as stars or letters to denote statistical significance and not as binary inequalities (P  < 0.05 or P > 0.05). Decisions to interpret or to publish results will not be based on statistical thresholds. People will spend less time with statistical software, and more time thinking.

This basically amounts to nothing. Journal editors still have to decide what to print, and if there is no strong alternative they’ll switch from an arbitrary cutoff of p < 0.05 to an ad-hoc arbitrary cutoff. In the meantime, they’re leaving flawed statistical procedures in place. P-values exaggerate the strength of the evidence, as I and others have argued. Confidence intervals are not an improvement, either. As I put it:

For one thing, if you’re a frequentist it’s a category error to state the odds of a hypothesis being true, or that some data makes a hypothesis more likely, or even that you’re testing the truth-hood of a hypothesis. […]

How does this intersect with confidence intervals? If it’s an invalid move to hypothesise[sic] “the population mean is Y,” it must also be invalid to say “there’s a 95% chance the population mean is between X and Z.” That’s attaching a probability to a hypothesis, and therefore a no-no! Instead, what a frequentist confidence interval is really telling you is “assuming this data is a representative sample, if I repeat my experimental procedure an infinite number of times then I’ll calculate a sample mean between X and Z 95% of the time.” A confidence interval says nothing about the test statistic, at least not directly.

In frequentism, the parameter is fixed and the data varies. It doesn’t make sense to consider other parameters, that’s a Bayesian move. And yet the authors propose exactly that!

We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits. In doing so, they should remember that all the values between the interval’s limits are reasonably compatible with the data, given the statistical assumptions used to compute the interval. Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.

Much of what the authors proposed would be fixed by switching to Bayesian statistics. Their own suggestions invoke Bayesian ideas without realizing it. Yet they go out of their way to say nothing’s wrong with p-values or confidence intervals, despite evidence to the contrary. Their proposal is destined to fail, yet it got more support than the arguably-superior p < 0.005 proposal.

Maddening. Maybe it’s time I got out my poison pen and added my two cents to the scientific record.