Stop Assessing Science


I completely agree with PZ, in part because I’ve heard the same tune before.

The results indicate that the investigators contributing to Volume 61 of the Journal of Abnormal and Social Psychology had, on the average, a relatively (or even absolutely) poor chance of rejecting their major null hypotheses, unless the effect they sought was large. This surprising (and discouraging) finding needs some further consideration to be seen in full perspective.

First, it may be noted that with few exceptions, the 70 studies did have significant results. This may then suggest that perhaps the definitions of size of effect were too severe, or perhaps, accepting the definitions, one might seek to conclude that the investigators were operating under circumstances wherein the effects were actually large, hence their success. Perhaps, then, research in the abnormal-social area is not as “weak” as the above results suggest. But this argument rests on the implicit assumption that the research which is published is representative of the research undertaken in this area. It seems obvious that investigators are less likely to submit for publication unsuccessful than successful research, to say nothing of a similar editorial bias in accepting research for publication.

Statistical power is defined as the odds of failing to reject a false null hypothesis. The larger the study size, the greater the statistical power. Thus if your study has a poor chance of answering the question it is tasked with, it is too small.

Suppose we hold fixed the theoretically calculable incidence of Type I errors. … Holding this 5% significance level fixed (which, as a form of scientific strategy, means leaning over backward not to conclude that a relationship exists when there isn’t one, or when there is a relationship in the wrong direction), we can decrease the probability of Type II errors by improving our experiment in certain respects. There are three general ways in which the frequency of Type II errors can be decreased (for fixed Type I error-rate), namely, (a) by improving the logical structure of the experiment, (b) by improving experimental techniques such as the control of extraneous variables which contribute to intragroup variation (and hence appear in the denominator of the significance test), and (c) by increasing the size of the sample. … We select a logical design and choose a sample size such that it can be said in advance that if one is interested in a true difference provided it is at least of a specified magnitude (i.e., if it is smaller than this we are content to miss the opportunity of finding it), the probability is high (say, 80%) that we will successfully refute the null hypothesis.

If low statistical power was just due to a few bad apples, it would be rare. Instead, as the first quote implies, it’s quite common. That study found that for studies with small effect sizes, where Cohen’s d was roughly 0.25, their average statistical power was an abysmal 18%. For medium-effect sizes, where d is roughly 0.5, that number is still less than half. Since those two ranges cover the majority of social science effect sizes, that means the typical study has very low power and thus a small sample size. Instead, the problem of low power must be systemic to how science is carried out.

In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and reconstructive efforts of Carnap, Hempel, and Popper to unscramble the logical relationships of theories and hypotheses to evidence. Meanwhile our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.

I know, it’s a bit confusing that I haven’t clarified who I’m quoting. That first paragraph comes from this study:

Cohen, Jacob. “The Statistical Power of Abnormal-Social Psychological Research: A Review.” The Journal of Abnormal and Social Psychology 65, no. 3 (1962): 145.

While the second and third are from this:

Meehl, Paul E. “Theory-Testing in Psychology and Physics: A Methodological Paradox.” Philosophy of Science 34, no. 2 (1967): 103–115.

That’s right, scientists have been complaining about small sample sizes for over 50 years. Fanelli et. al. [2017] might provide greater detail and evidence than previous authors did, but the basic conclusion has remained the same. Nor are these two studies lone wolves in the darkness; I wrote about a meta-analysis of 16 different power-level studies between Cohen’s and now, all of which agree with Cohen’s.

If your assessments have been consistently telling you the same thing for decades, maybe it’s time to stop assessing. Maybe it’s time to start acting on those assessments, instead. PZ is already doing that, thankfully…

More data! This is also helpful information for my undergraduate labs, since I’m currently in the process of cracking the whip over my genetics students and telling them to count more flies. Only a thousand? Count more. MORE!

… but this is a chronic, systemic issue within science. We need more.