Back in 1939, Joseph Berkson made a bold statement.
I believe that an observant statistician who has had any considerable experience with applying the chi-square test repeatedly will agree with my statement that, as a matter of observation, when the numbers in the data are quite large, the P’s tend to come out small. Having observed this, and on reflection, I make the following dogmatic statement, referring for illustration to the normal curve: “If the normal curve is fitted to a body of data representing any real observations whatever of quantities in the physical world, then if the number of observations is extremely large—for instance, on an order of 200,000—the chi-square P will be small beyond any usual limit of significance.”
This dogmatic statement is made on the basis of an extrapolation of the observation referred to and can also be defended as a prediction from a priori considerations. For we may assume that it is practically certain that any series of real observations does not actually follow a normal curve with absolute exactitude in all respects, and no matter how small the discrepancy between the normal curve and the true curve of observations, the chi-square P will be small if the sample has a sufficiently large number of observations in it.
Berkson, Joseph. “Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test.” Journal of the American Statistical Association 33, no. 203 (1938): 526–536.
Experience shows that when large numbers of subjects are used in studies, nearly all comparisons of means are “significantly” different and all correlations are “significantly” different from zero. The author once had occasion to use 700 subjects in a study of public opinion. After a factor analysis of the results, the factors were correlated with individual-difference variables such as amount of education, age, income, sex, and others. In looking at the results I was happy to find so many “significant” correlations (under the null-hypothesis model)-indeed, nearly all correlations were significant, including ones that made little sense. Of course, with an N of 700 correlations as large as .08 are “beyond the .05 level.” Many of the “significant” correlations were of no theoretical or practical importance.Nunnally, Jum. “The Place of Statistics in Psychology.” Educational and Psychological Measurement 20, no. 4 (1960): 641–650.One of the common experiences of research workers is the very high frequency with which significant results are obtained with large samples. Some years ago, the author had occasion to run a number of tests of significance on a battery of tests collected on about 60,000 subjects from all over the United States. Every test came out significant. Dividing the cards by such arbitrary criteria as east versus west of the Mississippi River, Maine versus the rest of the country, North versus South, etc., all produced significant differences in means. In some instances, the differences in the sample means were quite small, but nonetheless, the p values were all very low.Bakan, David. “The Test of Significance in Psychological Research.” Psychological Bulletin 66, no. 6 (1966): 423.
The major point of this paper is that the test of significance does not provide the information concerning psychological phenomena characteristically attributed to it; and that, furthermore, a great deal of mischief has been associated with its use. What will be said in this paper is hardly original. It is, in a certain sense, what “everybody knows.” To say it “out loud” is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear. Little of that which is contained in this paper is not already available in the literature, and the literature will be cited.Bakan (1966)
Statisticians classically asked the wrong question – and were willing to answer with a lie, one that was often a downright lie. They asked “Are the effects of A and B different?” and they were willing to answer “no.” All we know about the world teaches us that the effects of A and B are always different – in some decimal place – for any A and B. Thus asking “Are the effects different?” is foolish.Tukey, John W. “The Philosophy of Multiple Comparisons.” Statistical Science 6, no. 1 (February 1991): 100–116. doi:10.1214/ss/1177011945.
In sum, tests for direction are easier than tests for existence: when applied to the same data, tests for direction are more diagnostic than tests for existence. From a Bayesian perspective, the one-sided P value is a test for direction; when this test is misinterpreted as a test for existence—as classical statisticians are wont to do—this overstates the true evidence that the data provide against a point null hypothesis.
Marsman, M., and E.-J. Wagenmakers. “Three Insights from a Bayesian Interpretation of the One-Sided P Value.” Educational and Psychological Measurement, October 5, 2016. doi:10.1177/0013164416669201.
Crud factor: In the social sciences and arguably in the biological sciences, “everything correlates to some extent with everything else.” This truism, which I have found no competent psychologist disputes given five minutes reflection, does not apply to pure experimental studies in which attributes that the subjects bring with them are not the subject of study (except in so far as they appear as a source of error and hence in the denominator of a significance test). There is nothing mysterious about the fact that in psychology and sociology everything correlates with everything. Any measured trait or attribute is some function of a list of partly known and mostly unknown causal factors in the genes and life history of the individual, and both genetic and environmental factors are known from tons of empirical research to be themselves correlated.Meehl, Paul E. “Why Summaries of Research on Psychological Theories Are Often Uninterpretable.” Psychological Reports 66, no. 1 (1990): 195–244.