A Computer Scientist Reads EvoPsych, Part 4

[Part 3]

The programs comprising the human mind were designed by natural selection to solve the adaptive problems regularly faced by our hunter-gatherer ancestors—problems such as finding a mate, cooperating with others, hunting, gathering, protecting children, navigating, avoiding predators, avoiding exploitation, and so on. Knowing this allows evolutionary psychologists to approach the study of the mind like an engineer. You start by carefully specifying an adaptive information processing problem; then you do a task analysis of that problem. A task analysis consists of identifying what properties a program would have to have to solve that problem well. This approach allows you to generate hypotheses about the structure of the programs that comprise the mind, which can then be tested.[1]

Let’s try this approach. My task will be to calculate the inverse square root of a number, a common one in computer graphics. The “inverse” part implies I’ll have to do a division at some point, and the “square root” implies either raising something to a power, finding the logarithm of the input, or invoking some sort of function that’ll return the square root. So I should expect a program which contains an inverse square root function to have something like:

float InverseSquareRoot( float x ) 
{

     return 1.0 / sqrt(x);

}

So you could imagine my shock if I peered into a program and found this instead:

float FastInvSqrt( float x )
{
    long i;
    float x2, y;
    
    x2 = x * 0.5;

    i = * ( long * ) &x;
    i = 0x5f3759df - ( i >> 1 );
    y = * ( float * ) &i;

    y = y * ( 1.5 - ( x2 * y * y ) );

    return y;
}

Something like that snippet was in Quake III’s software renderer. It uses one step of Newton’s Method to find the zero of an equation derived from the input value, seeded by a guess that takes advantage of the structure of floating point numbers. It also breaks every one of the predictions my analysis made, not even including a division.

The task analysis failed for a simple reason: nearly every problem has more than one approach to it. If we’re not aware of every alternative, our analysis can’t take all of them into account and we’ll probably be led astray. We’d expect convolutions to be slow for large kernels unless we were aware of the Fourier transform, we’d think it was impossible to keep concurrent operations from mucking up memory unless we knew we had hardware-level atomic operations, and if we thought of sorting purely in terms of comparing one value to another we’d miss out on the fastest sorting algorithm out there, Radix sort.

Radix sort doesn’t get implemented very often because it either requires a tonne of memory, or the overhead of doing a census makes it useless on small lists. To put that more generally, the context of execution matters more than the requirements of the task during implementation. The simplistic approach of Tooby and Cosmides does not take that into account.

We can throw them a lifeline, mind you. I formed a hypothesis about computing inverse square roots, refuted it, and now I’m wiser for it. Isn’t that still a net win for the process? Notice a key difference, though: we only became wiser because we could look at the source code. If FastInvSqrt() was instead a black box, the only way I could refute my analysis would be to propose the exact way the algorithm worked and then demonstrated it consistently predicted the outputs much better. If I didn’t know the techniques used in FastInvSqrt() were possible, I’d never be able to refute it.

On the contrary, I might falsely conclude I was right. After all, the outputs of my analysis and FastInvSqrt() are very similar, so I could easily wave away the differences as due to a buggy square root function or a flaw in the division routine. This is especially dangerous with evolutionary algorithms, as Dr. Adrian Thompson figured out in an earlier installment, because the odds of us knowing every possible trick are slim.

In sum, this analysis method is primed to generate smug over-confidence in your theories.

Each organ in the body evolved to serve a function: The intestines digest, the heart pumps blood, and the liver detoxifies poisons. The brain’s evolved function is to extract information from the environment and use that information to generate behavior and regulate physiology. Hence, the brain is not just like a computer. It is a computer—that is, a physical system that was designed to process information. Its programs were designed not by an engineer, but by natural selection, a causal process that retains and discards design features based on how well they solved adaptive problems in past environments.[1]

And is my appendix’s function to randomly attempt to kill me? The only people I’ve seen push this biological teleology are creationists who propose an intelligent designer. Few people well studied in biology would buy this line.

But getting back to my field, notice the odd dichotomy at play here: our brains are super-sophisticated computational devices, but not sophisticated enough to re-program themselves on-the-fly. Yet even the most primitive computers we’ve developed can modify the code they’re running, as they’re running it. Why isn’t that an option? Why can’t we be as much of a blank slate as forty-year old computer chips?

It’s tempting to declare that we’re more primitive than they are, computationally, but there’s a fundamental problem here: algorithms are algorithms are algorithms. If you can compute, you’re a Turing machine of some sort. There is no such thing as a “primitive” computer, at best you could argue some computers have more limitations imposed on them than others.

Human beings can compute, as anyone who’s taken a math course can attest. Ergo, we must be something like a Turing machine. Is it possible that our computation is split up into programs, which themselves change only slowly? Sure, but that’s an extra limitation imposed on our computability. It should not be assumed a-priori.

[Part 5]


[1] Tooby, John, and Leda Cosmides. “Conceptual Foundations of Evolutionary Psychology.The Handbook of Evolutionary Psychology (2005): 5-67.

A Computer Scientist Reads EvoPsych, Part 3

[Part 2]

As a result of selection acting on information-behavior relationships, the human brain is predicted to be densely packed with programs that cause intricate relationships between information and behavior, including functionally specialized learning systems, domain-specialized rules of inference, default preferences that are adjusted by experience, complex decision rules, concepts that organize our experiences and databases of knowledge, and vast databases of acquired information stored in specialized memory systems—remembered episodes from our lives, encyclopedias of plant life and animal behavior, banks of information about other people’s proclivities and preferences, and so on. All of these programs and the databases they create can be called on in different combinations to elicit a dazzling variety of behavioral responses.[1]

“Program?” “Database?” What exactly do those mean? That might seem like a strange question to hear from a computer scientist, but my training makes me acutely aware of how flexible those terms can be. [Read more…]

What is False?

John Oliver weighed in on the replication crisis, and I think he did a great job. I’d have liked a bit more on university press departments, who can write misleading press releases that journalists jump on, but he did have to simplify things for a lay audience.

It got me thinking about what “false” means, though. “True” is usually defined as “in line with reality,” so “false” should mean “not in line with reality,” the precise compliment.

But don’t think about it in terms of a single thing, but in multiple data points applied to a specific theory. Suppose we analyze that data, and find that all but a few datapoints are predicted by the hypothesis we’re testing. Does this mean the hypothesis is false, since it isn’t in line with reality in all cases, or true, because it’s more in line with reality than not? Falsification argues that it is false, and exploits that to come up with this epistemology:

  1. Gather data.
  2. Is that data predicted by the hypothesis? If so, repeat step 1.
  3. If not, replace this hypothesis with another that predicts all the data we’ve seen so far, and repeat step 1.

That’s what I had in mind when I said that frequentism works on streams of hypotheses, hopping from one “best” hypothesis to the next. The addition of time changes the original definitions slightly, so that “true” really means “in line with reality in all instances” while “false” means “in at least one instance, it is not in line with reality.”

Notice the asymmetry, though. A hypothesis has to reach a pretty high bar to be considered “true,” and “false” hypotheses range from “in line with reality, with one exception” to “never in line with reality.” Some of those “false” hypotheses are actually quite valuable to us, as John Oliver’s segment demonstrates. He never explains what “statistical significance” means, for instance, but later on uses “significance” in the “effect size” sense. This will mislead most of the audience away from the reality of the situation, and in the absolute it makes his segment “false.” Nonetheless, that segment was a net positive at getting people to understand and care for the replication crisis, so labeling it “false” is a disservice.

We need something fuzzier than the strict binary of falsification. What if we didn’t compliment “true” in the set-theory sense, but in the definitional sense? Let “true” remain “in line with reality in all instances,” but change “false” from “in at least one instance, it is not in reality” to “never in line with reality.” This creates a gap, though: that hypothesis from earlier is neither “true” nor “false,” as it isn’t true in all cases nor false in all. It must be in a third category, as part of some sort of paraconsistent logic.

This is where the Bayesian interpretation of statistics comes from, it deliberately disclaims an absolute “true” or “false” label for descriptions of the world, instead holding them up as two ends of a continuum. Every hypothesis in the third category inbetween, hoping that future data will reveal that its closer to one end of the continuum or the other.

I think it’s a neat way to view the Bayesian/Frequentism debate, as a mere disagreement over what “false” means.

A Computer Scientist Reads EvoPsych, Part 2

[Part 1]

the concept of “learning” within the Standard Social Science Model itself tacitly invokes unbounded rationality, in that learning is the tendency of the general-purpose, equipotential mind to grow—by an unspecified and undiscovered computational means—whatever functional information-processing abilities it needs to serve its purposes, given time and experience in the task environment.

Evolutionary psychologists depart from fitness teleologists, traditional economists (but not neuroeconomists), and blank-slate learning theorists by arguing that neither human engineers nor evolution can build a computational device that exhibits these forms of unbounded rationality, because such architectures are impossible, even in principle (for arguments, see Cosmides & Tooby, 1987; Symons 1989, 1992; Tooby & Cosmides, 1990a, 1992).[1]

Yeah, these people don’t know much about computer science.

You can divide the field of “artificial” intelligence into two basic approaches. The top-down approach outlined modular code routines like “recognize faces,” then broke those down into sub-tasks like “look for eyes” and “find mouths.” By starting at a high level and dividing these things down into neat, tidy sub-programs, we can chain them together and create a greater whole.

It’s never worked all that well, at least for real-life problems. Take Cyc, the best example I can think of. It takes basic facts about the world, like “water is wet” or “rain is water,” and uses a simple set of rules to query these facts (“is rain wet?”). What it can’t do is make guesses (“are clouds wet?”), nor discover new facts on its own, nor handle anything but simple text. Thirty years and millions of dollars haven’t made a dent in those problems.

Meanwhile, the graphics card manufacturer NVidia is betting the farm on something called “deep learning,” one of several “bottom-up” approaches. You present the algorithm with an image (or sound file or object, the number of dimensions can be easily changed), and it maps it to a grid of cells. You toss a slightly smaller grid of cells on top of it, and for each new cell you calculate a weighted sum of the nearby values in the previous grid, weights that are random to start off with. Repeat this several times, and you’ll wind up with a single cell at the end. Assign this cell to an output, say “person,” then rewind all the way back to the start. Wash, rinse, and repeat until you get another single cell, then at least enough single cells to handle every possible solution. All of these single cells have a value associated with them, so that “person” cell might give the image 0.7 “person”s. Having cataloged what’s in the image already, you know there’s actually 1.0 “person” there, and so you propagate that information back down the chain. Prior cell weights which were pro-person are increased, while the anti-person ones are decreased. Do this right to the bottom, and for every input cell, then repeat the process for a new image.

It’s loosely patterned after how our own neurons are laid out. Biology is a bit more liberal with how it connects, but this structure has the virtue of being easy to calculate and massively parallel, quite convenient for a company which manufactures processors that specialize in massively parallel computations. NVidia’s farm-betting comes from the fact that it’s wildly successful; all of the best image recognition algorithms follow the deep-learning pattern, and their success rates are not only impressive but also resemble our own.[2]

Heard of the AI that could play Atari games? Emphasis mine:

Our [Deep action-value Network or DQN] method outperforms the best existing reinforcement learning methods on 43 games without incorporating any of the additional prior knowledge about Atari 2600 games used by other approaches … . Furthermore, our DQN agent performed at a level that was comparable to that of a professional human games tester across the set of 49 games, achieving more than 75% of the human score on more than half of the games […]

Indeed, in certain games DQN is able to discover a relatively long-term strategy (for example, Breakout: the agent learns the optimal strategy, which is to first dig a tunnel around the side of the wall allowing the ball to be sent around the back to destroy a large number of blocks; …). […]

In this work, we demonstrate that a single architecture can successfully learn control policies in a range different environments with only very minimal prior knowledge, receiving only the pixels and the game score as inputs, and using the same algorithm, network architecture and hyperparameters each game, privy only to the inputs a human player would have.[3]

This deep learning network has no idea what a video game is, nor is it permitted to peek at the innards of the game itself, yet can not only learn to play these games at the same level as human beings, it can develop non-trivial solutions to them. You can’t get more “blank slate” than that.

This basic pattern has repeated multiple times over the decades. Neural nets aren’t as zippy as the new kid on the “bottom-up” block, yet they too have had great success where the modular top-down approach has failed miserably. I haven’t worked with either technology, but I’ve worked with something that’s related: genetic algorithms. Represent your solutions in a sort of genome, come up with a fitness metric for them, then mutate or randomly construct those genomes and keep the fittest ones in the pool until you’ve tried every possibility, or you get bored. Two separate runs might converge to the same solution, or they might not. A lot depends on the “fitness landscape” they occupy, which you can visualize as a 3D terrain map with height representing how “fit” something is.

A visualization of three "evolutionary fitness landscapes," ranging from simple to complex to SUPER complex.That landscape has probably got more than three dimensions, but those aren’t as easy to visualize and they behave very similarily to the 3D case. The terrain might be a Mount Fiji with a single solution at the top of a fitness peak, or a Himalayas with many peak solutions scattered about but a single tallest standing above them, or a foothills where solutions are aplenty but the best solution is tough to find.

All of these take the “bottom-up” approach, the opposite of the “top-down” one, and work up from very small components towards a high-level goal. The path to there is rarely known in advance, so the system “feels” its way there via evolutionary algorithms.

That path may not go the way you expect, however. Take the case of a researcher, Dr. Adrian Thompson, who used an evolutionary algorithm to find the smallest computer processor that could sense the difference between two tones.

Finally, after just over 4,000 generations, the test system settled upon the best program. When Dr. Thompson played the 1kHz tone, the microchip unfailingly reacted by decreasing its power output to zero volts. When he played the 10kHz tone, the output jumped up to five volts. He pushed the chip even farther by requiring it to react to vocal “stop” and “go” commands, a task it met with a few hundred more generations of evolution. As predicted, the principle of natural selection could successfully produce specialized circuits using a fraction of the resources a human would have required. And no one had the foggiest notion how it worked.

Dr. Thompson peered inside his perfect offspring to gain insight into its methods, but what he found inside was baffling. The plucky chip was utilizing only thirty-seven of its one hundred logic gates, and most of them were arranged in a curious collection of feedback loops. Five individual logic cells were functionally disconnected from the rest— with no pathways that would allow them to influence the output— yet when the researcher disabled any one of them the chip lost its ability to discriminate the tones. Furthermore, the final program did not work reliably when it was loaded onto other FPGAs of the same type.

It seems that evolution had not merely selected the best code for the task, it had also advocated those programs which took advantage of the electromagnetic quirks of that specific microchip environment. The five separate logic cells were clearly crucial to the chip’s operation, but they were interacting with the main circuitry through some unorthodox method— most likely via the subtle magnetic fields that are created when electrons flow through circuitry, an effect known as magnetic flux. There was also evidence that the circuit was not relying solely on the transistors’ absolute ON and OFF positions like a typical chip; it was capitalizing upon analogue shades of gray along with the digital black and white.[4]

Evolutionary approaches are very simple and require no understanding or insight into the problem you’re solving, but they usually requires ridiculous amounts of computation or training merely to keep pace with the top-down “modular” approach. The fitness function may lead to a solution much too complicated for you to understand or much too fragile to operate anywhere but where it was generated. But the bottom-up approach may be your only choice for certain problems.

The moral of the story: the ability to do complex calculation can be built up from a blank slate, in principle and practice. When we follow the bottom-up approach we tend to get results that more closely mirror biology than when we work from the top-down and modularize, though this is less insightful than it first appears. Nearly all bottom-up approaches take direct inspiration from biology, whereas top-down approaches owe more to Plato then Aristotle.

Biology prefers the blank slate.

[Part 3]


[1] Tooby, John, and Leda Cosmides. “Conceptual Foundations of Evolutionary Psychology.The Handbook of Evolutionary Psychology (2005): 5-67.

[2] Kheradpisheh, Saeed Reza, et al. “Deep Networks Resemble Human Feed-forward Vision in Invariant Object Recognition.” arXiv preprint arXiv:1508.03929 (2015).

[3] Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning.” Nature 518.7540 (2015): 529-533.

[4] Bellows, Alan. “On the Origin of Circuits • Damn Interesting.” Accessed May 4, 2016.

A Computer Scientist Reads EvoPsych, Part 1

Computer Science is weird. Most of the papers published in my field look like this:

We describe the architecture of a novel system for precomputing sparse directional occlusion caches. These caches are used for accelerating a fast cinematic lighting pipeline that works in the spherical harmonics domain. The system was used as a primary lighting technology in the movie Avatar, and is able to efficiently handle massive scenes of unprecedented complexity through the use of a flexible, stream-based geometry processing architecture, a novel out-of-core algorithm for creating efficient ray tracing acceleration structures, and a novel out-of-core GPU ray tracing algorithm for the computation of directional occlusion and spherical integrals at arbitrary points.[1]

A speed improvement of two orders of magnitude is pretty sweet, but this paper really isn’t about computers per se; it’s about applying existing concepts in computer graphics in a novel combination to solve a practical problem. Most papers are all about the application of computers, and not computing itself. You can dig up examples of the latter, like if you try searching for concurrency theory,[2] but even then you’ll run across a lot of applied work, like articles on sorting algorithms designed for graphics cards.[3]

In sum, computer scientists spend most of their time working in other people’s fields, solving other people’s problems. So you can imagine my joy when I stumbled on people in other fields invoking computer science.

Because the evolved function of a psychological mechanism is computational—to regulate behavior and the body adaptively in response to informational inputs—such a model consists of a description of the functional circuit logic or information processing architecture of a mechanism (Cosmides & Tooby, 1987; Tooby & Cosmides, 1992). Eventually, these models should include the neural, developmental, and genetic bases of these mechanisms, and encompass the designs of other species as well.[4]

Hot diggity! How well does that non-computer-scientist understand the field, though? Let’s put my degree to work.

The second building block of evolutionary psychology was the rise of the computational sciences and the recognition of the true character of mental phenomena. Boole (1848) and Frege (1879) formalized logic in such a way that it became possible to see how logical operations could be carried out mechanically, automatically, and hence through purely physical causation, without the need for an animate interpretive intelligence to carry out the steps. This raised the irresistible theoretical possibility that not only logic but other mental phenomena such as goals and learning also consisted of formal relationships embodied nonvitalistically in physical processes (Weiner, 1948). With the rise of information theory, the development of the first computers, and advances in neuroscience, it became widely understood that mental events consisted of transformations of structured informational relationships embodied as aspects of organized physical systems in the brain. This spreading appreciation constituted the cognitive revolution. The mental world was no longer a mysterious, indefinable realm, but locatable in the physical world in terms of precisely describable, highly organized causal relations.[4]

Yes! I’m right with you here. One of the more remarkable findings of computer science is that every computation or algorithm can be executed on a Turing machine. That includes all of Quantum Field Theory, even those Quant-y bits. While QFT isn’t a complete theory, we’re extremely confident in the subset we need to simulate neural activity. Those simulations have long since been run and matched against real-world data, the current problem is scaling up from faking a million neurons at a time to faking 100 billion, about as many as you have locked in your skull.

Our brains can be perfectly simulated by a computational device, and our brain’s ability to do math proves they are computational. I can quibble a bit on the wording (“precisely describable” suggests we’ve faked those 100 billion), but we’re off to a great start here.

After all, if the human mind consists primarily of a general capacity to learn, then the particulars of the ancestral hunter-gatherer world and our prehuman history as Miocene apes left no interesting imprint on our design. In contrast, if our minds are collections of mechanisms designed to solve the adaptive problems posed by the ancestral world, then hunter-gatherer studies and primatology become indispensable sources of knowledge about modern human nature.[4]

Wait, what happened to the whole “our brains are computers” thing? Look, here’s a diagram of a Turing machine.

An annotated diagram of a Turing machine. Copyright HJ Hornbeck 2016, CC-BY-SA 4.0.

A read/write head sits somewhere along an infinite ribbon of tape. It reads what’s under the head, writes back a value to that location, and moves either left or right, all based on that value. How does it know what to do? Sometimes that’s hard-wired into the machine, but more commonly it reads instructions right off the tape. These machines don’t ship with much of anything, just the bare minimum necessary to do every possible computation.

This carries over into physical computers as well; when the CPU of your computer boots up, it does the following:

  1. Read the instruction at memory location 4,294,967,280.
  2. Execute it.

Your CPU does have “programs” of a sort, instructions such as ADD (addition) or MULT (multiply), but removing them doesn’t remove its ability to compute. All of those extras can be duplicated by grouping together simpler operations, they’re only there to make programmers’ lives easier.

There’s no programmer for the human brain, though. Despite what The Matrix told you, no-one can fiddle around with your microcode and add new programs. There is no need for helper instructions. So if human brains are like computers, and computers are blank slates for the most part, we have a decent reason to think humans are blank slates too, infinitely flexible and fungible.

[Part 2]


[1] Pantaleoni, Jacopo, et al. “PantaRay: fast ray-traced occlusion caching of massive scenes.” ACM Transactions on Graphics (TOG). Vol. 29. No. 4. ACM, 2010.

[2] Roscoe, Bill. “The theory and practice of concurrency.” (1998).

[3] Ye, Xiaochun, et al. “High performance comparison-based sorting algorithm on many-core GPUs.” Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE, 2010.

[4] Tooby, John, and Leda Cosmides. “Conceptual Foundations of Evolutionary Psychology.The Handbook of Evolutionary Psychology (2005): 5-67.

Index Post: P-values

Over the months, I’ve managed to accumulate a LOT of papers discussing p-values and their application. Rather than have them rot on my hard drive, I figured it was time for another index post.

Full disclosure: I’m not in favour of them. But I came to that by reading these papers, and seeing no effective counter-argument. So while this collection is biased against p-values, that’s no more a problem than a bias against the luminiferous aether or humour theory. And don’t worry, I’ll include a few defenders of p-values as well.

What’s a p-value?

It’s frequently used in “null hypothesis significance testing,” or NHST to its friends. A null hypothesis is one you hope to refute, preferably a fairly established one that other people accept as true. That hypothesis will predict a range of observations, some more likely than others. A p-value is simply the odds of some observed event happening, plus the odds of all events more extreme, assuming the null hypothesis is true. You can then plug that value into the following logic:

  1. Event E, or an event more extreme, is unlikely to occur under the null hypothesis.
  2. Event E occurred.
  3. Ergo, the null hypothesis is false.

They seem like a weird thing to get worked up about.

Significance testing is a cornerstone of modern science, and NHST is the most common form of it. A quick check of Google Scholar shows “p-value” shows up 3.8 million times, while its primary competitor, “Bayes Factor,” shows up 250,000. At the same time, it’s poorly understood.

The P value is probably the most ubiquitous and at the same time, misunderstood, misinterpreted, and occasionally miscalculated index in all of biomedical research. In a recent survey of medical residents published in JAMA, 88% expressed fair to complete confidence in interpreting P values, yet only 62% of these could answer an elementary P-value interpretation question correctly. However, it is not just those statistics that testify to the difficulty in interpreting P values. In an exquisite irony, none of the answers offered for the P-value question was correct, as is explained later in this chapter.

Goodman, Steven. “A Dirty Dozen: Twelve P-Value Misconceptions.” In Seminars in Hematology, 45:135–40. Elsevier, 2008. http://www.sciencedirect.com/science/article/pii/S0037196308000620.

The consequence is an abundance of false positives in the scientific literature, leading to many failed replications and wasted resources.

Gotcha. So what do scientists think is wrong with them?

Well, th-

And make it quick, I don’t have a lot of time.

Right right, here’s the top three papers I can recommend:

Null hypothesis significance testing (NHST) is arguably the mosl widely used approach to hypothesis evaluation among behavioral and social scientists. It is also very controversial. A major concern expressed by critics is that such testing is misunderstood by many of those who use it. Several other objections to its use have also been raised. In this article the author reviews and comments on the claimed misunderstandings as well as on other criticisms of the approach, and he notes arguments that have been advanced in support of NHST. Alternatives and supplements to NHST are considered, as are several related recommendations regarding the interpretation of experimental data. The concluding opinion is that NHST is easily misunderstood and misused but that when applied with good judgment it can be an effective aid to the interpretation of experimental data.

Nickerson, Raymond S. “Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy.” Psychological Methods 5, no. 2 (2000): 241.

After 4 decades of severe criticism, the ritual of null hypothesis significance testing (mechanical dichotomous decisions around a sacred .05 criterion) still persists. This article reviews the problems with this practice, including near universal misinterpretation of p as the probability that H₀ is false, the misinterpretation that its complement is the probability of successful replication, and the mistaken assumption that if one rejects H₀ one thereby affirms the theory that led to the test.

Cohen, Jacob. “The Earth Is Round (p < .05).” American Psychologist 49, no. 12 (1994): 997–1003. doi:10.1037/0003-066X.49.12.997.

This chapter examines eight of the most commonly voiced objections to reform of data analysis practices and shows each of them to be erroneous. The objections are: (a) Without significance tests we would not know whether a finding is real or just due to chance; (b) hypothesis testing would not be possible without significance tests; (c) the problem is not significance tests but failure to develop a tradition of replicating studies; (d) when studies have a large number of relationships, we need significance tests to identify those that are real and not just due to chance; (e) confidence intervals are themselves significance tests; (f) significance testing ensure objectivity in the interpretation of research data; (g) it is the misuse, not the use, of significance testing that is the problem; and (h) it is futile to reform data analysis methods, so why try?

Schmidt, Frank L., and J. E. Hunter. “Eight Common but False Objections to the Discontinuation of Significance Testing in the Analysis of Research Data.” What If There Were No Significance Tests, 1997, 37–64.

OK, I have a bit more time now. What else do you have?

Using a Bayesian significance test for a normal mean, James Berger and Thomas Sellke (1987, pp. 112–113) showed that for p values of .05, .01, and .001, respectively, the posterior probabilities of the null, Pr(H₀ | x), for n = 50 are .52, .22, and .034. For n = 100 the corresponding figures are .60, .27, and .045. Clearly these discrepancies between p and Pr(H₀ | x) are pronounced, and cast serious doubt on the use of p values as reasonable measures of evidence. In fact, Berger and Sellke (1987) demonstrated that data yielding a p value of .05 in testing a normal mean nevertheless resulted in a posterior probability of the null hypothesis of at least .30 for any objective (symmetric priors with equal prior weight given to H₀ and HA ) prior distribution.

Hubbard, R., and R. M. Lindsay. “Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing.” Theory & Psychology 18, no. 1 (February 1, 2008): 69–88. doi:10.1177/0959354307086923.

Because p-values dominate statistical analysis in psychology, it is important to ask what p says about replication. The answer to this question is ‘‘Surprisingly little.’’ In one simulation of 25 repetitions of a typical experiment, p varied from .44. Remarkably, the interval—termed a p interval —is this wide however large the sample size. p is so unreliable and gives such dramatically vague information that it is a poor basis for inference.

Cumming, Geoff. “Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better.Perspectives on Psychological Science 3, no. 4 (July 2008): 286–300. doi:10.1111/j.1745-6924.2008.00079.x.

Simulations of repeated t-tests also illustrate the tendency of small samples to exaggerate effects. This can be shown by adding an additional dimension to the presentation of the data. It is clear how small samples are less likely to be sufficiently representative of the two tested populations to genuinely reflect the small but real difference between them. Those samples that are less representative may, by chance, result in a low P value. When a test has low power, a low P value will occur only when the sample drawn is relatively extreme. Drawing such a sample is unlikely, and such extreme values give an exaggerated impression of the difference between the original populations. This phenomenon, known as the ‘winner’s curse’, has been emphasized by others. If statistical power is augmented by taking more observations, the estimate of the difference between the populations becomes closer to, and centered on, the theoretical value of the effect size.

is G., Douglas Curran-Everett, Sarah L. Vowler, and Gordon B. Drummond. “The Fickle P Value Generates Irreproducible Results.” Nature Methods 12, no. 3 (March 2015): 179–85. doi:10.1038/nmeth.3288.

If you use p=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time. If, as is often the case, experiments are underpowered, you will be wrong most of the time. This conclusion is demonstrated from several points of view. First, tree diagrams which show the close analogy with the screening test problem. Similar conclusions are drawn by repeated simulations of t-tests. These mimic what is done in real life, which makes the results more persuasive. The simulation method is used also to evaluate the extent to which effect sizes are over-estimated, especially in underpowered experiments. A script is supplied to allow the reader to do simulations themselves, with numbers appropriate for their own work. It is concluded that if you wish to keep your false discovery rate below 5%, you need to use a three-sigma rule, or to insist on p≤0.001. And never use the word ‘significant’.

Colquhoun, David. “An Investigation of the False Discovery Rate and the Misinterpretation of P-Values.” Royal Society Open Science 1, no. 3 (November 1, 2014): 140216. doi:10.1098/rsos.140216.

I was hoping for something more philosophical.

The idea that the P value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations).

Goodman, Steven N. “Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy.” Annals of Internal Medicine 130, no. 12 (1999): 995–1004.

Overemphasis on hypothesis testing–and the use of P values to dichotomise significant or non-significant results–has detracted from more useful approaches to interpreting study results, such as estimation and confidence intervals. In medical studies investigators are usually interested in determining the size of difference of a measured outcome between groups, rather than a simple indication of whether or not it is statistically significant. Confidence intervals present a range of values, on the basis of the sample data, in which the population value for such a difference may lie. Some methods of calculating confidence intervals for means and differences between means are given, with similar information for proportions. The paper also gives suggestions for graphical display. Confidence intervals, if appropriate to the type of study, should be used for major findings in both the main text of a paper and its abstract.

Gardner, Martin J., and Douglas G. Altman. “Confidence Intervals rather than P Values: Estimation rather than Hypothesis Testing.” BMJ 292, no. 6522 (1986): 746–50.

What’s this “Neyman-Pearson” thing?

P-values were part of a method proposed by Ronald Fisher, as a means of assessing evidence. Even as the ink was barely dry on it, other people started poking holes in his work. Jerzy Neyman and Egon Pearson took some of Fisher’s ideas and came up with a new method, based on long-term prediction. Their method is superior, IMO, but rather than replacing Fisher’s approach it instead wound up being blended with it, ditching all the advantages to preserve the faults. This citation covers the historical background:

Huberty, Carl J. “Historical Origins of Statistical Testing Practices: The Treatment of Fisher versus Neyman-Pearson Views in Textbooks.” The Journal of Experimental Education 61, no. 4 (1993): 317–33.

While the remainder help describe the differences between the two methods, and possible ways to “fix” their shortcomings.

The distinction between evidence (p’s) and error (a’s) is not trivial. Instead, it reflects the fundamental differences between Fisher’s ideas on significance testing and inductive inference, and Neyman-Pearson’s views on hypothesis testing and inductive behavior. The emphasis of the article is to expose this incompatibility, but we also briefly note a possible reconciliation.

Hubbard, Raymond, and M. J Bayarri. “Confusion Over Measures of Evidence ( p ’S) Versus Errors ( α ’S) in Classical Statistical Testing.” The American Statistician 57, no. 3 (August 2003): 171–78. doi:10.1198/0003130031856.

The basic differences are these: Fisher attached an epistemic interpretation to a significant result, which referred to a particular experiment. Neyman rejected this view as inconsistent and attached a behavioral meaning to a significant result that did not refer to a particular experiment, but to repeated experiments. (Pearson found himself somewhere in between.)

Gigerenzer, Gerd. “The Superego, the Ego, and the Id in Statistical Reasoning.” A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues, 1993, 311–39.

This article presents a simple example designed to clarify many of the issues in these controversies. Along the way many of the fundamental ideas of testing from all three perspectives are illustrated. The conclusion is that Fisherian testing is not a competitor to Neyman-Pearson (NP) or Bayesian testing because it examines a different problem. As with Berger and Wolpert (1984), I conclude that Bayesian testing is preferable to NP testing as a procedure for deciding between alternative hypotheses.

Christensen, Ronald. “Testing Fisher, Neyman, Pearson, and Bayes.” The American Statistician 59, no. 2 (2005): 121–26.

C’mon, there aren’t any people defending the p-value?

Sure there are. They fall into two camps: “deniers,” a small group that insists there’s nothing wrong with p-values, and the much more common “fixers,” who propose making up for the shortcomings by augmenting NHST. Since a number of fixers have already been cited, I’ll just focus on the deniers here.

On the other hand, the propensity to misuse or misunderstand a tool should not necessarily lead us to prohibit its use. The theory of estimation is also often misunderstood. How many epidemiologists can explain the meaning of their 95% confidence interval? There are other simple concepts susceptible to fuzzy thinking. I once quizzed a class of epidemiology students and discovered that most had only a foggy notion of what is meant by the word “bias.” Should we then abandon all discussion of bias, and dumb down the field to the point where no subtleties need trouble us?

Weinberg, Clarice R. “It’s Time to Rehabilitate the P-Value.” Epidemiology 12, no. 3 (2001): 288–90.

The solution is simple and practiced quietly by many researchers—use P values descriptively, as one of many considerations to assess the meaning and value of epidemiologic research findings. We consider the full range of information provided by P values, from 0 to 1, recognizing that 0.04 and 0.06 are essentially the same, but that 0.20 and 0.80 are not. There are no discontinuities in the evidence at 0.05 or 0.01 or 0.001 and no good reason to dichotomize a continuous measure. We recognize that in the majority of reasonably large observational studies, systematic biases are of greater concern than random error as the leading obstacle to causal interpretation.

Savitz, David A. “Commentary: Reconciling Theory and Practice.” Epidemiology 24, no. 2 (March 2013): 212–14. doi:10.1097/EDE.0b013e318281e856.

The null hypothesis can be true because it is the hypothesis that errors are randomly distributed in data. Moreover, the null hypothesis is never used as a categorical proposition. Statistical significance means only that chance influences can be excluded as an explanation of data; it does not identify the nonchance factor responsible. The experimental conclusion is drawn with the inductive principle underlying the experimental design. A chain of deductive arguments gives rise to the theoretical conclusion via the experimental conclusion. The anomalous relationship between statistical significance and the effect size often used to criticize NHSTP is more apparent than real.

Hunter, John E. “Testing Significance Testing: A Flawed Defense.” Behavioral and Brain Sciences 21, no. 02 (April 1998): 204–204. doi:10.1017/S0140525X98331167.

The Monty Hall Problem, or When the Obvious Isn’t

“You blew it, and you blew it big! Since you seem to have difficulty grasping the basic principle at work here, I’ll explain. After the host reveals a goat, you now have a one-in-two chance of being correct. Whether you change your selection or not, the odds are the same. There is enough mathematical illiteracy in this country, and we don’t need the world’s highest IQ propagating more. Shame!”
– Scott Smith, Ph.D. University of Florida

There was a rather unusual convergence of feminism and skepticism in 1991. Over a thousand people with PhD’s and a few Nobel-prize winners sent angry letters to a puzzle author, insisting the answer to a particular puzzle was wrong. More than a few flashed their academic credentials as evidence in their favor, providing an excellent example of the Argument from Authority. Most seemed to ignore that the author had one of the highest recorded IQs in the world and talked down to her, providing an excellent example of how women’s credentials are frequently undervalued.

The flashpoint for it all was the Monty Hall problem. Here’s how Marilyn vos Savant originally described the problem:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

This wasn’t her invention, but nearly all the people who presented it before were men and I can’t find any evidence they were as heavily doubted. Even a few women shamed vos Savant for coming to the “incorrect” answer, that it’s to your advantage to switch.

On first blush, that does seem incorrect. One door has been removed, leaving two behind and only one with a prize. So there’s a 50/50 chance you picked the right door, right? Even a computer simulation seems to agree.

Out of 12 scenarios, you were better off staying put in 6 of them,
 and switching in 6 of them. So you'll win 50.000000% of the time if you
 stay put, and 50.000000% of the time if you switch.

Out of 887746 trials, you were better off staying put in 444076 of them,
 and switching in 443670 of them. So you'll win 50.022865% of the time if you
 stay put, and 49.977135% of the time if you switch.

But there’s a simple flaw hidden here.

A table of all the outcomes in the Monty Hall problem.You can pick one of three doors, and you have a one in three chance of picking the door with the car. On one of those lucky occasions, Monty Hall opens one of the other doors. Which one doesn’t matter, because it should be obvious that no matter what door he picks your best option is to keep the door you have.

Two-thirds of the time, you’ve picked a door to a goat. Hall can’t open your door yet, and he can’t open the door with the car behind it, so he’s forced to reveal the third door. If you switch, the only sensible choice is the door with the car. If you stay, as already established, you’ve lost.

So two-thirds of the time you should switch, while one-third of the time you’re better off staying put. If you switch all the time, your expected earnings are two-thirds of a car, if you stay put all the time it’s a third of a car, and any strategy that bounces between the two choices (save cheating) will pay off somewhere in between.

In short, always switch.

Still don’t believe me? I’ll modify one of vos Savant’s demonstrations, and show you how to verify this with a deck of cards. Toss out any jokers, then pull out just a single suit of cards and leave the bulk of the deck behind. Grab some way to track the score, while you’re at it, and a coin.

  1. Shuffle the 13 cards well.
  2. Deal out three cards in a horizontal line, face-up. In this game you’re Monty Hall, so there’s no need to hide things.
  3. Find the card with the lowest face value, as that one has the car.
  4. The player always picks the leftmost door.
  5. If they didn’t pick the lowest card, “show” them the other losing door by flipping it over. If they did, toss the coin to determine which of the two losing cards you’ll “show.”
  6. Mark down which strategy wins this round, and gather up the cards.
  7. Repeat from step one until bored or enlightened.

You’ll quickly realize that Hall’s precise choice is irrelevant, and start looping back after step four. After twenty or thirty rounds, you should see the “switch” strategy is superior. If you think I’m cheating by always having the player choose a specific door, you can easily modify the game to be two-player; if you’re suspicious of the thirteen card thing, use three (but shuffle really carefully).

So why did the program give the incorrect answers?

The probability space of the Monty Hall problem; note that not all outcomes are equally likely.While there are four distinct outcomes, they do not carry the same odds of happening. When you pick the door with the car, both choices that Hall can make occupy a third of the probability space in total, whereas both instances where Hall has no choice occupy two-thirds of the space. If you’re not careful, you can give all of them equal weight and falsely conclude that neither strategy has an advantage. If you are careful, you get the proper results.

Out of 9 scenarios, you were better off staying put in 3 of them,
 and switching in 6 of them. So you'll win 33.333332% of the time if you
 stay put, and 66.666664% of the time if you switch.

Out of 2000000 trials, you were better off staying put in 666131 of them,
 and switching in 1333869 of them. So you'll win 33.306549% of the time if you
 stay put, and 66.693451% of the time if you switch.

In hindsight the solution seems obvious enough, but this problem is unusually unintuitive.

Piattelli-Palmarini remarked (see vos Savant, 1997, p. 15): “No other statistical puzzle comes so close to fooling all the people all the time. […] The phenomenon is particularly interesting precisely because of its specificity, its reproducibility, and its immunity to higher education.” He went on to say “even Nobel physicists systematically give the wrong answer, and […] insist on it, and are ready to berate in print those who propose the right answer.” In his book Inevitable illusions: How mistakes of reason rule our minds (1994), Piattelli-Palmarini singled out the Monty Hall problem as the most expressive example of the “cognitive illusions” or “mental tunnels” in which “even the finest and best-trained minds get trapped” (p. 161). [1]

Human beings are only approximately rational; it’s terribly easy to fall for logical fallacies, and act in sexist ways without realizing it. Know thyself, and know how you’ll likely fail.

[1] Krauss, Stefan, and Xiao-Tian Wang. “The psychology of the Monty Hall problem: discovering psychological mechanisms for solving a tenacious brain teaser.Journal of Experimental Psychology: General 132.1 (2003): 3.

Dictionary Atheism and Morality

I’m quite late to the party, I see. Hopefully I can make up for it with a slightly different angle.

There are no shortage of atheists that fetishize the dictionary. “It’s just a lack of belief, nothing more!” they cry, “there’s no moral code attached to it!”

Bullshit. If there is no moral system, why then are dictionary atheists so insistent on being atheist?

Moral codes are proscriptive, while assertions and bare facts are descriptive. One tells us how the world ought to behave, the others how the world is or might be. This can get confusing, I’ll admit. Science is supposed to be in the “descriptive” bin, yet scientists make predictions about how the world ought to behave. It sounds very proscriptive, but what happens when reality and your statement conflict? Say I calculate the trajectory of an asteroid via Newtonian Mechanics, but observe it wanders off my predicted path. Which of these two must change to resolve the contradiction, reality or Newtonian Mechanics? Surely the latter, and that reveals it and similar scientific laws as a descriptive item: if the description is wrong, or in conflict with reality, it gets tossed.

But this division is further tested by things like evolution. If we ever did find find something that broke that theory, like a fossil rabbit in the Precambrian era, we are not justified in tossing evolution. The weight of all other evidence in favor of evolution makes it more likely we got something wrong then that evolution should be dust-binned. We again seem to be proscriptive.

That pile of evidence is our ticket back to descriptiveness, though. One bit of counter-evidence may fall flat, but a giant enough heap would not. There is only a finite amount of it favoring evolution, so in theory I can still pile up more counter-evidence and be forced to give that theory up in favor of reality, even if that’s impossible in practice.

No amount of evidential persuasion can force me to give up on a moral, in contrast. This too may seem strange; it may not be moral to kill a person, but wouldn’t it be moral to kill Hitler? The information we have about a scenario can dramatically shift the moral action.

But, importantly, it doesn’t shift the moral code. No sane moral system will hold you accountable for honest ignorance, and even the non-sane ones provide an “out” via (for instance) penitence or another loop on the karmic wheel. Instead, you apply the moral code to the knowledge you do have, a code that does not change over time. Slavery was just as bad in the past as it is now, what’s changed instead is us. We as moral agents have progressed, through education, reason, and the occasional violent rebellion. The moral code hasn’t changed, we have adjusted our reality to better match it. Again, we find morality is proscriptive.

So what are we to make of atheists that argue they can only follow the evidence? “Do not hold false beliefs” is proscriptive, because it tells us what to do, yet it’s a necessary assumption behind “I cannot believe in the gods, because there is insufficient evidence to warrant belief.” Having a moral code is an essential prerequisite for every atheist who isn’t that way out of ignorance, and that ignorance dissipates within seconds of hearing someone attempt to describe what a god is.

But… is it true that black people deserve to be paid less than whites? Is it true that women who dress provocatively deserved to be raped? Is it true that the poor are lazy and shiftless? All it takes to believe in any form of social justice is the moral “do not hold false beliefs” and evidence to support “claim X is false.” The minimal moral system for a hardline dictionary atheist is no different then the minimal moral system of a feminist!

Of course, there’s no reason you can’t toss extra morals into the mix. Social justice types would quickly add “allowing false beliefs to persist in others is wrong,” but so too would the dictionary atheist. How else could they justify trying to persuade others away from religion? No doubt those atheists would disavow any additional morals, but so too could a feminist. That one extra premise is enough to justify actively changing the culture we live in.

There might be other differences in the moral code between dictionary atheists and those promoting social justice, but it amounts to little more than window dressing; not only does being an atheist require a moral code, even the “dictionary” brand, the smallest possible code also supports feminists and others engaging in social justice.

So knock off the “atheism has no moral code” crap. It just ain’t true.

Christina Hoff Sommers: Blatant Science Denialist

So, how’d my predictions of Christina Hoff Sommer’s video pan out?

The standard approach for those challenging rape culture is to either to avoid defining the term “rape culture” at all, or define it as actively encouraging sexual assault instead of passively doing so, setting up a strawperson from the get-go.

Half points for this one. Sommers never defined “rape culture,” but thanks to vague wording made it sound like “rape culture” was synonymous with “beliefs that encourage the sexual assault of women on college campuses:”

[1:12] Now, does that mean that sexual assault’s not a problem on campus? Of course not! Too many women are victimized. But it’s not an epidemic, and it’s not a culture.

Continuing with myself:

Sommers herself is a fan of cherry-picking individual studies or case reports and claiming they’re representative of the whole, and I figure we’ll see a lot of that.

Success kid: NAILED IT

There’s also the clever technique of deliberately missing the point or spinning out half-truths […] I don’t think Sommers will take that approach, preferring to cherry-pick and fiddle with definitions instead, but as a potent tool of denialists it’s worth keeping in mind.

Oooooo, almost. Almost.

While there’s a lot of things I could pick apart about this video, I’d like to focus on the most blatant examples of her denialism, her juggling of sexual assault statistics.

The first study she cites is an infamous one in conservative circles, the Campus Sexual Assault Study of 2007. Ever since Obama made a big deal of it, they’ve cranked up their noise machine and dug in deep to discredit the study. Sommers benefits greatly from that, doing just a quick hit-and-run.

[0:50] The “one in five” claim is based on a 2007 internet study, with vaguely worded questions, a low response rate, and a non-representative sample.

Oh, how many ways is that wrong? Here’s the actual methodology from the paper (pg 3-1 to 3-2):

Two large public universities participated in the CSA Study. Both universities provided us

with data files containing the following information on all undergraduate students who were enrolled in the fall of 2005: full name, gender, race/ethnicity, date of birth, year of study, grade point average, full-time/part-time status, e-mail address, and mailing address. […]

We created four sampling subframes, with cases randomly ordered within each subframe: University 1 women, University 1 men, University 2 women, and University 2 men. […]

Samples were then drawn randomly from each of the four subframes. The sizes of these samples were dictated by response rate projections and sample size targets (4,000 women and 1,000 men, evenly distributed across the universities and years of study) […]

To recruit the students who were sampled to participate in the CSA Study, we relied on both recruitment e-mails and hard copy recruitment letters that were mailed to potential respondents. Sampled students were sent an initial recruitment e-mail that described the study, provided each student with a unique CSA Study ID#, and included a hyperlink to the CSA Study Web site. During each of the following 2 weeks, students who had not completed the survey were sent a follow-up e-mail encouraging them to participate. The third week, nonrespondents were mailed a hard-copy recruitment letter. Two weeks after the hard-copy letters were mailed, nonrespondents were sent a final recruitment e-mail.

Christopher P Krebs, Christine H. Lindquist, Tara D. Warner, Bonnie S. Fisher, and Sandra L. Martin. “Campus Sexual Assault (CSA) Study, Final Report,” October 2007.

The actual number of responses was 5,446 women and 1,375 men, above expectations. Yes, the authors expected a low response rate with a non-representative sample, and already had methods in place to deal with that; see pages 3-7 to 3-10 of the report for how they compensated, and then verified their methods were valid. Note too that this “internet study” was quite targeted and closed to the public, contrary to what Sommers implies.

As to the “vaguely-worded” questions, that’s because many people won’t say they were raped even if they were penetrated against their will (eg. Koss, Mary P., Thomas E. Dinero, Cynthia A. Seibel, and Susan L. Cox. “Stranger and Acquaintance Rape: Are There Differences in the Victim’s Experience?Psychology of Women Quarterly 12, no. 1 (1988): 1–24). Partly that’s because denial is one way to cope with a traumatic event, and partly because they’ve been told it isn’t a crime by society. So researchers have to tip-toe around “rape culture” just to get an accurate view of sexual assault, yet more evidence that beast exists after all.

Sommers champions another study as more accurate than the CSA, one from the US Bureau of Justice Statistics which comes to the quite-different figure of one in 52. Sommers appears to be getting her data from Figure 2 in that document, and since that’s on page three either she or a research assistant must have read page two.

The NCVS is one of several surveys used to study rape and sexual assault in the general and college-age population. In addition to the NCVS, the National Intimate Partner and Sexual Violence Survey (NISVS) and the Campus Sexual Assault Study (CSA) are two recent survey efforts used in research on rape and sexual assault. The three surveys differ in important ways in how rape and sexual assault questions are asked and victimization is measured. […]

The NCVS is presented as a survey about crime, while the NISVS and CSA are presented as surveys about public health. The NISVS and CSA collect data on incidents of unwanted sexual contact that may not rise to a level of criminal behavior, and respondents may not report incidents to the NCVS that they do not consider to be criminal. […]

The NCVS, NISVS, and CSA target different types of events. The NCVS definition is shaped from a criminal justice perspective and includes threatened, attempted, and completed rape and sexual assault against males and females […]

Unlike the NCVS, which uses terms like rape and unwanted sexual activity to identify victims of rape and sexual assault, the NISVS and CSA use behaviorally specific questions to ascertain whether the respondent experienced rape or sexual assault. These surveys ask about an exhaustive list of explicit types of unwanted sexual contact a victim may have experienced, such as being made to perform or receive anal or oral sex.

Lynn Langton, Sofi Sinozich. “Rape and Sexual Assault Among College-age Females, 1995-2013” December 11, 2014.

This information repeats in Appendix A, which even includes a handy table summarizing all of the differences. If it’s been shoved into page two as well, that must indicate many people have tried to leverage this study to “discredit” others, without realizing the different methodologies make that impossible. The study authors tried to paint these differences in bright neon, to guard against any stat-mining, but alas Sommers has no qualms about ignoring all that to suit her ends. Even the NCVS authors suggest going with other numbers for prevalence and only using theirs for differences between student and non-student populations:

Despite the differences that exist between the surveys, a strength of the NCVS is its ability to be used to make comparisons over time and between population subgroups. The differences observed between students and nonstudents are reliable to the extent that both groups responded in a similar manner to the NCVS context and questions. Methodological differences that lead to higher estimates of rape and sexual assault in the NISVS and CSA should not affect the NCVS comparisons between groups.

In short, Sommers engaged in more half-truths and misleading statements than I predicted. Dang. But hold onto your butts, because things are about to get worse.

[2:41] The claim that 2% of rape accusations are false? That’s unfounded. It seems to have started with Susan Brownmiller’s 1975 feminist manifesto “Against Our Will.” Other statistics for false accusations range from 8 to 43%.

Hmph, so how did Brownmiller come to her 2% figure for false reports? Let’s check her book:

A decade ago the FBI’s Uniform Crime Reports noted that 20 percent of all rapes reported to the police were determined by investigation to be unfounded.’ By 1973 the figure had dropped to 15 percent, while rape remained, in the FBI’s words, the most underreported crime.’ A 15 percent figure for false accusations is undeniably high, yet when New York City instituted a special sex crimes analysis squad and put police women (instead of men) in charge of interviewing complainants, the number of false charges in New York dropped dramatically to 2 percent, a figure that corresponded exactly to the rate of false reports for other crimes. The lesson in the mystery of the vanishing statistic is obvious. Women believe the word of other women. Men do not.

Brownmiller, Susan. Against Our Will: Men, Women and Rape. Open Road Media, 2013. pg. 435.

…. waaaitaminute. Brownmiller never actually says the 2% figure is the false reporting rate; at best, she merely argues it’s more accurate than figures of 15-20%. And, in fact, it is!

In contrast, when more methodologically rigorous research has been conducted, estimates for the percentage of false reports begin to converge around 2-8%.Lonsway, Kimberly A., Joanne Archambault, and David Lisak. “False reports: Moving beyond the issue to successfully investigate and prosecute non-stranger sexual assault.” (2009).

That’s taken from the third study Sommers cites, or more accurately a summary of other work by Lisak. She quotes two of the three studies in that summary which show rates above 8%. The odd study out gives an even higher false reporting rate than the 8% one Sommers quotes, and should therefore have been better evidence, but look at how Lisak describes it:

A similar study was then again sponsored by the Home Office in 1996 (Harris & Grace, 1999). This time, the case files of 483 rape cases were examined, and supplemented with information from a limited number of interviews with sexual assault victims and criminal justice personnel. However, the determination that a report was false was made solely by the police. It is therefore not surprising that the estimate for false allegations (10.9%) was higher than those in other studies with a methodology designed to systematically evaluate these classifications.

That’s impossible to quote-mine. And while Lisak spends a lot of time discussing Kanin’s study, which is the fifth one Sommers presents, she references it directly instead of pulling from Lisak. A small sample may hint at why he’s been snubbed:

As a result of these and other serious problems with the “research,” Kanin’s (1994) article can be considered “a provocative opinion piece, but it is not a scientific study of the issue of false reporting of rape. It certainly should never be used to assert a scientific foundation for the frequency of false allegations” (Lisak, 2007, p. 1).

Well, at least that fourth study wasn’t quote-mined. Right?

internal rules on false complaints specify that this category should be limited to cases where either there is a clear and credible admission by the complainants, or where there are strong evidential grounds. On this basis, and bearing in mind the data limitations, for the cases where there is information (n=144) the designation of false complaint could be said to be probable (primarily those where the account by the complainant is referred to) in 44 cases, possible (primarily where there is some evidential basis) in a further 33 cases, and uncertain (including where victim characteristics are used to impute that they are inherently less believable) in 77 cases. If the proportion of false complaints on the basis of the probable and possible cases are recalculated, rates of three per cent are obtained, both of all reported cases (n=67 of 2,643), and of those where the outcome is known (n=67 of 2,284). Even if all those designated false by the police were accepted (a figure of approximately ten per cent), this is still much lower than the rate perceived by police officers interviewed in this study.Kelly, Liz., Jo. Lovett, Linda. Regan, Great Britain., Home Office., and Development and Statistics Directorate. Research. A Gap or a Chasm?: Attrition in Reported Rape Cases. London: Home Office Research, Development and Statistics Directorate, 2005.

Bolding mine. It’s rather convenient that Sommers quoted the police false report rate of 8% (or “approximately ten per cent” here), yet somehow overlooked the later section where the authors explain that the police inflated the false report figure. In the same way they rounded the 8% to ten, Liz Kelly and her co-authors also rounded up the “three per cent” figure; divide 67 by 2,284, and you get within fingertip distance of 2%, a false report rate of 2.5%.

Lisak did not get the low-end of his 2-8% range from Brownmiller; he got it from two large-scale, rigorous studies that concluded a 2% false report rate was reasonable. In his scientific paper, in fact, he explicitly discards Brownmiller’s number:

Another source, cited by Rumney (2006) and widely referenced in the literature on false allegations is a study conducted by the New York City police department and originally referenced by Susan Brownmiller (1975) in her book, Against Our Will: Men, Women and Rape. According to Brownmiller, the study found a false allegation rate of 2%. However, the only citation for the study is a public remark made by a judge at a bar association meeting, and, therefore, no information is available on the study’s sample or methodology.

Lisak, David, Lori Gardinier, Sarah C. Nicksa, and Ashley M. Cote. “False Allegations of Sexual Assualt: An Analysis of Ten Years of Reported Cases.” Violence Against Women 16, no. 12 (2010): 1318–34.

That 2% number is actually quite well founded, and Sommers must have known that. Feminists also know of the 2-8% stat, and cite it frequently.

In hindsight, this is a blatant example of the embrace-extend-extinguish pattern of Sommers that I discussed earlier. She took one extreme of the feminist position, then painted it as the typical one while cherry-picking the evidence in her favor. She took the other extreme as her low point, so she had the option of invoking a false concession, and then extended her false report range to encompass the majority of false rape report studies out there, most of which are useless.

very few of these estimates are based on research that could be considered credible. Most are reported without the kind of information that would be needed to evaluate their reliability and validity. A few are little more than published opinions, based either on personal experience or a non-systematic review (e.g., of police files, interviews with police investigators, or other information with unknown reliability and validity).

Lisak (2009), pg. 1

Sommers then claims this “middle ground” as her own, riding the Appeal to Moderation for all it’s worth. This is denialism so blatant that no skeptic should take it seriously.

Alas, quite a few do.

Christina Hoff Sommers: Science Denialist?

In a bizarre coincidence, just three days before my lecture on rape culture Christina Hoff Sommers happened to weigh in on the topic. I haven’t seen the video yet, which puts me in a great position to lay a little groundwork and make some predictions.

First off, we’ve got to get our definitions straight. “Rape culture” is the cloud of myths about sexual assault that exist within our society, which make it easier to excuse that crime and/or tougher for victims to recover or seek justice. Take Burt’s 1980 paper on the subject:

The burgeoning popular literature on rape (e.g., Brownmiller, 1975; Clark & Lewis, 1977) all points to the importance of stereotypes and myths — denned as prejudicial, stereotyped, or false beliefs about rape, rape victims, and rapists — in creating a climate hostile to rape victims. Examples of rape myths are “only bad girls get raped”; “any healthy woman can resist a rapist if she really wants to”; “women ask for it”; “women ‘cry rape’ only when they’ve been jilted or have something to cover up”; “rapists are sex-starved, insane, or both.” Recently, researchers have begun to document that rape myths appear in the belief systems of lay people and of professionals who interact with rape victims and assailants (e.g., Barber, 1974; Burt, 1978; Feild, 1978; Kalven & Zeisel, 1966). Writers have ana-
lyzed how rape myths have been institutionalized in the law (Berger, 1977) […]

Much feminist writing on rape maintains that we live in a rape culture that supports the objectification of, and violent and sexual abuse of, women through movies, television, advertising, and “girlie” magazines (see, e.g., Brownmiller, 197S). We hypothesized that exposure to such material would increase rape myth acceptance because it would tend to normalize coercive and brutal sexuality.
Burt, Martha R. “Cultural Myths and Supports for Rape.” Journal of Personality and Social Psychology 38, no. 2 (1980): 217.
http://www.excellenceforchildandyouth.ca/sites/default/files/meas_attach/burt_1980.pdf

You can see how the definition has shifted a little over time; objectification certainly helps dehumanize your victim, but it’s not a strict necessity, and while in all modern societies that I know of women are disproportionately targeted for gender-based violence, there’s still a non-trivial number of male victims out there.

There are two ways to demonstrate “rape culture” is itself a myth. The most obvious route is to challenge the “rape myth” part, and show either that those myths are in line with reality or are not commonly held in society. For instance, either good girls do not get raped, or few people believe that good girls do not get raped. Based on even a small, narrow sample of the literature, this is a tough hill to climb. I did a quick Google Scholar search, and even when I asked specifically for “rape myth acceptance” I had no problem pulling a thousand results, with Google claiming to have another 2,500 or so it wouldn’t show me. There must be a consensus on “rape culture,” based merely on volume, and to pick a side opposing that consensus is to be a science denialist.

The less obvious route to challenge the “help perpetrators/harm victims” portion. Consider the “rubber sheet model” of General Relativity; we know this is wrong, and not just because it depends on gravity to explain gravity, but nonetheless the model is close enough to reality that non-physicists get the gist of things without having to delve into equations. It’s a myth, but the benefits outweigh the harms. Sommers could take a similar approach to sexual assault, not so much arguing that rape myths are a net benefit but instead riding the “correlation is not causation” line and arguing the myths don’t excuse perpetrators or harm victims. This approach has problems too, as correlation can be evidence for causation when there’s a plausible mechanism, and past a point this approach also becomes science denialism. Overall, I think it’s Sommers’ best route.

If she gets that far, of course. The standard approach for those challenging rape culture is to either to avoid defining the term “rape culture” at all, or define it as actively encouraging sexual assault instead of passively doing so, setting up a strawperson from the get-go. Sommers herself is a fan of cherry-picking individual studies or case reports and claiming they’re representative of the whole, and I figure we’ll see a lot of that. There’s also the clever technique of deliberately missing the point or spinning out half-truths: take this video about date rape drugs by her partner-in-crime Caroline Kitchens, for instance. Her conclusion is that date rape drugs are over-hyped, and having looked at the literature myself I agree with her… so long as we exclude alcohol as a “date rape drug.” If you include it, then the picture shifts dramatically.

Numerous sources implicate alcohol use/abuse as either a cause of or contributor to sexual assault. … Across both the literatures on sexual assault and on alcohol’s side effects, several lines of empirical data and theory-based logic suggest that alcohol is a contributing factor to sexual assault.
George, William H., and Susan A. Stoner. “Understanding acute alcohol effects on sexual behavior.” Annual review of sex research 11.1 (2000): 92-124.

General alcohol consumption could be related to sexual assault through multiple path-ways. First, men who often drink heavily also likely do so in social situations that frequently lead to sexual assault (e.g., on a casual or spontaneous date at a party or bar). Second, heavy drinkers may routinely use intoxication as an excuse for engaging in socially unacceptable behavior, including sexual assault (Abbey et al. 1996b). Third, certain personality characteristics (e.g., impulsivity and antisocial behavior) may increase men’s propensity both to drink heavily and to commit sexual assault (Seto and Barbaree 1997).

Certain alcohol expectancies have also been linked to sexual assault. For example, alcohol is commonly viewed as an aphrodisiac that increases sexual desire and capacity (Crowe and George 1989). Many men expect to feel more powerful, disinhibited, and aggressive after drinking alcohol. … Further-more, college men who had perpetrated sexual assault when intoxicated expected alcohol to increase male and female sexuality more than did college men who perpetrated sexual assault when sober (Abbey et al. 1996b). Men with these expectancies may feel more comfortable forcing sex when they are drinking, because they can later justify to themselves that the alcohol made them act accordingly (Kanin 1984).

Attitudes about women’s alcohol consumption also influence a perpetrator’s actions and may be used to excuse sexual assaults of intoxicated women. Despite the liberalization of gender roles during the past few decades, most people do not readily approve of alcohol consumption and sexual behavior among women, yet view these same behaviors among men with far more leniency (Norris 1994). Thus, women who drink alcohol are frequently perceived as being more sexually available and promiscuous compared with women who do not drink (Abbey et al. 1996b). … In fact, date rapists frequently report intentionally getting the woman drunk in order to have sexual intercourse with her (Abbey et al. 1996b).
Abbey, Antonia, et al. “Alcohol and sexual assault.” Alcohol Research and Health 25.1 (2001): 43-51.
http://pubs.niaaa.nih.gov/publications/arh25-1/43-51.htm

I don’t think Sommers will take that approach, preferring to cherry-pick and fiddle with definitions instead, but as a potent tool of denialists it’s worth keeping in mind.

With that preamble out of the way, we can begin….