Finding Her Voice

Have you ever heard of a cool scientific paper, went out to find yourself a copy, and been frustrated to find no trace of it? I’ve been there for years for one particular paper, until I got lucky.

Cutler, Anne, and Donia R. Scott. “Speaker Sex and Perceived Apportionment of Talk.” Applied Psycholinguistics 11, no. 03 (1990): 253–272.
I’m sure you’ve heard the stereotype that women talk excessively. A number of studies have actually sat down and counter total talking time, only to find that men tend to be the blabbermouths. What gives?
An alternative suggestion is more complex and may rely on a difference in content between men’s and women’s speech. Kramer (1975) and Spender (1980) suggested that women are undervalued in society, and as a consequence women’s speech is undervalued – female contributions to conversation are overestimated because they are held to have gone on “too long” relative to what female speakers are held to deserve. Preisler (1986) similarly argued that evaluation of women’s speech is a function of (under)evaluation of the social roles most usually fulfilled by women.
The former explanation suggests that overestimation of women’s conversational contributions is a perceptual bias effect that should be reproducible in the laboratory simply by asking listeners to judge amount of talk produced by male and female speakers, even if content of the talk is controlled. [pg. 255]
So Anne Cutler and the other authors tested that by having the standard reference human listen to excerpts from plays, where both speaking roles said about the same number of words. The sex was varied, of course.
In single-sex conversations, female and male first speakers received almost identical ratings (49.5% and 50%, respectively), but in mixed-sex conversations, female speakers were judged to be talking more (55.2%), male speakers to be talking less (47.8%). Although the number of words spoken was identical for each column, listeners believed that in mixed-sex conversations, females spoke more and males spoke less.

In fact, three of these mean ratings are actually underestimates, since the true mean first speaker contribution across all four dialogues was 53.7%. ….

The interaction of speaker sex with whether the dialogue was mixed- or single-sex was significant in both analyses … There was also a main effect of speaker
sex, with female speakers’ contributions being overestimated, but male speakers’ contributions being underestimated relative to the actual number of words spoken. [pg. 259-260]
What’s interesting is that when people were asked to guess the sex of each role, handed nothing more than the script, men and women sometimes differed.
When a part was not particularly sex-marked (Dialogue 1), females speaking it were judged to have said more than males speaking it. When a part was marked as female for male and for female subjects alike (Dialogue 2), the same effect was found. When, however, a part was marked as female for male subjects only (Dialogue 3), only male subjects showed the effect; and when a part was marked as female for female subjects only (Dialogue 4), only female subjects showed any effect. [pg. 268]
Unfortunately, this muddied up the conclusions a bit. And I do have other issues with the paper, primarily in their use of p-values, but I think the findings rise above it. They also fit nicely into the existing body of work on sexism and speech.
These behaviors, the interrupting and the over-talking, also happen as the result of difference in status, but gender rules. For example, male doctors invariably interrupt patients when they speak, especially female patients but patients rarely interrupt doctors in return. Unless the doctor is a woman. When that is the case, she interrupts far less and is herself interrupted more. This is also true of senior managers in the workplace. Male bosses are not frequently talked over or stopped by those working for them, especially if they are women; however, female bosses are routinely interrupted by their male subordinates.

What can we do to raise women’s voices? Maybe technology can help.

Gender Timer is the app that measures the talk times between the sexes. It is used to raise awareness and generate discussion about how airtime looks in practice. The aim is to ultimately develop your organization and its meeting culture.

Available on Android and iPhone.

Daryl Bem and the Replication Crisis

I’m disappointed I don’t see more recognition of this.

If one had to choose a single moment that set off the “replication crisis” in psychology—an event that nudged the discipline into its present and anarchic state, where even textbook findings have been cast in doubt—this might be it: the publication, in early 2011, of Daryl Bem’s experiments on second sight.

I’ve actually done a long blog post series on the topic, but in brief: Daryl Bem was convinced that precognition existed. To put these beliefs to the test, he had subjects try to predict an image that was randomly generated by a computer. Over eight experiments, he found that they could indeed do better than chance. You might think that Bem is a kook, and you’d be right.

But Bem is also a scientist.

Now he would return to JPSP [the Journal of Personality and Social Psychology] with the most amazing research he’d ever done—that anyone had ever done, perhaps. It would be the capstone to what had already been a historic 50-year career.

Having served for a time as an associate editor of JPSP, Bem knew his methods would be up to snuff. With about 100 subjects in each experiment, his sample sizes were large. He’d used only the most conventional statistical analyses. He’d double- and triple-checked to make sure there were no glitches in the randomization of his stimuli. Even with all that extra care, Bem would not have dared to send in such a controversial finding had he not been able to replicate the results in his lab, and replicate them again, and then replicate them five more times. His finished paper lists nine separate ministudies of ESP. Eight of those returned the same effect.

One way to attack an argument is to merely follow its logic. If you can find it leads to an absurd conclusion, the argument must have been flawed even if you cannot find the flaw. Bem had inadvertently discovered a “reductio ad absurdum” argument against contemporary scientific practice: if proper scientific procedure can prove ESP exists, proper scientific procedure must be broken.

Meanwhile, at the conference in Berlin, [E.J.] Wagenmakers finally managed to get through Bem’s paper. “I was shocked,” he says. “The paper made it clear that just by doing things the regular way, you could find just about anything.”

On the train back to Amsterdam, Wagenmakers drafted a rebuttal, to be published in JPSP alongside the original research. The problems he saw in Bem’s paper were not particular to paranormal research. “Something is deeply wrong with the way experimental psychologists design their studies and report their statistical results,” Wagenmakers wrote. “We hope the Bem article will become a signpost for change, a writing on the wall: Psychologists must change the way they analyze their data.”

Slate has a long read up on the current replication crisis, and how it links to Bem. It’s aimed at a lay audience and highly readable; I recommend giving it a click.

So You Wanna Falsify Gender Studies?

How would a skeptic determine whether or not an area of study was legit? The obvious route would be to study up on the core premises of that field, recording citations as you go; map out how they are connected to one another and supported by the evidence, looking for weak spots; then write a series of articles sharing those findings.

What they wouldn’t do is generate a fake paper purporting to be from that field of study but deliberately mangling the terminology, submit it to a low-ranked and obscure journal for peer review, have it rejected from that journal, based on feedback then submit it to an second journal that was semi-shady and even more obscure, have it published, then parade that around as if it meant something.

Alas, it seems the Skeptic movement has no idea how basic skepticism works. Self-proclaimed “skeptics” Peter Boghossian and James Lindsay took the second route, and were cheered on by Michael Shermer, Richard Dawkins, Jerry Coyne, Steven Pinker, and other people calling themselves skeptics. A million other people have pointed and laughed at them, so I won’t bother joining in.

But no-one seems to have brought up the first route. Let’s do a sketch of actual skepticism, then, and see how well gender studies holds up.

What’s Claimed?

Right off the bat, we hit a problem: most researchers or advocates in gender studies do not have a consensus sex or gender model.

The Genderbread Person, version 3.3. From http://itspronouncedmetrosexual.com/2015/03/the-genderbread-person-v3/

This is one of the more popular explainers for gender floating out on the web. Rather than focus on the details, however, I’d like you to note this graphic is labeled “version 3.3”. In other words, Sam Killermann has tweaked and revised it three times over. It also conflicts with the Gender Unicorn, which has a categorical approach to “biological sex” and adds “other genders,” and it no longer embraces the idea of a spectrum thus contradicting a lot of other models. Confront Killermann on this, and I bet they’d shrug their shoulders and start crafting another model.

The model isn’t all that important. Instead, gender studies has reached a consensus on an axiom and a corollary: the two-sex, two-gender model is an oversimplification, and that sex/gender are complicated. Hence why models of sex or gender continually fail, the complexity almost guarantees exceptions to your rules.

There’s a strong parallel here to agnostic atheism’s “lack of belief” posture, as this flips the burden of proof. Critiquing the consensus of gender studies means asserting a positive statement, that the binarist model is correct, while the defense merely needs to swat down those arguments without advancing any of its own.

Nothing Fails Like Binarism

A single counter-example is sufficient to refute a universal rule. To take a classic example, I can show “all swans are white” is a false statement by finding a single black swan. If someone came along and said “well yeah, but most swans are white, so we can still say that all swans are white,” you’d think of them as delusional or in denial.

Well, I can point to four people who do not fit into the two-sex two-gender model. Ergo, that model cannot be true in all cases, and the critique of gender studies fails after a thirty second Google search.

When most people are confronted with this, they invoke a three-sex model (male, female, and “other/defective”) but call it two-sex in order to preserve their delusion. That so few people notice the contradiction is a testament to how hard the binary model is hammered into us.

But Where’s the SCIENCE?!

Another popular dodge is to argue that merely saying you don’t fit into the binary isn’t enough; if it wasn’t in peer-reviewed research, it can’t be true. This is no less silly. Do I need to publish a paper about the continent of Africa to say it exists? Or my computer? If you doubt me, browse Retraction Watch for a spell.

Once you’ve come back, go look at the peer-reviewed research which suggests gender is more complicated than a simple binary.

At times, the prevailing answers were almost as simple as Gray’s suggestion that the sexes come from different planets. At other times, and increasingly so today, the answers concerning the why of men’s and women’s experiences and actions have involved complex multifaceted frameworks.

Ashmore, Richard D., and Andrea D. Sewell. “Sex/Gender and the Individual.” In Advanced Personality, edited by David F. Barone, Michel Hersen, and Vincent B. Van Hasselt, 377–408. The Plenum Series in Social/Clinical Psychology. Springer US, 1998. doi:10.1007/978-1-4419-8580-4_16.

Correlational findings with the three scales (self-ratings) suggest that sex-specific behaviors tend to be mutually exclusive while male- and female-valued behaviors form a dualism and are actually positively rather than negatively correlated. Additional analyses showed that individuals with nontraditional sex role attitudes or personality trait organization (especially cross-sex typing) were somewhat less conventionally sex typed in their behaviors and interests than were those with traditional attitudes or sex-typed personality traits. However, these relationships tended to be small, suggesting a general independence of sex role traits, attitudes, and behaviors.

Orlofsky, Jacob L. “Relationship between Sex Role Attitudes and Personality Traits and the Sex Role Behavior Scale-1: A New Measure of Masculine and Feminine Role Behaviors and Interests.” Journal of Personality 40, no. 5 (May 1981): 927–40.

Women’s scores on the BSRI-M and PAQ-M (masculine) scales have increased steadily over time (r’s = .74 and .43, respectively). Women’s BSRI-F and PAQ-F (feminine) scale  scores do not correlate with year. Men’s BSRI-M scores show a weaker positive relationship with year of administration (r = .47). The effect size for sex differences on the BSRI-M has also changed over time, showing a significant decrease over the twenty-year period. The results suggest that cultural change and environment may affect individual personalities; these changes in BSRI and PAQ means demonstrate women’s increased endorsement of masculine-stereotyped traits and men’s continued nonendorsement of feminine-stereotyped traits.

Twenge, Jean M. “Changes in Masculine and Feminine Traits over Time: A Meta-Analysis.” Sex Roles 36, no. 5–6 (March 1, 1997): 305–25. doi:10.1007/BF02766650.

Male (n = 95) and female (n = 221) college students were given 2 measures of gender-related personality traits, the Bem Sex-Role Inventory (BSRI) and the Personal Attributes Questionnaire, and 3 measures of sex role attitudes. Correlations between the personality and the attitude measures were traced to responses to the pair of negatively correlated BSRI items, masculine and feminine, thus confirming a multifactorial approach to gender, as opposed to a unifactorial gender schema theory.

Spence, Janet T. “Gender-Related Traits and Gender Ideology: Evidence for a Multifactorial Theory.” Journal of Personality and Social Psychology 64, no. 4 (1993): 624.

Oh sorry, you didn’t know that gender studies has been a science for over four decades? You thought it was just an invention of Tumblr, rather than a mad scramble by scientists to catch up with philosophers? Tsk, that’s what you get for pretending to be a skeptic instead of doing your homework.

I Hate Reading

One final objection is that field-specific jargon is hard to understand. Boghossian and Lindsay seem to think it follows that the jargon is therefore meaningless bafflegab. I’d hate to see what they’d think of a modern physics paper; jargon offers precise definitions and less typing to communicate your ideas, and while it can quickly become opaque to lay people jargon is a necessity for serious science.

But let’s roll with the punch, and look outside of journals for evidence that’s aimed at a lay reader.

In Sexing the Body, Gender Politics and the Construction of Sexuality Fausto-Sterling attempts to answer two questions: How is knowledge about the body gendered? And, how gender and sexuality become somatic facts? In other words, she passionately and with impressive intellectual clarity demonstrates how in regards to human sexuality the social becomes material. She takes a broad, interdisciplinary perspective in examining this process of gender embodiment. Her goal is to demonstrate not only how the categories (men/women) humans use to describe other humans become embodied in those to whom they refer, but also how these categories are not reflect ed in reality. She argues that labeling someone a man or a woman is solely a social decision. «We may use scientific knowledge to help us make the decision, but only our beliefs about gender – not science – can define our sex» (p. 3) and consistently throughout the book she shows how gender beliefs affect what kinds of knowledge are produced about sex, sexual behaviors, and ultimately gender.

Gober, Greta. “Sexing the Body Gender Politics and the Construction of Sexuality.Humana.Mente Journal of Philosophical Studies, 2012, Vol. 22, 175–187

Making Sex is an ambitious investigation of Western scientific conceptions of sexual difference. A historian by profession, Laqueur locates the major conceptual divide in the late eighteenth century when, as he puts it, “a biology of cosmic hierarchy gave way to a biology of incommensurability, anchored in the body, in which the relationship of men to women, like that of apples to oranges, was not given as one of equality or inequality but rather of difference” (207). He claims that the ancients and their immediate heirs—unlike us—saw sexual difference as a set of relatively unimportant differences of degree within “the one-sex body.” According to this model, female sexual organs were perfectly homologous to male ones, only inside out; and bodily fluids—semen, blood, milk—were mostly “fungible” and composed of the same basic matter. The model didn’t imply equality; woman was a lesser man, just not a thing wholly different in kind.

Altman, Meryl, and Keith Nightenhelser. “Making Sex (Review).” Postmodern Culture 2, no. 3 (January 5, 1992). doi:10.1353/pmc.1992.0027.

In Delusions of Gender the psychologist Cordelia Fine exposes the bad science, the ridiculous arguments and the persistent biases that blind us to the ways we ourselves enforce the gender stereotypes we think we are trying to overcome. […]

Most studies about people’s ways of thinking and behaving find no differences between men and women, but these fail to spark the interest of publishers and languish in the file drawer. The oversimplified models of gender and genes that then prevail allow gender culture to be passed down from generation to generation, as though it were all in the genes. Gender, however, is in the mind, fixed in place by the way we store information.

Mental schema organise complex experiences into types of things so that we can process data efficiently, allowing us, for example, to recognise something as a chair without having to notice every detail. This efficiency comes at a cost, because when we automatically categorise experience we fail to question our assumptions. Fine draws together research that shows people who pride themselves on their lack of bias persist in making stereotypical associations just below the threshold of consciousness.

Everyone works together to re-inforce social and cultural environments that soft-wire the circuits of the brain as male or female, so that we have no idea what men and women might become if we were truly free from bias.

Apter, Terri. “Delusions of Gender: The Real Science Behind Sex Differences by Cordelia Fine.” The Guardian, October 11, 2010, sec. Books.

Have At ‘r, “Skeptics”

You want to refute the field of gender studies? I’ve just sketched out the challenges you face on a philosophical level, and pointed you to the studies and books you need to refute. Have fun! If you need me I’ll be over here, laughing.

[HJH 2017-05-21: Added more links, minor grammar tweaks.]

[HJH 2017-05-22: Missed Steven Pinker’s Tweet. Also, this Skeptic fail may have gone mainstream:

Boghossian and Lindsay likely did damage to the cultural movements that they have helped to build, namely “new atheism” and the skeptic community. As far as I can tell, neither of them knows much about gender studies, despite their confident and even haughty claims about the deep theoretical flaws of that discipline. As a skeptic myself, I am cautious about the constellation of cognitive biases to which our evolved brains are perpetually susceptible, including motivated reasoning, confirmation bias, disconfirmation bias, overconfidence and belief perseverance. That is partly why, as a general rule, if one wants to criticize a topic X, one should at the very least know enough about X to convince true experts in the relevant field that one is competent about X. This gets at what Brian Caplan calls the “ideological Turing test.” If you can’t pass this test, there’s a good chance you don’t know enough about the topic to offer a serious, one might even say cogent, critique.

Boghossian and Lindsay pretty clearly don’t pass that test. Their main claim to relevant knowledge in gender studies seems to be citations from Wikipedia and mockingly retweeting abstracts that they, as non-experts, find funny — which is rather like Sarah Palin’s mocking of scientists for studying fruit flies or claiming that Obamacare would entail “death panels.” This kind of unscholarly engagement has rather predictably led to a sizable backlash from serious scholars on social media who have noted that the skeptic community can sometimes be anything but skeptical about its own ignorance and ideological commitments.

When the scientists you claim to worship are saying your behavior is unscientific, maaaaybe you should take a hard look at yourself.]

P-hacking is No Big Deal?

Possibly not. simine vazire argued the case over at “sometimes i’m wrong.”

The basic idea is as follows: if we use shady statistical techniques to indirectly adjust the p-value cutoff in Null Hypothesis Significance Testing or NHST, we’ll up the rate of false positives we’ll get. Just to put some numbers to this, a p-value cutoff of 0.05 means that when the null hypothesis is true, we’ll get a bad sample about 5% of the time and conclude its true. If we use p-hacking to get an effective cutoff of 0.1, however, then that number jumps up to 10%.

However, p-hacking will also raise the number of true positives we get. How much higher it gets can be tricky to calculate, but this blog post by Erika Salomon gives out some great numbers. During one simulation run, a completely honest test of a false null hypothesis would return a true positive 12% of the time; when p-hacking was introduced, that skyrocketed to 74%.

If the increase in false positives is balanced out by the increase in true positives, then p-hacking makes no difference in the long run. The number of false positives in the literature would be entirely dependent on the power of studies, which is abysmally low, and our focus should be on improving that. Or, if we’re really lucky, the true positives increase faster than the false positives and we actually get a better scientific record via cheating!

We don’t really know which scenario will play out, however, and vazire calls for someone to code up a simulation.

Allow me.

My methodology will be to divide studies up into two categories: null results that are never published, and possibly-true results that are. I’ll be using a one-way ANOVA to check whether the average of two groups drawn from a Gaussian distribution differ. I debated switching to a Student t test, but comparing two random draws seems more realistic than comparing one random draw to a fixed mean of zero.

I need a model of effect and sample sizes. This one is pretty tricky; just because a study is unpublished doesn’t mean the effect size is zero, and vice-versa. Making inferences about unpublished studies is tough, for obvious reasons. I’ll take the naive route here, and assume unpublished studies have an effect size of zero while published studies have effect sizes on the same order of actual published studies. Both published and unpublished will have sample sizes typical of what’s published.

I have a handy cheat for that: the Open Science Collaboration published a giant replication of 100 psychology studies back in 2015, and being Open they shared the raw data online in a spreadsheet. The effect sizes are in correlation coefficients, which are easy to convert to Cohen’s d, and when paired with a standard deviation of one that gives us the mean of the treatment group. The control group’s mean is fixed at zero but shares the same standard deviation. Sample sizes are drawn from said spreadsheet, and represent the total number of samples and not the number of samples per group. In fact, it gives me two datasets in one: the original study effect and sample size, plus the replication’s effect and sample size. Unless I say otherwise, I’ll stick with the originals.

P-hacking can be accomplished a number of ways: switching between the number of tests in the analysis and iteratively doing significance tests are but two of the more common. To simply things I’ll just assume the effective p-value is a fixed number, but explore a range of values to get an idea of how a variable p-hacking effect would behave.

For some initial values, let’s say unpublished studies constitute 70% of all studies, and p-hacking can cause a p-value threshold of 0.05 to act like a threshold of 0.08.

Octave shall be my programming language of choice. Let’s have at it!

(Template: OSC 2015 originals)
With a 30.00% success rate and a straight p <= 0.050000, the false positive rate is 12.3654% (333 f.p, 2360 t.p)
Whereas if p-hacking lets slip p <= 0.080000, the false positive rate is 18.2911% (548 f.p, 2448 t.p)

(Template: OSC 2015 replications)
With a 30.00% success rate and a straight p <= 0.050000, the false positive rate is 19.2810% (354 f.p, 1482 t.p)
Whereas if p-hacking lets slip p <= 0.080000, the false positive rate is 26.2273% (577 f.p, 1623 t.p)

Ouch, our false positive rate went up. That seems strange, especially as the true positives (“t.p.”) and false positives (“f.p.”) went up by about the same amount. Maybe I got lucky with the parameter values, though; let’s scan a range of unpublished study rates from 0% to 100%, and effective p-values from 0.05 to 0.2. The actual p-value rate will remain fixed at 0.05. So we can fit it all in one chart, I’ll take the proportion of p-hacked false positives and subtract it from the vanilla false positives, so that areas where the false positive rate goes down after hacking are negative.

How varying the proportion of unpublished/false studies and the p-hacking amount changes the false positive rate.

There are no values less than zero?! How can that be? The math behind these curves is complex, but I think I can give an intuitive explanation.

Drawing the distribution of p-values when the result is null vs. the results from the OSC originals.The diagonal is the distribution of p-values when the effect size is zero; the curve is what you get when it’s greater than zero. As there are more or less values in each category, the graphs are stretched or squashed horizontally. The p-value threshold is a horizontal line, and everything below that line is statistically significant. The proportion of false to true results is equal to the proportion between the lengths of that horizontal line from the origin.

P-hacking is the equivalent of nudging that line upwards. The proportions change according to the slope of the curve. The steeper it is, the less it changes. It follows that if you want to increase the proportion of true results, you need to find a pair of horizontal lines where the horizontal distance increases as fast or faster in proportion to the increase along that diagonal. Putting this geometrically, imagine drawing a line starting at the origin but at an arbitrary slope. Your job is to find a slope such that the line pierces the non-zero effect curve twice.

Slight problem: that non-zero effect curve has negative curvature everywhere. The slope is guaranteed to get steeper as you step up the curve, which means it will curve up and away from where the line crosses it. Translating that back into math, it’s guaranteed that the non-effect curve will not increase in proportion with the diagonal. The false positive rate will always increase as you up the effective p-value threshold.

And thus, p-hacking is always a deal.

Intersex and Sex Denialism

This was a pleasant surprise.

For generations those who, for biological reasons, don’t fit the usual male/female categories have faced violence and stigma in Kenya. Intersex people – as they are commonly known in Kenya – were traditionally seen as a bad omen bringing a curse upon their family and neighbours. Most were kept in hiding and many were killed at birth. But now a new generation of home-grown activists and medical experts are helping intersex people to come out into the open. They’re rejecting the old idea that intersex people must be assigned a gender in infancy and stick to it and are calling on the government to instead grant them legal recognition.

While some of those people are trans*, that podcast does talk with a number of intersex people as well. It’s great to see more advocacy, I just wish I’d see more of it in North America and less of this.

The facts of the world generally don’t support transphobic arguments, and transphobes don’t really have the option of making robust arguments based on an honest assessment of the current state of our knowledge. They know this – they make use of this same technique of pondering counterfactuals. The difference is that they work backwards to fabricate an entirely new counter-reality, tailored to support their positions and vast enough that it can substitute for reality itself in a person’s mind. It’s called denialism: an entire ideological support system made to preserve a desired belief by rejecting the overwhelming evidence that would threaten this belief.

Denialism is wrongness with an infrastructure – ignorance with an armored shell, a whole fake world weaponized against the real world.

Less of “denialism,” that is, not good analysis or Zinnia Jones. She gets a bit meta behind the link, and the contents are applicable to much more than transphobia. It’s worth a full read.
(That last item comes courtesy of Shiv. Support her work, too!)

Intelligence and Race, in sub-populations

I’ve read a fair number of papers covering race and genes. In fact, before I go farther, here’s a bibliography:

In this article, the authors argue that the overwhelming portion of the literature on intelligence, race, and genetics is based on folk taxonomies rather than scientific analysis. They suggest that because theorists of intelligence disagree as to what it is, any consideration of its relationships to other constructs must be tentative at best. They further argue that race is a social construction with no scientific definition. Thus, studies of the relationship between race and other constructs may serve social ends but cannot serve scientific ends. No gene has yet been conclusively linked to intelligence, so attempts to provide a compelling genetic link of race to intelligence are not feasible at this time. The authors also show that heritability, a behavior-genetic concept, is inadequate in regard to providing such a link.

Sternberg, Robert J., Elena L. Grigorenko, and Kenneth K. Kidd. “Intelligence, race, and genetics.” American Psychologist 60.1 (2005): 46.

The literature on candidate gene associations is full of reports that have not stood up to rigorous replication. This is the case both for straightforward main effects and for candidate gene-by-environment interactions (Duncan and Keller 2011). As a result, the psychiatric and behavior genetics literature has become confusing and it now seems likely that many of the published findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge. The reasons for this are complex, but include the likelihood that effect sizes of individual polymorphisms are small, that studies have therefore been underpowered, and that multiple hypotheses and methods of analysis have been explored; these conditions will result in an unacceptably high proportion of false findings (Ioannidis 2005).

Hewitt, John K. “Editorial Policy on Candidate Gene Association and Candidate Gene-by-Environment Interaction Studies of Complex Traits.” Behavior Genetics 42, no. 1 (January 1, 2012): 1–2. doi:10.1007/s10519-011-9504-z.

[Read more…]

Where Bigotry Thrives

All of us strive to be rational. We believe that reality does not contradict itself, that something cannot exist and not exist at the same time. So when we encounter a contradiction we believe in, we discard it to align ourselves closer to reality. But there’s another, more human reason to weed out contradictions in our views.

Charles Murray, in his interview with Sam Harris, was grilled a bit on universal basic income.

[1:53:17] HARRIS: I’ve heard you talk about it and this is a surprise because, in “Coming Apart” you are fairly critical of the welfare state in all its guises and you- you just said something that at least implied disparagement of the welfare state in Europe, as we know it, so tell me why you are an advocate for universal basic income.

[1:53:40] MURRAY: Well, I first wrote [a] book back in two thousand five or six, called “In Our Hands,” but I did it initially for the same reason that Milton Friedman was in favor of a negative income tax, the idea is that you replace the current system with the universal basic income and, that, you leave people alone to make their decisions about how to use it.

And yet, back in 1984, Murray was singing a different tune.

In Losing Ground, Charles Murray shows that the great proliferation of social programs and policies of the mid-’60s made it profitable for the poor to behave in the short term in ways that were destructive in the long term.

Murray comprehensively documents and analyzes the disturbing course of Great Society social programs. Challenging popular notions that Great Society programs marked the beginning of improvement in the situation of the poor, Murray shows substantial declines in poverty prior to 1964-but slower growth, no growth, and retreat from progress as public assistance programs skyrocketed.

If we truly want to improve the lot of the poor, Murray declares, we should look to equality of opportunity and to education and eliminate the transfer programs that benefit neither recipient nor donor.

Murray was influential in Reagan’s war on the poor, which argued poor people would unwisely spend their government assistance cheques, yet now he’s arguing that the poor should be given government assistance without strings attached?! He never acknowledges his about-face, but I think this part of the interview is telling.

[2:00:11] MURRAY: There will be work disincentives, but we are already at a point, Sam, where something more than 20 percent of working-age males with high school diplomas, and no more [education than that], are out of the labor force. So we already have a whole lot of guys, sitting at home, in front of a TV set or a gameboy, probably stoned on meth, or- or opioids, doing nothing. We got a problem already and I see a lot of ways in which the moral agency that an income would give could make the problem less.

[2:00:46] HARRIS: Did the dysfunction you, you see in white and largely rural America now, is it analogous to the dysfunction that we were seeing in the in the black inner-city starting a few decades ago? Are there important differences, or- or how do you how do you view that?

[2:01:05] MURRAY: In some ways it followed pretty much the same trajectory. Way back in nineteen ninety two, or three it was, I had an op-ed in the Wall Street Journal called “Becoming a White Underclass,” and I was simply tracking the growth in a non-marital births among white working-class people, and I said to myself, along with Pat Moynihan who said it better and first, that if you have communities in which large numbers of young men are growing to adulthood without a male figure, you asked for and get chaos. And I assume that what had happened in the black community when non-marital births, uh, kept on going up is going to happen in the white community. So in that sense they follow pretty much a predictable trajectory.

In the 1980’s, the face of poverty was black and addicted to drugs. Now, it’s white and addicted to drugs. Changing the race of those impoverished may have changed Murray’s views of poverty.

We dug into a contradiction Murray held, and found bigotry hiding underneath. This is no coincidence, persistent contradictions in your worldview are fertile ground for bigotry. All the atheists in the crowd know this.

To evade the charge of bigotry, you need to do more than say that you sincerely believe that the Bible is against gay marriage. You need to explain why you take the clobber verses as something important and relevant to today, while the statements like “Let the man with two tunics share with him who has none,” aren’t.

There are arguments against taking the missional verses and the poverty verses and trying them to apply them today. Of course, many of those arguments could be turned against the clobber verses as well. Can it be shown that there is a consistent means of interpretation that would lead to the clobber verses being taken literally while the charity verses should be basically ignored?

Or think of it this way: would the hypothetical “man from Mars” who was innocent of Christianity and the culture wars really look at the Bible and come away saying, “Wow, we’ve really got to do something to stop gay marriage”?

Think about how this looks from the outside. The parts of the Bible that you believe apply today are the ones that require other people to make sacrifices. The parts of the Bible that would require YOU to make big sacrifices are not considered relevant. Look at it this way, and you’ll see why “bigot” is one of the nicer things you could be called.

Contradictions allow you to pick and choose which rules you follow, allowing you to benefit while others fall into harm. It also provides a great shield against criticism.

[59:06] MURRAY: Dick and I, our- our crime in the book was to have a single, solitary paragraph that said – after talking about the patterns that I’m about to describe – “if we’ve convinced you that either the environmental or the genetic explanation has won out to the exclusion of the other, we haven’t done a good enough job presenting the evidence for one side of the other. It seems to us highly likely that both genes and the environment have something to do with racial differences.” And we went no farther than that. There is an asymmetry between saying “probably genes have some involvement” and the assertion that it’s entirely environmental and that’s what the, that’s the assertion that is being made. If you’re going to be upset at “The Bell Curve,” you are obligated to defend the proposition that the black/white difference in IQ scores is 100% environmental, and that’s a very tough measure.

Hit Murray with the charge that he’s promoting genetic determinism, and he’ll point to that paragraph in “The Bell Curve” and say you’re straw-personing his views. Argue that intelligence is primarily driven by environment and he’ll either point to the hundreds of pages and dozens of charts that he says demonstrates a genetic link that’s much stronger than environment, or he’ll equivocate between “primarily driven by environment” and “100% environmental.” Nor is this an isolated incident. Remember his bit about “large numbers of young men are growing to adulthood without a male figure, you asked for and get chaos?”

[40:23] MURRAY: … the thing about the non-shared environment is it’s not susceptible to systematic manipulation. It’s … idiosyncratic. It’s non-systematic … there are no obvious ways that you can deal with the non-shared environment, in the way that you could say “Oh, we can improve the schools, we can teach better parenting practices, we can provide more money for – …” [those] all fall into the category of manipulating the shared environment and when it comes to personality, as you just indicated, it’s 50/50 [for genes and environment] but almost all that 50 is non-shared.

[41:02] HARRIS: Yeah, which seems to leave parents impressively off the hook for … how their kids turn out.

[41:10] MURRAY: Although it is true that parents – and I’m a father of four – uh, we resist that. … and with the non-shared environment and the small role left for parenting, I will say it flat out: I read [the research of Judith Rich Harris] with *the* most skeptical possible eye. I was looking for holes in it, assiduously. …

[41:57] MURRAY: … the book was very sound, it was very rigorously done, and … at this point I don’t know of anybody who’s familiar with literature, who thinks there’s that much of a role left of the kind of parents thought they had in shaping their children.

[42:15] HARRIS: Right, well I’m not gonna stop trying, I think, it’s [a] very hard illusion to cut through… as I read Harry Potter tonight to my eldest daughter.

[42:23] MURRAY: … You know that, but I think that it’s good to reflect on that: reading Harry Potter to your eldest daughter is a good in itself.

[42:32] HARRIS: Yeah.

[42:35] MURRAY: And the fact that she behaves differently 20 years from now is not the point.

[42:38] HARRIS: No, exactly, and it is an intrinsic good, and it’s for my own pleasure that I do it largely at this point.

Murray also thinks that nothing a parent will do will change their child’s development. His ability to flip between both sides of a contradiction is Olympic.

[43:12] HARRIS: That’s the one thing that it just occurred to me people should also understand is that, in addition to the fact that IQ doesn’t explain everything about a person’s success in life and … their intellectual abilities, the fact that a trait is genetically transmitted in individuals does not mean that all the differences between groups, or really even any of the differences between groups in that trait, are also genetic in origin, right?

[43:41] MURRAY: Critically important, critically important point.

[43:42] HARRIS: Yeah, so the jury can still be out on this topic, and we’ll talk about that, but to give a clear example: so if you have a population of people that is being systematically malnourished – now they might have genes to be as tall as the Dutch, but they won’t be because they’re not getting enough nourishment. And, in the case that they don’t become as tall as the Dutch, it will be entirely due to their environment and yet we know that height is among the most heritable things we’ve got – it’s also like 60 to 80 percent predicted by a person’s genes.

[44:15] MURRAY: Right. Uh, the comparison we use in the book … is that, you take a handful of genetically identical seed-corn, and divide it into two parts, and plant one of those parts in Iowa and the other part in the Mojave Desert, you’re going to get way different results. Has nothing whatsoever to do with the genetic content of the corn.

It’s no wonder that when Harris asks him if anything discovered since publication has changed his claims, his response was no. As he inhabits both side of a contradiction, nothing could falsify his views.

Contradictions are also a way to change your views without acknowledging you did. Consider this small bit of trivia Murray throws out (emphasis mine):

[1:40:53] HARRIS: If my life depended on it, I could not find another person [besides Christopher Hitchens] who smoked cigarettes in my contact list, you know, and let’s say there’s a thousand people in there, right?

[1:41:04] MURRAY:  Hmm mm-hmm.

[1:41:05] HARRIS: That’s an amazing fact in a society where something like 30% of people smoke cigarettes.

[1:41:12] MURRAY: That’s a wonderful illustration of how isolated [we are within our classes]… because, in my case, I do know people who smoke cigarettes but that’s only because I go play poker at Charleston West Virginia casino and there, about 30% of the guys I played poker with smoked. But that’s ok. In terms of [the] American Enterprise Institute, where I work, [I] don’t know anybody who smokes there, I don’t… social circles, no.

If you had a long memory, that small tidbit packs quite a punch.

Let’s begin by referring to the basic objectives of the program:

  1. To show that the basic social cost changes are bad economics.
  2. To illustrate how smoking benefits society and its members.
  3. To show that anti-smoking groups, who are promoting the social cost issue, have self-serving ends, and are not representative of the general society.

In short, we took as our goals a defense which would undermine the concepts of the social cost issue, and an offense which would stress the social benefits of smoking and freedom to smoke.

In 1980, the American Enterprise Institute was preparing reports and training videos that argued smoking is a net benefit to society. Among other things, worker productivity was better when people took regular smoke breaks, and restrictions on cigarettes harm personal liberty.

In 2017, the number of smokers at the American Enterprise Institute is far less than in the general population. If you value being free of contradictions, a reversal like this should cause you some tough introspection about who you allow into your think-tank. If you don’t, no introspection is necessary. There’s no need to criticize yourself, no need to submit yourself to annoying audits, you can just carry on being awesome.

Like Sam Harris. Emphasis mine.

[1:39] HARRIS: Human intelligence itself is a taboo topic; people don’t want to hear that intelligence is a real thing, and that some people have more of it than others. They don’t want to hear that IQ tests really measure it. They don’t want to hear that differences in IQ matter because they’re highly predictive of differential success in life, and not just for things like educational attainment and wealth, but for things like out-of-wedlock birth and mortality. People don’t want to hear that a person’s intelligence is in large measure due to his or her genes, and there seems to be very little we can do environmentally to increase a person’s intelligence, even in childhood. It’s not that the environment doesn’t matter, but genes appear to be 50 to 80 percent of the story. People don’t want to hear this, and they certainly don’t want to hear that average IQ differs across races and ethnic groups. Now, for better or worse, these are all facts.

[5:32] HARRIS: Whatever the difference in average IQ is across groups, you know nothing about a person’s intelligence on the basis of his or her skin color. That is just a fact. There is much more variance among individuals in any racial group than there is between groups.

If the mean IQs of people grouped by skin colour are different, then you must know something about a person’s intelligence by knowing their skin colour. Head over to R Psychologist’s illustration of Cohen’s d and keep a close eye on the “probability of superiority.” For instance, when d = 0.1, the fine print tells me “there is a 53 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group (probability of superiority),” which means that if I encounter someone from group A I can state they have a higher intelligence than someone from group B with odds slightly better than chance. There’s only one situation where knowing someone’s skin colour tells me nothing about their intelligence, and that’s when the mean IQs of both groups are equal.

You could counter “so what, that 53% chance is so small as to be no different than 50/50,” and I’d agree with you. But if Murray demonstrated group differences of the same magnitude, his conclusion should not have been “IQ differs between races,” it should have been “IQ is effectively equal across racial lines.” By taking this counter, you’ve abandoned the ability to say mean IQ varies across groups. “Average IQ differs across races” and “skin colour conveys information about IQ” are equivalent statements, so Sam Harris is contradicting himself.

Contradictions are a chronic problem for him. It should come as no surprise that Sam Harris is always right, and that entire websites are wrong.

A few of the subjects I explore in my work have inspired an unusual amount of controversy. Some of this results from real differences of opinion or honest confusion, but much of it is due to the fact that certain of my detractors deliberately misrepresent my views. The purpose of this article is to address the most consequential of these distortions. […]

Whenever I respond to unscrupulous attacks on my work, I inevitably hear from hundreds of smart, supportive readers who say that I needn’t have bothered. In fact, many write to say that any response is counterproductive, because it only draws more attention to the original attack and sullies me by association. These readers think that I should be above caring about, or even noticing, treatment of this kind. Perhaps. I actually do take this line, sometimes for months or years, if for no other reason than that it allows me to get on with more interesting work. But there are now whole websites—Salon, The Guardian, Alternet, etc.—that seem to have made it a policy to maliciously distort my views.

Disagreement is due to misunderstanding, not genuine error. Ergo, he cannot be a bigot.

This, then, is a strong second reason to examine yourself for contradictions. Don’t just do it to stay in line with reality, do it to help rid yourself of bigotry against your fellow person.

Gimmie that Old-Time Breeding

Full disclosure: I think Evolutionary Psychology is a pseudo-science. This isn’t because the field endorses a flawed methodology (relative to the norm in other sciences), nor because they come to conclusions I’m uncomfortable with. No, the entire field is based on flawed or even false assumptions; it doesn’t matter how good your construction techniques are, if your foundation is a banana cream pie your building won’t be sturdy.

But maybe I’m wrong. Maybe EvoPsych researchers are correct when they say every other branch of social science is founded on falsehoods. So let’s give one of their papers a fair shake.

Ellis, Lee, et al. “The Future of Secularism: a Biologically Informed Theory Supplemented with Cross-Cultural Evidence.” Evolutionary Psychological Science: 1-19. [Read more…]

Everything Is Significant!

Back in 1939, Joseph Berkson made a bold statement.

I believe that an observant statistician who has had any considerable experience with applying the chi-square test repeatedly will agree with my statement that, as a matter of observation, when the numbers in the data are quite large, the P’s tend to come out small. Having observed this, and on reflection, I make the following dogmatic statement, referring for illustration to the normal curve: “If the normal curve is fitted to a body of data representing any real observations whatever of quantities in the physical world, then if the number of observations is extremely large—for instance, on an order of 200,000—the chi-square P will be small beyond any usual limit of significance.”

This dogmatic statement is made on the basis of an extrapolation of the observation referred to and can also be defended as a prediction from a priori considerations. For we may assume that it is practically certain that any series of real observations does not actually follow a normal curve with absolute exactitude in all respects, and no matter how small the discrepancy between the normal curve and the true curve of observations, the chi-square P will be small if the sample has a sufficiently large number of observations in it.

Berkson, Joseph. “Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test.” Journal of the American Statistical Association 33, no. 203 (1938): 526–536.
His prediction would be vindicated two decades later.

[Read more…]

Stop Assessing Science

I completely agree with PZ, in part because I’ve heard the same tune before.

The results indicate that the investigators contributing to Volume 61 of the Journal of Abnormal and Social Psychology had, on the average, a relatively (or even absolutely) poor chance of rejecting their major null hypotheses, unless the effect they sought was large. This surprising (and discouraging) finding needs some further consideration to be seen in full perspective.

First, it may be noted that with few exceptions, the 70 studies did have significant results. This may then suggest that perhaps the definitions of size of effect were too severe, or perhaps, accepting the definitions, one might seek to conclude that the investigators were operating under circumstances wherein the effects were actually large, hence their success. Perhaps, then, research in the abnormal-social area is not as “weak” as the above results suggest. But this argument rests on the implicit assumption that the research which is published is representative of the research undertaken in this area. It seems obvious that investigators are less likely to submit for publication unsuccessful than successful research, to say nothing of a similar editorial bias in accepting research for publication.

Statistical power is defined as the odds of failing to reject a false null hypothesis. The larger the study size, the greater the statistical power. Thus if your study has a poor chance of answering the question it is tasked with, it is too small.

Suppose we hold fixed the theoretically calculable incidence of Type I errors. … Holding this 5% significance level fixed (which, as a form of scientific strategy, means leaning over backward not to conclude that a relationship exists when there isn’t one, or when there is a relationship in the wrong direction), we can decrease the probability of Type II errors by improving our experiment in certain respects. There are three general ways in which the frequency of Type II errors can be decreased (for fixed Type I error-rate), namely, (a) by improving the logical structure of the experiment, (b) by improving experimental techniques such as the control of extraneous variables which contribute to intragroup variation (and hence appear in the denominator of the significance test), and (c) by increasing the size of the sample. … We select a logical design and choose a sample size such that it can be said in advance that if one is interested in a true difference provided it is at least of a specified magnitude (i.e., if it is smaller than this we are content to miss the opportunity of finding it), the probability is high (say, 80%) that we will successfully refute the null hypothesis.

If low statistical power was just due to a few bad apples, it would be rare. Instead, as the first quote implies, it’s quite common. That study found that for studies with small effect sizes, where Cohen’s d was roughly 0.25, their average statistical power was an abysmal 18%. For medium-effect sizes, where d is roughly 0.5, that number is still less than half. Since those two ranges cover the majority of social science effect sizes, that means the typical study has very low power and thus a small sample size. Instead, the problem of low power must be systemic to how science is carried out.

In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and reconstructive efforts of Carnap, Hempel, and Popper to unscramble the logical relationships of theories and hypotheses to evidence. Meanwhile our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.

I know, it’s a bit confusing that I haven’t clarified who I’m quoting. That first paragraph comes from this study:

Cohen, Jacob. “The Statistical Power of Abnormal-Social Psychological Research: A Review.” The Journal of Abnormal and Social Psychology 65, no. 3 (1962): 145.

While the second and third are from this:

Meehl, Paul E. “Theory-Testing in Psychology and Physics: A Methodological Paradox.” Philosophy of Science 34, no. 2 (1967): 103–115.

That’s right, scientists have been complaining about small sample sizes for over 50 years. Fanelli et. al. [2017] might provide greater detail and evidence than previous authors did, but the basic conclusion has remained the same. Nor are these two studies lone wolves in the darkness; I wrote about a meta-analysis of 16 different power-level studies between Cohen’s and now, all of which agree with Cohen’s.

If your assessments have been consistently telling you the same thing for decades, maybe it’s time to stop assessing. Maybe it’s time to start acting on those assessments, instead. PZ is already doing that, thankfully…

More data! This is also helpful information for my undergraduate labs, since I’m currently in the process of cracking the whip over my genetics students and telling them to count more flies. Only a thousand? Count more. MORE!

… but this is a chronic, systemic issue within science. We need more.