LGBTQ+ People Are Not Going Back

I could use a break from my current draft (which is already over the 3,000 word mark, alas), and this is certainly worthy of my time:

I propose that on Tuesday, December 3rd, 2024 (the first day that both the House and Senate are back in session), all of us who are invested in this issue and have a platform (whether it be a blog, newsletter, column, podcast, YouTube, TikTok, Instagram, etc.) publish a piece with the shared title: “LGBTQ+ People Are Not Going Back.” Yes, I know, it’s a cheesy title, but it holds Democrats accountable to their own talking points and makes it clear that backsliding on LGBTQ+ rights is nonnegotiable for us.

Easy peas- wait, “Democrats?”

What you write or say or express in your op-ed or article or video or podcast etcetera is up to you. I encourage you to make it personal and feel free to tailor it to your audience. My only request (other than all of us using the same title) is that you implore people to contact their Congressperson and Senators (and perhaps even local politicians) and tell them that 1) you will not tolerate any backpedaling on LGBTQ+ rights whatsoever, and 2) if they fail to strongly stand up against these attacks on LGBTQ+ rights, then you will take your vote elsewhere next election.

Ah, this is somewhat US-centric. Unfortunately, I live in their hat and thus I doubt any Democratic politician would listen to me.

However, I do live in a province with a government that’s decided to demonize transgender people. Bill 29 is quite draconian: sports organizations are supposed to “establish, implement and maintain policies respecting fairness and safety,” which they must report to the government. They must also report any complaints about those policies, “requests for” or “the establishment of mixed-gender or mixed-sex leagues, classes or divisions,” and “other matters.” Anyone carrying out those orders is shielded from legal liability. What constitutes “fairness and safety?” That’s not up to the sporting organization, oh no; the government’s cabinet has full authority to prescribe “provisions or content that policies must include.” The language is very vague, with plenty of loopholes a bigot could exploit.

It has been widely condemned, a legal challenge has been launched, and even its mere proposal has made national sporting organizations rethink hosting events in our province. During its second reading, where it was supposed to be debated, both MLAs who rose to speak about the bill condemned it:

Hon. Hayter (NDP, Calgary-Edgemont): This bill is only going to discourage youth from participating. Bill 29 states that you want to have sports participation, but it really is just going to add more red tape for people to participate. Based on this government’s announcement, this bill is the first step forward barring trans women and girls from participating in women’s sports at all levels, starting at school level to being a professional athlete. …

Last year a nine-year-old girl – nine years old – participating in track and field in B.C. was harassed by people because she had short hair, so they made the assumption that she must be trans. A little girl. This government is giving a free pass to harassers in the name of protecting women in sport. This makes all women unsafe, especially Black, Indigenous, and other racialized women as well as women who are now going to be considered insufficiently feminine.

Hon. Elmeligi (NDP, Banff-Kananaskis): I want to zoom in on this idea about this unfair advantage, that somehow trans women have an unfair advantage over other athletes in sport. This idea is not supported by any science at all. … Really, this idea is based on the assumption that trans women have more testosterone, so let’s explore that a little bit. More testosterone leads to bigger muscles, faster times, tends to be associated with being stronger and faster, but that is so wrong, Mr. Chair. Again, we find a government basing policies on stereotypes, assumptions, transphobia, and just utter nonsense.

Here’s the reality check. In Judith Butler’s book Who’s Afraid of Gender? she really dives into this, and I highly recommend that all members in the House check out this book. Basically, the research shows that testosterone varies widely between and within genders. The research shows that there is considerable overlap in testosterone levels between genders: 16 and a half per cent of men have very low testosterone, 13 and a half per cent of women have higher than average testosterone, and there’s a lot overlap in those levels between genders.

It passed its second reading 43-31, with no amendments. More importantly, though, no one dared voice support of it. Even the bigots know that it’s indefensible! The silence is telling: the UCP know this isn’t going to earn them votes, if anything they’ll lose support should it become more widely known. If you’re an Albertan, now would be an excellent time to hammer that point home. Page 32 of the written record for that session lists every MLA who voted for or against Bill 29. The Alberta Government helpfully lists every current MLA. Get in touch with your MLA, and either thank them for voting against Bill 29 or politely ask them why they won’t defend their vote.

I’ve already done that process myself, and I can say it was well worth my time. My MLA has vocally supported transgender people, and they voted against Bill 29 on second reading. Now it’s your turn! I can guarantee it’ll be more satisfying than doom-scrolling US politics.

[HJH 2024-12-03] Whoops, I made the amateur mistake of assuming the second reading of this bill took place over one session. It actually was spread over three that occurred November 6th, 21st, and 26th. During the first session Hon. Schow (UCP, Cardston-Siksika) brought up the inherent strength advantages that men have over women (irrelevant, oversimplified) and that fairness demands transgender people be excluded (false). Hon. Armstrong-Homeniuk (UCP, Fort Saskatchewan-Vegreville) brought up the fairness of sport and the transformative power of sport (also covered by fairness). Hon. Petrovic (UCP, Livingstone-Macleod) again banged the fairness drum (it’s their best argument, which is damning). During the second session Hon. Johnson (UCP, Lacombe-Ponoka) recycled the “fairness” and “inherent strength” talking points from earlier.

One new argument comes from Hon. Petrovic’s name-drop of Reem Alsalem, who claims that nearly 890 medals were “unfairly” won by transgender athletes. Turns out it’s absolute batshit nonsense that hinges heavily on appeals to authority. Hon. Schmidt (NDP, Edmonton-Gold Bar) inadvertently spotted the game being played here:

With respect to science we heard the Member for Lacombe-Ponoka as well as the minister for sport refer to this report by the special rapporteur to the United Nations on women and gender-based violence. There is a quote in there, that the Member for Lacombe-Ponoka said, about the hundreds of medals that have been stripped from women competing in dozens of sports, and if you look at the footnote for that in the report, that claim is made by an organization called the Womens Liberation Front, which according to their website also unapologetically supports abortion on demand. So I look forward to the members opposite also endorsing the other work that the Womens Liberation Front is proposing.

Oh, so Alsalem sourced that figure from Women’s Liberation Front, that “gender critical” organization with strong ties to the US Christian nationalist movement. And as I touch on in that blog post, TERFs have been lobbying the UN for years to add the aura of authority to their arguments. If the evidence isn’t on your side, misinformation and authoritarianism are your only hope to getting your policies implemented.

Which, I suppose, explains why the cause is so attractive to our United Conservative Party.

A Transgender Athlete Reader

Remember this old thing?

Rationality Rules was so confident nobody would take him to task, his “improved” video contains the same arguments as his “flawed” one. And honestly, he was right; I’ve seen this scenario play out often enough within this community to know that we try to bury our skeletons, that we treat our minorities like shit, that we “skeptics” are just as prone to being blind followers as the religious/woo crowds we critique. And just like all those other times, I cope by writing words until I get sick of the topic. Sometimes, that takes a while.

In hindsight, “a while” turned out to be seven months and about seventeen blog posts. Why on Earth would I spend so much time and effort focused on one vlogger? I don’t think I ever explained why in those posts, so let’s fix that: the atheist/skeptic movement has a problem with transphobia. From watching my peers insinuate Ann Coulter was a man, to my participation in l’affair Benson, I gradually went from “something feels off about this” to “wow, some of my peers are transphobes.”

As I picked apart the arguments made by transphobes, I started to see patterns. Much like with religious and alt-Right extremists, there’s a lot of recycling going on. Constantly, apologists are forced to search for new coats of paint to cover up old bigoted arguments. I spotted a shift from bathroom rhetoric to sports rhetoric in early 2019 and figured that approach would have a decent lifespan. So when Rationality Rules stuck to his transphobic guns, I took it as my opportunity to defuse sports-related transphobic arguments in general. If I did a good enough job, most of these posts would still be applicable when the next big-name atheist or skeptic tried to invoke sports.

My last post was a test of that. It was a draft I’d been nursing for months back in 2019, but after a fair bit of research and some drastic revisions I’d gotten Rationality Rules out of my system via other posts. So I set it aside as a test. If I truly was right about this shift to sports among transphobes, it was only a matter of time until someone else in the skeptic/atheist community would make a similar argument and some minor edits would make it relevant again. The upshot is that a handful of my readers were puzzled by this post about Rationality Rules, while the vast majority of you instead saw this post about Shermer and Shrier.

The two arguments aren’t quite the same. Rationality Rules emphasizes that “male puberty” is his dividing line; transgender women who start hormone therapy early enough can compete as women, according to him, and he relies on that to argue he’s not transphobic at all. Shermer is nowhere near as sophisticated, arguing for a new transgender-specific sporting category instead. Shrier takes the same stance as Rationality Rules, but she doesn’t push back on Shermer’s opinions.

But not only are the differences small, I doubt many people had “women are inherently inferior to men in domain X” on their transphobe bingo card. And yet, the same assertion was made at two very different times by three very different people. I consider this test a roaring success.

One consequence is that most of my prior posts on Rationality Rules’ arguments against transgender athletes still hold quite a bit of value, and are worth boosting. First, though, I should share the three relevant posts that got me interested in sports-related apologia:

Trans Athletes, the Existence of Gender Identity, … / … and Ophelia Benson: The first post proposed two high-level arguments in favour of allowing transgender athletes to compete as the gender they identify with. The second is mostly about calling out Benson for blatant misgendering, but I also debunk some irrational arguments made against transgender athletes.

I Think I Get It: My research for the prior two posts led me to flag sport inclusion as the next big thing in transphobic rhetoric. The paragraph claiming “they think of them as the worst of men” was written with Benson in mind, but was eerily predictive of Shermer.

And finally, the relevant Rationality Rules posts:

EssenceOfThought on Trans Athletes: This is mostly focused on EssenceOfThought‘s critique of Rationality Rules, but I slip in some extras relating to hemoglobin and testosterone.

Rationality Rules is an Oblivious Transphobe: My first crack at covering the primary factors of athletic performance (spoiler alert: nobody knows what they are) and the variation present. I also debunk some myths about transgender health care, refute some attempts to shift the burden of proof or argue evidence need not be provided.

Texas Sharpshooter: My second crack at athletic performance and its variance, this time with better analysis.

Rationality Rules is “A Transphobic Hack“: This is mostly commentary specific to Rationality Rules, but I do link to another EssenceOfThought video.

Special Pleading: My second crack at the human rights argument, correcting a mistake I made in another post.

Rationality Rules is a “Lying” Transphobe: I signal boost Rhetoric&Discourse‘s video on transgender athletes.

“Rationality Rules STILL Doesn’t Understand Sports”: A signal boost of Xevaris‘ video on transgender athletes.

Lies of Omission: Why the principle of “fair play” demands that transgender athletes be allowed to compete as their affirmed gender.

Begging the Question: How the term “male puberty” is transphobic.

Rationality Rules Is Delusional: Rob Clark directs me to a study that deflates the muscle fibre argument.

Cherry Picking: If transgender women possess an obvious performance benefit, you’d expect professional and amateur sporting bodies to reach a consensus on that benefit existing and to write their policies accordingly. Instead, they’re all over the place.

Separate and Unequal: I signal boost Colleen Tighe‘s comic on transgender athletes.

Rationality Rules DESTROYS Women’s Sport!!1!: I take a deep dive into a dataset on hormone levels in professional athletes, to see what would happen if we segregated sports by testosterone level. The title gives away the conclusion, alas.

That takes care of most of Shermer and Shrier’s arguments relating to transgender athletes, and the remainder should be pretty easy. I find it rather sad that neither are as skilled at transphobic arguments as Rationality Rules was. Is the atheist/skeptic community getting worse on this subject?

The Weaker Sex

There’s an odd asymmetry in how Shermer and Shrier think about transgender athletes. They talk exclusively about transgender women entering women’s sport, but ignore the possibility of transgender men entering men’s sport. A sample:

[1:00:47] SHRIER: Sometimes people look at the numbers and they say there aren’t that many transgender kids, so there’s no reason for the moral panic. Who cares if the number one, two, and three spots go to biological boys? First of all, there’s obviously the incredible unfairness of fixing the race … telling girls “oh, you’ll never ever, no matter what you do, no matter how hard you train, you will never be number one. You will never make regionals.”… That’s a very different prospect for young women … [1:01:19]

So being assigned female doesn’t offer any advantages in any sport? At all? Let’s make a case for a female advantage. I’ll point out the logical and rhetorical flaws I’m deploying via tool-tips. [Read more…]

Rationality Rules is a Violent Transphobe

I thought I knew how this post would play out. EssenceOfThought has gotten some flack for declaring Stephen Woodford to be a “violent transphobe,” which I didn’t think they deserved. They gave a good defense in one of their videos, starting off with a definition of violence.

You see, violence is defined as the following by the World Health Organization. Quote; “the intentional use of physical force or power, threatened or actual, against oneself, another person, or against a group or community, that either results in, or has a high likelihood of resulting in injury, death, psychological harm, maldevelopment or deprivation.”

EoT points out that controlling someone’s behaviour or social networks by using their finances as leverage can be considered economic violence. They also point out that using legislation to control access to abortion can be considered legislative violence, as it deprives a person of their right to bodily autonomy. And thus, as EoT explains,

When you exclude trans women from women’s sports you’re not simply violating numerous human rights. You’re designating them as not real women, as an invasive force coming to take what doesn’t belong to them. You are cultivating future transphobic violence.

Note the air gap: “cultivating violence” and “violence” are not the same thing, and the definition EoT quoted above places intent front-and-centre. EoT bridges the gap by pointing out they gave Rationality Rules several months to demonstrate he promoted violent policies out of ignorance, rather than with intent. When “he [doubled] down on his violent transphobia,” EoT had sufficient evidence of intent to justify calling him a “violent transphobe.”

At this point I’d shore up their one citation with a few more. This decoupling of physical force and violence is not a new argument in the philosophy and social sciences literature.

Violence often involves physical force, and the association of force with violence is very close: in many contexts the words become synonyms. An obvious instance is the reference to a violent storm, a storm of great force. But in human affairs violence and force, cannot be equated. Force without violence is often used on a person’s body. If a person is in the throes of drowning, the standard Red Cross life-saving techniques specify force which is certainly not violence. To equate an act of rescue with an act of violence would be to lose sight entirely of the significance of the concept. Similarly, surgeons and dentists use force without doing violence.

Violence in human affairs is much more closely connected with the idea of violation than with the idea of force. What is fundamental about violence is that a person is violated. And if one immediately senses the truth of that statement, it must be because a person has certain rights which are undeniably, indissolubly, connected with being a person. One of these is a right to one’s body, to determine what one’s body does and what is done to one’s body — inalienable because without one’s body one would cease to be a person. Apart from a body, what is essential to one’s being a person is dignity. The real dignity of a person does not consist in remaining “dignified”, but rather in the ability to make decisions.

Garver, Newton. “What violence is.” The Nation 209.24 (1968): 819-822.

As a point of departure, let us say that violence is present when human beings are being influenced so that their actual somatic and mental realizations are below their potential realizations. […]

The first distinction to be made is between physical and psychological violence. The distinction is trite but important mainly because the narrow concept of violence mentioned above concentrates on physical violence only. […] It is useful to distinguish further between ’biological violence’, […] and ’physical violence as such’, which increases the constraint on human movements – as when a person is imprisoned or put in chains, but also when access to transportation is very unevenly distributed, keeping large segments of a population at the same place with mobility a monopoly of the selected few. But that distinction is less important than the basic distinction between violence that works on the body, and violence that works on the soul; where the latter would include lies, brainwashing, indoctrination of various kinds, threats, etc. that serve to decrease mental potentialities. […]

We shall refer to the type of violence where there is an actor that commits the violence as personal or direct, and to violence where there is no such actor as structural or indirect. In both cases individuals maybe killed or mutilated, hit or hurt in both senses of these words, and manipulated by means of stick or carrot strategies. But whereas in the first case these consequences can be traced back to concrete persons as actors, in the second case this is no longer meaningful. There may not be any person who directly harms another person in the structure. The violence is built into the structure and shows up as unequal power and consequently as unequal life chances.

Galtung, Johan. “Violence, peace, and peace research.” Journal of peace research 6.3 (1969): 167-191.

This expansive definition of “violence” has been influential, Galtung’s fifty-year-old paper from above has been cited from over 6,000 times according to Google Scholar. “Influential” is not a synonym for “consensus,” however.

Nearly all inquiries concerning the phenomenon of violence demonstrate that violence not only takes on many forms and possesses very different characteristics, but also that the current range of definitions is considerable and creates ample controversies concerning the question what violence is and how it ought to be defined (…). Since there are so many different kinds of violence (…) and since violence is studied from different actor perspectives (i.e. perpetrator, victim, third party, neutral observer), existing literature displays a wide variety of definitions based on different theoretical and, sometimes even incommensurable domain assumptions (e.g. about human nature, social order and history). In short, the concept of ‘violence’ is notoriously difficult to define because as a phenomenon it is multifaceted, socially constructed and highly ambivalent. […]

Violence is socially constructed because who and what is considered as violent varies according to specific socio-cultural and historical conditions. While legal scholars may require narrow definitions for punishable acts, the phenomenon of violence is invariably more complex in social reality. Not only do views about violence differ, but feelings regarding physical violence also change under the influence of social and cultural developments. The meanings that participants in a violent episode give to their own and other’s actions and experiences vary and can be crucial for deciding what is and what is not considered as violence since there is no simple relationship between the apparent severity of an attack and the impact that it has upon the victim. For example, in some cases, verbal aggression may prove to be more debilitating than physical attack.

De Haan, Willem. “Violence as an essentially contested concept.” Violence in Europe. Springer, New York, NY, 2008. 27-40.

A major objection to this inclusive definition of violence is that it makes everything violence, creating confusion instead of clarity. One example:

If violence is violating a person or a person’s rights, then every social wrong is a violent one, every crime against another a violent crime, every sin against one’s neighbor an act of violence. If violence is whatever violates a person and his rights of body, dignity, or autonomy, then lying to or about another, embezzling, locking one out of his house, insulting, and gossiping are all violent acts.

Betz, Joseph. “Violence: Garver’s definition and a Deweyan correction.” Ethics 87.4 (1977): 339-351.

The problem with this objection is that it assumes violence is binary: things are either violent, or they are not. Almost nothing in life falls in a binary, sex included, so a much more plausible model for violence is a continuum. I’m convinced that even the people who buy into a violence binary also accept that violence falls on a continuum, as I have yet to hear anyone argue that murder and wet willies are equally bad. Thus eliminating the binary and declaring all violence to fall on a continuum is a simpler theory, and by Occam’s razor should be favoured until contrary evidence comes along.

The other major objection is that while not every human society agrees on what constitutes violence, all of them agree that physical violence is violence. Sometimes this objection can be quite subtle:

Albeit rare, there are cases of violence occurring without rights being violated. This point has been made by Audi (1971, p. 59): ‘[while] in the most usual cases violence involves the violation of some moral right …there are also cases, like wrestling and boxing, in which even paradigmatic violence can occur without the violation of any moral right’.

Bufacchi, Vittorio. “Two concepts of violence.” Political Studies Review 3.2 (2005): 193-204.

That quote only works if you think wrestling is paradigmatic, something everyone agrees counts as violence. Wrestling fans would disagree, and either point to the hardcore training and co-operation involved or the efforts made to prevent injury, depending on which fandom you were querying. Societies definitely disagree on what physical acts count as violence, and even within a single country physical acts that are considered horrifically immoral to many today were perfectly acceptable to many a century ago. This pragmatic argument can also be turned on its head, by pointing out that if violence is binary then we wouldn’t expect a correlation between (for example) hostile views of women and violence towards women. If a violence continuum exists, however, such a correlation must exist.

Studies using Glick and Fiske’s (1996) Ambivalent Sexism Inventory, which contains different subscales for benevolent and hostile sexism, support this idea. Studies have found that greater endorsement of hostile sexism predicted more positive attitudes toward violence against a female partner (Forbes, Jobe, White, Bloesch, & Adams-Curtis, 2005; Sakalli, 2001). Other studies of IPV among college samples have found that men with more hostile sexist attitudes were more likely to have committed verbal aggression (Forbes et. al., 2004) and sexual coercion (Forbes & Adams-Curtis, 2001; Forbes et al., 2004).

Allen, Christopher T., Suzanne C. Swan, and Chitra Raghavan. “Gender symmetry, sexism, and intimate partner violence.” Journal of interpersonal violence 24.11 (2009): 1816-1834.

At this point in the post, though, I was supposed to pump the breaks a little. People have certain ideas in mind when you say “violence,” I’d say, and would likely equivocate between physical and non-physical violence. This would poison the well. Of course you can’t change language or create awareness by sitting on your hands, so EssenceOfThought were 100% in the right in arguing Rationality Rules was a violent transphobe, but at the same time I wasn’t willing to join in. I needed more time to think about it. After finishing that paragraph, I’d title this post “Rationality Rules is a ‘Violent’ Transphobe” and punch the Publish button.

But now that I’ve finished gathering my sources and writing this post, I have had time to think about it. I cannot find a good reason to reject the violence-as-intentional-rights-violation definition, in particular I cannot come up with a superior alternative. Rationality Rules argues that the rights of some transgender people should be restricted, via special pleading. As I point out at that link, Stephen Woodford is aware of the argument from human rights, so he cannot claim his restriction is being done out of ignorance. That gives us proof of intent.

So no quote marks are necessary: I too believe Rationality Rules is a violent transphobe, for the definitions and reasons above.

Rationality Rules DESTROYS Women’s Sport!!1!

I still can’t believe this post exists, given its humble beginnings.

The “women’s category” is, in my opinion, poorly named given our current climate, and so I’d elect a name more along the lines of the “Under 5 nmol/l category” (as in, under 5 nanomoles of testosterone per litre), but make no mistake about it, the “woman’s category” is not based on gender or identity, or even genitalia or chromosomes… it’s based on hormone levels and the absence of male puberty.

The above comment wasn’t in Rationality Rules’ latest transphobic video, it was just a casual aside by RR himself in the YouTube comment section. He’s obiquely doubled-down via Twitter (hat tip to Essence of Thought):

Of course, just as I support trans men competing in all “men’s categories” (poorly named), women who have not experienced male puberty competing in all women’s sport (also poorly named) and trans women who have experienced male puberty competing in long-distance running.

To further clarify, I think that we must rename our categories according to what they’re actually based on. It’s not right to have a “women’s category” and yet say to some trans women (who are women!) that they can’t compete within it; it should be renamed.

The proposal itched away at me, though, because I knew it was testable.

There is a need to clarify hormone profiles that may be expected to occur after competition when antidoping tests are usually made. In this study, we report on the hormonal profile of 693 elite athletes, sampled within 2 h of a national or international competitive event. These elite athletes are a subset of the cross-sectional study that was a component of the GH-2000 research project aimed at developing a test to detect abuse with growth hormone.

Healy, Marie-Louise, et al. “Endocrine profiles in 693 elite athletes in the postcompetition setting.” Clinical endocrinology 81.2 (2014): 294-305.

The GH-2000 project had already done the hard work of collecting and analyzing blood samples from athletes, so checking RR’s proposal was no tougher than running some numbers. There’s all sorts of ethical guidelines around sharing medical info, but fortunately there’s an easy shortcut: ask one of the scientists involved to run the numbers for me, and report back the results. Aggregate data is much more resistant to de-anonymization, so the ethical concerns are greatly reduced. The catch, of course, is that I’d have to find a friendly researcher with access to that dataset. About a month ago, I fired off some emails and hoped for the best.

I wound up much, much better than the best. I got full access to the dataset!! You don’t get handed an incredible gift like this and merely use it for a blog post. In my spare time, I’m flexing my Bayesian muscles to do a re-analysis of the above paper, while also looking for observations the original authors may have missed. Alas, that means my slow posting schedule is about to crawl.

But in the meantime, we have a question to answer.

What Do We Have Here? ¶

(Click here to show the code)

import numpy as np
import pandas as pd

dataset = pd.read_csv('dataset.minimal.tsv',sep='\t')

mask_afab = dataset['Gender']==2
mask_amab = dataset['Gender']==1

print( "{:24} = {}".format("Total Assigned-female Athletes", np.sum(mask_afab)) )
print( "{:24} = {:.2f} cm".format("  Height, Mean", np.mean( dataset['height'][mask_afab] )) )
print( "{:24} = {:.2f} cm".format("  Height, Std.Dev", np.std( dataset['height'][mask_afab] )) )
print( "{:24} = {:.2f} kg".format("  Weight, Mean", np.mean( dataset['weight'][mask_afab] )) )
print( "{:24} = {:.2f} kg".format("  Weight, Std.Dev", np.std( dataset['weight'][mask_afab] )) )
print( "{:24} = {:.2f} kg".format("  Body Fat, Mean", np.mean( (dataset['weight']*dataset['body-fat']*.01)[mask_afab] )) )
print( "{:24} = {:.2f} kg".format("  Body Fat, Std.Dev", np.std( (dataset['weight']*dataset['body-fat']*.01)[mask_afab] )) )

print( "{:24} = {:.2f} nmol/L".format("  Testosterone, Mean", np.mean( dataset['Testo'][mask_afab] )) )
print( "{:24} = {:.2f} nmol/L".format("  Testosterone, Std.Dev", np.std( dataset['Testo'][mask_afab] )) )
print( "{:24} = {:.2f} nmol/L".format("  Testosterone, Max", np.max( dataset['Testo'][mask_afab] )) )
print( "{:24} = {:.2f} nmol/L".format("  Testosterone, Min", np.min( dataset['Testo'][mask_afab] )) )
print()

print( "{:24} = {}".format("Total Assigned-male Athletes", np.sum(mask_amab) ) )
print( "{:24} = {:.2f} cm".format("  Height, Mean", np.mean( dataset['height'][mask_amab] )) )
print( "{:24} = {:.2f} cm".format("  Height, Std.Dev", np.std( dataset['height'][mask_amab] )) )
print( "{:24} = {:.2f} kg".format("  Weight, Mean", np.mean( dataset['weight'][mask_amab] )) )
print( "{:24} = {:.2f} kg".format("  Weight, Std.Dev", np.std( dataset['weight'][mask_amab] )) )
print( "{:24} = {:.2f} kg".format("  Body Fat, Mean", np.mean( (dataset['weight']*dataset['body-fat']*.01)[mask_amab] )) )
print( "{:24} = {:.2f} kg".format("  Body Fat, Std.Dev", np.std( (dataset['weight']*dataset['body-fat']*.01)[mask_amab] )) )

print( "{:24} = {:.2f} nmol/L".format("  Testosterone, Mean", np.mean( dataset['Testo'][mask_amab] )) )
print( "{:24} = {:.2f} nmol/L".format("  Testosterone, Std.Dev", np.std( dataset['Testo'][mask_amab] )) )
print( "{:24} = {:.2f} nmol/L".format("  Testosterone, Max", np.max( dataset['Testo'][mask_amab] )) )
print( "{:24} = {:.2f} nmol/L".format("  Testosterone, Min", np.min( dataset['Testo'][mask_amab] )) )

Total Assigned-female Athletes = 239
  Height, Mean           = 171.61 cm
  Height, Std.Dev        = 7.12 cm
  Weight, Mean           = 64.27 kg
  Weight, Std.Dev        = 9.12 kg
  Body Fat, Mean         = 13.19 kg
  Body Fat, Std.Dev      = 3.85 kg
  Testosterone, Mean     = 2.68 nmol/L
  Testosterone, Std.Dev  = 4.33 nmol/L
  Testosterone, Max      = 31.90 nmol/L
  Testosterone, Min      = 0.00 nmol/L

Total Assigned-male Athletes = 454
  Height, Mean           = 182.72 cm
  Height, Std.Dev        = 8.48 cm
  Weight, Mean           = 80.65 kg
  Weight, Std.Dev        = 12.62 kg
  Body Fat, Mean         = 8.89 kg
  Body Fat, Std.Dev      = 7.20 kg
  Testosterone, Mean     = 14.59 nmol/L
  Testosterone, Std.Dev  = 6.66 nmol/L
  Testosterone, Max      = 41.00 nmol/L
  Testosterone, Min      = 0.80 nmol/L

The first step is to get a basic grasp on what’s there, via some crude descriptive statistics. It’s also useful to compare these with the original paper, to make sure I’m interpreting the data correctly. Excusing some minor differences in rounding, the above numbers match the paper.

The only thing that stands out from the above, to me, is the serum levels of testosterone. At least one source says the mean of these assigned-female athletes is higher than the normal range for their non-athletic cohorts. Part of that may simply be because we don’t have a good idea of what the normal range is, so it’s not uncommon for each lab to have their own definition of “normal.” This is even worse for those assigned female, since their testosterone levels are poorly studied; note that my previous link collected the data of over a million “men,” but doesn’t mention “women” once. Factor in inaccurate test results and other complicating factors, and “normal” is quite poorly-defined.

Still, Rationality Rules is either convinced those complications are irrelevant, or ignorant of them. And, to be fair, that 5nmol/L line implicitly sweeps a lot of them under the rug. Let’s carry on, then, and look for invalid data. “Invalid” covers everything from missing data, to impossible data, and maybe even data we think might be made inaccurate due to measurement error. I consider a concentration of zero testosterone as invalid, even though it may technically be possible.

(Click here to show the code)

t_number = dataset['Testo'] >= 0
t_max = np.max(dataset['Testo'][t_number])
t_min = np.min(dataset['Testo'][t_number])
t_valid = (dataset['Testo'] > 0.5) & np.isfinite(dataset['Testo'])

print( "{:52} = {}".format("Total Assigned-male Athletes w/ T levels >= 0", np.sum(mask_amab & t_number) ) )
print( "{:52} = {}".format("                             w/ T levels <= 0.5", np.sum(mask_amab & (dataset['Testo']<=0.5)) ) )
print( "{:52} = {}".format("                             w/ T levels == 0", np.sum(mask_amab & (dataset['Testo']==0)) ) )
print( "{:52} = {}".format("                             w/ missing T levels", np.sum(mask_amab & np.isnan(dataset['Testo'])) ) )
print( "{:52} = {}".format("                             that I consider valid", np.sum(mask_amab & t_valid)) )

print()

print( "{:52} = {}".format("Total Assigned-female Athletes w/ T levels >= 0", np.sum(mask_afab & t_number)) )
print( "{:52} = {}".format("                               w/ T levels <= 0.5", np.sum(mask_afab & (dataset['Testo']<=0.5)) ) )
print( "{:52} = {}".format("                               w/ T levels == 0", np.sum(mask_afab & (dataset['Testo']==0)) ) )
print( "{:52} = {}".format("                               w/ missing T levels", np.sum(mask_afab & np.isnan(dataset['Testo'])) ) )
print( "{:52} = {}".format("                               that I consider valid", np.sum(mask_afab & t_valid)) )

Total Assigned-male Athletes w/ T levels >= 0        = 446
                             w/ T levels <= 0.5      = 0
                             w/ T levels == 0        = 0
                             w/ missing T levels     = 8
                             that I consider valid   = 446

Total Assigned-female Athletes w/ T levels >= 0      = 234
                               w/ T levels <= 0.5    = 5
                               w/ T levels == 0      = 1
                               w/ missing T levels   = 5
                               that I consider valid = 229

Fortunately for us, the losses are pretty small. 229 datapoints is a healthy sample size, so we can afford to be liberal about what we toss out. Next up, it would be handy to see the data in chart form.

(Click here to show the code)

# %matplotlib notebook  # makes the plots interactive, but only one can be active
%matplotlib inline 

import matplotlib.pyplot as pp
pp.rcParams['figure.dpi'] = 96      # MUST SET THIS FIRST
pp.rcParams['figure.figsize'] = [9.5, 6]

bins = 9
pp.hist( np.log(dataset['Testo'][mask_afab & t_valid]), bins, density=1, facecolor='black', alpha=0.2)
pp.hist( np.log(dataset['Testo'][mask_amab & t_valid]), bins, density=1, facecolor='green', alpha=0.2)
pp.legend(['aFab','aMab'], loc=0)

pp.title('Testosterone, elite athletes')
pp.xlabel('nmol/L')
pp.xticks(np.linspace(-2,4,9), ["{:.1f}".format(np.exp(x)) for x in np.linspace(-2,4,9)])
pp.yticks([])

pp.axvline(np.log(0.5),0,1)
pp.axvline(np.log(5),0,1)
# Source: https://www.exeterlaboratory.com/test/testosterone/
pp.fill( np.log([29,8.6,8.6,29]), [1.2,1.2,0,0], facecolor='green', alpha=0.05 )
pp.fill( np.log([1.68,.3,.3,1.68]), [1.2,1.2,0,0], facecolor='black', alpha=0.05 )

pp.show()

I've put vertical lines at both the 0.5 and 5 nmol/L cutoffs. There's a big difference between categories, but we can see clouds on the horizon: a substantial number of assigned-female athletes have greater than 5 nmol/L of testosterone in their bloodstream, while a decent number of assigned-male athletes have less. How many?

(Click here to show the code)

mask_gt_5nmol = t_valid & (dataset['Testo'] > 5)
mask_lt_5nmol = t_valid & (dataset['Testo'] < 5)
mask_eq_5nmol = t_valid & (dataset['Testo'] == 5)

print("Segregating Athletes by Testosterone")

table = {"Concentration":["> 5nmol/L","< 5nmol/L","= 5nmol/L"],
        "aFab":[sum(mask_gt_5nmol & mask_afab),sum(mask_lt_5nmol & mask_afab),sum(mask_eq_5nmol & mask_afab)],
        "aMab":[sum(mask_gt_5nmol & mask_amab), sum(mask_lt_5nmol & mask_amab), sum(mask_eq_5nmol & mask_amab)]}
print(pd.DataFrame(table).to_string(index=False))
print()

print("{:.1f}% of assigned-female athletes have > 5nmol/L".format(100.*sum(mask_gt_5nmol & mask_afab)/sum(t_valid & mask_afab)))
print("{:.1f}% of assigned-male athletes have < 5nmol/L".format(100.*sum(mask_lt_5nmol & mask_amab)/sum(t_valid & mask_amab)))
print("{:.1f}% of athletes with > 5nmol/L are assigned-female".format(100.*sum(mask_gt_5nmol & mask_afab)/sum(mask_gt_5nmol)))
print("{:.1f}% of athletes with < 5nmol/L are assigned-male".format(100.*sum(mask_lt_5nmol & mask_amab)/sum(mask_lt_5nmol)))

Segregating Athletes by Testosterone
Concentration  aFab  aMab
   > 5nmol/L    19   417
   < 5nmol/L   210    26
   = 5nmol/L     0     3

8.3% of assigned-female athletes have > 5nmol/L
5.8% of assigned-male athletes have < 5nmol/L
4.4% of athletes with > 5nmol/L are assigned-female
11.0% of athletes with < 5nmol/L are assigned-male

Looks like anywhere from 6-8% of athletes have testosterone levels that cross Rationality Rules' line. For comparison, maybe 1-2% of the general public has some level of gender dysphoria, though estimating exact figures is hard in the face of widespread discrimination and poor sex-ed in schools. Even that number is misleading, as the number of transgender athletes is substantially lower than 1-2% of the athletic population. The share of transgender athletes is irrelevant to this dataset anyway, as it was collected between 1996 and 1999, when no sporting agency had policies that allowed transgender athletes to openly compete.

That 6-8%, in other words, is entirely cisgender. This echoes one of Essence Of Thought's arguments: RR's 5nmol/L policy has far more impact on cis athletes than trans athletes, which could have catastrophic side-effects. Could is the operative word, though, because as of now we don't know anything about these athletes. Do >5nmol/L assigned-female athletes have bodies more like >5nmol/L assigned-male athletes than <5nmol/L assigned-female athletes? If so, then there's no problem. Equivalent body types are competing against each other, and outcomes are as fair as could be reasonably expected.

What, then, counts as an "equivalent" body type when it comes to sport?

Newton's First Law of Athletics ¶

One reasonable measure of equivalence is height. It's one of the stronger sex differences, and height is also correlated with longer limbs and greater leverage. Whether that's relevant to sports is debatable, but height and correlated attributes dominate Rationality Rules' list.

[19:07] In some events - such as long-distance running, in which hemoglobin and slow-twitch muscle fibers are vital - I think there's a strong argument to say no, [transgender women who transitioned after puberty] don't have an unfair advantage, as the primary attributes are sufficiently mitigated. But in most events, and especially those in which height, width, hip size, limb length, muscle mass, and muscle fiber type are the primary attributes - such as weightlifting, sprinting, hammer throw, javelin, netball, boxing, karate, basketball, rugby, judo, rowing, hockey, and many more - my answer is yes, most do have an unfair advantage.

Fortunately for both of us, most athletes in the dataset have a "valid" height, which I define as being at least 30cm tall.

(Click here to show the code)

height_valid = dataset['height'] > 30
print("Out of {:3} athletes, {} have valid height data.".format(len(height_valid), sum(height_valid)) )

bins = 9

pp.hist( dataset['height'][mask_afab & height_valid], bins, density=1, facecolor='black', alpha=0.2)
pp.hist( dataset['height'][mask_amab & height_valid], bins, density=1, facecolor='green', alpha=0.2)
pp.legend(['aFab','aMab'], loc=0)

pp.title('Height, elite athletes')
pp.xlim([145,215])
pp.xlabel("cm")
pp.yticks([])

# source: https://ourworldindata.org/human-height, Germany, 1976
pp.axvline(166.3,0,1, color='k', alpha=0.2)
pp.axvline(np.mean(dataset['height'][mask_afab & height_valid]),0,1, color='k')
pp.axvline(179.3,0,1, color='g', alpha=0.2)
pp.axvline(np.mean(dataset['height'][mask_amab & height_valid]),0,1, color='g')

pp.show()

Out of 693 athletes, 678 have valid height data.

The faint vertical lines are for the mean adult height of Germans born in 1976, which should be a reasonable cohort to European athletes that were active between 1996 and 1999, while the darker lines are each category's mean. Athletes seem slightly taller than the reference average, but only by 2-5cm. The amount of overlap is also surprising, given that height is supposed to be a major sex difference. We actually saw less overlap with testosterone! Finally, the height distribution isn't quite Gaussian, there's a subtle bias towards the taller end of the spectrum.

Height is a pretty crude metric, though. You could pair any athlete with a non-athlete of the same height, and there's no way the latter would perform as well as the former. A better measure of sporting ability would be muscle mass. We shouldn't use the absolute mass, though: bigger bodies have more mass and need more force to accelerate as smaller bodies do, so height and muscle mass are correlated. We need some sort of dimensionless scaling factor which compensates.

And we have one! It's called the Body Mass Index, or BMI.

$$ BMI = \frac w {h^2}, $$

where $w$ is a person's mass in kilograms, and $h$ is a person's height in metres. Unfortunately, BMI is quite problematic. Partly that's because it is a crude measure of obesity. But part of that is because there are two types of tissue which can greatly vary, body fat and muscle, yet both contribute equally towards BMI.

That's all fixable. For one, some of the athletes in this dataset had their body fat measured. We can subtract that mass off, so their weight consists of tissues that are strongly correlated with height plus one that is fudgable: muscle mass. For two, we're not assessing these individual's health, we only want a dimensionless measure of muscle mass relative to height. For three, we're not comparing these individuals to the general public, so we're not restricted to using the general BMI formula. We can use something more accurate.

The oddity is the appearance of that exponent 2, though our world is three-dimensional. You might think that the exponent should simply be 3, but that doesn't match the data at all. It has been known for a long time that people don't scale in a perfectly linear fashion as they grow. I propose that a better approximation to the actual sizes and shapes of healthy bodies might be given by an exponent of 2.5. So here is the formula I think is worth considering as an alternative to the standard BMI:

$$ BMI' = 1.3 \frac w {h^{2.5}} $$

I can easily pop body fat into Nick Trefethen's formula, and get a better measure of relative muscle mass,

$$ \overline{BMI} = 1.3 \frac{ w - bf }{h^{2.5}}, $$

where $bf$ is total body fat in kilograms. Individuals with excess muscle mass, relative to what we expect for their height, will have a high $\overline{BMI}$, and vice-versa. And as we saw earlier, muscle mass is another of Rationality Rules' determinants of sporting performance.

Time for more number crunching.

(Click here to show the code)

BMI_adj = 1.3*(dataset['weight']*(100. - dataset['body-fat']))*0.01/((dataset['height']*.01)**(2.5))
BMI_adj_valid = BMI_adj > 1

print( "Out of {:3} athletes, {} have valid adjusted BMIs.".format(len(BMI_adj_valid), sum(BMI_adj_valid)) )
print( "                     {} have valid weights.".format(sum(dataset['weight'] > 10)) )
print( "                     {} have valid body fat percentages.".format(sum(dataset['body-fat'] >= 0)) )
print()

print( "{:24} = {}".format("Total Assigned-female Athletes", np.sum(mask_afab)) )
print( "{:24} = {}".format(" total with valid adjusted BMI", np.sum(mask_afab & BMI_adj_valid)) )
print( "{:24} = {:.2f}".format("  adjusted BMI, Mean", np.mean( BMI_adj[mask_afab & BMI_adj_valid] )) )
print( "{:24} = {:.2f}".format("  adjusted BMI, Std.Dev", np.std( BMI_adj[mask_afab & BMI_adj_valid] )) )
print( "{:24} = {:.2f}".format("  adjusted BMI, Median", np.median( BMI_adj[mask_afab & BMI_adj_valid] )) )
print()

print( "{:24} = {}".format("Total Assigned-male Athletes", np.sum(mask_amab)) )
print( "{:24} = {}".format(" total with valid adjusted BMI", np.sum(mask_amab & BMI_adj_valid)) )
print( "{:24} = {:.2f}".format("  adjusted BMI, Mean", np.mean( BMI_adj[mask_amab & BMI_adj_valid] )) )
print( "{:24} = {:.2f}".format("  adjusted BMI, Std.Dev", np.std( BMI_adj[mask_amab & BMI_adj_valid] )) )
print( "{:24} = {:.2f}".format("  adjusted BMI, Median", np.median( BMI_adj[mask_amab & BMI_adj_valid] )) )

Out of 693 athletes, 227 have valid adjusted BMIs.
                     663 have valid weights.
                     241 have valid body fat percentages.

Total Assigned-female Athletes = 239
 total with valid adjusted BMI = 86
  adjusted BMI, Mean     = 16.98
  adjusted BMI, Std.Dev  = 1.21
  adjusted BMI, Median   = 16.96

Total Assigned-male Athletes = 454
 total with valid adjusted BMI = 141
  adjusted BMI, Mean     = 20.56
  adjusted BMI, Std.Dev  = 1.88
  adjusted BMI, Median   = 20.28

The bad news is that most of this dataset lacks any information on body fat, which really cuts into our sample size. The good news is that we've still got enough to carry on. It also looks like there's a strong sex difference, and the distribution is pretty clustered. Still, a chart would help clarify the latter point.

(Click here to show the code)

bins = 9

pp.hist( np.log(BMI_adj)[mask_afab & BMI_adj_valid], bins, density=1, facecolor='black', alpha=0.2)
pp.hist( np.log(BMI_adj)[mask_amab & BMI_adj_valid], bins, density=1, facecolor='green', alpha=0.2)
pp.legend(['aFab','aMab'], loc=0)

pp.title('Adjusted BMI, elite athletes')
pp.xticks(np.linspace(np.log(14),np.log(28),8), 
          ["{:.1f}".format(np.exp(x)) for x in np.linspace(np.log(14),np.log(28),8)])
pp.yticks([])

pp.axvline(np.log(np.mean( BMI_adj[mask_afab & BMI_adj_valid] )),0,1, color='k')
pp.axvline(np.log(np.mean( BMI_adj[mask_amab & BMI_adj_valid] )),0,1, color='g')

pp.show()

Whoops! There's more overlap and skew than I thought. Even in logspace, the results don't look Gaussian. We'll have to remember that for the next step.

A Man Without a Plan is Not a Man ¶

Just looking at charts isn't going to solve this question, we need to do some sort of hypothesis testing. Fortunately, all the pieces I need are here. We've got our hypothesis, for instance:

Athletes with exceptional testosterone levels are more like athletes of the same sex but with typical testosterone levels, than they are of other athletes with a different sex but similar testosterone levels.

If you know me, you know that I'm all about the Bayes, and that gives us our methodology.

Fit a model to a specific metric for assigned-female athletes with less than 5nmol/L of serum testosterone.
Fit a model to a specific metric for assigned-male athletes with more than 5nmol/L of serum testosterone.
Apply the first model to the test group, calculating the overall likelihood.
Apply the second model to the test group, calculating the overall likelihood.
Sample the probability distribution of the Bayes Factor.

"Metric" is one of height or $\overline{BMI}$, while "test group" is one of assigned-female athletes with >5nmol/L of serum testosterone or assigned-male athletes with <5nmol/L of serum testosterone. The Bayes Factor is simply

$$ \text{Bayes Factor} = \frac{ p(E \mid H_1) \cdot p(H_1) }{ p(E \mid H_2) \cdot p(H_2) } = \frac{ p(H_1 \mid E) }{ p(H_2 \mid E) }, $$

which means we need two hypotheses, not one. Fortunately, I've phrased the hypothesis to make it easy to negate: athletes with exceptional testosterone levels are less like athletes of the same sex but with typical testosterone levels, than they are of other athletes with a different sex but similar testosterone levels. We'll call this new hypothesis $H_2$, and the original $H_1$. Bayes factors greater than 1 mean $H_1$ is more likely than $H_2$, and vice-versa.

Calculating all that would be easy if I was using Stan or PyMC3, but I ran into problems translating the former's probability distributions into charts, and I don't have any experience with the latter. My next choice, emcee, forces me to manually convolve two posterior distributions. Annoying, but not difficult.

I'm a Model, If You Know What I Mean ¶

That just leaves one thing left: what models are we going to use? The obvious choice for height is the Gaussian distribution, as from previous research we know it's a great model.

(Click here to show the code)

import emcee
nwalkers, nsamples = 128, 6
models = dict()

import os
import scipy.stats as spst

ndim = 2
def lnLike_gaussian( theta, x ):
    mu, sigma = theta
    return np.sum( spst.norm( mu, sigma ).logpdf( x ) )

def lnPrior_gaussian( theta ):
    mu, sigma = theta
    if sigma <= 0:           # standard deviation must be positive
        return -np.inf
    return -2*np.log(sigma)  # favor lower standard deviations

def lnProb_gaussian( theta, x ):
    temp = lnPrior_gaussian( theta )
    if temp == -np.inf:
        return temp
    return temp + lnLike_gaussian( theta, x )

x = dataset['height'][mask_lt_5nmol & height_valid & mask_afab]
pos = [np.array([150,15]) + 1e-2*np.random.randn(ndim) for i in range(nwalkers)]  # rough estimate

print("Fitting the height of lT aFab athletes to a Gaussian distribution ...")


lnprob = None
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnProb_gaussian, threads=os.cpu_count(), args=[ x ] )
model_mean = np.mean( pos, axis=0 )
print("{:6}: ({:5f}) mu={:7f}, sigma={:7f}".format(0, lnProb_gaussian( model_mean, x ), *model_mean))
for it in range(nsamples):
    global pos, lnprob
    
    pos, lnprob, _ = sampler.run_mcmc( pos, 64, storechain=False )
    model_mean = np.mean( pos, axis=0 ) # fairer than going with the maximal likelihood
    print("{:6}: ({:5f}) mu={:7f}, sigma={:7f}".format((it+1) * 64, lnProb_gaussian( model_mean, x ), *model_mean))

best = np.argmax( lnprob )
print("{:>6}: ({:5f}) mu={:7f}, sigma={:7f}".format("ML", lnProb_gaussian( pos[best], x ), *pos[best]))
model_median = np.median( pos, axis=0 )
print("{:>6}: ({:5f}) mu={:7f}, sigma={:7f}".format("median", lnProb_gaussian( model_median, x ), *model_median))

models["lT_aFab_height"] = pos     # store the posterior for later use

Fitting the height of lT aFab athletes to a Gaussian distribution ...
     0: (-980.322471) mu=150.000819, sigma=15.000177
    64: (-710.417497) mu=169.639051, sigma=8.579088
   128: (-700.539260) mu=171.107358, sigma=7.138832
   192: (-700.535241) mu=171.154151, sigma=7.133279
   256: (-700.540692) mu=171.152701, sigma=7.145515
   320: (-700.552831) mu=171.139668, sigma=7.166857
   384: (-700.530969) mu=171.086422, sigma=7.094077
    ML: (-700.525284) mu=171.155240, sigma=7.085777
median: (-700.525487) mu=171.134614, sigma=7.070993

Alas, emcee also lacks a good way to assess model fitness. One crude metric is look at the progression of the mean fitness; if it grows and then stabilizes around a specific value, as it does here, we've converged on something. Another is to compare the mean, median, and maximal likelihood of the posterior; if they're about equally likely, we've got a fuzzy caterpillar. Again, that's also true here.

As we just saw, though, charts are a better judge of fitness than a handful of numbers.

(Click here to show the code)

def plotWrapper( func, theta, x ):
    retVal = list()
    for data in x:
        retVal.append( np.exp( func( theta, data ) ) )
    return retVal

minVal = min(dataset['height'][height_valid])
maxVal = max(dataset['height'][height_valid])
    
x = np.linspace(minVal,maxVal,255)


bins = 9

pp.hist( dataset['height'][mask_afab & height_valid], bins, density=1, facecolor='black', alpha=0.2)
pp.legend(['lT aFab'], loc=0)
pp.plot( x, plotWrapper(lnLike_gaussian, np.mean(models['lT_aFab_height'],axis=0), x ), 'k' )

pp.title('Height, elite athletes')
pp.xlim([145,215])
pp.xlabel("cm")
pp.yticks([])

# source: https://ourworldindata.org/human-height, Germany, 1976
pp.axvline(166.3,0,1, color='k', alpha=0.2)
pp.axvline(np.mean(dataset['height'][mask_afab & height_valid]),0,1, color='k')

pp.show()

If you were wondering why I didn't make much of a fuss out of the asymmetry in the height distribution, it's because I've already seen this graph. A good fit isn't necessarily the best though, and I might be able to get a closer match by incorporating the sport each athlete played.

(Click here to show the code)

sport_names = { 
    1: 'Power lifting',
    2: 'Basketball',
    3: 'Football',
    4: 'Swimming',
    5: 'Marathon',
    6: 'Canoeing',
    7: 'Rowing',
    8: 'Cross-country skiing',
    9: 'Alpine skiing',
    10: 'Weight lifting',
    11: 'Judo',
    12: 'Bandy',
    13: 'Ice Hockey',
    14: 'Handball',
    15: 'Track and field'}

print("{:^48}".format("Assigned-female Athletes"))
print("{:^24} {:^23}".format("sport","below/above 171cm"))
for sport in pd.Categorical( dataset['sport'] ).categories:
    below = sum(dataset['sport'][mask_afab & height_valid & (dataset['height'] < 171)] == sport) above = sum(dataset['sport'][mask_afab & height_valid & (dataset['height'] >= 171)] == sport)
    print("{:>24}: {:2} /{:2}".format(sport_names[sport], below, above))

            Assigned-female Athletes            
         sport              below/above 171cm   
           Power lifting:  1 / 0
              Basketball:  2 /12
                Football:  0 / 0
                Swimming: 41 /49
                Marathon:  0 / 1
                Canoeing:  1 / 0
                  Rowing:  9 /13
    Cross-country skiing:  8 / 1
           Alpine skiing: 11 / 1
          Weight lifting:  7 / 0
                    Judo:  0 / 0
                   Bandy:  0 / 0
              Ice Hockey:  0 / 0
                Handball: 12 /17
         Track and field: 22 /27

Basketball attracts tall people, unsurprisingly, while skiing seems to attract shorter people. This could be the cause of that asymmetry. It's no guarantee that I'll actually get a better fit, though, as I'm also dramatically cutting the number of datapoints to fit to. The model's uncertainty must increase as a result, and that may be enough to dilute out any increase in fitness. I'll run those numbers for the paper, but for now the Gaussian model I have is plenty good.

(Click here to show the code)

x = dataset['height'][mask_gt_5nmol & height_valid & mask_amab]
pos = [np.array([150,15]) + 1e-2*np.random.randn(ndim) for i in range(nwalkers)]  # rough estimate

print("Fitting the height of hT aMab athletes to a Gaussian distribution ...")


lnprob = None
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnProb_gaussian, threads=os.cpu_count(), args=[ x ] )
model_mean = np.mean( pos, axis=0 )
print("{:6}: ({:5f}) mu={:7f}, sigma={:7f}".format(0, lnProb_gaussian( model_mean, x ), *model_mean))
for it in range(nsamples):
    global pos, lnprob
    
    pos, lnprob, _ = sampler.run_mcmc( pos, 64, storechain=False )
    model_mean = np.mean( pos, axis=0 ) # fairer than going with the maximal likelihood
    print("{:6}: ({:5f}) mu={:7f}, sigma={:7f}".format((it+1) * 64, lnProb_gaussian( model_mean, x ), *model_mean))

best = np.argmax( lnprob )
print("{:>6}: ({:5f}) mu={:7f}, sigma={:7f}".format("ML", lnProb_gaussian( pos[best], x ), *pos[best]))
model_median = np.median( pos, axis=0 )
print("{:>6}: ({:5f}) mu={:7f}, sigma={:7f}".format("median", lnProb_gaussian( model_median, x ), *model_median))

models["hT_aMab_height"] = pos     # store the posterior for later use

Fitting the height of hT aMab athletes to a Gaussian distribution ...
     0: (-2503.079578) mu=150.000061, sigma=15.001179
    64: (-1482.315571) mu=179.740851, sigma=10.506003
   128: (-1451.789027) mu=182.615810, sigma=8.620333
   192: (-1451.748336) mu=182.587979, sigma=8.550535
   256: (-1451.759883) mu=182.676004, sigma=8.546410
   320: (-1451.746697) mu=182.626918, sigma=8.538055
   384: (-1451.747266) mu=182.580692, sigma=8.534070
    ML: (-1451.746074) mu=182.591047, sigma=8.534584
median: (-1451.759295) mu=182.603231, sigma=8.481894

We get the same results when fitting the model to >5 nmol/L assigned-male athletes. The log likelihood, that number in brackets, is a lot lower for these athletes, but that number is roughly proportional to the number of samples. If we had the same degree of model fitness but doubled the number of samples, we'd expect the log likelihood to double. And, sure enough, this dataset has roughly twice as many assigned-male athletes as it does assigned-female athletes.

(Click here to show the code)

x = np.linspace(minVal,maxVal,255)

bins = 9

pp.hist( dataset['height'][mask_lt_5nmol & mask_afab & height_valid], bins, density=1, facecolor='black', alpha=0.2)
pp.hist( dataset['height'][mask_gt_5nmol & mask_amab & height_valid], bins, density=1, facecolor='green', alpha=0.2)
pp.legend(['lT aFab','hT aMab'], loc=0)
pp.plot( x, plotWrapper(lnLike_gaussian, np.mean(models['lT_aFab_height'],axis=0), x ), 'k' )
pp.plot( x, plotWrapper(lnLike_gaussian, np.mean(models['hT_aMab_height'],axis=0), x ), 'g' )

pp.title('Height, elite athletes')
pp.xlim([145,215])
pp.xlabel("cm")
pp.yticks([])

# source: https://ourworldindata.org/human-height, Germany, 1976
pp.axvline(166.3,0,1, color='k', alpha=0.2)
pp.axvline(np.mean(dataset['height'][mask_lt_5nmol & mask_afab & height_valid]),0,1, color='k')
pp.axvline(179.3,0,1, color='g', alpha=0.2)
pp.axvline(np.mean(dataset['height'][mask_gt_5nmol & mask_amab & height_valid]),0,1, color='g')

pp.show()

The updated charts are more of the same.

Unfortunately, adjusted BMI isn't nearly as tidy. I don't have any prior knowledge that would favour a particular model, so I wound up testing five candidates: the Gaussian, Log-Gaussian, Gamma, Weibull, and Rayleigh distributions. All but the first needed an offset parameter to get the best results, which has the same interpretation as last time.

(Click here to show the code)

ndim = 2

x = BMI_adj[mask_gt_5nmol & BMI_adj_valid & mask_amab]
pos = [np.array([15,5]) + 1e-2*np.random.randn(ndim) for i in range(nwalkers)]  # rough estimate

print("Fitting the adjusted BMI of hT aMab athletes to a Gaussian distribution ...")


lnprob = None
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnProb_gaussian, threads=os.cpu_count(), args=[ x ] )
model_mean = np.mean( pos, axis=0 )
print("{:6}: ({:5f}) mu={:7f}, sigma={:7f}".format(0, lnProb_gaussian( model_mean, x ), *model_mean))
for it in range(nsamples):
    global pos, lnprob
    
    pos, lnprob, _ = sampler.run_mcmc( pos, 64, storechain=False )
    
model_mean = np.mean( pos, axis=0 ) # fairer than going with the maximal likelihood
print("{:6}: ({:5f}) mu={:7f}, sigma={:7f}".format((it+1) * 64, lnProb_gaussian( model_mean, x ), *model_mean))
best = np.argmax( lnprob )
print("{:>6}: ({:5f}) mu={:7f}, sigma={:7f}".format("ML", lnProb_gaussian( pos[best], x ), *pos[best]))
model_median = np.median( pos, axis=0 )
print("{:>6}: ({:5f}) mu={:7f}, sigma={:7f}".format("median", lnProb_gaussian( model_median, x ), *model_median))

models['hT_aMab_BMI_gaussian'] = pos

Fitting the adjusted BMI of hT aMab athletes to a Gaussian distribution ...
     0: (-410.901047) mu=14.999563, sigma=5.000388
   384: (-256.474147) mu=20.443497, sigma=1.783979
    ML: (-256.461460) mu=20.452817, sigma=1.771653
median: (-256.477475) mu=20.427138, sigma=1.781139

(Click here to show the code)

ndim = 3
def lnLike_loggaussian( theta, x ):
    mu, sigma, off = theta
    x_adj = x - off
    if np.any( x_adj < 0 ):
        return -np.inf
    return np.sum( -.5*( ((x_adj-mu)/sigma)**2 ) - np.log( x_adj*sigma ) )

def lnPrior_loggaussian( theta ):
    mu, sigma, off = theta
    if (mu < 0) or (sigma <= 0):
        return -np.inf
    if (off < 0) or (off > 25):
        return -np.inf
    
    return -2*np.log(sigma)

def lnProb_loggaussian( theta, x ):
    temp = lnPrior_loggaussian( theta )
    if temp == -np.inf:
        return temp
    return temp + lnLike_loggaussian(theta, x)

pos = [np.array([7,2,10]) + 1e-2*np.random.randn(ndim) for i in range(nwalkers)]  # rough estimate

print("Fitting the adjusted BMI of hT aMab athletes to a Log-Gaussian distribution ...")

lnprob = None
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnProb_loggaussian, threads=os.cpu_count(), args=[ x ] )


model_mean = np.mean( pos, axis=0 )
print("{:6}: ({:5f}) mu={:7f}, sigma={:7f}, off={:7f}".format(0, \
                                lnProb_loggaussian( model_mean, x ), *model_mean))
for it in range(nsamples):
    global pos, lnprob
    
    pos, lnprob, _ = sampler.run_mcmc( pos, 64, storechain=False )

model_mean = np.mean( pos, axis=0 )
print("{:6}: ({:5f}) mu={:7f}, sigma={:7f}, off={:7f}".format((it+1) * 64, \
                                lnProb_loggaussian( model_mean, x ), *model_mean))
best = np.argmax( lnprob )
print("{:>5}: ({:5f}) mu={:7f}, sigma={:7f}, off={:7f}".format("ML", \
                lnprob[best], *pos[best]) )
model_median = np.median( pos, axis=0 )
print("{:6}: ({:5f}) mu={:7f}, sigma={:7f}, off={:7f}".format("median", \
                                lnProb_loggaussian( model_median, x ), *model_median))

models['hT_aMab_BMI_loggaussian'] = pos

Fitting the adjusted BMI of hT aMab athletes to a Log-Gaussian distribution ...
     0: (-629.141577) mu=6.999492, sigma=2.001107, off=10.000768
   384: (-290.910651) mu=3.812746, sigma=1.789607, off=16.633741
   ML: (-277.119315) mu=3.848383, sigma=1.818429, off=16.637382
median: (-288.278918) mu=3.795675, sigma=1.778238, off=16.637076

(Click here to show the code)

import scipy as sp

ndim = 3
def lnLike_gammaoffset(theta, x):
    alpha, beta, off = theta
    x_adj = x - off
    if np.any( x_adj < 0 ):
        return -np.inf
    return np.sum( alpha*np.log(beta) - sp.special.loggamma(alpha) + (alpha-1)*np.log(x_adj) - beta*x_adj )

lnPrior_gammaoffset = lnPrior_loggaussian   # the two are similar enough to reuse
def lnProb_gammaoffset( theta, x ):
    temp = lnPrior_gammaoffset( theta )
    if temp == -np.inf:
        return temp
    return temp + lnLike_gammaoffset(theta, x)

pos = [np.array([20,3.,10.]) + 1e-2*np.random.randn(ndim) for i in range(nwalkers)]
print("Fitting the adjusted BMI of hT aMab athletes to a Gamma distribution ...")

sampler = emcee.EnsembleSampler(nwalkers, ndim, lnProb_gammaoffset, threads=os.cpu_count(), args=[ x ] )

gammaoffset_mean = np.mean( pos, axis=0 )
print("{:5}: ({:5f}) alpha={:7f}, beta={:7f}, off={:7f}".format(0, 
                        lnProb_gammaoffset( gammaoffset_mean, x ), *gammaoffset_mean))

for it in range(nsamples):
    global pos, lnprob
    
    pos, lnprob, _ = sampler.run_mcmc( pos, 64, storechain=False )

model_mean = np.mean( pos, axis=0 )
print("{:6}: ({:5f}) alpha={:7f}, beta={:7f}, off={:7f}".format((it+1) * 64, \
                                lnProb_gammaoffset( model_mean, x ), *model_mean))
best = np.argmax( lnprob )
print("{:6}: ({:5f}) alpha={:7f}, beta={:7f}, off={:7f}".format("ML", \
                lnprob[best], *pos[best]) )
model_median = np.median( pos, axis=0 )
print("{:6}: ({:5f}) alpha={:7f}, beta={:7f}, off={:7f}".format("median", \
                                lnProb_gammaoffset( model_median, x ), *model_median))

models['hT_aMab_BMI_gamma'] = pos

Fitting the adjusted BMI of hT aMab athletes to a Gamma distribution ...
    0: (-564.227696) alpha=19.998389, beta=3.001330, off=9.999839
   384: (-256.999252) alpha=15.951361, beta=2.194827, off=13.795466
ML    : (-248.056301) alpha=8.610936, beta=1.673886, off=15.343436
median: (-249.115483) alpha=12.411010, beta=2.005287, off=14.410945

(Click here to show the code)

ndim = 3
def lnLike_weibulloffset(theta, x):
    k, beta, off = theta
    x_adj = x - off
    if np.any( x_adj < 0 ):
        return -np.inf
    return np.sum( np.log(k*beta) + (k-1)*np.log(x_adj*beta) - (x*beta)**k )

lnPrior_weibulloffset = lnPrior_loggaussian
def lnProb_weibulloffset( theta, x ):
    temp = lnPrior_weibulloffset( theta )
    if temp == -np.inf:
        return temp
    return temp + lnLike_weibulloffset(theta, x)

pos = [np.array([8,.1,1.]) + 1e-2*np.random.randn(ndim) for i in range(nwalkers)]
print("Fitting the adjusted BMI of hT aMab athletes to a Weibull distribution ...")

sampler = emcee.EnsembleSampler(nwalkers, ndim, lnProb_weibulloffset, threads=os.cpu_count(), args=[ x ] )

weibull_mean = np.mean( pos, axis=0 )
print("{:5}: ({:5f}) k={:7f}, beta={:7f}, off={:7f}".format(0,
            lnProb_weibulloffset( weibull_mean, x ), *weibull_mean))
for it in range(nsamples):
    global pos, lnprob
    
    pos, lnprob, _ = sampler.run_mcmc( pos, 64, storechain=False )
    
model_mean = np.mean( pos, axis=0 )
print("{:>5}: ({:5f}) k={:7f}, beta={:7f}, off={:7f}".format((it+1) * 64, \
                                lnProb_weibulloffset( model_mean, x ), *model_mean))
best = np.argmax( lnprob )
print("{:>5}: ({:5f}) k={:7f}, beta={:7f}, off={:7f}".format("ML", \
                lnprob[best], *pos[best]) )
model_median = np.median( pos, axis=0 )
print("{:>5}: ({:5f}) k={:7f}, beta={:7f}, off={:7f}".format("median", \
                                lnProb_weibulloffset( model_median, x ), *model_median))

models['hT_aMab_BMI_weibull'] = pos

Fitting the adjusted BMI of hT aMab athletes to a Weibull distribution ...
    0: (-48865.772268) k=7.999859, beta=0.099877, off=0.999138
  384: (-271.350390) k=9.937527, beta=0.046958, off=0.019000
   ML: (-270.340284) k=9.914647, beta=0.046903, off=0.000871
median: (-270.974131) k=9.833793, beta=0.046947, off=0.011727

(Click here to show the code)

ndim = 2
def lnLike_rayleighoffset(theta, x):
    tau, off = theta
    x_adj = x - off
    if np.any( x_adj < 0 ):
        return -np.inf
    return np.sum( np.log(x_adj*tau) - .5*x_adj*x_adj*tau )

def lnPrior_rayleighoffset( theta ):
    tau, off = theta
    if (tau <= 0) or (off < 0):
        return -np.inf
    return 0

def lnProb_rayleighoffset( theta, x ):
    temp = lnPrior_rayleighoffset( theta )
    if temp == -np.inf:
        return temp
    return temp + lnLike_rayleighoffset(theta, x)

pos = [np.array([.5,10]) + 1e-2*np.random.randn(ndim) for i in range(nwalkers)]
print("Fitting the adjusted BMI of hT aMab athletes to a Rayleigh distribution ...")
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnProb_rayleighoffset, threads=os.cpu_count(), args=[ x ] )

rayleigh_mean = np.mean( pos, axis=0 )
print("{:5}: ({:5f}) tau={:7f}, off={:7f}".format(0,
            lnProb_rayleighoffset( rayleigh_mean, x ), *rayleigh_mean))

for it in range(nsamples):
    global pos, lnprob
    
    pos, lnprob, _ = sampler.run_mcmc( pos, 64, storechain=False )
    
rayleigh_mean = np.mean( pos, axis=0 )
print("{:5}: ({:5f}) tau={:7f}, off={:7f}".format((it+1)*64,
            lnProb_rayleighoffset( rayleigh_mean, x ), *rayleigh_mean))
best = np.argmax( lnprob )
print("{:>5}: ({:5f}) tau={:7f}, off={:7f}".format("ML", lnprob[best], *pos[best]))
rayleigh_median = np.median( pos, axis=0 )
print("{:5}: ({:5f}) tau={:7f}, off={:7f}".format("median",
            lnProb_rayleighoffset( rayleigh_median, x ), *rayleigh_median))

models['hT_aMab_BMI_rayleigh'] = pos

Fitting the adjusted BMI of hT aMab athletes to a Rayleigh distribution ...
    0: (-3378.099000) tau=0.499136, off=9.999193
  384: (-254.717778) tau=0.107962, off=16.378780
   ML: (-253.012418) tau=0.110751, off=16.574934
median: (-253.092584) tau=0.108740, off=16.532576

(Click here to show the code)

minVal = min(BMI_adj[BMI_adj_valid])
maxVal = max(BMI_adj[BMI_adj_valid])
    
x = np.linspace(minVal,maxVal,255)


bins = 9

pp.hist( BMI_adj[mask_gt_5nmol & mask_amab & BMI_adj_valid], bins, density=1, facecolor='green', alpha=0.2)
pp.plot( x, plotWrapper(lnLike_gaussian, np.median(models['hT_aMab_BMI_gaussian'],axis=0), x ), 'k' )
pp.plot( x, plotWrapper(lnLike_loggaussian, np.median(models['hT_aMab_BMI_loggaussian'],axis=0), x ), 'r', alpha=.2 )
pp.plot( x, plotWrapper(lnLike_gammaoffset, np.median(models['hT_aMab_BMI_gamma'],axis=0), x ), 'g' )
pp.plot( x, plotWrapper(lnLike_weibulloffset, np.median(models['hT_aMab_BMI_weibull'],axis=0), x ), 'b', alpha=.2 )
pp.plot( x, plotWrapper(lnLike_rayleighoffset, np.median(models['hT_aMab_BMI_rayleigh'],axis=0), x ), 'y' )

pp.legend(['Gaussian','log-Gaussian + Offset','Gamma + Offset', 'Weibull + Offset','Rayleigh + Offset','high-T aMab'], loc=0)


pp.title('Adjusted BMI, elite athletes')
pp.xlim([minVal,maxVal])
pp.ylim([0,.3])
pp.yticks([])

pp.show()

Looks like the Gamma distribution is the best of the bunch, though only if you use the median or maximal likelihood of the posterior. There must be some outliers in there that are tugging the mean around. Visually, there isn't too much difference between the Gaussian and Gamma fits, but the Rayleigh seems artificially sharp on the low end. It's a bit of a shame, the Gamma distribution is usually related to rates and variance so we don't have a good reason for applying it here, other than "it fits the best." We might be able to do better with a per-sport Gaussian distribution fit, but for now I'm happy with the Gamma.

Time to fit the other pool of athletes, and chart it all.

(Click here to show the code)

x = BMI_adj[mask_lt_5nmol & BMI_adj_valid & mask_afab]

ndim = 3
pos = [np.array([20,3.,10.]) + 1e-2*np.random.randn(ndim) for i in range(nwalkers)]
print("Fitting the adjusted BMI of lT aFab athletes to a Gamma distribution ...")

sampler = emcee.EnsembleSampler(nwalkers, ndim, lnProb_gammaoffset, threads=os.cpu_count(), args=[ x ] )

gammaoffset_mean = np.mean( pos, axis=0 )
print("{:5}: ({:5f}) alpha={:7f}, beta={:7f}, off={:7f}".format(0, 
                        lnProb_gammaoffset( gammaoffset_mean, x ), *gammaoffset_mean))

for it in range(nsamples):
    global pos, lnprob
    
    pos, lnprob, _ = sampler.run_mcmc( pos, 64, storechain=False )

model_mean = np.mean( pos, axis=0 )
print("{:6}: ({:5f}) alpha={:7f}, beta={:7f}, off={:7f}".format((it+1) * 64, \
                                lnProb_gammaoffset( model_mean, x ), *model_mean))
best = np.argmax( lnprob )
print("{:6}: ({:5f}) alpha={:7f}, beta={:7f}, off={:7f}".format("ML", \
                lnprob[best], *pos[best]) )
model_median = np.median( pos, axis=0 )
print("{:6}: ({:5f}) alpha={:7f}, beta={:7f}, off={:7f}".format("median", \
                                lnProb_gammaoffset( model_median, x ), *model_median))

models['lT_aFab_BMI_gamma'] = pos

Fitting the adjusted BMI of lT aFab athletes to a Gamma distribution ...
    0: (-127.467934) alpha=20.000007, beta=3.000116, off=9.999921
   384: (-128.564564) alpha=15.481265, beta=3.161022, off=12.654149
ML    : (-117.582454) alpha=2.927721, beta=1.294851, off=14.713479
median: (-120.689425) alpha=11.961847, beta=2.836153, off=13.008723

(Click here to show the code)

x = np.linspace(minVal,maxVal,255)


bins = 9

pp.hist( BMI_adj[mask_lt_5nmol & mask_afab & BMI_adj_valid], bins, density=1, facecolor='black', alpha=0.2)
pp.hist( BMI_adj[mask_gt_5nmol & mask_amab & BMI_adj_valid], bins, density=1, facecolor='green', alpha=0.2)
pp.legend(['lT aFab','hT aMab'], loc=0)
pp.plot( x, plotWrapper(lnLike_gammaoffset, np.median(models['lT_aFab_BMI_gamma'],axis=0), x ), 'k', alpha=.5 )
pp.plot( x, plotWrapper(lnLike_gammaoffset, np.median(models['hT_aMab_BMI_gamma'],axis=0), x ), 'g', alpha=.5 )

pp.legend(['Gamma + Offset, lT aFab', 'Gamma + Offset, hT aMab','low-T aFab','high-T aMab'], loc=0)

pp.title('Adjusted BMI, elite athletes')
pp.xlim([minVal,maxVal])
pp.yticks([])

pp.show()

Those models look pretty reasonable, though the upper end of the assigned-female distribution could be improved on. It's a good enough fit to get some answers, at least.

The Nitty Gritty ¶

It's easier to combine step 3, applying the model, with step 5, calculating the Bayes Factor, when writing the code. The resulting Bayes Factor has a probability distribution, as the uncertainty contained in the posterior contaminates it.

(Click here to show the code)

logBF_gt5_afab_height = list()

gt5_afab_height = dataset['height'][mask_gt_5nmol & height_valid & mask_afab]

numer_gt5_afab_height = [lnProb_gaussian( pos, gt5_afab_height ) for pos in models['lT_aFab_height']]
denom_gt5_afab_height = [lnProb_gaussian( pos, gt5_afab_height ) for pos in models['hT_aMab_height']]

for laf in numer_gt5_afab_height:
    for ham in denom_gt5_afab_height:
        logBF_gt5_afab_height.append( laf-ham )

print("Summary of the BF distribution, for the height of >5nmol/L aFab athletes")
        
percentiles = np.percentile( logBF_gt5_afab_height, [5,16,50,84,95] )
geo_mean = np.exp(np.mean( logBF_gt5_afab_height ))

temp = np.exp(logBF_gt5_afab_height)
mean = np.mean( temp )
favour = np.sum( temp > 1 ) / len(temp)
decisive = np.sum( temp > 19 ) / len(temp)

print("{:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10}".format(
    "n","mean","geo.mean","5%","16%","50%","84%","95%"))
print("{:>10} {:10.2f} {:10.2f} {:10.2f} {:10.2f} {:10.2f} {:10.2f} {:10.2f}".format(
    len(gt5_afab_height), mean, geo_mean, *np.exp(percentiles) ))
print()
print("Percentage of BF's that favoured the primary hypothesis: {:.2f}%".format(favour*100.))
print("Percentage of BF's that were 'decisive': {:.2f}%".format(decisive*100.))

bins = 21
pp.hist( logBF_gt5_afab_height, bins, density=1, facecolor='black', alpha=0.2)
pp.legend(['high-T aFab'], loc=0)

pp.axvline( x=percentiles[2], color='r' )
pp.axvline( x=percentiles[1], color='r', alpha=.5 )
pp.axvline( x=percentiles[3], color='r', alpha=.5 )
pp.axvline( x=percentiles[0], color='r', alpha=.1 )
pp.axvline( x=percentiles[4], color='r', alpha=.1 )

pp.xticks( np.linspace(-2,6,5), ["{:.1f}".format(x) for x in np.exp(np.linspace(-2,6,5))] )
pp.yticks([])
pp.title('Bayes factor, height, >5nmol/L aFab athletes')
pp.show()

Summary of the BF distribution, for the height of >5nmol/L aFab athletes
         n       mean   geo.mean         5%        16%        50%        84%        95%
        19      10.64       5.44       0.75       1.76       5.66      17.33      35.42

Percentage of BF's that favoured the primary hypothesis: 92.42%
Percentage of BF's that were 'decisive': 14.17%

That looks a lot like a log-Gaussian distribution. The arthithmetic mean fails us here, thanks to the huge range of values, so the geometric mean and median are better measures of central tendency.

The best way I can interpret this result is via an eight-sided die: our credence in the hypothesis that >5nmol/L aFab athletes are more like their >5nmol/L aMab peers than their <5nmol/L aFab ones is similar to the credence we'd place on rolling a one via that die, while our credence on the primary hypothesis is similar to rolling any other number except one. About 92% of the calculated Bayes Factors were favourable to the primary hypothesis, and about 16% of them crossed the 19:1 threshold, a close match for the asserted evidential bar in science.

That's strong evidence for a mere 19 athletes, though not quite conclusive. How about the Bayes Factor for the height of <5nmol/L aMab athletes?

(Click here to show the code)

logBF_lt5_amab_height = list()

lt5_amab_height = dataset['height'][mask_lt_5nmol & height_valid & mask_amab]

numer_lt5_amab_height = [lnProb_gaussian( pos, lt5_amab_height ) for pos in models['hT_aMab_height']]
denom_lt5_amab_height = [lnProb_gaussian( pos, lt5_amab_height ) for pos in models['lT_aFab_height']]

for laf in numer_lt5_amab_height:
    for ham in denom_lt5_amab_height:
        logBF_lt5_amab_height.append( laf-ham )

print("Summary of the BF distribution, for the height of <5nmol/L aMab athletes") percentiles = np.percentile( logBF_lt5_amab_height, [5,16,50,84,95] ) geo_mean = np.exp(np.mean( logBF_lt5_amab_height )) temp = np.exp(logBF_lt5_amab_height) mean = np.mean( temp ) favour = np.sum( temp > 1 ) / len(temp)
decisive = np.sum( temp > 19 ) / len(temp)

print("{:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10}".format(
    "n","mean","geo.mean","5%","16%","50%","84%","95%"))
print("{:>10} {:10.2e} {:10.2e} {:10.2e} {:10.2e} {:10.2e} {:10.2e} {:10.2e}".format(
    len(lt5_amab_height), mean, geo_mean, *np.exp(percentiles) ))
print()
print("Percentage of BF's that favoured the primary hypothesis: {:.2f}%".format(favour*100.))
print("Percentage of BF's that were 'decisive': {:.2f}%".format(decisive*100.))

bins = 21
pp.hist( logBF_lt5_amab_height, bins, density=1, facecolor='black', alpha=0.2)
pp.legend(['low-T aMab'], loc=0)

pp.axvline( x=percentiles[2], color='r' )
pp.axvline( x=percentiles[1], color='r', alpha=.5 )
pp.axvline( x=percentiles[3], color='r', alpha=.5 )
pp.axvline( x=percentiles[0], color='r', alpha=.1 )
pp.axvline( x=percentiles[4], color='r', alpha=.1 )

pp.xticks( np.linspace(30,55,7), ["{:.1e}".format(x) for x in np.exp(np.linspace(30,55,7))], rotation='vertical' )
pp.yticks([])
pp.title('Bayes factor, height, <5nmol/L aMab athletes')
pp.show()

Summary of the BF distribution, for the height of <5nmol/L aMab athletes
         n       mean   geo.mean         5%        16%        50%        84%        95%
        26   4.67e+21   3.49e+18   5.67e+14   2.41e+16   5.35e+18   4.16e+20   4.61e+21

Percentage of BF's that favoured the primary hypothesis: 100.00%
Percentage of BF's that were 'decisive': 100.00%

Wow! Even with 26 data points, our primary hypothesis was extremely well supported. Betting against that hypothesis is like betting a particular person in the US will be hit by lightning three times in a single year!

That seems a little too favourable to my view, though. Did something go wrong with the mathematics? The simplest check is to graph the models against the data they're evaluating.

(Click here to show the code)

minVal = min(dataset['height'][height_valid])
maxVal = max(dataset['height'][height_valid])
x = np.linspace(minVal,maxVal,255)


bins = 8
pp.hist( lt5_amab_height, bins, density=1, facecolor='green', alpha=0.2)

pp.plot( x, plotWrapper(lnLike_gaussian, np.mean(models['lT_aFab_height'],axis=0), x ), 'k' )
pp.plot( x, plotWrapper(lnLike_gaussian, np.mean(models['hT_aMab_height'],axis=0), x ), 'g' )

pp.legend(['low-T aFab', 'high-T aMab', 'low-T aMab'], loc=0)

pp.yticks([])
# pp.ylim([0,.45])
pp.title('Height, elite athletes, <5nmol/L aMab athletes')
pp.show()

Nope, the underlying data genuinely is a better fit for the high-testosterone aMab model. But that good of a fit? In linear space, we multiply each of the individual probabilities to arrive at the Bayes factor. That's equivalent to raising the geometric mean to the nth power, where n is the number of athletes. Since n = 26 here, even a geometric mean barely above one can generate a big Bayes factor.

(Click here to show the code)

temp = [lnProb_gaussian( np.median(models['hT_aMab_height'],axis=0), x ) - \
       lnProb_gaussian( np.median(models['lT_aFab_height'],axis=0), x ) for x in lt5_amab_height]

print( "{}th root of the median Bayes factor of the high-T aMab model applied to low-T aMab athletes: {:.4f}".format(len(lt5_amab_height), \
                                                            np.exp(percentiles[2]/len(temp))) )
print( "{}th root of the Bayes factor for the median marginal: {:.4f}".format(len(lt5_amab_height), \
                                                            np.exp(np.mean(temp))) )

26th root of the median Bayes factor of the high-T aMab model applied to low-T aMab athletes: 5.2519
26th root of the Bayes factor for the median marginal: 3.6010

Note that the Bayes factor we generate by using the median of the marginal for each parameter isn't as strong as the median Bayes factor in the above convolution. That's simply because I'm using a small sample from the posterior distribution. Keeping more samples would have brought those two values closer together, but also greatly increased the amount of computation I needed to do to generate all those Bayes factors.

With that check out of the way, we can move on to $\overline{BMI}$.

(Click here to show the code)

logBF_gt5_afab_BMI = list()
gt5_afab_BMI_invalid = [0,0]

gt5_afab_BMI = BMI_adj[mask_gt_5nmol & BMI_adj_valid & mask_afab]
numer_gt5_afab_BMI = [lnProb_gammaoffset( pos, gt5_afab_BMI ) for pos in models['lT_aFab_BMI_gamma']]
denom_gt5_afab_BMI = [lnProb_gammaoffset( pos, gt5_afab_BMI ) for pos in models['hT_aMab_BMI_gamma']]

for laf in numer_gt5_afab_BMI:
    for ham in denom_gt5_afab_BMI:
        if not np.isfinite(laf):
            gt5_afab_BMI_invalid[0] += 1
            continue
        elif not np.isfinite(ham):
            gt5_afab_BMI_invalid[1] += 1
            continue
        else:
            logBF_gt5_afab_BMI.append( laf-ham )
        
print("Summary of the BF distribution, for the adjusted BMI of >5nmol/L aFab athletes")
        
percentiles = np.percentile( logBF_gt5_afab_BMI, [5,16,50,84,95] )
geo_mean = np.exp( np.mean(logBF_gt5_afab_BMI) )

temp = np.exp(logBF_gt5_afab_BMI)
mean = np.mean( temp )
favour = np.sum( temp > 1 ) / len(temp)
decisive = np.sum( temp > 19 ) / len(temp)

print("{:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10}".format(
    "n","mean","geo.mean","5%","16%","50%","84%","95%"))
print("{:>10} {:10.2e} {:10.2e} {:10.2e} {:10.2e} {:10.2e} {:10.2e} {:10.2e}".format(
    len(gt5_afab_BMI), mean, geo_mean, *np.exp(percentiles) ))
print()
print("Percentage of BF's that favoured the primary hypothesis: {:.2f}%".format(favour*100.))
print("Percentage of BF's that were 'decisive': {:.2f}%".format(decisive*100.))
print("Percentage of non-finite probabilities, when applying the low-T aFab model to high-T aFab athletes: {:.2f}%".format(\
                gt5_afab_BMI_invalid[0]*100./(sum(gt5_afab_BMI_invalid)+len(logBF_gt5_afab_BMI)) ))
print("Percentage of non-finite probabilities, when applying the high-T aMab model to high-T aFab athletes: {:.2f}%".format(\
                gt5_afab_BMI_invalid[1]*100./(sum(gt5_afab_BMI_invalid)+len(logBF_gt5_afab_BMI)) ))

bins = 20
pp.hist( logBF_gt5_afab_BMI, bins, density=1, facecolor='black', alpha=0.2)
pp.legend(['high-T aFab'], loc=0)

pp.axvline( x=percentiles[2], color='r' )
pp.axvline( x=percentiles[1], color='r', alpha=.5 )
pp.axvline( x=percentiles[3], color='r', alpha=.5 )
pp.axvline( x=percentiles[0], color='r', alpha=.1 )
pp.axvline( x=percentiles[4], color='r', alpha=.1 )

pp.xticks( np.linspace(0,50,7), ["{:.1e}".format(x) for x in np.exp(np.linspace(0,50,7))], rotation='vertical' )
pp.yticks([])
pp.title('Bayes factor, BMI, >5nmol/L aFab athletes')
pp.show()

Summary of the BF distribution, for the adjusted BMI of >5nmol/L aFab athletes
         n       mean   geo.mean         5%        16%        50%        84%        95%
         4   1.70e+12   1.06e+05   2.31e+02   1.60e+03   4.40e+04   3.66e+06   3.99e+09

Percentage of BF's that favoured the primary hypothesis: 100.00%
Percentage of BF's that were 'decisive': 99.53%
Percentage of non-finite probabilities, when applying the low-T aFab model to high-T aFab athletes: 0.00%
Percentage of non-finite probabilities, when applying the high-T aMab model to high-T aFab athletes: 10.94%

This distribution is much stranger, with a number of extremely high BF's that badly skew the mean. The offset contributes to this, with 7-12% of the model posteriors for high-T aMab athletes assigning a zero percent likelihood to an adjusted BMI. Those are excluded from the analysis, but they suggest the high-T aMab model poorly describes high-T aFab athletes.

Our credence in the primary hypothesis here is similar to our credence that an elite golfer will not land a hole-in-one on their next shot. That's surprisingly strong, given we're only dealing with four datapoints. More data may water that down, but it's unlikely to overcome that extreme level of credence.

(Click here to show the code)

logBF_lt5_amab_BMI = list()
lt5_amab_BMI_invalid = [0,0]

lt5_amab_BMI = BMI_adj[mask_lt_5nmol & BMI_adj_valid & mask_amab]
numer_lt5_amab_BMI = [lnProb_gammaoffset( pos, lt5_amab_BMI ) for pos in models['hT_aMab_BMI_gamma']]
denom_lt5_amab_BMI = [lnProb_gammaoffset( pos, lt5_amab_BMI ) for pos in models['lT_aFab_BMI_gamma']]

for laf in numer_lt5_amab_BMI:
    for ham in denom_lt5_amab_BMI:
        if not np.isfinite(laf):
            lt5_amab_BMI_invalid[0] += 1
            continue
        elif not np.isfinite(ham):
            lt5_amab_BMI_invalid[1] += 1
            continue
        else:
            logBF_lt5_amab_BMI.append( laf-ham )
        
print("Summary of the BF distribution, for the adjusted BMI of <5nmol/L aMab athletes") percentiles = np.percentile( logBF_lt5_amab_BMI, [5,16,50,84,95] ) geo_mean = np.exp( np.mean(logBF_lt5_amab_BMI) ) temp = np.exp(logBF_lt5_amab_BMI) mean = np.mean( temp ) favour = np.sum( temp > 1 ) / len(temp)
decisive = np.sum( temp > 19 ) / len(temp)

print("{:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10}".format(
    "n","mean","geo.mean","5%","16%","50%","84%","95%"))
print("{:>10} {:10.2e} {:10.2e} {:10.2e} {:10.2e} {:10.2e} {:10.2e} {:10.2e}".format(
    len(lt5_amab_BMI), mean, geo_mean, *np.exp(percentiles) ))
print()
print("Percentage of BF's that favoured the primary hypothesis: {:.2f}%".format(favour*100.))
print("Percentage of BF's that were 'decisive': {:.2f}%".format(decisive*100.))
print("Percentage of non-finite probabilities, when applying the high-T aMab model to low-T aMab athletes: {:.2f}%".format(\
                lt5_amab_BMI_invalid[0]*100./(sum(lt5_amab_BMI_invalid)+len(logBF_lt5_amab_BMI)) ))
print("Percentage of non-finite probabilities, when applying the low-T aFab model to low-T aMab athletes: {:.2f}%".format(\
                lt5_amab_BMI_invalid[1]*100./(sum(lt5_amab_BMI_invalid)+len(logBF_lt5_amab_BMI)) ))

bins = 20
pp.hist( logBF_lt5_amab_BMI, bins, density=1, facecolor='black', alpha=0.2)
pp.legend(['low-T aMab'], loc=0)

pp.axvline( x=percentiles[2], color='r' )
pp.axvline( x=percentiles[1], color='r', alpha=.5 )
pp.axvline( x=percentiles[3], color='r', alpha=.5 )
pp.axvline( x=percentiles[0], color='r', alpha=.1 )
pp.axvline( x=percentiles[4], color='r', alpha=.1 )

pp.xticks( np.linspace(20,100,10), ["{:.1e}".format(x) for x in np.exp(np.linspace(20,100,10))], rotation='vertical' )
pp.yticks([])
pp.title('Bayes factor, BMI, <5nmol/L aMab athletes')
pp.show()

Summary of the BF distribution, for the adjusted BMI of <5nmol/L aMab athletes
         n       mean   geo.mean         5%        16%        50%        84%        95%
         9   6.64e+35   2.07e+22   4.05e+12   4.55e+16   6.31e+21   7.72e+27   9.81e+32

Percentage of BF's that favoured the primary hypothesis: 100.00%
Percentage of BF's that were 'decisive': 100.00%
Percentage of non-finite probabilities, when applying the high-T aMab model to low-T aMab athletes: 0.00%
Percentage of non-finite probabilities, when applying the low-T aFab model to low-T aMab athletes: 0.00%

The hypotheses' Bayes factor for the adjusted BMI of low-testosterone aMab athletes is much better behaved. Even here, the credence is above three-lightning-strikes territory, pretty decisively favouring the hypothesis.

Our final step would normally be to combine all these individual Bayes factors into a single one. That involves multiplying them all together, however, and a small number multiplied by a very large one is an even larger one. It isn't worth the effort, the conclusion is pretty obvious.

Truth and Consequences ¶

Our primary hypothesis is on quite solid ground: Athletes with exceptional testosterone levels are more like athletes of the same sex but with typical testosterone levels, than they are of other athletes with a different sex but similar testosterone levels. If we divide up sports by testosterone level, then, roughly 6-8% of assigned-male athletes will wind up in the <5 nmol/L group, and about the same share of assigned-female athletes will be in the >5 nmol/L group. Note, however, that it doesn't follow that 6-8% of those in the <5 nmol/L group will be assigned-male. About 41% of the athletes at the 2018 Olymics were assigned-female, for instance. If we fix the rate of exceptional testosterone levels at 7%, and assume PyeongChang's rate is typical, a quick application of Bayes' theorem reveals

$$ \begin{align} p( \text{aMab} \mid \text{<5nmol/L} ) &= \frac{ p( \text{<5nmol/L} \mid \text{aMab} ) p( \text{aMab} ) }{ p( \text{<5nmol/L} \mid \text{aMab} ) p( \text{aMab} ) + p( \text{<5nmol/L} \mid \text{aFab} ) p( \text{aFab} ) } \\ {} &= \frac{ 0.07 \cdot 0.59 }{ 0.07 \cdot 0.59 + 0.93 \cdot 0.41 } \\ {} &\approx 9.8\% \end{align} $$

If all those assumptions are accurate, about 10% of <5 nmol/L athletes will be assigned-male, more-or-less matching the number I calculated way back at the start. In sports where performance is heavily correlated with height or $\overline{BMI}$, then, the 10% of assigned-male athletes in the <5 nmol group will heavily dominate the rankings. The odds of a woman earning recognition in this sport are negligible, leading many of them to drop out. This increases the proportion of men in that sport, leading to more domination of the rankings, more women dropping out, and a nasty feedback loop.

Conversely, about 5% of >5nmol/L athletes will be assigned-female. In a heavily-correlated sport, those women will be outclassed by the men and have little chance of earning recognition for their achievements. They have no incentive to compete, so they'll likely drop out or avoid these sports as well.

In events where physicality has less or no correlation with sporting performance, these effects will be less pronounced or non-existent, of course. But this still translates into fewer assigned-female athletes competing than in the current system.

But it gets worse! We'd also expect an uptick in the number of assigned-female athletes doping, primarily with testosterone inhibitors to bring themselves just below the 5nmol/L line. Alternatively, high-testosterone aFab athletes may inject large doses of testosterone to bulk up and remain competitive with their assigned-male competitors.

By dividing up testosterone levels into only two categories, sporting authorities are implicitly stating that everyone within those categories is identical. A number of athletes would likely go to court to argue that boosting or inhibiting testosterone should be legal, provided they do not cross the 5nmol/L line. If they're successful, then either the rules around testosterone usage would be relaxed, or sporting authorities would be forced to subdivide these groups further. This would lead to an uptick in testosterone doping among all athletes, not just those assigned female.

Notice that assigned-male athletes don't have the same incentives to drop out, and in fact the low-testosterone subgroup may even be encouraged to compete as they have an easier path to sporting fame and glory. Sports where performance is heavily correlated with height or $\overline{BMI}$ will come to be dominated by men.

Let's Put a Bow On This One ¶

[1:15] In a nutshell, I find the arguments and logic that currently permit transgender women to compete against biological women to be remarkably flawed, and I’m convinced that unless quickly rectified, this will KILL women’s sports.

[14:00] I don’t want to see the day when women’s athletics is dominated by Y chromosomes, but without a change in policy, that is precisely what’s going to happen.

It's rather astounding. Transgender athletes are a not a problem, on several levels; as I've pointed out before, they've been allowed to compete in the category they identify for over a decade in some places, and yet no transgender athlete has come to dominate any sport. The Olympics has held the door open since 2004, and not a single transgender athlete has ever openly competed as a transgender athlete. Rationality Rules, like other transphobes, is forced to cherry-pick and commit lies of omission among a handful of examples, inflating them to seem more significant than they actually are.

In response to this non-existent problem, Rationality Rules' proposed solution would accomplish the very thing he wants to avoid! You don't get that turned around if you're a rational person with a firm grasp on the science.

No, this level of self-sabotage is only possible if you're a clueless bigot who's ignorant of the relevant science, and so frightened of transgender people that your critical thinking skills abandon you. The vast difference between what Rationality Rules claims the science says, and what his own citations say, must be because he knows that if he puts on a good enough act nobody will check his work. Everyone will walk away assuming he's rational, rather than a scared, dishonest loon.

It's hard to fit any other conclusion to the data.

Separate and Unequal

Oh HEY, didn’t see you there! Sorry for the absence, but life caught up to me again. I was doggedly plugging away on my next post when I stumbled across a comic that must have been tailor-made for my current headspace, and it simply had to jump the queue.

[Read more…]

Texas Sharpshooter

Quick Note

I’m trying something new! This blog post is available in two places, both here and on a Jupyter notebook. Over there, you can tweak and execute my source code, using it as a sandbox for your own explorations. Over here, it’s just a boring ol’ webpage without any fancy features, albeit one that’s easier to read on the go. Choose your own adventure!

Oh also, CONTENT WARNING: I’ll briefly be discussing sexual assault statistics from the USA at the start, in an abstract sense.

Introduction

[5:08] Now this might seem pedantic to those not interested in athletics, but in the athletic world one percent is absolutely massive. Just take for example the 2016 Olympics. The difference between first and second place in the men’s 100-meter sprint was 0.8%.

I’ve covered this argument from Rationality Rules before, but time has made me realise my original presentation had a problem.

His name is Steven Pinker.

(Click here to show the code)

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

assault = pd.read_csv('https://gitlab.com/hjhornbeck/texas_sharpshooter/raw/master/data/pinker_rape_usa.tsv',sep='\trcParams['figure.dpi'] = 96
plt.rcParams['figure.figsize'] = [9.5, 6]

plt.plot( assault['# Year'], assault['Rate'], 'b' )
plt.title('Forciblee, USA, Police reports')
plt.xlabel('Year')
plt.ylabel('Rate per 100,000')
plt.show()

He looks at that graph, and sees a decline in violence. I look at that chart, and see an increase in violence. How can two people look at the same data, and come to contradictory conclusions?

Simple, we’ve got at least two separate mental models.

(Click here to show the code)

import emcee
import numpy as np
import os
import scipy.optimize as spop

# Some of this code is based on the following examples:
#  https://emcee.readthedocs.io/en/v2.2.1/user/line/
#  https://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/

def lnLinear( theta, x, y ):
    intercept, slope, lnStdDev = theta
    prior       = -1.5*np.log1p(slope*slope) - lnStdDev
    model       = slope*x + intercept        
    inv_sig2    = 1. / (model**2 * np.exp(2*lnStdDev))
    return prior - .5*( np.sum( ((y-model)**2) * inv_sig2 - np.log(inv_sig2) ) )

# Model 1: What's happened over the last two decades?
negLnLin = lambda *args: -lnLinear(*args)

max_year = np.max( assault['# Year'] )
start = 1991
end = max_year

mask = assault['# Year'] > start
print("Finding the maximal likelihood, please wait ...", end='')
intercept_1, slope_1, error_1 = spop.minimize( negLnLin, [1000,-1,5], 
        args=(assault['# Year'][mask], assault['Rate'][mask]) )['x']
print(" done.")

# Model 2: Where are the extremes?
min_year = np.min( assault['# Year'] )
slope_2 = (float(assault['Rate'][assault['# Year']==max_year]) - float(assault['Rate'][assault['# Year']==min_year])) / \
    (max_year - min_year)
intercept_2 = float(assault['Rate'][assault['# Year']==min_year] - slope_2*min_year)

# Model 3: What trendlines fit the data?
ndim, nwalkers, nsamples, keep = 3, 64, 300, 1
seed = [np.array([intercept_2,slope_2,3.]) + 1e-4*np.random.randn(ndim) for i in range(nwalkers)]

sampler = emcee.EnsembleSampler(nwalkers, ndim, lnLinear, threads=os.cpu_count(),
        args=[ np.array(assault['# Year']), np.array(assault['Rate']) ] )
print("Running an MCMC sampler, please wait ...", end='')
sampler.run_mcmc(seed, nsamples)
print(" done.")

model_3 = sampler.chain[:, -keep:, :].reshape((-1, ndim))

print("Charting the results, please wait ...")
plt.plot( assault['# Year'], assault['Rate'], 'k' )
plt.xlabel('Year')
plt.ylabel('Rate per 100,000')
plt.plot( assault['# Year'], slope_1*assault['# Year'] + intercept_1, 'r' )
plt.plot( assault['# Year'], slope_2*assault['# Year'] + intercept_2, 'g' )
for entry in model_3:
    plt.plot( assault['# Year'], entry[1]*assault['# Year'] + entry[0], 'b', alpha=0.05 )
plt.legend( ['Original Data', 'Model 1 (Pinker)', 'Model 2 (Mine)', 'Model 3 (Mine)'])
plt.show()

Finding the maximal likelihood, please wait ... done.
Running an MCMC sampler, please wait ... done.
Charting the results, please wait ...

All Pinker cares about is short-term trends here, as he’s focused on “The Great Decline” in crime since the 1990’s. His mental model looks at the general trend over the last two decades of data, and discards the rest of the datapoints. It’s the model I’ve put in red.

I used two seperate models in my blog post. The first is quite crude: is the last datapoint better than the first? This model is quite intuitive, as it amounts to “leave the place in better shape than when you arrived,” and it’s dead easy to calculate. It discards all but two datapoints, though, which is worse than Pinker’s model. I’ve put this one in green.

The best model, in my opinion, wouldn’t discard any datapoints. It would also incorporate as much uncertainty as possible about the system. Unsurprisingly, given my blogging history, I consider Bayesian statistics to be the best way to represent uncertainty. A linear model is the best choice for general trends, so I went with a three-parameter likelihood and prior:

$p( x,y | m,b,\log(\sigma) ) = e^{ -\frac 1 2 \big(\frac{y-k}{\sigma}\big)^2 }(\sigma \sqrt{2\pi})^{-1}, ~ k = x \cdot m + b p( m,b,\log(\sigma) ) = \frac 1 \sigma (1 + m^2)^{-\frac 3 2}$

This third model encompasses all possible trendlines you could draw on the graph, but it doesn’t hold them all to be equally likely. Since time is short, I used an MCMC sampler to randomly sample the resulting probability distribution, and charted that sample in blue. As you can imagine this requires a lot more calculation than the second model, but I can’t think of anything superior.

Which model is best depends on the context. If you were arguing just over the rate of police-reported sexual assault from 1992 to 2012, Pinker’s model would be pretty good if incomplete. However, his whole schtick is that long-term trends show a decrease in violence, and when it comes to sexual violence in particular he’s the only one who dares to talk about this. He’s not being self-consistent, which is easier to see when you make your implicit mental models explicit.

Pointing at Variance Isn’t Enough

Let’s return to Rationality Rules’ latest transphobic video. In the citations, he explicitly references the men’s 100m sprint at the 2016 Olympics. That’s a terribly narrow window to view athletic performance through, so I tracked down the racetimes of all eight finalists on the IAAF’s website and tossed them into a spreadsheet.

(Click here to show the code)

dataset = pd.read_csv('https://gitlab.com/hjhornbeck/texas_sharpshooter/raw/master/data/100_metre.tsv',delimiter='\t',parse_datest=True,dtype={'Result':np.float64,'Wind':np.float64})

olympics_2016 = dataset['Competition'] == "Rio de Janeiro Olympic Games"
finals = dataset['Race'] == "F"

fastest_time = min(dataset['Result'][olympics_2016 & finals])

table = {"Athlete":dataset['# Name'][olympics_2016 & finals],
         "Result":dataset['Result'][olympics_2016 & finals],
         "Delta":dataset['Result'][olympics_2016 & finals] - fastest_time
        }
print("Rio de Janeiro Olympic Games, finals")
print( pd.DataFrame(table).sort_values("Result").to_string(index=False) )

Rio de Janeiro Olympic Games, finals
Athlete  Result  Delta
     bolt    9.81   0.00
   gatlin    9.89   0.08
de grasse    9.91   0.10
    blake    9.93   0.12
  simbine    9.94   0.13
    meite    9.96   0.15
   vicaut   10.04   0.23
  bromell   10.06   0.25

Here, we see exactly what Rationality Rules sees: Usain Bolt, the current world record holder, earned himself another Olympic gold medal in the 100m sprint. First and third place are separated by a tenth of a second, and the slowest person in the finals was a mere quarter of a second behind the fastest. That’s a small fraction of the time it takes to complete the event.

(Click here to show the code)

mask_2016   = (dataset['Date'] > '2016-01-01') & (dataset['Date'] < '2017-01-01')
names_2016  = pd.Categorical( dataset['# Name'] ).categories

all_2016    = list()
for name in names_2016:
    temp = np.array( dataset['Result'][mask_2016 & (dataset['# Name'] == name)] )
    temp.sort()
    all_2016.append( temp )

all_career = list()
for name in names_2016:
    all_career.append( np.max( dataset['Date'][dataset['# Name'] == name] ) - np.min( dataset['Date'][dataset['# Name'] == name] ) )

all_races = list()
for name in names_2016:
    all_races.append( len( dataset['Result'][dataset['# Name'] == name] ) )

mean_time       = sorted( np.linspace(0,len(all_2016)-1,len(all_2016),dtype=int), key=lambda x:np.mean(all_2016[x]))
median_time     = sorted( np.linspace(0,len(all_2016)-1,len(all_2016),dtype=int), key=lambda x:np.median(all_2016[x]))
min_time        = sorted( np.linspace(0,len(all_2016)-1,len(all_2016),dtype=int), key=lambda x:np.min(all_2016[x]))
fastest_time    = np.min([ np.min(data) for data in all_2016 ])

print('Race times in 2016, sorted by fastest time')
print("{0:16} {1:16} {2:16} {3:16} {4:16}".format('Name','Min time', 'Mean', 'Median', 'Personal max-min'))
print("{}".format( '-' * (6*16 + 5) ))
for i in min_time:
    print("{0:16} {1:16} {2:12.2f} {3:12.2f} {4:12.2f}".format( names_2016[i], np.min(all_2016[i]), np.mean(all_2016[i]), 
        np.median(all_2016[i]), np.max(all_2016[i]) - np.min(all_2016[i]) ))

Race times in 2016, sorted by fastest time
Name             Min time         Mean             Median           Personal max-min
-----------------------------------------------------------------------------------------------------
gatlin                        9.8         9.95         9.94         0.39
bolt                         9.81         9.98        10.01         0.34
bromell                      9.84        10.00        10.01         0.30
vicaut                       9.86        10.01        10.02         0.33
simbine                      9.89        10.10        10.08         0.43
de grasse                    9.91        10.07        10.04         0.41
blake                        9.93        10.04         9.98         0.33
meite                        9.95        10.10        10.05         0.44

Here, we see what I see: the person who won Olympic gold that year didn’t have the fastest time. That honour goes to Justin Gatlin, who squeaked ahead of Bolt by a hundredth of a second.

Come to think of it, isn’t the fastest time a poor judge of how good an athlete is? Picture one sprinter with a faster average time than another, and a second with a faster minimum time. The first athlete will win more races than the second. By that metric, Gatlin’s lead grows to three hundredths of a second.

The mean, alas, is easily tugged around by outliers. If someone had an exceptionally good or bad race, they could easily shift their overall mean a decent ways from where the mean of every other result lies. The median is a lot more resistant to the extremes, and thus a fairer measure of overall performance. By that metric, Bolt is now tied for third with Trayvon Bromell.

We could also judge how good an athlete is by how consistent they were in the given calendar year. By this metric, Bolt falls into fourth place behind Bromell, Jimmy Vicaut, and Yohan Blake. Even if you don’t agree to this metric, notice how everyone’s race times in 2016 varies between three and four tenths of a second. It’s hard to argue that a performance edge of a tenth of a second matters when even at the elite level sprinters’ times will vary by significantly more.

But let’s put on our Steven Pinker glasses. We don’t judge races by medians, we go by the fastest time. We don’t award records for the lowest average or most consistent performance, we go by the fastest time. Yes, Bolt didn’t have the fastest 100m time in 2016, but now we’re down to hundredths of a second; if anything, we’ve dug up more evidence that itty-bitty performance differences matter. If I’d just left things at that last paragraph, which is about as far as I progressed the argument last time, a Steven Pinker would likely have walked away even more convinced that Rationality Rules got it right.

I don’t have to leave things there, though. This time around, I’ll make my mental model as explicit as possible. Hopefully by fully arguing the case, instead of dumping out data and hoping you and I share the same mental model, I could manage to sway even a diehard skeptic. To further seal the deal, the Jupyter notebook will allow you to audit my thinking or even create your own model. No need to take my word.

I’m laying everything out in clear sight. I hope you’ll give it all a look before dismissing me.

Model Behaviour

Our choice of model will be guided by the assumptions we make about how athletes perform in the 100 metre sprint. If we’re going to do this properly, we have to lay out those assumptions as clearly as possible.

The Best Athlete Is the One Who Wins the Most. Our first problem is to decide what we mean by “best,” when it comes to the 100 metre sprint. Rather than use any metric like the lowest possible time or the best overall performance, I’m going to settle on something I think we’ll both agree to: the athlete who wins the most races is the best. We’ll be pitting our models against each other as many times as possible via virtual races, and see who comes out on top.
Pobody’s Nerfect. There is always going to be a spanner in the works. Maybe one athlete has a touch of the flu, maybe another is going through a bad breakup, maybe a third got a rock in their shoe. Even if we can control for all that, human beings are complex machines with many moving parts. Our performance will vary. This means we can’t use point estimates for our model, like the minimum or median race time, and instead must use a continuous statistical distribution.This assumption might seem like begging the question, as variance is central to my counter-argument, but note that I’m only asserting there’s some variance. I’m not saying how much variance there is. It could easily be so small as to be inconsequential, in the process creating strong evidence that Rationality Rules was right.
Physics Always Wins. No human being can run at the speed of light. For that matter, nobody is going to break the sound barrier during the 100 metre sprint. This assumption places a hard constraint on our model, that there is a minimum time anyone could run the 100m. It rules out a number of potential candidates, like the Gaussian distribution, which allow negative times.
It’s Easier To Move Slow Than To Move Fast. This is kind of related to the last one, but it’s worth stating explicitly. Kinetic energy is proportional to the square of the velocity, so building up speed requires dumping an ever-increasing amount of energy into the system. Thus our model should have a bias towards slower times, giving it a lopsided look.

Based on all the above, I propose the Gamma distribution would make a suitable model.

$\Gamma(x | \alpha, \beta ) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}$

(Be careful not to confuse the distribution with the function. I may need the Gamma function to calculate the Gamma distribution, but the Gamma function isn’t a valid probability distribution.)

(Click here to show the code)

import scipy.stats as spst

x = np.linspace(0,10,1023)

print("Three versions of the Gamma Distribution")

plt.subplot( 131 )
plt.plot( x, spst.gamma.pdf(x, 5, scale=.7) )
plt.xticks([0])
plt.yticks([])
plt.xlabel('"Canonical"')

plt.subplot( 132 )
plt.plot( x, spst.gamma.pdf(x, 1, scale=2) )
plt.xticks([0])
plt.yticks([])
plt.xlabel('Exponential')

plt.subplot( 133 )
plt.plot( x, spst.gamma.pdf(x, 200, scale=.025) )
plt.xticks([0])
plt.yticks([])
plt.xlabel('pseudo-Gaussian')

plt.show()

Three versions of the Gamma Distribution

It’s a remarkably flexible distribution, capable of duplicating both the Exponential and Gaussian distributions. That’s handy, as if one of our above assumptions is wrong the fitting process could still come up with a good fit. Note that the Gamma distribution has a finite bound at zero, which is equivalent to stating that negative values are impossible. The variance can be expanded or contracted arbitrarily, so it isn’t implicitly supporting my arguments. Best of all, we’re not restricted to anchor the distribution at zero. With a little tweak …

$\Gamma(x | \alpha, \beta, b ) = \frac{\beta^\alpha}{\Gamma(\alpha)} \hat x^{\alpha-1} e^{-\beta \hat x}, ~ \hat x = x - b$

… we can shift that zero mark wherever we wish. The b parameter sets the minimum value our model predicts, while α controls the underlying shape and β controls the scale or rate associated with this distribution. α < 1 nets you the Exponential, and large values of α lead to something very Gaussian. Conveniently for me, SciPy already supports this three-parameter tweak.

My intuition is that the Gamma distribution on the left, with α > 1 but not too big, is the best model for athlete performance. That implies an athlete’s performance will hover around a specific value, and while they’re capable of faster times those are more difficult to pull off. The Exponential distribution, with α < 1, is most favourable to Rationality Rules, as it asserts the race time we’re most likely to observe is also the fastest time an athlete can do. We’ll never actually see that time, but what we observe will cluster around that minimum.

Running the Numbers

Enough chatter, let’s fit some models! For this one, my prior will be

$p( \alpha, \beta, b ) = \begin{cases} 0, & \alpha \le 0 \\ 0, & \beta \le 0 \\ 0, & b \le 0 \\ 1, & \text{otherwise} \end{cases},$

which is pretty light and only exists to filter out garbage values.

(Click here to show the code)

import sys

def lnprob( theta, data ):

    alpha, beta, b   = theta
    if (alpha <= 0) or (beta <= 0) or (b <= 0):
        return -np.inf
    return np.sum( spst.gamma.logpdf( data, alpha, scale=beta, loc=b ) )

ndim, nwalkers, nsamples, keep = 3, 64, 300, 5
models = [[] for x in all_2016]
summaries = [[] for x in all_2016]

print("Generating some models for 2016 race times (a few seconds each) ...")
print( "{:16}\t{:16}\t{:16}\t{:16}".format("# name","α","β","b") )
for loc,idx in enumerate(min_time):

    data = all_2016[idx]

    mean = np.mean( data ) - fastest_time      # adjust for the location offset
    seed = list()
    i = 0
    while i < nwalkers: beta = np.random.rand()*1.5 + .5 b = fastest_time - np.random.rand()*.3 if lnprob( [mean*beta, beta, b], data ) > -np.inf:
            seed.append( [mean*beta, beta, b] )
            i += 1

    sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=[data], threads=os.cpu_count())
    sampler.run_mcmc(seed, nsamples)

    samples = sampler.chain[:, -keep:, :].reshape((-1, ndim))
    alpha_mcmc, beta_mcmc, b_mcmc = map(lambda v: (v[1], v[2]-v[1], v[1]-v[0]), 
        zip(*np.percentile(samples, [16, 50, 84], axis=0)))

    print("{:16}".format(names_2016[idx]), end='')
    print("\t{:.3f} (+{:.3f} -{:.3f})".format(*alpha_mcmc), end='')
    print("\t{:.3f} (+{:.3f} -{:.3f})".format(*beta_mcmc), end='')
    print("\t{:.3f} (+{:.3f} -{:.3f})".format(*b_mcmc))
    sys.stdout.flush()

    models[idx] = samples
    summaries[idx] = [*alpha_mcmc, *beta_mcmc, *b_mcmc]

print("... done.")

Generating some models for 2016 race times (a few seconds each) ...
# name          	α               	β               	b               
gatlin          	0.288 (+0.112 -0.075)	1.973 (+0.765 -0.511)	9.798 (+0.002 -0.016)
bolt            	0.310 (+0.107 -0.083)	1.723 (+0.596 -0.459)	9.802 (+0.008 -0.025)
bromell         	0.339 (+0.115 -0.082)	1.677 (+0.570 -0.404)	9.836 (+0.004 -0.032)
vicaut          	0.332 (+0.066 -0.084)	1.576 (+0.315 -0.400)	9.856 (+0.004 -0.013)
simbine         	0.401 (+0.077 -0.068)	1.327 (+0.256 -0.226)	9.887 (+0.003 -0.018)
de grasse       	0.357 (+0.073 -0.082)	1.340 (+0.274 -0.307)	9.907 (+0.003 -0.022)
blake           	0.289 (+0.103 -0.085)	1.223 (+0.437 -0.361)	9.929 (+0.001 -0.008)
meite           	0.328 (+0.089 -0.067)	1.090 (+0.295 -0.222)	9.949 (+0.000 -0.003)
... done.

This text can’t change based on the results of the code, so this is only a guess, but I’m pretty sure you’re seeing a lot of α values less than one. That really had me worried when I first ran this model, as I was already conceding ground to Rationality Rules by focusing only on the 100 metre sprint, where even I think that physiology plays a significant role. I did a few trial runs with a prior that forced α > 1, but the resulting models would hug that threshold as tightly as possible. Comparing likelihoods, the α < 1 versions were always more likely than the α > 1 ones.

The fitting process was telling me my intuition was wrong, and the best model here is the one that most favours Rationality Rules. Look at the b values, too. There’s no way I could have sorted the models based on that parameter before I fit them; instead, I sorted them by each athlete’s minimum time. Sure enough, the model is hugging the fastest time each athlete posted that year, rather than a hypothetical minimum time they could achieve.

(Click here to show the code)

athlete = 0
x = np.linspace(9.5,11,300)
for i,row in enumerate(models[athlete]):
    if i < 100:
        plt.plot( x, spst.gamma.pdf(x, row[0], scale=row[1], loc=row[2]), alpha=0.05, color='k' )
    else:
        break
plt.yticks([])
plt.yscale('log')
plt.ylabel('log(likelihood)')
plt.title("100ls of {}'s 2016 race times".format(names_2016[athlete]))
plt.show()

Charting some of the models in the posterior drives this home. I’ve looked at a few by tweaking the “player” variable, as well as the output of multiple sample runs, and they all are dominated by Exponential distributions.

Dang, we’ve tilted the playing field quite a ways in Rationality Rules’ favour.

Still, let’s simulate some races. For each race, I’ll pick a random trio of parameters from each model’s posterior and feet that into SciPy’s random number routines to generate a race time for each sprinter. Fastest time wins, and we tally up those wins to estimate the odds of any one sprinter coming in first.

Before running those simulations, though, we should make some predictions. Rationality Rules’ view is that (emphasis mine) …

[9:18] You see, I absolutely understand why we have and still do categorize sports based upon sex, as it’s simply the case that the vast majority of males have significant athletic advantages over females, but strictly speaking it’s not due to their sex. It’s due to factors that heavily correlate with their sex, such as height, width, heart size, lung size, bone density, muscle mass, muscle fiber type, hemoglobin, and so on. Or, in other words, sports are not segregated due to chromosomes, they’re segregated due to morphology.

[16:48] Which is to say that the attributes granted from male puberty that play a vital role in explosive events – such as height, width, limb length, and fast twitch muscle fibers – have not been shown to be sufficiently mitigated by HRT in trans women.

[19:07] In some events – such as long-distance running, in which hemoglobin and slow-twitch muscle fibers are vital – I think there’s a strong argument to say no, [transgender women who transitioned after puberty] don’t have an unfair advantage, as the primary attributes are sufficiently mitigated. But in most events, and especially those in which height, width, hip size, limb length, muscle mass, and muscle fiber type are the primary attributes – such as weightlifting, sprinting, hammer throw, javelin, netball, boxing, karate, basketball, rugby, judo, rowing, hockey, and many more – my answer is yes, most do have an unfair advantage.

… human morphology due to puberty is the primary determinant of race performance. Since our bodies change little after puberty, that implies your race performance should be both constant and consistent. The most extreme version of this argument states that the fastest person should win 100% of the time. I doubt Rationality Rules holds that view, but I am pretty confident he’d place the odds of the fastest person winning quite high.

The opposite view is that the winner is due to chance. Since there are eight athletes competing here, each would have a 12.5% chance of winning. I certainly don’t hold that view, but I do argue that chance plays a significant role in who wins. I thus want the odds of the fastest person winning to be somewhere above 12.8%, but not too much higher.

(Click here to show the code)

simulations = 15000
wins = [0] * len(all_2016)

print("Simulating {} races, please wait ...".format(simulations), end='')
for sim in range(simulations):
    
    times = list()
    
    for athlete,_ in enumerate(all_2016):

        choice  = int( np.random.rand()*len(models[athlete]) )
        times.append( models[athlete][choice][2] + np.random.gamma( models[athlete][choice][0], models[athlete][choice][1] ) )
        
    wins[ np.argmin(times) ] += 1

print(" done.")
    
by_wins = sorted( np.linspace(0,len(all_2016)-1,len(all_2016),dtype=int), key=lambda x: wins[x], reverse=True)

print()
print("Number of wins during simulation")
print("--------------------------------")
for i,athlete in enumerate(by_wins):
    print( "{:24} {:8} ({:.2f}%)".format(names_2016[athlete], wins[athlete], wins[athlete]*100./simulations) )
sys.stdout.flush()

Simulating 15000 races, please wait ... done.

Number of wins during simulation
--------------------------------
gatlin                       5174 (34.49%)
bolt                         4611 (30.74%)
bromell                      2286 (15.24%)
vicaut                       1491 (9.94%)
simbine                       530 (3.53%)
de grasse                     513 (3.42%)
blake                         278 (1.85%)
meite                         117 (0.78%)

Whew! The fastest 100 metre sprinter of 2016 only had a one in three chance of winning Olympic gold. Of the eight athletes, three had odds better than chance of winning. Even with the field tilted in favor of Rationality Rules, this strongly hints that other factors are more determinative of performance than fixed physiology.

But let’s put our Steven Pinker glasses back on for a moment. Yes, the odds of the fastest 100 metre sprinter winning the 2016 Olympics are surprisingly low, but look at the spread between first and last place. What’s on my screen tells me that Gatlin is 40-50 times more likely to win Olympic gold than Ben Youssef Meite, which is a pretty substantial gap. Maybe we can rescue Rationality Rules?

In order for Meite to win, though, he didn’t just have to beat Gatlin. He had to also beat six other sprinters. If p_M represents the geometric mean of Meite beating one sprinter, then his odds of beating seven are p_M⁷. The same rationale applies to Gatlin, of course, but because the geometric mean of him beating seven other racers is higher than p_M, repeatedly multiplying it by itself results in a much greater number. With a little math, we can use the number of wins above to estimate how well the first-place finisher would fare against the last-place finisher in a one-on-one race.

(Click here to show the code)

win_ratio = float(np.max(wins)) / float(np.min(wins))
prob_head2head = np.power( win_ratio, 1./7. ) / (1 + np.power( win_ratio, 1./7. ))

print("In the above simulation, {} was {:.1f} times more likely to win Olympic gold than {}.".format(
    names_2016[by_wins[0]], win_ratio, names_2016[by_wins[-1]] ))
print("But we estimate that if they were racing head-to-head,", end='')
print(" {} would win only {:.1f}% of the time.".format( names_2016[by_wins[0]], prob_head2head*100. ))

difference = (np.min(all_2016[by_wins[-1]]) - np.min(all_2016[by_wins[0]])) / np.min(all_2016[by_wins[0]])
print(" (For reference, their best race times in 2016 differed by {:.2f}%.)".format( difference * 100. ))

In the above simulation, gatlin was 39.5 times more likely to win Olympic gold than meite.
But we estimate that if they were racing head-to-head, gatlin would win only 62.8% of the time.
 (For reference, their best race times in 2016 differed by 1.53%.)

For comparison, FiveThirtyEight gave roughly those odds for Hilary Clinton becoming the president of the USA in 2016. That’s not all that high, given how “massive” the difference is in their best race times that year.

This is just an estimate, though. Maybe if we pitted our models head-to-head, we’d get different results?

(Click here to show the code)

headCount = max( simulations >> 3, 100 )
maxFound = 0

print()
print("Wins when racing head to head ({} simulations each)".format( headCount ))
print("----------------------------------------------")

print("{:10}".format("LOSER->"), end='')
for _,idx in enumerate(min_time):
    print("{:>10}".format(names_2016[idx]), end='')
print()

for x,x_ind in enumerate(min_time):

    print("{:10}".format(names_2016[x_ind]), end='')

    for y in range(len(min_time)):
        if y <= x:
            print("{:10}".format(""), end='')
        else:
            wins = 0
            for rand in range(headCount):

                choice  = int( np.random.rand()*len(models[x_ind]) )
                x_time  = models[x_ind][choice][1] + np.random.gamma( models[x_ind][choice][0], models[x_ind][choice][2] )
                choice  = int( np.random.rand()*len(models[min_time[y]]) )
                y_time  = models[min_time[y]][choice][1] + np.random.gamma( models[min_time[y]][choice][0], models[min_time[y]][choice][2] )

                if y_time < x_time:
                    wins += 1

            temp = wins*100./headCount
            if temp < 50: temp = 100 - temp if temp > maxFound:
                maxFound = temp
                
            print("{:9.1f}%".format(wins*100./headCount), end='')

    print()
    sys.stdout.flush()

print()
print("The best winning percentage was {:.1f}% (therefore the worst losing percent was {:.1f}%).".format(
    maxFound, 100-maxFound ))

Wins when racing head to head (1875 simulations each)
----------------------------------------------
LOSER->       gatlin      bolt   bromell    vicaut   simbine de grasse     blake     meite
gatlin                   48.9%     52.1%     55.8%     56.4%     59.5%     63.5%     61.9%
bolt                               52.2%     57.9%     55.8%     57.9%     65.8%     60.2%
bromell                                      52.4%     55.3%     55.0%     65.2%     59.0%
vicaut                                                 51.7%     52.2%     59.8%     59.3%
simbine                                                          52.3%     57.7%     57.1%
de grasse                                                                  57.0%     54.7%
blake                                                                                47.2%
meite                                                                                     

The best winning percentage was 65.8% (therefore the worst losing percent was 34.2%).

Nope, it’s pretty much bang on! The columns of this chart represents the loser of the head-to-head, while the rows represent the winner. That number in the upper-right, then, represents the odds of Gatlin coming in first against Meite. When I run the numbers, I usually get a percentage that’s less than 5 percentage points off. Since the odds of one person losing is the odds of the other person winning, you can flip around who won and lost by subtracting the odds from 100%. That explains why I only calculated less than half of the match-ups.

I don’t know what’s on your screen, but I typically get one or two match-ups that are below 50%. I’m again organizing the calculations by each athlete’s fastest time in 2016, so if an athlete’s win ratio was purely determined by that then every single value in this table would be equal to or above 50%. That’s usually the case, thanks to each model favouring the Exponential distribution, but sometimes one sprinter still winds up with a better average time than a second’s fastest time. As pointed out earlier, that translates into more wins for the first athlete.

Getting Physical

Even at this elite level, you can see the odds of someone winning a head-to-head race are not terribly high. A layperson can create that much bias in a coin toss, yet we still both outcomes of that toss to be equally likely.

This doesn’t really contradict Rationality Rules’ claim that fractions of a percent in performance matter, though. Each of these athletes differ in physiology, and while that may not have as much effect as we thought it still has some effect. What we really need is a way to substract out the effects due to morphology.

If you read that old blog post, you know what’s coming next.

[16:48] Which is to say that the attributes granted from male puberty that play a vital role in explosive events – such as height, width, limb length, and fast twitch muscle fibers – have not been shown to be sufficiently mitigated by HRT in trans women.

According to Rationality Rules, the physical traits that determine track performance are all set in place by puberty. Since puberty finishes roughly around age 15, and human beings can easily live to 75, that implies those traits are fixed for most of our lifespan. In practice that’s not quite true, as (for instance) human beings lose a bit of height in old age, but here we’re only dealing with athletes in the prime of their career. Every attribute Rationality Rules lists is effectively constant.

So to truly put RR’s claim to the test, we need to fit our model to different parts of the same athlete’s career, and compare those head-to-head results with the ones where we raced athletes against each other.

(Click here to show the code)

table = {"Athlete":[], "First Result":[], "Latest Result":[]}

for name in names_2016:
    
    mask_name = dataset['# Name'] == name
    dates = dataset["Date"][mask_name]
    table['Athlete'].append( name )
    table['First Result'].append( dates.min() )
    table['Latest Result'].append( dates.max() )
    
print( pd.DataFrame(table) )

     Athlete First Result Latest Result
0      blake   2005-07-13    2019-06-21
1       bolt   2007-07-18    2017-08-05
2    bromell   2012-04-06    2019-06-08
3  de grasse   2012-06-08    2019-06-20
4     gatlin   2000-05-13    2019-07-05
5      meite   2003-07-11    2018-06-16
6    simbine   2010-03-13    2019-06-20
7     vicaut   2008-07-05    2019-07-02

That dataset contains official IAAF times going back nearly two decades, in some cases, for those eight athletes. In the case of Bolt and Meite, those span their entire sprinting career.

Which athlete should we focus on? It’s tempting to go with Bolt, but he’s an outlier who broke the mathmatical models used to predict sprint times. Gatlin would have been my second choice, but between his unusually long career and history of doping there’s a decent argument that he too is an outlier. Bromell seems free of any issue, so I’ll go with him. Don’t agree? I made changing the athlete as simple as altering one variable, so you can pick whoever you like.

I’ll divide up these athlete’s careers by year, as their performance should be pretty constant over that timespan, and for this sport there’s usually enough datapoints within the year to get a decent fit.

(Click here to show the code)

athlete = 2   # look at the indicies on the previous table
min_races = 3 # minimum number of races per year; filters out thin data

print()
print("{0} vs. {0}, model building ...".format( names_2016[athlete] ))

mask_ath    = dataset['# Name'] == names_2016[athlete]
min_year    = np.min( dataset['Date'][ mask_ath ] ).year
max_year    = np.max( dataset['Date'][ mask_ath ] ).year

years       = list()
models_ath  = list()
summaries_ath = list()

print("year\tα\tβ\tb")
for year in range(min_year, max_year+1):

    mask_year = (dataset['Date'] > '{}-01-01'.format(year)) & (dataset['Date'] < '{}-01-01'.format(year+1)) data = dataset['Result'][ mask_ath & mask_year ] if len(data) >= min_races:
        years.append( year )
    else:
        continue

    mean = np.mean( data ) - fastest_time      # adjust for the bation offset
    seed = list()
    i = 0
    while i < nwalkers: beta = np.random.rand()*1.5 + .5 b = fastest_time - np.random.rand()*.3 if lnprob( [mean*beta, beta, b], data ) > -np.inf:
            seed.append( [mean*beta, beta, b] )
            i += 1

    sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=[data], threads=os.cpu_count())
    sampler.run_mcmc(seed, nsamples)

    samples = sampler.chain[:, -keep:, :].reshape((-1, ndim))
    alpha_mcmc, beta_mcmc, b_mcmc = map(lambda v: (v[1], v[2]-v[1], v[1]-v[0]), 
        zip(*np.percentile(samples, [16, 50, 84], axis=0)))

    print("{}".format(year), end='')
    print("\t{:.3f} (+{:.3f} -{:.3f})".format(*alpha_mcmc), end='')
    print("\t{:.3f} (+{:.3f} -{:.3f})".format(*beta_mcmc), end='')
    print("\t{:.3f} (+{:.3f} -{:.3f})".format(*b_mcmc))
    sys.stdout.flush()

    models_ath.append( samples )
    summaries_ath.append( [*alpha_mcmc, *beta_mcmc, *b_mcmc] )

print("... done.")

print()
print("{0} vs. {0}, head to head ({1} simulations)".format( names_2016[athlete], headCount ))
print("----------------------------------------------")

print("{:7}".format("LOSER->"), end='')
for year in years:
    print("{:>7}".format(year), end='')
print()

maxFound = 0
for x_ind,x in enumerate( years ):

    print("{:7}".format(x), end='')

    for y_ind,y in enumerate( years ):
        if y <= x:
            print("{:7}".format(""), end='')
        else:
            wins = 0
            for rand in range(headCount):

                choice  = int( np.random.rand()*len(models_ath[x_ind]) )
                x_time  = models_ath[x_ind][choice][2] + np.random.gamma( models_ath[x_ind][choice][0], models_ath[x_ind][choice][1] )
                choice  = int( np.random.rand()*len(models_ath[y_ind]) )
                y_time  = models_ath[y_ind][choice][2] + np.random.gamma( models_ath[y_ind][choice][0], models_ath[y_ind][choice][1] )

                if y_time < x_time:
                    wins += 1
                    
            temp = wins*100./headCount
            if temp < 50: temp = 100 - temp if temp > maxFound:
                maxFound = temp
                
            print("{:6.1f}%".format(wins*100./headCount), end='')

    print()
    sys.stdout.flush()

print()
print("The best winning percentage was {:.1f}% (therefore the worst losing percent was {:.1f}%).".format(
    maxFound, 100-maxFound ))

bromell vs. bromell, model building ...
year	α	β	b
2012	0.639 (+0.317 -0.219)	0.817 (+0.406 -0.280)	10.370 (+0.028 -0.415)
2013	0.662 (+0.157 -0.118)	1.090 (+0.258 -0.195)	9.970 (+0.018 -0.070)
2014	0.457 (+0.118 -0.070)	1.556 (+0.403 -0.238)	9.762 (+0.007 -0.035)
2015	0.312 (+0.069 -0.064)	2.082 (+0.459 -0.423)	9.758 (+0.002 -0.016)
2016	0.356 (+0.092 -0.104)	1.761 (+0.457 -0.513)	9.835 (+0.005 -0.037)
... done.

bromell vs. bromell, head to head (1875 simulations)
----------------------------------------------
LOSER->   2012   2013   2014   2015   2016
   2012         61.3%  67.4%  74.3%  71.0%
   2013                65.1%  70.7%  66.9%
   2014                       57.7%  48.7%
   2015                              40.2%
   2016                                   

The best winning percentage was 74.3% (therefore the worst losing percent was 25.7%).

Again, I have no idea what you’re seeing, but I’ve looked at a number of Bromell vs. Bromell runs, and every one I’ve done shows at least as much variation, if not more, than runs that pit Bromell against other athletes. Bromell vs. Bromell shows even more variation in success than the coin flip benchmark, giving us justification for saying Bromell has a significant advantage over Bromell.

I’ve also changed that variable myself, and seen the same pattern in other athletes. Worried about a lack of datapoints causing the model to “fuzz out” and cover a wide range of values? I thought of that and restricted the code to filter out years with less than three races. Honestly, I think it puts my conclusion on firmer ground.

Conclusion

Texas Sharpshooter Fallacy: Ignoring the difference while focusing on the similarities, thus coming to an inaccurate conclusion. Similar to the gambler’s fallacy, this is an example of inserting meaning into randomness.

Rationality Rules loves to point to sporting records and the outcome of single races, as on the surface these seem to justify his assertion that differences in performance of fractions of a percent matter. In reality, he’s painting a bullseye around a very small subset of the data and ignoring the rest. When you include all the data, you find Rationality Rules has badly missed the mark. Physiology cannot be as determinative as Rationality Rules claims, other factors must be important enough to sometimes overrule it.

And, at long last, I can call bullshit on this (emphasis mine):

[17:50] It’s important to stress, by the way, that these are just my views. I’m not a biologist, physiologist, or statistician, though I have had people check this video who are.

Either Rationality Rules found a statistician who has no idea of variance, which is like finding a computer scientist who doesn’t know boolean logic, or he never actually consulted a statistician. Chalk up yet another lie in his column.

Cherry Picking

With the benefit of hindsight, I can see another omission from Rationality Rules’ latest transphobic video. In his citations, he cites two sporting bodies: the International Association of Athletics Federations and the Australian Sports Anti-Doping Authority. He relies heavily on the former, which is strange. The World Medical Association has condemned the IAAF’s policies on intersex and transgender athletes as “contrary to international medical ethics and human rights standards.” The IAAF has defended itself, in part, by arguing this:

The IAAF is not a public authority, exercising state powers, but rather a private body exercising private (contractual) powers. Therefore, it is not subject to human rights instruments such as the Universal Declaration of Human Rights or the European Convention on Human Rights.

Which is A) not a good look, and B) false. If you won’t take my word on that last one, maybe you’ll take the UN’s? [Read more…]

Reprobate Spreadsheet

/dev/random, unless I make a hash of it

FtB & Social Media

Prison Labor Will Set You Free

Macroeconomics with Peter Navarro

Eggflation

New on OnlySky: The coming dark age

Shrim Pizza

Randomness

'Deciphering the Gospels Proves Jesus Never Existed': Chapter 10, part 3

Does anyone in Tucson, AZ want an extra player for games?

The book is off to the editor! (Non-Fiction!)

LGBTQ+ People Are Not Going Back

Rationality Rules DESTROYS Women’s Sport!!1!

What Do We Have Here? ¶

Newton's First Law of Athletics ¶

A Man Without a Plan is Not a Man ¶

I'm a Model, If You Know What I Mean ¶

The Nitty Gritty ¶

Truth and Consequences ¶

Let's Put a Bow On This One ¶

Separate and Unequal

Texas Sharpshooter

Quick Note

Introduction

Pointing at Variance Isn’t Enough

Model Behaviour

Running the Numbers

Getting Physical

Conclusion