LLM’s Shouldn’t Code

My draft for “Loneliness, 3” is currently sitting at 2,600 words. It hasn’t been as hard to write as “Loneliness, 2“, this time around I only redid the intro once. Nonetheless, I haven’t touched it in a few months. The why of it all is complicated, as usual, but one not-insignificant chunk is that I’m starting to doubt my approach. I never expected to find a “magic passphrase” to got people to understand my arguments immediately, but since starting the “Loneliness” series I’ve spent more time with people who love and defend LLMs. The additional evidence and experience suggests that series is shouting into a black hole.

I don’t want to give up on it, but taking a break from it might help get me typing again. Besides, I think I can convince you LLMs should not code.

Remember that whole “strawberry” thing? No? It became a meme to query them with “How many “r”‘s are there in strawberry?” The typical result was usually hilarious, a claim that there are two or one or even zero “r” letters in “strawberry,” even though the LLM correctly used three when spelling it. It was embarrassing enough that OpenAI code-named their o1 model as “Project Strawberry.”

But why is this a problem in the first place? Let’s review some prior material: LLMs deal with “tokens,” not characters. To map between them, a sequence of characters is first encoded to a sequence of bytes, typically via UTF-8. That byte sequence is then mapped to a sequence of tokens; all the tokens which could match the first bytes in the sequence are found, and of those the token that consumes the most bytes and has the lowest numeric ID is chosen. The matched bytes are discarded, and the process repeats until there are no more bytes left to map.

The exact output depends on the model, but ChatGPT 5 (in part) sees that “strawberry” question as “5299, 1991, 392, 81, 50049, 82, 553, 306, 101830, 30.” There can be a loose correlation between how large a token’s number is and how many characters it contains, but that’s merely an artifact of generating the token vocabulary. They’re otherwise arbitrary labels.

Spelling “strawberry” is no problem, ChatGPT 5 only has to emit “302, 1618, 19772,” which maps to “st”, “raw”, and “berry”. But to reverse that map, to figure out that token 19772 contains two token 81’s, is not so straightforward, nor is it to recognize those three tokens differ only from token 101830 by a single initial space character.

Compare and contrast with how human beings learn. Our hierarchical learning style hammers the notion of “letter” into us at an early age, so by the time we’re old enough to read sentences the number of “r”‘s in “strawberry” is so obvious it almost never gets mentioned in-text. And if the training data for your LLM is just a pile of human-made texts, then it might never encounter the concept of letters.

Thus, LLM’s have to be explicitly trained on letters. Most of the time this means feeding in synthetically-generated texts that explicitly invoke the concept of letters, such as asking the LLM how many X’s are in Y, and then bopping them on the nose if they get it wrong. You can actually find evidence for this process, sometimes; in the appendices to this paper the authors asked Gemini “How many Rs are in are? I like Strawberries.” and the response back started with “There are three “r”‘s in “strawberry”…” I gave it a try with Claude, and after one failed attempt it responded back to “How many r’s are in are? Like strawberries” with “The word strawberry contains 3 r’s…” ChatGPT 5.3 has given me a “fast response” for the number of “r”‘s in “strawberry,” and not when I asked for the number of “w” characters instead, but it hasn’t been consistent on that. I can’t tell if OpenAI do a better job of covering their tracks, or used a different sort of training data.

The main thrust of the aforementioned paper is to augment each token with a byte encoding of all the characters it represents, so the underlying transformer gets some access to the additional information. The results were lackluster. Their LLM did better in general, sure, but at best the needle moves from a 67% success rate to a 71% success rate, and there are a few benchmarks where it actually under-performed the old method.

There’s also a good case to be made that if there is any understanding of letters there, it’s only superficial. We can do a lot more than just query how many letters are in a word, after all.

Task	Input	Output
Spelling	Spell out the word: there	there
Inverse Spelling	Write the word that is spelled out (no spaces): t h e r e	there
Contains Character	Is there a ‘c’ in ‘there’?	No
Contains Word	Is there a ‘the’ in ‘the sky is blue’?	Yes
Character Insertion	Add ‘b’ after every ‘e’ in ‘there’	thebreb
Word Insertion	Add ‘is’ after every ‘the’ in ‘the sky is blue’	the is sky is blue
Character Deletion	Delete every ‘e’ in ‘there’	thr
Word Deletion	Delete every ‘the’ in ‘the sky is blue’	sky is blue
Character Substitution	Replace every ‘e’ with ‘a’ in ‘there’	thara
Word Substitution	Replace every ‘the’ with ‘is’ in ‘the sky is blue’	is sky is blue
Character Swapping	Swap ‘t’ and ‘r’ in ‘there’	rhete
Word Swapping	Swap ‘the’ and ‘is’ in ‘the sky is blue’	is sky the blue

Lukas Edman et al., “EXECUTE: A Multilingual Benchmark for LLM Token Understanding,” Findings of the Association for Computational Linguistics: ACL 2025, 2025, 1878–87.

That table isn’t a complete replica of Figure 1, the authors also tested how well LLMs could handle “composite” characters. Kanji, for instance, allows multiple characters to be combined together to form another character. Their overall results are in Table 8, which is much too large to include here, but LLMs seemed to have the most trouble with character swapping, insertion, substitution, and deletion. They did great at figuring out if a sentence contained a word, good if inserting one word after another, but had an odd amount of difficulty swapping words around.

LLMs aced spelling, but this gets to my biggest beef about that paper: they didn’t just ask the LLM to perform each task, they also provided four correct examples to show how the task was done. Come on! My physics knowledge is slowly decaying away, but if you handed me a test where every question was repeated five times with slight variation, and four of those five had the correct answers filled in, I bet I could ace it. If you instead try something closer to an actual test, you get:

ME: Where are the “w” characters located in ” strawberry”?

CHATGPT 5.3: The word “strawberry” doesn’t have the letter “w” in it, so there are no “w” characters in it.

ME: How many “w” characters are in ” strawberry”?

CHATGPT 5.3: The word “strawberry” has 0 “w” characters in it.

ME: Substitute each occurrence of the letter “r” in “strawberry” with a “k”.

CHATGPT 5.3: Sure! If we substitute each occurrence of the letter “r” in “strawberry” with a “k”, it becomes: st kaw be k y

Nonetheless, one interesting result of their testing is that the choice of language makes a big impact on what an LLM can do. My previous assessment applies to most languages, but not Amharic, Tamazight, and Santali. Handed those languages, some LLMs they tested could ace every single task with no obvious weaknesses. Why the sudden competence?

Those three languages have barely any presence on the web, and thus there’s very little training data available. Few people/companies would tune an LLMs token vocabulary for those languages, so the ratio between the average number of characters represented by one token tends to be lower for those, relative to more popular ones. Some copy-pasting into a tokenizer suggests four Amharic characters map to an average of seven tokens for ChatGPT 5’s vocabulary; in contrast, the first four paragraphs of this blog post suggests four English characters average to one token! In practice, every other language usually has a lower character-to-token ratio than English, and that’s remained true for years.

Conversely, manipulating non-English characters should be easier because there’s less need to split a token into characters.

To test that, the paper authors tried directly mapping English characters to Amharic characters, and repeated their tests. The results? English-as-Amharic had a slightly higher success rate than Amharic! Nice, that’s solid support for that theory… exceeeeept the authors did one more test. They tricked the tokenizer for one LLM into mapping one English letter to one token. If the problem was only the character-to-token ratio, this should have boosted how well that LLM did.

Instead, the LLMs overall performance cratered. There were modest improvements for the character manipulation tests, true, but the word-level manipulation tasks went from being the best-performing to easily the worst.

Here’s my guess: perhaps in the high-dimensional state space of these LLMs, two different types of inputs map to two very different locations. One corresponds to common written languages, the other to arbitrary byte sequences. The training data for the former is almost absent any character-level manipulations, save artificially-crafted examples hoping to paper over the “strawberry” problem; the latter, though, could have a lot more naturally-occurring examples of those manipulations. Mucking around with sequences of bytes is a common task for programmers, for instance. If the input corresponds more closely to written language, it lands within the former part of the state space, where the phrase “change this character” doesn’t have a clean map to the abstract concept it represents. The output token distribution is often garbage or nonsensical, as a result. But if the input lands in the arbitrary-sequence part of the space, that same phrase has a cleaner mapping and the outputs are more likely to conform to what we expect.

If a language has barely any presence in the training data, it will tend to wind up in the arbitrary-sequence part of the state space instead of the common human language part. Thus the improved results at character-level manipulation. The competence at “word” manipulation stems from that phrase mapping to “grouping of arbitrary characters” in the arbitrary-sequence sub-space, and those being about as common in the training data as byte-level manipulation.

Since bytes can map to characters, those arbitrary-sequence “words” will contain the occasional stray English character. Forcing a one-to-one mapping between English letters and tokens places the input in the “arbitrary byte sequence” part of the state space, but now words are nothing but English characters. These look nothing like the “words” typical of that part of the state space, so “change this word” no longer has a clean mapping and the success rate plummets for those tests.

All of that is rank speculation on my part, of course. But it’s hard to argue the contrary, that any of these LLMs have a general concept of “letters.” Ask me to count the number of X’s, and I’ll do very well no matter whether or not “X” is numbers, fish, or Fish numbers. The areas where I fail are taken as evidence I lack a general concept of numbers, or that my skill with that concept is limited, or that I simply lack the intelligence to realize the full extent of where numbers can be applied. Likewise, merely being able to answer how many “r” letters are in “strawberry” is insufficient to show an LLM has the concept of letters. On the contrary, the inability to generalize between the arbitrary-sequence case and human language argues against them having a general concept.

Whatever the actual underlying reason, there seems to be an inverse correlation between how popular a language is within an LLM’s training set and how well the LLM can perform character-level manipulations. As a consequence, we should expect an LLM will struggle to tell that these two streams of tokens are functionally equivalent:

257, 1056, 3469, 350, 7743, 11, 13901, 8, 10039, 530, 622, 350, 7743, 425, 3099, 1029

1314, 3469, 7, 1577, 11, 3099, 48169, 271, 622, 1577, 9, 13901

Or, when those sequences are decoded from ChatGPT’s token vocabulary:

     def mult (first,second) :
      return (first * second)

def mult( first, second ):
    return first*second

Did you spot that ChatGPT’s token vocabulary includes words with a space in front of them? There are in fact a whopping 27,980 tokens out of 199,998 total are just another token with some extra whitespace added to the front or back. That’s almost 14%! Anthropic are oddly secretive about their token vocabulary, but someone’s partly reversed engineered it. Of the 38,360 tokens known to be in the vocabulary, a whopping 15,741 fall into the same category! But if tokenization always prioritizes the token that absorbs the most characters, an identifier with a space before it can map to a different token than the same identifier with no leading space. Whitespace is ignored in most computer languages, which can be abused to a comical extent, and as shown above even exceptions like Python still have some wiggle room.

In order to correctly interpret programming languages, then, LLM’s must understand that tokens are collections of characters. And yet, at best, they struggle to grasp that fact with popular human languages.

When they do shine at that task, it’s either when tossed an arbitrary byte sequence with no underlying meaning, or an obscure human language. Thanks to open source, though, there’s a ridiculous amount of public source code out there. Debian 13 alone contained a whopping 1.4 trillion lines of code! Combine that with the frenzy over using LLMs to write program code, and modern LLM training sets are overflowing with the stuff. So again, we’re left believing LLMs should be terrible at coding.

If they are not, then it must be because they’ve memorized large chunks of the training set. That suggests an incredible level of brittleness, though. Throw some programming code that diverges from what’s in their training, and their output will be garbage even if the input was valid code.

I remember 20 years ago saying to a colleague, when he talked about “keeping spaces for compatibility”: Hey, we’re past the dark times already — modern tools work fine with semantic things like tabs.

And here we are, 20 years later, in 2026 — damn, AI still cannot work with tabs. What’s next? Will it break files without a newline at the end? Or will we have to add a carriage return manually after each line AI writes?

It’s ridicolous the issue is staying open for half a year.

There’s been a long running debate within programming about whether to use tabs or spaces for indentation. Tabs are more efficient, but the number of spaces that correspond to one tab character isn’t standard. In recent years, Team Spaces has largely won. Official formatting guides from LLVM, Google, Mozilla, WebKit, and Microsoft forbid the use of tabs for indentation. Thus programming code training datasets are dominated by examples that use spaces for indentation.

Still, there are always some people who refuse to follow convention. And at least some of the time, their code is unintelligible to Claude. Some people don’t encounter the issue, some work-arounds that have been proposed fixed the issue, but nonetheless it’s been a rare but persistent problem since October 25th, 2025, with no sign of resolution.

At long last, you can pick your poison:

LLMs should not be able to code, because to understand a programming language requires breaking apart a token into its constituent characters, and all LLMs struggle with that task.
LLMs should not be allowed to code, because they lack an understanding of programming languages and instead work by copy-pasting examples memorized from their training set, perhaps with a bit of massaging to make the pieces fit together. This can only result in disasters, be they short term or long term.

AI cannot do your job, but an AI salesman can 100% convince your boss to fire you and replace you with an AI that can’t do your job, and when the bubble bursts, the money-hemorrhaging “foundation models” will be shut off and we’ll lose the AI that can’t do your job, and you will be long gone, retrained or retired or “discouraged” and out of the labor market, and no one will do your job. AI is the asbestos we are shoveling into the walls of our society and our descendants will be digging it out for generations.

Reprobate Spreadsheet

/dev/random, unless I make a hash of it

Ugly man, ugly statue

Will Trump step in to rescue the Tates?

My Next Votes

Take That, Human Traffickers

The Cult Of LLMs

Blame Canada!

The Probability Broach: Dead men tell no tales

The Bolingbrook Babbler interviews the ghost of Lindsey Graham (Fiction)

Space is full of sugar!

LLM’s Shouldn’t Code