John Sokol's Blog: Anagrams as Source Code: A 2014 Hint at What LLMs Would Be

Anagrams as Source Code: A 2014 Hint at What LLMs Would Be

Fourth in a series. Read the mimetics essay for the broader frame; this one zooms in on a single page of my old wiki and asks why the intuition on it turned out to predict, in structure if not in detail, what large language models would become.

Sometime in the early 2010s — the page history says the bulk of edits land between 2011 and 2015 — I wrote on my wiki the following claim, which I’ll quote unedited because the rough edges are the point:

I am convinced it’s not “SuperNatural” but part of a memetic fabric of the way the brain does pattern recognition. … This has been done so long, that the languages evolve with these mechanisms built in to it. So much so that message content is based on the frequency of letters more then the actually words.
Anyhow decoding the anagrams is like looking at the source code, you can’t take it’s contents on literal face value but how it will subconsciously effect those people who “Get it”. … Now that I’ve been writing for a while I can feel what’s going to make a good anagram or not. I can feel weather it’s positive or negative.
It should also be possible to have a program that can measure a content’s meme-worthiness, (the ability to propagate as a meme, and in the greater context of current events and the mainstream media’s effects.)
It should also be possible to create a system that could generate candidate meme’s that should have a high likelihood of success.

The specific claim — that anagrams literally carry meaning, that re-arranging letters reveals a hidden semantic layer — I now think is mostly wrong. But the structural claim wrapped around it is something I’ve gotten less sure was wrong with every passing year. The structural claim is that language has a substrate of statistical pattern that does cognitive work in parallel with semantic content, that the substrate is exploitable, and that a program could be built to operate on it.

Nobody built that program in 2014. Then, ten years later, somebody did — but they did it by accident, while trying to build something else, and what they ended up with looks nothing like what I imagined. They called it a large language model. It works, and the reason it works validates the hint on the wiki page more than I would have predicted.

This essay is about the connection.

1. What the wiki claim was, exactly

The page is Nature_of_Language in the cluster I built around Department of Memetics. Stripped to its bones, the argument runs:

Spoken and written language has evolved alongside the brains that process it for hundreds of thousands of years.
Any feature of language that affects reader response will, under that long Darwinian sieve, accumulate in the language because writers and speakers whose feel produced more effective patterns out-propagated those whose feel did not.
Letter and phoneme distributions are such a feature. They affect reader response — the rhythm of a sentence, the “feel” of a name, whether a phrase lands — on a channel below conscious semantic processing.
Therefore there is hidden statistical structure in language that is doing cognitive work in parallel with the propositional content.
Skilled writers can sense the structure but do not consciously manipulate it. Actors choose stage names whose letter distributions “feel right.” Marketers find the slogan that sticks. Speechwriters tune phrases until they’re memorable.
In principle, a program could measure the structure. It could score arbitrary text for “meme-worthiness.” It could generate text optimised against that score.

I added the anagram story because I was doing statistical work on anagrams at the time and thought I was finding something specific — that the anagrams of high-impact phrases had statistically interesting properties. I now believe what I was actually picking up was the underlying letter-frequency structure of the source phrase, which an anagram by definition preserves. The anagrams were a symptom; the real signal was the letter-frequency surface of the input.

The substrate was real. The diagnostic — anagrams as source-code- revealing — was wrong but pointed at something true. This is a more common pattern in early-stage hypotheses than people give credit for.

2. The statistical substrate of language was already known

The wiki page makes the argument de novo, which it shouldn’t have, because by the early 2010s most of the components were textbook.

Zipf’s law

George Kingsley Zipf in 1932 — and Alfred Lotka before him — noticed that if you rank words in a corpus by frequency, the n-th most common word appears about 1/n as often as the most common one. The most frequent English word (“the”) appears about twice as often as the second (“of”), three times as often as the third (“and”), and so on. The pattern is shockingly robust: it holds in every natural language ever measured, it holds across topics within a language, it holds across centuries within the same language, and it holds at the character level too. The frequencies are not random. They sit on a power law that any sufficiently large text obeys.

The implication for the wiki claim is direct. If letter and word frequencies are stable lawful features of a language, then a brain that processes language has had millennia to internalise those frequencies and to use them — as a kind of background model against which surprise is computed. The information content of a token is proportional to the negative log of its frequency. Surprise is information, in the technical sense Shannon nailed down in 1948.

Shannon’s word game

Speaking of Shannon: in A Mathematical Theory of Communication he introduced what we now call n-gram models of English, ran them forward, and showed that you can generate increasingly English-looking text by sampling from progressively higher-order n-grams. Order-0 gives uniform letter distribution: garbage. Order-1 gives the right letter frequencies but random sequences: still garbage but lumpy. Order-2 (bigrams) starts producing pronounceable nonsense. Order-3 (trigrams) starts producing word-shaped tokens. Order-5 or so produces text that, sentence by sentence, looks like English — even though it has no semantic content.

The 1948 paper essentially demonstrated that statistical structure captured at the letter level is most of what makes text look like language. This is the same argument the wiki page was groping toward, twenty-six years before the wiki page. I should have read Shannon more carefully. So should everyone.

Cryptanalysis: anagrams as source code, for real

The discipline that has taken letter-frequency analysis most seriously for the longest is cryptanalysis. Al-Kindi, in 9th-century Baghdad, wrote Risāla fī Istikhrāj al-Muʿammā — “On Extracting Encrypted Letters” — which laid out frequency analysis: count the letters in the ciphertext, match them against the known frequency distribution of the plaintext language, and you have probabilistic guesses for each substitution cipher’s mapping. From the 9th century forward, this was the dominant attack on substitution ciphers, and every serious encryption system since has been designed with frequency analysis as the first thing to defeat.

Modern cryptography talks about confusion and diffusion, terms Shannon coined in 1949. Confusion makes the relationship between key and ciphertext complex. Diffusion spreads the statistical structure of the plaintext across many positions in the ciphertext, so that no local letter-frequency pattern survives. AES is built to destroy the patterns Al-Kindi exploited. The fact that we have to destroy them to get a secure cipher tells you that the patterns are real, that they carry information, and that a sufficiently patient algorithm can read them.

Reading the letter-frequency layer of text is, literally, reading the source code of the message in a way that the writer didn’t put there on purpose. The wiki claim was right about this. It just framed the phenomenon through anagrams when it should have framed it through cryptanalysis.

3. Sub-semantic channels: what the substrate actually carries

Letter frequencies are the boring part of the substrate. The interesting parts are the ones where the sub-semantic structure appears to carry actual meaning-tinted information, not just statistics.

Phonosemantics and sound symbolism

The textbook example is the bouba/kiki effect, first observed by Wolfgang Köhler in 1929 and rediscovered repeatedly since (most famously by Ramachandran and Hubbard in 2001). Show a subject two shapes — one rounded and blobby, one sharp and spiky — and ask which one is called “bouba” and which is called “kiki.” Across languages, across age groups, even across cultures that don’t use roman letters, something like 95% of subjects assign “bouba” to the rounded shape and “kiki” to the sharp one. The shapes have no inherent names. The sounds are not “really” round or sharp. But the cross-modal association is robust to the point of being one of the most replicable findings in experimental psychology.

The mechanism, as best as anyone has been able to nail down, is some combination of articulation gesture (lip rounding for “bouba,” tongue pointing for “kiki”) and high-frequency / low-frequency acoustic content. The point for our purposes is that the sounds themselves carry semantic associations, prior to and independent of any linguistic convention.

Phonesthemes

A phonestheme is a sub-morphemic sound cluster that carries meaning across a family of unrelated words. English has several. The cluster gl- at the start of a word leans visual, often to do with light or sight: glow, gleam, glint, glitter, glance, glare, glimpse, glisten. The cluster sn- at the start of a word leans nasal: sniff, snore, snort, sneeze, snout, snot, snarl. The cluster sl- leans toward smoothness or unpleasantness or both: slip, slide, slick, slime, sludge, slop, slush. The cluster -ump leans toward roundness or impact: bump, lump, hump, dump, jump, slump, clump, stump, thump, rump.

None of these are absolute. Plenty of gl- words have nothing to do with light (gland, glue, glib). The claim is statistical, not categorical: the cluster shifts the probability distribution of meanings the word will carry, and brains pick this up. Margaret Magnus’s PhD thesis (1999, University of Trondheim) catalogued English phonesthemes systematically and argued that they are sub-morphemic semantic carriers that can be exploited deliberately by writers and that are exploited unconsciously by language drift.

The wiki page was, in a hand-wavy way, gesturing at exactly this kind of phenomenon. The technical literature was already considerable when I wrote it. I hadn’t read enough of it.

Rhythm, meter, and persuasion

The substrate also shows up in prosody. Iambic pentameter is not magic, but the fact that English speeches and slogans skew strongly toward stressed-unstressed alternation is not coincidence either. The “rule of three” in rhetoric — life, liberty, and the pursuit of happiness; of the people, by the people, for the people; veni, vidi, vici — is older than English itself and crosses every Indo-European language. Try replacing any of those with a two-clause or four-clause version and the difference in memetic stickiness is immediate and brutal. The substrate cares about rhythm. Brains are entrainment machines and stickiness rides on entrainment.

The cumulative point is that language carries information on a sub-semantic channel that includes letter frequencies, phonemic clusters, sound-symbolic associations, and rhythmic structure. The channel is real, it has been characterised in the linguistics literature for nearly a century, and it does load-bearing cognitive work that readers are not consciously aware of. The wiki page wasn’t inventing this. It was rediscovering it through a particular lens (anagrams) that happened to be a sideways view of the underlying phenomenon.

4. How LLMs ended up doing this on purpose

A transformer language model is trained, mechanically, to do exactly one thing: predict the next token, given the previous tokens. The training corpus is some enormous slice of human-generated text. The loss function is cross-entropy on the next-token distribution. There is no semantic objective. There is no fact-checking. There is no “understanding” in any of the senses philosophy of mind has been arguing about for a century. There is only fit the next-token distribution as well as possible given what came before.

The architecture is descended from a chain of older statistical language models. The lineage goes roughly:

Markov chains (1906) — sample the next character or word from a distribution conditioned on the previous one.
n-gram models (Shannon 1948, then many) — condition on the previous n-1 tokens.
Class-based language models (Brown et al. 1992) — cluster words into classes so the conditioning is on the classes, not the tokens.
Neural language models (Bengio et al. 2003) — replace the count-based conditional with a small neural network.
Word embeddings (Mikolov et al. 2013, Word2Vec) — represent words as vectors in a space where geometric relations encode semantic relations.
Sequence-to-sequence with attention (Bahdanau et al. 2014) — let the model focus on different parts of the input dynamically.
Transformers (Vaswani et al. 2017) — drop recurrence; everything is attention.

Each step in that chain increased the order of statistical conditioning the model could capture. A bigram model conditions on one previous token. A 5-gram model on four. A transformer with a several-thousand-token context conditions on thousands. And — this is the load-bearing observation — when you make the conditioning long enough and rich enough, things that look like reasoning, planning, analogy, and even mild self-awareness fall out of the model. None of them were explicitly trained for. They emerge from optimising the next-token distribution well enough on enough text.

This is exactly the result the wiki page predicted. Not in the way I expected — I imagined some hand-crafted “meme-worthiness scorer” that explicitly modelled phonesthemes and letter frequencies, and what actually happened was a brutally simple architecture that learned all of that and a lot more by gradient descent on next-token prediction over a few terabytes of human-generated text. But the structural claim is the same. The cognitive content of language is largely carried by statistical patterns over tokens. A sufficiently large machine trained on enough of that statistical structure can produce text that exploits the same channels readers exploit unconsciously when producing or evaluating their own.

5. What LLMs validated, and what they didn’t

The wiki claim broke into a strong form and a weak form. The strong form was something like anagrams have inherent meaning. The weak form was the statistical substrate of language carries cognitive weight that brains pick up without conscious awareness.

LLMs have decisively validated the weak form. They produce text that humans rate as compelling, persuasive, emotionally resonant, and sometimes even insightful, using nothing but the statistical structure they extracted from training data. The compellingness is not a semantic add-on. It rides on the same channels. If it were purely semantic, an LLM with no model of truth, agency, or world state could not produce compelling text. They obviously do.

LLMs have also surfaced a related fact that the wiki claim did not anticipate: the substrate is not just letter-frequency or phoneme-cluster. It is the full sequence-conditional distribution. Transformer attention learns long-range dependencies between phrases many sentences apart, between turn-taking patterns in dialogue, between rhetorical-figure setups and payoffs, between argument structures, between voice registers. The “fabric” the wiki page named in passing turned out to be far richer than the letter-frequency diagnostic suggested. The patterns include letter frequencies, yes, but also bigram-and-up structures, syntactic templates, rhetorical moves, narrative arcs, and stylistic signatures. All of it is in the statistics. The wiki claim was right that the substrate existed; it was wrong about how thick the substrate is. It is far thicker than I imagined.

The strong form — anagrams as a Rosetta Stone — LLMs have not validated and probably never will. Anagram structure is one projection of the underlying frequency distribution. It contains some information about the source but not in the way I was reading it. The “emotional synesthesia” I felt when scoring anagrams was almost certainly me feeling the underlying phonesthemic and rhythmic structure of the source phrase, channeled through the anagram-shaped filter I was applying. The filter was incidental. The signal was real.

6. Implications: what to do with a validated substrate

Once you grant that the substrate exists and that machines can now explicitly operate on it, several things change.

Content optimisation as substrate exploitation. Every recommendation algorithm running on social platforms today is performing a version of the “meme-worthiness scorer” the wiki page called for. The mechanism is not that they explicitly score phonesthemes; the mechanism is that they score engagement, which correlates with the deeper substrate properties because engagement is the substrate’s signature in human behaviour. The platforms are strip-mining the substrate without knowing it has a name. The 2014 proposal has been built, badly, by every company whose business depends on attention.

Prompt engineering as deliberate substrate operation. The discipline that has emerged around “prompting” LLMs is, in a real sense, the consciously-engineered version of what skilled writers were doing instinctively before. A prompt that reliably produces a certain register, voice, or argument structure is one that activates the right region of the substrate. People who are good at this often report that they can “feel” when a prompt will work before they run it — the same feeling skilled writers report. The substrate is the same; the target machinery has changed.

Adversarial substrate exploitation. If statistical patterns in text affect human cognition below conscious awareness, then a sufficiently large model that learns those patterns can produce text deliberately tuned to particular cognitive vulnerabilities. This is not hypothetical. The combination of generation (LLMs that can produce convincing copy on demand) and targeting (platforms that can deliver custom-tuned copy to specific subpopulations) and iteration (automated A/B testing of message variants) is industrial-scale substrate exploitation. There has been no equivalent in the history of communication. The printing press could only mass-produce one copy at a time; this can mass-produce a million variants. Whether that turns out to be a slow disaster or a tolerable nuisance depends on defences that we are nowhere near ready to deploy.

Defensive substrate awareness. The cleanest defence is the same one that works against any cognitive bias: knowing the mechanism exists weakens its effect. People who know that mortality salience biases their political judgments make better political judgments under mortality salience. People who know that a piece of copy was produced by an LLM tuned for engagement read it more skeptically. The substrate cannot be removed — it is the same substrate that makes language work at all — but its exploit cases can be recognised and discounted. Memetic hygiene becomes a thing you can teach.

7. The intuition’s place in the genealogy

A small note on credit. I am not claiming I “predicted” LLMs in any serious sense. The component ideas — that statistical structure matters for cognition, that language has sub-semantic channels, that machines could in principle be built to operate on those channels — were in the literature in scattered form when I wrote the wiki page. I hadn’t read them. What I did was hit on the structural insight from a sideways angle (anagrams), state it loosely, and intuit a research program (the “meme-worthiness scorer”) that nobody, including me, built.

Then a different research program — neural language modelling, scaled absurdly, with no theoretical commitment to substrates or phonesthemes or anything else other than next-token prediction — happened to materialise the same observation as a working artifact. The artifact is more general than what I imagined; it confirms the structural claim while being silent on the diagnostic I was using.

I write this up partly because it’s gratifying when a hunch turns out to have been pointing at something real. But mostly because the structural lesson seems worth holding onto. When you notice a pattern that nobody else is talking about, even if your diagnostic for it is wrong, write it down. The diagnostic can be corrected later. The pattern, if real, will eventually be hit by some unrelated research program from another direction, and your having written the hunch down will save you (and others) the time of re-deriving it.

8. What’s next on the substrate

LLMs settled the question of whether the substrate exists and whether it can be operationalised. The open questions now are quieter and deeper.

What other latent dimensions of language carry cognitive weight that we haven’t named yet? Phonosemantics, rhythm, and frequency are the ones we have. There are surely others. Some are likely to be visible to scaling laws and emergent capability studies in larger models. Some may require new instruments.

How much of human cognition is itself the substrate? The honest implication of the LLM result is that a substantial fraction of what we experience as our own thinking is statistical pattern completion over linguistic input. This is uncomfortable. It is also probably true to first order, and the question is how to live with it without either collapsing into nihilism or pretending it isn’t so. The neuroscience says the same thing from another direction (predictive processing, Bayesian brain, free-energy principle). Two literatures converging on the same uncomfortable conclusion from independent directions usually means the conclusion is right.

Can the substrate be inoculated against? See the defensive substrate-awareness discussion above. The early signs are that yes, partially, with effort. The economics work against the defence; the production side is industrialised, the defence side is artisanal. Closing that gap is one of the more important communication-design problems of the next decade.

What happens when LLMs train on LLM output? This is the distinctively-new problem. Once a meaningful fraction of training text is itself generated, the model is learning the substrate’s image of itself rather than the substrate as humans produced it. The fixed points of that recursion are not the same as the fixed points of the original substrate. We are running this experiment globally without much instrumentation. I have no clean prediction for how it ends.

I’ll close where I started. The wiki page in 2014 (or thereabouts) was written in rough prose, with several typos and at least one bad diagnostic. The structural claim was correct: language has a substrate of statistical pattern that does cognitive work, the substrate is exploitable, and a sufficiently large machine could be built to operate on it. That machine got built. It works. It is already changing the politics of attention more than the printing press did.

Anagrams turned out not to be the right window into the substrate. But the substrate was there, and somebody — though not me — has now built the equivalent of an x-ray for it. The next few years will be about deciding what to do with the x-ray. I think the answer involves more substrate awareness, not less. The frame the mimetics essay set up still applies: knowing the mechanism is half of being able to defend against it.

The other half is harder.

John Sokol's Blog

Friday, May 29, 2026