Friday, May 29, 2026

BNN on the GA144: Forth, Hamming Distance, and Picojoules

BNN on the GA144: Forth, Hamming Distance, and Picojoules

Fifth in a series. The earlier four (water anomalies, if water were the computer, the third replicator, anagrams as source code) were about substrates — physical, quantum, cultural, statistical. This one is about a piece of silicon that nobody has used in anger and the operation it was accidentally optimised for: Hamming distance. Which is also the only operation a binary neural network actually needs.


In 2008, GreenArrays released the GA144: a single chip with 144 independent 18-bit dual-stack Forth cores arranged in a 12×12 mesh, each core with 64 words of RAM and 64 words of ROM, communicating with its four nearest neighbors through blocking register-to-register transfers, asynchronous, no global clock, retailing for about $20, consuming microwatts at idle and roughly 7 picojoules per instruction when active. It was designed by Chuck Moore, the inventor of Forth, and it is the most power-efficient general-purpose computer ever shipped commercially. It has also, in the decade and a half since it appeared, been used for almost nothing.

This essay is the argument for what it should be used for. The short version: binary neural network inference. The longer version is that this isn’t a coincidence — the GA144’s architectural choices, the constraints that have kept it out of mainstream use, and the mathematical structure of BNN inference fit together in a way that looks designed even though it isn’t. The bridge between them is one specific operation, Hamming distance, that I have been computing on various pieces of silicon since 1996 for completely unrelated reasons. The pieces start to look like the same problem only in retrospect.

1. The GA144, briefly

Chuck Moore is the inventor of the Forth programming language. He has been designing minimalist stack-based processors to run Forth natively since at least 1985 (the Novix NC4016), through the RTX2000 (1988), the ShBoom (1990s), and a sequence of MISC and MuP21-line chips at iTV Corporation and later GreenArrays. The GA144 is the seventh or eighth chip in this lineage. The design principle is consistent across all of them and is not the design principle of mainstream processor architecture: do less per cycle, fit more cores on the die, let them communicate directly, run them asynchronously, and reach the power-efficiency frontier by not having any of the global infrastructure that most processors burn watts maintaining.

Each F18A core inside a GA144 looks, by mainstream standards, like a toy:

  • 18-bit word size (not 16, not 32, not 64 — 18, chosen because it cleanly subdivides into three 5-bit instruction slots plus a 3-bit slot, fitting four instructions per word).
  • 64 words of RAM, 64 words of ROM. That is 1,152 bits of writable state per core. There is no cache, no MMU, no DRAM. A program has to be tiny.
  • Dual-stack architecture — a data stack and a return stack — with the top three of each stack as named registers and the rest in a small circular buffer. Forth conventions throughout.
  • Minimal instruction set — about 32 instructions, all single-cycle, including XOR, NOT, AND, ADD, shifts, conditionals, returns, and communication primitives for the four neighbour ports.
  • No clock. The core advances when its inputs are ready. When it has nothing to do, it stops drawing power. When a neighbour sends data, it wakes.

The 144 cores are wired in a 12×12 grid. A core can read from and write to its north, south, east, and west neighbours through register-mapped ports. There is no broadcast bus, no shared memory, no cache coherence. If two cores on opposite sides of the chip need to talk, the message has to be relayed by the cores in between.

The GA144 ships at around $20 per chip, draws single-digit milliwatts at full load and microwatts at idle, and aggregates to roughly 700 MIPS across the whole die. By any metric except power-per-instruction it is uncompetitive with a modern Cortex-M0, let alone an x86 core. By the power-per-instruction metric it is unmatched.

Why has nobody used it? Three reasons:

  1. The 64-word RAM limit per core. You cannot run a normal program. You have to break the problem into pieces that fit. This is the objection that keeps mainstream developers off the chip.
  2. It runs Forth. Stack-based, postfix, no types, you-edit-with- colorForth-or-arrayForth. The Forth tradition is small, devout, and epistemically incompatible with the C tradition. Most engineers have never written Forth and don’t want to start.
  3. The benchmarks aren’t there. The chip would crush at the things it was designed for, except that “the things it was designed for” is not a category that mainstream processor reviews evaluate. It loses on SPECint by orders of magnitude, because SPECint is not what the GA144 is for.

The thesis of this essay is that BNN inference at the edge is what the GA144 is for, and that the constraints which have kept it unfashionable turn out to be the exact properties that make BNN inference work on it.

2. BNNs, briefly

A standard neural network computes, at every neuron, a sum of products: take each input, multiply by a learned weight, add up the products, pass through a nonlinearity. With 32-bit floating-point weights and activations, that multiply-add is roughly 50 picojoules on a modern CPU — a number dominated by the floating-point unit and the memory bandwidth required to move the weights in and out.

A Binary Neural Network constrains both the weights and the activations to two values, +1 and −1. The training story (which I’ll mostly skip here; the canonical paper is Courbariaux et al. 2016) is that you can train with the standard floating-point backprop but apply a sign() activation at inference time and the resulting binary-weight model loses only a few percentage points of accuracy on most problems that a small MLP can do at all. Larq, the open-source BNN training framework, ships with reference implementations.

The interesting part is what happens at inference time when everything is one bit. Encode +1 as 1 and −1 as 0. Then the multiplication of a weight bit w by an activation bit a produces:

waw · aencoded
+1 (1)+1 (1)+11
+1 (1)−1 (0)−10
−1 (0)+1 (1)−10
−1 (0)−1 (0)+11

That’s the XNOR truth table. Multiplication has become XNOR.

Summing the products is then just counting the 1-bits in the XNOR result and adjusting for sign. If N is the vector length and p = popcount(XNOR(W, A)), then:

dot(W, A) = 2 * p - N

Equivalently, since XNOR = NOT(XOR):

dot(W, A) = N - 2 * popcount(XOR(W, A))
         = N - 2 * hamming_distance(W, A)

The dot product of two binary vectors is a closed-form function of their Hamming distance. Neural network inference, the multiply-accumulate that the deep learning revolution scaled up by fifteen orders of magnitude, reduces to Hamming distance once you binarise.

On a modern CPU, this means you can compute 64 multiplications and their sum with one XOR instruction, one popcount instruction, and a subtract. The whole tightly-coded inner loop is three or four instructions. The energy savings over a floating-point MAC are between one and three orders of magnitude. The memory savings — because each weight is one bit instead of 32 — are exactly 32×.

The drawbacks are also real. You lose some accuracy. You cannot use a BNN where a floating-point ResNet is required. The technique works for small-to-medium classification, keyword spotting, gesture recognition, simple object detection. Not Stable Diffusion. Not GPT-4. But for the long tail of edge AI applications — the hearing aid, the doorbell camera, the always-on sensor — the trade is a clear win.

3. Why the match is so clean

Now the alignment. Take each of the GA144’s “limitations” and ask what BNN inference does with it.

3.1 Memory

The GA144 has 64 words of RAM per core. That is the single most quoted reason the chip is unusable: you cannot fit a normal program in 1,152 bits.

In binary: 1,152 bits is 1,152 weights. A small dense layer (say, 32 inputs × 32 outputs) needs exactly 1,024 weight bits. One core fits one small layer with room to spare. Across the chip:

EncodingBits per weightWeights per coreWeights per chip
FP3232365,184
INT8814420,736
Binary11,152165,888

165,888 weight parameters is enough for a meaningful classifier. A keyword-spotter “hey siri” style network is in the 100K-parameter range. Many gesture-recognition models fit. A small image classifier on 32×32 inputs fits.

The constraint that kept the GA144 out of general-purpose computing has the opposite sign for BNN inference. The chip has exactly the amount of memory you need.

3.2 The 18-bit word

The F18A’s 18-bit word holds 18 binary products per XNOR. That’s narrower than AVX-512’s 512-bit lanes and yes, this is the GA144’s weakest dimension — bit-for-bit it loses to wider SIMD. But the GA144 is not competing on aggregate throughput, it is competing on ops per joule, and 144 cores each doing 18 bits at 700 MHz adds up. The aggregate XNOR rate is around 2 trillion bit-products per second across the chip, an order of magnitude below a high-end Xeon but at less than 1% of the power.

The 18-bit width also has a useful side effect for popcount, which the F18A does not have as a native instruction. A 9-bit lookup table — 512 entries — fits inside the ROM budget if you partition it across cores, and popcount of an 18-bit word becomes two table lookups plus an add. Three instructions.

3.3 The instruction set

The F18A’s minimal instruction set already contains what BNN inference needs.

  • XOR — single-cycle native. Combined with NOT (also single cycle) you have XNOR in two cycles.
  • Popcount — not native, but the lookup-table approach above runs in three cycles for an 18-bit chunk.
  • Sign activation — just a comparison against a threshold, which the F18A handles with conditional skip.

The inner loop for one neuron is roughly:

@a   ( fetch activation word from north port )
@w   ( fetch weight word from local memory )
xor  ( XOR them — XNOR after one more NOT )
-not ( invert: now we have XNOR in the top of stack )
pop  ( table lookup for popcount )
+    ( accumulate into running sum )

That’s six F18A instructions per 18 binary products. At ~700 MHz single-cycle execution per core, you get something like 100 million 18-bit XNOR-popcount-accumulate operations per core per second, or roughly 2 billion per core per second of bit-level operations. Times 144 cores, that’s about 2.5 trillion binary ops per second on the whole chip. At an estimated 0.15 W power draw, that’s approximately 17 trillion binary ops per joule.

The comparison chart that this lives in:

PlatformBinary ops/secPowerOps/joule
Intel Xeon (AVX-512)~200 G~150 W~1.3 G
ARM Cortex-A72 (NEON)~30 G~3 W~10 G
ARM Cortex-M4~500 M~50 mW~10 G
GA144 (projected)~2.5 T~150 mW~17 T

The GA144 is roughly three orders of magnitude better than a Xeon on ops-per-joule for this specific workload, and a thousand times more energy-efficient than a Cortex-M4. The Xeon delivers more total throughput, but at a hundred to a thousand times the power. That ratio is exactly the trade-off the edge inference market is shaped around.

3.4 The mesh

A 12×12 mesh of cores with neighbour-only communication looks, on first contact, like a bug. You cannot do arbitrary point-to-point routing without intermediate cores. You cannot broadcast. You cannot gather from arbitrary locations.

A neural network layer, structurally, is a sequence of neighbour-only transformations. The activations of layer N feed only into layer N+1; there are no skip connections in a vanilla MLP, and even in architectures with residuals the connections are local. The natural mapping is:

  • Each row of cores = one layer, with cores in the row processing different output neurons of that layer.
  • Activations flow east-to-west or north-to-south between layers, using the inter-core ports.
  • Weights are static, loaded once at boot from external flash into each core’s RAM.
  • Pipelining is automatic. As soon as a core finishes computing its output activation, it streams the result to its neighbour and starts on the next input. The asynchronous design means there is no global pipeline stall; cores wake and sleep individually based on data availability.

The chip is, essentially, a piece of hardware whose dataflow graph matches a layered feedforward network’s dataflow graph. Mainstream processors have to emulate this with shared memory, cache hierarchy, and software pipelining. The GA144 just is this dataflow.

3.5 Asynchronous wake-on-data

This is the property that nothing else matches. A core that has no input to process is drawing essentially zero current. A clocked processor at the same idle state is still burning the clock distribution tree, the cache refresh, the PLLs, the voltage regulators. The GA144’s asynchronous design means a sparse neural network — one where many activations are zero — costs zero energy on the zero-activation cores. Sparsity becomes a hardware property, not an algorithmic optimisation.

4. Hamming distance: the throughline

I started thinking seriously about this because I have been computing Hamming distances on various pieces of silicon for thirty years for completely unrelated reasons, and I noticed the same operation kept showing up.

In 1996 I was working on what became ECIP — Error Correcting Internet Protocol — an early implementation of forward error correction for real-time UDP streaming. The motivation was that Internet packet loss is an erasure channel: packets either arrive intact or they don’t arrive at all. There’s no need to detect corruption inside a packet (UDP checksums do that). What you need is to recover entire missing packets from redundancy in the surrounding ones, and the math for that uses block codes whose optimality is determined by the minimum Hamming distance between codewords.

To find good codes I wrote a brute-force search program, ecc4.c, that for a given block size enumerates candidate codewords, checks the Hamming distance from each candidate to every codeword already accepted, and keeps the candidates that maintain a minimum distance above some threshold. The inner loop is identical to BNN inference: a XOR between two bit-vectors, a popcount of the result, a compare against a threshold.

Around 1998 the program got rewritten as a portable CPU benchmark and distributed to contributors, who ran it on every x86 generation from the 386 onward. I called it Budmark (long story). The data collected between 1996 and 2002 was, in retrospect, a longitudinal study of how processor architectures handled the XOR-popcount primitive across nearly fifteen years of x86 evolution.

The most interesting result was the Pentium 4. The P4 was running at 2.26 GHz — almost four times the clock speed of the Pentium II Xeon at 600 MHz. On any benchmark Intel marketed at the time, the P4 crushed. On Budmark, the P4 was only 66.7% as efficient per cycle as the P3, meaning that despite its enormous clock advantage it finished a Budmark run only about 1.5× faster than the older chip. I didn’t know at the time, but Budmark was measuring exactly the operations that the P4’s NetBurst architecture had deprioritised — pipeline-unfriendly tight loops on small bit operations. The benchmark predicted Intel’s architectural dead end a couple of years before Intel acknowledged it and went back to the Pentium-Pro lineage that became Core.

The connection to BNN didn’t exist then. It exists now. Every processor that did well on Budmark would do well on BNN inference, and every processor that did badly would do badly, because the fundamental operation is the same. Hamming distance is the unifying primitive across:

  • Error-correcting codes — minimum Hamming distance between codewords determines correction capability.
  • Binary neural networks — dot product equals length minus twice Hamming distance.
  • Locality-sensitive hashing — near-duplicate detection rides on Hamming distance over fingerprint bits.
  • Cryptanalysis — many cipher attacks reduce to Hamming-weight analyses of the key schedule.
  • Binary feature matching — ORB, BRIEF, FREAK descriptors all compare via Hamming distance.

Once you see the primitive everywhere, designing for it stops feeling specialised and starts feeling general. The GA144 is a chip designed for Hamming distance computation, accidentally and brilliantly. It was designed for Forth and for power efficiency. It happened to land in the same architectural neighbourhood as the right shape for BNN inference because the underlying problem — extracting decisions from sparse bit-level patterns at low energy cost — is structurally the same problem across all these domains.

5. The Palmo echo

Before BNNs had the name, I was building one. In 1994 I started a project I called Palmo — “palm” + “mo” for “motion” — to do real-time hand-gesture recognition from video on 486-class machines. Memory was tight (target was sub-megabyte footprint) and CPU was tight (a 486 at 66 MHz delivered maybe 25 MIPS). The only way to make a neural network fit in those constraints was to binarise the weights and activations and exploit the resulting bit-parallelism.

Palmo’s architecture was distributed across maybe 100,000 FreeBSD machines volunteer-connected over the Internet, each running small networks and communicating via compressed binary activation packets. Three things made it work:

  • Compressed bit packets for inter-node communication: 1,000 binary activations fit in 125 bytes instead of 4,000 bytes of float, a 32× bandwidth reduction.
  • Lazy neuron evaluation in a linked list: neurons updated only when their inputs changed, using timestamp-based decay instead of continuous evaluation.
  • Threshold-based firing as the only nonlinearity, computable in one compare.

These are exactly the techniques that resurfaced 20 years later as BinaryConnect (2015) and XNOR-Net (2016). I shelved Palmo in 1996 because the GPU revolution made floating-point cheap, the deep learning community standardised on float, and there was no commercial interest in binary approaches. The 2015-onward BNN renaissance felt, when I lived through it, like watching someone re-derive your abandoned PhD thesis from first principles.

The reason this matters here: the GA144 is “Palmo on a chip.” Take the distributed-100k-FreeBSD-machines architecture, miniaturise it by six orders of magnitude, swap TCP/IP for register-to-register neighbour transfers, swap millisecond network latency for nanosecond wire latency, and you have the GA144. The architectural principles are identical. The Palmo project ran the experiment at the distributed-system scale and the GA144 packages the result at the silicon scale. Nobody noticed because the two communities — Forth hardware people and binary neural network people — don’t talk to each other.

6. Why this hasn’t happened yet

The honest accounting: the GA144 has been available since around 2010 or 2011. BinaryConnect and XNOR-Net are nine and eight years old respectively. The energy advantage I’m describing has been on the table the whole time. Why has nobody shipped a GA144-based BNN inference engine?

A few reasons:

The Forth barrier. Programming a GA144 requires colorForth or arrayForth, both of which look alien to anyone trained on C/Python. The toolchain is small, the documentation is sparse, and the community is a few hundred people worldwide. Getting a working F18A-port of a BNN inference loop is a week of work for someone fluent in Forth and a month for someone who isn’t.

The mainstream-benchmark gap. Processor reviews don’t measure edge-AI inference energy. They measure SPECint, Geekbench, AI-bench on giant networks. The GA144 loses on every one of those and wins on none of them. There has never been a high-profile public benchmark that would have surfaced its actual strengths.

The deployment story. Mainstream MCU programmers reach for an STM32 or an ESP32 or a Cortex-M0 because the toolchain is familiar, the ecosystem is huge, and the chip is available everywhere. The GA144 is single-sourced from GreenArrays, who make it in small batches. There is no Arduino-style ecosystem. There is no TensorFlow Lite backend.

The accuracy ceiling. BNNs do well on small classification but they cannot do everything a floating-point network can. If your edge application needs the last 5% of accuracy that the FP32 model gets and the BNN doesn’t, you’re going to ship a Cortex-M4 with INT8 quantised weights and you’re going to take the power hit. The market for chips that are only good at the BNN workload has been small.

All four reasons are softening. Forth toolchains are getting better (there are now Python-based F18A simulators that let you prototype without the colorForth learning cliff). Edge AI is exploding as a deployment category and per-inference energy is starting to matter more than absolute accuracy. The BNN accuracy story is improving as training techniques mature. And the energy budget of LLM-driven applications has gotten so large that even mainstream attention is turning to the bottom of the inference market.

The GA144 is, in 2026, approximately ten years too early or three years too late depending on how you count. The case for porting a keyword-spotter or a gesture-recognition model to it is now defensible in a way it wasn’t even five years ago.

7. What edge AI is actually measured in

The mainstream AI conversation is measured in TFLOPS, in tokens-per-second, in dollars per million tokens. None of those metrics apply to the edge. At the edge the relevant units are:

  • Inferences per millijoule — battery-powered devices live or die by this.
  • Latency from sensor wake to decision — for always-on listeners, motion detectors, fall sensors.
  • Memory footprint that fits in on-chip RAM — because going off-chip to DRAM dominates the energy budget.
  • Cost per chip in volume — embedded sensor markets are extremely price-sensitive.

On all four metrics the GA144 looks competitive or better. On inferences per millijoule for a binary classifier of reasonable size, the projected numbers put it at roughly one to three orders of magnitude ahead of conventional MCUs. On latency, the asynchronous design means a wake-on-data inference can complete in microseconds. On memory footprint, by definition, anything that fits at all fits entirely on chip. On cost, $20 is not great for an MCU role but is not catastrophic for a specialised edge inference accelerator.

The chip is not for everything. It is for the specific case where the network is small enough to fit, the input cadence is intermittent enough that asynchronous wake-on-data wins, and the energy budget is tight enough that the picojoule-per-instruction matters. That case exists. Hearing aids. Implantable sensors. Long-battery industrial sensors. Solar-powered wildlife cameras with on-board species classification. The list is shorter than the LLM list but it’s not empty, and it’s getting longer.

8. What a port would actually look like

Sketching the minimum viable demonstration:

  1. Pick a small BNN — the canonical 784→256→128→10 MNIST MLP I already have running in C and AVX2 (see bnn1 in the parent repo). 165,888 binary weights chip-wide; the MNIST MLP needs ~233K binary weights total, so it would need either compression or a slightly smaller architecture (say 784→128→64→10, which has ~108K weights and fits comfortably).

  2. Allocate cores by layer. With 144 cores and three layers, the natural mapping is: top row(s) handle layer 1, middle row(s) handle layer 2, bottom row(s) handle layer 3. Each row pipelines its outputs to the next.

  3. Implement the inner loop in F18A assembly. The XNOR-popcount- accumulate inner loop is six instructions per 18 bits. A 32-wide input layer takes two iterations.

  4. Build a host-side driver that streams MNIST images into the north edge of the mesh and reads classifications from the south edge. The host can be an ordinary Cortex-M running on the same board.

  5. Measure. Inferences per second, energy per inference, latency from input to output. Compare against the existing AVX2 reference implementation and against a Cortex-M4 INT8 quantised implementation as the relevant baselines.

A working demo of this would take, by my estimate, a focused month for someone fluent in F18A Forth and roughly three months for someone coming from C. The result would be the world’s first published GA144 BNN inference engine and would, if my numbers are roughly right, establish a power-efficiency record for the workload that would be hard to beat without designing custom silicon.

9. The deeper claim

If you’ve read the other essays in this series, the structure of this argument is familiar. There is an information-processing substrate (Forth-style minimal-stack cores in a 2D mesh), there is an information-processing problem (BNN inference at the edge), and the substrate and the problem fit together in a way that nobody designed for explicitly but that falls out cleanly once you see the underlying operation.

That underlying operation, in this case, is Hamming distance — popcount of a XOR. It is the same operation I was running on Pentium II Xeons in 1998 to find error-correcting codes, the same operation modern locality-sensitive hashing rides on, the same operation BinaryConnect-era researchers re-derived from neural network quantisation. Substrates change. Problems change. The operation doesn’t.

Chuck Moore designed the GA144 because he was committed to an architectural philosophy nobody else followed — minimalist, async, mesh-connected, Forth-native — and trusted that the niche would exist somewhere even if he couldn’t predict where. The niche turned out to be BNN inference, which didn’t exist as a research area when the chip taped out and which only became commercially interesting a decade later. The match between them is the kind of thing you’d expect to see when both the substrate designer and the problem designer were optimising against the same hidden objective without realising it. The objective, in retrospect, was maximum useful computation per picojoule on bit-level operations. They both went for it. They both got there. They met.

I would like to be the person who finally connects them. The work is real and tractable and would benefit from someone who has both the Forth heritage and the BNN heritage in the same head, which I do, and which is rare. The chip is still available; the BNN technique is mature; the inner loop fits in six instructions per 18 binary products. It is, structurally, a two-month project away from demonstrating something nobody else has demonstrated.

I am writing this partly to commit, in public, to actually doing it.


Further reading

  • Moore, C. Programming a Problem-Oriented Language (1970). The original Forth manifesto.
  • Moore, C.; Ting, C. eForth and Zen (1996). The minimalist approach.
  • GreenArrays Inc. GA144 Documentation and F18A Architecture Reference. https://www.greenarraychips.com/
  • Pelt, J. Programming the F18A Computer. arrayForth tutorial.
  • Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. “Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1.” NeurIPS 2016.
  • Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” ECCV 2016.
  • Larq Contributors. Larq: An Open-Source Library for Training Binarized Neural Networks (2019). https://larq.dev
  • Sokol, J. Error Correcting Internet Protocol (ECIP), 1996.
  • Sokol, J. Palmo: Distributed Binary Neural Networks for Hand Gesture Recognition, 1994-1996. Unpublished.
  • Sokol, J. Budmark CPU Benchmark, 1996-2002. Collected data across x86 generations 386 through Pentium 4; predicted NetBurst’s architectural inefficiency on small-bit-operation workloads.
  • Hamming, R. W. “Error Detecting and Error Correcting Codes.” Bell System Technical Journal 29 (1950).
  • See also bnn1/ in the companion repository for working Python/Keras + C/AVX2 reference implementations of the overflow-fire BNN inference kernel.

Comments, corrections, and especially “let’s actually build it” welcome.

Anagrams as Source Code: A 2014 Hint at What LLMs Would Be

Anagrams as Source Code: A 2014 Hint at What LLMs Would Be

Fourth in a series. Read the mimetics essay for the broader frame; this one zooms in on a single page of my old wiki and asks why the intuition on it turned out to predict, in structure if not in detail, what large language models would become.


Sometime in the early 2010s — the page history says the bulk of edits land between 2011 and 2015 — I wrote on my wiki the following claim, which I’ll quote unedited because the rough edges are the point:

I am convinced it’s not “SuperNatural” but part of a memetic fabric of the way the brain does pattern recognition. … This has been done so long, that the languages evolve with these mechanisms built in to it. So much so that message content is based on the frequency of letters more then the actually words.

Anyhow decoding the anagrams is like looking at the source code, you can’t take it’s contents on literal face value but how it will subconsciously effect those people who “Get it”. … Now that I’ve been writing for a while I can feel what’s going to make a good anagram or not. I can feel weather it’s positive or negative.

It should also be possible to have a program that can measure a content’s meme-worthiness, (the ability to propagate as a meme, and in the greater context of current events and the mainstream media’s effects.)

It should also be possible to create a system that could generate candidate meme’s that should have a high likelihood of success.

The specific claim — that anagrams literally carry meaning, that re-arranging letters reveals a hidden semantic layer — I now think is mostly wrong. But the structural claim wrapped around it is something I’ve gotten less sure was wrong with every passing year. The structural claim is that language has a substrate of statistical pattern that does cognitive work in parallel with semantic content, that the substrate is exploitable, and that a program could be built to operate on it.

Nobody built that program in 2014. Then, ten years later, somebody did — but they did it by accident, while trying to build something else, and what they ended up with looks nothing like what I imagined. They called it a large language model. It works, and the reason it works validates the hint on the wiki page more than I would have predicted.

This essay is about the connection.

1. What the wiki claim was, exactly

The page is Nature_of_Language in the cluster I built around Department of Memetics. Stripped to its bones, the argument runs:

  1. Spoken and written language has evolved alongside the brains that process it for hundreds of thousands of years.
  2. Any feature of language that affects reader response will, under that long Darwinian sieve, accumulate in the language because writers and speakers whose feel produced more effective patterns out-propagated those whose feel did not.
  3. Letter and phoneme distributions are such a feature. They affect reader response — the rhythm of a sentence, the “feel” of a name, whether a phrase lands — on a channel below conscious semantic processing.
  4. Therefore there is hidden statistical structure in language that is doing cognitive work in parallel with the propositional content.
  5. Skilled writers can sense the structure but do not consciously manipulate it. Actors choose stage names whose letter distributions “feel right.” Marketers find the slogan that sticks. Speechwriters tune phrases until they’re memorable.
  6. In principle, a program could measure the structure. It could score arbitrary text for “meme-worthiness.” It could generate text optimised against that score.

I added the anagram story because I was doing statistical work on anagrams at the time and thought I was finding something specific — that the anagrams of high-impact phrases had statistically interesting properties. I now believe what I was actually picking up was the underlying letter-frequency structure of the source phrase, which an anagram by definition preserves. The anagrams were a symptom; the real signal was the letter-frequency surface of the input.

The substrate was real. The diagnostic — anagrams as source-code- revealing — was wrong but pointed at something true. This is a more common pattern in early-stage hypotheses than people give credit for.

2. The statistical substrate of language was already known

The wiki page makes the argument de novo, which it shouldn’t have, because by the early 2010s most of the components were textbook.

Zipf’s law

George Kingsley Zipf in 1932 — and Alfred Lotka before him — noticed that if you rank words in a corpus by frequency, the n-th most common word appears about 1/n as often as the most common one. The most frequent English word (“the”) appears about twice as often as the second (“of”), three times as often as the third (“and”), and so on. The pattern is shockingly robust: it holds in every natural language ever measured, it holds across topics within a language, it holds across centuries within the same language, and it holds at the character level too. The frequencies are not random. They sit on a power law that any sufficiently large text obeys.

The implication for the wiki claim is direct. If letter and word frequencies are stable lawful features of a language, then a brain that processes language has had millennia to internalise those frequencies and to use them — as a kind of background model against which surprise is computed. The information content of a token is proportional to the negative log of its frequency. Surprise is information, in the technical sense Shannon nailed down in 1948.

Shannon’s word game

Speaking of Shannon: in A Mathematical Theory of Communication he introduced what we now call n-gram models of English, ran them forward, and showed that you can generate increasingly English-looking text by sampling from progressively higher-order n-grams. Order-0 gives uniform letter distribution: garbage. Order-1 gives the right letter frequencies but random sequences: still garbage but lumpy. Order-2 (bigrams) starts producing pronounceable nonsense. Order-3 (trigrams) starts producing word-shaped tokens. Order-5 or so produces text that, sentence by sentence, looks like English — even though it has no semantic content.

The 1948 paper essentially demonstrated that statistical structure captured at the letter level is most of what makes text look like language. This is the same argument the wiki page was groping toward, twenty-six years before the wiki page. I should have read Shannon more carefully. So should everyone.

Cryptanalysis: anagrams as source code, for real

The discipline that has taken letter-frequency analysis most seriously for the longest is cryptanalysis. Al-Kindi, in 9th-century Baghdad, wrote Risāla fī Istikhrāj al-Muʿammā — “On Extracting Encrypted Letters” — which laid out frequency analysis: count the letters in the ciphertext, match them against the known frequency distribution of the plaintext language, and you have probabilistic guesses for each substitution cipher’s mapping. From the 9th century forward, this was the dominant attack on substitution ciphers, and every serious encryption system since has been designed with frequency analysis as the first thing to defeat.

Modern cryptography talks about confusion and diffusion, terms Shannon coined in 1949. Confusion makes the relationship between key and ciphertext complex. Diffusion spreads the statistical structure of the plaintext across many positions in the ciphertext, so that no local letter-frequency pattern survives. AES is built to destroy the patterns Al-Kindi exploited. The fact that we have to destroy them to get a secure cipher tells you that the patterns are real, that they carry information, and that a sufficiently patient algorithm can read them.

Reading the letter-frequency layer of text is, literally, reading the source code of the message in a way that the writer didn’t put there on purpose. The wiki claim was right about this. It just framed the phenomenon through anagrams when it should have framed it through cryptanalysis.

3. Sub-semantic channels: what the substrate actually carries

Letter frequencies are the boring part of the substrate. The interesting parts are the ones where the sub-semantic structure appears to carry actual meaning-tinted information, not just statistics.

Phonosemantics and sound symbolism

The textbook example is the bouba/kiki effect, first observed by Wolfgang Köhler in 1929 and rediscovered repeatedly since (most famously by Ramachandran and Hubbard in 2001). Show a subject two shapes — one rounded and blobby, one sharp and spiky — and ask which one is called “bouba” and which is called “kiki.” Across languages, across age groups, even across cultures that don’t use roman letters, something like 95% of subjects assign “bouba” to the rounded shape and “kiki” to the sharp one. The shapes have no inherent names. The sounds are not “really” round or sharp. But the cross-modal association is robust to the point of being one of the most replicable findings in experimental psychology.

The mechanism, as best as anyone has been able to nail down, is some combination of articulation gesture (lip rounding for “bouba,” tongue pointing for “kiki”) and high-frequency / low-frequency acoustic content. The point for our purposes is that the sounds themselves carry semantic associations, prior to and independent of any linguistic convention.

Phonesthemes

A phonestheme is a sub-morphemic sound cluster that carries meaning across a family of unrelated words. English has several. The cluster gl- at the start of a word leans visual, often to do with light or sight: glow, gleam, glint, glitter, glance, glare, glimpse, glisten. The cluster sn- at the start of a word leans nasal: sniff, snore, snort, sneeze, snout, snot, snarl. The cluster sl- leans toward smoothness or unpleasantness or both: slip, slide, slick, slime, sludge, slop, slush. The cluster -ump leans toward roundness or impact: bump, lump, hump, dump, jump, slump, clump, stump, thump, rump.

None of these are absolute. Plenty of gl- words have nothing to do with light (gland, glue, glib). The claim is statistical, not categorical: the cluster shifts the probability distribution of meanings the word will carry, and brains pick this up. Margaret Magnus’s PhD thesis (1999, University of Trondheim) catalogued English phonesthemes systematically and argued that they are sub-morphemic semantic carriers that can be exploited deliberately by writers and that are exploited unconsciously by language drift.

The wiki page was, in a hand-wavy way, gesturing at exactly this kind of phenomenon. The technical literature was already considerable when I wrote it. I hadn’t read enough of it.

Rhythm, meter, and persuasion

The substrate also shows up in prosody. Iambic pentameter is not magic, but the fact that English speeches and slogans skew strongly toward stressed-unstressed alternation is not coincidence either. The “rule of three” in rhetoric — life, liberty, and the pursuit of happiness; of the people, by the people, for the people; veni, vidi, vici — is older than English itself and crosses every Indo-European language. Try replacing any of those with a two-clause or four-clause version and the difference in memetic stickiness is immediate and brutal. The substrate cares about rhythm. Brains are entrainment machines and stickiness rides on entrainment.

The cumulative point is that language carries information on a sub-semantic channel that includes letter frequencies, phonemic clusters, sound-symbolic associations, and rhythmic structure. The channel is real, it has been characterised in the linguistics literature for nearly a century, and it does load-bearing cognitive work that readers are not consciously aware of. The wiki page wasn’t inventing this. It was rediscovering it through a particular lens (anagrams) that happened to be a sideways view of the underlying phenomenon.

4. How LLMs ended up doing this on purpose

A transformer language model is trained, mechanically, to do exactly one thing: predict the next token, given the previous tokens. The training corpus is some enormous slice of human-generated text. The loss function is cross-entropy on the next-token distribution. There is no semantic objective. There is no fact-checking. There is no “understanding” in any of the senses philosophy of mind has been arguing about for a century. There is only fit the next-token distribution as well as possible given what came before.

The architecture is descended from a chain of older statistical language models. The lineage goes roughly:

  • Markov chains (1906) — sample the next character or word from a distribution conditioned on the previous one.
  • n-gram models (Shannon 1948, then many) — condition on the previous n-1 tokens.
  • Class-based language models (Brown et al. 1992) — cluster words into classes so the conditioning is on the classes, not the tokens.
  • Neural language models (Bengio et al. 2003) — replace the count-based conditional with a small neural network.
  • Word embeddings (Mikolov et al. 2013, Word2Vec) — represent words as vectors in a space where geometric relations encode semantic relations.
  • Sequence-to-sequence with attention (Bahdanau et al. 2014) — let the model focus on different parts of the input dynamically.
  • Transformers (Vaswani et al. 2017) — drop recurrence; everything is attention.

Each step in that chain increased the order of statistical conditioning the model could capture. A bigram model conditions on one previous token. A 5-gram model on four. A transformer with a several-thousand-token context conditions on thousands. And — this is the load-bearing observation — when you make the conditioning long enough and rich enough, things that look like reasoning, planning, analogy, and even mild self-awareness fall out of the model. None of them were explicitly trained for. They emerge from optimising the next-token distribution well enough on enough text.

This is exactly the result the wiki page predicted. Not in the way I expected — I imagined some hand-crafted “meme-worthiness scorer” that explicitly modelled phonesthemes and letter frequencies, and what actually happened was a brutally simple architecture that learned all of that and a lot more by gradient descent on next-token prediction over a few terabytes of human-generated text. But the structural claim is the same. The cognitive content of language is largely carried by statistical patterns over tokens. A sufficiently large machine trained on enough of that statistical structure can produce text that exploits the same channels readers exploit unconsciously when producing or evaluating their own.

5. What LLMs validated, and what they didn’t

The wiki claim broke into a strong form and a weak form. The strong form was something like anagrams have inherent meaning. The weak form was the statistical substrate of language carries cognitive weight that brains pick up without conscious awareness.

LLMs have decisively validated the weak form. They produce text that humans rate as compelling, persuasive, emotionally resonant, and sometimes even insightful, using nothing but the statistical structure they extracted from training data. The compellingness is not a semantic add-on. It rides on the same channels. If it were purely semantic, an LLM with no model of truth, agency, or world state could not produce compelling text. They obviously do.

LLMs have also surfaced a related fact that the wiki claim did not anticipate: the substrate is not just letter-frequency or phoneme-cluster. It is the full sequence-conditional distribution. Transformer attention learns long-range dependencies between phrases many sentences apart, between turn-taking patterns in dialogue, between rhetorical-figure setups and payoffs, between argument structures, between voice registers. The “fabric” the wiki page named in passing turned out to be far richer than the letter-frequency diagnostic suggested. The patterns include letter frequencies, yes, but also bigram-and-up structures, syntactic templates, rhetorical moves, narrative arcs, and stylistic signatures. All of it is in the statistics. The wiki claim was right that the substrate existed; it was wrong about how thick the substrate is. It is far thicker than I imagined.

The strong form — anagrams as a Rosetta Stone — LLMs have not validated and probably never will. Anagram structure is one projection of the underlying frequency distribution. It contains some information about the source but not in the way I was reading it. The “emotional synesthesia” I felt when scoring anagrams was almost certainly me feeling the underlying phonesthemic and rhythmic structure of the source phrase, channeled through the anagram-shaped filter I was applying. The filter was incidental. The signal was real.

6. Implications: what to do with a validated substrate

Once you grant that the substrate exists and that machines can now explicitly operate on it, several things change.

Content optimisation as substrate exploitation. Every recommendation algorithm running on social platforms today is performing a version of the “meme-worthiness scorer” the wiki page called for. The mechanism is not that they explicitly score phonesthemes; the mechanism is that they score engagement, which correlates with the deeper substrate properties because engagement is the substrate’s signature in human behaviour. The platforms are strip-mining the substrate without knowing it has a name. The 2014 proposal has been built, badly, by every company whose business depends on attention.

Prompt engineering as deliberate substrate operation. The discipline that has emerged around “prompting” LLMs is, in a real sense, the consciously-engineered version of what skilled writers were doing instinctively before. A prompt that reliably produces a certain register, voice, or argument structure is one that activates the right region of the substrate. People who are good at this often report that they can “feel” when a prompt will work before they run it — the same feeling skilled writers report. The substrate is the same; the target machinery has changed.

Adversarial substrate exploitation. If statistical patterns in text affect human cognition below conscious awareness, then a sufficiently large model that learns those patterns can produce text deliberately tuned to particular cognitive vulnerabilities. This is not hypothetical. The combination of generation (LLMs that can produce convincing copy on demand) and targeting (platforms that can deliver custom-tuned copy to specific subpopulations) and iteration (automated A/B testing of message variants) is industrial-scale substrate exploitation. There has been no equivalent in the history of communication. The printing press could only mass-produce one copy at a time; this can mass-produce a million variants. Whether that turns out to be a slow disaster or a tolerable nuisance depends on defences that we are nowhere near ready to deploy.

Defensive substrate awareness. The cleanest defence is the same one that works against any cognitive bias: knowing the mechanism exists weakens its effect. People who know that mortality salience biases their political judgments make better political judgments under mortality salience. People who know that a piece of copy was produced by an LLM tuned for engagement read it more skeptically. The substrate cannot be removed — it is the same substrate that makes language work at all — but its exploit cases can be recognised and discounted. Memetic hygiene becomes a thing you can teach.

7. The intuition’s place in the genealogy

A small note on credit. I am not claiming I “predicted” LLMs in any serious sense. The component ideas — that statistical structure matters for cognition, that language has sub-semantic channels, that machines could in principle be built to operate on those channels — were in the literature in scattered form when I wrote the wiki page. I hadn’t read them. What I did was hit on the structural insight from a sideways angle (anagrams), state it loosely, and intuit a research program (the “meme-worthiness scorer”) that nobody, including me, built.

Then a different research program — neural language modelling, scaled absurdly, with no theoretical commitment to substrates or phonesthemes or anything else other than next-token prediction — happened to materialise the same observation as a working artifact. The artifact is more general than what I imagined; it confirms the structural claim while being silent on the diagnostic I was using.

I write this up partly because it’s gratifying when a hunch turns out to have been pointing at something real. But mostly because the structural lesson seems worth holding onto. When you notice a pattern that nobody else is talking about, even if your diagnostic for it is wrong, write it down. The diagnostic can be corrected later. The pattern, if real, will eventually be hit by some unrelated research program from another direction, and your having written the hunch down will save you (and others) the time of re-deriving it.

8. What’s next on the substrate

LLMs settled the question of whether the substrate exists and whether it can be operationalised. The open questions now are quieter and deeper.

What other latent dimensions of language carry cognitive weight that we haven’t named yet? Phonosemantics, rhythm, and frequency are the ones we have. There are surely others. Some are likely to be visible to scaling laws and emergent capability studies in larger models. Some may require new instruments.

How much of human cognition is itself the substrate? The honest implication of the LLM result is that a substantial fraction of what we experience as our own thinking is statistical pattern completion over linguistic input. This is uncomfortable. It is also probably true to first order, and the question is how to live with it without either collapsing into nihilism or pretending it isn’t so. The neuroscience says the same thing from another direction (predictive processing, Bayesian brain, free-energy principle). Two literatures converging on the same uncomfortable conclusion from independent directions usually means the conclusion is right.

Can the substrate be inoculated against? See the defensive substrate-awareness discussion above. The early signs are that yes, partially, with effort. The economics work against the defence; the production side is industrialised, the defence side is artisanal. Closing that gap is one of the more important communication-design problems of the next decade.

What happens when LLMs train on LLM output? This is the distinctively-new problem. Once a meaningful fraction of training text is itself generated, the model is learning the substrate’s image of itself rather than the substrate as humans produced it. The fixed points of that recursion are not the same as the fixed points of the original substrate. We are running this experiment globally without much instrumentation. I have no clean prediction for how it ends.


I’ll close where I started. The wiki page in 2014 (or thereabouts) was written in rough prose, with several typos and at least one bad diagnostic. The structural claim was correct: language has a substrate of statistical pattern that does cognitive work, the substrate is exploitable, and a sufficiently large machine could be built to operate on it. That machine got built. It works. It is already changing the politics of attention more than the printing press did.

Anagrams turned out not to be the right window into the substrate. But the substrate was there, and somebody — though not me — has now built the equivalent of an x-ray for it. The next few years will be about deciding what to do with the x-ray. I think the answer involves more substrate awareness, not less. The frame the mimetics essay set up still applies: knowing the mechanism is half of being able to defend against it.

The other half is harder.


Further reading

  • Zipf, G. K. Selected Studies of the Principle of Relative Frequency in Language (Harvard, 1932).
  • Shannon, C. E. “A Mathematical Theory of Communication.” Bell System Technical Journal 27 (1948).
  • Shannon, C. E. “Communication Theory of Secrecy Systems.” Bell System Technical Journal 28 (1949). The confusion/diffusion framing.
  • Al-Kindi. Risāla fī Istikhrāj al-Muʿammā, 9th century. The first treatise on frequency analysis.
  • Köhler, W. Gestalt Psychology (Liveright, 1929). The original bouba/kiki observation, with different vowels.
  • Ramachandran, V. S.; Hubbard, E. M. “Synaesthesia — A Window into Perception, Thought and Language.” Journal of Consciousness Studies 8 (2001). The modern bouba/kiki paper.
  • Magnus, M. Gods of the Word: Archetypes in the Consonants (Thomas Jefferson University Press, 1999). The phonosemantic thesis. A more detailed treatment lives in her PhD work at the University of Trondheim.
  • Hinton, L.; Nichols, J.; Ohala, J. (eds.) Sound Symbolism (Cambridge, 1994).
  • Brown, P. F. et al. “Class-Based n-gram Models of Natural Language.” Computational Linguistics 18 (1992).
  • Bengio, Y. et al. “A Neural Probabilistic Language Model.” JMLR 3 (2003).
  • Mikolov, T. et al. “Efficient Estimation of Word Representations in Vector Space.” 2013. (Word2Vec.)
  • Vaswani, A. et al. “Attention Is All You Need.” NeurIPS 2017. The transformer paper.
  • Bender, E. M.; Koller, A. “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.” ACL 2020. The skeptical case that LLMs don’t really understand. Worth reading as the counterpoint to this essay’s friendlier view.
  • Friston, K. The free-energy principle / predictive processing literature, for the convergent neuroscience claim about pattern-completion brains.

Comments, refutations, and worked counterexamples especially welcome.