BNN on the GA144: Forth, Hamming Distance, and Picojoules
Fifth in a series. The earlier four (water anomalies, if water were the computer, the third replicator, anagrams as source code) were about substrates — physical, quantum, cultural, statistical. This one is about a piece of silicon that nobody has used in anger and the operation it was accidentally optimised for: Hamming distance. Which is also the only operation a binary neural network actually needs.
In 2008, GreenArrays released the GA144: a single chip with 144 independent 18-bit dual-stack Forth cores arranged in a 12×12 mesh, each core with 64 words of RAM and 64 words of ROM, communicating with its four nearest neighbors through blocking register-to-register transfers, asynchronous, no global clock, retailing for about $20, consuming microwatts at idle and roughly 7 picojoules per instruction when active. It was designed by Chuck Moore, the inventor of Forth, and it is the most power-efficient general-purpose computer ever shipped commercially. It has also, in the decade and a half since it appeared, been used for almost nothing.
This essay is the argument for what it should be used for. The short version: binary neural network inference. The longer version is that this isn’t a coincidence — the GA144’s architectural choices, the constraints that have kept it out of mainstream use, and the mathematical structure of BNN inference fit together in a way that looks designed even though it isn’t. The bridge between them is one specific operation, Hamming distance, that I have been computing on various pieces of silicon since 1996 for completely unrelated reasons. The pieces start to look like the same problem only in retrospect.
1. The GA144, briefly
Chuck Moore is the inventor of the Forth programming language. He has been designing minimalist stack-based processors to run Forth natively since at least 1985 (the Novix NC4016), through the RTX2000 (1988), the ShBoom (1990s), and a sequence of MISC and MuP21-line chips at iTV Corporation and later GreenArrays. The GA144 is the seventh or eighth chip in this lineage. The design principle is consistent across all of them and is not the design principle of mainstream processor architecture: do less per cycle, fit more cores on the die, let them communicate directly, run them asynchronously, and reach the power-efficiency frontier by not having any of the global infrastructure that most processors burn watts maintaining.
Each F18A core inside a GA144 looks, by mainstream standards, like a toy:
- 18-bit word size (not 16, not 32, not 64 — 18, chosen because it cleanly subdivides into three 5-bit instruction slots plus a 3-bit slot, fitting four instructions per word).
- 64 words of RAM, 64 words of ROM. That is 1,152 bits of writable state per core. There is no cache, no MMU, no DRAM. A program has to be tiny.
- Dual-stack architecture — a data stack and a return stack — with the top three of each stack as named registers and the rest in a small circular buffer. Forth conventions throughout.
- Minimal instruction set — about 32 instructions, all single-cycle, including XOR, NOT, AND, ADD, shifts, conditionals, returns, and communication primitives for the four neighbour ports.
- No clock. The core advances when its inputs are ready. When it has nothing to do, it stops drawing power. When a neighbour sends data, it wakes.
The 144 cores are wired in a 12×12 grid. A core can read from and write to its north, south, east, and west neighbours through register-mapped ports. There is no broadcast bus, no shared memory, no cache coherence. If two cores on opposite sides of the chip need to talk, the message has to be relayed by the cores in between.
The GA144 ships at around $20 per chip, draws single-digit milliwatts at full load and microwatts at idle, and aggregates to roughly 700 MIPS across the whole die. By any metric except power-per-instruction it is uncompetitive with a modern Cortex-M0, let alone an x86 core. By the power-per-instruction metric it is unmatched.
Why has nobody used it? Three reasons:
- The 64-word RAM limit per core. You cannot run a normal program. You have to break the problem into pieces that fit. This is the objection that keeps mainstream developers off the chip.
- It runs Forth. Stack-based, postfix, no types, you-edit-with- colorForth-or-arrayForth. The Forth tradition is small, devout, and epistemically incompatible with the C tradition. Most engineers have never written Forth and don’t want to start.
- The benchmarks aren’t there. The chip would crush at the things it was designed for, except that “the things it was designed for” is not a category that mainstream processor reviews evaluate. It loses on SPECint by orders of magnitude, because SPECint is not what the GA144 is for.
The thesis of this essay is that BNN inference at the edge is what the GA144 is for, and that the constraints which have kept it unfashionable turn out to be the exact properties that make BNN inference work on it.
2. BNNs, briefly
A standard neural network computes, at every neuron, a sum of products: take each input, multiply by a learned weight, add up the products, pass through a nonlinearity. With 32-bit floating-point weights and activations, that multiply-add is roughly 50 picojoules on a modern CPU — a number dominated by the floating-point unit and the memory bandwidth required to move the weights in and out.
A Binary Neural Network constrains both the weights and the activations to two values, +1 and −1. The training story (which I’ll mostly skip here; the canonical paper is Courbariaux et al. 2016) is that you can train with the standard floating-point backprop but apply a sign() activation at inference time and the resulting binary-weight model loses only a few percentage points of accuracy on most problems that a small MLP can do at all. Larq, the open-source BNN training framework, ships with reference implementations.
The interesting part is what happens at inference time when everything is one bit. Encode +1 as 1 and −1 as 0. Then the multiplication of a weight bit w by an activation bit a produces:
| w | a | w · a | encoded |
|---|---|---|---|
| +1 (1) | +1 (1) | +1 | 1 |
| +1 (1) | −1 (0) | −1 | 0 |
| −1 (0) | +1 (1) | −1 | 0 |
| −1 (0) | −1 (0) | +1 | 1 |
That’s the XNOR truth table. Multiplication has become XNOR.
Summing the products is then just counting the 1-bits in the XNOR result and adjusting for sign. If N is the vector length and p = popcount(XNOR(W, A)), then:
dot(W, A) = 2 * p - N
Equivalently, since XNOR = NOT(XOR):
dot(W, A) = N - 2 * popcount(XOR(W, A))
= N - 2 * hamming_distance(W, A)
The dot product of two binary vectors is a closed-form function of their Hamming distance. Neural network inference, the multiply-accumulate that the deep learning revolution scaled up by fifteen orders of magnitude, reduces to Hamming distance once you binarise.
On a modern CPU, this means you can compute 64 multiplications and their sum with one XOR instruction, one popcount instruction, and a subtract. The whole tightly-coded inner loop is three or four instructions. The energy savings over a floating-point MAC are between one and three orders of magnitude. The memory savings — because each weight is one bit instead of 32 — are exactly 32×.
The drawbacks are also real. You lose some accuracy. You cannot use a BNN where a floating-point ResNet is required. The technique works for small-to-medium classification, keyword spotting, gesture recognition, simple object detection. Not Stable Diffusion. Not GPT-4. But for the long tail of edge AI applications — the hearing aid, the doorbell camera, the always-on sensor — the trade is a clear win.
3. Why the match is so clean
Now the alignment. Take each of the GA144’s “limitations” and ask what BNN inference does with it.
3.1 Memory
The GA144 has 64 words of RAM per core. That is the single most quoted reason the chip is unusable: you cannot fit a normal program in 1,152 bits.
In binary: 1,152 bits is 1,152 weights. A small dense layer (say, 32 inputs × 32 outputs) needs exactly 1,024 weight bits. One core fits one small layer with room to spare. Across the chip:
| Encoding | Bits per weight | Weights per core | Weights per chip |
|---|---|---|---|
| FP32 | 32 | 36 | 5,184 |
| INT8 | 8 | 144 | 20,736 |
| Binary | 1 | 1,152 | 165,888 |
165,888 weight parameters is enough for a meaningful classifier. A keyword-spotter “hey siri” style network is in the 100K-parameter range. Many gesture-recognition models fit. A small image classifier on 32×32 inputs fits.
The constraint that kept the GA144 out of general-purpose computing has the opposite sign for BNN inference. The chip has exactly the amount of memory you need.
3.2 The 18-bit word
The F18A’s 18-bit word holds 18 binary products per XNOR. That’s narrower than AVX-512’s 512-bit lanes and yes, this is the GA144’s weakest dimension — bit-for-bit it loses to wider SIMD. But the GA144 is not competing on aggregate throughput, it is competing on ops per joule, and 144 cores each doing 18 bits at 700 MHz adds up. The aggregate XNOR rate is around 2 trillion bit-products per second across the chip, an order of magnitude below a high-end Xeon but at less than 1% of the power.
The 18-bit width also has a useful side effect for popcount, which the F18A does not have as a native instruction. A 9-bit lookup table — 512 entries — fits inside the ROM budget if you partition it across cores, and popcount of an 18-bit word becomes two table lookups plus an add. Three instructions.
3.3 The instruction set
The F18A’s minimal instruction set already contains what BNN inference needs.
- XOR — single-cycle native. Combined with NOT (also single cycle) you have XNOR in two cycles.
- Popcount — not native, but the lookup-table approach above runs in three cycles for an 18-bit chunk.
- Sign activation — just a comparison against a threshold, which the F18A handles with conditional skip.
The inner loop for one neuron is roughly:
@a ( fetch activation word from north port )
@w ( fetch weight word from local memory )
xor ( XOR them — XNOR after one more NOT )
-not ( invert: now we have XNOR in the top of stack )
pop ( table lookup for popcount )
+ ( accumulate into running sum )
That’s six F18A instructions per 18 binary products. At ~700 MHz single-cycle execution per core, you get something like 100 million 18-bit XNOR-popcount-accumulate operations per core per second, or roughly 2 billion per core per second of bit-level operations. Times 144 cores, that’s about 2.5 trillion binary ops per second on the whole chip. At an estimated 0.15 W power draw, that’s approximately 17 trillion binary ops per joule.
The comparison chart that this lives in:
| Platform | Binary ops/sec | Power | Ops/joule |
|---|---|---|---|
| Intel Xeon (AVX-512) | ~200 G | ~150 W | ~1.3 G |
| ARM Cortex-A72 (NEON) | ~30 G | ~3 W | ~10 G |
| ARM Cortex-M4 | ~500 M | ~50 mW | ~10 G |
| GA144 (projected) | ~2.5 T | ~150 mW | ~17 T |
The GA144 is roughly three orders of magnitude better than a Xeon on ops-per-joule for this specific workload, and a thousand times more energy-efficient than a Cortex-M4. The Xeon delivers more total throughput, but at a hundred to a thousand times the power. That ratio is exactly the trade-off the edge inference market is shaped around.
3.4 The mesh
A 12×12 mesh of cores with neighbour-only communication looks, on first contact, like a bug. You cannot do arbitrary point-to-point routing without intermediate cores. You cannot broadcast. You cannot gather from arbitrary locations.
A neural network layer, structurally, is a sequence of neighbour-only transformations. The activations of layer N feed only into layer N+1; there are no skip connections in a vanilla MLP, and even in architectures with residuals the connections are local. The natural mapping is:
- Each row of cores = one layer, with cores in the row processing different output neurons of that layer.
- Activations flow east-to-west or north-to-south between layers, using the inter-core ports.
- Weights are static, loaded once at boot from external flash into each core’s RAM.
- Pipelining is automatic. As soon as a core finishes computing its output activation, it streams the result to its neighbour and starts on the next input. The asynchronous design means there is no global pipeline stall; cores wake and sleep individually based on data availability.
The chip is, essentially, a piece of hardware whose dataflow graph matches a layered feedforward network’s dataflow graph. Mainstream processors have to emulate this with shared memory, cache hierarchy, and software pipelining. The GA144 just is this dataflow.
3.5 Asynchronous wake-on-data
This is the property that nothing else matches. A core that has no input to process is drawing essentially zero current. A clocked processor at the same idle state is still burning the clock distribution tree, the cache refresh, the PLLs, the voltage regulators. The GA144’s asynchronous design means a sparse neural network — one where many activations are zero — costs zero energy on the zero-activation cores. Sparsity becomes a hardware property, not an algorithmic optimisation.
4. Hamming distance: the throughline
I started thinking seriously about this because I have been computing Hamming distances on various pieces of silicon for thirty years for completely unrelated reasons, and I noticed the same operation kept showing up.
In 1996 I was working on what became ECIP — Error Correcting Internet Protocol — an early implementation of forward error correction for real-time UDP streaming. The motivation was that Internet packet loss is an erasure channel: packets either arrive intact or they don’t arrive at all. There’s no need to detect corruption inside a packet (UDP checksums do that). What you need is to recover entire missing packets from redundancy in the surrounding ones, and the math for that uses block codes whose optimality is determined by the minimum Hamming distance between codewords.
To find good codes I wrote a brute-force search program, ecc4.c, that for a given block size enumerates candidate codewords, checks the Hamming distance from each candidate to every codeword already accepted, and keeps the candidates that maintain a minimum distance above some threshold. The inner loop is identical to BNN inference: a XOR between two bit-vectors, a popcount of the result, a compare against a threshold.
Around 1998 the program got rewritten as a portable CPU benchmark and distributed to contributors, who ran it on every x86 generation from the 386 onward. I called it Budmark (long story). The data collected between 1996 and 2002 was, in retrospect, a longitudinal study of how processor architectures handled the XOR-popcount primitive across nearly fifteen years of x86 evolution.
The most interesting result was the Pentium 4. The P4 was running at 2.26 GHz — almost four times the clock speed of the Pentium II Xeon at 600 MHz. On any benchmark Intel marketed at the time, the P4 crushed. On Budmark, the P4 was only 66.7% as efficient per cycle as the P3, meaning that despite its enormous clock advantage it finished a Budmark run only about 1.5× faster than the older chip. I didn’t know at the time, but Budmark was measuring exactly the operations that the P4’s NetBurst architecture had deprioritised — pipeline-unfriendly tight loops on small bit operations. The benchmark predicted Intel’s architectural dead end a couple of years before Intel acknowledged it and went back to the Pentium-Pro lineage that became Core.
The connection to BNN didn’t exist then. It exists now. Every processor that did well on Budmark would do well on BNN inference, and every processor that did badly would do badly, because the fundamental operation is the same. Hamming distance is the unifying primitive across:
- Error-correcting codes — minimum Hamming distance between codewords determines correction capability.
- Binary neural networks — dot product equals length minus twice Hamming distance.
- Locality-sensitive hashing — near-duplicate detection rides on Hamming distance over fingerprint bits.
- Cryptanalysis — many cipher attacks reduce to Hamming-weight analyses of the key schedule.
- Binary feature matching — ORB, BRIEF, FREAK descriptors all compare via Hamming distance.
Once you see the primitive everywhere, designing for it stops feeling specialised and starts feeling general. The GA144 is a chip designed for Hamming distance computation, accidentally and brilliantly. It was designed for Forth and for power efficiency. It happened to land in the same architectural neighbourhood as the right shape for BNN inference because the underlying problem — extracting decisions from sparse bit-level patterns at low energy cost — is structurally the same problem across all these domains.
5. The Palmo echo
Before BNNs had the name, I was building one. In 1994 I started a project I called Palmo — “palm” + “mo” for “motion” — to do real-time hand-gesture recognition from video on 486-class machines. Memory was tight (target was sub-megabyte footprint) and CPU was tight (a 486 at 66 MHz delivered maybe 25 MIPS). The only way to make a neural network fit in those constraints was to binarise the weights and activations and exploit the resulting bit-parallelism.
Palmo’s architecture was distributed across maybe 100,000 FreeBSD machines volunteer-connected over the Internet, each running small networks and communicating via compressed binary activation packets. Three things made it work:
- Compressed bit packets for inter-node communication: 1,000 binary activations fit in 125 bytes instead of 4,000 bytes of float, a 32× bandwidth reduction.
- Lazy neuron evaluation in a linked list: neurons updated only when their inputs changed, using timestamp-based decay instead of continuous evaluation.
- Threshold-based firing as the only nonlinearity, computable in one compare.
These are exactly the techniques that resurfaced 20 years later as BinaryConnect (2015) and XNOR-Net (2016). I shelved Palmo in 1996 because the GPU revolution made floating-point cheap, the deep learning community standardised on float, and there was no commercial interest in binary approaches. The 2015-onward BNN renaissance felt, when I lived through it, like watching someone re-derive your abandoned PhD thesis from first principles.
The reason this matters here: the GA144 is “Palmo on a chip.” Take the distributed-100k-FreeBSD-machines architecture, miniaturise it by six orders of magnitude, swap TCP/IP for register-to-register neighbour transfers, swap millisecond network latency for nanosecond wire latency, and you have the GA144. The architectural principles are identical. The Palmo project ran the experiment at the distributed-system scale and the GA144 packages the result at the silicon scale. Nobody noticed because the two communities — Forth hardware people and binary neural network people — don’t talk to each other.
6. Why this hasn’t happened yet
The honest accounting: the GA144 has been available since around 2010 or 2011. BinaryConnect and XNOR-Net are nine and eight years old respectively. The energy advantage I’m describing has been on the table the whole time. Why has nobody shipped a GA144-based BNN inference engine?
A few reasons:
The Forth barrier. Programming a GA144 requires colorForth or arrayForth, both of which look alien to anyone trained on C/Python. The toolchain is small, the documentation is sparse, and the community is a few hundred people worldwide. Getting a working F18A-port of a BNN inference loop is a week of work for someone fluent in Forth and a month for someone who isn’t.
The mainstream-benchmark gap. Processor reviews don’t measure edge-AI inference energy. They measure SPECint, Geekbench, AI-bench on giant networks. The GA144 loses on every one of those and wins on none of them. There has never been a high-profile public benchmark that would have surfaced its actual strengths.
The deployment story. Mainstream MCU programmers reach for an STM32 or an ESP32 or a Cortex-M0 because the toolchain is familiar, the ecosystem is huge, and the chip is available everywhere. The GA144 is single-sourced from GreenArrays, who make it in small batches. There is no Arduino-style ecosystem. There is no TensorFlow Lite backend.
The accuracy ceiling. BNNs do well on small classification but they cannot do everything a floating-point network can. If your edge application needs the last 5% of accuracy that the FP32 model gets and the BNN doesn’t, you’re going to ship a Cortex-M4 with INT8 quantised weights and you’re going to take the power hit. The market for chips that are only good at the BNN workload has been small.
All four reasons are softening. Forth toolchains are getting better (there are now Python-based F18A simulators that let you prototype without the colorForth learning cliff). Edge AI is exploding as a deployment category and per-inference energy is starting to matter more than absolute accuracy. The BNN accuracy story is improving as training techniques mature. And the energy budget of LLM-driven applications has gotten so large that even mainstream attention is turning to the bottom of the inference market.
The GA144 is, in 2026, approximately ten years too early or three years too late depending on how you count. The case for porting a keyword-spotter or a gesture-recognition model to it is now defensible in a way it wasn’t even five years ago.
7. What edge AI is actually measured in
The mainstream AI conversation is measured in TFLOPS, in tokens-per-second, in dollars per million tokens. None of those metrics apply to the edge. At the edge the relevant units are:
- Inferences per millijoule — battery-powered devices live or die by this.
- Latency from sensor wake to decision — for always-on listeners, motion detectors, fall sensors.
- Memory footprint that fits in on-chip RAM — because going off-chip to DRAM dominates the energy budget.
- Cost per chip in volume — embedded sensor markets are extremely price-sensitive.
On all four metrics the GA144 looks competitive or better. On inferences per millijoule for a binary classifier of reasonable size, the projected numbers put it at roughly one to three orders of magnitude ahead of conventional MCUs. On latency, the asynchronous design means a wake-on-data inference can complete in microseconds. On memory footprint, by definition, anything that fits at all fits entirely on chip. On cost, $20 is not great for an MCU role but is not catastrophic for a specialised edge inference accelerator.
The chip is not for everything. It is for the specific case where the network is small enough to fit, the input cadence is intermittent enough that asynchronous wake-on-data wins, and the energy budget is tight enough that the picojoule-per-instruction matters. That case exists. Hearing aids. Implantable sensors. Long-battery industrial sensors. Solar-powered wildlife cameras with on-board species classification. The list is shorter than the LLM list but it’s not empty, and it’s getting longer.
8. What a port would actually look like
Sketching the minimum viable demonstration:
Pick a small BNN — the canonical 784→256→128→10 MNIST MLP I already have running in C and AVX2 (see bnn1 in the parent repo). 165,888 binary weights chip-wide; the MNIST MLP needs ~233K binary weights total, so it would need either compression or a slightly smaller architecture (say 784→128→64→10, which has ~108K weights and fits comfortably).
Allocate cores by layer. With 144 cores and three layers, the natural mapping is: top row(s) handle layer 1, middle row(s) handle layer 2, bottom row(s) handle layer 3. Each row pipelines its outputs to the next.
Implement the inner loop in F18A assembly. The XNOR-popcount- accumulate inner loop is six instructions per 18 bits. A 32-wide input layer takes two iterations.
Build a host-side driver that streams MNIST images into the north edge of the mesh and reads classifications from the south edge. The host can be an ordinary Cortex-M running on the same board.
Measure. Inferences per second, energy per inference, latency from input to output. Compare against the existing AVX2 reference implementation and against a Cortex-M4 INT8 quantised implementation as the relevant baselines.
A working demo of this would take, by my estimate, a focused month for someone fluent in F18A Forth and roughly three months for someone coming from C. The result would be the world’s first published GA144 BNN inference engine and would, if my numbers are roughly right, establish a power-efficiency record for the workload that would be hard to beat without designing custom silicon.
9. The deeper claim
If you’ve read the other essays in this series, the structure of this argument is familiar. There is an information-processing substrate (Forth-style minimal-stack cores in a 2D mesh), there is an information-processing problem (BNN inference at the edge), and the substrate and the problem fit together in a way that nobody designed for explicitly but that falls out cleanly once you see the underlying operation.
That underlying operation, in this case, is Hamming distance — popcount of a XOR. It is the same operation I was running on Pentium II Xeons in 1998 to find error-correcting codes, the same operation modern locality-sensitive hashing rides on, the same operation BinaryConnect-era researchers re-derived from neural network quantisation. Substrates change. Problems change. The operation doesn’t.
Chuck Moore designed the GA144 because he was committed to an architectural philosophy nobody else followed — minimalist, async, mesh-connected, Forth-native — and trusted that the niche would exist somewhere even if he couldn’t predict where. The niche turned out to be BNN inference, which didn’t exist as a research area when the chip taped out and which only became commercially interesting a decade later. The match between them is the kind of thing you’d expect to see when both the substrate designer and the problem designer were optimising against the same hidden objective without realising it. The objective, in retrospect, was maximum useful computation per picojoule on bit-level operations. They both went for it. They both got there. They met.
I would like to be the person who finally connects them. The work is real and tractable and would benefit from someone who has both the Forth heritage and the BNN heritage in the same head, which I do, and which is rare. The chip is still available; the BNN technique is mature; the inner loop fits in six instructions per 18 binary products. It is, structurally, a two-month project away from demonstrating something nobody else has demonstrated.
I am writing this partly to commit, in public, to actually doing it.
Further reading
- Moore, C. Programming a Problem-Oriented Language (1970). The original Forth manifesto.
- Moore, C.; Ting, C. eForth and Zen (1996). The minimalist approach.
- GreenArrays Inc. GA144 Documentation and F18A Architecture Reference. https://www.greenarraychips.com/
- Pelt, J. Programming the F18A Computer. arrayForth tutorial.
- Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. “Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1.” NeurIPS 2016.
- Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” ECCV 2016.
- Larq Contributors. Larq: An Open-Source Library for Training Binarized Neural Networks (2019). https://larq.dev
- Sokol, J. Error Correcting Internet Protocol (ECIP), 1996.
- Sokol, J. Palmo: Distributed Binary Neural Networks for Hand Gesture Recognition, 1994-1996. Unpublished.
- Sokol, J. Budmark CPU Benchmark, 1996-2002. Collected data across x86 generations 386 through Pentium 4; predicted NetBurst’s architectural inefficiency on small-bit-operation workloads.
- Hamming, R. W. “Error Detecting and Error Correcting Codes.” Bell System Technical Journal 29 (1950).
- See also
bnn1/in the companion repository for working Python/Keras + C/AVX2 reference implementations of the overflow-fire BNN inference kernel.
Comments, corrections, and especially “let’s actually build it” welcome.
No comments:
Post a Comment