There Are Far More Efficient Ways to Run Neural Networks
The entire AI industry has standardized on one primitive — the dense floating-point matrix multiply — and then built a trillion-dollar edifice of GPUs, data centers, and capital around the assumption that intelligence is that primitive at scale. It is worth saying plainly: that is an assumption, not a law of computing. And it is an expensive one.
I want to make a narrow, concrete version of this argument, because the sweeping version is easy to wave away. Here it is: a neural network can run with no multiplications at all, and the arithmetic we burn most of our energy on is optional. I’ll show it running, then say where it leads.
How we got locked in
In 2012 a big neural network plus a GPU plus a large dataset worked, and the field reorganized around that fact. Every layer of the stack since then has been optimized for dense floating-point matmul: tensor cores designed to feed it, libraries tuned for it, model architectures shaped to saturate it, frameworks that assume it. The loop is self-reinforcing — hardware rewards matmul, so researchers choose matmul-shaped models, so the next hardware doubles down on matmul. After a decade it stops looking like a choice and starts looking like physics. It isn’t.
Where the waste is
The cost of inference is dominated by two things: floating-point multiply-accumulates, and moving 32-bit weights out of memory. Multiplication is among the most expensive operations a chip does; a memory fetch can cost far more than the arithmetic it feeds (see Horowitz, Computing’s Energy Problem, ISSCC 2014). A modern accelerator is a machine built to maximize exactly these operations, fed continuously from DRAM, clocked whether or not the numbers it is grinding actually matter. The brain, by contrast, runs general intelligence on roughly twenty watts — no floating-point multiplier, event-driven, sparse, with memory and compute in the same place. That is not mysticism; it is a different set of primitives.
A network that runs on pulses, not multiplies
Here is the concrete part. I built a small pulsed neural network and ran it on MNIST. It is not a ±1 “binary” network and it does no XNOR tricks. It works the way pulse-stream signal processing works: a value is a stream of pulses (its magnitude is the pulse rate), polarity is a sign clock, and each neuron simply integrates pulses and fires when its accumulator overflows a threshold — the overflow is the activation. A value×weight product becomes “deliver the input’s pulses, each adding the weight.” Sum-then-threshold becomes “accumulate until you overflow and fire.” No multiplier appears anywhere in inference.
It classifies. Trained conventionally and then run purely as pulses:
reference (floating-point) test accuracy: 94.23%
pulses/input (T) test accuracy
1 11.40%
2 58.50%
4 88.20%
8 93.00%
16 93.70%
64 93.60%
As you spend more pulses, accuracy climbs from chance to within half a point of the floating-point network, and saturates around sixteen pulses per input. That curve is the whole story of rate coding: more pulses = more precision = more energy. It is a knob you control, not a fixed tax. The code is a single dependency-light file you can run yourself; the link is at the bottom.
I’ll be honest about what this does and does not prove. It proves the arithmetic is optional — the network’s function survives with pulses and overflow-firing and zero multiplies. It does not prove this is faster on your laptop. It isn’t: a CPU or GPU is the wrong machine for it, because those chips are built to do the very floating-point matmul we just removed. Emulating pulses on a GPU is slower, not faster. That is the point, not a footnote.
Why the GPU loses, and what wins
A GPU is the maximal embodiment of the old assumption: dense floating point, DRAM-fed, clock-synchronous. Every design decision that makes it superb at matrix multiplication becomes a liability the moment the workload stops being matrix multiplication. The alternatives don’t out-matmul the GPU — they make matmul irrelevant:
- Pulsed / event-driven computation: energy proportional to activity, not to a clock. Work happens only when a pulse arrives.
- No floating-point unit to power, because the model doesn’t need one.
- Weights on-chip, killing the DRAM traffic that dominates the energy budget.
The clearest existing picture of this substrate is the GreenArrays GA144: 144 asynchronous cores, no floating point, picojoule-scale per operation, cores that sleep instantly when no data is flowing, and enough on-chip memory to hold a binary/pulsed model with no external weight fetches. An async, FP-free, on-chip, event-driven array is exactly the machine a pulsed network wants — and the opposite of a GPU in every design commitment.
To be careful: the GA144 energy advantage for this workload is, today, a projection from its architecture, not a benchmark I am handing you. The honest next step is to run a pulsed inference kernel on a GA144 (or its cycle-accurate simulator) and publish the measured instruction and energy counts. That is the experiment worth funding — and it costs a rounding error against what we are spending pouring concrete for matmul.
The claim
GPUs are not obsolete tomorrow, and training will live on dense hardware for a while. But the assumption that inference at planetary scale must be floating-point matrix multiplication is a 2010s artifact, and it is breaking. The arithmetic is optional; the substrate is a choice; and we have barely funded the alternatives. There are far more efficient ways to run these networks. Here is one of them, running.
Working example (MIT-licensed): a pulsed integrate-and-fire MNIST network in one file — https://github.com/johnsokol/bnn-example (python3 pulsed_nn.py).
© 2026 John L. Sokol.
No comments:
Post a Comment