The Future Is Sparse

Every machine that moves through the world has to perceive it first. A drone has to see the obstacle before it hits. A robot arm has to see the part it is about to grasp. A car has to see the child stepping off the curb.

That gap comes from an old assumption about how machines should see, one that is wrong for a large and growing class of systems.

Where that assumption came from

The way computers see today comes from the way computers have always been built. Memory and computation are kept separate, so every operation means moving data across that boundary: fetch it, work on it, send the answer back. This is the Von Neumann bottleneck. A camera feeds that machine a full picture at a fixed rhythm, every pixel, every frame, whether or not anything in the scene has changed, and the processor grinds through all of it to find the small part that matters.

Biology never worked this way. Your eye does not send your brain a fresh full-resolution image sixty times a second. It sends signals only when something changes, and your brain runs on roughly the power of a dim light bulb. Sparse, event-driven, low-power perception is what five hundred million years of evolution converged on, because anything slower or hungrier for power would not survive contact with a moving world.

Three things have recently made it possible to build machines that work the same way, at commercial scale: cameras that fire only when a pixel changes; processors modeled on networks of neurons that sit idle until there is something to compute; and methods for training and combining these into working systems that are maturing from research papers into reproducible engineering. For the first time, all three exist at once.

The sensor and the compute

A conventional camera exposes a full frame at a fixed rate, say 30 or 60 times a second, regardless of what is happening. An event camera works differently. Each pixel operates independently and asynchronously, and only emits a signal when the log-intensity at that pixel changes by more than a threshold.

A DISCRETE EVENT

{ x, y, timestamp, polarity }

Where, when, and whether it got brighter or darker. Static background produces almost nothing. A fast-moving edge produces a dense stream.

The practical effect: a mostly-static scene with a small moving object generates orders of magnitude less data than a full-frame sensor would, with microsecond-scale temporal resolution, since each pixel reports the instant it changes rather than waiting for the next scheduled exposure.

On the compute side, spiking neural networks are often introduced as a biologically-inspired curiosity: neurons that integrate a membrane potential and fire a binary spike at threshold. That is real, but historically hard to train to competitive accuracy, and hungry for many simulation timesteps per inference.

Sigma-delta networks are a more practical variant. Instead of communicating raw activations every step, a sigma-delta neuron tracks the change in its activation since the last update, and only transmits when that change exceeds a threshold, carrying the magnitude as a graded, quantized value rather than a single bit.

WHY IT IS TRACTABLE

A sigma-delta network can be converted from a network already trained and quantized with conventional tools, with no from-scratch spike-based procedure. Selective activation of a limited subset of neurons gives spatial sparsity. State updates only on change give temporal sparsity. On hardware such as Intel's Loihi 2, this has been demonstrated directly against edge-GPU baselines.

Why these two things belong together

They close a loop that dense pipelines leave open. A conventional stack, frame camera, full frames, dense GPU compute, pays a fixed compute cost every cycle no matter what is in the scene, so latency and power are set by the worst case, all the time. The sparse stack inverts that. The sensor already discards the redundant part of the signal, and the sigma-delta network is architected to skip computation exactly where the sensor indicated nothing changed.

Sensing sparsity and compute sparsity reinforce each other, instead of one undoing the other's savings.

The net effect for a system architect is threefold: much lower average power, since you are not paying for idle scenes; much lower latency, since a change is processed within microseconds instead of waiting for a frame boundary; and reduced data movement, since you are shipping a trickle of coordinates and polarities instead of a video stream.

What this is actually good for

This is not a general-purpose replacement for how most vision systems are built today. It is the right architecture for a specific and growing set of problems that share three properties: the system is power or thermally constrained; it needs to react in microseconds to milliseconds; and the scene is mostly static most of the time, with the interesting signal being motion against that background.

It starts with the machines where the constraint is most extreme, small, fast aerial vehicles tracking a moving target onboard, where every watt and millisecond is precious. From there the same module extends to patrol and inspection drones, ground robots in cluttered spaces, and any autonomous system at the edge of its power and reaction-time budget. Beyond that: a factory robot working safely alongside people, a surgical instrument sensing tissue, an assistive device giving sight back, a car catching the fast hazard a conventional camera would blur or miss.

Why datacenter networks are not the comparison

The instinct is to benchmark sparse neuromorphic systems against datacenter-scale dense networks and ask which is better. They are solving different problems. Datacenter training and large dense inference are built around batching, synchronous execution, and predictable, uniform compute graphs, which is exactly what lets you saturate a GPU with matrix multiplies at high utilization. That is a legitimate and, for now, largely irreplaceable way to build large models.

But it depends on dense, regular, batched computation, the opposite of a sparse, asynchronous, per-event stream. Feeding sparse spike trains into a GPU designed for dense matmuls does not unlock the sparsity. It just adds overhead, because the hardware's efficiency comes precisely from not skipping around irregularly.

The mismatch runs the other way too. Spike counts tend to vanish at deeper layers, which is why current best practice often pairs a shallow spiking front end with dense layers deeper in. This is not a wholesale replacement for large learned models. It is a complementary architecture for the sensing-and-inference layer closest to the physical world, under a power and latency budget a datacenter was never designed to meet.

Our bet

We believe the next great wave of AI will not happen only in the cloud. A large part of it will happen in the physical world, in machines that move, gated not by how large a model can be, but by how efficiently a machine can perceive and act under real constraints. The advantage is not any single chip or sensor, all of which will improve and be replaced. It is the accumulated experience of making these systems work under real-world conditions that are messy and unpredictable, learned by starting narrow, where the constraint is sharpest, and expanding from there.

If the interesting signal in your environment is sparse, your sensor and your compute should both be sparse, and neither one should force the other back into a dense representation along the way.

See the stack Design system