Backpropagation on the Apple Neural Engine
Every now and then you read something that rearranges how you think about hardware you assumed you understood. maderix's writeup on training the Apple Neural Engine did that for me. None of the engineering is mine — it's their work, with Claude Opus 4.6 — so treat this as a reading, not a tutorial. Here's what stuck.
The premise is a little heretical. The Neural Engine in every recent iPhone, iPad, and Mac is built and licensed to do exactly one thing: inference. Forward passes, forever. maderix asked whether that's a real hardware limit or just a decision Apple made in software — then spent days reverse-engineering the chip to find out it was the latter.
Those are the numbers that stuck — especially the last one.
It isn't a GPU
The reframe that made everything click: the ANE isn't a small GPU. maderix describes it as a graph execution engine — you don't issue instructions to it, you hand it a whole compiled network and it runs the entire graph as one atomic operation.
The M4's version (codename H16G) is 16 cores with a queue depth of 127 in-flight requests, independent voltage/frequency scaling, and hard power-gating that drops it to exactly 0 mW when idle. That 127-deep queue alone tells you what it was built for: high-throughput streaming inference, nothing else. It's a fixed pipeline that swallows whole models, not a processor you can target.
A door behind CoreML
CoreML — the only ANE interface Apple documents — turns out to be just a convenience layer. To find what's underneath, maderix dumped every Objective-C class in the private AppleNeuralEngine.framework with dyld_info -objc and swizzled CoreML's calls to watch what it actually did. Out fell 40-odd undocumented classes — _ANEClient, _ANEModel, _ANERequest, _ANEIOSurfaceObject — and _ANEClient exposes the full compile-load-evaluate pipeline directly. The door was there the whole time; CoreML is just the lobby.
The detail I love most: all I/O runs through IOSurface, the same shared-memory primitive the GPU uses for textures. Which means zero-copy GPU↔ANE pipelines are, in theory, already possible — nobody ships that today, but the plumbing exists.
The language the chip actually speaks
Under the hood the ANE doesn't take ONNX or protobufs. It takes MIL — Machine Learning Intermediate Language — a typed SSA form where every value carries both its precision and its shape. The part that genuinely surprised me: a matrix multiply is expressed as a 1×1 convolution, and tensors use a 5D NCDHW layout, because the ANE's native primitive is convolution. A 1024×1024 matrix becomes [1, 1024, 1, 1024]. The hardware is a convolution engine wearing a matmul costume.
A clue hidden in a file size
This is the detail I keep coming back to. The compiler emits a FlatBuffer-structured binary — an E5 file — and maderix noticed that a 1024×1024 matmul compiles to 2,688 bytes, while a 128×128 one compiles to 2,680. Eight bytes apart, for 64× the work.
So the binary isn't encoding the math. It's a parameterized program telling a handful of hardwired primitives which shapes to run, with the real dimensions supplied at runtime. Fixed operators plus a scheduler — exactly why the ANE is so fast and so restricted. Inferring the shape of the silicon from the size of a file is about as satisfying as reverse engineering gets.
The hack that makes training work
Training is where it gets hostile. A single transformer layer's training step runs as six distinct ANE kernels: the chip handles the matmuls, attention, and elementwise ops; the CPU accumulates gradients and runs Adam. The activations that backprop needs are emitted as extra outputs from the forward kernels instead of being recomputed — a clean trick. Even reaching this point meant fighting the API: the in-memory model descriptor wants MIL as raw bytes rather than a string, and weights as a dictionary rather than a blob — pass the wrong type and it fails silently.
The real wall is the one in the stat block. After roughly 119 compilations in a single process, the ANE compiler leaks enough that it simply stops; every compile after that fails. Inference compiles once, so nobody ever noticed — training recompiles every step. maderix's fix is to call exec() every few steps and replace the process with a fresh copy loaded from a checkpoint. The compile budget resets because, as far as the OS is concerned, it's a brand-new process. Process suicide as garbage collection — ugly, but it's what you reach for once you stop expecting a clean API.
The human bit
One aside made me grin: buried in Apple's private code is a class named Desctiptor. Even Apple's own engineers typo their internal APIs.
Where it goes
The wild part is how little of the chip this even uses. maderix measured the training prototype at around 11% of the ANE's peak — bottlenecked not by silicon but by baking weights into every compiled program. The ceiling is far higher than 107 ms a step.
And there's more left unmapped. Classes surfaced during discovery that nobody has touched: _ANEChainingRequest (chaining models in a single dispatch, which would gut the per-layer compile overhead) and _ANESharedEvents (Metal-style fences for real GPU↔ANE synchronization). The surface is barely scratched.
If the most power-efficient AI block in a billion devices can learn, not just predict, a few things stop being science fiction: personalization that never leaves the device, federated learning that's finally tolerable at scale, and on-device agents that actually adapt — not a prompt trick, but a model that gets better because it lived on your phone.
Why it stuck
The reason I keep returning to this isn't the numbers. It's that the ANE could do this all along. The constraint was never the silicon — it was a decision made in Cupertino and communicated through silence and private frameworks: "this is for inference."
The line between possible and permitted is always thinner than it looks. Private APIs are still APIs. Undocumented behavior is still behavior. The chip was always a training accelerator. It just needed someone patient enough to ask.
Read maderix's original writeup — "Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering" Explore the ANE training repository on GitHubAcknowledgements
All of the reverse engineering, benchmarks, and training code discussed here are the work of maderix, in collaboration with Claude Opus 4.6. This is my reading of that work — the engineering is entirely theirs.