Backpropagation on the Apple Neural Engine

3 March, 2026

Inside every modern iPhone, iPad, and Apple Silicon Mac is a silent monster. On the M4, it’s a 15.8 TFLOP beast, engineered with absurd precision, and for its entire life, it’s been told to do one thing: inference. No training. No learning. Just forward passes, forever.

That wall just got a crack in it.

A researcher going by maderix, working alongside Claude Opus 4.6, spent several days peeling Apple's Neural Engine apart layer by layer — and then did the thing Apple says you can't do: they trained a real transformer, with backpropagation and Adam, directly on the ANE. No CoreML. No Metal. Just reverse-engineered private frameworks and a lot of patience.

Their writeup is one of the best pieces of systems reverse engineering I’ve read in a long time. What follows is my attempt to walk through how they did it — the discoveries, the dead ends, the weird hardware quirks — and why it matters. All the engineering below is theirs; I'm just the one squinting at it and explaining what I see.

Apple Neural Engine chip die illustration — The Apple Neural Engine — a 15.8 TFLOPS accelerator that Apple restricts to inference only. Until now.

The question that started it all

maderix opens with a simple question: can you train a model on Apple’s Neural Engine?

Apple really does not want you to answer that. They don’t publish the ANE’s ISA. They don’t document its internal architecture. They don’t even give you a way to program it directly — everything goes through CoreML, which piles on abstractions, optimization passes, and overhead that make it nearly impossible to see what the hardware is actually doing.

So maderix reverse-engineered it. Over several days they:

mapped the entire stack from CoreML down to the IOKit kernel driver,
discovered how to compile and execute programs on the ANE without CoreML,
cracked the binary format the compiler emits,
measured true peak throughput (spoiler: Apple's "38 TOPS" number is misleading), and
ultimately got a neural network training on a chip designed exclusively for inference.

15.8

TFLOPS on M4

109M

Parameters Trained

107

ms per training step

119

Compile limit per process

Current utilization is ~11% of peak TFLOPS. The hardware can do more. The bottleneck is software — specifically the overhead of baking weights into each ANE program at compile time.

What the ANE actually is

The first thing to unlearn: the Neural Engine is not a GPU. It's not a CPU either.

maderix describes it as a graph execution engine — a fixed-function accelerator that takes an entire compiled neural network graph and executes the whole thing as one atomic operation. You don't issue individual multiply-accumulate instructions. You hand it a compiled program describing a full computation graph, and the hardware runs it end-to-end.

Apple introduced the ANE in the A11 (2017) as a tiny 2-core design. Every generation has scaled it up. The M4’s ANE — codename H16G — is the target of this work: 16 cores, a queue depth of 127 evaluation requests, independent DVFS (dynamic voltage/frequency scaling), and hard power gating that drops the whole block to exactly 0 milliwatts when idle.

Apple was not subtle. This is a chip designed to sit behind a wall.

Mapping the stack

Before you can talk to a chip directly, you need to know what's sitting between you and it. maderix’s methodology for mapping the stack is worth studying because it’s a clean template for this kind of work:

Class discovery — dyld_info -objc dumped every Objective-C class and method inside AppleNeuralEngine.framework. 40+ private classes fell out, including _ANEClient, _ANEModel, _ANERequest, _ANEIOSurfaceObject, and _ANEInMemoryModel.
Method swizzling to intercept CoreML's calls into the private ANE frameworks — so they could watch exactly what CoreML was doing under the hood.
Binary analysis of the compiled bundles the ANE compiler emits (.e5 files) to understand the on-disk program format.
Scaling analysis — varying matrix sizes, graph depths, and channel counts to reverse-engineer hardware topology from performance curves.

The key insight that falls out of this: CoreML is not the only way in. The _ANEClient class exposes the full compile → load → evaluate pipeline directly. CoreML is just a convenience layer bolted on top.

Reverse engineering the private ANE API — The private classes that make everything possible. Once you find _ANEClient, the whole pipeline opens up.

Talking to the ANE without CoreML

Here’s the sequence maderix uncovered to compile and run a program on the ANE with zero CoreML involvement. Even if you never write a line of Objective-C, read it slowly — it's the whole game in eight steps.

// 1. Get shared client connection
id client = [_ANEClient sharedConnection];
 
// 2. Create model reference
id model = [_ANEModel modelAtURL:compiledURL key:@"mykey"];
 
// 3. Compile (MIL text → E5 binary, cached)
[client compileModel:model options:@{
    @"kANEFModelType": @"kANEFModelMIL",
    @"kANEFNetPlistFilenameKey": @"model.mil"
} qos:21 error:&err];
 
// 4. Load program onto ANE hardware
[client loadModel:model options:@{} qos:21 error:&err];
// → programHandle assigned, queueDepth = 127
 
// 5. Create IOSurface I/O buffers
IOSurfaceRef surface = IOSurfaceCreate(props);
id wrapped = [_ANEIOSurfaceObject objectWithIOSurface:surface];
 
// 6. Build evaluation request
id req = [_ANERequest requestWithInputs:@[wA, wB]
    inputIndices:@[@0, @1]
    outputs:@[wOut]
    outputIndices:@[@0]
    weightsBuffer:nil
    perfStats:nil
    procedureIndex:@0];
 
// 7. Execute on ANE
[client evaluateWithModel:model options:@{}
    request:req qos:21 error:&err];
 
// 8. Read results from output IOSurface
IOSurfaceLock(outSurface, kIOSurfaceLockReadOnly, NULL);
float *data = IOSurfaceGetBaseAddress(outSurface);
// ... read results ...
IOSurfaceUnlock(outSurface, kIOSurfaceLockReadOnly, NULL);

Two details matter here. The first is the queue depth of 127 — you can have up to 127 evaluation requests in flight at once. That's absurdly deep for an accelerator queue and tells you this hardware was built for streaming, high-throughput inference.

The second is IOSurfaces. All I/O is done through the same shared-memory primitive the GPU uses for textures. That means zero-copy GPU↔ANE pipelines are theoretically possible — both accelerators can operate on the same memory. Nobody is shipping that today, but the plumbing is there.

MIL: the language the ANE actually speaks

CoreML does not hand neural networks to the ANE as ONNX or protobufs. It uses MIL — Machine Learning Intermediate Language — a typed SSA (Static Single Assignment) representation that looks surprisingly readable:

program(1.3)
[buildInfo = dict<string, string>({
    {"coremltools-version", "9.0"}
})]
{
    func main<ios18>(
        tensor<fp16, [1, 1024, 1, 1024]> x,
        tensor<fp16, [1, 1024, 1, 1024]> w
    ) {
        bool tx = const()[val = bool(false)];
        bool ty = const()[val = bool(false)];
        tensor<fp16, [1, 1024, 1, 1024]> out =
            matmul(transpose_x = tx, transpose_y = ty,
                   x = x, y = w);
    } -> (out);
}

Every value is typed with both precision and shape. Operations take keyword arguments. Tensors follow ANE's native NCDHW + Interleave layout: [Batch, Channels, Depth, Height, Width]. A 1024×1024 matrix becomes [1, 1024, 1, 1024] in 4D — a format that looks ridiculous until you realise the ANE's primary compute primitive is convolution, and matmul is expressed as a 1×1 conv.

The E5 binary — and a weird clue about the hardware

When the ANE compiler chews through a MIL program, it emits an E5 binary — a FlatBuffer-structured file the driver hands to the chip.

maderix found something very revealing here. A 1024×1024 matmul compiles to 2,688 bytes. A 128×128 matmul compiles to 2,680 bytes. Nearly identical. The E5 binary is not encoding the matrix multiplication algorithm — it’s encoding a parameterized program whose behavior is controlled by tensor descriptors at runtime.

That detail tells you something deep about the hardware:

The ANE likely has a small set of fixed compute primitives (convolution, matrix multiply, elementwise ops) that are parameterized by tensor shape descriptors. The E5 binary describes which primitives to chain and how to connect them — it isn’t the compute itself.

This is why the ANE is so fast and so restricted. It's not a general-purpose compute engine with an instruction set. It's a pile of hardwired operators with a graph scheduler on top.

The in-memory path: the real unlock

The filesystem-based compile path works fine for inference — compile once, cache the result, run forever. But training is different. Weights change every step. A filesystem round-trip on every update is a non-starter.

The breakthrough for training is _ANEInMemoryModelDescriptor, which accepts MIL text directly in memory:

id desc = [_ANEInMemoryModelDescriptor
    modelWithMILText:milData      // NSData*, not NSString*!
    weights:weightDict            // NSDictionary*, not NSData*!
    optionsPlist:nil];
 
id model = [_ANEInMemoryModel
    inMemoryModelWithDescriptor:desc];
 
[model compileWithQoS:21 options:@{} error:&err];
[model loadWithQoS:21 options:@{} error:&err];
[model evaluateWithQoS:21 options:@{}
    request:req error:&err];

Getting this working cost maderix several days of debugging, all because of three perfectly obnoxious gotchas:

NSData, not NSString. The milText parameter wants UTF-8 bytes as NSData*. Pass a string and it fails silently.
NSDictionary, not NSData. The weights parameter is a dictionary mapping weight names to NSData blobs, not a single blob.
Temp-directory trap. Even the "in-memory" path internally writes to a temp directory. No writable temp path → opaque failure.

And one small delight from the reverse-engineering notes: Apple’s internal code contains a class with Desctiptor in the name. Even Apple engineers typo private APIs.

The three keys to training

Once the in-memory path works, the whole training pipeline collapses down to three private classes:

_ANEInMemoryModelDescriptor — create a model from MIL text plus a weight dictionary, all in memory. No .mlmodelc file needed. Entry point.
_ANEInMemoryModel — take the descriptor and compile it into an executable ANE program at runtime. This is the magic wand: new programs with new weights, on the fly.
_ANERequest — submit the compiled model with its I/O buffers and get it executing on the hardware.

The repo’s ane_runtime.h wraps these into a clean, fragile, obviously-not-for-App-Store interface:

// Load the private framework at runtime
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
 
// Resolve the private classes
g_ANEDesc  = NSClassFromString(@"_ANEInMemoryModelDescriptor");
g_ANEInMem = NSClassFromString(@"_ANEInMemoryModel");
g_ANEReq   = NSClassFromString(@"_ANERequest");
 
// Build a descriptor from MIL text + weight blob
id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
    g_ANEDesc, @selector(modelWithMILText:weights:optionsPlist:),
    milText, weightDict, nil);
 
// Create and compile the model
id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(
    g_ANEInMem, @selector(inMemoryModelWithDescriptor:), desc);
((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
    mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e);

The direct use of objc_msgSend is the dead giveaway that you're off the beaten path. That's runtime introspection — a conversation with the machine in its own language, asking it to do something its creators never explicitly allowed.

Backpropagation, for real

Forward pass is one thing. A full training loop is another beast.

A single transformer layer’s training step in maderix’s implementation uses six distinct ANE kernels. The ANE does the heavy lifting (matmuls, attention, elementwise activations). The CPU handles the parts that are awkward to express in MIL — gradient accumulation and the Adam optimizer.

Data flow between CPU and ANE during training — The CPU and ANE work in concert. The ANE computes input gradients (dx); the CPU accumulates weight gradients (dW) and runs the optimizer.

The clever trick is how they handle forward taps. Backprop needs intermediate activations from the forward pass. Instead of recomputing or stashing them in a side channel, the forward kernels are rewritten to emit those intermediates as extra outputs, which are then fed straight into the backward kernels. Clean data flow, no hacks.

A second MIL-level generator (ane_mil_gen.h) spits out fresh MIL program strings on the fly, parameterized by the model's dimensions. It's a compiler for a compiler, running at training time, every step.

The 119-compile cliff

Here's the hardware quirk that nearly killed the project.

After roughly 119 compilations in a single process, the ANE compiler leaks enough resources that it simply gives up. Subsequent compiles fail. For inference this is invisible — you compile once. For training, where you recompile whenever weights change, it's a death sentence.

The workaround is as brutal as it is beautiful: every 10 gradient steps, the training program calls exec() to replace itself with a fresh copy, loading from a checkpoint.

The 119-compilation limit per process — A hard limit of ~119 compilations per process requires a creative workaround: restart the process.

Process suicide as garbage collection. The compile budget resets because the process is, by every meaningful definition, new. This is the kind of fix that only comes from the long, patient kind of reverse engineering where you stop looking for a clean API and start negotiating with the hardware.

What this actually unlocks

The ANE training repo is a research prototype, not a production framework. But sit with what it implies for a minute.

Six use cases enabled by on-device ANE training — A few of the applications that become possible when training can happen on-device.

Hyper-personalization without the cloud. A keyboard that learns your slang. A photo editor that learns your aesthetic. A health app that adapts to your body. On-device, private, continuous.
Privacy-first federated learning. Shared models trained across millions of devices without any device ever uploading raw data. The ANE's power efficiency is what would make this tolerable.
Creative tools that learn your style. Drawing apps where the brush is a small net that learns your stroke. Music tools that fine-tune their suggestions as you use them.
On-device agents that actually learn. Not a cloud-side fine-tune. Not a prompt trick. A real model that gets a little better because it lived on your device.

What remains unknown

One of the things I appreciate most about the writeup is that maderix is clear about what they don't know. Several classes surfaced during discovery that remain unexplored:

_ANEChainingRequest — possibly a way to chain multiple compiled models in a single dispatch, which would shred the current per-layer compile overhead.
_ANESharedEvents, _ANESharedSignalEvent, _ANESharedWaitEvent — Metal-style fences and signals, which would enable real GPU↔ANE synchronization.
_ANEPerformanceStats — almost certainly hardware performance counters.
_ANEVirtualClient — virtualized ANE access, probably for multi-process sharing.

And at the hardware level, a lot is still guesswork: the exact ANE core microarchitecture and ISA, how cores are assigned to operations within a graph, the actual clock under DVFS, whether hardware perf counters are exposed, the SRAM topology. The chip is a black box with a very well-characterised surface.

ANE SRAM bandwidth vs CPU/GPU cache — The ANE SRAM bandwidth dwarfs CPU and GPU cache bandwidth — the hardware is not the bottleneck.

Why this matters

The reason this project matters isn't the 109M-parameter transformer or the 107ms step time. Those are cool. They're not the point.

The point is that the ANE has been capable of this all along. The hardware was never the constraint. The constraint was a software decision made somewhere in Cupertino, communicated through silence and private frameworks — "this is for inference."

maderix's work is a reminder that the boundary between possible and permitted is almost always thinner than it looks. Private APIs are still APIs. Undocumented behavior is still behavior. The ANE was always a training accelerator. It just needed someone patient enough to ask.

⚡Read maderix's original writeup — "Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering" ⚡Explore the ANE Training repository on GitHub

Acknowledgements

All of the reverse engineering, benchmarks, and training code discussed in this post are the work of maderix (in collaboration with Claude Opus 4.6). This post is a walkthrough and interpretation of that work — the engineering is entirely theirs. Thanks also to prior art from hollance/neural-engine, mdaiter/ane, eiln/ane, and apple/ml-ane-transformers, which maderix builds on.