Reverse Engineering Apple’s Neural Engine to Train Transformers on M4

March 3, 2026Provided by Utku Ege Tuluk

A developer has successfully reverse-engineered Apple’s Neural Engine (ANE) on M4 silicon and used it to train a transformer model — something Apple has never officially supported. The open-source project bypasses CoreML entirely, talking directly to the hardware through private APIs to achieve 9.3 milliseconds per training step at 1.78 TFLOPS sustained throughput. It’s the first public demonstration of training — not just inference — on Apple’s locked-down neural accelerator.

Cross-section of a silicon chip with a glowing neural engine section connected to a holographic transformer architecture — Illustration generated by AI

Cracking Open the Black Box

Apple’s Neural Engine is a fixed-function accelerator embedded in every Apple Silicon chip. On the M4, the ANE (codename H16G) packs 16 cores and is rated at 38 TOPS — but Apple only exposes it through CoreML for inference workloads. There are no public APIs for training, no Metal compute path, and no official documentation of the hardware’s internal architecture.

The researcher, known as maderix, spent months mapping the full software stack from CoreML down to the IOKit kernel driver. The work uncovered over 40 private classes in AppleNeuralEngine.framework, including _ANEClient, _ANECompiler, and the critical _ANEInMemoryModelDescriptor — a class that enables runtime recompilation without filesystem I/O. This last piece was the key to making training practical: without it, every weight update would require a slow round-trip through the filesystem.

How It Works

The training pipeline implements a single transformer layer (768-dimensional, 512 sequence length) using six specialized ANE kernels that handle attention, SwiGLU feedforward networks, RMSNorm, and their corresponding backward passes. The forward and backward passes execute entirely on the ANE, while weight gradient computation runs on the CPU via Apple’s Accelerate framework.

Key technical discoveries from the reverse engineering effort include:

MIL (Model Intermediate Language) — The ANE’s internal representation is a typed SSA (Static Single Assignment) format using tensor descriptors in NCDHW layout. The researcher had to write MIL programs directly rather than going through CoreML’s abstraction.
E5 binary format — Compiled ANE programs are FlatBuffer-structured binaries of 2,680–2,688 bytes regardless of matrix size, suggesting parameterized programs rather than operation-specific machine code.
1×1 convolution trick — Expressing matrix multiplication as 1×1 convolution yields 3× higher throughput than the ANE’s native matmul path, because convolution is the hardware’s primary compute primitive.
IOSurface data transfer — The ANE uses the same IOSurface mechanism as the GPU for data I/O, theoretically enabling zero-copy GPU↔ANE pipelines.

Performance and Limitations

On an M4 Mac Mini running macOS 15.x, the system achieves 9.3ms per training step at 11.2% ANE utilization (1.78 TFLOPS sustained). The team optimized from an initial 33.5ms baseline through vectorized normalization (10× speedup on RMSNorm) and asynchronous compute overlap between ANE and CPU.

Current limitations are significant but expected for a first-of-its-kind effort: only a single transformer layer is supported, multi-layer training requires pipeline scheduling that hasn’t been implemented yet, and there’s a ~119 compile limit per process (requiring restarts as a workaround). The project currently uses synthetic data rather than real training datasets.

Still, this is Part 1 of a planned three-part series, with future installments promising detailed benchmarking and expanded training experiments.

Why This Matters

Every Mac, iPad, and iPhone shipped since 2020 contains an ANE — hardware that has been essentially off-limits for anything beyond Apple’s own inference workloads. By demonstrating that training is technically possible on this silicon, the project opens a conversation about what millions of existing Apple devices could contribute to on-device machine learning beyond inference.

The work also connects to a broader trend: as AI accelerators proliferate in consumer devices — from AMD’s NPUs (FastFlowLM on AMD Ryzen AI NPUs) to Qualcomm’s Hexagon processors — the community continues to find ways to unlock their full potential, even when manufacturers keep the doors locked.

The full code is available under the MIT license on GitHub, tested on M4 Mac Mini with macOS 15.x. No external dependencies are required beyond system frameworks.

Related Coverage

FastFlowLM — Running LLMs on AMD Ryzen AI NPUs With Ease — Similar effort to unlock AI accelerators in consumer hardware
Understanding Apple’s Parallel-Track MoE Architecture — Apple’s official approach to on-device AI

Cracking Open the Black Box

How It Works

Performance and Limitations

Why This Matters

Related Coverage

Sources