A developer has successfully reverse-engineered Apple’s Neural Engine (ANE) on M4 silicon and used it to train a transformer model — something Apple has never officially supported. The open-source project bypasses CoreML entirely, talking directly to the hardware through private APIs to achieve 9.3 milliseconds per training step at 1.78 TFLOPS sustained throughput. It’s the first public demonstration of training — not just inference — on Apple’s locked-down neural accelerator.
Apple’s Neural Engine is a fixed-function accelerator embedded in every Apple Silicon chip. On the M4, the ANE (codename H16G) packs 16 cores and is rated at 38 TOPS — but Apple only exposes it through CoreML for inference workloads. There are no public APIs for training, no Metal compute path, and no official documentation of the hardware’s internal architecture.
The researcher, known as maderix, spent months mapping the full software stack from CoreML down to the IOKit kernel driver. The work uncovered over 40 private classes in AppleNeuralEngine.framework, including _ANEClient, _ANECompiler, and the critical _ANEInMemoryModelDescriptor — a class that enables runtime recompilation without filesystem I/O. This last piece was the key to making training practical: without it, every weight update would require a slow round-trip through the filesystem.
The training pipeline implements a single transformer layer (768-dimensional, 512 sequence length) using six specialized ANE kernels that handle attention, SwiGLU feedforward networks, RMSNorm, and their corresponding backward passes. The forward and backward passes execute entirely on the ANE, while weight gradient computation runs on the CPU via Apple’s Accelerate framework.
Key technical discoveries from the reverse engineering effort include:
On an M4 Mac Mini running macOS 15.x, the system achieves 9.3ms per training step at 11.2% ANE utilization (1.78 TFLOPS sustained). The team optimized from an initial 33.5ms baseline through vectorized normalization (10× speedup on RMSNorm) and asynchronous compute overlap between ANE and CPU.
Current limitations are significant but expected for a first-of-its-kind effort: only a single transformer layer is supported, multi-layer training requires pipeline scheduling that hasn’t been implemented yet, and there’s a ~119 compile limit per process (requiring restarts as a workaround). The project currently uses synthetic data rather than real training datasets.
Still, this is Part 1 of a planned three-part series, with future installments promising detailed benchmarking and expanded training experiments.
Every Mac, iPad, and iPhone shipped since 2020 contains an ANE — hardware that has been essentially off-limits for anything beyond Apple’s own inference workloads. By demonstrating that training is technically possible on this silicon, the project opens a conversation about what millions of existing Apple devices could contribute to on-device machine learning beyond inference.
The work also connects to a broader trend: as AI accelerators proliferate in consumer devices — from AMD’s NPUs (FastFlowLM on AMD Ryzen AI NPUs) to Qualcomm’s Hexagon processors — the community continues to find ways to unlock their full potential, even when manufacturers keep the doors locked.
The full code is available under the MIT license on GitHub, tested on M4 Mac Mini with macOS 15.x. No external dependencies are required beyond system frameworks.
