Apple recently introduced a novel Parallel-Track Mixture‑of‑Experts (PT‑MoE) architecture for their server-side language models, aiming to dramatically improve scalability and efficiency while preserving model quality (machinelearning.apple.com).
Rather than deploying a monolithic Transformer, PT‑MoE splits the server model into multiple “tracks”—each track is its own smaller Transformer (with its own MoE layers). Inputs are routed independently through each track, and synchronization only occurs at the start and end of track blocks. This design diverges from standard MoE models that tightly interleave experts within a single Transformer pipeline (machinelearning.apple.com).
Apple also unveiled a compact on‑device model, split 5:3 in depth, sharing KV‑cache and optimized for Core ML. The PT‑MoE design, however, is specific to powerful server setups aimed at handling more complex workloads (machinelearning.apple.com).
| Feature | Traditional MoE | PT‑MoE (Apple) |
|---|---|---|
| Model structure | Single transformer stack | Multiple independent tracks |
| MoE placement | Within each layer | Within each track block |
| Synchronization points | Every layer (2×L) | At block boundaries (≈L/D times) |
| Parallelism | Expert + tensor + data | Track-level + standard parallelisms |
PT‑MoE offers a compelling twist on MoE efficiency: by parallelizing at the track level, Apple unlocks strong scalability and performance without sacrificing model integrity. This could influence next-gen architectures, inspiring the blending of MoE and track-level pipelining in other large-scale systems.