ZML: A Zig-Based Inference Engine Bringing LLMs to AMD GPUs

ZML, a Paris-based open-source project, is gaining traction as a production inference stack written almost entirely in Zig — bypassing the Python and PyTorch dependency chains that dominate AI infrastructure. With its v2 release on March 24, 2026, ZML now compiles large language models directly onto NVIDIA, AMD, Google TPU, and AWS Trainium hardware from a single codebase, making it one of the most hardware-agnostic inference engines available today.

Advanced

Stylized GPU chip with neural network graph visualization representing ZML's hardware-level AI inference compilation
Illustration generated by AI

What Is ZML?

ZML takes a fundamentally different approach to LLM inference. Rather than wrapping Python around CUDA kernels — as vLLM, Ollama, and most other inference servers do — ZML uses the Zig programming language (which makes up 92.7% of its codebase) combined with MLIR and OpenXLA to compile model computation graphs into standalone native binaries. The result is a runtime with zero Python dependencies, minimal memory overhead, and direct hardware access.

ZML project banner — Model to Metal
Image credit: ZML GitHub Repository

The project’s tagline — “Model to Metal” — captures its philosophy: explicit over implicit, composability over monolithic systems, and predictability over magic. ZML currently supports Llama 3.1/3.2, Qwen 3.5, and LFM 2.5 model families, with its LLMD inference server offering an OpenAI-compatible API in a remarkably compact 2.4 GB container image.

ZML v2: A Complete Rewrite

The v2 release represents a ground-up rewrite focused on making platform ownership, compilation, memory management, and device placement first-class concepts rather than hidden abstractions. Key architectural changes include:

  • Platform abstraction: A unified zml.Platform API handles accelerator selection, data transfer, compilation, and execution across NVIDIA CUDA, AMD ROCm, TPU, and Trainium — all from the same code path.
  • Pinned memory and zero-copy I/O: A new DmaAllocator eliminates unnecessary memory copies, with overlapped data transfers via MemoryWriter. ZML demonstrated loading 14.96 GiB of model weights in 1.165 seconds (12.83 GiB/s throughput).
  • Pluggable attention backends: Automatic selection of FlashAttention 2 or 3 on CUDA (sm80–sm121), and AITER kernels on AMD ROCm — no manual configuration needed.
  • Virtual filesystem: Models load directly from local files, HTTP endpoints, S3, or Hugging Face without staging to disk first.
  • Hermetic builds: A fully sandboxed LLVM toolchain enables reproducible builds and cross-compilation, with support for remote execution via BuildBuddy or NativeLink.

Why AMD GPUs Matter Here

ZML’s cross-platform compilation is particularly significant for AMD GPU users. Consumer AMD cards like the RX 7900 XTX (24 GB VRAM, often available under $1,000) and the RX 7800 XT (16 GB VRAM, around $450–550) have long been second-class citizens in the AI inference ecosystem due to CUDA lock-in. ZML compiles the same model graph to AMD’s ROCm stack without requiring separate code paths or manual kernel porting.

Community benchmarks show the RX 7900 XTX running LLM inference at roughly 80–90% of RTX 4090 throughput for comparable model sizes. For budget-conscious researchers and hobbyists, this means running quantized 35B-parameter models on hardware costing a fraction of NVIDIA’s data center GPUs — a proposition that ZML’s native ROCm support makes significantly more accessible.

Current Limitations

ZML is still in alpha, and the LLMD inference server is explicitly labeled as a technical preview. Current limitations include single-GPU-only operation (no multi-GPU sharding), a maximum batch size of 16, no prefix caching, and support limited to Llama and Qwen model architectures. The project describes itself as a “build-your-own-stack” tool rather than a drop-in replacement for established servers — positioning it squarely for ML systems engineers comfortable with low-level infrastructure.

What This Means

ZML represents a broader trend in AI infrastructure: the move away from Python-centric, NVIDIA-exclusive toolchains toward compiled, hardware-portable runtimes. By building on Zig and MLIR rather than PyTorch and CUDA, ZML trades ecosystem maturity for performance predictability and true hardware agnosticism. With 3,300+ GitHub stars and an active contributor community, the project is one to watch — especially as AMD’s ROCm ecosystem continues to mature and consumer GPU hardware becomes an increasingly viable platform for local AI inference.

Related Coverage

Sources