ZML, a Paris-based open-source project, is gaining traction as a production inference stack written almost entirely in Zig — bypassing the Python and PyTorch dependency chains that dominate AI infrastructure. With its v2 release on March 24, 2026, ZML now compiles large language models directly onto NVIDIA, AMD, Google TPU, and AWS Trainium hardware from a single codebase, making it one of the most hardware-agnostic inference engines available today.
Advanced
ZML takes a fundamentally different approach to LLM inference. Rather than wrapping Python around CUDA kernels — as vLLM, Ollama, and most other inference servers do — ZML uses the Zig programming language (which makes up 92.7% of its codebase) combined with MLIR and OpenXLA to compile model computation graphs into standalone native binaries. The result is a runtime with zero Python dependencies, minimal memory overhead, and direct hardware access.
The project’s tagline — “Model to Metal” — captures its philosophy: explicit over implicit, composability over monolithic systems, and predictability over magic. ZML currently supports Llama 3.1/3.2, Qwen 3.5, and LFM 2.5 model families, with its LLMD inference server offering an OpenAI-compatible API in a remarkably compact 2.4 GB container image.
The v2 release represents a ground-up rewrite focused on making platform ownership, compilation, memory management, and device placement first-class concepts rather than hidden abstractions. Key architectural changes include:
zml.Platform API handles accelerator selection, data transfer, compilation, and execution across NVIDIA CUDA, AMD ROCm, TPU, and Trainium — all from the same code path.DmaAllocator eliminates unnecessary memory copies, with overlapped data transfers via MemoryWriter. ZML demonstrated loading 14.96 GiB of model weights in 1.165 seconds (12.83 GiB/s throughput).ZML’s cross-platform compilation is particularly significant for AMD GPU users. Consumer AMD cards like the RX 7900 XTX (24 GB VRAM, often available under $1,000) and the RX 7800 XT (16 GB VRAM, around $450–550) have long been second-class citizens in the AI inference ecosystem due to CUDA lock-in. ZML compiles the same model graph to AMD’s ROCm stack without requiring separate code paths or manual kernel porting.
Community benchmarks show the RX 7900 XTX running LLM inference at roughly 80–90% of RTX 4090 throughput for comparable model sizes. For budget-conscious researchers and hobbyists, this means running quantized 35B-parameter models on hardware costing a fraction of NVIDIA’s data center GPUs — a proposition that ZML’s native ROCm support makes significantly more accessible.
ZML is still in alpha, and the LLMD inference server is explicitly labeled as a technical preview. Current limitations include single-GPU-only operation (no multi-GPU sharding), a maximum batch size of 16, no prefix caching, and support limited to Llama and Qwen model architectures. The project describes itself as a “build-your-own-stack” tool rather than a drop-in replacement for established servers — positioning it squarely for ML systems engineers comfortable with low-level infrastructure.
ZML represents a broader trend in AI infrastructure: the move away from Python-centric, NVIDIA-exclusive toolchains toward compiled, hardware-portable runtimes. By building on Zig and MLIR rather than PyTorch and CUDA, ZML trades ecosystem maturity for performance predictability and true hardware agnosticism. With 3,300+ GitHub stars and an active contributor community, the project is one to watch — especially as AMD’s ROCm ecosystem continues to mature and consumer GPU hardware becomes an increasingly viable platform for local AI inference.
