On March 26, 2026, NVIDIA released gpt-oss-puzzle-88B — a deployment-optimized version of OpenAI’s gpt-oss-120B reasoning model that delivers up to 2.82× faster inference while matching or exceeding the original model’s accuracy. Created using NVIDIA’s Puzzle neural architecture search framework, the 88-billion-parameter model fits on a single H100 GPU and represents a significant step forward in making frontier reasoning models practical for production deployment.
Advanced
The gpt-oss-puzzle-88B model was created using Puzzle, NVIDIA’s post-training neural architecture search (NAS) framework designed to optimize large language models for inference efficiency. The approach combines three key techniques:
After architecture optimization, the model undergoes knowledge distillation on 84 billion tokens at 128K sequence length using the Megatron-LM framework, followed by multi-environment reinforcement learning across math, coding, and reasoning tasks.
The results are striking. Across NVIDIA’s evaluation suite — including MMLU-Pro, GPQA-Diamond, AIME25, SciCode, and RULER 128K — the optimized model achieves accuracy retention between 100.8% and 108.2% compared to the parent gpt-oss-120B, meaning it actually improves on several benchmarks despite being 27% smaller.
The throughput gains depend on the deployment scenario:
| Configuration | Throughput Speedup |
|---|---|
| Long-context (64K/64K) on 8×H100 | 1.63× |
| Short-context (4K/4K) on 8×H100 | 1.22× |
| Long-context on single H100 | 2.82× |
| Short-context on single H100 | 2.44× |
The single-GPU results are particularly noteworthy: the model achieves nearly 3× throughput improvement, making frontier-class reasoning accessible on a single H100 — hardware that many research labs and enterprises already have.
Like its parent, gpt-oss-puzzle-88B supports three reasoning effort levels — low, medium, and high — allowing developers to trade compute for accuracy on a per-request basis. This is especially relevant for cost-aware production deployments where not every query requires deep multi-step reasoning. The model is compatible with standard inference stacks including Hugging Face Transformers (v4.57.3+) and vLLM, and can be served with a single vllm serve command.
NVIDIA’s Puzzle framework demonstrates that post-training architecture optimization can yield substantial inference savings without the typical accuracy trade-offs associated with model compression. The approach is detailed in an accompanying research paper and builds on NVIDIA’s earlier Puzzle work for dense models (arXiv: 2411.19146).
The release also reflects NVIDIA’s growing role as a bridge between open-weight model providers and production deployment. OpenAI released gpt-oss-120B and gpt-oss-20B in August 2025 as its first open-weight models since GPT-2 — and NVIDIA’s optimization pipeline is now turning these models into more practical deployment targets, particularly for organizations running H100 or B200 infrastructure.
The model is available now on Hugging Face under the NVIDIA Open Model License, which permits commercial use.
