NVIDIA Nemotron 3 Super: 120B Hybrid Model Activates Only 12B Parameters for Agentic AI

On March 11, 2026, NVIDIA released Nemotron 3 Super — a 120-billion-parameter open-weight model that activates only 12 billion parameters per token, delivering frontier-class reasoning at a fraction of the compute cost. Designed for multi-agent AI systems, the model combines three distinct architectures into a single hybrid backbone and ships with a native 1-million-token context window.

Intermediate

Visualization of a hybrid neural network architecture combining Mamba, Transformer, and Mixture-of-Experts layers
Illustration generated by AI

A Three-Architecture Hybrid

Nemotron 3 Super’s key innovation is its hybrid backbone, which interleaves three layer types in repeating blocks:

  • Mamba-2 layers handle the majority of sequence processing with linear-time complexity, delivering 4x memory and compute efficiency compared to standard attention.
  • Transformer attention layers are interleaved at key depths for precise associative recall — the kind of exact-match retrieval that recurrent layers struggle with.
  • Latent Mixture-of-Experts (LatentMoE) layers compress tokens before routing to experts, then project results back to full dimension. This enables activating “four expert specialists for the cost of one,” according to NVIDIA.

The result is a 120B-total-parameter model where only 12B are active per token — a 10:1 ratio that dramatically reduces inference cost while maintaining accuracy competitive with much larger dense models.

Performance and Benchmarks

NVIDIA reports that Nemotron 3 Super achieves 2.2x higher throughput than GPT-OSS-120B and 7.5x higher throughput than Qwen3.5-122B on 8K-input/16K-output workloads. At 432 tokens per second, it is among the fastest open models in its class.

On accuracy benchmarks, the model matches or exceeds GPT-OSS-120B and Qwen3.5-122B across diverse tasks. It scored 36 on the Artificial Analysis Intelligence Index — ahead of GPT-OSS-120B (33), though behind Qwen3.5-122B-A10B (42). Where it truly stands out is agentic performance: on PinchBench, a benchmark for LLM-powered autonomous agents, Nemotron 3 Super scored 85.6%, making it the top-performing open model.

The model also powers NVIDIA’s AI-Q research agent, which currently holds the #1 position on both DeepResearch Bench and DeepResearch Bench II leaderboards.

Built for Agents, Not Just Chat

The 1-million-token context window is a strategic choice for agentic workloads. Multi-step agent pipelines that chain tool calls, code execution, and document retrieval can quickly consume hundreds of thousands of tokens. Nemotron 3 Super outperforms both GPT-OSS-120B and Qwen3.5-122B on the RULER benchmark at 1M context length, meaning it maintains coherence across extremely long reasoning chains.

NVIDIA highlights several target use cases:

  • Software development: Loading entire codebases without segmentation for end-to-end code generation
  • Cybersecurity triaging: High-accuracy tool calling for autonomous threat analysis
  • Financial analysis: Processing thousands of report pages simultaneously
  • Life sciences: Deep literature search and molecular understanding

Training and Availability

The model was pretrained on 25 trillion tokens using NVFP4, NVIDIA’s 4-bit floating-point format optimized for Blackwell GPUs. Post-training included approximately 7 million supervised fine-tuning samples and 1.2 million reinforcement learning rollouts across 21 environment configurations.

Nemotron 3 Super is available in multiple formats — NVFP4, FP8, and BF16 — via Hugging Face, NVIDIA NIM microservices, and major cloud providers including Google Cloud, AWS, Azure, and Oracle. On Blackwell B200 hardware, the NVFP4 variant runs 4x faster than FP8 on previous-generation H100 GPUs.

The model is released under the NVIDIA Nemotron Open Model License, which provides open weights, training datasets (10 trillion pre-training tokens publicly available), and full evaluation recipes. The license is permissive for commercial use, though it includes clauses requiring that safety guardrails not be removed without equivalent replacements — a distinction from fully permissive licenses like Apache 2.0.

Related Coverage

Sources