NVIDIA Star Elastic: One Checkpoint, Three Reasoning Models, Zero-Shot Slicing

On May 7, 2026, NVIDIA released Star Elastic — a single 30-billion-parameter reasoning checkpoint that contains two smaller production-ready models, 23B and 12B, embedded in its weights. Developers can extract the smaller variants with a one-shot slicing script, no fine-tuning required. The release introduces a “many-in-one” approach to LLM packaging that NVIDIA reports cuts post-training tokens by 360× compared to building each variant from scratch.

Advanced

NVIDIA Star Elastic logo for the Nemotron Labs 3 Elastic 30B-A3B model family
Image credit: NVIDIA on Hugging Face

Three Models, One Checkpoint

Star Elastic is built on top of Nemotron Nano v3, a hybrid Mamba-2 / Transformer / Mixture-of-Experts model with 30B total parameters and 3.6B active parameters per token. Through nested weight-sharing, the same checkpoint contains a 23B variant (2.8B active) and a 12B variant (2.0B active) as proper subsets of the parent. The smaller models reuse the most important slices of every weight matrix, so extracting them is a deterministic, training-free operation.

The packaging savings are concrete. Storing all three variants in BF16 inside one Star Elastic checkpoint takes 58.9 GB versus 126.1 GB for three independent Nano v3 checkpoints — a 2.14× reduction in disk and download footprint. Quantized formats compress further: FP8 brings the 30B variant down to 31.4 GB, and NVFP4 to just 18.7 GB.

Zero-Shot Slicing

Slicing prioritizes “width-based elasticity” — reducing hidden dimensions, expert counts, and attention heads rather than removing layers. NVIDIA reports that width compression recovers 98.1% of parent performance at a 15% parameter reduction, compared to 95.2% for depth compression. The 30B parent uses 2688-dim embeddings, 128 routed experts, and 32 attention heads; the 23B and 12B variants narrow these dimensions while keeping all 52 layers intact.

Extraction is a single command:

python zero_shot_slicing.py \
    --source-checkpoint <path-to-30B-checkpoint> \
    --target-checkpoint ./nemotron-elastic-12b-bf16 \
    --size 12B \
    --precision bf16

The elastic post-training run that produced these nested variants used roughly 160B tokens — about 0.6% of the parent’s pretraining budget — with a two-stage curriculum that ramped context length from 8K to 49K tokens. NVIDIA reports a 360× token reduction versus pretraining each variant independently and a 7× improvement over prior state-of-the-art compression methods.

Benchmarks

Bar chart comparing Elastic-12B, 23B, and 30B variants against Nemotron Nano v3 30B and Qwen3-30B-A3B across reasoning and instruction-following benchmarks
Image credit: NVIDIA on Hugging Face

On AIME-2025, the Elastic-30B scores 88.54, the 23B reaches 85.63, and the 12B hits 78.54 — each comfortably ahead of Qwen3-30B-A3B at 80.00. On MMLU-Pro, the 30B parent scores 78.63 and the 23B 76.07. Throughput scales as the variants shrink: on a single H100 with vLLM, the 12B serves up to 224 concurrent requests at 2.4× the throughput of the 30B parent.

The release also introduces “elastic budget control,” a novel inference mode that lets a model use a smaller variant for the chain-of-thought phase and then switch to the larger variant to produce the final answer. NVIDIA reports the 23B-thinking → 30B-answering configuration delivers up to 16% higher accuracy and 1.9× lower latency than running the 30B alone, though this routing is not yet supported natively in vLLM.

Pareto frontier showing accuracy-versus-latency tradeoffs for elastic budget control configurations across reasoning and answering phases
Image credit: NVIDIA on Hugging Face

What This Means

Star Elastic reframes how reasoning model families are shipped. Instead of committing to one model size at training time and fine-tuning down to smaller checkpoints, deployment teams can bundle a single artifact and let inference infrastructure choose a variant per workload — a 12B for low-latency RAG, a 30B for hard reasoning, the 23B for batch jobs that need throughput. Because the checkpoint is released under the NVIDIA Open Model License with commercial use permitted, the pattern is immediately usable in production.

The technique also has implications for research budgets. If width-elastic post-training can produce a usable 12B variant for 0.6% of the parent’s pretraining cost, the cost of supporting multiple deployment tiers drops to almost nothing — a meaningful change for labs that previously had to triage which model sizes were worth distilling.

Related Coverage

Sources