Google Ships Gemma 4 QAT Models: 72% Less VRAM, Same Quality

Lead — On June 5, 2026, Google DeepMind released quantization-aware training (QAT) checkpoints for the entire Gemma 4 family, cutting memory requirements by roughly 72% while keeping output quality nearly identical to the full-precision models. The release spans every size — from the sub-1 GB E2B edge model to the 31B dense flagship — and ships in formats ready for llama.cpp, Ollama, LM Studio, vLLM, and on-device runtimes. The goal is blunt: run frontier-class open models locally, on consumer hardware, without a datacenter GPU.
Intermediate
What QAT actually changes
Most quantized models you download today are produced by post-training quantization (PTQ): a model is trained at full precision, then compressed to 4-bit afterward. PTQ is cheap, but squeezing 16-bit weights down to 4-bit loses information, and that loss shows up as small but real drops in accuracy — especially on math and code.
Quantization-aware training takes a different route. It simulates the low-precision arithmetic during training, so the model learns to compensate for the rounding error before its weights are frozen. The result is a 4-bit model that behaves almost like its full-precision parent. Google reports that its QAT checkpoints yield even higher overall quality than standard PTQ baselines at the same bit width.
The memory math
The practical headline is memory. According to Google, QAT reduces the memory footprint of Gemma 4 by approximately 72%, letting the models run in about one-third of the VRAM they previously needed. A 31B dense model at 16-bit precision is roughly 60 GB; the 4-bit QAT checkpoint lands in the 17–19 GB range, which is the difference between “needs a server” and “fits on a high-end laptop.”
The release covers the full lineup:
- E2B — text-only footprint reduced to under 1 GB using the mobile quantization format; small enough for smartphones and in-browser use.
- E4B — a realistic starting point for GPUs with 8 GB of VRAM.
- 12B — the laptop-class multimodal model added June 3, now with a QAT variant.
- 26B-A4B — a mixture-of-experts model that fits on a 16 GB laptop while activating only a fraction of its parameters per token.
- 31B — the dense flagship, brought down to ~17–19 GB at 4-bit.
What this means
The mobile-format work goes beyond a single quantization recipe. Google describes several on-device optimizations layered on top of QAT: static activations with pre-calculated scaling, channel-wise quantization tuned to mobile accelerators, targeted 2-bit quantization for the token-generation layers, and embedding and KV-cache compression. Together these are what pull the E2B text model under the 1 GB line.
For the local-AI community, this lowers the bar in a concrete way. The checkpoints are distributed on Hugging Face in GGUF (for llama.cpp) and compressed-tensor formats (for vLLM and SGLang), with day-one support across Ollama, LM Studio, MLX, Transformers.js, and Google’s own LiteRT-LM runtime. One caveat practitioners are already flagging: naively re-quantizing the QAT checkpoint with a generic Q4_0 converter can undo the benefit, because the weights are aligned to a specific QAT lattice — the published Q4_0 and mobile builds are the ones to use, not a homemade conversion.
The broader signal is that “frontier model on your own hardware” keeps getting more literal. A year ago, running a capable multimodal model meant cloud APIs or a multi-GPU rig. With QAT-trained Gemma 4, a 26B mixture-of-experts model fits on a laptop and a 2B model fits on a phone — under a fully permissive Apache 2.0 license.
Related Coverage
- Google Releases Gemma 4 12B: Frontier Multimodal AI on a Laptop — the laptop-class model that this QAT release extends to memory-constrained hardware.
- Gemma 4 Gets Multi-Token Prediction Drafters: 3x Faster Inference, Same Outputs — an earlier inference-efficiency upgrade for the same model family.
- Huawei’s KVarN Hits 5x KV-Cache Capacity at FP16 Accuracy and Throughput — a complementary approach to fitting larger contexts in less memory.
- PrismML Releases 1-Bit Bonsai Image 4B for Local Generation — the same local-accessibility push, applied to image generation.




沪公网安备31011502017015号