Qwen 3.5 Small Models: 9B Parameters That Beat 120B

On March 2, 2026, Alibaba’s Qwen team completed the Qwen 3.5 family with the release of four small dense models — 0.8B, 2B, 4B, and 9B parameters — designed for on-device and edge deployment. The headline result: Qwen3.5-9B outperforms models 3–13 times its size across language, vision, and agentic benchmarks, while running on a single consumer GPU. All four models are open-weight under Apache 2.0 and available on Hugging Face and ModelScope.
Architecture: Gated DeltaNet Goes Small
The small models share the same hybrid architecture that powers the entire Qwen 3.5 lineup. At its core is Gated DeltaNet, a linear attention mechanism arranged in a 3:1 ratio with traditional full softmax attention blocks. The linear layers maintain constant memory complexity regardless of sequence length, while the full attention blocks handle precision-critical reasoning. This hybrid design is what enables a 9B model to support 262,144 tokens of native context — extensible to over 1 million tokens via YaRN — without the memory explosion that would make this impossible with standard transformers.
Additional architectural innovations include:
- Multi-Token Prediction (MTP): The models predict multiple tokens simultaneously during inference, enabling significant speedups through speculative decoding via the NEXTN algorithm.
- DeepStack Vision Transformer: Conv3d embeddings enable native temporal video understanding, while multi-layer feature merging replaces the conventional final-layer-only approach.
- 248K-token vocabulary covering 201 languages and dialects, shared across all Qwen 3.5 models.
- Native multimodal: All four models process text, images, and video from a single unified architecture — vision isn’t bolted on as a separate module.
Benchmarks: Punching Far Above Their Weight
The 9B model delivers what is arguably the most impressive size-to-performance ratio in open-weight AI today. Here are the key numbers:
Language Benchmarks
| Benchmark | GPT-OSS-120B | Qwen3-30B | Qwen3-80B | Qwen3.5-9B | Qwen3.5-4B |
|---|---|---|---|---|---|
| MMLU-Pro | 80.8 | 80.9 | 82.7 | 82.5 | 79.1 |
| GPQA Diamond | 80.1 | 73.4 | 77.2 | 81.7 | 76.2 |
| IFEval | 88.9 | 88.9 | 88.9 | 91.5 | 89.8 |
| LongBench v2 | 48.2 | 44.8 | 48.0 | 55.2 | 50.0 |
| HMMT Feb 25 | 90.0 | — | 73.7 | 83.2 | 74.0 |
The 9B model beats OpenAI’s GPT-OSS-120B (a model 13× larger) on MMLU-Pro, GPQA Diamond, IFEval, and LongBench v2. It also surpasses the previous-generation Qwen3-30B on every metric listed — a model more than three times its size.
Vision-Language Benchmarks
| Benchmark | GPT-5-Nano | Gemini 2.5 Flash | Qwen3.5-9B | Qwen3.5-4B |
|---|---|---|---|---|
| MMMU-Pro | 57.2 | 59.7 | 70.1 | 66.3 |
| MathVision | 62.2 | 52.1 | 78.9 | 74.6 |
| MathVista (mini) | 71.5 | 72.8 | 85.7 | 85.1 |
| VideoMME (w/ sub.) | 71.7 | 74.6 | 84.5 | 83.5 |
In vision tasks, the gap is even more dramatic. The 9B scores 70.1 on MMMU-Pro versus GPT-5-Nano’s 57.2 — a 12.9-point advantage. On MathVision, the lead widens to 16.7 points. Even the 4B model outperforms both GPT-5-Nano and Gemini 2.5 Flash across the board.
Agentic Capabilities
The small models also show strong agentic performance. Qwen3.5-9B scores 66.1 on BFCL-V4 (function calling), 79.1 on TAU2-Bench (tool use), 65.2 on ScreenSpot Pro (GUI understanding), and 41.8 on OSWorld-Verified (desktop automation) — outperforming Qwen3-Next-80B on all four benchmarks.
Running It Yourself
The practical deployment story is where these models truly shine:
- Qwen3.5-0.8B (~1.6 GB): Runs on smartphones and Raspberry Pi devices
- Qwen3.5-2B (~4 GB): Suitable for tablets and lightweight laptops
- Qwen3.5-4B (~8 GB): RTX 3060, M1/M2 Macs
- Qwen3.5-9B (~18 GB): RTX 3090/4090, or quantized to fit smaller GPUs
All models are supported by vLLM, SGLang, llama.cpp (GGUF), MLX (Apple Silicon), and Hugging Face Transformers. Four-bit quantization reduces VRAM requirements by approximately 75%, making the 9B runnable on an 8 GB GPU.
What This Means
The Qwen 3.5 Small series completes a 16-day blitz that saw Alibaba ship nine models spanning 0.8B to 397B parameters — all sharing the same architecture, vocabulary, and native multimodal capabilities. The message is clear: frontier-level intelligence no longer requires frontier-level hardware. A 9B model that beats a 120B competitor on standard benchmarks, processes video natively, and runs on a single consumer GPU represents a meaningful shift in what’s possible for local AI deployment, privacy-sensitive applications, and resource-constrained environments.
For researchers and developers, the Apache 2.0 license and base model availability (alongside instruct-tuned variants) make fine-tuning straightforward. The consistent architecture across model sizes also means techniques validated on the 0.8B can transfer up to the 9B with minimal adaptation.
Related Coverage
- Qwen 3.5 Medium Series: Frontier AI That Fits on Your GPU — Coverage of the 27B, 35B-A3B, and 122B-A10B medium models released February 24
- Qwen 3.5: Alibaba’s Native Multimodal Agent Model Arrives — The flagship 397B-A17B model that launched the Qwen 3.5 family on February 16
- Qwen3-Coder-Next: Alibaba’s Ultra-Sparse 80B Coding Agent — The specialized coding model from the Qwen3 generation



沪公网安备31011502017015号