On March 2, 2026, Alibaba’s Qwen team completed the Qwen 3.5 family with the release of four small dense models — 0.8B, 2B, 4B, and 9B parameters — designed for on-device and edge deployment. The headline result: Qwen3.5-9B outperforms models 3–13 times its size across language, vision, and agentic benchmarks, while running on a single consumer GPU. All four models are open-weight under Apache 2.0 and available on Hugging Face and ModelScope.
The small models share the same hybrid architecture that powers the entire Qwen 3.5 lineup. At its core is Gated DeltaNet, a linear attention mechanism arranged in a 3:1 ratio with traditional full softmax attention blocks. The linear layers maintain constant memory complexity regardless of sequence length, while the full attention blocks handle precision-critical reasoning. This hybrid design is what enables a 9B model to support 262,144 tokens of native context — extensible to over 1 million tokens via YaRN — without the memory explosion that would make this impossible with standard transformers.
Additional architectural innovations include:
The 9B model delivers what is arguably the most impressive size-to-performance ratio in open-weight AI today. Here are the key numbers:
| Benchmark | GPT-OSS-120B | Qwen3-30B | Qwen3-80B | Qwen3.5-9B | Qwen3.5-4B |
|---|---|---|---|---|---|
| MMLU-Pro | 80.8 | 80.9 | 82.7 | 82.5 | 79.1 |
| GPQA Diamond | 80.1 | 73.4 | 77.2 | 81.7 | 76.2 |
| IFEval | 88.9 | 88.9 | 88.9 | 91.5 | 89.8 |
| LongBench v2 | 48.2 | 44.8 | 48.0 | 55.2 | 50.0 |
| HMMT Feb 25 | 90.0 | — | 73.7 | 83.2 | 74.0 |
The 9B model beats OpenAI’s GPT-OSS-120B (a model 13× larger) on MMLU-Pro, GPQA Diamond, IFEval, and LongBench v2. It also surpasses the previous-generation Qwen3-30B on every metric listed — a model more than three times its size.
| Benchmark | GPT-5-Nano | Gemini 2.5 Flash | Qwen3.5-9B | Qwen3.5-4B |
|---|---|---|---|---|
| MMMU-Pro | 57.2 | 59.7 | 70.1 | 66.3 |
| MathVision | 62.2 | 52.1 | 78.9 | 74.6 |
| MathVista (mini) | 71.5 | 72.8 | 85.7 | 85.1 |
| VideoMME (w/ sub.) | 71.7 | 74.6 | 84.5 | 83.5 |
In vision tasks, the gap is even more dramatic. The 9B scores 70.1 on MMMU-Pro versus GPT-5-Nano’s 57.2 — a 12.9-point advantage. On MathVision, the lead widens to 16.7 points. Even the 4B model outperforms both GPT-5-Nano and Gemini 2.5 Flash across the board.
The small models also show strong agentic performance. Qwen3.5-9B scores 66.1 on BFCL-V4 (function calling), 79.1 on TAU2-Bench (tool use), 65.2 on ScreenSpot Pro (GUI understanding), and 41.8 on OSWorld-Verified (desktop automation) — outperforming Qwen3-Next-80B on all four benchmarks.
The practical deployment story is where these models truly shine:
All models are supported by vLLM, SGLang, llama.cpp (GGUF), MLX (Apple Silicon), and Hugging Face Transformers. Four-bit quantization reduces VRAM requirements by approximately 75%, making the 9B runnable on an 8 GB GPU.
The Qwen 3.5 Small series completes a 16-day blitz that saw Alibaba ship nine models spanning 0.8B to 397B parameters — all sharing the same architecture, vocabulary, and native multimodal capabilities. The message is clear: frontier-level intelligence no longer requires frontier-level hardware. A 9B model that beats a 120B competitor on standard benchmarks, processes video natively, and runs on a single consumer GPU represents a meaningful shift in what’s possible for local AI deployment, privacy-sensitive applications, and resource-constrained environments.
For researchers and developers, the Apache 2.0 license and base model availability (alongside instruct-tuned variants) make fine-tuning straightforward. The consistent architecture across model sizes also means techniques validated on the 0.8B can transfer up to the 9B with minimal adaptation.
