Qwen 3.5 Small Models: 9B Parameters That Beat 120B

On March 2, 2026, Alibaba’s Qwen team completed the Qwen 3.5 family with the release of four small dense models — 0.8B, 2B, 4B, and 9B parameters — designed for on-device and edge deployment. The headline result: Qwen3.5-9B outperforms models 3–13 times its size across language, vision, and agentic benchmarks, while running on a single consumer GPU. All four models are open-weight under Apache 2.0 and available on Hugging Face and ModelScope.

Visualization of four compact neural network nodes of increasing size, representing the Qwen 3.5 small model family from 0.8B to 9B parameters
Illustration generated by AI

Architecture: Gated DeltaNet Goes Small

The small models share the same hybrid architecture that powers the entire Qwen 3.5 lineup. At its core is Gated DeltaNet, a linear attention mechanism arranged in a 3:1 ratio with traditional full softmax attention blocks. The linear layers maintain constant memory complexity regardless of sequence length, while the full attention blocks handle precision-critical reasoning. This hybrid design is what enables a 9B model to support 262,144 tokens of native context — extensible to over 1 million tokens via YaRN — without the memory explosion that would make this impossible with standard transformers.

Additional architectural innovations include:

  • Multi-Token Prediction (MTP): The models predict multiple tokens simultaneously during inference, enabling significant speedups through speculative decoding via the NEXTN algorithm.
  • DeepStack Vision Transformer: Conv3d embeddings enable native temporal video understanding, while multi-layer feature merging replaces the conventional final-layer-only approach.
  • 248K-token vocabulary covering 201 languages and dialects, shared across all Qwen 3.5 models.
  • Native multimodal: All four models process text, images, and video from a single unified architecture — vision isn’t bolted on as a separate module.

Benchmarks: Punching Far Above Their Weight

The 9B model delivers what is arguably the most impressive size-to-performance ratio in open-weight AI today. Here are the key numbers:

Language Benchmarks

Benchmark GPT-OSS-120B Qwen3-30B Qwen3-80B Qwen3.5-9B Qwen3.5-4B
MMLU-Pro 80.8 80.9 82.7 82.5 79.1
GPQA Diamond 80.1 73.4 77.2 81.7 76.2
IFEval 88.9 88.9 88.9 91.5 89.8
LongBench v2 48.2 44.8 48.0 55.2 50.0
HMMT Feb 25 90.0 73.7 83.2 74.0

The 9B model beats OpenAI’s GPT-OSS-120B (a model 13× larger) on MMLU-Pro, GPQA Diamond, IFEval, and LongBench v2. It also surpasses the previous-generation Qwen3-30B on every metric listed — a model more than three times its size.

Vision-Language Benchmarks

Benchmark GPT-5-Nano Gemini 2.5 Flash Qwen3.5-9B Qwen3.5-4B
MMMU-Pro 57.2 59.7 70.1 66.3
MathVision 62.2 52.1 78.9 74.6
MathVista (mini) 71.5 72.8 85.7 85.1
VideoMME (w/ sub.) 71.7 74.6 84.5 83.5

In vision tasks, the gap is even more dramatic. The 9B scores 70.1 on MMMU-Pro versus GPT-5-Nano’s 57.2 — a 12.9-point advantage. On MathVision, the lead widens to 16.7 points. Even the 4B model outperforms both GPT-5-Nano and Gemini 2.5 Flash across the board.

Agentic Capabilities

The small models also show strong agentic performance. Qwen3.5-9B scores 66.1 on BFCL-V4 (function calling), 79.1 on TAU2-Bench (tool use), 65.2 on ScreenSpot Pro (GUI understanding), and 41.8 on OSWorld-Verified (desktop automation) — outperforming Qwen3-Next-80B on all four benchmarks.

Running It Yourself

The practical deployment story is where these models truly shine:

  • Qwen3.5-0.8B (~1.6 GB): Runs on smartphones and Raspberry Pi devices
  • Qwen3.5-2B (~4 GB): Suitable for tablets and lightweight laptops
  • Qwen3.5-4B (~8 GB): RTX 3060, M1/M2 Macs
  • Qwen3.5-9B (~18 GB): RTX 3090/4090, or quantized to fit smaller GPUs

All models are supported by vLLM, SGLang, llama.cpp (GGUF), MLX (Apple Silicon), and Hugging Face Transformers. Four-bit quantization reduces VRAM requirements by approximately 75%, making the 9B runnable on an 8 GB GPU.

What This Means

The Qwen 3.5 Small series completes a 16-day blitz that saw Alibaba ship nine models spanning 0.8B to 397B parameters — all sharing the same architecture, vocabulary, and native multimodal capabilities. The message is clear: frontier-level intelligence no longer requires frontier-level hardware. A 9B model that beats a 120B competitor on standard benchmarks, processes video natively, and runs on a single consumer GPU represents a meaningful shift in what’s possible for local AI deployment, privacy-sensitive applications, and resource-constrained environments.

For researchers and developers, the Apache 2.0 license and base model availability (alongside instruct-tuned variants) make fine-tuning straightforward. The consistent architecture across model sizes also means techniques validated on the 0.8B can transfer up to the 9B with minimal adaptation.

Related Coverage

Sources