Qwen 3.5 Medium Series: Frontier AI That Fits on Your GPU

On February 24, 2026, Alibaba’s Qwen team expanded the Qwen 3.5 family with three new open-weight models — Qwen3.5-27B, Qwen3.5-35B-A3B, and Qwen3.5-122B-A10B — each designed to deliver frontier-level intelligence at a fraction of the compute cost of the 397B flagship released a week earlier. The release makes a pointed argument: with the right architecture, smaller can outperform bigger.

Three interconnected neural network spheres representing the Qwen 3.5 Medium model series
Illustration generated by AI

Three Models, One Mission: Efficiency Without Compromise

The Qwen 3.5 Medium series ships three distinct models under the Apache 2.0 open-source license, available immediately on Hugging Face:

  • Qwen3.5-27B — A dense model with all 27 billion parameters active at inference, optimized for coding and instruction-following. It uses a 3:1 hybrid ratio of Gated DeltaNet layers to Gated Attention layers, enabling efficient linear-time processing of long sequences.
  • Qwen3.5-35B-A3B — A sparse Mixture-of-Experts model with 35B total parameters but only 3B active per forward pass. It routes tokens through 8 of 256 available experts, achieving high throughput on modest hardware — including consumer GPUs with 32GB VRAM for up to 1 million token contexts with YaRN scaling.
  • Qwen3.5-122B-A10B — The largest in the medium tier, with 122B total parameters and 10B active. Designed for complex agentic workflows requiring multi-step planning, it supports 1M+ context on server-grade 80GB GPUs.

All three models share the same capabilities as the flagship: native multimodal understanding (text, images, video), 201 language support, built-in thinking mode with <think>...</think> reasoning traces, and native tool-calling for agentic applications.

Benchmark Highlights

The Qwen3.5-27B leads the medium lineup on several key evaluations despite its compact size:

  • Coding: SWE-bench Verified 72.4 — matching GPT-5-mini, LiveCodeBench v6 at 80.7
  • Math reasoning: HMMT 92.0, DynaMath 87.7
  • Instruction following: IFEval 95.0
  • General knowledge: MMLU-Pro 86.1, GPQA Diamond 85.5

The Qwen3.5-35B-A3B punches well above its weight at just 3B active parameters:

  • MMLU-Pro: 85.3 | GPQA Diamond: 84.2 | IFEval: 91.9
  • SWE-bench Verified: 69.2 | MMMU-Pro: 75.1 | VideoMME: 86.6
  • AndroidWorld (mobile agent): 71.1 | ScreenSpot Pro (UI grounding): 68.6

These scores put the 35B-A3B ahead of many 70B+ dense models from prior generations — a result of Alibaba’s emphasis on architectural efficiency and reinforcement learning during post-training rather than brute-force scaling.

Architecture: Gated Delta Networks and Sparse Experts

All Qwen 3.5 models use a Gated Delta Network architecture — a hybrid that merges gating (which removes unnecessary data from memory) with the delta rule (which streamlines parameter updates). In the sparse MoE variants, this is paired with expert routing: the 35B-A3B, for instance, selects 8 routed experts plus 1 shared expert from 256 total for each token.

The practical benefit is significant. At native 262,144 token context, memory scales linearly rather than quadratically. For the 35B-A3B running locally, that means 1M-token contexts fit on a single 32GB GPU — opening up use cases like full codebase analysis or long-document research without cloud API costs.

The native context window is 262,144 tokens across all models, with YaRN scaling extending this to over 1 million tokens. Models ship in BF16 and support 4-bit quantization with near-lossless accuracy — the quantized 35B-A3B fits on consumer hardware with room to spare.

Deployment

The models are supported by all major inference frameworks out of the box:

# Run Qwen3.5-35B-A3B with vLLM
vllm serve Qwen/Qwen3.5-35B-A3B \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

SGLang, Hugging Face Transformers, llama.cpp, and MLX (Apple Silicon) are all supported. Quantized GGUF versions are available for llama.cpp users running fully locally.

What This Means

The Qwen 3.5 Medium series makes frontier multimodal, agentic AI accessible to researchers and developers who cannot or do not want to depend on cloud APIs. The 35B-A3B model in particular is notable: at 3B active parameters with 35B total, it rivals models that would require multiple high-end GPUs when run as dense architectures.

The release also signals a broader shift in Alibaba’s Qwen strategy. Rather than releasing one headline model and stopping, the team is systematically filling the compute-efficiency frontier — giving users options from a 27B dense daily driver to a 397B agentic powerhouse, all under the same open license.

For researchers at institutions like NYU Shanghai, the Medium series offers a compelling local deployment story: capable enough for complex research tasks, efficient enough to run without datacenter resources.

Related Coverage

Sources