Qwen 3.5: Alibaba’s Native Multimodal Agent Model Arrives

Alibaba’s Qwen team released Qwen 3.5 on February 16, 2026, marking a significant architectural leap with its flagship 397B-parameter mixture-of-experts model built for the agentic AI era. Unlike previous generations where vision was bolted on as an afterthought, Qwen 3.5 was trained from scratch on text, images, and video simultaneously — making it one of the first truly native multimodal foundation models capable of autonomous action across digital environments.

Qwen 3.5 native multimodal AI agent illustration
Illustration generated by AI

Architecture and Technical Specs

The flagship Qwen3.5-397B-A17B model deploys a sparse Mixture-of-Experts (MoE) architecture that activates only 17 billion parameters per forward pass despite having 397 billion total parameters. This design, combined with a hybrid attention mechanism that fuses Gated Delta Networks (linear attention) with standard sparse MoE layers, allows the model to achieve remarkable inference efficiency — approximately 45 tokens per second on an 8×H100 GPU cluster.

Key technical specifications include:

  • Total parameters: 397B (17B active per token)
  • Context window: 256K tokens native; 1M tokens on the hosted Qwen3.5-Plus
  • Vocabulary: 250K tokens (up from 152K in Qwen 3)
  • Language support: 201 languages and dialects (up from 82 in the previous generation)
  • Training modalities: Text, images, and video — trained natively together from the start
  • License: Apache 2.0 open-weight release

Inference throughput compared to Qwen3-Max is 8.6× faster at a 32K context length and 19× faster at 256K context — a difference that makes long-document and multimodal workflows substantially more practical at scale.

Benchmark Performance

Qwen 3.5 397B-A17B benchmark scores across reasoning, coding, multimodal, and agentic tasks
Image credit: Qwen / Hugging Face

Qwen 3.5 posts competitive numbers across a wide range of evaluations:

  • AIME 2026 (math olympiad reasoning): 91.3
  • GPQA Diamond (graduate-level science reasoning): 88.4
  • MathVista (visual math reasoning): 90.3
  • MMMU (multimodal understanding): 85.0
  • LiveCodeBench v6 (competitive coding): 83.6
  • SWE-bench Verified (real-world software engineering): 76.4
  • IFBench (instruction following): 76.5 — top result among evaluated models
  • OmniDocBench (document understanding): 90.8
  • Video-MME (video comprehension): 87.5

The model surpasses Claude Opus 4.5 on multimodal benchmarks and posts competitive results against GPT-5.2, while remaining fully open-weight and available for local deployment.

Visual Agentic Capabilities

The headline capability distinguishing Qwen 3.5 from prior models is its visual agentic interface control. Because the model was trained natively on UI screenshots alongside text and video, it can interpret and interact with graphical interfaces — clicking buttons, filling forms, and executing multi-step workflows across mobile and desktop applications without human intervention.

This positions Qwen 3.5 as a direct competitor to agent-oriented models like Anthropic’s Computer Use and Google’s Project Mariner. The model can process images up to 1344×1344 resolution and 60-second video clips, enabling it to watch a screen recording and then reproduce the demonstrated workflow autonomously.

Cost and Availability

Alibaba reports approximately 60% lower inference cost per token compared to its predecessor, with the hosted Qwen3.5-Plus API priced at around $0.18 per million tokens. The open-weight model is available on Hugging Face under Apache 2.0, meaning developers can download, fine-tune, and self-host it on their own infrastructure.

Both deployment options are live: the open-weight Qwen3.5-397B-A17B for self-hosting and a hosted “Qwen3.5-Plus” variant for API access with the extended 1M token context. The broad language support — 201 languages versus 82 in the previous generation — combined with the native multimodal architecture makes it one of the more versatile open frontier models currently available.

Related Coverage

Sources