Mistral Small 4: Four Models Unified in One Open-Source MoE

On March 16, 2026, Mistral AI released Mistral Small 4 — a 119-billion-parameter Mixture-of-Experts model that unifies instruction following, reasoning, multimodal understanding, and agentic coding into a single deployment. With only 6 billion active parameters per token (8B including embedding layers), it delivers frontier-class performance at a fraction of the cost and latency of larger models. The model is released under the Apache 2.0 license.

Intermediate

Visualization of a sparse Mixture-of-Experts architecture with 128 expert nodes, 4 active per token
Illustration generated by AI

Four Models in One

Mistral Small 4 consolidates four previously separate model families into a single architecture:

  • Mistral Small — fast instruction following
  • Magistral — step-by-step reasoning
  • Pixtral — multimodal (text + image) understanding
  • Devstral — agentic coding workflows

This means developers no longer need to route requests between specialized models. A single deployment handles general chat, document analysis, code generation, and complex reasoning tasks.

Architecture and Performance

The model uses a granular MoE architecture with 128 experts and 4 active per token, keeping compute costs low while maintaining a large total parameter budget. It supports a 256K-token context window and accepts both text and image inputs.

Compared to its predecessor Mistral Small 3, Small 4 delivers:

  • 40% lower end-to-end latency in latency-optimized configurations
  • 3x higher throughput (requests per second) in throughput-optimized setups

On benchmarks, Mistral Small 4 matches or surpasses GPT-OSS 120B across AA LCR, LiveCodeBench, and AIME 2025 — while generating significantly shorter outputs. On AA LCR, Small 4 scores 0.72 with just 1.6K characters of output, where comparable Qwen models need 5.8–6.1K characters for similar scores. On LiveCodeBench, it outperforms GPT-OSS 120B while producing 20% less output.

Configurable Reasoning

A standout feature is the reasoning_effort parameter, which lets developers control the depth of reasoning on a per-request basis:

  • “none” — fast, lightweight responses comparable to Mistral Small 3.2
  • “high” — deep step-by-step reasoning at Magistral-level depth

This eliminates the need to maintain separate fast and reasoning model deployments, simplifying infrastructure and reducing operational overhead.

Self-Hosting and Availability

Mistral Small 4 can run on relatively modest hardware for a 119B model. Minimum requirements include 4x NVIDIA HGX H100, 2x HGX H200, or a single DGX B200. The model is compatible with popular serving frameworks including vLLM, llama.cpp, SGLang, and Transformers.

It is available through the Mistral API, AI Studio, Hugging Face, NVIDIA’s build.nvidia.com for free prototyping, and as an NVIDIA NIM container for production deployment.

Related Coverage

Sources