Microsoft Releases Phi-4-Reasoning-Vision-15B: Small Model, Big Vision

Microsoft released Phi-4-reasoning-vision-15B on March 4, 2026 — a compact, open-weight multimodal AI model that combines high-resolution visual perception with selective reasoning. At just 15 billion parameters and trained on only 200 billion tokens, it matches or exceeds models many times its size on math, science, and UI understanding tasks, while consuming a fraction of the compute.

Neural network architecture visualization showing vision-language fusion with think and no-think reasoning modes
Illustration generated by AI

What Makes It Different

Phi-4-reasoning-vision-15B is the first model in the Phi family to simultaneously “see clearly” and “think deeply.” It uses a mid-fusion architecture that pairs a SigLIP-2 Naflex vision encoder (processing up to 3,600 visual tokens at dynamic resolution) with the Phi-4-Reasoning language backbone. Rather than the early-fusion approach that requires massive compute, this design leverages pretrained components while enabling cross-modal reasoning.

The model’s most distinctive feature is its hybrid think/no-think system. Approximately 80% of training data uses <nothink> tokens for straightforward perception tasks like image captioning, OCR, and object grounding. The remaining 20% uses <think> tokens with full chain-of-thought traces for complex math, science, and multi-step reasoning. This means the model learns when to reason deeply and when to respond directly — avoiding the latency penalty of forced reasoning on simple tasks.

Benchmarks and Performance

On ten standard evaluation benchmarks, Phi-4-reasoning-vision-15B delivers competitive results:

  • AI2D (science diagrams): 84.8%
  • ChartQA (chart understanding): 83.3%
  • MathVista (mathematical reasoning): 75.2%
  • ScreenSpot V2 (UI element grounding): 88.2%
  • MMMU (multimodal understanding): 54.3%
  • OCRBench: 76.0%

These scores trail the much larger Qwen3-VL-32B (which scored 85.0, 84.0, 81.8, 93.9, and 70.6 on the same benchmarks respectively) but remain competitive with or ahead of similarly-sized models like Qwen3-VL-8B and Kimi-VL-A3B. The real value emerges when plotting accuracy against compute: Phi-4-reasoning-vision sits at the Pareto frontier of models that are both fast and accurate.

Remarkably Data-Efficient

Perhaps the most striking aspect is the training efficiency. The model was trained on approximately 200 billion multimodal tokens using just 240 NVIDIA B200 GPUs over 4 days. By contrast, competing multimodal models from Alibaba (Qwen3-VL), Google (Gemma3), and others each consumed over 1 trillion tokens — roughly 5x more data.

Microsoft attributes this efficiency to meticulous data curation rather than brute-force scale. The team manually reviewed datasets at a rate of 5–10 minutes per sample, regenerated incorrect answers using GPT-4o, and fixed formatting errors across widely-used open-source benchmarks. Low-quality question sets were repurposed — their high-quality images became seeds for synthetic VQA data.

The data composition also revealed a surprising finding: increasing math/science data 3x while holding UI data constant improved both math benchmarks (37.4% to 38.9% on MathVista) and computer-use performance (48.2% to 63.1% on ScreenSpot-V2), suggesting strong cross-domain transfer effects.

Practical Applications

The model targets three primary use cases:

  1. Scientific and math reasoning — interpreting handwritten equations, extracting data from charts and tables, multi-step problem solving in educational contexts
  2. Computer-use agent tasks — understanding screen content, localizing GUI elements, and selecting interactive UI components for automation workflows
  3. General vision-language tasks — image captioning, visual QA, OCR, and object localization

Phi-4-reasoning-vision-15B is available under the MIT license on Hugging Face, GitHub, and Azure AI Foundry, with full weights, fine-tuning code, and benchmark logs included. It requires a 16,384-token context window and runs on GPUs from NVIDIA A6000 and above.

Related Coverage

Sources