Microsoft released Phi-4-reasoning-vision-15B on March 4, 2026 — a compact, open-weight multimodal AI model that combines high-resolution visual perception with selective reasoning. At just 15 billion parameters and trained on only 200 billion tokens, it matches or exceeds models many times its size on math, science, and UI understanding tasks, while consuming a fraction of the compute.
Phi-4-reasoning-vision-15B is the first model in the Phi family to simultaneously “see clearly” and “think deeply.” It uses a mid-fusion architecture that pairs a SigLIP-2 Naflex vision encoder (processing up to 3,600 visual tokens at dynamic resolution) with the Phi-4-Reasoning language backbone. Rather than the early-fusion approach that requires massive compute, this design leverages pretrained components while enabling cross-modal reasoning.
The model’s most distinctive feature is its hybrid think/no-think system. Approximately 80% of training data uses <nothink> tokens for straightforward perception tasks like image captioning, OCR, and object grounding. The remaining 20% uses <think> tokens with full chain-of-thought traces for complex math, science, and multi-step reasoning. This means the model learns when to reason deeply and when to respond directly — avoiding the latency penalty of forced reasoning on simple tasks.
On ten standard evaluation benchmarks, Phi-4-reasoning-vision-15B delivers competitive results:
These scores trail the much larger Qwen3-VL-32B (which scored 85.0, 84.0, 81.8, 93.9, and 70.6 on the same benchmarks respectively) but remain competitive with or ahead of similarly-sized models like Qwen3-VL-8B and Kimi-VL-A3B. The real value emerges when plotting accuracy against compute: Phi-4-reasoning-vision sits at the Pareto frontier of models that are both fast and accurate.
Perhaps the most striking aspect is the training efficiency. The model was trained on approximately 200 billion multimodal tokens using just 240 NVIDIA B200 GPUs over 4 days. By contrast, competing multimodal models from Alibaba (Qwen3-VL), Google (Gemma3), and others each consumed over 1 trillion tokens — roughly 5x more data.
Microsoft attributes this efficiency to meticulous data curation rather than brute-force scale. The team manually reviewed datasets at a rate of 5–10 minutes per sample, regenerated incorrect answers using GPT-4o, and fixed formatting errors across widely-used open-source benchmarks. Low-quality question sets were repurposed — their high-quality images became seeds for synthetic VQA data.
The data composition also revealed a surprising finding: increasing math/science data 3x while holding UI data constant improved both math benchmarks (37.4% to 38.9% on MathVista) and computer-use performance (48.2% to 63.1% on ScreenSpot-V2), suggesting strong cross-domain transfer effects.
The model targets three primary use cases:
Phi-4-reasoning-vision-15B is available under the MIT license on Hugging Face, GitHub, and Azure AI Foundry, with full weights, fine-tuning code, and benchmark logs included. It requires a 16,384-token context window and runs on GPUs from NVIDIA A6000 and above.
