Inception Launches Mercury 2: Diffusion-Powered Reasoning at 1,000 Tokens per Second

On February 24, 2026, Inception launched Mercury 2 — the first reasoning-capable diffusion large language model (dLLM) and what the company calls the fastest reasoning LLM available today. Built on a fundamentally different architecture from conventional autoregressive transformers, Mercury 2 reaches 1,009 tokens per second on NVIDIA Blackwell GPUs with just 1.7 seconds of end-to-end latency, making it over five times faster than leading speed-optimized models.
Why Diffusion for Language?
Most large language models generate text one token at a time — a sequential, autoregressive process that creates an inherent speed ceiling. Inception takes a different approach: applying diffusion, the technique that powers modern image and video generation, to language. Instead of predicting the next token in a sequence, a diffusion LLM refines multiple text blocks simultaneously, working more like “an editor revising an entire draft at once rather than looking at individual words,” as the company describes it.
This parallel generation strategy enables dramatically higher throughput while also unlocking a built-in error-correction mechanism. Because the model iteratively refines its output, it can catch and fix hallucinations mid-generation — delivering reasoning-grade quality within real-time latency budgets, something autoregressive reasoning models that take minutes per response struggle to achieve.
Benchmarks and Performance
Mercury 2 targets the quality tier of models like Claude 4.5 Haiku and GPT 5.2 Mini while delivering roughly 10x the throughput. Key benchmark results:
- AIME 2025: 91.1
- GPQA Diamond: 74
- IFBench: 71.3
- LiveCodeBench: 67.3
- SciCode: 38.4
- Tau2: 52.9
On latency, the gap is stark. Where Gemini 3 Flash (reasoning mode) averages 14.4 seconds end-to-end and Claude Haiku 4.5 (reasoning mode) takes 23.4 seconds, Mercury 2 delivers answers in 1.7 seconds.
Pricing further sharpens the value proposition: $0.25 per million input tokens and $0.75 per million output tokens — roughly 50% cheaper than Gemini 3 Flash on input and 75% cheaper on output, and approximately four times less expensive than Claude Haiku 4.5.
Capabilities and Use Cases
Mercury 2 ships with a 128K context window, tunable reasoning depth, native tool use, and schema-aligned JSON output. These features position it squarely at production workloads where inference latency determines adoption: agent loops that require rapid multi-step tool calls, real-time voice assistants, search systems, and instant code editing at scale.
The tunable reasoning feature is especially notable — developers can dial the model’s thinking depth up or down depending on the task, trading quality for speed on simpler queries while engaging full reasoning for complex problems.
The Team Behind It
Inception was founded by researchers from Stanford, UCLA, and Cornell who contributed to foundational AI work including flash attention, decision transformers, and direct preference optimization (DPO). The company has positioned itself as a pioneer in applying diffusion techniques to language, having previously released Mercury Coder Mini and Mercury Coder Small — code-generation models that achieved over 1,100 tokens per second on NVIDIA H100 GPUs.
What This Means
Mercury 2 represents a significant proof point for diffusion-based language models. While autoregressive transformers have dominated the LLM landscape since GPT-2, the speed and cost advantages of dLLMs could reshape how reasoning models are deployed in latency-sensitive production environments. If diffusion models can continue to close the remaining quality gap with frontier autoregressive models while maintaining their speed advantage, they may carve out a substantial niche — particularly for agentic AI systems where every millisecond of latency compounds across multi-step workflows.
Mercury 2 models are available now via the Inception API.



沪公网安备31011502017015号