MiniMax M3: Frontier Coding, 1M Context, and Sparse Attention

MiniMax released M3 on June 1, 2026, claiming the first open-weight model to combine frontier coding, a 1-million-token context window, and native multimodality in a single architecture. The headline is not just the benchmark sheet — it is the engine underneath: a new sparse-attention mechanism (MSA) that the company says cuts per-token compute at 1M context to roughly one-twentieth of its previous generation, with a 9.7× faster prefill and 15.6× faster decode.

Advanced

Diagram of MiniMax Sparse Attention (MSA) showing block-level selection over uncompressed key-value pairs on a grouped-query attention backbone
Image credit: MiniMax

The Architecture: MiniMax Sparse Attention

M3 is a Mixture-of-Experts model with 229.9 billion total parameters that activates just 9.8 billion per token across 256 fine-grained experts — a sparse footprint that keeps inference cheap relative to its capacity. The more novel piece is attention. Where DeepSeek’s Multi-head Latent Attention (MLA) compresses keys and values into a low-dimensional latent space, MiniMax Sparse Attention (MSA) keeps a standard grouped-query attention (GQA) backbone but applies block-level selection over the real, uncompressed key-value cache.

The mechanism reorganizes the attention loop as a “KV outer, gather Q” pass — using blocks as the outer loop and aggregating queries within them — which MiniMax reports is roughly 4× faster than open-source alternatives such as Flash-Sparse-Attention and flash-moba. The practical payoff is in long-context economics: at a 1M-token window, the per-token compute lands near 1/20 of the prior-generation model, which is what makes a million-token context commercially serviceable rather than a demo.

Benchmarks: Coding and Agentic Strength

M3’s reported results cluster around coding and agentic tasks rather than raw post-training polish:

  • SWE-Bench Pro: 59.0% — MiniMax says this surpasses GPT-5.5 and Gemini 3.1 Pro and approaches Claude Opus 4.7.
  • Terminal-Bench 2.1: 66.0%
  • BrowseComp: 83.5 — ahead of Opus 4.7’s 79.3 on this web-agent benchmark.
  • MCP Atlas: 74.2% and SWE-fficiency: 34.8%
  • KernelBench Hard: 28.8%

The model is weaker on PostTrainBench (0.37), trailing Opus 4.7 (0.42) and GPT-5.5 (0.39) — a reminder that “frontier” here means agentic and long-context capability, not a clean sweep of every leaderboard.

What Long Autonomy Looks Like

MiniMax leans on two long-horizon demonstrations to make the agentic case. In the first, M3 independently reproduced the ICLR 2025 paper Learning Dynamics of LLM Finetuning, running for nearly 12 hours to produce 18 commits and 23 experimental figures while validating core results including SFT-stage predictions and DPO effects.

Results from MiniMax M3 autonomously reproducing an ICLR 2025 paper on LLM fine-tuning dynamics, showing experimental figures
Image credit: MiniMax

The second is a hardware-level stress test: optimizing an FP8 matrix-multiplication kernel on NVIDIA Hopper GPUs over roughly 24 hours. Across six rounds — 147 benchmark submissions and 1,959 tool calls — M3 lifted hardware utilization from 7.6% to 71.3%, a 9.4× speedup. Both runs are vendor-reported and not yet independently reproduced, so treat the numbers as a capability ceiling rather than a guarantee.

Availability, Licensing, and Pricing

At launch M3 is available via the MiniMax API, the MiniMax Code agent, and subscription token plans (Plus $20/mo, Max $50/mo, Ultra $120/mo). On OpenRouter it listed around $0.60 / $2.40 per million input/output tokens, with a temporary 50% promotion roughly halving that. The model is natively multimodal — image and video input, document and chart parsing, and computer-use — having undergone mixed-modality training “from step zero.”

The open-weight claim comes with a caveat worth watching: MiniMax says the technical report and weights will land within 10 days of launch, but the prior M2.7 license restricted commercial use of the model or derivatives without written authorization. Whether M3 ships under similarly restrictive terms will determine how “open” it really is for builders.

What This Means

M3 is a bet that the next competitive axis is not a higher MMLU score but sustained, cheap, long-context autonomy. By attacking attention’s quadratic cost directly, MiniMax is trying to make million-token agentic workflows — multi-hour coding sessions, paper reproduction, kernel tuning — economically routine rather than a flagship-only luxury. If the weights arrive under a genuinely usable license, M3 becomes one of the most capable open models for agentic engineering. If they arrive locked down, it is a strong API product with an interesting architecture paper attached. The next ten days will tell which.

Related Coverage

Sources