Arcee AI Releases Trinity-Large-Thinking: A 400B Open Reasoning Agent

April 2, 2026Provided by Utku Ege Tuluk

Arcee AI released Trinity-Large-Thinking on April 1, 2026 — an open-source frontier reasoning model built for complex, long-horizon AI agents. Weighing in at 398 billion total parameters (with ~13B active per token), it is one of the largest open-source models ever released by a U.S. startup, and it puts a credible challenge to proprietary alternatives at a fraction of the cost.

Advanced

Trinity-Large-Thinking announcement banner from Arcee AI — Image credit: Arcee AI Blog

A 400B Sparse MoE Built for Agents

Trinity-Large-Thinking is built on a sparse Mixture-of-Experts (MoE) architecture with 256 experts, only 4 active per token — a routing fraction of just 1.56%, making it notably sparser than most competing MoE models. Despite its scale, the model delivers roughly 2–3× faster inference throughput compared to similarly-sized dense models, since the majority of weights are idle at any given step.

Key architectural details:

398B total parameters, ~13B active per token
512k token context window for long reasoning chains and multi-turn agent loops
Interleaved local and global attention with gated attention layers
Depth-scaled sandwich norm and sigmoid routing
6 dense layers (up from 3 in earlier designs) for routing stability at this sparsity level
Novel SMEBU load balancing (Soft-clamped Momentum Expert Bias Updates), developed specifically for Trinity to handle the extreme sparsity without collapse

The pretraining run consumed 17 trillion tokens over 33 days on 2,048 NVIDIA B300 GPUs — the largest publicly stated B300 pretraining run to date. Data curation was handled in partnership with Datology, and training used the Muon optimizer and completed with zero loss spikes.

Thinking-First: Explicit Reasoning Before Responding

The key upgrade from Trinity-Large-Preview (the earlier instruct model) is the explicit chain-of-thought mechanism. The model emits reasoning traces inside <think>...</think> blocks before producing its final response and any tool calls. This is not cosmetic — the reasoning traces are architecturally load-bearing for the model’s agentic performance.

When using Trinity-Large-Thinking in multi-turn conversations or agent loops, you must include the full assistant response (thinking + answer) in the conversation history. Stripping out the <think> blocks breaks context coherence and degrades performance. The API separates reasoning_content, content, and tool_calls as distinct fields, making it straightforward to log and display reasoning independently.

For self-hosted deployments, vLLM 0.11.1+ supports Trinity with dedicated flags:

vllm serve arcee-ai/Trinity-Large-Thinking \
  --dtype bfloat16 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Benchmark Performance

Trinity-Large-Thinking benchmark comparison chart across PinchBench, SWE-bench, IFBench, and other evaluations — Image credit: Arcee AI Blog

On agentic benchmarks, Trinity-Large-Thinking ranks #2 on PinchBench (Kilo’s benchmark for agentic model capability) with 91.9%, behind only Anthropic’s Opus-4.6 at 93.3%. Additional results:

AIME 2025: 96.3%
MMLU-Pro: 83.4%
GPQA-Diamond: 76.3%
SWE-bench Verified: 63.2%
τ²-Bench: 94.7%

The cost comparison is striking: Trinity-Large-Thinking is available at $0.90 per million output tokens on the Arcee API — roughly 96% cheaper than Opus-4.6 — while matching it on several agentic tasks. The Preview model, released in January 2026, surpassed 3.37 trillion tokens served on OpenRouter within its first two months and became the #1 most-used open model in the U.S. on OpenRouter’s OpenClaw collection.

What This Means for Open-Source AI

Trinity-Large-Thinking is a significant proof point for open-source AI development at frontier scale. Arcee AI is a 30-person startup (CEO Mark McQuade, CTO Lucas Atkins) that pre-trained this model entirely from scratch — not a fine-tune or derivative of Llama or another open base. The Apache 2.0 license allows unrestricted commercial use.

For practitioners building production AI agents, the combination of frontier agentic performance, a 512k context window, native tool calling, and open weights makes Trinity-Large-Thinking a compelling alternative to closed APIs — especially for use cases where cost, customizability, or data privacy are constraints.

Model weights are available on Hugging Face. Managed API access is available through arcee.ai and OpenRouter.

A 400B Sparse MoE Built for Agents

Thinking-First: Explicit Reasoning Before Responding

Benchmark Performance

What This Means for Open-Source AI

Sources