Arcee AI released Trinity-Large-Thinking on April 1, 2026 — an open-source frontier reasoning model built for complex, long-horizon AI agents. Weighing in at 398 billion total parameters (with ~13B active per token), it is one of the largest open-source models ever released by a U.S. startup, and it puts a credible challenge to proprietary alternatives at a fraction of the cost.
Advanced
Trinity-Large-Thinking is built on a sparse Mixture-of-Experts (MoE) architecture with 256 experts, only 4 active per token — a routing fraction of just 1.56%, making it notably sparser than most competing MoE models. Despite its scale, the model delivers roughly 2–3× faster inference throughput compared to similarly-sized dense models, since the majority of weights are idle at any given step.
Key architectural details:
The pretraining run consumed 17 trillion tokens over 33 days on 2,048 NVIDIA B300 GPUs — the largest publicly stated B300 pretraining run to date. Data curation was handled in partnership with Datology, and training used the Muon optimizer and completed with zero loss spikes.
The key upgrade from Trinity-Large-Preview (the earlier instruct model) is the explicit chain-of-thought mechanism. The model emits reasoning traces inside <think>...</think> blocks before producing its final response and any tool calls. This is not cosmetic — the reasoning traces are architecturally load-bearing for the model’s agentic performance.
When using Trinity-Large-Thinking in multi-turn conversations or agent loops, you must include the full assistant response (thinking + answer) in the conversation history. Stripping out the <think> blocks breaks context coherence and degrades performance. The API separates reasoning_content, content, and tool_calls as distinct fields, making it straightforward to log and display reasoning independently.
For self-hosted deployments, vLLM 0.11.1+ supports Trinity with dedicated flags:
vllm serve arcee-ai/Trinity-Large-Thinking \
--dtype bfloat16 \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
On agentic benchmarks, Trinity-Large-Thinking ranks #2 on PinchBench (Kilo’s benchmark for agentic model capability) with 91.9%, behind only Anthropic’s Opus-4.6 at 93.3%. Additional results:
The cost comparison is striking: Trinity-Large-Thinking is available at $0.90 per million output tokens on the Arcee API — roughly 96% cheaper than Opus-4.6 — while matching it on several agentic tasks. The Preview model, released in January 2026, surpassed 3.37 trillion tokens served on OpenRouter within its first two months and became the #1 most-used open model in the U.S. on OpenRouter’s OpenClaw collection.
Trinity-Large-Thinking is a significant proof point for open-source AI development at frontier scale. Arcee AI is a 30-person startup (CEO Mark McQuade, CTO Lucas Atkins) that pre-trained this model entirely from scratch — not a fine-tune or derivative of Llama or another open base. The Apache 2.0 license allows unrestricted commercial use.
For practitioners building production AI agents, the combination of frontier agentic performance, a 512k context window, native tool calling, and open weights makes Trinity-Large-Thinking a compelling alternative to closed APIs — especially for use cases where cost, customizability, or data privacy are constraints.
Model weights are available on Hugging Face. Managed API access is available through arcee.ai and OpenRouter.
