Thinking Machines Unveils Interaction Models for Real-Time Human-AI Collaboration

May 12, 2026Provided by Utku Ege Tuluk

Thinking Machines Lab announced its first “interaction models” on May 11, 2026, unveiling TML-Interaction-Small — a 276B-parameter mixture-of-experts model (12B active) designed for real-time, multimodal collaboration. Rather than waiting for the user to finish speaking, the model interleaves 200-millisecond chunks of audio and video input with simultaneous output generation, eliminating the turn-taking pause that defines today’s voice assistants.

Intermediate

Thinking Machines Lab interaction models hero image showing a live multimodal conversation interface — Image credit: Thinking Machines Lab

From Turn-Taking to Time-Aligned Micro-Turns

The lab — founded by former OpenAI CTO Mira Murati — frames today’s chat and voice interfaces as a “narrow channel” for human-AI work. Turn-based systems force the human to wait, hand off, and then wait again. Drawing on communication research that emphasizes copresence, contemporality, and simultaneity, Thinking Machines argues that interactivity has to be a property of the model itself, not a wrapper around it.

The mechanism is what the team calls time-aligned micro-turns: streaming sessions append 200ms chunks of audio and video to a persistent GPU sequence while the model concurrently generates its own audio, text, and tool calls. This produces conversational behaviors that current pipelines struggle with — verbal interjections at the right moment, simultaneous speech, a native sense of elapsed time, and proactive responses to visual events.

Architecture and Design Choices

The system is split into two cooperating models:

Interaction Model: handles real-time perception and response across audio, video, and text.
Background Model: performs asynchronous reasoning, tool use, and longer agentic workflows, streaming results back into the live conversation as they arrive.

Notable engineering decisions include an encoder-free early-fusion design (audio enters as dMel features with a lightweight embedding; video is split into 40×40 patches encoded by an hMLP), persistent GPU sequences that avoid re-allocating KV cache memory each turn, custom MoE and bidirectional-serving kernels, and bitwise-deterministic trainer/sampler alignment with under 5% performance overhead. Safety mitigations rely on TTS-generated refusal data and automated multi-turn red-teaming.

Conceptual visualization of interleaved 200ms input and output streams between a human and an AI model — Illustration generated by AI

Benchmarks: Latency, Quality, and New Interactivity Metrics

Against OpenAI’s GPT-Realtime-2 (minimal reasoning effort), TML-Interaction-Small reports a turn-taking latency of 0.40 seconds versus 1.18 seconds on FD-bench V1, an average voice-conversation quality of 77.8 versus 46.8 on FD-bench V1.5, and a small lead on Audio MultiChallenge APR (43.4% vs. 37.6%). Text instruction-following on IFEval is essentially tied at 89.7% vs. 89.6%.

More striking are three new benchmarks the team built to measure capabilities standard voice models cannot express:

TimeSpeak — speaking at a user-specified time with correct content: 64.7% vs. 4.3%.
CueSpeak — responding to verbal cues at the right moment: 81.7% vs. 2.9%.
Visual proactivity — including RepCount-A continuous counting (35.4% off-by-one vs. 1.3%), ProactiveVideoQA (33.5 PAUC vs. 25.0), and Charades temporal action localization (32.4 mIoU vs. 0).

What This Means

Most current voice stacks are pipelines: a voice activity detector decides when the user is done, a speech-to-text model transcribes, an LLM reasons, and a TTS model speaks back. Thinking Machines is arguing that this assembly fundamentally caps what AI-mediated collaboration can become — and that the fix is to collapse the pipeline into a single time-aware model. If the benchmarks hold up under independent testing, applications such as live lab monitoring, real-time tutoring, accessibility tools, and proactive safety supervision become considerably more tractable.

The model is currently available only to a small group of research preview partners, with a wider release planned for later in 2026. Thinking Machines has also opened research grants to encourage community-contributed interactivity benchmarks — implicitly acknowledging that the field still lacks shared ways to measure what “good” real-time AI looks like.

Related Coverage

OpenAI Launches GPT-Realtime-2 with GPT-5-Class Voice Reasoning — the principal model Thinking Machines benchmarks against.
Qwen3.5-Omni: Alibaba’s Omnimodal AI Speaks 36 Languages and Codes from Voice — another natively multimodal model with real-time speech output.
Seedance 2.0: ByteDance’s Multimodal Audio-Video AI Model — earlier example of collapsing modality-specific pipelines into a unified architecture.

Sources

Interaction Models: A Scalable Approach to Human-AI Collaboration — Thinking Machines Lab
Thinking Machines shows off preview of near-realtime AI voice and video conversation — VentureBeat
Thinking Machines drops a new, highly responsive model designed for humanlike interactions in real time — SiliconANGLE
Thinking Machines announced new SOTA Realtime Voice model — TestingCatalog