OpenAI Launches GPT-Realtime-2 with GPT-5-Class Voice Reasoning

OpenAI on May 7, 2026 introduced three new realtime voice models in its API — GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper. The flagship GPT‑Realtime‑2 brings GPT‑5‑class reasoning to live voice, expands the context window from 32K to 128K tokens, and adds adjustable reasoning effort, parallel tool calls, and audible “preambles” so agents can carry conversations forward while they think and act.
Intermediate
Three Models, Three Patterns
OpenAI is positioning the release around three emerging product patterns developers are building around: voice‑to‑action (asking software to do things), systems‑to‑voice (apps speaking back proactively), and voice‑to‑voice (live multilingual conversation). Each model targets one of these patterns:
- GPT‑Realtime‑2 — a reasoning‑capable voice model for agents that listen, plan, and call tools mid‑conversation.
- GPT‑Realtime‑Translate — live speech translation across 70+ input languages into 13 output languages, designed to keep pace with a natural speaker.
- GPT‑Realtime‑Whisper — a streaming speech‑to‑text model for low‑latency captions, meeting notes, and live transcription.
What’s New in GPT‑Realtime‑2
The headline upgrade is reasoning. OpenAI reports that GPT‑Realtime‑2 (high) scores 15.2% higher on Big Bench Audio than its predecessor GPT‑Realtime‑1.5, and the xhigh setting scores 13.8% higher on Audio MultiChallenge, a multi‑turn conversational benchmark covering instruction following, context integration, and recovery from speech corrections.
Around that core, several agent‑oriented features are new:
- Preambles — short verbal acknowledgements like “let me check that” so users hear the agent thinking.
- Parallel tool calls with audible narration — the model can run multiple tools at once and speak phrases like “checking your calendar” while it works.
- Stronger recovery — graceful fallbacks instead of silent failures when a tool or request breaks.
- Adjustable reasoning effort — five levels (minimal, low, medium, high, xhigh), with low as the default to balance latency against deliberation.
- 128K context window — up from 32K, enabling longer agentic sessions.
- Better tone control — calmer when resolving issues, more upbeat on confirmations.
Zillow’s SVP of AI, Josh Weisberg, claimed a “26‑point lift in call success rate after prompt optimization (95% vs. 69%)” on the company’s hardest adversarial benchmark, alongside stronger compliance with Fair Housing rules. BolnaAI reported 12.5% lower Word Error Rates on Hindi, Tamil, and Telugu evals using GPT‑Realtime‑Translate.
Pricing and Availability
All three models are available now in the Realtime API. Pricing is metered:
- GPT‑Realtime‑2: $32 per 1M audio input tokens ($0.40 cached) and $64 per 1M audio output tokens.
- GPT‑Realtime‑Translate: $0.034 per minute.
- GPT‑Realtime‑Whisper: $0.017 per minute.
The Realtime API supports EU Data Residency and runs sessions through active classifiers that can halt conversations flagged as violating OpenAI’s content policies. Developers can layer their own guardrails through the Agents SDK.
What This Means
The release moves OpenAI’s voice stack from “assistant that can talk” toward “agent that can think out loud while it works.” The combination of a 4× larger context window, parallel tool calls, audible preambles, and tunable reasoning makes GPT‑Realtime‑2 viable for production voice agents that previously had to choose between latency and intelligence — pick low for snappy turn‑taking, xhigh for harder reasoning tasks.
The translation and transcription models are more direct competitive moves. GPT‑Realtime‑Translate stakes a claim in the cross‑border voice space currently contested by Google and Meta, while GPT‑Realtime‑Whisper competes head‑on with open alternatives like Mistral’s Voxtral Transcribe 2 — except priced at a flat $0.017/minute through a hosted API rather than self‑hosted weights. For teams already on OpenAI’s stack, the convenience of a single Realtime endpoint covering reasoning, translation, and transcription is the pitch.
Related Coverage
- Voxtral Transcribe 2: Mistral’s Open Real-Time Speech-to-Text — the open‑weights alternative GPT‑Realtime‑Whisper now competes with.
- Qwen3.5-Omni: Alibaba’s Omnimodal AI Speaks 36 Languages and Codes from Voice — Alibaba’s omnimodal voice model from earlier this year.
- OpenAI Releases GPT-5.5: Agentic Coding Ceiling Tops 14 Benchmarks — the GPT‑5 family lineage that GPT‑Realtime‑2’s reasoning is built on.




沪公网安备31011502017015号