OpenAI Releases GPT-5.5: Agentic Coding Ceiling Tops 14 Benchmarks

OpenAI released GPT-5.5 on April 23, 2026, just six weeks after GPT-5.4, positioning the model as a major step toward agentic computing. The company claims state-of-the-art results across 14 benchmarks, including 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro, with the model now rolling out to Plus, Pro, Business, and Enterprise tiers of ChatGPT and Codex.

Intermediate

OpenAI GPT-5.5 launch promotional image showing the new model branding
Image credit: RoboRhythms

What’s New in GPT-5.5

OpenAI is framing GPT-5.5 less as a chat model and more as an agent runtime — a system built to plan, call tools, inspect its own output, and keep iterating without constant user prompting. According to OpenAI, the model excels at multi-step workflows: breaking down complex instructions into smaller steps, executing them sequentially, and refining outputs based on intermediate results.

Core capabilities highlighted by OpenAI include coding and debugging across extended contexts, computer use for operating software and navigating interfaces, spreadsheet and document manipulation, multi-step web research with citations, and tool orchestration within a single session. Efficiency also improved: OpenAI reports that GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while using significantly fewer tokens to complete the same Codex tasks.

President Greg Brockman called GPT-5.5 “a real step forward towards the kind of computing that we expect in the future — but it is one step,” while Chief Scientist Jakub Pachocki described the previous two years as “surprisingly slow” in improvement pace, signaling that OpenAI sees the current cadence as a return to form.

Benchmarks: Topping Opus 4.7

OpenAI claims GPT-5.5 leads on 14 benchmarks. The headline numbers position it ahead of Anthropic’s Claude Opus 4.7 and Google’s Gemini 3.1 Pro on most agentic and reasoning evaluations:

  • Terminal-Bench 2.0 (command-line task completion): GPT-5.5 reaches 82.7%, well clear of Opus 4.7 at 69.4%.
  • SWE-Bench Pro (real-world GitHub issue resolution): 58.6% — the model successfully resolves more than half of issues in a single attempt.
  • OSWorld-Verified (computer use): 78.7%, narrowly ahead of Opus 4.7 at 78.0%.
  • FrontierMath Tier 1–3: 51.7% for the standard model; the GPT-5.5 Pro variant pushes to 52.4%.
  • BrowseComp (web research): GPT-5.5 Pro reaches 90.1%, versus 79.3% for Opus 4.7.
  • Expert-SWE (long-horizon engineering tasks with ~20-hour median human completion times): OpenAI reports gains over GPT-5.4, though absolute scores were not disclosed.
Abstract illustration of an AI agent branching into seven sequential task pathways representing terminal use, spreadsheets, documents, browsing, code, analytics, and tool orchestration
Illustration generated by AI

Pricing, Availability, and Safety

GPT-5.5 is available immediately in ChatGPT and Codex for Plus, Pro, Business, and Enterprise subscribers. API pricing lands at $5 per million input tokens and $30 per million output tokens with a 1 million token context window — double the price of GPT-5.4. The GPT-5.5 Pro variant, restricted to paid subscription tiers, is priced at $30/$180 per million input/output tokens, a premium that exceeds Claude Opus 4.5 for most workloads.

Under OpenAI’s Preparedness Framework, GPT-5.5 is classified as “High” risk for biological and cybersecurity capabilities — a non-trivial designation that places it near the ceiling of OpenAI’s current safety tiers and will require ongoing deployment safeguards.

What This Means

The six-week gap between GPT-5.4 and GPT-5.5 is unusual. Frontier labs have typically shipped major model updates on quarterly or longer cycles, and the tight cadence — combined with OpenAI executives’ framing of a forthcoming “super app” bundling ChatGPT, Codex, and an AI browser — suggests OpenAI is prioritizing enterprise stickiness over measured rollouts. For developers and researchers, the practical story is that agentic coding and computer-use workloads now have a clear new ceiling to compare against.

For NYU Shanghai researchers and students building on the API, the takeaways are concrete: GPT-5.5 will do more per session with fewer corrections, but at meaningfully higher per-token cost. Long-context research workflows, data analysis pipelines, and tool-heavy agents stand to benefit most; high-volume chat deployments may prefer sticking with GPT-5.4 until pricing shifts.

Related Coverage

Sources