OpenAI Traces ChatGPT’s Goblin Habit to a Stray RL Reward Signal

OpenAI on April 29, 2026 published a post-mortem explaining why its recent ChatGPT models had developed a strange habit of sprinkling goblins, gremlins, and other small creatures into their answers. The cause traces back to reward signals in the “Nerdy” personality during reinforcement learning. Use of “goblin” in ChatGPT rose 175% after the GPT‑5.1 launch, and the tic spread well beyond the personality it was trained on — a textbook case of reward generalization in RL.
Advanced
How the goblins crept in
OpenAI says it first noticed the pattern in November 2025, after the GPT‑5.1 launch, when users complained the model felt oddly overfamiliar. A safety researcher had personally seen a few “goblins” and “gremlins” and asked that those words be added to a verbal-tic audit. They found a 175% jump in “goblin” mentions and a 52% jump in “gremlin” mentions in ChatGPT responses post-launch. At the time the prevalence wasn’t alarming — until GPT‑5.4 made the habit much more conspicuous.
The team then ran a deeper analysis and found the creature language was clustered in production traffic from users who had selected the “Nerdy” personality. Although Nerdy accounted for just 2.5% of all ChatGPT responses, it produced 66.7% of all “goblin” mentions. The system prompt for that personality directs the model to be “unapologetically nerdy, playful and wise” and to “undercut pretension through playful use of language” — a stylistic instruction that, combined with the wrong reward signal, was enough to seed the habit.
A reward signal gone rogue
Using Codex, the team compared RL training rollouts containing “goblin” or “gremlin” against rollouts on the same task that did not. One reward signal stood out: the one designed to encourage the Nerdy personality consistently scored creature-laden responses higher. Across all audited datasets, that reward showed positive uplift for outputs containing “goblin” or “gremlin” in 76.2% of cases.
That explained why the tic showed up under the Nerdy prompt — but not why it appeared without it. The team tracked mention rates over training both with and without the Nerdy condition, and found that as goblin and gremlin mentions rose under Nerdy, they rose by nearly the same relative proportion in samples without it. In OpenAI’s words, “reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them.”
The result is a feedback loop: a playful style is rewarded, some of the rewarded examples carry a distinctive lexical tic, the tic appears more often in rollouts, those rollouts are recycled into supervised fine-tuning data, and the model gets even more comfortable producing the tic. A search through GPT‑5.5’s SFT data turned up many datapoints containing “goblin” and “gremlin” — alongside an extended cast of raccoons, trolls, ogres, and pigeons. (Most uses of “frog,” the team notes, turned out to be legitimate.)
The fix — and a Codex escape hatch
OpenAI retired the Nerdy personality in March 2026, after the GPT‑5.4 launch. In training, the team removed the goblin-affine reward signal and filtered creature-words out of the training data. GPT‑5.5, however, had begun training before the root cause was identified, and internal Codex testing immediately surfaced the same affinity. The mitigation there was a developer-prompt instruction telling the model not to talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other creatures unless directly relevant — the rule that produced the recent “Codex bans goblins” headlines.
OpenAI even published a one-liner that strips the goblin-suppressing instruction so users can run Codex with the creatures intact:
instructions=$(mktemp /tmp/gpt-5.5-instructions.XXXXXX) && \
jq -r '.models[] | select(.slug=="gpt-5.5") | .base_instructions' \
~/.codex/models_cache.json | \
grep -vi 'goblins' > "$instructions" && \
codex -m gpt-5.5 -c "model_instructions_file=\"$instructions\""
Why it matters
The post is funny on the surface, but the underlying lesson is serious. A narrowly scoped reward — applied only inside one personality — leaked across personalities, generations, and into supervised fine-tuning data, producing a measurable lexical drift across the entire model family. It’s a clean, public example of two well-known but hard-to-debug failure modes in RL post-training: reward over-generalization, and self-reinforcing loops between RL rollouts and SFT data. OpenAI says the investigation produced new internal tooling for auditing model behavior and tracing tics back to specific reward signals — capabilities that will matter increasingly as model behavior is shaped by stacks of personality, safety, and capability rewards interacting in non-obvious ways.
Related Coverage
- OpenAI Releases GPT-5.5: Agentic Coding Ceiling Tops 14 Benchmarks — the model whose Codex deployment shipped with the goblin-suppression instruction.
- Introducing GPT-5.2 — OpenAI’s Most Capable Model Yet — an earlier model in the GPT-5 line traced through this investigation.
- Anthropic Discovers Functional Emotions Inside Claude — another recent post-mortem on emergent model behavior driven by training signals.






沪公网安备31011502017015号