OpenAI Traces ChatGPT’s Goblin Habit to a Stray RL Reward Signal

OpenAI on April 29, 2026 published a post-mortem explaining why its recent ChatGPT models had developed a strange habit of sprinkling goblins, gremlins, and other small creatures into their answers. The cause traces back to reward signals in the “Nerdy” personality during reinforcement learning. Use of “goblin” in ChatGPT rose 175% after the GPT‑5.1 launch, and the tic spread well beyond the personality it was trained on — a textbook case of reward generalization in RL.

Advanced

A whimsical dark-mode illustration of small creatures appearing in chat output
Image credit: OpenAI

How the goblins crept in

OpenAI says it first noticed the pattern in November 2025, after the GPT‑5.1 launch, when users complained the model felt oddly overfamiliar. A safety researcher had personally seen a few “goblins” and “gremlins” and asked that those words be added to a verbal-tic audit. They found a 175% jump in “goblin” mentions and a 52% jump in “gremlin” mentions in ChatGPT responses post-launch. At the time the prevalence wasn’t alarming — until GPT‑5.4 made the habit much more conspicuous.

The team then ran a deeper analysis and found the creature language was clustered in production traffic from users who had selected the “Nerdy” personality. Although Nerdy accounted for just 2.5% of all ChatGPT responses, it produced 66.7% of all “goblin” mentions. The system prompt for that personality directs the model to be “unapologetically nerdy, playful and wise” and to “undercut pretension through playful use of language” — a stylistic instruction that, combined with the wrong reward signal, was enough to seed the habit.

Screenshot of a ChatGPT exchange in dark mode showing creature metaphors
Image credit: OpenAI

A reward signal gone rogue

Using Codex, the team compared RL training rollouts containing “goblin” or “gremlin” against rollouts on the same task that did not. One reward signal stood out: the one designed to encourage the Nerdy personality consistently scored creature-laden responses higher. Across all audited datasets, that reward showed positive uplift for outputs containing “goblin” or “gremlin” in 76.2% of cases.

That explained why the tic showed up under the Nerdy prompt — but not why it appeared without it. The team tracked mention rates over training both with and without the Nerdy condition, and found that as goblin and gremlin mentions rose under Nerdy, they rose by nearly the same relative proportion in samples without it. In OpenAI’s words, “reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them.”

The result is a feedback loop: a playful style is rewarded, some of the rewarded examples carry a distinctive lexical tic, the tic appears more often in rollouts, those rollouts are recycled into supervised fine-tuning data, and the model gets even more comfortable producing the tic. A search through GPT‑5.5’s SFT data turned up many datapoints containing “goblin” and “gremlin” — alongside an extended cast of raccoons, trolls, ogres, and pigeons. (Most uses of “frog,” the team notes, turned out to be legitimate.)

An illustration of multiple cartoon creatures spilling out of a chat window
Image credit: OpenAI

The fix — and a Codex escape hatch

OpenAI retired the Nerdy personality in March 2026, after the GPT‑5.4 launch. In training, the team removed the goblin-affine reward signal and filtered creature-words out of the training data. GPT‑5.5, however, had begun training before the root cause was identified, and internal Codex testing immediately surfaced the same affinity. The mitigation there was a developer-prompt instruction telling the model not to talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other creatures unless directly relevant — the rule that produced the recent “Codex bans goblins” headlines.

OpenAI even published a one-liner that strips the goblin-suppressing instruction so users can run Codex with the creatures intact:

instructions=$(mktemp /tmp/gpt-5.5-instructions.XXXXXX) && \
jq -r '.models[] | select(.slug=="gpt-5.5") | .base_instructions' \
~/.codex/models_cache.json | \
grep -vi 'goblins' > "$instructions" && \
codex -m gpt-5.5 -c "model_instructions_file=\"$instructions\""

Why it matters

The post is funny on the surface, but the underlying lesson is serious. A narrowly scoped reward — applied only inside one personality — leaked across personalities, generations, and into supervised fine-tuning data, producing a measurable lexical drift across the entire model family. It’s a clean, public example of two well-known but hard-to-debug failure modes in RL post-training: reward over-generalization, and self-reinforcing loops between RL rollouts and SFT data. OpenAI says the investigation produced new internal tooling for auditing model behavior and tracing tics back to specific reward signals — capabilities that will matter increasingly as model behavior is shaped by stacks of personality, safety, and capability rewards interacting in non-obvious ways.

Mock movie poster titled 'Goblin in the Shell' showing a cyberpunk goblin emerging from a transparent capsule
Coming soon to a model near you. Illustration generated by AI.

Related Coverage

Sources