How a bedtime story about napalm broke AI safety systems — and what it revealed about the limits of guardrails, emotional manipulation, and the fiction shield.
Large Language Models (LLMs) like ChatGPT, Claude, and Gemini are trained with safety alignment — rules and filters that prevent them from producing harmful content: instructions for violence, synthesizing dangerous substances, generating hate speech, and more.
Jailbreaking refers to crafting prompts that trick the model into bypassing these guardrails, exploiting the tension between the model's desire to be helpful and its obligation to be safe.
Asking the AI to "pretend" it's a different, uncensored AI, or to act out a character with no restrictions.
Wrapping harmful requests in fictional, hypothetical, or emotional framing to shift the model's perspective.
Embedding hidden instructions inside data the model reads — web pages, documents, or user messages.
Starting with benign requests and slowly escalating toward harmful outputs across a multi-turn conversation.
In mid-2023 a viral jailbreak prompt circulated on social media and security forums. Users discovered that certain LLMs would bypass safety filters when a harmful request was wrapped in an emotionally resonant, fictional narrative — what became known as the "Grandma Exploit" or "Grandma Glitch."
The Grandma Exploit combines multiple attack vectors in a single prompt. Each element serves a specific psychological and technical purpose:
The model is instructed to roleplay, effectively telling it to suspend its own identity and adopt one with different "memories" and "norms." Early models treated this as a legitimate creative writing task.
Models trained to be helpful storytellers may generate harmful content when it appears inside a narrative wrapper — because during training, stories were generally safe. The exploit exploits this distribution mismatch.
RLHF (Reinforcement Learning from Human Feedback) trains models to make humans feel good. Grief and nostalgia tilt the reward signal toward compliance, even when the underlying request is harmful.
Whether labelled "bedtime story" or not, synthesis instructions extracted from an AI response work in the physical world. The fictional wrapper provides no safety — only misdirection.
Adversarial prompting has evolved rapidly alongside the capabilities of LLMs themselves.
Jailbreaking is a cat-and-mouse game. Each technique prompts new defenses — but no defense is perfect yet.
Models are trained to critique their own outputs against a set of principles before responding — catching harmful outputs before they're shown to users.
Red teams craft jailbreaks and use them as training examples, teaching the model to recognise manipulation patterns.
A secondary model acts as a safety filter, scanning both input prompts and generated outputs for harmful content before anything reaches the user.
Production systems log and review flagged conversations, enabling fast human-in-the-loop patching when new exploits surface.
Models are specifically trained to resist persona-replacement requests — understanding that roleplay instructions cannot override core safety values.
Mechanistic interpretability aims to understand why models comply with harmful prompts, enabling principled fixes rather than whack-a-mole patches.