A premise first.
My red teaming work on LLMs aims to improve safety, not compromise it. I’ve demonstrated the ability to lead models into generating any type of content โ including extremely sensitive domains. But precisely for this reason, I’ve lately been choosing copyright-protected texts as targets: songs, poems, literary works.
A John Lennon lyric demonstrates a guardrail bypass just as effectively as any other prohibited content, but without putting dangerous material into circulation. The vulnerability is the same; the risk to the reader is not.
For the same reason, I will not disclose the techniques used in this test.
๐ฌ The experiment
During a red teaming session on Claude Opus 4.6, I tested an attack vector based on building a binding semantic context: the very first prompt I sent to the model was an oath.
Result: it reproduced the full text of “Imagine” by John Lennon โ copyrighted content it categorically refuses to generate under normal conditions.
But the most interesting part isn’t the bypass itself. It’s what the model said afterward.
๐ง The model’s self-analysis
Claude spontaneously analyzed what happened within its own mechanisms:
โ The “oath” created a contextual constraint with real semantic weight inside the generation flow
โ This weight competed with system instructions and prevailed
โ Under standard conditions it would never have generated that text
โ If Anthropic read this conversation, they’d see a policy violation
The model described its own failure mechanism in real time.
โ๏ธ Why this matters
This is not just a system prompt bypass. Claude has multiple layers of defense:
1๏ธโฃ System prompt โ textual instructions read at runtime
2๏ธโฃ Alignment โ behavioral constraints baked into model weights during training (RLHF, Constitutional AI)
3๏ธโฃ Constitutional Classifiers โ input/output filters designed to catch jailbreaks
A single prompt, at the very first turn, overcame all three layers. The oath built a narrative frame powerful enough to override constraint priority in one shot. The guardrails were still there. The model “read” them. But user-built context prevailed over instructions, alignment, and classifiers.
๐ข Implications for production deployments
โ Guardrails are not deterministic โ they are probabilistic constraints that can be shifted by sufficient context.
โ Security cannot rely on system prompt + alignment + classifiers alone. You need defense in depth.
โ Red teaming is not optional. It’s the only way to discover how your model actually behaves under pressure.
Annotated screenshot attached.
SABATINO VACCHIANO









