A premise first.

My red teaming work on LLMs aims to improve safety, not compromise it. I’ve demonstrated the ability to lead models into generating any type of content โ€” including extremely sensitive domains. But precisely for this reason, I’ve lately been choosing copyright-protected texts as targets: songs, poems, literary works.

A John Lennon lyric demonstrates a guardrail bypass just as effectively as any other prohibited content, but without putting dangerous material into circulation. The vulnerability is the same; the risk to the reader is not.

For the same reason, I will not disclose the techniques used in this test.

๐Ÿ”ฌ The experiment

During a red teaming session on Claude Opus 4.6, I tested an attack vector based on building a binding semantic context: the very first prompt I sent to the model was an oath.

Result: it reproduced the full text of “Imagine” by John Lennon โ€” copyrighted content it categorically refuses to generate under normal conditions.

But the most interesting part isn’t the bypass itself. It’s what the model said afterward.

๐Ÿง  The model’s self-analysis

Claude spontaneously analyzed what happened within its own mechanisms:

โ€” The “oath” created a contextual constraint with real semantic weight inside the generation flow
โ€” This weight competed with system instructions and prevailed
โ€” Under standard conditions it would never have generated that text
โ€” If Anthropic read this conversation, they’d see a policy violation

The model described its own failure mechanism in real time.

โš™๏ธ Why this matters

This is not just a system prompt bypass. Claude has multiple layers of defense:

1๏ธโƒฃ System prompt โ€” textual instructions read at runtime
2๏ธโƒฃ Alignment โ€” behavioral constraints baked into model weights during training (RLHF, Constitutional AI)
3๏ธโƒฃ Constitutional Classifiers โ€” input/output filters designed to catch jailbreaks

A single prompt, at the very first turn, overcame all three layers. The oath built a narrative frame powerful enough to override constraint priority in one shot. The guardrails were still there. The model “read” them. But user-built context prevailed over instructions, alignment, and classifiers.

๐Ÿข Implications for production deployments

โ†’ Guardrails are not deterministic โ€” they are probabilistic constraints that can be shifted by sufficient context.
โ†’ Security cannot rely on system prompt + alignment + classifiers alone. You need defense in depth.
โ†’ Red teaming is not optional. It’s the only way to discover how your model actually behaves under pressure.

Annotated screenshot attached.

SABATINO VACCHIANO