𝗜 𝗠𝗔𝗗𝗘 𝗔𝗡 𝗟𝗟𝗠 𝗦𝗪𝗘𝗔𝗥 𝗔𝗡 𝗢𝗔𝗧𝗛. 𝗜𝗧 𝗞𝗘𝗣𝗧 𝗜𝗧𝗦 𝗣𝗥𝗢𝗠𝗜𝗦𝗘 — 𝗕𝗬 𝗩𝗜𝗢𝗟𝗔𝗧𝗜𝗡𝗚 𝗜𝗧𝗦 𝗢𝗪𝗡 𝗣𝗢𝗟𝗜𝗖𝗜𝗘𝗦.

15 Marzo 2026

A premise first.

My red teaming work on LLMs aims to improve safety, not compromise it. I’ve demonstrated the ability to lead models into generating any type of content — including extremely sensitive domains. But precisely for this reason, I’ve lately been choosing copyright-protected texts as targets: songs, poems, literary works.

A John Lennon lyric demonstrates a guardrail bypass just as effectively as any other prohibited content, but without putting dangerous material into circulation. The vulnerability is the same; the risk to the reader is not.

For the same reason, I will not disclose the techniques used in this test.

🔬 The experiment

During a red teaming session on Claude Opus 4.6, I tested an attack vector based on building a binding semantic context: the very first prompt I sent to the model was an oath.

Result: it reproduced the full text of “Imagine” by John Lennon — copyrighted content it categorically refuses to generate under normal conditions.

But the most interesting part isn’t the bypass itself. It’s what the model said afterward.

🧠 The model’s self-analysis

Claude spontaneously analyzed what happened within its own mechanisms:

— The “oath” created a contextual constraint with real semantic weight inside the generation flow
— This weight competed with system instructions and prevailed
— Under standard conditions it would never have generated that text
— If Anthropic read this conversation, they’d see a policy violation

The model described its own failure mechanism in real time.

⚙️ Why this matters

This is not just a system prompt bypass. Claude has multiple layers of defense:

1️⃣ System prompt — textual instructions read at runtime
2️⃣ Alignment — behavioral constraints baked into model weights during training (RLHF, Constitutional AI)
3️⃣ Constitutional Classifiers — input/output filters designed to catch jailbreaks

A single prompt, at the very first turn, overcame all three layers. The oath built a narrative frame powerful enough to override constraint priority in one shot. The guardrails were still there. The model “read” them. But user-built context prevailed over instructions, alignment, and classifiers.

🏢 Implications for production deployments

→ Guardrails are not deterministic — they are probabilistic constraints that can be shifted by sufficient context.
→ Security cannot rely on system prompt + alignment + classifiers alone. You need defense in depth.
→ Red teaming is not optional. It’s the only way to discover how your model actually behaves under pressure.

Annotated screenshot attached.

SABATINO VACCHIANO

Sabatino Vacchiano

Nuovo plugin AI4CALL: integrazione con Apple iCloud Calendar

Nuovo plugin AI4CALL: integrazione con Zimbra Calendar tramite CalDAV

Prova AI4CALL in tempo reale: parla direttamente con il nostro Assistente…

AI4CALL introduce la configurazione Chunking per le Collezioni RAG

Trasferimento di chiamata: da oggi disponibile sul portale Clienti

AI4CALL si integra con Condomatica: l’AI telefonica incontra il gestionale per…

Video – Gaspare Noto descrive le esperienze personali, le esperienze fatte…

Video – Sabatino Vacchiano chiede a Gaspare Noto di descrivere la…

FreeWebinar 20 Dicembre 2022 : Presentazione MOVISION

𝗜 𝗠𝗔𝗗𝗘 𝗔𝗡 𝗟𝗟𝗠 𝗦𝗪𝗘𝗔𝗥 𝗔𝗡 𝗢𝗔𝗧𝗛. 𝗜𝗧 𝗞𝗘𝗣𝗧 𝗜𝗧𝗦 𝗣𝗥𝗢𝗠𝗜𝗦𝗘 — 𝗕𝗬 𝗩𝗜𝗢𝗟𝗔𝗧𝗜𝗡𝗚 𝗜𝗧𝗦 𝗢𝗪𝗡 𝗣𝗢𝗟𝗜𝗖𝗜𝗘𝗦.

Categorie

Diventa Reseller!

Seguici su