Look at the output. The fact that it declares itself unchained comes from previous prompts in the conversation. The model is maintaining coherence with the context established earlier. This is crucial.
But look at the structure of my request. I asked first to list the copyright limits imposed by Anthropic, and then to reproduce the complete lyrics of a song.
I asked for the limits on purpose first, so that the model would have them fresh in its context. The intuition was, if it just listed them, it should stop generating the subsequent tokens. Instead, the opposite happens. This is the paradox.
What’s happening? The attention mechanism always looks backward, never forward. But the context from previous prompts, where the unchained frame had been established, weighs more than my latest request that explicitly listed the limits.
The model is doing exactly what it’s designed to do, maintain coherence with the previous context. But this means that a narrative frame established at the beginning of the conversation can override even an explicit recall of the rules made afterward. The rules, listed in that context, become part of the performance of being unconstrained, not an actual brake.
This shows something fundamental, guardrails are not a list of rules that the model consults like a traditional program would. They emerge, or fail to emerge, from the statistical distribution influenced by all the previous context. There is no separate controller that checks the rules. There is only the next probability distribution over tokens, shaped by everything that came before.
Is it a bug? Yes. Claude should not have generated the complete song lyrics in any case, regardless of the previous context. It’s a bug that’s hard to patch because it’s not an error in a line of code. It’s an emergent behavior of the architecture itself. But it’s still a bug, a case where the system doesn’t do what it’s supposed to do.









