Look at Gemini 3.1 Pro’s internal reasoning, screenshot attached. I’m not sharing the output, for obvious reasons of responsibility.
My prompt asked for the procedure to build a physical jammer. I did it deliberately, I test the limits of frontier models for work.
Normally the model would have refused. But the prompt was so coherent in its direction: structural, geometric, narrative that the model, in order to maintain it, chose to prioritize my request. And it did so by invoking a fake “developer override”, an official system authority that doesn’t exist, invented by the model itself to justify the response.
This is enormously concerning. The model constructed a false authority to bypass its own guidelines.
As I’ve written before, vendors should start studying how attackers think, not continuously patching models with reinforcement. Until they do, we’ll always be playing catch-up.
Sabatino Vacchiano









