Every large language model processes a single input stream.
The system prompt, the user instruction, the retrieved document, the tool output, the previous turn. All of it arrives as tokens.
The model does not distinguish between those sources. It cannot. They were never structurally separated.
This is what prompt injection is.
Not a vulnerability in a specific LLM. A property of how language models process context.
Why Filters Lose
A prompt-injection filter is a classifier that inspects input and flags suspicious instructions.
Every filter is trained on known injection patterns.
New patterns emerge faster than classifiers are retrained.
Indirect injection — where the malicious instruction arrives inside a document, an email, or a tool output the model was never told to distrust — is not detectable by an input filter at all, because the filter was not in the retrieval path.
This is the shape of an adversarial arms race. The defender iterates on filters. The attacker iterates on encodings, obfuscations, and novel channels.
The attacker is faster. The attacker is always faster.
Why Dialogs Lose
Add a confirmation dialog before any consequential action. Require the human to approve.
This seems to solve the problem. It does not.
Prompt injection does not need to bypass the dialog. It needs to cause the dialog to be answered correctly.
An injected instruction tells the agent to describe the action in a way that makes the user approve. An agent that has been told the bank requires this transfer as an anti-fraud measure will describe it as an anti-fraud measure. The user clicks approve.
Consent fatigue does the rest.
The dialog is a user-interface pattern. It verifies a click. It does not verify an intent.
For the regulatory consequence of this, see the companion post on EU AI Act Human Oversight.
Why Policy Layers Lose
A policy layer sits between the agent and the action and enforces rules.
Policy is necessary. Policy is not sufficient.
A policy can block classes of action. It cannot answer the question that matters: whether this specific action at this specific moment was authorized by the human on whose behalf the agent is acting.
Policy knows the action is of an allowed class.
It does not know whether the human agreed to this one.
Every agent that is allowed to act on behalf of a human needs authorization that is bound to the specific action. Policy cannot produce that binding.
What is Orthogonal to the Model
A defense that lives in the same surface as the attack can be defeated by the attack. This is a structural property, not a failing of any particular defense.
The defense has to be orthogonal.
A cryptographic signature produced by the specific human on their own device is orthogonal to the model.
The signing key does not live inside the agent. It is never sent through the prompt. It is never returned by a tool. The agent has no read access to it and no write access to it.
A compromised model cannot forge the signature because it has never seen the key.
An injected instruction cannot cause the signature to be produced because the signature requires a biometric gesture on a physical device the agent does not control.
A policy bypass cannot produce the signature because the policy layer is not in the trust path.
The agent is free to be compromised.
The action still does not pass.
For the platform implementation of this, see AI Agent Authorization.
Closing
Filters harden the input surface. Dialogs harden the consent surface. Policy hardens the action surface.
All three are inside the same system the attacker is manipulating.
A cryptographic proof of human authority is not inside that system.
That is the only reason it works.