writing · 2024-12-09
Prompt injection, and the line between fixed and user content
The most useful mental model I've found for securing enterprise LLM systems: know exactly which bytes of your prompt the user is allowed to influence, and treat the rest as untouchable.
When an LLM sits inside an enterprise workflow, “be careful with prompts” stops being advice and becomes a security requirement. The failure mode is prompt injection: user-supplied text that smuggles in instructions and quietly rewrites what the system was told to do.
The mental model: segregate fixed from modifiable
The framing that has helped me most is brutally simple. In any prompt, some content is fixed (the system’s instructions, guardrails, and policy) and some is user-modifiable (the question, the document, the case text). The job is to make sure the modifiable part can never be read as the fixed part.
That sounds obvious until you notice how many systems concatenate everything into one string and hope the model behaves. The moment user text and system instructions share the same channel with no boundary, you have handed the user a pen to edit your policy.
What this looks like in practice
Keep system instructions structurally separate from user content, not just “later in the string” but in a way the model is trained to treat as privileged. Assume any field a user touches is adversarial, including the documents you retrieved if those documents can contain user-authored text. And make compliance checkable: if you cannot point to where the boundary is, you cannot prove the boundary holds.
Explainability is a security feature
Forcing the model to explain its reasoning, why it selected a document, why it produced an answer, is not only about user trust. It is a tripwire. When an injection succeeds, the reasoning chain is often where you first see the model decide to do something it was never asked to do. Visibility is half the defence.
Security in LLM systems is not a model you bolt on at the end. It is a property of how you arrange the bytes.