3/15/2026 at 5:40:54 PM
This is the natural consequence of building everything around "the agent needs access to everything to be useful." The more capabilities you hand an agent, the larger the attack surface when it encounters a malicious page.The simplest mitigation is also the least popular one: don't give the agent credentials in the first place. Scope it to read-only where possible, and treat every page it visits as untrusted input. But that limits what agents can do, which is why nobody wants to hear it.
by redgridtactical
3/15/2026 at 6:13:19 PM
I absolutely agree, although even that doesn't solve the root problem. The underlying LLM architecture is fundamentally insecure as it doesn't separate between instructions and pure content to read/operate on.I wonder if it'd be possible to train an LLM with such architecture: one input for the instructions/conversation and one "data-only" input. Training would ensure that the latter isn't interpreted as instructions, although I'm not knowledgeable enough to understand if that's even theoretically possible: even if the inputs are initially separate, they eventually mix in the neural network. However, I imagine that training could be done with massive amounts of prompt injections in the "data-only" input to penalize execution of those instructions.
by rocho
3/16/2026 at 4:13:40 PM
I think there are two distinct attack types for LLMs. Jailbreaking is what most people think of, and consists of structureing a prompt so the LLM does what the prompt says, even if it had prior context saying not to.The other type of attack would be what I would call "induced hallucinations", where the attacker crafts data not to get the LLM to do anything the data says, but to do what the attacker wants.
This is a common attack to demonstrate on neural network based image classifiers. Start with a properly classified image, and a desired incorrect classification. Then, introduce visually imperceptible noise until the classifier reports it as your target classification. There is no data/instruction confusion here: it is all data.
The core problem is that neural networks are fairly linear (which is what makes it possible to construct efficient hardware for them). They are, of course, not actually linear functions, but close enough to make linear algebra based attacks feasible.
It is probably better to think of this sort of attack in term of crypto analysis, which frequently exploits linearity in cryptosystems.
The depth of LLM networks make this sort of attack difficult; but I don't see any reason to think you can add enough layers to make it impossible. Particularly given that there is other research showing structure across layers, with groupings of layers having identifiable functionality. This means it is probably possible to reason about attacking individual layers like an onion.
This problem isn't really unique to AI either. Human written code has a tendency to be vulnerable to a similar attack, where maliciously crafted data can exploit the processor to do anything (e.g buffer overflow into arbitrary code execution).
by gizmo686
3/16/2026 at 4:57:59 AM
OpenAI and other labs are trying to do this within existing structure by explicitly training a specific chain of authority in the inputs: https://github.com/openai/model_spec/blob/main/model_spec.md...However, you may immediately see how using same input space essentially relies on the model itself to do the judgement which can't be ultimately trusted
by everlier
3/15/2026 at 7:40:13 PM
> one input for the instructions/conversation and one "data-only" inputWe learned so many years ago that separating code and data was important for security. It's such a huge step backwards that it's been tossed in the garbage.
by RHSeeger
3/15/2026 at 6:03:14 PM
[dead]by myrak