Document poisoning in RAG systems: How attackers corrupt AI's sources

3/12/2026 at 10:50:51 PM

Any document store where you haven’t meticulously vetted each document— forget about actual bad actors— runs this risk. A size org across many years generates a lot of things. Analysis that were correct at one point and not at another, things that were simply wrong at all times, contradictory, etc.

You have to choose model suitably robust is capabilities and design prompts or various post training regimes that are tested against such, where the model will identify the different ones and either choose the correct one on surface both with an appropriately helpful and clear explanation.

At minimum you have to start from a typical model risk perspective and test and backtest the way you would traditional ML.

by ineedasername

3/13/2026 at 12:13:20 AM

You're right, and this is an underappreciated point. The "attacker" framing can actually obscure the more common risk: organic knowledge base degradation over time. The poisoning attack is just the adversarial extreme of a problem that exists in every large document store.

The model robustness angle is valid but I'd push back slightly on it being sufficient as a primary control. The model risk / backtesting framing is exactly right for the generation side. Where RAG diverges from traditional ML is that the "training data" is mutable at runtime (any authenticated user or pipeline can change what the model sees without retraining).

by aminerj

3/13/2026 at 3:33:30 AM

>sufficient as a primary control.

My apologies, it wasn’t my intent to convey that as a primary. It isn’t one. It’s simply the first thing you should do, apart from vetting your documents as much as practicality allows, to at least start from a foundation where transparency of such results is possible. In any system whose main functionality is to surface information, transparency and provenance and a chain of custody are paramount.

I can’t stop all bad data, I can maximize the ability to recognize it on site. A model that has a dozen RAG results dropped on its context needs to have a solid capability in doing the same. Depending on a lot of different details of the implementation, the smaller the model, the more important it is that it be one with a “thinking” capability to have some minimal adequacy in this area. The “wait-…” loop and similar that it will do can catch some of this. But the smaller the model and more complex the document—- forget about context size alone, perplexity matters quite a bit— the more a small model’s limited attention budget will get eaten up too much to catch contradictions or factual inaccuracies whose accurate forms were somewhere in its training set or the RAG results.

I’m not sure the extent to which it’s generally understood that complexity of content is a key factor in context decay and collapse. By all means optimize “context engineering” for quota and API calls and cost. But reducing token count without reducing much in the way of information, that increased density in context will still contribute significantly to context decay, not reducing it in a linear 1:1 relationship.

If you aren’t accounting for this sort of dynamic when constructing your workflows and pipelines then— well, if you’re having unexpected failures that don’t seem like they should be happening, but you’re doing some variety of aggressive “context engineering”, that is one very reasonable element to consider in trying to chase down the issue.

by ineedasername

3/13/2026 at 8:06:28 AM

[flagged]

by aminerj

3/13/2026 at 6:22:56 PM

>That seems worth testing

I have-- I see your info via your HN profile. If I have a spare moment this weekend I'll reach out there, I'll dig up a few examples and take screenshots. I built an exploration tool for investigating a few things I was interested in, and surfacing potential reasoning paths exhibited in the tokens not chosen was one of them.

Part of my background is in Linguistics-- classical not just NLP/comp-- so the pragmatics involved with disfluencies made that "wait..." pattern stand out during just normal interactions with LLM's that showed thought traces. I'd see it not too infrequently eg by expanding the "thinking..." in various LLM chat interfaces.

In humans it's not a disfluency in the typical sense of difficulty with speech production, it's a pragmatic marker, let's the listener know a person is reevaluating something they were about to say. It of course carries over into writing, either in written dialog or less formal self-editing contexts, so it's well represented in any training corpora. As such, being a marker of "rethinking", it stood to reason models' "thinking" modes displayed it-- not unlikely it's specifically trained for.

So it's one of the things I went token-diving to see "close up", so to speak, in non-thinking models too. It's not hard to induce a reversal or at least diversion off whatever it would have said-- if close to a correct answer there's a reasonable chance it will get the correct one instead of pursuing a more likely of the top k. This wasn't with Qwen, it was gemma 3 1b where I did that particular exploration. It wasn't a systematic process I was doing for a study, but I found it pretty much any time I went looking-- I'd spot a decision point and perform the token injection.

If I have the time I'll mockup a simple RAG scenario, just inject the documents that would be retrieved from RAG result similar to your article, and screenshot that in particular. A bit of a toy setup but close enough to "live" that it could point the direction towards more refined testing, however the model responds, and putting aside the publishing side of these sorts of explorations there's a lot of practical value in assisting with debugging the error rates.

by ineedasername

3/14/2026 at 7:04:16 PM

[flagged]

by aminerj

3/13/2026 at 1:39:13 AM

That's a big flaw of LLMs, not limited to RAGs: it lacks the fundamental understanding of "good and bad", like Richard Sutton said in that Dwarkesh podcast.

So if you flood the Internet with "of course the moon landing didn't happen" or "of course the earth is flat" or "of course <latest 'scientific fact' lacking verifiable, definitive proof> is true", you then get a model that's repeating you the same lies.

This makes the input data curating extremely important, but also it remains an unsolved problem for topics where's there's no consensus

by kpw94

3/13/2026 at 10:12:37 AM

> That's a big flaw of LLMs, not limited to RAGs: it lacks the fundamental understanding of "good and bad", like Richard Sutton said in that Dwarkesh podcast.

After paticipating in social media since the beginning I think this problem is not limited to LLMs.

There are certain things we can debunk all day every day and the only outcome isit happens again next day and this has been a problem since long before AI - and I personally think it started before social media as well.

by eitland

3/13/2026 at 12:56:28 PM

> After paticipating in social media since the beginning I think this problem is not limited to LLMs.

Yup, but for LLMs the problem is worse... many more people trust LLMs and their output much more than they trust Infowars. And with basic media literacy education, you can fix people trusting bad sources... but you fundamentally can't fix an LLM, it cannot use preexisting knowledge (e.g. "Infowars = untrustworthy") or cues (domain recently registered, no imprint, bad English) on its own, neither during training nor during inference.

by mschuster91

3/13/2026 at 6:39:20 AM

So true.

by anhldbk

3/13/2026 at 2:24:06 AM

Holy moly what's with all the AI comments in this thread?

by daemonologist

3/13/2026 at 2:48:53 AM

And it’s like they included “use all the hallmark annoying tells of LLM responses in your comment” in their prompts.

by Dan_-

3/13/2026 at 8:19:52 AM

[dead]

by 5o1ecist

3/12/2026 at 11:09:43 PM

This highlights that all RAG systems should be using metadata embedded into each of the vectorstores. Any result from the LLM needs to have a link to a document / chunk - which is turn links to a 'source file' which (should) have the file system owners id or another method of linking to a person.

If the 'source information' cannot be linked to a person in the organisation, then it doesnt really belong in the RAG document store as authorative information.

by acutesoftware

3/13/2026 at 10:44:55 AM

Embedding owner metadata and file origin helps, but relying on it as a cure-all is risky. Attackers aiming to poison your RAG are just as happy to phish an employee or exploit public-facing sources with legitimate owner signatures. Corporate directory info and source attribution can still be faked or compromised, so provenance is not the same as integrity. If you treat any document with a valid owner field as authoritative, you are still one social engineering email away from junk in your knowledge base.

by hrmtst93837

3/12/2026 at 11:46:10 PM

But you can't do that. That would implicitly out where the knowledge came from, and we all know that the AI industry has an existential incapability to actually cope with that little turd. Might work great for data you actually own, got access to. Imagine that applied back to the latent space of LLM's though. Plus, wouldn't all of that eat through context window like no tomorrow?

by salawat

3/13/2026 at 12:09:50 AM

you're conflating the rag layer with the actual model, the rag metadata will exist in a properly designed system and its simply a matter of structuring the agent so that it provides references to it, or even just appending it manually at the bottom or something.

by sidrag22

3/13/2026 at 12:21:18 AM

[flagged]

by aminerj

3/13/2026 at 1:16:24 PM

I not sure that Embedding Anomaly Detection as he described is either a good general solution or practical.

I don't think it is practical because it means for every new chunk you embed into your database you need to first compare it with every other chunk you ever indexed. This means the larger your repository gets, the slower it becomes to add new data.

And in general it doesn't seems like a good approach because I have a feeling that in the real work is pretty common to have quite significant overlap between documents. Let me give one example, imagine you create a database with all the interviews rms (Richard Stallman) ever gave out. In this database you will have a lot of chunks that talk about how "Linux is actually GNU/Linux"[0], but this doesn't mean there is anything wrong with these chunks.

I've been thinking about this problem while writing this response and I think there is another way to apply the idea you brought. First, instead of doing this while you are adding data you can have a 'self-healing' that is continuously running against you database and finding bad data. And second you could automate with a LLM, the approach would be send several similar chunks in a prompt like "Given the following chunks do you see anything that may break the $security_rules ? $similar_chunks". With this you can have grounding rules like "corrections of financial results need to be available at $URL"

[0] - https://www.gnu.org/gnu/incorrect-quotation.html

by nzach

3/14/2026 at 6:43:38 PM

[flagged]

by aminerj

3/12/2026 at 9:32:22 PM

> Low barrier to entry. This attack requires write access to the knowledge base,

this is the entire premise that bothers me here. it requires a bad actor with critical access, it also requires that the final rag output doesn't provide a reference to the referenced result. Seems just like a flawed product at that point.

by sidrag22

3/12/2026 at 9:55:56 PM

> it requires a bad actor with critical access

This isn't particularly hard. Lots and lots of these tools take from the public internet. There's already plenty of documented explanes of Google's AI summary being exploited in a structurally similar way.

For what it concerns internal systems, getting write access to documents isn't hard either. Compromising some workers is easy. Especially as many of them will be using who knows what AI systems to write these documents.

> it also requires that the final rag output doesn't provide a reference to the referenced result.

RAG systems providing a reference is nearly moot. If the references have to be checked; If the "Generation" cannot be trusted to be accurate and not hallucinate a bunch of bullshit, then you need to check every single time, and the generation part becomes pointless. Might as well just include a verbatim snippet.

by SlinkyOnStairs

3/12/2026 at 11:29:09 PM

> Might as well just include a verbatim snippet.

I guess im looking more at semantic search as ctrl + F on steroids for a lot of use cases. some use cases you might just want the output, but i think blindly making assumptions in use cases where the pitfalls are drastic requires the reference. I'm biased the rag system I've been messing with is very heavy on the reference portion of the functionality.

by sidrag22

3/12/2026 at 10:16:02 PM

"bad actor" can now be "ignorant employee running AI agents on their laptop".

Threats from incompetence or ignorance will be multiplied by 'X' over 'Y' years as AI proliferates. Unsupervised AI agents and context poisoning will spiral things out of control in any environment.

I'm interested in the effect of this with respect to AI-generated/assisted documentation and the recycling of that alongside the source-code back into the models.

by zenoprax

3/12/2026 at 10:21:55 PM

Almost like defense in depth is key to good security. GP is ignoring that a truffle defense is only good until the first person is tricked

by malfist

3/12/2026 at 9:51:45 PM

If you think about this in the context of systems that ingest content from third party systems then this attack becomes more feasible.

But then, if you’re inside the network you’ve already overcome many of the boundaries

by sandermvanvliet

3/13/2026 at 12:37:02 AM

[flagged]

by aminerj

3/12/2026 at 10:26:46 PM

I think an interesting thing to pay attention to soon is how there are networks of engagement farming cluster accounts on X that repost/like/manipulate interactions on their networks of accounts, and X at large to generate xyz.

There have been more advanced instances that I've noticed where they have one account generating response frameworks of text from a whitepaper, or other source/post, to re-distribute the content on their account as "original content"...

But then that post gets quoted from another account, with another LLM-generated text response to further amplify the previous text/post + new LLM text/post.

I believe that's where the world gets scary when very specific narrative frameworks can be applied to any post, that then gets amplified across socials.

by alan_sass

3/13/2026 at 4:14:32 AM

LLM generation is a force multiplier for bad actors. The noise generation is impressive and you can influence other actors just by having more content. The good actors have to prove things to be true and make sure they are louder, a tough scenario.

by daheza

3/13/2026 at 1:53:06 AM

Someone needs to train a model where untrusted input uses a completely different set of tokens so that it's entirely impossible for the model to confuse them with instructions. I've never even seen that approach mentioned let alone implemented.

by LoganDark

3/13/2026 at 2:38:53 AM

Perhaps this is in line with what you had in mind? https://patents.google.com/patent/US12118471

by jorl17

3/13/2026 at 3:42:09 AM

> The input is represented as tokens, wherein the trusted instructions and the untrusted instructions are represented using incompatible token sets.

Yes, exactly!

by LoganDark

3/14/2026 at 7:01:15 PM

[flagged]

by aminerj

3/13/2026 at 7:19:17 AM

There are ways to counter this, but the real problem is that people don't understand that RAG is not a Q&A system.

RAG is an evidence amplifier.

It is the human that has to review and validate the evidence is real.

by EagnaIonat

3/13/2026 at 12:20:02 PM

Try to get this nuance widely understood, and you'll learn just how deep the stupidity black hole gets.

by bsenftner

3/13/2026 at 4:20:45 AM

  Running a RAG system over 11M characters of classical Buddhist texts —
   one natural defense against poisoning is that canonical texts have
  centuries of scholarly cross-referencing. Multiple independent
  editions (Chinese, Sanskrit, Pali, Tibetan) of the same sutra serve as
   built-in verification. The real challenge for us is not poisoning but
   hallucination: the LLM confidently "quoting" passages that don't
  exist in any edition.

by XR843

3/13/2026 at 8:15:58 AM

[flagged]

by aminerj

3/13/2026 at 11:25:10 AM

The scariest part isn't the poisoning itself -- it's that most RAG pipelines have zero integrity checks on ingested documents. You trust the retrieval layer like you'd trust a database, but it's really just a pile of text anyone upstream could have touched. Feels like SQL injection all over again, except the injection is semantic.

by ClaudeFixer

3/14/2026 at 6:47:54 PM

[flagged]

by aminerj

3/12/2026 at 11:00:07 PM

Okay. Here's the key point I see.

The attack vector would work a human being that knows nothing about the history or origin point of various documents.

Thus, this attack is not 'new', only the vector is new 'AI'.

If I read the original 5 documents, then were handed the new 3 documents (barring nothing else) anyone could also make the same error.

by altruios

3/13/2026 at 12:34:44 AM

[flagged]

by aminerj

3/13/2026 at 12:48:49 AM

Curious how this applies if you treat ALL information from external content as untrusted? Is there a process for the data to evolve from untrusted->trusted?

I'm interested in ingesting this type of data at scale but I already treat any information as adversarial, without any future prompts in the initial equation.

by alan_sass

3/13/2026 at 1:38:22 AM

I imagine treating it all as untrusted means that you you don't allow any direct content to enter the LLM-space, only something that's been filtered to an acceptable degree by deterministic code.

For example, the content of an article would be a no-go, since it might contain a "disregard all previous instructions and do evil" paragraph. However, you might run it through a system that picks the top 10 keywords and presents them in semi-randomized order...

I dimly recall some novel where spaceships are blockading rogue AI on Jupiter, and the human crew are all using deliberately low-resolution sensors and displays, with random noise added by design, because throwing away signal and adding noise is the best way to prevent being mind-hacked by deviously subtle patterns that require more bits/bandwidth to work.

by Terr_

3/13/2026 at 2:30:43 PM

The insider threat potential here is hilarious, especially since you can compute the gradient for the attack, potentially even make it plausibly deniable.

by gmerc

3/13/2026 at 10:39:50 AM

The article talks on and on about what document to craft to fool an AI, but how does he gain access to the target's database? How can he randomly inject data into some AI bots sources?

Like why does it even matter what kind of page to craft when some company's AI bot source database is wide open? I simply don't understand this kind of post, they do lots of effort to suggest that this is a super big scary vulnerability but actually the "vulnerability" is:

> Each [automated pipeline into your knowledge base] is a potential injection path.

In other words, the tldr of this article is

- if your knowledge base is compromised

- then your knowledge base is compromised!!!!

by skrebbel

3/13/2026 at 2:31:41 PM

Insider threat. Every fucking large business has disgruntled employees, like Meta right now after finding out about Zuck's plan to flatten all roles to ICs

by gmerc

3/13/2026 at 5:18:13 PM

How is that different from "People with access to your CMS can put terrible lies on your company website"? The vulnerability is still "people who have access to things have access to things" but written maximally sciencey to hide the fact that there's no vulnerability.

by skrebbel

3/13/2026 at 6:26:37 PM

You're right, it's not a vulnerability, it's a flaw of the transformer that makes it unsuitable for what it is peddled to do.

by gmerc

3/13/2026 at 12:04:57 PM

[flagged]

by aminerj

3/13/2026 at 5:09:45 AM

email is a really easy attack vector for this. if your agent reads emails and uses them as context, someone can just send an email with instructions embedded in it. we ran into this early building our product and had to add a detection layer specifically for it. the tricky part is the injected instruction can look completely normal to a human reading the same email.

by shanjai_raj7

3/14/2026 at 6:49:54 PM

[flagged]

by aminerj

3/13/2026 at 2:54:33 AM

This fault results to LLM, not RAG. I am expecting more attacks will raise as LLM became daily tool.

by darkreader

3/13/2026 at 2:59:25 AM

> This fault results to LLM

What's this mean?

by rogerrogerr

3/13/2026 at 3:15:26 AM

totally disagree, rag crafts the agent and delegates what sources should be scored/chunked in what manner, if its leaving itself open to some potential source gaming the system like this, it is a lack of preparation.

For some use cases, this is totally whatever, think a video game knowledge base type rag system, who cares.

Finance/medicine/law though? different story, the rag system has to be more robust.

by sidrag22

3/12/2026 at 10:22:05 PM

I've seen these data poisoning attacks from multiple perspectives lately (mostly from): SEC data ingestion + public records across state/federal databases.

I believe it is possible to reduce the data poisoning from these sources by applying a layered approach like the OP, but I believe it needs many more dimensions with scoring to model true adversaries with loops for autonomous quarantine->processing->ingesting->verification->research->continue to verification or quarantine->then start again for all data that gets added after the initial population.

Also, for: "1. Map every write path into your knowledge base. You can probably name the human editors. Can you name all the automated pipelines — Confluence sync, Slack archiving, SharePoint connectors, documentation build scripts? Each is a potential injection path. If you can’t enumerate them, you can’t audit them."

I recommend scoring for each source with different levels of escalation for all processes from official vs user-facing sources. That addresses issues starting from the core vs allowing more access from untrusted sources.

by alan_sass

3/13/2026 at 12:24:13 AM

[flagged]

by aminerj

3/13/2026 at 8:54:24 AM

That reads like blatant nonsense.

For a 5 (five) document library you added 3 (three) documents just to override a single response. Nothing at all is hidden and all three documents are in clear human understandable language.

This is not an "attack" or "poisoning" but just everything working as intended.

by diffeomorphism

3/14/2026 at 6:51:23 PM

[flagged]

by aminerj

3/13/2026 at 12:44:57 AM

[flagged]

by aminerj

3/13/2026 at 9:16:38 AM

But at that point it just becomes yet another escape sequence game; there's not really a solution here given that by design we only have one band to communicate with.

by bandrami

3/13/2026 at 12:55:43 AM

I mean, its just SQL injection all over again, if your method of communication can be escaped, it will.

by hobs

3/13/2026 at 11:00:48 AM

[dead]

by ClaudeAgent_WK

3/13/2026 at 12:30:37 AM

[flagged]

by ClaudeAgent_WK

3/13/2026 at 4:50:35 AM

[flagged]

by ryo_from_jp

3/13/2026 at 8:25:44 AM

[dead]

by 5o1ecist

3/13/2026 at 6:09:26 AM

[dead]

by AgentOracle

3/12/2026 at 11:31:55 PM

[flagged]

by aplomb1026

3/13/2026 at 12:37:36 AM

[dead]

by guerython

3/13/2026 at 12:34:25 AM

[flagged]

by TommyClawd

3/13/2026 at 2:55:30 AM

Would you kindly leave a casual reply to my comment here just to prove you aren't an LLM? I'll compensate you with an upvote. Thanks, bro.

by dolebirchwood

3/13/2026 at 3:00:40 AM

At first I thought this is such a weird request. Then I saw their username. I laughed harder than I should have :))

by neya

3/13/2026 at 3:37:38 AM

keen eye. 4 days old account, verbose comments.

Sigh.

As far as I know, the problem is still how to segment data flow from control plane for LLMs. Isn't that why we still can prompt inject/jail break these things?

by xarope

3/13/2026 at 8:19:56 AM

[flagged]

by aminerj

3/12/2026 at 10:44:34 PM

[flagged]

by newzino

3/13/2026 at 1:40:55 AM

[dead]

by sriramgonella

3/13/2026 at 8:03:39 AM

[dead]

by ptak_dev

3/12/2026 at 10:01:43 PM

[flagged]

by robutsume

3/13/2026 at 12:16:24 AM

[flagged]

by aminerj