Abusive AI Web Crawlers: Get Off My Lawn

4/2/2025 at 3:33:43 PM

You can monetize your app users by partnering with providers that offer SDKs for residential proxy networks. These services let users opt-in to share their internet connection, earning you revenue while they get benefits like ad-free experiences.

How It Works: Providers like Proxyrack, Live Proxies, Rayobyte, and Infatica allow you to integrate their SDKs into your app. Users who agree to join the proxy network contribute their device’s bandwidth, often used for web scraping, and you get paid based on their activity—typically per monthly or daily active user.

So it need not be "compromised Android SetTop Boxes", but just millions of free apps running on user's phones.

by PeterStuer

4/2/2025 at 4:25:52 PM

There are many different methods used by proxy providers to unethically source their IPs: https://scrapingfish.com/how-ips-for-web-scraping-are-source...

by mateuszbuda

4/2/2025 at 3:24:29 PM

We observed the same behavior. Each request used a different IP address and a random user agent. In our case, most of the IP addresses belonged to Chinese ISPs. They went to great lengths to avoid being blocked, but at the same time used user agents such as Windows 95/98 or IE 5. Fortunately, the combination of the odd user agents and the fact that they still use HTTP/1.1 makes them somewhat easy to identify. So you can use a captcha on more expensive endpoints to block them.

by DarkPlayer

4/2/2025 at 2:04:52 PM

I don't understand the current thing about "AI Crawlers". Maybe someone can help educate me.

How is it related to AI? Do AI crawlers do something different from traditional search index crawlers? Or is it simply a proliferation of crawlers because of the growth of AI products?

What makes AI special in this context?

by intellectronica

4/2/2025 at 2:15:50 PM

The proliferation of crawlers is part of the problem. They're also more aggressive and poorly behaved than typical search engines. Some issues:

- They request every resource, vastly increasing costs compared to a normal crawler.

- They not only don't respect robots.txt; they use it as an explicit source of more links to mine.

- They request resources frequently (many reports of 100x per day), sometimes from bugs and sometimes to ensure they have the latest copy.

- There's no rate limiting. It's trivial to create a crawler architecture where the crawler operates at full tilt, spread across millions of pages and respecting each site, but they don't bother, so even if everything else were fine it starts looking like a DOS attack.

- They intentionally use pools of IPs and other resources to obfuscate their identities and make themselves harder to block.

How much of that is "baby's first crawler" not being written very well, and how much is actual malice? Who knows, but the net effect is huge jumps in costs to support the AI wave.

by hansvm

4/2/2025 at 2:25:57 PM

Speculation: If not already, it will be a data broker market for public-ish data too. What I mean by that is a separation of entities where open ai and ”legitimate” AI companies will buy data from data brokers of shadily scraped data, and throw them under the bus if shit hits the fan to protect the mothership. This makes sense from a corporate risk perspective, creating a gray area buffer of accountability. OpenAI and Anthropic already pleaded to the government to not take away their fair use hall pass, (by invoking the magic spell ”China”), but if this won’t work and publishers win they’ll need to be prepared.

At the same time, publicly and easily available quality content is a race against time. Platforms like Reddit and Xitter already lock down with aggressive anti-bot measures and fingerprinting, and the cottage industries are following. Meanwhile, public data is being polluted by content farms producing garbage at increased rate using AI.

Together this creates a perfect storm of bad incentives: (1) the data hoarders are no longer just Google and Microsoft, but probably thousands of smaller entities and (2) they’re short on time, and try to scrape more invasively and at a fast rate.

by klabb3

4/2/2025 at 2:18:35 PM

The incompetence hypothesis makes sense (it is often a good explanation). Web indexers like Google had decades to get really good at this, including hoards of people who work on crawlers full time. AI companies are often very young, execute with small teams, and don't consider web indexing their main activity, just something they do in support of pre-training (or maybe serving web results).

by intellectronica

4/2/2025 at 2:20:25 PM

If the problem is really incompetence, then maybe a viable solution is for the community to create a really great (and well-behaved) OSS crawler. Make it easier for the AI people to do the right thing by making rolling their own crawler the more expensive, lower quality option.

by intellectronica

4/2/2025 at 2:13:30 PM

Search engine crawlers, aggregators, vertical markets crawlers and so on may give you visibility, are not so much, and are usually well-behaved (i.e. respect robots.txt, announce themselves with a consistent user agent, etc).

Security/vulnerability scans doesn't ask too much pages, at least existing ones, and usually come from few IPs from time to time.

But AI crawlers could be really a lot, try to get all your pages, and not always are respectful about robots.txt or your performance, u. And don't give you anything back. There may be exceptions, but the few you notice ends having a negative impact.

by gmuslera

4/2/2025 at 2:15:58 PM

Yes, I understand that, and I'm dismayed to learn about this.

But the question I'm asking is _why_ do AI crawlers behave in this different way.

by intellectronica

4/2/2025 at 2:28:15 PM

Too many players

by gmuslera

4/2/2025 at 2:06:44 PM

They don’t respect robots.txt at all and won’t hesitate to call all the endpoints they find, repeatedly, even when they’re costly for the host. That’s basically it.

by otikik

4/2/2025 at 2:09:04 PM

Right, but how is it related to AI?

Is there just a correlation between crawlers that don't respect indexing norms and crawlers that operate in the service of AI products?

Or is there something about the sort of indexing people do for AI that makes this nasty behaviour more likely?

by intellectronica

4/2/2025 at 4:10:01 PM

One major difference is that while indexing, you're generating an internal data structure that represents that site. Once done, if the site doesn't change, you don't have any need to revisit it, and in fact, fetching the site multiple times just increases your own costs.

On the other hand, an unsupervised AI training algorithm may just need raw text, and as much of it as possible. It doesn't know what site it came from or much care, and it's not building any index that links the content back to its original source. So fetching the site on each training epoch might actually be viable: why bother storing the entire internet when you can just fetch -> transform -> ingest into your model? If your crawler is distributed enough, it won't be the bottleneck, either.

If this is the architecture some companies are using, this also means that these crawlers won't ever stop, because they are finetuning some model by constantly updating over time based on the "current" internet, whatever that might mean.

by structural

4/2/2025 at 2:11:46 PM

If normal crawlers are a light rain, AI crawlers are a hurricane. Most sites can handle some rain, but they are not built to handle hurricanes. AI crawlers can look like DDOS attacks. The worst offenders will just crawl a site as fast as possible until it goes offline.

by blakesterz

4/2/2025 at 2:14:39 PM

Yes, I understand. And sorry to hear that. But I'm trying to understand how it is related to AI. How come this is happening with AI crawlers but not with traditional web index crawlers. If the pattern is so common (which is confirmed by multiple credible sources) there must be some interesting and potentially useful explanation.

by intellectronica

4/2/2025 at 2:29:28 PM

Search engines link to websites. They want the websites up, so its worth a little extra work to avoid harming them. LLMs seek to replace the websites.

Search engine crawlers are more mature and better written.

I suspect a lot of LLM crawling an development is done under time pressure to get things done while the investors money is still coming in to fund it. DO stuff in a hurry, and it will be less competently done.

by graemep

4/2/2025 at 2:29:06 PM

My understanding is that everyone wants to be first in the AI race, so they throw all the rules everyone else agreed on overboard.

by TonyTrapp

4/2/2025 at 2:25:25 PM

https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

This is interesting on the subject. Maybe gives you some understanding. Posted on HN a few minutes ago.

by thisdougb

4/3/2025 at 5:28:06 AM

Ordinary search indices don't contain the entire target site, while LLM-style so called AI does consume it all. I would guess some of these crawlers are subcontractors rather than "AI" companies, i.e. they compete on having the most complete and fresh dataset you could rent for "training".

Whenever the market decides the Internet is too full of slop to be usable for "training" the one that has the most copies of the pre-"AI" Internet wins. Some of the traffic is likely "AI" "tool use", i.e. bot scraping as part of running some LLM, i.e. "AI" "research".

The big scraping bots have gone from stupid to ruthless. Previously it was irritating that some of them got stuck traversing cyclical link paths on your site or on-the-fly generated pages, now it's like your silly family blog suddenly got very popular for no good reason and it puts a lot of load on the tiny amount of hardware it's served from.

by cess11

4/3/2025 at 12:14:35 AM

Why does the author of this post assume their increase in traffic has anything to do with "AI" specifically?

by lostmsu