The proliferation of crawlers is part of the problem. They're also more aggressive and poorly behaved than typical search engines. Some issues:- They request every resource, vastly increasing costs compared to a normal crawler.
- They not only don't respect robots.txt; they use it as an explicit source of more links to mine.
- They request resources frequently (many reports of 100x per day), sometimes from bugs and sometimes to ensure they have the latest copy.
- There's no rate limiting. It's trivial to create a crawler architecture where the crawler operates at full tilt, spread across millions of pages and respecting each site, but they don't bother, so even if everything else were fine it starts looking like a DOS attack.
- They intentionally use pools of IPs and other resources to obfuscate their identities and make themselves harder to block.
How much of that is "baby's first crawler" not being written very well, and how much is actual malice? Who knows, but the net effect is huge jumps in costs to support the AI wave.
4/2/2025
at
2:25:57 PM
Speculation: If not already, it will be a data broker market for public-ish data too. What I mean by that is a separation of entities where open ai and ”legitimate” AI companies will buy data from data brokers of shadily scraped data, and throw them under the bus if shit hits the fan to protect the mothership. This makes sense from a corporate risk perspective, creating a gray area buffer of accountability. OpenAI and Anthropic already pleaded to the government to not take away their fair use hall pass, (by invoking the magic spell ”China”), but if this won’t work and publishers win they’ll need to be prepared.At the same time, publicly and easily available quality content is a race against time. Platforms like Reddit and Xitter already lock down with aggressive anti-bot measures and fingerprinting, and the cottage industries are following. Meanwhile, public data is being polluted by content farms producing garbage at increased rate using AI.
Together this creates a perfect storm of bad incentives: (1) the data hoarders are no longer just Google and Microsoft, but probably thousands of smaller entities and (2) they’re short on time, and try to scrape more invasively and at a fast rate.
by klabb3