3/21/2026 at 1:33:53 PM
As a site operator who has been battling with the influx of extremely aggressive AI crawlers, I’m now wondering if my tactics have accidentally blocked internet archive. I am totally ok with them scraping my site, they would likely obey robots.txt, but these days even Facebook ignores it, and exceeds my stipulated crawl delay by distributing their traffic across many IPs. (I even have a special nginx rule just for Facebook.)Blocking certain JA3 hashes has so far been the most effective counter measures. However I wish there was an nginx wrapper around hugin-net that could help me do TCP fingerprinting as well. As I do not know rust and feel terrified of asking an LLM to make it. There is also a race condition issue with that approach, as it is passive fingerprinting even the JA4 hashes won’t be available for the first connection, and the AI crawlers I’ve seen do one request per IP so you don’t get a chance to block the second request (never happens).
by VladVladikoff
3/21/2026 at 2:25:13 PM
> they would likely obey robots.txtIf only... Despite providing a useful service, they are not as nice towards site owners as one would hope.
Internet Archive says:
> We see the future of web archiving relying less on robots.txt file declarations geared toward search engines
https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
They are not alone in that. The "Archiveteam", a different organization, not to be confused with archive.org, also doesn't respect robots.txt according to their wiki: https://wiki.archiveteam.org/index.php?title=Robots.txt
I think it is safe to say that there is little consideration for site owners from the largest archiving organizations today. Whether there should be is a different debate.
by danrl
3/21/2026 at 5:42:35 PM
It seems like the general problem is that the original common usage of robots.txt was to identify the parts of a site that would lead a recursive crawler into an infinite forest of dynamically generated links, which nobody wants, but it's increasingly being used to disallow the fixed content of the site which is the thing they're trying to archive and which shouldn't be a problem for the site when the bot is caching the result so it only ever downloads it once. And more sites doing the latter makes it hard for anyone to distinguish it from the former, which is bad for everyone.> The "Archiveteam", a different organization, not to be confused with archive.org, also doesn't respect robots.txt according to their wiki
"Archiveteam" exists in a different context. Their usual purpose is to get a copy of something quickly because it's expected to go offline soon. This both a) makes it irrelevant for ordinary sites in ordinary times and b) gives the ones about to shut down an obvious thing to do, i.e. just give them a better/more efficient way to make a full archive of the site you're about to shut down.
by AnthonyMouse
3/21/2026 at 7:03:52 PM
[dead]by devnotes77
3/21/2026 at 5:23:44 PM
What an absolutely insufferable explanation from ArchiveTeam. What else do you expect from an organization aggressively crawling websites and bringing them down to their knees because they couldn't care less?by sunaookami
3/21/2026 at 10:10:12 PM
ArchiveTeam (which is not the Internet Archive) aggressively crawls websites because they care a lot, because the website in question is about to go away.Heck, I'd say as caring goes, ArchiveTeam cares more than the owners of the website, because in the ideal shutdown, the owners provide the data instead of forcing people to scrape it if they want to retain it after the site shuts down.
by wlonkly
3/22/2026 at 1:53:30 PM
They also crawl aggressively when the site is not in danger. They crawled my MediaWiki because someone else input the site in their bot and it overloaded the PHP process. I know that archiving is important but please, not like this.by sunaookami
3/21/2026 at 5:31:13 PM
I'm curious to hear about examples of where this has happened. Because ArchiveTeam also has an important role in rescuing cultural artefacts that have been taken into private hands and then negligently destroyed.by rossng
3/21/2026 at 7:21:18 PM
Having a laudable goal doesn't absolve them from bad behavior.by tredre3
3/22/2026 at 2:03:20 AM
It's a good reason to not worry about hypothetical bad behavior and wait for evidence of real bad behavior.by Dylan16807
3/22/2026 at 1:36:55 AM
ArchiveTeam definitely do not intend to kill websites with too fast crawling, but definitely have done that unintentionally and always will stop/slow the crawling when it happens.Even the distributed crawling system has monitoring and controls to ensure it doesn't kill sites.
by pabs3
3/21/2026 at 8:09:29 PM
That page was written by Jason Scott in 2011 and has barely been changed since then.by tech234a
3/21/2026 at 11:14:12 PM
Why mess with perfection?by textfiles
3/21/2026 at 1:54:57 PM
Evasion techniques like JA3 randomization or impersonation can bypass detection.by mycall
3/21/2026 at 6:47:53 PM
I am aware, fortunately I haven't seen much of this... yet. Also JA4 is supposed to be a bit less vulnerable to this. Also this is why I really want TCP and HTTP fingerprinting. But the best i've found so far is https://github.com/biandratti/huginn-net and is only available as rust library, I really need it as an nginx module. I've been tempted to try to vibe code an nginx module that wraps this library.by VladVladikoff
3/21/2026 at 4:06:16 PM
[dead]by noads2000
3/21/2026 at 1:57:42 PM
I wonder if it would be practical to have bot-blocking measures that can be bypassed with a signature from a set of whitelisted keys... In this case the server would be happy to allow Internet Archive crawlers.by andrepd
3/21/2026 at 2:02:03 PM
That's an interesting idea. Mtls could probably be used for this pretty easily. It would require IA to support it if course, but could be a nice solution. I wonder, do they already support it? I might throw up a test...by freedomben