Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record

3/21/2026 at 1:33:53 PM

As a site operator who has been battling with the influx of extremely aggressive AI crawlers, I’m now wondering if my tactics have accidentally blocked internet archive. I am totally ok with them scraping my site, they would likely obey robots.txt, but these days even Facebook ignores it, and exceeds my stipulated crawl delay by distributing their traffic across many IPs. (I even have a special nginx rule just for Facebook.)

Blocking certain JA3 hashes has so far been the most effective counter measures. However I wish there was an nginx wrapper around hugin-net that could help me do TCP fingerprinting as well. As I do not know rust and feel terrified of asking an LLM to make it. There is also a race condition issue with that approach, as it is passive fingerprinting even the JA4 hashes won’t be available for the first connection, and the AI crawlers I’ve seen do one request per IP so you don’t get a chance to block the second request (never happens).

by VladVladikoff

3/21/2026 at 2:25:13 PM

> they would likely obey robots.txt

If only... Despite providing a useful service, they are not as nice towards site owners as one would hope.

Internet Archive says:

> We see the future of web archiving relying less on robots.txt file declarations geared toward search engines

https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

They are not alone in that. The "Archiveteam", a different organization, not to be confused with archive.org, also doesn't respect robots.txt according to their wiki: https://wiki.archiveteam.org/index.php?title=Robots.txt

I think it is safe to say that there is little consideration for site owners from the largest archiving organizations today. Whether there should be is a different debate.

by danrl

3/21/2026 at 5:42:35 PM

It seems like the general problem is that the original common usage of robots.txt was to identify the parts of a site that would lead a recursive crawler into an infinite forest of dynamically generated links, which nobody wants, but it's increasingly being used to disallow the fixed content of the site which is the thing they're trying to archive and which shouldn't be a problem for the site when the bot is caching the result so it only ever downloads it once. And more sites doing the latter makes it hard for anyone to distinguish it from the former, which is bad for everyone.

> The "Archiveteam", a different organization, not to be confused with archive.org, also doesn't respect robots.txt according to their wiki

"Archiveteam" exists in a different context. Their usual purpose is to get a copy of something quickly because it's expected to go offline soon. This both a) makes it irrelevant for ordinary sites in ordinary times and b) gives the ones about to shut down an obvious thing to do, i.e. just give them a better/more efficient way to make a full archive of the site you're about to shut down.

by AnthonyMouse

3/21/2026 at 7:03:52 PM

[dead]

by devnotes77

3/21/2026 at 5:23:44 PM

What an absolutely insufferable explanation from ArchiveTeam. What else do you expect from an organization aggressively crawling websites and bringing them down to their knees because they couldn't care less?

by sunaookami

3/21/2026 at 10:10:12 PM

ArchiveTeam (which is not the Internet Archive) aggressively crawls websites because they care a lot, because the website in question is about to go away.

Heck, I'd say as caring goes, ArchiveTeam cares more than the owners of the website, because in the ideal shutdown, the owners provide the data instead of forcing people to scrape it if they want to retain it after the site shuts down.

by wlonkly

3/22/2026 at 1:53:30 PM

They also crawl aggressively when the site is not in danger. They crawled my MediaWiki because someone else input the site in their bot and it overloaded the PHP process. I know that archiving is important but please, not like this.

by sunaookami

3/21/2026 at 5:31:13 PM

I'm curious to hear about examples of where this has happened. Because ArchiveTeam also has an important role in rescuing cultural artefacts that have been taken into private hands and then negligently destroyed.

by rossng

3/21/2026 at 7:21:18 PM

Having a laudable goal doesn't absolve them from bad behavior.

by tredre3

3/22/2026 at 2:03:20 AM

It's a good reason to not worry about hypothetical bad behavior and wait for evidence of real bad behavior.

by Dylan16807

3/22/2026 at 1:36:55 AM

ArchiveTeam definitely do not intend to kill websites with too fast crawling, but definitely have done that unintentionally and always will stop/slow the crawling when it happens.

Even the distributed crawling system has monitoring and controls to ensure it doesn't kill sites.

by pabs3

3/21/2026 at 8:09:29 PM

That page was written by Jason Scott in 2011 and has barely been changed since then.

by tech234a

3/21/2026 at 11:14:12 PM

Why mess with perfection?

by textfiles

3/21/2026 at 1:54:57 PM

Evasion techniques like JA3 randomization or impersonation can bypass detection.

by mycall

3/21/2026 at 6:47:53 PM

I am aware, fortunately I haven't seen much of this... yet. Also JA4 is supposed to be a bit less vulnerable to this. Also this is why I really want TCP and HTTP fingerprinting. But the best i've found so far is https://github.com/biandratti/huginn-net and is only available as rust library, I really need it as an nginx module. I've been tempted to try to vibe code an nginx module that wraps this library.

by VladVladikoff

3/21/2026 at 4:06:16 PM

[dead]

by noads2000

3/21/2026 at 1:57:42 PM

I wonder if it would be practical to have bot-blocking measures that can be bypassed with a signature from a set of whitelisted keys... In this case the server would be happy to allow Internet Archive crawlers.

by andrepd

3/21/2026 at 2:02:03 PM

That's an interesting idea. Mtls could probably be used for this pretty easily. It would require IA to support it if course, but could be a nice solution. I wonder, do they already support it? I might throw up a test...

by freedomben

3/21/2026 at 2:08:06 PM

I'm seeing a lot of comments about how we maintain the status quo, but I'm very interested in hearing from anyone who has conceded that there is no way to stop AI scrapers at this point and what that means for how we maintain public information on the internet in the future.

I don't necessarily believe that we won't find some half-successful solution that will allow server hosting to be done as it currently is, but I'm not very sure that I'll want to participate in whatever schemes come about from it, so I'm thinking more about how I can avoid those schemes rather than insisting that they won't exist/work.

The prevailing thought is that if it's not possible now, it won't be long before a human browser will be indistinguishable from an LLM agent. They can start a GUI session, open a browser, navigate to your page, snapshot from the OS level and backwork your content from the snapshot, or use the browser dev tools or whatever to scrape your page that way. And yes, that would be much slower and more inefficient than what they currently do, but they would only need to do that for those that keep on the bleeding edge of security from AI. For everyone else, you're in a security race against highly-paid interests. So the idea of having something on the public internet that you can stop people from archiving (for whatever purpose they want) seems like it's soon to be an old-fashioned one.

So, taking it as a given that you can't stop what these people are currently trying to stop (without a legislative solution and an enforcement mechanism): how can we make scraping less of a burden on individual hosts? Is this thing going to coalesce into centralizing "archiving" authorities that people trust to archive things, and serve as a much more structured and friendly way for LLMs to scrape? Or is it more likely someone will come up with a way to punish LLMs or their hosts for "bad" behavior? Or am I completely off base? Is anyone actually discussing this? And, if so, what's on the table?

by catapart

3/21/2026 at 4:24:06 PM

> without a legislative solution and an enforcement mechanism

If there's one thing people, especially HN users, should've learned by now, it's that there's no enforcement mechanism worth a damn for Internet legislation when incentives don't align.

by ronsor

3/21/2026 at 6:08:44 PM

> how can we make scraping less of a burden on individual hosts?

Isn't this basically what content-addressable storage is for? Have the site provide the content hashes rather than the content and then put the content on IPFS/BitTorrent/whatever where the bots can get it from each other instead of bothering the site.

Extra points if you can get popular browsers to implement support for this, since it also makes it a lot harder to censor things and a decent implementation (i.e. one that prefers closer sources/caches) would give most of the internet the efficiency benefits of a CDN without the centralization.

by AnthonyMouse

3/21/2026 at 3:33:01 PM

If you don't publish content to the public web anymore, you don't have to worry traffic or scraping or bots

Maybe it'll just be cheaper for CDNs or whatever to sell the data they serve directly instead of doing extra steps with scraping

by heavyset_go

3/21/2026 at 7:37:37 PM

I think this is what will happen. That the public internet will become the place you go to seed the data you want to the scrapers and you will use a private internet for everything else. Private sites, private feeds, mesh networks, etc. We're basically going back in time similar to when AOL and friends had their own private networks for their members.

by eikenberry

3/22/2026 at 10:09:47 AM

That sounds bleak, unless they also bring back Tradewars and LORD.

by cykros

3/21/2026 at 3:50:21 PM

The only answer is WebDRM.

It's easy to pretend you're human, it's hard to pretend that you have a valid cryptographic signature for Google which attests that your hardware is Google-approved.

Crawling is the price we pay for the web's openness.

by miki123211

3/21/2026 at 5:38:13 PM

It's not hard to bypass attestation, it's actually very easy and done right now at scale, there's giant click farms with phones on racks.

They don't modify any device and will pass whatever attestation you try to make.

by realusername

3/21/2026 at 4:18:10 PM

I don't see this is a permanent problem. Right now there must be 1000s of well-funded AI companies trying to scrape the entire internet. Eventually the AI equity bubble will pop and there will be consolidation. If every player left has already scanned the web, will they need to keep constantly scanning it? Seems like no. Even if they do, there will be a lot less of them.

by suzzer99

3/21/2026 at 5:55:56 PM

The current trend is that it's getting cheaper and easier to roll out your own AI on your own computer, so more and more people will do it as a hobby. Even if the big players die out, some dude with a decent gaming PC could decide to start scraping everything pertaining to their interests just for the hell of it. Every government with a budget and someone capable of doing the job will surely get in on it as well.

by kdheiwns

3/21/2026 at 8:46:56 PM

> some dude with a decent gaming PC could decide to start scraping everything pertaining to their interests just for the hell of it.

Not from their single residential IP, they are not.

If they do succeed[1] - it is not going to be at hundreds or thousands of requests per second that the current AI scrapers bombard servers with. Some dude at home will, at best, be putting 4-6 orders of magnitude less strain on a limited set of servers.

1. Scraping is an arms race: if you're just "some dude" at the skill floor - you're going to have a bad time whether you're scraping, or defending against scrapers.

by overfeed

3/21/2026 at 6:24:34 PM

> anyone who has conceded that there is no way to stop AI scrapers at this point and what that means for how we maintain public information on the internet in the future.

Bloat, and bandwidth costs are the real problems here. Every one seems to have forgotten basics of engineering and accounting.

by zer00eyz

3/21/2026 at 2:12:50 PM

You're going to hate this, but one answer might be blockchain. A crytographically strong, attestable public record of appending information to a shared repository. Combined with cryptographic signatures for humans, it's basically a secure, open git repository for human knowledge.

by titzer

3/21/2026 at 3:35:08 PM

> Combined with cryptographic signatures for humans

What happens when the human gives an agent access to said signature? Then you fall back on traditional anti-bot techniques and you're right back where you started.

by techjamie

3/21/2026 at 5:01:41 PM

DNA/biometrics are the only secure future!

I joke, but there are those out there who don’t.

by jakeydus

3/21/2026 at 2:41:43 PM

Sounds interesting, but I guess I'm a little unsure of how to connect the dots? Are you suggesting that websites would be hosted on a blockchain and browsed by human-signed browsers? Or more like there would be a blockchain authority, which server hosts could query to determine if a signature, provided by their browser, is human? Would you mind painting the picture in a little more detail?

by catapart

3/21/2026 at 3:24:12 PM

You can have cryptographically signed data caches without the need for a blockchain. What a blockchain can add is the ability to say that a particular piece of data must have existed before a given date, by including the hash of that data somewhere in the chain.

by sharperguy

3/21/2026 at 7:27:47 PM

You'd spend less compute just serving the crawlers than maintaining the Blockchain.

Like, 3 orders of magnitude less compute, conservatively counting.

by amarant

3/21/2026 at 2:55:10 PM

We're rarely going to need to attest anything is "real" or "human". It's basically only going to matter in civil and criminal court, and IDV.

We don't need to attest signals are analogue vs. digital. The world is going to adapt to the use of Gen AI in everything. The future of art, communications, and productivity will all be rooted in these tools.

by echelon

3/21/2026 at 12:47:18 PM

I think media outlets think way too highly of their contribution to AI.

Had they never existed, it had likely not made a dent to the AI development - completely like believing that had they been twice as productive, it had likely neither made a dent to the quality of LLMs.

by tossandthrow

3/21/2026 at 1:13:14 PM

How do you think those models get trained? You can only get so far with Wikipedia, Reddit, and non-fiction works like books and academic papers.

by Freak_NL

3/21/2026 at 1:30:06 PM

Have a look at this article: https://www.washingtonpost.com/technology/interactive/2023/a...

NY Times is 0.06% of common crawl.

These news media outlets provide a drop in the ocean worth of information. Both qualitatively and quantitatively.

The news / media industry is really just trying to hold on to their lifeboat before inevitably becoming entirely irrelevant.

(I do find this sad, but it is like the reality - I can already now get considerably better journalism using LLMs than actual journalists - both click bait stuff and high quality stuff)

by tossandthrow

3/21/2026 at 1:44:28 PM

That seems like a reductive way to consider it. What percent of music was created by Led Zeppelin? What percent of art was painted by Monet? What percent of films by Alfred Hitchcock? It may be a small percentage objectively but they are hugely influential.

by pimlottc

3/21/2026 at 2:40:47 PM

I don't think back propagation care whose text it is back propagating.

by tossandthrow

3/21/2026 at 3:32:34 PM

The data sets aren't naively fed into the training runs.

Instead, training attempts to sample more heavily from higher quality sources, with, I'm sure, a mix of manual and heuristic labeling.

by NiloCK

3/21/2026 at 5:32:06 PM

fwiw, no llm ive ever used generated in the writing style newspapers and -sites use - hence i honestly doubt they've been given a meaningful boost in relevancy.

their idioms would leak occasionally otherwise

by ffsm8

3/21/2026 at 3:41:10 PM

90% of common crawl is complete junk. While the tiny bit of news articles powers almost all the ai answers in Google search.

by Gigachad

3/22/2026 at 2:13:32 AM

News takes a very different path to get into search results. It's not going through databases or archive passes, that would take far too long.

And don't basically all those news sites allow google on purpose?

by Dylan16807

3/21/2026 at 4:31:45 PM

How many Reddit, HN, etc. posts are based on NYT articles? How many derivative news articles, blog posts, YouTube videos, TikToks, etc. are responses to those articles?

At least NYT is probably on the correct side of Sturgeon’s Law: https://en.wikipedia.org/wiki/Sturgeon%27s_law

by datsci_est_2015

3/21/2026 at 6:14:47 PM

> How many Reddit, HN, etc. posts are based on NYT articles? How many derivative news articles, blog posts, YouTube videos, TikToks, etc. are responses to those articles?

You may get an inconvenient answer when you ask the question the other way around.

by AnthonyMouse

3/21/2026 at 4:42:51 PM

0.06% is way higher than I would expect

by Melatonic

3/21/2026 at 1:17:15 PM

How does the entire textual corpus of say, new York times compare to all novels? Each article is a page of text, maybe two at most? There certainly are an awful lot of articles. But it's hard to imagine it is much more than a couple hundred novels. There must be thousands of novels released each year

by RugnirViking

3/21/2026 at 1:36:14 PM

Like apples to oranges.

LLMs are (apparently) massively used to get information about topics in the real world. Novels aren't going to be much help there. Journalism, particularly in written form, provides a fount of facts presented from different angles, as well as opinions, and it was all there free for the taking…

Wikipedia provides the scantest summary of that, fora and social media give you banter, fake news, summaries of news, and a whole lot of shaky opinions, at best. Novels give you the foundations of language, but in terms of knowledge nothing much beyond what the novel is about.

by Freak_NL

3/21/2026 at 1:52:52 PM

LLMs can get up to date information from primary sources - no journalists required.

by olalonde

3/21/2026 at 2:07:45 PM

I don't understand how LLMs can ask questions at a press conference.

by PopAlongKid

3/21/2026 at 6:37:45 PM

To begin with, your premise is that the only primary sources are press conferences and that press conferences only provide information in response to questions.

But even taking it literally, isn't that one of the things LLMs could actually do? You're essentially asking how a text generator could generate text. The real question is whether the questions would be any good, but the answer isn't necessarily no.

by AnthonyMouse

3/22/2026 at 7:52:40 AM

I'm sure any competent agent would send an email, or just ask as an aside in a chat.

by casey2

3/21/2026 at 3:11:10 PM

Startup idea right there.

by olalonde

3/21/2026 at 2:09:56 PM

I don't think an LLM can have secret human sources that provide them with confidential information anonymously. Not all news shows up on Twitter.

by none2585

3/21/2026 at 3:56:21 PM

You don't need the secret human sources any more.

You used to need them, because journalists had the distribution and the sources didn't. In a word of printed newspapers, you couldn't get your story distributed nationally (much less worldwide) without the help of a journalist, doubly so if you wanted to stay anonymous.

Nowadays, you just make a Substack and there's that.

See that recent expose on the Delve fraud as just one example. No journalists were harmed in the making of that article.

by miki123211

3/22/2026 at 7:56:51 AM

This is technically trivial. Most data comes froms chats these days, not the web.

Start thinking!

by casey2

3/21/2026 at 3:06:16 PM

The primary source for most news is journalism.

by ajam1507

3/21/2026 at 3:37:07 PM

In context, primary source means the subject of the article (the thing the journalist is writing about).

Journalism is by definition a secondary source. (Notwithstanding edge cases like articles reporting directly on the news industry itself.)

by NiloCK

3/21/2026 at 6:45:24 PM

Journalism is absolutely not by definiton a secondary source.

If a journalist is on location covering a flood, for example, they are the primary source.

A journalist conducting an interview would also be a primary source.

by ajam1507

3/21/2026 at 2:04:48 PM

Primary sources can and often are, very biased. Journalists are (supposed to be) doing fact checks and gathering multiple sources from all sides. Modern journalism is in a terrible state, but still important.

Imagine if all info about Facebook came from Facebook...

by freedomben

3/22/2026 at 7:37:20 AM

By talking to users? I'm 100% sure Google and OpenAI know every major news story, in much greater detail, long before NYT publishes it.

I'd imagine they already have a database of users such that if multiple people talk about a possibly true subject they can ask subject experts; users related to the subject for clarification and further information.

by casey2

3/22/2026 at 10:47:23 AM

If excluding these sources wouldn’t make a difference, why do AI companies scrape them despite explicit requests to not be scraped?

by xigoi

3/22/2026 at 10:51:04 AM

They want an as diverse data set as possible?

It is not like they paid anybody else for their contribution either.

It is just not more worth than anything else in the data sets.

by tossandthrow

3/22/2026 at 7:46:54 AM

Define quality.

Many publications put information on the internet for the first time, or curate it for the first time, or research a topic deeper than ever before. Someone - a thinking, feeling human - had to get out there and try restaurants, talk to people, pore through archives, read books, use products. Each of them contribute a little to what we know about the world.

I do this for a living. AI might soon put me out of work. It already more than halved my audience, using my own work. It's sickening to see people cheer for it because they have a bone to pick with certain websites. Eventually those websites will be gone, but so will the good ones that produced critical information.

by nicbou

3/21/2026 at 1:38:12 PM

Isn't the non-LLM generated text becoming more valuable for training as the web at large is flooded with slop?

Preventing new human generated text from being used by AI firms (without consent) seems like a valid strategy.

by phatfish

3/21/2026 at 2:45:14 PM

No.

Modern LLMs are trained on a large percentage of synthetic data.

This sentiment is largely legacy (even though just a couple of years old).

by tossandthrow

3/21/2026 at 5:55:13 PM

We're essentially burning the library to punish the arsonist. The arsonist already left.

by ashwinnair99

3/21/2026 at 7:52:08 PM

What do you mean, "the arsonist already left"? Isn't it more accurate to say that 90% of the library's visitors are arsonists?

by tremon

3/21/2026 at 10:09:39 PM

It is not accurate. A very small number of actors pose as many and make up the majority of traffic. For example, your User-Agent block may cut traffic by 10%, 99% of which malicious - but you blocked 1000 individuals, only 1 of which malicious.

by pamcake

3/21/2026 at 8:22:15 PM

[dead]

by paseante

3/21/2026 at 3:48:19 PM

The EFF has a lukewarm stance on AI, but criticizes everyone else. AI is clearly ruining the Internet and the job market.

How about thinking about your mission and take an anti-AI hardliner stance? But I see multiple corporate sponsors that would not be pleased:

https://www.eff.org/thanks

All these so called freedom organizations like the OSI and the EFF have been bought and are entirely irrelevant if not harmful.

by rkwtr1299

3/21/2026 at 1:33:41 PM

The New York Times is awful I want it to be archived so people can see that in the future.

by stuaxo

3/21/2026 at 2:28:03 PM

All media opinion articles are nothing but propaganda pieces. Every media out only allows those aligned with their ideology to write those pieces

by gsky

3/21/2026 at 2:03:26 PM

I don't read it. Why is it awful?

by Archonical

3/21/2026 at 3:51:42 PM

From Manufacturing Consent:

> by selection of topics, by distribution of concerns, by emphasis and framing of issues, by filtering of information, by bounding of debate within certain limits. They determine, they select, they shape, they control, they restrict — in order to serve the interests of dominant, elite groups in the society."

> "history is what appears in The New York Times archives; the place where people will go to find out what happened is The New York Times. Therefore it's extremely important if history is going to be shaped in an appropriate way, that certain things appear, certain things not appear, certain questions be asked, other questions be ignored, and that issues be framed in a particular fashion."

The propaganda in the New York times is especially precious because of how highly respected it is, there never was a war or other elite interest they didn't push along.

by lyu07282

3/21/2026 at 4:59:55 PM

They have a very long track record of pretending to be independent but actually toeing the government's line at key pivotal moments in history when an independent newspaper is needed the most. Everybody here knows how they helped start the second Iraq war I hope, but that wasn't a one-off fluke. Go back through the major wars in American history and you can find the New York Times championing the cause of war before each of these. World Was 2, they uncritically accepted Walter Durranty letting Stalin ghostwrite for him, specifically w.r.t. Stalin's man-made famine in Ukraine, because America was allied with Stalin. WWI, frequent editorializing of Germans being wild Asiatic savages while the Anglos were good and noble people that Americans owed something to for some reason nobody could explain. Vietnam, they uncritically accepted government reports on the second Gulf of Tonkin incident which never happened and broadly accepted the governments own reports about how the war was going, at least in the early years when it still might have been possible to avoid further engagement. Korean war, they supported the government narrative of communist containment. First Iraq War, they uncritically reported very dubious atrocity propaganda, like the fraudulent "Nayirah testimony" given by the teenage daughter of a diplomat pretending to be a politically uninvolved hospital worker.

The pattern here is deference to official narratives at precisely the times when criticism is needed the most.

by mikkupikku

3/21/2026 at 9:42:43 PM

It's bad etiquette to edit your comment after people have replied to it without showing what your edits were. Please do not do this.

by martey

3/22/2026 at 8:37:07 AM

Doesn't HN prevent that? I thought edits lock upon the first reply.

by nikisweeting

3/21/2026 at 5:50:00 PM

> World Was 2, they uncritically accepted Walter Durranty letting Stalin ghostwrite for him, specifically w.r.t. Stalin's man-made famine in Ukraine, because America was allied with Stalin.

Duranty's New York Times articles were written in 1931, a decade before America entered World War II. They not only predate an American alliance with the Soviet Union, but they also predate the United States having any diplomatic relations with the Soviet Union whatsoever.

> Go back through the major wars in American history and you can find the New York Times championing the cause of war before each of these.

Are there other major American newspapers who have a history of dissenting against war? Wasn't the New York Times' behavior in most of the conflicts you mention in line with American popular opinion?

by martey

3/21/2026 at 5:59:20 PM

The American political apparatus was already normalizing relations with the Soviet Union due to the Japanese invasion of Manchuria (1931, which is when WW2 truly started), due to the great depression in America making alliance with the Soviets look economically advantageous for America, and due to political instability in Germany and Italy. There was a strong sense of shit hitting the fan soon and that America would be with the Soviet Union through it. FDR officially recognized the Soviet Union in 1933, during the peak of Stalin's famine in Ukraine, which the New York Times was actively denying.

As for other newspapers, the Times isn't worse but bears the brunt of the criticism because they are after all America's foremost, most influential newspaper.

by mikkupikku

3/21/2026 at 8:36:39 PM

Your comment is full of historical revisionism. The Second World War has little or nothing to do with the Holodomor. The Times' lack of reporting on it has nothing to do with American foreign policy (both Duranty and Gareth Jones were British) and everything to do with credulous reporters. The idea that America and the Soviet Union would be natural allies was not the majority viewpoint in the 1930s (outside of American communist propaganda) and is clearly disproved by the Molotov-Ribbentrop Pact.

by martey

3/21/2026 at 9:16:22 PM

> Wasn't the New York Times' behavior in most of the conflicts you mention in line with American popular opinion?

Dear god, what? I love the unintentional satire its so funny. "Its fine if the media lies to the people if the people believe the lies." That's low even for this stemlord dumpsterfire of a platform

by lyu07282

3/21/2026 at 9:48:58 PM

> "Its fine if the media lies to the people if the people believe the lies."

That is low, but that's neither a direct quote or not an accurate paraphrase of my comment. While I realize that the comment I replied was edited after my response to talk about lying in more recent conflicts (which might be causing your confusion), I don't think you (like OP) are trying to make the argument that the New York Times is bad because of their reporting in the 1930s.

by martey

3/21/2026 at 11:31:44 AM

> But in recent months The New York Times began blocking the Archive from crawling its website, using technical measures that go beyond the web’s traditional robots.txt rules. That risks cutting off a record that historians and journalists have relied on for decades. Other newspapers, including The Guardian, seem to be following suit.

I'm a bit surprised I never read about this till now, though while disappointing it is unfortunately not surprising.

> The Times says the move is driven by concerns about AI companies scraping news content. Publishers seek control over how their work is used, and several—including the Times—are now suing AI companies over whether training models on copyrighted material violates the law. There’s a strong case that such training is fair use.

I suspect part of it might be these corps not wanting people to skip a paywall (whether or not someone would pay even if they had no access is a different story). But this argument makes no sense for the Guardian.

by user_7832

3/21/2026 at 11:34:03 AM

I went to Guardian's website to cross check their motto (getting confused with WaPo's motto) and got served this (hilarious? sad?) banner. As if blocking cross website tracking is somehow bad.

> Rejection hurts … You’ve chosen to reject third-party cookies while browsing our site. Not being able to use third party cookies means we make less from selling adverts to fund our journalism.

We believe that access to trustworthy, factual information is in the public good, which is why we keep our website open to all, without a paywall.

If you don’t want to receive personalised ads but would still like to help the Guardian produce great journalism 24/7, please support us today. It only takes a minute. Thank you.

by user_7832

3/21/2026 at 12:07:11 PM

The Guardian’s ads asking for contributions have got progressively more desperate. I find their commitment to keeping their site paywall free admirable, but the current almost-begging (and selling off their Sunday paper) has got so intense that it feels like it’s only a matter of time until they introduce some kind of paid content.

by mocd

3/21/2026 at 3:55:56 PM

Begging users to turn the tracking gun on themselves so they can be bombarded with ads is totally pathetic, and I’ve seen this on multiple news sites. These guys can’t go out of business fast enough.

by ryandrake

3/21/2026 at 12:43:51 PM

>If you don’t want to receive *personalised ads*

So ads, just not personalized. Remind me again why personalized ads are good for me if I have to pay to have non-personalized ads?

by duskdozer

3/21/2026 at 2:13:18 PM

I think their plea is: 'we make more money from personalized ads so help us make up the difference through donation (or whatever they're selling).'

by none2585

3/21/2026 at 10:45:07 AM

Does Internet Archive have a distributed residential IP crawler program? I would enthusiastically contribute to that.

There must be some mechanism to prevent tampering in such a setup.

by xnx

3/21/2026 at 11:26:22 AM

The Internet Archive does not, but Archive Team does: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

by progval

3/21/2026 at 11:51:43 AM

Yes! I'm running an instance right now.

by xnx

3/21/2026 at 11:52:00 AM

> There must be some mechanism to prevent tampering in such a setup.

Trivial as long as they terminate the TLS on their end, not yours. So you'd just be a residential proxy.

by Retr0id

3/21/2026 at 11:59:34 AM

No, IA does everything above board and even honors invalid DMCA takedowns.

by gzread

3/21/2026 at 5:21:09 PM

As someone who did a lot of work on early spam fighting only to see it replaced by things like DKIM, I wonder if we are going to start having the "taxi medallion" style approach but for people connecting to your site.

e.g. IA will publish out signed https requests with their key so you, as the site owner, can confirm that it is indeed from them and not from AI.

Feels like that would be very anti open internet but not sure how else you would prove who is a good actor vs not (from your perspective that is).

by alexpotato

3/21/2026 at 8:07:36 PM

I'll tell you what I expect to see from crawlers, agents and which I'm enforcing on everybody who doesn't look distinctly human:

* Reverse DNS which points to a web site which has a discoverable / well-known page which clearly describes their behavior.

* Some sort of reverse IP based, RBL and SPF -inspired TXT records which describe who, what, when, why, how, how often

so that I can make automated decisions based on it.

Yah, I don't have a lot of crawlers that I welcome... but I'm building a pretty good database of the worst offenders. At scale... there are advantages to scale which work in my favor, actually.

I documented this at the end of a blog post when I made blocking Amazon incoming requests a default policy several years ago.

by m3047

3/21/2026 at 11:58:54 AM

This is why archive.is was created. Should we stop trying to hunt down and punish its creator and support it as the extremely useful project that it is?

by gzread

3/21/2026 at 1:48:11 PM

Agreed, and if archive.is goes down, archive.org becomes the de facto monopoly in web archival.

That's a problem because archive.org honors removal requests from site owners. Buy an old domain and you can theoretically wipe its archived history clean.

by 8cvor6j844qw_d6

3/21/2026 at 7:28:30 PM

Alternatively, it leaves a vacuum for an archive site that doesn't take things down like archive.org to exist and a new one takes its place as the defacto one.

by charcircuit

3/21/2026 at 12:05:22 PM

The creator can maintain anonymity. The creator does not deserve to continue being celebrated when they embarked on a DDOS campaign using the traffic of archive.is against a journalist trying to uncover their identity. By these actions, they have shown to be capricious, vindictive, and willing to ensnare their users in their DDOS of others. Whoever they are, they’re terrible.

by philistine

3/21/2026 at 12:47:39 PM

If there's ever something a journalist would never ever do, it's destroy someone's life for a headline. Never ever. Totally impossible.

by MSFT_Edging

3/21/2026 at 12:40:28 PM

I had no idea that was the actual situation (journalist trying to hunt them down). Sorta changes the moral calculus, I'll allow it

by Obscurity4340

3/21/2026 at 12:57:52 PM

This is great. Journalists are impeding the preservation of the historical record by blocking archivist traffic while simultaneously manhunting those archivists who find ways around their authwalls.

Soon the news and the historical facts will be unnecessary. You can simply receive your wisdom from the AIs, which, as nondeterministic systems, are free to change the facts at will.

by rdevilla

3/21/2026 at 1:23:01 PM

>This is great. Journalists are impeding the preservation of the historical record by blocking archivist traffic while simultaneously manhunting those archivists who find ways around their authwalls.

You are deliberately misrepresenting the situation. The journalists who block archivist traffic are not in any way connected to the blogger who was attempting to investigate the creator of archive.is. You have portrayed them as related in an attempt to garner sympathy for the creator of archive.is.

Here is an account of the facts: https://gyrovague.com/2026/02/01/archive-today-is-directing-...

by Permit

3/21/2026 at 2:13:36 PM

Indeed. I am highly supportive of archive.is, but let's remember that he hijacked his own users to become a bot net. That should make all us hackers furious. Is a complete violation of trust. Our residential IPs were used to attack someone, meaning he put us all at risk for his own personal goals. It's disgusting behavior and he should be called out for it. But we should also realize he's offering an important and free service to us all. I support him, but this is not something we should just ignore. Trust is very important.

by freedomben

3/21/2026 at 7:29:32 PM

Review the definition of botnet. That is not what was done.

by charcircuit

3/21/2026 at 4:31:59 PM

[flagged]

by mikkupikku

3/21/2026 at 4:53:54 PM

> Being left beaten and bleeding in an alley to get a small taste of what life has to offer would do him a lot of good

This is unhinged.

by Permit

3/21/2026 at 5:02:03 PM

Start shit, get hit.

by mikkupikku

3/21/2026 at 3:54:23 PM

I didn't think I was going to side with the DDoS-er, but considering what happened with Aaron Schwartz, that blogger was trying to get them killed or put in a box forever.

by heavyset_go

3/21/2026 at 11:39:50 PM

[flagged]

by gzread

3/21/2026 at 2:10:21 PM

Thanks for this. I didn’t know about the details, and there are probably mor... but this gyrovague person is clearly being a privileged trouble. Their “boringly straightforward curiosity” is an admittance of their shallow thinking. When you are pointed out that you’re hurting someone in some respect that you weren’t intentional about, you should stop, sit down, and reconsider everything in that respect.

You may end up deciding to continue inflicting harm, intentionally so this time---that is a perfectly valid course to take. But you cannot anymore remain unintentional about it.

by ThoAppelsin

3/21/2026 at 3:44:18 PM

> When you are pointed out that you’re hurting someone in some respect that you weren’t intentional about, you should stop, sit down, and reconsider everything in that respect.

> You may end up deciding to continue inflicting harm, intentionally so this time---that is a perfectly valid course to take. But you cannot anymore remain unintentional about it.

To be clear, are you talking about the harm of commanding a botnet (which includes you and me) to attack an investigative journalist for investigatively journaling?

by ImPostingOnHN

3/21/2026 at 12:10:32 PM

Their life is in danger and one particular journalist is making it so

by gzread

3/21/2026 at 1:17:03 PM

They're terrible for not wanting to be dox'd?

by staticassertion

3/21/2026 at 5:45:35 PM

They’re terrible for turning all of us into parts of a botnet DDOS someone doing their job. I don’t understand how DDOS is the correct tool for anyone to protect their anonymity.

by philistine

3/21/2026 at 11:21:27 PM

It sounds like DDOS is the tool they had available. I'm currently ambivalent on using user traffic to perform a DDOS.

by staticassertion

3/21/2026 at 12:23:10 PM

Well, if they deserve anonymity, they also deserve to be able to protect it, and they have really few tools against a doxxing, the DDOS was one of them, corrupting the archived article was another, albeit dangerous for their own reputation as an archiver.

The crux of the problem was the doxxing, not the defense against it.

by choo-t

3/21/2026 at 12:30:47 PM

You don’t think leveraging your site to DDOS someone is a problem?

Do people not also deserve to be protected from being DDOSed? Do people also not deserve to not have their internet traffic be used to DDOS someone?

by ajam1507

3/21/2026 at 12:52:16 PM

> You don’t think leveraging your site to DDOS someone is a problem?

It is, but it's one of the only tools they have to prevent the doxxing site to being reachable.

> Do people not also deserve to be protected from being DDOSed?

You mean the person doing the doing should be protected ?

>Do people also not deserve to not have their internet traffic be used to DDOS someone?

Yes, it should have been opt-in. But unless you doesn't run JS, you kinda give right to the website you visit to run arbitrary code anyway.

by choo-t

3/21/2026 at 12:58:40 PM

Not defending any party, it's basic ethological expectation: a creature that try to beat an other should expect aggressive response in return.

Of course, never aggressing anyone and transform any aggression agaisnt self into an opportunity to acculturate the aggressor into someone with the same empathic behavior is a paragon of virtuous entity. But paragons of virtue is not the median norm, by definition.

by psychoslave

3/21/2026 at 1:43:43 PM

> Not defending any party, it's basic ethological expectation: a creature that try to beat an other should expect aggressive response in return.

Another basic ethological expectation is that the strong dominate the weak, but maybe we shouldn’t base our moral framework around how things are, and rather on how they should be.

by ajam1507

3/21/2026 at 12:52:44 PM

You don't think non-consensually revealing somebody's identity is a problem?

Resorting to DDoS is not pretty, but "why is my violent behavior met with violence" is a little oblivious and reversal of victim and perpetrator roles.

by kpcyrd

3/22/2026 at 2:21:25 AM

If it's information that's medium-difficult to get, and the only people that would use the information to cause harm can easily put in more effort than that, then I don't think it's "violence" to post that information.

by Dylan16807

3/21/2026 at 1:21:54 PM

> You don't think non-consensually revealing somebody's identity is a problem?

I do think it’s a problem. You are the only one excusing bad behavior here.

by ajam1507

3/21/2026 at 1:18:44 PM

I think this is a weak framing. Lots of things are moral or immoral under specific circumstances. We should protect people from being murdered. I think murder is usually wrong. But we also likely agree that there are circumstances in which killing someone can be justified. If we can find context for taking a life, I'm quite sure we can find context for a DoS.

by staticassertion

3/21/2026 at 1:34:43 PM

And what’s the context for using the internet traffic of your unsuspecting users to accomplish this?

by ajam1507

3/21/2026 at 1:38:37 PM

Using the internet trafic of the persons using your service to protect your anonymity and thus, protecting the service itself.

by choo-t

3/21/2026 at 1:45:08 PM

So you shouldn’t have to inform your users that their traffic will be used in a cyberattack?

by ajam1507

3/21/2026 at 2:12:31 PM

In most jurisdictions informing them would potentially make them legally liable. The fact they had no knowledge shields them from liability.

by RobotToaster

3/21/2026 at 2:43:44 PM

So their desire to not be used to commit a cyberattack doesn’t factor in? As long as they aren’t legally liable, it doesn’t matter?

Also a checkbox that says something like “I would like to help commit a crime using my internet traffic” would keep people from having their traffic used without consent.

by ajam1507

3/21/2026 at 3:47:46 PM

Unfortunately “consent” is a difficult to understand concept for a lot of the web and Silicon Valley.

by ryandrake

3/21/2026 at 2:12:27 PM

I don't have strong feelings about that one way or the other, honestly.

by staticassertion

3/21/2026 at 2:25:19 PM

There's an old legal maxim "in pari delicto potior est conditio defendentis", that is "in a case of mutual fault the position of the defending party is the better one."

by RobotToaster

3/21/2026 at 2:39:05 PM

That works better when there is a defendant.

by ajam1507

3/21/2026 at 11:41:41 PM

There's an archivist, and a journalist who is trying to kill that archivist. Which one is attacking and which one is defending?

by gzread

3/21/2026 at 4:42:15 PM

People do not ever have any sort of moral or natural right to not get hit after starting shit.

by mikkupikku

3/21/2026 at 6:53:07 PM

Even if this were true, this does not justify any particular type of action, except maybe an in kind response.

For example, would they have been justified to murder the blogger?

by ajam1507

3/21/2026 at 11:40:25 PM

I can't really be mad about someone DDoSing someone who's trying to kill them.

by gzread

3/21/2026 at 6:47:13 PM

I'm now an AI bro, and a long-time fan of the EFF (though they occasionally make a mistake).

I think this EFF piece could be more forthright (rather than political persuasion), since the matter involves balancing multiple public interest goals that are currently in opposition.

> Organizations like the Internet Archive are not building commercial AI systems.

This NiemanLab article lists evidence that Internet Archive explicitly encouraged crawling of their data, which was used for training major commercial AI models:

| News publishers limit Internet Archive access due to AI scraping concerns (niemanlab.org) | 569 points by ninjagoo 34 days ago | 366 comments | https://news.ycombinator.com/item?id=47017138

> [...] over a fight that libraries like the Archive didn't start, and didn't ask for.

They started or stumbled into this fight through their actions. And (ideology?) they also started and asked for a related fight, about disregard of copyright and exploitation of creators:

| Internet Archive forced to remove 500k books after publishers' court win (arstechnica.com) | 530 points by cratermoon on June 21, 2024 | 564 comments | https://news.ycombinator.com/item?id=40754229

by neilv

3/21/2026 at 3:48:21 PM

When you disappear from the historical record, that's called you becoming irrelevant. The world moves on, and pays attention to someone else. Not sure why the Times doesn't seem to see this angle.

by rdiddly

3/21/2026 at 5:48:12 PM

[dead]

by paseante

3/21/2026 at 1:42:21 PM

Archive now, make public after X amount of time. So, maybe both publisher and archiver are happy (or less sad).

by b1n

3/22/2026 at 10:47:11 AM

I see the attraction, however this will immediately get abused and this delay increased to unreasonable within months/years of it being introduced. Some kind of slippery slope if you will.

by Zopieux

3/21/2026 at 6:45:02 PM

Does IA use a known set of IPs? Should be trivial to let them through. But yeah, news companies aren't technically capable of this kind of finesse, they probably have by-the-hour contractors doing any coding/config changes, and closing the ticket is the goal there.

by phendrenad2

3/21/2026 at 1:19:20 PM

As someone perpetually online it’s also making me rethink that a bit

Unless you love walled gardens, doomscrolling and endless AI slop that seems like the fun is over

by Havoc

3/21/2026 at 7:25:12 PM

The EFF is being obtuse. Using archives sites is a known bypass for reading news articles for free. Every time a paywalled site someone posts an archive link so others can read for free.

>Archiving and Search Are Legal

But giving full articles away for free to everyone is not. Archive.org has the power to make archives private.

by charcircuit

3/21/2026 at 11:29:51 AM

Devil's advocate: Anyone seeking to limit AI scraping doesn't have much of a choice in also blocking archivists.

And it's genuinely not that weird for news organisations to want to stop AI scraping. This is just a repeat of their fight with social media embedding.

Sure. The back catalogue should be as close to public domain as possible, libraries keeping those records is incredibly important for research.

But with current news, that becomes complicated as taking the articles and not paying the subscription (or viewing their ads) directly takes away the revenue streams that newsrooms rely on to produce the news. Hence the "Newspaper trying to ban linking" mess, which was never about the links themselves but about social media sites embedding the headline and a snippet, which in turn made all the users stop clicking through and "paying" for the article.

Social media relies on those newsrooms (same with really, most other kinds of websites) to provide a lot of their content. And AI relies on them for all of the training data (remember: "Synthetic data" does not appear ex nihilo) & to provide the news that the AI users request. We can't just let the newsrooms die. The newsroom hasn't been replaced itself, it's revenue has been destroyed.

---

And so, the question of archives pops up. Because yes, you can with some difficulty block out the AI bots, even the social media bots. A paywall suffices.

But this kills archiving. Yet if you whitelist the archives in some way, the AI scrapers will just pull their data out of the archive instead and the newsrooms still die. (Which also makes the archiving moot)

A compromise solution might be for archives to accept/publish things on a delay, keep the AI companies from taking the current news without paying up, but still granting everyone access to stuff from decades ago.

There's just major disagreement about what a reasonable delay is. Most major news orgs and other such IP-holders are pretty upset about AI firm's "steal first, ask permission later" approach. Several AI firms setting the standard that training data is to be paid for doesn't help here either. In paying for training data they've created a significant market for archives, and significant incentive to not make them publicly freely accessible.

Why would The Times ever hand over their catalogue to the Internet Archive if Amazon will pay them a significant sum of money for it? The greater good of all humanity? Good luck getting that from a dying industry.

---

Tangent: Another annoying wrinkle in the financial incentives here is that not all archiving organisations are engaging in fair play, which yet further pushes people to obstruct their work.

To cite a HN-relevant example: Source code archivist "Software Heritage" has long engaged in holding a copy of all the sourcecode they can get their hands on, regardless of it's license. If it's ever been on github, odds are they're distributing it. Even when licenses explicitly forbid that. (This is, of course, perfectly legal in the case of actual research and other fair use. But:)

They were notable involved in HuggingFace's "The Stack" project by sharing a their archives ... and received money from HuggingFace. While the latter is nominally a donation, this is in effect a sale.

---

I find it quite displeasing that the EFF fails to identify the incentives at play here. Simply trying to nag everyone into "doing the thing for the greater good!" is loathsome and doesn't work. Unless we change this incentive structure, the outcome won't change.

by SlinkyOnStairs

3/21/2026 at 12:44:16 PM

It would be better if there was some arrangement the papers could reach with Archive where they just delay the release or wait a week then its part of the archive. That way, news stuff gets paid for when its hot and fresh but then it gets archived and the record is preserved

by Obscurity4340

3/21/2026 at 11:39:03 AM

[dead]

by onetokeoverthe

3/22/2026 at 10:10:01 AM

[dead]

by pugchat

3/22/2026 at 6:57:40 AM

[dead]

by lzhgusapp

3/21/2026 at 2:32:04 PM

[flagged]

by ryguz

3/21/2026 at 1:03:06 PM

[dead]

by daliliu

3/22/2026 at 5:44:39 AM

[dead]

by EchoReflection

3/21/2026 at 4:15:53 PM

I am really tired of this kind of moralizing. The reality is that every time geeks come up with some utopian ideal, such as that we should publish all our software under free licenses or make all human knowledge freely accessible to anyone, the same geeks later show up and build extractive industries on top of this. Be a part of the open source revolution... so that you do unpaid labor for Facebook. Make a quirky homepage... so that we can bootstrap global-scale face recognition tech. Help us build the modern-day library of Alexandria... so that OpenAI and Anthropic can sell it back to you in a convenient squeezable tube.

Maybe it's time to admit that the techie community has a pretty bad moral compass and that we're not good stewards of the world's knowledge. We turn lofty ideals into amoral money-making schemes whenever we can. I'm not sure that the EFF's role in this is all that positive. They come from a good place, but they ultimately aid a morally bankrupt industry. I don't want archive.org to retain a copy of everyone's online footprint because I know it be used the same way it always is: to make money off other people's labor and to and erode privacy.

by lich_king

3/21/2026 at 8:47:21 PM

Agreed; again and again, we see that the utopian ideals of the tech world are only the ones that let them extract value without consideration.

by Peritract

3/21/2026 at 11:32:28 PM

Where your argument falls apart:

> the same geeks

Proof?

by netsharc

3/21/2026 at 8:57:02 PM

If you're selling ammonium nitrate and diesel, it's a reasonable presumption that you're in the agricultural supply business. It's also reasonable to expect you not to sell a truckload of both to someone who you don't know to be a farmer.

by m3047