Proof-of-work to protect lore.kernel.org and git.kernel.org against AI crawlers

4/2/2025 at 10:24:59 PM

>Difficulty is set at 4 leading zeroes, unless you're coming from US in which case there's also a tariff of 5 more leading zeroes.

isn't linux afraid of retaliatory tariffs? should I stock up on linuxes just in case? I've already beefed up toilet paper reserves.

by cowboylowrez

4/2/2025 at 10:30:28 PM

If they go overboard people will start switching to FreeTrade-BSD

by perihelions

4/2/2025 at 11:16:56 PM

If tariffs are imposed on Antarctica, penguin prices will go up. This will in turn raise Linux prices because penguins are crucial mascot components in all Linux systems.

by BrenBarn

4/3/2025 at 6:09:53 AM

I haven't heard penguins being deported by ICE or having their visas revoked, so I assume that at least the penguin tourism industry is still doing well in the US?

Of course, they're not allowed to work as mascots while touristing. wink wink

by flowerthoughts

4/2/2025 at 10:45:16 PM

Sanctioning Linux, due to its totalitarian government (bdfl)

by lionkor

4/2/2025 at 10:25:36 PM

Any chance there's some way, going forwards, to dual-purpose these webserver PoW's, so they solve some socially beneficial compute problem at the same time? I recall reading ideas like that in the early days of cryptocurrency, before humans ruined it.

- Server: here's a bit of a cancer protein

- Client: okay, here's some compute

- Verifier: the compute checks out

- Server: okay, you are authorized to access cat.gif

by perihelions

4/2/2025 at 10:32:31 PM

It's difficult to break important problems down into NP-hard problems. Search problems are, afaik, the current state-of-the-art; but to my knowledge "a bit of a cancer protein" isn't useful, and "an entire cancer protein" would take a few hours at least.

by wizzwizz4

4/2/2025 at 10:47:09 PM

Very true. Perhaps a better system would be to credit "points" for solving fewer/larger problems that could be spent a bit at a time? That sounds even more complex than charging regular money though.

by xnx

4/2/2025 at 11:01:55 PM

You could do like a captcha but for computers. Here are some molecule, find the ground state geometry of all of them. You give some you already solved just to root out whether the solver is actually solving or faking it.

by xmcqdpt2

4/2/2025 at 11:23:11 PM

See https://news.ycombinator.com/item?id=43556521

wherein a hosting company sees AI bot scans that appear to be coming from millions of unique addresses, thousands of ASNs, many residential and often with a single connection from an IP. The AI bots are proxying through either hacked IoT devices or apps that pay people pennies to let their phone be used as a proxy.

Likely your proof of work will be distributed to the proxies. It'll just make millions of webcams and phones run a little hotter without slowing down the AI bots at all.

by notherhack

4/3/2025 at 1:38:03 PM

I don't know about the implementation details, but there are a lot of cryptocurrency proof of work algorithms that requires a lot of memory access like Monero's RandomX. Those can't be realistically run on most underpowered devices.

by throwCPjuh9kR

4/3/2025 at 7:43:13 PM

Proof of work still takes time. It will slow each individual requester down.

Meaning for the same request rate they would need more hosts, which costs them more money.

by MBCook

4/2/2025 at 10:31:04 PM

It appears they're using https://github.com/TecharoHQ/anubis for the proof of work proxy

by unsnap_biceps

4/2/2025 at 10:55:33 PM

I enjoyed their succinct project description:

> Weighs the soul of incoming HTTP requests using proof-of-work to stop AI crawlers

by stevenhuang

4/2/2025 at 11:01:36 PM

oh that's fantastic

by Havoc

4/2/2025 at 10:50:31 PM

This is absolutely surreal to see in action! I hope that I can manage to afford to not have to do my dayjob anymore.

by xena

4/2/2025 at 10:54:14 PM

Context for others: Xe is the author of the software used for this (https://anubis.techaro.lol/docs/)

by dharmab

4/3/2025 at 10:13:43 AM

Is it possible to resolve the PoW "manually" (i.e. Without browser JS execution) for personal use?

Like picking up the problem from the http headers and returning it as an follow up query.

Or is running a browser part of the PoW essentially?

by tym0

4/3/2025 at 12:56:40 PM

Yes, but I'm not going to implement that to avoid the implementation "leaking" and ossifying the current transitionary hack.

by xena

4/3/2025 at 1:09:06 PM

Cool, thanks. Will keep an eye on it.

I've not been able to read your blog with my personal news reader so I was hoping to implement that.

by tym0

4/2/2025 at 11:48:45 PM

I'm a bit skeptical if this will do the trick. These PoW challenges can be parallelized across different websites and may not be as off-putting as intended. Here some quick back-of-the-napkin math:

DeepMind's MassiveText dataset was sourced from ~2.35B documents. A difficulty of 4 leading zeros requires an expected 16^4 SHA-256 hashes per site. Benchmarks [1] show an H100 at ~12k MH/s, meaning it would take just ~3.5 hours to solve for all 2.35B pages.

[1] https://gist.github.com/Chick3nman/e1417339accfbb0b040bcd0a0...

by xxprogamerxy

4/3/2025 at 12:01:46 AM

SHA-256 is a hack to buy time. This will be replaced with something better. It would be faster for me to replace it if I didn't have to do my dayjob: https://patreon.com/cadey

by xena

4/3/2025 at 12:16:57 AM

Not meant as a criticism of the project in general. I appreciate people working on this.

I'm curious, what other approaches are you currently considering? In my mind, all roads lead to rate-limiting identifiers with privacy through zk-proofs.

by xxprogamerxy

4/3/2025 at 12:18:17 AM

I'm looking at using equi-x, but failing that I may unironically do protein folding.

by xena

4/2/2025 at 10:17:38 PM

I am really enjoying seeing this use-case for PoW gain popularity. Hopefully it normalizes the technique and it can start to become more common for anti-spam systems.

by skeptrune

4/2/2025 at 11:10:24 PM

https://blog.torproject.org/introducing-proof-of-work-defens...

Tor has similarly been using Proof of Work as part of the defense for onion services for around a year and a half now.

I’ve also seen some clear net web sites that use PoW to slow down account creation. Some websites will even adjust the difficulty for individual visitors depending on the recent number of sign-ups coming from their IP block. More signups from an IP block -> higher PoW difficulty for anyone from that IP block -> fewer accounts created by anyone in that IP block over a span of time.

by codetrotter

4/2/2025 at 10:35:45 PM

Why do you assume that spammers and AI crawlers do not have access to large amounts of compute? You can make it more expensive, but these crawlers already have made it clear that they do not care particularly about cost (or they would not crawl so completely indiscriminately).

by Sesse__

4/2/2025 at 10:41:40 PM

um no? sending an http request is quite a bit different than some forced pow calculation

by arccy

4/3/2025 at 8:44:02 AM

Why? Don't you think these companies can use Puppeteer or similar and just take that second or so of compute to get a cookie for lore.kernel.org?

by Sesse__

4/2/2025 at 10:37:52 PM

Spammers are willing to dedicate more processing power than regular users. It doesn't make sense to do. It's either meaningless or ruins the user experience for normal people.

by charcircuit

4/2/2025 at 11:11:23 PM

Regular users aren't trying to load billions of pages.

by losvedir

4/2/2025 at 11:43:30 PM

What's your point?

by charcircuit

4/3/2025 at 7:57:49 AM

When the compute power to access a page has just doubled, when you view 10 pages it's not a problem, but when you view 1 billion of them, it is.

by mariusor

4/4/2025 at 6:03:13 AM

The compute power is not going to be a problem even at a billion pages. Lets say that a challenge targeted towards a cheap phone can be solved by a workstation CPU at 1000 challenges per second. And let's say each challenge lets you visit 100 pages. So for a billion pages you need only a few hours of 1 machine's compute. If you try and make the amount of compute to be a problem, then it will become a problem to regular users of the site and it will drive away regular people.

by charcircuit

4/4/2025 at 10:05:56 AM

You seem to be very confident, but from what I've seen online most people that employed this type of countermeasures did really have a drop in requests. I can't find the graphs I've seen about this at the moment, but I'll update if I do.

by mariusor

4/4/2025 at 2:47:29 PM

It only protects against people who aren't specifically targeting you. I am confident because I've had people target a site of mine.

by charcircuit

4/4/2025 at 6:00:33 PM

Then I think you're talking about a different problem than the Konstantin and wikipedia, and sourcehut and codeberg and everyone else is complaining about lately, which is the subject under discussion.

by mariusor

4/5/2025 at 12:33:10 AM

It will buy time until the next generation of crawlers is created.

by charcircuit

4/2/2025 at 10:25:11 PM

This is infinitely better than using CloudFlare. I hope it works and more people adopt it.

by chr15m

4/3/2025 at 3:01:12 AM

This does not help against real DDoS attacks (that don't even speak HTTP most of the time) or full-browser headless bots, besides warming the planet more. It also only looks at Mozilla user agents (despite one of the reasons given for its development was bots changing user agents), so it's extremely easy to bypass. But solutions like CF's or similar are better tailored for anti-DDoS purposes where the threat is from massive amounts of bandwidth, not well-behaved AI crawler bots clogging up your logs.

And if your argument is that it helps DDoS by being a frontend proxy, well, you still need more bandwidth than the DDoS uses, in which case you could do this with a simple "click here" page just as easily.

But please prove me wrong if I've misunderstood something.

by ranger_danger

4/2/2025 at 10:30:06 PM

Genuine question: How? Is there a downside to CloudFlare I'm not aware of?

by ToucanLoucan

4/2/2025 at 10:36:32 PM

Cloudflare will just straight up block me sometimes, with no way to see the page. For whatever reason this used to happen to me a lot with car dealer websites. Maybe checking lots of different dealerships' inventory looking for a specific car made me look like a bot.

And even in cases where Cloudflare forces a captcha, this POW ran much more quickly than I could solve one by hand

by Rebelgecko

4/2/2025 at 10:40:57 PM

It was nearly instant on my shitty old phone.

by nosioptar

4/2/2025 at 10:38:41 PM

Blocking me from contributing to any gitlab hosted project for ~4 years already. I wanted to send a glib2 patch today, again realized that, no, I can't still sign up to CF protected gitlab instances. :)

Makes me appretiate the Linux kernel mailing list based contribution method. Very open, very simple.

At this point I guess CF will never fix compatibility bugs in their interstitial pages, and in captcha, with non-default setup of Firefox.

by megous

4/2/2025 at 10:49:37 PM

For what it's worth, ensuring that JIT is enabled for challenges.cloudflare.com can help a lot.

No, not to the point of making it bearable, but at least it becomes rarer for it to take minutes.

by g-b-r

4/2/2025 at 10:53:18 PM

It routinely takes at least a minute overall on gitlab, from a budget phone.

Other sites with Cloudflare only take some nice twenty seconds, others just never ever let you go through.

Those checks are a serious contender for worse thing ever happened to the web.

by g-b-r

4/3/2025 at 7:35:54 PM

> Those checks are a serious contender for worse thing ever happened to the web.

I couldn't agree more. And not only Cloudflare but also Google is increasingly imposing extremely heavy CAPTCHAs, which destroy experiences for many users. They should really think more about false positives too. (And the irony of course is that Google is the biggest scraper in itself.) Thus, please come up with something else than CAPTCHAs or heavy JavaScripts blobs in general.

by jruohonen

4/3/2025 at 2:29:04 AM

Besides the downsides mentioned by others, cloudflare heavily punished anyone using a browser that isn't chrome, especially if it is something other than chrome/safari/edge/firefox.

by thayne

4/3/2025 at 1:03:18 AM

If you're not aware of the downsides I don't have time to explain them to you. If you genuinely want to know, 5 minutes research will give you answers.

by chr15m

4/2/2025 at 10:30:46 PM

> Difficulty is set at 4 leading zeroes, unless you're coming from US in which case there's also a tariff of 5 more leading zeroes.

> You can see it in action on this recently decommissioned system I'm using for testing purposes: https://ams.source.kernel.org/

Something seriously wrong with it. When I run it with my normal German/EU home connection, it does ~17k iterations. When I run it with a US Atlanta VPN, it only takes ~6k iterations.

by sva_

4/2/2025 at 10:51:27 PM

It's luck-based, I'm working on making a check that's more deterministic, but I'm also trying to figure out how to not lock out big-endian systems in the process.

I may have to just give up on that though :(

by xena

4/2/2025 at 10:37:36 PM

I think that part was a joke

by Rebelgecko

4/2/2025 at 10:46:22 PM

I think OP, like me, wishes it wasn't (it would be very funny)

by lionkor

4/3/2025 at 2:34:32 AM

Not for those of us who live in the US. It would basically lock out real people in the US, while doing nothing to block bots, which could just use a different source ip.

by thayne

4/2/2025 at 11:45:32 PM

I think 1 or 2 extra 0s would be funny but 5 seems excessive

by Rebelgecko

4/3/2025 at 2:50:13 AM

Ah yes, 100x worse for US users is amazing.

by sadeshmukh

4/2/2025 at 10:52:01 PM

Maybe I'm missing something, but why do people expect PoW to be effective against companies who's whole existence revolves around acquiring more compute?

by sakras

4/2/2025 at 10:54:40 PM

I was under the impression that the bad crawlers exist because it's cheaper to reload the data all the time than to cache it somewhere. If this changes the cost balance, those companies might decide to download only once instead of over and over again, which would probably be satisfactory to everyone.

by xmcqdpt2

4/2/2025 at 10:58:36 PM

So, market/companies refused to regulate themselves (by adhering to the robots.txt) so we're now forced to innovate some solutions against them.

by kklisura

4/2/2025 at 10:40:56 PM

I think these solutions are really novel and interesting but I'd like to point out that this is literally one of the use cases for cryptocurrency, or microtransactions in general. Cryptocurrencies, at least the PoW ones, offload the proof-of-work so that it doesn't need to be done in real time.

Paying fractions of a penny to view websites has minimal impact on average users but is punishing to spammers.

by abetusk

4/2/2025 at 10:48:14 PM

This is one of the use-cases of proof-of-work, but the rest of what makes something cryptocurrency isn't necessary. There is no need for a blockchain here, and the cost can be paid directly in compute time.

by solid_fuel

4/2/2025 at 10:58:51 PM

Again, cryptocurrency would allow a proof-of-work mechanism but offload it so it needn't be real-time.

That is, do the proof-of-work before visiting the website, then present the currency token that proves the work has been done. The blockchain is there to prevent double spending of the currency token.

I do feel like this is a kind of "those who don't understand it are doomed to re-invent it" type of technologies.

by abetusk

4/3/2025 at 5:28:41 PM

No, I understand. I WANT it to be real time. I don’t want it to be offloaded.

You don’t need to worry about double spending or any of the privacy issues that come with a blockchain when the work is done real time. I know exactly how much work is being done on the server per request with this solution - one hash. There is no need to crawl through a block chain, or submit a transaction, or wait for an external system to catch up.

I want a solution to these crawling bots, not a distributed database. So why would I reach for a distributed database when this simple proof of work system works better, is more understandable, and doesn’t have external dependencies?

by solid_fuel

4/13/2025 at 12:29:38 AM

There's a sentiment here that I think is valid and I'm struggling to tease out the best version of it, along with providing a valid response.

I'll mention that, in an extreme case, one could provide a proof-of-work challenge that's directly tied to some cryptocurrency mining effort, so that the work/energy expenditure is captured by the one providing the challenge. This has the effect of the challenger giving money to the one providing the challenge, just in a real-time proof-of-work way. Since money is effectively being exchanged anyway, this punishes systems that don't allow up-front payments to do away with the page load delay.

by abetusk

4/2/2025 at 10:47:28 PM

The problem is, it kills anonymity - it allows at the very least the government to tie each web page visit, each resource load, back to a real person.

And no, "anonymous mixer" services don't work either. They're yet another layer of useless profiteering middlemen, which the web already has more than enough of.

by mschuster91

4/2/2025 at 10:49:53 PM

What about Monero/XMR? Isn't is fully anonymous by design?

by pitaj

4/2/2025 at 11:06:39 PM

TOR used to make the same claims, turned out the NSA actually can to a degree correlate traffic given that they have their eyes literally everywhere on the planet.

I wouldn't trust anything to be safe from the dragnet surveillance apparatus of FVEY.

by mschuster91

4/3/2025 at 1:59:21 AM

That seems to be a much weaker claim than "allows at the very least the government to tie each web page visit, each resource load, back to a real person"

by pitaj

4/3/2025 at 12:57:54 PM

That was related to payment via Monero/XMR/...

The problem is, if you want some "proof of money"/"proof of stake", site operators set that up on their own which is a ton of work and people will not want to set up payments for their favourite porn site AND the government can trace back people from payments to site visits, or site operators contract a major vendor (similar to Stripe, Paypal, ...) who handles it and can then trivially be subpoena-ed for records.

by mschuster91

4/2/2025 at 10:51:06 PM

> Paying fractions of a penny to view websites has minimal impact

Although true in an ideal financial sense, it's demonstrably false because having a pay wall of any kind will severely limit usage.

by klysm

4/2/2025 at 11:16:49 PM

The fractions of a penny are in compute resources. cock.li required proof of work (10 minutes on a phone browser) to register a new mailbox before it went down.

by spelyytomat

4/2/2025 at 11:05:00 PM

Demonstrably false with a payment service like PayPal but could still be an option for some type of online payment. One can imagine a payment method that's transparent, integrated and seamless. Maybe something like limits can be set with features to allow, prompt or flat out deny websites that violate some payment policy.

My view is that when the payment threshold is so low, the issue is inconvenience or user friction, not the amount of money involved. I suspect people would be fine with small payments if the user experience was better. For example, Amazon, Netflix, iTunes or Spotify.

by abetusk

4/2/2025 at 10:48:23 PM

I'd rather just be able to click a link and not have to worry about wallet, transaction fees, keeping my keys safe, etc.

by sva_

4/3/2025 at 2:40:39 AM

But how do you solve the transaction cost problem? It doesn't make sense to pay a fraction of a cent for access when you have to pay a lot more than that in order to transfer the value.

by thayne

4/3/2025 at 2:51:05 AM

Isn't that the value proposition of crypto?

by sadeshmukh

4/2/2025 at 11:14:06 PM

Patron sites are a bit like that, but undermining effect is real - content quality shows a down trend when real Internet points become the metric, as opposed to fake ones.

by numpad0

4/2/2025 at 10:47:31 PM

I wonder how well this will actually work.

The core problem is that alot of crawlers aren't spending their money. They are part of a botnet so they are just spending the victim's money.

But hopefully most of the crawlers aren't botnets or funded by free VC money so they have an economic incentive to avoid crawling systems requiring proof-of-work.

by shanemhansen

4/3/2025 at 6:50:31 AM

It's not about stopping, more about slowing down.

(numbers for presentation purposes only)

If every request goes from 10ms to 100ms - a human won't care as they'll click like 5-6 times total, while reading the output between page load, but a bot will crawl the site 10 times slower.

by theshrike79

4/2/2025 at 11:03:03 PM

Pretty sure the AI crawlers aren't botnets

by Havoc

4/3/2025 at 2:42:42 AM

I'd bet some of them, run by less scrupulous entities, are.

by thayne

4/3/2025 at 7:28:28 AM

I thought Anubis author doesn't want you to remove the anime girl images? I guess kernel.org is exempted. gitlab.gnome.org still has the anime girl though.

https://anubis.techaro.lol/docs/funding

by neurostimulant

4/3/2025 at 7:54:01 AM

Seeing as it's open-source under an MIT license, I would think that everyone is allowed to modify the source and do whatever they want with it.

The monetary payment for removal of the logo is just for companies that don't posses the "know how" to do that and can instead pay a consulting fee.

by mariusor

4/2/2025 at 11:17:13 PM

i wonder how much traffic lore.kernel generates since it's such a basic site how it was before crawling and after

also where is the anubis avatar, that's so disappointing not to see it

by lousken

4/2/2025 at 11:01:24 PM

They are using white-labeled Anubis or stock Anubis?

by hooverd

4/2/2025 at 10:41:57 PM

I am not sure we want to prevent AI crawlers but rather we want the crawlers to just not negatively affect the websites.

We want AI automation everywhere and crawling is important.

by bhouston

4/2/2025 at 10:46:22 PM

The DoSes are AI-for-training-data, not AI-for-automation. AI-for-automation is for the time being going to be the same order of magnitude as standard activity.

by wnoise

4/2/2025 at 10:57:07 PM

> We want AI automation everywhere

Who is we? I definitely don’t want AI automation everywhere

by budududuroiu

4/2/2025 at 11:01:01 PM

Crawling is important, and PoW incentivizes the creation of a data broker as a cached middle layer to reduce negative effects on the website by amortizing PoW cost among crawlers.

by fritzo