2/23/2026 at 1:26:42 PM
I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
by bndr
2/23/2026 at 7:06:12 PM
I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.by mettamage
2/23/2026 at 7:23:09 PM
In Italy it’s a crime punishable up to 12 years to access any protected computer system without authorization, especially if it causes a DoS to the ownerConsider the case of selfhosting a web service on a low performance server and the abusive crawling goes on loop fetching data (which was happening when I was self hosting gitlab!)
https://www.brocardi.it/codice-penale/libro-secondo/titolo-x...
by fuomag9
2/23/2026 at 2:20:48 PM
Can't your users just whitelist your IPs?by mrweasel
2/23/2026 at 4:36:19 PM
I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.
by dewey
2/23/2026 at 2:38:56 PM
They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.by bndr
2/23/2026 at 4:12:10 PM
Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?by cassepipe
2/23/2026 at 4:30:47 PM
Blocking seems really popular. I wonder if it coincides with stack overflow closing.by 0xdeadbeefbabe
2/23/2026 at 2:57:35 PM
> the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binariesI would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.
by gilrain
2/23/2026 at 3:00:20 PM
Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.by bndr
2/23/2026 at 9:28:32 PM
But how does that work?Does Cloudflare force firewall rules for those who choose to use it for their websites?
If the tool that does the crawling identifies itself properly, does Cloudflare block it even if users do not tell Cloudflare to block it?
by demetris
2/23/2026 at 3:03:36 PM
It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.by gilrain
2/23/2026 at 3:26:06 PM
OP literally said that users add their domains, meaning they are explicitly ASKING OP to scrape their websites.by joncrane
2/23/2026 at 3:04:19 PM
Users sign up for my service.by bndr
2/23/2026 at 3:08:17 PM
You employ residential proxies. As such, you enable and exploit the ongoing destruction of the Internet commons. Enjoy the money!by gilrain
2/23/2026 at 4:08:02 PM
This is kind of like getting upset with people who go to ATMs because drug dealers transact in cash lol.by christoff12
2/23/2026 at 5:07:49 PM
Cloudflare and Big Tech are primary contributors to the impairment and decline of the Internet commons for moats, control, and profit; you are upset at the wrong parties.by toomuchtodo
2/23/2026 at 8:00:44 PM
I would argue that the ability to crawl and scrape is core to the original ethos of the internet and all the hoops people jump through to block non-abusive scraping of content is in fact more anti-social than circumventing these mechanisms.by prettyblocks
2/23/2026 at 3:58:07 PM
Just stop scraping. I'll do everything to block you.by spiderfarmer
2/23/2026 at 4:09:59 PM
> in my case, users add their own domainsSeems like they're only scraping websites their clients specifically ask them to
by ssgodderidge
2/23/2026 at 4:00:29 PM
Now you've gamified it :)by Keyframe
2/23/2026 at 4:12:40 PM
It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.There's no point in playing tug of war against unethical actors, just ban them and be done with it.
I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.
by shimman
2/23/2026 at 10:44:13 PM
If you think the game is played on a single IP address, you are not adept enough to be weighing in on this discussion.by Klonoar
2/23/2026 at 5:24:12 PM
What is the crawler is using a shared IP and you end up blocking legitimate users with the bad actor?by stevewodil
2/23/2026 at 6:07:56 PM
He said "it's pretty easy", probably not realizing there are whole industries on both sides of that cat and mouse game, making it not easy.by Keyframe