alt.hn

3/27/2025 at 1:07:01 PM

Crawl Order and Disorder

https://www.marginalia.nu/log/a_117_crawl_order/

by ingve

3/27/2025 at 1:39:05 PM

Maybe something worth exploring, run it like an inverted generational garbage collector.

Separate the problematic domains and run them in a lower priority background crawl; then do the opposite as well, have a higher priority list for crawling sites that have regular updates more often, based on their last update gap.

Basically like a generational garbage collector but in this case it's generational content collector, although arguably, still mostly garbage.

by keyle

3/27/2025 at 3:54:40 PM

Yeah there's definitely more that can be done in this regard. Right now I have fairly limited statistics on the rate of change of the domains. There's probably improvements to be made by making more informed choices based on that sort of data.

by marginalia_nu

3/27/2025 at 7:26:52 PM

Are you running into those autogen crawler labyrinths that have been posted about recently? Certainly the .edu and .gov ones are probably just huge sites.

Am I correct in assuming you are recrawling your entire corpus to get fresh results? Would there be a downside to replacing crawl epochs with a continuous crawl that is random but with age priority?

by outer_web

3/27/2025 at 8:16:58 PM

Crawler labyrinths generally tend to be on paths disallowed by robots.txt, or behind nofollow links. I haven't seen any indication that they're playing much part in this.

Seems the bigger problem is very large domains with slow response times and long crawl delays.

> Am I correct in assuming you are recrawling your entire corpus to get fresh results?

For known links, I'm sampling them and based on whether I find changes (first via if-none-match and if-modified-since or alternatively via locality sensitive hashing), I'm recrawling only a part or all of the links.

> Would there be a downside to replacing crawl epochs with a continuous crawl that is random but with age priority?

The drawbacks to this is a much more mutable crawl data (being able to read or write crawl data top down in an append-only format is a huge performance improvement), as well as problems with the indexing software which takes ~1 day to complete, and can't currently be partially rebuilt but is rebuilt from scratch every time at significant computational expense.

by marginalia_nu

3/27/2025 at 2:12:02 PM

Simplifying Systems with Elixir • Sasa Juric • YOW! 2020

https://www.youtube.com/watch?v=EDfm2fVS4Bo

A noob question, is Elixir good at crawling tasks in the process simplying the system architecture.

by nthingtohide

3/27/2025 at 4:14:36 PM

Elixir is one of a reasonably large number of acceptable languages.

A crawler, and especially a very large one like Marginalia, is going to implement its own queuing logic, its own retry logic, its own management of same, and so on and so forth. As a result, things like "BEAM has supervisor trees" are actually not that useful; as they are nowhere near enough for this use case and need to be augmented anyhow, they don't really help a lot, and can even get in the way. At most the BEAM OTP infrastructure might help you bootstrap something up somewhat more quickly but I'd expect single-digit weeks into the dev process before it isn't really that helpful anymore. None of the BEAM-specific strengths strike me as hugely helpful here, mostly for similar reasons; they don't quite match what a crawler wants per se and the crawler is going to reimplement them anyhow.

For a project of this scale, you also don't want to be doing the raw indexing in Elixir, as it is not a very fast language, and at this scale that adds up quickly, so you're going to pull in another language anyhow.

All in all, while I might call it "acceptable", there's a solid half-dozen languages (plus runtimes, as appropriate) that I'd put solidly in front of Elixir for this use case and another several I'd rate as roughly ties. It certainly is not the case that it offers some sort of amazing, blow-me-away advantage that makes it the only sensible answer or anything.

by jerf

3/27/2025 at 4:57:44 PM

I think the demands on concurrency are relatively basic. It's nice to have something a bit more robust than raw pthreads, but the main thing that makes or breaks a crawler is access to a robust HTTP client library and HTML parser, solid I/O performance, that sort of thing. Because the problem domain looks the way it looks, concurrency is one of the easier parts.

by marginalia_nu

3/27/2025 at 2:47:46 PM

I'm not well versed enough in Elixir to give a good answer.

The hard problems in crawling is mostly dealing with a very large and potentially highly mutable state. It's a concurrent problem, but a fairly easy one and a good use case for a traditional thread pool since there is latency is mostly a non-issue.

by marginalia_nu

3/27/2025 at 3:13:59 PM

based on what you said, and my assumption that Elixir (thanks to Erlang’s VM) is pretty good at concurrency, then yes Elixir would be good for crawling

by muscomposter