5/15/2026 at 4:15:36 PM
Hi! I'm one of the programmers at Gutenberg. We've been improving the site a lot over the past few months (and more is coming!). If you haven't visited the page recently, it's worth checking out again: https://www.gutenberg.org/by JSeiko
5/15/2026 at 8:14:20 PM
Have you considered having a detailed version history for each book (etext)? The process of submitting fixes to typos etc in books involves sending an email (https://www.gutenberg.org/help/errata.html) and although the last time I did this (2011) the fixes did get applied reasonably quickly (couple of days), it all felt a bit opaque. The version history could also include the project (usually PGDP correct?) the etext originated from; that way one would be able to compare against the actual page scans.I have very mixed feelings about Standard Ebooks and would much prefer being able to use Project Gutenberg directly, but one good thing Standard Ebooks does is that every book has an associated git repository (on GitHub), so it's (in principle) possible to see a history of fixes to the text over time.
by svat
5/15/2026 at 9:26:59 PM
We're using git repos internally to keep history for each book. They existed on github for a while, but our implementation was awkward, and too big of project for the volunteer dev team. But it's likely that we'll evolve towards that.by gluejar
5/16/2026 at 12:16:18 AM
> I have very mixed feelings about Standard Ebooks[…]Why?
by marcprux
5/16/2026 at 9:10:24 AM
Not the GP, but I also have mixed feelings about Standard Ebooks. They modernise texts for American readers. This means changing the punctuation, merging some words, altering the syntax, etc.When I read an old novel, written two centuries ago in England, the little differences to modern English are part of the charm, and I certainly don't want any Americanism mixed in. For one of my favorite novels, The Forsyte saga, the author deliberately used some rare forms of words, which SE replaced with the mainstream forms.
by idoubtit
5/16/2026 at 2:07:15 PM
SE editor in chief here. What you describe is incorrect. The only thing we do is very light sound-alike spelling modernization, like "to-night" -> "tonight". We do not do things like change from en-GB to en-US, replace old words with different modern words, or change text for "American readers", whatever that means. I have no idea where you got that impression.I personally worked on the Forsyte saga. If you think something was done in error, please let us know and we'll be happy to fix it.
by acabal
5/16/2026 at 10:40:44 PM
I commented on this kind of editing several years ago:https://news.ycombinator.com/item?id=16957359
The edit is still in place, and I still maintain that changing 'phone to phone in dialogue changes the meaning.
by mrob
5/16/2026 at 3:52:49 PM
> The only thing we do is very light sound-alike spelling modernization, like "to-night" -> "tonight".Curious. Why even bother?
by natex
5/16/2026 at 9:39:45 PM
Guess: screen readers and such.by bell-cot
5/16/2026 at 7:41:55 PM
One could argue that this falls into the previous poster's thought about "the little differences to modern English are part of the charm" ...by tangledhelix
5/16/2026 at 10:37:43 AM
You may already be aware, but SE marks all commits making those kinds of changes as '[Editorial]', so it is generally trivial to use their tooling to build your own high-quality ebook without any of the editorial changes.by jcurtis
5/16/2026 at 10:42:56 PM
When I tried this in the past, it was non-trivial because the editorial changes are mixed with the technical changes. Reverting the editorial changes broke the technical changes.by mrob
5/16/2026 at 9:15:49 AM
SE sounds truly, truly awful. Thanks for making me aware of its existence so I can avoid it.by AdamN
5/16/2026 at 3:25:47 PM
They're providing beautifully made ebooks for free...The only thing they are is truly, truly wonderful.
by phaedrix
5/17/2026 at 5:13:55 AM
SE is an amazing and wonderful resourceby condwanaland
5/16/2026 at 7:25:52 AM
It splits the community and number of possible volunteer hours for one. It also splits the canon into different versions. More projects fight for the attention attention (and possibly donations) of the audience.There are lots of reasons it could be preferable to centralize. OTOH their mission is limited and some competition is healthy, if only to explore alternative ways to do things.
by a2800276
5/16/2026 at 9:02:06 AM
It’s a different mission.PG focuses on an accurate digital translation of the source material, sometimes hosting multiple different versions of the same text, and doing things like putting work into recreating the adverts at the back of some novels.
SE focuses less of preservation and more on making readers’ versions of the texts, like other publishing imprints. So there’s typography standardisation, a light-touch moderinisation of hyphenation and soundalike spelling, and things like author-wide collections of short fiction and poetry even if it didn’t previously exist.
Both are valuable, but they serve different segments.
by robin_reala
5/15/2026 at 8:45:54 PM
I believe our new-ish CEO Eric Hellman actually did some work on something very similarby JSeiko
5/15/2026 at 8:24:29 PM
That's an interesting idea. not a small feat to accomplish though ...by JSeiko
5/15/2026 at 5:59:28 PM
When I thought about Project Gutenberg I remembered that original brutalist non-design. The current site has been very tastefully updated but looks like it's still very accessible if you turn styles off. Great job!by jefurii
5/15/2026 at 6:08:09 PM
sadly HN doesn't have a "heart" emoji I could use :Dby JSeiko
5/15/2026 at 11:41:32 PM
I like the design but liked the previous design as well, it was unique and Craigslistish, you knew what website you were visiting just by looking at it.by ricardonunez
5/15/2026 at 6:16:12 PM
♡by Wistar
5/15/2026 at 9:27:50 PM
<3Less than three is a classic!
by ok_dad
5/16/2026 at 8:26:29 AM
Ess two is less than less than three, but also a classic.s2 < <3
by agys
5/16/2026 at 8:24:32 PM
>When I thought about Project Gutenberg I remembered that original brutalist non-design.I suppose a printed book, black ink on paper, is "brutalist" and unpleasant to look at?
The text of a book shouldn't be encrusted with format, your reader or browser should contain the presentation that you want to see, find appealing, or need (accessibility).
by fsckboy
5/16/2026 at 1:37:20 PM
The biggest lever: make the reading experience great. https://www.gutenberg.org/cache/epub/245/pg245-images.html is still hard to read: lines are tooo long (macbook), no great way for pagination/remembering where I was, notesby eulerpoolapi
5/16/2026 at 3:31:20 PM
The ebook editions are very good for this. Most of the e-reader software provides all the amenities (bookmarks, highlighting, notes, control of margins, etc).by tangledhelix
5/16/2026 at 2:02:16 PM
Firefox's reader mode works amazingly for these situations.by SwampertX
5/16/2026 at 3:42:36 PM
A while back I attempted to extract the FF reader code to make it a front end to various non-web clients (email with pine key bindings etc)I got it to a prototype level but then shelved it after having difficulty getting good results with various test datasets. Probably would make a fantastic ereader though
by drzaiusx11
5/16/2026 at 4:48:19 PM
Lines aren't too long. They look great on all my devices.Use ⌘ + + until you get the line length you like.
by elch
5/15/2026 at 8:00:04 PM
Huh that's interesting: 4.5 seconds for the TCP handshake and an additional 9.2 seconds for the TLS handshake. Is this some kind of captcha, since most bots would disconnect before that, so if you complete it once then it knows you're good? (Until the bots catch on of course, but so long as it works it's relatively unintrusive and not discriminatory against uncommon client software (that is, non-Chrome/ium).) The rest of the requests were lightning fastEdit: welcome to your first comment after 9 years on HN btw, nice to have you here!
by lucb1e
5/15/2026 at 8:10:13 PM
I think their site is just slow, potentially because more people than they are used to are trying to view it.I was unable to load it initially (got an error from firefox) and had to re-attempt. Still slow if one forces a reload (shift-r, etc, to not use local cache).
by codys
5/15/2026 at 8:23:26 PM
we are having occasional lows in page speed performance due to LARGE amounts of bot traffic. full disclosure - we've not really been able to resolve this fully/well. Let us know if you have a good idea for how to deal with itby JSeiko
5/16/2026 at 3:30:27 PM
How do you currently host everything? Your main web server should not be responsible for hosting content. All books should be hosted on mirrors, and clicking download should automatically select a mirror to download it from.Furthermore:
* Make sure that all books are downloadable in bulk as torrents.
* Every day, generate a CSV file of all available books and their metadata. Distribute this so that bots and user clients can run queries locally, instead of using your search engine.
by uyzstvqs
5/15/2026 at 9:14:05 PM
Do you host a torrent?I have about 50k of the books, I would have used a torrent of just the txt files if it was prominent.
by gropo
5/16/2026 at 6:56:32 PM
we have a tarball of all text files - link posted somewhere hereby gluejar
5/16/2026 at 12:20:10 AM
If it's purely bot traffic, then Anubis could helpYou could have seen it on some websites already
by dimava
5/16/2026 at 3:42:11 AM
anubis only works against lazy scrapers, and at a cost to your users. I'd prefer people not use it.Bot traffic comes from machines that usually have a lot of idle cpu (since they're largely blocked on network IO as they scrape a bunch of sites in parallel), so they can trivially solve the anubis "proof of work" challenge, save the cookie, and then not solve it again for that site.
The only reason scrapers don't solve it is if the developers were too lazy to implement it... and modern scrapers also do, codeberg stopped using anubis because modern scrapers were updated to solve it.
The "proof of work" has to be easy or else people on old cell phones couldn't access your site (since an old android phone would start to overheat and throttle trying to solve a challenge that would take a modern server even several seconds), and it also consumes your cell-phone user's batteries, which is a really precious resource for them compared to the idle cpu on a server.
by TheDong
5/16/2026 at 7:53:10 PM
Just to add to the two negative replies, I find Anubis to be the only system that doesn't ever get in the way. My browsers have Javascript enabled and, so far, it never took more than a fraction of a second to complete the checksEvery other system I've run into has constant false positives, e.g. Google captchas will sometimes say I've failed and make me do the hardest level (if it wasn't giving me that already), Cloudflare regularly thinks I'm a bot, Codeberg blocked me before, Github signup captchas used to take ~15 minutes to complete and then still said "well you failed, try again", Github's general rate limiting has false positives (some days I browse a lot, other days little, and on the little days it'll sometimes go "slow down" with no recourse whatsoever, you're just blocked for an indeterminate amount of time), OpenStreetMap blocks my browser at work because I'm using Firefox ESR instead of latest stable and it finds that user agent string to be implausible, whatever the german railway operator uses since a few days is triggering on me constantly, etc.,
etc.,
etc. Constant blocks everywhere.
With Anubis, my understanding is that you do the proof of work (with whatever implementation you like, it doesn't have to be the Javascript one that they provide) and you can move on without ever doing any task yourself. The power consumption is a shame, but so long as attackers aren't even doing this much, the couple Joules it takes doesn't seem to be an issue
Of course, the attackers will evolve, but for now...
by lucb1e
5/16/2026 at 7:12:58 AM
Please no. I'm a non-bot who gets stopped and turned away all the time by that menace. Anubis doesn't work without JS.One of the things I give duckduckgo a lot of credit for is that while they're quick to interrupt me for a bot check (sometimes multiple times in a span of minutes) they'll let me identify ducks even on the most locked down browsers I use.
by autoexec
5/15/2026 at 8:40:05 PM
I'm only a small-scale sysadmin but the way that I understand the internet is that you send abuse notifications to the IP address block owner and, if it doesn't get resolved, you block. The whois/rdap database reveals which IPs all belong to the same hosting provider or ISP, so you can summarize that all to one list of IP addrs + timestamps per some time periodThe ISP actually knows which subscriber is on that line, can send them notices, block them, terminate them... loads of things that you simply cannot do because you have no relation to this person. And frankly I wouldn't want to need to have a personal relation with every website that I visit; my ISP can reach me if there is anything relevant to continued use of the internet. From personal experience, when I was a teenager, the ISP cutting our household off after an abuse report was an effective way of stopping what I was doing
by lucb1e
5/15/2026 at 10:22:56 PM
It’s effective against teenagers maybe. Not so much against Amazon, Meta or wherever botnet/crawler is coming out of China these days from up-and-coming AI companies.by Jolter
5/16/2026 at 7:39:50 PM
Then block all of Amazon, Meta, or wherever botnet/crawling traffic is coming from that doesn't honor robots.txt, sends DDoS reflection traffic, submits SMTP messages (in large volumes, not just probing) for domains they're not authorized for with SPF, or whatever else applies to the protocol you're usingIf they can't keep their ranges clean to a reasonable degree, their customers will need to move if they want to access your part of the internet. New sign-ups will always be hard, so some amount of abuse is expected, but if it's the same abuse traffic for weeks after you've notified them, well, it stops being your problem at some point
by lucb1e
5/16/2026 at 7:41:57 PM
See the other comments in this thread. The perpetrators are unknown and are jumping between residential IPs. Possibly botnets?by Jolter
5/16/2026 at 7:43:40 PM
Then see my other replies in the thread where I've specifically addressed residential IPs, e.g.: https://news.ycombinator.com/item?id=48163060by lucb1e
5/17/2026 at 10:31:01 AM
This is the post I’m talking about. Make sure you understand how it would not be productive to go after each ISP individually when the traffic is from all of them.by Jolter
5/15/2026 at 11:38:59 PM
I mean you could block entire AS numbers that relate to amazon or big tech datacentersby tonetegeatinst
5/16/2026 at 12:07:02 AM
wouldn't help, much of the traffic we've observed look closer to ddos patterns - IPs from all over the world, many different networks, each IP makes one request only, doesn't come back. highly distributed, no form of blocking would be effective except maybe captcha or proof of work.by tangledhelix
5/16/2026 at 12:00:54 PM
The problem with this approach is that modern scrapers use hordes of residential proxies and quickly rotate through IP addresses which belong to ASes you get a lot of real traffic from. There's nothing you can do if the ISP won't take any action against the customer.by miki123211
5/16/2026 at 3:37:08 PM
Worse than that - even if they would take action, you can't possibly orchestrate filing all of the complaints. It's a drown-in-quicksand problem, you can't fight quicksand one grain at a time.by tangledhelix
5/16/2026 at 7:35:08 PM
> you can't possibly orchestrate filing all of the complaintsTo the ISPs? Each IP range has an abuse email address registered and this is specifically exempt from rate limiting at RIPE's WHOIS server. Not sure how it is in other RIRs but I just happen to know of this policy
You can automate the whole thing, provided that you have a reliable way of identifying the undesired traffic which you need anyway for being able to block it by any means. The trouble is in user identification (they'll just use a new IP address from that ISP or hosting provider if you don't tell the provider about the problematic user)
by lucb1e
5/16/2026 at 7:50:14 PM
See what I wrote above (and let me say I am talking about Project Gutenberg and Distributed Proofreaders here, I am one of the admins on both). A large amount of the hassle traffic we've seen is as I wrote above, the IPs come from everywhere and in many cases, each IP makes a single request and doesn't come back. They change user-agent dynamically, etc, to masquerade as regular traffic. They come from residential, cloud/hyperscale, corporate, educational, government, all the networks, on every continent. This is many thousands of "open a ticket with someone" events per hour territory. It's as difficult to fight as DDoS itself for the same reasons (presumably the harvesting parties know that and that's exactly why this approach is used).Others online have been writing about their own experience with the same stuff; it's not unique to PG at all, it's everywhere. Talk to anyone that runs a web server and they'll have these stories...
by tangledhelix
5/16/2026 at 8:06:39 PM
I'm aware, I also host various websites that see an IP do a single request to the most unlikely of deep pages. Usually not hard to correlate with similar surprising requests from the same ISP, though, and that's exactly why it would be useful to talk to them: they know who used that IP address at the given timestamp. If they get a hundred complaints from different websites, the ISP is in the unique position to correlate that and find the subscriber(s) that are problematicYou also don't have to send out 1k support requests per hour. Could trial it with some hosting provider that you expect is responsive and see how it works out
edit: like, I just don't see another solution short of banning being anonymous online. Each site would have to know who you are. Someone has to be able to track it back to a person that is doing the abuse or there can't be any rules that we can apply. Imo it's better if that's the ISP (or VPN provider, say) who already has this information anyway
by lucb1e
5/16/2026 at 7:34:17 PM
I know. All the more reason to do it, right? If an ISP can't keep its network clean, then allowing them to send traffic onto the web is just asking for the problem to continueShow people a useful error, such as "You are using [ISP name] which sends large volumes of abusive traffic (think of spam and DDoS). They allow the attackers to hop around points across their entire network so we cannot block the abusers more selectively. Despite our attempts to contact them, the abuse continues in volumes which we do not see from other ISPs. To access our corner of the internet, use a different ISP. You could try mobile data instead of Wi-Fi or vice versa.", and they can make their own choices about staying with this ISP if more and more websites show this sort of error
If everyone tries to identify people piecemeal, we all need to implement ~200 different identification systems (assuming each country has a central system that everyone is signed up to in the first place), or rely on algorithms to tell who is a bot (I'm currently being misidentified on a daily basis and I'm, eh, not a bot. Trying to buy public transport tickets is currently difficult, for example, because the monopolist in my country blocks me after a few route queries when using a Google browser, and 0 queries from Firefox)
by lucb1e
5/15/2026 at 8:32:27 PM
CF cache?by TurdF3rguson
5/16/2026 at 10:05:11 AM
I would love it if you could detect AI scraper bots, and feed them AI generated bs instead of the real books...by jimnotgym
5/16/2026 at 7:51:13 PM
Cloudflare sells that as a product, they call it Labyrinth IIRC.by tangledhelix
5/16/2026 at 11:59:10 AM
This is very, very, very dangerous.Occasionally, you misclassify a real user as a bot, and then your reputation is ruined forever.
The official Polish train schedules website did this recently, feeding incorrect departure and arrival times to IP addresses known for aggressive scraping, without taking CGNAT into account. People... have noticed[1].
[1] (Polish) https://zaufanatrzeciastrona.pl/post/kto-i-dlaczego-losuje-w...
by miki123211
5/16/2026 at 6:54:12 PM
traffic yesterday ~20% more than recent average. 4971601 sessions 177 robots 863462 robot files 3390115 user files 20.30% robot files (robots id'd based on requests/ip address) 5 apache servers for static content, 1 CherryPy server for dynamic content hosted at iBiblio.by gluejar
5/15/2026 at 10:15:30 PM
As long as you're taking suggestions, since many of the books are quite old, adding a publication date or date range to the search functionality might be nice. I personally would find it very useful since I have a tendency to look for things that are older than year _x_ when researching various things.Thanks for all the effort put into the site!
by 0x0203
5/16/2026 at 6:59:30 PM
only 20% of our books have original publication data in the db. We have a project to add another 40% or so from another database, let us know if you want to help.by gluejar
5/17/2026 at 1:58:47 PM
I have the same problem on catholiclibrary.org, but insist on having something as the book date for every work. My solution is to temporarily default to the author dates until the book date can be refined. If there is no known author date I at least have a date range, hopefully to century or better.Author dates are a much smaller data set, can be generally supplemented from public marc records (viaf, loc, etc - I don't do that, but it's an option) and at least provide basic filtering / sorting.
by sgc
5/16/2026 at 1:36:11 AM
Hi for the past 20 years I have known about Project Gutenberg and I used to read a lot from it. One of the obstacle that I face is that there is no way to arrange the books in the order of their original publication. Do you know of any such way. Surely we can arrange the books by their release date on Gutenberg but it has long baffled me as it feels to me the most useless way of sorting the books. Thank you for Project Gutenberg.by Guestmodinfo
5/16/2026 at 7:00:30 PM
only 20% of our books have original publication data in the db. We have a project to add another 40% or so from another database, let us know if you want to help. replyby gluejar
5/16/2026 at 7:33:42 PM
Yes I am willing to help. Plz include me in your efforts. Thank you for thisby Guestmodinfo
5/15/2026 at 4:27:11 PM
The book list elements on front page render as both horizontally and vertically scrollable divs on mobile - seems like an opportunity for improvement.Keep up the good work!
by Falimonda
5/15/2026 at 4:33:04 PM
good feedback thanks! Doing an iteration on the homepage design is actually pretty high on the priority list. will keep your feedback in mind!by JSeiko
5/15/2026 at 10:05:53 PM
Any interest in offering PG as a multi-lingual web e-reader in any language?I've since discontinued hosting it, but happy to add you all and merge into an official PG offering: https://www.reddit.com/r/SideProject/s/VtYKxjrMme
by Falimonda
5/15/2026 at 10:07:10 PM
More content visible on various videos I took and posted to Xby Falimonda
5/15/2026 at 4:49:17 PM
Thank you for your work. This site is an international treasure.by xrd
5/16/2026 at 2:54:59 PM
FWIW I absolutely love how 'no-frills' PG is compared to so much of the bloated, over-engineered, script-riddled web these days. Please don't ever change that!by windowliker
5/15/2026 at 4:59:16 PM
Thank you for being one of the best places on the internetby excitednumber
5/15/2026 at 7:30:54 PM
Thanks for the free work! Project Gutenberg is nice to have :).On the site I noticed the library boxes have roughly a single extra line causing a scrollbar to appear and the last line to be chopped off https://i.imgur.com/PQ8T0qc.png is there an issues/bug portal to properly submit these kinds of things?
by zamadatix
5/15/2026 at 8:27:33 PM
you can open an Issue at https://github.com/gutenbergtools/gutenbergsiteby JSeiko
5/15/2026 at 5:04:32 PM
There's a minor bug with chrome in android where the menu will not close when you tap outside the menu or on the menu link/buttonby smallnix
5/15/2026 at 5:29:57 PM
I've messaged the guy who's best suited to fixing this. He'll be on it this weekendby JSeiko
5/16/2026 at 1:52:28 PM
Oh no. I did not want to cause someone to work on the weekend. I hope it's his hobby!by smallnix
5/15/2026 at 5:11:39 PM
will open an "Issue" for itby JSeiko
5/15/2026 at 4:56:32 PM
Oh, my! This does look nice. Thank you for your hard work!by ExtremisAndy
5/15/2026 at 5:00:27 PM
Thanks! We're currently working on a design update of the page of any specific book. Should be online soon (next 1-2 weeks or so)by JSeiko
5/15/2026 at 6:44:24 PM
I can't say for project Gutenberg specifically, but in general a huge issue I see is OCR errors. What do you all do to address OCR?by freedomben
5/15/2026 at 7:11:20 PM
Check out Distributed Proofreaders: https://pgdp.netby gluejar
5/15/2026 at 9:27:50 PM
I didn't realized DP was still around. I used to do it quite a bit, 15 years ago, but OCR has improved considerably since then.by jfengel
5/16/2026 at 3:45:00 PM
OCR has improved a lot since then, but OCR is just step 1 of reading in text. They make a lot of errors (even now, especially on old worn out paper pages) and even if they didn't, one has to format the book, deal with footnotes, sidenotes, illustrations, etc. DP is very active, we will welcome you back with open arms :)by tangledhelix
5/15/2026 at 6:47:41 PM
I uploaded a PDF to archive.org that auto-OCRs with plenty of mistakes. I have found no way of updating the entire stack of documents produced. I wonder if Project Gutenberg is similarby lapetitejort
5/15/2026 at 5:03:03 PM
Great Work. Thank you. I'm also a programmer. If you are ever short on help, let me know. I would love to contribute.by shuvrojit
5/15/2026 at 5:39:24 PM
https://github.com/gutenbergtoolsautocat3 and gutenbergsite are repos responsible for generating gutenberg.org
by JSeiko
5/16/2026 at 4:12:04 AM
Great project. Are many of the books in a format that can easily be converted into audio? Is there a way to search for them, and information on what software your readers find useful for this purpose?(Note: A lot of print media these days has switched to far-to-small font-sizes. Less of a problem for (zoomable) digital media, but for many that's still a barrier.)
by 8bitsrule
5/16/2026 at 3:43:28 PM
There are many books available as audio, some are human-read, some were automated. You can see lists here:human-read: https://www.gutenberg.org/browse/categories/1
computer-generated: https://www.gutenberg.org/browse/categories/2
IIRC many of the human-generated ones come from LibriVox, many of the computer-generated ones came from a collaboration with Microsoft.
by tangledhelix
5/16/2026 at 1:57:55 PM
For the Audio part, I suggest https://desktop.with.audioby OfflineSergio
5/17/2026 at 2:08:07 AM
IMO, most audio read by humans (esp. voice actors) are far preferable to machine readings. Also, I found no demos on that page.by 8bitsrule
5/15/2026 at 5:55:01 PM
Wanna let you know you’re doing great work and you have my dream job, thanks to the team for everything!by TimorousBestie
5/15/2026 at 6:09:14 PM
it's not my day job. PG is open-source. I'm "just" a contributorby JSeiko
5/15/2026 at 6:16:01 PM
Oh, right. That makes sense.by TimorousBestie
5/15/2026 at 5:23:50 PM
Thanks so much for the work you and your team do!by BiraIgnacio
5/16/2026 at 10:48:56 PM
I don't know what the status of this is today, but a number of years ago my biggest complaint about Gutenberg is that a lot of books had images added back when low resolution images were the standard, so you have a ton of books with image resolutions from the year 2000.by Jiro
5/16/2026 at 9:39:57 AM
Looking really good! Great work.by samwho
5/16/2026 at 1:07:06 AM
[dead]by openclawclub
5/15/2026 at 5:59:27 PM
[dead]by nomoreusernames
5/16/2026 at 11:34:10 AM
There should be more books at Gutenberg.Also by the way I just searched for 3d printing and found nothing. Either there are no books, or the search query makes things too complicated, IMO.
by shevy-java
5/16/2026 at 6:34:36 PM
Gutenberg is nearly all books that have lapsed into the US public domain by dint of being published 95+ years in the past. Which broadly explains why you hit nothing for 3d printing.by robin_reala
5/16/2026 at 7:56:50 PM
As another commenter said PG is almost all books from 95+ years in the past due to copyright law in the US. We partner with a sister organization, the World Library Foundation, who have a self-publishing portal for modern works by authors who wish to put their own work in the public domain. You might want to look there for more modern material. https://self.gutenberg.orgby tangledhelix
5/15/2026 at 4:30:46 PM
Very cool! Do you have a recommended way for an agent to see an index of the books and epub links?(I can’t quite tell if that’s an egregious abuse of the site or you’re perfectly fine to share without human eye balls hitting your www?)
by samcollins
5/15/2026 at 4:40:01 PM
Now i'm not associated with gutenberg in any form, but they do have a page for offline consumption:https://www.gutenberg.org/ebooks/offline_catalogs.html
Perhaps you can find the information you are looking for there.
However if you plan on scraping or otherwise hitting them with a ton of traffic, consider at least to donate a good amount for the traffic you cause them. It ain't free after all.
by jzs
5/15/2026 at 4:42:10 PM
Donations are always appreciated ;)by JSeiko
5/16/2026 at 10:11:05 AM
Presumably if you paid them enough money they would give you the books without you having to pay to scrape at all?by jimnotgym
5/15/2026 at 5:10:09 PM
Thanks for the answers! Found it:> All Project Gutenberg metadata are available digitally in the XML/RDF format. This is updated daily (other than the legacy format mentioned below). Please use one of these files as input to a database or other tools you may be developing, instead of crawling or roboting the website.
And strongly consider a donation! (My addition)
https://www.gutenberg.org/ebooks/offline_catalogs.html#the-p...
by samcollins
5/15/2026 at 4:34:46 PM
Check out https://www.gutenberg.org/ebooks/offline_catalogs.htmlDon't hit the site with agent. The section furtherst bottom machine readable.
by kay_o
5/15/2026 at 5:57:53 PM
if what you want is all the text, please use the tarball or data files at https://www.gutenberg.org/cache/epub/feedsby gluejar
5/15/2026 at 4:35:11 PM
not yet, but that's not a bad idea imo. Dealing with Ai crawler traffic is definitely a challenge if that's what you were referring to.by JSeiko
5/15/2026 at 11:40:12 PM
Possibly ZIMs is of interest: <https://ebookfoundation.org/openzim.html> (via: <https://news.ycombinator.com/item?id=48152200>).by dredmorbius
5/15/2026 at 4:39:04 PM
OPDS?by ancientcatz
5/15/2026 at 5:33:32 PM
OPDS 2.0 coming RSN. email us if you want to test. OPDS 0.x is currently available (not recommended) by adding .opds to the end of a urlby gluejar
5/15/2026 at 4:34:44 PM
[flagged]by e0d075b569cd