alt.hn

7/4/2026 at 4:51:26 PM

Google Books (or similar) all book scans – $200k bounty (2025)

https://software.annas-archive.gl/AnnaArchivist/annas-archive/-/work_items/234

by Cider9986

7/4/2026 at 6:09:24 PM

I live in a country where the selection of available books, especially in English, is very limited. Buying online from foreign markets comes with a long list of administrative hurdles and limits.

If it were not for Anna's Archive and Z-Library, I would've never been able to read the books that shaped who I am today, or keep my passion for learning alive.

Thanks, AA and ZLib! (Also, thank you to the authors whose books and knowledge I consumed without being able to pay them back.)

by ahmedfromtunis

7/4/2026 at 8:26:10 PM

Look, fair enough from your perspective. But a lot of those books probably wouldn't exist if the author couldn't make some money from their work.

I can't find the post but years ago on Reddit an author posted stats showing when her book turned up pirates online, real sales for it collapsed.

Because of this I make a point of buying books, programming books especially. Yes I download pdfs, I use them as previews. This has led to buying way more than I would have.

Anyway, I appreciate this doesn't apply if you live somewhere that these books can't be purchased. But everyone praising these sorts of sites tends to look at them from only a positive perspective.

by pipes

7/4/2026 at 8:32:52 PM

> But a lot of those books probably wouldn't exist if the author couldn't make some money from their work.

I think that's at least a bit debatable. People thought that about (normal) libraries back in the day, but it ended up having the opposite effect.

Not to mention out of print books or academic books which is a big usage of sites like these, since lots of people prefer physical books and only reach for pdfs as a last resort.

by bawolff

7/4/2026 at 9:12:13 PM

Libraries spend like $2B / year buying books https://www.imls.gov/sites/default/files/2021-08/fy19-pls-re..., which is like 10% of the total book market. So even if no one ever bought a book because they first encountered the book, author, or genre in the library that's already a signficant difference

by dsizzle

7/4/2026 at 9:06:02 PM

I think I agree, the FAR bigger impact on my book's sales was Google search deciding not to surface it in search results. Presence on pirate websites had no effect, and eventually I switched to the PDF as "pay what you want."

by j2kun

7/4/2026 at 9:04:47 PM

Can you imagine if we didn’t have libraries and someone tried to create them today? From publishers to right wingers, they would be painted as communist plots to destroy creativity.

by brookst

7/4/2026 at 9:34:37 PM

The Internet Archive tried to defend its ability to lend books as an online library due to format shift (physical books get first sale doctrine, ebooks are licensed), and were told no by the system, so “pirating” it is until copyright changes and becomes more reasonable. Disk is cheap, and the Internet global. Global distributed storage system durability and availability is the path to success until laws change.

The Internet Archive has lost its appeal in Hachette vs. Internet Archive - https://news.ycombinator.com/item?id=41447758 - September 2024 (793 comments)

Totally unrelated: Dweb camp 2026 is coming up for those interested: https://dwebcamp.org/

(no affiliation with any person or entity in this comment)

by toomuchtodo

7/4/2026 at 7:07:24 PM

https://send.djazz.se/

This is key for getting epubs to your Kobo.

by jvm___

7/4/2026 at 9:47:35 PM

This is a genius way to farm ebooks while providing a useful service. I personally just use Google drive though.

by Salgat

7/4/2026 at 7:23:24 PM

Thanks, but I don't use e-readers as they are not available here.

I've been using MoonReader for many years now and settled on pretty good parameters that make the reading experience very comfortable on both my phone and my tablet.

by ahmedfromtunis

7/4/2026 at 9:24:20 PM

Moon reader is amazing. I love mine so much I don't see a point of having a separate book reader.

by subscribed

7/4/2026 at 7:31:35 PM

I don't understand what this is doing. Can't you sideload any ebook onto a kobo anyway? Never had an issue on my Clara

by pull_my_finger

7/4/2026 at 8:13:46 PM

I’ve noticed that people today often bristle at any suggestion that one connect a device to a phone or computer with a cable – on Reddit, one will often get downvoted for this. Apparently, a lot of younger people are hardly aware this is possible and it strikes them as overly complicated or for old people. People want to wirelessly transfer stuff, and what the OP linked to is a popular way to do that with Kobo.

by TFNA

7/4/2026 at 7:59:01 PM

Handy, but a book lover with an ereader probably already uses Calibre :)

by andrepd

7/4/2026 at 8:11:53 PM

I don't recall ever needing anything special on my Aura H2O. It's one of the reasons I chose Kobo in the first place. Just copy any file onto it.

If you mean stripping drm I used Calibre for that but mostly I just avoid buying books with drm where possible.

by Brian_K_White

7/4/2026 at 6:56:17 PM

https://SourceLibrary.org has about 16,000 rare books translated — most for the first time. 50,000 books archived (will be translated when we have $$ for it). More tokens than English Wikipedia and about .75 petabytes.

Not sure if we will qualify for a bounty, but happy to share! Btw, we are looking for funding from small or large donors who want to help us translate the Renaissance…

by dr_dshiv

7/4/2026 at 7:33:58 PM

Hey, this looks fascinating!

I can't quickly tell what all you have archived^, but I have some friends who are academic historians who might be interested in certain categories of work (and could help verify some esoteric languages) - is it possible to search by region or language?

Have you reached out to any types of historians WRT the project? It seems like some PhD students might be able to find some projects in this work etc

^ when I looked at the timeline https://sourcelibrary.org/timeline, I got an error

by wrsh07

7/4/2026 at 8:01:40 PM

Yes, this is designed with historians and librarians from the Embassy of the Free Mind (https://embassyofthefreemind.com) in Amsterdam, stewards of the collection of the Biblioteca Philosophica Hermetica

Please share with historian friends. I’m not great at socials or fundraising but this was really designed to support humanists. It can give DOIs for the versions of the translated books, which means they can be quoted and cited in academic papers.

Tip: Try it in Claude or Claude code (even better)! Just point it towards the source library. It can find quotes and evidence on any topic of interest. Or try the librarian — our source-grounded research agent https://sourcelibrary.org/librarian

Thanks for the feedback, I’ll fix the timeline.

by dr_dshiv

7/4/2026 at 8:59:37 PM

Interesting site. I picked a random topic to listen to — flying chariots or something like that — and the conversation of one person talking and the other whispering was definitely not to my preference. I’ll have to take another look when I have more time.

by therealpygon

7/4/2026 at 7:53:36 PM

Curious as to what your budget was to get where you are today? That's a lot of tokens. I presume you are using gemini flash?

by sgc

7/4/2026 at 8:07:51 PM

All the models used are shown with each page of translation and each book has a whole data provenance treatment.

You can add it up!

by dr_dshiv

7/4/2026 at 8:48:26 PM

I don't see raw token counts, just a list of steps and page counts. For example, what is the rough average token count per page in the ocr and in the translation steps for a Greek book?

I have seen Gemini costs change quite a bit when processing very similar books from the same series lately, mainly because thinking tokens have increased about 5x. Has that has happened to you as well?

Edit: for ocr I am using about 15k-25k tokens per page, but I have a complex prompt.

by sgc

7/4/2026 at 8:44:38 PM

How do you handle the more densely written pages in script ? I did a very similar exercise OCRing works from this exact collection, but I stuck with the English books for the first pass.

by mmargenot

7/4/2026 at 9:13:45 PM

Can't you just tell him?

by efilife

7/4/2026 at 9:17:29 PM

Wow this is amazing!

by ziofill

7/4/2026 at 9:20:57 PM

Anna’s came clutch for me yesterday. I spent a few days trying to find a zip file of a CD that came with an old book from early 2000s on programming. One of those Thomson Publishing slap jobs that I actually enjoyed. I checked used copies all of them said does not come with CD. I tried googling around, nothing. LLMs couldn’t find it. ChatGPT kept saying it is on the archive (no it isn’t you useless piece of shit). Anyway, on a whim I went to AA, lo and behold, zip files for both first and second edition. Godsend.

by tangenter

7/4/2026 at 9:36:09 PM

What for?

by irthomasthomas

7/4/2026 at 6:01:45 PM

Who is behind Annas archive, there is a lot of english speakers involved in the team and forums! Anyway as long as buying isn´t owning no issues here.

by trilogic

7/4/2026 at 8:26:23 PM

I think the main source may be in Russia; or that was with libgen.

But I could be wrong.

I am more surprised to see that there are so few alternatives to it. Or perhaps I am unaware of them but after Facebook and co declared war on libgen, and libgen going down, there were surprisingly few alternatives. Anna was one of the few. I still don't know what happened with libgen, but since the attack it really is kind of semi-gone.

by shevy-java

7/4/2026 at 8:56:48 PM

Libgen and similar are more alive than ever with an extended botnet growing weekly. The "googlers" indexed framework is shrinking everyday, so users wont find it in those search engines easily, also it is hard to keep up with a good storage considering price trend last 5 years so the botnet and torrents are some kind of solution I guess. (We for instance are considering to use the old taping system, cause is at least a viable alternative.

by trilogic

7/4/2026 at 9:33:19 PM

If no issue there, then why would you ask who is behind it in a public forum?

by tumdum_

7/4/2026 at 6:05:31 PM

I wonder how long it will be before they offer bounties for internet scrapes.

Cloudflare captchas have made the internet unusable for me, and I'm sure it will only get worse over time. I'd much rather just browse (or even torrent) a copy of archive.is or similar. The latter would be much better for privacy, and hey, I run ad blockers anyway.

by hedora

7/4/2026 at 6:30:23 PM

Anyone afraid of being laid off at google right now? Perhaps this is a backup :)

by DeepYogurt

7/4/2026 at 6:44:47 PM

I think if you get caught exfiltrating data they'll sue you for much more than $200K.

by Cthulhu_

7/4/2026 at 7:09:23 PM

I don't think anybody would do it purely for money. I would rather see someone who is terminally ill and decides to do some "good".

by imhoguy

7/4/2026 at 7:29:51 PM

There are not too many mentally-sharp, fully-employed, terminally-ill people that I have met. Even fewer at tech companies.

And even fewer who are single and childless. (Google would likely go after the estate of anyone who did this.)

by dlenski

7/4/2026 at 7:58:20 PM

I wonder how hard they would press an estate. It’s bad PR to go after widows and surviving children, and the data has already escaped.

This is something they’d want to settle quietly, so the family would have leverage.

by bitmasher9

7/4/2026 at 9:36:44 PM

They’ve made so many terrible decisions already. Going after widows wouldn’t change anything.

by tumdum_

7/4/2026 at 7:59:06 PM

But the one would be enough, especially in large organization. Surely they would need access to the exact data too.

by imhoguy

7/4/2026 at 6:56:32 PM

Copy data into extra large capacity micro sdcard and hide it in your rubiks cube, nobody will suspect a thing

by merpkz

7/4/2026 at 7:13:42 PM

It’s the “ Copy data into extra large capacity micro sdcard” step that gets you caught. Nobody is stopping you from leaving with an SD card or USB stick at Google.

by diab0lic

7/4/2026 at 7:13:26 PM

I wish an extra capacity SD card was enough, google books holds (probably) an insane numbers of books

by takipsizad

7/4/2026 at 8:06:36 PM

Comments on the source mention dataset sizes ranging between 1.5PB and 200PB

by stephenlf

7/4/2026 at 8:43:33 PM

For 200PB one would need 25kg worth of 2TB microSD cards... that would be lots of Rubik's cubes =P

by cydodon

7/4/2026 at 8:39:26 PM

my guess would be the 7PB mark

by takipsizad

7/4/2026 at 8:04:42 PM

I'm sure they'd go after you, but hypothetically: What damages would they claim? They still have the data, which isn't their IP to begin with.

by mmooss

7/4/2026 at 8:28:07 PM

Good point. But it would still be a breach of Google policy, most likely and they signed a pact with the devil so ...

by shevy-java

7/4/2026 at 6:50:45 PM

If your money is in private crypto or offshore you have nothing to worry about.

by the_real_cher

7/4/2026 at 7:13:27 PM

i'd strongly caution anybody foolish enough to go down this path

financial watchdogs and international treaties make it impossible unless you are perhaps a multi billionaire who can afford to buy people at the political level

by zuzululu

7/4/2026 at 6:54:51 PM

Except perhaps jail time.

Lying about your assets to avoid paying a lawful fine is criminal. Just because they can’t see your money doesn’t mean they can’t prove that you have it, and can’t jail you for hiding it to get out paying a fine.

by mock-possum

7/4/2026 at 7:14:56 PM

So is stealing

by LastTrain

7/4/2026 at 7:09:37 PM

Google, Amazon, and FB: It's not me, right

by LearnYouALisp

7/4/2026 at 8:27:06 PM

I think the problem is more that financial damage would result from this. So people would need to be prepared to relocate to another country probably.

by shevy-java

7/4/2026 at 9:38:29 PM

Gemini should be trained on those books already, so in theory it could regurgitate some verbatim fragments (as NYT lawsuit agains OpenAI showed some time ago).

by alkyon

7/4/2026 at 9:37:40 PM

    If you shouldn't be able to copyright GRAPES...you shouldn't be able to copyright BOOKS.

by tolerance

7/4/2026 at 6:07:48 PM

The US should just find a way to quietly share literature access with the Russians, rather than letting piracy be promoted and facilitated for US consumers as freedom-fighter "archiving".

Between all the piracy, and all the AI training and the purchase/visitor-circumventing AI services, the practice of writing and publishing genuinely good work is being wiped out.

We're killing the goose that lays the eggs, for selfish gain.

by neilv

7/4/2026 at 6:49:29 PM

This ship has sailed for academic publications, and academics define that term very liberally because we want to read everything, fiction included. The shadow libraries started off as a way for scholars in ex-Soviet countries in particular (but also India, SE Asia, etc.) to access literature that simply wasn’t available in their country. But the shadow libraries proved so successful and convenient that academics in all countries are using them now, even if they have access to official subscription services. I use AA several times a day and so do the researchers around me in my office; at conferences, if the presenter mentions an interesting publication, the whole room immediately opens AA on their laptops, etc.

Even if projects like AA didn’t have nation-level support, academics would find a way to keep as much of it as possible going. After all, we’re the ones who compiled the bulk of pre-2020 material, and we’re the ones who do all the hard work of scanning from our institutional libraries stuff that doesn’t exist anywhere in digital form.

by TFNA

7/4/2026 at 7:31:01 PM

>the practice of writing and publishing genuinely good work is being wiped out.

Most of the best literature in the English language was written before modern IP law was even a thing. There's very little good literature written by authors primarily motivated by money.

by logicchains

7/4/2026 at 7:55:51 PM

How much of that literature was written by wealthy landowners who already had little need for money?

by Jtarii

7/4/2026 at 9:27:23 PM

Well, you needed the means to get an education, since most of the poor in those days were illiterate, which is something of an impediment to becoming a successful writer.

I can only think of one writer off hand who wasn’t a wealthy landowner, although it is a particularly notable example; that of William Shakespeare.

Shakespeare wasn’t poor (his parents seem to be of upper middle class standing), he was able to get a basic (but not a university) education and then pursue an acting career (with perhaps a side hustle as a teacher). Whatever the case he certainly wasn’t independently wealthy before he started writing, he needed to earn a living.

He did seem to be in it for the money (and fame) since he wasn’t just a writer he was an actor, theatre owner, and something of a celebrity, and he did make enough money to become a wealthy landowner by the time he died.

by mr_toad

7/4/2026 at 7:46:05 PM

That's just cultural elitism. I hope you meet someone in your life who finds absolute joy in reading young adult romance novels or D&D fantasy books so you can understand how irrelevant "good" literature is. I love Dostoevsky and Verne (and D&D novels, especially those written by R.A. Salvatore), but I would never judge the modern "IPs" that got my daughter into reading.

> best literature

What does that even mean?

by boca_honey

7/4/2026 at 6:45:43 PM

Possibly but this act of governmental self-harm is useful to The People. We live in a world where if your valuation is ~1T you can more or less just do what you like. And the work of The People is stolen from you and launderd.

In such a world, isnt it useful that governments are stupid enough to give adversaries reasons to undermine it? When the government props up a corporate tyranny domestically, and racketeering, should we make a temporary alliance with all its enemies?

(Eg., the provision to AI companies of all corporate secretes and competitive practices via prompts, eventually to be used against their capital interests and their labour interests).

by mjburgess

7/4/2026 at 7:12:02 PM

So when will the American people form an "Incorporation" to lobby against business for them?

by LearnYouALisp

7/4/2026 at 7:22:32 PM

>We're killing the goose that lays the eggs, for selfish gain

We already did that when the internet collectively agreed decades ago that everything digital should be free for anyone.

We're now 20 years downstream of ad-blocking being a virtuous good, and piracy being the ultimate show of liberty, and now suddenly everyone cares about the creator's revenue stream.

The mask slipped and unsurprisingly the internet is a bunch of selfish morally stunted children. Some of them even pushing 50 years old.

Yes, I am talking to you with the 4TB of pirated content, proud of not loading any ads in the last 15 years, and getting enraged over LLM training.

by WarmWash

7/4/2026 at 7:49:40 PM

> Yes, I am talking to you with the 4TB of pirated content, proud of not loading any ads in the last 15 years, and getting enraged over LLM training.

That's oddly-specific :-)

In any case, I have no pirated content that I know off, neither proud nor ashamed of blocking ads[1], but I still get annoyed that a bunch of VCs can use their invested-into companies to launder all the worlds IP, then sell it back to them.

[1] Who feels proud of blocking ads? It's like feeling proud of tying your shoelaces: "Good job, well done, but that's the expectation, son".

by lelanthran

7/4/2026 at 5:39:27 PM

Piracy / copyright predictions?

The current situation feels untenable with renting. So many regular people I know have learned about VPN, NAS, etc.

by bix6

7/4/2026 at 6:05:04 PM

Hopefully the guillotines. Look up how much the authors and artists who create the actual work get paid.

by codemog

7/4/2026 at 7:06:55 PM

Quite a few textbook authors I know are paid well to be part of the whole scheme (kickbacks, forced yearly repurchase for the 'online' component of books, etc). So I think it varies a lot.

by 0x3f

7/4/2026 at 9:17:26 PM

All authors should have a pay + linktree type thing so pirates can pay them directly.

Or something like thanks.dev

by smashah

7/4/2026 at 5:58:34 PM

It was never sustainable, just regulatory capture by large IP owners.

Spotify, Netflix, Amazon etc provided OK value for a while, but now enshitification is biting, this is due a massive comeback.

by specproc

7/4/2026 at 5:40:57 PM

Some more interesting bounties they offer: https://software.annas-archive.gl/AnnaArchivist/annas-archiv...

> Purchase all Library of Congress MARC datasets — $3,000 bounty

> English Wikipedia pages about relevant institutions — up to $100 per new page

> Internet Archive Digital Lending — $5000 per 1 million pdf files

> Text version of our full library — $20,000

...

by wxw

7/4/2026 at 5:39:27 PM

So AA is a front for openai?

by FerritMans

7/4/2026 at 7:26:46 PM

No, but they openly make a lot of money from selling their library to AI companies. Fast enterprise access to Anna's Archive starts at $100.000

by flexagoon

7/4/2026 at 9:29:04 PM

A lot? I would be kind of interested if there were any known figures. Do companies want to be implicated in AA-cooperation in any capacity?

by poly2it

7/4/2026 at 9:33:10 PM

They likely use intermediary companies, but NVIDIA might have purchased from them directly, I don't remember the full story.

by Cider9986

7/4/2026 at 8:29:58 PM

Interesting. But AI companies drive the RAM prices, which costs me more. So someone makes me pay more here ... :(

by shevy-java

7/4/2026 at 6:36:22 PM

How did you come to that conclusion?

by 650REDHAIR

7/4/2026 at 6:24:37 PM

the bounty would be a bit higher with openAI money behind it

by awakeasleep

7/4/2026 at 7:16:36 PM

The link sort of reads like people who have very easy access to the requested material. Almost like they're Google employees.

by hereme888

7/4/2026 at 8:30:22 PM

There was a time where you would get a random page preview, some artists found a way to extract full books that way (F.A.T lab?).

by thenthenthen

7/4/2026 at 8:09:26 PM

The only legal hurdle keeping Anna’s Archive away from its noble goal (piracy laws) has been shown to mean zilch in the age of AI.

by stephenlf

7/4/2026 at 7:58:03 PM

Anna’s archive rocks

by stephenlf

7/4/2026 at 8:35:13 PM

How is Anna's Archive funded? I see they have memberships, but it's hard to believe that can fund all these bounties - some going into six figures. Ask any FOSS project about funding by that method.

It seems like there are some deep pockets funding them.

by mmooss

7/4/2026 at 9:09:44 PM

Chinese (and some other) AI companies buying fast access to their dataset.

by atemerev

7/4/2026 at 9:35:37 PM

So Anna's Archive is in some ways a front for AI companies, gathering the sources they can't get themselves?

by mmooss

7/4/2026 at 7:42:40 PM

Does Anna's Archive use a completely different "source repository" from LibGen?

by anyaya1

7/4/2026 at 8:18:37 PM

AA compiles from everywhere; LibGen and Z-Lib served as the major sources of books. This has unfortunately led to search results for a particular book containing multiple versions of that book, and it is not readily clear which one is the highest-quality version. A real library would have librarian staff who carefully curate everything, but in the pirate world this isn’t realistic so it just gets all thrown together.

LibGen is now more or less a dead project. The servers of the original version were reportedly seized a couple of years ago already, and other sites under the LibGen name were notorious for piggybacking the original collection and just plastering it with ads. If one wants to upload stuff, better now to upload it to Z-Lib (not a perfect site, but still) and it will then get picked up by AA in a few months.

by TFNA

7/4/2026 at 7:53:55 PM

annas archive is practically a compilation from all sources possible (including libgen afaik)

by takipsizad

7/4/2026 at 8:36:53 PM

Just do it and be legends, Larry. ;)

by leoc

7/4/2026 at 8:51:56 PM

Apple won't even help Asahi linux even though it would help hardware sales and give them a ton of goodwill.

by Cider9986

7/4/2026 at 8:46:50 PM

I think this would cross the line from civil copyright claims into criminal activity

https://chatgpt.com/share/6a4970e8-7fe8-83e9-8f81-3aefd76b6b...

On another note, if Google's cybersecurity were always one rogue employee away from a massive leak, then it wouldn't be Google. What was the last Google leak you remember, defense in depth people.

by TZubiri

7/4/2026 at 5:37:37 PM

One of my hopes is that when the AI bubble bursts, some brave person will sneak out a copy of the last frontier model.

by ThrowawayTestr

7/4/2026 at 5:39:12 PM

Not worried about that, you will only have to wait 3-6 months and get a Chinese model just as good.

by Aboutplants

7/4/2026 at 6:43:41 PM

That’s misunderstanding why these models are behind. A large part of why they’re behind is they aren’t able to do the reinforcement learning post-training steps that takes a pre-trained model and turns it into a frontier model like GPT 5 or Opus. Instead they do their best to recreate these models using distillation.

Fundamentally, you can never distill your way to being the teacher, so these approaches will not advance the frontier.

[edit, after thinking about it I think my phrasing is unfair. It's not necessarily that aren't able to do it, but they haven't yet shown that they are willing to do it.]

by sulam

7/4/2026 at 7:04:24 PM

That’s not remotely true. They did distillation as a cheap solution to the cold start problem. You need data/trajectories to hill climb to higher capabilities. All large Chinese labs do RLAIF.

by computerex

7/4/2026 at 7:10:39 PM

Oh yes, not remotely true. Which is why the frontier labs all have invested heavily in trying to identify and thwart distillers, using known company names / domains to drive their exclusion lists.

/s

by sulam

7/4/2026 at 7:34:17 PM

It's cheaper to distill than to do reinforcement learning, so of course they prefer that, but if it wasn't an option they could just pay up and spend more GPU time on RL.

by logicchains

7/4/2026 at 9:12:58 PM

> you can never distill your way to being the teacher

Are you sure?

What if you distill from 10 teachers?

by DANmode

7/4/2026 at 9:32:14 PM

In this case all teachers have also learned from each other.

by poly2it

7/4/2026 at 7:04:32 PM

>"they aren’t able to do the reinforcement learning post-training steps"

Not yet.

If there is a need someone will come and fulfill. Personally for me now I do not even want to use top models. Professionally I use AI to help with the coding using Junie agent that comes with IDEs from JetBrains. Junie is told to use Gemini Flash and works fine for what I ("I" being an emphasis here) ask it to do. I tried more advanced models and different vendors only to discover credits going down the toilet without any extra benefit.

by FpUser

7/4/2026 at 7:11:08 PM

I'll agree I guess and clarify that the better phrasing is probably something like "haven't yet shown the capability to."

by sulam

7/4/2026 at 5:52:34 PM

Chinese companies giving away expensive models for free is a symptom of the AI bubble, too. It's not a law of nature that they'll always be able to scrounge up the money for yet another training run.

by yorwba

7/4/2026 at 5:56:29 PM

Shaping the tool that does the thinking is quite valuable when you're in the business of changing how people think - I think we can expect propaganda agencies to be subsidizing model creation forever.

This doesn't strike me as a symptom of a bubble - except in so far as the bubble pushes the competitors models forwards and thus they need to invest more to stay competitive.

by gpm

7/4/2026 at 6:29:23 PM

All the models, have to respect their local laws, and most of all, pressure from users and the employees.

They all carry political weights, because humans behind defend their interests, and are promoting some social values.

https://pastebin.com/hjhvsBFg

This answer from Claude is so biased that it is ridiculous

by rvnx

7/4/2026 at 7:32:47 PM

As long as it is in the CCP's national interest to have a frontier model, Chinese companies will have the resources for another training run.

by jnwatson

7/4/2026 at 6:03:19 PM

I think it's a deliberate business strategy of commoditization of their complement.

China acts like an entire bloc, not as single companies, and they want to monetize hardware.

by nextos

7/4/2026 at 8:48:22 PM

If you think Chinese companies always act as a bloc, your mental model needs to get about a billion times more detailed. But in this case just a few details may be enough: There are Chinese AI companies that have released LLMs without publishing the weights.

ByteDance is going the direct-to-consumer route with their Doubao chatbot (the most popular in China, probably thanks to their social media prowess). iFlyTek seems to be angling for enterprise and government use cases, where they already have an in.

The companies that have released weights have in common that they didn't have a monetization channel lined up and their models weren't good enough to make people pay attention with just API access. (You can see with Qwen Max that the calculus can change towards not releasing weights for better models.)

And who exactly among the investors is having their complement commoditized? When Nvidia releases Nemotron, the story is clear, but it's less obvious for say Z.ai's GLM.

by yorwba

7/4/2026 at 9:53:08 PM

I have never said they always act as a bloc, but their industry has a strong component of long-term strategic government planning behind them.

by nextos

7/4/2026 at 6:43:27 PM

If it's a bubble, why do you care about frontier models?

by fastball

7/4/2026 at 9:18:24 PM

If we had the dotcom bubble, why are you still on the Internet?

by emdash

7/4/2026 at 9:56:57 PM

[delayed]

by fastball

7/4/2026 at 7:07:29 PM

Internet was a bubble, so was telecom etc. at some point. Being bubble does not mean that when 90% of investments go down the drain the remains are not useful.

by FpUser

7/4/2026 at 7:14:57 PM

which will be very difficult to run unless you have a large budget to operate your own mini datacenter

by zuzululu

7/4/2026 at 7:53:50 PM

In a crash the hardware will go for pennies on the dollar, if not for fractions of pennies on the dollar.

Lots of companies will pick them up for scrap metal prices and host them for fractions of what we are paying today.

That's the nature of bubbles.

by lelanthran

7/4/2026 at 6:38:53 PM

Prediction markets can solve this.

by thx67

7/4/2026 at 5:40:38 PM

[dead]

by b112

7/4/2026 at 6:07:17 PM

Curious as to how you would approach this. I have no experience in this area, anyone on this forum willing to share their expertise?

by OrangeDelonge

7/4/2026 at 7:03:27 PM

If it works as AA seems to theorize, you'd need to:

  (a) work out how Google books exposes fragments of books, and see if there's a systematic way of using this to get whole books.  For example, a naive approach might be to find any fragment of the book by searching some exact phrase.  Then, you can search for an exact phrase from the start or end of the fragment it gave you, hoping it will show you the previous or next part of the book.  You can then just loop that to get the whole book.

  (b) once you have (a), you need a way of bypassing Google's bot detection/rate limiting.  I don't know what current state of the art is, but there may be a solution for sale out there.  E.g. you pay to receive a cookie or browser state, and use that to fetch the URLs from (a).  Or if you're good/already in the scene, you could do this part yourself.

by 0x3f

7/4/2026 at 7:27:35 PM

That way definitely will work with the current access google provides however its an extremely inconvenient way to scrape google books

by takipsizad