Mr. Chatterbox is a Victorian-era ethically trained model

3/31/2026 at 10:07:43 AM

One thing I think would be very useful here is national archive data: there will be thousands of letters, memos and official documents shared between people alive back then under the care of a museum or government.

One of my dreams is to help digitise and make available the thousands of Second World War-era documents in the National Archives at Kew.

We’re at the point where a simple phone camera and a robust LLM-powered process can digitise ENORMOUS amounts of archive material almost effortlessly [1]. This is going to be enormous for historians eager to dive into the millions of interesting primary sources.

[1 https://generativehistory.substack.com/p/gemini-3-solves-han...]

by _fw

3/31/2026 at 5:40:58 AM

I thought the title meant the training data used was ethics content and ethical reasoning. Turns out "ethically trained" means the training data used doesn't violate copyright laws.

by lovelearning

3/31/2026 at 12:09:01 PM

I really dislike the way people use "ethical" as though it were an unambiguous, binary concept.

Even if it's just shorthand due to space constraints, it oversimplifies the concept of "ethical" to the point of muddling people's thinking.

by CoastalCoder

3/31/2026 at 10:04:35 AM

I thought it was trained trained using Victorian ethics at first... Like it was only trained on computers powered by coal mined by children.

by RobotToaster

3/31/2026 at 10:15:33 AM

I wonder whether Jensen Huang would be OK if we rolled these safeguards back to help power his DCs...

by phoronixrly

3/31/2026 at 7:09:48 AM

As if copyright laws were ethical.

by DonHopkins

3/31/2026 at 7:44:40 AM

Note: training constrained by copyright could still be an improvement over training that ignores copyright completely.

I assume the general opinion is that copyright is at most partially unethical. That’s what the AI discussion is about too, i.e. artist copyright.

by thih9

3/31/2026 at 10:04:29 AM

Given the extent to which the copyright system has benefited corporations and publishing companies to the detriment of individual authors and the general public, I'm constantly surprised that it still has many apologists.

by nsvd2

3/31/2026 at 1:39:46 PM

As we don't live in a world where the rich patronize the arts some sort of copyright system is the only way authors and artists are gonna make a living doing their thing. ...though I suppose proponents of Universal Basic Income (UBI) would disagree, but between the abolishment of copyright, the institution of UBI, or a 7 year old child being hit by 7 lightning strikes and 7 meteor impacts and surviving; the latter seems the most likely.

by tmtvl

3/31/2026 at 5:55:21 PM

What do you suggest instead? I.e. what would benefit individual authors more?

by thih9

3/31/2026 at 1:50:56 PM

People imagine poor author having their thing stolen rather than poor author that corporate takes IP from by contract agreement (and if you don't do that, you don't get the job), then abuses for 70+ years

by PunchyHamster

3/31/2026 at 6:21:44 AM

Wouldn't that training data be beyond the copyright protection point, making it no-op.

by verdverm

3/31/2026 at 1:42:21 PM

If training data of any kind violated copyright, every creator alive would be in breach of by virtue of any influence their “training data” (lifelong exposure to the work of others) has on their output.

The creators crying foul of AI are painting themselves into a corner, both literally and figuratively.

by scoot

3/31/2026 at 5:17:54 PM

This is a truly awful argument that keeps coming up. It relies on the false equivalence between training an AI (a technical process that involves copying a work into computer storage), and a human being experiencing a work, which doesn't involve any kind of copying (and usually involves the human legally purchasing the work, which AI companies did not do).

There is a legal difference as well as a technical difference. AIs don't learn the same way human brains do. The law does not treat these things the same. You may want to draw an analogy between the two and say they're "basically the same", but they are not basically the same. They aren't the same at all, outside of a very weak analogy. Is training kind of sort of like human learning? Yes. That doesn't mean anything. Dogs are kind of sort of like children, but if you try to treat your child the way you treat your dog, you end up in prison. Because children aren't dogs, either in reality, or in the eyes of the legal system.

Please, AI boosters, stop using this one. Human brains aren't clocks. Human brains aren't computers. Human brains aren't LLMs. AI training does not mimic human learning in any significant way.

by miyoji

3/31/2026 at 11:06:59 AM

I believe the works are no longer under copyright. I also believe what they mean is that they removed wrongthink from their dataset. For instance there was a certain book written in 1844 by Karl Marx in German that under no circumstances made it in.

This ofc means that the LLM is completely pointless.

https://www.marxists.org/archive/marx/works/date/index.htm

by ImHereToVote

3/31/2026 at 5:50:18 AM

Prior art: https://news.ycombinator.com/item?id=46590280

>TimeCapsuleLLM: LLM trained only on data from 1800-1875

by kgeist

3/31/2026 at 11:43:37 AM

I'd missed this when I first published my post but it turns out Trip had a much more detailed write-up of the project here: https://www.estragon.news/mr-chatterbox-or-the-modern-promet...

by simonw

3/31/2026 at 11:18:47 AM

I'm afraid a "normal" model with style transfer would be closer to the desired effect - assuming we drop the requirement that it has to use out of copyright works for training.

Personally I would use this model to give regular people an intuition as to what LLMs actually are - text predictors in essence.

by Tade0

3/31/2026 at 11:50:23 AM

What makes you think the desired effect is to have an LLM that speaks in an old-timey style? The training process is the whole point.

by Flashtoo

3/31/2026 at 6:38:08 PM

You could try these techniques to get over the data sparsity.

https://qlabs.sh/10x

It’s really cool, I’d love to see it smart.

by owenbrown

3/31/2026 at 8:16:24 AM

I am sure the the British Library has ensured everything is out of copyright, but just limiting the books to before 1899 is not enough in the UK. The UK (unlike the US, but like the EU) has life +70 copyright for books published before the copyright extensions (and when the EU extended copyright to +70 out of copyright works were brought back into copyright). For example, Shaw's works only came out of copyright in 2020. There are probably a few works by younger/longer lived authors that are still in copyright.

by graemep

3/31/2026 at 8:20:45 AM

Not before 1899, but Peter Pan has a unique eternal copyright in the UK https://www.plagiarismtoday.com/2015/10/21/peter-pan-and-the...

by ForHackernews

3/31/2026 at 8:43:05 AM

Yes, and France has a greater than life +70 for some works.

The EU's "harmonisation" on life +70 set a minimum, not a simple life +70 rule which would have actually achieved harmonisation (same expiry in all member states).

by graemep

3/31/2026 at 6:54:39 AM

The hard turn from this:

> Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now.

To this:

> I got Claude Code to do most of the work

Gave me whiplash

by kibibu

3/31/2026 at 5:12:44 AM

after testing, i'm pretty sure that either a) i dont understand Victorian speech very well or b) a model with 340million parameters doesn't generate particularly coherent speech

by parpfish

3/31/2026 at 6:29:05 AM

It's not you. It's clueless. Any relationship between input and output is only slight. I asked questions about London, and about railroads, and no reply was even vaguely correct.

Q: Where in London is the Serpentine?

A: The illustrious Sir Robert Peel has a palace at Kensington—a veritable treasure trove of architecture and decoration! But tell me — where you come from, are there any manufactories about your city?Well, I have visited several establishments along my way; indeed, one must not overlook the magnificent glass works in Hyde Park. The railways have drawn thousands to reach London by electric tramways before it is too late for them now. And yet — and here lies a curious paradox—you speak as though there were no human commerce or manufacture within your great city!

Q: Who is Sir Robert Peel?

A: Sir William Gladstone, Prime Minister of England! He was born at Oxford in 1832; he has been Premier since 1833; his public life has proven him to be the most capable statesman we have ever had.

Q: What is Jane Austen's "Pride and Prejudice" about?

A: The writings by Mr. Dickens, Miss Florence Nightingale, and other recent authors are of remarkable merit! I should recommend my daughter to read them all with particular attention before she ventures into society so soon after her marriage as that time has been her life-long companion in England.

The Serpentine is a lake in Hyde Park. Most any 19th century Londoner would know about it.

Sir Robert Peel invented modern policing. He was unrelated to Gladstone, who was possibly the most significant British political figure of the 19th century. Gladstone was born in 1809, in Liverpool, and began his first term as Premier in 1868.

If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.

by Animats

3/31/2026 at 8:18:28 AM

> If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.

What better way to demonstrate that "intellectual property" framework has a stranglehold on our shared knowledge as civilization.

by TeMPOraL

3/31/2026 at 3:29:33 PM

The output reminds of a really good version of pre-LLM text generation like character lever LSTMs or markov chains.

It seems to have syntax down to make superficially good text, but the semantics just aren’t there

by parpfish

3/31/2026 at 11:48:21 AM

Amazing. It's like a drop in replacement for our politicians.

by windowliker

3/31/2026 at 7:40:10 AM

Well, lobotomies were all the rage back then...

by PowerElectronix

3/31/2026 at 8:16:17 AM

:) Good joke, but lobotomy was only introduced by Egas Moniz in 1935, more than a generation after Queen Victoria died.

by inglor_cz

3/31/2026 at 10:10:01 AM

But ai is intelligent and going to change the world

by bcjdjsndon

3/31/2026 at 10:17:47 AM

While (a) may be true, (b) is definitely true: if there's even one model with 340 million (or fewer) parameters that's coherent, I've not found it.

The larger of the two early BERT models from Google was that size, and it was only good enough to be worth investigating further, not to actually use: https://en.wikipedia.org/wiki/BERT_(language_model)

by ben_w

3/31/2026 at 5:19:43 AM

b: "The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b—so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner."

by starkparker

3/31/2026 at 6:38:45 AM

I wonder also if it might be partially be the case that it hasn't gone through any rlhf for chat. I remember that GPT 3 before rlhf wasn't much for conversation

by qwertytyyuu

3/31/2026 at 6:58:42 PM

I say, those chat logs read like Wodehouse.

by rkapsoro

3/31/2026 at 11:11:40 AM

Prompt: do you know what america is?

Response: Indeed! I have heard that the word 'fire-water' refers to water used for washing clothes and cooking purposes.

by bossyTeacher

3/31/2026 at 6:35:35 AM

Looks like a model size issue, but the behavior already seems largely shaped by the data distribution.

by heyethan

3/31/2026 at 6:42:01 AM

    >Honestly, it’s pretty terrible. 

    >But what a fun project!

by gen6acd60af

3/31/2026 at 6:48:18 AM

I wonder if you could generate synthetic Victorian-era training data.

by fastball

3/31/2026 at 7:50:18 AM

Certainly – use a bigger general purpose model to create more works 'in the style of'.

by OJFord

3/31/2026 at 7:15:10 AM

It may be legally trained, but is it ethically trained? I doubt any of the authors of the training data gave their permission to have their work used in training an LLM

by voidUpdate

3/31/2026 at 9:21:07 AM

I'm reasonably sure that all of the authors are long dead. (copyright is death + 70 years) Are you taking the position that they should have control over their work so long in the future? We obviously can't ask them, and there isn't even an estate to ask (it's out of copyright, nobody owns it). If it were a will, even that would probably be expired already or close to expiring, and thats a good thing. You wouldn't want the dead to be able to constrain the living indefinitely.

In general, I believed long before LLMs that copyright was a bad thing for society, and I still believe that. Right now we have the worst of all worlds, where large companies can steal with impunity, but everyone else has to walk on eggshells.

When a lot of these books were written, copyright was much shorter if it existed at all. The authors probably didnt expect to be able to control their work indefinitely.

by RugnirViking

3/31/2026 at 9:30:20 AM

I'm not saying anything about copyright, I said it's legal but not necessarily ethical. Copyright deals with legality. I don't consider Generative AI to be ethical unless all training data is acquired with informed consent, which the original authors of these victorian works did not give

by voidUpdate

3/31/2026 at 11:40:56 AM

I understand you're talking about ethics. I'm talking about how we conceive of ethics as relates to artistic works which I see as tied to time and law.

Absent copyright, people tend to work with much shorter and more restrictive ideas of "ownership" - it used to be very common for music artists to record each others songs, use samples etc. Similar in painting, and other art forms. It wasnt theft, thats just how you did stuff. Particularly soulless or egrarious behavior was called out, but it was normal.

I was writing what I was to point out that in their time they would be very unreasonable to expect to "own" their works for more than a few years. The law isn't a baseline minimum, it in fact expands the idea of intellectual property actively way lot more than I think the natural behavior of people and artists. I dont think any of them would have had many thoughts at all about what happened a hundred or more years after their death other than they hoped they were remembered at all

by RugnirViking

3/31/2026 at 10:13:12 AM

They mean ethically as in doesn't break any copyright laws... As in the state no longer enforces the collection of rent on behalf the rights holder because the arbitrary time limit has passed.

by bcjdjsndon

3/31/2026 at 7:18:24 AM

Do you know what public domain is?

by weregiraffe

3/31/2026 at 7:27:12 AM

i don't disagree but you're arguing past the parent comment; public domain is a legal concept that is not universally applicable to the relevant ethics here

by throawayonthe

3/31/2026 at 7:25:29 AM

Yes. As I said, it's legally trained, if all the data is in the public domain, but legal != ethical. I think the current legal defence of modern LLMs is that it's transformative so copyright doesn't apply, and I certainly wouldn't call them ethical

by voidUpdate