AMÁLIA and the future of European Portuguese LLMs

5/11/2026 at 4:06:07 PM

I'm not sure the direction should be to finetune a small local model for each country or language. These models are already not particularly great at information retrieval, so I doubt anyone would use them for questions like the author suggests (ie who was the president between X and Y). Similarly, they are a little too lightweight to be used for translations too.

If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.

by pu_pe

5/11/2026 at 5:53:37 PM

I agree, the research is complex enough as is without having to worry about splitting it babel-like into multiple languages.

by iugtmkbdfil834

5/12/2026 at 8:19:55 AM

> who was the president between X and Y

this is the type of question that should never ever be asked to an llm running on some A100 on the other side of the world, local llms are already more than capable to answer these

by dudefeliciano

5/11/2026 at 6:11:18 PM

Yeah I think India is going the better route with Sarvam which is trained from scratch and still relatively cheap.

by dyauspitr

5/11/2026 at 7:02:19 PM

This is the way.

Sovereign SOTA models might also be possible with nation-state involvement. But this is a good stopgap.

by TheMagicHorsey

5/11/2026 at 7:25:57 PM

This model is a waste of Public Funds.

There is no public website to use it, be it free or paid, the dataset is not public, the code is not public (The github URL in the article returns 404 ), the claimed model intelligence is so low that is pretty much useless at 32K context and massively inferior to GPT‑4o.

As per tradition in Portugal, some people managed to get 5.5 Million to produce nothing and no one is asking questions.

You want a better idea? Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset, the cost would be under a million and we would be getting something useful.

It would be really nice to know what happened to 5.5 Millions whilst not being able to even provide a functional website to use the model.

by mariopt

5/11/2026 at 8:39:44 PM

As a pt-BR speaker from across the pond: https://soberania.ai/

Similar waste.

by upupupandaway

5/12/2026 at 5:18:39 AM

Interestingly, the mobile version of the website contains a hamburger menu with an "Equipe" (Team) link that returns a 404 error (https://soberania.ai/equipe).

This link is absent from the desktop version.

Isn't it a bit odd that the team responsible for it is nowhere to be credited?

by edwcross

5/12/2026 at 1:49:05 PM

Perhaps a bought theme and they forgot to test for mobile.

by dormento

5/11/2026 at 11:56:40 PM

Given that their publication says the dataset is freely available on Huggingface that's at least something ig

by avdelazeri

5/11/2026 at 11:29:51 PM

Why?

by gverrilla

5/11/2026 at 7:45:58 PM

It’s a way to suck all the money out of the room in the name of nationalism — and it’s all over Europe. Only idea everyone has had.

by dr_dshiv

5/12/2026 at 11:57:20 AM

As Portuguese living abroad, reading the article made me think exactly the same , even though I wasn't even aware of the project.

by pjmlp

5/11/2026 at 8:19:51 PM

I'm not arguing with the rest of your points, but...

> Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset

I think that tokenizers of all popular models are heavily biased towards English or English and Mandarin.

And I don't think that it is possibple to replace the tokenizer without full retraining.

by vova_hn2

5/11/2026 at 8:56:30 PM

You are right about most tokenizers being heavily biased towards English, but the situation is not so bad for Portuguese. Here are some results on the Goldfish corpus [1] with a few different tokenizers. This measures #characters in corpus / #subwords in tokenized corpus.

```

Llama3

english, 0.216

portuguese, 0.285

italian, 0.287

greek, 0.592

```

Gemma4

english, 0.219

portuguese, 0.246

italian, 0.249

greek, 0.537

```

Kimi2.6

english, 0.214

portuguese, 0.310

italian, 0.308

greek, 0.716

```

Portuguese is worse than English certainly, but it is on par with Italian (which I think has more overlap with English) and much better than Greek (since it doesn't use the Latin script and is definitely not prioritized in the tokenizer construction).

On your second point, tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch. The simplest version of this is tokenizer extension + continual pretraining, where you just add a bunch more tokens to the vocab for the language/domain that you want to improve and train a little more. It's been done for Japanese [2] and Indic languages, but afaik not Portuguese.

So I think that continual pretraining for a large base model would have probably been fine for this case with huge cost savings. But it is good to have the ability to train your own base models, so I don't think this is such a bad idea.

-----------------------

[1]: https://huggingface.co/datasets/goldfish-models/fish-food

[2]: https://arxiv.org/abs/2404.17790

by mcyc

5/12/2026 at 4:26:33 PM

> tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch.

This is very interesting, I didn't know that! Thanks for the links!

by vova_hn2

5/11/2026 at 8:51:19 PM

The Amália model is not yet publicly available. Until it's ready, one can fool around with Anália at https://analia.pt

by alexaholic

5/12/2026 at 6:21:22 AM

That name change…

by ncruces

5/12/2026 at 5:48:33 PM

I just died.

by pelf

5/11/2026 at 4:00:43 PM

It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.

by swiftcoder

5/11/2026 at 10:42:47 PM

I would have to imagine this might not actually be as bad as it seems, at the very least there should be a giant corpus of translated EU texts.

by bobthepanda

5/11/2026 at 4:13:10 PM

I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.

All in all, I don't think that's a major issue here.

by embedding-shape

5/11/2026 at 4:23:22 PM

The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).

I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)

by swiftcoder

5/11/2026 at 4:34:19 PM

Man, there’s an attitude up here in trás-os-montes that the rest of Portugal has spoken unrecognisable trash for a century. It took me years to realise I’d learned hilariously antique Portuguese by moving there.

Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.

by madaxe_again

5/11/2026 at 4:59:01 PM

> I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English).

That's easy to say when you're not on the other end of US defaultism.

by philipwhiuk

5/11/2026 at 6:29:32 PM

To be fair, it is only natural: Portuguese itself only came to be because the Roman Empire conquered the Lusitan land [1], a lot of English comes from Norman French from the Norman conquest [2], the Americas didn't speak European languages until 500 years ago or so, etc.

If you give enough time, all languages will change, and some of them because of major political changes/conquests

[1]: https://en.wikipedia.org/wiki/Paleohispanic_languages

[2]: https://en.wikipedia.org/wiki/Influence_of_French_on_English

[3]: https://en.wikipedia.org/wiki/Indigenous_languages_of_the_Am...

by augusto-moura

5/13/2026 at 9:43:58 AM

It's natural to die during childbirth or because the sabretooth tiger nearby thought you looked like an easy target.

Natural doesn't mean good.

by philipwhiuk

5/11/2026 at 6:52:16 PM

> That's easy to say when you're not on the other end of US defaultism.

I mean, I’m a Brit who lived a long time in the US, so that’s a dynamic with which I am rather familiar

by swiftcoder

5/11/2026 at 5:04:48 PM

Right, but most of those speak brazilian portuguese. There's so much less european portuguese text that it becomes impossible for a model to not speak brazilian portuguese if not trained in a way that ignores brazilian sources

by mghackerlady

5/11/2026 at 7:52:06 PM

Portugal has a growing Xenophobic attitude towards immigrants, specially Brazilians and this is reflected in linguistic prejudice.

They have concerns of portuguese children learning to "speak brazillian" because there is a lot more of video content being produced in Brasil than in Portugal and stuff like movies, videogames and software in general are avaliable in brazilian localization/adaptation first.

by evandrofisico

5/12/2026 at 12:00:44 PM

As portuguese immigrant that has lived in a few European countries I find this growing attitude quite sad.

It starts by we emigrate all over the place, when something happens to a portuguese abroad due to xenophobic attitudes, it is all over the place on the news, they squeeze the juice until there is no more news to talk about.

Then some folks decide to do exactly the same to others that like us abroad, decide to try their luck in Portugal.

And yes, I have experience what means to be shown that Portuguese aren't welcomed.

by pjmlp

5/11/2026 at 8:02:16 PM

We have the same thing happening, on multiple levels, here too. First some Spanish parents are afraid the children aren't listening and watching enough Spanish media. Then additionally, some Catalan parents are afraid the children don't get to use Catalan in school so they don't become proficient enough to use it in society.

by embedding-shape

5/11/2026 at 8:17:07 PM

The Catalan situation is completely different and unrelated, being a completely different language and not endangered (with or without scary quotes, as you prefer) by an ex-colony that became independent. Actually many Catalans would like to be such ex-colony.

by darkwater

5/11/2026 at 9:29:00 PM

> The Catalan situation is completely different and unrelated

I'm not saying it's the same, but there is definitively similarities in that parents are worrying about what language their children use. And yeah, unrelated, wasn't trying to claim it's the same or better/worse or anything, just another similar situation other (curious) people might want to learn more about, regardless of what you think Catalan wants or not.

by embedding-shape

5/12/2026 at 6:52:54 AM

Spain also took the route of dubbing foreign media, whereas Portugal tends to subtitle instead. This sort of exacerbates the situation, since it means that typically any Portuguese dubs of American media will be Brazilian.

by swiftcoder

5/12/2026 at 8:18:49 AM

AFAIK there is no Brazilian dubbing in Portugal, the only commonly dubbed media are animated movies and they are always dubbed by Portuguese VA's.

by rlf_dev

5/12/2026 at 8:33:11 AM

As a father of 3, I’m kind of guilty of that prejudice myself.

It is not towards Brazilians themselves, which I frankly respect, but because of the low quality stuff my kids are exposed to on the internet. You just can’t avoid it and of having the kids gravitate towards YouTube instead of better entertainent channels.

Random dumb YouTubers doing shit for giggles and overly sexualized funk music. And the shorts, oh the shorts crap everywhere, in all languages.

I don’t have a problem with my kids watching “manual do mundo” or “Paula stefania”. Or “porta dos fundos” myself. Good stuff.

On the other hand, Apple developer relations suports Brazilian Portuguese only, when they do not distinguish between variants of English, French and Spanish: I had to submit an English translation of a plain simple and clear document because my Portuguese version was rejected.

by pcardoso

5/11/2026 at 4:22:13 PM

The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.

by KK7NIL

5/11/2026 at 4:23:56 PM

Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.

by embedding-shape

5/11/2026 at 4:29:38 PM

What's your evidence for that?

And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM?

by KK7NIL

5/11/2026 at 4:46:33 PM

Evidence? Not so much, I didn't realize I was defending a PhD thesis here.

I speak Spanish, and have talked with people who only speak Portuguese, either of the variants, and also talked with Portuguese people before how they see their language, comparing it with Brazilian Portuguese, and vice-versa. So basically based on vibes and experience.

> And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English

I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

by embedding-shape

5/11/2026 at 5:10:26 PM

> I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences.

I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.)

> Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

Not only are your reasons not obvious, your conclusion is actually wrong.

If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese.

LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC).

by KK7NIL

5/11/2026 at 8:16:50 PM

> If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals)

Oh, I wasn't aware that was their goal, would certainly be intuitive to avoid Brazilian Portuguese if that's the case, although I'm still not sure it actually makes sense to 100% avoid it for pre-training even if you're trying to avoid Brazilian bias, you can "skew" things pretty heavily in post-training if you so wish.

Where can I read more about this goal, because it doesn't seem to be mentioned in the submission article, just a short off-hand about one of the benchmarks, so I'm guessing there is some resource they talk more about the specifically perhaps?

by embedding-shape

5/12/2026 at 12:03:37 PM

You're missing African Portuguese.

by pjmlp

5/12/2026 at 3:10:19 PM

If we're being pedantic, I missed a ton of Portuguese variants :) From the top of my head, at least Macanese Portuguese is missing too, but probably only remember that because I heard it in person recently.

by embedding-shape

5/12/2026 at 5:00:02 PM

Yes, and Timor Leste as well. :)

by pjmlp

5/12/2026 at 9:20:26 AM

Never heard the term Lusophone before. TIL

by matusp

5/12/2026 at 12:06:47 PM

It is broader than that, it means all the countries that speak Portuguese as official language, there are quite a few.

Usually only Portugal and Brazil come up in conversations.

In reality the list is wider, https://en.wikipedia.org/wiki/Portuguese-speaking_world

African Portuese is also closer in way of speaking to European Portuguese than Brasilian Portuguese, as we also tend to share some common slang that comes in from creole.

by pjmlp

5/11/2026 at 4:23:58 PM

Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.

by madaxe_again

5/11/2026 at 4:25:43 PM

I agree, they're not the same. But they're far closer than other languages who don't come from the same families.

by embedding-shape

5/11/2026 at 5:25:30 PM

European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.

https://en.wikipedia.org/wiki/List_of_languages_by_number_of...

by fy20

5/11/2026 at 7:52:19 PM

What makes Portugal's situation unique is that it is a small population that is eclipsed in models by the bigger weights of the much bigger population of Brazil.

Yes, there are much smaller European countries, but those are generally the only source of truth for their specific language, so the context of a LLM query in that language steers the LLM towards facts from that country, for example, if I ask a big generic LLM something in Latvian then it most likely will answer something relevant to the context of Latvia. But Portugal, being the much smaller user of its language, have the somewhat unique problem that if I ask a generic model something in Portuguese it will probably answer something related to Brazil instead of Portugal.

Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.

by SkeuomorphicBee

5/12/2026 at 7:01:24 AM

> Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.

There are ~26x more Portuguese speakers worldwide than in Portugal. Only 13x more Spanish speakers worldwide than in Spain. Depending on how you count (English is really widespread as a native-but-second language), there are about 20x more English speakers worldwide than in the UK.

So yes, Portugal has it pretty bad by the numbers.

by swiftcoder

5/12/2026 at 8:46:35 AM

I guess Americanisms bleeding over into English LLMs as used in Britain happens similarly.

Should we also be expecting to see bleed-over of Indian English into generic English LLMs? Or is it not relatively large enough compared to America to force it, unlike Brazil to Portugal?

by futune

5/11/2026 at 6:16:32 PM

It is pretty small when considering content output. It is only 11 million people, and only a fraction of them will be writing something that could be used on training datasests. If you look at the countries by scientific contribution, for example [1], Portugal is on the 28th position, while Brazil is in 14th by more than double the number of contributions.

Don't get me wrong, it is definitely impressive given Portugal's actual size, but I believe there's a hard limit for population and size that will be difficult to cross

[1]: https://en.wikipedia.org/wiki/List_of_countries_by_number_of...

by augusto-moura

5/11/2026 at 5:29:50 PM

> European Portuguese is the 13th most populous language in Europe

that's not impressive

by depaulagu

5/11/2026 at 5:52:29 PM

Hello from 23rd

by senko

5/11/2026 at 7:58:09 PM

"This model is a waste of Public Funds". There is no "public funds", this is a waste of money from the tax payers.

by r2ob

5/11/2026 at 7:26:06 PM

5 million for a llama-2 finetune, how is that impressive?

by mt_

5/11/2026 at 9:25:18 PM

I’ve noticed that ChatGPT is noticeably dumber in languages other than English. It even will confidently repeat common but wrong superstitions from the target language as if they were fact.

by drivebyhooting

5/11/2026 at 4:22:26 PM

Wouldnt it be easier to fine tune a model to convert the Brazilian Portuguese corpus into European Portuguese and then use that corpus?

by algoth1

5/11/2026 at 9:58:32 PM

That idea is different than what most are talking here in other comments.

The grammar and vocabularies don't match, but I think the worst are the expressions. Both sides have *a lot* of expressions that vary per context and location.

by kinow

5/13/2026 at 8:14:49 AM

Amazing. Every country should have a foothold on AI, as it will have impact every area of a citizen's life

Quite cheap compared with most public spendings

Europe countries already produce very little. Let's not let the wave pass and end up in a future where Europe is continuously reliant on US and Chinese tech as usual. And their definitions of what truth is

by guilhas

5/12/2026 at 10:47:21 PM

[flagged]

by timmers

5/11/2026 at 5:10:30 PM

Domain specific models will never be a thing. You don't get generalised intelligence with that.

https://simianwords.bearblog.dev/why-domain-specific-llms-wo...

by simianwords

5/11/2026 at 3:36:23 PM

What a waste of time and money.

Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.

by hartator

5/11/2026 at 3:42:27 PM

What LLM isn't forced into a specific language? That'd be a weird language model no one could understand, you need to chose at least one language, ideally the same as the creators speak.

Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.

by embedding-shape

5/11/2026 at 4:10:14 PM

To my knowledge, all major LLMs are multilingual. This article could really have used an evaluation of existing models' European Portuguese capabilities.

by Miraste

5/11/2026 at 4:45:17 PM

yeah, they seem all confined to being an American-consultant-Chinese-authoritarian split personality with broad second language capabilities. I suppose they become too incoherent otherwise.

by numpad0

5/11/2026 at 4:13:09 PM

E.g. gemma3:4b can fake simple conversations in several european languages, including portuguese, swedish and finnish.

It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.

by cess11

5/11/2026 at 6:43:55 PM

Europe always has a thing for their languages. They think many languages make them stronger while spending billions in system loss due to communication barriers. It is obvious they will try to do the same with LLMs and call it the next best thing since bread and butter.

I went to JCON EUROPE this year. The amount of "Europe this" "Europe that" "sovereign this, sovereign that" is mind boggling and just a waste of time and money. The regular people know this and thus left the conferences mid way. But somehow the people "in charge" really need to push this. Same thing here.

by CrimsonRain

5/11/2026 at 7:00:33 PM

whats your suggestion? we just eradicate all of our culture and languages and go full on english ?

whats wrong with exploring ways to keep national languages alive in the LLM area

by lmf4lol

5/12/2026 at 7:19:54 AM

There's an obvious advantage to everyone speaking the same language - although perhaps real-time translation with LLMs and hardware like the Timekettle will reduce this problem. Personally, I wouldn't really care if that language were English or Mandarin Chinese tbh.

Training an entire LLM model for each language is going to be incredibly expensive and likely a waste of resources. Keep in mind that all the big LLMs can already speak these languages anyway - this effort is just to make a 'pure' Portuguese LLM.

by schnitzelstoat

5/11/2026 at 10:19:27 PM

>and go full on english ?

Nobody is saying you have to swap your culture for English. You can have English as the mandatory language for tech and business across the EU, while still keeping your language and culture for your education, leisure, festivities, art, media, etc. This way everyone is happy. But countries like France would rather detonate its entire nuclear arsenal rather than accepting official use of English on its own soil.

As long as resources are spent across the EU to account for every language and bureaucracy, we'll keep falling behind internationally, and the only winners will be the bureaucrats, notaries, lawyers, consultants, translators, etc. which would be fine if this were preserving culture like you said in the beginning, but it isn't, it's just preserving friction, segmentation and bureaucracy.

We need another Concord moment. What's wild is that Concord was made via international cooperation, before the EU was even a thing. So whatever the EU is doing to improve things, it's either not good, not enough, or not working. I hope this improves but knowing how petty some EU states are about things being done their way, I doubt it.

by joe_mamba

5/11/2026 at 3:38:40 PM

> makes you missed out on most of the world knowledge

and, who knows what will happen to grammar ?

by mistrial9

5/11/2026 at 3:45:26 PM

[dead]

by clear-octopus

5/11/2026 at 4:31:32 PM

[flagged]

by KK7NIL

5/12/2026 at 7:23:23 AM

It reminds me of Qwant - the French Google alternative that got a load of public money. The fact you've probably never heard of it shows how well that went.

by schnitzelstoat

5/11/2026 at 9:00:23 PM

It's the European Way

by xp84

5/11/2026 at 7:01:49 PM

everyone on this project probably learned a lot doing it, dont you think!

by lmf4lol

5/11/2026 at 10:36:19 PM

I'd also want to get paid to work on stuff not meant to bring any financial returns to my employer, just to learn and pad my resume. Sounds like a sweet gig. Where do I sign up?

by joe_mamba

5/12/2026 at 9:36:54 AM

become a scientist? apply for research grants. Join an NGO.

Lots of options

Not everything has to return money.

by lmf4lol

5/12/2026 at 9:54:37 AM

>become a scientist? apply for research grants. Join an NGO.

You mentioned only 2(grants are also academia) and most NGOs don't hire SW devs. That's hardly "lots", basically none out of academia.

>Not everything has to return money.

But then where does that money they give you come from?

by joe_mamba