Voxtral Transcribe 2

2/4/2026 at 4:21:17 PM

This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.

I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:

> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?

by simonw

2/4/2026 at 4:41:58 PM

Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.

And open weight too! So grateful for this.

by tekacs

2/5/2026 at 1:19:24 AM

This past month Parakeet v3 dropped with a streaming ASR model that is 0.6B params, can run on a CPU and is super good.

by drakenot

2/5/2026 at 1:37:55 PM

I did say all the model. :)

Yes I've tried Parakeet v3 too. For its own purpose - running locally - it's amazing.

The thing that's particularly amazing about this Voxtral model is how incredibly rock solid the accuracy is.

For the longest time previous models have been 'mostly correct' or as people have commented elsewhere on this HN thread, have dropped sentences or lost or added utterances.

I have no affiliation with these folks, but I tried and struggled to get this model to break even speaking as adversariately as I could.

That's a totally different class of model.

by tekacs

2/5/2026 at 5:16:36 AM

Do you mean https://huggingface.co/nvidia/nemotron-speech-streaming-en-0... ?

by meatmanek

2/5/2026 at 1:27:48 PM

Yes. That is it

by drakenot

2/5/2026 at 8:39:51 AM

What's the business plan here?

by puttycat

2/4/2026 at 4:25:22 PM

Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.

I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.

by Oras

2/4/2026 at 6:11:16 PM

According to the announcement blog Le Chat is powered by the new model as well: https://chat.mistral.ai/chat

by druskacik

2/4/2026 at 11:09:14 PM

> Truly impressive for real-time.

Impressive indeed. Works way better than the speech recognition I first got demo'ed in... 1998? I remember you had to "click" on the mic everytime you wanted to speak and, well, not only the transcription was bad, it was so bad that it'd try to interpret the sound of the click as a word.

It was so bad I told several people not to invest in what was back then a national tech darling:

https://en.wikipedia.org/wiki/Lernout_%26_Hauspie

That turned out to be a massive fraud.

But ...

> I tried speaking in 2 languages at once, and it picked it up correctly.

I'm a native french speaker and I tried with a very simple sentence mixing french and english:

"Pour un pistolet je prefere un red dot mais pour une carabine je prefere un ACOG" (aka "For a pistol I prefer a red dot but for a carbine I prefer an ACOG")

And instead I got this:

"Je prépare un redote, mais pour une carabine, je préfère un ACOG."

"Je prépare un redote ..." doesn't mean anything and it's not at all what I said.

I like it, it's impressive, but literally the first sentence I tried it got the first half entirely wrong.

by TacticalCoder

2/5/2026 at 1:33:10 AM

I used sell the Mac Voice Navigator (from Articulate Systems) in the 90s, which was a SCSI based hardware box that you plug into a Mac, Mac SE or Mac II. It used to use the same L&H speech recognition tech (if I recall correctly) and was called the "User Interface" of the future.

Horrible speech recognition rate and very glitchy. Customers hated it, and lots of returns/complaints.

A few years later, L&H went bankrupt. And so did Articulate Systems.

https://applerescueofdenver.com/products-page/macintosh-to-p...

by jnaina

2/4/2026 at 4:48:11 PM

404 on https://mistralai-voxtral-mini-realtime.hf.space/gradio_api/... for me (which shows up in the UI as a little red error in the top right).

by daemonologist

2/5/2026 at 3:16:15 AM

Same here

by echion

2/4/2026 at 7:24:04 PM

Doesn't seem to work for me - tried in both Firefox and Chromium and I can see the waveform when I talk but the transcription just shows "Awaiting audio input".

by skykooler

2/5/2026 at 6:49:45 AM

For me it shows the waveform and then "error"

by winrid

2/4/2026 at 8:05:36 PM

Try disabling CSP for the page

by starkgoose

2/4/2026 at 7:34:33 PM

Same here. In Chromium I don't even see the waveform.

by codethief

2/4/2026 at 8:01:15 PM

I had to turn off ad-block to get it to work.

by fragmede

2/5/2026 at 3:43:21 AM

I can see the waveform but it still doesn't work for me. Switched to Edge, disabled all adblocking and privacy extensions, built-in tracking prevention, and "enhanced site security" (whatever that is), and still no dice. I'd love to try it and be impressed, but it seems impossible. :(

by whimblepop

2/5/2026 at 9:00:27 AM

Did you check if your mic even works in principle? E.g. using https://www.onlinemictest.com/

If you don't get sound there it won't work anywhere. A surprising number of problems like these can be solved by selecting the correct audio input source (provided your computer shows more than one).

by atoav

2/5/2026 at 1:26:22 PM

Yep. Mic works fine. My mic even works on the test page! What doesn't work is any of the transcription functionality. :(

by whimblepop

2/6/2026 at 8:11:22 AM

I just bit the bullet and did it via python and the api.

by atoav

2/5/2026 at 6:46:57 AM

Same here on iPhone with Arc Search.

by niek_pas

2/4/2026 at 5:03:00 PM

It can transcribe Eminem's Rap God fast sequence, really, really impressive.

by jaggederest

2/4/2026 at 5:32:04 PM

That's almost certainly in the training data, to be fair.

by rafram

2/4/2026 at 6:49:07 PM

what a great test hahah

by keeganpoppen

2/4/2026 at 6:10:06 PM

This model was able to transcribe Bad Bunny lyrics over the sound of the background music, played casually from my speakers. Impressive, to me.

by carbocation

2/5/2026 at 4:21:31 AM

Wow, so it has surpassed humans.

by elboru

2/5/2026 at 5:50:47 PM

It is quite impressive.

I have seen the same impressive performance about 7 months ago here: https://kyutai.org/stt

If I look at the architecture of Voxtral 2, it seems to take a page from Kyutai’s delayed stream modeling.

The reason the delay is configurable is that you can delay the stream by a variable number of audio tokens. Each audio token is 80 ms of audio, converted to a spectrogram, fed to a convnet, passed through a transformer audio encoder, and the encoded audio embedding is passed, with a history of 1 audio embedding per 80 ms, into a text transformer, which outputs text embedding, then converted to a text token (which is thus also worth 80ms, but there is a special [STREAMING_PAD] token to skip producing a word).

There is no cross-attention in either Kyutai's STT nor in Voxtral 2, unlike Whisper's encoder-decoder design!

by espadrine

2/4/2026 at 5:18:52 PM

Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.

by pyprism

2/4/2026 at 5:51:28 PM

Well, on the linked page, it mentions "strong transcription performance in 13 languages, including [...] Hindi" but with no mention of Bengali. It probably doesn't know a lick of Bengali, and is just trying to snap your words into the closest language it does know.

by derefr

2/4/2026 at 6:49:59 PM

it must have some exposure to bengali— just not enough for them to advertise it. otherwise it would have a damn hard time.

by keeganpoppen

2/4/2026 at 6:04:33 PM

I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!

by sheepscreek

2/4/2026 at 11:59:41 PM

I can't get that demo to work. Tried with both Firefox and Chrome.

by GolDDranks

2/5/2026 at 4:08:59 AM

Same here; the voice waveform animates as expected but the model doesn't do anything when I click on the microphone. It just says "Error" in the upper-right corner.

Also tried downloading and running locally, no luck. Same behavior.

by CamperBob2

2/4/2026 at 8:46:26 PM

Doesn’t seem to work in Safari on iOS 26.2, iPhone 17 Pro, just about anything extra disabled.

by Barbing

2/5/2026 at 3:48:32 AM

No long with Firefox or Edge or Chrome on either macOS or Android for me, either. Same issue on all.

by whimblepop

2/4/2026 at 9:02:45 PM

It's really nice although I've got a sentence in French when I was speaking Italian but I corrected myself in the middle of a word.

But I'm definitely going to keep an eye on this for local-only TTS for Home Assistant.

by darkwater

2/4/2026 at 5:35:38 PM

Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.

by rafram

2/4/2026 at 10:08:27 PM

Yeah it messed up a bit for me too when I didn't enunciate well. If I speak clearly it seems to work very well even with background noise. Remember Dragon Naturally Speaking? Imagine having this back then!

by timhh

2/4/2026 at 9:30:07 PM

Here European Multilingual-Intelligence truly shines!

by mentalgear

2/4/2026 at 10:02:04 PM

is this demo running fully in the browser?

by colordrops

2/4/2026 at 10:19:13 PM

No, it's server-side.

Model is around 7.5 GB - once they get above 4 GB running them in a browser gets quite difficult I believe.

by simonw

2/5/2026 at 3:34:59 AM

Because it's a 4gb download?

by dcl

2/5/2026 at 3:50:32 AM

I think that web browsers only allow up to 4GB of memory per tab.

by subset

2/4/2026 at 4:52:38 PM

[dead]

by th0ma5

2/4/2026 at 5:20:14 PM

[flagged]

by adarsh2321

2/4/2026 at 5:26:20 PM

[flagged]

by adarsh2321

2/4/2026 at 4:07:16 PM

> At approximately 4% word error rate on FLEURS and $0.003/min

Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/

by dmix

2/4/2026 at 4:09:38 PM

Is it 0.003 per minute of audio uploaded, or "compute minute"?

For example fal.ai has a Whisper API endpoint priced at "$0.00125 per compute second" which (at 10-25x realtime) is EXTREMELY cheaper than all the competitors.

by mdrzn

2/4/2026 at 4:30:51 PM

I think the point is having it for real-time; this is for conversations rather than transcribing audio files.

by Oras

2/4/2026 at 5:52:51 PM

That quote was for the non-realtime model.

by jamilton

2/5/2026 at 10:33:32 PM

It can actually go much lower. Gemini costs around $0.01/hour of transcription last time I checked.

by 85392_school

2/5/2026 at 5:08:45 AM

Both AWS and Mistral prices above are per minute of input audio.

by tgrowazay

2/6/2026 at 7:27:51 PM

If Voxtral can process rapid speech as well as it claims to, an obvious cost optimization would be to speed up normal laconic speech to the maximum speed the model can handle accurately.

by Curiositry

2/4/2026 at 6:32:56 PM

In English it is pretty good. But talk to it in Polish, and suddenly it thinks you speak Russian? Ukranian? Belarus? I would understand if an American company launched this, but for a company being so proud about their European roots, I think it should have better support for major European languages.

I tried English + Polish:

> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.

by iagooar

2/4/2026 at 9:07:34 PM

They don't claim to support Polish, but they do support Russian.

> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.

I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?

by loire280

2/4/2026 at 9:38:25 PM

Nobody ever supports Polish. It's the worst. They'll support like, ̵Swahili, but not Polish.

edit: I stand corrected lol. I'll go with "Gaelic" instead.

by MarcelOlsz

2/4/2026 at 9:58:31 PM

Swahili is subcontinental lingua franca spoken by 200M people and growing quickly. Polish is spoken by a shrinking population in one country where English is understood anyways.

by chickenimprint

2/5/2026 at 9:21:25 AM

> where English is understood anyways.

It's popular. But not that popular - you couldn't assume a random person over 30yo on the street would be able to have a chat.

by viraptor

2/4/2026 at 9:56:40 PM

200 million people speak Swahili.

39 million people speak Polish, and most of those also speak English or another more common language.

by londons_explore

2/4/2026 at 10:10:17 PM

You could say the same about Dutch to be fair. 90-95% speak English - I bet that's way higher than in Poland.

by timhh

2/4/2026 at 11:59:21 PM

As an American, my perspective is that Dutch people speak better English than a large percentage of English people and Americans.

by gerad

2/5/2026 at 1:27:48 PM

As a Dutch person, I'm very doubtful that's the case, but I'm willing to bet a good ESL speaker is more aware of common grammatical errors than some native speakers. For example, the your/you're mixup makes no sense if you've had to explicitly learn about English contractions in the first place.

by RestartKernel

2/5/2026 at 6:19:55 AM

Heh, based on my incorrect and probably wrong experience Dutch and Swedes are the best non-native english speakers in term of both the accent and fluency.

by vkazanov

2/5/2026 at 9:28:15 AM

Those and Icelandic people. But there's a fun correlation - see how much the US media content is played compared to local one per country. And which countries use subs rather than dubs or voiceovers in cinemas and TV. https://publications.europa.eu/resource/cellar/e4d5cbf4-a839...

If you have exposure to English media from young age and don't get a translation, you learn pretty quickly.

by viraptor

2/5/2026 at 12:52:42 AM

Just a side note to remember that this is a mini model. It's very small and yet 12 languages.

I guess a European version can be created but now it's aimed at a world wide distribution.

by _ache_

2/5/2026 at 8:11:41 AM

I guess I will check Korean. OpenAI audio mini is not bad but I always have to make gpt to check and fix transcription.

by sbinnee

2/4/2026 at 7:30:27 PM

> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.

Try sticking to the supported languages

by lm28469

2/4/2026 at 6:37:21 PM

Yeah, it's too bad. Apparently it only performs well in certain languages: "The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch"

by tdb7893

2/4/2026 at 7:37:10 PM

It did great English and Spanish, it didn't switch to Portuguese, french nor German, maybe struggle with my accent.

by ricardonunez

2/4/2026 at 8:11:22 PM

Try to warn it you are going to switch language to Portugese. Worked for me.

by scotty79

2/4/2026 at 7:13:07 PM

That's a mix of Polish and Ukrainian in the transcript. Now, if I try speaking Ukrainian, I'm getting transcript in Russian every time. That's upsetting.

by yko

2/4/2026 at 8:09:43 PM

Oh no! The model won't translate to an unsupported language, and incorrectly reverts to one that it was explicitly trained on.

The base likely was pretrained on days that included Polish and Ukrainian. You shouldn't be surprised to learn it doesn't perform great on languages it wasn't trained on, or perhaps had the highest share of training data.

by overfeed

2/4/2026 at 8:12:09 PM

Tell it you are going to speak Polish now. It helps.

by scotty79

2/5/2026 at 8:59:31 AM

Cracking non-English or accented / mispronounced English is the white whale of text-to-speech I think; I don't know about you, but in our day to day chats there's a lot of jargon, randomly inserted English words, etc. And when they speak in English it's often what I call expat-English which is what you get when non-native speakers only speak the language with other non-native speakers.

Add poor microphone quality (using a laptop to broadcast a presentation to a room audience isn't very good) and you get a perfect storm of untranscribeable presentations or meetings.

All I want from e.g. Teams is a good transcript and, more importantly, a clever summary. Because when you think about it, imagine all the words spoken in a meeting and write them down - that's pages and pages of content that nobody would want to read in full.

by Cthulhu_

2/4/2026 at 10:29:13 PM

I'm not sure why but their multilingual performance in general has usually been below average. For a French company, their models are not even close to being best in French, even outdone by the likes of Qwen. I don't think they're focusing on anything but English, the rest is just marketing.

by moffkalast

2/4/2026 at 6:36:10 PM

TBH ChatGPT does the same, when I mix Polish and English. Generally getting some cyrillic characters and it gets super confused.

by mystifyingpoi

2/4/2026 at 11:38:14 PM

polish logically should be rendered in cyrillic as the cyrillic orthography more closely matches the sounds and consonant structure of slavic languages like polish and russian, although this has never been done for church reasons . maybe this is confusing ai

by DaedalusII

2/4/2026 at 11:46:41 PM

Polish has been written with Latin alphabet since the 13th century. And before it simply wasn't written.

Polish works with the Latin alphabet just fine.

"Do kraju tego, gdzie kruszynę chleba podnoszą z ziemi przez uszanowanie dla darów Nieba.... Tęskno mi, Panie..."

"Mimozami jesień się zaczyna, złotawa, krucha i miła. To ty, to ty jesteś ta dziewczyna, która do mnie na ulicę wychodziła."

by iagooar

2/5/2026 at 9:36:06 AM

> although this has never been done for church reasons

That's not the case. Polish uses Latin-like alphabet due to Czech influence and German printers.

by viraptor

2/4/2026 at 4:47:16 PM

Do we know if this is better than Nvidia Parakeet V3? That has been my go-to model locally and it's hard to imagine there's something even better.

by pietz

2/4/2026 at 6:29:05 PM

I've been using nemotron ASR with my own ported inference, and happy about it:

https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...

https://github.com/m1el/nemotron-asr.cpp https://huggingface.co/m1el/nemotron-speech-streaming-0.6B-g...

by m1el

2/4/2026 at 7:19:59 PM

I'm so amazed to find out just how close we are to the start trek voice computer.

I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.

And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.

But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.

And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.

Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.

[1] https://github.com/cjpais/Handy

by Multicomp

2/4/2026 at 9:32:49 PM

you should check out

https://github.com/pipecat-ai/nemotron-january-2026/

discovered through this twitter post:

https://x.com/kwindla/status/2008601717987045382

by m1el

2/4/2026 at 10:00:06 PM

Happy to answer questions about this (or work with people on further optimizing the open source inference code here). NVIDIA has more inference tooling coming, but it's also fun to hack on the PyTorch/etc stuff they've released so far.

by kwindla

2/5/2026 at 7:35:27 AM

Thank you for sharing! Does your implementation allow running the Nemotron model on Vulkan? Like whisper.cpp? I'm curious to try other models, but I don't have Nvidia, so my choices are limited.

by pstroqaty

2/5/2026 at 2:44:08 AM

I’m curious about this too. On my M1 Max MacBook I use the Handy app on macOS with Parakeet V3 and I get near instant transcription, accuracy slightly less than slower Whisper models, but that drop is immaterial when talking to CLI coding agents, which is where I find the most use for this.

https://github.com/cjpais/Handy

by d4rkp4ttern

2/4/2026 at 5:01:15 PM

I've been using Parakeet V3 locally and totally ancedotaly this feels more accurate but slightly slower

by tylergetsay

2/4/2026 at 5:54:16 PM

I liked Parakeet v3 a lot until it started to drop whole sentences, willy-nilly.

by czottmann

2/4/2026 at 10:29:14 PM

Yeah, I think the multilingual improvements in V3 caused some kind of regression for English - I've noticed large blocks occasionally dropped as well, so reverted to v2 for my usage. Specifically nvidia/parakeet-tdt-0.6b-v2 vs nvidia/parakeet-tdt-0.6b-v3

by cypherpunks01

2/5/2026 at 2:46:28 AM

I didn’t see that but I do get a lot of stutters (words or syllables repeated 5+ times), not sure if it’s a model problem or post processing issue in the Handy app.

by d4rkp4ttern

2/5/2026 at 12:56:54 AM

Oh god am I glad to read this. Thought it was my microphone or something.

by WXLCKNO

2/4/2026 at 10:34:22 PM

Parakeet is really good imo too, and it's just 0.6B so it can actually run on edge devices. 4B is massive, I don't see Voxtral running realtime on an Orin or fitting on a Hailo. An Orin Nano probably can't even load it at BF16.

by moffkalast

2/4/2026 at 6:01:27 PM

Came here to ask the same question!

by whinvik

2/5/2026 at 6:08:52 AM

The other demos didn't work for me, so I made https://github.com/owenbrown/transcribe It's just a python script to test the streaming.

Wow, Voxtral is amazing. It will be great when someone stitches this up so an LLM starts thinking, researching for you, before you actually finish talking.

Like, create a conversation partner with sub 0.5 second latency. For example, you ask it a multi part questions and, as soon as you finish talking, it gives you the answer to the first part while it looks up the rest of the answer, then stitches it together so that there's no break.

The 2-3 second latency of existing voice chatbots is a non-started for most humans.

by owenbrown

2/5/2026 at 6:40:18 PM

Yes, appreciate this.

I noticed that with both models voxtral-mini-transcribe-realtime-2602 and voxtral-mini-2602 filler words are ignored. I'd like to be able to count words/sounds, specifically "um" or "uh" for improvement purposes. Any good models that handle that?

by jpeeler

2/5/2026 at 8:36:02 AM

Nice! works well - I couldn't get huggingface to work either

by jwblackwell

2/4/2026 at 5:41:39 PM

I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.

https://aclanthology.org/2025.findings-acl.87/

by janalsncm

2/4/2026 at 9:15:06 PM

I don't know. What about words inherited from other languages? I think a cross-language model could improve lots of things.

For example, "here it is, voila!" "turn left on el camino real"

by m463

2/4/2026 at 11:54:43 PM

Most English speakers likely would understand those and don’t speak French or Spanish. So it’s not necessary to tack on extra languages even if there are loan words.

In general there is a concept called the “curse of multilinguality”

https://arxiv.org/pdf/1911.02116

by janalsncm

2/4/2026 at 6:27:54 PM

It doesn't make sense to have a language-restricted transcription model because of code switching. People aren't machines, we don't stick to our native languages without failure. Even monolingual people move in and out of their native language when using "borrowed" words/phrases. A single-language model will often fail to deal with that.

by popalchemist

2/4/2026 at 6:55:53 PM

yeah, one example I run into is getting my perplexity phone assistant to play a song in spanish. I cannot for the life of me get a model to translate: "Play señorita a mi me gusta su style on spotify" correctly

by javier123454321

2/4/2026 at 11:48:21 PM

Everything is a tradeoff, and different use cases require different tradeoffs:

Option A: this model

Option B: faster model, only 1 language

Option C: same size model, only 1 language but higher quality

My point is that option A isn’t always best.

And on the borrowed words bit, there’s no rule that we cannot add borrowed words into the vocab. But you don’t need the whole language. I know what deja voux means but I don’t speak French.

by janalsncm

2/5/2026 at 4:28:00 AM

that depends entirely on how common the borrowed thing is. And anyway, option A is always going to be insufficient for my code-switching example -- as another commenter pointed out, it is very common to want to refer to a foreign work (song, movie, book) by its foreign language title. Monolingual ASR solutions break over this all the time. Try asking Alexa to play a Spanish language track on Spotify. It fails frequently.

The real world is like that.

by popalchemist

2/4/2026 at 9:14:44 PM

The hilarious part of this comment is all the comments around it complaining about not supporting enough languages

by idiotsecant

2/5/2026 at 6:06:15 PM

It’s a little bit like asking for everything to be included in the Standard Library. Sure, it sounds nice at first, but now you need to maintain tons of dependencies. And any time you want to do one thing, you bring along the baggage of every other thing.

Languages are similar. They also change over time. So now if you want to release a v2 you need an updated corpus for all languages. Or if you get access to an updated corpus for a small language, it might not merit a new model version since it’s only one out of the 14.

by janalsncm

2/5/2026 at 9:12:04 AM

But I actually think that one if the bigger arguments for single language models is the ability to have more languages. Im from Sweden, so I would like to have swedish on extremly high level, but I wouldnt like to have all other small languages on that level beacuse it would inflate the size. So, I actually think having multiple single language models, make it wider and deeper

by gingersnap

2/4/2026 at 5:51:35 PM

I think this model proves it's very efficient and accurate.

by decide1000

2/4/2026 at 9:45:31 PM

But it could potentially be even more efficient if it was single-language.

by ethmarks

2/5/2026 at 12:42:37 AM

honestly the inability to correctly transcribe the 4 language mix i use in my everyday life is one of the major blockers for adopting ASR tech in my own tooling. this coming from someone who literally works in that field.

turns out, outside the US, many people speak more than one language. :)

edit: I should say was a major blocker, because the last iterations of open-weight models actually work better and better. it's often the UX that's not thought for these usecases.

by black_puppydog

2/5/2026 at 6:10:51 AM

A single language modèle wouldn't make any sense except for English: there's simply too much English intertwined with any other language nowadays (corporate jargon, brands, tech, etc.)

by littlestymaar

2/4/2026 at 8:11:48 PM

STT services that have been around for longer, like Azure, Google and Amazon, generally require you to request a specific language, and their quality is a lot higher than models that advertise themselves as LLMs (even though I believe the clouds are also using the same types of models now).

by depr

2/5/2026 at 3:42:38 AM

"I only speak one language, so models I use should only understand one".

by ryan_lane

2/5/2026 at 6:32:37 PM

Engineering is about tradeoffs. If the model is being used in an English-only context then tacking on 13 other languages might not be worth the cost.

You are also implicitly choosing worse performance in English by adding extra languages. So you could have a better monolingual model for the same number of weights.

by janalsncm

2/4/2026 at 8:11:23 PM

Imagine if ChatGPT started like this and thought they should trim coding abilities from their language model because most people don't code.

by raincole

2/4/2026 at 9:51:56 PM

They've already done the inverse and trimmed non-coding abilities from their language model: https://openai.com/index/introducing-gpt-5-2-codex/. There's already precedent for creating domain-specific models.

I think it's nice to have specialized models for specific tasks that don't try to be generalists. Voxtral Transcript 2 is already extremely impressive, so imagine how much better it could be if it specialized in specific languages rather than cramming 14 languages into one model.

That said, generalist models definitely have their uses. I do want multilingual transcribing models to exist, I just also think that monolingual models could potentially achieve even better results for that specific language.

by ethmarks

2/4/2026 at 6:51:55 PM

uhhh i cast doubt on multi-language support as affecting latency. model size, maybe, but what is the mechanism for making latency worse? i think of model latency as O(log(model size))… but i am open to being wrong / that being a not-good mental model / educated guess.

by keeganpoppen

2/4/2026 at 8:07:51 PM

Even model size, it’s modest. There is a lot of machinery that is going to be common for all languages. You don’t multiply model size by 2 when you double the number of supported languages.

by kergonath

2/5/2026 at 12:26:56 AM

Well for example the last step is to softmax over all output logits, which is the same as your vocab size. You need the sum of the exponentiated values of each logit to calculate the denominator which is O(N).

Bigger impact is before that you need to project the hidden state matrix to the vocab list. Something like 4096x250000. Bigger vocab=more FLOPs.

If you’re on a GPU things are parallelized so maybe it’s not quite linear if everything fits nicely. But on a cpu you’re going to struggle more.

This is why the juiciest target when shrinking models is the token embedding table. For example AlBERT factorized the whole embedding table to two low rank matrices.

by janalsncm

2/4/2026 at 9:55:28 PM

If encoding more learned languages and grammars and dictionaries makes the model size bigger, it will also increase latency. Try running a 1B model locally and then try to run a 500B model on the same hardware. You'll notice that latency has rather a lot to do with model size.

by ethmarks

2/4/2026 at 7:06:24 PM

model size directly affects latency

by make3

2/4/2026 at 3:53:39 PM

Native diarization, this looks exciting. edit: or not, no diarization in real-time.

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

~9GB model.

by observationist

2/4/2026 at 4:16:09 PM

The diarization is on Voxtral Mini Transcribe V2, not Voxtral Mini 4B.

by coder543

2/4/2026 at 4:54:40 PM

Do you have experience with that model for diarization? Does it feel accurate, and what's its realtime factor on a typical GPU? Diarization has been the biggest thorn in my side for a long time..

by sbrother

2/4/2026 at 7:03:41 PM

You can test it yourself for free on https://console.mistral.ai/build/audio/speech-to-text I tried it on an english-speaking podcast episode, and apart from identying one host as two different speakers (but only once for a few sentences at the start), the rest was flawless from what I could see

by ashenke

2/4/2026 at 10:16:03 PM

Amazing. Thank you.

by sbrother

2/4/2026 at 5:22:27 PM

> Do you have experience with that model

No, I just heard about it this morning.

by coder543

2/4/2026 at 4:30:57 PM

Ahh, yeah, and it's explicitly not working for realtime streams. Good catch!

by observationist

2/4/2026 at 10:03:54 PM

Incroyable! Competitive (if not better) than deepgram nova-3, and much better than assembly and elevenlabs in basically all cases on our internal streaming benchmarking.

The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.

Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?

by mnbbrown

2/4/2026 at 4:03:32 PM

There's no comparison to Whisper Large v3 or other Whisper models..

Is it better? Worse? Why do they only compare to gpt4o mini transcribe?

by mdrzn

2/4/2026 at 4:11:19 PM

WER is slightly misleading, but Whisper Large v3 WER is classically around 10%, I think, and 12% with Turbo.

The thing that makes it particularly misleading is that models that do transcription to lowercase and then use inverse text normalization to restore structure and grammar end up making a very different class of mistakes than Whisper, which goes directly to final form text including punctuation and quotes and tone.

But nonetheless, they're claiming such a lower error rate than Whisper that it's almost not in the same bucket.

by tekacs

2/4/2026 at 4:12:00 PM

On the topic of things being misleading, GPT-4o transcriber is a very _different_ transcriber to Whisper. I would say not better or worse, despite characterizations such. So it is a little difficult to compare on just the numbers.

There's a reason that quite a lot of good transcribers still use V2, not V3.

by tekacs

2/4/2026 at 4:41:08 PM

Different how?

by satvikpendem

2/4/2026 at 4:07:35 PM

Gpt4o mini transcribe is better and actually realtime. Whisper is trained to encode the entire audio (or at least 30s chunks) and then decode it.

by GaggiX

2/4/2026 at 4:10:28 PM

So "gpt4o mini transcribe" is not just whisper v3 under the hood? Btw it's $0.006 / minute

For Whisper API online (with v3 large) I've found "$0.00125 per compute second" which is the cheapest absolute I've ever found.

by mdrzn

2/4/2026 at 7:05:21 PM

Deepinfra offers Whisper V3 at 0.00045$ / minute of transcribed audio.

by breisa

2/4/2026 at 4:13:00 PM

>So it's not just whisper v3 under the hood?

Why it should be Whisper v3? They even released an open model: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

by GaggiX

2/4/2026 at 4:09:51 PM

The linked article claims the average word error rate for Voxtral mini v2 is lower than GPT-4o mini transcribe

by emmettm

2/4/2026 at 4:11:11 PM

Gpt4o mini transcribe is better than whisper, the context is the parent comment.

by GaggiX

2/5/2026 at 9:58:46 AM

The Apache 2.0 license on Realtime is the buried lede. 4B params at sub-200ms latency means you can run private transcription on-device without sending audio to anyone's servers. That's not an API improvement, it's a categorically different thing.

by RT_max

2/4/2026 at 7:10:46 PM

Played with the demo a bit. It's really good at English, and detects language change on the fly. Impressive.

But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.

by yko

2/4/2026 at 7:13:43 PM

Thats just the result of the model only supporting russian (and 12 other languages) and not urkainian. It maps to the closest words from training data.

by breisa

2/4/2026 at 6:46:43 PM

It’s nice, but the previous version wasn’t actually that great compared to Parakeet for example.

We need better independent comparison to see how it performs against the latest Qwen3-ASR, and so on.

I can no longer take at face value the cherry picked comparisons of the companies showing off their new models.

For now, NVIDIA Parakeet v3 is the best for my use case, and runs very fast on my laptop or my phone.

by jiehong

2/4/2026 at 6:50:47 PM

There is https://huggingface.co/spaces/hf-audio/open_asr_leaderboard but it hasn't been updated for half a year.

by nodja

2/4/2026 at 6:53:50 PM

I like Parakeet as well and use it via Handy on Mac. What app are you using on your phone?

by archb

2/4/2026 at 6:58:10 PM

Spokenly has it on Mac and iOS, in both cases for free when using parakeet

by jiehong

2/5/2026 at 6:19:05 AM

Very happy with all the mistral work. I feel like I'm always one release behind theirs. Last time they released Mistral 3 I commented saying how excited I was to try it out [1]

Well, I'm happy to report I integrated the new Mistral 3 and have been truly astounded by the results. I still am not a big fan of the model wrt factual information - it seems to be especially confident and especially wrong if left to it's own devices - but with http://phrasing.app I do most of the data aggregation myself and just use an LLM to format it. Mistral 3 was a drop-in replacement for 3x the quality (it was already very very good), 0% error rate for my use case (I had an issue for it occasionally going off the rails that was entirely solved), and sticks to my formatting guidelines perfectly (which even gpt-5-pro failed on). Plus it was somehow even cheaper.

I'm using Scribe v2 at the moment for TTS, but I'm very excited now to try integrating Voxtral Transcribe. The language support is a little lacking for my use cases, but I can always fall back to Scribe and amatorize the cost across languages. I actually was due to work on the transcription of phrasing very soon so I guess look forward to my (hopefully) glowing review on their next hn launch! XD

[1] https://news.ycombinator.com/item?id=46121889#46122612

by barrell

2/4/2026 at 3:53:44 PM

things I hate:

"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"

So, you don't mean 'try this out', you mean 'buy this product'.

Let's not act like it's a free sampler.

I can't comment on the model : i'm not giving them money.

by serf

2/4/2026 at 3:57:57 PM

You can try it on HF: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

by ReadEvalPost

2/4/2026 at 4:21:29 PM

I'm impressed.

by boobsbr

2/4/2026 at 4:39:08 PM

Looks like this model doesn't do realtime diarization, what model should I use if I want that? So far I've only seen paid models do diarization well. I heard about Nvidia NeMo but haven't tried that or even where to try it out.

by satvikpendem

2/4/2026 at 7:10:54 PM

Not sure if its "realtime" but the recently released VibeVoice-ASR from Microsoft does do diarization. https://huggingface.co/microsoft/VibeVoice-ASR

by breisa

2/4/2026 at 6:50:39 PM

Is there an open source Android keyboard that would support it? Everything I find is based on Whisper, which is from 2022. Ages ago given how fast AI is evolving.

by fph

2/4/2026 at 9:59:00 PM

I wish I had a Google Keyboard that could easily run on Whisper Medium. This is already great. But unfortunately would be too much inference cost, incredibly slow. The problem with Whisper is not the inference quality: medium and large are incredible. Is that the base model is not enough, and the only one with fast inference in mobile devices.

by antirez

2/5/2026 at 2:58:00 AM

FUTO keyboard is trying to do this. I think they have some kind of distillation of Whisper running on-device.

by hephaes7us

2/5/2026 at 6:27:22 PM

They are just shipping the same whisper-small that everyone else is using, and did not much to improve their models since release. Other models have been "coming soon" forever. https://keyboard.futo.org/voice-input-models

by fph

2/5/2026 at 3:39:51 AM

Have been using https://github.com/notune/android_transcribe_app And pretty happy with it. Fully local and fast and accurate

by fittingopposite

2/5/2026 at 4:38:30 AM

This is actually really good. I'm writing with it right now. It's just not the best setup as a keyboard. Because for example you cannot easily switch back to uh the normal keyboard with keys.

by luplex

2/5/2026 at 3:40:24 AM

This uses Parakeet v3 which is a lot lighter but still very good accuracy

by fittingopposite

2/4/2026 at 5:43:21 PM

Is it me or error rate of 3% is really high?

If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.

by XCSme

2/4/2026 at 5:46:47 PM

The error rate for human transcription can be as high as 5%.

by cootsnuck

2/4/2026 at 11:38:29 PM

I did transcription for a while in 2021. It is absurdly hard. Especially as these days humans only get the difficult jobs that AI has already taken a stab at.

The hardest one I did was for a sports network where it was a motorcross motorbike event where most of what you could hear was the roar of the bikes. There were two commentators I had to transcribe over the top of that mess and they were using the slang insider nicknames for all the riders, not their published names, so I had to sit and Google forums to find the names of the riders while I was listening. I'm not even sure how these local models would even be able to handle that insanity at all because they almost certainly lack enough domain knowledge.

by qingcharles

2/4/2026 at 5:53:03 PM

Oh wow, I thought humans are like 0.1% error rate, if they are native speakers and aware of the subject being discussed.

by XCSme

2/4/2026 at 7:01:47 PM

I was skepitcal upon hearing the figure but various sources do indeed back it up and [0] is a pretty interesting paper (old but still relevant human transcibers haven't changed in accuracy).

[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...

by zipy124

2/4/2026 at 7:28:00 PM

I think it's actually hard to verify how correct a transcription is, at scale. Curious where those error rate numbers come from, because they should test it on people actually doing their job.

by XCSme

2/4/2026 at 8:51:55 PM

It can depend a lot on different factors like:

- familiarity with the accent and/or speaker;

- speed and style/cadence of the speech;

- any other audio that is happening that can muffle or distort the audio;

- etc.

It can also take multiple passes to get a decent transcription.

by rhdunn

2/4/2026 at 11:39:57 PM

You missed a giant factor: domain knowledge. Transcribing something outside of your knowledge realm is very hard. I posted above about transcribing the commentary of a motorbike race where the commentators only used the slang names of the riders.

by qingcharles

2/4/2026 at 9:17:42 PM

Most of these errors will not be meaningful. Real speech is full of ambiguities. 3% is low

by Nimitz14

2/4/2026 at 6:19:15 PM

I really wish those offering speech-to-text models provided transcription benchmarks specific to particular fields of endeavor. I imagine performance would vary wildly when using jargon peculiar to software development, medicine, physics, and law, as compared to everyday speech. Considering that "enterprise" use is often specialized or sub-specialized, it seems like they're leaving money on Dragon's table by not catering to any of those needs.

by gwerbret

2/4/2026 at 4:40:40 PM

What's the cheapest device specs that this could realistically run on?

by aavci

2/4/2026 at 5:10:33 PM

I haven't quite figured out if the open weights they released on huggingface amount to being able to run the (realtime) model locally - i hope so though! For the larger model with diarization I don't think they open sourced anything.

by kamranjon

2/4/2026 at 8:44:22 PM

The HF page suggests yes, with vllm.

> We've worked hand-in-hand with the vLLM team to have production-grade support for Voxtral Mini 4B Realtime 2602 with vLLM. Special thanks goes out to Joshua Deng, Yu Luo, Chen Zhang, Nick Hill, Nicolò Lucchesi, Roger Wang, and Cyrus Leung for the amazing work and help on building a production-ready audio streaming and realtime system in vLLM.

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

https://docs.vllm.ai/en/latest/serving/openai_compatible_ser...

by IanCal

2/4/2026 at 4:15:03 PM

Italian represents, I believe, the most phonetically advanced human language. It has the right compromise among information density, understandability, and ability to speech much faster to compensate the redundancy. It's like if it had error correction built-in. Note that it's not just that it has the lower error rate, but is also underrepresented in most datasets.

by antirez

2/4/2026 at 6:02:40 PM

I love seeing people from other countries share their own folk tales about what makes their countries special and unique. I've seen it up close in my country and I always cringed when I heard my fellow countrymen came up with these stories. In my adulthood I'm reassured that it happens everywhere and I find it endearing.

On the information density of languages: it is true that some languages have a more information dense textual representation. But all spoken languages convey about the same information in the same time. Which is not all that surprising, it just means that human brains have an optimal range at which they process information.

Further reading: Coupé, Christophe, et al. "Different Languages, Similar Encoding Efficiency: Comparable Information Rates across the Human Communicative Niche." Science Advances. https://doi.org/10.1126/sciadv.aaw2594

by nindalf

2/4/2026 at 6:16:46 PM

Different representations at the same bitrate may have features that make one a lot more resilient to errors. This thing about Italian, you fill find in any benchmark of vastly different AI transcribing models. You can find similar results also on the way LLMs mostly trained on English generalize usually very well with Italian. All this despite Italian accounting for marginal percentage of the training set. How do you explain that? I always cringe when people refute evidence.

by antirez

2/4/2026 at 6:42:07 PM

Where is this evidence you’ve cited for your claims?

by testdelacc1

2/4/2026 at 11:04:08 PM

> All this despite Italian accounting for marginal percentage of the training set.

Evidence?

by hollowturtle

2/4/2026 at 4:36:27 PM

This is largely due to the fact that modern Italian is a systematised language that emerged from a literary movement (whose most prominent representative is Alessandro Manzoni) to establish a uniform language for the Italian people. At the time of Italian unification in 1861, only about 2.5% of the population could speak this language.

by Archelaos

2/4/2026 at 4:51:47 PM

The language itself was not invented for the purpose: it was the language spoken in Florence, than adopted by the literary movement and than selected as the national language.

It seems like the best tradeoff between information density and understandability actually comes from the deep latin roots of the language

by gbalduzzi

2/4/2026 at 5:26:22 PM

At least some relatively well-known research finds that all languages have similar information density in terms of bits/second (~39 bits/second based on a quick search). Languages do it with different amounts of phonetic sound / syllables / words per bit and per second, but the bps comes out the same.

I don't know how widely accepted that conclusion is, what exceptions there may be, etc.

by mmooss

2/4/2026 at 9:07:26 PM

in the end (our) italian language wasn’t optimized by engineers, it was refactored by poets

by mr_tox

2/4/2026 at 11:24:30 PM

and disseminated to the entire peninsula by broadcast television featuring Mike Buongiorno

by ithkuil

2/4/2026 at 4:45:27 PM

I was honestly surprised to find it in the first place, because I assumed English to be at first place given the simpler grammar and the huge dataset available.

I agree with your belief, other languages have either lower density (e.g. German) or lower understandability (e.g. English)

by gbalduzzi

2/4/2026 at 4:58:53 PM

English has a ton of homophones, way more sounds that differ slightly (long/short vowels), and major pronunciation differences across major "official" languages (think Australia/US/Canada/UK).

Italian has one official italian (two, if you count IT_ch, but difference is minor), doesn't pay much attention to stress and vowel length, and only has a few "confusable" sounds (gl/l, gn/n, double consonants, stuff you get wrong in primary school). Italian dialects would be a disaster tho :)

by riffraff

2/4/2026 at 4:49:50 PM

The only knowledge I have about how difficult Italian is comes from Inglourious Basterds.

by NewsaHackO

2/4/2026 at 5:53:40 PM

> the most phonetically advanced human language

That's interesting. As a linguist, I have to say that Haskell is the most computationally advanced programming language, having the best balance of clear syntax and expressiveness. I am qualified to say this because I once used Haskell to make a web site, and I also tried C++ but I kept on getting errors.

/s obviously.

Tldr: computer scientists feel unjustifiably entitled to make scientific-sounding but meaningless pronouncements on topics outside their field of expertise.

by hackyhacky

2/5/2026 at 2:57:30 AM

It performs well on Mandarin audio transcription, considering it's an European company. It's weird though that it keeps adding spaces between single Chinese characters, and mixing traditional & simplified characters.

by cyp0633

2/5/2026 at 2:01:40 AM

Ok, I guess this is the regular time for me to look for a local realtime transcription solution on Linux, and not finding anything good.

Maybe this'll get wrapped into a nice tool later.

Does anyone have any recommendations?

by mijoharas

2/5/2026 at 2:57:58 AM

I made this for myself, might not work on wayland though if thats an issue.

https://github.com/rabfulton/Auriscribe

by rabf

2/4/2026 at 8:16:27 PM

This looks great, but it's not clear to me how to use it for a practical task. I need to transcribe about 10 years worth of monthly meetings. These are government hearings with a variety of speakers. All the videos are on YouTube. What's the most practical and cost-effective way to get reasonably accurate transcripts?

by ccleve

2/4/2026 at 8:41:44 PM

If you use something like youtube-dlp you can download the audio from the meetings, and you could try things out in mistrals ai studio.

You could use their api (they have this snippet):

```curl -X POST "https://api.mistral.ai/v1/audio/transcriptions" \ -H "Authorization: Bearer $MISTRAL_API_KEY" \ -F model="voxtral-mini-latest" \ -F file=@"your-file.m4a" \ -F diarize=true \ -F timestamp_granularities="segment"```

In the api it took 18s to do a 20m audio file I had lying around where someone is reviewing a product.

There will, I'm sure, be ways of running this locally up and available soon (if they aren't in huggingface right now) but the API is $0.003/min. If it's something like 120 meetings (10 years of monthly ones) then it's roughly $20 if the meetings are 1hr each. Depending on whether they're 1 or 10 hours (or if they're weekly or monthly but 10 parallel sessions or something) then this might be a price you're willing to pay if you get the results back in an afternoon.

edit - their realtime model can be run with vllm, the batch model is not open

by IanCal

2/5/2026 at 12:15:33 PM

> 10 years worth of monthly meetings

if it's 1 monthly video and thus 120 videos (or so) you could try recall (getrecall.ai not recall.ai that is a similar product with a similar name). They summarize youtube videos, but you get the transcript. AFAIK you cannot batch the processing and you have to add each video one by one, that's why 100 or 200 videos is doable but probably not thousands.

by poulpy123

2/4/2026 at 8:45:56 PM

- get an API key for this service

- make sure you have a list of all these YouTube meeting URLs somewhere

- ask your preferred coding assistant to write you up a script that downloads the audio for these videos with yt-dlp & calls Mixtrals' API

- ????

- profit

by isoprophlex

2/4/2026 at 8:19:38 PM

If they are on Youtube, try Gemini 3 Flash first. Use AI studio, it lets you insert YouTube videos into context.

by jimmy76615

2/4/2026 at 5:48:38 PM

Wired advertises this as "Ultra-Fast Translation"[^1]. A bit weird coming from a tech magazine. I hope it's just a "typo".

[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...

by siddbudd

2/4/2026 at 5:50:44 PM

It might be capable of translation; OpenAI Whisper was a transcription model that could do it.

by bigyabai

2/4/2026 at 8:23:37 PM

Do you know anything better for Polish language, low quality audio than Whisper large-v3 through WhisperX?

This combo has almost unbeatable accuracy and it rejects noises in the background really well. It can even reject people talking in the background.

The only better thing I've seen is Ursa model from Speechmatics. Not open weights unfortunately.

by scotty79

2/4/2026 at 9:31:01 PM

What's the best way to train this further on a specific dialect or accent or even terminology?

by sgt

2/5/2026 at 4:38:19 AM

You know what I'd love to have? This running on my Android smartphone. Google's speech services are garbage and they LOVE to cut me off mid-sentence for no reason, well over half the time. It's maddening.

by MaxL93

2/5/2026 at 12:08:39 AM

What hardware resources are required for what quality/latency? Multiple high end nvidia or can you run it on your phone on an esp32 offline? Or...

Seems like fundamental info for any model announcement. Did I just miss it? Does everyone just know except me?

by harry8

2/5/2026 at 3:53:09 AM

Very nice! The thing I am missing is turn detection. In real time audio we need the turn detection to understand when AI should speak. Unfortunately this makes it not a complete deepgram replacement yet!

by Obertr

2/5/2026 at 4:00:44 AM

Is deepgram really performing better than open source turn detection models for you? In our tests it is not.

by nostrebored

2/5/2026 at 5:17:23 PM

what is SOTA?

by Obertr

2/4/2026 at 10:51:06 PM

https://www.tavus.io/post/sparrow-1-human-level-conversation...

how does it compare to sparrow-1?

by maxdo

2/4/2026 at 10:40:04 PM

3 hours for a single request sounds nice to me. Although the graph suggests that it’s not going to perform as good as openai model I have been using, it is open source and surely I will give it a try.

by sbinnee

2/5/2026 at 3:41:15 AM

Is there some well established independent benchmark where I can easily (looking at a couple of graphs) compare all popular (especially self-hosted) transcription models?

by krick

2/5/2026 at 3:57:42 AM

Not that I am aware of unfortunately

by mottiden

2/5/2026 at 4:37:05 AM

I cant wait for models to get smaller enough that they can run on commodity devices.

Hope we can build an app like Whispr Flow using this with the model running completely on device.

by albert_e

2/5/2026 at 8:20:04 AM

Wondering if most of the AI agents use real time apis or transcription apis.. anyone had experience with building voice agents can comment ?

by ashu1461

2/4/2026 at 5:55:04 PM

One week ago I was on the hunt for an open source model that can do diatization and I had to literally give up because I could not find any easy to use setup.

by yewenjie

2/4/2026 at 7:06:19 PM

I don't know if that will change, but right now only the Voxtral Mini Transcribe V2 supports diarization and it's not open-weight. The Voxtral Realtime model doesn't support diarization, but is open-weight.

by ashenke

2/4/2026 at 6:46:53 PM

WhisperX ?

by vojto11

2/4/2026 at 6:14:19 PM

I'm guessing I won't be able to finetune this until they come out with a HF tranformers model, right?

by jszymborski

2/5/2026 at 7:11:00 AM

This exciting, especially after 11 labs very expensive model

by qwertytyyuu

2/4/2026 at 7:07:25 PM

Impressive results, tested on crappy audio files (in french and english)...

by blobinabottle

2/4/2026 at 7:20:41 PM

does anyone know if there's any desktop tools I can use this transcription model with? e.g. something where like Wisper Flow/WillowVoice but with custom model selection

by numbers

2/4/2026 at 7:24:59 PM

There is Handy, an open source project meant to be a desktop tool, but I haven’t installed it yet to see how you pick your model.

Handy – Free open source speech-to-text app https://github.com/cjpais/Handy

by tietjens

2/4/2026 at 4:12:28 PM

As a rule of thumb for software that I use regularly, it is very useful to consider the costs over a 10-year period in order to compare it with software that I purchase for lifetime to install at home. So that means 1,798.80 $ for the Pro version.

What estimates do others use?

by Archelaos

2/4/2026 at 6:02:31 PM

Any chance Voxtral Mini Transcribe 2 will ever be an open model?

by derac

2/5/2026 at 5:23:32 PM

I think this is it. https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

by gunalx

2/4/2026 at 11:56:07 PM

Has anyone compared to Deepgram Flux yet for realtime?

by Rapzid

2/4/2026 at 7:47:44 PM

I added it to my bot agent,let’s see how it performs

by tallesborges92

2/5/2026 at 12:38:52 AM

my struggle with VTT is always the accent. it doesn't understand my English too well because of my non native accent

by upcoming-sesame

2/4/2026 at 8:14:37 PM

Nice. Can this be ran on a mobile device?

by atentaten

2/5/2026 at 3:02:22 AM

Cannot wait to try it on Spokenly

by _blackhawk_

2/4/2026 at 11:23:47 PM

Smells Like Teen Spirit survives another challenge!

Voxtral Transcribe 2:

Light up our guns, bring your friends, it's fun to lose and to pretend. She's all the more selfish, sure to know how the dirty world. I wasn't what I'd be best before this gift I think best A little girl is always been Always will until again Well, the lights out, it's a stage And we are now entertainers. I'm just stupid and contagious. And we are now entertainers. I'm a lot of, I'm a final. I'm a skater, I'm a freak. Yeah! Hey! Yeah. And I forget just why I taste it Yeah, I guess it makes me smile I found it hard, it's hard to find the well Whatever, never mind Well, the lights out, it's a stage. You and I are now entertainers. I'm just stupid and contagious. You and I are now entertainers. I'm a lot of, I'm a minor. I'm a killer. I'm a beater. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. And I forget just why I taste it Yeah, I guess it makes me smile I found it hard, it's hard to find the well Whatever, never mind I know, I know, I know, I know, I know Well, the lights out, it's a stage. You and I are now entertainers. I'm just stupid and contagious. You and I are now entertainers. I'm a lot of, I'm a minor. I'm a killer. I'm a beater. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd.

Google/Musixmatch:

Load up on guns, bring your friends It's fun to lose and to pretend She's over-bored, and self-assured Oh no, I know a dirty word Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello With the lights out, it's less dangerous Here we are now, entertain us I feel stupid and contagious Here we are now, entertain us A mulatto, an albino A mosquito, my libido, yeah Hey, yey I'm worse at what I do best And for this gift, I feel blessed Our little group has always been And always will until the end Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello With the lights out, it's less dangerous Here we are now, entertain us I feel stupid and contagious Here we are now, entertain us A mulatto, an albino A mosquito, my libido, yeah Hey, yey And I forget just why I taste Oh yeah, I guess it makes me smile I found it hard, it's hard to find Oh well, whatever, never mind Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello With the lights out, it's less dangerous Here we are now, entertain us I feel stupid and contagious Here we are now, entertain us A mulatto, an albino A mosquito, my libido A denial, a denial A denial, a denial A denial, a denial A denial, a denial A denial

by asah

2/4/2026 at 11:26:15 PM

(when it was released, adults/press/etc. found SLTS famously incomprehensible and then they realized that the kids didn't understand the lyrics either, and Weird Al nailed it with his classic, Smells Like Nirvana: https://www.google.com/search?q=Smells+Like+Nirvana )

by asah

2/4/2026 at 5:52:18 PM

Can it translate in real time?

by ewuhic

2/5/2026 at 6:32:26 AM

Real time as in at >1x speed? Probably?

Real time as in per-word basis? Probably not?

by numpad0

2/4/2026 at 10:22:07 PM

Also curious about this. Just need real time German to English. What does this?

by unstatusthequo

2/5/2026 at 8:38:00 AM

Really cool.

by kranke155

2/5/2026 at 4:17:02 AM

wow Mistral really cooked

by bytesandbits

2/4/2026 at 10:26:16 PM

Disappointing how this lacks a clear reference implementation, if not mixed at almost yet unreleased VLLM (nightly version) stuff. I'm ok with Open Weights being a form of OSS in the case of models, because frankly I don't believe that, for large LLMs, it is feasible to release the training data, all the orchestration stuff, and so forth. But it can't be: here are the weights, we partnered with VLLM for inference. Come on. Open Weights must mean that you put me in a situation to write an implementation easily for any hardware.

p.s. even the demo uses a remote server via websocket.

by antirez

2/4/2026 at 5:31:35 PM

I'm on voxtral-mini-latest and that's why I started seeing 500s today lol

by dumpstate

2/4/2026 at 4:52:58 PM

Pseudo related -- am I the only one uncomfortable using my voice with AI for the concern that once it is in the training model it is forever reproducible? As a non-public person it seems like a risk vector (albeit small),

by boringg

2/4/2026 at 6:03:33 PM

It's a real issue, but why do you only see it in ai? It's true for any case where you're speaking into a microphone

Depending on the permissions granted to apps on your mobile device, it can even be passively exfiltrated without you ever noticing - and that's ignoring the video clips people take and put online. Like your grandma uploading to Facebook a short moment from a Christmas meet or similar

There have already been successful scams - eg calls from "relatives" (AI) calling family members needing money urgently and convincing them to send the money...

by ffsm8

2/5/2026 at 2:51:34 PM

I completely agree - but I think those scams you refer to are less explicit but could potentially be anywhere.

AI I am intentionally providing them my voice. I'm not sure that the value to security risk is good.

by boringg

2/4/2026 at 4:06:03 PM

[flagged]

by varispeed

2/4/2026 at 4:18:12 PM

Many people speak Russian, including many who do not live in Russia, e.g. about 30% of Ukranians.

Beyond that, I don't see how we stand to durably reduce military action by making languages mutually unintelligible.

https://simple.wikipedia.org/wiki/Russian_language#/media/Fi...

by Empact

2/4/2026 at 4:25:09 PM

Don't they have a partnership with the French Armed Forces? I am sure they are interested in automating Russian Audio or Text (-> Russian Text) -> French text.

by laffOr

2/4/2026 at 5:59:54 PM

Fair point.

by varispeed

2/4/2026 at 4:26:23 PM

They've chosen languages which would help them to cover the highest percentage of human population..

by gostsamo