VibeVoice: Open-source frontier voice AI

4/28/2026 at 1:00:09 PM

This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.

Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.

by steinvakt2

4/28/2026 at 5:20:07 PM

It has some perks, is a bit more expressive in some cases, but overall is trained on really noisy data, uses more memory, and isn't that fast - I'm talking about the (7b?) version that they released then removed quickly (vibevoice-community on github) - I still use chatterbox turbo and sometimes qwen TTS.

by terbo

4/28/2026 at 1:14:18 PM

Yeah, I don't get why it is suddenly getting so much attention today, it is all over twitter too

by lblock

4/28/2026 at 3:05:11 PM

there is so much more subversive marketing out there than any of us can really fathom. i try not to be too paranoid but it's getting a lot harder every day.

i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.

[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...

by GuinansEyebrows

4/28/2026 at 2:46:34 PM

Simonw (who has a bit of a Midas touch for posts here) just posted about it https://simonwillison.net/2026/Apr/27/vibevoice/

by xnx

4/28/2026 at 3:36:27 PM

To be fair, his Midas touch is a result of consistency and a lot of hard work.

It's like the gardener at one of the Oxford colleges said - it's really easy to create these perfect lawns, just turn up every day and trim and water it - for a couple hundred years.

by realty_geek

4/28/2026 at 6:36:21 PM

I thought they rolled it as well?

by soperj

4/28/2026 at 7:48:53 PM

As always with people: listen to what they say, not to what they do...

After all, they rarely do what they say themselves, so it's surely not entirely made up nonsense!

by ffsm8

4/28/2026 at 1:24:47 PM

well duh, they updated the news section

https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...

which is microsoft for "we removed two dead links". AI innovation knows no limits!

by ramon156

4/28/2026 at 2:30:10 PM

Interestingly that seems to be in response to [1], which might indeed be the trigger for this.

[1] https://doublepulsar.com/microsoft-vibing-capturing-screensh...

by Vinnl

4/28/2026 at 7:25:01 PM

Yes, the SOTA is currently much more advanced.

by narrationbox

4/29/2026 at 6:22:09 AM

What do you consider to be SOTA?

by steinvakt2

4/28/2026 at 3:05:36 PM

It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".

I really disappointed with this model to say the least.

by gagan2020

4/30/2026 at 2:50:01 PM

> ...it was randomly adding music...

I've been noticing this with the Mistral Voxtral TTS models too. I have my AI record a morning briefing podcast for myself, and occasionally there are sounds like music at the start (the british voice had a musical tone underneath that sounded a little like the end of the BBC News theme). I don't think I've ever encountered that with the OpenAI TTS models, so they're now my default go-to again.

by SyneRyder

4/28/2026 at 7:19:46 PM

yep, it seems this was trained on large amount of podcasts with ad jingles or phone call queues with elevator music. I was also pretty disappointed to run the TTS last week.

by tjungblut

4/28/2026 at 6:40:21 PM

The 7B parameter Vibevoice TTS model is still the most impressive local TTS model i've tried. It was pulled by Microsoft a few days after its release due to "abuse potential" but it can be found in various community maintained huggingface repos.

by Stagnant

4/28/2026 at 4:36:55 PM

you saved us a lot of time here.... i unstarred the repo

moving on....

by zuzululu

4/28/2026 at 5:00:42 PM

I don't really pay attention to stars. Do people use them as bookmarks? Why would you star a repo if you knew so little about it?

by Capricorn2481

4/28/2026 at 5:46:47 PM

Stars for me are basically "this might be interesting but I don't have time to look at it now, hopefully I'll think about it later and give it a second look".

by drusepth

4/28/2026 at 5:44:22 PM

I exclusively use stars as bookmarks which is why I always found it strange when people talked about lots of stars meaning high quality or trustworthy…I’ve learned since then that I’m probably in the minority (both in using stars as bookmarks and not caring about how many stars a repo has).

by einsteinx2

4/28/2026 at 6:06:09 PM

Judging by how many people apparently are paying bots to give their lazily vibe-coded repos thousands of stars, it seems like people both simultaneously take stars seriously while not taking them seriously at all. It breaks my brain.

by tombert

4/28/2026 at 3:11:44 PM

You just saved me an afternoon.

by scotty79

4/28/2026 at 5:32:57 PM

Saved a lot of my time thanks!

by Tamatarr

4/28/2026 at 5:00:32 PM

I'm shocked, shocked to find that Microsoft takes credit for a slow, unoriginal product that doesn't actually do what it advertises.

by tombert

4/28/2026 at 5:26:26 PM

Imagine the balls it took to willingly attach the Microsoft label to the front of the product that is Teams.

by logicchains

4/28/2026 at 6:30:51 PM

I mean the same can be said about most versions of Windows as well. People act like Windows 11 is where it all went sour, but I've personally kind of hated it since Windows XP.

I feel like a recurring pattern with Microsoft is to create something quickly, market it aggressively and push for everyone to use it immediately, and only once it is installed everywhere do people suddenly realize how terrible it is, but it's too late to change.

by tombert

4/28/2026 at 7:05:59 PM

I'm surprised you picked XP as the falling point. I didn't enjoy the days of reinstalling 95/98/ME every 6 months to avoid driver weirdness and seemingly random failures. XP was built on the foundation of 2000, which tended to make it more robust vs. its predecessors.

Vista on the other hand...

by NBJack

4/28/2026 at 7:28:38 PM

I mean, part of it is that I really hated the Fisher Price look to it, but it was also the first time I ever felt like I had to "hack" things to make stuff work. I had to muck with registry keys. Oh, and it was the first time that I noticed that Windows repair tools do not work.

I suspect I might have hated 9x more but I was pretty young when they came out and I didn't really "get into" computers until XP, and I disliked it enough to dual-boot Linux as a twelve year old.

by tombert

4/28/2026 at 1:21:51 PM

[flagged]

by SecretDreams

4/28/2026 at 2:08:39 PM

The nuance is lost on LLM agentic dominant partakers.

by NobleLie

4/28/2026 at 1:11:19 PM

I think we should stop calling this type of models open source. They are indeed "open weight." The training code is proprietary and never revealed.

https://github.com/microsoft/VibeVoice/issues/102

by maxloh

4/28/2026 at 1:59:13 PM

Indeed. We now live in a world where freeware is named open source. We are very sorry, Stallman.

by jcmfernandes

4/28/2026 at 2:15:59 PM

If you're going to apologize to Stallman, you should apologize for conflating open source with software freedom. ;D

by MarsIronPI

4/28/2026 at 2:33:45 PM

With free libre software, where freedom and liberty are about what the end user is empowered with actually, the software is mostly metonymic. Free software, free society, because there are free people in the middle of course.

by psychoslave

4/28/2026 at 2:37:53 PM

Right, as I said elsewhere, maybe let's just let "open-source" have it.

"Open-source" can be "anything you can go out and grab a copy of and use" but doesn't give you much legal certainty about any of it, and reserve "free software" for the other, better thing.

by jrm4

4/28/2026 at 3:02:24 PM

But, free software lost it's way around GPLv3. From the end user's perspective, GPLv3 says that you can only use the software if it's either a cloud service, on hypothetical open firmware devices, or if you install it yourself.

AGPLv3 partially solves the issue by blocking people like Google from using it to build proprietary cloud services that take away their users' freedom. (It still doesn't solve the problem where providers use network effects to achieve the same end game.)

by hedora

4/28/2026 at 4:09:14 PM

I don't understand this either. The GPL doesn't address end users and their use of software at all, to be technical. It only addresses what terms of copyright redistributors of GPLed software are allowed to apply in-turn to subsequent end users.

by jrm4

4/28/2026 at 4:33:02 PM

The point of the Free in free software was always to protect the users of the software, not the vendors or the redistributors. (This is why the license focuses on the redistributors -- the mechanisms of the license limit their rights in order to protect others' rights.)

The first sentence of the GNU manifesto says this, and a few sections later in the document elaborate on the point:

https://www.gnu.org/gnu/manifesto.html

Note, in particular, footnote [1] which explains that its OK for distributors to ask for payment, but that it's never OK for users to have to ask for permission to use the software, and the section "Why I Must Write GNU".

Since then, software service monopolies became common, and all of the most end-user-hostile systems on earth rely heavily on the GNU system. At this point, we're paying for permission to use those services with our money, our data, our democracy, etc.

I certainly cannot give you permission to use any of the GPLed services that I have used, or that I've been paid to extend. Therefore, I say the free software movement has lost its way.

by hedora

4/28/2026 at 8:00:50 PM

I see your point and I agree. It's just that when you say "GPLv3 says that you can only use the software if it's either a cloud service, hypothetical open firmware devices" that's a stretch and not really true. AIUI vendors can pre-install GPLv3 software as long as they let you actually then replace the software (i.e. no DRM or locked bootloader). The firmware can still be non-GPL and non-replaceable. You just can't use GPLv3 code in the non-replaceable bootloader or firmwares.

by MarsIronPI

4/29/2026 at 12:10:41 PM

AFAIK you can use GPLv3 for non-replaceable stuff. The thing is only to allow the users to replace it IIF it's phisically possible to do so. If you make a device that boots from a ROM it's not a problem. If you sign your updates and keep your public key on a ROM and there is no way to boot anything else… there's a problem.

by LtWorf

4/29/2026 at 7:12:39 PM

> If you sign your updates and keep your public key on a ROM and there is no way to boot anything else… there's a problem.

As there should be.

by MarsIronPI

4/28/2026 at 3:07:30 PM

> From the end user's perspective, GPLv3 says that you can only use the software if it's either a cloud service, on hypothetical open firmware devices, or if you install it yourself.

What in the world do you mean?

by MarsIronPI

4/28/2026 at 4:41:51 PM

The anti-tivo clause bans things like Apple pre-installing GPLv3 software on macs, but allows them to let you use exactly the same software as long as they do not give users access to the binary. AGPLv3 blocks both use cases, GPLv2 blocks neither.

On the spectrum of "things that take away user freedom", withholding the source code is bad. Withholding the source code, the binaries and physical access to the computer is obviously much worse! This latter business model is heavily subsidized by GPLv3.

by hedora

4/28/2026 at 9:26:51 PM

It doesn't ban apple from doing anything. They choose to avoid a license that was better for the users.

by LtWorf

4/28/2026 at 2:56:25 PM

I totally get you, but this is yet another thick layer away.

by jcmfernandes

4/28/2026 at 2:54:28 PM

I'm reserving that complaint for "open source" models which are released under non-open-source licenses.

I care that I know what I can DO with the project when I see it described as "open source".

by simonw

4/28/2026 at 3:00:04 PM

> I care that I know what I can DO with the project when I see it described as "open source".

Yes, the first of which is that you should be able to build it from source. Which requires the source code, and in this case data.

by yjftsjthsd-h

4/28/2026 at 3:23:31 PM

The OSI's take on this is that an open source model can be modified through fine-tuning etc, even if you can't rebuild it from scratch.

The problem with requiring "build from scratch" for open source models is that the number of interesting models with training data that can be openly licensed is close to zero.

If you trained your model on an unlicensed scrape of the web you can't release the data under an open source license!

The Open Source Initiative have a bunch of their thinking around this in their FAQ for the "Open Source AI definition": https://opensource.org/ai/faq#isn-t-training-data-required-t...

by simonw

4/28/2026 at 10:55:30 PM

That's a point.

It is legal to train on copyrighted materials, provided they were obtained legally. Most companies also train their models using user interactions with previous iterations.

It is impossible to release this data publicly, let alone license it to a third party. However, I believe that at least the training code and the data processing pipeline could, and should, be released in order to claim a model is truly "open source."

That said, Allen AI actually released several models with the full datasets available. It is impressive how they pushed the models' performance despite training on a limited set of publicly available data. Kudos to them.

by maxloh

4/28/2026 at 5:31:27 PM

> The OSI's take on this is that an open source model can be modified through fine-tuning etc, even if you can't rebuild it from scratch.

By this definition almost any binary can be "open source" since hex editors exist. (Or more usefully, you can use ghidra et al. to do more interesting changes.) I know GPL has a very specific view of things, but I'd like to quote an excerpt that I think is generally applicable from https://www.gnu.org/licenses/gpl-3.0.html -

> The “source code” for a work means the preferred form of the work for making modifications to it. “Object code” means any non-source form of a work.

Which is why I'm fine with "open weights", because that's saying the object code is under an open license.

> The problem with requiring "build from scratch" for open source models is that the number of interesting models with training data that can be openly licensed is close to zero.

So? If the number of open source models is zero, then the number of open source models is zero.

by yjftsjthsd-h

4/29/2026 at 12:12:30 PM

I think the OSI no longer has any authority since that stunt they pulled in their "elections".

by LtWorf

4/29/2026 at 2:56:25 PM

I'd missed that whole thing. Useful context: https://lwn.net/Articles/1014603/

by simonw

4/28/2026 at 3:42:39 PM

I would personally disagree slightly with this take. Freely being able to use means IMHO, that this can be done for all applications in a legal (and ideally ethical) fashion. Regulation often requires to prove the quality or provenance of data. Open source has IMHO often a very libertarian view on things focusing on the rights of the user an not society in general.

by riedel

4/28/2026 at 3:24:02 PM

They’ll never reveal the data, because that would reveal this is all built on stolen work.

by rogerrogerr

4/28/2026 at 3:40:37 PM

Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example:

https://huggingface.co/allenai/OLMo-2-0325-32B

Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.

by simonw

4/28/2026 at 3:12:17 PM

That would be “permissive license”

Maybe we should have a little cue card for models: vendor/name, size, open weights, open source, permissive license.

It’s simple enough an idea.

by data-ottawa

4/28/2026 at 1:20:31 PM

> we should stop calling this type of model open source. They are indeed "open weight”

This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.

by JumpCrisscross

4/28/2026 at 1:21:53 PM

I think you mean GIF.

by andy_ppp

4/28/2026 at 4:16:51 PM

The inventor of GIF didn't begin with a document* clearly laying out what is and isn't to be called a "GIF."

I think it's right to push back whenever a huge tech corporation tries to build goodwill by falsely using terms like "open source."

*https://opensource.org/osd

by engeljohnb

4/28/2026 at 6:24:04 PM

To be fair, the initiators of the "Open Source" movement also co-opted a term that previously had a much more flexible meaning (and had been around for more than a decade at that point.) Just writing a document attributing specific criteria to a term does not grant one authority over the use of that term.

Ironically, the roots of the Open Source movement are a direct reponse to the Free Software movement largely because it was considered too ideological and unfriendly to corporate interests (i.e. monetization.)

by keeda

4/28/2026 at 4:34:13 PM

> inventor of GIF didn't begin with a document clearly laying out what is and isn't to be called a "GIF”*

Neither did the inventors of AI. A third party published a document after corporations went with open weights = open source and a spoiler block in FOSS wanted all training data published.

> it's right to push back whenever a huge tech corporation tries to build goodwill by falsely using terms like "open source

I think it’s counterproductive. Most people only see a squabble, which makes any ensuing points from the open-source community seem silly. Those who care can continue using the more-precise language they choose to.

Put another way, there is a difference between using terms like cracker and fully spelling out cryptocurrency, and telling people who use hacker and crypto more loosely that they’re wrong. They aren’t wrong and that isn’t meaningful feedback. At the same time, the person using the precise language isn’t wrong either.

by JumpCrisscross

4/28/2026 at 4:52:23 PM

There's a big difference between correcting some random commenter on an internet forum and correcting Microsoft.

> think it’s counterproductive. Most people only see a squabble, which makes any ensuing points from the open-source community seem silly.

Only to people that truly don't care whether something's open source. In which case, Microsoft using the term (correctly or incorrectly) won't change their perception.

But the people who do care won't like to be mislead by Microsoft. There's a reason the term is right in the headline: people respond to it.

I wish I had time to come up with a better example, but it's like if a AAA game company says they've released "native Linux build," but really they're just packaging the Windows build with Wine.

99% of people won't care, neither about the news nor the deception. But for that last 1%, any goodwill garnered with the headline would be gone, and the game company are the ones who look foolish, not the people calling them out.

by engeljohnb

4/28/2026 at 1:33:27 PM

It's the same as GIS, you wouldn't say jizz now would you?

by giancarlostoro

4/28/2026 at 1:48:34 PM

I absolutely do, every single time it comes up.

by DoctorOW

4/28/2026 at 2:33:52 PM

I hadn't thought about how to pronounce GIS, but do you have a problem with the pronunciation of the Japanese Industrial Standards: JIS?

by ziml77

4/28/2026 at 2:58:37 PM

I've been pronouncing both of them as /dʒis/ like hiss and not /dʒɪz/. I however am not a native english speaker of English. I wonder if native speakers gravitate towards the z more?

by s20n

4/28/2026 at 3:52:55 PM

I would end both with the S sound, but I'm operating under the assumption that the person I was replying to either pronounces their Ss as Zs or can't tell the difference between the S and Z sounds.

Because the other assumption I could have gone with is the less charitable take that they know GIS with a soft G doesn't sound like jizz, but they were just looking for a crude way to mock the soft G.

by ziml77

4/28/2026 at 3:45:05 PM

I think it depends on region. Related, many speakers pronounce chips and salza, Tezla, Wezley.

by bronson

4/28/2026 at 1:40:47 PM

I take it that you haven’t met the Arcgees people…

by notabotiswear

4/28/2026 at 2:07:26 PM

i am absolutely going to from now on

by dijksterhuis

4/28/2026 at 1:55:09 PM

The developer of the format declared the pronunciation 30+ years ago. It has always been jif.

by kevin_thibedeau

4/28/2026 at 2:14:22 PM

Yeah, but society overruled them.

by Geezus_42

4/28/2026 at 2:02:01 PM

How do you pronounce giraffe?

by pardon_me

4/28/2026 at 2:50:38 PM

Same way I pronounce my first name btw ;) but I think of "gif" as "gift" and this is probably the subconscious association people make without realizing it.

by giancarlostoro

4/28/2026 at 3:33:32 PM

Which is why I find it fun to bring up that in Old English "gift" hadn't yet picked up the "t" and was spelled "gif", but in Old English "g" was most commonly "HY". I like the Old English pronunciation of "gif" as "HYEEF", which is a "compromise" position that often makes some of both soft-g and hard-g "gif" pronunciation fans angry.

by WorldMaker

4/28/2026 at 4:02:29 PM

I sometimes just pick the opposite of whatever everyone agreed to just for fun. I do the same when people cry about vim or emacs since I have used both. ;)

Some men just want to watch the world burn. At least it's mostly harmless fun anyway. It's even funnier when they bring up how my name is pronounced in defense of "jiff" and I tell them, so you're calling me the expert in "Gi" pronunciation then? :)

by giancarlostoro

4/28/2026 at 3:54:11 PM

I have never heard this third option before but I love it!

by ziml77

4/28/2026 at 5:20:49 PM

I do too. The idea that any one pronunciation is more correct based on the letters is quite amusing, given there's examples that work all ways.

by pardon_me

4/28/2026 at 2:06:16 PM

How do you pronounce gift?

by parineum

4/28/2026 at 10:17:13 PM

Jift

by giancarlostoro

4/28/2026 at 2:39:26 PM

gorge = george

by briffle

4/28/2026 at 1:48:17 PM

And "hallucination" which should have been "delusion".

Way early on (spring 2023) people tried to stop it, but no luck.

by WarmWash

4/28/2026 at 2:49:35 PM

Why would it be delusion? It’s making something up which isn’t there and describing it.

by MagicMoonlight

4/28/2026 at 2:52:28 PM

A hallucination is a false sensory experience.

A delusion is a false mental belief.

Basically hallucinations are false external things, and delusions false internal things. You hallucinate a pink elephant, you delude yourself into thinking trump won 2020.

by WarmWash

4/28/2026 at 3:35:47 PM

Devils advocate here: I can give you a binary of my open source MIT code and never phone you the code. The code is still MIT licensed, and open source. You just have no access to it.

That said, I entirely agree that MS is misrepresenting their openness here, which isn’t in the least surprising.

by WhyNotHugo

4/28/2026 at 3:49:49 PM

? Do you know what “source” means in open source? Like, what is the source of the binary? It’s the code. That’s the source in open source.

by Otek

4/28/2026 at 3:53:39 PM

I don't disagree, but it is perfectly acceptable per the MIT license, which is an OSI approved license. MIT doesn't require source distribution with the binary (which is why from the developer perspective, it's a more "permissive" license)

by freedomben

4/28/2026 at 4:04:04 PM

The license describes what users are allowed to do with the source code, it doesn’t (and shouldn’t) define what a creator has to do to make the source code open.

by clickety_clack

4/28/2026 at 6:10:40 PM

Then it sounds like you're philosophically opposed to copyleft license like GPL. That's ok, we can agree to (in my case vehemently) disagree, but your philosophy is inconsistent with the commonly accepted definition of "open source" such as OSI's OSD[1][2]

[1]: https://opensource.org/licenses [2]: https://opensource.org/osd

by freedomben

4/28/2026 at 6:35:22 PM

I think you completely misunderstand me. I don’t have any opinion on GPL, but in the links you shared, even OSI considers the license to be separate from the definition of open source “Open source licenses are licenses that comply with the Open Source Definition”. You can use a license that open source projects use (ie MIT), and still keep the source closed, or you can write one that puts obligations on you if you want. In fact, you can use or write pretty much any license you want if you own the copyright.

by clickety_clack

4/28/2026 at 3:50:36 PM

In their defense, most everyone else does the same thing. They still shouldn't do it, but at least they're not the trendsetter here (though they are contributing to the ongoing problem)

by freedomben

4/28/2026 at 2:18:15 PM

At least it's MIT licensed! As much as non-open training data irks me, restrictive licensing irks me more!

by btown

4/28/2026 at 3:46:06 PM

what is problem with restrictive licensing? Most of them starts if you have 1M users etc?

by cute_boi

4/28/2026 at 3:08:48 PM

Open weights is not exactly right either because we do get source of the software that uses those open weights.

Maybe open inference?

But we often also get source code for fine tunning the model.

So maybe it's closer to open source than to anything else?

Isn't it a bit like not calling a game open source because engine tooling used to made it isn't open source and they didn't publish .psd files with asset designs?

by scotty79

4/28/2026 at 2:35:52 PM

I'm genuinely torn on this one; I get technically why not, but why I think I have no problem with it is the wishy-washiness of "open source" generally.

As I teach this stuff to people newer to this tech, it's probably just easier and more helpful to refer to the wide array of "stuff you can just download and use yourself" as "open-source" and then after that, go deeper and talk about why Stallman was right, how "Free Software" was first. etc.

by jrm4

4/28/2026 at 2:39:48 PM

What you said makes a lot of sense. Free software should not be confused with open source

by bitvvip

4/28/2026 at 1:33:02 PM

I mean, you have "AI" which means just about anything in marketing speak, "Agentic" is kind of becoming similar, hopefully they don't goof that one too badly, would be nice to know what you are trying to sell me. Used to be "Cloud" meant storage not just hosting (I guess it still does).

Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.

I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.

by giancarlostoro

4/28/2026 at 4:08:28 PM

[dead]

by ilqr_jb

4/28/2026 at 1:44:20 PM

Openwashing is the new greenwashing, which, coincidently, seems to have gone out of fashion a few hundred datacentres ago.

by notabotiswear

4/28/2026 at 1:52:19 PM

it was replaced with abundancewashing

by dist-epoch

4/28/2026 at 2:16:31 PM

What is "abundancewashing"?

by Geezus_42

4/28/2026 at 2:30:03 PM

> “This means a future of abundance. A future where there is no poverty, where people can have whatever they want in terms of goods and services.” – Elon Musk

> “I think we see a path now where the world gets much more abundant and much better every year.” – Sam Altman

https://www.diamandis.com/blog/elon-sam-abundance

by dist-epoch

5/1/2026 at 6:43:31 PM

Ahh, thank you. The lipstick they put on the pig that is Elon's dream of an Afrikaaner enclave on Mars where the 1% can finally extricate themselves from the rest of the species.

by Geezus_42

4/28/2026 at 4:19:22 PM

I think in this category, Voxtral by Mistral is a lot better. It also happens to be small enough to run on webGPU https://huggingface.co/spaces/mistralai/Voxtral-Realtime-Web...

by isodev

4/29/2026 at 7:52:53 PM

Better accuracy than whisper large? For English? What about multilingual?

by steinvakt2

4/28/2026 at 1:16:51 PM

Interesting story about this repo/product/author by cybersecurity researcher Kevin Beaumont: https://cyberplace.social/@GossiTheDog/116454846703138243

by pluc

4/28/2026 at 5:26:25 PM

got to love how they're trying to hide the links.

by tacticus

4/28/2026 at 12:52:29 PM

Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?

by embedding-shape

4/28/2026 at 12:55:59 PM

Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are newer.

by 542458

4/28/2026 at 1:03:44 PM

[flagged]

by SingleSourceAI

4/28/2026 at 1:47:23 PM

It’s confusing (at least for me) because the project covers a number of things including what you are mentioning.

by infecto

4/28/2026 at 2:13:09 PM

[off topic]

When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls

People will also post their own interpretations in response to comments, and quickly find out they missed something.

… But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.

[on topic]

(OK I’m done making excuses, time to read the article… thanks for the encouragement!)

I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:

“2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”

by Barbing

4/29/2026 at 1:41:10 PM

This was on HN 7 months ago:

https://news.ycombinator.com/item?id=45114245

Every time a STT/TTS model is posted I wonder if it will change my current workflow on MacOS, which is:

STT with Parakeet-V3 via Hex [1] app for near-instant good-enough transcription for talking to AI agents.

TTS using KyutAI’s Pocket-TTS, an amazing 100M-param model. I used this to make a "voice" plugin [2] for Claude Code

So far I haven’t seen anything that replaces these for me, or haven't been persuaded enough to spend time testing an alternative (explore/exploit and all that).

[1] Hex STT app - https://github.com/kitlangton/Hex, which is macOS-only. (also good free/OSS alternatives: Handy, VoiceInk. No need for Wispr, Superwhisper etc)

[2] Claude Code Voice Plugin - https://pchalasani.github.io/claude-code-tools/plugins-detai...

by d4rkp4ttern

4/29/2026 at 7:52:28 PM

What do you consider to be the model with highest accuracy?

by steinvakt2

4/29/2026 at 7:57:22 PM

I guess you mean for STT. For my usecase of talking to AI's or coding agents, pure STT accuracy is less important than transcription speed. Transcription needs to be near-instant, and accuracy "good enough" so that the AI's can "read between the lines". Parakeet-V3 gives exactly this.

by d4rkp4ttern

4/28/2026 at 1:27:17 PM

Interesting to see "vibe" enshrined by the likes of Microsoft as an AI product word.

by aqme28

4/28/2026 at 1:37:16 PM

Especially when "vibe coded" can have a negative connotation meaning quickly put together without understanding.

by accrual

4/28/2026 at 3:16:16 PM

In my mind, Vibe-anything means "some slop carelessly thrown together to ship as fast as possible." Wild that it's being used in a serious product name!

by ryandrake

4/28/2026 at 2:00:52 PM

I’m just surprised they put the name of the e-waste slop company in their product

by Barbing

4/28/2026 at 4:22:16 PM

Maybe they were trying to make a pun on "Via Voice", the cursed IBM STT from the 90s?

by amlib

4/28/2026 at 4:09:31 PM

I'm honestly more surprised that they could resist the temptation to call it Copilot

by lvncelot

4/28/2026 at 6:40:54 PM

Microslop Copilot for Voice! After they renamed Office, they surely will rename this one, too.

by tempodox

4/28/2026 at 1:49:55 PM

[flagged]

by altmanaltman

4/28/2026 at 2:31:09 PM

"get offended" is just what the clickbait news cycle made of it. It was based on the post at [1], and this is all it said:

> We need to get beyond the arguments of slop vs sophistication and develop a new equilibrium in terms of our “theory of the mind” that accounts for humans being equipped with these new cognitive amplifier tools as we relate to each other

[1] https://snscratchpad.com/posts/looking-ahead-2026/

by Vinnl

4/28/2026 at 3:57:57 PM

Are you sure you have the correct reference?

I think everyone else is relating to

https://futurism.com/artificial-intelligence/microsoft-bans-...

by fg137

4/29/2026 at 12:41:15 PM

Heh well, that article says it "clearly infuriated executives at the company", and links to [1], which is exactly what I described. But banning it on Discord does kind of retroactively prove their point, I suppose.

[1] https://futurism.com/artificial-intelligence/microsoft-satya...

by Vinnl

4/28/2026 at 2:48:46 PM

When a CEO says "We need to get beyond the arguments of X" it is universally a polite, PR-scrubbed way of saying, "Please stop talking about X, it is hurting our business" which is how the media interpreted it.

by altmanaltman

4/28/2026 at 12:45:19 PM

So we've really just settled on Vibe as the verb for AI then?

by podgietaru

4/28/2026 at 12:51:21 PM

Why use precise technical language when you can just vibe with your AI system?

by pryanshu89

4/28/2026 at 12:53:50 PM

I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?

by giarc

4/28/2026 at 1:21:09 PM

it'll probably be something we're not even talking about yet - we still have 7 months in which to make the world even worse

by internet_points

4/28/2026 at 12:44:42 PM

Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/

by CubsFan1060

4/28/2026 at 12:53:51 PM

Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.

by 542458

4/28/2026 at 1:23:56 PM

“VibeVoice can only handle up to an hour of audio”

Why?

by JumpCrisscross

4/28/2026 at 1:05:33 PM

You have selected Microsoft Sam as the computer's default voice.

by Anonyneko

4/28/2026 at 1:38:52 PM

My friends and I had fun in the computer lab with Microsoft Sam, inputting long strings of characters to create funny sound effects. Sususususususu.

by accrual

4/28/2026 at 1:46:51 PM

Holy moly, a Microsoft AI product that isn't named Copilot!

by ryukoposting

4/28/2026 at 1:49:31 PM

Missed opportunity to call it Vopilot

by DoctorOW

4/28/2026 at 2:55:07 PM

Slopilot

by silverwind

4/28/2026 at 2:49:07 PM

Still waiting for the open weights model that conclusively beats the multi-year old Whisper in accuracy, features, and performance.

by xnx

4/28/2026 at 3:13:34 PM

It's crazy that a lot is happening in open models for stt, but there's very little progress when it comes to results, esp multilingual.

by scotty79

4/28/2026 at 7:12:38 PM

The 60-minute single-pass transcription is the part that actually matters. Most ASR models chunk audio and you lose speaker continuity across boundaries. If the diarization actually holds up on hour-long recordings without drifting, thats a real solve for podcast and meeting transcription workflows.

by vijgaurav

4/28/2026 at 2:12:04 PM

Microsoft Store App Vibing.exe Accused of Harvesting Screens, Audio, and Clipboard Data:

https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...

by chaosprint

4/28/2026 at 5:13:41 PM

Surprised it wasn't called Copilot Voice

by triage8004

4/28/2026 at 3:34:08 PM

I've been using VibeVoice's ASR (speech to text) model quite intensively for the past month and have found it to be a lot more reliable and out-of-the box functional then Whisper, parakeet and other models. The fact that is has diarization built into to the model is a huge win in my book. Without that you have to run a different model just for that which adds significantly to the overall processing time vs VibeVoice which gives you reliably great results. Big fan.

by mberg

4/29/2026 at 6:14:18 AM

I built speech-swift, which focuses on on-device speech processing like VibeVoice, but specifically leverages Apple Silicon's capabilities for ASR, TTS, and VAD without cloud dependency. Our ASR supports 52 languages with a real-time factor of 0.06. https://soniqo.audio/benchmarks

by ipotapov

4/28/2026 at 1:55:21 PM

I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.

My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.

Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?

by frangonf

4/28/2026 at 6:48:13 PM

It highly depends on the sort of data you’re processing (phone calls, podcasts, meetings of more people recorded using single channel?). For NVIDIA/NeMo, check out their softformer diarization models (also streaming).

by woodson

4/28/2026 at 1:10:16 PM

I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):

- Cohere Transcribe (self hosted)

- Grok Speech To Text (they provide an API, only $0.10/hr!)

They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?

by Void_

4/28/2026 at 1:19:48 PM

I've had good experiences with the Mistral Voxtral models (I've used the API, but some of the model-variants are open weight)

by olejorgenb

4/28/2026 at 1:58:36 PM

Does Cohere work with longer transcripts? Do you have to do some magic to merge recordings over 35 seconds long?

by Barbing

4/28/2026 at 1:21:03 PM

Have you tried qwen?

by 2ndorderthought

4/28/2026 at 1:22:58 PM

Any non-Musk alternatives that are comparable in quality and cost?

by SecretDreams

4/28/2026 at 2:01:40 PM

Voxtral competes on price ($0.003/min) and quality. Speechmatics has best in class accuracy but is a bit more expensive ($0.004/min)

by jayphen

4/28/2026 at 1:29:39 PM

Our default is still OpenAI Whisper. Grok is just a choice for users who might prefer it.

by Void_

4/28/2026 at 1:26:35 PM

What’s the current state of the art, for each of training locally and in the cloud, for learning my voice?

by JumpCrisscross

4/28/2026 at 2:02:58 PM

Locally maybe https://voicebox.sh/

Elevenlabs in the cloud.

by yreg

4/28/2026 at 1:36:28 PM

Local? No idea. Cloud? Eleven Labs, probably. But it's described as "cloning" not "training". Not sure what the distinction is or why it matters if the end result is you can to generate any TTS that sounds like you. There might very well be an important one, I just don't know it.

by chrsw

4/28/2026 at 2:11:09 PM

open weights i would say S2: https://github.com/rodrigomatta/s2.cpp

by khimaros

4/28/2026 at 2:21:24 PM

Microsoft has historically made poor choices in product naming, but this has to be a new low.

by Mobius01

4/29/2026 at 12:44:02 AM

Sounds like Msft wanted to coast on the “vibecode” vibe popularity?

by yapyap

4/29/2026 at 2:59:33 PM

This is really great. I know it's not a new model, and it does often hallucinate, but it's really great frontier open-source voice AI models.

by leadgenman

4/28/2026 at 4:35:13 PM

Shouldn't it be called something like "Copilot Voice"?

by dragonfax

4/28/2026 at 5:29:35 PM

That's not confusing enough. It should be just Copilot.

by Narishma

4/28/2026 at 5:03:56 PM

Someone tell me if this is better or worse than Parakeet

by yayadarsh

4/28/2026 at 1:28:37 PM

Maybe Microsoft’s real strength was never making the best model, it was knowing you don’t need to, as long as you own the platform everyone builds on.

by BlastBash192

4/28/2026 at 4:50:52 PM

https://www.youtube.com/watch?v=d_AP3SGMxxM

by hedora

4/28/2026 at 2:16:12 PM

Previously:

Sept 2025 https://news.ycombinator.com/item?id=45114245

by ChrisArchitect

4/28/2026 at 2:58:43 PM

That was about the text-to-speech model, the speech-to-text one was release in January.

by simonw

4/28/2026 at 7:09:19 PM

When mixing languages, why does the English have Chinese accent and Chinese have English accent? Is it a feature or bug?

by low_tech_punk

4/28/2026 at 9:19:46 PM

Microsoft continues to be completely incapable of coming up with good names for their products and services

by lizardking

4/28/2026 at 5:46:24 PM

Explains most of the shit they have pushing with Windows 11. Perhaps all that bloatware was VibeVoiced too.

by threepts

4/28/2026 at 3:35:13 PM

It would have been better if they provided not just weights, but also some frontend where it is usable as is.

by solomatov

4/28/2026 at 1:17:48 PM

For me its giving me very poor results

by mistic92

4/28/2026 at 2:08:58 PM

looks like this offers ASR support in GGUF https://github.com/CrispStrobe/CrispASR -- haven't tested

by khimaros

4/28/2026 at 6:27:49 PM

Seriously, VibeVoice? Microslop really has a penchant for the worst names.

by isolay

4/28/2026 at 4:43:29 PM

This is a very good model, but can it be run on the web?

by nickandbro

4/28/2026 at 5:33:39 PM

What the do they mean by frontier voice

by unixhero

4/28/2026 at 8:03:15 PM

Isn't voxtral much better?

by decide1000

4/29/2026 at 2:39:06 PM

any idea on how does this STT compare to whisper large or turbo?

by dnivra26

4/28/2026 at 12:54:18 PM

Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?

The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck

by walthamstow

4/28/2026 at 2:54:11 PM

English only?

by Zopieux

4/28/2026 at 2:20:13 PM

Microsoft is famous for choosing terrible names but how could they be this terrible.

by starkeeper

4/28/2026 at 4:00:56 PM

lol they rug-pulled the 7B for our own safety some months ago

by villgax

4/28/2026 at 6:53:43 PM

What a terrible name

by simjnd

4/29/2026 at 2:03:32 AM

[flagged]

by matpb

4/29/2026 at 12:36:47 AM

[dead]

by vicchenai