Interaction Models

5/11/2026 at 11:24:01 PM

These videos are worth a watch. There are tons of impressive moments, but they had me at the very first one where a woman says: "I'm going to tell you a story," and then pauses for a long, luxurious sip from a cup of coffee, and the model ... does nothing, just waits. Take my money.

Speaking of taking my money, what's the economic model for a company like this? They've published a fair amount about their architecture - enough that I imagine frontier labs could implement. Patents? Trade secrets? It's hard for me to understand how you'd be able to beat that training compute and knowhow at Anthropic/GOOG/oAI/Meta without some sort of legal protection.

I can't wait to see what these model architectures do with like 30-40% lower latency and more model intelligence. Very appealing. For reference, these look to be roughly 1/10 the size of Opus 4.7 / GPT 5.x series -- 275B, 12B active. So there's lots of room to add intelligence, and lots of hope that we could see lower latency.

by vessenes

5/12/2026 at 1:28:24 AM

> They've published a fair amount about their architecture - enough that I imagine frontier labs could implement.

i think the real ones know this is the tip of the iceberg? hparam tuning, data recipes, data collection, custom kernels, rl/eval infra, all immensely deep topics that would condense multiple decades of phd lifetimes to produce SOTA performance (in both senses of the word) like this.

i would also calibrate what you are impressed by. simply waiting is a posttrain thing - the fact that gemini and oai have not prioritized it is not something you should overindex on as hard. what they showed with full duplex is technically far far harder to achieve

by swyx

5/12/2026 at 11:54:11 AM

I agree that full duplex is the amazing bit. For instance, the three engineers shouting trivia questions while a timer is running — that’s extremely novel as far as I can tell.

I’d like to believe from the demos that this ability to wait kind of falls out of the model as an emergent property — perhaps coming out of a small RL loop - rather than a specific behavior trained, a-la a VAD component in a stack. Either way, I would guess that VAD absolutely cannot do this right now — interruptions are highly annoying on all voice interaction experiences, and if it were a simple matter of better post training, SOMEONE would have done this, e.g. elevenlabs.

But, I disagree on your idea that this is too expensive/too hard to replicate. For me, yes. But, there’s an existence proof — a small team at a new company just did this without a real roadmap, certainly for less than $1b dollars and probably in less than two years. They are almost certainly less skilled at your list of needs to replicate than teams at the frontier labs, who have been given a roadmap.. So I don’t think it’s as difficult as you propose, from an organizational skills perspective.

by vessenes

5/12/2026 at 5:40:55 PM

SOTA is very much about both training on well catered corpus (having it) and also hundreds of iterations which eventually make you into… several PHDs really.

This is ML/AI. Is not calling third party APIs. If you want any SOTA in any AI area you need to design your own strategy and models. Drilling down to get there is super painful and perhaps not something a paid-for-course can teach you.

Random is everywhere and so are unexpected engineering challenges. Mastering linear algebra alongside some geometry and still knowing classic algos is the starting point.

by larodi

5/12/2026 at 4:36:13 AM

In China it's become well known that promising new companies will get an offer from either Alibaba or Tencent. In the US, it's probably simmilar. Everything that's out in the open can get acquired or simply copied. Maybe that is what Thinking Machines is hoping as well?

by edg5000

5/12/2026 at 11:58:40 AM

Publish a Demo -> acquihire for anthropic/oAI/GOOG/META stock and cash is an understandable economic model. In this case, I feel like they built more than would be needed though — and I hope they deploy something useful, I’d love to play with it.

by vessenes

5/12/2026 at 2:54:09 PM

Purely out of curiousity, I see you are using an em dash. Did you use voice transcription or something? It looks hand-typed though. I'm confused.

by edg5000

5/12/2026 at 4:39:08 PM

On the presumption that this isn't a joke: em dashes appear in LLM outputs because LLMs were trained on human text which included them organically. It's not as unordinary as memes suggest.

by niam

5/12/2026 at 3:47:10 PM

I just typed two single hyphens from my iOS device. One: - two: —

Edit: when I edit this comment they have been merged in the form so I speculate this is an iOS keyboard feature.

by vessenes

5/12/2026 at 9:37:29 PM

Mira Murati, the founder of Thinking Machines, was CTO at OpenAI during the birth of ChatGPT. Very unlikely their goal is to just cash out.

by ricardobeat

5/12/2026 at 2:46:15 PM

hasn't the economic model always been enterprise llms?

tinker - for fine tuning a custom enterprise model,

interaction models - for working as a digital paired employee (as opposed to a company having to reinvent their entire process around ai agents)

by htrp

5/12/2026 at 2:55:06 AM

they hire leading researchers, and leading researchers won't work for you unless they're able to publish

by babelfish

5/12/2026 at 11:46:46 AM

That was true 10 years ago. It’s most definitely not true now. The arms race is very real.

by vessenes

5/12/2026 at 4:15:54 AM

> leading researchers won't work for you unless they're able to publish

oh, honey.

by swyx

5/12/2026 at 8:03:56 AM

Do we want the whole humanity to get richer, or few individuals (company owners)?

by leonidasrup

5/12/2026 at 3:19:18 AM

Which seems bizarre. Companies can’t afford to just give things away right?

by SilverElfin

5/12/2026 at 3:16:19 PM

> Companies can’t afford to just give things away right?

Let's say a cutting-edge young researcher is making a name for themselves in their field and earning $300k/yr at a company where they're encouraged to publish and speak. You're trying to headhunt them for a company where they'll be forbidden from sharing their work which will likely stall their career and reputation outside of that company. How much do you think you'd have to offer? $600k? $1M? $1.5M?

When faced with the choice to paying significant salaries, hiring lower-tier researchers, or just letting their people publish, many companies conclude that giving away some of their work is the best option. (And that doesn't even include the benefits of boosting the company's profile which makes it easier to attract other cutting-edge researchers.)

by angiolillo

5/12/2026 at 3:39:31 AM

Yes they can. Your research papers are not the whole story. It’s like google could open source their entire monorepo and very little would change. No one else could operate it.

by rokob

5/11/2026 at 10:35:47 PM

The noteworthy things to me are that the architecture is a transformer that takes in text, image, and audio input and produces text and audio output, all trained together, and it works in near real-time through interleaving inputs and outputs rather than pure generation of the output from a given prompt.

> Time-Aligned Micro-Turns. The interaction model works with micro-turns continuously interleaving the processing of 200ms worth of input and generation of 200ms worth of output. Rather than consuming a complete user-turn and generating a complete response, both input and output tokens are treated as streams. Working with 200ms chunks of these streams enables near real-time concurrency of multiple input and output modalities.

That's probably the main thing that distinguishes it from the multimodal models from other frontier labs as far as I can tell.

by alyxya

5/12/2026 at 8:08:15 AM

What's really interesting for me about multimodal architectures from the ground up is that we might start to see applications where different modalities are "facets" of the same thing. Like a coding agent that sees "code" + "IDE" + "memory mapping" + feedback from different plugins as different modalities. And it gets to output in them as well - text where it needs to, actions (not <action>call_something(params)</action> like we have today) and so on. Being able to "sit still" until one of the modalities triggers is really interesting.

We can do these things today, but they're "bolted on" as afterthoughts. Yet they work remarkably well. I wonder how well they'd work if trained int his combined regime, from the ground up.

by NitpickLawyer

5/12/2026 at 12:59:10 PM

> interleaving the processing of 200ms worth of input and generation of 200ms worth of output.

How does this work? Don't LLMs/transformers need whole context to output next chunk of tokens?

by throwaw12

5/11/2026 at 10:00:09 PM

Aside from how impressive the model is, the demos here are very well done! Quirky and short, unlike what we're used to from Anthropic and OpenAI.

by rohitpaulk

5/12/2026 at 1:57:23 PM

Agree that this is interesting/impressive, and the demos are nice.

But I completely cracked up at the unexpected physical comedy of the woman in the "slouching" demo, haha omg that was comedy gold, no notes...

I do appreciate less of that flavor of demo that we get from OpenAI/Anthropic, and more of this "human"-feeling vibe. Dare I go as far as calling this an example of "human-centered design" even (https://en.wikipedia.org/wiki/Human-centered_design)?

by strgcmc

5/11/2026 at 10:34:19 PM

Very cool! The demos felt fairly contrived - e.g., count things while I talk. I wonder what more useful or commercial applications look like.

by tedsanders

5/11/2026 at 10:45:46 PM

In theory I would expect it to do everything the current frontier models are capable of but with the added benefit of real time interactivity for better collaboration. The biggest benefit may be the real time video input so it can take in that input in parallel with producing outputs steered by the input rather than taking in a video or all images at once and then producing a single output for all of that.

by alyxya

5/12/2026 at 8:56:56 AM

Yes! This is a big thing ive noticed in all AI demos. If the best use case you can think of to show off yor tech is to book a holiday, that I could easily do myself, does your service really add much value? Or is it simply because the real uses will be nuanced and specialsed, and not suited for a quick general audience demo? I'm not sure.

by haritha-j

5/12/2026 at 9:00:03 PM

I see that a lot of demos involve moving components from external harness into the model itself, but would this really be a flexible way to do things?

It seems that in a lot of cases you would be able to iterate faster on the user interaction harness if it's an external harness rather than a full-blown model. For example, if there's a UI standing between the user and the model that needs to change (perhaps by the user customizing it themselves).

IMO flexibility is mandatory because for fixed use cases like live translation or a straight-up voice bot, sure a model like this helps, but in each of those cases you'd just be outcompeted by even more specialized alternatives down the line.

by 2001zhaozhao

5/12/2026 at 11:41:48 AM

This does feel like where things should be going for more natural human-AI interaction patterns. Nice write up and demos.

by monkeydust

5/12/2026 at 3:59:16 AM

Very cool tech. I think people are underrating how this will be used.

by nasreddin

5/12/2026 at 12:49:01 PM

Their corporate sound system is sick!

by madebylaw

5/12/2026 at 3:45:32 PM

I hate to say it but while this does seem very impressive and a step forward in how we interact with AI, the use-cases they present and the UX both seem unrealistic and/or unhelpful.

With the exception of the real-time translation (which seems like it should be a separate product all by itself), none of the use-cases they presented had much utility. I don't want anything to count the number animals in my stories or time a trivia quiz for me. The auto-slouch-detector, while the demo was pretty funny, just seems so dystopian and weird. AI interrupting you to scold you about taking elderly parents mountain biking instead of waiting for you to finish to scold you? No thanks.

The UX is also an issue - the model interrupting the user (even when apparently required by these strange use-cases) is jarring and makes one lose their flow. You can even see this in the demo videos that they put out - the employees/actors had to really concentrate to continue speaking as if they weren't being interrupted by a brash robotic machine. A human, when participating in this (rare) "invited interruption" has the ability to speak "under" the main speaker and I feel it's generally timed with a lot of nuance.

Even in the auto-translation demo, they ducked the human's audio but the AI steamrolled him and it would have been impossible to actually do that demo without either an incredible amount of control over one's speaking, or (more likely) muting the output. A human translator has a way of "pointing" the "output" to the intended speaker.

The very best part of this tech was presented in the first video where it shows the AI not needlessly interrupting the user. This seems to me more of an important bug fixed that the current models still (somehow) have.

Maybe a good use-case for this would be counting "um's" and the like while practising public speaking.

by darajava

5/12/2026 at 4:26:49 PM

An omni model seems very useful for real-time human-computer interaction, off the top of my head:

- Voice assistants

- Customer experience

- Gaming

- Meeting assistants

- Real-time coach or user assistant for using software

- Translation

- Real-time work on a computer controlled by voice (frontend / mobile dev, CAD, 3D modeling, etc)

Traditionally a lot of these use cases with LLM agents are higher latency because the model needs to wait for the speaker to finish, then decide to call a tool or respond - if they call a tool they need to process the tool result and decide if they want to call a tool or respond, etc...

by ej88

5/12/2026 at 6:33:20 PM

I'm not saying an omni model isn't useful for HCI - essentially my problem is that these demos seem to be highlighting the model's ability to interrupt the user (which is almost always not a good thing), it's ability to keep time (which should be a non-issue really), and it showcases these using fairly lame use-cases.

by darajava

5/12/2026 at 6:35:01 PM

Ya, the demos were pretty contrived (feels like a running theme amongst the labs...)

by ej88

5/12/2026 at 4:34:24 PM

[flagged]

by JohnBizBiz

5/11/2026 at 10:10:26 PM

That's neat and definitely the next step. But to be honest, I don't want an AI talk to me like that.

by emsign

5/11/2026 at 11:29:39 PM

Same here.

Presumably it will be possible to adjust that behavior with settings, the system prompt, etc. Not that most users will make such adjustments, though.

I'm currently teaching a class on AI-related issues at a university in Tokyo. Many of the students were surprised when I showed them that they can change the response behavior of chatbots to make them more or less verbose, sycophantic, etc. It shifted the direction of our discussions on the possible impacts of AI on the people who use it.

by tkgally

5/12/2026 at 3:51:01 AM

Very cool demo, I wonder what would be the billion dollar applications of a thing like this.

by abhik24

5/12/2026 at 3:34:14 AM

This deserves to be at the top of HN, shame it seems like it's not going to make it. Some of the demos are hilarious. Clearly having the model appropriately choose when to speak is a major thing that has been missing from voice models to date. It seems like the latency is still a touch too high to be truly human-like though.

by modeless

5/12/2026 at 2:20:33 PM

the intentions may be good but it looks like a boost to surveillance tech in the wrong hands, time to react

by oldfuture

5/12/2026 at 4:35:35 PM

One of the most interesting things to me about AI is that it seems no one has a clear use for intelligence (besides for programming which has taken off)

Every demo by openai showing of their models is "tell me how tall the statue of liberty is divided by the year the inventor of steam engines was born". It's cool but it's so hard to find an actual use. As a personal answer machine I find it very useful but if someone told me 5 years ago; here's a natural language computer as smart as at least every 15 year old, it costs a few bucks per million words. I would have thought that the applications would just scream out but till this day - outside of programming (a big deal tbc) - no one has found a good use for intelligence. It's so so weird.

I guess even a company can't just automatically make more money by hiring more people but I'm still confused

by FergusArgyll

5/13/2026 at 9:36:18 AM

A lot of people use AI to write things, from mundane emails over blog posts and news articles all the way to full novels and non-fiction books. I'm not saying the results are any good, I'm just saying people find use for it. Another common use-case is summarizing or proofreading.

by Timwi

5/13/2026 at 10:06:59 AM

Right, and I find it really useful as an answer machine but I would have thought that we can do more. Like, is a 15 year old who only has access to a terminal unemployable?

by FergusArgyll

5/12/2026 at 7:34:55 AM

This looks similar to things people are already building locally with Gemma4 and TTS; just a bit fancier.

Local models will catch up soon.

by lostathome

5/12/2026 at 5:04:14 AM

Simultaneous speech is best.

by kburman

5/12/2026 at 3:48:38 AM

Really really cool. If they can serve this efficiently it would disrupt a lot of things.

by Nimitz14

5/11/2026 at 10:07:47 PM

incredibly impressive demos. I wonder how the training data for these models look like?

is it separate batches of special "skills" that are added post training? how can they guarantee the models won't eventually lose a skill?

by suriya-ganesh

5/12/2026 at 4:16:06 AM

am i the only person not impressed by this ? it just feels akward still with pauses and doesnt openai offer voice cadence already

by zuzululu

5/12/2026 at 1:46:43 PM

hard agree, there's already "voice ai" companies that use the normal models and have this "interaction" engine on top of them to produce better results than I've seen in these demos. idk why people are impressed

by AnthonyR

5/12/2026 at 5:14:01 AM

Same here. I dont see anything there that nobody else can catch up on eventually. I must be missing something here. It's all cute, but mmm

by gyre007

5/12/2026 at 6:33:41 AM

What I will say is that this is probably the first model after gemini live to do some of these things. It feels similar to gemini live, which I don't think is what they were going for exactly, but IMO it is still impressive as I don't think anyone else has matched full duplex video/audio/tool calling.

Next gemini releases coming next week though, we will see how that matches up!

by mchusma

5/12/2026 at 1:41:10 PM

A bunch of companies made light bulbs after Edison, that doesn't mean that light bulbs weren't an interesting invention.

by empath75

5/12/2026 at 10:05:07 AM

[dead]

by Ozzie-D