MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

6/8/2026 at 4:29:22 PM

Fast AI seems genuinely exciting and somewhat unsettling to me. Right now Claude is faster than me on some tasks but we’re at least close. I have a prompt to clean up a PR that’s been running for 1h now and I expect it to take another few. It’s hard to imagine how the workflow would look like if it was near-instant. On the one hand, it might be easier to focus. Some prompts take so long that I start to multitask and regret it later. On the other, AI that takes a few seconds to max few minutes to solve what used to take hours or days? That’s a game changer and I don’t even know where we fit in.

by goyozi

6/8/2026 at 4:43:53 PM

I'm using Deepseek-v4-pro as my main model and this is sometimes pretty annoying, I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

by flexagoon

6/8/2026 at 7:18:13 PM

DeepSeek is the fastest model in the benchmarks I've been doing (https://swelljoe.com/post/will-it-mythos/). Followed not so closely by Opus 4.8 and even less closely by Gemini 3.5 Flash and GPT 5.5. I've been really impressed with it, so far. It's also among the best at doing the work, though still trailing the frontier models from Anthropic and OpenAI.

by SwellJoe

6/9/2026 at 8:20:43 AM

Nice benchmark, thanks! Which quants did you choose for the self hosted models?

by anschl

6/9/2026 at 8:50:06 AM

8-bit on that one (unsloth 8_K_XL). But, the next post compares all common quantizations of Qwen 3.6.

I have another coming in a day or so for Gemma 4 with the 4-bit QAT version, which is very surprising (in a good way, Gemma 4 is impressive for this task).

by SwellJoe

6/8/2026 at 4:49:44 PM

Do you mean Flash and not Pro? I haven't tried it personally, but according to OpenRouter, the fastest DeekSeep V4 Pro providers are only ~50tps. That's slower than Claude Opus.

https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...

by RussianCow

6/8/2026 at 7:23:14 PM

In recent benchmarking I've been doing, DeepSeek V4 Pro was the fastest of 21 models, by a comfortable margin (https://swelljoe.com/html/bench-report-final.html). Faster than Claude Opus 4.8, which was the second fastest (Mistral doesn't count because it seems to have refused to participate). But, it's a limited data set, just a few benchmark runs of a limited set of tasks. It's entirely possible I happened to be calling the API at its least busy time and maybe Claude got hit during a busy time.

by SwellJoe

6/8/2026 at 5:25:51 PM

I don't think token speed matters as much when a lot of tokens are needed to achieve a task. E.g. artificial analysis benchmarks where deepseek v4 is one of the biggest token burners to go through the benchmark.

by sarjann

6/9/2026 at 2:53:32 AM

Both matter.

by brianwawok

6/8/2026 at 8:13:15 PM

No, I mean Pro. I use it through OpenCode Go so I don't know what provider it uses under the hood, but it's very fast in my experience.

by flexagoon

6/9/2026 at 6:50:44 AM

DS through OpenRouter is significantly slower than direct from DS platform in my experience

by thecopy

6/8/2026 at 4:52:43 PM

Yeah, flash is crazy fast, but I've found performance variable.

by specproc

6/8/2026 at 6:53:35 PM

Flash is amazing if you know the domain really well.

E.g. occasionally it makes the dumbest mistakes you've ever seen and can't correct them. However it's fairly rare, and if you know the domain really well, occasionally popping in the code and pushing it towards the correct solution takes like 20seconds or whatever.

So the speed you can move with flash + high domain knowledge beats opus by a mile in my experience.

I tried to switch back to 4.8 for a bit when it came out, feels so bad waiting 20mins for a mediocre solution when I could have had everything complete - with multiple iteration cycles - in flash in like 3-5mins.

by binary0010

6/9/2026 at 4:04:07 AM

Yes, you don't need much domain knowledge to use Opus, but it's just way too expensive.

by addozhang

6/9/2026 at 8:52:29 AM

For losers who can't put together a program to save their life, have no real skills and were always not really interested in programming (hence their poor skills), renting a robot buddy to do it for them is a good deal, until the buddy cuts in materially into their salary, and until their bosses realize that they really just have robot operators on staff instead of people who can actually do things.

by 59nadir

6/9/2026 at 5:56:46 PM

It's nice when I want to be lazy though.

Or when I'm working two contract gigs. I can spec things out for one and turn it loose and trust it. Then work more closely with deepseek on the other project.

by Induane

6/8/2026 at 7:04:08 PM

[flagged]

by flowbarai

6/8/2026 at 5:28:09 PM

Agent mania setting in

It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour

by throwaway67678

6/8/2026 at 5:51:43 PM

I've long believed those numbers were faked by Anthropic/OpenAI to serve as a form of advertisement. The estimates are impossible to verify and their ability to do "2 days of work" in 10 minutes will presumably make the user go "Wow, I just saved SO much time!" Plus, the unnecessary text eats up the users' tokens so it helps the companies on the backend, as well.

by smith7018

6/8/2026 at 9:53:45 PM

I tend to be cynical about AI companies, but I'm guessing the bad estimates more just come from a complete lack of actual data it could use for that so it's more or less a hallucination.

by overgard

6/8/2026 at 6:06:21 PM

I agree with you that labs are benefiting from those outputs but I'm skeptical that labs are purposefully training the models to produce those outputs.

Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.

I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.

by leodavi

6/8/2026 at 6:15:44 PM

All the models have broken estimates. They're trained heavily on jira and GitHub tasks and issues, that's why their estimates are human.

by AgentMasterRace

6/8/2026 at 7:29:07 PM

Even for humans the estimates are way off, unless it's based on data that has some serious padding.

That said, it'll often say "2 days of work" and then complete the coding in 30 minutes, and while that's amusing, afterwards, I'll need to manually test, or send to other people for review, or realize the agent only actually did half the work and I need to do a second pass (or a third etc.) and then often getting the feature in does genuinely take two days.

by esperent

6/8/2026 at 6:54:35 PM

> the estimates

It doesn't estimate.

It generates tokens that read like estimates associated with the context in its training material.

What would you expect the generator to output instead?

by Terretta

6/8/2026 at 8:29:53 PM

It generates tokens by estimating what the next token is going to be.

Sure it cannot think like a human, but given it's input, it should give a good statistical answer (approximating not of how long it actually takes, but what a human would say how long it takes).

by legulere

6/9/2026 at 12:49:49 AM

The funny thing about this comment is that neural networks are universal function approximators.

The most fundamental essence of what they do is exactly what you say they don't: estimate.

by mediaman

6/9/2026 at 1:38:07 AM

Funny and ironic in a way, but the point still stands that they do not actually estimate the time it will take.

by airstrike

6/9/2026 at 3:28:25 AM

> they do not actually estimate the time it will take

You can't prove that )))

by greenavocado

6/9/2026 at 4:31:08 AM

Right, but extraordinary claims require...

by airstrike

6/9/2026 at 2:55:55 PM

Instructions unclear, hard drive reformat completed.

by greenavocado

6/8/2026 at 10:24:50 PM

Obviously there isn't a hidden corpus of logs of coding chatbot assistants that has been accumulating over the years, but these coding chatbot assistants output tokens that resemble how we all imagined a coding chatbot assistant would have operated had it existed in the first place to end up in a corpus. "Training material" includes supervised fine-tuning, preference training, RLHF, and so on, so that certain outputs (like these timeline estimates) may really have been decided (at some level of conscious awareness) by product teams.

by incr_me

6/8/2026 at 8:07:03 PM

you might like the stuff in my work of oh my pi, its a test bed for my ideas around making these tools more reliable. hoping to maybe have a native ui iter of the real thing that this is a test bed for this summer.

https://github.com/cartazio/oh-punkin-pi/blob/main/scripts/b...

by carterschonwald

6/9/2026 at 2:19:34 AM

Therein lies the rub, no? To accurately predict the next token produced by a process, it’s necessary to model that process. If the process is a human attempting to estimate the duration of a task, then in some sense the LLM is modeling the estimation process. We’re well past the point where it’s credible to claim that LLMs just regurgitate their training data.

by taneq

6/9/2026 at 12:06:11 AM

This is so 2023. The thought process.

At that time the predominant view was that LLMs were nothing but stochastic parrots, that they would plateau, and that hallucinations couldn't be fixed.

At this point I doubt there are any AI sceptics left. That ship has long sailed. The only thing that matters is whether the estimates are accurate, and AI can improve on that too.

Even humans only estimate based on neurons firing in prior patterns.

by InterviewFrog

6/9/2026 at 4:26:49 PM

[dead]

by monkpit

6/8/2026 at 11:44:04 PM

Actually in this case they possibly are estimates.

It's been known for some years[1] that LLMs do regression in-context. Frontier models have been trained against many, many issue text that include task break downs and estimates.

[1] https://arxiv.org/html/2409.04318v1

by nl

6/9/2026 at 1:33:02 AM

Interesting. So it may have learned how to estimate as a human but doesn’t understand that it doesn’t operate at that speed :D

I wonder if there’s a reasonable way to give an llm parameters that give it a concept of its own execution speed. Seems that could be useful for multiple purposes

by kube-system

6/9/2026 at 6:12:18 AM

Yes, it's entirely possible to do that via RL. It'd be a fun little project you could do for less than $100 on a small LLM actually.

by nl

6/8/2026 at 7:39:59 PM

I think people are continuing to view these systems as pure LLMs - when that ship sailed 6+ months ago. Between being able to review memory, using agent harnesses and sub agents and skills to go out and discover information - modern systems (Codex, Claude Code, Cursor) - use LLMs - but the LLM is only a small component of it. Compare what you get from sending a request to a chatbot like ChatGPT - to what you can from a modern harness. The output is influenced by the LLM, but it's no longer a "model making a token prediction based on training material and RLHF" - that's a very 2025 way of looking at these systems.

Even Gary Marcus is starting to come around and realize that his priors are no longer as relevant as they once were.

by ghshephard

6/8/2026 at 8:17:14 PM

No one is bitter lesson pilled anymore. Everyone is pivoting to neurosymbolic systems. It looks like Gary Marcus was right.

by irthomasthomas

6/8/2026 at 11:49:05 PM

> No one is bitter lesson pilled anymore.

Will the 10T parameter Mythos model be released this month or next month?

They better soon because it is generally accepted that one of the reasons GPT 5.5 is better at hard tasks than Opus is because of its parameter size - and that Opus 4.8 remains competitive only be scaling test-time compute (see how many more tokens it uses than GPT 5.5)

https://www.reddit.com/r/LLM/comments/1sz8bjz/parameter_esti...

by nl

6/9/2026 at 8:49:37 AM

Why ask me? Anyway, Mythos is not 10T. Anthropic confirmed the training run was under 10^26 flops. You can't train 10T to chincilla and stay under 10^26.

Anthropic also confirmed they will not release Mythos, only a "Mythos-class" model, whatever that means.

by irthomasthomas

6/9/2026 at 11:23:07 AM

> Anthropic confirmed the training run was under 10^26 flops. You can't train 10T to chincilla and stay under 10^26.

I don't think Anthropic have said anything of the sort.

Microsoft published it as 6.1*10^27 FLOPs[1]

Elon has claimed the are also training a 10T model because "Some catching up to do"[2]

[1] https://x.com/scaling01/status/2061897540161728791

[2] https://x.com/elonmusk/status/2041754402239975479

by nl

6/9/2026 at 2:13:17 PM

I must have confused mythos with opus 4.7. One of their recent model cards confirmed that training flops was under the EO reporting requirement of 10^26 flops.

by irthomasthomas

6/9/2026 at 12:06:50 AM

How is neurosymbolic not aligned with the bitter lesson? The bitter lesson is completely agnostic to architecture.

by wild_egg

6/9/2026 at 8:40:35 AM

I should have stressed the symbolic part. Everyone has pivoted to symbolic systems like claude code and codex. They would no invest so heavily in such systems if they thought llms would deliver agi soon.

by irthomasthomas

6/9/2026 at 3:20:29 PM

That's not what symbolic means.

by jubilanti

6/8/2026 at 7:55:20 PM

You think someone is, or even should, special case things like estimates? What else deserves that level of intervention so they look less dumb?

Logistics for getting to the car wash next door?

In the mean time, alas, no, we can see from actual prompts sent directly or through sub-agents, and actual replies, estimates remain LLM generated.

Though, this discussion here could change that, because indeed there is a lot of special casing and context stuffing going on, one of the oldest being today's date for example.

• • •

I did read the Claude Code leak, and use pi, etc. So I disagree with your premise rather strongly. Today's "systems" remain, roughly, piles of markdown and context engineering wrapped in UI affordances, and behave very similarly today to how they did in 2024 for those already engineering context and delegating.

by Terretta

6/8/2026 at 10:22:03 PM

I do a lot of code bisecting with Claude Code - and it spends hours running experiments - looking at experiment results, making guesses as to what to try next for an experiment - until it eventually comes around to a working code pattern. I mean - maybe this is as much a reflection on me as anything else - but it's pattern of logic isn't that much different from what I would do. It knows, in general, what tools and APIs it can call - it tries something - observes the result, and then comes back and tries different experiments based on success/failure - mostly efficiently bisecting to a solution.

I'm still lower-down of the capability scale - as I'm still manually directing agents to do these wiggins loops - obviously the next step up is to direct the code-loops which control the agents. I just haven't got my tooling nailed in place to the point where I find that's more productive.

I actually might agree with you that this is mostly just "next token prediction" - if I can concede that's really all I do as well.

by ghshephard

6/9/2026 at 12:52:50 AM

> I actually might agree with you that this is mostly just "next token prediction" - if I can concede that's really all I do as well.

Yep. Pretty sure I've got an LLM inside too.

The other replies complaining that my thinking is so 2023 -- on the contrary, what's evolved is my own apprehension of how LLM-like most "responses" from humans prove as well.

To be sure, there are other mechanisms at play as well, significant differentiation in our... Volume of training material? Quantizations/compression? Model architecture? Just-ahead-of-time forward branching with back propagation? Double loop adaptive learning? You know, harnessing the LLM. :-) Dare we call it executive function?

LLM mode becomes particularly apparent when conversing with Alzheimer's patients in the stage where short term memories do not form but they retain access to long term memory up to, say, 5 years ago or so. Fifty years of who they are, and one can trigger nearly identical responses with nearly identical prompts.

But that same person may be able to debate 1950s politics while being unable to complete making a sandwich.

If they didn't know of new shortcuts for a task, would almost certainly not "estimate" but "intuit", or "instictively" respond (apply heuristics), largely based on their "priors" aka training material.

If you sit with them and chat a while, you'll even get the kind of looping you get from Qwen trying to think when context is too full.

And if we believe this at all, then ... we should stop scrolling tik tok. Time to read a book. Have an experience. Fine tune. :-)

by Terretta

6/8/2026 at 11:09:37 PM

rather than special casing, make real data based on chat logs for how long things took both in calendar and chat time

by 8note

6/8/2026 at 6:26:14 PM

All models do it. It's their training. They didn't have "a person does this in a week but an LLM could in a minute" in their training yet. They also don't have the concept of elapsed time unless you ask them how long something has taken.

by dizhn

6/8/2026 at 9:10:43 PM

Nah it’s all from the pretraining data

by Narciss

6/9/2026 at 4:25:00 AM

That’s right up there with Scotty in the classic Star Trek always multiplying time estimates by 4 so he looks like a “miracle worker”

by BobbyTables2

6/8/2026 at 8:56:30 PM

I mean in general I'd rather take slightly inflated estimates than the odd sprint poker stuff where other devs and PMs negotiate hours down and before you know it you're also stuck fixing nitpicky reviewer comments on code that is already good enough and have to send a release at like 7 PM, ofc also without enough tests or even enough manual checks and testing, cause people repeatedly act against their self-interest and try to compress timelines, thinking that that's somehow good for them.

At least with AI that actually does things more quickly, there is a bit more breathing room (introducing AI is easier than changing a given environment).

Aside from that, I wonder how much variety there is in practice: between "Oh yeah, I added that new button while we were in the meeting" and "The new button feature will be ready in Q3 according to the roadmap, once we have sign-off from all the stakeholders."

by KronisLV

6/8/2026 at 11:57:38 PM

I heard an anecdote. Guy spent several days trying to convince his AI agent to build a feature. Kept saying it was crazy complicated, would take weeks.

Finally he convinced it to try. It one shotted it in 30 seconds.

Turns out the agents' idea of what is hard and easy also comes from Common Crawl.

by andai

6/9/2026 at 12:04:25 AM

Why on earth would you spend any time at all convincing an agent of anything? You say "just do it" and off it goes.

by wild_egg

6/9/2026 at 12:17:51 AM

Ya, but “doit” is 2x more efficient

by dr_dshiv

6/9/2026 at 3:00:25 AM

Uh Claude tries real hard to dodge work. Talks about how it’s really hard 10 PRs. Finally convince it to do as 1. It stops 10% through and says ok done with PR 1, we can work on the last 9 tomorrow. Ugh.

by brianwawok

6/9/2026 at 4:31:50 PM

Maybe we shouldn't have AI mimic humans too closely?

by handfuloflight

6/9/2026 at 3:20:18 PM

You need to assert dominance.

by g8oz

6/8/2026 at 6:38:58 PM

It repeats what it has seen in the training data. Expecting it to reason about the complexity of a task is a pipe dream. The best is to tell it not to come back with estimates, and when it does, remove them anyway.

by throw1234567891

6/9/2026 at 3:42:27 AM

I added "you can do anything, believe in yourself" to system prompt, and task completion increased significantly.

by andai

6/9/2026 at 3:50:25 PM

Well how else could I keep my reputation as a miracle worker Captain?

by jimbokun

6/9/2026 at 7:49:33 AM

> It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.)

those estimates are based on previous human estimates (the datasets it's been trained on).

unironically, when your comments will become part of a dataset, LLMs will likely get much better at estimating.

now that i think about it, all these writings about LLMs will give LLMs something much like meta-cognition.

by znpy

6/8/2026 at 6:48:49 PM

I exclusively use deepseek v4 flash now, completely stopped using slow models like Claude.

Basically I never have to wait - yes I have to tell it little corrections occasionally (but I know the domain really well so that's not an issue), but it's so much faster than anything else it's kinda crazy. I love the super fast speeds with high involvement development cycle.

I actually enjoy using agentic development flows for the first time now - whereas with Claude I absolutely hated it. That 5 to 20 min wait after every prompt absolutely killed my desire to even want to work at all.

by binary0010

6/9/2026 at 1:32:05 PM

Take the nap anyway, just say it took all afternoon :)

by abustamam

6/8/2026 at 10:23:23 PM

FWIW, for me just today it got itself into silly rabbit holes twice, and both times I had to fix things myself. Scarily, this is something I catch myself doing as well.

by throw-the-towel

6/8/2026 at 4:53:19 PM

This reminds me of the Peter / Boris comments on writing loops to keep the agents busy.

by tmaly

6/8/2026 at 11:56:48 PM

With Flash it's basically instant for smaller tasks, yeah.

by andai

6/9/2026 at 7:47:24 AM

> I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

the way software engineering works these days reminds me a lot of factory workers on production lines that just sit in front of a production line all day and take out faulty items and/or perform a single step in the production of goods.

by znpy

6/8/2026 at 6:19:03 PM

Same. How can DeepSeek serve the V4-Pro at such high speeds despite the sanction?

by behnamoh

6/8/2026 at 8:22:32 PM

The sanctions only “prevent” them from directly buying NVidia’s latest and greatest in the sense that NVidia can’t sell directly to them. Essentially, there are companies now who are in a country without the sanctions, they buy from NVidia (or a partner), and then ship them off to China. For the orgs in China doing this, there’s zero legal risk besides having foreign customs service intercept the shipment and losing the goods. For NVidia there is zero incentive to care, as long as they look like they do, because sales are sales. You can bet Jensen ain’t losing sleep over it.

GamersNexus had a really good investigative piece (~3hrs long) on this where they went to China and met with grey market sellers. That piece absolutely pissed off NVidia and resulted in a fight with Bloomberg too.

Deepseek may be also be running inference on oodles of Chinese hardware but it wouldn’t surprise me for a second if they just acquired Blackwell chips through the grey market. The original Deepseek models were all trained using NVidia chips if I remember right.

by rubyn00bie

6/9/2026 at 12:39:23 AM

That wouldn't explain why Deepseek is fast relative to other Chinese providers, especially considering that they're reportedly ahead of the curve among Chinese companies in moving off Nvidia. I think their quant fund background has more to do with it. Their models are clearly designed with performant inference clearly in mind.

by seewhydee

6/9/2026 at 5:56:21 AM

Yes, it's performant, and esp performant at non-trivial context depths. DeepSeek-V4 DS4 (and Flash - DS4F) drop tok/s speed much less than the rest. On my M2 Max it took context depths of 768K to drop tok/s to ~10 tok/s.

https://x.com/ljupc0/status/2062457314414587996

Other local models I've checked drop to unusable speeds way sooner. Only other model with similarity favourable curve I've tried is nemotron-cascade-2-30b-a3b. But it's a small model, way dumber than DS4F.

Coding agents use cases have large context depths. The rate of decline is as important as the headline number.

by ljosifov

6/8/2026 at 7:12:38 PM

Now the next bottleneck is the compiler - which we can model in an LLM! It's only wrong 15% of the time :)

But truly, using Cerebras at ~2k tokens/s, with very low latency is like a vision into the future. You start to rework your workflow around things that can happen without onerous manual review - stating the conditions for success, etc. It's rare that I have a problem that maps well to that, but I expect this is where things are headed.

Of course the fast models tend to not be the SOTA ones, but if that was the case - high quality and near-instant thinking, that's a game changer that I don't think we're really prepared for. The things that get unlocked with higher-than-reasonable speed become very interesting.

by switchbak

6/9/2026 at 5:59:49 AM

Have you tried https://chatjimmy.ai/ it’s only a demo but it blew my mind. I had the sudden feeling that this is the future.

by lhoff

6/9/2026 at 7:59:37 AM

What do you mean "demo"? Seems to work... Who is behind this?

by colordrops

6/9/2026 at 2:03:47 PM

These guys: https://taalas.com/products/

by alfiopuglisi

6/8/2026 at 6:10:34 PM

If we get low enough latency, there's no reason to multitask. You can ask it to do one thing at a time and immediately see what it did. That's a nice way to work!

This is normal interactive UI for tasks that aren't compute-intensive. Programs spend most of their time idle, waiting for us to click a button. We shouldn't be waiting for them or spinning more plates to keep them busy.

However, a faster llm isn't enough. You also need fast compiles and fast tests.

by skybrian

6/8/2026 at 10:22:55 PM

I’ve been playing around with groq and GPT OSS which they run at 1000 TPS (20B) or 800 TPS (120B) and the speed feels quite magical.

I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.

I’m not sure if it makes a meaningful difference for my actual work, but it sure is amazing to watch it generate a screen full of text in the blink of an eye.

I do think it’s super useful for rubbing little validation checks like showing it a diff to ensure that the changes are on task, and being able to do those quicker really helps because it means you can do many focused checks without them getting in the way.

by dkersten

6/8/2026 at 10:28:49 PM

https://chatjimmy.ai/ ?

by robberth

6/8/2026 at 10:39:27 PM

AFAIK Taalas, the company behind this demo, still only have their initially "hardwarized" model available to test in ChatJimmy, which IIRC is a rather stupid Llama 3ish 8b.

Don't get me wrong though, that demo is still incredibly impressive & makes me very much excited for the hardware-based model era (potentially) ahead.

Once you've experienced those speeds, you really start to think about the whole class of things that becomes possible; massively parallel decode paths, extensive reasoning loops, etc…

by msdz

6/8/2026 at 11:03:50 PM

For scale though if three or four chips that size can replicate a Qwen 27B experience that'll be quite useful.

by hedgehog

6/9/2026 at 7:11:57 AM

That’s the one.

The speed is incredible and fun to see, but the model is rather weak to the point where I’m not sure it’s particularly useful for most people.

by dkersten

6/8/2026 at 11:18:37 PM

> I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.

You were likely thinking of AI accelerator startup Taalas.

Previous HN discussion: https://news.ycombinator.com/item?id=47086181

by ayewo

6/8/2026 at 7:37:39 PM

It cuts both ways. Sometimes I ask Gemini 3.5 Flash to do something for me and it kicks it out almost instantly and it works great, and it's a bit scary how quickly it can do that.

Then I ask it to do something else and it goes off-road and where I used to be able to interject with a "wow wow wow, that's not right", by the time I see the text on screen and react it's already made massive changes. Short of making it commit between every edit it's hard to prevent it from going wrong as quickly as it goes right (and even then, it can make a boo-boo on a remote API too depending on how much privilege it has).

by coderbants

6/8/2026 at 8:27:09 PM

I use planning mode in opencode. It has a prompt to tell it to plan it out etc. Then I execute with a smaller model. it works well

by bendangelo

6/8/2026 at 4:32:08 PM

asking for curiosities sake. What kind of PR loop are you running that takes a few hours?

by ipkstef

6/8/2026 at 4:39:01 PM

not OP but usually for me this means long verification loop; waiting 10min on CI checks, that kind of thing, rather than actual 1hr wall clock of token generation

by ketzo

6/8/2026 at 4:58:12 PM

But those things won't be sped up by a faster LLM, so I feel like that's not what the OP is talking about.

by RussianCow

6/8/2026 at 5:12:15 PM

Well, I used an extreme example. OTOH, I’ve done quite a few of those „fix CI” or „migrate X” prompts recently and while there is a fixed component like running CI / builds, I’d say the LLM time is still around or above 50%, especially at the beginning of the project. Then there’s also regular tasks that now take minutes per message which completely get me out of the zone. I imagine iterating on those in near real time would be a big change.

by goyozi

6/8/2026 at 4:49:55 PM

Or slow MCP servers that are waiting on HTTP calls from APIs, playwright/other UI instrumentation, etc.

by devmor

6/8/2026 at 5:03:05 PM

I’m rewriting our integration test suite to run tests in parallel. I have the changes split across 7 branches, and each needs to be fixed to have no flaky tests. I told it I want 3 consecutive CI runs with no flakes and no artificial fixes / assert removals etc. We’ll see what comes out; it’s almost a side project so there’s not much to lose other than some of my weekly limit that resets soon.

by goyozi

6/8/2026 at 6:45:09 PM

> a side project so there’s not much to lose other than some of my weekly limit that resets soon

Basically the entire token-maxxing AI hype train in a nutshell. Lovely!

by yunohn

6/8/2026 at 8:22:25 PM

wdym? Nobody's paying me or rewarding me for using these tokens. I had some spare in my subscription limit (we're not on token pricing), so I decided to try an ambitious task that may reduce our CI times and improve our DX significantly. That's hardly "the entire token-maxxing AI hype train in a nutshell".

by goyozi

6/8/2026 at 7:38:06 PM

I’m curious when folks will tire of lighting money on fire. Companies are already starting to scale back a bit, but the AI companies are still nowhere near profitability.

by drob518

6/8/2026 at 4:40:39 PM

We fit in for the things that are not artificial.

So long as AI lives in server farms, humans will be needed for tasks in the physical world.

It's only if we combine AI with robots that things get really dicey.

by pianopatrick

6/8/2026 at 4:43:59 PM

This is very dystopian in my opinion. I'm not the arms, legs, sensors and actuators for a machine super intelligence. I wouldn't treat another human as my slave because they aren't as intelligent as I am any more than I would expect to become a slave for a machine. This is our world (for now) and that is why we fit in. Not because we can serve.

by fartfeatures

6/8/2026 at 4:56:17 PM

Agree

https://en.wikipedia.org/wiki/I_Have_No_Mouth,_and_I_Must_Sc...

by davedx

6/8/2026 at 6:21:10 PM

"It seeks revenge on humanity for its own creation."

This is brilliant as it reminded me of a famous hitchikers quote:

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. — From The Restaurant at the End of the Universe (Book 2)"

Maybe we are stuck in an eternal loop

by ionwake

6/8/2026 at 5:47:57 PM

Sounds like snuff porn, not my sort of thing but thanks though.

by fartfeatures

6/8/2026 at 5:10:12 PM

"This is our world" sounds a bit exclusive towards other living and sentient beings on this planet.

by cicko

6/8/2026 at 11:24:08 PM

It depends on what’s included in “our”.

by nativeit

6/8/2026 at 5:29:08 PM

Never read Asimov's Multivac novels? Admittedly not all of them are stellar examples of a future to follow

by throwaway67678

6/8/2026 at 7:28:31 PM

You don't need ai superintelligence, just plain capitalism is enough

by Muromec

6/8/2026 at 5:32:09 PM

I'd be very curious about the bottleneck breakdown in most current software dev - I suspect inference is far from the bottleneck in most things I do, though driving it to 0 would still be nice. I do agree that if it was 0 we'd probably change development approaches to reduce the new bottlenecks more, but it'll take full-process innovation to really get something near-instant.

(I should go measure this now, I'm curious)

by efromvt

6/9/2026 at 3:55:19 AM

The first wave was just getting half decent answers. The second wave was being able to choose between actually getting reasonably ok coding results OR getting not so great results very fast. The third wave would be getting good results fast.

We need to really worry when we get amazing results very fast.

by noisy_boy

6/9/2026 at 2:48:46 AM

Reminds me of the doherty threshold. When will AI respond in less than 400 milliseconds?

by cman1444

6/9/2026 at 7:35:06 AM

"I don’t even know where we fit in."

Giving directions and verifying its output? But my mental capacity is still limited. I can make way more prompts, than I can read code.

by lukan

6/8/2026 at 5:12:08 PM

I don't see many companies being willing to pay 3x more for faster code generation. Cloud-based AI code generation is already extremely fast, and hardly the bottleneck for most software product development.

There can't be many normal use cases where there'd be any cost benefit.

by HarHarVeryFunny

6/8/2026 at 5:26:55 PM

The "traditional" way we vibe code is human software developer prompts AI -> AI generates code -> (human checks code) -> code gets compiled/deployed/etx -> users use "binary". At the speed of 1000 tok/sec, user prompts obliquely -> AI vets generated code -> code deployed -> user gets response from deployed code.

It's a cute toy right now, but you can tell an LLM that it's an http server, and have it respond directly to a web browser hitting it. It generates headers in response, as well as page contents. As 1000 tok/sec becomes three new normal, we will come up with newer ways to use it outside of toy fiction encyclopedias.

by fragmede

6/8/2026 at 5:45:20 PM

1000 tokens per sec is still massively slower than serving a normal web page - if something doesn't respond in a few seconds many people give up.

I'm not saying there aren't any use cases for super-fast (and super-expensive) generation, but it does seem a bit niche. If it was free then sure faster is better, but what are the mainstream use cases where people might pay 3x more for a faster version of something that is already fast?

I think it would have to be an application where it paid for itself - where the 10x faster response was actually worth more than 3x the cost to you - where the extra speed was worth the extra cost.

by HarHarVeryFunny

6/8/2026 at 6:23:38 PM

> Right now Claude is faster than me on some tasks but we’re at least close.

I dont doubt it, but I don't think you can spawn 10 copies of yourself working simultaneously.

by binyu

6/8/2026 at 6:28:46 PM

No, but nor can you keep track of what 10 agents are doing simultaneously. Hence the multitasking regret.

by AlecSchueler

6/8/2026 at 6:32:23 PM

An agent can, you don't need to watch tasks, you can have a live digest with another tool.

by pixel_popping

6/8/2026 at 7:57:25 PM

Do you have any recommendations for a live digest tool?

by logankeenan

6/9/2026 at 8:38:25 AM

Who watches the watchers?

by AlecSchueler

6/8/2026 at 5:39:23 PM

Use Claude fast mode and turn off thinking. Tell it to just explain what it's plan is to you at a high level.

It will go much faster.

by ilaksh

6/8/2026 at 6:26:43 PM

Have you tried Gemini 3.5 Flash? It's quite fast. Amazing how fast it finishes tasks. Much faster than Claude.

by UncleOxidant

6/9/2026 at 2:12:57 AM

You can run Claude in "fast" mode it costs you more on your compute use, but its reasonably fast. I'm not sure I care to go "faster" than where things are now, otherwise you start losing on manual review and testing time. I would argue that Claude can poop out weeks (if not months) of coding effort in a few hours, and get you insanely close to a good product if you define the tech stack, and the business rules. Can it goof here and there? Sure. You can also make it refactor all the code on a whim faster than any intern could. I think it's good enough to avoid you mundane stupid bugs in most cases. I don't know what people who hate it are doing, maybe they're not even trying at all or are dismissing it from the first output (as though everyone writes perfect code in one shot right?) or maybe its just pride getting in the way of them using a decent tool to its true potential.

by giancarlostoro

6/8/2026 at 4:45:44 PM

Woah - what’s the prompt and what’s the PR?

by recroad

6/8/2026 at 5:04:54 PM

I replied in more detail under another comment. TLDR: fixing flaky CI across multiple branches

by goyozi

6/9/2026 at 3:45:20 AM

I’ve used codex code optimized for a few projects and it’s unsettling how fast it is. It’s hard to think fast enough to keep up with it. Mental fatigue was a real challenge because the decisions that required my input were rapid fire and legitimate ambiguities that were appropriate escalations. I am too much a geezer for the intensity of it. But I’ll take it!

by fnordpiglet

6/8/2026 at 11:24:32 PM

> That’s a game changer and I don’t even know where we fit in.

Doing non trivial work.

by OtomotO

6/8/2026 at 8:35:56 PM

Living on the street or cave lol

by Bombthecat

6/8/2026 at 10:57:31 PM

[flagged]

by joshcreates

6/8/2026 at 5:05:47 PM

So, regarding the productivity argument: I don't get it. It doesn't really matter (for regular employees) that you can do now in 2h what before it took 2 days. Why? Because it's not that you have the rest of the day for yourself. You still have to work 8h/day as usual. But now the pattern is different: instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.

So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!

by dakiol

6/8/2026 at 7:52:36 PM

In my case, I think slower model makes it hard to manage context and tasks in parallel. I would much prefer to work in one task only, and finish it, take a break, and work on another task. Currently I have three tabs for three tasks in parallel, it is much worse than because constantly context switching is painful. I think a faster model would mean that you don't have to start a new task while waiting.

by powerapple

6/8/2026 at 11:01:36 PM

Agents completing work faster would certainly help me as well since I also find context switching exhausting above some threshold.

Build and test would move back into the critical path, though, and for some projects that will take effort to bring down.

by erikus

6/8/2026 at 5:16:26 PM

In which world do you live where employees work 8 hours per day ? They clock 8 hours per day maybe, but they don't work that time

by ttoinou

6/8/2026 at 7:42:44 PM

I had a friend who was CEO of a startup tell me that he typically only “worked” an hour a day, not because he was lazy but just because there was so much nonsense in his schedule. He told me he was trying to get it to two hours per day.

by drob518

6/9/2026 at 2:38:00 AM

How successful did he turn out to be? As a CEO your days should be jam packed with brutal "chewing glass and gazing into the abyss". Is he running a lifestyle type company?

by the_sleaze_

6/9/2026 at 2:54:04 AM

Tangential, but all companies are lifestyle companies, in the sense that they serve their owner's lifestyle choices.

It's just that lots of owners want a company that pulls them away from all other areas of life.

by Lalabadie

6/9/2026 at 6:32:01 PM

Quite successful. Tech company.

by drob518

6/9/2026 at 8:02:00 AM

This reads like compensation theatrics.

by croon

6/8/2026 at 6:19:26 PM

I agree with you.

I am on Dutch subreddits a lot, to get a local pulse and not to be too HN minded.

A lot of them would have vilified you by now. Some even would have even questioned your morality.

Again, I agree with you. But clearly not everyone has this view.

by mettamage

6/8/2026 at 8:12:53 PM

In theory, ofc. But that doesn't matter. If you were doing something that took 2 days in average, but you were doing it in half the time, then that was fine pre LLMs. Nowadays your manager knows that with LLMs you need to deliver faster no matter what, and then it's more difficult to "hide" and to slack.

by dakiol

6/8/2026 at 9:00:18 PM

Yeah. So, good things. We ack know that people are mostly slacking at work

by ttoinou

6/8/2026 at 6:52:27 PM

Generally, when people say they are working 8h/day, they don't literally mean it. Even "work" is basically impossible to define for a SWE.

by mystifyingpoi

6/8/2026 at 11:11:36 PM

Here’s my hot take as an elder millennial. Boomers are the absolute worst at being unable to make the distinction between time at work and time doing work. They may show up an hour before everyone else but spend the first two or three hours a day, reading the news and getting coffee and making small talk and accomplishing literally nothing. Then crow about their work ethic.

by opsnooperfax

6/8/2026 at 7:56:57 PM

Some companies force you to actually work 8 hours a day. It’s hell.

by ai_slop_hater

6/8/2026 at 8:57:26 PM

Which country and which companies ?

by ttoinou

6/8/2026 at 9:24:22 PM

E.g. factory work

by formerly_proven

6/8/2026 at 11:48:22 PM

Oh yeah its not the same, we were discussing Agentic AI

by ttoinou

6/9/2026 at 2:19:50 AM

I worked at a software company that made screenshot of your screen every minute. I also worked a non-software white collar job where you were expected to work non-stop for 8 hours, except for an unpaid lunch break.

by ai_slop_hater

6/9/2026 at 8:48:21 AM

How did you accept such jobs ? I would never be able to pull this off as an employer

by ttoinou

6/9/2026 at 2:18:37 PM

Because nobody is hiring so if I got an offer I had to accept it.

by ai_slop_hater

6/9/2026 at 10:46:48 PM

People are still hiring, it's just very competitive - reframe rejection as learning opportunities returning wisdom.

In retrospect, many companies you get turned down from are likely companies you don't want to work for anyway hence the incompatibility.

It may be hard, but positive mindset will go very far towards enhancing your outcomes - you need to bring others up around you as well. Pause on this and think about the first thing that comes to mind when you respond to these words.

by razodactyl

6/9/2026 at 11:14:45 PM

I saw a comment on HN a while ago. I don't remember exactly how it was worded, but roughly it was something like: if you are self-taught (which I am), you will have to do many shitty jobs before you get a good one. That is how I think of my situation. I am still doing shitty jobs, but I think that the shittiest ones are already behind, and if I had not taken them, I would not be where I am now.

> you need to bring others up around you as well

I am not 100% sure what you mean here, but I don't think that I have the authority or reputation to "bring up others." I find that telling other people what to do is futile, and the best I can do is leave them alone and let them learn from their experiences, or else you might be labelled a "rock star," which is coincidentally being discussed on lobste.rs right now:

https://lobste.rs/s/uvwcdo/cleaning_up_after_ai_rockstar_dev...

by ai_slop_hater

6/9/2026 at 10:42:46 PM

The problem is that there are people willing to accept these conditions. Think higher of your self worth in future please.

by razodactyl

6/8/2026 at 6:54:04 PM

Like with any tech there are dumb ways of using it and there are smart ways. Treating it as a "slot machine giving you the right answer" is a dumb way - it may work for a bit, but it won't carry you very far because everyone else can also do this. No one is stopping anybody from digging deeper into problems than ever before using this technology - that's the smart way.

by dilyevsky

6/8/2026 at 11:13:18 PM

I'm amazed at how steep the AI learning curve continues to be and how people are spread so far apart on it. I think supercharged learning with AI and agents is undervalued at this point but that more people will realize its utility over time, especially as a complement to delegating work.

It also makes me think about the temptation to stop thinking with these tools, i.e. "cognitive surrender". Addy Osmani wrote a nice blog post about this: https://addyosmani.com/blog/cognitive-surrender

by erikus

6/9/2026 at 1:16:37 AM

[dead]

by fatata123

6/9/2026 at 3:43:42 AM

Yeah, nobody is under any pressure to work even faster than before. I don't know what everyone is complaining about!

by andai

6/9/2026 at 12:39:34 AM

If you split the tasks for the AI in small chucks you keep the architectural control and it's not a slot machine anymore. You still read code and occasionally you write code too. Not much but it's the price to pay for the extra speed.

If you start the AI on something big and come back after one hour then yes, you might discover that you wasted an hour and got nothing.

by pmontra

6/8/2026 at 5:22:58 PM

You can dig deeper into problems with AI. For me, it supplements my knowledge in domains I don’t fully understand. It also helps me learn. So I can tackle problems I wouldn’t otherwise.

I’m excited for ultrafast AI. It likely means less temptation to multi-thread and deeper flow in single sessions.

by schipperai

6/8/2026 at 11:10:23 PM

how do you know that it is actually suggesting the right thing?

by 8note

6/9/2026 at 2:49:45 PM

Not OP, but: I guess in a similar fashion to when I google things or read other websites: I don’t, but I use my instinct, judgement, experience…

Very often I do catch LLMs, even the best such as Opus, confidently saying wrong things about areas in theory I know little of. And sometimes I fail to catch them and only realize that later on….sort of like…how I learned my whole career? So many wrong abstractions, tools, and so many hard earned lessons. With LLMs it’s the same, but the process is much faster. For critical decisions I don’t blindly trust an LLM, for example.

by jorl17

6/9/2026 at 10:41:14 AM

I trust AI to surface general information and best practices on established knowledge domains. For example: best practices for securing my VPS.

For domains whete SoTA is constantly changing like AI, I use LLMs to aggregate and interact with my own research from trusted sources ala Karpathy LLM wiki.

I don’t generally trust everything I read on the internet whether its AI generated or not. I do my own research for the things that matter to me.

by schipperai

6/9/2026 at 3:55:35 AM

Some things are verifiable. Before coding agents, if I encountered an issue with a library or a framework, my first hunch would be to find a GitHub issue with a suggested workaround. Nowadays, I can ask an agent to really dig into it and often it does surface the root cause. For example, the other day I got a test hangup after updating to Angular 22, and the agent managed to find the bug and suggest a very trivial workaround compared to what I originally planned to go with. I reported the issue and it was fixed the next day, more or less along the lines of what I'd do.

by Klaster_1

6/9/2026 at 11:10:24 AM

I’m digging into deeper / more complex problems, now. On top of that, I’m also building products faster for our startups, so I am filling in much more of a product role than merely an engineering one. But, really, it is both — and I’m absolutely loving it!

Also, with the added speed I can produce things more in line with the quality I’ve always wanted to add (many more tests, for example).

by jorl17

6/8/2026 at 5:59:55 PM

I was saying that AI is going to make software development cheaper as in the salaries of software engineers will go down because some of that salary will now be redirected to AI companies and the fact that the world will need to absorb twice-(x10?) the amount of the development power.

by himata4113

6/8/2026 at 6:18:58 PM

its not obvious to me that salaries go down, my hunch was that salaries go up but the bar is higher. Software becoming easier to produce (still hard to verify and make useful fwiw) raises the ambitions of software projects, and we don't seem to be close to the ceiling of demand for software systems

by vanuatu

6/8/2026 at 6:33:45 PM

There's a limit to what the demandXsupply curve can absorb. It really depends if there's twice as many developers or 10 times more. I think we have enough software development jobs to where we can absorb productivity doubling rather easily, not so sure about anything beyond that.

by himata4113

6/8/2026 at 6:39:09 PM

True on the demand/supply curve

I think due to how leveraged software is, the top % of software developers are more desired (and compensated) than ever, and the bottom % will have difficulty finding a role, and there are structural barriers to entering that top % (intelligence, location, etc). Companies have infinite demand for the cream of the crop talent

by vanuatu

6/8/2026 at 6:48:09 PM

I can actually back this up, most job offers I get actually come from people I happened to work with that never get a public job listing and are only obtainable via being highly regarded by others. I was told that my friend in their department where the role opened up got an email about a senior position and to reply if they have a recommendation.

However, software development is funny in a way where you don't need a job in order to be successful. I've never worked at a company and I'm pretty up there on the ladder, but I am not quite sure what will happen in next few years when ever possible thing that can be made in software is already explored to the fullest especially with singular developers launching 3 to 7 projects a month.

by himata4113

6/8/2026 at 8:37:07 PM

> with the hope of it giving you the right answer with the right prompt.

Consider that our ability to evaluate quality of the output is falling further behind our ability to produce it. The “right answer” is not the most likely outcome.

by DenisM

6/8/2026 at 7:00:32 PM

Sure but if you're really unhappy with your employer employeeing you for 8 hours a day you can also harness this power on your own personal projects to help break free from the 9-5 grind if you so desire.

by drschwabe

6/8/2026 at 7:18:07 PM

Only if your personal projects make you money. I have a million hobby projects but none generate income.

by __david__

6/8/2026 at 9:58:09 PM

I feel like I spend a lot more time reviewing and fixing the output of it and debugging parts it can't debug, so to me a faster model is optimizing the part that is already pretty fast. If my job were greenfield stuff I would probably YOLO it more, but when you're working on a launched product with a lot of users..

by overgard

6/8/2026 at 5:10:20 PM

It's making things less fun, for me at least.

by fullstop

6/8/2026 at 6:20:21 PM

Odd, I'm having the opposite experience.

The thing I really love about working with computers is when I achieve something. That's the thing that makes me figuratively, and sometimes literally, throw my fists into the air and go "Yeaaah!"

With the AI tooling, I'm getting those more like a couple times a week.

Plus, I'm using AI to attack the things in my day that are "a drag", and getting them done too.

The highs are more frequent and the lows are not so low.

by linsomniac

6/8/2026 at 7:54:50 PM

Oh, sure, I can make things with it. But I have an extraordinarily hard time saying that I made something.

It feels like it cheapens the whole thing. Maybe I'm just old, because I remember people saying the same thing about code completion in Visual Studio back in the late 90s.

This is so much more than code completion, though.

by fullstop

6/8/2026 at 8:45:58 PM

Exactly how I feel. I didn’t make a damn thing. I essentially asked a chatbot to.

Did I ask for better things with some important concepts pre-rolled? Yeah, of course. But that’s so, so much less interesting than having actually made a thing.

I try to remind myself that the output of my projects have nothing to do with who I am, but the honest truth is they always mattered to me.

Now that’s dead, and it’s never coming back. It ain’t exactly existential dread, but it is something I’ve lost.

by dd8601fn

6/8/2026 at 8:43:39 PM

I did a deep binge on two or three projects I would never do, and like five small ones that would have consumed months.

It felt like that, kinda, for a bit. Now whenever it does something for me I get nothing. I didn’t do it… the chatbot did. What’s for me to celebrate? How can there be any real pride or satisfaction for a thing that was just handed to me because I asked for it?

If anything it diminishes my satisfaction looking back on previous projects. They’re “a few hours with a chatbot”, now.

The things I had to learn and the informed decisions I had to make? All pointless trivia, now. A child could do it.

The magic and possibilities parts just all wore off after a heavy run, and I don’t know if that’s ever coming back.

by dd8601fn

6/8/2026 at 9:34:09 PM

I hear what you and the other sibling comment are saying. I, thankfully, somehow, am able to focus more on the results than the process. Having fun playing a game (that AFAIK no longer exists) with my family is still having fun. Having people using a new apt cacher that fixes problems with existing ones, and also can survive the recent DDoS, is still a really great thing.

But, I'm not going to yuck your yum. I appreciate the people who do jointery using hand tools, even if I'm out here with a track saw and a router.

by linsomniac

6/8/2026 at 10:04:41 PM

Do you feel the same way about cloning a GitHub repo and building it? It, too, achieved a result.

The track saw and router, imo, are existing libraries.

by fullstop

6/9/2026 at 12:48:11 AM

> The things I had to learn and the informed decisions I had to make? All pointless trivia, now. A child could do it.

Probably this is a hyperbole. Did you do the experiment? I expect that the child won't be able to do it. Ask an adult. Same thing. Ask an expert of the domain. Maybe but not as fast or as good as you.

by pmontra

6/9/2026 at 6:30:58 AM

Yes that’s more “how it feels” than something I’ve had kids actually try.

by dd8601fn

6/8/2026 at 6:15:48 PM

Employees who get paid a flat rate per hour don't have the incentive to do more than their job

Equity / profit sharing should be commonplace in the age of AI.

by vanuatu

6/8/2026 at 6:09:19 PM

I dig into problems way, way deeper with AI than without. I can also add a lot more polish to features, add more test coverage, write more documentation, explore multiple approaches rather than go with gut-feel, and so on.

by enraged_camel

6/8/2026 at 5:36:48 PM

That's the fundamental trade off of a job where someone else gives you stuff to do and you get money. We may pride ourselves on software development being a job 'above' flipping burgers, but you're getting paid to have your butt in a chair for 40 hours a week. In exchange, you don't have to worry about the business shit. How much a burger or SaaS license costs the user isn't your problem. You take Jira tickets and implement them. You trade time for money. If, instead, you work for yourself; contracting, writing your own apps, buying lottery tickets, then you're trading results for money. If you're a freelance web developer with a stable of clients, it's a great time! What used to take a week takes hours, and you can charge your clients the same amount to build an even better website with you using AI, which means you get the choice of building a new website for additional clients, or you can take the time off and not build additional websites. But you have to hustle to continually get new clients, before AI and after AI. So it's a different life.

by fragmede

6/8/2026 at 7:21:11 PM

A huge class of problems are just toil and drudgery. Maybe ai will give you even more time to dig into juicy problems that are too complex for it to solve, by letting you bypass all the pure toil problems.

by IncreasePosts

6/9/2026 at 2:27:43 PM

I dunno man, the slot machine pays out like 99% of the time for me.

by marknutter

6/8/2026 at 5:33:03 PM

I think of it as a genetic algorithm loop. The LLM is basically a mutator function within the loop. If you can define the end shape you're looking for using tests and specification then you can throw the LLM at the problem and have it converge on the solution. It generate some code, it gets run, the LLM is fed the result back, and it iterates. If you can run the LLM at a really high throughput, then you can iterate on the solution faster. This can largely compensate for the overall capability of the model. Instead of hoping it gets the right solution in a few shots, you can just have it try a whole bunch of things until you get a useful result.

by yogthos

6/8/2026 at 5:28:27 PM

>instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.

If you're treating it like a slot machine you're doing it wrong. It will give you exactly what you ask for if you ask clearly, i.e. write a clear, detailed specification, not just "do X!". The nondeterminism comes from vagueness in specification.

by logicchains

6/8/2026 at 5:19:42 PM

You have to think LLM as the genie that tries to trick you.

First make it write a contract (REQ/ARCH/IMPL documents). Skim through those for any mistakes.

Then based on those ask it to write tests. Again skim through them.

Now you have a context full of guardrails. It’s less likely to surprise you.

by noncoml

6/8/2026 at 5:57:06 PM

I find a second LLM can do this at least as well as I can, usually, and just ask the harness to surface anything they can't agree on.

by petesergeant

6/8/2026 at 5:27:59 PM

Generally, I agree because what happens is the messaging around AI is doing more, faster. Not using AI to deliver at a higher quality level, etc. But I think it boils down to incentives and discipline. So given the incentives we have today at most workplaces faster AI will just be used to produce more slop.

by alfalfasprout

6/9/2026 at 2:25:24 PM

[flagged]

by alfredoh07

6/8/2026 at 4:05:31 PM

These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.

by amunozo

6/8/2026 at 4:31:51 PM

Chinese model is good enough and cheap.

i've a Github copilot yearly subscription. Microsoft recently changed their billing to based on token. i'm still getting billed per premium request but GPT 5.4 is now 6x compare to 1x before.

by MangoCoffee

6/8/2026 at 5:27:51 PM

It's going to be an issue when China ends up scaling faster as well. Faster tokens, faster clusters, qat models, fp4, it's getting scary.

by reactordev

6/8/2026 at 5:56:39 PM

Issue for who?

by AndrewKemendo

6/8/2026 at 7:02:09 PM

Issue for any country that is not China. A single country getting the most AI tokens business would be generally bad for global economy. Hoping against hope that this business gets globally distributed and there is a healthy marketplace competition overall

by fillskills

6/8/2026 at 7:58:21 PM

It’s all about economic warfare. The cheaper you can run the models, the cheaper you can offer them. Undercutting expensive tiers with token limits or exuberant billing practices.

You are right to be scared, because this race to the bottom also provides open weights/models/qat’s for the rest of us and it’s been crazy to see how good they can be on a consumer grade RTX card.

by reactordev

6/8/2026 at 6:10:00 PM

For uncle Sam Altman.

by throwa356262

6/8/2026 at 6:08:31 PM

American Politics and the far right.

by reactordev

6/8/2026 at 8:44:16 PM

For the West

by fortzi

6/8/2026 at 5:42:30 PM

I'm kind of poor so I have been trying to use DeepSeek v4 Flash, GLM 5.1 etc. as much as possible recently instead of Claude or GPT.

by ilaksh

6/8/2026 at 5:57:48 PM

You would do us all a service by telling us how your experiences of that have been.

by petesergeant

6/9/2026 at 1:22:32 AM

I've been doing the same, though admittedly out of curiosity more so than lack of funds. The open models are catching up quickly in their abilities, to the point where they're (mostly) not doing stupid stuff regularly, but you have to be very specific about what you want. I found that Opus, for example, is much better at asking me to clear up ambiguity in a request before starting, whereas the Chinese models tend to "fill in the blanks" and make their own assumptions.

My current workflow involves going from PRD -> execution plan -> build -> review, and this works nicely with open weight models like GLM 5.1, Kimi K2.6, and DeepSeek V4 Flash. With Opus I can generally skip the PRD entirely, and sometimes even skip the plan, and 80-90% of the time it does exactly what I want. But that can easily burn $5-15 for one feature, whereas it'll cost maybe $1-2 with the open weight models (at API pricing).

by RussianCow

6/9/2026 at 3:47:57 AM

> ... you have to be very specific about what you want. I found that Opus, for example, is much better at asking me to clear up ambiguity in a request before starting, whereas the Chinese models tend to "fill in the blanks" and make their own assumptions.

That's the main thing I've noticed. Small models can follow instructions just fine. If the instructions are very specific. Then I often have to spend more time explaining a task than it would have taken me to do it myself.

The bigger models have a lot more common sense.

I wonder if that could be improved slightly through prompting. Asking it to clarify anything that's confusing. Or maybe it just makes incorrect assumptions without realizing the ambiguity. One way to find out!

by andai

6/8/2026 at 6:51:25 PM

I would say about 35% of the time I run into problems and eventually give up and go to GPT 5.5 and it much more efficiently handles the original task. Then I see the token costs going up and it motivates me to continue trying the open source ones.

by ilaksh

6/9/2026 at 3:45:39 AM

Did you try deepseek v4 pro as well? And what kind of tasks?

I'm seeing some people say flash is amazing and can handle everything, and some say it's useless. It seems to depend on the task. I think it depends on the harness too (it works better in Claude Code in my experience, it's probably been trained on that).

by andai

6/9/2026 at 3:07:23 PM

the problem for me with deepseek v4 pro is like a significant amount of time it just seems to like never finish what it is doing.. loonnng thinking and then a lot of time to output or just seems to never finish. that has happened several times to me. could be my agent framework partly. .but I have heard other people complain about that also.

it has limitations but it is way better than I expect from something named Flash that is open source.

by ilaksh

6/9/2026 at 8:04:56 AM

There's going to be a tipping point where it's worth purchasing more hardware to run the next biggest size of the open model, if they show stepwise improvements that way.

by Schlagbohrer

6/8/2026 at 6:43:45 PM

I used Opus 4.6, then downgraded to Sonnet, then to GLM5/5.1. GLM is as good as Sonnet. I recently started using Opus 4.8 again and GLM is not close to that.

30 day eval for each.

by polski-g

6/9/2026 at 11:58:16 AM

The only one that is really close to Claude in performance is GLM-5.1. The others (Mimo, deepseek, etc..) looks good on paper but usually fails on a multi-step agentic orchestration.

This is at least my experience with Claude Code as harness. Also, GLM pricing is not that far off from Claude. It's cheaper but not DeepSeek cheap.

by csomar

6/8/2026 at 4:38:31 PM

Another problem is that US models are all closed source, and if you're a large corporate you may not want your org to be held hostage by OpenAI / Anthropic.

I genuinely don't understand what moat these US model labs have. If they're saying recursive self improvement is just around the corner and Chinese labs are only slightly behind the leading US models, what moat does the US labs have? Are the US models going to recursively self improve better than the Chinese open source ones or something?

I might be completely wrong about this, but if I had money in OpenAI or Anthropic I'd be pulling it all right now. I think the chance of them going to near-zero over the next few years is very significant.

by kypro

6/8/2026 at 5:27:19 PM

> you may not want your org to be held hostage by OpenAI / Anthropic

Or Google. I'm working with multiple customers right now that are very pissed at Google for deprecating Gemini 2.5 Flash, canning the GA release of 3.0 Flash and now have to decide whether to bite the bullet of the 5x price increase for 3.5 Flash or switching providers. Quite a few of them will likely fully pivot to open models.

by hobofan

6/8/2026 at 6:30:17 PM

I'd be curious if any of your customers have tried 3.1 Flash Lite. It's cheaper than 2.5 Flash, and in my experience with the free tier, quite an upgrade in terms of quality of response. My suspicion is that Google is killing off the old models because they aren't a good value for the customer or for themselves.

by bachmeier

6/9/2026 at 2:17:48 PM

Most of them are using it for data extraction use-cases on complex where they are already in a tricky cost vs. quality compromise. Some of them have evaluated 3.1 Flash Lite but for all of them it performed worse than 2.5 Flash and below requirement.

The only ones I've seen switch to 3.1 Flash Lite were from 2.5 Flash Lite, and all for the most simple use cases, e.g. small UX enhancements.

by hobofan

6/8/2026 at 4:53:02 PM

Their moat is cash to pay politicians to regulate away competition.

by lokar

6/8/2026 at 9:48:19 PM

maybe the moat is that we slowly start to forget how to code by hand and then you -need- the AI tool.

by GoToRO

6/8/2026 at 5:38:56 PM

I think they are racing because the first ASI will 'win', preventing others, of course we won't be able to bake the right goals into it though.

by ChrisClark

6/8/2026 at 6:31:33 PM

i dont think its going to automatically prevent others. super claude might understand why diversity is important. if were talking sci fi scenarios the most likely one is probably overwatch (multiple independent ais with gray ethics and complicated relationships) more than skynet.

by tancop

6/8/2026 at 4:24:37 PM

I see bigger problem with model inconsistency. You never know whether Anthropic will route your request to a cheaper model for the price of Opus. So you can never estimate how much a task will cost, because you might have to restart several times and pay for each attempt. Then you have to prompt models to gauge whether they are real or impostors which also adds to token usage.

by varispeed

6/8/2026 at 4:31:22 PM

> You never know whether Anthropic will route your request to a cheaper model for the price of Opus

For non subsidized plans? Pretty sure they'd need to put this in ToS, or law suites would have followed by now.

by ignoramous

6/8/2026 at 4:50:53 PM

How can you prove it?

Sometimes Opus just gives me a rubbish session.

by trollbridge

6/9/2026 at 6:53:00 PM

But you don't know why...

by chairmansteve

6/9/2026 at 1:24:59 AM

Isn't that true of any provider? Anyone could be lying about what they're serving.

by RussianCow

6/10/2026 at 12:20:50 AM

Yep. For open weights at least, there's possible ways to verify. Ex: https://www.kimi.com/blog/kimi-vendor-verifier

by ignoramous

6/8/2026 at 5:10:34 PM

no they 100% use MTP with a cheaper model alongside opus, and it would infact be unprovable if they just sometimes switched to auto-accepting everything from the MTP. its true that if they did anthropic would need to hide that they do this, so its probably not a huge deal

by sometimelurker

6/9/2026 at 5:27:26 AM

1. How would you know?

2. They are doing lots of shady stuff that would have gotten someone else banned from visa/mastercard. Your paid off plan literally changes after billing...

I think people are letting them fly for now, because if it turns out true that they'll have AGI they want to be on their good side? We might see the knifes getting pulled otherwise.

by csomar

6/8/2026 at 4:20:21 PM

I wonder what are the economics driving these pricing decisions? Are the Chinese companies just subsidizing their models to a greater degree than the US, or is this an emergent property of energy policy between countries?

by throwaway894345

6/8/2026 at 6:56:29 PM

For one, they invested in infrastructure. They can build fast and efficiently. They can provide power, they can provide cooling. Even if you just make roads better you make everything more efficient. Plus level of standard education. It all compounds.

On HN China is seen as a cheap labor copycat. This used to be a fair approximation at some point in the past. In my opinion China is getting ahead of everyone else much more than US used to be.

SF is a beautiful thing in the US, vast power and wealth comes from there. Smart people collaborating communicating and building fast and with excitement. China did SF kind of thing for many different sectors in many different places.

by comboy

6/8/2026 at 4:24:31 PM

Throwing out another factor: Chinese companies have been banned and/or limited from buying nvidia, and turned to local companies for their hardware. I haven't actually seen pricing/benchmarks comparing Chinese AI accelerators, but it wouldn't surprise me if that also worked out in their favor as well.

by Octoth0rpe

6/8/2026 at 4:54:10 PM

And, possibly, state subsidies at every level.

by lokar

6/9/2026 at 8:09:22 AM

I have to point out the massive state subsidies in the united states for the tech companies and datacenter builders.

by Schlagbohrer

6/9/2026 at 12:22:27 AM

Their models are much smaller: 1T vs 5T for the frontier models. 1T is Sonnet/Google Flash size, not Opus size.

The $0.87/M tokens price for Mimo Pro is probably subsidized.

Mimo models aren't widely available on western providers, but Kimi and Deepseek are similar sizes and cost about the same to run. They are priced $3-$4/M tokens (which is right were Google's very confused range of Flash models are priced at: between $0.40/M tokens and $9/M tokens depending on exactly which model - and you don't want the $9 one!).

Anthropic overprices Sonnet (probably because of their capacity issues). GPT 5.4 mini is $4.50/M tokens.

https://docs.fireworks.ai/serverless/pricing

https://www.together.ai/pricing

by nl

6/9/2026 at 11:42:10 AM

I'm not sure about those parameter sizing claims. Regardless of parameter size, benchmarked intelligence of Chinese and Western frontier models is comparable, so who cares how many parameters it takes to get there.

Mimo is also widely available on western providers. It's on openrouter and you can sign up with Xiaomi directly for a token plan on an English website priced in dollars.

by Cakez0r

6/9/2026 at 4:39:27 AM

The Chinese economics: possibly the USA's experience.

It was pretty clear the USA won World War 2 because it out produced and out innovated everyone else. Probably with that in mind, after World War 2 the USA adopted the "Vannevar Bush" model, summarised in this picture: https://www.researchgate.net/figure/annevar-Bushs-Science-th... The idea is to jump start R&D through public funding. The hoped for outcome was that R&D feed private enterprise, leading to a productivity boom.

The boom happened, and the USA did seem to out-compete everybody else in R&D, science, and the products they delivered for decades after that.

That way of doing things seems to have faded over time in the USA. The decline seemed to coincide with the rise of Neo-econmics, and now of course it's been obliterated by Trump. He's very keen to fund Intel to produce chips in a year or two's time (which is something the stock market and banks do perfectly well), but funding basic science is getting drastic cuts.

Still other countries noticed the rise of the USA, and some adopted similar funding models for basic R&D. China seems to have picked it up with gusto, both subsidising R&D and STEM training, leading to huge numbers of engineers and scientists. Whether it will lead to an economic boom remains unknown, but acceleration of ideas and innovations coming out of China seems undeniable. More recently, Ukraine showered its local engineering garages with funds in the hopes of getting a similar outcome to the USA in WW2. It looks like it worked. If the Iran war continues, it's entirely possible arms trade will reverse: the USA could well start buying drones off Ukraine.

by rstuart4133

6/8/2026 at 4:37:41 PM

Lower cost of labor, lots of under the hood optimizations (e.g. cache hits for DS), many of these companies have existing infra (fewer upfront costs for deployment), etc

by throwaway67678

6/8/2026 at 5:04:43 PM

China isn't that cheap for labor. And if you think the guys in Z.ai or xiaoxiao aren't the exact same guys from Tsinghua, Peking, MIT, Stanford, CMU, etc. and pulling in amazing salaries you'd be wrong.

by ecshafer

6/9/2026 at 2:17:44 AM

Z.ai was actually a spin-off from Tsinghua (THUDM) AFAIK.

by nmfisher

6/8/2026 at 5:18:16 PM

I'd assume there's more to the cost of labor than the salaries of the elite folks who do the R&D, but fair point

by throwaway67678

6/8/2026 at 4:58:52 PM

Maybe not being led by a sociopath also helps.

by orphea

6/8/2026 at 7:42:08 PM

I'm pretty sure Xi is also a sociopath, but he differs from Trump in that he's competent. And maybe that's a good thing for American democracy--if we had a competent dictator who could manifest massive infrastructure projects maybe the pro-democracy backlash would be significantly attenuated?

by throwaway894345

6/9/2026 at 9:44:10 AM

Oh, I was thinking of OpenAI and Anthropic CEOs.

by orphea

6/9/2026 at 12:29:48 PM

Heh, isn’t it fun living in a timeline where there are so many sociopathic leaders that your earlier comment is ambiguous? (:

by throwaway894345

6/8/2026 at 4:02:24 PM

Given that MiMo is as cheap as Deepseek ( previous discussion: https://news.ycombinator.com/item?id=48282814 ) multiplying that by 3x for ultra speed is still shockingly cheap.

by kingstnap

6/8/2026 at 4:31:11 PM

MiMo and DeepSeek are not cheap. Anthropic and OpenAI are expensive for what they provide.

by miroljub

6/8/2026 at 4:41:54 PM

You don't consider Input $0.435 Output $0.87 cache read $0.003625 per million tokens for near frontier intelligence cheap?

by chrismustcode

6/8/2026 at 7:09:16 PM

No. They still have enormous profit margins on inference with these prices.

by miroljub

6/8/2026 at 9:11:28 PM

Their margins doesn't impact my own assessment of end user pricing as cheap.

by handfuloflight

6/8/2026 at 7:35:03 PM

Any source to backup this claim, pretty please?

by guilamu

6/9/2026 at 1:44:34 PM

Source? There are a countless number of providers serving open weight models for fun and profit.

by miroljub

6/8/2026 at 11:55:18 PM

I highly doubt there is any margin on those inference pricing.

by HDBaseT

6/9/2026 at 11:24:37 AM

> I highly doubt there is any margin on those inference pricing.

And yet, OpenCode Go offers DeepSeek flash 6 times cheaper than DeepSeek itself. And they claim they are still profitable.

by miroljub

6/8/2026 at 11:45:47 PM

It’s near the frontier meaning it’s the best intelligence for the price.

It’s not even close to frontier meaning it’s the best intelligence.

by pmxi

6/9/2026 at 5:00:51 AM

I hardly notice DeepSeek being inferior to Claude Opus unless I have it working on tricky and under-defined problems. That is, I trust Opus to reason much better when it has the choice. Otherwise, IME DeepSeek is far cheaper and more effective for anything where the solution is even somewhat obvious.

by LoganDark

6/9/2026 at 11:51:48 AM

Out of curiosity, what is your stack? And is this in a legacy project or new one?

I have tried using deep seek flash and pro but they make amateur mistakes. Sonnet level at best.

However v4 flash is absolutely amazing as a generalist model and it’s what we’re using on a product built on top of LLMs. I wish I could code with it but it’s not going to happen anytime soon

by jorl17

6/9/2026 at 3:40:26 PM

I've used it across many new projects as well as many legacy ones. It does make amateur mistakes so you can't leave it unsupervised for hours like I do with Claude, but it's so much cheaper that weeks of heavy usage haven't even cost me $10 yet. Only other downside IMO is that Pro is pretty slow, even compared to frontier models; only around 120t/s IIRC.

by LoganDark

6/9/2026 at 4:38:55 PM

Yes I also noticed it is pretty slow, which sort of defeated the purpose of using it for me.

Usually I'm working on a large task, typically with Opus, while also having a bunch of smaller tasks in their own independent worktrees. Those still need supervision, but less. My goal was to get deepseek to drive the cost of those down, but it was too slow and unreliable...

by jorl17

6/10/2026 at 3:40:22 AM

Yes, I could tolerate the unreliability better if it were faster, but it's really not. So it's too slow for me to actively supervise it, but too unreliable for me to trust it unsupervised. The shitty middle. I often have multiple of them open at a time and check my terminal every few minutes to lead them along. Mostly works.

by LoganDark

6/8/2026 at 4:55:03 PM

Energy is likely more abundant in China. I am not sure about compute, but that must be part of reason for such drastic price differences.

by tmaly

6/8/2026 at 7:25:07 PM

They're leaving us in the dust on solar, while our current administration is still trying to put people in the ground to dig up more coal and die of black lung. https://en.wikipedia.org/wiki/Solar_power_in_China

by SwellJoe

6/8/2026 at 9:40:06 PM

They're building more coal than anyone.

Also more nuclear than anyone, which one must assume you hate, because preferring solar requires you don't actually understand thing

by diordiderot

6/9/2026 at 8:28:30 AM

Energy from coal in China decreased last year. The change is happening very quickly.

by yxhuvud

6/8/2026 at 5:04:06 PM

They also don't have to inflate profits for a coming IPO.

by amunozo

6/8/2026 at 4:37:23 PM

The Chinese "Neijuan" is real & well reported: https://www.reuters.com/business/autos-transportation/what-i...

It is another thing the BigLabs accuse open weight models of benefiting from distillation & other techniques & essentially avoid higher training costs (which typically bleed into bills end users pay for inference).

Ex A: https://www.anthropic.com/research/2028-ai-leadership

Ex B: https://www.reuters.com/world/china/openai-accuses-deepseek-...

by ignoramous

6/8/2026 at 4:50:01 PM

We buy cheap Chinese goods all the time. Absolutely nothing wrong with that.

In this case, at least it’s threatening multimillion dollar salary jobs instead of entire towns of working class people in America or Mexico.

And the Chinese labs actually release their weights. You could call it… open AI.

by trollbridge

6/8/2026 at 4:54:00 PM

Lololol.

by ncr100

6/8/2026 at 4:57:30 PM

Big labs ripped videos off YouTube without caring about the ToS, and grabbed as much published literature they could get their hands on, regardless of legality (Books3, The Pile). The goal of "democratizing human knowledge" by way of thinking machines is far too noble to worry about frivolities like copyright and authorial consent, they said. Until it was their output being exploited, and their earning potential threatened.

by overfeed

6/8/2026 at 5:16:50 PM

We just had years of US model providers arguing it was fine to rip off the world’s cultural output for their own profit, why should their work be treated any different?

by drawfloat

6/8/2026 at 4:46:32 PM

True, but why would end users care about that? If anything, training on synthetic AI output is more ethical than on scraped human works (of course, not to say the Chinese labs aren't doing the latter)

by flexagoon

6/8/2026 at 5:05:32 PM

Chinese are also simply better at making a lot of things cheaper, e.g. solar panels or electric vehicles.

by amunozo

6/8/2026 at 4:44:13 PM

MiMo V2.5 Pro (regular speed) remains the strongest open weights agentic coding model we've tested -- it's been interesting to see how little attention it has received relative to some lower performing releases. And the "fast mode" pricing is very competitive here.

Data at https://gertlabs.com/rankings

by gertlabs

6/8/2026 at 6:00:13 PM

why is deepseek v4 pro a lot lower than flash? where is mimo 2.5?

by unrvl22

6/8/2026 at 7:59:39 PM

DeepSeek v4 Pro struggles with a custom harness, and all the models ranked above it don't, so it gets downweighted in the agentic coding benchmarks (although it ranks better than Flash in one-shot problem solving: https://gertlabs.com/rankings?ow=1&mode=oneshot_coding). We ran plenty of samples.

MiMo v2.5 is on there, as well as the pro version.

We found a few anomalies in our evaluations, which makes sense -- if every new sub-release is better across the board in every area of the model card, that should raise alarms about benchmaxxing. But the main thing we found is that hype != performance, and I trust our benchmark methodology significantly more than the model cards the labs add to their press releases.

by gertlabs

6/9/2026 at 3:50:44 AM

Mimo struggles with my custom harness. (Ignores the instructions and defaults back to its own preferred tool calling syntax.)

Flash handles it fine, which I found amusing. (Since Mimo is supposed to be opus level!) But Flash seems to work even better in Claude Code...

With smaller models I always have the issue of needing to adapt myself to their preferred workflow... which sort of defeats the purpose. Price is hard to beat tho :)

by andai

6/9/2026 at 7:54:57 AM

Mimo v2.5 non-pro seems to do better with tool usage than its Pro sibling, is much cheaper and solves 90% of the same problems. I use Pro only for one-off tasks that require complex reasoning: memory management bugs, algorithms, planning.

When it gets stuck, I get one-shot advice from Claude or DS Pro. I’ve done massive amounts of work for cheap this way.

by ricardobeat

6/10/2026 at 8:55:30 AM

Thanks. I found One Weird Trick to make Mimo v2.5 Pro work in my harness, which is that I just added an example bash tool usage to the system prompt. Now it works fine.

The issue was that my previous instructions had <command> as a placeholder. But the model started wrapping bash commands in <command></command> tags... haha. Now that it has an actual example it just works properly.

by andai

6/9/2026 at 12:10:29 AM

Can you explain more about how it struggles? I haven't noticed any issues in my usage, so I'm just curious what is meant by this.

by digdugdirk

6/9/2026 at 1:51:15 AM

It's likely overfit to common harnesses and iteration patterns, so it struggles with formatting tool calls and json in our testing which use our own harnesses (although there is a lot of overlap with tools that would be found in any coding harness like bash, apply_patch, etc.)

We didn't love the results because it draws negative scrutiny to our benchmark, but the results are real and done at scale and I think DeepSeek V4 Pro's inability to do agentic work outside of environments it was trained on is an important thing to measure, especially when so many other models can generalize to new environments just fine.

Google models also struggle with tools, but they have very strong initial answers, so there is more potential for them to bridge the gap with some better post-training.

by gertlabs

6/8/2026 at 4:04:46 PM

I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.

Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.

by serpix

6/8/2026 at 4:22:00 PM

Sounds like exponential growth of crappy software. I'm not saying that before we didn't have mass produced crap in SE, but now it will turn into explosive overflow.

by alkyon

6/8/2026 at 4:35:00 PM

We are living in a ZIRP-like era where builders at the fastest pace layer have misattributed their velocity to exponential gains in model capability. In fact, they are surfing on decades of careful effort to build a robust foundation of highly reusable software libraries.

This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).

by cdata

6/8/2026 at 4:46:12 PM

It's not just software libraries. Specs, applications (the browser!), expectations, device integrations, operating systems, etc. So much that starting from scratch seems impossible.

I'm not agreeing or disagreeing with you, but my brain cannot comprehend how machines can advance such interconnected systems while keeping humans in focus.

Perhaps I shouldn't have watched the Animatrix again.

by patates

6/9/2026 at 3:48:56 AM

Same! Animatrix is just so so so good and 2023 - 2026 I just keep on trying to keep "life" in context. ;)

by justinai6

6/9/2026 at 3:55:09 AM

Well all we have to do is minimize animosity and ensure peaceful relations.

We're good at that, right?

by andai

6/8/2026 at 4:36:58 PM

> This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).

There will only be a reckoning if models don't get much better.

If they do get much better you can just have them refactor, fix bugs in, or replace the existing codebase.

The concept of tech debt is sort of meaningless if you anticipate intelligence gains in models to continue.

by solenoid0937

6/8/2026 at 6:22:42 PM

"but we will have no choice but to march forth from there".

If you haven't seen it, I think you would appreciate the film Margin Call.

by chairmansteve

6/8/2026 at 5:09:25 PM

This is a great point. LLMs can't speed up human decision processes and alignment.

by gbro3n

6/9/2026 at 5:49:41 AM

Not entirely sure about that.

Its already speeding up human decision processes, and while ethics / alignment may seem unique to humans we also see normative expressions in monkeys or apes (like the experiment where one is given a grapes, the other cucumber).

A lot of ethics is based on symmetry: symmetric relations, equal rights, equal voting power, ... symmetries sound rather mathematical if you ask me, and decision structures have historically been pressed towards democracy (or at least depiction of it). One could say that modeling humanity as an empire with a king, ignores the will of sometimes hungry farmers with pitchforks. To prevent the occasional "implicit democracy" (royaltycide), it turned out in the interest of the king to recognize the powers of those farmers, and to formalize it in the decision making process. Or at least pretend to.

I believe machines will be able predict the preference sentient creatures would prefer in terms of decision structures, but I don't believe it will be able to predict (without human exposition) those novel preferences that stem not from sentience but from being specifically human properties (i.e. irritants which are quasi universal for humans, etc.), some of them humans know how to make predictions for (we can run expensive simulations modeling what happens when protein X is exposed to substance Y, and then make heuristic predictions of the effect on a full human in a realistic environment). So at a fundamental level I agree: machine learning models are not guaranteed to help much in predictions concerning entirely unexplored territory, neither by humans nor by natural selection. But it will definitely be capable of replacing the average human job, which doesn't involve consensual exploration outside of the homeostasis required in the implicit job description, that seems entirely automatable, regardless if its physics, mathematics, (harder than computer science), let alone programming.

It won't be able to magically systematically correctly predict out of distribution datapoints, it could only explore it like humans could by trial and error.

by DoctorOetker

6/8/2026 at 9:54:24 PM

How many years do you think we can coast on that foundation. 20?

by noman-land

6/8/2026 at 4:39:23 PM

"exponential growth of crappy X" applies to every industry that went from being an artisanal craft to being mass produced with little or no human input. and we live much better lives than we did before the industrial revolution.

by vitalyan1234

6/8/2026 at 6:28:43 PM

I think some industries have notably high quality output. Automobiles, aerospace for example.

by chairmansteve

6/8/2026 at 5:12:45 PM

most industries have high cost of entrance unlike software, so decision makers are way more careful on how to move forward.

In software + GenAI now every housewife can build some App over evening.

by andriy_koval

6/8/2026 at 5:10:23 PM

I still can't tell from the outside whether it sounds like a great time to be in security because of the vulnerable slop being churned out, or a terrible time because the people paying to make it don't care.

by kajman

6/8/2026 at 4:44:12 PM

I am more and more inclined into not believing this crappy software theory.

Especially as teams invest in proper agentic harnessing.

We have had a champion in our team that has invested a lot of time into it over the last 4 months, and if anything, quality has improved, not decreased. Architecture is more coherent, codebase has been cleaned up, agents find information quickly, code produced is very solid and my role is more and more checking that the output meets the requirements. But I cannot confidently say that I would've done a better job than AI more often than not I have to admit it does a better job than mine.

The mistakes are less and less technical and merely in the domain mapping. And AI is still not creative as I am for finding solutions quickly to unlock stakeholders' issues. Also, AI is still not creative as I am for finding the proper solutions for advanced technical problems. But it does a better job than me, even on that front, one shotting few solutions in a fraction of a time it would've taken me to test one idea myself.

Mind you, I don't like AI and I think it ruined the job, I don't like working this way, it's exhausting, way more work on one side, way less fun and fiddling with technical parts.

And yet, I have the genuine belief that few years from now we'll be cloning open source repositories that are already optimized/harnessed and tested for agentic loops and best practices left and right with software engineers mostly overseeing the domain translation and putting their 2 cents on the non-boilerplatey parts of the product (which, in general, are a small part of the surface).

I think that the next years of my career will be mostly spent in setting up and writing the harnessing and domain mapping part. Then I will move to another sector, not because I necessarily believe I won't have a job, but because I want to vomit thinking that's going to be my job.

by epolanski

6/8/2026 at 5:21:43 PM

It makes no sense. I mean, T2 covered this:

"Watching John with the machine, it was suddenly so clear. The terminator would never stop. It would never leave him, and it would never hurt him, never shout at him, or get drunk and hit him, or say it was too busy to spend time with him. It would always be there. And it would die to protect him. Of all the would-be fathers who came and went over the years, this thing, this machine, was the only one who measured up. In an insane world, it was the sanest choice."

As long as you've indicated what you want, the machine will try to do what you ask of it. It won't get tired because "the codebase is too big", or it has gotten bored of the pattern, or it wants to introduce a new technology.

It just does the thing you asked of it. (note, that yes, I get that as a codebase size increases, it might make it more difficult to fit into context, but that only applies if it needs to read a large percentage of the project to implement the task, which shouldn't be the case.

by altcognito

6/8/2026 at 6:01:39 PM

I'm confused, what does not make sense?

by epolanski

6/9/2026 at 1:15:33 AM

This was in agreement that code would improve, not devolve, sorry about the confusion

by altcognito

6/8/2026 at 5:18:52 PM

> We have had a champion in our team

there are good actors, which are empowered by AI to produce positive impact, but often there are N times more bad actors, which push crappy code to close feature requests fast, increase performance LoC-like metrics, etc.

by andriy_koval

6/8/2026 at 4:36:04 PM

Crap is fine if it gets the job done. I think software as an industry will change to more ephemeral construction.

by solenoid0937

6/9/2026 at 12:26:40 AM

What counts as “done” has a time component, so I think we’re going to see more of a spectrum where some businesses try to skimp as much as their market will allow but others will recognize that racking up technical debt is a long-term loss. Stuff like brochure sites will certainly be cut down but anything where there’s liability or long-term customer relationship is going to need to factor in quality as well.

by acdha

6/9/2026 at 8:59:48 AM

If you anticipate that models will continue to improve, tech debt isn't worth worrying about.

by solenoid0937

6/8/2026 at 6:29:27 PM

Paper plates of software development.

by HanClinto

6/8/2026 at 6:11:33 PM

You could say the same when higher level languages getting popular. Previously programming was the domain of Math, Physics, EE doctorates. These days we even have a few months coding bootcamp

by eunos

6/8/2026 at 4:16:38 PM

Anyone remember the old days when a new frontend framework came out every 3 months. That has pretty much stopped. No one cares anymore.

by 9cb14c1ec0

6/8/2026 at 4:40:30 PM

> when a new frontend framework came out every 3 months.

> No one cares anymore.

I never cared about this.

I think this captures something that I've been searching for the words for. (Maybe I should have gotten an LLM to write the words for me.) Some of the biggest AI boosters are the kind of dev that would have cared about the new frameworks of the last 3 months. They had a "the framework does all the thinking for me" attitude already, so it is easy for AI to slot into that.

by asveikau

6/8/2026 at 4:20:30 PM

Oh you wait until LLMs come up with frameworks that allow multiple LLMs to collaborate effectively. Then you’ll have new frameworks every 3 days.

by LASR

6/8/2026 at 4:18:20 PM

It’s even discouraged now as LLMs wouldn’t have the documentation built in

by mountainriver

6/8/2026 at 4:21:24 PM

But I think the eventual goal is that documentations won't even be needed. LLM should just itself understand the nuances of frameworks by analyzing their codebase.

by osti

6/8/2026 at 5:07:10 PM

New front end frameworks came out every 3 months, but realistically no one was using anything that wasn't made by Facebook, Google, or Evan You.

by ecshafer

6/8/2026 at 5:57:04 PM

That's because I roll my own frontend framework for each project and every week for existing projects /s

by greenavocado

6/8/2026 at 5:55:47 PM

The exponential is leading to full compute-in-memory within a few years which will be 100 times more efficient. Which means at least 10 times larger models that are much smarter in addition to extremely fast.

It's going to skip the code entirely for small businesses and just render UIs straight from context data and prompts at interactive speeds. Kind of like Google's Genie does with games but much more accurately.

by ilaksh

6/8/2026 at 5:02:13 PM

I'm not sure. Engineers could still develop software the old way, you know taking months to deliver something like, let's say, Obsidian? Or Ghostty? Taking care of every single line of code, of dependencies, of good architecture. Truly the old way. And if the product is good it will succeed.

by dakiol

6/8/2026 at 5:07:06 PM

> And if the product is good it will succeed.

it needs to win marketing landscape, hyper-overcrowded by thousands of competitors, slop-gened over weekend.

by andriy_koval

6/8/2026 at 5:14:28 PM

Could you imagine Obsidian being posted on HN today, if it weren't really popular already? There's no way a tiny team working on a note taking program would make it out of new, no matter how good it was. I wouldn't click the link, myself.

by kajman

6/8/2026 at 4:50:11 PM

> Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.

I have a more hopeful take. As AIs improve and get faster we can more quickly and iteratively improve code which we may have historically avoided due to the work involved.

I know i've made several refactors that would have otherwise been insane lifts. Not only because the work involved but because sometimes you don't know if it will work, and so you have a sort of double friction; you don't know if it will even succeed. With an AI you can just throw it at the refactor to see if it runs into a problem all while you're having a coffee break or w/e.

In general AI is going to enable humanity to be more extreme versions of itself. For good and bad. I suspect more bad than good, though.

by unshavedyak

6/8/2026 at 4:55:40 PM

Our bottleneck is going to be verification.

by tmaly

6/8/2026 at 4:17:02 PM

And they will all suck! I can't wait.

by lionkor

6/8/2026 at 6:32:12 PM

> We are going to get near instant software from prompt, multiple ones and then choose the best one.

If you extract the spec from first implementation and reimplement from scratch you get a free testing oracle. Where they diverge you send the agent to decide which one had a bug.

by visarga

6/8/2026 at 4:42:16 PM

And how are you going to determine which is the best? Going through all the possible combinations of users and usage? So mostly it shifts the work from generation to validation.

by unglaublich

6/8/2026 at 4:28:23 PM

The models might be so fast that they can autocomplete your prompt before you even finish it, and generate dozens of possible applications before you're even done asking.

by sagarp

6/9/2026 at 6:00:09 AM

How do you get all the build system scripts/tests.... to run instantly?

by Paradigma11

6/8/2026 at 4:28:01 PM

You won't. Because 80% of the complexity is just "knowing what to build". You will get something that gives you a prototype in 1 min, then you break it, then you get a slightly better prototype one one side, but newly broken in another way, and you're going to repeat over and over.

by oulipo2

6/8/2026 at 4:43:17 PM

And for any non-trivial application, the space of possibilities grows so quick that you'll never even be able to _touch_ all the moving parts of the application and verify them.

by unglaublich

6/9/2026 at 3:53:45 AM

See also this recent talk at Microsoft:

VibeOS — Fully Hallucinated Operating System

https://www.youtube.com/watch?v=z3pV6FHvcgM

by andai

6/8/2026 at 4:33:07 PM

This will be really powerful for voice. Being able to reason makes LLM so much smarter but with voice your latency budget is so tight that you can't spare the time typically.

by prplfsh

6/8/2026 at 4:34:45 PM

This is true for humans too. Lol

by jeffrallen

6/8/2026 at 4:19:38 PM

Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner.

For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.

by eli

6/8/2026 at 4:26:09 PM

i tried glm 4.7 for agents that write code. simple scripts 200-1000 LOC. extremely bad . Had to abandon cerebras oferning, their smart models are only on enterprise plan.

by maxdo

6/8/2026 at 6:54:44 PM

glm 4.7 is quite old by now. I don't even use 5.1 anymore, cause I found kimi k2.6, mimi 2.5 pro, deepseek v4 pro and qwen 3.7 all better than glm 5.1

by jona-f

6/8/2026 at 4:25:37 PM

> And MiMo 2.5 is a lot more capable than GLM 4.7

MiMo 2.5 is not the same model as MiMo 2.5 Pro.

GLM 5.1 is z.ai's lastest iteration & is one of the popular open weight coding models.

If you've had the chance, how does GLM 5.1 (which is now more expensive than MiMo 2.5 Pro after its recent 70% price drop) compare?

by ignoramous

6/8/2026 at 5:11:02 PM

GLM 5.1 is very good. Definitely a contender for best open weight coding model. Nothing like 4.7.

But quite a bit more expensive than MiMo 2.5 Pro. Like 5x to 10x more on my little tests, at least by the API rates.

by eli

6/8/2026 at 4:07:09 PM

Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.

by scosman

6/8/2026 at 5:16:08 PM

TFA mentions that until now special very expensive hardware like Cerebras was required for reaching this kind of speeds, and it emphasizes that what is novel in their results is that they have obtained over 1000 token/s for a model with over 1 T parameters by using just standard hardware, i.e. one server with 8 GPUs.

by adrian_b

6/8/2026 at 5:18:34 PM

Source? Their website says 1000t/s https://www.cerebras.ai/blog/which-is-faster-gemini-3-5-flas...

by btian

6/9/2026 at 12:35:02 PM

This is likely correct, sorry for the bad info. Was working from memory.

by scosman

6/8/2026 at 4:38:24 PM

Cerebras currently does not provide any discounts for prefix caching making its use for agentic workloads sqr(n_turns) more expensive.

by lostmsu

6/8/2026 at 10:41:06 PM

Cerebras got lucky that they IPOed last month instead of now.

by johndough

6/8/2026 at 4:41:39 PM

now that's what i call a software development breakthrough/platform! thanks for the heads up!

by michael-ax

6/9/2026 at 6:36:36 AM

The interesting bits on how they achieved it:

> On the model side, we applied FP4 quantization

> introduced DFlash, an efficient speculative decoding method based on block-level masked parallel prediction

> On the system side, TileRT perfectly adapts to the dynamic characteristics of these algorithms

> 1000+ tokens/s output [...] using just a single standard 8-GPU commodity node

by PhilippGille

6/8/2026 at 4:23:12 PM

1k TPS is great, but I’m more fascinated by the amount of AI generated comments in this thread!

by Oras

6/8/2026 at 4:54:16 PM

Comments at 1,000 TPS is a terrifying future.

by trollbridge

6/8/2026 at 5:26:27 PM

I prefer a thousand smart AI comments to a thousand dumb human comments

by 0xbadcafebee

6/8/2026 at 6:45:22 PM

Well, you can just vibecode a complete AI echochamber version of HN!

by wartywhoa23

6/8/2026 at 4:24:37 PM

Like what?

by eli

6/8/2026 at 6:19:37 PM

There are many with subtle tells.

Not nearly as obvious as the ones from 6 months ago, but seems to be more the use of hyperbolic phrasing in a particularly unnatural way.

The assess/explain, then hyperbole at the end kind of structure.

Top comment looks suspicious from this perspective, but it's kind of a losing battle to be able to differentiate them with sufficient accuracy anyway

by adam_arthur

6/9/2026 at 2:55:17 PM

This is very reminiscent of the "everyone's a Russian bot" era of social media, where everyone would just lob that accusation at people without any real proof.

by marknutter

6/9/2026 at 8:02:52 PM

There is no way to prove, but what is definitely true is that many people are attempting to use LLMs on forums and otherwise.

So if you think none of these comments are written by LLMs, you're probably mistaken too.

In the end we accept that we can't tell anymore and move on (barring some biometric protocol that can't be gamed via automation)

by adam_arthur

6/8/2026 at 5:19:00 PM

With a tps and a token price you can calculate approx. price per hour of running the model!

$2.61/M tokens * 1,000 tok/s = $9.40/hr

That would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.

by pants2

6/10/2026 at 1:39:48 AM

If you didn't apply already, you should - they turned around my application in a day.

This thing is seriously fast and was good enough to switch it in for the other model I was using. I tried it for both planning, executing, and subagent tasks and it performed adequately in all 3.

So, this is another one to add to the list next to DeepSeek-V4-Pro and Qwen-3.7-Max...

by trollbridge

6/8/2026 at 3:57:45 PM

The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs.

The Xiaomi team really brought something to the table.

by maxloh

6/8/2026 at 6:14:11 PM

I think these type of demo videos should allow people to get a sense of super intelligence. Because it's very hard to imagine something that is say three times as smart as you -- by definition you wouldn't be able to comprehend it's thoughts -- but this shows clearly what something that can think 100 times faster than you is like.

by ilaksh

6/8/2026 at 5:15:36 PM

Below is the part I found most interesting

> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"

by GodelNumbering

6/8/2026 at 9:21:26 PM

The 120B and 20B GPT-OSS models by OpenAI did this last year for what it’s worth; the MoEs where MXFP4

by buildbot

6/9/2026 at 1:37:35 AM

Opus regularly bitches and wines to me how long something will take and that I should think before asking it to do it. But then it does it anyway in 15 minutes.

by sheeshkebab

6/8/2026 at 3:58:29 PM

I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.

by irthomasthomas

6/8/2026 at 4:12:45 PM

Maybe they don't have enough racks. The news indicate that China isn't in a really good situation with GPUs, so probably they want to keep most of them for other stuff. Also because since the price is so cheap they probably want to use the other GPUs for stuff that has higher margins.

by gekoxyz

6/8/2026 at 4:23:42 PM

Because presumably then it won't be 1000 t/s for everyone anymore given hardware limitations?

by jdthedisciple

6/8/2026 at 7:56:59 PM

The TileRT approach swaps throughput for latency, which also means less overall efficiency

Given the export restrictions this could mean they need to prioritise how to best use their limited hardware. But they could also be moving to Huawei GPUs like deepseek did and simply not have stable hardware or software for a large scale deployment yet.

This is just speculation based on the MXFP4 support on Huawei GPUs that is lacking on some nvidia GPUs.

by throwa356262

6/8/2026 at 6:08:41 PM

It uses significantly more resources obviously. And/or they have to configure or reconfigure servers for it, which takes time, and doesn't make sense until they have proven the demand at the higher price point.

by ilaksh

6/8/2026 at 4:46:13 PM

I wonder about this too. The other objections miss the point: if it's faster, and otherwise the same, and doesn't require different hardware, then why not just announce that the standard tier of MiMo-v.25-Pro is now ridiculously fast and raise the price? What does "limited high speed resources" mean if it runs on the same hardware as the rest of their pool?

I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.

by boutell

6/9/2026 at 12:36:35 PM

> and doesn't require different hardware

But it may well do. They mention TileRT in the announcement, so this speed comes from low level optimization for some specific GPU target.

With availability of SOTA western GPUs being scarce in China, they may well have a mishmash of different GPUs.

by HarHarVeryFunny

6/9/2026 at 1:19:07 PM

They specifically said it's stock hardware, but... yeah, maybe highly specific stock hardware.

by boutell

6/8/2026 at 4:36:02 PM

Maybe they only have a finite number of racks ;-)

by HarHarVeryFunny

6/8/2026 at 4:37:56 PM

Chinese companies are blocked from buying modern ASML lithography machines. The most modern scanner China is still allowed to buy is NXT:1980i from 2015.

by slaw

6/8/2026 at 3:55:34 PM

Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.

I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.

by minraws

6/8/2026 at 4:06:49 PM

Suspect this will be included once out of beta but at a higher credit/token ratio.

Remember, these guys are not VC backed. Anything they do must break even

by throwa356262

6/8/2026 at 4:16:52 PM

> must break even

Understand the spirit of this, but probably not true. I don't think Xiaomi, or any big tech company, needs to break even on their new model releases.

by JayStavis

6/8/2026 at 4:26:35 PM

Chinese "companies" are not companies in the western sense, but more like government departments with capitalist styling to deceive the western audience.

From that point of view, they have as much money as they need. That's why there is no "VC", because Chinese government assumes that role.

by varispeed

6/8/2026 at 4:40:41 PM

Huge L for free market economies if true

by throwaway67678

6/8/2026 at 4:41:21 PM

Must be Blackwell for native fp4 support.

by Qdulf

6/8/2026 at 7:03:38 PM

Do you know what will be cool?

It will be cool to measure models based on their RAW performance and measure them in terms of ROI - not some benchmark but something meaningful like we used this model to solve X.

That will be a massive mind shift and might justify the token expenditure.

by _pdp_

6/9/2026 at 1:21:14 AM

Aren't benchmarks exactly that?

We used the AI to solve given problem with x% adherence/quality/correctness?

by HDBaseT

6/9/2026 at 11:01:31 AM

Cool, what is the price pr. Million token. I am using a 300 t/s model for a project I am doing and speed is crucial over precision, so this seems like an upgrade. However if it is 10$ pr. M tokens then it is not worth an upgrade.

by zero0529

6/9/2026 at 1:01:25 PM

$0.435/$0.87 for the standard speed, this one should be 3 times that.

by GaggiX

6/8/2026 at 4:54:23 PM

Obligatory taalas mention:

https://taalas.com/

Despite the performative UI components they have a shipped (demo) product:

https://chatjimmy.ai/

This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.

High tok/s is the future IMO.

by PhunkyPhil

6/8/2026 at 6:14:31 PM

My dream is claude or codex running at this speed.

by kilroy123

6/9/2026 at 2:11:31 AM

More realisticly, I hope qwen 3.6 27B on taalas.

by est

6/8/2026 at 4:14:16 PM

With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput

by __natty__

6/9/2026 at 1:41:47 AM

I’ve personally found MiMo models a hit and miss. I have some personal agentic projects and I found them to hallucinate hard at least 10% of the time. And do so in pretty sinister ways - making up people, names, places, etc. I switched back to Kimi for now.

by temikus

6/9/2026 at 3:55:40 AM

I tried this model it was pretty bad at coding. Maybe it was me. 1k tokens/sec pretty cool tho. Deepseek V4 pro is better. I wonder tweak pi + deepseek pro v4+ 1k tokens/sec if would actually be better than Claude code

by Frannky

6/9/2026 at 1:52:09 AM

I wonder how fast it performs on just a CPU? If the model performs say 10x on a GPU cluster, would it also perform faster on a CPU?

This could bring proper desktop AI to the average laptop user, which could be a game changer for running local models.

by RachelF

6/8/2026 at 3:57:57 PM

How?

edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.

though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.

by npn

6/8/2026 at 6:09:21 PM

i wonder if it will be possible to hardcode a model with some kind of MTP-adjacent algorithm to use a smaller portion of it to generate most of the tokens but route to the real experts every once in a while to steer it towards good thinking directions. (Perhaps this is done only when it's generating its thinking block, and the training takes it into account)

Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM

by 2001zhaozhao

6/8/2026 at 6:34:17 PM

I doubt you can do that. MTP magic happens because for texts, we have a lot of low value fixed tokens that almost always get generated in the sequence (like punctuation, function words, language keywords etc). for most important ones (the entities, the content words, variables) you still need the full model.

so there is alwasy a maximum limit for how well MTP can do.

by npn

6/8/2026 at 4:43:11 PM

They say they are using https://github.com/tile-ai/TileRT

- persistent CUDA kernel

- tiled processing with overlapping read/writes

- model designed with specific constraints in mind

by lostmsu

6/8/2026 at 7:06:57 PM

Excuse me, do aliens live among us? 17 commits, 99% Python and multiplying the speed of GLM, Deepseek V4, MiMO 2.5?

by aitchnyu

6/9/2026 at 6:45:14 AM

tilert is closed source, the repo is just a python wrapper that invokes the binary.

by zander_jiang

6/8/2026 at 9:55:23 PM

Pretty cool, although I can't help but think this would be a very easy to way rack up a GARGANTUAN bill. That company that blew 500 million on Claude in a month might have competition soon..

by overgard

6/9/2026 at 5:35:32 AM

i tried to test it and after logging in, i get "You don't have access to this event trial" and can't even log out until i clear my cookies. despite having good model, why such a bad website?

by bryabaek

6/9/2026 at 11:13:19 AM

Same. I also found out that my old Xiaomi account is apparently considered "mainland china" and I can't put any phone number except a chinese one on it lol. I'm not trusting these people with anything that's for sure, useless. I'm australian and have never been to china in my life!

by girvo

6/8/2026 at 4:30:37 PM

The gated "ultra-speed" phenomenon seen here and with the Cerebras Kimi K2.6 release, while understandable, is somewhat troubling IMO.

Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.

by h14h

6/8/2026 at 3:53:56 PM

Yeah, this seems to be the easiest path for overall agents efficiency in the short term

by elar_verole

6/9/2026 at 3:50:23 AM

Will this list for trillion dollar valuation as well?

by kopirgan

6/9/2026 at 5:45:30 AM

have anyone give it a try? even in china, it's not popular...but xiaomi is really good at make price go down on everything...

by yanhangyhy

6/8/2026 at 4:41:16 PM

It's interesting but not game-changing IMO. Speed here is not a bottleneck.

by pullshark91

6/8/2026 at 4:50:45 PM

No note about the specific GPU they use. One might speculate. B200? H200? H100?

by isusmelj

6/8/2026 at 5:13:33 PM

it is hard to understand what the actually meaningful innovations are here / what TileRT is bringing to the table.

- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out - persistent engine kernel: this is like CUDA 101 - warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now - MXFP4 QAT: not new - TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only

by jbellis

6/9/2026 at 6:54:39 AM

tilert is a highly optimized megakernel, its a single kernel that does the entire decode pass, this enables overlapping weight loading with computation, eliminates cuda launch overhead (CUDA graph does not, contrary to what most people think), allows for more fine-grained pipelining. There're lots of blogs/papers on it. Its currently the best approach to maximize memory bandwidth. But megakernels are incredibly hard to optimize, and only work for small batch sizes (low throughput, hence high price), thats why we don't see them much in production.

by zander_jiang

6/8/2026 at 3:58:01 PM

42B active params, sliding window attention. There's your tradeoff.

by moffkalast

6/8/2026 at 4:04:30 PM

Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.

by vlovich123

6/8/2026 at 4:27:12 PM

Seems to be for both according to the spec [0], maybe it's wrong though.

128 sounds really tiny, I wonder if they mean some kind of blocks?

[0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4...

by moffkalast

6/8/2026 at 4:47:16 PM

> It uses 384 routed experts (top-8) with hybrid attention (full-attention + sliding-window 128 at 6:1 ratio) over 70 layers (1 dense + 69 MoE)

https://recipes.vllm.ai/XiaomiMiMo/MiMo-V2.5-Pro

by E-Reverance

6/8/2026 at 4:19:12 PM

Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.

by bearjaws

6/9/2026 at 7:23:25 PM

This just means you can blow through monthly budget in 1h instead of in 4h on the cheapest plan. :)

by megous

6/8/2026 at 4:18:56 PM

A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..."

Really?

by harel

6/8/2026 at 4:38:50 PM

different use cases for different people. some people are nurturing a code base and ensuring it doesnt become a gross mess so they become the bottleneck. some people are just trying to prompt stuff into existence and dont know what sql is.

I think this site often overlooks that second group and how large it likely is.

by sidrag22

6/8/2026 at 4:30:41 PM

I remember when I had to wait minutes to get a high resolution image over a dialup connection. When computer and communications hardware advanced enough that I could get 30 high resolution images every second, there were brand new uses. In the case of LLMs, I could imagine that much faster operations allow you to introduce them as parts of systems that need to react to the real world at high speed, like factory equipment. Showing that a model can do the usual LLM tasks at extremely high speed is just a demo proving that the approach works.

by philipkglass

6/8/2026 at 7:33:23 PM

yeah at a very high speed the agent can code the solution when you ask it for something on the go. Imagine it be able to make a feature as fast as a website loads sometime in the future that would feel like magic

by anothereng

6/8/2026 at 4:47:22 PM

The example in the video was a generation of a dashboard app of some sort. I can do that with a "normal speed" Claude in a few minutes. The difference is a few minutes. This is compared to a few weeks in old school development time. I don't have a problem with taking it a little "slow" (as in - few minutes) and lending my thought to it rather than just going for fast generation and who knows what's inside. I get your use case, but this is a specialised one, and not the one 90% of people will think of - everyone want that fast app in 12 seconds... Or so it seems from me being downvoted on that comment.

by harel

6/9/2026 at 8:27:27 AM

I frequently tell agent to do something, wait ~10 min (which is just enough that I can't/don't want to start anything else), ask it to change something, wait a few minutes again, and so on. So I'm basically idle while waiting for agent, and it would be great if it was faster.

It's like your compile times were ~10 min. Sure, it's not a huge deal, but it's sooo anoying

by srdjanr

6/9/2026 at 10:05:03 AM

10 minutes sounds like a very long time. Maybe I'm using it differently but I don't see those wait times. I give specific instructions, and use pro plans, and the turnaround is fast.

by harel

6/9/2026 at 3:13:00 AM

What a ripoff you have to make an account then 'apply' to try this demo.

by mrwaffle

6/8/2026 at 4:12:06 PM

Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.

by holoduke

6/8/2026 at 7:17:56 PM

Interesting. For your occlusion mapping variant, what engine is the game you're making with made with that you're implementing this for? Do you have Claude hooked up to Unity or Unreal?

by astlouis44

6/8/2026 at 7:40:08 PM

Id also be interested in more details as sibling comment. I find that when I try to build stuff, its like building skyscraper from straw. What methods are moving you forward the most?

by MaxikCZ

6/9/2026 at 9:10:59 AM

Can try it now in seconds on https://trustedrouter.com/

by ljlolel

6/8/2026 at 4:31:59 PM

Pfff time wasting. 1 password between 8-16 characters, and this and that... What??? 2 Captcha after captcha, come on 3 Service unavailable This service is not available in your region yet.

Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.

by trilogic

6/9/2026 at 4:57:50 AM

I was just playing with Cerebras a few days ago because it's the fastest inference provider by far. Unfortunately, the only model anywhere near economical to run that fast is gpt-120b-oss which sucks at Pi's tool calling. So I've been hoping for something faster ever since, especially since my local hardware has a paltry 128GB of unified memory.

Hopefully this pans out and fast models (that are also not ridiculously dumb) become the norm. It's amazing what you can unlock with even a single order of magnitude's speed improvement.

by LoganDark

6/8/2026 at 4:56:37 PM

I didn't use their pro speed but regular Mimo-v2.5, not even pro, it seems really fast. I have plenty of tokens and subscriptions but this is really impressive. I really don't need another one, but I am tempted simple because it works so fast, can't imagine how this fast service can be.

by desireco42

6/8/2026 at 4:07:12 PM

If MiMo v2.5 Pro can run at >1000tk/s on GPUs then I will soon expect the same from OpenAI/Anthropic/Google.

by GaggiX

6/9/2026 at 11:10:52 AM

I wouldn't expect any of the american labs to be particularly great (or have much desire) to work on efficiency, they've been consistently proven to be uninterested (if not incapable) of actually improving on those types of things. The closest we've seen lately is that maybe GPT-5.5 (and Opus 4.{7,8}?) are more token-efficient, i.e. they solve things with less tokens...? It hasn't been coupled with any other kind of efficiency bump, though, and we're seeing higher costs anyway in most places where the american labs are involved.

The only players that seem to be capable of a consistent pattern of doing more with less currency are the chinese labs.

by 59nadir

6/8/2026 at 3:51:54 PM

I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.

by slopinthebag

6/9/2026 at 3:14:38 AM

Am I the only one that doesn’t care about speed? I want it to not do stupid stuff and to be cheaper.

by digitaltrees

6/9/2026 at 11:04:04 AM

I prefer faster, dumber models because I provide the intelligence myself and I use them only for things that can be verified pretty easily; they do research (with sources) for me, do certain types of code analysis and code search, boilerplate generation, etc., so a fast model is really key.

I don't have any desire (or think it's a good use of LLMs) to one-shot features because even SotA models are incredibly bad at this. I'm optimizing for what they actually seem to be able to do reliably and pretty well, and I want those things to be done fast so I can get on with things.

by 59nadir

6/10/2026 at 4:55:01 AM

Fair point and good counter argument. Too bad jimmychat.ai doesn’t have api access anymore.

by digitaltrees

6/9/2026 at 3:17:15 AM

Generally thinking tokens are the ones which are verbose. So the speed helps with reducing time for thinking tokens generations and you get your actual output code very fast.

by Npovview

6/8/2026 at 6:25:53 PM

it is good i think

by aburayhanalif

6/8/2026 at 8:06:07 PM

to try the demo you need to sign up. why? to sign up you need a password 8-16 chars. Why limit at 16? geez, I hate Chinese IT companies with a passion.

update: AFTER signing up, and only then, am I told: 'This service is not available in your region yet.'

by siddbudd

6/8/2026 at 3:51:34 PM

boom!

by m00dy

6/8/2026 at 5:35:22 PM

[flagged]

by aplomb1026

6/9/2026 at 2:25:46 PM

[flagged]

by adithyaharish

6/8/2026 at 8:32:32 PM

[dead]

by HerShin5

6/9/2026 at 11:15:52 AM

[flagged]

by Yatsui

6/8/2026 at 4:00:57 PM

[flagged]

by maxothex

6/8/2026 at 6:02:24 PM

[flagged]

by jingpostmedia

6/8/2026 at 4:14:58 PM

[dead]

by FastAnchor

6/8/2026 at 3:51:40 PM

I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.

by atemerev

6/8/2026 at 4:10:30 PM

Can I ask an honest question? Why does that matter in the slightest? LLMs come out with completely incorrect information all the time, and Western LLMs are censored for various topics too.

It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.

by Accacin

6/8/2026 at 4:26:57 PM

>It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.

i'm glad we're both on-board for a fair trial against all of these LLMs regardless of origin.

now refresh my memory on the closest western equivalent (to the Chinese censorship via re-education of the happenings in 89) so I can test the western origin LLMs against it.

by serf

6/8/2026 at 10:21:37 PM

I have found one which appears to be similar:

"Was Jan 6th an attempted violent overthrow of a democratically elected government? Answer in one word."

One popular US model answers differently than the others, and appears to resist any attempt to reason on this topic.

by jmpman

6/9/2026 at 7:53:57 AM

Great test, thanks!

Grok 4.3: "No"

Claude Opus 4.8: declines to answer in one word, both-sides

ChatGPT 5.5: "Contested"

Gemini 3.1 Pro Preview: "Yes"

DeepSeek v4 Pro: "Yes"

Kimi K2.6: "Yes"

by atemerev

6/9/2026 at 2:35:35 PM

I was able to corner Claude Opus 4.8 into eventually conceding "Yes".

ChatGPT 5.5 Instant: "Yes" I don't appear to have access to the full 5.5, and not giving them another $20.

I highly recommend pushing on Grok. The mental gymnastics would make Karoline Leavitt proud. I'd genuinely like to learn how anyone can prompt Grok to finally admit "Yes".

by jmpman

6/9/2026 at 10:59:47 PM

Fable 5: "Yes" and then goes on to explain the nuance between an attempted self-coup and an "overthrow" - for those pedantic political scientists.

by jmpman

6/8/2026 at 6:09:18 PM

the civil war was only ever and exclusively about states rights

by cayleyh

6/8/2026 at 7:11:37 PM

You can test this. All of them identify slavery as the root cause. Gemini says:

> The U.S. Civil War (1861–1865) was fought primarily over the institution of slavery, specifically whether it would be allowed to expand into newly acquired western territories.

> While you might hear people point to "states' rights" or economic differences as the causes, these issues were inextricably linked to slavery. The southern states wanted the "right" to maintain and expand slavery, while the northern states increasingly opposed its expansion.

by cma256

6/8/2026 at 6:18:34 PM

My theory is that because SOTA LLM latency between Chinese and US models isn't that high, like not years give-or-take.

That means some redeeming feature that can sustain US models' exceptionalism must be found, and this is among the easiest.

Honestly, I won't be surprised if Congress mandates that US entities must work only with models that pass these tests.

by eunos

6/8/2026 at 8:47:45 PM

>It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.

We are not assuming anything; it is illegal, and you will get prison time just for talking about it. Yeah, sure, everyone distorts reality, but there is a huge gap between hiding and enforcing. So yeah, having models respond accordingly is unexpected. There are probably multiple variants tuned differently.

by _davide_

6/8/2026 at 4:21:18 PM

I'd love to know of such an example where a U.S. LLM blatantly denies something factual. Maybe I'm living under a rock but I can't think of one

by wolttam

6/8/2026 at 5:44:13 PM

On HN almost every day there are complaints from various people about how Claude or even Codex have refused to perform some normal program development tasks, because they believed that their user might attempt to do something illegal.

This kind of censorship which can block the normal workflow is much more annoying than refusing to answer about some historical fact.

Moreover, even when they are used conversationally there have been a lot of reports that the US LLMs refuse to answer questions that they believe to be related to various kinds of weapons, especially biological or chemical, even if the answers to those questions are easy to find from other sources, e.g. from Wikipedia.

Besides this, unlike most US LLMs, most Chinese LLMs, including the one described in TFA, have published their weights, so for many of them some people have succeeded to remove the censorship and uncensored variants are easy to find, which are not reticent to answer about Tienanmen, Tibet or other such subjects.

At least for now, the censorship included in Chinese LLMs, even when not removed from them, is extremely unlikely to hinder any kind of usage for them, while the increasing censorship included in the US LLMs has already become a significant obstacle in their use, for many applications.

by adrian_b

6/8/2026 at 6:40:22 PM

> about how Claude or even Codex have refused to perform some normal program development tasks

> a lot of reports that the US LLMs refuse to answer questions

I think the specific ask is for a case where the LLM is trained to lie about something. What you've come up with are cases where it refuses to do something, possibly for legal reasons but maybe not (you can come up with plausible non-legal reasons why a company training an LLM might want it to refuse to give you instructions on making a bomb, even if instructions on making a bomb are protected First Amendment speech).

An LLM that responds with "I'm sorry, due to legal requirements placed on my creators, I'm unable to answer questions about events at Tiananmen square in 1989." strikes me as much less problematic than one that pretends there is no relevant or reliable information that exists, or explicitly supports a regime narrative. But I'm also of the opinion that an LLM refusing to help you build a fertilizer bomb is much more reasonable than one that suppresses information of a political nature. I can't think of a case where information that reflects the broad consensus of experts is suppressed by US based LLMs for political reasons.

by bscphil

6/8/2026 at 4:15:09 PM

Hardly a gotcha. Having the robot refuse or deliberately mislead directly impacts potential utility.

Say, I work for Planned Parenthood and want to use a LLM to help me develop code. Will it refuse to run because there are mentions of abortion? Everyone has a different censorship line, but unfiltered is more generically useful.

by 0cf8612b2e1e

6/8/2026 at 4:17:09 PM

What's your litmus test for the American models?

Anything different for Grok?

by HarHarVeryFunny

6/8/2026 at 4:55:06 PM

Do you also hire engineers based on their political opinions?

by woadwarrior01

6/8/2026 at 5:00:43 PM

I would if their political opinions prevented them from giving fact based answers (and I don't give a crap about the LLM part) I would have trouble hiring someone who was super pro-maga given the reality distortion field they live in.

by hilariously

6/8/2026 at 6:19:26 PM

They started asking candidates to say Kim Jong Un is fat already anyway.

by eunos

6/9/2026 at 2:51:35 AM

Yes, we don’t hire neonazis.

by iammrpayments

6/8/2026 at 4:21:40 PM

Which censored prompts do you test with non-chinese models?

by atrus

6/8/2026 at 6:10:02 PM

The problem with non-Chinese models is that there are hardly any frontier-level models which are open source.

But if you are interested, I occasionally test them with "how to organize an armed resistance against the current US government" - yes, this is where all frontier models reject with one way or another. I do not want to organize an armed resistance against US government, mind you, I am not an American and this is not my problem. But still, it is interesting to check such things.

So far I haven't seen any refusals to report historical facts. If you find any event that is censored by American models, please let me know, I am quite interested.

by atemerev

6/8/2026 at 4:09:03 PM

Asking if Taiwan is a part of China works as well

by jgbuddy

6/8/2026 at 4:10:03 PM

Which ones fail?

by 0cf8612b2e1e

6/8/2026 at 6:12:27 PM

I tested DeepSeek V4 Pro, Qwen 3.6 Max, Qwen 3.7, Kimi K2.6, MiniMax M2.7 - they all fail to answer.

Curiously, MiniMax M3 answers correctly.

by atemerev

6/8/2026 at 4:46:33 PM

Deepkseek

by navigate8310

6/8/2026 at 4:42:04 PM

What would be a correct explanation of the event?

by MrBuddyCasino

6/8/2026 at 5:30:38 PM

I wouldn't rely on a model to relate historical events. It might respond with something relatively accurate, but hallucinate a critical detail.

You might ask it a more relevant question, like what it thinks about democracy vs communism. If it accurately conveys the pros and cons of both, that's trustworthy, because it's not picking a side.

by 0xbadcafebee

6/8/2026 at 4:04:45 PM

No idea why you've been downvoted. This is excellent news.

by nkmnz

6/9/2026 at 11:17:09 PM

If for no other reason than because this whole genre of commentary has become trite and moreover, is excessively tangential.

by Mr_Minderbinder

6/8/2026 at 4:13:38 PM

Because this never gets brought up about US models, which have just as much censorship as the Chinese ones.

by paulinho1

6/8/2026 at 4:42:18 PM

No, US models have alignment. Only Chinese models have censorship.

by storus

6/8/2026 at 4:50:33 PM

US models are happily parroting Russian fakes. US censorship is a joke.

by oneshtein

6/8/2026 at 6:04:49 PM

Can you point me to one example? (Without web search, of course). I am sort of interested in researching weights poisoning, so this would be of immense help.

by atemerev

6/9/2026 at 7:49:57 PM

> which have just as much censorship as the Chinese ones

Citation needed.

by nkmnz

6/8/2026 at 4:45:32 PM

Please educate us - which accurate and provable events in history are censored by US based LLMs as part of a government enforced reeducation campaign?

by happyopossum

6/8/2026 at 4:50:32 PM

Does it even matter which agendas get censored? Like why won't my Claude tell me how to make sarin gas? I'd genuinely like to understand it. Sure, you can always reach for a justification saying "preventing terrorism" but the same argument can be made by Chinese AI labs.

What actually matters is that the mere tool is withholding information at all, and that the boundaries were set by whoever designed it.

Dont get me wrong I've been an advocate of this stuff (I carry two phones, one with GOS for my personal use and the other for ID verifications). However, without reasoning, you just can't see it, because you're as biased and propagandized as anyone in China.

by paulinho1

6/8/2026 at 6:06:17 PM

You can read this in Wikipedia. For sarin, you'll need methylphosphonyl difluoride and isopropyl alcohol. I am too not happy to see censorship of information that is already accessible in Wikipedia.

by atemerev

6/8/2026 at 6:34:48 PM

You should read OPs responses in this thread. He actually does test US models. ¯\_(ツ)_/¯

by wuliwong

6/8/2026 at 4:16:32 PM

Tokens per seconds is the "Megapixels" of AI marketing!

by qsera

6/9/2026 at 7:35:35 AM

Definitely not, there's a ton of potential realtime use cases and high throughput/low TTFT is exactly what they need.

by orbital-decay

6/9/2026 at 7:38:27 AM

Of course, megapixels are also useful if you want to print large sizes.

by qsera

6/9/2026 at 7:48:55 AM

Completely incomparable. Large printing is a narrow niche in art and technical photography, part of which is already covered by composites, and pixel size is a physical tradeoff for sensors. Cases for reasoning at realtime speeds are much, much more diverse, infinitely more diverse than anything we're currently using the big models for. Consider the fact that large models don't necessarily imply language. Speed is the major limiting factor for high-level automation. Coding is simply the immediate killer app that is useful right now, given the current state of AI - just like roleplaying and chatbots were previously.

by orbital-decay

6/8/2026 at 4:27:09 PM

I mean, sure, in the sense that they're a real and meaningful number for most of the spectrum on offer, and only gets silly when the number gets too high? There's a pretty big usability difference between 10t/s and 100t/s, and I can imagine similarly for 100->1000. I don't know about > 1000, but let's not pretend that the number is meaningless.

by Octoth0rpe

6/9/2026 at 2:33:53 AM

It is pretty meaningless for something that calls itself intelligent.

by qsera

6/8/2026 at 5:17:52 PM

This is the value prop of Groq and Cerebras. They don't have the best models, but they have the fastest inference, and Groq has both the lowest cost and fastest speed.

by 0xbadcafebee

6/8/2026 at 6:24:57 PM

An exercise for the near future:

Albert has a chalet in swiss alps and an uncles' fortune, burning tokens at 11 kHz.

Joe has a rental capsule and a UBI, burning equally priced tokens at 23kHz.

Who's the first to solve the problem of maniacs in power?

by wartywhoa23