Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

5/3/2026 at 5:31:47 AM

These posts are going to be a constant for the next year, because there's no objective way to compare models (past low-level numbers like token generation speed, average reasoning token amount, # of parameters, active experts, etc). They're all quite different in a lot of ways, they're used for many different things by different people, and they're not deterministic. So you're constantly gonna see benchmarks and tests and proclamations of "THIS model beat THAT model!", with people racing around trying to find the best one.

But there is no best one. There's just the best one for you, based on whatever your criteria is. It's likely we'll end up in a "Windows vs MacOS vs Linux" style world, where people stick to their camps that do a particular thing a particular way.

by 0xbadcafebee

5/3/2026 at 7:04:48 AM

The news is not in the way to compare models, it’s that Kimi K2.6 (and I’d add Deepseek v4 Pro) are more or less equivalent to Opus and that’s already pretty big.

They are open source and cost waaaay less per token than American models.

I’m using them right now on the $20 Ollama cloud plan and I can actually work with them on my side projects without reaching the limits too much. With Claude Pro $20 plan my usage can barely survive one or two prompts.

And I choose Ollama cloud just because their CLI is convenient to use but their are a lot of other providers for those models so you aren’t even stuck with shitty conditions and usage rules.

To me that’s a pretty bad thing for American economy.

by pjerem

5/3/2026 at 8:52:31 AM

Or maybe it is a pretty good thing for the American economy that you can get AI at cost rather than monopoly pricing.

You know, for the rest of the economy that is not big tech.

by chvid

5/3/2026 at 10:17:05 AM

It's not good for current administration. The American AI growth is only thing that keeps the GDP not looking terrible.

And investor pumping money in US AI circular money flow just makes innovation everywhere else slower. If not for the GPU/Memory drought running stuff locally (or just in competition cloud) would be far cheaper

by PunchyHamster

5/3/2026 at 12:01:03 PM

> It's not good for current administration

I don't know where to begin if you're leading with that. Anything approaching reality is not good for the current administration.

by arvid-lind

5/3/2026 at 11:43:54 AM

That is the very reason the open source models exist. Prestige and soft power to influence interest away from American models and hopefully slow down their progress.

by nelox

5/3/2026 at 12:06:51 PM

DeepSeek and other Chinese model makers are massively accelerating progress in AI not slowing it down. They're the only ones who still come up with real technical innovations while the proprietary model makers are stagnating.

by zozbot234

5/3/2026 at 12:33:09 PM

I'm as happy to see cheap open weight models any anyone is, and I'm in Europe and certainly not cheering the US on, but that's a bunch of unfounded hyperbole you just said.

by Sammi

5/3/2026 at 11:45:08 PM

>> DeepSeek and other Chinese model makers are massively accelerating progress in AI... They're the only ones who still come up with real technical innovations.

> that's a bunch of unfounded hyperbole you just said.

Calling the quote on top "unfounded hyperbole" betrays lack of knowledge and awareness about the subject. Keep in mind that when we talk about real technical innovations, we have in mind published research - not closed or hidden models, some of which we know only from hype but cannot even test. A cursory look at said research reveals more Chinese names than I can count.

Deepseek did introduce real technical innovations, they're in their papers, and there was plenty of talk about another "Sputnik moment" when their first model appeared. If you don't know what that means - it's the moment when the industry mobilizes to "accelerate progress" due to the unexpected appearance of strong competition.

There's a lot more to be said, but it wouldn't do much good to a person who's not following the trends.

by bigbadfeline

5/3/2026 at 4:09:10 PM

You should read the research papers that come out with Deepseek releases. There is a reason why the first Deepseek release briefly caused existential panic.

by overfeed

5/3/2026 at 4:14:12 PM

I did not and am not inclined to invest the time to do so.

But I did read some second hand reports that what was new and exciting was that they found some really good performance optimizations. The thing about deekseek publishing this is that now everyone has this.

Or did I miss something?

by Sammi

5/3/2026 at 8:55:56 PM

> The thing about deekseek publishing this is that now everyone has this.

It sounds like you're agreeing with upstream comment then!

>> DeepSeek and other Chinese model makers are massively accelerating progress in AI not slowing it down

by overfeed

5/3/2026 at 5:44:39 PM

from the "DeepSeek is a ploy to undermine usamerican models' duopoly" theory's perspective, "now everyone has this" helps them achieve this goal more efficiently.

especially if it's something that the major companies had already stumbled upon (something equivalent to) and regarded as a trade secret.

by DroneBetter

5/3/2026 at 1:23:10 PM

That is a petty big assumption (aka bullshit) unless you have direct insight the inner workings of the big US labs. Just because it isn’t published doesn’t mean that innovation is not happening.

by dumbmrblah

5/3/2026 at 1:30:34 PM

That's an unfalsifiable assertion with no evidence to support it, while all the visible evidence we do have points to stagnation and merely incremental pushes among the big proprietary model makers. Even Claude Mythos, which was 'teased' to the public but not released, is reportedly mostly a scaled-up model that takes massive compute resources to run (and lengthy agentic loops to achieve its reported results in computer security). The polar opposite to what the Chinese labs are releasing now.

by zozbot234

5/3/2026 at 1:46:02 PM

So no insight and just going off blogposts and YouTube huh. Pot kettle calling each other black etc.

by dumbmrblah

5/3/2026 at 5:09:50 PM

Sam Altman certainly got all lovey-dovey and less arrogant after DeepSeek came into prominence at most 3-6 month gap, if there was something mind blowing, Sam would’ve gotten money out of Apple and the same thing applies to Google if they had something mind blowing, they would’ve gotten more than a $1 billion refund neither happened. The bubble is near…

by Danox

5/3/2026 at 12:34:03 PM

Can you name some tangible AI idea that came out of Chinese labs?

I can name thousands that came out western universities.

I see a lot of rhetoric that only the Chinese labs are contributing to AI while companies like Google and Microsoft are still pulishing their research.

Unfortunately the domain of scientific papers is cluttered with AI slop but still occasional serious paper that i find are from western labs particularly Google Research or Microsoft Research

by darkoob12

5/3/2026 at 12:58:59 PM

Any of DeepSeek's recent papers which are more about efficiency and that's how their inference costs can be so low.

by satvikpendem

5/3/2026 at 3:12:46 PM

Oh please https://github.com/deepseek-ai

by gmerc

5/4/2026 at 6:04:02 AM

It doesn't mean only Chinese companies are contributing. Take TurboQuant, a serious theoretical paper not just GPU optimization, it was google research as the original transformers as MoE and many other techniques we use daily in deep learning as for libraries like TensorFlow which were pivotal to fast development of AI

by darkoob12

5/4/2026 at 9:09:59 AM

Bit of a strawman, isn’t it.

by gmerc

5/3/2026 at 4:47:29 PM

I am using both on OpenCode Go plan and they're pretty good, but I would say still not at the same level at GPT-5.5 in my experience, I don't know about Opus.

On a different note, is Ollama cloud good?

by amunozo

5/3/2026 at 6:40:10 PM

> is Ollama cloud good?

I'd say they have reliability issues but for the price it's worth it.

I like that usage isn't measured per token but per computation time, which means that you get more usage when models become more efficient.

by pjerem

5/3/2026 at 11:26:43 AM

I appreciate your reply but you are completely glossing over his point about how head to head model evals are useless lmao

by alansaber

5/3/2026 at 10:16:36 AM

They are no way as good as Opus yet. But Sonnet, yes. Using all in real life.

by rurban

5/3/2026 at 9:03:03 AM

> for American economy.

There is more to American economy than big tech.

And that's precisely why this has started: https://www.wired.com/story/super-pac-backed-by-openai-and-p...

by Cookingboy

5/3/2026 at 9:14:44 AM

>There is more to American economy than big tech.

Most of the stock market valuation is big-tech, and most of people's retirements are the stock market, so... if the AI bubble bursts a lot of the US will be affected.

by joe_mamba

5/3/2026 at 10:59:37 AM

>Most of the stock market valuation is big-tech

Which is why most of it is a bubble

by coldtea

5/3/2026 at 9:39:29 AM

I do not know why this is downvoted. This is true.

by atemerev

5/3/2026 at 10:57:58 AM

Agreed. I upvoted.

by gigatexal

5/3/2026 at 6:42:24 AM

There are objective ways to compare models. They involve repeated sampling and statistical analysis to determine whether the results are likely to hold up in the future or whether they're just a fluke. If you fine-tune each model to achieve its full potential on the task you expect to be giving it, the rankings produced by different benchmarks even agree to a high degree: https://arxiv.org/abs/2507.05195

The author didn't do any of that. They ran each model once on each of 13 (so far) problems and then they chose to highlight the results for the 12th problem. That's not even p-hacking, because they didn't stop to think about p-values in the first place.

LLM quality is highly variable across runs, so running each model once tells you about as much about which one is better as flipping two coins once and having one come up heads and the other tails tells you about whether one of them is more biased than the other.

by yorwba

5/3/2026 at 6:47:26 AM

That's objective metrics. Not an objective way to compare, which is the selection of metrics to include.

by jiggunjer

5/3/2026 at 6:58:21 AM

That's exactly why there's a ton of different benchmarking suites used for evaluating hardware performance.

I reckon we'll have similar suites comparing different aspects of models.

And, at some point, we'll be dealing with models skewing results whenever they detect they're being benchmarked, like it happened before with hardware. Some say that's already happening with the pelican test.

by cromka

5/3/2026 at 10:20:33 AM

> I reckon we'll have similar suites comparing different aspects of models.

The problem is that hardware benchmarks are harder to game. Yes, hardware manufacturer can make driver tweaks for say particular game to run better but the benchmark is still representable for the workflow user faces and they can't change the most important part, hardware, they can't benchmark gimmick their way in designing hardware

Meanwhile in LLM land the game is to tune it for the current popular set of benchmarks, all while user experience is only vaguely related to those results

by PunchyHamster

5/3/2026 at 10:06:44 AM

Fine-tuning for a specific task is even much less realistic than the benchmarks shown in TFA.

Most people who have computers could run inference for even the biggest LLMs, albeit very slowly when they do not fit in fast memory.

On the other hand, training or even fine tuning requires both more capable hardware and more competent users. Moreover the effort may not be worthwhile when diverse tasks must be performed.

Instead of attempting fine-tuning, a much simpler and more feasible strategy is to keep multiple open-weights LLMs and run them all for a given task, then choose the best solution.

This can be done at little cost with open-weights models, but it can be prohibitively expensive with proprietary models.

by adrian_b

5/3/2026 at 8:38:47 AM

While I partially agree with you, there IS work being done to make the metrics comparable. Eg:

https://ghzhang233.github.io/blog/2026/03/05/train-before-te...

It just hasn't been widely adopted yet. And it might be in each of their particular interests that it continues to stay so for a while. It's basically like p-hacking.

by taegee

5/3/2026 at 2:11:23 PM

I agree. I have rather constrained use cases for LLMs and the agentic harnesses that I use with them.

I try one or two of my use cases with new models or harnesses, make my own often subjective judgements, and largely ignore benchmarks.

Blogging and writing in general are a business, or feed other tech adjacent businesses, and a lot of writing about evals is attention getting - nothing wrong with that but there is a lot of noise.

by mark_l_watson

5/3/2026 at 6:20:44 AM

My theory is we will end up in a similar spot to hiring people. You can look at a CV (benchmarks) but you won't know for sure until you've worked with them for six months.

We as an industry cannot determine if one software engineer is objectively better than another, on practically any dimension, so why do we think we can come to an objective ranking of models?

by verve_rat

5/3/2026 at 8:59:34 AM

Yes, the entire field of software engineering ran aground on not being able to test how well people can write software.

But I'm more optimistic about testing programming models. You can run repeated tests, and compare median performance. You can run long tests, like hundreds of hours, while getting more than a few humans to complete half-day tests is a huge project. And you can do ablation testing, where you remove some feature of the environment or tools and see how much it helps/hurts.

by tlb

5/3/2026 at 6:54:25 AM

Not many things are as manifold broken as hiring these days. I hope we do not end up there.

by zelphirkalt

5/3/2026 at 7:39:19 AM

The CV-to-six-months analogy is actually exactly right and it's also why benchmarks for hiring people stopped being useful. The signal that holds up is what you see when something breaks, which is hard to compress into a number.

by roymain

5/3/2026 at 10:11:02 AM

this smells like an ai-generated comment so much

by bartekpacia

5/3/2026 at 10:22:14 AM

Terrible comparison. CV is just a list, telling you barely anything about performance and that's when candidate is not lying to get thru HR filter.

And we can judge developer performance, it just takes 6 months to a year working with a team so it's just hard to get metric

by PunchyHamster

5/3/2026 at 6:58:57 AM

You do not interview 1000 rounds on problems you're actually solving. If you did, hiring would be fine. Minus the social fit aspect, which isn't as relevant for a model.

by pishpash

5/3/2026 at 11:38:44 AM

This is a problem for OpenAI and Anthropic when they are bleeding money and in desperate need to jack up prices by moving people to their very expensive API.

It's very difficult to justify spending on the their models in a world where DeepSeek costs a fraction and Chinese open models exists and they perform as well as what is considered the state of the art, and it only depends on you adjusting how you use them.

A couple of days ago I canceled ChatGPT and started to try out DeepSeek. Let's see how it goes.

by surgical_fire

5/3/2026 at 5:16:06 PM

Cheaper and only 3 to 6 months behind at most.

by Danox

5/3/2026 at 8:30:07 AM

A pretty simple one would be to have every model try and one shot every ticket your company has and then measure the acceptance rate of each model.

by charcircuit

5/3/2026 at 8:35:14 AM

Except that if you tried one-shotting your ticket twenty times at different hours of the day and different days of the week, you would have enough changes to make benchmarks even if you used the same model every time. Much moreso if you fiddled with the thinking or changed the prompt.

Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al.

by sam_goody

5/3/2026 at 8:46:27 AM

You never get "the same" Steph Curry, he might be tired, annoyed by a fan, getting older... but if he and I were to throw 100 3-pointers, we could all correctly guess who will perform better.

by serial_dev

5/3/2026 at 1:10:21 PM

Good point.

But I use Codex and Claude daily (work and hobby respectively). And there are days where one or the other just seems to have gotten up on the wrong side of the bed. Or is just being lazy. Or is suddenly super-powered do everything including what i asked it not to. (To be fair, the same thing happens with myself. :/)

I am convinced that if I was bench-marking, I would be convinced these are different models on different days.

[This conviction may say more about me then about the model.]

by sam_goody

5/3/2026 at 3:14:27 PM

That's also fair, Anthropic lobotomized their services a couple of times already. One week, you are in awe that the tools figure out everything, explain everything, consider everything, produce a clean fix... next week, they are completely useless.

by serial_dev

5/3/2026 at 10:06:26 AM

Unfortunately, you're probably right, but the cock measuring contest is going to keep escalating because the billionaires and VC backers need to _win_. And the Psychosis is going to produce some horrible collateral damage.

by cyanydeez

5/3/2026 at 7:10:47 AM

That was my thought too.

> The Word Gem Puzzle is a sliding-tile letter puzzle. The board is a rectangular grid (10×10, 15×15, 20×20, 25×25, or 30×30) filled with letter tiles and one blank space.

Just last week my superior asked to implement that for a customer. /s

Maybe some real, real task would be good? Add sone database, some REST, some random JS framework and let it figure out a full-stack task instead of creating some rectangles?

by chrisandchris

5/3/2026 at 10:23:18 AM

giving real relatable task like that is memory excercise, not any reasoning excercise. The training dataset have tens of thousands apps like that

by PunchyHamster

5/3/2026 at 6:23:57 AM

[flagged]

by ljlolel

5/3/2026 at 6:37:12 AM

So like Open Router?

by idonotknowwhy

5/3/2026 at 4:46:19 PM

A secure and open source Open Router

by ljlolel

5/3/2026 at 5:18:43 AM

I'm glad we're seeing a shift towards objectively scored tests.

We've been doing this at scale at https://gertlabs.com/rankings, and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of MiMo V2.5 Pro for top open weights model, and performs much better with tools than DeepSeek V4 Pro.

GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.

by gertlabs

5/3/2026 at 6:39:26 AM

Seems like in agentic work flow the qween flash and Deepseek flash models are quite good.

Fits with another comment from yesterday on here who said the flash models are just better at tool calling.

Planning with gpt55 and implementation with a flash model could be bang for the buck route.

by Mashimo

5/3/2026 at 8:34:09 AM

This may be objectively scored, but it is not an indication of anyone's coding capabilities. This test measures which model almost accidentally came up with the best strategy (against other bots). This is not representative of coding. You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum, to get an idea which model is best at finding strategies involving an English dictionary.

by tgv

5/3/2026 at 9:25:21 AM

I don't think that is entirely fair.. I don't see them stating anywhere they are measuring coding capabilities... "Using complex games to probe real intelligence."

And this seems very much in line with the methodology in ARC-AGI-3.

The results here, in the OP article and in https://www.designarena.ai all tell a similar story: Kimi K2.6 is up and in the SOTA mix.

by robbomacrae

5/3/2026 at 10:33:44 AM

The task was writing a "bot" to play the game. The title is "Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge." How does that not imply measuring coding capabilities?

by tgv

5/3/2026 at 3:41:53 PM

If you are referring to the parent post, yes, hard to draw conclusions from such a small sample size.

For our testing, we use hundreds of different environments across disciplines, and it seems to line up with subjective experience better than other benchmarks. We test coding, agentic coding, and non-coding reasoning in the environments.

by gertlabs

5/3/2026 at 9:13:22 AM

> You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum

Would you? I am not very knowledgable on LLMs, but my understanding was that each query was essentially a stateless inference with previous input/output as context. In such a case, a single puzzle, yielding hundreds of queries, is essentially hundreds of paths dependent but individual tests?

by Galanwe

5/3/2026 at 10:31:16 AM

From what I understood, it's a coding challenge: the models wrote a player for that specific word game. E.g. https://github.com/rayonnant-ai/aicc/blob/main/wordgempuzzle...

by tgv

5/3/2026 at 9:19:01 AM

Generally speaking, would you take a conclusion based only an event that happened once?

by biscuit1v9

5/3/2026 at 5:36:57 AM

In my experience benchmarks are pretty meaningless.

Not only is performance dependent on the language and tasks gives but also the prompts used and the expected results.

In my own internal tests it was really hard to judge whether GPT 5.5 or Opus 4.7 is the better model.

They have different styles and it's basically up to preference. There where even times where I gave the win to one model only to think about it more and change my mind.

At the end of the day I think I slightly prefer Opus 4.7.

by veber-alex

5/3/2026 at 6:01:41 AM

I think benchmarks are improving and will always have value, but it's the equivalent to someone's college and GPA for an entry level job application.

It's a strong signal for a job, but the soft skills are sometimes going to get Claude Opus 4.6 a job over smarter applicants. That's what we'd really like to measure objectively, and are actively working on.

by gertlabs

5/3/2026 at 6:27:18 AM

In addition, the harness around these models do a lot of work and changes the outcome significantly.

I just had an issue where Claude CLI with Opus 4.7 High could not figure out why my Blazor Server program was inert, buttons didn't do anything etc. After several rounds, I opened the web console and found that it failed to load blazor.js due to 404 on that file. I copied the error message to Claude CLI and after another several unproductive rounds I gave up.

I then moved the Codex, with ChatGTP 5.5 High. I gave it the code base, problem description and error codes. Unlike Claude CLI it spun up the project and used wget/curl to probe for blazor.js, and found indeed it was not served. It then did a lot more probing and some web searches and after a while found my project file was missing a setting. It added that and then probed to verify it worked.

So Codex fixed it in about 20 minutes without me laying hands on it (other than approve some program executions).

However, I'm not convinced this shows GPT 5.5 being that much better than Opus 4.7. It could very well be the harness around it, the system prompts used in the harness and tools available.

For reference this was me just trying to see how good the vibecoding experience is now, so was trying to do this as much hands-off as possible.

by magicalhippo

5/3/2026 at 5:19:02 PM

I’ve run into this exact situation several times recently. They were all tasks that I’m positive Claude Code could have managed in the past, but now it simply cannot get over the finish line a lot of the time, and Codex will one shot fix it just like the old Claude Code could. I’ve even tried having Code fix and implement some old tasks it had done correctly in the past and now it simply can’t.

My guess is that it is the fault of the model rather than the harness, I believe Opus to be much worse than it was for whatever reasons. Though I suppose it could be Code’s fault somehow. For the time being though Codex is much better which I never thought I’d be saying.

I plan to run tests using Pi so they have the same system prompt and harness, but I’m suspicious that it’s only the subscription level Claude Code that is worse and we’re not allowed to use that with Pi.

by manny_rat

5/3/2026 at 7:27:23 AM

> However, I'm not convinced this shows GPT 5.5 being that much better than Opus 4.7. It could very well be the harness around it, the system prompts used in the harness and tools available.

A model that can more effectively make use of the tools presented to it is going to be better. You're not wrong about the system prompt; these can have quite a pronounced effect, especially when what the agent is bridging to is not just a case of bash + read/write; you need the prompt (and tool descriptions) to steer and reinforce what it should actually do because most models are heavily over-trained on executing bash lines.

When it comes to more basic agent usage that just runs in a terminal and executes bash ultimately most models are going to do just fine as long as you provide the very basics.

Regarding your case in this post it could be any number of issues: The provider being over-provisioned, leaving less time for your case, the model just not being particularly great, your previous context (in your original session) subtly nudging the model to not do the correct thing, and so on.

The truth is that you simply can't really know what the exact cause of this behavior you experienced is, but I think you're also working hard to cope on behalf of Anthropic.

All in all I think you're placing a bit too much faith in agents and their effect. If you slim down and use something like Pi instead you'll likely get a more accurate sense of what agents do and don't do, and how it affects things. You can then also add your own things and experiment with how that impacts things as well.

I've written an agent that only allows models to send commands to Kakoune (a niche text editor that I use) and can say that building an agent that just executes bash + read/write in 2026 is probably the easiest proposition ever. I say this because a lot of the work I've had to do has been to point them in the direction of not constantly trying to write bash lines; models all seem to tend towards this so if you just wanted to do that anyway most of your work is already done. The vast amount of the work in those types of agents is better spent fixing model quirks and bad provider behavior in terms of input/output.

by 59nadir

5/3/2026 at 8:33:50 AM

> A model that can more effectively make use of the tools presented to it is going to be better.

Of course. What I was getting at is that if the harness A doesn't expose certain useful tools that harness B does, it doesn't matter if the model could use those tools.

> I think you're also working hard to cope on behalf of Anthropic

How on earth did you get that out my post? I was just reporting on a recent experience I had, to make a point that harness+model is a very different thing from just model when it comes to evaluating effectiveness and quality of output.

by magicalhippo

5/3/2026 at 10:22:32 AM

You can fix Claude's laziness by modifying the system prompt. https://gist.github.com/chyzwar/99fe217c3ed336f57c74dcffe371...

by Chyzwar

5/3/2026 at 6:36:24 AM

I actually noticed this too. GPT 5.5 is much more "hands on" with calling tools to debug issues and verify results. I did all my tests in Cursor but I don't know if they use a different system prompt for each model.

by veber-alex

5/3/2026 at 5:39:44 AM

Are you tests and results open source?

by bazlightyear

5/3/2026 at 5:52:23 AM

Test result summaries are openly available, test environments are not.

by gertlabs

5/3/2026 at 10:08:15 AM

Curious, why can't you provide a measurement of context size for a human. Surely there must be enough science to make a good approximater.

by cyanydeez

5/3/2026 at 5:41:11 AM

Any thoughts on using it on Fireworks? It's extremely fast there.

by refulgentis

5/3/2026 at 5:51:43 AM

I'm not sure how many of our requests got routed to Fireworks -- for our testing, we set preferences for routing to providers with the highest advertised quantizations / highest reasoning mode support / or preferably the model developer itself.

While it may be possible to get better numbers from certain providers, we try to establish a common baseline. I.e. if we measure that Kimi K2.6 averages 450s on a task and GLM 5.1 averages 400s, you might be able to improve that number on a provider like Fireworks but GLM 5.1 would also likely be 10% faster on the premium provider. This is a caveat worth considering when comparing to proprietary model speeds on the site, though.

by gertlabs

5/3/2026 at 5:55:39 AM

At the current rate, open sourced models are expected to surpass cloud models within a couple years based on a study I read a couple days ago.

Looking back at chatGPT and claude a couple years ago, very small Qwen models are basically equal in coding to what those cloud based models could do then. Also factoring in scaling laws, a 9b going to 18b is roughly a 40% increase, whereas 18b to 35b is 20%, I expect there will be a change of at least price in cloud based models.

Adobe used to be $600 per month, then it became $20 when distribution scaled.

by ninjahawk1

5/3/2026 at 8:59:50 AM

That makes no sense, though, and reeks of extrapolating a trend way beyond the conditions in which it is valid.

The simple truth is, cloud models are always going to be strictly superior to open ones, simply because cloud model vendors can run those same open models too. And they still retain economies of scale and efficiency that operating large data centers full of specialized hardware, so at the very least they can always offer open models at price per token that's much less than anyone else's electricity bill for compute. But on top of that, they still have researchers working on models and everything around them; they can afford to put top engineers on keeping their harness always ahead of whatever is currently most popular on Github, etc.

by TeMPOraL

5/3/2026 at 3:03:24 PM

I don't think the real-world evidence supports your argument... OpenAI and Anthropic have all of those advantages today, and Chinese models are reaching the same level. Clearly, the Chinese labs are doing something very right that is not directly related to infinite money.

by rubslopes

5/3/2026 at 3:24:58 PM

Doesn't change the argument. As long as the models are open, the big cloud providers have strict advantage, because even if some open model gets ahead, they can just serve it from their infra, and do it better than everyone else.

This proves the strict inequality in my claim is preserved, everything beyond that is just debating the size of their advantage.

by TeMPOraL

5/3/2026 at 9:01:35 PM

> As long as the models are open, the big cloud providers have strict advantage, because even if some open model gets ahead, they can just serve it from their infra

Why would I want to use it, though? If, say, Anthropic were to serve a hypothetical Kimi K5.0 from their infra, seems like they'd keep their pricing where it is. If I can use that same model from kimi.com/Kimi Code, for less money (which seems like a safe bet in this scenario), then I wouldn't use Anthropic's offering. Even if Anthropic did lower prices, I doubt they'd be able to match kimi.com/Kimi Code.

> ... and do it better than everyone else.

Why would you assume this? That doesn't follow. "Better" has diminishing returns, and all of these companies have impressively scaled up already, and will continue to scale further in the coming years. And, regardless, I would absolutely use someone else's infra if it cost, say, 20% less, even if inference was a bit slower, or I hit rate limits more often (not usage limits, rate limits).

by kelnos

5/4/2026 at 5:48:38 AM

Cloud v. local is a different axis to secret v. open.

Claude is secret and cloud; Kimi on e.g. AWS is open and cloud; Kimi on your machine is open and local; If there are any closed and local models, I don't know what they are (Apple Intelligence, if I had to guess?)

I'd argue slightly differently from TeMPOraL: Cloud has advantages when the best models are the big ones. Right now this is so, but this may not always be the case. If we are in a world where the models stop improving at any point (for whatever reason) while hardware keeps getting better (it might or might not), then we may find the small cost benefit from operating at scale isn't worth the effort let alone the legal implications.

Unrelated, but for me the film called Kimi is higher in search rankings than the model is and oh wow we really do have a problem with the whole "finally the torment nexus" thing don't we.

by ben_w

5/3/2026 at 10:40:57 PM

Isn't Amazon Bedrock doing something quite similar already? The obvious argument is "We have Kimi at home" i.e. no need to pay for Chinese-supplied APIs that might misuse your submitted data.

by zozbot234

5/3/2026 at 5:59:58 AM

While this might be true I’m worried about the hardware side of things.

What if you have a good enough model but the cloud model providers are better in procuring hardware for interference?

by baxtr

5/3/2026 at 6:15:48 AM

The cloud providers are probably better at procuring hardware for inference, but on prem users are better at repurposing hardware that they'd need anyway for their existing uses. In a world where AI compute is likely inherently scarce, it makes sense to rely on both.

by zozbot234

5/3/2026 at 7:00:09 AM

I personally believe that eventually manufacturers will want to sell more of their hardware and look for ways to sell hardware to consumers. isnt that situation quite similar to the days of early computers? I am for sure biased in hoping that will be the case

by pheggs

5/4/2026 at 12:40:50 AM

Perhaps for some very specific capabilities such as TTS, translation, voice recognition and so on. But for general intelligence models, better hardware just directly allows better models and that doesn't seem to be changing any time soon.

by fireant

5/3/2026 at 6:09:50 AM

Local inference is definitely going to make more and more sense. Modern CPUs have all this amazing hardware well-optimized for inference purposes. I use a lot of web tools and see AI baked in and it feels weird. I want the smartness localized for speed and data security. I think and hope the industry points towards smart ai agents operating as locally as possible.

by gleenn

5/3/2026 at 6:19:04 AM

You’ll be able to run the open models on any cloud at the cost of the hardware rental. While the closed models will try to mark up beyond the base cost.

by Gigachad

5/3/2026 at 7:06:53 AM

> Adobe used to be $600 per month, then it became $20 when distribution scaled.

What product is this referring to? I haven't heard about Adobe having any offering that is quite that expensive?

by sakjur

5/3/2026 at 4:05:40 PM

$600/mo? Do you mean $600 as a one time purchase for life? I've never heard of any Adobe plan that expensive.

by MintPaw

5/3/2026 at 7:23:37 AM

If you have a link to the study you read, please share it.

by great_psy

5/3/2026 at 8:47:05 AM

Adobe never costed $600 per month. They had Creative Suites upwards of $3000 but that was before SaaS

by Marciplan

5/3/2026 at 6:14:24 AM

What were all the datacenters for???

by Traubenfuchs

5/3/2026 at 7:06:37 AM

Those would be the Pork Futures Warehouse from Discworld.

by robinsonb5

5/3/2026 at 5:40:08 AM

Kimi is really good.

I have been using Sonnet and others (DeepSeek, ChatGPT, MiniMax, Qwen) for my compiler/vm project and the Claude Pro plan is mostly unusable for any serious coding effort. So I use it in chat mode in the browser where it cannot needlessly read your entire project, and use Kimi on the OpenCode Go plan with pi.

Kimi consistently exceeded Sonnet on the C+Python project. Never had to worry about it doing anything other than what I asked it to do. GLM crapped the bed once or twice. Kimi never did.

by sieve

5/3/2026 at 9:16:21 AM

>the Claude Pro plan is mostly unusable for any serious coding effort

Why? Seems to go a giant the opinion of the masses who mostly use Claude Pro for serious coding.

by joe_mamba

5/3/2026 at 11:02:22 AM

I think ths comment is referring to the usage limits of the Pro plan. You run out of usage very fast if on anything less than a Max ($100+/month) plan.

by LUmBULtERA

5/3/2026 at 7:48:09 PM

Claude is opaque as regards token usage. So you might end up using your 5hr limit in 7-10 minutes using regular Sonnet. Meanwhile, OpenCode etc give you exact breakdown in terms of how many cached tokens used per session etc which you can use to estimate burn rate.

All these coding tools are extremely wasteful as far as resources are concerned. Almost designed to make you move to the next tier. You have to consciously restrict their scope all the time to make your plans last. Even with Kimi/MiniMax a 3-4 hour session often ends up with 50-70M cached reads. Not a small amount at all.

by sieve

5/3/2026 at 10:54:55 AM

Anecdotal evidence but last week or two Claude changed something related to their quotas. I'm a Pro user(now Team Standard) and while I did quite a lot before with that subscription, past week the 5h quota quite literally lasts maybe 5 semi sized prompts. I don't "vibe" anything, I give it clearly defined tasks or things to debug/fix, nothing hardcore. I ran out of the quota every single day past week, often twice a day, this never happened before. It's rather unusable for actual professional usage now. I'm tempted to test Codex over next week to compare hence why we're waiting with going to Claude Max sub.

by atraac

5/3/2026 at 4:43:43 AM

In a single challenge, measured by how performant the solution was.

Kimi K2.6 is definitely a frontier-sized model, so on the one hand it's not that surprising it's up there with the closed frontier models.

Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

by magicalhippo

5/3/2026 at 6:17:34 AM

> Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

The value of open source is not that you will run it locally, it's that anyone can run it at all.

Even if you can't afford to purchase the hardware to run large open source models, someone would, price it at half the cost of the closed source models and still make a profit.

The only reason you are not seeing that happen right now is because the current front-running token-providers have subsidised their inference costs.

The minute they start their enshittification the market for alternatives becomes viable. Without open-source models, there will never be a viable alternative.

Even if they wanted to charge only 80% of what a developer costs, the existence of open source models that are not far behind is a forcing function on them. There is no moat for them.

by lelanthran

5/3/2026 at 6:55:10 AM

The reason nobody is pricing Kimi K2.6 at half the cost of the best closed source models is that there are too many providers of the same model, so the competition drives prices down and they have to charge far less than that. https://openrouter.ai/moonshotai/kimi-k2.6/providers

by yorwba

5/3/2026 at 6:25:41 AM

Totally agree with you. There is only so much time before SF tech runs out of subsidy bucks, and Chinese models take the consumer spotlight

by 0xkvyb

5/3/2026 at 5:00:47 AM

>Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

Of course it matters because that makes coding plans much cheaper than those from Anthropic and OpenAI.

For personal use I have coding plans with GLM 5.1, Kimi K2.6, MiniMax M2.7 and Xiaomi MiMo V2.5 Pro and I am getting a lot of bang for the buck.

by DeathArrow

5/3/2026 at 5:07:58 AM

Currently it's not a huge difference given the subsidies of closed model subscriptions. Once that stops then yea it will be really nice to have open models as price competitors.

by magicalhippo

5/3/2026 at 6:12:34 AM

At least in my experience switching from Claude Pro ($20/month) to Kimi 2.6 through ollama (also $20/month), I was almost always hitting my usage limit with Sonnet 4.6, but with ollama I haven't hit my usage a single time.

by smj-edison

5/3/2026 at 10:35:51 AM

How many t/ps do you get with Kimi on Ollama?

by ode

5/3/2026 at 9:46:02 PM

I actually have no clue how I'd check, do you know how?

by smj-edison

5/3/2026 at 7:06:27 AM

>Currently it's not a huge difference given the subsidies of closed model subscriptions.

With Claude Max I was hitting the limits very fast.

by DeathArrow

5/3/2026 at 4:56:02 AM

It absolutely does matter.

The enshittification will go unnoticed at first but I'm already finding my favourite frontier models severely nerfed, doing incredibly dumb stuff they weren't in the past.

We need open weight models to have a stable "platform" when we rely on them, which we do more and more.

by keyle

5/3/2026 at 5:05:18 AM

Most people won't roll out their own K2 deployment across rented GPUs, so in that sense it doesn't matter that much, they'll be using a paid service which is just as much of a black box as Claude or ChatGPT. For example, on OpenRouter you can select a provider which state they use a given open model, but you have no idea what actually goes on behind the curtains, which quantization levels they use and so on.

That said, I do fully agree that it is valuable to have open near-frontier models, as a balance to the closed ones.

by magicalhippo

5/3/2026 at 6:54:40 AM

> but you have no idea what actually goes on behind the curtains, which quantization levels they use and so on.

That would take something close to a global conspiracy of every technologist lying continuously to keep the tweaks secret. If necessary, I personally will rent some servers and run a vanilla Kimi K2.6 deployment for people to use at reasonable prices. I don't expect to ever make good on that threat because they are grim times indeed if I'm the first person doing something AI related, but the skill level required to load up a model behind an API is low.

So it isn't hard to see how there will be unadulterated Kimi models available and from there it is really, really straightfoward to tell if someone is quantising a model; just run some benchmarks against 2 different providers who both claim to serve the same thing. If one is quantising and another isn't there's a big difference in quality.

by roenxi

5/3/2026 at 8:38:27 AM

> If one is quantising and another isn't there's a big difference in quality.

Sure. But the problem is you have to do this continuously to have any measure of confidence, which is expensive. For example, a provider could at any point randomly start serving some fraction of the requests to a quantized model. Either due to "routing error", as Anthropic called one of their model degradation events, or trying to improve bottom line.

There's really no good way to detect this on a few-prompt level without overspending significantly, because they're all black boxes.

by magicalhippo

5/3/2026 at 9:43:44 AM

Well you can rent a capable node for a few hours for like $50, install Kimi yourself and verify occasionally whether it works just like in cloud providers.

by atemerev

5/3/2026 at 5:27:34 AM

It's not really a black box. Useful models becoming fungible is crucial for disincentivizing bad behaviour with model providers. I can't really overstate how different it is from relying on closed models. If you don't like or trust any of the providers on OpenRouter you can rent the GPUs yourself and host it, although this is probably unnecessary.

by slopinthebag

5/3/2026 at 4:48:55 AM

This is the future though. Open weights models that run on H200s provide far more opportunity to build products and real infrastructure around.

You can always distill this for your little RTX at home. But models shaped for consumer hardware will never win wide adoption or remain competitive with frontier labs.

This is something that _can_ compete. And it will both necessitate and inspire a new generation of open cloud infra to run inference. "Push button, deploy" or "Push button, fine tune" shaped products at the start, then far more advanced products that only open weights not locked behind an API can accomplish.

Now we just need open weights Nano Banana Pro / GPT Image 2, and Seedance 2.0 equivalents.

The battle and focus should be on open weights for the data center.

by echelon

5/3/2026 at 5:58:57 AM

These large MoE models can work quite well on consumer or prosumer platforms, they'll just be slow, and you have to offset that by running them unattended around the clock. (Something that you can't really do with large SOTA models without spending way too much on tokens.) This actually works quite well for DeepSeek V4 series which has comparatively tiny KV-cache sizes so even a consumer platform can run big batches in parallel.

by zozbot234

5/3/2026 at 5:03:11 AM

I don’t fully understand what open weights unlocks that cannot be accomplished via API from a product standpoint.

Open weights is great if you want to do additional training, or if you need on-prem for security.

by bitmasher9

5/3/2026 at 5:14:50 AM

Multiple providers of the same model. That means competition for price, reliability, latency, etc. It also means you can use the same model as long as you want, instead of having it silently change behaviour.

by mkl

5/3/2026 at 7:36:28 AM

Those open weight providers where found nerfing models too.

by Bombthecat

5/3/2026 at 5:13:07 AM

Or try to beat Anthropic's uptime.

by stldev

5/3/2026 at 5:41:32 AM

> Open weights is great if you want to do additional training, or if you need on-prem for security.

The power of giving universities, companies, and hackers "full" models should not be understated.

Here are a just a few ideas for image, video, and creative media models:

- Suddenly you're not "blocked" for entire innocuous prompts. This is a huge issue.

- You can fine tune the model to learn/do new things. A lighting adjustment model, a pose adjustment model. You can hook up the model to mocap, train it to generate plates, etc.

- You can fine tune it on your brand aesthetic and not have it washed out.

by echelon

5/3/2026 at 4:55:48 AM

[flagged]

by joshoink

5/3/2026 at 8:22:27 AM

[dead]

by tom2026hn

5/3/2026 at 4:59:54 AM

I was surprised by the ranking, until I read what the test was. Not horribly relevant for coding.

The current ranking of all tests makes more sense (well, except for how well Gemini does)

https://aicc.rayonnant.ai

by slashdave

5/3/2026 at 7:06:17 AM

If you look at the ranking breakdown though, Kimi K2.6 has only participated in the last 5 challenges (claude dominated before then) and if you only count those it would be in first place

by mpeg

5/3/2026 at 12:44:44 PM

It also has a DNF. So it has a high ceiling but also unfortunately a low floor. So using Kimi means accepting high variability of the output.

Personally what I've found that has made coding agents more and more useful over the last year is that they have gotten a higher and higher floor, not that they have gotten a higher and higher ceiling. They were already plenty smart a year ago, it was just that they failed so often and so spectacularly that it made them a liability. Now they have become much more reliable, which is the key thing that has transitioned them into being actually useful. For the most part I don't use them to work on really intellectually difficult tasks. I mostly use them to work on very boring and labor intensive tasks. Most commercial software development work is just boring drudgery like this. Certainly the bulk of what I need them for is. I need them to just not crap their pants all the time while they're at it.

So I'm kinda wary seeing the poor reliability of Kimi.

by Sammi

5/3/2026 at 2:48:04 PM

If you look at the last 5 challenges (the ones Kimi was in) both Claude and Kimi have 1 DNF, chatgpt has 2

I'm not sure this is enough data to form an opinion, but going by what we have Kimi would be as reliable as Claude

by mpeg

5/3/2026 at 7:36:55 AM

The ranking of gold medals only makes sense if all models would gave participate all tests.

DNP = Did not participate

In this regard, kimi got more and better medals than Claude.

by SeriousM

5/3/2026 at 1:25:36 PM

All those models and the site is not responsive on mobile. Ironic.

by r0fl

5/3/2026 at 7:57:25 AM

Well, the link you provided basically confirms Kimi's dominance.

by dvfjsdhgfv

5/3/2026 at 7:07:54 AM

I've been switching across all different models this week with OpenCode and Pi - we're in an interesting place now because the open models are definitely "good enough" for a wide range of coding tasks and MUCH cheaper. They certainly aren't AS good, especially once you get into unfamiliar territory - custom enterprise frameworks etc where model knowledge falls off and general intelligence kicks in. But then, with time people will build up custom skills and agent files for those. And the open models will also get better.

I could easily see us in a place 2 years from now where this coding application is fully commoditised.

by zmmmmm

5/3/2026 at 5:21:54 AM

This seems less like Kimi is better at coding than Claude and more like Kimi found the right strategy for this particular game.

Still interesting though. The fact that an open weight model is close enough for that to matter is probably the real story.

by aykutseker

5/3/2026 at 6:16:42 AM

Kimi is nowhere near GPT or Opus unfortunately. I really wish it was. I’m running evals where models have to generate code that produces 3D models and it’s obvious that it lacks spatial understanding and makes many more code errors before it succeeds.

Maybe it’s better in one particular case here and there and I think this blog post is example of that.

by ponyous

5/3/2026 at 10:43:02 AM

Slightly OT, but after using DeepSeek V4 Pro for the last few weeks, I’ve found that it’s basically on par with Opus…except when it comes to driving Blender. This isn’t even a visual issue (DS isn’t multimodal), for whatever reason Opus just understands the Blender API a lot better.

There always seem to be pockets where closed frontier models perform slightly better.

by nmfisher

5/3/2026 at 9:49:58 AM

Not everyone needs 3D models to be fair.

by codedokode

5/3/2026 at 6:43:49 AM

Anecdotal, but having used Claude Code exclusively for the last several months, I was pleasantly surprised by how capable Pi + Kimi K2.6 is. It's also much faster (via OpenRouter), and at a fraction of the cost.

by yanis_t

5/3/2026 at 9:52:03 AM

It's interesting that OpenAI promised to make AI accessible for everyone, but China actually did it.

by codedokode

5/3/2026 at 9:53:19 AM

> Xiaomi confirming that weights for their newer V2.5 Pro model are dropping soon

This has already happened.

I have downloaded both the big Pro model and the smaller but multimodal MiMo-V2.5.

https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro

https://huggingface.co/XiaomiMiMo/MiMo-V2.5

https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-Base

https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Base

The download of MiMo-V2.5-Pro takes 963 GB, while that of MiMo-V2.5 takes 295 GB.

For comparison, the download of Kimi-K2.6 takes 555 GB.

by adrian_b

5/3/2026 at 5:09:42 AM

I’ve been maining Kimi k2.6 through opencode go and openrouter for a week and I can say it’s the same experience as when I was maining Sonnet 3.5/4 late last year.

Not as good or as fast as Claude Code on Opus now but definitely enough for casual/hobby use. The best part is multiple choices for providers, if opencode gimps their service, I’ll switch

by justech

5/3/2026 at 4:30:51 PM

I don't know much about the AI field, but it seems to me that trying to train any model to be all things to all people is a really dumb idea. It requires huge financial resources and is causing extreme shortages/market distortions in any resource used by an AI company - RAM, SSDs, data centers, etc.

In the real world, you don't hire a plumber and expect him to also do your landscaping, fix your car, and tailor your clothes. It would seem like a much better use of resources if I could download an app that specialized in shell, Python, and C coding for example, or maybe even that would be 3 apps that communicated. Maybe I could even run them on a regular machine with 16GB of RAM. I don't need one huge model that can do that and code in Fortran, COBOL, and Lisp.

As humans, we've done pretty well by specializing. I hope this gets explored more with smaller, focused AI models vs the current path of one model to rule them all that can only be run in a data center the size of a country.

by prirun

5/4/2026 at 3:00:11 AM

> I could download an app that specialized in shell, Python, and C coding for example, or maybe even that would be 3 apps that communicated. Maybe I could even run them on a regular machine with 16GB of RAM. I don't need one huge model that can do that and code in Fortran, COBOL, and Lisp.

I would daresay for "coding tasks", you actually _want_ a model that can code "in all languages".

Sure, it might be that outdated language XYZ is really useless to you or the task you want, but being exposed to their limits, philosophy and concerns across environment, framework and organization, among other things, means for example you get insights of your problems from other areas and points of view.

That's afterall how we got Newtonian physics and calculus, right? A person studying physics someday noticed how the "math of the day" wasn't able to calculate some results without a lot of elbow grease. He then "found" the "missing math" and with it was able to generalize what at the time was considered a bunch of isolated phenomena into a cohesive corpus of knowledge.

So for example, I want my code to have mechanical sympathy like Fortran; well defined input/output interfaces, and not-interweaved control structures, like COBOL; stateless, side-effects-free business logic like Lisp.

by ElectricalUnion

5/3/2026 at 4:35:15 PM

This is true unless it isn't basically.

People claim that finetuning is good because no model can be 'that' general since gpt3 and at every generation it's becoming less true

by davidguetta

5/3/2026 at 5:11:01 AM

I absolutely love Kimi's personality - some of the things it says are so out there! And it's been great for very focused, iterative work.

Its weakness is that it seems to yak on-and-on when it needs to plan out something big or read through and make sense of how to use a niche piece of a complex library. To the point where it can fill up its 256k window - and rack up a build. (No cache.) I have had better experience with GLM 5.1 in those cases.

Anyone out there relate?

by jrecyclebin

5/3/2026 at 5:19:15 AM

Absolutely. I use caveman to help with that: https://github.com/JuliusBrussee/caveman

by anderber

5/3/2026 at 5:25:53 AM

You can just add "be brief" to the prompt to replace the entire plugin. Same results.

https://www.maxtaylor.me/articles/i-benchmarked-caveman-agai...

by LeoPanthera

5/3/2026 at 5:24:48 AM

Not a bad idea - however

> Caveman only affects output tokens — thinking/reasoning tokens are untouched.

The problem is the thinking. But could help to tune my system prompt for Kimi.

by jrecyclebin

5/3/2026 at 6:27:31 PM

What would prevent Anthropic or Open AI take one of the open source models, tune it to "feel" like one of their models and expose them under their respective names (GPT or Claude) to save on training and inference, but keep their brand name's popularity? If the models are not objectively worse and are cheaper to run, then nobody would complain, or complaining baseline would stay the same.

by pllbnk

5/3/2026 at 6:27:46 AM

I don't feel like this is an optimal way of comparing models. I really don't think any metric as of now has the ability to list down the best model as of now. It prioritizes tasks over the overall ability, and I don't even think it's possible to.

by kmkrworks

5/3/2026 at 4:51:39 AM

Is the site just slashdotted rn? Can anyone get to it?

by elromulous

5/3/2026 at 5:34:10 AM

Slashdot... Now that's a name I haven't heard in a long time. A long time.

by brettgo1

5/3/2026 at 6:06:29 AM

BTW it looks like Kimi won the subsequent challenge too https://aicc.rayonnant.ai/challenges/hexquerques/

by bazlightyear

5/3/2026 at 9:36:36 PM

Why are we still discussing which model is “the best”. The model is just one small part of what you need. Think agentic harness, data governance, AI guardrails, machine access controls, etc

by noashavit

5/3/2026 at 4:45:10 AM

Great to know, but what was the cost both in terms of $$ and tokens used?

Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale.

Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.

by PedroBatista

5/3/2026 at 4:57:52 AM

Re pricing. Never as high as frontier commercial models.

by keyle

5/3/2026 at 5:55:31 AM

You’d be surprised with some long running complex tasks. I’ve seen Kimi spend 8 minutes (total) thinking on a task that Claude got done in 30 seconds. They both ultimately got it right, but Kimi spent ~$2.25 to Claude’s ~$0.20

by CryptoBanker

5/3/2026 at 10:51:18 AM

These benchmarks means very little. The real test is model + harness so agentic system that can fulfill given goals.

by syntex

5/3/2026 at 4:49:29 AM

I have to try Kimi. I was looking for an alternative. If you have any experience, advice, please share. I saw Kimi is at the top of the Open Router ranking.

by Frannky

5/3/2026 at 5:01:33 AM

I use Kimi at home via a kimi.com subscription and Kimi CLI (sometimes running inside Zed, sometimes not). My favorite model by far. And it's just $20.

I have to use a supposedly frontier model at work and I hate it.

by zorked

5/3/2026 at 5:08:59 AM

Nice, thanks for sharing!

by Frannky

5/3/2026 at 5:03:13 AM

Kimi K2.6 is great but I advice you to get a coding plan from Kimi.com as that way is much cheaper than paying for API calls using OpenRouter.

by DeathArrow

5/3/2026 at 5:08:06 AM

Thanks, I am trying it right now. I had an opencode plan 5$/month, so I will play with that. I use ZED and I added Pi ACP, so I can try the both pi and Kimi. I will also try it in opencode and via Kimi code.

by Frannky

5/3/2026 at 5:18:41 AM

Use kimi 2.6 for planning and a cheap model (preferably local) for execution, and then kimi once again for reviewing it. Then finally I review the code. Saves a lot on tokens.

by prvnsmpth

5/3/2026 at 5:47:13 AM

Very interesting, thanks for sharing. I am testing it with Pi in Zed and it seems pretty good.

by Frannky

5/3/2026 at 9:27:09 AM

Kimi is capable model but it needs a very good harness. With a good harness it is a very capable model. But it can get into all kinds of issues (loops and such) something that frontier models do not.

As I said, you can blame the model, but it is nothing that the harness cannot take care of more deterministically.

by _pdp_

5/3/2026 at 9:29:16 AM

Which harness do you recommend using then for a model like kimi 2.6, Opencode or something else?

by Imustaskforhelp

5/3/2026 at 9:42:00 AM

We have our own... so I was making the comment form our own experience working with the model.

It is a lot trickier to use kimi compared to sonnet - hence why it seems that sonnet is more powerful while I think it is down to the harness.

by _pdp_

5/3/2026 at 9:48:08 AM

How did you make your own harness, I am curious to know more about the building process and please feel free to share your harness.

If someone were to not use your harness and rather use some stock harness though, what is the one that you would recommend? I am curious about that too.

by Imustaskforhelp

5/3/2026 at 10:49:44 AM

I really don't mean to discourage anyone but I find that making your own harness is pretty complex process. We where very naive when we started, thinking this is just a loop, but then it turned out it has all kinds of nuances, edge-cases, things to be considered, etc.

As I am writing this I am fixing a bug in the harness that could cause infinite loops under some conditions.

For simple interactions looping over the the llm complete function is not really that difficult. Put some tools, write the loop and exit.

It starts getting more tricky when you need to detect cycling behaviour, when some tools might need to abort but do so gracefully, when you need to wait for something to complete while allowing other parts to continue, when you want to protect against too much usage after you hit certain thresholds, when you need to retry whatever, when you need to compact or truncate to maintain good context window and do so with the model in mind (kimi), when messages need to be structured in certain ways to handle various model capabilities, etc.

Pi.dev, Hermes and all the other I have seen do not do even half of the things I have enumerated above - not at least out of the box.

by _pdp_

5/3/2026 at 11:11:58 AM

> I really don't mean to discourage anyone but I find that making your own harness is pretty complex process. We where very naive when we started, thinking this is just a loop, but then it turned out it has all kinds of nuances, edge-cases, things to be considered, etc.

I'd be interested in your harness if its open source, please share some more resources

> Pi.dev, Hermes and all the other I have seen do not do even half of the things I have enumerated above - not at least out of the box.

Interesting that they don't do these things out of the box, do you still have some ideal setup then though? Like Pi.dev + X/Y/Z thing which can make things work close enough to ideal.

Because although our conversation is interesting, I feel like I am unable to take any enforcable action and I might default back to opencode. So I would love if you could talk more about it.

by Imustaskforhelp

5/3/2026 at 11:31:15 AM

I am not saying that they do not work.

What I am saying is that they do not handle many things that we handle internally and yes our code is not open source so I have the luxury to compare notes without revealing much how we do things... sorry about this.

The reason opencode and pi.dev are not handling a lot of the edge-cases is because they are primarily designed to run in a constraint mode with some level of human intervention assumed and while you can certainly make them run in yolo mode I don't believe this is most of their usage. OpenClaw is like that but then look at the code behind it - it is enormous. Most of it is mysterious to me.

Our tool is my bio. But for open source opencode and pi.dev are the best and most widely used.

by _pdp_

5/3/2026 at 11:57:08 AM

Personally I prefer my harness to be open source. So Opencode/pi.dev seem the best for me.

I suppose that as your original comment mentioned Kimi being a nice model with a good harness, My personal opinion suggests for me saying that Kimi with opencode might make it competent model too. Although currently I just use the model provided by default on opencode and I have found it to be competent for small codebases itself, although you definitely have to ask it to git/jj.

The Opencode default model consensus seems to be GLM 4.6 and Kimi 2.6 is definitely much more competent model

I think that tools like Opencode might continue getting better too given its open nature if/as the harness itself turns out to be the most valued piece for a model's competence as you suggested, so I am betting on Opencode and I really appreciated this discussion we had as It was great to know insider insights and I wish you good luck for your product!

by Imustaskforhelp

5/3/2026 at 2:20:43 PM

https://github.com/can1357/oh-my-pi

Pi+some additions here. Have not used it enough to have a strong opinion, however, I’ve seen it mentioned here.

by kleinishere

5/3/2026 at 5:20:22 AM

This seems to be testing the models on leetcode style prompts that also require the model to implement TCP calls to send the results. Interesting but probably not a apples to apples comparison. The fact only Grok qualified for the first one seems suspect

by SomaticPirate

5/3/2026 at 9:31:37 AM

I’ve been wondering about potential regression in coding models.

The initial models were corrected by programmers which gave a very high quality feedback signal. Whereas with vibe coding on the rise, you’ll lose that signal.

by ajdegol

5/3/2026 at 4:02:28 PM

I find this practically relevant insofar K2.6 is offered by Kagi in the 2 lower priced plans for $5 and $10 per month.

by raffael_de

5/3/2026 at 12:32:21 PM

“I did not wake up to be a loser. This loser attitude, makes no sense to me.” - Frontier Model Labs (original quote by Jensen in a podcast)

by gizmodo59

5/3/2026 at 4:46:39 AM

I’m a little confused as to the setup. It was asking each model to one-shot a script and then the scripts faced off? Were the models given a computer environment? Or a test server to iterate against?

by beering

5/3/2026 at 4:47:23 AM

Sounds incredibly simple to me. One-shot.

by rpmisms

5/3/2026 at 4:56:47 AM

So nothing like real-world coding, where you’d be able to run and test the script before submitting?

by beering

5/3/2026 at 5:32:24 AM

One shot just means the user doesn’t have to iterate on it via the agent. The agent does what ever it needs to deliver the best outcome, including its own running and iteration until it’s happy with it. This could be a short or long process potentially depending on the task.

by procinct

5/3/2026 at 8:09:44 AM

I don't know about you but kimi 2.6 from the kimi subscription has been absolutely bad and useless for the past 1 week so I canceled my sub and stopped using it.

by alex7o

5/3/2026 at 5:48:35 AM

In my opinion, this kind of comparison is not very meaningful.

by koala-news

5/3/2026 at 6:09:24 AM

Same experience here i use open router with claude as fallback for my startup. Is Kimi if close in quality the cost is difference hard to ignore

by imrozim

5/3/2026 at 6:43:17 AM

As a musician, I find the butchering of musical notation on Kimis pricing page extremely off-putting.

by bjoli

5/3/2026 at 8:16:34 AM

I’m not trying to add fuel to the fire, but will OpenAI and Anthropic’s IPO go smoothly?

by warabe

5/3/2026 at 7:13:01 AM

Doesn't seem like a very insightful result. Kimi won with the naive strategy. Other models didn't slide tiles at all or didn't demonstrate understanding of the rules, claiming words that lost points. A strategy that did nothing would beat them.

We know these models can solve much more difficult problems, something isn't right.

by muti

5/3/2026 at 4:56:54 AM

What's the GPU VRAM requirements for this thing?

Awesome to have a open model that can compete, but damn it would be so much better if you could run it locally. Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc

by jakemanger

5/3/2026 at 5:05:12 AM

>Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc

Getting a coding plan from Kimi.com will make coding 20x cheaper than using Anthropic.

BTW, I am using it with Claude Code.

by DeathArrow

5/3/2026 at 5:33:55 AM

[dead]

by redrove

5/3/2026 at 5:33:17 AM

Amazing. To me it feels like GLM 5.1, Kimi 2.6, DeepSeek 4 are all competitive both with each other and with the American models. Truly a great time to be alive.

I would like to see more effort making the flash variants work for coding. They are super economical to use to brute force boilerplate and drudgery, and I wonder just how good they can be with the right harness, if it provides the right UX for the steering they require.

As much as vibe coding has captured the zeitgeist, I think long term using them as tools to generate code at the hands of skilled developers makes more sense. Companies can only go so long spending obscene amounts of money for subpar unmaintainable code.

by slopinthebag

5/3/2026 at 6:41:55 AM

>To me it feels like GLM 5.1, Kimi 2.6, DeepSeek 4 are all competitive both with each other and with the American models.

Yes, at least probably with each other

>Truly a great time to be alive.

Do wish outcome had been better for those whose work ended up in the training sets, wish competition could’ve found ways to agree on distillation practices, wish globally we’d planned as fast as we’re developing…

Tremendous excitement too

by Barbing

5/3/2026 at 9:17:48 AM

I never looked in to the details of these benchmarks, I live with the assumptions that most benchmarks of any kind are gamed and useless.

What I do see in my own work and that of others around me, is that Claude consistently outperforms Gemini and to a lesser extent Codex.

With Claude eating tokens with declining return, concessions have to be made and Codex is a usable middle ground.

I use Kimi in Kagi's Assistant for non-code or generic programming questions and am quite happy with its no-bullshit responses.

by gherkinnn

5/3/2026 at 5:28:59 AM

So we are now at the point where open weight models are rapidly catching up to the frontier models.

They are at best 30 days behind, and at worst case 2 months behind. The last issue is being able to run the best one on conventional hardware without a rack of GPUs.

The Macbooks, and Mac minis are behind on hardware but eventually in the next 2 years at worst will make it possible thanks to the advancements of the M-series machines.

All of this is why companies like Anthropic feel like they have to use "safety" to stop you from running local models on your machine and get you hooked on their casino wasting tokens with a slot machine named Claude.

by rvz

5/3/2026 at 5:06:41 AM

All my co-workers say Claude blows away Gemini. Is it really that good? How can I do Kimi?

by pbreit

5/3/2026 at 5:15:14 AM

You can sign up for a plan on the kimi code platform and use it via the pi.dev coding agent, or opencode. In planning, I’d say it’s almost on par with Claude Opus.

by prvnsmpth

5/3/2026 at 6:15:04 AM

About 40% of stock market consists of about 7 or 8 companies. Those companies that are all into AI circular deals collectively trillions of dollars in valuations.

Now imagine a company burning 200,000/month on AI spend. Real numbers. Not every company is but some are.

Why such a company won't deploy an open weight model (Kimi 2.6 or Deepseek v4) on their own hardware (rented or otherwise) to save about 2.4 million dollars a year?

And these are the landmines Chinese cleverly did set up. Not saying intentionally or otherwise.

But end result is that good luck recouping your investments, you can pretty much kiss goodbye to any ROI. The bucket has a hole at the bottom and the bubble bust is guaranteed.

PS: Without open weight models too the economics do not make sense neither the code generated by these SOTA models is reliable enough to be deployed as is. Anyone claiming otherwise either hasn't worked on a real software stack with real users OR didn't use AI long enough to witness the AI slop and how hard it is to untangle or de-slopify the AI generated code therefore these trillion dollar valuations are absurd anyway.

by wg0

5/3/2026 at 11:34:31 AM

crazy how people on hacker news, who just gobble up anything if it's from openai or anthropic suddenly become monocled sceptics when chinese open models are "winning"

by VeejayRampay

5/3/2026 at 5:30:10 AM

Yes gimini is a programming application

by qakajjqj

5/3/2026 at 12:40:58 PM

Is there a lo-slop model that stands out when using Zig?

by childintime

5/3/2026 at 1:00:44 PM

Zig changes way too fast for AI to be generally good at it. Have it generate Rust instead.

by zozbot234

5/3/2026 at 10:15:16 AM

That is not a programming challenge, the fuck

by PunchyHamster

5/3/2026 at 5:38:58 AM

People thinking to self-host Kimi K2.6 had better be prepared for how big it is.

Q8 K XL quantization for instance is around 600GB on disk. I would bet about 700GB of VRAM needed.

Quantizations lower than Q8 are probably worthless for quality.

Or 2.05TB on disk for the full precision GGUF.

https://huggingface.co/unsloth/Kimi-K2.6-GGUF

If you can afford the hardware to run Kimi K2.6 at any decent speed for more than 1 simultaneous user, you probably have a whole team of people on staff who are already very familiar with how to benchmark it vs Claude, GPT-5.5, etc.

by walrus01

5/3/2026 at 10:48:56 AM

While most people would not be able to run Kimi K2.6 fast enough for a chat, as a coding assistant the low speed matters much less, especially when many tasks can be batched to progress during a single pass over the weights.

If you run it on your own hardware, you can run it 24/7 without worrying about token price or reaching the subscription limits and it is likely that you can do more work, even on much slower hardware. Customizing an open-source harness can also provide a much greater efficiency than something like Claude Code.

For any serious application, you might be more limited by your ability to review the code, than by hardware speed.

by adrian_b

5/3/2026 at 12:00:57 PM

DeepSeek V4 Pro is way more effective at batching multiple tasks together since the KV cache is so much lighter - a max of ~10GB at full 1M context, and in a linear proportion with context according to the DeepSeek V4 release paper. That's extremely impressive, it unlocks batching, agent swarms etc. even on severely memory-constrained platforms, especially at smaller max context.

by zozbot234

5/3/2026 at 5:45:46 AM

Kimi is a natively quantized model, the lossless full precision release is 595GB. Your own link mentions that.

by zozbot234

5/3/2026 at 6:03:05 AM

So, realistically, $100K for an 8x RTX 6000 Pro system that can run it at a usable rate.

by CamperBob2

5/3/2026 at 6:12:42 AM

I think people will always disagree on what qualifies as a "usable rate". But keep in mind that practically no one sensible is running the latest Opus or GPT around the clock, especially not at sustainable, unsubsidized prices. With open-weights models it's easy to do that.

by zozbot234

5/3/2026 at 6:43:49 AM

Also for people doing something medical, privacy or sensitive data related, there's an almost incalculable value (depending on industry niche) in having absolutely no external network traffic to any servers/systems you don't fully control.

by walrus01

5/3/2026 at 6:41:56 AM

the 'unsloth' link above is a 3rd party person that has quantized it to Q8, the original release is considerably larger in size than 600GB:

https://huggingface.co/moonshotai/Kimi-K2.6

by walrus01

5/3/2026 at 10:26:05 AM

No.

I have downloaded Kimi-K2.6 (the original release).

  du -sh moonshotai/Kimi-K2.6 
  555G moonshotai/Kimi-K2.6

  du -s moonshotai/Kimi-K2.6 
  581255612 moonshotai/Kimi-K2.6

For comparison (sorted in decreasing sizes, 3 bigger models and 3 smaller models, all are recently launched):

  du -sh zai-org/GLM-5.1
  1.4T zai-org/GLM-5.1
  du -sh XiaomiMiMo/MiMo-V2.5-Pro 
  963G XiaomiMiMo/MiMo-V2.5-Pro
  du -sh deepseek-ai/DeepSeek-V4-Pro
  806G deepseek-ai/DeepSeek-V4-Pro

  du -sh XiaomiMiMo/MiMo-V2.5 
  295G XiaomiMiMo/MiMo-V2.5
  du -sh MiniMaxAI/MiniMax-M2.7
  215G MiniMaxAI/MiniMax-M2.7
  du -sh deepseek-ai/DeepSeek-V4-Flash
  149G deepseek-ai/DeepSeek-V4-Flash

by adrian_b

5/3/2026 at 6:55:54 AM

That page mentions that the model is natively INT4 for most of the params, and 600GB is in the ballpark of what's available there for download.

by zozbot234

5/3/2026 at 6:20:44 AM

What I would like to see is a comparison of how well the models work in long running conversations:

  * do they lie and gaslight

  *  do they start breaking down on very long chats (forget old context, just get dumber)

  * do they constantly try to tell me how smart I am vs solving the problem (yes man)

  * do they follow conventions, parameters set out early in the prompts, or forget them

  * if they cant read a given file (like pdf), do they lie about it

  * is there a branch function to go back to earlier state of conversation

  * what is the quality of the presentation of results (structure, wording, excessive use of tables, appropriate use of headings)

  * how does the bot deal with user frustration (empathy?)

For example Chatpgt 5.5 is fairly smart, but presentation of results is kind of poor and unstructured, and unnecessarily long. It will break down on long conversations (the long answers dont help here), and it can’t deal with that except lying and gaslighting. It also has very little empathy, and mostly ignores user frustration. But at least theres branching, so one can go back without completely starting over.

Gemini doesnt feel quite as smart these days. It does well with very long conversations. Except it has bugs where all context gets lost or pruned, and it will lie and gaslight about it. Theres also no branching, so once context is lost you have to start over. Presentation is decent. Empathy is fairly good, except if users get frustrated, it gets more and more flustered and breaks down.

by ant6n

5/3/2026 at 6:50:26 AM

I think they all support branching if you use a agent like pi.

by Mashimo

5/4/2026 at 11:11:44 AM

[flagged]

by kk_mors

5/3/2026 at 4:54:58 AM

[dead]

by surrTurr

5/3/2026 at 8:19:41 AM

[flagged]

by ibrahimhossain

5/3/2026 at 5:08:01 AM

[dead]

by Rekindle8090

5/3/2026 at 7:19:31 AM

[dead]

by tim0414

5/3/2026 at 6:59:41 AM

Meanwhile, I can’t get kimi k2.6 to edit a heredoc in a shell script without it fucking it up.

by chillfox

5/3/2026 at 5:42:58 AM

I always though claude is the goat, but i guess its time to change the notion and try Kimi K2.6

by plexescor