Show HN: State of the Art of Coding Models, According to Hacker News Commenters

5/2/2026 at 9:36:00 PM

Interpreting these metrics is quite interesting.

One thing for sure is that while Claude is currently taking the #1 spot in mentions, it carries a lot of negative sentiment due to API pricing policies and frequent server downtime. On the other hand, the runner-up, GPT-5.5, actually seems to have more positive feedback.

Personally, my experience with Codex wasn't as good as with Claude Code (Codex freezes on Windows more often than you'd expect), so this is a bit surprising. That said, the more defensive GPT is definitely better in terms of sheer code-writing capability. However, GPT actually has quite a few issues with text corruption when generating in Korean or Chinese—something English-speaking users probably don't notice. In terms of model capabilities, when given the same agent.md (CLAUDE.md) file, I think GPT is better at writing code, while Claude is better at writing text during code reviews.

Looking at the bottom right, Qwen and DeepSeek are open-source, so they are largely mentioned in the context of guarding against vendor lock-in, which drives positive sentiment. Considering that Hacker News occasionally shows negative sentiment toward China, the fact that they are viewed this positively—unlike US models—shows that being open-source is a massive advantage in itself.

Anyway, one thing for sure is that Gemini is pretty much unusable.

by jdw64

5/3/2026 at 12:32:31 AM

I like your analysis but I think the open models are genuinely well received not only because of vendor lock in or being open source.

They are cheaper! All signals point to them staying cheaper because they are built more sustainably. Also, some of the latest entries can run on 1 GPU! Literally available at your desktop where there can be no service interruptions. Not even network latency. People are one and few shotting little games for 0 dollars because they bought a GPU to play video games this year. To me that's an unbeatable value. Once the tooling catches up and a few more model releases, it could change everything completely.

by 2ndorderthought

5/3/2026 at 4:00:28 AM

I think it's decidedly preliminary to compare models using the same .md file, since they respond quite differently to the same input. I try to narrow to the top 2-3 and then refine inputs for each one. For me it's unfortunately not much better than an intuitive process of trial and error.

Gemini is not at all unusable. It is quite usable for the tasks it excels at - to the point that it is the top pick for many tasks and I spend more money there than elsewhere. On the other hand it responds quite differently from the other major models - so that claude and gpt on one hand are similar and gemini requires a different approach. In my opinion people who think gemini is worthless have not learned how to prompt it correctly. Again, it's intuitive and watching concrete response difference due to small input changes, but if I had to summarize it shows its google books / google scholar roots.

I have started experimenting with qwen more than deepseek, but I have not had good results yet. Given the good press I presume I will learn how to interact with it for better results.

Curious if others have similar experiences in comparing models usefully, or if most don't bother with this, or do something else? I mainly use models for highly focused specialty tasks, so this fine tuning makes the difference between usable and unusable. I don't yet have the luxury of defining my preferred workflow and finding the tool for the task. Everything just breaks almost immediately if I try to shoehorn into my preferred flow.

by sgc

5/3/2026 at 6:27:23 AM

What are your prompting and general tips for using Gemini effectively?

And what use cases do you think it’s best suited for?

by uxcolumbo

5/3/2026 at 4:31:27 PM

General tip is *iterate*. Look at what it does right and does wrong, and refine. My most complex prompt took me 2 weeks of work to get right, and I just spent a half a day improving that even more. Obviously not worth it unless you are going to be be doing something major. In my case it is for 2 years of work, so clearly worth it.

Somebody else mentioned they had great success at math heavy code. I had to develop a complex piece of software that also integrated into 4 existing systems with a lot of poorly documented constraints. I tried with the major models and Gemini provided the most structured solution that would allow me to work on it, add features etc in the future, and it created an MVP in one shot after working through the planning stage in detail. I have managed to work on that code afterwards quite successfully. It is by far the best model for language tasks like OCR and translation. In my opinion the benchmarks, which put it first for this, are far from showing how far ahead it is because it responds so well to iterating on a prompt. So I think it is good to great for a wide variety of things, but you have to iterate. If what you are doing is simple enough you don't need or want to do that, then use the best model you are already comfortable with. For me that type of work is currently done with GPT 5.5.

by sgc

5/3/2026 at 12:46:37 AM

I had a surprisingly positive experience with Gemini optimizing some mathy MPS code. It did far better than claude.

Of course, when I tried it on something else it rewrote every line in the file for no good reason, applied changes directly when I told it just to plan, etc.

So maybe it has one strength.

by dgacmu

5/3/2026 at 7:58:22 AM

Gemini is actually realy good for code review, critique and other tasks. It just cannot be allowed to code himself.

by chewz

5/3/2026 at 6:09:39 AM

I know its subjective, but I tried different models with my OpenRouter subscription and VSCode Roocode plugin. I evaluated them based on cost and code quality. I liked gemini-3-flash-preview.

Its really a cost effective model.

by pryanshu89

5/3/2026 at 5:32:49 AM

Yeah, I think we are pretty past an idea of "better" and are at the point where it needs qualification as "better at". "Claude writes, Codex reviews, and Gemini doesn't get installed" is my go-to, although I go to Gemini whenever I want an advanced graphical calculator, or data extraction of any type.

by petesergeant

5/3/2026 at 6:58:53 AM

"Gemini researches" has been my go-to for awhile (although GPT seems to have gotten better recently in this category?).

Essentially, I use it when I truly only need an "Advanced Google" to find lots of document or website references based on only some partial understanding of "X". I don't like having it do anything with those things. Only when I need to find those things.

Claude, especially, seems to absolutely hate doing research when there are major ambiguities in your question. It's the only one of the major models that keeps playing 20 questions with me when I neither know nor care what the answers to those questions are.

by dentemple

5/3/2026 at 6:01:03 AM

Mostly my experience, but “Gemini crunches data” would be my replacement there.

If I have a task that requires parsing through swathes of irregular data that traditional ml would choke on (or require an intermediate training step ala bigquery), I have gotten much better results from Gemini than the other two.

by devmor

5/3/2026 at 12:26:10 AM

> Anyway, one thing for sure is that Gemini is pretty much unusable

Ha! I find that Gemini is quite useful - if only because I am forced to use it (on my personal projects) because it's the only one that has unlimited interaction for "free"

It has its limitations, yes, but so does Claude (which I am leaning on too heavily at work at the moment)

by awesome_dude

5/3/2026 at 12:25:30 AM

Interesting to see the positive sentiment around kimi2.6 qwen3.6 and deepseek relative to the negative. I hope the trend of people appreciating open models continue. They aren't namesakes yet, but it's a higher percentage then I thought it would be. Especially on HN where we are all talking about businesses.

I am upset because now anthropic, openai, meta, etc will continue their smear campaigns here. But I am also happy because it will make HN less useful when they do.

Everything is a give and take I guess. Excited to see where the equilibrium sits

by 2ndorderthought

5/3/2026 at 1:17:31 AM

Is it just “smear campaigns”? Don’t get me wrong - I don’t want big tech or big AI monopolies and appreciate the open weight models. But it’s also true that Chinese companies are basically stealing through distillation and also that they censor to align to CCP rules. They’re problematic in a different way.

What I want is more fully open models where everything is shared. Data, training algorithms, weights. That way we can figure out if we should trust it.

by SilverElfin

5/3/2026 at 1:30:08 AM

They are all stealing from each other just like how they all stole from us. Grok supposedly admitted to distilling from open ai for instance.

I think it's also unfair to say their success is solely due to stealing data. They are contributing a lot of advances to the literature about what they are doing. The proof is in the results we have 27b models you can vibe code with. Not 1t+

It's murky sure. But there are smear campaigns about how people can't trust China too. There's some truth to that too but we can't trust the US either so local models are an interesting way for China to offer us some level of sovereignty.

by 2ndorderthought

5/3/2026 at 12:33:50 PM

> I think it's also unfair to say their success is solely due to stealing data.

Their models would be completely useless if they didn't train on stolen data, so no, it's not unfair at all.

by miyoji

5/3/2026 at 1:31:33 PM

Name one company selling AI models who didn't steal their data or aquire it through dubious or unethical means?

If anything it's a Robinhood story.

Regardless I doubt they would be useless at all. Alibaba has access to tons of data and they make qwen. Qwen models are insane for their size.

by 2ndorderthought

5/3/2026 at 1:36:45 PM

> Name one company selling AI models who didn't steal their data or aquire it through dubious or unethical means?

That's what I said. I was absolutely referring to OpenAI and Anthropic as well as the Chinese models.

by miyoji

5/3/2026 at 1:52:07 PM

I misunderstood you. I am so used to people just slamming local models over data issues and I knee jerk replied to you. Sorry.

by 2ndorderthought

5/3/2026 at 2:15:12 PM

No worries, I could have been more clear in my original comment!

by miyoji

5/3/2026 at 1:52:35 PM

My push for trying local was the wildly unpredictable but systematic performance of large models like Opus and ChatGPT. It feels like at different times of day or week they are getting degraded beyond belief. I don’t know if it is deliberate, a function of demand, or a function of the models themselves. We are all learning the shape of this space by trying. I need to be able to rely on consistent performance - and maybe that means putting some harness of benchmarks between models and maybe it means between different inference providers, and maybe local.

by dzink

5/3/2026 at 1:09:19 PM

Judging by how things are moving ( pricing models, limits, harness patchy updates ), it feels like the real salvation will be a combination of more mature OS models and some open source harness setup like OpenCode or similar. I'm feeling like OS models are nearly there, and with the proper setup and harness might already be there. What are the general thoughts on this ?

by fredcallagan

5/3/2026 at 1:43:53 PM

I only started playing around with local inference a couple weeks ago. Prior to that I was just using Gemini via web since it came with my Workspace subscription, but I did not want to be reliant on the cloud.

Others will have a better idea since they've been messing around with local inference longer than I, but I am quite impressed with the models I have been loading on my laptop with only iGPU. As of this week I no longer feel like I am playing second fiddle with slow inference and small models. Gemma 4 (and maybe Qwen3.5, haven't tried it yet) seem to have changed the game this month!

Even with trying some absolutely shiiiiite models (I only had 16GB unified RAM at the start), I was suitably impressed that I splashed the $300 to double my RAM. I am happy that this one time cost was enough to break through to smarter models and faster inference. No ongoing cloud costs!

by julianlam

5/3/2026 at 1:50:32 PM

It's awesome. Even on a trash computer you can run a small model that works just about as good as anything else for basic questions for free and no privacy issues. It's gotta be the future.

by 2ndorderthought

5/3/2026 at 7:51:54 PM

I really think the future are agent harness kits like `itayinbarr/little-coder`. Small, minimal, customizeable pi-coding-agent series of extensions, that has some specialized deterministic logic to "heal" common small LLM errors like getting stuck into thinking loops and syntax errors on tool calling.

This one has "generic healing" for issues present in the current generation of local small LLMs, but if things we see from Frontier LLMs generalize, "optimized healing" for quirks present on your pick of local LLM would be more useful.

by ElectricalUnion

5/4/2026 at 9:45:51 AM

I use both Codex & Claude Code. With Codex, their long running agents are still not as good as Claude Code. But I don't complain because they burn very less tokens as compared to Claude Code for the same task. But having said that, if I am provided limitless tokens for each of them, my go to harness will be Claude Code

by brihati

5/3/2026 at 3:26:15 AM

It's extra interesting because I think the model people should be talking about is actually not DeepSeek V4 Pro, but the Flash version. When accounting for cache hits, the input price (per OpenRouter) is effectively only 6 cents per million tokens (3 vs 14 cents hit/miss), and 28 cents on output. That's really good efficiency, and it's not a sale price like they are doing with V4 Pro, it's the normal price.

It's actually pretty difficult to find a good comparison model because there isn't one. Again, a 14/28 cent in/out model, ignoring cache, it scores just below GPT 5.4 Mini-xhigh (75/450) and Gemini 3 Flash (50/300) in intelligence. It's similar to Gemma 4 31B in some metrics (13/38) including cost, so it's not completely unheard of, but it's pretty notable that virtually everything else in the same region in most benchmarks are going to cost at least 5 times more (much, much more in very output-heavy contexts)

by cheesecakegood

5/4/2026 at 2:34:33 AM

Is there a good cloud provider for these open models that I can easily swap my Claude library in python to?

by polishdude20

5/3/2026 at 3:50:26 AM

It's well priced but does that have much relevance for "state of the art coding models", specifically?

I wouldn't use Gemini 3 Flash or GPT 5.4 mini for anything except the most trivial work, although both are useful for basic exploratory work.

So I'm using a heavy model for the bulk of the work and the cost of that so far outweighs the light model that the light model cost is effectively irrelevant.

by esperent

5/3/2026 at 4:49:30 AM

It's so interesting to see the wild pendulum swings of LLM sentiment here.

If one likes a model then it's capable of one-shotting entire apps.

Otherwise it's "only suitable for the most trivial tasks".

Never in between.

by julianlam

5/3/2026 at 5:13:48 AM

You're confusing "different people with different opinions" with "wild pendulum swings".

Personally my opinion in this regard is highly consistent over time.

by esperent

5/3/2026 at 1:39:12 PM

It's so true. I bet 80% of questions normal people even ask chatgpt/copilot could be answered with an 8b model trained on recent data.

I don't think people realize how small the gap between free to cheap models have from frontier models. It's going to be commoditized a lot faster than marketing will catch up. Once cash gets tight or prices rise it's more or less done for.

Especially considering some of the small free/cheap models can one shot code now.

by 2ndorderthought

5/3/2026 at 9:05:22 PM

I have no trivial tasks.

Just last week, I was trying to map the weird and wonderful column names emitted by a NetScaler’s detailed REST log into OpenTelemetry semantics.

The NetScaler is basically abandoned by its dying vendor. Hence, its new features like sending logs directly to Splunk compatible receivers are basically undocumented. I’m sure there’s like three of us masochists out there stumbling our way through the brambles.

The Open Telemetry end is a mess of “deprecated” and “beta”, copying the shifting sands of other cloud native projects like Kubernetes.

Even with carefully curated Markdown documentation references and sample logs, every modern “frontier” AI makes basic mistakes and hallucinates like crazy no matter how I stuff their context.

This isn’t an Erdős problem! This is just getting logs from point A to point B.

by jiggawatts

5/4/2026 at 5:39:43 AM

(I would drop this somewhere more relevant but apparently replying to a thing from 17 days ago would be necroing here + PMs are hard or something)

>[1] With what prompt!? I like the terse output! Do share...

Not sure if it's complete or completely right, but the matter came up in a recent session, and when asked what gave, Jim 'n' I came up with some supposedly relevant factors, at least one of which is news to me (supposedly, steering LLMs using negative instructions isn't counterproductive anymore (not that I'd been resisting the temptation anyway)):

https://gemini.google.com/share/7af54a6861d7#:~:text=What%20...

With the caveat (or bonus) that it can go (?:too)? far when told to "be blunt" and not to "pull punches":

https://i.vgy.me/WHRZD7.png (from 2024-09) (in this case, in user-config persistent instructions in Kagi's multi-LLM thing)

by vonunov

5/3/2026 at 9:01:35 AM

> Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute per user' of service 'sheets.googleapis.com' for consumer 'project_number:849324395320'.

maybe cache this thing my guy you're just doing a bunch of reads

---

constructive suggestions

- you have a pretty cheap process here, and HN exposes historical posts by date. perhaps worth running this back the last 2 years to reconstruct a history of sentiment?

- introduce alternative sorts around the net positive/negative sentiments and absolute positive sentiments, similar to State of JS (https://stateofjs.com) - you'll see the gpt outperformance more

- matching of Opus 4.7 and Opus Latest seems sus?

by swyx

5/3/2026 at 10:34:35 AM

Didn't expect it to get hammered like that, just added caching for the sheets request. Thanks, my guy ;)

Backfilling it further is definitely in the cards, I just want to stabilize the methodology first.

If a comment just mentions Opus without being more specific and in the absence of relevant context clues, it gets mapped to Opus Latest. So it's saying more about the model family than a specific version. Tbh I'll probably remove all "-latest" data points going forward, as I mentioned in another comment.

by yunusabd

5/3/2026 at 11:45:04 AM

> If a comment just mentions Opus without being more specific and in the absence of relevant context clues, it gets mapped to Opus Latest

Consider keeping this data point but instead calling it something like "Opus Unspecified". Let the user decide how to interpret it.

by m0rde

5/2/2026 at 9:51:51 PM

Please fix your graph so the names of the models are readable

by Jabbles

5/2/2026 at 10:10:23 PM

Also, the stacked graph only allows you to quickly see total mentions, really hard to compare negative or positive sentiment across models at a glance.

by marcuskaz

5/2/2026 at 10:54:21 PM

Yep, a toggle to scale all columns to the same height could solve this. I'll look into it when I do the custom graph.

Edit: Done

by yunusabd

5/3/2026 at 12:02:34 AM

Much better, nice update!

by marcuskaz

5/2/2026 at 10:02:59 PM

Sorry about that, the embedded graph from Sheets doesn't let me do that. I think I'll have to fetch the data and render the graph myself.

In the meantime, you can hover or tap the columns to see the full model names.

by yunusabd

5/3/2026 at 12:04:09 AM

Thanks for the comment, should be fixed now.

by yunusabd

5/2/2026 at 10:57:03 PM

Came here to offer this feedback. If I can't see the name of the model, nothing else in the chart really matters to me. I even tried going to the Google Sheet.

It's way too important a piece of information not to have it visible.

by smeej

5/3/2026 at 12:00:38 AM

Thanks, I replaced it with a custom graph, should be easier to read now.

by yunusabd

5/3/2026 at 5:30:19 AM

This is awesome data! I've been wanting to measure how closely hype aligns to our results at https://gertlabs.com/rankings

Subjectively, it seemed like DeepSeek V4 Pro had the highest hype/performance ratio (meaning high hype for lower performance). Whereas MiMo V2.5 Pro didn't get much attention despite being the top dog in the open weights world, not even an honorable mention in your chart :( ...

by gertlabs

5/3/2026 at 6:34:19 AM

There is one mention of Mimo V2.5 Pro in the data by... you! In the UserRatings tab in the sheet, if you want to have a look.

Searching for it on HN shows very few results, that's why it's not showing up in the analysis yet. But it might in the future, once it gains traction.

I'll keep an eye on it, thanks for bringing it up!

https://news.ycombinator.com/item?id=47911464

by yunusabd

5/3/2026 at 12:53:26 AM

Surely "Claude Opus 4.7" and "Claude Opus Latest" should be the same, right?

by chillfox

5/3/2026 at 1:25:12 AM

Yeah, so often people just mention "Opus" or "GPT" without a version, and those get mapped to the "-latest" suffix.

I thought I'd keep these as a rating for model families rather than specific models. But tbh it's probably better to remove them, too confusing.

by yunusabd

5/3/2026 at 12:30:01 AM

Thanks for doing the hard work. I've bookmarked this, hoping it'll come handy when new models are released. If you're taking feature requests, I've a few. - Show combined measurements of model makes. Like All claude models vs open ai, Deepseek so on. - Another toggle to remove the neutral section?

by idivett

5/2/2026 at 11:21:24 PM

It'd be interesting to also graph this over time to see how sentiment changes from when a model is released to today.

by brooksc

5/3/2026 at 1:30:55 AM

Yes! Going forward I'm definitely doing that, once there is enough data. Might even backfill the data more into the past. I just want to stabilize the methodology before burning more tokens.

And it's probably a good idea to create a list of model release dates, so older comments can't accidentally map to models that weren't released yet.

by yunusabd

5/3/2026 at 9:01:46 AM

Sentiment probably shifts a lot between release day and a few weeks in, once people hit real edge cases. Would be interesting to see that curve per model.

by itsnewme

5/3/2026 at 12:27:13 AM

Before harnesses, I'd fix the methodology/claims. A saner methodology would be to see comments that compare two models, say 'gpt5.5>opus4.7' and infer context ('ctx:frontend', for example). For your current methodology, 'opus 4.6 was very smart, opus4.7 is a disappointing upgrade to 4.6' would make normal aspect-based sentiment analysis consider 4.6 is smarter than 4.6. But considering you have <300 mentions total, probably you'd be better off scrapping some other websites as well. I'd also take out completely the SotA claim and downgrade the mentions to measuring something like visibility rather than performance.

by gobdovan

5/3/2026 at 1:00:43 AM

That's fair, my immediate concern would be that there would be very few comments comparing any two models, so the data would be very anecdotal.

The context would be really nice to have, but reading the comments myself, it often just isn't very clear what exactly users are building or which programming language they are using.

I think analyzing more comments is promising. If you get enough data, you can generalize across use cases and get more meaningful ratings. The obvious lever is including more posts, although it might hit diminishing returns. I'll play around with it.

For the context, I want to try giving Gemini a "scratch pad", where it can note down strengths and weaknesses per model that it finds in the comments. Something like "some users say that model x is good for writing tests". Then on each run, I let it update the scratch pad and publish the results as more of a qualitative analysis.

For the wording, I'd like to keep a certain amount of click bait, sorry ;)

by yunusabd

5/3/2026 at 9:01:26 AM

[dead]

by itsnewme

5/2/2026 at 9:52:44 PM

"Prompts an LLM" -> which LLM?

I saw you're using Gemini for the sentiment rating (which I guess you picked because it's not often mentioned and thus "neutral"? lol)

But would be interesting to get more details overall

by yakkomajuri

5/2/2026 at 10:49:21 PM

It's actually ChatGPT at the moment for the first filtering step, for no other reason than having a code snippet ready that I could point Cursor at (I know, so 2025). The Gemini call is using batch processing, so it's handled differently.

by yunusabd

5/3/2026 at 3:09:39 AM

I suspect companies are deploying bots to shift sentiment around their products. I find metrics like this to be largely useless vs. actually just trying stuff out.

by jesse_dot_id

5/3/2026 at 5:13:02 AM

What a win it is for open source that qwen and kimi show up on this at all.

by skeptrune

5/3/2026 at 11:58:04 PM

How about State of the Art of Coding Agents?

by bigbadfeline

5/3/2026 at 9:10:21 AM

I find using both very helpful and in most cases I have used Claude to build 70-80% of what I need and finish it off with Codex.

by misbau

5/3/2026 at 2:32:28 PM

Amazing how Kimi has no negative feedback. Moonshot is doing really great work.

by SpyCoder77

5/2/2026 at 11:28:27 PM

So, it's a webpage with 3 paragraphs and a simple chart. It has: 1) terrible color scheme – fine, I switch to reader mode 2) shitloads of JS - fine, NoScript works, page breaks 3) Fancy "design" with simple graph but unreadable X axis labels - fine, I can use screen zoom for that ... to see 3x "Claude O..." LOL are we playing guess-me-over game? 4) ... "LxxxLxxx - Learn languages with YouTube!"

by pbgcp2026

5/3/2026 at 12:59:58 AM

How noisy is the sentiment classification? Feels like that could skew results a lot

by Hari2028

5/3/2026 at 1:27:53 AM

From the comments that I've checked manually it's pretty good. You can go to the "User Ratings" tab in the Google Sheet and check some comments to get an idea.

by yunusabd

5/2/2026 at 10:12:09 PM

Just FYI this article seems to define "start of the art" as "popular", as measured by "total mentions and user sentiment", without any bearing on the technical abilities or actual usage of the model.

by ranger_danger

5/2/2026 at 11:04:25 PM

Calling it sota might be a bit provocative, but what actually is the "state of the art"? We have benchmarks, but those are getting increasingly gamed and don't necessarily reflect the actual performance of a model, see Opus 4.7. So I think it's useful to have real world data from actual users as an additional data point.

by yunusabd

5/3/2026 at 12:32:05 PM

Maybe you shouldn't be relying on something if you can't even tell how good it is?

by miyoji

5/2/2026 at 10:24:34 PM

That's pretty much exactly what the title says.

The technical abilities and usage are derived from the commenters usage reflections.

by mellosouls

5/3/2026 at 9:07:09 AM

and assuming all mentions are coding model mentions just because its on hn

by swyx

5/3/2026 at 7:36:20 AM

So no one's using Gemini on HN?

by jatins

5/3/2026 at 2:26:11 PM

We're all busy doing work instead of incessantly commenting about our models?

by julianlam

5/3/2026 at 4:12:20 AM

Interesting that Gemma 4 didn't crack the top 10.

I've been experimenting with the 26B-A4B model with some surprisingly good results (both in inference speed and code quality — 15 tok/s, flying along!), vs my last few experiments with Devstral 24B. Not sure whether I can fit that 35B Qwen model everybody's so keen on, on my 32GB unified RAM.

However I think I may be in the minority of HN commenters exploring models for local inference.

by julianlam

5/3/2026 at 11:25:24 AM

Can you elaborate on your setup? What harness are you using with Gemma 4 on your 32GB machine?

by asnelt

5/3/2026 at 1:36:45 PM

[dead]

by julianlam

5/3/2026 at 4:25:17 AM

more users = more complaints. negativity just means popularity.

kimi...?

by tokkkie

5/3/2026 at 1:43:20 PM

That's not really true. It also indicates how widespread the issues are. The graph lets you normalize by user reports by the way.

Kimi specifically kimi2.6 from moonshotai is a new open weight cheap model that has the performance of frontier models. The price point is it's main draw

by 2ndorderthought

5/3/2026 at 2:23:43 PM

I am guessing the open source locally run models will become more appreciated in value when the enshitification reaches the industry. Any guesses on how much time we still have until then?

by askos

5/3/2026 at 2:12:37 AM

I am looking for a good alternative to Claude code + opus that is not codex. I tried switching back to opus 4.6. The attitude of 4.7 is what is more problematic. Difficult to enforce checking stuff before answering, and it suppose he knows better than me and reality. Plus all the latest shenanigans they did. Pretty disgusted I am still using them

by Frannky

5/3/2026 at 7:47:48 AM

You can use other models in Claude Code

https://github.com/raine/claude-code-proxy

https://api-docs.deepseek.com/quick_start/agent_integrations...

by rane

5/3/2026 at 2:33:29 AM

I have forgotten to add the tendency of not owing problems and taking care and solve immediately but instead deflecting and saying it shouldn't be done now it's not my responsibility etc Just terrible

by Frannky

5/3/2026 at 7:23:34 AM

100% this! So often it complains failing unit tests are not its fault.

by alxhslm

5/3/2026 at 11:55:21 AM

For what it's worth, here some of my experiences, as recently I have had some major deviations from what I've come to expect:

1.) Opus 4.7 via the API is great. Unlike 4.6, I have found the model to degrade far less beyond 120k, even 600k can be relied upon. Task Inference, Task Evaluation, Task Adherence, tool calling, all do very well on my evals. I did however for the first time in a while end my Claude Max subscription because, after their post-mortem [0] I for the first time saw true, reproducible, incredibly frustrating regressions in model output when using Claude Code.

Yes, this was after their post in the last week of April 26 and yes, I have been fortunate enough to never have been affected by regressions up to this point. The model via API with other harnesses provides consistent, useful and high quality output, but the recent changes have become an avalanche of "this requires more than two changes so we should table this for later" and "it seems the subagent finding was wrong and this is not actionable" with a healthy mix of suggestions that clearly are there to safe tokens, but go against clear instructions. I understand that they are compute constraint but as someone who until recently has never maxed out their weekly and nearly never their 5 hour limits on the Max 5x plan, these changes are not just frustrating (and make reasonable users think the model was nerfed rather than the harness) but also cost more as I now have to prompt four times and thousands of tokens more for a task that previously the same harness with the same model did far more efficiently. I regularly check the numbers and yes, by trying to be more efficient, they made what I am costing them far higher, going beyond what I pay for the subscription. Ironically, and I must emphasise this, I did not have regressions before, which suggests some major luck in A/B testing at least.

2.) GPT-5.5 is amazing, a true jump I have not seen since GPT-5 and far more than even GPT-5.4 is approaching the way Anthropic models have handled task inference, which also has lead to far reduced reasoning needs. I very much like it, with the exception of the reduced context window and degradation in compaction. GPT-5.4 did compaction so consistently well, that the 272k standard window before the price increase was of no concern and going beyond it was reliably possible. With GPT-5.5, the cost per token is doubled and compaction is far less reliable, leading to loss of task adherence and preventing task completion in certain cases. I am aware GPT-5.5 is a new pretrain (though how new given frontend is still abhorrently poor and has been since Horizon Alpha which I maintain was worse than GPT-4.1) and am hopeful they can integrate some of the solutions they were leveraging for GPT-5.4 compaction, but until then, it remains a model great for very challenging and complex blockers, but not a GPT-5.4 drop-in replacement.

3.) Kimi K2.6 is great for the API price, efficient, fast and does very well on all my metrics. I very much like it, far more so than Deepseek V4 Pro, any Qwen, Z.AI or Meta model and I truly am impressed. Composer 2 has shown how you can take the base even further given the right data and if I had to pay exclusively API pricing without any subscriptions, I think I'd have no problem leaning on K2.6 for most needs. It is what I'd love to see from Mistral or Apple and shows that one can't just succeed in a few narrow areas (Z.AI with tool calling, Deepseek with world knowledge, Mistral with being European, etc.) but provide a balanced product across all areas as an open weight company. I just wish they'd expose Agent Swarms via the API, there are a few experiments I'd like to try.

[0] https://www.anthropic.com/engineering/april-23-postmortem

by Topfi

5/3/2026 at 9:21:15 AM

Something that has been interesting to me for my entire life was the geek/jock cultural split in the US that seemed to crescendo in the 80s with the rise of popular nerd films and then the 90s when software started taking over the world. Being a pretty athletic kid who lettered in four sports, won a state championship, but also won math tournaments and spelling bees, it felt artificial to me. Plenty of high-level athletes have always been into video games, anime, and comic books, and are just as smart as people who can't run without tripping themselves and never learned to throw or catch any kind of ball.

Now it seems like it's come circle from the other direction, too. We always had fandom elements in computing nerd culture. Editor wars. Language wars. Framework wars. Now that software tooling has become nearly human-like, mercurial, unpredictable, inconsistent in performance and experience from week to week, software developers have turned into sports scouts and ESPN talking heads, going so far as to make continually updating live power rankings the way commentators try to predict in season which team is looking most like they'll win the championship that year. You're in the position talent evaluators were in roughly the late 90s, relying mostly on eye test and rough proxy measures of raw potential. Simon Willison applies the pelican test the way draft combines put athletes through shuttle drills and test vertical leap to try and predict how well they'll do in real gameplay.

It leaves me wondering when we'll have the Bill James style analytics breakthrough in software talent evaluation or if such a thing is even possible. At least with athletes, practice can make them better and injury and age can make them worse, but you can't just silently swap out an entirely different mind and body under the same name and face. You guys are trying to assess the performance of constantly moving targets that can and do change capabilities and characteristics on a daily basis.

by nonameiguess

5/3/2026 at 9:12:19 AM

Terrible metric that tells absolutely nothing about what's state-of-the-art. You might as well call this list the most astroturfed models on HN.

by input_sh

5/3/2026 at 2:57:45 PM

[flagged]

by kk_mors

5/3/2026 at 5:33:21 AM

[dead]

by soupspaces

5/3/2026 at 10:50:13 AM

[flagged]

by jimmypk

5/3/2026 at 1:58:47 PM

[dead]

by taintlord