3/1/2026 at 12:09:55 AM
If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.
They are impressive, but they are not performing at Sonnet 4.5 level in my experience.
I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.
That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.
by Aurornis
3/1/2026 at 1:14:30 AM
Respectfully, from my experience and a few billions of tokens consumed, some opensource models really are strong and useful. Specifically StepFun-3.5-flash https://github.com/stepfun-ai/Step-3.5-FlashI'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.
I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.
by kir-gadjello
3/1/2026 at 8:16:22 AM
What coding agent do you use with StepFun-3.5-flash? I just tried it from siliconflow's api with opencode. The toolcalling is broken: AI_InvalidResponseDataError: Expected 'function.name' to be a string.by jasonni
3/1/2026 at 4:53:12 PM
I use pi, but I'm almost done writing a better alternative that doesn't have pi's stability issues. 80K Rust SLOC and a few hundred tests btw.by kir-gadjello
3/1/2026 at 4:33:47 AM
Are you using stepfun mostly because it's free, or is it better than other models at some things?by copperx
3/1/2026 at 7:19:00 AM
I think we are at this point where the hard ceiling of a strong model is pretty hard to delineate reliably (at least in coding, in research work it's clearer ofc) - and in a good sense, meaning with suitable task decomposition or a test harness or a good abstraction you can make the model do what you thought it could not. StepFun is a strong model and I really enjoyed studying and comparing it to others by coding pretty complex projects semi-autonomously (will do a write up on this soon tm).Even purely pragmatically, StepFun covers 95% of my research+SWE coding needs, and for the remaining 5% I can access the large frontier models. I was surprised StepFun is even decent at planning and research, so it is possible to get by with it and nothing else (1), but ofc for minmaxing the best frontier model is still the best planner (although the latest deepseek is surprisingly good too).
Finally we are at a point where there is a clear separation of labor between frontier & strong+fast models, but tbh shoehorning StepFun into this "strong+fast" category feels limiting, I think it has greater potential.
by kir-gadjello
3/1/2026 at 7:33:36 AM
I pay for copilot to access anthropic, google and openai models.Claude code always give me rate limits. Claude through copilot is a bit slow, but copilot has constant network request issues or something, but at least I don't get rate limited as often.
At least local models always work, is faster (50+ tps with qwen3.5 35b a4b on a 4090) and most importantly never hit a rate limit.
by CapsAdmin
3/1/2026 at 9:45:32 AM
> Claude code always give me rate limits> 50+ tps with qwen3.5 35b a4b on a 4090
But qwen3.5 35b is worse than even Claude Haiku 4.5. You could switch your Claude Code to use Haiku and never hit rate limits. Also gets similar 50tps.
by acchow
3/1/2026 at 1:17:05 PM
I haven't tried 4.5 haiku much, but i was not impressed with previous haiku versions.My goto proprietary model in copilot for general tasks is gemini 3 flash which is priced the same as haiku.
The qwen model is in my experience close to gemini 3 flash, but gemini flash is still better.
Maybe it's somewhat related to what we're using them for. In my case I'm mostly using llms to code Lua. One case is a typed luajit language and the other is a 3d luajit framework written entirely in luajit.
I forgot exactly how many tps i get with qwen, but with glm 4.7 flash which is really good (to be local) gets me 120tps and a 120k context.
Don't get me wrong, proprietary models are superior, but local models are getting really good AND useful for a lot of real work.
by CapsAdmin
3/1/2026 at 6:18:02 AM
I also started playing with 3.5 Flash and was impressed.It’s 2× faster than its competitors. For tasks where “one-shotting” is unrealistic, a fast iteration loop makes a measurable difference in productivity.
by nodakai
3/1/2026 at 2:51:59 PM
TDD is really the delineation between being successful or not when using [local] LLMs.by mycall
3/1/2026 at 4:27:40 PM
> some opensource models really are strong and usefulTo be clear I never said they weren’t strong or useful. I use them for some small tasks too.
I said they’re not equivalent to SOTA models from 6 months ago, which is what is always claimed.
Then it turns into a Motte and Bailey game where that argument is replaced with the simpler argument that they’re useful for open weights models. I’m not disagreeing with that part. I’m disagree with the first assertion that they’re equivalent to Sonnet 4.5
by Aurornis
3/1/2026 at 4:52:05 PM
They are not equivalent 1:1, esp. in knowledge coverage (given OOM param size difference) and in taste (Sonnet wins, but for taste one can also use Kimi K2.5), but in my hardcore use (high-performance realtime simulations of various kinds) I would prefer StepFun-3.5-Flash to Sonnet 4 strongly and to 4.5 often enough without a decisive advantage in using exclusively Sonnet 4.5. For truly hard tasks or specifications I would turn to 5.2 or 5.3-codex of course - but one KPI for quality of my work as a lead engineer is to ensure that truly hard tasks are known, bounded and planned-for in advance.Maybe my detailed, requirement-based/spec-based prompting style makes the difference between anthropic's and OSS models smaller and people just like how good Anthropic's models are at reading the programmer's intent from short concise prompts.
Frankly, I think the 1:1 equivalent is an impossible standard given the set of priorities and decisions frontier labs make when setting up their pre-, mid- and post-training pipelines, and benchmark-wise it is achievable for a smaller OSS model to align with Sonnet 4.5 even on hard benchmarks.
Given the relatively underwhelming Sonnet 4.5 benchmarks [1], I think StepFun might have an edge over it esp. in Math/STEM [2] - even an old deepseek-3.2 (not speciale!) had a similar aggregate score. With 4.6 Anthropic ofc vastly improved their benchmark game, and it now truly looks like a frontier model.
1. https://artificialanalysis.ai/models/claude-4-5-sonnet-think... 2. https://matharena.ai/models/stepfun_3_5_flash
by kir-gadjello
3/1/2026 at 1:15:22 AM
What are you running that model on?by aappleby
3/1/2026 at 1:18:20 AM
I just use openrouter, it's free for now. But I would pay 30-100$ to use it 24/7.by kir-gadjello
3/1/2026 at 1:36:09 AM
Ah, I thought you meant you were running it locally.by aappleby
3/1/2026 at 6:08:28 PM
Have you tried Minimax M2.5? How did it compare?by Aerroon
3/1/2026 at 2:38:48 AM
A 3 bit quant will run on a 128gb MacBook Pro, it works pretty well.by FuckButtons
3/1/2026 at 2:43:20 AM
A 3 bit quant is quite a lot weaker than the OpenRouter version the OP is using.by nl
3/1/2026 at 6:08:00 AM
Yes and no. "Last-gen" (like, from 6 months ago) frontier models do still tend to outperform the best open source models. But some models, especially GLM-5, really have captured whatever circuitry drives pattern matching in the models they were trained off of.I like this benchmark that competes models against one another in competitive environments, which seems like it can't really be gamed: https://gertlabs.com
by lend000
3/1/2026 at 4:29:37 PM
> Yes and no. "Last-gen" (like, from 6 months ago) frontier models do still tend to outperform the best open source modelsThat’s exactly what I said, though. The headline we’re commenting under claims they’re Sonnet 4.5 level but they’re not.
I don’t disagree that they’re powerful for open models. I’m pointing out that anyone reading these headlines who expects a cheap or local Sonnet 4.5 is going to discover that it’s not true.
by Aurornis
3/1/2026 at 1:03:54 AM
All models are doing that. Not only the open source ones.I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't.
by wolvoleo
3/1/2026 at 5:15:05 AM
I wouldn't mind them benchmaxing my queries.by red75prime
3/1/2026 at 2:42:51 AM
I'm using Qwen 3.5 27b on my 4090 and let me tell you. This is the first time I am seriously blown away by coding performance on a local model. They are almost always unusable. Not this time though...by dimgl
3/1/2026 at 1:43:08 PM
27B dense model is probably the best in the 3.5 lot, not absolutely but for perf:size. Its also pretty good at prose, which is a rarity for a Qwen.by smahs
3/1/2026 at 6:56:11 AM
You don't need a coding version of model from Qwen? the 3.5 works?by bibstha
3/1/2026 at 2:01:11 AM
Are there any up-to-date offline/private agentic coding benchmark leaderboards?If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization.
Edit: These look decent and generally match my expectations:
by rudhdb773b
3/1/2026 at 1:28:38 AM
"When a measure becomes a target, it ceases to be a good measure."Goodhart's law shows up with people, in system design, in processor design, in education...
Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning.
by chaboud
3/1/2026 at 1:28:32 PM
This is because of the forbidden argument in statistics. Any statistic, even something so basic as an average, ONLY works if you can guarantee the independence of the individual facts it measures.But there's a problem with that: of course the existence of the statistical measure itself is very much a link between all those individual facts. In other words: if there is ANY causal link between the statistical measure and the events measured ... it has now become bullshit (because the law of large numbers doesn't apply anymore).
So let's put it in practice, say there's a running contest, and you display the minimum, maximum and average time of all runners that have had their turns. We all know what happens: of course the result is that the average trends up. And yet, that's exactly what statistics guarantees won't happen. The average should go up and down with roughly 50% odds when a new runner is added. This is because showing the average causes behavior changes in the next runner.
This means, of course, that basing a decision on something as trivial as what the average running time was last year can only be mathematically defensible ONCE. The second time the average is wrong, and you're basing your decision on wrong information.
But of course, not only will most people actually deny this is the case, this is also how 99.9% of human policy making works. And it's mathematically wrong! Simple, fast ... and wrong.
by spwa4
3/1/2026 at 12:13:02 PM
Hmm, I second this. Haven't compared Qwen3.5 122B yet, but played around with OpenCode + Qwen3-Coder-Next yesterday and did manual comparisons with Claude Code and Claude Code is still far ahead in general felt "intelligence quality".by warpspin
3/1/2026 at 1:41:52 AM
> they always disappoint in actual use.I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed.
Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments.
At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience.
by crystal_revenge
3/1/2026 at 12:37:37 PM
Just going to echo this. Been using K2.5 in opencode as a switch away from Opus because it was too expensive for the sorts of things I was playing with, and it's been great. There's definitely a bit of learning to get the hang of what sort of prompts to give it and to make sure there's enough documentation in the project for it, but it's remarkably capable once you're in the swing of it.by regularfry
3/1/2026 at 2:03:14 PM
I've been trying to get these things to local host and use tools. Am I right in understanding that it's impossible for these things to use tools from within llama.cpp? Do I need another "thing" to run the models? What exactly is the mechanism by which the models became aware that they're somewhere where they have tools availbale? So many questions...by ekjhgkejhgk
3/1/2026 at 12:14:18 AM
Are you saying that the benchmarks are flawed?And could quantization maybe partially explain the worse than expected results?
by amelius
3/1/2026 at 12:29:21 AM
No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up. So you add specific problems to training data... The result is that the model is smarter, but the benchmarks overstate the progress. Sure there are problem sets designed to be secret, but keeping secrets is hard given the fraction of planetary resources we are dedicating to making the AI numbers go up.I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.
by TrainedMonkey
3/1/2026 at 12:33:45 AM
> No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up.That's a much better way to say it than I did.
These models are known for being open weights but they're still products that Alibaba Cloud wants is trying to sell. They have Product Managers and PR and marketing people under pressure to get people using them.
This Venture Beat article is basically a PR piece for the models and Alibaba Cloud hosting. The pricing table is right in the article.
It's cool that they release the models for us to use, but don't think they're operating entirely altruistically. They're playing a business game just like everyone else.
by Aurornis
3/1/2026 at 1:24:13 AM
There should be a way to turn the questions we ask LLMs into benchmarks.That way, we can have a benchmark that is always up to date.
by amelius
3/1/2026 at 3:14:05 PM
There are a few “updating” benchmarks out there. I periodically take a look at these two:by lurkshark
3/1/2026 at 12:18:55 AM
The models outperform on the benchmarks relative to general tasks.The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before.
> And could quantization maybe explain the worse than expected results?
You can use the models through various providers on OpenRouter cheaply without quantization.
by Aurornis
3/1/2026 at 12:21:32 AM
Flawed? Possibly, but I think it's more that any kind of benchmark then becomes a target, and is inherently going to be a "lossy" signal as to the models actual ability in practice.Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience.
by girvo
3/1/2026 at 1:14:47 AM
It's not just the open source ones.The only benchmarks worth anything are dynamic ones which can be scaled up.
by noosphr
3/1/2026 at 10:12:39 AM
they're distilling claude and openai obviously.that said, sonnet 4.5 is not a good model today, March 1st 2026. (it blew my mind on its release day, September 29th, 2025.)
by baq
3/1/2026 at 4:26:54 AM
> That said, they are impressive for open source models.there is nothing open "source" about them. They are open weights, that's all.
by ekianjo
3/1/2026 at 1:05:04 AM
Very good point. I'm playing with them too and got to the same conclusion.by eurekin
3/1/2026 at 12:37:19 AM
Death by KPIs. Management makes it too risky to do anything but benchmaxx. It will be the death of American AI companies too. Eventually, people will notice models aren’t actually getting better and the money will stop flowing. However, this might be a golden age of research as cheap GPUs flood the market and universities have their own clusters.by jackblemming
3/1/2026 at 1:11:50 AM
[dead]by bourjwahwah