5/8/2026 at 12:28:18 PM
We track performance vs. the all-in cost of completing real engineering tasks, rather than cost per token. [1]Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
by languid-photic
5/8/2026 at 2:36:11 PM
Interesting! I've been thinking about how to create a similar type of evaluation system for myself. How do you handle tweaks to agentic tasks? Say that a model gets pretty close to what you want, so you just need a quick follow up prompt to the original response?by digdugdirk
5/8/2026 at 3:10:58 PM
Yes! It depends on the extent of changes needed.If the changes needed are small, I'll apply the best implementation as a foundation and then just iterate directly.
If the changes needed are drastic, it usually signals that there was sth wrong/ambiguous/etc in the spec (or the ensemble was too weak, which is rarely the case). In cases like this, I improve the spec and then rerun.
If it's in the middle, I'll usually apply the best and write a follow on spec.
by languid-photic
5/8/2026 at 3:59:29 PM
How does that get integrated into the scoring system? I'm imagining a scenario where a cheaper model may get close, but only needs a small follow up to get the desired result. How would this score in comparison to a larger model that got it right the first time - even if it may have been much more expensive overall?by digdugdirk
5/8/2026 at 4:43:53 PM
We also use a secondary signal from blinded multi-verifier reviews. Each verifier ranks the candidates, and those verification outcomes serves as an additional quality signal. It's somewhat similar to consensus labeling.Btw, this also helps manage scale. Eg you have 15 diffs to review. Run a few verifiers to get a short list, then review directly and apply the best.
by languid-photic
5/8/2026 at 4:07:26 PM
It feels pretty weird that your ratings have:gpt-5-4-high > gpt-5-4-xhigh
gpt-5-4-high > gpt-5-5-high
gpt-5-4 > gpt-5-5
gpt-5-2-high > gpt-5-2-xhigh
No other ratings I've seen show that.
by BugsJustFindMe
5/8/2026 at 4:37:14 PM
Yes, the signal we are measuring is quite different from most evals.We are measuring sth much closer to: when multiple agents compete on the same spec, which one produces the patch that holds up best in code review?
Most evals are static / synthetic, and for code, generally stop at tests. Test evals are weak proxies for quality since it's difficult to encode qualities like scope creep/churn, codebase fit, maintainability etc in tests. [1]
Almost every agent in a given run can pass tests at this point, but there is large separation during review.
by languid-photic
5/8/2026 at 7:10:47 PM
Ok, but my point is that the claims you make about more reasoning performing worse seems kinda suspicious and I haven't seen any analysis exploring why that would happen.by BugsJustFindMe
5/8/2026 at 8:51:15 PM
My point is more reasoning often leads to worse "scope creep/churn, codebase fit, maintainability".by languid-photic
5/8/2026 at 9:12:00 PM
I get it, but that is a significant claim. And the claim could be right, but it could also be wrong, and I see no analysis, not even a blog post on your website saying "wow, look at this weird thing we found". To me that makes the claim suspicious because it signals that nobody thought to investigate what's going on. Investigating weird results is how we demonstrate that what we're doing is right.by BugsJustFindMe
5/8/2026 at 9:33:04 PM
It’s mostly a bandwidth thing. We’ve seen the pattern consistently, but haven’t had time yet to write up the analysis carefully.We are not the only ones to see the reasoning inversion.: https://arxiv.org/abs/2510.11977, https://arxiv.org/abs/2502.08235, https://arxiv.org/abs/2507.14417
by languid-photic
5/8/2026 at 3:27:24 PM
would be interesting to see some other labs:- deepseek v4 pro
- glm 5.1
- kimi k2.6
- qwen 3.6 max
- xiaomi 2.5 pro
- minimax 2.7
- grok
by lukewarm707
5/8/2026 at 3:49:32 PM
I agree!So far we have been native harnessmaxxing, which simplifies things a lot.
The configuration space around open models is much larger. Eg which models, capability heterogeneity, which harness, networking, data egress / privacy, etc.
If anyone is getting very good production code out of open models, I'd love to do a user interview to better understand your setup. Email is in my bio.
by languid-photic
5/8/2026 at 4:08:09 PM
With how much vendor harnesses are now actively steering the agent with their own instructions on top of user prompts, I think it’d be super interesting to see a comparison of one of the already tested models - so Opus 4.7 or GPT-5.5 - across a range of different harnesses that aren’t their native. OpenCode, Pi, Hermes, Kilo Code. The most popular coding-focused harnesses, basically.by thepasch
5/8/2026 at 4:38:59 PM
Agreed. Harness is really important. Especially since many labs are now post-training agents directly in their native harness.(Which is why my prior is that third party harnesses would not perform as well. But I haven't actually measured this.)
by languid-photic
5/8/2026 at 7:27:10 PM
OpenCode seems to give me better results than codex-cli, i’d be interested in seeing this too!by cyberpunk
5/8/2026 at 5:46:09 PM
But what situation seems to good to enable xhigh?by motbus3