6/8/2026 at 9:33:55 PM
:wave: i was on the team! AMA.some headlines
- 3000 rubrics on code quality. First benchmark to measure: "would this code get actually merged?"
- 20+ expert open-source maintainer created tasks on their own repos to capture their opinion & taste.
- total 1000+ hours of real life software maintainer work captured in dataset. ON TOP of that, 40+ hours of real human work to turn that real life work into well validated and structured tasks with rubrics (even more work to turn tasks/prompts from devin-infra-specific to pluggable coding agent)
- results in 81% lower false positive rate than SWE-Bench Pro
- High quality bar: many QA stages & each task manually reviewed by Cognition researchers (examples in post)
Opus 4.8 scores 13% on FrontierCode Diamond.
one of my goals was also to datamine interesting stuff even on the easy tasks. for example, if you squint you can see the answer to "WTF Happened in late 2025" with coding models: https://x.com/swyx/status/2064081945567580323
by swyx
6/8/2026 at 10:03:55 PM
Very cool! So glad to see people building and sharing evals that are better than SWE bench.I'm curious - any particular reason you didn't put error bars on the graphs? Seems like it could be helpful when there are only 50 unique problems in the diamond set.
by tedsanders
6/8/2026 at 10:14:54 PM
*50 unique problems but 20-40 rubrics per problem (something I had to keep reminding people internally who were unimpressed with the N)simple answer is our reporting was pass@5. feel like you'd need like 50+ runs to have reasonable confidence intervals, which somehow i dont see other people do, so i also didnt insist on it.
hoping to work with <prominent third party evals shop> to get this on their infra and evaluated along with whatever the industry standard is.
by swyx
6/8/2026 at 11:37:56 PM
Makes sense, thanks. I suppose error bars are tricky if trying to handle problem-to-problem variance, rubric-to-rubric variance, and run-to-run variance all at once.by tedsanders
6/9/2026 at 1:00:48 PM
Any chance of also benchmarking a couple of more affordable Chinese models? (specifically Deepseek and Xiaomi's MiMo)by VulgarExigency
6/9/2026 at 1:04:25 PM
i think <third party evals platform> will help us do that best on their standardized model matrix. for frontiercode’s launch we were focused on.. the frontier modelsby swyx
6/9/2026 at 2:15:43 PM
What qualifies as a frontier model? From my personal "taste tests", I wouldn't have placed Sonnet or Kimi above Deepseek Pro or MiMo, or Gemini 3.1 Flash Lite above Deepseek Flash, but they're listed in the benchmark.by VulgarExigency
6/9/2026 at 12:32:35 AM
This looks really great, more thoughtful than any benchmark that I've seen until now!I'm curious if you're only interested in scoring frontier models or you would accept submission from custom harnesses? I am working on multi-model harnesses and would love to test them against your benchmark. Do you plan on releasing the tasks publicly?
by glerk
6/9/2026 at 5:47:50 AM
> Do you plan on releasing the tasks publicly?yep
by swyx
6/9/2026 at 7:14:18 AM
yay! looking forward, and thanks!by glerk
6/9/2026 at 11:41:17 AM
Does reporting each model at its best performing reasoning effort introduce a best-of-N/multiple-comparisons bias, especially if models have different numbers of effort levels?by llama_drama
6/9/2026 at 12:57:31 PM
to you it may do idk. note that if you scroll past fig 1 you get into a nice data explorer that breaks out pass@5 by reasoning level with token and $ and step cost visualized. i think some other commenters on this hn thread got very worked up about stuff we actually agree on.internally ive charted everything and am satisfied that theres no meaningful rank bias introduced. weve sliced it every which way. in fact we have not even published the best looking charts for this story to be told, because we have further publishing plans on frontiercode
tldr “trust me bro” this isnt the issue and if anything we couldve done more to increase N as tedsanders below points out
by swyx
6/8/2026 at 10:38:56 PM
What did you do around cross-harness testing? I don't see anything in the blog post about what harnesses were used in evaluation. SOTA benchmarks have consistently shown that frontier model performance is quite sensitive to what tools are exposed (e.g. str_replace vs. apply_patch) as the labs are RLing on their own harnesses. Did you do testing of the models in a standard setup or in their native harnesses?by typs
6/8/2026 at 10:58:55 PM
yes well aware :) numbers shown are on "house" harnesses eg codex with gpt and claude code with opus.fwiw we have examples of each model doing better on NON-house harnesses too - speaking jsut for myself i think the "the labs are RLing on their own harnesses" narrative is kinda overstated if you think through wanting to have any meaningful api business (often eg the labs will give guidance on what is prefered and the agent labs can easily match tool contract to that, which is to say, the "home turf advantage" isnt as large as you think it is if you try a little bit)
by swyx
6/8/2026 at 11:59:00 PM
What "non-house" harnesses have you found to work best?by chris_st
6/9/2026 at 1:34:49 AM
What is the "house" harness for minimax? They haven't released anyby Bolwin
6/8/2026 at 9:37:38 PM
How do you measure quality at scale ? Is there another model that determines if it adheres to codebase standard ?by great_psy
6/8/2026 at 9:46:32 PM
see Beyond Unit Tests and Novel Grading Methods in TFA.i think something like ~60% llm as judge rubrics and the rest as described. every rubric validated by maintainer. 3000 rubrics
by swyx
6/9/2026 at 4:54:07 AM
I'm a bit disappointed that Opus 4.6 wasn't in this because the tokenizer changed quite a bit from 4.7 onward. I was so annoyed by 4.7 that I've been forcing 4.6 ever since. I've been annoyed by 4.8 a bit too, so I haven't felt the urge to move on.by fouc
6/9/2026 at 1:16:06 PM
shared older model numbers here https://www.latent.space/p/ainews-frontiercode-benchmarkingtldr theres been broad progress despite your observed regressions
by swyx
6/9/2026 at 9:54:26 AM
> total 1000+ hours of real life software maintainer work captured in dataset. ON TOP of that, 40+ hours of real human work to turn that real life work into well validated and structured tasks with rubrics (even more work to turn tasks/prompts from devin-infra-specific to pluggable coding agent)Heartening. We still haven’t automated making the world worse.
by keybored
6/9/2026 at 12:38:24 PM
this gives my non tired self a chance to fix the typo:- “ ON TOP of that, 40+ hours of real human work to turn…”
+ “ ON TOP of that, 40+ hours of real human work PER TASK to turn…”
by swyx
6/9/2026 at 7:20:55 AM
[flagged]by hanzeweiasa
6/9/2026 at 6:49:03 AM
Meaningless comment filled with buzzwords and marketing numbers.by blks