FrontierCode: An eval to measure whether you would actually merge the code

6/8/2026 at 9:33:55 PM

:wave: i was on the team! AMA.

some headlines

- 3000 rubrics on code quality. First benchmark to measure: "would this code get actually merged?"

- 20+ expert open-source maintainer created tasks on their own repos to capture their opinion & taste.

- total 1000+ hours of real life software maintainer work captured in dataset. ON TOP of that, 40+ hours of real human work to turn that real life work into well validated and structured tasks with rubrics (even more work to turn tasks/prompts from devin-infra-specific to pluggable coding agent)

- results in 81% lower false positive rate than SWE-Bench Pro

- High quality bar: many QA stages & each task manually reviewed by Cognition researchers (examples in post)

Opus 4.8 scores 13% on FrontierCode Diamond.

one of my goals was also to datamine interesting stuff even on the easy tasks. for example, if you squint you can see the answer to "WTF Happened in late 2025" with coding models: https://x.com/swyx/status/2064081945567580323

by swyx

6/8/2026 at 10:03:55 PM

Very cool! So glad to see people building and sharing evals that are better than SWE bench.

I'm curious - any particular reason you didn't put error bars on the graphs? Seems like it could be helpful when there are only 50 unique problems in the diamond set.

by tedsanders

6/8/2026 at 10:14:54 PM

*50 unique problems but 20-40 rubrics per problem (something I had to keep reminding people internally who were unimpressed with the N)

simple answer is our reporting was pass@5. feel like you'd need like 50+ runs to have reasonable confidence intervals, which somehow i dont see other people do, so i also didnt insist on it.

hoping to work with <prominent third party evals shop> to get this on their infra and evaluated along with whatever the industry standard is.

by swyx

6/8/2026 at 11:37:56 PM

Makes sense, thanks. I suppose error bars are tricky if trying to handle problem-to-problem variance, rubric-to-rubric variance, and run-to-run variance all at once.

by tedsanders

6/9/2026 at 1:00:48 PM

Any chance of also benchmarking a couple of more affordable Chinese models? (specifically Deepseek and Xiaomi's MiMo)

by VulgarExigency

6/9/2026 at 1:04:25 PM

i think <third party evals platform> will help us do that best on their standardized model matrix. for frontiercode’s launch we were focused on.. the frontier models

by swyx

6/9/2026 at 2:15:43 PM

What qualifies as a frontier model? From my personal "taste tests", I wouldn't have placed Sonnet or Kimi above Deepseek Pro or MiMo, or Gemini 3.1 Flash Lite above Deepseek Flash, but they're listed in the benchmark.

by VulgarExigency

6/9/2026 at 12:32:35 AM

This looks really great, more thoughtful than any benchmark that I've seen until now!

I'm curious if you're only interested in scoring frontier models or you would accept submission from custom harnesses? I am working on multi-model harnesses and would love to test them against your benchmark. Do you plan on releasing the tasks publicly?

by glerk

6/9/2026 at 5:47:50 AM

> Do you plan on releasing the tasks publicly?

yep

by swyx

6/9/2026 at 7:14:18 AM

yay! looking forward, and thanks!

by glerk

6/9/2026 at 11:41:17 AM

Does reporting each model at its best performing reasoning effort introduce a best-of-N/multiple-comparisons bias, especially if models have different numbers of effort levels?

by llama_drama

6/9/2026 at 12:57:31 PM

to you it may do idk. note that if you scroll past fig 1 you get into a nice data explorer that breaks out pass@5 by reasoning level with token and $ and step cost visualized. i think some other commenters on this hn thread got very worked up about stuff we actually agree on.

internally ive charted everything and am satisfied that theres no meaningful rank bias introduced. weve sliced it every which way. in fact we have not even published the best looking charts for this story to be told, because we have further publishing plans on frontiercode

tldr “trust me bro” this isnt the issue and if anything we couldve done more to increase N as tedsanders below points out

by swyx

6/8/2026 at 10:38:56 PM

What did you do around cross-harness testing? I don't see anything in the blog post about what harnesses were used in evaluation. SOTA benchmarks have consistently shown that frontier model performance is quite sensitive to what tools are exposed (e.g. str_replace vs. apply_patch) as the labs are RLing on their own harnesses. Did you do testing of the models in a standard setup or in their native harnesses?

by typs

6/8/2026 at 10:58:55 PM

yes well aware :) numbers shown are on "house" harnesses eg codex with gpt and claude code with opus.

fwiw we have examples of each model doing better on NON-house harnesses too - speaking jsut for myself i think the "the labs are RLing on their own harnesses" narrative is kinda overstated if you think through wanting to have any meaningful api business (often eg the labs will give guidance on what is prefered and the agent labs can easily match tool contract to that, which is to say, the "home turf advantage" isnt as large as you think it is if you try a little bit)

by swyx

6/8/2026 at 11:59:00 PM

What "non-house" harnesses have you found to work best?

by chris_st

6/9/2026 at 1:34:49 AM

What is the "house" harness for minimax? They haven't released any

by Bolwin

6/8/2026 at 9:37:38 PM

How do you measure quality at scale ? Is there another model that determines if it adheres to codebase standard ?

by great_psy

6/8/2026 at 9:46:32 PM

see Beyond Unit Tests and Novel Grading Methods in TFA.

i think something like ~60% llm as judge rubrics and the rest as described. every rubric validated by maintainer. 3000 rubrics

by swyx

6/9/2026 at 4:54:07 AM

I'm a bit disappointed that Opus 4.6 wasn't in this because the tokenizer changed quite a bit from 4.7 onward. I was so annoyed by 4.7 that I've been forcing 4.6 ever since. I've been annoyed by 4.8 a bit too, so I haven't felt the urge to move on.

by fouc

6/9/2026 at 1:16:06 PM

shared older model numbers here https://www.latent.space/p/ainews-frontiercode-benchmarking

tldr theres been broad progress despite your observed regressions

by swyx

6/9/2026 at 9:54:26 AM

> total 1000+ hours of real life software maintainer work captured in dataset. ON TOP of that, 40+ hours of real human work to turn that real life work into well validated and structured tasks with rubrics (even more work to turn tasks/prompts from devin-infra-specific to pluggable coding agent)

Heartening. We still haven’t automated making the world worse.

by keybored

6/9/2026 at 12:38:24 PM

this gives my non tired self a chance to fix the typo:

- “ ON TOP of that, 40+ hours of real human work to turn…”

+ “ ON TOP of that, 40+ hours of real human work PER TASK to turn…”

by swyx

6/9/2026 at 7:20:55 AM

[flagged]

by hanzeweiasa

6/9/2026 at 6:49:03 AM

Meaningless comment filled with buzzwords and marketing numbers.

by blks

6/8/2026 at 10:11:39 PM

This looks great. Well reasoned, tons of work put into eval, thanks for building it.

It strikes me as kind of wild that good evals can drive tens to hundreds of millions of dollars of compute deployment in the wild — there’s something new and collaborative and competitive about the eval / frontier model race that’s quite interesting..

In this case “shorter actually mergable patches that open source maintainers would accept” feels like a great thing to deliver to the world.

I didn’t deep dive into good and bad patches, but I wonder if swyx or others on the team have predictions on saturation. Both when, and how useful will it be? That is, do you guys think this test is broad enough as written to get better behavior out of models, and if there is saturation on this test, will we see generalized better patch / coding behavior?

by vessenes

6/8/2026 at 10:20:01 PM

thanks - credit to silas, eric, ben, and team for the depth of the evals, and the rest of the research team for doing the transcript reading parties lol

by nature of being based on open source, frontiercode public will saturate very very quickly. frontiercode main will be >80% in less than a year. hopefully diamond will last a bit longer. we can do annual refreshes, thats not my strategy for staying relevant - what i'm more excited to get funding for is private held out version of frontiercode based on repros of real enterprise customer problems. in an ideal agent lab (https://latent.space/p/agent-labs) you meticulously build up this domain understanding and that is essentially why both model labs and serious customers come to you.

by swyx

6/9/2026 at 6:51:59 AM

Interesting. So frontiercode-IBM-Diamond is a thing you’d hope to sell the creation of and certification of? And if it’s published then you’d expect model providers to train to forntiercode-IBM-Pro or whatever and publish it so that it would be considered a good model to use inside IBM? (Obviously just a random corporate choice here).

by vessenes

6/9/2026 at 12:46:56 PM

no, single customer focus would be bad for a number of reasons.

but frontiercode-finance? thatd be cool…

by swyx

6/9/2026 at 3:10:11 PM

Ahh interesting. There are companies making good money doing private equity dd work, largely custom harnesses. A lot of open space right now.

by vessenes

6/9/2026 at 5:55:20 AM

I'm liking the effort to make new, no-longer-saturated benchmarks. I'll also be a bit suspicious if some model aces it -- matching OSS maintainers' taste more often is a plausible improvement in quality but if they nail it every time they've been memorizing.

Not saying FrontierCode should've done this, but benchmarking the interaction would be interesting. That is, if I get a diff with a blocking problem but writing a comment gets fixed, that's a lot different from if the model has hit a wall. Better, if there's a problem but the model flagged it in a short list of questions or worries to me before or after coding, it can get sorted without taking much of my time. Stick an LLM in the loop instructed to behave like a user or reviewer with some rubric-ish info that wasn't in the prompt. Then, look at how much the pretend user has to do to get to a quality result with a given model, if they can get to one at all.

You could say 'why worry about interaction? the goal is the model just gets it perfect' but I think that imagined end state just is not a thing: tasks will get bigger but there will still be interaction. Handling comments and asking good clarifying questions when needed are real capabilities. Human SWEs interact plenty and real engineering has a certain density of questions about requirements, taste, and other big vague things.

by twotwotwo

6/9/2026 at 12:42:21 PM

i agree it would be interesting but apart from the fact that its be harder to measure and automate, theres real alpha in being the best truly async, hands off model/agent, which is what cog has been working on for 2 years now. its not that im opposed to steering or interaction mid task, its just that 1) it mostly Just Works, 2) it doesnt parallelize/scale well, 3) including on proactive agents (https://docs.devin.ai/product-guides/automations).

see my “semi async valley of death” post. people are pursuing both sides but per bitter lesson only one side scales indefinitely with compute

that said, multistage rollouts and synhetic rubrics (using grpo advantage? see dr tulu paper) somewhat approximate human intervention and interaction, so theres known ways to model that, its just not thaaaat valuable

by swyx

6/9/2026 at 5:31:16 PM

To repeat, not a dig at FrontierCode, which is substantial progress in benchmarking. But I'd argue modeling the rest of process is tha(aaa)t valuable and becomes more so as coding capability progresses:

Async agents interact on a longer timescale, but they interact. Again, experienced SWEs, consulting agencies, etc. ask questions before and after implementation, accept notes, etc.; they vary at how good they are at it; and how well they do it is a big factor in the success and failure of projects.

LLM interaction ability isn't saturated or mature; asking for point edits mostly works, but e.g. when I try to get Opus to ask clarifying questions or surface tricky bits to focus review, it's not close to a human-level response -- it's both noisy and misses key stuff. (Handling uncertainty has been a weak point for LLMs since early on, which might not help.) Other aspects of good interaction are even harder, like digging into a potentially mistaken request, or proposing a good 80/20 tweak to the spec.

There's a different, shorter-term reason to model interaction: it better tells users the value to expect now. It turns out my employer doesn't love infinite Opus use. (Go figure.) Kimi and Sonnet do comparably on FrontierCode. Are they about the same to use, or is one flailing while the other one just needs a couple rounds of fixups? If I saw a benchmark that credibly approximated 'this model will save you this much time vs. that one' that would put it well above existing ones.

I do think a bunch of discussion, investment, etc. is based on the idea the industry will essentially be replaced with successful one-shotting with little interaction. The mistake there is to assume back-and-forth is inessential and only happens because the agents aren't that good at coding yet. For a long time lots back-and-forths were driven by the models' limitations at raw coding, which might've made that idea more appealing.

As the coding side gets better, drawing the rest of the owl becomes the hard part. The world is messy and so is one's software's boundary with it. (I'm not saying the tasks don't get longer, I'm saying interaction gets more important as they do.) My conviction here might partly because in my sort of work the requirements and big picture were always thornier than typing the code; I'm suspicious that as raw coding gets easier for everybody they will hit something analogous.

Anyway, again, what y'all are doing is progress. I do want to stick up for the idea that a lot of critical things aren't raw coding ability. (I'm not alone in that, I don't think!) I'm definitely not here to say someone's Doing It Wrong as they do it more correctly than I've seen it done--just asking "would the patch get accepted?" is a huge step.

by twotwotwo

6/9/2026 at 6:49:27 PM

no disagreements. big fan of thinky

by swyx

6/9/2026 at 9:20:17 AM

[dead]

by dmitry_dv

6/8/2026 at 10:58:34 PM

Great effort and a bit closer to my private evals than DeepSWE. I greatly appreciate the focus on false negative and positives, along with simply being far more focused on actual, mergeable quality output over plain passing. Could see a lot of others adopt your list of metrics as a basis, they are very well defined and solid coverage of everything one should want out of code provided, not just focused on one or two narrow targets. Will incorporate a lot of these ideas in my own tests and polish some other parts where I somewhat unintentionally already went into a roughly similar direction.

by Topfi

6/9/2026 at 3:50:57 AM

This isn't a fair way to chart this:

"Each model is run 5 times at every available reasoning effort. For each effort, we average the metric across the 5 trials, then report each model’s score at its best performing reasoning level."

For example, Anthropic's "medium" might involve 3x the amount of thinking and take 5x as long as OpenAI's idea of "medium". So now you've skewed all the results. It assumes that they're linear and equivalent ranges.

You should compare apples to apples. Weight them in a way that factors in total task completion time as the measure of "effort", not the arbitrary effort settings provided by the AI company. I don't care what the underlying effort level is, I care which model out of multiple, if running for the same amount of time, completes my task to a more accurate degree. Total token consumption would also be another thing to consider as well, to rule out TPS. But generally, if the goal is ultimate productivity, the main factor is what does it faster. If cost is a concern factor then token count+speed, or token count alone, is the main factor.

The second chart paints a more clear picture though, GPT 5.5 xhigh gets 44.7% at 21k tokens, and Opus 4.8 max gets 49.9% at 75k tokens. So basically, 4x the amount of tokens from Opus 4.8 resulted in an increase of 5.2%. If you were to loop GPT 5.5 xhigh over the same set of tasks, an extra 4x, would it surpass the 49.9%? That's the real question here. And I'd wager it probably would.

But the framing of this whole thing makes it sound like Opus has some massive lead. In reality though, it just loops harder and consumes more tokens. Their effort levels are not equivalent.

Now take this even further, and emulate what Anthropic is likely doing behind the scenes. Running the prompt through multiple prompts and converging on the end result. Give GPT 4 generic skills that cover different aspects of the benchmark in a general way. Run it 4x to get that same token count usage, and use each of those different skills for each one. Now what is the result? I'd wager it blows Opus out of the water.

The end result is this: Anthropic gives you all of the bloat in a single, slow package. GPT gives you the ability to build your own equivalent harness. I'd much rather have the freedom and flexibility to do it myself.

Once people actually focus on building strong harnesses around open-source, we'll have models that are competing at the same level as the closed labs. Especially now that we have models like Nemotron 3 Ultra. But it involves a lot of clever approaches, like using small fast models to help with routing and determining what "skills" and prompts to load, using static analysis, local tools and vector databases. Using a pipeline of all of the specialized, fast, small models to handle the various aspects of the specific task in a cooperative tree. The amount of underutilized specialized AI models out there is insane, no one seems to be building harnesses around them. Things like semantic code duplication detection for example. We don't need to be using the big model to do everything, the big model should be the orchestrator of all of the tools and little models.

This is why the big labs have a lead that no one seems to be able to crack, because they're not just building a model and calling it a day, they utilize all of these other approaches on top of the big model. Now that we have strong open source models, we can start building these things too.

by nullbio

6/9/2026 at 11:58:21 AM

I don't specifically care about Claude -vs- GPT, but comparing models at different amounts of test time compute is a gaping hole. It also means that any unreasonably-expensive token guzzling white-elephant model can top all the benchmarks and still be useless.

What we actually have is like a scaling law for test time compute, so it's silly to focus on specific Y values that someone benchmarked (at whatever default X values). Instead, characterize the slope or power of the scaling law, or just plot the damn curve for each model -vs- number of tokens or cost or something!

Noam Brown also raised this issue recently: https://x.com/polynoamial/status/2064210146558136827

by ssivark

6/9/2026 at 12:53:30 PM

ok i mean i agree, how is it a gaping hole when its literally the second (and third and fourth..) chart on the post? yes token cost and reasoning efficiency is important, hence the 2D pareto charts

by swyx

6/9/2026 at 5:19:11 PM

My apologies... I was responding to the above comment / ranting about the general trend and got carried away. Wasn't directed at specifically at your post.

I love your second graph; hope the trend catches on as the main graph, instead of the model-wise bar graph that seems to be popular.

by ssivark

6/9/2026 at 6:51:39 PM

1 dimension is unfortunately all the mental bandwidth that talking heads have.

by swyx

6/9/2026 at 8:47:37 AM

> Total token consumption would also be another thing to consider as well, to rule out TPS.

There is a chart that compares this in the article.

> You should compare apples to apples. Weight them in a way that factors in total task completion time as the measure of "effort", not the arbitrary effort settings provided by the AI company. I don't care what the underlying effort level is, I care which model out of multiple, if running for the same amount of time, completes my task to a more accurate degree.

That's your opinion, my goal is exactly what this benchmark measures, the end result being something I can merge into the codebase based on some configuration setup provided by the lab. I don't run 50 agents in parallel and I am able to use the $100 Anthropic plan just enough that I don't go over the limits.

Also what is your specific argument into the benchmark findings considering that some problems are solved by Opus and not solved by Codex? Whether one uses more tokens or not is a completely different metric.

by lnenad

6/9/2026 at 4:15:07 AM

Wow, looks like you've found a massive flaw indeed.

I was skeptical about the results because in my experience both recent GPT and Opus modules are strong. Everything else is B or C tier. This is just artisanal vibe testing though. It's very hard to eval them properly.

by edg5000

6/9/2026 at 11:18:58 AM

YMMW but on my local tasks Opus 4.8 is nowhere near close to gpt 5.5. For the lack of a better word Opus is just soo damn lazy.

by tornikeo

6/9/2026 at 2:47:40 AM

Is there anything we can download? Did they test GLM 5.1?

by ilaksh

6/8/2026 at 9:41:48 PM

Since no one knows or can agree on what "code quality" is and we can't measure it for human output, I'm dubious about measuring it for LLMs

by singpolyma3

6/9/2026 at 12:17:25 AM

You don't need universal consensus to measure something. There are many good quality measures of code quality.

by kube-system

6/9/2026 at 10:35:39 AM

> There are many good quality measures of code quality.

Besides the traditional and mechanical "less LOC", cyclomatic complexity and similar, which ones are you talking about exactly?

by embedding-shape

6/9/2026 at 3:39:49 PM

I'm not talking about any specific measure. There are many. Pick some, and you have a benchmark.

by kube-system

6/10/2026 at 9:36:52 AM

I understand you're talking about "many", could you give an concrete name or example of at least one?

by embedding-shape

6/9/2026 at 5:47:49 AM

Opus 4.8 low at 8.2% while medium at 5.9% is definitely an interesting result, to say the least.

by Magniquick

6/8/2026 at 10:09:48 PM

> Today’s coding benchmarks have established that models can write correct code.

I wouldn't say that.

> But as AI-generated code becomes the dominant path to production

I really hope that's not the case.

by einpoklum

6/8/2026 at 10:36:05 PM

How do you define "correct" code?

by zakisaad

6/9/2026 at 12:07:56 AM

The code that gets stuff done instead of beating around the bush making unxpected errors

by newsicanuse

6/9/2026 at 5:22:53 AM

i suspect this is highly dependent on what you're working on

from my experience if you give the models a way to self-verify correctness they succeed basically 100% of the time

by vanuatu

6/9/2026 at 1:32:47 PM

> from my experience if you give the models a way to self-verify correctness they succeed basically 100% of the time

My experience is that if you can get the model to one shot the task, you'll do fine but if it has to iterate it leaves things worse than before and almost always requires human intervention after burning through an enormous amount of tokens

by maccard

6/9/2026 at 4:08:53 AM

You know that it's a honest benchmark when their own model (SWE-1.6) scores terrible on it.

by 2001zhaozhao

6/9/2026 at 7:54:33 AM

I wish there was a new kind of benchmark that...wasn't focused on prompt-to-complete-task completion, rather on how well a model can act an assistant.

At my day job, despite all the harnessing and providing extensive documentation and user stories via E2Es, I cannot trust models to deliver quality output. They are unable to, and reviewing 18 files of changes is the kind of work that increases my load and effort.

And yes, we have already split and optimized our documentation to not overwhelm the context.

In order to do this, the best flow is planning together, finding edge cases, having review skills, iterating, producing a business logic focused document describing the changes -> iterating to get a code changeset focused document.

Then I want to review step by step all the edits the model does.

On average this triplicates the amount of time required for a major change, but significantly improves business logic correctness and code quality, with the major benefit that it will require significantly less maintenance down the line and thus ends up being both a benefit on one side, and to improve harness on the other (more quality code, proper information, better examples for the models in the future).

The issue is: models are increasingly getting worse at this kind of work. While it is clear that they have better capabilities, the feedback loop has definitely degraded between Opus 4.7 and Opus 4.8, much more than it did between Opus 4.5 and 4.7.

This is very disappointing to me, as it is crystal clear that models are increasingly reinforced to deliver from prompt to the end result on their own and keep me more and more left out of the loop.

This has resulted in increasing frustration and makes my work slower, not better.

by epolanski

6/9/2026 at 2:41:49 AM

[flagged]

by bisonbear

6/9/2026 at 8:27:35 AM

[flagged]

by alex1sa

6/9/2026 at 4:27:39 AM

[flagged]

by nryoo

6/8/2026 at 9:45:41 PM

[flagged]

by fHr