4/11/2026 at 7:50:03 PM
This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.
by ggillas
4/11/2026 at 9:40:51 PM
> hopefully changes the way benchmarking is doneThe purpose of a system is what it does.
AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"
by SlinkyOnStairs
4/12/2026 at 12:57:50 AM
I work at OpenAI and I really don't find this to be the case.We're pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users, but we don't do this. My impression is that Anthropic and other labs are similar. E.g., in the Sonnet 4.6 system card they use a model to detect potential contamination and manually score those outputs as 0 if human review agrees there was contamination. If all the labs cared about was marketing material, it would be quite easy not to do this extra work.
There are ton of other games you can play with evals too (e.g., test 100 different model checkpoints or run secret prompt optimization to steer away from failing behaviors), but by and large what I've seen inside OpenAI is trustworthy.
I won't say everything is 100% guaranteed bulletproof, as we could always hire 100 more SWEs to improve hack detection systems and manually read outputs. Mistakes do happen, in both directions. Plus there's always going to be a bit of unavoidable multiple model testing bias that's hard to precisely adjust for. Also, there are legitimate gray areas like what to do if your model asks genuinely useful clarifying questions that the original reference implementation scores as 0s, despite there being no instruction that clarifying questions are forbidden. Like, if you tell a model not to ask clarifying questions is that cheating or is that patching the eval to better align it with user value?
by tedsanders
4/12/2026 at 10:00:16 AM
I remember the gpt-5 benchmarks and how wildly inaccurate they were data-wise. Linking one[0] that I found so that other people can remember what I am talking about. I remember some data being completely misleading or some reaching more than 100% (iirc)And this is something which has reached the public eye in one of the most anticipated videos basically. So I find it a bit rough as to think that OpenAI has the best practices for data, and if the public can be shown these inaccurate graphs themselves on based on benchmarks. I find it a bit harder to trust the benchmarks themselves and if OpenAI wants legitimate benchmarks.
Also I find it wild that after 1 month of this, nobody talked about it. I remember thinking that this is gonna be the highlight for a long time that a mega billion dollar company did such basic graph errors. I feel like we are all forgetting a lot of things as our news cycle keeps on moving faster.
(Another tangential point is about the OpenAI/Google employees who had signed the pledge yet nothing came out of it and this is something more recent & I also remember one of your comments on Hackernews.)
> I'm an OpenAI employee and I'll go out on a limb with a public comment. I agree AI shouldn't be used for mass surveillance or autonomous weapons. I also think Anthropic has been treated terribly and has acted admirably. My understanding is that the OpenAI deal disallows domestic mass surveillance and autonomous weapons, and that OpenAI is asking for the same terms for other AI companies (so that we can continue competing on the basis of differing services and not differing scruples). Given this understanding, I don't see why I should quit. If it turns out that the deal is being misdescribed or that it won't be enforced, I can see why I should quit, but so far I haven't seen any evidence that's the case. [1]
This is a bit off-topic so sorry about that, but I hope that you realize that you did say you will go out on a limb with public comment so please don't mind if I ask for some questions, everyone supported you then and heck, even I thought that maybe I was wrong and I thought that I should trust you more than my gut-instincts because you clearly must know so much more than me/us but that aged like fine milk.
I would really love some answers or your thoughts now on that off-topic thought as well if possible as these are just some questions which are unanswered by you and I would love to have a respectful discussion about it, sorry for catching you off guard, waiting for your reply and I wish you to have a nice day ted.
[0]: https://www.reddit.com/r/BetterOffline/comments/1mk6ofz/gpt5...
by Imustaskforhelp
4/12/2026 at 1:17:15 AM
>The purpose of a system is what it does.I am so tired of this saying.
It's not true, in general. Systems almost universally have unintended consequences and result in side effects their designers did not foresee.
Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench.
by Legend2440
4/12/2026 at 3:49:20 AM
https://en.wikipedia.org/wiki/The_purpose_of_a_system_is_wha...You are misunderstanding the saying. It is entirely about unintended consequences and viewing the system for what it actually does and not any stated intentions of the designers.
by burpingtree
4/12/2026 at 4:33:43 AM
I will propose that you are wrong.1. We must ignore the intentions of the designers (your claim), and instead see what the outcomes are
2. Therefore we should ignore Beer's intentions when designing the phrase POSWID, and instead see how it is used.
3. The overwhelming majority of people using it on the internet (including the GP comment) is to imply that the people perpetuating the system actually desire the outcome.
So the purpose of POSWID is clearly to imply intent.
by aidenn0
4/12/2026 at 4:03:29 AM
Well that’s stupid and completely ignores the meaning of the word “purpose”.by jimbokun
4/12/2026 at 9:58:44 AM
If you accept what the system actually does now, and decides to live with it as it is, you just deprecated the original "purpose" and made it irrelevant. You embraced "the purpose is what it does" - to you.IMHO the saying is meant to make you reflect.
by actionfromafar
4/12/2026 at 6:58:53 AM
It does not ignore the word. It subverts it, and that's the point. It's the system equivalent of "death of the author", which states that omes a work is written, the authors intent loses relevance and the work must be examined on its own. The aurhors opinion or relationship to the work carries no more weight than any other persons.That's not "true" in any demonstrable sense, but it can be a useful form of analysis. As it is with "purpose of a system"
by delusional
4/12/2026 at 10:05:46 AM
I'd go further and say this is also the cybernetics equivalent of the religious teachings about humans, specifically the whole "judge by one's deeds, not by one's words" thing. So it's not like it's a novel idea.Also worth remembering that most systems POSIWID is said about, and in fact ~all important systems affecting people, are not designed in the first place. Market forces, social, political, even organizational dynamics, are not designed top-down, they're emergent, and bottom-up wishes and intentions do not necessarily carry over to the system at large.
by TeMPOraL
4/12/2026 at 1:46:30 AM
I think the point is that if the side effects become known and are accepted, or if they are known and rejected, then indeed the purpose of the system is what it does.by hrimfaxi
4/12/2026 at 7:47:31 AM
I think the point of the saying is that as systems tend to expand, sooner or later we become part of them. That means that we can no longer see them from outside, we're now part of the system and our goals and the system's goals will align. Then the purpose of the system can't be anything else than what it does.by nurbl
4/12/2026 at 3:19:23 AM
Same. Anyone who has designed anything at all in any domain realizes that what your intentions are and what materializes are often not the same. You have practical constraints in the real world. That doesn’t somehow make the constraints the purpose. The saying makes no sense.by user3939382
4/12/2026 at 12:13:40 AM
That is Anthropic’s shtick to a tee.by anon373839
4/12/2026 at 2:43:22 AM
Funny, I just made https://model-tracker.com because model performance change all the time, and it would be good to have a subjective signal of what people are actually feeling today. And also, benchmarks are flaky af as this paper shows.The idea is knowing what to try first today saves a bit of time.
by keepamovin
4/12/2026 at 3:27:28 AM
Interesting, little different than this other site I saw on HN this week:by Barbing
4/12/2026 at 2:49:05 AM
I would love to see a stable test over time with a hold out set of easy/medium/hard challenges. I, like many others, have noticed a large drop in recent performance w/ Claude Opus (and Sonnet) and more sites like these would hold the labs more accountable to sneaky backend changes that nerf/degrade performance.by siliconc0w
4/12/2026 at 2:55:08 AM
working on something similar to evaluate model performance over time using tasks based on your own code. obviously this is still susceptible to the same hacking mechanics documented here, but at a local level, it's easier to detect/fix, and should give a stronger signal of subjective harness/agent/context performance than these large generic benchmarksalso I keep hearing complaints that opus is nerfed, but IMO it's nice to have objective data to back that. I feel like half of the nerfing complaints are people getting past honeymoon phase...
by bisonbear
4/11/2026 at 7:55:36 PM
>hopefully changes the way benchmarking is done.Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.
by operatingthetan
4/11/2026 at 8:07:54 PM
Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?by siva7
4/12/2026 at 5:09:05 AM
This is already well known, all these AI benchmarks use a different model to judge whether or not the solution was correct.It’s… remarkably poor, and as demonstrated in the paper, easily gamed. Worst yet, these benchmarks teach AIs to be very short-sighted and hyper-focused on completing the task, rather than figuring out the best solution.
by stingraycharles
4/11/2026 at 10:39:35 PM
Every ai labs train on the test set. That is a big part of why we see benchmark climbing from 1% to 30% after a few models iterationsby retinaros
4/12/2026 at 3:57:51 AM
Models themselves definitely aren't getting better.by latentsea
4/11/2026 at 8:39:15 PM
Frontier model developers try to check for memorization. But until AI interpretability is a fully solved problem, how can you really know whether it actually didn't memorize or your memorization check wasn't right?by SpicyLemonZest
4/11/2026 at 8:16:24 PM
Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.by operatingthetan
4/11/2026 at 8:24:34 PM
In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.by ZeroGravitas
4/11/2026 at 9:29:28 PM
Unfortunately, very few LLM benchmarks do this. LLMs get such high scores on many benchmarks because there's no difference between answering "I don't know" as giving a made up answer, and made up answers can improve the score some of the time, so by chasing higher benchmark numbers on these kinds of benchmarks, the labs are prioritizing guessing over accuracy.The Artificial Analysis Omniscience benchmark does penalize guessing, so it actually helps you determine which LLMs are likely to just guess rather than telling you they don't know. Only a very few of the frontier models actually score higher than 0 on this, where 0 means that it's equally likely to return a correct answer as it is to return a hallucination on factual questions.
by lambda
4/12/2026 at 7:57:06 AM
But that requires me to do things :(by nananana9
4/11/2026 at 8:02:48 PM
Also, fuzz your benchmarksby Leynos
4/11/2026 at 11:49:17 PM
solution is simple:if bug { dont }
/s
by Aperocky
4/12/2026 at 12:13:37 AM
> evaluation was not designed to resist a system that optimizes for the score rather than the task.Welcome to benchmarks in general, but especially reasoning. Robustness and sensitivity research says nothing is robust, everything is sensitive, feels like every paper says "yeah we made a new benchmark that shuffles the order of multiple choice options in the question set and found a 40% drop in model performance"
by robot-wrangler
4/11/2026 at 7:59:01 PM
2024: Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance https://www.tomshardware.com/pc-components/cpus/spec-invalid...2003: Nvidia accused of cheating in 3DMark 03 https://www.gamespot.com/articles/nvidia-accused-of-cheating...
It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.
by zer00eyz
4/11/2026 at 8:51:23 PM
What was the cheat in the 2024 Intel situation? The TomsHardware article and the Phoronix article they linked were quite vague. (Not to say I have any doubts, just curious, hadn’t heard of this one).by bee_rider
4/11/2026 at 11:37:19 PM
Intel basically benchmaxxed their compiler optimizations. They used detailed knowledge of the benchmark to make their compiler generate machine code to do better on the benchmark in a way that was not beneficial for non-benchmark scenarios.by BugsJustFindMe
4/12/2026 at 6:07:56 AM
I assumed as much, I’m just wondering what exactly they did. For example IIRC some phone company would detect that a benchmark was running by checking for the program name, and then allow the clock to boost higher (increase thermal limits) if it was a benchmark (like you could literally avoid the cheating behavior by changing the name of the program being run).by bee_rider
4/11/2026 at 8:06:49 PM
> It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.I wonder if this common? We should call it Goodharts law while someone does the research on how common this is.
For real, I’ve assumed from the jump these things were all gamed, with the amount of money on the line.
by irishcoffee