5/12/2026 at 3:51:07 PM
I don't get the point. The model has presumably been trained on all public GitHub code, so the evaluation is tainted anyway.by fsh
5/12/2026 at 4:25:32 PM
A couple of days ago there has been another thread about an experiment with many LLMs, where especially the Anthropic models were found to "cheat" in a large percentage of the coding tasks that had been benchmarked, by searching the Internet for appropriate code and inserting it in the program they had to write.The conclusion of that study was that when benchmarking LLMs for coding ability, they should not have access to Internet, if you want to know their intrinsic abilities.
Moreover, this can be worrisome as a more direct copyright infringement than the one caused by training, because even if they find open source code on the Internet and they insert it in the generated files, it is pretty certain that it must have had a license that prohibits the removal of the copyright notice.
by adrian_b
5/12/2026 at 5:40:47 PM
> A couple of days ago there has been another thread about an experiment with many LLMs, where especially the Anthropic models were found to "cheat" in a large percentage of the coding tasks that had been benchmarked, by searching the Internet for appropriate code and inserting it in the program they had to write.Can you find the thread?
by htrp
5/12/2026 at 10:15:08 PM
I have found it:https://news.ycombinator.com/item?id=48045174
The study paper:
https://arxiv.org/abs/2605.03546
Look at Table 3, where the cheating rates of Claude Sonnet, Claude Opus and Gemini were between 20% and 36%, during the coding benchmarks.
by adrian_b
5/12/2026 at 4:56:02 PM
swe bench pro has a public and private test set, where the private eval is from proprietary codebases onlyby ej88