2/2/2026 at 6:23:51 PM
This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.
by ofirpress
2/2/2026 at 7:25:58 PM
>this really tough task leads to very interesting findings on AI for codingAre you going to share those with the class or?
by 63stack
2/2/2026 at 7:11:27 PM
Cool to see core war! I feel it's mostly forgotten by now. My dad is still playing it to this day though and even attends tournamentsby Instantnoodl
2/2/2026 at 9:54:25 PM
https://ai.meta.com/research/publications/gaia-a-benchmark-f...?
by RobRivera
2/2/2026 at 6:44:14 PM
Leaderboard looks very outdated..by riku_iki