Advancing AI Benchmarking with Game Arena

2/2/2026 at 6:23:51 PM

This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -

We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.

https://codeclash.ai/

by ofirpress

2/2/2026 at 7:25:58 PM

>this really tough task leads to very interesting findings on AI for coding

Are you going to share those with the class or?

by 63stack

2/2/2026 at 7:11:27 PM

Cool to see core war! I feel it's mostly forgotten by now. My dad is still playing it to this day though and even attends tournaments

by Instantnoodl

2/2/2026 at 9:54:25 PM

https://ai.meta.com/research/publications/gaia-a-benchmark-f...

?

by RobRivera

2/2/2026 at 6:44:14 PM

Leaderboard looks very outdated..

by riku_iki

2/3/2026 at 4:56:08 AM

This was effectively what OpenAI did in the very early days with Dota 2: https://en.wikipedia.org/wiki/OpenAI_Five

As someone who's been playing dota for nearly 20 years now, it was fascinating to watch it play. Some of it's decision making process didn't seem logical in the short term, but would often be set ups for future plays, even though their observation window was fairly small. Even more impressively was the ai bot changed the meta of professional players, since tactics that arose out of its training ended up being more optimal.

I wish we got to the point where other ai bots were out there, but it's entirely understandable that you couldn't drive a complex game like Dota with LLMs, whereas you can with the ones the Game Arena has selected.

by jjcm

2/3/2026 at 4:52:07 PM

I'd still love to see them play more games, and indeed it'd be fun to play against them. Sad they died off

by Ntrails

2/2/2026 at 10:45:47 PM

Let's add NetHack to the mix!

https://kenforthewin.github.io/blog/posts/nethack-agent/

by kenforthewin

2/2/2026 at 9:26:08 PM

I feel uneasy about werewolf being included here. I don't want AI labs to actively try and make their LLMs deceptive!

by iNic

2/2/2026 at 7:41:38 PM

I'd really like to see them add a complex open world fully physicalized game like Star Citizen (assuming the game itself is stable) with a single primary goal like accumulating currency as a measure of general autonomy and a proxy for how the model might behave in the real world given access to a bipedal robot.

by ZeroCool2u

2/3/2026 at 12:01:09 AM

Oh hey, I've been running Werewolves/Mafia games as benchmarks for a while now

https://mafia-arena.com

Gemini is consistently winning against top models

by mohsen1

2/2/2026 at 6:54:06 PM

If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.

by 10xDev

2/2/2026 at 8:54:49 PM

It can write a chess engine because it has read the code of a thousand of chess engines. This benchmark measures a different aspect of intelligence.

And as a poker player, I can say that this game is much more challenging for computers than chess, writing a program that can play poker really well and efficiently is an unsolved problem.

by RivieraKid

2/3/2026 at 2:13:43 AM

The most popular form was solved in 2019: https://en.wikipedia.org/wiki/Pluribus_(poker_bot)

by marksimi

2/3/2026 at 10:47:49 AM

Pluribus didn't solve poker. It's limited to fixed starting stack sizes. It can't exploit weak opponents, it tries to approach a Nash equilibrium, but in multiplayer poker, Nash equilibrium doesn't have the theoretical guarantees it does in head's up. And lastly, it requires a ton of compute.

by RivieraKid

2/2/2026 at 10:10:38 PM

The program doesn't need to be a solver. It can be anything that helps it.

It doesn't even need to be one tool but a series of tools.

by 10xDev

2/2/2026 at 8:11:40 PM

> If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead?

Heh, we really did come full circle on this! When chatgpt launched in dec22 one of the first things that people noticed is that it sucked at math. Like basic math 12 + 35 would trip it up. Then people "discovered" tool use, and added a calculator. And everyone was like "well, that's cheating, of course it can use a calculator, but look it can't do the simple addition logic"... And now here we are :)

by NitpickLawyer

2/2/2026 at 8:25:57 PM

IMO there's an expectation for baseline intelligence. I don't expect an "AGI" model to beat Magnus Carlsen out of the box but it should be able to do basic grade school level arithmetic and play chess at a complete beginner level without resorting to external tools.

by paxys

2/2/2026 at 9:46:56 PM

I'm not going to respond to everything but the key to my comment was "This applies to other domains as well." But people are limiting their imagination to the chess engine example given for chess. The tool or program (or even other neural networks that are available) can be literally anything for any task... Use your imagination.

Maybe we should just get rid of tedious benchmarks like chess altogether at this point that is leading people to think of how to limit AI as a way of keeping it a relevant benchmark rather than expanding on what is already there.

by 10xDev

2/2/2026 at 7:08:43 PM

They should be allowed to! In fact i think better benchmark would be to invent new games and test the models ability to allocate compute to minmax/alphazero new games in compute constraints

by Davidzheng

2/2/2026 at 7:25:29 PM

Its the same reason we are asked to write exams without using calculators but the real world does have them.

How you work without calculators is a proxy for real world competency.

by simianwords

2/2/2026 at 7:31:35 PM

Funny, you used probably the most useless form of benchmarking used on people as an example of measuring "competency" in the real world.

by 10xDev

2/2/2026 at 7:48:50 PM

A lot of the insights of math come from knowing how to do things efficiently. That’s why the tests are timed. I don’t know, this is pretty basic pedagogy that you are choosing to grief.

by doctorpangloss

2/2/2026 at 7:32:26 PM

are you in favour of children using calculators in exams?

by simianwords

2/2/2026 at 7:35:46 PM

It is a program. I need it to get task X done and I don't care how, whether it is strictly through CoT or with tools. There is no such thing as cheating in real work and no reason to handicap it. Just test the limits of what it can do with whatever means possible.

Trying to solve everything with CoT alone without utilising tools seems futile.

by 10xDev

2/2/2026 at 7:51:39 PM

you are not understanding. its a proxy for how well it does other things.

by simianwords

2/2/2026 at 9:57:50 PM

A good proxy is knowing which tools to use to solve the problem. Not how to try and emulate how a human would play chess. That is pointless...

by 10xDev

2/3/2026 at 5:18:48 AM

According to you, it says nothing about a person if they are good at chess

by simianwords

2/2/2026 at 8:56:33 PM

CoT is upstream of building a chess engine.

Chess engines don’t grow on trees, they’re built by intelligent systems that can think, namely human brains.

Supposedly we want to build machines that can also think, not just regurgitate things created by human brains. That’s why testing CoT is important.

It’s not actually about chess, it’s about thinking and intelligence.

by CooCooCaCha

2/2/2026 at 6:27:51 PM

My personal threshold for AGI is when an AI can 'sit down' - it doesn't need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game that it hasn't pre-trained on (it can train on older games).

by cv5005

2/2/2026 at 10:22:02 PM

Isn't this a bit too visual-centric? By this criterion Helen Keller, author of 14 books, would not be generally intelligent.

Ultimately I think it's impossible to define AGI. Maybe "I know it when I see it"—except everyone sees it at a different point (evidently).

by anematode

2/2/2026 at 11:20:11 PM

It could have hands that feel but no vision, I think they were getting at that they thought embodiment and playing games in the modality of humans, without thousands of hours of play to reach competency, would be an important milestone.

by jamilton

2/2/2026 at 7:25:57 PM

https://arxiv.org/abs/2507.03793

by bob1029

2/3/2026 at 1:24:00 AM

I believe that if a model can outperform humans in all board/card games, and can autonomously complete all video games, then AGI — or even ASI — has essentially been achieved. We’re still a long way from that.

by deyiao

2/2/2026 at 7:07:41 PM

Gemini tops all benchmarks but when it comes to real world usage it is genuinely unusable

by simianwords

2/2/2026 at 8:57:06 PM

It's legit good at visual stuff. It's not just a great agent and does some weird stuff sometimes.

by CuriouslyC

2/2/2026 at 7:18:19 PM

It’s not that bad. I’ve been using 3 Pro for some time now and I’m quite happy with how it works. Best paired with Opus and Codex, like most models, but it’s solid as a full-stack buddy.

by goniszewski

2/2/2026 at 6:12:12 PM

How about nethack?

by tiahura

2/2/2026 at 9:51:45 PM

For reference for anyone who missed it, the 2021 NetHack challenge results: https://nethackchallenge.com/report.html

That was a whole half a decade ago, but back then deep learning AIs were defeated very badly by handcrafted scripts. Even the best bot in the neural net category was actual a symbolic script/neural net hybrid.

by tux3

2/3/2026 at 4:32:30 AM

It’s my understanding that ai has had some advances in the last 5 years.

by tiahura

2/2/2026 at 6:08:09 PM

Curious why they decided to curate poker hands instead of a normal poker

by eamag

2/2/2026 at 6:11:52 PM

Poker has very high variance, you'd need several hundred thousand hands to confidently say who's better. Also, you probably want to precompute the GTO-optimal play for benchmarking purposes.

by qsort

2/2/2026 at 6:46:19 PM

But can't computers play several hundred thousand poker hands easily in a couple of hours ?

by johndhi

2/2/2026 at 6:44:49 PM

But now because the hands are so strong we don't see any folds

by eamag

2/2/2026 at 9:59:45 PM

Claude plays Pokemon Red

by mclau153

2/2/2026 at 7:15:23 PM

Wow. I'm generally in the AI maximalist camp. But adding Werewolf feels dangerous to me. Anyone who's played knows lying, deceipt, and manipulation is often key to winning. We really want models climbing this benchmark?

by bennyfreshness

2/2/2026 at 8:44:58 PM

Oddly in the highlighted game I watched the werewolf simply gives up in the last round and says I'm the werewolf well-done... Vote me.

Bizarre.

by rustyhancock

2/2/2026 at 11:20:01 PM

This is a legitimate strategy for the werewolf, no?

by minihat

2/3/2026 at 11:00:30 AM

Probably not in this case.

There were two villagers and one werewolf. The werewolf started the round by saying I'm the werewolf vote for me and then the game ended with a villager win.

Over night he had successfully taken out the doctor. It made no sense in my opinion.

There were some funny bits like on of the Anthropics models forgetting a rule and leading to everyone accusing him of being a werewolf in a pile on. He wasn't a werewolf he genuinely forgot the rule. Happens nearly every human game of werewolf.

by rustyhancock

2/2/2026 at 7:28:52 PM

Good question, but who's going to stop them?

AI already has a very creative imagination for role play so this just adds extra to their arsenal.

by bilekas

2/3/2026 at 4:02:01 AM

negative benchmark isn't it? no sane lab is going to realease PR that states our newest model is best at lying, if anything the reverse may occur, if this catches on, they will make their model play werewolf badly and claim "alignment improvements, our model no longer lies as much in werewolf" but it lies more often in other domains

by Rastonbury

2/2/2026 at 8:07:46 PM

confidently and charismatically lying to clueless users has been one of fundaments of AI adoption

by PunchyHamster

2/2/2026 at 6:21:58 PM

Anecdotal data point, but recently I’ve found Gemini to perform better than ChatGPT when it came to intent analysis.

by chaostheory

2/2/2026 at 8:06:39 PM

making models target benchmark about being good at lying and getting away with it (werewolf) is certainly an interesting choice

by PunchyHamster