Why are large language models so terrible at video games?

6/1/2026 at 10:29:50 AM

It feels like they're really focusing on overstating how confusing and weird it is that an LLM can write code but not play games very well, rather than just explaining it.

Code is text. LLMs are text input/output machines.

Game input/output is not at all text.

LLMs can certainly reason about games with a simple/explicit enough domain (try a risk tournament where models can talk to each other between turns!)

by ceheaaf

6/1/2026 at 11:29:25 AM

But LLMs are terrible at text adventures too. See e.g. https://entropicthoughts.com/updated-llm-benchmark and previous articles referenced in there.

I have yet to see any sort of harness that lets a frontier LLM interact with a text adventure and make meaningful progress on its own.

by kqr

6/1/2026 at 1:08:24 PM

To pile on, they're also bad at games that are 2D text based environments.

ARC-AGI-3 shows this: https://arcprize.org/arc-agi/3

I've done some work as well on Rogue (sorry for self-promotion): https://iwhalen.github.io/rogue-bench/

by iwhalen

6/2/2026 at 9:11:46 AM

There is no "2D text" processing when it comes to LLMs. They process text as ordinary, sequential 1D text only. And humans process "2D text" like any other 2D image. So 2D text isn't really a thing in any case. Saying LLMs are bad at 2D text is like saying that humans are bad at 2D audio.

by cubefox

6/1/2026 at 11:43:33 AM

They are also pretty bad at navigating mazes (which can be somewhat similar in spirit to text adventures where you need to navigate through text): https://arxiv.org/abs/2507.20395

by haffi112

6/1/2026 at 11:31:03 AM

The other reason is lack of continual learning, especially for long games like RPGs.

by cubefox

6/1/2026 at 11:30:56 AM

LLMs are used for OpenClaw and similar to do tasks for their user.

Games are a bunch of tasks too.

So if they fail at game tasks maybe it’s a bad idea to advertise those LLMs as task doing assistants.

by croes

6/1/2026 at 9:59:37 AM

There was good progress in training neural networks to play video games.

Unfortunately it doesn't seem to fit in some people's context because it was a few years ago.

Kind reminder: there is "AI" beyond LLMs.

by nottorp

6/1/2026 at 10:46:24 AM

OpenAI's Dota 2 adventures were super hype back in the days.

by kingstnap

6/1/2026 at 10:59:54 AM

OpenAI Five doesn’t really know how to play games in general — it only knows how to play Dota.

by deyiao

6/1/2026 at 1:58:33 PM

The only game that matters.

by slumberlust

6/1/2026 at 11:39:22 AM

[dead]

by Yeask

6/1/2026 at 11:04:03 AM

Several years ago I built a simple snake game and wrote a DQN from scratch to learn how to play it.

I was really proud of it at the time because I had to do a decent amount of reading and research since I wrote all of the NN code from scratch and wanted to add some more advanced algorithm optimisations which I hadn't done in previous projects.

I suspect a coding agent could spit the entire project out in 20 minutes now, but it was very cool at the time to build a game then watch my computer learn how to play it in real time.

by kypro

6/1/2026 at 9:56:20 AM

I actually really miss all the research being done on having (reinforcement learning) AIs beat Atari games and the like. Or the one that stopped at a TV playing random images instead of continuing through the level. Has there been any progress in that field? It seems like LLMs came around and all the projects stopped completely.

by panarchy

6/1/2026 at 10:32:59 AM

Why is a language model bad at video games? I think the answer is stated in the question itself.

by dsabanin

6/1/2026 at 11:39:35 AM

I think it’s good to remember that, just 2 years ago, we were having conversations with people convinced LLMs were intelligent and possibly sentient. It’s really good to a) point out that they’re not demonstrating general intelligence and b) why they aren’t a good fit for this type of problem.

by scott_w

6/1/2026 at 11:20:48 AM

Hilliariously correct!

by newsicanuse

6/1/2026 at 10:52:51 AM

As others have hinted at LLMs aren't really made in a way that makes them likely to play video games (CS/Halo and such) well. I wonder how they'd fare "against" text based adventures like Zork (which they'll no doubt have ample knowledge about) and newer text based adventure games (which they'll know less about).

by Zobat

6/1/2026 at 11:30:52 AM

They aren't good at Zork[1] and neither at newer and/or more obscure text adventures[2].

[1]: https://www.lowimpactfruit.com/p/zork-bench-an-llm-reasoning...

[2]: https://entropicthoughts.com/evaluating-llms-playing-text-ad...

by kqr

6/1/2026 at 11:15:27 AM

Nethack has been widely used to test reinforcement learning agents, starting from at least 2020; there was a Nethack challenge at NeurIPS 2021. https://nethackchallenge.com/report.html

For a more recent test, see https://kenforthewin.github.io/blog/posts/nethack-agent/ .

by fph

6/1/2026 at 12:08:46 PM

To be honest, Zork at times makes precious little sense: you are supposed to die over and over before you figure stuff out. For instance, you have to grab the endless-light-source treasure very early on, or you mathematically cannot win. And the game does not spell anything out for you, you just have to "get it" by watching closely at how/why you die.

This is a tall order for an LLM: it needs a lot of context but most of the context will be just noise.

by lou1306

6/1/2026 at 11:13:35 AM

I found LLMs to be surprisingly good at puzzle games like Baba Is You: https://meffmadd.github.io/samplesurium/posts/baba_is_agent/

by meffmadd

6/1/2026 at 1:21:01 PM

Does this use levels from the original game or some custom ones? The solutions to the original levels should be in the training data, be it blogs, reddit comments, or wikis.

Unless the goal was to test how well do the large language models translate solutions in prose to actionable keyboard inputs, which is pretty interesting in itself.

by duckmysick

6/1/2026 at 3:37:30 PM

Agreed. I’m surprised how often people seem to miss this. They don’t realize just how gargantuan the training datasets are for these large language models, especially for a very popular game like Baba Is You. I’m sure that both GameFAQs and the Steam forums are in the training data for any reasonably SOTA LLM both of which almost assuredly have complete walkthroughs for BIY.

by vunderba

6/1/2026 at 11:27:47 AM

Nice!

I remember "Baba is Eval" (https://fi-le.net/baba/), released 11 months ago, back when Claude Opus 4 was the strongest model. Back then, I was surprised how poor was it even at the first level.

I am happy to see an another approach - and indeed, with much stronger results.

by stared

6/1/2026 at 12:09:02 PM

Yes that was the post that inspired me to build this.

While I did implement a more comprehensive harness with path finding tools etc. the models themselves have improved significantly.

by meffmadd

6/1/2026 at 7:54:16 PM

I saw you mentioned that!

Anyway: did you test it with Claude Opus 4.8?

by stared

6/1/2026 at 12:36:52 PM

I think what should be kept in mind is that these are not the hard levels BIY is famous for.

by IsTom

6/1/2026 at 12:48:11 PM

That is very true but I was surprised by how clear the “signal” was. Only Gemini really confidently solved all levels. But yeah the goal is now to include harder levels as well!

by meffmadd

6/1/2026 at 11:45:07 AM

I don't know what to save from this article. Maybe only "[LLMs are] very bad at spatial reasoning. Which shouldn’t be surprising, because that’s also not in the training data."

by pmontra

6/1/2026 at 11:50:07 AM

Frankly, Claude has been unbelievably proficient at spatial design since Opus 4.6. I think a lot of the people commenting here are relying on outdated assumptions. Simply put, LLMs have crossed a threshold and can now produce professional, shippable visual designs, similar to the way they became good enough to produce shippable code in 2024.

by new_account_103

6/1/2026 at 10:20:59 AM

I wonder if you paired a few different types of AI together, an LLM agent might be good at strategizing -. E.g. building a strategy on how to handle a scenario. But, it would need to know the entire game manual basically. Then it would pass the stratrgy to a better AI in some way. But it might not be needed if the better gaming AI can just do that part too already.

I admit I know nothing about this though.

by ThunderSizzle

6/1/2026 at 10:33:13 AM

GOAP is a better tool.

by deadbabe

6/1/2026 at 11:04:46 AM

I guess the author’s point is that LLMs can’t really learn in real time yet, whereas playing games is basically all about real-time learning. So an LLM can be very good at writing code, but still be terrible at actually playing games.

Personally, I think this is a really hard problem, and it may turn out to be one of the first big walls we hit on the road to AGI.

by deyiao

6/1/2026 at 11:14:43 AM

The coding comparison is more interesting to me. Programming has unusually good feedback loops. A test fails, an exception gets thrown, a benchmark regresses. Most games don't give you that kind of signal. I wonder how much of current coding performance depends on that.

by suyavuz

6/1/2026 at 11:19:22 AM

I have been noting this as well. It also had an unfair advantage of having all of open source code to train on, and a bunch of human discussions about code quality and structure. Now as well, the feedback loop of us all using coding agents in real life scenarios.

Not many industries except perhaps writing have had that advantage, in many ways coding is one of the best case scenarios for LLMs.

by ehnto

6/1/2026 at 12:59:14 PM

A failed test often points at the mistake. Most games just tell you that the outcome was bad.

by aykutseker

6/1/2026 at 11:53:53 AM

Isn’t this more of a “we didn’t rl the model to do games so it can’t do it?”

Something like snake or tic-tac-toe is straightforward.

by aabdi

6/1/2026 at 10:46:14 AM

Because they’re large language models. Language doesn’t map onto gameplay.

Choose another “AI” technology and give another go.

by jagged-chisel

6/1/2026 at 10:13:13 AM

I wonder if they would be good at text-based games.

by andunie

6/1/2026 at 11:48:37 AM

Maybe LLMs should stay away from the arts

by nickcageinacage

6/1/2026 at 9:41:49 AM

Video games are made to entertain humans, so does it really matter whether LLMs are good at playing them?

by jiehong

6/1/2026 at 9:46:27 AM

It matters a lot because it's a real solution for external bots that plays more "fairly" especially in older games. It also allows to test games autonomously, which is huge if we are talking about automated programming.

Imagine if you can bring those AI players to CS 1.6.

by pixel_popping

6/1/2026 at 9:50:42 AM

LLMs are the wrong tool for video games. There have been plenty of successful non-LLM AIs that have been trained with reinforcement learning to play games.

If you want to implement actual bots inside the game, then you want to use explicit logic instead of inferred logic. It's much more efficient and easier to debug.

If you want to create Bots for an existing game, which doesn't have its own pre-programmed bots, then you should look at other types of AI. See https://www.geeksforgeeks.org/deep-learning/reinforcement-le...

by vaylian

6/1/2026 at 10:20:17 AM

[dead]

by new_account_102

6/1/2026 at 10:26:14 AM

The headshot/spin bots didn't need ai, all they had to do was ask the server where you were standing, and teleported to your location.

by nubinetwork

6/1/2026 at 4:31:03 PM

Fair point, thanks for changing my point of view :)

by jiehong

6/1/2026 at 9:43:42 AM

Its almost like the Large Language Model has trouble with things that arent Language, such as realtime controller input and video output from a game

by voidUpdate

6/1/2026 at 10:24:31 AM

I know someone who tried the "aibot plays pokemon" thing...

From what I saw, even if you frame advance every single frame, they still don't seem to grasp the concept of "I need to hold down this button for a few frames until x happens"...

There's no concept of time, just a never ending state machine thats constantly changing state.

by nubinetwork

6/1/2026 at 10:04:00 AM

[dead]

by new_account_102

6/1/2026 at 9:28:09 AM

> This brings us to what seems like a contradiction. LLMs are bad at playing games. Yet at the same time, they’re improving rapidly at coding, a skill set that can be used to create a game. How do these facts fit together?

> Togelius: It’s super weird.

...No, it's really not.

They're language models. Code is a language. "Playing a game well" is not. One can, hypothetically, encode game inputs in such a way that it seems kinda-sorta like a language, but it has none of the same kinds of structures that languages—both human and programming—do.

The only way one can think this is strange is if one thinks of LLMs' ability to code rudimentary games as being due to a deeper understanding of games, rather than due to game code being well-represented in their training data.

by danaris

6/1/2026 at 10:21:18 AM

Yet LLMs can play chess and have a "mental" representation of the chessboard.

If LLMs get better but do not progress at playing games when not specifically trained on it it seems to point to a generalisation failure, a limitation that would prevent LLMs to ever achieve AGI, I do not know if that is weird but it seems that for now nobody really knows if they can achieve AGI or not. Perhaps some emergent behavior will arise after more scaling.

To me it's only totally unsurprising if you are 100% certain that LLMs will never reach AGI (like LeCun thinks for example).

by pingou

6/1/2026 at 10:49:16 AM

Chess games are in their training set, other games are not.

by IX-103

6/1/2026 at 11:15:46 AM

Chess is representable entirely in text as well, and generally speaking the LLM concept of "picking the next best token" fits pretty well for "picking the next best move" where a move is a text token

by ehnto

6/1/2026 at 11:18:51 AM

That representation is also old, incredibly well documented, and used to describe how to reason about chess. There are of course text guides to other games in training data but they rely upon depictions of what’s happening that aren’t purely text so the game harness is always going to have to make novel decisions about represent the game as text.

by roxolotl

6/1/2026 at 10:09:41 AM

Yea it’s wild watching so many smart people convince themselves that LLMs are general purpose AIs. Don’t get me wrong they are incredibly powerful tools. However being surprised that text models cannot play video games particularly well is like being surprised weather models cannot.

by roxolotl

6/1/2026 at 9:42:03 AM

cough JEPA cough

by cultofmetatron

6/1/2026 at 1:37:43 PM

LLMs are terrible at a lot of things, and mediocre at most things. What those things have in common with each other is interesting though.

by josefritzishere

6/1/2026 at 10:37:35 AM

[dead]

by sxx0