2/17/2026 at 7:10:57 PM
As a former competitive MtG player this is really exciting to me.That said, I reviewed a few of the Legacy games (the format I'm most familiar with and also the hardest by far), and the level of play was so low that I don't think any of the results are valid. It's very possible for Legacy they would need some assistance for playing Blue decks, but they seem to not be able to know the most basic of concepts - Who's the beatdown?.
IMO the most important pars of current competitive Magic is mulligans and that's something an LLM should be extremely good at but none of the games I'm seeing had either player starting with less than 7 cards... in my experience about 75% of games in Legacy have at least one player mulligan their opener.
by danielvinson
2/17/2026 at 7:14:42 PM
Yeah, the intention here is not to answer "which deck is best" - the standard of play is nowhere near high enough for that. It's meant as more of a non-saturated benchmark for different LLM models, so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old". I'm optimistic that with continued improvements to the harness and new model releases we can get to at least "official Pro Tour stream commentator" skill levels within the next few years.by GregorStocks
2/17/2026 at 11:47:04 PM
Hmm well, from my perspective, none of them are even really playing the game, they are just taking random actions. Any human, even a small child, would be much better.And re: ages, it's worth noting that the youngest player to make Day 2 of a Grand Prix is 8 years old, and the youngest Pro Tour winner was 15 years old. I don't think it's realistic to get an LLM anywhere close to either of those players in skill level, though it's absolutely possible with a specialized model.
by danielvinson
2/17/2026 at 9:16:47 PM
> , so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old".no, no, no.. please think. Human child psychology is not the same as an LLM engine rating. It is both inaccurate and destructive to actual understanding to say that common phrase. Asking politely - consider not saying that about LLM game ratings.
by mistrial9
2/17/2026 at 10:41:02 PM
I picked a random commander game and the first message 1/4 players left was "Well I should be getting a new hand soon because I have asked for a mulligan". Its definitely in there, whether they are using it correctly is another question.by protocolture
2/17/2026 at 11:38:08 PM
The model does not seem to get told what the cards actually do, it's just trying to guess.Probably less of a problem for Legacy, but it isn't even told what format it's playing. So I expect it's easy to get the bots to do far better than this by giving them more information.
Let me analyze this opening hand:
*Hand analysis:* - 3 lands (Riverpyre Verge, Steam Vents, Willowrush Verge) - dual lands which is good - Monument to Endurance ({3}) - a 3-drop - Accumulate Wisdom ({1}{U}) - 2-drop, likely a blue advantage engine - Boomerang Basics ({U}) - 1-drop blue spell - Gran-Gran ({U}) - 1-drop blue creature (1/2)
This is actually a reasonable hand. I have: - 3 lands giving me mana access (with color fixing) - A 1-drop (Gran-Gran or Boomerang Basics to choose from) - A 2-drop (Accumulate Wisdom) - A 3-drop (Monument to Endurance)
The curve is solid. I have some interactive blue play with Boomerang Basics and early board presence with Gran-Gran. The mana base supports blue well with all three lands. This is keep-able — while not spectacular, it's a functional opening hand.
{ "hand_size": 7, "action_type": "GAME_ASK", "players": "GPT52 Vet(20hp), Haiku Worrier(you,20hp)", "land_count": 3, "context": "T1 ()", "response_type": "boolean", "respond_with": "answer=true or answer=false", "message": "Mulligan down to 6 cards?", "your_hand": [ { "name": "Riverpyre Verge", "is_land": true }, { "name": "Steam Vents", "is_land": true }, { "name": "Willowrush Verge", "is_land": true }, { "mana_cost": "{3}", "name": "Monument to Endurance" }, { "mana_cost": "{1}{U}", "name": "Accumulate Wisdom" }, { "mana_cost": "{U}", "name": "Boomerang Basics" }, { "mana_cost": "{U}", "name": "Gran-Gran", "power": "1", "toughness": "2" } ], "action_pending": true }
by Eridrus
2/18/2026 at 12:28:58 AM
Oh, that's a good bug report - historically it was just hallucinating card effects so I made the harness throw the Oracle text for all visible cards into the context, but I bet I forgot to do that for the mulligan decision specifically (it's a weird one). Thanks!by GregorStocks
2/18/2026 at 4:03:14 AM
> mulligans and that's something an LLM should be extremely goodWhy? I honestly can't think of any reason that LLMs should be specifically good at mulligans
by raincole
2/18/2026 at 4:07:09 AM
This is actually really interesting to me, but the way to determine if you should mulligan is if the 7 cards you are looking at is better than the average 6 cards in your deck. Given that games in most higher power formats end in the first 2-3 turns, the number of cards isn’t as important as the quality generally. So it’s really just math to determine what an “average” hand looks like.by danielvinson
2/18/2026 at 4:22:37 AM
> it's really just math to...Uh, LLMs are notoriously bad at basic arithmetic. I think you might be thinking about another kind of AI.
Plus I don't really believe LLMs can reliably tell which hand is better. If you remove the drawing part and simply present two hands to an LLM and ask it which one is stronger I expect it to do much much much worse than experienced player. There isn't much reason to expect otherwise (but I'm willing to be proven wrong if such benchmarks exist)
by raincole