3/20/2026 at 2:01:00 AM
> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.Finally! This is a really obvious test-case that I've wondered about myself, and have seen many casual skeptics and cautiously optimistic people independently raising for several years now. When megacorp is not crowing about such a test, the silence is deafening, and it was practically guaranteed that they tested, didn't like the results, and didn't publish.
I'm still surprised it took this long for academics to try it, and skimming cites, I don't see anything similar. Anyone know if this is the first paper to try this kind of thing, or just the first paper to put together a especially good suite of reusable benchies?
If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro, and then we gradually run out of esolangs to do honest testing with. SAT is a whole different animal admittedly, but comparable honest tests might involve just forcing models to use randomly generated but easily checked EBNF grammar? I don't have a quick link to the relevant papers, but afaik benchmarks of strict adherence to non-simple JSON schemas is also still pretty bad, and we're just working around it with lots of retries/tokens. "But look how well it works for 10k lines of kubernetes manifests!" Well yeah, maybe, but it barely needs to really follow a schema since that is more stuff that's in the training set..
by robot-wrangler
3/20/2026 at 5:33:35 AM
> If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distrohttps://x.com/lossfunk/status/2034637505916792886
"After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned"
A little harness engineering was enough!
by dinp
3/20/2026 at 7:25:51 AM
> Stay tunedNever heard that before! But ok, it seems like this entity is affiliated with the paper, I'm interested..
> A little harness engineering was enough
Enough for what? It's not enough to crush the benchmark if that just means showing it is feasible to generate esolang code. No one cares about that if we're using it as a proxy to investigate general reasoning. Given validation/execution feedback loops, and 1000 retries for hello-world where we succeed with trial and error, the case for reasoning still wouldn't look great.
Suppose it's way better than that though; maybe trials are few and show clear logical progression. Well, we needed a harness, and that's still damning for whether and to what extent models can reason. But with harnesses at least we have a way to do general reasoning well enough on novel problems, right?
> mimic how humans would learn to solve problems in esoteric languages
Well hold on, does the harness do that, or does it enable models to do reasoning? We've retreated back towards solving that thing we weren't actually interested in..
by robot-wrangler
3/20/2026 at 4:36:24 AM
I don’t have much confidence n the premise. Where was the human control? I think most Python programmers when tasked with “now do it in brainfuck” would fail. There is not much meaningful overlap in how to express intent and solutions to problems. The ridiculous syntax is the joke.But more importantly, I don’t have to solve any problems with languages that are elaborate practical jokes, so I’m not worried about the implications of an LLMs ability to be useful.
by GorbachevyChase
3/20/2026 at 5:12:35 AM
The point here is to test for "genuine reasoning" or something approaching it. If a model is truly reasoning it should be competent even in a new language you just made up (provided the language itself is competently designed)by culi
3/20/2026 at 5:26:50 AM
So humans don't do "genuine reasoning"?by wehnsdaefflae
3/20/2026 at 3:43:29 PM
No. I’m just an NPC in someone else’s simulation. Wandering the world aimlessly incapable of expressing ideas outside of my training corpus of language. Pathetic.by GorbachevyChase
3/20/2026 at 11:49:06 AM
I would in fact expect any human that's as good at writing code as various state-of-the-art LLMs (if you take the breathless proclamations of their hype bros at face value) to be able to solve the rather simple problems in the benchmark given the relevant esolang spec and some time to figure it out.It's not as if the models here were asked to write a kernel in Brainfuck; the medium tier of problems here contains such apparently insurmountable tasks as "calculate the nth prime".
by filleduchaos
3/20/2026 at 4:45:09 AM
> I don’t have to solve any problems with languages that are elaborate practical jokesThis is just being needlessly dismissive. Esolangs are (and have been) an area of active CS research for decades. I know I'm a bit of an esolang nerd, and while some are jokes, most focus on specific paradigms (e.g. Piet is visual, bf is a Turing tarpit, etc.).
> I think most Python programmers when tasked with “now do it in brainfuck” would fail.
This is untrue. Given internet-level awareness and infinite time, virtually all developers should be able to go from Python to brainfuck (trivially, I might add.) Did you even look at the test sets? It's all pretty basic stuff (palindromes, array traversal, etc.—we aren't using pandas here). I mean, sure, it would take forever and be mega annoying, but manipulating a head and some tape is hardly difficult.
by dvt