alt.hn

4/10/2026 at 8:16:58 PM

We're running out of benchmarks to upper bound AI capabilities

https://www.lesswrong.com/posts/gfkJp8Mr9sBm83Rcz/we-re-actually-running-out-of-benchmarks-to-upper-bound-ai

by gmays

4/10/2026 at 9:48:32 PM

We can definitely make harder evals, the problem is a good eval set is indistinguishable from good training data / market edge, so no one is incentivized to share their best eval sets publicly.

by nikisweeting

4/10/2026 at 10:43:01 PM

This is the least true thing ever. All LLMs are terrible at ARC-AGI-3. Every video game can be used as a benchmark. You could rank LLMs on how long they can keep a game of Dwarf Fortress running or how fast they can beat GTA5.

by UltraSane

4/10/2026 at 11:16:31 PM

We already have specialized AI to play video games

by ttoinou

4/10/2026 at 11:19:15 PM

We are talking about LLMs. a true AGI would be able to beat every video game.

by UltraSane

4/10/2026 at 11:21:46 PM

Until Arc-Battletoads is passed I’m not buying it.

by conception

4/11/2026 at 1:36:29 AM

More like ARC-SegaMasterSystem-ALF

by UltraSane

4/10/2026 at 9:17:53 PM

Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks.

These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.

by WarmWash

4/10/2026 at 10:09:17 PM

We need benchmarks that can distinguish between continuous learning and long-context extrapolation.

by jballanc

4/11/2026 at 7:47:54 PM

oh that's easy: continuous learning is not something current architectures can do. So the benchmark for that can be done mentally

by vrighter

4/10/2026 at 11:33:16 PM

[dead]

by refactorbench