4/16/2026 at 3:41:37 PM
So Opus 4.7 is measurably worse at long-context retrieval compared to Opus 4.6. Opus 4.6 scores 91.9% and Opus 4.7 scores 59.2%. At least they're transparent about the model degradation. They traded long-context retrieval for better software engineering and math scores.by bachittle
4/16/2026 at 4:22:39 PM
To be honest, I think it's just a more honest score of what Opus 4.6 actually was. Once contexts get sufficiently large, Opus develops pretty bad short term memory loss.by film42
4/16/2026 at 8:25:40 PM
You can support very long context windows if you don’t mind abysmal recall rate.by tomaskafka
4/16/2026 at 6:32:16 PM
No: https://x.com/bcherny/status/2044821690920980626by enraged_camel
4/16/2026 at 4:12:59 PM
Agreed, I appreciate the transparency (and Anthropic isn't normally very transparent). It's also great to know because I will change how I approach long contexts knowing it struggles more with them.by freedomben
4/16/2026 at 4:16:24 PM
Could this be because they've found the 1m context uneconomical (ie costs too much to serve, or burns through users quota too quickly causing complaints), and so they're no longer targeting it as a goalby RobinL
4/16/2026 at 4:57:38 PM
Opus 4.7 is also worse at 256K context. Go look at page 195 and page 196. It is across the board regression, not just 1M context.by Someone1234
4/16/2026 at 8:43:27 PM
Thanks, interesting. Does this make it more surprising that the other benchmarks have improved? I'm not sure I understand the benchmarks well enough - but I'm wondering whether with agentic workflows it's possible to get away with a smaller more focussed context (and hence lower cost) whilst achieving the same or better performance, because of agentic model's ability to decide what the put in context as they workby RobinL
4/16/2026 at 9:22:11 PM
what's all this mean in real world use?by timvb
4/16/2026 at 5:29:06 PM
A year ago it felt like SoTA model developers were not improving so much as moving the dirt around. Maybe we’re in another such rut.by teaearlgraycold
4/18/2026 at 3:38:01 AM
Also, just to be clear: This links to a PDF, for some reason.by msla
4/16/2026 at 4:16:48 PM
At what point along the 1M window does context become "long" enough that this degradation occurs?by jzig
4/16/2026 at 4:36:10 PM
The benchmark GP mentioned is measuring at 128k-256k context (there's another at 524k-1024k, where 4.6 scored 78.3% and 4.7 scored 32.2%).The longer the context the worse the performance; there isn't really a qualitative step change in capability (if there is imo it happens at like 8k-16k tokens, much sooner than is relevant for multi-turn coding tasks - see e.g. this old benchmark https://github.com/adobe-research/NoLiMa ).
by daemonologist
4/16/2026 at 6:26:44 PM
Be brief. No one wants AI boyfriend users who drone on & on about their day.by the13