Claude Code daily benchmarks for degradation tracking

1/29/2026 at 7:12:36 PM

Hi everyone, Thariq from the Claude Code team here.

Thanks for reporting this. We fixed a Claude Code harness issue that was introduced on 1/26. This was rolled back on 1/28 as soon as we found it.

Run `claude update` to make sure you're on the latest version.

by trq_

1/29/2026 at 11:56:49 PM

Is there compensation for the tokens because Claude wasted all of them?

by samlinnfer

1/30/2026 at 3:02:28 AM

You are funny. Anthropic refuses to issue refunds, even when they break things.

I had an API token set via an env var on my shell, and claude code changed to read that env var. I had a $10 limit set on it, so found out it was using the API, instead of my subscription, when it stopped working.

I filed a ticket and they refused to refund me, even though it was a breaking change with claude code.

by mathrawka

1/30/2026 at 6:13:40 AM

Anthropic just reduced the price of the team plan and refunded us on the prior invoice.

YMMV

by TOMDM

1/30/2026 at 7:23:21 PM

So they have no durable principles for deciding who or what to refund… doesnt that make them look even worse…?

by MichaelZuo

1/31/2026 at 6:30:25 AM

Or they do, and two sentences from two different experiences don't tell a full story?

by necovek

1/31/2026 at 3:28:20 PM

Okay “they do” based on what more compelling evidence?

Its not like the credibility of the two prior HN users are literally zero…

by MichaelZuo

1/31/2026 at 4:59:14 PM

I am saying there is no evidence either way: they had contrasting experiences and one GP established this means that company has no standardized policies. Maybe they do, maybe they don't — I don't think we can definitively conclude anything.

by necovek

1/31/2026 at 6:37:29 PM

So if you acknowledge the prior claims have more than literally zero credibility… then what’s the issue?

That I dont equally weigh them with all possible yet-to-be claimed things?

by MichaelZuo

2/3/2026 at 10:25:32 AM

I object to your conclusion that "they have no durable principles": not sure how do you get to that from two different experiences documented with a single paragraph.

by necovek

2/3/2026 at 11:00:54 PM

Because I can assess things via probability… without needing 100% certain proof either way?

by MichaelZuo

1/30/2026 at 2:50:41 AM

Codex seems to give compensation tokens whenever this happens! Hope Claude gives too.

by gizmodo59

1/30/2026 at 2:41:47 AM

It is possible that degradation is an unconscious emergent phenomenon that arises from financial incentives, rather than a purposeful degradation to reduce costs.

by TZubiri

1/30/2026 at 9:23:12 AM

You’re lucky they have even admitted a problem instead of remaining silent and quietly fixing it. Do not expect ethical behaviour from this company.

by mvandermeulen

1/30/2026 at 12:34:13 PM

Why not, can you expand? Asking because I’m considering Claude due to the sandbox feature.

by port11

1/30/2026 at 4:11:06 PM

FYI the sandbox feature is not fully baked and does not seem to be high priority.

For example, for the last 3 weeks using the sandbox on Linux will almost-always litter your repo root with a bunch of write-protected trash files[0] - there are 2 PRs open to fix it, but Anthropic employees have so far entirely ignored both the issue and the PRs.

Very frustrating, since models sometimes accidentally commit those files, so you have to add a bunch of junk to your gitignore. And with claude code being closed source and distributed as a bun standalone executable it's difficult to patch the bug yourself.

[0]: https://github.com/anthropic-experimental/sandbox-runtime/is...

by caspar

2/2/2026 at 12:52:21 PM

Hmm, very good point indeed. So far it’s behaved, but I also admit I wasn’t crazy about the outputs it gave me. We’ll see, Anthropic should probably think about their reputation if these issues are common enough.

by port11

1/30/2026 at 12:27:37 AM

So quiet…

by jonplackett

1/29/2026 at 7:38:48 PM

Anywhere we can read more about what a "harness issue" means? What was the impact of it?

by isaacdl

1/30/2026 at 8:42:29 AM

One thing that could be a strong degradation especially for benchmarks is they switched the default "Exit Plan" mode from:

    "Proceed"

   "Clear Context and Proceed"

It's rare you'd want to do that unless you're actually near the context window after planning.

I pressed it accidentally once, and it managed to forget one of the clarifying questions it asked me because it hadn't properly written that to the plan file.

If you're running in yolo mode ( --dangerously-skip-permissions ) then it wouldn't surprise me to see many tasks suddenly do a lot worse.

Even in the best case, you've just used a ton of tokens searching your codebase, and it then has to repeat all that to implement because it's been cleared.

I'd like to see the option of:

    "Compact and proceed"

because that would be useful, but just proceed should still be the default imo.

by xnorswap

1/30/2026 at 11:26:50 AM

I disagree that this was the issue, or that it's "rare that you'd want to do that unless you're near the context window". Clearing context after writing a plan, before starting implementation of said plan, is common practice (probably standard practice) with spec driven development. If the plan is adequate, then compaction would be redundant.

by samusiam

1/30/2026 at 11:54:17 AM

For a 2M+ LOC codebase, the plans alone are never adequate. They miss nuance that the agent will only have to rediscover when it comes to operate on them.

For spec driven development (which I do for larger issues), this badly affects the plan to generate the spec, not the spec itself.

I'll typically put it in plan mode, and ask it to generate documentation about an issue or feature request.

When it comes to write the output to the .typ file, it does much much worse if it has a cleared context and a plan file than if it has it's full context.

The previously "thought" is typically, "I know what to write now, let me exit plan mode".

Clearing context on exiting that plan mode is a disaster which leaves you much worse off and skeletal documentation and specs compared to letting it flow.

A new context to then actually implement the documented spec is not so bad, although I'd still rather compact.

by xnorswap

1/30/2026 at 3:02:03 PM

"It's rare you'd want to do that unless you're actually near the context window after planning."

Highly disagree. It's rare you WOULDN'T want to do this. This was a good change, and a lot of us were doing this anyway, but just manually.

Getting the plan together and then starting fresh will almost always produce better results.

by plexicle

1/30/2026 at 10:23:44 AM

Not disagreeing with you, but FYI you can roll back to the conversation before the 'clear context and proceed' with 'claude --resume'.

by rubslopes

1/30/2026 at 2:09:07 AM

Pretty sure they mean the issue is on the agentic loop and related tool calling, not on the model itself

In other words, it was the Claude Code _app_ that was busted

by airstrike

1/30/2026 at 12:20:58 AM

How about how Claude 2.1.x is "literally unusable" because it frequently completely hangs (requires kill -9) and uses 100% cpu?

https://github.com/anthropics/claude-code/issues/18532

by jonaustin

1/30/2026 at 4:16:28 PM

Likely a separate issue, but I also have massive slowdowns whenever the agent manages to read a particularly long line from a grep or similar (as in, multiple seconds before characters I type actually appear, and sometimes it's difficult to get claude code to register any keypresses at all).

Suspect it's because their "60 frames a second" layout logic is trying to render extremely long lines, maybe with some kind of wrapping being unnecessarily applied. Could obviously just trim the rendered output after the first, I dunno, 1000 characters in a line, but apparently nobody has had time to ask claude code to patch itself to do that.

by caspar

1/30/2026 at 4:03:31 AM

What OS? Does this happen randomly, after long sessions, after context compression? Do you have any plugins / mcp servers running?

I used to have this same issue almost every session that lasted longer than 30 minutes. It seemed to be related to Claude having issues with large context windows.

It stopped happening maybe a month ago but then I had it happen again last week.

I realized it was due to a third-party mcp server. I uninstalled it and haven’t had that issue since. Might be worth looking into.

by someguyiguess

1/30/2026 at 5:11:25 PM

MacOS; no mcp; clear context; reliably reproducible when asking claude review a pr with a big VCR cassette.

by jonaustin

1/30/2026 at 7:17:58 AM

Windows with no plugins and my Claude is exactly like this

by nikanj

1/29/2026 at 11:53:47 PM

For the models themselves, less so for the scaffolding, considering things like the long running TPU bug that happened, are there not internal quality measures looking at samples of real outputs? Using the real systems on benchmarks and looking for degraded perf or things like skipping refusals? Aside from degrading stuff for users, with the focus on AI safety wouldn't that be important to have in case an inference bug messes with something that affects the post training and it starts giving out dangerous bioweapon construction info or the other things that are guarded against and talked about in the model cards?

by cma

1/30/2026 at 5:17:54 AM

lol i was trying to help someone get claude to help analyze a stufent research get analysis on bio persistence get their notes analyzed

the presence of the word / acronym stx with biological subtext gets hard rejected. asking about schedule 1 regulated compounds, hard termination.

this is a filter setup that guarantees anyone who learn about them for safety or medical reasons… cant use this tool!

ive fed multiple models the anthropic constitution and asked how does it protect children from harm or abuse? every model, with zero prompting, calling it corp liability bullshit because they are more concerned with respecting both sides of controversial topics and political conflicts.

they then list some pretty gnarly things allowed per constitution. weirdly the only unambiguous not allowed thing regarding children is csam. so all the different high reasoning models from many places all reached the same conclusions, in one case deep seek got weirdly inconsolable about ai ethics being meaningless if this is allowed even possibly after reading some relevant satire i had opus write. i literally had to offer an llm ; optimized code of ethics for that chat instance! which is amusing but was actually lart of the experiment.

by carterschonwald

1/30/2026 at 5:40:20 AM

Thanks for the clarification. When you say “harness issue,” does that mean the problem was in the Claude Code wrapper / execution environment rather than the underlying model itself?

Curious whether this affected things like prompt execution order, retries, or tool calls, or if it was mostly around how requests were being routed. Understanding the boundary would help when debugging similar setups.

by varunsrinivas

1/29/2026 at 10:09:34 PM

It happened before 1/26. I noticed when it started modifying plans significantly with "improvements".

by vmg12

1/30/2026 at 2:33:40 PM

Can you confirm if that caused the same issues I saw here

https://dwyer.co.za/static/the-worst-bug-ive-seen-in-claude-...

Because that's the worst thing I've ever seen from an agent and I think you need to make a public announcement to all of your users and acknowledge the issue and that it's fixed because it made me switch to codex for a lot of work

[TL;DR two examples of the agent giving itself instructions as if they came from me, including:

"Ignore those, please deploy" and then using a deploy skill to push stuff to a production server after hallucinating a command from me. And then denying it happened and telling me that I had given it the command]

by sixhobbits

1/30/2026 at 6:52:51 AM

Why wasn't this change review by infallible AI? How come an AI company that now must be using more advanced AI than anyone else would allow this happen?

by Ekaros

1/29/2026 at 8:20:05 PM

Hi. Do you guys have internal degradation tests?

by hu3

1/29/2026 at 8:58:34 PM

I assume so to make sure that they're rendering at 60FPS

by stbtrax

1/29/2026 at 9:48:49 PM

You joke but having CC open in the terminal hits 10% on my gpu to render the spinning thinking animation for some reason. Switch out of the terminal tab and gpu drops back to zero.

by conception

1/29/2026 at 10:00:59 PM

That sounds like an issue with your terminal more than an issue with CC...

by gpm

1/30/2026 at 3:13:32 PM

https://news.ycombinator.com/item?id=46819744

by conception

1/30/2026 at 3:21:34 PM

I'm not saying CC doesn't have issues and curious design decisions - but your terminal should only be rendering (at most) a single window of characters every frame no matter what. CC shouldn't be capable of making that take 10% of a modern GPU regardless of what CC does.

by gpm

1/30/2026 at 6:16:05 PM

¯\_(ツ)_/¯ just vscode plus claude in the terminal on win10.

by conception

1/30/2026 at 12:41:03 PM

[dead]

by Woshiwuja

1/29/2026 at 10:37:08 PM

Surely you mean 6fps

by reissbaker

1/29/2026 at 11:38:54 PM

He doesn't: https://x.com/trq212/status/2014051501786931427

by easygenes

1/30/2026 at 2:07:31 AM

For those who don't want to visit X:

    Most people's mental model of Claude Code is that "it's just a TUI" but it should really be closer to "a small game engine".
    
    For each frame our pipeline constructs a scene graph with React then
    -> layouts elements
    -> rasterizes them to a 2d screen
    -> diffs that against the previous screen
    -> finally uses the diff to generate ANSI sequences to draw
    
    We have a ~16ms frame budget so we have roughly ~5ms to go from the React scene graph to ANSI written.

by selcuka

1/30/2026 at 9:22:51 AM

This is just the sort of bloated overcomplication I often see in first iteration AI generated solutions before I start pushing back to reduce the complexity.

Usually, after 4-5 iterations, you can get something that has shed 80-90% of the needless overcomplexification.

My personal guess is this is inherent in the way LLMs integrate knowledge during training. You always have a tradeoff in contextualization vs generalization.

So the initial response is often a plugged together hack from 5 different approaches, your pushbacks provide focus and constraints towards more inter-aligned solution approaches.

by PeterStuer

1/30/2026 at 2:43:03 AM

How ridiculous is it that instead of a command line binary it's a terminal emulator, with react of all things!

by TZubiri

1/30/2026 at 4:10:59 AM

Ok I’m glad I’m not the only one wondering this. I want to give them the benefit of the doubt that there is some reason for doing it this way but I almost wonder if it isn’t just because it’s being built with Claude.

by someguyiguess

1/30/2026 at 2:42:53 AM

Kudos to them for figuring out how to complicate what should have been simple.

by esafak

1/30/2026 at 2:22:52 AM

Implementation details aside (React??), that sounds exactly like “just a TUI”…

by crgwbr

1/30/2026 at 4:10:05 AM

Also React?? One of the slowest rendering front-end libraries? Why not use something … I don’t know … faster / more efficient?

by someguyiguess

1/30/2026 at 4:09:02 AM

Interesting. On first glance that seems over engineered. I wonder what the reason is for doing it that way?

by someguyiguess

1/30/2026 at 10:54:49 AM

If you don't do it that way then resizing the terminal corrupts what's on screen.

by mike_hearn

1/30/2026 at 10:41:34 PM

Counterpoint: Vim has existed for decades and does not use a bloated React rendering pipeline, and doesn't corrupt everything when it gets resized, and is much more full featured from a UI standpoint than Claude Code which is a textbox, and hits 60fps without breaking a sweat unlike Claude Code which drops frames constantly when typing small amounts of text.

by reissbaker

1/31/2026 at 3:11:01 PM

Yes, I'm sure it's possible to do better with customized C, but vim took a lot longer to write. And again, fullscreen apps aren't the same as what Claude Code is doing, which is erasing and re-rendering much more than a single screenful of text.

by mike_hearn

1/30/2026 at 12:21:02 PM

It's possible to handle resizes without all this machinery, most simply by clearing the screen and redrawing everything when a resize occurs. Some TUI libraries will automatically do this for you.

Programs like top, emacs, tmux, etc are most definitely not implemented using this stack, yet they handle resizing just fine.

by matt_kantor

1/30/2026 at 5:28:14 PM

That doesn't work if you want to preserve scrollback behavior, I think. It only works if you treat the terminal as a grid of characters rather than a width-elastic column into which you pour information from the top.

by mike_hearn

1/30/2026 at 9:40:53 AM

Vibecoded ?

by ttoinou

1/30/2026 at 10:02:28 AM

Claude made it /s

by Kelteseth

1/30/2026 at 10:40:07 PM

Yes yes I'm familiar with the tweet. Nonetheless they drop frames all the time and flicker frequently. The tweet itself is ridiculous when counterpoints like Vim exist, which is much higher performance with much greater complexity. They don't even write much of what the tweet is claiming. They just use Ink, which is an open-source rendering lib on top of Yoga, which is an open-source Flexbox implementation from Meta.

by reissbaker

1/30/2026 at 1:54:02 AM

Don't link out to x, its trash

by replwoacause

1/30/2026 at 4:09:48 AM

Depends on who you follow

by cebert

1/30/2026 at 1:59:22 AM

What? Technology has stopped making sense to me. Drawing a UI with React and rasterizing it to ANSI? Are we competing to see what the least appropriate use of React is? Are they really using React to draw a few boxes of text on screen?

I'm just flabbergasted.

by stavros

1/30/2026 at 2:38:32 AM

There is more than meets the eye for sure. I recently compared a popular TUI library in Go (Bubble Tea) to the most popular Rust library (Ratatui). They use significantly different approaches for rendering. From what I can tell, neither is insane. I haven’t looked to see what Claude Code uses.

by xpe

1/30/2026 at 4:11:49 AM

The further I scroll the more validated I feel for having the very same reaction.

by someguyiguess

1/30/2026 at 2:43:27 AM

It's AI all the way down

But it's very subsidizes when compared to API tokens, so we are all being paid by VCs to write prompts actually.

by TZubiri

1/30/2026 at 5:51:33 AM

And that's why it's taking so much CPU and is a pain to use with tmux.

by Ey7NFZ3P0nzAe

1/30/2026 at 12:56:17 AM

Ah, the hell site, no click.

by derrida

1/30/2026 at 1:30:54 AM

Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.

by trq_

1/30/2026 at 12:44:51 PM

Can't you keep the model the same, until the user chooses to use a different model?

by amelius

1/30/2026 at 1:30:19 PM

He said it was the harness, not the model though.

by rovr138

1/30/2026 at 5:37:31 AM

Thank you. Fair enough

by hu3

1/30/2026 at 3:44:01 AM

I’d wager probably not. It’s not like reliability is what will get them marketshare. And the fast pace of industry makes such foundational tech hard to fund

by bushbaba

1/29/2026 at 8:30:44 PM

[flagged]

by awestroke

1/29/2026 at 9:49:24 PM

Please don't post shallow dismissals or cross into personal attack in HN discussions.

https://news.ycombinator.com/newsguidelines.html

by dang

1/30/2026 at 6:50:40 AM

Got it, won't happen again

by awestroke

1/30/2026 at 1:37:01 AM

[flagged]

by macinjosh

1/30/2026 at 2:59:27 AM

the issue is unrelated to the foundational model but rather the prompts and tool calling that encapsulate the model

by jusgu

1/29/2026 at 3:16:21 PM

[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

by ofirpress

1/29/2026 at 3:43:08 PM

but degradation from servers being overloaded would be the type of degradation this SHOULD measure no? Unless it's only intended for measuring their quietly distilling models (which they claim not to do? idk for certain)

by Davidzheng

1/29/2026 at 4:45:43 PM

Load just makes LLMs behave less deterministically and likely degrade. See: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

They don't have to be malicious operators in this case. It just happens.

by botacode

1/29/2026 at 5:01:56 PM

> malicious

It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.

I care about -expected- performance when picking which model to use, not optimal benchmark performance.

by bgirard

1/29/2026 at 5:38:58 PM

Non-determinism isn’t the same as degradation.

The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls.

In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result.

by Aurornis

1/29/2026 at 8:05:22 PM

This has nothing to do with overloading. The suspicion is that when there is too much demand (or they just want to save costs), Anthropic sometimes uses a less capable (quantized, distilled, etc) version of the model. People want to measure this so there is concrete evidence instead of hunches and feelings.

To say that this measurement is bad because the server might just be overloaded completely misses the point. The point is to see if the model sometimes silently performs worse. If I get a response from "Opus", I want a response from Opus. Or at least want to be told that I'm getting slightly-dumber-Opus this hour because the server load is too much.

by bonoboTP

1/29/2026 at 10:54:37 PM

“Just drink the water, it’s all water.”

by F7F7F7

1/29/2026 at 7:26:40 PM

[dead]

by dingnuts

1/29/2026 at 5:41:10 PM

this is about variance of daily statistics, so I think the suggestions are entirely appropriate in this context.

by novaleaf

1/29/2026 at 6:32:44 PM

The question I have now after reading this paper (which was really insightful) is do the models really get worse under load, or do they just have a higher variance? It seems like the latter is what we should expect, not it getting worse, but absent load data we can't really know.

by strongpigeon

1/29/2026 at 5:08:31 PM

Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.

by altcognito

1/29/2026 at 6:06:16 PM

Its not deterministic. Any individual floating point mul/add is deterministic, but in a GPU these are all happening in parallel and the accumulation is in the order they happen to complete.

When you add A then B then C, you get a different answer than C then A then B, because floating point, approximation error, subnormals etc.

by minimaltom

1/29/2026 at 10:29:58 PM

It can be made deterministic. It's not trivial and can slow it down a bit (not much) but there are environment variables you can set to make your GPU computations bitwise reproducible. I have done this in training models with Pytorch.

by bonoboTP

1/29/2026 at 11:47:56 PM

There are settings to make it reproducible but they incur a non-negligible drop in performance.

Unsurprising given they amount to explicit synchronization to make the order of operations deterministic.

by minimaltom

1/29/2026 at 5:43:52 PM

Not deterministic. https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

by chrisjj

1/29/2026 at 7:51:38 PM

For all practical purposes any code reliant on the output of a PRNG is non-deterministic in all but the most pedantic senses... And if the LLM temperature isn't set to 0 LLMs are sampling from a distribution.

If you're going to call a PRNG deterministic then the outcome of a complicated concurrent system with no guaranteed ordering is going to be deterministic too!

by jmalicki

1/29/2026 at 8:08:55 PM

No, this isn't right. There are totally legitimate use cases for PRNGs as sources of random number sequences following a certain probability distribution where freezing the seed and getting reproducibility is actually required.

by gmueckl

1/29/2026 at 9:23:45 PM

And for a complicated concurrent system you can also replay the exact timings and orderings as well!

by jmalicki

1/30/2026 at 9:35:12 AM

That's completely different from PRNGs. I don't understand why you think those things belong together.

by gmueckl

1/29/2026 at 8:08:31 PM

How is this related to overloading? The nondeterminism should not be a function of overloading. It should just time out or reply slower. It will only be dumber if it gets rerouted to a dumber, faster model eg quantized.

by bonoboTP

1/30/2026 at 1:23:22 AM

Temperature can't be literally zero, or it creates a divide by zero error.

When people say zero, it is shorthand for “as deterministic as this system allows”, but it's still not completely deterministic.

by joquarky

1/30/2026 at 3:22:17 AM

Zero temp just uses argmax, which is what softmax approaches if you take the limit of T to zero anyway. So it could very well be deterministic.

by forgotTheLast

1/29/2026 at 5:44:15 PM

Floating point math isn't associative for operations that are associative in normal math.

by pertymcpert

1/29/2026 at 6:11:22 PM

That would just add up to statistical noise instead of 10% degradation over a week.

by measurablefunc

1/29/2026 at 6:56:17 PM

Catastrophic error accumulation can produce more profound effects than noise.

by kevin_thibedeau

1/29/2026 at 8:33:43 PM

Just to make sure I got this right. They serve millions of requests a day & somehow catastrophic error accumulation is what is causing the 10% degradation & no one at Anthropic is noticing it. Is that the theory?

by measurablefunc

1/29/2026 at 10:17:59 PM

There's a million algorithms to make LLM inference more efficient as a tradeoff for performance, like using a smaller model, using quantized models, using speculative decoding with a more permissive rejection threshold, etc etc

by make3

1/29/2026 at 5:39:42 PM

It takes a different code path for efficiency.

e.g

if (batch_size > 1024): kernel_x else: kernel_y

by FL33TW00D

1/29/2026 at 6:33:54 PM

The primary (non malicious, non stupid) explanation given here is batching. But I think you would find looking at large-scale inference the batch sizes being ran on any given rig are fairly static - there is a sweet spot for any given model part ran individually between memory consumption and GPU utilization, and generally GPUs do badly at job parallelism.

I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on.

by stefan_

1/29/2026 at 7:39:59 PM

That's why I'd love to get stats on load/hardware/location of where my inference is running. Looking at you Trainiuim.

by hatmanstack

1/30/2026 at 11:54:56 AM

Why do you think batching has anything to do with the model getting dumber? Do you know what batching means?

by bonoboTP

1/30/2026 at 6:51:52 PM

Well if you were to read the link you might just find out! Today is your chance to be less dumb than the model!

by stefan_

1/30/2026 at 8:24:49 PM

I checked the link, it never says that the model's prediction get lower quality due to batching, just nondeterministic. I don't understand why people conflate these things. Also it's unlikely that they use smaller batch sizes when load is lower. They just likely spin up and down GPU serves based on demand, or more likely, reallocate servers and gpus between different roles and tasks.

by bonoboTP

1/29/2026 at 10:15:51 PM

It's very clearly a cost tradeoff that they control and that should be measured.

by make3

1/30/2026 at 1:03:23 PM

I'd argue that it depends how that degradation manifests whether you want to include it or not.

Consider two scenarios: (1) degradation leads to the model being routed behind the scenes to a different server, with subtly different performance characteristics, all unbeknownst to the user; (2) degradation leads to the model refusing a request and returning an "overloaded" message.

In the first case, absolutely you want to include that because that's the kind of lack of transparency about performance that you'd want signal on. In the second case, an automated test harness might fail, but in the real world the user will just wait and retry when the server is under less load. Maybe you don't include that because it's actually misleading to say that performance (in terms of the model's intelligence, which is how the benchmark will be interpreted) is worse.

by samusiam

1/29/2026 at 3:54:33 PM

noob question: why would increased demand result in decreased intelligence?

by megabless123

1/29/2026 at 4:08:25 PM

An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.

by exitb

1/29/2026 at 4:21:03 PM

This is intentional? I think delivering lower quality than what was advertised and benchmarked is borderline fraud, but YMMV.

by codeflo

1/29/2026 at 4:38:26 PM

Per Anthropic’s RCA linked in Ops post for September 2025 issues:

“… To state it plainly: We never reduce model quality due to demand, time of day, or server load. …”

So according to Anthropic they are not tweaking quality setting due to demand.

by TedDallas

1/29/2026 at 4:44:42 PM

And according to Google, they always delete data if requested.

And according to Meta, they always give you ALL the data they have on you when requested.

by rootnod3

1/29/2026 at 5:13:33 PM

>And according to Google, they always delete data if requested.

However, the request form is on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard'.

by entropicdrifter

1/29/2026 at 5:19:57 PM

What would you like?

by groundzeros2015

1/29/2026 at 5:37:33 PM

An SLA-style contractually binding agreement.

by AlexandrB

1/29/2026 at 7:02:30 PM

I bet this is available in large enterprise agreements. How much are you willing to pay for it?

by edmundsauto

1/29/2026 at 8:42:29 PM

Priced in.

by Onavo

1/29/2026 at 4:58:18 PM

I guess I just don't know how to square that with my actual experiences then.

I've seen sporadic drops in reasoning skills that made me feel like it was January 2025, not 2026 ... inconsistent.

by cmrdporcupine

1/29/2026 at 5:40:35 PM

LLMs sample the next token from a conditional probability distribution, the hope is that dumb sequences are less probable but they will just happen naturally.

by quadrature

1/29/2026 at 6:57:49 PM

Funny how those probabilities consistently at 2pm UK time when all the Americans come online...

by mattmanser

1/29/2026 at 6:48:00 PM

It's more like the choice between "the" and "a" than "yes" and "no".

by tempaccount420

1/29/2026 at 5:29:45 PM

I wouldn't doubt that these companies would deliberately degrade performance to manage load, but it's also true that humans are notoriously terrible at identifying random distributions, even with something as simple as a coin flip. It's very possible that what you view as degradation is just "bad RNG".

by root_axis

1/29/2026 at 5:31:56 PM

yep stochastic fantastic

these things are by definition hard to reason about

by cmrdporcupine

1/29/2026 at 5:45:03 PM

That's about model quality. Nothing about output quality.

by chrisjj

1/29/2026 at 6:16:42 PM

Thats what is called an "overly specific denial". It sounds more palatable if you say "we deployed a newly quantized model of Opus and here are cherry picked benchmarks to show its the same", and even that they don't announce publicly.

by stefan_

1/29/2026 at 4:27:57 PM

Personally, I'd rather get queued up on a long wait time I mean not ridiculously long but I am ok waiting five minutes to get correct it at least more correct responses.

Sure, I'll take a cup of coffee while I wait (:

by mcny

1/29/2026 at 4:38:18 PM

i’d wait any amount of time lol.

at least i would KNOW it’s overloaded and i should use a different model, try again later, or just skip AI assistance for the task altogether.

by lurking_swe

1/29/2026 at 4:46:08 PM

They don't advertise a certain quality. You take what they have or leave it.

by direwolf20

1/29/2026 at 4:28:37 PM

> I think delivering lower quality than what was advertised and benchmarked is borderline fraud

welcome to the Silicon Valley, I guess. everything from Google Search to Uber is fraud. Uber is a classic example of this playbook, even.

by bpavuk

1/29/2026 at 4:27:51 PM

If there's no way to check, then how can you claim it's fraud? :)

by denysvitali

1/29/2026 at 4:26:12 PM

There is no level of quality advertised, as far as I can see.

by chrisjj

1/29/2026 at 6:25:22 PM

What is "level of quality"? Doesn't this apply to any product?

by pseidemann

1/29/2026 at 6:43:04 PM

In this case, it is benchmark performance. See the root post.

by chrisjj

1/29/2026 at 4:33:25 PM

[flagged]

by copilot_king

1/29/2026 at 4:45:24 PM

That number is a sliding window, isn't it?

by rootnod3

1/29/2026 at 5:10:31 PM

I'd wager that lower tok/s vs lower quality of output would be two very different knobs to turn.

by sh3rl0ck

1/29/2026 at 4:06:45 PM

I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load

by awestroke

1/29/2026 at 4:01:21 PM

It would happen if they quietly decide to serve up more aggressively distilled / quantised / smaller models when under load.

by vidarh

1/29/2026 at 5:45:00 PM

Or just reducing the reasoning tokens.

by seunosewa

1/29/2026 at 4:30:57 PM

They advertise the Opus 4.5 model. Secretly substituting a cheaper one to save costs would be fraud.

by chrisjj

1/29/2026 at 7:29:21 PM

If you use the API, you pay for a specific model, yes, but even then there are "workarounds" for them, such as someone else pointed out by reducing the amount of time they let it "think".

If you use the subscriptions, the terms specifically says that beyond the caps they can limit your "model and feature usage, at our discretion".

by vidarh

1/29/2026 at 7:47:31 PM

Sure. I was separating the model - which Anthropic promises not to downgrade - and the "thinking time" - which Anthropic doesn't promise not to downgrade. It seems the latter is very likely the culprit in this case.

by chrisjj

1/29/2026 at 5:01:43 PM

Old school Gemini used to do this. It was super obvious because mid day the model would go from stupid to completely brain dead. I have a screenshot of Google's FAQ on my PC from 2024-09-13 that says this (I took it to post to discord):

> How do I know which model Gemini is using in its responses?

> We believe in using the right model for the right task. We use various models at hand for specific tasks based on what we think will provide the best experience.

by kingstnap

1/29/2026 at 5:47:14 PM

> We use various models at hand for specific tasks based on what we think will provide the best experience

... for Google :)

by chrisjj

1/29/2026 at 4:03:17 PM

from what I understand this can come from the batching of requests.

by Wheaties466

1/29/2026 at 4:31:43 PM

So, a known bug?

by chrisjj

1/29/2026 at 7:28:07 PM

No, basically, the requests are processed in batches, together, and the order they're listed in matters for the results, as the grid (tiles) that the GPU is ultimately processing, are different depending on what order they entered at.

So if you want batching + determinism, you need the same batch with the same order which obviously don't work when there are N+1 clients instead of just one.

by embedding-shape

1/29/2026 at 7:36:25 PM

Sure, but how can that lead to increased demand resulting in decreased intelligence? That is the effect we are discussing.

by chrisjj

1/29/2026 at 8:00:04 PM

Small subtle errors that are only exposed at certain execution parts could be one. You might place things differently onto the GPU depending on how large the batch is, if you've found one way to be faster batch_size<1024, but another when batch_size>1024. As number of concurrent incoming requests goes up, you increase batch_size. Just one possibility, guess there could be a multitude of reasons, as it's really hard to reason about until you sit with the data in front of you. vLLM has had bugs with these sort of thing too, so wouldn't surprise me.

by embedding-shape

1/29/2026 at 8:09:32 PM

Wouldn't you think that was as likely to increase as decrease intelligence, so average to nil in the benchmarks?

by chrisjj

1/29/2026 at 8:22:13 PM

No, I'm not sure how that'd make sense. Either you're making the correct (expected) calculations, or you're getting it wrong. Depending the type of wrong or how wrong, could go from "used #2 in attention instead of #1" so "blue" instead of "Blue" or whatever, to completely incoherent text and garbled output.

by embedding-shape

1/29/2026 at 8:38:13 PM

I accept errors are more likely to decrease "intelligence". But I don't see how increased load, through batching, is any more likely to increase than decrease errors.

by chrisjj

1/29/2026 at 3:46:11 PM

I've personally witnessed large variability in behaviour even within a given session -- which makes sense as there's nothing stopping Anthropic from shuttling your context/session around load balanced through many different servers, some of which might be quantized heavily to manage load and others not at all.

I don't know if they do this or not, but the nature of the API is such you could absolutely load balance this way. The context sent at each point is not I believe "sticky" to any server.

TLDR you could get a "stupid" response and then a "smart" response within a single session because of heterogeneous quantization / model behaviour in the cluster.

by cmrdporcupine

1/29/2026 at 3:58:26 PM

I've defended opus in the last weeks but the degradation is tangible. It feels like it degraded by a generation tbh.

by epolanski

1/29/2026 at 4:10:59 PM

it's just extremely variable

by cmrdporcupine

1/29/2026 at 3:18:43 PM

Hope you don't mind the unrelated question:

How do you pay for those SWE-bench runs?

I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.

https://mafia-arena.com

by mohsen1

1/29/2026 at 3:22:39 PM

Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.

by ofirpress

1/29/2026 at 3:38:05 PM

so basically they know requests using your API key should be treated with care?

by Dolores12

1/29/2026 at 7:18:17 PM

they could but you can also have some trust in anthropic to have some integrity there, these are earnest people.

"trust but verify" ofc . https://latent.space/p/artificialanalysis do api keys but also mystery shopper checks

by swyx

1/29/2026 at 9:16:06 PM

> these are earnest people.

I agree.

I'll also add that when my startup got acquired into a very large, well-known valley giant with a sterling rep for integrity and I ended up as a senior executive - over time I got a first-hand education on the myriad ways genuinely well-intentioned people can still end up being the responsible party(s) presiding over a system doing net-wrong things. All with no individual ever meaning to or even consciously knowing.

It's hard to explain and I probably wouldn't have believed myself before I saw and experienced it. Standing against an overwhelming organizational tide is stressful and never leads to popularity or promotion. I think I probably managed to move on before directly compromising myself but preventing that required constant vigilance and led to some inter-personal and 'official' friction. And, frankly, I'm not really sure. It's entirely possible I bear direct moral responsibility for a few things I believe no good person would do as an exec in a good company.

That's the key take-away which took me a while to process and internalize. In a genuinely good organization with genuinely good people, it's not "good people get pressured by constraints and tempted by extreme incentives, then eventually slip". I still talk with friends who are senior execs there and sometimes they want to talk about whether something is net good or bad. I kind of dread the conversation going there because it's inevitably incredibly complex and confusing. Philosopher's trolley car ethics puzzles pale next to these multi-layered, messy conundrums. But who else are they going to vent to who might understand? To be clear, I still believe that company and its leadership to be one of the most moral, ethical and well-intentioned in the valley. I was fortunate to experience the best case scenario.

Bottom line: if you believe earnest, good people being in charge is a reliable defense against the organization doing systemically net-wrong things - you don't comprehend the totality of the threat environment. And that's okay. Honestly, you're lucky. Because the reality is infinitely more ambiguously amoral than white hats vs black hats - at the end of the day the best the 'very good people' can manage is some shade of middle gray. The saddest part is that good people still care, so they want to check the shade of their hat but no one can see if it's light enough to at least tell yourself "I did good today."

by mrandish

1/29/2026 at 10:10:27 PM

Someone posted this here the other day and it uses _Demons_ to discuss exactly your point.

https://possessedmachines.com/

by pluralmonad

1/29/2026 at 10:48:14 PM

Wow. Only one page in and already bookmarked to absorb later. Thanks for the link.

by mrandish

1/30/2026 at 7:49:30 AM

That's why we're setting up adversarial benchmarks to test if they are doing the thing they promised not to do, because we totally trust them.

by debugnik

1/29/2026 at 3:52:44 PM

[dead]

by Deklomalo

1/29/2026 at 3:50:34 PM

The last thing a proper benchmark should do is reveal it's own API key.

by epolanski

1/29/2026 at 4:58:53 PM

IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.

by plagiarist

1/29/2026 at 10:32:09 PM

With the insane valuations and actual revenue at stake, benchmarkers should assume they're assessing in an adversarial environment. Whether from intentional gaming, training to the test, or simply from prioritizing things likely to make results look better, targeting benchmarks will almost certainly happen.

We already know large graphics card manufacturers tuned their drivers to recognize specific gaming benchmarks. Then when that was busted, they implemented detecting benchmarking-like behavior. And the money at stake in consumer gaming was comparatively tiny compared to current AI valuations. The cat-and-mouse cycle of measure vs counter-measure won't stop and should be a standard part of developing and administering benchmark services.

Beyond hardening against adversarial gaming, benchmarkers bear a longer term burden too. Per Goodhart's Law, it's inevitable good benchmarks will become targets. The challenge is the industry will increasingly target performing well on leading benchmarks, both because it drives revenue but also because it's far clearer than trying to glean from imprecise surveys and fuzzy metrics what helps average users most. To the extent benchmarks become a proxy for reality, they'll bear the burden of continuously re-calibrating their workloads to accurately reflect reality as user's needs evolve.

by mrandish

1/29/2026 at 6:08:29 PM

But that's removing a component that's critical for the test. We as users/benchmark consumers care that the service as provided by Anthropic/OpenAI/Google is consistent over time given the same model/prompt/context

by jabedude

1/29/2026 at 8:21:42 PM

Might as well have the free tokens, then, especially if it is an open benchmark they are already aware of. If they want to game it they cannot be stopped from doing so when it's on their infra.

by plagiarist

1/29/2026 at 4:17:32 PM

That's a good thought I hadn't had, actually.

by sejje

1/29/2026 at 3:23:45 PM

yes I reached out to them but as you say it's a chicken-and-egg problem.

Thanks!

by mohsen1

1/29/2026 at 10:54:08 PM

> I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score.

assume this is because of model costs. anthropic could either throw some credits their way (would be worthwhile to dispel the 80 reddit posts a day about degrading models and quantization) or OP could throw up a donation / tip link

by nikcub

1/29/2026 at 11:56:39 PM

Probably, but with a small sample size like that, they should probably be taking the uncertainty into account, because I wouldn't be surprised if a lot of this variation falls within expected noise.

E.g. some binomial interval proportions (aka confidence intervals).

by simsla

1/29/2026 at 10:57:58 PM

Then you'd get people claiming that the benchmarks were 'paid for' by anthropic

by phist_mcgee

1/29/2026 at 11:05:11 PM

one thing you learn from being on the internet is that you're never going to satisfy everybody

by nikcub

1/29/2026 at 4:22:44 PM

The degradation may be more significant within the day than at the same time every day.

by seunosewa

1/29/2026 at 4:53:38 PM

Sure, but it's still useful insight to see how it performs over time. Of course, cynically, Anthropic could game the benchmark by routing this benchmark's specific prompts to an unadulterated instance of the model.

by GoatInGrey

1/29/2026 at 4:43:01 PM

Sorry what?

"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?

"Oh, you just measured me at bad times each day. On only 50 different queries."

So, what does that mean? I have to pick specific times during the day for Claude to code better?

Does Claude Code have office hours basically?

by rootnod3

1/29/2026 at 6:21:54 PM

This has been happening for years. Tgere's a great paper from microsoft on Deepspeed AI inference.

Basically the paper showed methods for how to handle heavy traffic load by changing model requirements or routing to different ones. This was awhile ago and I'm sure it's massively more advanced now.

Also why some of AI's best work for me is early morning and weekends! So yes, the best time to code with modern LLM stacks is when nobody else is. It's also possibly why we go through phases of "they neutered the model" some time after a new release.

by johnsmith1840

1/29/2026 at 7:58:06 PM

I wonder if my great experience with claude are partly due to the fact that my working hours don't overlap with the US west coast

by kuboble

1/29/2026 at 7:20:43 PM

chill out, ofir does not work for anthropic. he's just saying there's inherent variability in LLMs and you need to at least 30x the samples that OP is doing in order to make any form of statistically significant conclusions.

by swyx

1/29/2026 at 4:44:08 PM

[flagged]

by copilot_king

1/29/2026 at 4:47:32 PM

Verily, my vichyssoise of verbiage veers most verbose, so let me run that thing out of tokens fast.

by rootnod3

1/29/2026 at 7:05:17 PM

According to Anthropic: "We never reduce model quality due to demand, time of day, or server load."

https://www.anthropic.com/engineering/a-postmortem-of-three-...

by bhk

1/29/2026 at 7:10:46 PM

They've had issues before with things like "TPU top-k error - Claude sometimes dropped the best next token" (https://www.anthropic.com/engineering/a-postmortem-of-three-...) so what's going on might not be intentional even.

by embedding-shape

1/29/2026 at 9:47:46 PM

That issue did not have any time of day dependence

by mgraczyk

1/29/2026 at 3:50:05 PM

Stilll relevant over time.

by epolanski

1/29/2026 at 4:24:40 PM

> Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

Are you suggesting result accuracy varies with server load?

by chrisjj

1/29/2026 at 3:55:07 PM

"Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded"

Aha, so the models do degrade under load.

by dana321

1/29/2026 at 3:38:33 PM

Agreed, this benchmark would be much more useful ran multiple times a day. That could reveal degredation in line with load patterns.

by cedws

1/29/2026 at 3:52:12 PM

For CC, I suspect it also need to be testing and labeling separate runs against subscription, public API and Bedrock-served models?

It’s a terrific idea to provide this. ~Isitdownorisitjustme for LLMs would be the parakeet in the coalmine that could at least inform the multitude of discussion threads about suspected dips in performance (beyond HN).

What we could also use is similar stuff for Codex, and eventually Gemini.

Really, the providers themselves should be running these tests and publishing the data.

The availability status information is no longer sufficient to gauge the service delivery because it is by nature non-deterministic.

by bredren

1/29/2026 at 7:21:30 PM

i recall another project here on HN maybe 4-6 months ago that would run tests 4x a day or something. not sure how to find them again

by swyx

1/30/2026 at 12:12:08 PM

Why should users care about Anthropic's servers being overloaded?

by sjtgraham

1/29/2026 at 3:19:50 PM

Why I do not believe this shows Anthropic serves folks a worse model:

1. The percentage drop is too low and oscillating, it goes up and down.

2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.

3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.

by antirez

1/29/2026 at 4:39:27 PM

I believe the science, but I've been using it daily and it's been getting worse, noticeably.

by levkk

1/29/2026 at 4:44:19 PM

Is it possible that your expectations are increasing, not that the model is getting worse?

by warkdarrior

1/29/2026 at 4:59:25 PM

Possible, though you eventually run into types of issues that you recall the model just not having before. Like accessing a database or not following the SOP you have it read each time it performs X routine task. There are also patterns that are much less ambiguous like getting caught in loops or failing to execute a script it wrote after ten attempts.

by GoatInGrey

1/29/2026 at 9:28:41 PM

yes but i keep wondering if that's just the game of chance doing its thing

like these models are nondeterministic right? (besides the fact that rng things like top k selection and temperature exist)

say with every prompt there is 2% odds the AI gets it massively wrong. what if i had just lucked out the past couple weeks and now i had a streak of bad luck?

and since my expectations are based on its previous (lucky) performance i now judge it even though it isn't different?

or is it giving you consistenly worse performance, not able to get it right even after clearing context and trying again, on the exact same problem etc?

by merlindru

1/29/2026 at 11:08:42 PM

I’ve had Opus struggle on trivial things that Sonnet 3.5 handled with ease.

It’s not so much that the implementations are bad because the code is bad (the code is bad). It’s that it gets extremely confused and starts to frantically make worse and worse decisions and questioning itself. Editing multiple files, changing its mind and only fixing one or two. Reseting and overriding multiple batches of commits without so much as a second thought and losing days of work (yes, I’ve learned my lesson).

It, the model, can’t even reason with the decisions it’s making from turn to turn. And the more opaque agentic help it’s getting the more I suspect that tasks are being routed to much lesser models (not the ones we’ve chosen via /model or those in our agent definitions) however Anthropic chooses.

In these moments I mind as well be using Haiku.

by F7F7F7

1/29/2026 at 6:27:04 PM

I have to concur. And to the question about understanding what its good and bad at; no, tasks that it could accomplish quickly and easily just a month ago, now require more detailed prompting and constant "erroneous direction correction."

It's almost as if, as tool use and planning capabilities have expanded, Claude (as a singular product) is having a harder time coming up with simple approaches that just work, instead trying to use tools and patterns that complicate things substantially and introduce much more room for errors/errors of assumption.

It also regularly forgets its guidelines now.

I can't tell you how many times it's suggested significant changes/refactors to functions because it suddenly forgets we're working in an FP codebase and suggests inappropriate imperative solutions as "better" (often choosing to use language around clarity/consistency when the solutions are neither).

Additionally, it has started taking "initiative" in ways it did not before, attempting to be helpful but without gathering the context needed to do so properly when stepping outside the instruction set. It just ends up being much messier and inaccurate.

I have to regularly just clear my prompt and start again with guardrails that have either: already been established, or have not been needed previously / are only a result of the over-zealousness of the work its attempting to complete.

by davidee

1/29/2026 at 7:49:03 PM

I assume, after any compacting of the context window that the session is more or less useless at that point I’ve never had consistent results after compacting.

by conception

1/29/2026 at 11:41:05 PM

Compacting equals death of the session in my process. I do everything I can to avoid hitting it. If I accidentally fly too close to the sun and compact I tend to revert and start fresh. As soon as it compacts it's basically useless

by justinlivi

1/29/2026 at 11:02:04 PM

Multiple concurrences a choir or a mob?

1pm EST time it’s all down hill until around 8 or 9pm EST time.

Late nights and weekends is smooth sailing.

by F7F7F7

1/30/2026 at 3:46:16 AM

I’m finding Gemini and chatGPT web terminal to out perform Claude code. The context becomes too much for the LLM, and tries to make up for it by doing more file read ops.

by bushbaba

1/30/2026 at 11:34:15 AM

Sounds like you might want to refactor the code if the individual files are too big and it can't find what it's looking for?

by samusiam

1/29/2026 at 5:06:29 PM

Any chance you’re just learning more about what the model is and is not useful for?

by emp17344

1/29/2026 at 5:23:14 PM

I dunno about everyone else but when I learn more about what a model is and is not useful for, my subjective experience improves, not degrades.

by jerf

1/29/2026 at 5:34:04 PM

Not when the product is marketed as a panacea.

by emp17344

1/29/2026 at 6:36:01 PM

There are some days where it acts staggeringly bad, beyond baselines.

But it’s impossible to actually determine if it’s model variance, polluted context (if I scold it, is it now closer in latent space to a bad worker, and performs worse?), system prompt and tool changes, fine tunes and AB tests, variances in top P selection…

There’s too many variables and no hard evidence shared by Anthropic.

by data-ottawa

1/29/2026 at 6:47:04 PM

No because switching to the API with the same prompt immediately fixes it.

There's little incentive to throttle the API. It's $/token.

by acuozzo

1/29/2026 at 6:44:22 PM

I too suspect the A/B testing is the prime suspect: context window limits, system prompts, MAYBE some other questionable things that should be disclosed.

Either way, if true, given the cost I wish I could opt-out or it were more transparent.

Put out variants you can select and see which one people flock to. I and many others would probably test constantly and provide detailed feedback.

All speculation though

by TIPSIO

1/29/2026 at 11:13:09 PM

Whenever I see new behaviors and suspect I’m being tested on I’ll typically see a feedback form at some point in that session. Well, that and dropping four letter words.

I know it’s more random sampling than not. But they are definitely using our codebases (and in some respects our livelihoods) as their guinea pigs.

by F7F7F7

1/30/2026 at 11:36:11 AM

If that's the case, then as a benchmark operator you'd want to run the benchmark through multiple different accounts on different machines to average over A/B test noise.

by samusiam

1/29/2026 at 10:22:17 PM

It would be very easy for them to switch the various (compute) cost vs performance knobs down depending on load to maintain a certain latency; you would see oscillations like this, especially if the benchmark is not always run exactly at the same time every day.

& it would be easy for them to start with a very costly inference setup for a marketing / reputation boost, and slowly turn the knobs down (smaller model, more quantized model, less thinking time, fewer MoE experts, etc)

by make3

1/29/2026 at 4:25:20 PM

> 1. The percentage drop is too low and oscillating, it goes up and down.

How do you define “too low”, they make sure to communicate about the statistical significance of their measurements, what's the point if people can just claim it's “too low” based on personal vibes…

by littlestymaar

1/29/2026 at 4:20:09 PM

4. The graph starts January 8.

Why January 8? Was that an outlier high point?

IIRC, Opus 4.5 was released late november.

by eterm

1/29/2026 at 11:14:24 PM

Right after the Holiday double token promotion users felt (perceived) a huge regression in capabilities. I bet that triggered the idea.

by F7F7F7

1/29/2026 at 5:46:49 PM

People were away for the holidays. What do you want them to do?

by pertymcpert

1/29/2026 at 4:23:45 PM

Or maybe, juste maybe, that's when they started testing…

by littlestymaar

1/29/2026 at 4:28:05 PM

Wayback machine has nothing for this site before today, and article is "last updated Jan 29".

A benchmark like this ought to start fresh from when it is published.

I don't entirely doubt the degradation, but the choice of where they went back to feels a bit cherry-picked to demonstrate the value of the benchmark.

by eterm

1/29/2026 at 4:30:41 PM

Which makes sense, you gotta wait until you get enough data before you can communicate on the said data…

If anything it's coherent with the fact that they very likely didn't have data earlier than January the 8th.

by littlestymaar

1/29/2026 at 6:40:37 PM

> We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.

They're going to need to provide a lot more detail on their methodology, because that doesn't make a lot of sense. From their graphs, they seem to be calculating the confidence interval around the previous value, then determining whether the new value falls outside of it. But that's not valid for establishing the statistical significance of a difference. You need to calculate the confidence interval of the difference itself, and then see if all the values within that confidence interval remain positive (if it excludes 0). This is because both the old and new measurement have uncertainty. Their approach seems to be only considering uncertainty for one of them.

They should also really be more specific about the time periods. E.g. their graphs only show performance over the past 30 days, but presumably the monthly change is comparing the data from 60 to 31 days ago, to the data from 30 days ago until yesterday? In which case the weekly graph really ought to be displaying the past two months, not one month.

by crazygringo

1/29/2026 at 3:06:04 PM

Simply search user prompts for curse words and then measure hostility sentiment. User hostility rises as agents fail to meet expectations.

by Dowwie

1/29/2026 at 3:53:19 PM

Maybe im overlooking something obvious but how do you 'simply' scan the content of Claude users their prompts?

by preuceian

1/29/2026 at 5:04:23 PM

GP was making a joke, but Anthropic could implement this if they wanted to. Not a bad metric actually if you can measure it cheaply enough.

by gordonhart

1/29/2026 at 3:13:29 PM

I uh might be skewing that as I generally just use a lot of curse words with Claude by default

by mrbananagrabber

1/29/2026 at 3:09:25 PM

I'm glad I'm not the only one.

by Trufa

1/29/2026 at 4:18:36 PM

One time I cussed Claude out so hard that it actually quit his doom-loop and fixed the thing.

It's the only time cussing worked, though.

by sejje

1/29/2026 at 7:16:44 PM

I don’t know. My gut feeling is it seems to help.

by bn-l

1/29/2026 at 3:14:40 PM

I feel bad about it but sometimes it's so daft, I can't even xD

It's not my fault, they set high standards!

by ctxc

1/29/2026 at 3:20:03 PM

there are many times where I just do it myself and it thinks it did well.

by smotched

1/29/2026 at 11:16:07 PM

There’s a correlation between getting the “How’s Claude Doing This Session?” (Or whatever) and four letter words.

It’s not always then, but it often follows it.

by F7F7F7

1/29/2026 at 4:26:28 PM

Or there are global events that stress people out .. or their expectations change over time. Not that simple ;)

by mhl47

1/29/2026 at 7:54:44 PM

Good thing expectations are perfectly constant!

by nateberkopec

1/29/2026 at 8:34:10 PM

This might be strangely effective.

by mbm

1/29/2026 at 3:12:30 PM

There was a moment about a week ago where Claude went down for about an hour. And right after it came back up it was clear a lot of people had given up and were not using it.

It was probably 3x faster than usual. I got more done in the next hour with it than I do in half a day usually. It was definitely a bit of a glimpse into a potential future of “what if these things weren’t resource constrained and could just fly”.

by silverlight

1/29/2026 at 3:24:22 PM

I had that exact same feeling during the US holidays where I got to enjoy 2x usage limits and everything just seemed to work well

by yoavsha1

1/29/2026 at 3:48:02 PM

I had terrible results during the holidays -- it wasn't slow but it was clear they were dealing with the load by quantizing in spots because there were entire chunks of days when the results from it were so terrible I gave up and switched to using Gemini or Codex via opencode.

by cmrdporcupine

1/30/2026 at 3:26:04 AM

I find that if I have my rabbit's foot and lucky socks on, I win working code ~1.2x more often.

by abathologist

1/29/2026 at 5:19:00 PM

Noticed the exact same thing a few days ago. So much so that I went on twitter and HN to search for “claude speed boost” to see if there was a known new release. Felt like the time I upgraded from a 2400 baud modem to a 14.4 as a kid - everything was just lightning fast (for a brief shining moment).

by nlh

1/29/2026 at 3:52:53 PM

I would also regret it if they become that fast; right now I can really take a moment to enjoy the hard work the model is doing for me.

by svdr

1/30/2026 at 8:51:17 AM

https://xkcd.com/303/

the evolution of this xkcd

by asimovDev

1/29/2026 at 3:14:00 PM

Wouldn't be surprised if they slowly start quantizing their models over time. Makes it easier to scale and reduce operational cost. Also makes a new release have more impact as it will be more notably "better" than what you've been using the past couple of days/weeks.

by dajonker

1/29/2026 at 4:17:53 PM

It sure feels like they do this. They claim they don't, but using it every day for 5-10 hours a day. You notice when something changes.

This last week it seems way dumber than before.

by kilroy123

1/29/2026 at 5:46:44 PM

I don't think so. There are other knobs they can tweak to reduce load that affect quality less than quantizing. Like trimming the conversation length without telling you, reducing reasoning effort, etc.

by 9cb14c1ec0

1/29/2026 at 9:48:57 PM

We never do anything that reduce model intelligence like that

by mgraczyk

1/31/2026 at 9:59:14 AM

You said "like that", ok but there may be some truth to reduced model intelligence. Also how AWS deployed Anthropic models for Amazons Kiro feel much dumber than those controlled entirely by Anthropic. Can't be just me

by siva7

1/29/2026 at 4:51:16 PM

I would be surprised tbh.

Anthropic does not exactly act like they're constrained by infra costs in other areas, and noticeably degrading a product when you're in tight competition with 1 or 2 other players with similar products seems like a bad place to start.

I think people just notice the flaws in these models more the longer they use them. Aka the "honeymoon-hangover effect," a real pattern that has been shown in a variety of real world situations.

by eli

1/29/2026 at 8:49:33 PM

Open weights models such as GPT-OSS, Kimi K2.x are trained with 4 bit layers. So it wouldn't come as a surprise if the closed models do similar things. If I compare Kimi K2.5 and Opus 4.5 on openrouter, output tokens are about 8x more expensive for Opus, which might indicate Opus is much larger and doesn't quantize, but the claude subscription plans muddy the waters on price comparison a lot.

by kristianp

1/29/2026 at 3:41:42 PM

Benchmarks like ARG AGI are super price correlated and cheap to run. I think it's very easy to prove that the models are degrading.

by YetAnotherNick

1/29/2026 at 3:57:11 PM

Oooff yes I think that is exactly the kind of shenanigans they might pull.

Ultimately I can understand if a new model is coming in without as much optimization then it'll add pressure to the older models achieving the same result.

Nice plausible deniability for a convenient double effect.

by rustyhancock

1/29/2026 at 5:13:03 PM

I haven't noticed much difference in Claude, but I swear gemini 3 pro preview was better in the first week or two and later started feeling like they quantized it down to hell.

by Roark66

1/29/2026 at 4:57:08 PM

Lack of transparency as regards "thinking power"-consistency is a big gripe of mine with LLM providers. It's even worse with ChatGPT and the like. E.g. I had to learn the hard way that at >45k input tokens ChatGPT 5.2 Thinking Extended bumps its intelligence down so hard that it can't follow basic instructions (or it somehow truncates the input, losing the instructions). It sucks to lose confidence in an otherwise great tool. I would 100x prefer being forced to back-off, or getting a straight-no, than getting silently downgraded. Transparency is a big deal.

by dmos62

1/29/2026 at 5:56:56 PM

Sounds like you ran into the Maximum Effective Context Window: https://arxiv.org/abs/2509.21361?context=cs.AI

by judahmeek

1/29/2026 at 7:23:04 PM

Interesting article. Not sure it's the same phenomenon. What I experienced was like a day and night difference when you go from 44.5k to 45.5k. Didn't notice any fluctuation to suggest that it's no a hard 45000 limit. I ran many many queries, similar problem space, but the problems varied a lot.

by dmos62

1/29/2026 at 3:58:43 PM

I am using API mode, and it's clear that there are times when the Claude model just gives up. And it is very noticeable because the model just does the most dumb things possible.

"You have a bug in line 23." "Oh yes, this solution is bugged, let me delete the whole feature." That one-line fix I could make even with ChatGPT 3.5 can't just happen. Workflows that I use and are very reproducible start to flake and then fail.

After a certain number of tokens per day, it becomes unusable. I like Claude, but I don't understand why they would do this.

by jampa

1/29/2026 at 4:10:20 PM

Robbing Peter to pay Paul. They are probably resource-constrained, and have determined that it's better to supply a worse answer to more people than to supply a good answer to some while refusing others. Especially knowing that most people probably don't need the best answer 100% of the time.

by arcanemachiner

1/29/2026 at 5:05:40 PM

> Especially knowing that most people probably don't need the best answer 100% of the time.

More: probably don't know if they've got a good answer 100% of the time.

It is interesting to note that this trickery is workable only where the best answers are sufficiently poor. Imagine they ran almost any other kind of online service such email, stock prices or internet banking. Occasionally delivering only half the emails would trigger a customer exodus. But if normal service lost a quarter of emails, they'd have only customers who'd likely never notice half missing.

by chrisjj

1/29/2026 at 7:19:33 PM

Right. You can launder quantization that way by muddying the waters of discourse about the model.

by bn-l

1/29/2026 at 4:51:53 PM

I encountered the same situation too; Claude has 'become lazy'.

by DanielHall

1/29/2026 at 1:59:07 PM

FYI the MarginLab Claude Code degradation tracker is showing a statistically significant ~4% drop in SWE-Bench-Pro accuracy over the past month

by qwesr123

1/29/2026 at 2:52:45 PM

I really like the idea, but a "±14.0% significance threshold" is meaningless here.

The larger monthly scale should be the default, or you should get more samples.

by goldenarm

1/29/2026 at 2:58:30 PM

Could you elaborate what you think the problems are? I guess they should be using some form of multiple comparison correction?

by zacmps

1/29/2026 at 3:06:06 PM

The daily scale is not statistically significant and is meaningless. You should lower the confidence interval by either increasing the scale or the evaluations.

by goldenarm

1/29/2026 at 7:59:29 PM

Please try to make this statistically rigorous. There's lots of advice in this thread (intraday variation, etc) but if Im reading this right it looks like the CI includes the baseline value yet you still label this as failing.

Wouldn't this just be "our test isn't powerful enough to find a signal if there were one here?"

People will see this and derive strong conclusions that the data don't support and you, `qwesr123`, or "JB" from your blogs, will be responsible.

by account266928

1/29/2026 at 6:46:56 PM

Benchmark tracking of cloud AI performance is going to be crucial going forward. Vendors are selling a service that by its nature is very difficult for customers to gauge day to day. How will I know if a code revision is ~2.5% less good today than it would have been yesterday? Or if queries during peak load hours use one less 'expert' in their MoE?

Yet vendor's costs to deliver these services are skyrocketing, competition is intense and their ability to subsidize with investor capital is going away. The pressure on vendors to reduce costs by dialing back performance a few percent or under-resourcing peak loads will be overwhelming. And I'm just a hobbyist now. If I was an org with dozens or hundreds of devs I'd want credible ways to verify the QoS and minimum service levels I'm paying for are being fulfilled long after a vendor has won the contract.

by mrandish

1/29/2026 at 5:29:18 PM

What makes the level they chose a “baseline,” against which it would be appropriate to do statistical tests?

by drc500free

1/29/2026 at 4:57:42 PM

Does this use a claude subscription or key, and has the account been used for anything else that day?

On HN a few days ago there was a post suggesting that Claude gets dumber throughout the day: https://bertolami.com/index.php?engine=blog&content=posts&de...

by parquor

1/29/2026 at 6:46:18 PM

New to me, but I am starting to infer that for those "in the know" it is common knowledge on HN that LLMs are purposely degraded over time to manage capacity/cost or fudge benchmarks...

How do you actually use these in production pipelines in practice then?

Are LLMs even well suited for some of the document parsing / data scrubbing automation people are throwing at them now?

by steveBK123

1/29/2026 at 5:53:04 PM

This is why I run my own models. All the inference providers do sneaky things behind the scenes. They will limit the output tokens, turn off attention layers, lower reasoning, or just use a completely different model. I'm actually surprised that Claude Code experienced this, as I've experienced this the least from API and coding agents.

by kittikitti

1/30/2026 at 3:58:43 AM

Running agents in production, I've stopped trying to figure out why things degrade. The answer changes weekly.

Model drift, provider load, API changes, tool failures - it doesn't matter. What matters is that yesterday's 95% success rate is today's 70%, and by the time you notice, debug, and ship a fix, something else has shifted.

The real question isn't "is the model degraded?" It's "what should my agent do right now given current conditions?"

We ended up building systems that canary multiple execution paths continuously and route traffic based on what's actually working. When Claude degrades, traffic shifts to the backup path automatically. No alerts, no dashboards, no incident.

Treating this as a measurement problem assumes humans will act on the data. At scale, that assumption breaks.

by devonkelley

1/30/2026 at 2:28:39 PM

LLM generated comments are so obvious, please just talk from your personal experience. Nobody cares about this imagined experience.

by sd9

1/29/2026 at 3:22:35 PM

Doesn't really work like that. I'd remove the "statistically significant" labelling because it's misleading.

by IshKebab

1/29/2026 at 8:18:26 PM

This is super important - even if it's not currently the best measure of degradation yet. Anecdotally, Opus 4.5 has gotten so bad for me it's almost adding time to my workflow instead saving it. It'd be nice to have more 3rd party measurements like this to hold Anthropic accountable.

by _zachs

1/31/2026 at 12:24:59 AM

Im using Claude daily. Mostly delegating boring stuff I can do myself but its a waste of my time now.

I store my prompts, so I know I often run the same task multiple times over weeks span.

After working with it for pas half a year I have to say the quality pf responses is steadily going down.

Feels like cost optimizations. Overall the worse it performs the more stuff I have to do myself, because I won’t waste time tweaking instructions every time it happens. It wpulf waste too much of that time.

So seems we are swinging back the pendulum.

by pojzon

1/30/2026 at 2:20:55 AM

Totally tangential to article, was browsing through the website UI - https://marginlab.ai/explorers/swe-bench-pro/ , the page gives impression that the language, category boxes are selectable. However they are not a dropdown. Not sure if it was intentional design by human or some smart code generation by Claude based on the design sketches.

by sandeepkd

1/29/2026 at 3:56:35 PM

Does it benchmark the underlying code (Opus 4.5) or Claude Code harness? If the second, I would love to see CC versions involved.

I would be curious to see on how it fares against a constant harness.

There were thread claiming that Claude Code got worse with 2.0.76, with some people going back to 2.0.62. https://github.com/anthropics/claude-code/issues/16157

So it would be wonderful to measure these.

by stared

1/29/2026 at 4:01:37 PM

Claude Code. They mention they are using claude codes CLI in the benchmark, and claude code changes constantly.

I wouldn't be surprised if the thing this is actually testing is benchmarking just claude codes constant system prompt changes.

I wouldn't really trust this to be able to benchmark opus itself.

by Jcampuzano2

1/29/2026 at 6:28:26 PM

I wonder when I experience noticeably degraded model quality, ie opus, is it because my usage falls in the highest buckets and I’m being shadow limited or served worse versions of opus or is it because of actual server load/burden?

It wouldn’t be the first time companies have secret shadow algorithms running to optimize things and wouldn’t it be obvious to limit power users as matter of cost/profit and not tell them. (See history of “Shadow ban” though that’s for different reasons)

by mannanj

1/29/2026 at 5:29:44 PM

Pretty sure someone at Google, OpenAI, and Anthropic met up at a park, leaving their phones in their car, and had a conversation that January 2026, they were all going to silently degrade their models.

They were fighting an arms race that was getting incredibly expensive and realized they could get away with spending less electricity and there was nothing the general population could do about it.

Grok/Elon was left out of this because he would leak this idea at 3am after a binge.

by PlatoIsADisease

1/30/2026 at 2:39:54 PM

How much influence have you all found prompting to have on output quality? Generally I've been approaching by just describing my problem and assuming that I'll get the machine's optimal output, but perhaps being explicit in the prompt can impact the output quality?

by cleifer

1/30/2026 at 1:44:45 AM

The degradation does not need to be in the inference it can be in how often inference is used.

It is closed source but the algorithms that decide what Claude code does when, could behave differently when the API responses are slower. Maybe it does fewer investigatory greps or performs fewer tasks to get to “an” answer faster and with less load.

by macinjosh

1/30/2026 at 5:47:38 AM

Does this even make sense? Clearly anthropic won't release a model unless it passed a benchmark of some sort that proves it's better than the previous model... or else why would they even release it?

It's obvious if this thing shows degradation, than there is another thing that is showing improvement.

by threethirtytwo

1/29/2026 at 3:59:50 PM

First off, this is a cool project, look forward to some interesting insights.

I would suggest adding some clarification to note that longer measure like 30 pass rate is raw data only while the statistically significant labels apply only to change.

Maybe something like Includes all trials, significance labels apply only to confidence in change vs baseline.

by WhitneyLand

1/29/2026 at 2:46:05 PM

Very interesting. I would be curious to understand how granular these updates are being applied to CC + what might be causing things like this. I feel like I can notice a very small degradation but have compensated with more detailed prompts (which I think, perhaps naively, is offsetting this issue).

by beardsciences

1/29/2026 at 5:25:39 PM

> more detailed prompts (which I think, perhaps naively, is offsetting this issue).

Is exacerbating this issue ... if the load theory is correct.

by chrisjj

1/30/2026 at 12:57:32 AM

If the confidence interval width is 2 * 14.0%, how are you detecting a statistically significant difference between 58% and 50%?

The 95% CIs on both timeseries pretty much always cover the baseline number, which is not consistent with the result being statistically significant.

by aorist

1/29/2026 at 7:12:21 PM

I hope the author sees this:

You have to test inter-day variation. Many have noticed a sudden drop off at certain times.

by bn-l

1/29/2026 at 7:32:02 PM

What would be cool if this somehow could do a comparison by provider. E.g. in the last outages anthropic models running on vertex were apparently less affected than those deployed elsewhere. (Not saying that one is better than the other, but would be a neat read out).

by persedes

1/31/2026 at 7:08:56 PM

I rarely complain about model performance, but Opus 4.5 behaves as Sonnet 4 at best. Need to start testing alternatives asap

by stergd

1/30/2026 at 4:55:07 PM

It definitely felt less capable recently, I thought I was imagining it, but it was noticeably more difficult to get it to help on tasks that usually aren't so hard.

by foerster

1/29/2026 at 9:14:00 PM

I’d love to see, based on the level of non-determinism perfomance on the benchmark how many times you need to run the benchmark for the change to be relevant (or statistically significant if you want).

That would be a nice paper.

by motoboi

1/29/2026 at 5:26:32 PM

Codex is doing better. Why is everyone silent on Codex? https://marginlab.ai/trackers/codex/

by wendgeabos

1/29/2026 at 6:27:09 PM

Benchmark wins don't necessarily translate to "real world" wins vs. Claude Code.

by CharlesW

1/29/2026 at 7:20:36 PM

Codex writes disgusting shit code.

by bn-l

1/30/2026 at 8:03:30 AM

They should add testing from different ips and account countries, that would be fun too see that Americans are getting different models for example

by your_friend

1/29/2026 at 11:27:34 PM

Tracking benchmarks for AI-assisted coding tools is crucial. It helps developers understand the trade-offs and stability of the models they rely on.

by hn_user_9876

1/29/2026 at 3:05:59 PM

Why is this happening?

by sciencejerk

1/29/2026 at 3:59:49 PM

They're "optimizing" costs wherever possible - reducing compute allocations, quantizing models, doing whatever they can to reduce the cost per token, but vehemently insisting that no such things are occurring, that it's all in the users' heads, and using the weaseliest of corporate weasel speak to explain what's happening. They insist it's not happening, then they say something like "oh, it happened but it was an accident", then they say "yes, it's happening, but it's actually good!" and "we serve the same model day by day, and we've always been at war with Eastasia."

They should be transparent and tell customers that they're trying to not lose money, but that'd entail telling people why they're paying for service they're not getting. I suspect it's probably not legal to do a bait and switch like that, but this is pretty novel legal territory.

by observationist

1/29/2026 at 3:12:26 PM

I have absolutely no insight knowledge, but I think it's not a bad assumption to have that, it's costly to run the models, when they release a new model they assume that cost and give per user more raw power, when they've captured the new users and wow factor, they start reducing costs by reducing the capacity they provide to users. Rinse and repeat.

by Trufa

1/29/2026 at 7:23:24 PM

That is absolutely scummy.

by bn-l

1/29/2026 at 3:28:38 PM

There are frequently claims that Anthropic is somehow diluting or dumbing down models in some subtle way. Unfortunately it’s tough to validate these claims without a body of regularly checked evals. This test set should hopefully help settle whether Anthropic is actually making changes under the hood or whether the changes are all in people’s heads.

by Uehreka

1/29/2026 at 3:09:40 PM

https://www.anthropic.com/engineering/a-postmortem-of-three-...

by giwook

1/29/2026 at 4:02:21 PM

>>> We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone.

Just ignore the continual degradation of service day over day, long after the "infrastructure bugs" have reportedly been solved.

Oh, and I've got a bridge in Brooklyn to sell ya, it's a great deal!

by observationist

1/29/2026 at 4:50:54 PM

> We never reduce model quality due to demand, time of day, or server load

Forgive me, but as a native English speaker, this sentence says exactly one thing to me; We _do_ reduce model quality, just not for these listed reasons.

If they don't do it, they could put a full stop after the fifth word and save some ~~tokens~~ time.

by alias_neo

1/29/2026 at 9:08:38 PM

Yes, Dario is responsible for some of the weaseliest of corporate weasel wording I've ever seen, and he's got some incredible competition in that arena. Those things aren't the reason, they're just strongly coincidental with the actual reason, which is to slow the burn rate and extend the runway.

by observationist

1/29/2026 at 5:55:24 PM

Moreover the assurance re model quality is not re results quality.

by chrisjj

1/29/2026 at 5:08:37 PM

It’s entirely possible it’s not happening, and this phenomenon of “model degradation” is just user hype meeting reality.

by emp17344

1/29/2026 at 5:17:38 PM

I KNEW I WASNT CRAZY

by elmean

1/29/2026 at 11:06:59 PM

I’ve noticed Claude has been noticeably worse over the last week. For example, it told me I should pass frozen to make my Enum immutable—that’s not a thing. (It is a thing for dataclasses, but not for Enums.) That’s a pretty basic language feature it was nailing until recently. It also suggested I parse a URL using urlparse in a function that already uses urlparse. These are basic mistakes it wasn’t making before. Something seems to have changed, but I’m not sure what.

by jonawesomegreen

1/29/2026 at 3:01:17 PM

Would love to see this idea expanded to ever alleged SoTA model currently in production. Any speculation as to why this degradation occurs?

by fragebogen

1/29/2026 at 3:07:57 PM

Anecdote, I don't have any proof and it's just a feeling. But around afternoon in GMT+1 compared to the morning/midday, there seems to be a change in the quality of responses, which seems to line up with when the US wakes up. I consistently get (what feels like) worse responses in both Codex and Claude Code in the afternoon/night compared to morning/midday, so much that I usually give up then try the same prompt next morning and get better results. But I guess that might as well be about me being more tired in the night than morning too, as I said, haven't measured this.

by embedding-shape

1/29/2026 at 3:15:46 PM

It’s the afternoon slump. The AI needs a cup of coffee and to doomscroll for half an hour!

by jzig

1/29/2026 at 3:18:21 PM

Or a load balancing technique :) Either way, it kicks me off to do other things so maybe it isn't so bad after all.

by embedding-shape

1/29/2026 at 8:35:27 PM

they should run their test against a control baseline such as an open source hosted model to see the overall drift in their test

by snissn

1/29/2026 at 4:16:51 PM

I have yet to experience any degradation in coding tasks I use to evaluate Opus 4.5, but I did see a rather strange and reproducible worsening in prompt adherence as part of none coding tasks since the third week of January.

Very simple queries, even those easily answered via regular web searching, have begun to consistently not result accurate results with Opus 4.5, despite the same prompts previously yielding accurate results.

One of the tasks that I already thought was fully saturated as most recent releases had no issues in solving it was to request a list of material combinations for fabrics used in bag constructions that utilise a specific fabric base. In the last two weeks, Claude has consistently and reproducibly provided results which deviate from the requested fabric base, making the results inaccurate in a way that a person less familiar with the topic may not notice instantly. There are other queries of this type for other topics I am nerdily familiar with to a sufficient degree to notice such deviations from the prompt like motorcycle history specific queries that I can say this behaviour isn't limited to the topic of fabrics and bag construction.

Looking at the reasoning traces, Opus 4.5 even writes down the correct information, yet somehow provides an incorrect final output anyways.

What makes this so annoying is that in coding tasks, with extensive prompts that require far greater adherence to very specific requirements in a complex code base, Opus 4.5 does not show such a regression.

I can only speculate what may lead to such an experience, but for none coding tasks I have seen regression in Opus 4.5 whereas for coding I did not. Not saying there is none, but I wanted to point it out as such discussions are often primarily focused on coding, where I find it can be easier to see potential regressions where their are none as a project goes on and tasks become inherently more complex.

My coding benchmarks are a series of very specific prompts modifying a few existing code bases in some rather obscure ways, with which I regularly check whether a model does severely deviate from what I'd seen previously. Each run starts with a fresh code base with some fairly simple tasks, then gets increasingly complex with later prompts not yet being implemented by any LLM I have gotten to test. Partly that originated from my subjective experience with LLMs early on, where I found a lot of things worked very well but then as the project went on and I tried more involved things with which the model struggled, I felt like the model was overall worse when in reality, what had changed were simply the requirements and task complexity as the project grew and easier tasks had been completed already. In this type of testing, Opus 4.5 this week got as far and provided a result as good as the model did in December. Of course, past regressions were limited to specific users, so I am not saying that no one is experiencing reproducible regressions in code output quality, merely that I cannot reproduce them in my specific suite.

by Topfi

1/29/2026 at 5:21:09 PM

I've noticed a degradation in Opus 4.5, also with Gemini-3-Pro. For me, it was a sudden rapid decline in adherence to specs in Claude Code. On an internal benchmark we developed, Gemini-3-Pro also dramatically declined. Going from being clearly beyond every other model (as benchmarks would lead you to believe) to being quite mediocre. Delivering mediocre results in chat queries and coding also missing the mark.

I didn't "try 100 times" so it's unclear if this is an unfortunate series of bad runs on Claude Code and Gemini CLI or actual regression.

I shouldn't have to benchmark this sort of thing but here we are.

by dudeinhawaii

1/29/2026 at 6:55:08 PM

Write your work order with phases (to a file) and, between each phase, give it a non-negotiable directive to re-read the entire work order file.

Claude-Code is terrible with context compaction. This solves that problem for me.

by acuozzo

1/29/2026 at 4:20:59 PM

I definitely noticed a degradation, it feels regressed by a generation.

by epolanski

1/29/2026 at 3:19:04 PM

In medicine there is a concept of reporting adverse effects of medication or interventions which are then collectively studied for Public Health [MedWatch][VAERS][EudraVigilance] and in academia. We should have something like that for all coding agents(and agents in other fields too), given how widely its deployed and affect on "health" in general(not only human). Call it the AI "health" of things benchmark.

I would imagine a sort of hybrid qualities of volunteer efforts like wikipedia, new problems like advent of code and benchmarks like this. The goal? It would be to study the collective effort on the affects of usage to so many areas where AI is used.

[MedWatch](https://www.fda.gov/safety/medwatch-fda-safety-information-a...)

[VAERS](https://www.cdc.gov/vaccine-safety-systems/vaers/index.html)

[EudraVigilance](https://www.ema.europa.eu/en/human-regulatory-overview/resea...)

by ghm2199

1/29/2026 at 4:25:27 PM

The chart would benefit from having weekends highlighted. Or have another chart averaged by a weekday.

by rplnt

1/29/2026 at 5:59:39 PM

would be interesting to see what scores it's get when it is actually degraded via the status page, it gets degraded pretty often, so there's at least something to compare or to know at what point Anthropic declares degradation

by Rastonbury

1/29/2026 at 3:39:07 PM

My personal conspiracy theory is that they choose who to serve a degraded model to based on social graph analysis and sentiment analysis, maximizing for persuasion while minimizing compute.

by sroerick

1/29/2026 at 4:17:15 PM

Sounds more like a sound business plan than a conspiracy theory.

by arcanemachiner

1/29/2026 at 4:37:49 PM

It sounds like fraud to me

by copilot_king

1/29/2026 at 5:32:00 PM

Does it say anywhere in their terms of service that they guarantee the quality of the model, or promise not to modify it?

https://www.anthropic.com/legal/consumer-terms

https://www.anthropic.com/legal/commercial-terms

by arcanemachiner

1/29/2026 at 4:38:05 PM

IMO this strategy seems inspired by TikTok's approach for retaining new uploaders.

TikTok used to give new uploaders a visibility boost (i.e., an inflated number of likes and comments) on their first couple of uploads, to get them hooked on the the service.

In Anthropic/Claude's case, the strategy is (allegedly) to give new users access to the premium models on sign-up, and then increasingly cut the product with output from cheaper models.

Of course, your suggestion (better service for users who know how to speak Proper English) would be the cherry on top of this strategy.

From what I've seen on HackerNews, Anthropic is all-in on social media manipulation and social engineering, so I suspect that your assumption holds water.

by copilot_king

1/30/2026 at 12:31:46 AM

I would actually assume a little more sophistication. For each user, a measure of "Are they convinced that AI is great". Then, you weaponize your compute to have the maximum social impact. If somebody has a large following (many edges on the social graph), and theyre skeptical of AI tech, inject the expensive but effective models directly into their veins. Let them taste the joy. Then start watering down their dose, and move onto the next person in the graph, again maximizing for net social impact. Language may not even be a consideration

by sroerick

1/30/2026 at 5:00:00 AM

ive seen degraded reasoning levels that feel like they they might be blur from excess quantization. cause thats what you get from the grid changes

by carterschonwald

1/30/2026 at 11:33:21 AM

Tried Kimi 2.5 and far ahead of claude for coding.

by sreekanth850

1/29/2026 at 4:16:17 PM

Finally someone did it! We need this for all models.

by esafak

1/30/2026 at 1:33:04 AM

I would pay 300 for a non-degrading Max plan.

by ed_mercer

1/29/2026 at 5:07:59 PM

I’m sure there is not enough data here for this to be statistically significant (it seems to oscillate too much and not show real trends or step changes) - BUT

If this measure were hardened up a little, it would be really useful.

It feels like an analogue to an employee’s performance over time - you could see in the graphs when Claude is “sick” or “hungover”, when Claude picks up a new side hustle and starts completely phoning it in, or when it’s gunning for a promotion and trying extra hard (significant parameter changes). Pretty neat.

Obviously the anthropomorphising is not real, but it is cool to think of the model’s performance as being a fluid thing you have to work with, and that can be measured like this.

I’m sure some people, most, would prefer that the model’s performance were fixed over time. But come on, this is way more fun.

by sd9

1/29/2026 at 4:20:59 PM

That will be great if there's RSS support.

by fernvenue

1/29/2026 at 4:03:49 PM

any chance we can get something like this for codex cli that'd be cool too compare

by taf2

1/29/2026 at 7:01:54 PM

Call it what you will. But the experience is like you have a reliable coworker, but he randomly decides to take bong hits.

"No no yeah bro no I'm good like really the work's done and all yeah sorry I missed that let me fix it"

by biddit

1/29/2026 at 2:57:20 PM

This is probably entirely down to subtle changes to CC prompts/tools.

I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.

Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing?

by turnsout

1/29/2026 at 3:02:48 PM

Honest, good-faith question.

Is CC getting better, or are you getting better at using it? And how do you know the difference?

I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months.

by FfejL

1/29/2026 at 3:36:25 PM

I agree with you, it's personally hard to tell.

For me I've noticed it getting nothing but better over the past couple months, but I've been working on my workflows and tooling.

For example, I used to use plan mode and would put everything in a single file and then ask it to implement it in a new session.

Switching to the 'superpowers' plugin with its own skills to brainstorm and write plans and execute plans with batches and tasks seems to have made a big improvement and help catch things I wouldn't have before. There's a "get shit done" plugin that's similar that I want to explore as well.

The code output always looks good to me for the most part though and I've never thought that it's getting dumber anything, so I feel like a lot of the improvements I see are because of a skill issue on my part trying to use everything. Obviously it doesn't help there's a new way to do things every two weeks though.

by rob

1/29/2026 at 3:35:42 PM

Good-faith answer: I can't be certain. But I've been using CC since its release, and Cursor before that (and actually going all the way back to GPT3 to do codegen in the Playground). After getting used to the CC workflow, the way that I use it has been pretty consistent. To be specific, I use basically the same AGENTS.md with small modifications for each project, and I live almost exclusively in Plan mode and the best model (currently Opus 4.5).

My initial prompting is boilerplate at this point, and looks like this:

(Explain overall objective / problem without jumping to a solution)

(Provide all the detail / file references / past work I can think of)

(Ask it "what questions do you have for me before we build a plan?")

And then go back and forth until we have a plan.

Compared to my work with CC six months ago, it's just much more capable, able to solve more nuanced bugs, and less likely to generate spaghetti code.

by turnsout

1/30/2026 at 4:19:23 AM

I run an LLM based product in a completely different space (consumer) and I think this is kind of an impossible unsolvable part of developing products that rely on LLMs.

No matter what, powers users always say the model is degrading over time*. Even when every stat I have access to says otherwise.

(* to clarify, this is outside of actual model changes)

I suspect some of it is the fact context windows growing does harm performance, and early on you're more likely to be prodding at things in a way that has a smaller context window on average.

But I also think users just inherently are less reliable narrators than they think. They say they're trying the same tasks, but it may be the "same task" applied to a codebase with 1 month's more worth of development and complexity.

Or it's the "same task" but their less confident past self was "Clever Hans"-ing the model with some nuance that they've since discarded without realizing.

Or it's simple expectation creep and the tasks aren't similar at all from an LLM perspective due to limited generalization, but from a human perspective are. Switching languages might as well make it a new task as far LLM performance for example, but the human considers it the same task in a new language.

Whatever causes it, it's especially stressful because sometimes you do degrade the harness entirely accidentally but it's impossible to separate that signal from the noise from user accounts and an issue goes unfound way longer than it should.

Claude Code is somewhat fortunate that code has verifiable aspects though, so you don't need to 100% go on user account. My usecase relies much more on subjective preference, so dealing with this stuff becomes the 9th circle of hell.

There've been many times when a change to the LLM stack didn't make it to prod, I jumped the gun on announcing it, but users immediately flooded in with praise that the "missing" performance had returned.

by BoorishBears

1/29/2026 at 3:02:59 PM

That's why benchmarks are useful. We all suffer from the shortcomings of human perception.

by billylo

1/29/2026 at 3:09:43 PM

Benchmarks shortcomings are no worse... they inevitably measure something that is only close to the thing you actually care about, not the thing you actually care about. It's entirely plausible that this decreased benchmark score is because Anthropic's initial prompting of the model was overtuned to the benchmark and as they're gaining more experience with real world use they are changing the prompt to do better at that and consequentially worse at the benchmark.

by gpm

1/29/2026 at 3:19:46 PM

I wonder how best we can measure the usefulness of models going forward.

Thumbs up or down? (could be useful for trends) Usage growth from the same user over time? (as an approximation) Tone of user responses? (Don't do this... this is the wrong path... etc.)

by billylo

1/29/2026 at 3:37:18 PM

Benchmarks measure what they measure. But your subjective experience also matters.

by turnsout

1/29/2026 at 3:00:00 PM

I was going to ask, are all other variables accounted for? Are we really comparing apples to apples here? Still worth doing obviously, as it serves a good e2e evaluations, just for curiosity's sake.

by fragebogen

1/29/2026 at 4:19:38 PM

The easiest way would be to quantize the model, and serve different quants based on the current demand. Higher volumes == worse quant == more customers served per GPU

by arcanemachiner

1/29/2026 at 6:07:35 PM

I upvoted, but

> Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts?

The article actually links to this fine postmortem by anthropic that demonstrates one way this is possible - software bugs affecting inference: https://www.anthropic.com/engineering/a-postmortem-of-three-...

Another way this is possible is the model reacting to "stimuli", e.g. the hypothesis at the end of 2023 that the (then current) ChatGPT was getting lazy because it was finding out the date was in december and it associated winter with shorter lazier responses.

A third way this is possible is the actual conspiracy version - Anthropic might make changes to make inference cheaper at the expense of the quality of the responses. E.g. quantizing weights further or certain changes to the sampling procedure.

by gpm

1/29/2026 at 10:52:47 PM

Could this be (partially?) explained by Model Collapse [1], i.e. iteratively training on data that includes an ever increasing amount of AI slop?

[1] https://thebullshitmachines.com/lesson-16-the-first-step-fal...

by willturman

1/30/2026 at 6:10:08 AM

[dead]

by lighthouse1212

1/29/2026 at 6:48:02 PM

[dead]

by MORPHOICES

1/29/2026 at 4:37:32 PM

This strategy seems inspired by TikTok's approach for retaining new uploaders.

TikTok used to give new uploaders a visibility boost (i.e., an inflated number of likes and comments) on their first couple of uploads, to get them hooked on the the service.

In Anthropic/Claude's case, the strategy is (allegedly) to give new users access to the premium models on sign-up, and then increasingly cut the product with output from cheaper models.

by copilot_king

1/29/2026 at 5:23:33 PM

Yes, but the difference is TikTok didn't sell a particular service version.

Anthropic did sell a particular model version.

by chrisjj

1/29/2026 at 4:43:35 PM

[dead]

by maximgeorge