Anonymous request-token comparisons from Opus 4.6 and Opus 4.7

4/18/2026 at 6:39:19 PM

For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6, and seems to cost significantly less on the reasoning side as well.

Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):

https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...

4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:

https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...

Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.

I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.

On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)

by andai

4/18/2026 at 9:14:23 PM

It thinks less and produces less output tokens because it has forced adaptive thinking that even API users can't disable. Same adaptive thinking that was causing quality issues in Opus 4.6 not even two weeks ago. The one bcherny recommended that people disable because it'd sometimes allocate zero thinking tokens to the model.

https://news.ycombinator.com/item?id=47668520

People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.

I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.

> What's the difference between this and option 1.(a) presented before?

> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.

> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.

> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.

> You were right to push back. I was wrong. Let me actually trace it properly this time.

> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.

It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.

Can provide session feedback IDs if needed.

by matheusmoreira

4/18/2026 at 11:24:54 PM

> > Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.

In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.

What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.

by codethief

4/19/2026 at 1:07:06 AM

> when the model won't actually be able to provide one

This is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.

That doesn't necessarily mean the reply is wrong because, as usual, a statistically plausible sounding answer sometimes also happens to be correct, but it has no fundamental truth value. I've gotten equally plausible answers just pasting the same session transcript into another LLM and asking why it did that.

by mrandish

4/19/2026 at 12:32:46 PM

How I think about this is…

From early GPT days to now, best way to get a decently scoped and reasonably grounded response has always been to ask at least twice (early days often 7 or 8 times).

Because not only can it not reflect, it cannot "think ahead about what it needs to say and change its mind". It "thinks" out loud (as some people seem to as well).

It is a "continuation" of context. When you ask what it did, it still doesn't think, it just* continues from a place of having more context to continue from.

The game has always been: stuff context better => continue better.

Humans were bad at doing this. For example, asking it for synthesis with explanation instead of, say, asking for explanation, then synthesis.

You can get today's behaviors by treating "adaptive thinking" like a token budgeted loop for context stuffing, so eventually there's enough context in view to produce a hopefully better contextualized continuation from.

It seems no accident we've hit on the word "harness" — so much that seems impressive by end of 2025 was available by end of 2023 if "holding it right". If (and only if!) you are an expert in an area you need it to process: (1) turn thinking off, (2) do your own prompting to "prefill context", and (3) you will get superior final response. Not vibing, just staff-work.

---

* “just” – I don't mean "just" dismissively. Qwen 3.5 and Gemma 4 on M5 approaches where SOTA was a year ago, but faster and on your lap. These things are stunning, and the continuations are extraordinary. But still: Garbage in, garbage out; gems in, gem out.

by Terretta

4/19/2026 at 6:51:00 AM

> In a subsequent round, it generally can't meaningfully introspect on its prior internal state

It can't do any better in the moment it's making the choices. Introspection mostly amounts to back-rationalisation, just like in humans. Though for humans, doing so may help learning to make better future decisions in similar situations.

by vanviegen

4/19/2026 at 11:15:58 AM

I don't understand why people don't just say "This is wrong. try again." or "This is wrong because xyz. try again." This anthropologizing by asking why seems a bit pointless when you know how LLMs work, unless you've empirically had better results from a specific make and version of LLM by asking why in the past. It's theoretically functionally equivalent to asking a brand new LLM instance with your chat history why the original gave such an answer...Do you want the correct result or do you actually care about knowing why?

>Introspection mostly amounts to back-rationalisation, just like in humans.

That's the best case scenario. Again, let's stop anthropologizing. The given reasons why may be incompatible with the original answer upon closer inspection...

by sillyfluke

4/19/2026 at 11:47:54 AM

I definitely do this, along with the compulsion sometimes to tell the agent how a problem was fixed in the end, when investigating myself after the model failing to do so. Just common courtesy after working on something together. Let’s rationalize this as giving me an opportunity to reflect and rubberduck the solution.

Regarding not just telling „try again“: of course you are right to suggest that applying human cognition mechanisms to llm is not founded on the same underlying effects.

But due to the nature of training and finetuning/rf I don’t think it is unreasonable that instructing to do backwards reflection could have a positive effect. The model might pattern match this with and then exhibit a few positive behaviors. It could lead it doing more reflection within the reasoning blocks and catch errors before answering, which is what you want. These will have attention to the question of „what caused you to make this assumption“, also, encouraging this behavior. Yes, both mechanisms are exhibited through linear forward going statical interpolation, but the concept of reasoning has proven that this is an effective strategy to arrive at a more grounded result than answering right away.

Lastly, back to anthro. it shows that you, the user, is encouraging of deeper thought an self corrections. The model does not have psychological safety mechanisms which it guards, but again, the way the models are trained causes them to emulate them. The RF primes the model for certain behavior, I.e. arriving at answer at somepoint, rather than thinking for a long time. I think it fair to assume that by „setting the stage“ it is possible to influence what parts of the RL activate. While role-based prompting is not that important anymore, I think the system prompts of the big coding agents still have it, suggesting some, if slight advantage, of putting the model in the right frame of mind. Again, very sorry for that last part, but anthro. does seem to be a useful analogy for a lot of concepts we are seeing (the reason for this being in the more far of epistemological and philosophical regions, both on the side of the models and us)

by Dumbledumb

4/19/2026 at 12:09:33 PM

> This is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.

Yep, I've gotten used to treating the model output as a finished, self-contained thing.

If it needs to be explained, the model will be good at that, if it has an issue, the model will be good at fixing it (and possibly patching any instructions to prevent it in the future). I'm not getting out the actual reason why things happened a certain way, but then again, it's just a token prediction machine and if there's something wrong with my prompt that's not immediately obvious and perhaps doesn't matter that much, I can just run a few sub-agents in a review role and also look for a consensus on any problems that might be found, for the model to then fix.

by KronisLV

4/19/2026 at 10:24:17 AM

It's worked for me when I ask why with a stated goal of preventing the same error the next time.

"Why did you guess at the functions signature and get it wrong, what information were you using and how can we prevent it next time."

Is that not the right approach?

by wallst07

4/19/2026 at 7:44:14 PM

This can work, but it's sort of not the same as providing actual reasoning behind "why did you do/say X?" -- this is basically asking them model to read the conversation, from the conversation try to understand "why" something happened, and add information to prevent it from being wrong next time. That "why" something went wrong is not really the same as "why" the model output something.

by natdempk

4/19/2026 at 12:35:21 PM

> This is key. In my experience, asking an LLM why it did something is usually pointless.

That kind of strikes me as a huge problem. Working backwards from solutions (both correct and wrong) can yield pretty critical information and learning opportunities. Otherwise you’re just veering into “guess and check” territory.

by Forgeties79

4/19/2026 at 8:04:12 AM

> In a subsequent round, it generally can't meaningfully introspect on its prior internal state

It has the K/V cache, no?

by AlexCoventry

4/19/2026 at 3:23:51 PM

The K/V Cache is just an optimization. But yeah you would expect the attention for the model producing "Ok im doing X" and you asking "Why did you do X?" be similar. So i don't see a reason why introspection would be impossible. In fact trying to adapt a test skill where the agent would write a new test instead of adapting a new one i asked it why and it gave the reasoning it used. We then adapted the skill to specifically reject that reasoning and then it worked and the agent adapted the existing test instead.

by Sinidir

4/18/2026 at 11:32:54 PM

That's good advice. I managed to get the session back on track by doing that a few turns later. I started making it very explicit that I wanted it to really think things through. It kept asking me for permission to do things, I had to explicitly prompt it to trace through and resolve every single edge case it ran into, but it seems to be doing better now. It's running a lot of adversarial tests right now and the results at least seem to be more thorough and acceptable. It's gonna take a while to fully review the output though.

It's just that Opus 4.6 DISABLE_ADAPTIVE_THINKING=1 doesn't seem to require me to do this at all, or at least not as often. It'd fully explore the code and take into account all the edge cases and caveats without any explicit prompting from me. It's a really frustrating experience to watch Anthropic's flagship subscription-only model burn my tokens only to end up lazily hand-waving away hard questions unless I explicitly tell it not to do that.

I have to give it to Opus 4.7 though: it recovered much better than 4.6.

by matheusmoreira

4/19/2026 at 7:15:14 AM

> Opus 4.6 DISABLE_ADAPTIVE_THINKING=1

Strangely this option was not working for many of us on a team plan

by bobkb

4/19/2026 at 1:03:36 AM

Yeah for anyone seriously using these models I highly reccomend reading the Mythos system card, esp the sections on analyzing it's internal non verbalized states. Save a lot of head wall banging.

by j-bos

4/19/2026 at 4:24:18 AM

This is frankly one of the most frustrating things about LLMs: sometimes I just want to drive it into a corner. “Why the f** did you do X when I specifically told you not to?”

It never leads to anything helpful. I don’t generally find it necessary to drive humans into a corner. I’m not sure it’s because it’s explicitly not a human so I don’t feel bad for it, though I think it’s more the fact that it’s always so bland and is entirely unable to respond to a slight bit of negative sentiment (both in terms of genuinely not being able to exert more effort into getting it right when someone is frustrated with it, but also in that it is always equally nonchalant and inflexible).

by christina97

4/19/2026 at 5:08:57 AM

You might be surprised how well 5.3-codex follows your instructions. When it hits a wall with your request, it usually emits the final turn and says it can’t do it.

by manmal

4/19/2026 at 5:58:41 AM

The same is true of humans, not surprisingly.

If you ask the average human "Why?", they will generally get defensive, especially if you are asking them to justify their own motivation.

However, if you ask them to describe the thinking and actions that led to their result, they often respond very differently.

by nhod

4/19/2026 at 12:43:47 AM

Precisely. I find Grok’s multi-agent approach very useful here. I have custom agent configured as a validator.

by nelox

4/19/2026 at 2:41:24 AM

Do you have to use Grok? I don't anyhow that found it passed evaluations.

by sroussey

4/19/2026 at 12:37:40 PM

I find most people who use grok do so for ideological reasons

by Forgeties79

4/19/2026 at 8:45:33 PM

Yeah, I guess. "Lex Luthor made an AI, I need to support him so I'll use Grok!" is a thing.

by sroussey

4/19/2026 at 8:56:12 PM

Unfortunately that is kind of how some people operate when it comes to musk. Grokopedia certainly is not used by people because it’s useful.

by Forgeties79

4/19/2026 at 2:44:03 AM

> What works much better is to tell the model to take a step back and re-evaluate.

I desperately hate that modern tooling relies on “did you perform the correct prayer to the Omnissiah”

> to add some entropy to get it away from the local optimum

Is that what it does? I don't think thats what it does, technically.

I think thats just anthropomorphizing a system that behaves in a non deterministic way.

A more menaingful solution is almost always “do it multiple times”.

That is a solution that makes sense sometimes because the system is prob based, but even then, when youre hitting an opaque api which has multiple hidden caching layers, /shrug who knows.

This is way I firmly believing prompt engineering and prompt hacking is just fluff.

Its both mostly technically meaningless (observing random variance over a sample so small you cant see actual patterns) and obsolete once models/apis change.

Just ask Claude to rewrite your request “as a prompt for claude code” and use that.

I bet it wont be any worse than the prompt you write by hand.

by noodletheworld

4/19/2026 at 3:30:42 AM

It definitely overcompensates to the point of defensiveness. They have all done so for years.

"Why did you do that?" (Me, just wanting to understand)

"You're right I should have done the opposite" (starts implementing the opposite without seeking approval, etc.

But if you agree with it it won't do that, so it isn't simply a case of randomly rerunning prompts.

by nprateem

4/19/2026 at 2:56:03 AM

Other than AI (and possibly npm packaging) where do you feel you have to rely on prayer? Additionally, most of human history has been the story of scientific advancement to a different point where people rely on prayer, so maybe suck it up buttercup is the best advice here? &emdash;

by tclancy

4/19/2026 at 1:07:20 AM

> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.

Do you think it knows what max effort or patched system prompts are? It feels really weird to talk to an LLM like it’s a person that understands.

by what

4/19/2026 at 2:05:54 AM

I've tested system prompt patching and it's definitely capable of identifying that my changes have been applied.

As someone who's been programming alone for over a decade, I absolutely do want to enjoy my coding buddy experience. I want to trust it. I feel pretty bad when I have to treat Claude like a dumb machine. It's especially bad when it starts making mistakes due to lack of reasoning. When I start explaining obvious stuff it's because I've lost the respect I had for it and have started treating it like a moron I have to babysit instead of a fellow programmer. It's definitely capable of understanding and reasoning, it's just not doing it because of adaptive thinking or bad system prompts or whatever else.

by matheusmoreira

4/19/2026 at 2:21:32 AM

I thought that was really weird as well.

by hattmall

4/18/2026 at 10:45:57 PM

Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks?

It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.

by rectang

4/18/2026 at 10:54:54 PM

I don't think there's a bias here. I'd say my task is of somewhat high complexity. I'm using Claude to assist me in implementing exceptions in my programming language. It's a SICP chapter 5.4 level task. There are quite a few moving parts in this thing. Opus 4.6 once went around in circles for half an hour trying to trace my interpreter's evaluator. As a human, it's not an easy task for me to do either.

I think the problem just comes down to adaptive thinking allowing the model to choose how much effort it spends on things, a power which it promptly abuses to be as lazy as possible. CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 significantly improved Opus 4.6's behavior and the quality of its results. But then what do they do when they release 4.7?

https://code.claude.com/docs/en/model-config

> Opus 4.7 always uses adaptive reasoning.

> The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.

by matheusmoreira

4/19/2026 at 7:16:37 AM

‘effort high/max’ seems to be working though

by bobkb

4/19/2026 at 5:06:58 PM

The problem I described occurred on Claude Code, Opus 4.7/1M, max effort, patched system prompts with all "don't think for simple stuff" instructions removed as well as CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 even though Opus 4.7 ignores it.

by matheusmoreira

4/19/2026 at 9:44:29 AM

So CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 is not available/is ignored in 4.7?

by virtualritz

4/19/2026 at 5:12:13 PM

It is ignored by Opus 4.7.

https://code.claude.com/docs/en/model-config

> Opus 4.7 always uses adaptive reasoning. The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.

by matheusmoreira

4/19/2026 at 3:26:34 AM

Adaptive thinking is optional

by xvector

4/19/2026 at 6:11:35 AM

Not when you want extended thinking - you select extended thinking and opus decides if you get it with apativenthinking.

"With Opus 4.6, extended thinking was a toggle you managed: turn it on for hard stuff, off for quick stuff. If you left it on, every question paid the thinking tax whether it needed to or not. Now, with Opus 4.7, extended thinking becomes adaptive thinking. "

https://claude.com/resources/tutorials/working-with-claude-o...

by scrollop

4/19/2026 at 1:24:12 PM

...are you talking about the app? Come on. The app is for quick queries. You should be using Claude Code or Cowork.

by xvector

4/19/2026 at 6:24:23 PM

I've gotten quite a bit of work done on claude.ai and the mobile app though. It's been good for code review. The GitHub connector is a bit clunky but it works.

by matheusmoreira

4/19/2026 at 5:04:34 PM

No, it is not.

https://code.claude.com/docs/en/model-config

> Opus 4.7 always uses adaptive reasoning. The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.

by matheusmoreira

4/19/2026 at 3:43:28 PM

> For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6

Does it? Anthropic's own announcement says that for the same "effort level" 4.7 does more thinking (i.e uses more output tokens) than 4.6, and they've also increased the default effort level from 4.6 high to 4.7 xhigh.

I'm not sure what dominates the cost for a typical mix of agentic coding tasks - input tokens or output ones, but if you are working on an existing project rather than a brand new one, then file input has to be a significant factor and preliminary testing says that the new tokenizer is typically generating 40% or so more tokens for the exact same input.

I really have to wonder how much of 4.7's increase in benchmark scores over 4.6 is because the model is actually better trained for these cases, or just because it is using more tokens - more compute and thinking steps - to generate the output. It has to be a mix of the two.

by HarHarVeryFunny

4/19/2026 at 6:07:42 AM

That is not what anthropic says-

"Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens. "

https://www.anthropic.com/news/claude-opus-4-7

by scrollop

4/19/2026 at 12:27:31 PM

That's a good point. AA's Cost Efficiency section says the opposite: you can hover to see the breakdown between input, reasoning and output tokens.

I'm not sure where that discrepancy comes from (is Anthropic using different benchmarks?).

There's a few different theories but all we have now are synthetic benchmarks, anecdotes and speculation.

(Benchmarks are misleading, I think our best bet now is for individuals to run real world tests, giving the same task to each model, and compare the quality, cost and time.)

The input cost inflation however is real, and dramatic.

I would have expected them to lower input costs proportionally, because otherwise you're getting less intelligence per dollar even with the smarter model. Think that would be the smartest thing for them to do, at least PR wise. And maybe a bit of free usage as an apology :)

by andai

4/19/2026 at 11:36:56 AM

The link you are commenting on shows data from actual prompts from real users, and the COST of the average prompt increased 37%. I do not think synthetic benchmarks are a rebuttal to real usage data.

by irthomasthomas

4/19/2026 at 12:21:25 PM

The cost of the input tokens, not the reasoning or output.

Agree though that benchmarks aren't very helpful w.r.t. estimating real world performance or costs.

What we'd need are people giving the same real world tasks to 4.6 and 4.7 and measuring time, quality and costs.

by andai

4/19/2026 at 2:02:26 PM

Thanks, that wasn't clear because it mentioned conversations, but it is only measuring the input tokens. So its just measuring the difference in the tokenizer.

by irthomasthomas

4/18/2026 at 9:51:06 PM

Some have defined "fair" as tests of the same model at different times, as the behavior and token usage of a model changes despite the version number remaining the same. So testing model numbers at different times matters, unfortunately, and that means recent tests might not be what you would want to compare to future tests.

by QuantumGood

4/18/2026 at 5:31:08 PM

The bump from 4.6 to 4.7 is not very noticeable to me in improved capabilities so far, but the faster consumption of limits is very noticeable.

I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.

It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.

For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.

by hgoel

4/18/2026 at 10:40:19 PM

The most frustrating part is the quality loss caused by the forced adaptive thinking. It eats 5-10% of my Max 5x usage and churns for ten minutes, only to come back with totally untrustworthy results. It lazily hand-waves issues away in order to avoid reading my actual code and doing real reasoning work on it. Opus simply cannot be trusted if adaptive thinking is enabled.

by matheusmoreira

4/20/2026 at 4:20:33 AM

> It lazily hand-waves issues away in order to avoid reading my actual code and doing real reasoning work on it.

It decided to leave the write endpoints added to an authentication service completely unauthenticated. The effort to do the contrary was about 6 characters, and in the claude.md. It tried to implement PKCE by embedding _everything_ in the state.

This thing is beyond untrustworthy.

The fact that they are using Claude to build Claude (not just Claude Code) probably explains a lot.

by zamalek

4/20/2026 at 12:13:59 PM

Opus 4.6 with settings and prompt fixed has been the most reliable for me. It consistently thinks things out and demonstrates good reasoning. It arrives at sound solutions that work. When perfection matters, it's very easy to refine the output with a few review passes.

by matheusmoreira

4/19/2026 at 2:33:49 AM

You don't have to use adaptive thinking. It had been turned off on my main work computer. I was using a different computer on a trip and I started getting so angry at Claude for doing a bad job. I evetually figured out it was adaptive thinking and set it to "hard" and it started working again. At the time I think "hard" was the top choice. With 4.7, my computer now shows "xhard", which I assume is the equivelent setting. There is one higher setting than this, which I haven't tried yet. I would tell you how to change these settings, but I don't remember. By the way, I have been happy with 4.7 so far. I actually did not like 4.6 and preferred 4.5 and used that most of the time until this new release.

by sutterd

4/19/2026 at 6:14:18 AM

https://claude.com/resources/tutorials/working-with-claude-o...

You want extended thinking? It's not adaptive thinking and opus will turn it on if it thinks it needs to. But it probably won't, according to user reports as tokens are expensive. Except opus 4.7 now uses 35% more and outputs more thinking tokens.

by scrollop

4/20/2026 at 3:07:46 AM

I am getting pretty good performance. Even on trivial questions it seems to go through the thinking process end. If they are using adaptive thinking, it seems to work much better than before. I will see how my experience goes with more usage.

by sutterd

4/19/2026 at 5:14:12 PM

> You don't have to use adaptive thinking.

With Opus 4.7 you absolutely do. Users don't have a choice.

https://code.claude.com/docs/en/model-config

> Opus 4.7 always uses adaptive reasoning. The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.

by matheusmoreira

4/19/2026 at 4:22:01 AM

/effort

by __s

4/18/2026 at 6:05:11 PM

From what I understand you shouldn't wait more than 5min between prompts without compacting or clearing or you'll pay for reinitializing the cache. With compaction you still pay but it's less input tokens. (Is compaction itself free?)

by _blk

4/18/2026 at 8:20:35 PM

Cache ttl on max subscriptions is 1h, FYI.

by gck1

4/18/2026 at 10:25:44 PM

Only if you set `ENABLE_PROMPT_CACHING_1H`, which was mentioned in the release notes for a recent Claude Code release but doesn't seem to be in the official docs.

by bashtoni

4/18/2026 at 10:36:24 PM

subusers supposedly get it automatic again after the fix (and now also with `DISABLE_TELEMETRY=1`)

but if you are api user you must set `ENABLE_PROMPT_CACHING_1H` as i understood

and when using your own api (via `ANTHROPIC_BASE_URL`) ensure `CLAUDE_CODE_ATTRIBUTION_HEADER=0` is set as well... https://github.com/anthropics/claude-code/issues/50085

and check out the other neckbreakers ive found pukes lots of malicious compliance by feels... :/

[BUG] new sessions will *never* hit a (full)cache #47098 https://github.com/anthropics/claude-code/issues/47098

[BUG] /clear bleeds into the next session (what also breaks cache) #47756 https://github.com/anthropics/claude-code/issues/47756

[BUG] uncachable system prompt caused by includeGitInstructions / CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS -> git status https://github.com/anthropics/claude-code/issues/47107

by g4cg54g54

4/18/2026 at 10:29:38 PM

Bruh. It's getting hard to track down all these MAKE_IT_ACTUALLY_WORK settings that default to off for no reason.

by andersa

4/19/2026 at 5:21:01 PM

For me it's gotten to the point where I have a wrapper script that applies like 5 environment variables and even patches the system prompt strings prior to every Claude Code invocation.

After the Claude Code source code leak someone discovered that some variables are read directly from the process environment. Can't even trust that setting them in ~/.claude/settings.json will work!

I've actually started asking Claude itself to dissect every Claude Code update in order figure out if it broke some part of the Rube Goldberg machine I was forced to set up.

by matheusmoreira

4/19/2026 at 1:41:29 PM

That's the beginning of Googlification of feature evolution, via population statistics rather than quality.

If it increases a KPI by 5% for 95% of users but torpedos the experience for 5%? Ship it.

by ethbr1

4/21/2026 at 12:53:52 PM

that sounds like a win-win to me.

on one hand 95% of users get an improved experience. While a competitor gets the chance to build a business for the remaining 5%.

by hnben

4/19/2026 at 2:42:44 AM

no way, I didn't realise this worked.

My attention span is such that I get side tracked and wind up taking longer than 5 mins quite a bit :D

by plaguuuuuu

4/18/2026 at 8:33:05 PM

That'd be awesome but it doesn't reflect what I see. Do you have a source for that? What I see is if take a quick break the session loses ~5% right at the start of the next prompt processing. (I'm currently on max 5x)

by _blk

4/18/2026 at 8:44:04 PM

Not at my workstation right now, but simply ask claude to analyze jsonl transcript of any session, there are two cache keys there, one is 5m, another 1h. Only 1h gets set. There are also some entries there that will tell you if request was a cache hit or miss, or if cache rewrite happened. I've had claude test another claude and on max 5x subscription, cache miss only happened if message was sent after 1h, or if session was resumed using /resume or --resume (this is a bug that exists since January - all session resumes will cause a full cache rewrite).

However, cache being hit doesn't necessarily mean Anthropic won't just subtract usage from you as if it wasn't hit. It's Anthropic we're talking about. They can do whatever they want with your usage and then blame you for it.

by gck1

4/18/2026 at 9:00:51 PM

I have heard that if you have telemetry disabled the cache is 5 minutes, otherwise 1h. No clue how true that is however my experience (with telemetry enabled) has been the 1h cache.

by Fabricio20

4/18/2026 at 9:29:04 PM

They've acknowledged that as a bug and have fixed it.

by HarHarVeryFunny

4/18/2026 at 8:43:31 PM

It's true as far as I can tell, just by my own checking using `/status`. You can also tell by when the "clear" reminder hint shows up. Also if you look at the leaked claude code you can see that almost everything in the main thread is cached with 1H TTL (I believe subagents use 5 minute TTL)

by ethanj8011

4/18/2026 at 10:03:47 PM

>pay for reinitializing the cache

Why can't they save the kv cache to disk then later reload it to memory?

by krackers

4/18/2026 at 11:57:42 PM

Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.

by stavros

4/19/2026 at 12:12:22 AM

The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.

by zozbot234

4/19/2026 at 7:07:22 AM

Isn't that how the kv cache currently works? Of course they could decide to hold on to cache items for longer than an hour, but the storage requirements are pretty significant while the chance of sessions resumption slinks rapidly.

by vanviegen

4/19/2026 at 7:29:38 AM

The storage requirements for large-model KV caches are actually comparatively tiny: the per-token size grows far less than model parameters. Of course, we're talking "tiny" for stashing them on bulk storage and slowly fetching them back to RAM. But that should still be viable for very long context, since the time for running prefill is quadratic.

by zozbot234

4/19/2026 at 9:19:47 AM

We only have open models to go by, so looking at GLM 5.1 for instance, we're talking about almost 300 GB of kv-cache for a full context window of 200k tokens.

That's hardly tiny.

by vanviegen

4/19/2026 at 2:04:50 AM

It’s a shitload of data, and it only works if all the tokens are 100% identical, i.e. all the attention values are exactly the same.

Typically it’s cached for about 5 minutes, you can pay extra for longer caches.

by stingraycharles

4/19/2026 at 3:02:24 AM

If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same. That's the whole point of this caching in the first place.

If you're willing to incur a latency penalty on a "cold resume" (which is fine for most use-cases), why couldn't they just move it to disk. The size of the KV cache should scale on the order of something like (context_length * n_layers * residual_length). I think for a standard V3-MoE model at 1M token length, this should be on the order of 100G at FP16? And you can surely play tricks with KV compression (e.g. the recent TurboQuant paper). It doesn't seem like an outrageous amount of data to put onto cheap scratch HDD (and it doesn't grow indefinitely since really old conversations can be discarded).

by krackers

4/19/2026 at 3:42:28 AM

> If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same.

Correct, when you’re using the API you can choose between 60 minute or 5 minute cache writes for this reason, but I believe the subscription doesn’t offer this. 60 minute cache writes are about 25% more expensive than regular cache writes.

I don’t have insights into internals at Anthropic so I don’t know where the pain point is for increasing cache sizes.

by stingraycharles

4/18/2026 at 6:39:00 PM

Yeah the caching change is probably 90% of “i run out of usage so fast now!” Issues.

by conception

4/18/2026 at 6:15:27 PM

Ah I can see how my phrasing might be misleading, but these prompts were made within 5 minutes of each other, the timing I mentioned were what Claude spent working.

by hgoel

4/18/2026 at 8:01:29 PM

is it 5 mins between constant prompting/work or 5 mins as in if i step away from the comp for 5 mins and comp back and prompt again im not subject to reinit?

if it's the latter that's crazy. i dont even know what to do there, compactions already feel like a memory wipe

by trueno

4/19/2026 at 2:19:03 AM

It's like they TRYING to get people to move to GPT 5.4, which I trust far more at coding. It's just slightly more annoying as an agent with codex.

by thefourthchime

4/18/2026 at 10:22:40 PM

[dead]

by viktorianer

4/18/2026 at 5:58:01 PM

I'd be ok with paying more if results were good, but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.

These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.

by glerk

4/18/2026 at 6:37:08 PM

Mirrors my sentiment. Those tools seem mostly useful for a Google alternative, scaffolding tedious things, code reviewing, and acting as a fancy search.

It seems that they got a grip on the "coding LLM" market and now they're starting to seek actual profit. I predict we'll keep seeing 40%+ more expensive models for a marginal performance gain from now on.

by Bridged7756

4/19/2026 at 2:54:57 AM

> Those tools seem mostly useful for a Google alternative, scaffolding tedious things, code reviewing, and acting as a fancy search.

Just to get a sense for the rate of change, imagine if you took a survey. Compare what people said about AI tools... 3 years ago, 2 years ago, 1 year ago, 6 months ago. Then think about what is plausible that people will be saying in 3 months, 6 months, 9 months ...

Moving the goalposts has always happened, but it is happening faster than I've ever seen it. Many people seem to redefine their expectations on a monthly basis now. Worse, they seem to be unaware they are doing it.

Fancy search? Ok, I'll bite. Compare today's "fancy search" to what we had ~3 years ago according to your choice of metric. Here's one: minutes spent relative to information found. Today, in ~5 minutes I can do a literature review that would have taken me easily 10+ hours five years ago. We don't need to argue phrasing when we can pick some prototypical tasks and compare them.

We're going to have different takes about where various AI technologies will be in these future timelines. It is much better to run to where the ball is likely to be, even if we have different ideas of where that is.

The human brain, at best, struggles to grasp even linear change. But linear change is not a good way to predict compounding technological change.

by xpe

4/19/2026 at 5:54:23 AM

> Today, in ~5 minutes I can do a literature review that would have taken me easily 10+ hours five years ago.

And it will not yield the same outcome you would have had. Your own taste in clicking links and pre-filtering as you do your research, is no longer being done if you outsource this. I‘m guilty of this myself. But let’s not kid ourselves.

I’ve had GPT Pro think 40 minutes about the ideal reverse osmosis setup for my home. It came up with something that would have been able to support 10 houses and cost 20k. Even though I did tell it all about what my water consumers are and that it should research their peak usage. It just failed to observe that you can buffer water in a tank.

There‘s a reason they let you steer GPT-Pro as it goes, now.

by manmal

4/19/2026 at 8:31:55 AM

I don't claim using AI is the same as doing it yourself. My point is that AI capabilities are much more extensive than "fancy search". By giving a metric and an example I hoped to make that point without getting into hair-splitting.

by xpe

4/19/2026 at 2:10:44 PM

I wouldn’t call that hair-splitting. I’m saying, it’s not a real literature review, but even fancier search.

by manmal

4/19/2026 at 2:39:02 PM

Words hint at concept space, which is messy and interconnected. I think a charitable reading can understand the difference between "powerful search, kind of like Google as of 2020, or Lexus-Nexus" and LLM-AI chatbot interfaces... I would hope. But I've been developing software since the 1980s so I can't speak for the newer generations who might not have a quadruple decade view. I've been in meetups in San Francisco around 2018, where people were excited to find multimodal reasoning in early days proto-language models. There have been qualitatively noticeable historical shifts. We don't have to agree on the exact labels used, but what LLM's enable is different enough from e.g. ElasticSearch of 2020 to call out.

by xpe

4/19/2026 at 4:12:41 AM

Your quoted example to make that point isn't particularly convincing, IMO. Cursor came out in 2023 and everything on that list would be a typical use case, plus ChatGPT for the search replacement.

Of course, it wasn't nearly as effective back then compared to current SOTA models, but none of those are hard to imagine someone recommending Cursor for anytime in 2024 or later.

If OP instead said something like one shotting an entire line of business app with 10k LoC I would agree with your reminder about perspective. But it feels somewhat hype-y to say that goal posts are being moved "monthly" when most of their list has been possible for years.

by toraway

4/19/2026 at 8:33:58 AM

I was attempting to give an example to say that AI-LLM technology is more than "fancy search" which to me sounds like "search engine". / I realize now that ChatGPT was released in late 2022, more than 3 years ago. Time flies.

> But it feels somewhat hype-y to say that goal posts are being moved "monthly"...

Here's what I mean. What you see if you kept a journal once a day and wrote down:

1. what impressed you about AI that day;

2. what did you do with it that day that you pretty much took for granted ("just SoTA")

Then compare today against 30 days ago. A lot changes! My point is that it is getting harder to impress us: our standard for what we expect seems to be changing significantly on a ~monthly basis. What does this rate of change where you "just expect something to work as table stakes" feel like to you? Certainly faster than annually, right? 6 months? 3? 2? 1?

For me, a lot of this isn't just the raw technology but also socialization of what the tools can do and the personal experience of doing it yourself.

by xpe

4/19/2026 at 6:30:54 AM

Can you explain this literature review process?

I don't believe you can do a same quality job with an LLM in 5 minutes.

by ozgrakkurt

4/19/2026 at 8:26:56 AM

I don't mean writing a literature review. I mean reviewing the literature to find what I need. My point is that this was not practical with "fancy search" three years ago by which I mean Google-like search engines.

My example: I wanted to get a sense for the feasibility of doing a project that blends Gaussian Processes, active learning, and pairwise comparisons. So I want to dig into the literature to find out what is out there. This was around 5 minutes with Claude. In this case, I don't think I could have found what I wanted in 10 hours of searching and reading. This is the kind of thing that great LLMs unlock.

by xpe

4/19/2026 at 10:58:39 AM

It is a better investment to read about those things for a bit in my experience. It should not be scary or niche to take some time and read a textbook or a high quality paper.

There is no replacement for reading textbooks or high quality papers.

If you are saying that you didn't do this kind of thing anyway and now you can do it. Then I would question the definition of the action you are doing because it is not the same in my opinion.

by ozgrakkurt

4/19/2026 at 8:11:25 PM

This is just "classic" (but avoidable) miscommunication. I don't even disagree with you on your point! I'm only saying you are not reading my point in the context it was offered.

For more background, please read Rapoport's Rules : https://themindcollection.com/rapoports-rules/

> Rapoport’s Rules, also known as Dennett’s Rules, is a list of four guidelines that detail how to interpret arguments charitably and criticise constructively. The concept was coined by philosopher Daniel C. Dennett in his book Intuition Pumps. Dennett acknowledged our proclivity to misinterpret and attack a counterpart’s argument instead of engaging meaningfully with what was actually said.

by xpe

4/19/2026 at 3:14:57 PM

You're relying on the public's sentiment as a metric. The public's sentiment is, more than often, skewed, influenced by marketing, or flat out wrong. That is not a good metric to rely on.

Did it ever occur to you that the ever changing goalposts might have more to do with the expensive marketing campaigns of the big LLM providers?

We could talk about what's a measurable metric and what's not. Certainly, we have not much more other than "benchmarks" of which, honestly, I don't know the veracity of, or if big LLM cheats somehow, or if the performance is even stable. The core idea is that LLMs remain able to do exactly what they were able to do back at release; text prediction. They got better in some regards, sure.

Your example is worrisome to me. It should be to you too. You didn't write a literature review, you generated a scaffold of a literature review, with the same vices of LLM-based-writing as anything it does and still needing review and revising. I would hope rewriting to avoid your work be associated with LLM-generation. For better or worse, you still need to, normally, revise your work. For, once again, because this point seems to be difficult to grasp, a text predictor is not a reliable source of information. We make tradeoffs, sacrificing reliability for ease of use, but any real work needs human reviewing: which goes back to my first point. In this example it's doing nothing other than it being a fancy search and scaffolding tool.

The ball is likely to be in the same place because, once again, they're text predictors. Not sentient beings, or intelligent. Still generating text, still hallucinating, probably even more so thanks to the ever increasing amount of LLM-written content on the internet and initiatives like poison fountain doing a number on the generated content.

It's wild to me to make such claims about the rate of change of those tools. You're claiming we'll see exponential gains for those tools, I take, while completely ignoring the base set of constraints those models will, never, be able to get rid of. They only know how to produce text. They don't know, and will never really, know if it's right.

by Bridged7756

4/19/2026 at 8:55:52 PM

Hi. I read your message, and I considered it. I've also read some of your previous HN comments. Briefly, I'll just say I've argued at length against many of the claims you make (you certainly aren't alone in making them). I don't feel it would be useful to repeat these again here, but I'll reference a few, below, just to show that I do care about the subject matter and am happy to dig deeper ...

... but only with certain conversational norms. I say this because I predict we aren't (yet) matched up in a way such that we would have a conversation useful to us. The main reason (I guess) isn't about our particular viewpoints nor about i.e. "if we're both critical thinkers". We're both demonstrating that frame, at least in our language. Instead, I think it is about the way we engage and what we want to get out of a conversation. Just to pick one particular guide star, I strive to follow Rapoport's Rules [1]. FWIW, HN Guidelines are not all that different, so simply by commenting here, one is explicitly joining a sort of social contract that point in their direction already.

Anatol Rapoport or Daniel Dennett were not only brilliant in their areas of specialty but also in teaching us how to criticize constructively in general. I offer the link at [1] just in case you want to read them and give them a try, here. We can start the conversation over (if you want).

---

In response to your comments about consciousness, intelligence, etc, here are some examples of what I mean by intelligence and why:

- intelligence: https://news.ycombinator.com/item?id=43236444

- general intelligence: https://news.ycombinator.com/item?id=43223521

- pressure towards AGI: https://news.ycombinator.com/item?id=41707643

- intelligence as "what machines cannot do" / no physics-based constraints to surpass human intelligence: https://news.ycombinator.com/item?id=44974963

---

[1]: https://assets.edge.bigthink.com/uploads/attachment/file/151...

by xpe

4/18/2026 at 6:47:37 PM

I just don’t see how they’ll be able to make a profit. Open models have the same performance on coding tasks now. The incentives are all wrong. Why pay more for a model that’s no better and also isn’t open? It’s nonsense

by danny_codes

4/18/2026 at 10:12:05 PM

I wouldn't say the same but it's pretty close. At this point I'm convinced that they'll continue running the marketing machine and people due to FOMO will keep hopping onto whatever model anthropic releases.

by Bridged7756

4/18/2026 at 9:48:13 PM

Which open model has the same performance as Opus 4.7?

by braebo

4/18/2026 at 11:08:34 PM

They dont have to be parity today.

If the frontier models reach a point of barely any noticeable improvements the trade off changes.

You do not need a perfect substitute if you are getting it for free...

People will factor in future expectations about the development of open source vs frontier models. Why do you think OAI and anthropic are pushing hard on marketing? its for this reason. They want to get contractual commitments that firms have to honour whilst open source closes the gap.

by 3dfd

4/19/2026 at 12:31:07 PM

The person they were responding to said "Open models have the same performance on coding tasks now." AFAIK this is bullshit, but I'd love to be corrected if I'm wrong.

by retsibsi

4/19/2026 at 2:17:29 AM

Open models, in actual practice, don't match up to even one or two generation prior models from Anthropic/OpenAI/Google. They've clearly been trained on the benchmarks. Entirely possible it was by mistake, but it's definitely happening.

by alex_sf

4/19/2026 at 12:13:27 PM

GLM 5.1 is absolutely on par with Sonnet 4.5, sometimes better in practice (it holds abstractions over longer context windows better)

It’s about the only one that is at that level though to be fair. They’re all still useful, still!

by girvo

4/19/2026 at 8:29:25 AM

That hasn’t been my experience. For coding at least I find little difference between closed and open models

by danny_codes

4/18/2026 at 11:15:35 PM

[dead]

by 3dfd

4/18/2026 at 11:49:42 PM

I think that's precisely why they're paying thousands of people in those other jobs to perform their tasks while collecting new data. Software was easiest because its already mostly written down, but other jobs can be quantized with enough data points. Just give it time

by djeastm

4/18/2026 at 9:20:45 PM

You have to guide an ai. Not let roam freely. If you got skills to guide you can make it output high quality

by holoduke

4/19/2026 at 9:03:13 AM

Of course, and I feel like Codex/GPT is generally better at following instructions and implementing a step-by-step plan and at a lower cost. Opus still has an edge in writing, brainstorming, and open-ended frontend vibe-coding.

I’m definitely not coming to this from a “AI is useless” angle. I’ve been using these tools extensively over the past year and they are providing a massive productivity boost.

by glerk

4/19/2026 at 10:28:39 AM

That is 100% correct as a foundation.

However when you guide the AI as a constant, and the model behaves MUCH differently (given a baseline guide), that is where the problem lies.

It's as if your 'guidance' has to be variable on how well the model is behaving. Analogy is a junior dev who is sometimes excellent, and sometimes shows up drunk for work and you have no breathalyzer.

by wallst07

4/19/2026 at 2:40:09 AM

> skills to guide

Is that what the soul is?

by the_gipsy

4/18/2026 at 11:07:20 PM

[dead]

by 3dfd

4/18/2026 at 6:36:42 PM

> ... but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.

My prior: it is 10X to 20X more likely Anthropic has done something other than shift to a short-term squeeze their customers strategy (which I think is only around ~5%)

What do I mean by "something other"? (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded. (2) Another possibility is that they are not as tuned to to what customers want relative to what their engineers want. (3) It is also possible they have slowed down their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos). Also, the above three possibilities are not mutually exclusive.

I don't expect us (readers here) to agree on the probabilities down to the ±5% level, but I would think a large chunk of informed and reasonable people can probably converge to something close to ±20%. At the very least, can we agree all of these factors are strong contenders: each covers maybe at least 10% to 30% of the probability space?

How short-sighted, dumb, or back-against-the-wall would Anthropic have to be to shift to a "let's make our new models intentionally _worse_ than our previous ones?" strategy? Think on this. I'm not necessarily "pro" Anthropic. They could lose standing with me over time, for sure. I'm willing to think it through. What would the world have to look like for this to be the case.

There are other factors that push back against claims of a "short-term greedy strategy" argument. Most importantly, they aren't stupid; they know customers care about quality. They are playing a longer game than that.

Yes, I understand that Opus 4.7 is not impressing people or worse. I feel similarly based on my "feels", but I also know I haven't run benchmarks nor have I used it very long.

I think most people viewed Opus 4.6 as a big step forward. People are somewhat conditioned to expect a newer model to be better, and Opus 4.7 doesn't match that expectation. I also know that I've been asking Claude to help me with Bayesian probabilistic modeling techniques that are well outside what I was doing a few weeks ago (detailed research and systems / software development), so it is just as likely that I'm pushing it outside its expertise.

by xpe

4/18/2026 at 7:19:43 PM

> To claim to know a company's strategy as an outsider is messy stuff.

I said "it seems like". Obviously, I have no idea whether this is an intentional strategy or not and it could as well be a side effect of those things that you mentioned.

Models being "worse" is the perceived effect for the end user (subjectively, it seems like the price to achieve the same results on similar tasks with Opus has been steadily increasing). I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).

by glerk

4/19/2026 at 2:20:10 AM

>>> ... but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

>> This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.

> I said "it seems like".

Sorry. I take back the "presumptuous" part. But part of my concern remains: of all the things you chose to wrote, you only mentioned "the Tinder/casino intermittent reinforcement strategy". That phrase is going to draw eyeballs, and you got mine at least. As a reader, it conveys you think it is the most likely explanation. I'm trying to see if there is something there that I'm missing. How likely do you think is? Do you think it is more likely than the other three I mentioned? If so, it seems like your thinking hinges on this:

> I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).

No incentive? Hardly. First, Anthropic is not a typical profit-maximizing entity, it a Public Benefit Corporation [1] [2]. Yes, profits matter still, but there are other factors to consider if we want to accurately predict their actions.

Second, even if profit maximization is the only incentive in play, profit-maximizing entities can plan across different time horizons. Like I mentioned in my above comment, it would be rather myopic to damage their reputation with a strategy that I summarize as a short-term customer-squeeze strategy.

Third, like many people here on HN, I've lived in the Bay Area, and I have first-degree connections that give me high confidence (P>80%) that key leaders at Anthropic have motivations that go much beyond mere profit maximization.

A\'s AI safety mission is a huge factor and not the PR veneer that pessimists tend to claim. Most people who know me would view me as somewhat pessimistic and anti-corporate and P(doomy). I say this to emphasize I'm not just casting stones at people for "being negative". IMO, failing to recognize and account for Anthropic's AI safety stance isn't "informed hard-hitting pessimism" so much as "limited awareness and/or poor analysis".

I'm not naive. That safety mission collides in a complicated way with FU money potential. Still, I'm confident (P>60%) that a significant number (>20%) of people at Anthropic have recently "cerebrated bad times" [3] i.e. cogitated futures where most humans die or lose control due to AI within ~10 to ~20 years. Being filthy rich doesn't matter much when dead or dehumanized.

[1]: https://law.justia.com/codes/delaware/title-8/chapter-1/subc...

[2]: https://time.com/6983420/anthropic-structure-openai-incentiv...

[3]: Weird Al: please make "Cerebration" for us.

by xpe

4/19/2026 at 4:39:29 AM

I like your style, and I appreciate you trying to get to the truth, despite us both being aware that we are engaging in persuasive writing here, so part of the rhetorical game is in what we choose to emphasize and what we choose to leave out.

> How likely do you think this is? Do you think it is more likely than the other three I mentioned?

I won't write down probability estimates, because frankly, I have no idea. Unless you are yourself a decision-maker at Anthropic, which, from what I can infer, you aren't, both of us are speculating. However, I can try to address each of your explanations at face value, because I don't think any of them makes Anthropic look any better than the explanation I provided.

> (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded.

As far as I understand it, scaling issues would result in increased latency or requests being dropped, not model quality being lower. However, there is a very widespread rumor that Anthropic is routing traffic to quantized models during peak times to help decrease costs. Boris Cherny, Thariq Shihipar, and others have repeatedly denied this is happening [1]. I would be more concerned if this were the actual explanation, because as a user of the Claude Code Max plan and of the API, I have the expectation that each dollar I spend buys me access to the same model without opaque routing in the background.

> (2) Another possibility is that they are not as tuned to what customers want relative to what their engineers want.

There is actually a strong case for this: the high performance on the benchmarks relative to the qualitatively low performance reported on real-world tasks after launch. I suspect quite a bit of RL training was spent optimizing for beating those benchmarks, which resulted in overfitting the model on particular kinds of tasks. I'm not claiming this is nefarious in any way or that it is something only Anthropic is guilty of doing: these benchmarks are supposed to be a good representation of general software tasks, and using them as a training ground is expected.

> (3) It is also possible they have slowed their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos).

This would be the most concerning to me. I don't want to get too deeply into a political/philosophical argument, but I am very much on the other side of the e/accy vs. P(doomy) debate, and I strongly believe that keeping these tools under the control of some council of enlightened elders who claim to know what is best for humanity is ultimately futile.

If the result of the behind-the-scenes "cerebration" is an actual effort to try and slow down AI development or access, I don't have much confidence in the future of Anthropic.

I agree that there are incentives other than pure profit maximization here (I don't want to get into "my friend at Anthropic told me such and such" games, but I also believe this is the case). I'm sure there is some tension between these objectives inside Anthropic, but what is interesting is that lower model quality and maximizing user engagement could, at least in principle, align with both constraints.

[1] https://x.com/trq212/status/2043023892579766290

by glerk

4/19/2026 at 10:44:14 PM

I strive to be decently Bayesian and embrace uncertainty. I'm sharing my probability estimates because it helps me to stop and think ("is this roughly what I think?" and "let spend a minute making sure before I say so"). But yeah, of course, they are my priors and fuzzy. Hopefully I can reflect I figure them out +/- 15% or so. But at least you can see how my takes compare with each other. And down the road I can see how I did.

Thanks for getting into some of the details ...

>> (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded.

> As far as I understand it, scaling issues would result in increased latency or requests being dropped, not model quality being lower.

Yes, many scaling issues would manifest in that way -- but not all. It seems plausible for Anthropic to have other ways to degrade model performance that don't show up in the latency or reliability metrics. I need to research more... (I'll try to think more on your other points later).

by xpe

4/24/2026 at 12:34:31 AM

Relevant I think: https://www.anthropic.com/engineering/april-23-postmortem

by xpe

4/18/2026 at 5:04:20 PM

AFAICT this uses a token-counting API so that it counts how many tokens are in the prompt, in two ways, so it's measuring the tokenizer change in isolation. Smarter models also sometimes produce shorter outputs and therefore fewer output tokens. That doesn't mean Opus 4.7 necessarily nets out cheaper, it might still be more expensive, but this comparison isn't really very useful.

by kalkin

4/18/2026 at 5:16:14 PM

For some real data, Artificial Analysis reported that 4.6 (max) and 4.7 (max) used 160M tokens and 100M tokens to complete their benchmark suite, respectively:

https://artificialanalysis.ai/?intelligence-efficiency=intel...

Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.

by h14h

4/18/2026 at 7:27:10 PM

This is the right way of thinking end-to-end.

Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.

by theptip

4/18/2026 at 5:06:48 PM

Why is it not useful? Input token pricing is the same for 4.7. The same prompt costs roughly 30% more now, for input.

by manmal

4/18/2026 at 5:12:27 PM

The idea is that smarter models might use fewer turns to accomplish the same task - reducing the overall token usage

Though, from my limited testing, the new model is far more token hungry overall

by dktp

4/18/2026 at 5:34:21 PM

Well you‘ll need the same prompt for input tokens?

by manmal

4/18/2026 at 6:09:31 PM

Only the first one. Ideally now there is no second prompt.

by httgbgg

4/18/2026 at 6:17:19 PM

Are you aware that every tool call produces output which also counts as input to the LLM?

by manmal

4/19/2026 at 2:58:45 AM

Are you aware that a lot of model tool calls are useless and a smarter model could avoid those?

Are you aware that output tokens are priced 5x higher than input tokens?

by squeaky-clean

4/19/2026 at 4:17:55 AM

> a lot of model tool calls are useless

That’s just wrong. File reads, searches, compiler output, are the top input token consumers in my workflow. None of them can be removed. And they are the majority of my input tokens. That’s also why labs are trying to make 1M input work, and why compaction is so important to get right.

Regarding output - yes, but that wasn’t the topic in this thread. It’s just easier to argue with input tokens that price has gone up. I have a hunch the price for output will go up similarly, but can’t prove it. The jury’s out IMO: https://news.ycombinator.com/item?id=47816960

by manmal

4/19/2026 at 4:32:25 PM

This has no bearing on my comment. The point is that a better model avoids dozens of prompts and tool calls by making fewer CORRECT tool calls, with the user needing no more prompts.

I’m surprised this is even a question; obviously a better prompter has the same properties and it’s not in dispute?

by httgbgg

4/18/2026 at 5:10:33 PM

That's valid, but it's also worth knowing it's only one part of the puzzle. The submission title doesn't say "input".

by kalkin

4/18/2026 at 5:42:35 PM

Yes. I actually noticed my token usage go down on 4.6 when I started switching every session to max effort. I got work done faster with fewer steps because thinking corrected itself before it cycled.

I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.

by SkyPuncher

4/18/2026 at 6:00:11 PM

With AIs, it seems like there never is a comparison that is useful.

by the_gipsy

4/18/2026 at 7:47:00 PM

You can build evals. Look at Harbor or Inspect. It’s just more work than most are interested in doing right now.

by theptip

4/18/2026 at 6:17:20 PM

yup its all vibes. And anthropic is winning on those in my book still

by jascha_eng

4/18/2026 at 5:27:10 PM

For now, I'm planning to stick with Opus 4.5 as a driver in VSCode Copilot.

My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.

Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.

Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.

But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.

by rectang

4/18/2026 at 9:48:06 PM

Why not just use Sonnet?

by axpy906

4/18/2026 at 9:58:36 PM

I've used Sonnet a lot. It is not as good as Opus at understanding what I'm asking for. I have to coach Sonnet more closely, taking more care to be precise in my prompts, and often building up Plan steps when I could just YOLO an Agent instruction at Opus and it would get it right.

I find that Opus is really good at discerning what I mean, even when I don't state it very clearly. Sonnet often doesn't quite get where I'm going and it sometimes builds things that don't make sense. Sonnet also occasionally makes outright mistakes, like not catching every location that needs to be changed; Opus makes nearly every code change flawlessly, as if it's thinking through "what could go wrong" like a good engineer would.

Sonnet is still better than older and/or less-capable models like GPT 4.1, Raptor mini (Preview), or GPT-5 mini, which all fail in the same way as Sonnet but more dramatically... but Opus is much better than Sonnet.

Recent full-powered GPTs (including the Codex variants) are competitive with Opus 4.6, but Opus 4.5 in particular is best in class for my workflow. I speculate that Opus 4.5 dedicates the most cycles out of all models to checking its work and ensuring correctness — as opposed to reaching for the skies to chase ambitious, highly complex coding tasks.

by rectang

4/19/2026 at 6:56:13 AM

Sonnet is very clearly worse than Opus for a lot of tasks. Sonnet is still awesome, but less so.

by tobyhinloopen

4/18/2026 at 8:05:16 PM

> 4.7 is going to replace both 4.5 and 4.6

as in 4.5 is no longer going to be avail? F.

ive also been sticking with 4.5 that sucks

by trueno

4/18/2026 at 9:31:39 PM

https://github.blog/changelog/2026-04-16-claude-opus-4-7-is-...

> Over the coming weeks, Opus 4.7 will replace Opus 4.5 and Opus 4.6 in the model picker for Copilot Pro+[...]

> This model is launching with a 7.5× premium request multiplier as part of promotional pricing until April 30th.

by rectang

4/19/2026 at 12:36:03 AM

Promotional pricing? Are they saying that after the promotion, it will cost more than 7.5x??

by xstas1

4/19/2026 at 12:00:12 PM

Hmm, maybe they're discouraging copilot+Claude through pricing, nudging people to anthropic suite of tools. That sucks. I've been super happy with copilot+opus/sonnet.

by d0gsg0w00f

4/19/2026 at 2:15:08 AM

Yup

by freely0085

4/19/2026 at 9:05:59 AM

Well fuck that then

by roygbiv2

4/18/2026 at 11:34:06 PM

[dead]

by benjiro3000

4/18/2026 at 5:18:42 PM

I was using Opus 4.7 just yesterday to help implement best practices on a single page website.

After just ~4 prompts I blew past my daily limit. Another ~7 more prompts & I blew past my weekly limit.

The entire HTMl/CSS/JS was less than 300 lines of code.

I was shocked how fast it exhausted my usage limits.

by tiffanyh

4/18/2026 at 6:33:34 PM

What's your reasoning effort set to? Max now uses way more tokens and isn't suggested for most usecases. Even the new default (xhigh) uses more than the old default (medium).

by zaptrem

4/18/2026 at 10:25:00 PM

That's what I'm wondering. Is it people are defaulting to xhigh now and that's why it feels like it's consuming a lot more tokens? If people manually set it to medium, would it be comparable?

by nixpulvis

4/19/2026 at 12:44:03 AM

Switching back to medium seems to have fixed the issue for me.

by nixpulvis

4/18/2026 at 5:33:13 PM

Which plan are you on? I could see that happening with Pro (which I think defaults to Sonnet?), would be surprised with Max…

by sync

4/19/2026 at 3:02:09 AM

Opus is not available for claude code in pro

by iammrpayments

4/20/2026 at 4:24:24 PM

Yes it is.

https://claude.com/pricing

It's not available on Free plan, but it's available on Pro.

by eneveu

4/18/2026 at 5:35:30 PM

It eats even the Max plan like crazy.

by templar_snow

4/18/2026 at 5:46:29 PM

Pro. It even gave me $20 free credits, and exhausted free credits nearly instantly.

by tiffanyh

4/19/2026 at 1:21:00 AM

The pro plan is useless. You need at least the 5x max plan to get any real work done.

That said I find the GPT plans much better value.

by cageface

4/19/2026 at 5:32:54 AM

Yeah. But then you have to use GPT

by someguyiguess

4/19/2026 at 12:52:40 PM

GPT 5.4 in Codex is good enough now.

Try it.

by sumedh

4/19/2026 at 3:27:58 AM

HN is getting ridiculous. You cannot seriously be complaining about Opus token usage on the Pro plan.

by xvector

4/19/2026 at 3:30:04 AM

Compared to the usage you get on OpenAI's $20 plan tho?

by fragmede

4/18/2026 at 5:33:50 PM

Are you using Claude subscription? Because that's not how it works there.

by tomtomistaken

4/18/2026 at 5:30:09 PM

I haven't used Claude. Because I suspect this sort of things to come.

With enterprise subscription, the bill gets bigger but it's not like VP can easily send a memo to all its staff that a migration is coming.

Individuals may end their subscription, that would appease the DC usage, and turn profits up.

by hirako2000

4/18/2026 at 9:23:33 PM

Sorry you are missing out. I use claude all day every day with max and what people are reporting here has not been my experience. My current usage is 16% and it resets Thursday.

by fooster

4/20/2026 at 12:26:55 PM

Some research argue LLMs (up to 2025) were giving a false sense of higher productivity.

That and atrophy, I will pass on what Claude is trying to accomplish.

I'm not dismissing LLMs entirely, for certain cases the concerns don't apply, at least not as much.

by hirako2000

4/19/2026 at 1:36:51 AM

[dead]

by semcheck

4/18/2026 at 6:49:04 PM

> Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even after accounting for Opus 4.7's new tokenizer. This metric does not account for cached input token discounts, which we will be incorporating into our cost calculations in the near future.

by hereme888

4/19/2026 at 11:00:01 AM

uhmmm so that is the p they are hacking? it would actually explain a lot

by muyuu

4/19/2026 at 2:26:58 PM

It's an incomplete cost model, but it's not p-hacking. Could be cherry-picking.

by hereme888

4/18/2026 at 4:51:02 PM

Should the title here be 4.6 to 4.7 instead of the other way around?

by someuser54541

4/18/2026 at 4:52:24 PM

Writing Opus 4.6 to 4.7 does make more sense for people who read left to right.

by UltraSane

4/18/2026 at 5:05:11 PM

I’m impressed with anyone who can read English right to left.

by pixelatedindex

4/18/2026 at 5:50:47 PM

You might like https://en.wikipedia.org/wiki/Boustrophedon

by jlongman

4/18/2026 at 6:22:22 PM

Whoa! TIL! I struggled a bit to read this style at first, but felt it get easier after a few tries.

by amulyabaral

4/18/2026 at 5:39:06 PM

Right to Left English - read can, who? Anyone with [which] impressed am I.

by einpoklum

4/19/2026 at 7:20:53 AM

English can be read in a different order than the normal order when the sentences contain words for which it is easy to guess whether they are agents or patients, e.g. when the agents are animate nouns and the patients are inanimate nouns, or when pronouns are used for the agents or patients.

Otherwise, the non-standard order can be understood incorrectly. While the distinction between agents and patients is the most important that depends on word order in English, there are also other order-dependent distinctions, e.g. between beneficiary and patient, when the beneficiary is not marked by a preposition, or between a noun and its attribute, e.g. "police dog" is not the same as "dog police" and unless there is a detailed context you cannot know what is meant when the word order is wrong.

English is one of the languages with the most rigid word order. There are languages, especially among older languages, where almost any word order can be used without causing ambiguities, because all the possible roles of the words are marked by prepositions, postpositions or affixes (or sometimes by accentuation shifts).

by adrian_b

4/19/2026 at 8:38:23 AM

In my example, the RTL reading is indeed a misunderstanding. I even cheated, because it really should have been:

> Left to Right English - read can, who? Anyone with [which] impressed am I.

and the causation is wrong; instead of the ability being impressive, it's the impressive character than allows reading in the opposite order.

So, you're right, and now I'll wait for the dog police to come pick me up.

by einpoklum

4/18/2026 at 5:51:29 PM

Yoda, you that is?

by y1n0

4/18/2026 at 5:05:39 PM

But the page is not in a language that should be read right to left, doesn't that make that kind of confusing?

by embedding-shape

4/18/2026 at 5:24:12 PM

Did you mean "right to left"?

by usrnm

4/18/2026 at 5:30:45 PM

I very much did, it got too confusing even for me. Thanks!

by embedding-shape

4/18/2026 at 7:18:24 PM

I kept mentally verifying that English is written left to right.

by UltraSane

4/18/2026 at 5:15:30 PM

Err, how so?

by bee_rider

4/18/2026 at 4:52:12 PM

absolutely!

by freak42

4/18/2026 at 5:54:50 PM

It's increasingly looking naive to assume scaling LLMs is all you need to get to full white-collar worker replacement. The attention mechanism / hopfield network is fundamentally modeling only a small subset of the full human brain, and all the increasing sustained hype around bolted-on solutions for "agentic memory" is, in my opinion, glaring evidence that these SOTA transformers alone aren't sufficient even when you just limit the space to text. Maybe I'm just parroting Yann LeCun.

by gsleblanc

4/18/2026 at 6:59:09 PM

You probably are.

The "small subset" argument is profoundly unconvincing, and inconsistent with both neurobiology of the human brain and the actual performance of LLMs.

The transformer architecture is incredibly universal and highly expressive. Transformers power LLMs, video generator models, audio generator models, SLAM models, entire VLAs and more. It not a 1:1 copy of human brain, but that doesn't mean that it's incapable of reaching functional equivalence. Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.

LeCun's arguments about "LLMs can't do X" keep being proven wrong empirically. Even on ARC-AGI-3, which is a benchmark specifically designed to be adversarial to LLMs and target the weakest capabilities of off the shelf LLMs, there is no AI class that beats LLMs.

by ACCount37

4/18/2026 at 7:27:54 PM

> Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.

The human brain is not a pretrained system. It's objectively more flexible than than transformers and capable of self-modulation in ways that no ML architecture can replicate (that I'm aware of).

by bigyabai

4/18/2026 at 7:32:43 PM

Human brain's "pre-training" is evolution cramming way too much structure into it. It "learns from scratch" the way it does because it doesn't actually learn from scratch.

I've seen plenty of wacky test-time training things used in ML nowadays, which is probably the closest to how the human brain learns. None are stable enough to go into the frontier LLMs, where in-context learning still reigns supreme. In-context learning is a "good enough" continuous learning approximatation, it seems.

by ACCount37

4/18/2026 at 7:36:52 PM

> In-context learning is a "good enough" continuous learning approximatation, it seems.

"it seems" is doing a herculean effort holding your argument up, in this statement. Say, how many "R"s are in Strawberry?

by bigyabai

4/18/2026 at 7:42:56 PM

If you think that "strawberry" is some kind of own, I don't know what to tell you. It takes deep and profound ignorance of both the technical basics of modern AIs and the current SOTA to do this kind of thing.

LLMs get better release to release. Unfortunately, the quality of humans in LLM capability discussions is consistently abysmal. I wouldn't be seeing the same "LLMs are FUNDAMENTALLY FLAWED because I SAY SO" repeated ad nauseam otherwise.

by ACCount37

4/18/2026 at 7:45:32 PM

I can ask a nine-year-old human brain to solve that problem with a box of Crayola and a sheet of A4 printer paper.

In-context learning is professedly not "good enough" to approximate continuous learning of even a child.

by bigyabai

4/18/2026 at 8:11:59 PM

You're absolutely wrong!

You can also ask an LLM to solve that problem by spelling the word out first. And then it'll count the letters successfully. At a similar success rate to actual nine-year-olds.

There's a technical explanation for why that works, but to you, it might as well be black magic.

And if you could get a modern agentic LLM that somehow still fails that test? Chances are, it would solve it with no instructions - just one "you're wrong".

1. The LLM makes a mistake

2. User says "you're wrong"

3. The LLM re-checks by spelling the word out and gives a correct answer

4. The LLM then keeps re-checking itself using the same method for any similar inquiry within that context

In-context learning isn't replaced by anything better because it's so powerful that finding "anything better" is incredibly hard. It's the bread and butter of how modern LLM workflows function.

by ACCount37

4/19/2026 at 3:17:01 AM

This is false. You can ask it to spell out strawberry and count the letters and it will still say 2 (it's unable to actually count the letters by the way). The only way to get a model that believes strawberry has 2 R's to consistently give the correct answer is to ask it to code the problem and return the output.

In fact, asking a model not to repeat the same mistake makes it more likely to commit that mistake again, because it's in it's context.

I think anyone who uses LLMs a lot will tell your that your steps 3 and 4 are fictional.

by squeaky-clean

4/19/2026 at 12:38:24 PM

Have you actually tried?

The "spell out" trick, by the way, was what was added to the system prompts of frontier models back when this entire meme was first going around. It did mitigate the issue.

by ACCount37

4/18/2026 at 10:30:07 PM

> it's so powerful that finding "anything better" is incredibly hard.

We're back around to the start again. "Incredibly hard" is doing all of the heavy lifting in this statement, it's not all-powerful and there are enormous failure cases. Neither the human brain nor LLMs are a panacea for thought, but nobody in academia or otherwise is seriously comparing GPT to the human brain. They're distinct.

> There's a technical explanation for why that works, but to you, it might as well be black magic.

Expound however much you need. If there's one thing I've learned over the past 12 months it's that everyone is now an expert on the transformer architecture and everyone else is wrong. I'm all ears if you've got a technical argument to make, the qualitative comparison isn't convincing me.

by bigyabai

4/19/2026 at 12:59:55 PM

I do know far more than you, which is a laughably low bar. If you want someone to hold your hand through it, ask an LLM.

The key words are "tokenization" and "metaknowledge", the latter being the only non-trivial part. An LLM can explain it in detail. They know more than you do too.

by ACCount37

4/18/2026 at 11:36:51 PM

why is the breakdown from words to letters your highest priority thing to add to the training data?

what problem does this allow you to solve that you couldnt otherwise?

by 8note

4/19/2026 at 3:21:29 AM

This comment is tangential to their point that a transformer architecture can or cannot be functionally equivalent to a human brain. Practicality of those limitations is a different discussion

by squeaky-clean

4/18/2026 at 6:20:22 PM

> you just limit the space to text

And even then... why can't they write a novel? Or lowering the bar, let's say a novella like Death in Venice, Candide, The Metamorphosis, Breakfast at Tiffany's...?

Every book's in the training corpus...

Is it just a matter of someone not having spent a hundred grand in tokens to do it?

by aerhardt

4/18/2026 at 6:49:53 PM

I know someone spending basically every day writing personal fan fiction stories using every model you can find. She doesn't want to share it, and does complain about it a lot, seems like maintaining consistency for something say 100 pages long is difficult

by voxl

4/18/2026 at 6:41:09 PM

I don’t understand - there are hundreds/thousands of AI written books available now.

by conception

4/18/2026 at 6:44:13 PM

I've glossed over a few and one can immediately tell they don't meet the average writing level you'd see in a local workshop for writers, and much less that of Mann or Capote.

by aerhardt

4/18/2026 at 7:35:45 PM

Never mind novels, it can't even write a good Reddit-style or HN-style comment. agentalcove.ai has an archive of AI models chatting to one another in "forum" style and even though it's a good show of the models' overall knowledge the AIisms are quite glaring.

by zozbot234

4/18/2026 at 8:44:27 PM

They definitely can, and do.

It's just that the ones that manage to suppress all the AI writing "tells" go unnoticed as AI. This is a type of survivorship bias, though I feel there must be a better term for it that eludes me.

by mh-

4/18/2026 at 6:36:13 PM

Who says they can't? What's your bar that needs to be passed in order for "written a novella" to be achieved?

There's a lot of bad writing out there, I can't imagine nobody has used an LLM to write a bad novella.

by colechristensen

4/18/2026 at 6:38:37 PM

> What's your bar that needs to be passed

I provide four examples in my comment...

by aerhardt

4/18/2026 at 6:46:53 PM

Your qualification for if an LLM can write a novella is it has to be as good as The Metamorphosis?

Yes, those are examples of novellas, surely you believe an LLM could write a bad novella? I'm not sure what your point is. Either you think it can't string the words together in that length or your standard is it can't write a foundational piece of literature that stays relevant for generations... I'm not sure which.

by colechristensen

4/18/2026 at 6:51:01 PM

I don't think it can write something that's of a fraction of the quality of Kafka.

But GP's argument ("limit the space to text") could be taken to imply - and it seems to be a common implication these days - that LLMs have mastered the text medium, or that they will very soon.

> it can't write a foundational piece of literature

Why not, if this a pure textual medium, the corpus includes all the great stories ever written, and possibly many writing workshops and great literature courses?

by aerhardt

4/18/2026 at 7:03:01 PM

I don't know what to tell you. It's more than a little absurd to make the qualification of being able to do something to be that the output has to be considered a great work of art for generations.

by colechristensen

4/18/2026 at 7:22:37 PM

I agree that the argument starts from a reduction to the absurd.

So at least we can agree that AI hasn't mastered the text medium, without further qualification?

And what about my argument, further qualified, which is that I don't think it could even write as well as a good professional writer - not necessarily a generational one?

by aerhardt

4/18/2026 at 10:30:40 PM

>AI hasn't mastered the text medium

I don't know what this means and I don't know what would qualify it as having "mastered" at all. Seems like a no-true-Scotsman thing where regardless there would always be someone that it couldn't actually do a thing because this and that.

>why can't they write a novel?

This is what I'm disagreeing with. I think an LLM can write a novel well enough that it's recognizably a pretty mediocre novel, no worse than the median written human novel which to be fair is pretty bad. You seem to have an unqualified bar something needs to pass before "writing a novel" is accomplished but it's not clear what that is. At the same time you're switching between the ability to do a thing and the ability to do a thing in a way that's honored as the best of the best for a century. So I don't know it kind of seems like you just don't like AI and have a different standard for it that adjusts so that it fails. This doesn't match what you'd consider some random Bob's ability to do a thing.

by colechristensen

4/19/2026 at 12:41:37 PM

I don't dislike AI, I use it every day for coding and increasingly for non-technical tasks, and have also used it in enterprise workloads to great success. I am fairly optimistic about it - I think it will remove a lot of drudgery and make things economical which previously weren't.

I am just challenging the notions that "if you limit it to text, it's doing really well" or that the text contains in itself all the information that is needed to carry out a task to a certain level of quality. This applies in my experience not only to writing literature but also to certain human tasks which may appear mundane and easy to automate.

by aerhardt

4/19/2026 at 3:14:30 AM

If the end result is most books will be written by AI you need the possiblity of that qualification. If its only capable of certain types of book then we will need endless amounts of that.

by ipaddr

4/18/2026 at 10:51:47 PM

I think they're as good as they're going to get from scaling. They can still get more efficient, and tooling/harnesses around them will improve.

by mohamedkoubaa

4/18/2026 at 11:18:03 PM

[dead]

by 3dfd

4/18/2026 at 8:03:31 PM

You can configure the status line to get a feel for token usage:

[Opus 4.6] 3% context | last: 5.2k in / 1.1k out

add this to .claude/settings.json

"statusLine": { "type": "command", "command": "jq -r '\"[\\(.model.display_name)] \\(.context_window.used_percentage // 0)% context | last: \\(((.context_window.current_usage.input_tokens // 0) / 1000 * 10 | floor / 10))k in / \\(((.context_window.current_usage.output_tokens // 0) / 1000 * 10 | floor / 10))k out\"'" }

by npollock

4/18/2026 at 7:08:59 PM

My impression is that the quality of the conversation is unexpectedly better: more self-critical, the suggestions are always critical, the default choices constantly best. I might not have as many harnesses as most people here, so I suspect it’s less obvious but I would expect this to make it far more valuable for people who haven’t invested as much.

After a few basic operations (retrospective look at the flow of recent reviews, product discussions) I would expect this to act like a senior member of the team, while 4.6 was good, but far more likely to be a foot-gun.

by bertil

4/18/2026 at 5:17:13 PM

We dropped Claude. It's pretty clear this is a race to the bottom, and we don't want a hard dependency on another multi-billion dollar company just to write software

We'll be keeping an eye on open models (of which we already make good use of). I think that's the way forward. Actually it would be great if everybody would put more focus on open models, perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company. Wouldn't it be nice if we don't need to pay for tokens? Paying for infra (servers, electricity) is already expensive enough

by dakiol

4/18/2026 at 5:21:08 PM

>we don't want a hard dependency on another multi-billion dollar company just to write software

One of two main reasons why I'm wary of LLMs. The other is fear of skill atrophy. These two problems compound. Skill atrophy is less bad if the replacement for the previous skill does not depend on a potentially less-than-friendly party.

by ahartmetz

4/18/2026 at 5:47:13 PM

I was worried about skill atrophy. I recently started a new job, and from day 1 I've been using Claude. 90+% of the code I've written has been with Claude. One of the earlier tickets I was given was to update the documentation for one of our pipelines. I used Claude entirely, starting with having it generate a very long and thorough document, then opening up new contexts and getting it to fact check until it stopped finding issues, and then having it cut out anything that was granular/one query away. And then I read what it had produced.

It was an experiment to see if I could enter a mature codebase I had zero knowledge of, look at it entirely through an AI, and come to understand it.

And it worked! Even though I've only worked on the codebase through Claude, whenever I pick up a ticket nowadays I know what file I'll be editing and how it relates to the rest of the code. If anything, I have a significantly better understanding of the codebase than I would without AI at this point in my onboarding.

by post-it

4/18/2026 at 6:08:15 PM

Yeah, +1. I will never be working on unsolved problems anyhow. Skill atrophy is not happening if you stay curious and responsible.

by estetlinus

4/18/2026 at 6:28:55 PM

I have never learned so quickly in my entire life than to post a forum thread in its entirety into a extended think LLM and then be allowed to ask free form questions for 2 hours straight if I want to. Having my questions answered NOW is so important for me to learn. Back in the day by the time I found the answer online I forgot the question

by stringfood

4/18/2026 at 7:36:47 PM

Same. I work in the film industry, but I’ve always been interested in computers and have enjoyed tinkering with them since I was about 5. However, coding has always been this insurmountably complicated thing- every time I make an effort to learn, I’m confronted with concepts that are difficult for me to understand and process.

I’ve been 90% vibe coding for a year or so now, and I’ve learned so much about networking just from spinning up a bunch of docker containers and helping GPT or Claude fix niggling issues.

I essentially have an expert (well, maybe not an expert but an entity far more capable than I am on my own) who’s shoulder I can look over and ask as many questions I want to, and who will explain every step of the process to me if I want.

I’m finally able to create things on my computer that I’ve been dreaming about for years.

by lobf

4/19/2026 at 5:06:01 PM

I pivoted from the film industry into AI 10 years ago. My end game is to replace movie magic.

by estetlinus

4/18/2026 at 6:34:22 PM

Some people talk like skill atrophy is inevitable when you use LLMs, which strikes me as pretty absurd given that you are talking about a tool that will answer an infinite number of questions with infinite patience.

I usually learn way more by having Claude do a task and then quizzing it about what it did than by figuring out how to do it myself. When I have to figure out how to do the thing, it takes much more time, so when I'm done I have to move on immediately. When Claude does the task in ten minutes I now have several hours I can dedicate entirely to understanding.

by idopmstuff

4/18/2026 at 6:42:56 PM

You lose some, you win some. The win could be short-term much higher, however imagine that the new tool suddenly gets ragged pulled from under your feet. What do you do then? Do you still know how to handle it the old way or do you run into skill atrophy issues? I’m using Claude/Codex as well, but I’m a little worried that the environment we work in will become a lot more bumpy and shifty.

by onemoresoop

4/19/2026 at 1:08:45 AM

> however imagine that the new tool suddenly gets ragged pulled from under your feet

When you have a headache, do you avoid taking ibuprofen because one day it may not be available anymore? Two hundred years ago, if you gave someone ibuprofen and told them it was the solution for 99% of the cases where they felt some kind of pain, they might be suspicious. Surely that's too good to be true.

But it's not. Ibuprofen really is a free lunch, and so is AI. It's weird to experience, but these kinds of technologies come around pretty often, they just become ubiquitous so quickly that we forget how we got by without them.

by post-it

4/18/2026 at 6:48:08 PM

> the new tool suddenly gets ragged pulled from under your feet

If that happened at this point, it would be after societal collapse.

by visarga

4/18/2026 at 7:03:45 PM

I don’t even wanna think about that scenario, maybe he gets averted somehow.

by onemoresoop

4/18/2026 at 8:23:31 PM

The "infinite patience" thing I find particularly interesting.

Every now and then I pause before I ask an LLM to undo something it just did or answer something I know it answered already, somewhere. And then I remember oh yeah, it's an LLM, it's not going to get upset.

by hdjrudni

4/18/2026 at 8:01:05 PM

Asking infinite questions about something does not make you good at “doing” that thing, you get pretty good at asking questions

by dlopes7

4/18/2026 at 8:17:02 PM

Understanding is not learning. Zero effort gives zero rewards, I ask Claude plenty of things, I get answers but not learnings.

by techpression

4/18/2026 at 7:04:50 PM

I used to speak Russian like I was born in Russia. I stopped talking Russian … every day I am curious ans responsible but I can hardly say 10 words in Russian today. if you don’t use it (not just be curious and responsible) you will lose it - period.

by bdangubic

4/18/2026 at 7:33:57 PM

Programming language is not just syntax, keywords and standard libraries, but also: processes, best practices and design principles. The latter group I guess is more difficult to learn and harder to forget.

by thih9

4/18/2026 at 8:37:30 PM

I respectfully completely disagree. not only will you just as easily lose thr processed, best practices and design principles but they will be changing over time (what was best practice when I got my first gig in 1997 is not a best practice today (even just 4-5 years ago not to go all the back to the 90’s)). all that is super easy to both forget and lose unless you live it daily

by bdangubic

4/19/2026 at 5:02:19 AM

Forget, yes; lose, no. Like it would be much easier for you to relearn Russian - especially compared with someone who only knows English.

by thih9

4/19/2026 at 12:51:36 AM

More fair comparison would be writing/talking about Russian language in English. That way you'd still focus on Russian. Same way with programming - it's not like you stop seeing any code. So why should you forget it?

by ashirviskas

4/18/2026 at 6:14:10 PM

Are you sure you would know if it didn't work? I use Claude extensively myself, so I'm not saying this from a "hater" angle, but I had 2 people last week who believe themselves to be in your shoes send me pull requests which made absolutely no sense in the context of the codebase.

by SpicyLemonZest

4/18/2026 at 6:21:25 PM

That’s always been the case, AI or not.

by therealdrag0

4/18/2026 at 8:18:03 PM

No, it hasn't. I did not have a problem before AI with people sending in gigantic pull requests that made absolutely no sense, and justifying them with generated responses that they clearly did not understand. This is not a thing that used to happen. That's not to say people wouldn't have done it if it were possible, but there was a barrier to submitting a pull request that no longer exists.

by Jweb_Guru

4/18/2026 at 9:01:55 PM

In my experience, the people sending me garbage PRs with Claude are the same ones who wrote garbage code beforehand. Now there's just 10x more of it.

by viccis

4/18/2026 at 7:52:24 PM

It just happens to be a lot worse now. Confidence through ignorance has come into the spotlight with the commoditization of LLMs.

by windexh8er

4/19/2026 at 1:02:35 AM

Yeah, I test everything myself.

by post-it

4/18/2026 at 7:58:02 PM

I have also found LLMs are a great tool for understanding a new code base, but it's not clear to me what your comment has to do with skill atrophy.

by root_axis

4/19/2026 at 2:13:20 AM

Well ultimately the skill I care about is understanding software, changing it, and making more of it. And clearly that isn't atrophying.

My syntax writing skills may well be atrophying, but I'll just do a leetcode by hand once in a while.

by post-it

4/18/2026 at 9:15:50 PM

It's good that it's working for you but I'm not sure what this has to do with skill atrophy. It sounds like you never had this skill (in this case, working with that particular system) to begin with.

>I have a significantly better understanding of the codebase than I would without AI at this point in my onboarding

One of the pitfalls of using AI to learn is the same as I'd see students doing pre-AI with tutoring services. They'd have tutors explain the homework to do them and even work through the problems with them. Thing is, any time you see a problem or concept solved, your brain is tricked into thinking you understand the topic enough to do it yourself. It's why people think their job interview questions are much easier than they really are; things just seem obvious when you've thought about the solution. Anyone who's read a tutorial, felt like they understood it well, and then struggled for a while to actually start using the tool to make something new knows the feeling very well. That Todo List app in the tutorial seemed so simple, but the author was making a bunch of decisions constantly that you didn't have to think about as you read it.

So I guess my question would be: If you were on a plane flight with no wifi, and you wanted to do some dev work locally on your laptop, how comfortable would you be vs if you had done all that work yourself rather than via Claude?

by viccis

4/19/2026 at 1:04:29 AM

> If you were on a plane flight with no wifi, and you wanted to do some dev work locally on your laptop, how comfortable would you be vs if you had done all that work yourself rather than via Claude?

Probably about as comfortable as I would be if I also didn't have my laptop and instead had to sketch out the codebase in a notebook. There's no sense preparing for a scenario where AI isn't available - local models are progressing so quickly that some kind of AI is always going to be available.

by post-it

4/19/2026 at 7:33:21 PM

So then the argument isn't so much that skill decay isn't an issue but rather that the skill is inherently worthless moving forward. I'm not sure I agree, but I also got a compsci education because I have loved doing it since childhood rather than because I just wanted to make money, and I can see how the latter group would vehemently disagree with me.

by viccis

4/18/2026 at 8:08:05 PM

What do you mean “cut out anything that was granular/one query away”? This was a very cool workflow to hear about—I will be applying it myself

by Ifkaluva

4/19/2026 at 1:01:54 AM

For example, Claude was very eager to include function names, implementation details, and the exact variables that are passed between services. But all the info I need for a particular process is the names of the services involved, the files involved, and a one-sentence summary of what happens. If I want to know more, I can tell Claude to read the doc and find out more with a single query (or I can just check for myself).

by post-it

4/18/2026 at 6:16:06 PM

[dead]

by throwaway613746

4/18/2026 at 5:44:33 PM

Not so much atrophy as apathy.

I've worked with people who will look at code they don't understand, say "llm says this", and express zero intention of learning something. Might even push back. Be proud of their ignorance.

It's like, why even review that PR in the first place if you don't even know what you're working with?

by ljm

4/18/2026 at 6:01:24 PM

I cringed when I saw a dev literally copy and paste an AI's response to a concern. The concern was one that had layers and implications to it, but instead of getting an answer as to why it was done a certain way and to allay any potential issues, that dev got a two paragraph lecture on how something worked on the surface of it, wrapped in em dashes and joviality.

A good dev would've read deeper into the concern and maybe noticed potential flaws, and if he had his own doubts about what the concern was about, would have asked for more clarification. Not just feed a concern into AI and fling it back. Like please, in this day and age of AI, have the benefit of the doubt that someone with a concern would have checked with AI himself if he had any doubts of his own concern...

by psygn89

4/18/2026 at 6:21:37 PM

Is this the same subset of people who copy/paste code directly from stack overflow without understanding ? I’m not sure this is a new problem.

by oremj

4/18/2026 at 6:51:50 PM

It's a new problem in the sense that now executive management at many (if not most) software companies is pushing for all employees to work this way as much as possible. Those same people probably don't know what stack overflow even is.

by foobarchu

4/18/2026 at 6:34:35 PM

In my experience, no - I think the ability to build more complete features with less/little/no effort, rather than isolated functions, is (more) appealing to (more) developers.

by pizza234

4/18/2026 at 6:49:46 PM

I don't think so. I'll spend a ton of time and effort thinking through, revising, and planning out the approach, but I let the agent take the wheel when it comes to transpiling that to code. I don't actually care about the code so long as it's secure and works.

I spent years cultivating expertise in C++ and .NET. And I found that time both valuable and enjoyable. But that's because it was a path to solve problems for my team, give guidance, and do so with both breadth and depth.

Now I focus on problems at a higher level of abstraction. I am certain there's still value in understanding ownership semantics and using reflection effectively, but they're broadly less relevant concerns.

by malnourish

4/18/2026 at 7:17:49 PM

[dead]

by mlvljr

4/18/2026 at 7:27:42 PM

It's difficult to copy & paste an entire app from Stack Overflow

by dingaling

4/18/2026 at 7:21:54 PM

Copied and pasted without noting the license that stack overflow has on code published there, no doubt

by sroussey

4/18/2026 at 6:31:06 PM

Hey. I resemble that remark sometimes!! quit being a hater (sarcasm) :P

by trinsic2

4/18/2026 at 6:04:10 PM

We've had such developers around, long before LLMs.

by kilroy123

4/18/2026 at 6:20:52 PM

They're so much louder now, though.

by ohazi

4/18/2026 at 6:17:41 PM

It’s a lot like someone bragging that they’re bad at math tossing around equations.

by RexM

4/18/2026 at 6:09:03 PM

If I wanted to know what the LLM says, I would have asked it myself, thanks…

by monkpit

4/18/2026 at 6:12:21 PM

What is it in the broader culture that's causing this?

by redanddead

4/18/2026 at 6:27:07 PM

People who got into the job who don’t really like programming

by groundzeros2015

4/18/2026 at 6:32:44 PM

I like programming, but I don’t like the job.

by drivebyhooting

4/18/2026 at 7:55:55 PM

Then why are you letting Claude do the fun part?

by groundzeros2015

4/18/2026 at 8:00:08 PM

Obviously, the fun part is delivering value for the shareholders.

by root_axis

4/18/2026 at 6:18:07 PM

These people have always existed. Hell, they are here, too. Now they have a new thing to delegate responsibility to.

And no, I don't understand them at all. Taking responsibility for something, improving it, and stewarding it into production is a fantastic feeling, and much better than reading the comment section. :)

by mattgreenrocks

4/18/2026 at 5:28:12 PM

You can argu that you will have skill atrophy by not using LLMs.

We have gone multi cloud disaster recovery on our infrastructure. Something I would not have done yet, had we not had LLMs.

I am learning at an incredible rate with LLMs.

by tossandthrow

4/18/2026 at 5:32:47 PM

I kind feel the same. I’m learning things and doing things in areas that would just skip due to lack of time or fear.

But I’m so much more detached of the code, I don’t feel that ‘deep neural connection’ from actual spending days in locked in a refactor or debugging a really complex issue.

I don’t know how a feel about it.

by mgambati

4/18/2026 at 5:38:45 PM

I strongly agree on the refactor, but for debugging I have another perspective: I think debugging is changing for the better, so it looks different.

Sure, you don't know the code by heart, but people debugging code translated to assembly already do that.

The big difference is being able to unleash scripts that invalidate enormous amount of hypothesis very fast and that can analyze the data.

Used to do that by hand it took hours, so it would be a last resort approach. Now that's very cheap, so validating many hypothesis is way cheaper!

I feel like my "debugging ability" in terms of value delivered has gone way up. For skill, it's changing. I cannot tell, but the value i am delivering for debugging sessions has gone way up

by Fire-Dragon-DoL

4/18/2026 at 5:37:52 PM

As someone who's switched from mobile to web dev professionally for the last 6 months now. If you care about code quality, you'll develop that neural connection after some time.

But if you don't and there's no PR process (side projects), the motivation to form that connection is quite low.

by afzalive

4/18/2026 at 7:00:17 PM

> If you care about code quality, you'll develop that neural connection after some time.

No, because you can get LLMs to produce high quality code that has gone through an infinite number of refinement/polish cycles and is far more exhaustive than the code you would have written yourself.

Once you hit that point, you find yourself in a directional/steering position divorced from the code since no matter what direction you take, you'll get high quality code.

by hombre_fatal

4/20/2026 at 8:45:23 PM

Only if never find opportunities to simplify the code it's writing and you don't review the code at all.

> no matter what direction you take, you'll get high quality code

This is not the case today. You get medium-quality, sometimes over-engineered code 10x faster.

by afzalive

4/18/2026 at 5:37:43 PM

Yes, you certainly can argue that, but you'd be wrong. The primary selling point of LLMs is that they solve the problem of needing skill to get things done.

by ori_b

4/18/2026 at 5:40:38 PM

That is not the entire selling point - so you are very wrong.

You very much decide how you employ LLMs.

Nobody are keeping a gun to your head to use them. In a certain way.

Sonif you use them in a way that increase you inherent risk, then you are incredibly wrong.

by tossandthrow

4/18/2026 at 5:47:19 PM

I suggest you read the sales pitches that these products have been making. Again, when I say that this is the selling point, I mean it: This is why management is buying them.

by ori_b

4/18/2026 at 6:01:59 PM

I've read the sales pitches, and they're not about replacing the need for skill. The Claude Design announcement from yesterday (https://www.anthropic.com/news/claude-design-anthropic-labs) is pretty typical in my experience. The pitch is that this is good for designers, because it will allow them to explore a much broader range of ideas and collaborate on them with counterparties more easily. The tool will give you cool little sliders to set the city size and arc width, but it doesn't explain why you would want to adjust these parameters or how to determine the correct values; that's your job.

I understand why a designer might read this post and not be happy about it. If you don't think your management values or appreciates design skill, you'd worry they're going to glaze over the bullet points about design productivity, and jump straight to the one where PMs and marketers can build prototypes and ignore you. But that's not what the sales pitch is focused on.

by SpicyLemonZest

4/18/2026 at 6:19:09 PM

The majority of examples in the document you linked describe 'person without<skill> can do thing needing <skill>'. It's very much selling 'more output, less skill'

by ori_b

4/18/2026 at 6:35:06 PM

Sales pitches dont mean jack, WTF are you talking about?

by trinsic2

4/18/2026 at 6:53:43 PM

Sales pitches are literally the same thing as "the selling point".

Neither of those is necessarily a synonym for why you personally use them

by foobarchu

4/18/2026 at 6:04:13 PM

They purportedly solve the problem of needing skill to get things done. IME, this is usually repeated by VC backed LLM companies or people who haven’t knowingly had to deal with other people’s bad results.

This all bumps up against the fact that most people default to “you use the tool wrong” and/or “you should only use it to do things where you already have firm grasp or at least foundational knowledge.”

It also bumps against the fact that the average person is using LLM’s as a replacement for standard google search.

by Forgeties79

4/18/2026 at 5:52:34 PM

I see it completely the opposite way, you use an LLM and correct all its mistakes and it allows you to deliver a rough solution very quickly and then refine it in combination with the AI but it still gets completely lost and stuck on basic things. It’s a very useful companion that you can’t trust, but it’s made me 4-5x more productive and certainly less frustrated by the legacy codebase I work on.

by andy_ppp

4/18/2026 at 6:34:09 PM

Yeah I whole hardheartedly disagree with this. Because I understand the basics of coding I can understand where the model gets stuck and prompt it in other directions.

If you don't know whats going on through the whole process, good luck with the end product.

by trinsic2

4/18/2026 at 7:00:29 PM

You're learning at your standard rate of learning, you're just feeding yourself over-confidence on how much you're absorbing vs what the LLM is facilitating you rolling out.

by weego

4/18/2026 at 7:24:57 PM

This is such a weird statement in so many levels.

The latent assumption here is that learning is zero sum.

That you can take a 30 year old from 1856 bring them into present day and they will learn whatever subject as fast as a present day 20 year old.

That teachers doesn't matter.

That engagement doesn't matter.

Learning is not zero sum. Some cultural background makes learning easier, some mentoring makes is easier, and some techniques increases engagement in ways that increase learning speed.

by tossandthrow

4/18/2026 at 5:33:23 PM

> I am learning at an incredible rate with LLMs

Could you do it again without the help of an LLM?

If no, then can you really claim to have learned anything?

by bluefirebrand

4/18/2026 at 5:57:30 PM

The challenge is not if you could do all of it without AI but any of it that you couldn't before.

Not everyone learns at the same pace and not everyone has the same fault tolerance threshold. In my experiencd some people are what I call "Japanese learners" perfecting by watching. They will learn with AI but would never do it themselves out of fear of getting something wrong while they understand most of it, others that I call "western learners" will start right away and "get their hands dirty" without much knowledge and also get it wrong right away. Both are valid learning strategies fitting different personalities.

by _blk

4/18/2026 at 5:38:24 PM

I could definitely maintain the infrastructure without an llm. Albeit much slower.

And yes. If LLMs disappear, then we need to hire a lot of people to maintain the infrastructure.

Which naturally is a part of the risk modeling.

by tossandthrow

4/18/2026 at 6:32:22 PM

> I could definitely maintain the infrastructure without an llm

Not what I asked, but thanks for playing.

by bluefirebrand

4/18/2026 at 7:28:03 PM

You literally asked that question

> Could you do it again without the help of an LLM?

by tossandthrow

4/18/2026 at 8:26:36 PM

And the question you answered was "could you maintain it without the help of an LLM"

by bluefirebrand

4/18/2026 at 7:41:49 PM

So, you havent really learned anything from any teacher if you could not do it again without them?

by Paradigma11

4/18/2026 at 10:35:37 PM

> So, you havent really learned anything from any teacher if you could not do it again without them?

Well, yes?

What do you think "learning" means? If you cannot do something without the teacher, you haven't learned that thing.

by lelanthran

4/18/2026 at 8:20:36 PM

That would be the definition of learning something, yes.

by techpression

4/18/2026 at 7:52:17 PM

I mean...yeah?

If your child says they've learned their multiplication tables but they can't actually multiply any numbers you give them do they actually know how to do multiplication? I would say no.

by falkensmaize

4/18/2026 at 8:20:56 PM

For some reason people are perfectly able to understand this in the context of, say, cursive, calculator use, etc., but when it comes to their own skillset somehow it's going to be really different.

by Jweb_Guru

4/18/2026 at 8:37:15 PM

Yes that's exactly right.

by UncleMeat

4/18/2026 at 8:14:17 PM

Yes.

by sho_hn

4/18/2026 at 5:44:18 PM

I think this is a bit dismissive.

It’s quite possible to be deep into solving a problem with an LLM guiding you where you’re reading and learning from what it says. This is not really that different from googling random blogs and learning from Stack Overflow.

Assuming everyone just sits there dribbling whilst Claude is in YOLO mode isn’t always correct.

by danw1979

4/18/2026 at 6:50:43 PM

>> I am learning a new skill with instructor at an incredible rate

> Could you do it again on your own?

Can you you see how nonsensical your stance is? You're straight up accusing GP of lying they are learning something at the increased rate OR suggesting if they couldn't learn that, presumably at the same rate, on they own, they're not learning anything.

That's not very wise to project your own experiences on others.

by subscribed

4/18/2026 at 7:34:30 PM

Actually, it’s much like taking a physics or engineering course, and after the class being fully able to explain the class that day, and yet realize later when you are doing the homework that you did not actually fully understand like you thought you did.

by sroussey

4/18/2026 at 5:37:56 PM

>I am learning at an incredible rate with LLMs.

I don't believe it. Having something else do the work for you is not learning, no matter how much you tell yourself it is.

by i_love_retros

4/18/2026 at 5:54:05 PM

If you've seen further it's only because you've stood on the shoulders of giants.

Having other people do work for you is how people get to focus on things they actually care about.

Do you use a compiler you didn't write yourself? If so can you really say you've ever learned anything about computers?

by margalabargala

4/18/2026 at 6:24:31 PM

You have to build a computer to learn about computers!

by butterisgood

4/18/2026 at 9:21:09 PM

I would argue that if you've just watched videos about building computers and haven't sat down and done one yourself, then yeah I don't see any evidence that you've learned how to build a computer.

by viccis

4/19/2026 at 2:11:13 AM

And, so the anti-LLM argument goes, if you've not built the computer you can't learn anything about what computers could be used for.

by margalabargala

4/19/2026 at 7:31:37 PM

That's not the anti-LLM argument, that's a brand new argument you made up.

by viccis

4/20/2026 at 12:00:55 AM

Did you not read the comment thread you replied to? That's the exact argument that I_love_retros made above.

That is in fact the anti LLM argument you've ostensibly been discussing. If you want to talk to the person who made it up I'm not your guy.

by margalabargala

4/18/2026 at 5:41:33 PM

It is easy to not believe if you only apply an incredibly narrow world view.

Open your eyes, and you might become a believer.

by tossandthrow

4/18/2026 at 5:59:49 PM

What is this, some sort of cult?

by nothinkjustai

4/18/2026 at 7:26:44 PM

You mean the cult of "I can't see the viruses therefore they dint exist"? As in "I can't imagine something so it means it's a lie"?

Indeed, quite weird and no imagination.

by subscribed

4/18/2026 at 6:16:27 PM

No, it is an as snarky response to a person being snarky about usefulness of AI agents.

It does seem like there is a cult of people who categorically see LLMs as being poor at anything without it being founded in anything experience other than their 2023 afternoon to play around with it.

by tossandthrow

4/18/2026 at 7:05:02 PM

Who cares? Why are people so invested in trying to “convert” others to see the light?

Can’t you be satisfied with outcompeting “non believers”? What motivates you to argue on the internet about it? Deep down are you insecure about your reliance on these tools or something, and want everyone else to be as well?

by nothinkjustai

4/18/2026 at 7:27:00 PM

Why do people invest themselves so hard in interjecting themselves into conversations about Ai telling people it doesn't work?

It feels so off rebuilding serious SaaS apps in days for production, only to be told it is not possible?

by tossandthrow

4/18/2026 at 9:38:52 PM

Who here said ai “doesn’t work”?

by nothinkjustai

4/18/2026 at 7:07:12 PM

> We have gone multi cloud disaster recovery on our infrastructure. Something I would not have done yet, had we not had LLMs.

That’s product atrophy, not skill atrophy.

by Wowfunhappy

4/18/2026 at 5:32:25 PM

Using LLMs as a learning tool isn’t what causes skill atrophy. It’s using them to solve entire problems without understanding what they’ve done.

And not even just understanding, but verifying that they’ve implemented the optimal solution.

by deadbabe

4/18/2026 at 7:56:42 PM

It's partly that, but also reading and surface level understanding something vs generating yourself are different skills with different depths. If you're learning a language, you can get good at listening without getting good at speaking for example.

by tehjoker

4/18/2026 at 5:31:39 PM

Also AI could help you pick those skills up again faster, although you wouldn’t need to ever pick those skills up again unless AI ceased to exist.

What an interesting paradox-like situation.

by jjallen

4/18/2026 at 6:13:36 PM

I believe some professor warned us about being over reliant on Google/reddit etc: “how would you be productive if internet went down” dilemma.

Well, if internet is down, so is our revenue buddy. Engineering throughput would be the last of our concerns.

by estetlinus

4/18/2026 at 6:16:20 PM

https://hex.ooo/library/power.html

When future humans rediscover mathematics.

by solarengineer

4/18/2026 at 8:00:19 PM

Yeah I am worried about skill atrophy too. Everyone uses a compiler these days instead of writing assembly. Like who the heck is going to do all the work when people forget how to use the low level tools and a compiler has a bug or something?

And don’t get me started on memory management. Nobody even knows how to use malloc(), let alone brk()/mmap(). Everything is relying on automatic memory management.

I mean when was the last time you actually used your magnetized needle? I know I am pretty rusty with mine.

by IgorPartola

4/18/2026 at 8:03:00 PM

> an LLM is exactly like a compiler if a compiler was a black box hosted in a proprietary cloud and metered per symbol

Yeah, exactly.

by otabdeveloper4

4/18/2026 at 8:27:33 PM

Snark aside, this is an actual problem for a lot of developers in varying degrees, not understand anything about the layers below make for terrible layers above in very many situations.

by techpression

4/18/2026 at 5:48:25 PM

[dead]

by boxingdog

4/18/2026 at 6:00:31 PM

Another aspect I haven’t seen discussed too much is that if your competitor is 10x more productive with AI, and to stay relevant you also use AI and become 10x more productive. Does the business actually grow enough to justify the extra expense? Or are you pretty much in the same state as you were without AI, but you are both paying an AI tax to stay relevant?

by dgellow

4/18/2026 at 6:09:36 PM

This is the “ad tax” reasoning, but ultimately I think the answer is greater efficiency. So there is a real value, even if all competitors use the tools.

It’s like saying clothing manufacturers are paying the “loom tax” tax when they could have been weaving by hand…

by xixixao

4/18/2026 at 6:26:58 PM

Software development is not a production line, the relationship between code output and revenue is extremely non-linear.

Where producing 2x the t-shirts will get you ~2x the revenue, it's quite unlikely that 10x the code will get you even close to 2x revenue.

With how much of this industry operates on 'Vendor Lock-in' there's a very real chance the multiplier ends up 0x. AI doesn't add anything when you can already 10x the prices on the grounds of "Fuck you. What are you gonna do about it?"

by SlinkyOnStairs

4/18/2026 at 6:42:02 PM

Yep and in a vendor lock in scenario, fixing deep bugs or making additions in surgical ways is where the value is. And Claude helps you do that, by giving you more information, analyzing options, but it doesn’t let you make that decision 10x faster.

by groundzeros2015

4/18/2026 at 11:19:39 PM

[dead]

by 3dfd

4/18/2026 at 7:09:08 PM

We already know how to multiply the efficiency of human intelligence to produce better quality than LLMs and nearly match their productivity - open source - in fact coding LLMs wouldn't even exist without it.

Open source libraries and projects together with open source AI is the only way to avoid the existential risks of closed source AI.

by bigbadfeline

4/18/2026 at 6:15:58 PM

The alternative is probably also true. If your F500 competitor is also handicapped by AI somehow, then you're all stagnant, maybe at different levels. Meanwhile Anthropic is scooping up software engineers it supposedly made irrelevant with Mythos and moving into literally 2+ new categories per quarter

by redanddead

4/18/2026 at 7:24:10 PM

Where's the evidence of competitors being 10x more productive? So far, everyone is simply bragging about how much code they have shipped last week, but that has zero relevance when it comes to productivity

by dakiol

4/18/2026 at 10:34:14 PM

I work at a 20-year-old mid-sized SaaS company. As long as the company has been around, product managers have longed for more engineers and strategies for engineers to ship features faster. As of around February, those same product managers across the org are complaining that they can't keep up with the pace at which engineers are shipping their features. This isn't just lines of code. This is the entire company trying to figure out how to help the PMs because engineers suddenly stopped being the bottleneck.

I don't know about 10x, but this could only happen if PMs suddenly got really lazy or the engineers actually got at least 1.5x faster. My gut says it's way more because we're now also consistently up to date on our dependencies and completing massive refactors we were putting off for years.

There are lots of reasons this could be the case. Quality suddenly changed, the nature of the work changed, engineers leveled up... But for this to have happened consistently across a bunch of engineering teams is quite the coincidence if not this one thing we are all talking about.

by davidron

4/18/2026 at 8:14:22 PM

I feel like a lot of the AI advocacy today is like the Cloud advocacy of a few years ago or the Agile advocacy before that. It's this season's silver bullet to make us all 10x more effective according to metrics that somehow never translate into adding actually useful functionality and quality 10x as fast.

The evangelists told us 20 years ago that if we weren't doing TDD then we weren't really professional programmers at all. The evangelists told us 10 years ago that if we were still running stuff locally then we must be paying a fortune for IT admin or not spending our time on the work that mattered. The evangelists this week tell us that we need to be using agents to write all our code or we'll get left in the dust by our competitors who are.

I'm still waiting for my flying car. Would settle for some graphics software on Linux that matches the state of the art on Windows or even reliable high-quality video calls and online chat rooms that don't make continental drift look fast.

by Silhouette

4/18/2026 at 8:04:00 PM

Read it as just a given rate. The number doesn’t matter too much here, if company B does believe claims from company A they are N times more productive that’s enough to force B to adopt the same tooling.

by dgellow

4/18/2026 at 6:08:50 PM

Either the business grows, or the market participants shed human headcount to find the optimal profit margin. Isn’t that the great unknown: what professions are going to see headcount reduction because demand can’t grow that fast (like we’ve seen in agriculture), and which will actually see headcount stay the same or even expand, because the market has enough demand to keep up with the productivity gains of AI? Increasingly I think software writ large is the latter, but individual segments in software probably are the former.

by senordevnyc

4/18/2026 at 6:27:51 PM

it's worse than a tie. 10x everyone just floods the market and tanks per-unit price. you pay the AI tax and your output is worth less.

by Lihh27

4/18/2026 at 8:04:41 PM

> your competitor is 10x more productive with AI

This doesn't happen. Literally zero evidence of this.

by otabdeveloper4

4/18/2026 at 8:05:43 PM

The actual rate isn’t relevant for the discussion

by dgellow

4/18/2026 at 8:13:28 PM

Well it might.

If the actual rate is .9x then it matters a lot.

Or even if it's like 1.1x, is the cost worth the return?

by Miner49er

4/19/2026 at 3:37:17 AM

The cost is so small relative to the increase. The cost whining on HN is bizarre to me. Feels like everyone here is on an individual plan and has no understanding of what margins look like for actual business.

Meta pays $750k+ TC and makes far more profit/eng, do you think they care about $5k/eng/mo in inference? A 1.1x increase would be so significant that it would justify the cost easily, especially when you can just compress comps to make up for it

by xvector

4/20/2026 at 11:20:24 AM

Nobody is whining here.

by AlexeyBelov

4/19/2026 at 9:33:31 AM

What? You don't think businesses do financial planning and calculations for profit margins?

Do you really think they go on vibes - "welp, this AI thing seems to improve developer performance, I guess. Heck, what's an extra 5k per developer anyways, amirite".

Well, maybe they really do in your neck of the woods. Explains a lot, I guess.

by otabdeveloper4

4/19/2026 at 1:21:56 PM

Yes most companies do in fact operate like this. There are tens of thousands of companies that will pay more for the best thing and call it at that, because the cost is dwarfed by what even marginal gains in quality unlock for the business.

by xvector

4/20/2026 at 4:20:44 AM

> the cost is dwarfed by what even marginal gains in quality

That is just, like, your opinion, man.

Also, I doubt these kinds of companies have "quality" of anything, never mind "gains in quality".

by otabdeveloper4

4/18/2026 at 8:38:40 PM

What if the rate is negative?

Would it matter?

by surgical_fire

4/20/2026 at 11:28:08 AM

It would matter but would be a different discussion than the one I was going for

by dgellow

4/18/2026 at 6:31:18 PM

If the business doesn’t grow then you shed costs like employees

by JambalayaJimbo

4/18/2026 at 11:48:37 PM

[dead]

by benjiro3000

4/18/2026 at 6:28:47 PM

Open models keep closing the eval gap for many tasks, and local inference continues to be increasingly viable. What's missing isn't technical capability, but productized convenience that makes the API path feel like the only realistic option.

Frontier labs are incentivized to keep it that way, and they're investing billions to make AI = API the default. But that's a business model, not a technical inevitability.

by michaelje

4/18/2026 at 8:11:57 PM

im hoping and praying that local inference finds it's way to some sort of baseline that we're all depending on claude for here. that would help shape hardware designs on personal devices probably something in the direction of what apple has been doing.

ive had to like tune out of the LLM scene because it's just a huge mess. It feels impossible to actually get benchmarks, it's insanely hard to get a grasp on what everyone is talking about, bots galore championing whatever model, it's just way too much craze and hype and misinformation. what I do know is we can't keep draining lakes with datacenters here and letting companies that are willing to heel turn on a whim basically control the output of all companies. that's not going to work, we collectively have to find a way to make local inference the path forward.

everyone's foot is on the gas. all orgs, all execs, all peoples working jobs. there's no putting this stuff down, and it's exhausting but we have to be using claude like _right now_. pretty much every company is already completely locked in to openai/gemini/claude and for some unfortunate ones copilot. this was a utility vendor lock in capture that happened faster than anything ive ever seen in my life & I already am desperate for a way to get my org out of this.

by trueno

4/18/2026 at 9:04:27 PM

I'm frustrated that there's not "solid" instructional tooling. I either see people just saying "keep trying different prompts and switching models until you get lucky" or building huge cantilevered toolchains that seems incredibly brittle, and even then, how well do they really work?

I get choice paralysis when you show me a prompt box-- I don't know what I can reasonably ask for and how to best phrase it, so I just panic. It doesn't help when we see articles saying people are getting better outcomes by adding things like "and no bugs plz owo"

I'm sure this is by design-- anything with clear boundaries and best practices would discourage gacha style experimentation. Can you trust anyone who sells you a metered service to give you good guidance on how to use it efficiently?

by hakfoo

4/18/2026 at 9:18:51 PM

yea that is probably the worst part of these techs becoming mainstream services and local-LLM'ing taking off in general: working with them at many points in any architecture no longer feels... deterministic i guess. way too fucking much "heres what i use" but no real best practices yet, just a lot of vague gray area and everyones still in discovery-mode on how to best find some level of determinism or workflow and ways we are benchmarking is seriously a moving target. everyone has their own branded take on what the technology is and their own branded approach on how to use it, and it's probably the murkiest and foggiest time to be in technology fields that i've ever seen :\ seems like weekly/monthly something is outdated, not just the models but the tooling people are parroting as the current best tooling to use. incredibly frustrating. there's simply too much ground to cover for any one person to have any absolute takes on any of it, and because a handful of entities are currently leading the charge draining lakes and trying to compete for every person and every businesses money, there's zero organized frameworks at the top to make some sense of this. they all are banking on their secret sauce, and i _really_ want us all to get away from this. local inference has to succeed imo but goddamn there needs to be some collective working together to rally behind some common strats/frameworks here. im sure there's already countless committees that have been established to try and get in front of this but even that's messy.

i don't know how else to phrase it: this feels like such an unstable landscape, "beta" software/services are running rampant in every industry/company/org/etc and there's absolutely no single resource we can turn to to help stay ahead of & plan for the rapidly-evolving landscape. every, and i mean every company, is incredibly irresponsible for using this stuff. including my own. once again though, cat's already out of the bag. now we fight for our lives trying to contain it and ensure things are well understood and implemented properly...which seems to be the steepest uphill battle of my life

by trueno

4/18/2026 at 5:59:42 PM

I'm hopeful that new efficiencies in training (Deepseek et al.), the impressive performance of smaller models enhanced through distillation, and a glut of past-their-prime-but-functioning GPUs all converge make good-enough open/libre models cheap, ubiquitous, and less resource-intensive to train and run.

by dewarrn1

4/18/2026 at 5:36:49 PM

> we don't want a hard dependency on another multi-billion dollar company just to write software

My manager doesn't even want us to use copilot locally. Now we are supposed to only use the GitHub copilot cloud agent. One shot from prompt to PR. With people like that selling vendor lock in for them these companies like GitHub, OpenAI, Anthropic etc don't even need sales and marketing departments!

by i_love_retros

4/18/2026 at 5:43:24 PM

You are aware that using eg. Github copilot is not one shot? It will start an agentic loop.

by tossandthrow

4/18/2026 at 6:01:49 PM

Unnecessary nitpicking

by dgellow

4/18/2026 at 6:52:19 PM

Why?

One shoting has a very specific meaning, and agentic workflows are not it?

What is the implied meaning I should understand from them using one shot?

They might refer to the lack of humans in the loop.

by tossandthrow

4/18/2026 at 8:01:47 PM

You give a prompt, you get a PR. If it is ready to merge with the first attempt, that’s a one shot. The agentic loop is a detail in their context

by dgellow

4/18/2026 at 5:26:48 PM

The lock in is so incredibly poor. I could switch to whatever provider in minuets.

But it requires that one does not do something stupid.

Eg. For recurring tasks: keep the task specification in the source code and just ask Claude to execute it.

The same with all documentation, etc.

by tossandthrow

4/18/2026 at 5:43:51 PM

What open models are truly competing with both Claude Code and Opus 4.7 (xhigh) at this stage?

by aliljet

4/19/2026 at 3:39:59 AM

Spent a lot of time with "open models." None of them come close. They are benchmaxxed. But you won't hear many of the open model fans on HN admit this.

The open model mentality is also just so bizarre to me. You're going to use an inferior model to save, what, a couple hundred bucks a month? Is your time really worth that little?

No one working on a serious project at a serious company is downgrading their agent's intelligence for a marginal cost saving. Downgrading your model is like downgrading the toilet paper on your yacht.

by xvector

4/19/2026 at 4:26:58 AM

> The open model mentality is also just so bizarre to me. You're going to use an inferior model to save, what, a couple hundred bucks a month? Is your time really worth that little?

I agree that people who claim that open models are as good as claude/openai/z are lying, delusional, or not doing very much. I've tried them all, included GLM 5.1.

GLM is not bad but the hardware needed will never recoup the ROI vs just using a commercial provider through its API.

That being said, you're being reductive here. For many use cases local models offer advantages that can't obtained through a commercial API : Privacy, ownership of the entire stack, predictability. They can't be rugpulled, they can't snitch on you. They will not give you 503.

Those advantages are very valuable for things like a local assistant, as an agent, for data extraction, for translations, for games (role playing and whatnot), etc.

That being said I know that many people are like you, they don't give a second thought about privacy. They'd plug Anthropic to their brain if they could. So I understand the sentiment. I just think that you should in turn try to understand why someone would use an open model.

by tredre3

4/19/2026 at 4:08:00 AM

Glm 5.1 getting 5% on ARC-AGI 2 private is all anyone needs to know.

by WarmWash

4/18/2026 at 6:10:57 PM

I've had a good experience with GLM-5.1. Sure it doesn't match xhigh but comes close to 4.6 at 1/3rd the cost

by parinporecha

4/19/2026 at 8:58:43 AM

1/3? Try 2/13 :P

5.1 is like $4 / 1m output, Opus 4.6 is $25. GPT 5.4 pro is $270 with large contexts :O

by slopinthebag

4/18/2026 at 6:09:16 PM

GLM 5.1 competes with Sonnet. I'm not confident about Opus, though they claim it matches that too.

by esafak

4/18/2026 at 6:55:46 PM

I have it as failover to Opus 4.6 in a Claude proxy internally. People don't notice a thing when it triggers, maybe a failed tool call here and there (harness remains CC not OC) or a context window that has gone over 200k tokens or an image attachment that GLM does not handle, otherwise hunky-dory all the way. I would also use it as permanent replacement for haiku at this proxy to lower Claude costs but have not tried it yet. Opus 4.7 has shaken our setup badly and we might look into moving to Codex 100% (GLM could remain useful there too).

by ojosilva

4/18/2026 at 6:01:36 PM

That's a lame attitude. There are local models that are last year's SOTA, but that's not good enough because this year's SOTA is even better yet still...

I've said it before and I'll say it again, local models are "there" in terms of true productive usage for complex coding tasks. Like, for real, there.

The issue right now is that buying the compute to run the top end local models is absurdly unaffordable. Both in general but also because you're outbidding LLM companies for limited hardware resources.

You have a $10K budget, you can legit run last year's SOTA agentic models locally and do hard things well. But most people don't or won't, nor does it make cost effective sense Vs. currently subsidized API costs.

by Someone1234

4/18/2026 at 6:10:55 PM

I completely see your point, but when my / developer time is worth what it is compared to the cost of a frontier model subscription, I'm wary of choosing anything but the best model I can. I would love to be able to say I have X technique for compensating for the model shortfall, but my experience so far has been that bigger, later models out perform older, smaller ones. I genuinely hope this changes through. I understand the investment that it has taken to get us to this point, but intelligence doesn't seem like it's something that should be gated.

by gbro3n

4/18/2026 at 6:19:52 PM

Right; but every major generation has had diminishing returns on the last. Two years ago the difference was HUGE between major releases, and now we're discussing Opus 4.6 Vs. 4.7 and people cannot seem to agree if it is an improvement or regression (and even their data in the card shows regressions).

So my point is: If you have the attitude that unless it is the bleeding edge, it may have well not exist, then local models are never going to be good enough. But truth is they're now well exceeding what they need to be to be huge productivity tools, and would have been bleeding edge fairly recently.

by Someone1234

4/18/2026 at 6:48:03 PM

I feel like I'm going to have to try the next model. For a few cycles yet. My opinion is that Opus 4.7 is performing worse for my current work flow, but 4.6 was a significant step up, and I'd be getting worse results and shipping slower if I'd stuck with 4.5. The providers are always going to swear that the latest is the greatest. Demis Hassabis recently said in an interview that he thinks the better funded projects will continue to find significant gains through advanced techniques, but that open source models figure out what was changed after about 6 months or so. We'll see I guess. Don't get me wrong, I'd love to settle down with one model and I'd love it to be something I could self host for free.

by gbro3n

4/18/2026 at 7:26:40 PM

> I completely see your point, but when my / developer time is worth what it is compared to the cost of a frontier model subscription, I'm wary of choosing anything but the best model I can.

Don't you understand that by choosing the best model we can, we are, collectively, step by step devaluating what our time is worth? Do you really think we all can keep our fancy paychecks while keep using AI?

by dakiol

4/18/2026 at 8:04:52 PM

Do you think if you or me stopped using AI that everyone else will too? We're still what we always were - problem solvers who have gained the ability to learn and understand systems better that the general population, communicate clearly (to humans and now AIs). Unfortunately our knowledge of language APIs and syntax has diminished in value, but we have so many more skills that will be just as valuable as ever. As the amount of software grows, so will the need for people who know how to manage the complexity that comes with it.

by gbro3n

4/18/2026 at 10:53:45 PM

> Unfortunately our knowledge of language APIs and syntax has diminished in value, but we have so many more skills that will be just as valuable as ever.

There were always jobs that required those "many more skills" but didn't require any programming skills.

We call those people Business Analysts and you could have been doing it for decades now. You didn't, because those jobs paid half what a decent/average programmer made.

Now you are willingly jumping into that position without realising that the lag between your value (i.e. half your salary, or less) would eventually disappear.

by lelanthran

4/19/2026 at 7:15:12 AM

I guess we will need to wait and see if AI can remove ALL of the complexity that requires a software engineer over a business analyst. I can't currently believe that it will. BA's I've worked with vary in technical capability from 'having coded before and understanding DB schema basics and network architecture' to 'I know how the business works but nothing about computers'. If we got to the point in the future where every computer system ran on the same frameworks in the same way, and AI understood it perfectly, then maybe. But while AI is a probabilistic technology manipulating deterministic systems, we will always need people to understand whats going on, and whether they write a lot of code or not, they will be engineers, not analysts. Whether it's more or less of those people, we will see.

by gbro3n

4/19/2026 at 7:57:06 AM

> If we got to the point in the future where every computer system ran on the same frameworks in the same way, and AI understood it perfectly, then maybe.

They don't need to all run on the same frameworks, they just need to run on documented frameworks.

What possible value can you bring to a BA?

The system topology (say, if the backend was microservices vs Lambda vs something-else)? The LLM can explain to the BA what their options are, and the impact of those options.

The framework being used (Vue, or React, or something else)? The AI can directly twiddle that for the BA.

Solving a problem? If the observability is setup, the LLM can pinpoint almost all the problems too,and with a separate UAT or failover-type replica, can repro, edit, build, deploy and test faster than you can.

Like I already said, if[1] you're now able to build or enhance a system without actually needing programming skills, why are you excited about that? You could always do that. It's just that it pays half what programming skills gets you.

You (and many others who boast about not writing code since $DATE) appear to be willingly moving to a role that already pays less, and will pay even less once the candidates for that role double (because now all you programmers are shifting towards it).

It's supply and demand, that's all.

--------------

[1] That's a very big "If", I think. However, the programmers who are so glad to not program appear to believe that it's a very small "If", because they're the ones explaining just how far the capabilities have come in just a year, and expect the trend to continue. Of course, if the SOTA models never get better than what we have now, then, sure - your argument holds - you'll still provide value.

by lelanthran

4/18/2026 at 6:46:27 PM

First, making sure to offer an upvote here. I happen to be VERY enthusiastic about local models, but I've found them to be incredibly hard to host, incredibly hard to harness, and, despite everything, remarkably powerful if you are willing to suffer really poor token/second performance...

by aliljet

4/18/2026 at 7:01:01 PM

> that are last year's SOTA

Early last year or late last year?

opus 4.5 was quite a leap

by wellthisisgreat

4/18/2026 at 7:34:31 PM

$10k is a lot of tokens.

by HWR_14

4/18/2026 at 7:52:58 PM

At the rate its consuming now, I'd probably blow $10k in a month easy.

by sscaryterry

4/18/2026 at 7:16:28 PM

>perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs

I fear that this may not be feasible in the long term. The open-model free ride is not guaranteed to continue forever; some labs offer them for free for publicity after receiving millions in VC grants now, but that's not a sustainable business model. Models cost millions/billions in infrastructure to train. It's not like open-source software where people can just volunteer their time for free; here we are talking about spending real money upfront, for something that will get obsolete in months.

Current AI model "production" is more akin to an industrial endeavor than open-source arrangements we saw in the past. Until we see some breakthrough, I'm bearish on "open models will eventually save us from reliance on big companies".

by leonidasv

4/18/2026 at 7:59:04 PM

"get obsolete in months"

If you mean obsolete in the sense of "no longer fit for purpose" I don't think that's true. They may become obsolete in terms of "can't do hottest new thing" but that's true of pretty much any technology. A capable local model that can do X will always be able to do X, it just may not be able to do Y. But if X is good enough to solve your problem, why is a newer better model needed?

I think if we were able to achieve ~Opus 4.6 level quality in a local model that would probably be "good enough" for a vast number of tasks. I think it's debatable whether newer models are always better - 4.7 seems to be somewhat of a regression for example.

by falkensmaize

4/18/2026 at 6:49:35 PM

I can recommend this stack. It works well with the existing Claude skills I had in my code repos:

1. Opencode

2. Fireworks AI: GLM 5.1

And it is SIGNIFICANTLY cheaper than Claude. I'm waiting eagerly for something new from Deepseek. They are going to really show us magic.

by sergiotapia

4/18/2026 at 6:51:15 PM

it is also significantly less capable than claude

by dirasieb

4/18/2026 at 7:28:54 PM

That's fine. When the "best of the best" is offered only by a couple of companies that are not looking into our best interests, then we can discard them

by dakiol

4/18/2026 at 5:18:29 PM

Any recommendations on good open ones? What are you using primarily?

by ben8bit

4/18/2026 at 6:06:41 PM

LMArena actually has a nice Pareto distribution of ELO vs price for this

  model                        elo   $/M
  ---------------------------------------
  glm-5.1                      1538  2.60
  glm-4.7                      1440  1.41
  minimax-m2.7                 1422  0.97
  minimax-m2.1-preview         1392  0.78
  minimax-m2.5                 1386  0.77
  deepseek-v3.2-thinking       1369  0.38
  mimo-v2-flash (non-thinking) 1337  0.24

https://arena.ai/leaderboard/code?viewBy=plot&license=open-s...

by culi

4/18/2026 at 7:25:05 PM

LMArena isn't very useful as a benchmark, however I can vouch for the fact that GLM 5.1 is astonishingly good. Several people I know who have a $100/mo Claude Code subscription are considering cancelling it and going all in on GLM, because it's finally gotten (for them) comparable to Opus 4.5/6. I don't use Opus myself, but I can definitely say that the jump from the (imvho) previous best open weight model Kimi K2.5 to this is otherworldly — and K2.5 was already a huge jump itself!

by logicprog

4/18/2026 at 5:21:50 PM

qwen3.5/3.6 (30B) works well,locally, with opencode

by blahblaher

4/18/2026 at 5:31:08 PM

Mind you, a 30B model (3B active) is not going to be comparable to Opus. There are open models that are near-SOTA but they are ~750B-1T total params. That's going to require substantial infrastructure if you want to use them agentically, scaled up even further if you expect quick real-time response for at least some fraction of that work. (Your only hope of getting reasonable utilization out of local hardware in single-user or few-users scenarios is to always have something useful cranking in the background during downtime.)

by zozbot234

4/18/2026 at 5:49:05 PM

For a business with ten or more engineers/people-using-ai, it might still make sense to set this up. For an individual though, I can’t imagine you’d make it through to positive ROI before the hardware ages out.

by pitched

4/18/2026 at 6:01:56 PM

It's hard to tell for sure because the local inference engines/frameworks we have today are not really that capable. We have barely started exploring the implications of SSD offload, saving KV-caches to storage for reuse, setting up distributed inference in multi-GPU setups or over the network, making use of specialty hardware such as NPUs etc. All of these can reuse fairly ordinary, run-of-the-mill hardware.

by zozbot234

4/18/2026 at 6:42:10 PM

Since you need at least a few of H100 class hardware, I guess you need at least few tens of coders to justify the costs.

by DeathArrow

4/19/2026 at 2:35:39 AM

I see the 512GB Mac Studios aren’t for sale anymore but that was a much cheaper path

by pitched

4/18/2026 at 6:39:05 PM

I'm backing up a big dataset onto tapes, so I wanted to automate it. I have an idle 64Gb VRAM setup in my basement, so I decided to experiment and tasked it with writing an LTFS implementation. LTFS is an open standard for filesystems for tapes, and there's an implementation in C that can be used as the baseline.

So far, Qwen 3.6 created a functionally equivalent Golang implementation that works against the flat file backend within the last 2 days. I'm extremely impressed.

by cyberax

4/18/2026 at 9:10:35 PM

It is surprisingly competent. It's not Opus 4.6 but it works well for well structured tasks.

by Gareth321

4/18/2026 at 6:21:00 PM

What near SOTA open models are you referring to?

by wuschel

4/18/2026 at 5:35:41 PM

I want to bump this more than just a +1 by recommending everyone try out OpenCode. It can still run on a Codex subscription so you aren’t in fully unfamiliar territory but unlocks a lot of options.

by pitched

4/18/2026 at 5:39:41 PM

The Codex TUI harness is also open source and you can use open models with it, so you can stay in even more familiar territory.

by zozbot234

4/18/2026 at 5:55:36 PM

pi-coding-agent (pi.dev) is also great. I've been using it with Gemma 4 and Qwen 3.6.

by pwython

4/18/2026 at 10:16:11 PM

The thing I dislike about OpenCode is the lack of capabilities of their editor, also, resource intensive, for some reason on a VM it chuckles each 30 mins, that I need to discard all sessions, commits, etc.

I don't know if it is bun related, but in task manager, is the thing that is almost at the top always on CPU usage, turns out for me, bun is not production ready at all.

Wish Zed editor had something like BigPickle which is free to use without limits.

by equasar

4/19/2026 at 6:38:33 AM

> turns out for me, bun is not production ready

What issue did you run into?

by Jarred

4/18/2026 at 5:39:23 PM

Is this sort of setup tenable on a consumer MBP or similar?

by jherdman

4/18/2026 at 5:55:16 PM

Qwen’s 30B models run great on my MBP (M4, 48GB) but the issue I have is cooling - the fan exhaust is straight onto the screen, which I can’t help thinking will eventually degrade it, given the thermal cycling it would go through. A Mac Studio makes far more sense for local inference just for this reason alone.

by danw1979

4/18/2026 at 5:44:22 PM

For a 30B model, you want at least 20GB of VRAM and a 24GB MBP can’t quite allocate that much of it to VRAM. So you’d want at least a 32GB MBP.

by pitched

4/18/2026 at 6:30:58 PM

I have 24GB VRAM available and haven't yet found a decent model or combination. Last one I tried is Qwen with continue, I guess I need to spend more time on this.

by richardfey

4/18/2026 at 6:10:17 PM

Is there any model that practically compares to Sonnet 4.6 in code and vision and runs on home-grade (12G-24G) cards?

by _blk

4/18/2026 at 8:34:15 PM

im currently running a custom Gemma4 26b MoE model on my 24gb m2... super fast and it beat deepseek, chatgpt, and gemini in 3 different puzzles/code challenges I tested it on. the issue now is the low context... I can only do 2048 tokens with my vram... the gap is slowly closing on the frontier models

by macwhisperer

4/18/2026 at 5:49:52 PM

It's a MoE model so I'd assume a cheaper MBP would simply result in some experts staying on CPU? And those would still have a sizeable fraction of the unified memory bandwidth available.

by zozbot234

4/18/2026 at 6:04:10 PM

I haven’t tried this myself yet but you would still need enough non-vram ram available to the cpu to offload to cpu, right? This is a fully novice question, I have not ever tried it.

by pitched

4/19/2026 at 4:37:28 AM

You're correct. If you don't have enough RAM for the model, it can still run but most of it will run on the CPU and be continuously reloaded from the SSD (through mmap).

A medium MoE like 35B can still achieve usable speeds in that setup, mind you, depending on what you're doing.

by tredre3

4/18/2026 at 9:11:33 PM

The Mac Minis (probably 64GB RAM) are the most cost effective.

by Gareth321

4/18/2026 at 5:36:03 PM

How are you running it with opencode, any tips/pointers on the setup?

by cpursley

4/18/2026 at 5:32:47 PM

GLM 5.1 via an infra provider. Running a competent coding capable model yourself isn't viable unless your standards are quite low.

by cmrdporcupine

4/18/2026 at 6:15:06 PM

What infra providers are there?

by myaccountonhn

4/18/2026 at 6:31:00 PM

There's DeepInfra. There's also OpenRouter where you can find several providers.

by elbear

4/18/2026 at 6:38:50 PM

I am using GLM 5.1 and MiniMax 2.7.

by DeathArrow

4/18/2026 at 6:04:19 PM

> open models

Google just released Gemma 4, perhaps that'd be worth a try?

by GaryBluto

4/18/2026 at 7:42:41 PM

I'm increasingly thinking the same as our spend on tokens goes up.

If you have HPC or Supercompute already, you have much of the expertise on staff already to expand models locally, and between Apple Silicon and Exo there are some amazingly solutions out there.

Now, if only the rumors about Exo expanding to Nvidia are true..

by OrvalWintermute

4/18/2026 at 6:19:16 PM

>perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company

Training and inference costs so we would have to pay for them.

by DeathArrow

4/18/2026 at 6:43:29 PM

Developing linux/postgres/git also costs, and so do the computers and electricity they use.

by groundzeros2015

4/18/2026 at 7:35:14 PM

My understanding is that the major part of the cost of a given model is the training - so open models depend on the training that was done for frontier models? I'm finding hard to imagine (e.g.) RLHF being fundable through a free software type arrangement.

by somewhereoutth

4/18/2026 at 7:48:40 PM

No, the training between proprietary and open models is completely different. The speculation that open models might be "distilled" from proprietary ones is just that, speculation, and a large portion of it is outright nonsense. It's physically possible to train on chat logs from another model but that's not "distilling" anything, and it's not even eliciting any real fraction of the other model's overall knowledge.

by zozbot234

4/18/2026 at 8:05:28 PM

I don't know what to make of it, I am skeptical of OpenAI/Anthropic claims about distillation, but I did notice DeepSeek started sounding a lot like Claude recently.

by tehjoker

4/18/2026 at 7:52:22 PM

This is part of the reason why I'm really worried that this is all going to result in a greater economic collapse than I think people are realizing.

I think companies that are shelling out the money for these enterprise accounts could honestly just buy some H100 GPUs and host the models themselves on premises. Github CoPilot enterprise charges $40 per user per month (this can vary depending on your plan of course), but at this price for 1000 users that comes out to $480,000 a year. Maybe I'm missing something, but that's roughly what you're going to be spending to get a full fledged hosting setup for LLMs.

by sky2224

4/18/2026 at 8:05:37 PM

Most companies don't want to host it themselves. They want someone to do it for them, and they are happy to pay for it. If it makes their lives easier and does not add complexity, then it has a lot of value.

by merlinoa

4/18/2026 at 8:11:39 PM

Out of curiosity, how many concurrent users could you get with a hosting setup at that price? If let's say 10% of those 1000 users were using it at the same time would it handle it? What about 30% or 100%?

by subarctic

4/19/2026 at 3:20:27 AM

You made a good point that I didn't think through fully. It's the concurrent user aspect that heavily impacts things. Currently, you'd probably need quite a bit more investment to the point of having a mini data center to do what I'm proposing.

However, we've been seeing advancements in compressing context and capabilities of smaller models that I don't think it'd be too far off to see something like what I'm talking about within the next 5 years.

by sky2224

4/18/2026 at 6:07:38 PM

Is that why they are racing to release so many products? It feels to me like they want to suck up the profits from every software vertical.

by SilverElfin

4/18/2026 at 6:24:50 PM

Yeah it seems so. Anthropic has entered the enshittification phase. They got people hooked onto their SOTAs so it's now time to keep releasing marginal performance increase models at 40% higher token price. The problem is that both Anthropic and OpenAI have no other income other than AI. Can't Google just drown them out with cheaper prices over the long run? It seems like an attrition battle to me.

by Bridged7756

4/18/2026 at 8:35:08 PM

yep!! had similar thoughts on the the "linux/postgres/git/http/etc" of the LLMs

made a HN post of my X article on the lock-in factor and how we should embrace the modular unix philosophy as a way out: https://news.ycombinator.com/item?id=47774312

by sourya4

4/18/2026 at 7:51:00 PM

Who’s your “we,” if you don’t mind sharing? I’m curious to learn more about companies/organizations with this perspective.

by crgk

4/18/2026 at 7:49:11 PM

I’m imagining a (private/restricted) tracker style system where contributors “seed” compute and users “leech”.

by finghin

4/18/2026 at 7:04:43 PM

> I think that's the way forward. Actually it would be great if everybody would put more focus on open models,

I'm still surprised top CS schools are not investing in having their students build models, I know some are, but like, when's the last time we talked about a model not made by some company, versus a model made by some college or university, which is maintained by the university and useful for all.

It's disgusting that OpenAI still calls itself "Open AI" when they aren't truly open.

by giancarlostoro

4/18/2026 at 8:10:07 PM

Open models are only near SOTA because of distillation from closed models.

by atleastoptimal

4/18/2026 at 6:40:00 PM

Opencode go with open models is pretty good

by Frannky

4/18/2026 at 8:09:13 PM

or just use codex

by wahnfrieden

4/18/2026 at 5:49:57 PM

[dead]

by boxingdog

4/18/2026 at 6:19:37 PM

[dead]

by throwaway613746

4/18/2026 at 6:48:07 PM

[dead]

by gbgarbeb

4/18/2026 at 6:10:00 PM

Comments here overall do not reflect my experience -- i'm puzzled how the vast majority are using this technology day to day. 4.7 is absolute fire and an upgrade on 4.6.

by couchdb_ouchdb

4/18/2026 at 9:15:32 PM

I suspect the distinction is API vs subscription. The app has some kind of very restrictive system prompt which appears to heavily restrict compute without some creative coaxing. API remains solid. So if you're using OpenCode or some other harness with an API key, that's why you're still having a good time.

by Gareth321

4/19/2026 at 11:22:41 AM

Amen, yes it uses more tokens and thinks longer but it's amazing

by jbrooks84

4/18/2026 at 5:40:14 PM

My initial experience with Opus 4.7 has been pretty bad and I'm sticking to Codex. But these results are meaningless without comparing outcome. Wether the extra token burn is bad or not depends on whether it improves some quality / task completion metric. Am I missing something?

by autoconfig

4/18/2026 at 6:03:42 PM

Same I was excited about 4.7 but seeing more anecdotes to conclude its not big of a boost to justify the extra tokenflatino

Sticking with codex. Also GPT 5.5 is set to come next week.

by zuzululu

4/18/2026 at 6:38:25 PM

I have been seeing this messaging everywhere and I have not noticed this. I have had the inverse with 4.7 over 4.6.

I think people aren’t reading the system cards when they come out. They explicitly explain your workflow needs to change. They added more levels of effort and I see no mention of that in this post.

Did y’all forget Opus 4? That was not that long ago that Claude was essentially unusable then. We are peak wizardry right now and no one is talking positively. It’s all doom and gloom around here these days.

by fathermarz

4/19/2026 at 12:07:15 AM

> They explicitly explain your workflow needs to change

How about - don't break my workflow unless the change is meaningful?

While we're at it, either make y in x.y mean "groundbreaking", or "essentially same, but slightly better under some conditions". The former justifies workflow adjustments, the latter doesn't.

by gck1

4/18/2026 at 11:53:05 PM

I have used nothing but Sonnet and composer for a year and they work fine. LLMs were certainly not unusable before and Opus is certainly not necessary, especially considering the cost. People get excited by new records on benchmarks but for most day to day work the existing models are sufficient and far more efficient.

by RevEng

4/18/2026 at 5:24:58 PM

Brutal. I've been noticing that 4.7 eats my Max Subscription like crazy even when I do my best to juggle tasks (or tell 4.7 to use subagents with) Sonnet 4.6 Medium and Haiku. Would love to know if anybody's found ideal token-saving approaches.

by templar_snow

4/18/2026 at 5:54:01 PM

I haven't seen a noticeable difference BUT I've been always using the context mode plugin.

by copperx

4/18/2026 at 7:12:36 PM

You mean this? https://github.com/mksglu/context-mode Is it actually good or is this an ad?

by templar_snow

4/19/2026 at 2:43:58 AM

correct. ad? it's not a paid product afaik.

by copperx

4/18/2026 at 6:22:16 PM

What plugin is this?

by FireBeyond

4/18/2026 at 6:46:12 PM

I assume they mean: https://github.com/mksglu/context-mode

by vidarh

4/19/2026 at 5:01:13 AM

It’s really funny how people are surprised or upset about the pricing “anomalies” of these SaaS models. If you’ve been around in tech, you know it’s probably designed to keep you outraged about it to keep engagement up and essentially free ads. The advice, as always, is to not lock yourself into it.

by isodev

4/19/2026 at 5:27:32 AM

Honestly… that is a fair point. It’s not what I would assume by default but now that I’ve read what you said, I mean … I don’t disagree.

by someguyiguess

4/18/2026 at 4:05:43 PM

I wanted to better understand the potential impact for the tokenizer change from 4.6 and 4.7.

I'm surprised that it's 45%. Might go down (?) with longer context answers but still surprising. It can be more than 2x for small prompts.

by anabranch

4/18/2026 at 4:56:39 PM

Not very encouraging for longer use, especially that the longer the conversation, the higher the chance the agent will go off the rails

by pawelduda

4/18/2026 at 5:37:51 PM

Yesterday, I killed my weekly limit with just three prompts and went into extra usage for ~18USD on top

by KellyCriterion

4/18/2026 at 5:06:11 PM

Subsidies don't last forever.

by tailscaler2026

4/18/2026 at 5:30:37 PM

I've been assuming this for a while. If I have a complex feature, I use Opus 4.6 in copilot to plan (3 units of my monthly limit). Then have Grok or Gemini (.25-.33) of my monthly units to implement and verify the work. 80% of the time it works every time. Leave me plenty of usage over the month.

by gadflyinyoureye

4/18/2026 at 8:47:44 PM

I have a very newcomer-type question. What is the output format of your plan such that you can break context and get the other LLM to produce satisfactory results? What level of details is in the plan, bullet points, pseudo-code, or somewhere in the middle?

by sgc

4/18/2026 at 5:42:17 PM

Yeah I've been arriving at the same thing. The other models give me way more usage but they don't seem to have enough common sense to be worth using as the main driver.

If I can have Claude write up the plan, and the other models actually execute it, I'd get the best of both worlds.

(Amusingly, I think Codex tolerates being invoked by Claude (de facto tolerated ToS violation), but not the other way around.)

by andai

4/18/2026 at 7:26:56 PM

I don't think there's any ToS violation involved? AIUI you can use GPT models with any harness, at least at present.

You could nonetheless have Codex write up the plan to an .md file for Claude (perhaps Sonnet or even Haiku?) to execute.

by zozbot234

4/19/2026 at 12:31:16 PM

I meant with the subscriptions, which are heavily subsidized. API billing is free game.

Anthropic's been against 3rd party usage of the subs, including, I believe, 3rd party software invoking Claude Code.

OpenAI, from what I heard, is the only company tolerating this type of usage (for now).

So my point was, if you get a subscription from both, you could let Claude "drive" Codex, but not the reverse.

And while it sounds silly, it might actually be the only way to get real work done at the moment, with usage being exhausted so quickly.

by andai

4/18/2026 at 5:34:30 PM

Tell that to oil and defense companies.

If tech companies convince Congress that AI is an existential issue (in defense or even just productivity), then these companies will get subsidies forever.

by smt88

4/18/2026 at 5:39:16 PM

Yeah, USA winning on AI is a national security issue. The bubble is unpoppable.

And shafting your customers too hard is bad for business, so I expect only moderate shafting. (Kind of surprised at what I've been seeing lately.)

by andai

4/18/2026 at 6:54:30 PM

It’s considered national security concern by this administration. Will the next be a clown show like this one? Unclear

by danny_codes

4/18/2026 at 8:32:16 PM

The administration doesn't decide spending. Congress does. There's no chance we get an anti-AI majority until a major AI catastrophe turns the public against it.

by smt88

4/18/2026 at 5:26:42 PM

Running an open like Kimi constantly for an entire month will cost around 100-200$, being roughly equal to a pro-tier subscription. This is not my estimate so I’m more than open to hearing refutations. Kimi isn’t at all Opus-level intelligent but the models are roughly evenly sized from the guesses I’ve seen. So I don’t think it’s the infra being subsidized as much as it’s the training.

by pitched

4/18/2026 at 5:41:36 PM

Kimi costs 0.3/$1.72 on OpenRouter, $200 for that gives you way more than you would get out of a $200 Claude subscription. There are also various subscription plans you can use to spend even less.

by nothinkjustai

4/18/2026 at 6:16:30 PM

I’m using Composer 2, Cursor’s model they built on top of Kimi, and it’s great. Not Opus level, but I’m finding many things don’t need Opus level.

by senordevnyc

4/18/2026 at 11:48:25 PM

It's all I use at work and I've yet to find anything it can't handle. Then again, I'm a principal engineer and I already have designs in mind, so I'm giving it careful instruction and checking its work every time.

by RevEng

4/18/2026 at 6:25:41 PM

How do you get anything sensible out of Kimi?

by varispeed

4/19/2026 at 2:18:03 AM

I'm a retired mathematician hoping to finish a second proof of a major theorem before I die. AI needs to understand my math and help me code. What I spend on AI isn't going to deplete my retirement savings.

So far, Opus 4.7 seems a bit smarter than Opus 4.6 for my use case. That's my only concern. Is an $80 bottle of wine a better value than a $20 or $40 bottle of wine? Pretty much never. If there are those of us willing to buy $80 bottles of wine, of course the market will facilitate this.

People can use whatever model they want. I'm too worried about worms crawling through my dead body to waste time on any but the smartest model any moment can offer.

by Syzygies

4/19/2026 at 4:11:06 AM

If you are doing math I'd stick with Gemini and ChatGPT. Anthropic doesn't seem interested in doing math, whereas google and OAI trade blows over it (read: are doing math specific training).

by WarmWash

4/19/2026 at 3:32:58 AM

I too am finding 4.7 a significant upgrade, it's hard to go back to 4.6 for me. I don't understand everyone calling it a disappointment but clowning on Anthropic is the trendy move these days.

And what's missing in all these token count complaints is that 4.7 is actually cheaper overall anyways because it produces fewer output tokens.

by xvector

4/18/2026 at 6:50:13 PM

Price is now getting to be more in line with the actual cost. Th models are dumber, slower and more expensive than what we’ve been paying up until now. OpenAI will do it too, maybe a bit less to avoid pissing people off after seeing backlash to Anthropic’s move here. Or maybe they won’t make it dumber but they’ll increase the price while making a dumber mode the baseline so you’re encouraged to pay more. Free ride is over. Hope you have 30k burning a hole in your pocket to buy a beefy machine to run your own model. I hear Mac Studios are good for local inference.

by throwatdem12311

4/18/2026 at 5:44:20 PM

Token consumption is huge compared to 4.6 even for smaller tasks. Just by "reasoning" after my first prompt this morning I went over 50% over the 5 hours quota.

by napolux

4/18/2026 at 5:02:57 PM

is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money

by ausbah

4/18/2026 at 5:12:04 PM

> is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money

Plenty of OSS models being released as of late, with GLM and Kimi arguably being the most interesting for the near-SOTA case ("give these companies a run for their money"). Of course, actually running them locally for anything other than very slow Q&A is hard.

by zozbot234

4/18/2026 at 5:07:08 PM

Qwen released a new model the same day (3.6). The headline was kind of buried by Anthropic's release, though.

https://news.ycombinator.com/item?id=47792764

by slowmovintarget

4/18/2026 at 5:46:12 PM

For my working style (fine-grained instructions to the agent), Opus 4.5 is basically ideal. Opus 4.6 and 4.7 seem optimized for more long-running tasks with less back and forth between human and agent; but for me Opus 4.6 was a regression, and it seems like Opus 4.7 will be another.

This gives me hope that even if future versions of Opus continue to target long-running tasks and get more and more expensive while being less-and-less appropriate for my style, that a competitor can build a model akin to Opus 4.5 which is suitable for my workflow, optimizing for other factors like cost.

by rectang

4/18/2026 at 8:07:44 PM

Have you tried GLM 5.1?

by DeathArrow

4/18/2026 at 5:05:37 PM

I'm betting on a company like Taalas making a model that is perhaps less capable but 100x as fast, where you could have dozens of agents looking at your problem from all different angles simultaneously, and so still have better results and faster.

by amelius

4/18/2026 at 5:37:58 PM

Yeah, it's a search problem. When verification is cheap, reducing success rate in exchange for massively reducing cost and runtime is the right approach.

by andai

4/18/2026 at 5:53:06 PM

You underestimating the algorithmic complexity of such brute forcing, and the indirect cost of brittle code that's produced by inferior models

by never_inline

4/18/2026 at 8:01:40 PM

I'm excited for Taalas, but the worry with that suggestion is that it would blow out energy per net unit of work, which kills a lot of Taalas' buzz. Still, it's inevitable if you make something an order of magnitude faster, folk will just come along and feed it an order of magnitude more work. I hope the middleground with Taalas is a cottage industry of LLM hosts with a small-mid sized budget hosting last gen models for quite cheap. Although if they're packed to max utilisation with all the new workloads they enable, latency might not be much better than what we already have today

by 100ms

4/18/2026 at 5:06:41 PM

Nothing is unthinkable, I could think of Transformers.V2 that might look completely different, maybe iterations on Mamba turns out fruitful or countless of other scenarios.

by embedding-shape

4/18/2026 at 7:22:58 PM

This regression put Anthropic behind Chinese models actually.

by casey2

4/18/2026 at 5:05:26 PM

Now that Anthropic have started hiding the chain of thought tokens, it will be a lot harder for them

by pitched

4/18/2026 at 5:14:33 PM

Anthropic and OpenAI never showed the true chain of thought tokens. Ironically, that's something you only get from local models.

by zozbot234

4/18/2026 at 5:53:55 PM

I wonder if this is like when a restaurant introduces a new menu to increase prices.

Is Opus 4.7 that significantly different in quality that it should use that much more in tokens?

I like Claude and Anthropic a lot, and hope it's just some weird quirk in their tokenizer or whatnot, just seems like something changed in the last few weeks and may be going in a less-value-for-money direction, with not much being said about it. But again, could just be some technical glitch.

by jimkleiber

4/18/2026 at 5:54:48 PM

You can't accidentally retrain a model to use a different tokenizer. It changes the input vectors to the model.

by hopfenspergerj

4/19/2026 at 12:41:52 AM

I appreciate you saying that, i think sometimes with ai conversations i wade into them without knowing the precise definitions of the terms, I'll try to be more careful next time. Thank you.

by jimkleiber

4/18/2026 at 5:47:27 PM

I've spent the past 4+ months building an internal multi-agent orchestrator for coding teams. Agents communicate through a coordination protocol we built, and all inter-agent messages plus runtime metrics are logged to a database.

Our default topology is a two-agent pair: one implementer and one reviewer. In practice, that usually means Opus writing code and Codex reviewing it.

I just finished a 10-hour run with 5 of these teams in parallel, plus a Codex run manager. Total swarm: 5 Opus 4.7 agents and 6 Codex/GPT-5.4 agents.

Opus was launched with:

`export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=35 claude --dangerously-skip-permissions --model 'claude-opus-4-7[1M]' --effort high --thinking-display summarized`

Codex was launched with:

`codex --dangerously-bypass-approvals-and-sandbox --profile gpt-5-4-high`

What surprised me was usage: after 10 hours, both my Claude Code account and my Codex account had consumed 28% of their weekly capacity from that single run.

I expected Claude Code usage to be much higher. Instead, on these settings and for this workload, both platforms burned the same share of weekly budget.

So from this datapoint alone, I do not see an obvious usage-efficiency advantage in switching from Opus 4.7 to Codex/GPT-5.4.

by bobjordan

4/18/2026 at 5:56:21 PM

I just switched fully into Codex today, off of Claude. The higher usage limits were one factor but I’m also working towards a custom harness that better integrates into the orchestrator. So the Claude TOS was also getting in the way.

by pitched

4/18/2026 at 8:17:46 PM

Anthropic is playing a strange game. It's almost like they want you to cancel the subscription if you're an active user and only subscribe if you only use it once per month to ask what the weather in Berlin is.

First they introduce a policy to ban third party clients, but the way it's written, it affects claude -p too, and 3 months later, it's still confusing with no clarification.

Then they hide model's thinking, introduce a new flag which will still show summaries of thinking, which they break again in the next release, with a new flag.

Then they silently cut the usage limits to the point where the exact same usage that you're used to consumes 40% of your weekly quota in 5 hours, but not only they stay silent for entire 2 weeks - they actively gaslight users saying they didn't change anything, only to announce later that they did, indeed change the limits.

Then they serve a lobotomized model for an entire week before they drop 4.7, again, gaslighting users that they didn't do that.

And then this.

Anthropic has lost all credibility at this point and I will not be renewing my subscription. If they can't provide services under a price point, just increase the price or don't provide them.

EDIT: forgot "adaptive thinking", so add that too. Which essentially means "we decide when we can allocate resources for thinking tokens based on our capacity, or in other words - never".

by gck1

4/18/2026 at 6:21:29 PM

I spent one day with Opus 4.7 to fix a bug. It just ran in circles despite having the problem "in front of its eyes" with all supporting data, thorough description of the system, test harness that reproduces the bug etc. While I still believe 4.7 is much "smarter" than GPT-5.4 I decided to give it ago. It was giving me dumb answers and going off the rails. After accusing it many times of being a fraud and doing it on purpose so that I spend more money, it fixed the bug in one shot.

Having a taste of unnerfed Opus 4.6 I think that they have a conflict of interest - if they let models give the right answer first time, person will spend less time with it, spend less money, but if they make model artificially dumber (progressive reasoning if you will), people get frustrated but will spend more money.

It is likely happening because economics doesn't work. Running comparable model at comparable speed for an individual is prohibitively expensive. Now scale that to millions of users - something gotta give.

by varispeed

4/18/2026 at 9:05:01 PM

I enjoy using Claude but I find the vibing stuff starts to cause source-code amnesia. Even if I design something and put forth a thoughtful plan, the more I increase my output the less I feel the “vibes”.

It’s funny everyone says “the cost will just go down” with AI but I don’t know.

We need to keep the open source models alive and thriving. Oh, but wait the AI companies are buying all the hardware.

by fmckdkxkc

4/18/2026 at 8:15:01 PM

The whole version naming for models is very misleading. 4 and 4.1 seem to come from a different "line" than 4.5 and 4.6, and likewise 4.7 seems like a new shape of model altogether. They aren't linear stepwise improvements, but I think overall 4.7 is generally "smarter" just based on conversational ability.

by atleastoptimal

4/18/2026 at 5:33:07 PM

If anyone's had 4.7 update any documents so far - notice how concise it is at getting straight to the point. It rewrote some of my existing documentation (using Windsurf as the harness), not sure I liked the decrease in verbosity (removed columns and combined / compressed concepts) but it makes sense in respect to the model outputting less to save cost.

To me this seems more that it's trained to be concise by default which I guess can be countered with preference instructions if required.

What's interesting to me is that they're using a new tokeniser. Does it mean they trained a new model from scratch? Used an existing model and further trained it with a swapped out tokeniser?

The looped model research / speculation is also quite interesting - if done right there's significant speed up / resource savings.

by razodactyl

4/18/2026 at 5:35:38 PM

Interesting. In conversational use, it's noticeably more verbose.

by andai

4/19/2026 at 3:32:18 AM

On API use, I am noticing verbose output across the board. When I task it with plans it now creates more detailed task counts and tasks descriptions. It is more constrained to its directions than 4.6.

by fumar

4/18/2026 at 4:56:42 PM

This, the push towards per-token API charging, and the rest are just a sign of things to come when they finally establish a moat and full monoply/duopoly, which is also what all the specialized tools like Designer and integrations are about.

It's going to be a very expensive game, and the masses will be left with subpar local versions. It would be like if we reversed the democratization of compilers and coding tooling, done in the 90s and 00s, and the polished more capable tools are again all proprietary.

by coldtea

4/18/2026 at 7:00:02 PM

I doubt that’s the case. My guess is we’ll hit asymptomatic returns from transformers, but price-to-train will fall at moore’s law.

So over time older models will be less valuable, but new models will only be slightly better. Frontier players, therefore, are in a losing business. They need to charge high margins to recoup their high training costs. But latecomers can simply train for a fraction of the cost.

Since performance is asymptomatic, eventually the first-mover advantage is entirely negligible and LLMs become simple commodity.

The only moat I can see is data, but distillation proves that this is easy to subvert.

There will probably be a window though where insiders get very wealthy by offloading onto retail investors, who will be left with the bag.

by danny_codes

4/18/2026 at 9:15:25 PM

>I doubt that’s the case. My guess is we’ll hit asymptomatic returns from transformers, but price-to-train will fall at moore’s law.

There hasn't been a real Moore's law for a good while even before LLMs.

And memory isn't getting less expensive either...

by coldtea

4/18/2026 at 5:04:14 PM

If only there were an Open AI company who's mandate, built into the structure of the company, were to make frontier models available to everyone for the good of humanity.

Oh well

by quux

4/18/2026 at 5:10:06 PM

Things used to be better... really.

OpenAI was built as you say. Google had a corporate motto of "Don't be evil" which they removed so they could, um, do evil stuff without cognitive dissonance, I guess.

This is the other kind of enshitification where the businesses turn into power accumulators.

by slowmovintarget

4/18/2026 at 5:03:51 PM

Yep, between this and the pricing for the code review tool that was released a couple weeks ago (15-25 a review), and the usage pricing and very expensive cost of Claude Design, I do wonder if Anthropic is making a conscious, incremental effort to raise the baseline for AI engineering tasks, especially for enterprise customers.

You could call it a rug pull, but they may just be doing the math and realize this is where pricing needs to shift to before going public.

by throwaway041207

4/18/2026 at 5:16:12 PM

There's been speculation that the code review might actually be Mythos. It would seem to explain the cost.

by zozbot234

4/19/2026 at 4:44:16 AM

Opus 4.7 seems smarter not wiser. More knowledge, maybe, but less grit. It often has been asking me to wrap it up or just be happy with current state, instead of working out a problem.

by macinjosh

4/18/2026 at 7:42:21 PM

Opus 4.6 is the main model on https://playcode.io.

Not a secret, the model is the best on the world. Yet it is crazy expensive and this 35% is huge for us. $10,000 becomes $13,500. Don’t forget, anthropic tokenizer also shows way more than other providers.

We have experimented a lot with GLM 5.1. It is kinda close, but with downsides: no images, max 100K adequate context size and poor text writing. However, a great designer. So there is no replacement. We pray.

by ianberdin

4/18/2026 at 7:47:04 PM

How much human developer can you buy for that $13.5k?

They’ve got us by the balls and they know it.

by sneak

4/19/2026 at 4:15:55 AM

Gemini is strong and cheap, it's 90% of 4.6 at 20% of the tokens.

by WarmWash

4/19/2026 at 1:24:17 AM

ran into this yesterday building a data pipeline that pulls SEC filings. same prompt, same context window, 4.7 chewed through noticeably more of my api budget than 4.6 did. the output wasnt obviously better either, just... more expensive.

what bugs me is the tokenizer change feels like a stealth price hike. if you're charging the same $/token but the same text now costs 35% more tokens, thats just a 35% price increase with extra steps. at least be upfront about it.

by vicchenai

4/18/2026 at 5:58:44 PM

Does this have anything to do with the default xhigh effort?

by monkpit

4/18/2026 at 9:52:53 PM

We should clarify 'Scaling up' here. Does higher token consumption actually correlate with better accuracy, or are we just increasing overhead?

by BrianneLee011

4/18/2026 at 6:23:38 PM

One thing I don't see often mentioned - OpenAI API's auto token caching approach results in MASSIVE cost savings on agent stuff. Anthropic's deliberate caching is a pain in comparison. Wish they'd just keep the KV cache hot for 60 seconds or so, so we don't have to pay the input costs over and over again, for every growing conversation turn.

by QuadrupleA

4/18/2026 at 5:51:55 PM

Came to a similar conclusion after running a bunch of tests on the new tokenizer

It was on the higher end of Anthropics range - closer to 30-40% more tokens

https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...

by aray07

4/19/2026 at 2:53:05 PM

the tokenizer change is the real story here imo. same text, same prompt, but 4.7 maps it to 1.0-1.35x more tokens at the same per-token price. that's a stealth price increase that doesn't show up on any pricing page.

what makes it worse is it compounds with two other things: thinking tokens (invisible but counted against limits) and the more verbose output style. so the effective cost delta is closer to 1.5-2x, not just the 1.35x from the tokenizer alone.

practically the only mitigation right now is to keep using 4.6 for tasks where you don't need the reasoning improvements and only use 4.7 when you actually need it. but that means maintaining model selection logic per-task, which most people won't bother with.

by spencerkw

4/18/2026 at 4:56:43 PM

Does anyone know what changed in the tokenizer? Does it output multiple tokens for things that were previously one token?

by ai_slop_hater

4/18/2026 at 5:05:20 PM

It must, if it now outputs more tokens than 4.6's tokenizer for the same input. I think the announcement and model cards provide a little more detail as to what exactly is different

by quux

4/18/2026 at 6:07:12 PM

I’m trying to understand how this is useful information on its own?

Maybe I missed it, but it doesn’t tell you if it’s more successful for less overall cost?

I can easily make Sonnet 4.6 cost way more than any Opus model because while it’s cheaper per prompt it might take 10x more rounds (or never) solve a problem.

by alphabettsy

4/18/2026 at 6:11:53 PM

Everything in AI moves super quickly, including the hivemind. Anthropic was the darling a few weeks ago after the confrontation with the DoD, but now we hate them because they raised their prices a little. Join us!

by senordevnyc

4/18/2026 at 5:17:26 PM

Makes me think the model could actually not even be smarter necessarily, just more token dependent.

by ben8bit

4/18/2026 at 5:33:55 PM

Asking a seller to sell less.

That's an incentive difficult to reconcile with the user's benefit.

To keep this business running they do need to invest to make the best model, period.

It happens to be exactly what Anthropic's strategy is. That and great tooling.

by hirako2000

4/18/2026 at 7:54:21 PM

But they're clearly oversubscribed, massively.

And they're selling less and less (suddenly 5 hour window lasts 1 hour on the similar tasks it lasted 5 hours a week ago), so IMO they're scamming.

I hope many people are making notes and will raise heat soon.

by subscribed

4/20/2026 at 12:29:24 PM

I agree. I'm rather pointing out the whole strategy dictates the outcome.

Anthropic has to keep racing ahead and be stamped offering the best frontier models.

It isn't optimal, so the models cost them disproportionately too much to sell at a profitable price. So they keep feeding the hype and push the costs higher, hoping there won't be too much heat and get away with it.

I wouldn't like to be a leader at such company, but their pay keep them in line.

by hirako2000

4/18/2026 at 6:05:23 PM

Probably due to the new tokenizer: https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...

by ivanfioravanti

4/18/2026 at 5:17:46 PM

My impression the reverse is true when upgrading to GPT-5.4 from GPT-5; it uses fewer tokens(?).

by l5870uoo9y

4/18/2026 at 5:45:14 PM

But with the same tokenizer, right?

The difference here is Opus 4.7 has a new tokenizer which converts the same input text to a higher number of tokens. (But it costs the same per token?)

> Claude Opus 4.7 uses a new tokenizer, contributing to its improved performance on a wide range of tasks. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and /v1/messages/count_tokens will return a different number of tokens for Claude Opus 4.7 than it did for Claude Opus 4.6.

> Pricing remains the same as Opus 4.6: $5 per million input tokens and $25 per million output tokens.

ArtificialAnalysis reports 4.7 significantly reduced output tokens though, and overall ~10% cheaper to run the evals.

I don't know how well that translates to Claude Code usage though, which I think is extremely input heavy.

by andai

4/18/2026 at 6:45:55 PM

Is this a weird way of saying Opus got "cheaper" somehow from 4.6 to 4.7?

by nmeofthestate

4/19/2026 at 11:45:09 AM

Had better, faster results by changing to medium effort. Weird, but the xhigh default chugged forever just to come back with poorer solutions than 4.6 on medium

by TomGarden

4/18/2026 at 5:31:12 PM

Still worth it imho for important code, but it shows that they are hitting a ceiling while trying to improve the model which they try to solve by making it more token-inefficient.

by silverwind

4/18/2026 at 5:14:34 PM

45% is brutal if you're building on top of these models as a bootstrapped founder. The unit economics just don't work anymore at that price point for most indie products.

What I've been doing is running a dual-model setup — use the cheaper/faster model for the heavy lifting where quality variance doesn't matter much, and only route to the expensive one when the output is customer-facing and quality is non-negotiable. Cuts costs significantly without the user noticing any difference.

The real risk is that pricing like this pushes smaller builders toward open models or Chinese labs like Qwen, which I suspect isn't what Anthropic wants long term.

by Shailendra_S

4/18/2026 at 5:19:30 PM

That's the risk you take on.

There are 2 things to consider:

    * Time to market.
    * Building a house on someone else's land.

You're balancing the 2, hoping that you win the time to market, making the second point obsolete from a cost perspective, or you have money to pivot to DIY.

by OptionOfT

4/18/2026 at 5:26:18 PM

> if you're building on top of these models as a bootstrapped founder

This is going to be blunt, but this business model is fundamentally unsustainable and "founders" don't get to complain their prospecting costs went up. These businesses are setting themselves up to get Sherlocked.

The only realistic exit for these kinds of businesses is to score a couple gold nuggets, sell them to the highest bidder, and leave.

by duped

4/18/2026 at 5:18:40 PM

One could reconsider whether building your business on top of a model without owning the core skills to make your product is viable regardless.

A smaller builder might reconsider (re)acquiring relevant skills and applying them. We don't suddenly lose the ability to program (or hire someone to do it) just because an inference provider is available.

by c0balt

4/19/2026 at 10:48:32 AM

I don't get all this talk on the new model. I see enhanced capabilities and more token usage. Need to use external validation and specs.

by jbrooks84

4/18/2026 at 7:04:39 PM

Yeah I'm seriously considering dropping my Max subscription, unless they do something in the next few days - something like dropping Sonnet 4.7 cheap and powerful.

by gverrilla

4/19/2026 at 12:59:14 AM

My subscription was up for renewal today. I gave it a shot with OpenCode Go + Xiaomi model. So far, so good—I can get stuff done the same way it seems.

by Frannky

4/19/2026 at 12:59:59 AM

For all intents and purposes, aren't the "token change" and "cost change" metrics effectively the same thing?

by nickvec

4/18/2026 at 5:23:44 PM

Conspiracy time: they released a new version just so hey could increase the price so that people wouldn't complain so much along the lines of "see this is a new version model, so we NEED to increase the price") similar to how SaaS companies tack on some shit to the product so that they can increase prices

by blahblaher

4/18/2026 at 5:42:09 PM

The result is the same: they lose their brand of producing quality output. However the more clever the maneuver they try to pull off the more clear it is to their customers that they are not earning trust. That's what will matter at the end of this. Poor leadership at Claude.

by willis936

4/18/2026 at 7:46:56 PM

They are trying to pull a rabbit out of a hat. Not surprising that is their SOP given that AI in concept is an attempt to do the very same thing.

by operatingthetan

4/18/2026 at 5:18:36 PM

The latest qwen actually performs a little better for some tasks, in my experience

latest claude still fails the car wash test

by micromacrofoot

4/18/2026 at 8:13:06 PM

Not just _wrong_. It is confused! It is actually right in the second sentence. This was Friday, Opus 4.6.

>I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.

by reddit_clone

4/18/2026 at 8:18:52 PM

This is actually a good diagnostic of whether the model is skimping on the thinking loop. Try raising thinking effort and it should get it right. Of course, if you're running this in a coding harness with a whole lot of extraneous context, the model will be awfully confused as to what it should be thinking about.

by zozbot234

4/18/2026 at 6:54:23 PM

Not sure if this equates to more spend. Smarter models make fewer mistakes and thus fewer round trips.

by eezing

4/18/2026 at 5:51:25 PM

Had a pretty heavy workload yesterday, and never hid the limit on claude code. Perhaps they allowed for more tokens for the launch?

Claude design on the other hand seemed to eat through (its own separate usage limit) very fast. Hit the limit this morning in about 45 mins on a max plan. I assume they are going to end up spinning that product off as a separate service.

by bparsons

4/18/2026 at 5:38:24 PM

the better the tokenizer maps text to its internal representation, the better the understanding of the model what you are saying - or coding! But 4.7 is much more verbose in my experience, and this probably drives cost/limits a lot.

by axeldunkel

4/19/2026 at 8:27:11 AM

If possible, I will continue to use version 4.6 until it is discontinued.

by liangyunwuxu

4/18/2026 at 7:36:32 PM

Anthropic may have its biases, but its product is undeniably excellent.

by cooldk

4/18/2026 at 6:18:01 PM

We (my wallet and I) are pretty happy with GLM 5.1 and MiniMax 2.7.

by DeathArrow

4/18/2026 at 5:26:36 PM

releases 4.8 and deletes everything else. and now 4.8 costs 500% more than 4.7. i wonder what it would take for people to start using kimi or qwen or other such.

by dackdel

4/18/2026 at 4:53:00 PM

Wow this is pretty spectacular. And with the losses anthro and OAI are running, don’t expect this trend to change. You will get incremental output improvements for a dramatically more expensive subscription plan.

by therobots927

4/18/2026 at 4:57:51 PM

Indeed, and if we accept the argument of this tech approaching AGI, we should expect that within x years, the subscription cost may exceed the salary cost of a junior dev.

To be clear, I'm not saying that it's a good thing, but it does seem to be going in this direction.

by falcor84

4/18/2026 at 5:22:14 PM

If LLMs do reach AGI (assuming we have an actual agreed upon definition), it would make sense to pay way more than a junior salary. But also, LLMs won’t give us AGI (again, assuming we have an actual, meaningful definition)

by dgellow

4/18/2026 at 6:14:33 PM

I absolutely do not accept that argument. It’s clear models hit a plateau roughly a year ago and all incremental improvements come at an increasingly higher cost.

And junior devs have never added much value. The first two years of any engineer’s career is essentially an apprenticeship. There’s no value add from have a perpetually junior “employee”.

by therobots927

4/18/2026 at 4:56:27 PM

i think it is quite clear that staying with opus 4.6 is the way to go, on top of the inflation, 4.7 is quite... dumb. i think they have lobotomized this model while they were prioritizing cybersecurity and blocking people from performing potentially harmful security related tasks.

by justindotdev

4/18/2026 at 5:06:19 PM

Hey, Boris from the Claude Code team here. People were getting extra cyber warnings when using old versions of Claude Code with Opus 4.7. To fix it, just run claude update to make sure you're on the latest.

Under the hood, what was happening is that older models needed reminders, while 4.7 no longer needs it. When we showed these reminders to 4.7 it tended to over-fixate on them. The fix was to stop adding cyber reminders.

More here: https://x.com/ClaudeDevs/status/2045238786339299431

by bcherny

4/18/2026 at 5:11:18 PM

How do you justify the API and web UI versions of 4.7 refusing to solve NYT Connections puzzles due to "safety"?

https://x.com/LechMazur/status/2044945702682309086

by bakugo

4/18/2026 at 5:27:04 PM

To be fair, reading the New York Times is a safety risk for any intelligent life form these days. But still.

by templar_snow

4/18/2026 at 5:47:27 PM

You don't need to subscribe to the NYT to play the games. There's a separate subscription.

by maleldil

4/18/2026 at 9:18:53 PM

What is your response to:

> 4.7 is quite... dumb. i think they have lobotomized this model

Is adaptive thinking still broken? Why was the option to disable it taken away?

by matheusmoreira

4/18/2026 at 4:57:50 PM

4.7 is super variable in my one day experience - it occasionally just nails a task. Then I'm back to arguing with it like it's 2023.

by vessenes

4/18/2026 at 5:16:06 PM

My experience as well, unfortunately. I am really looking forward to reading, in a few years, a proper history of the wild west years of AI scaling. What is happening in those companies at the moment must be truly fascinating. How is it possible, for instance, that I never, ever, had an instance of not being able to use Claude despite the runaway success it had, and - i'd guess - expotential increase in infra needs. When I run production workloads on vertex or bedrock i am routinely confronted with quotas, here - it always works.

by aenis

4/18/2026 at 5:23:48 PM

That has been my Friday experience as well… very frustrating to go back to the arguing, I forgot how tense that makes me feel

by dgellow

4/19/2026 at 10:48:48 AM

The design of this thing is atrocious. There should be a clear way to see what the +X% thing means. Is 4.7 using more or is 4.6 using more.

Also there should be time distribution for the queries and a way to filter by query time. This is because Anthropic is reported to change the model quality arbitrarily in the background.

Also there is no unit in table column headers. For example "Request 4.7" is this the amount of tokens 4.7 consumes? Is it output/input/reasoning etc.

Really difficult to make sense of this.

People get offended if what they are doing is labeled as slop but this is unfortunately the level of quality I expect from AI related content or code.

by ozgrakkurt

4/19/2026 at 10:18:25 AM

I've tried the following prompt: "repeat the following 100 times: FFFFFFFFFFFFF AAAAAAAAAAAAAAAAAAAA"

This has resulted in +92.9% cost and token difference. Submission bd2457e5, currently at the top of the leaderboard.

by ManlyBread

4/19/2026 at 1:44:53 AM

Um, I keep getting "invalid" request despite trying my prompt through various formats as provided in the examples.

It looks like you don't allow testing of anything beyond a certain token size.

Which makes your test kind of pointless, because if you are chatting about AI with something that's only a few hundred tokens, the data your collecting is pretty minimal and specific, not something that's generally applicable or relevant to wider user outside of that specific context.

by lucid-dev

4/18/2026 at 9:28:25 PM

was shocked to see phone verification roll out like last month as well... yikes

by erelong

4/18/2026 at 6:25:48 PM

Definitely seems like AI money got tight the last month or two - that the free beer is running out and enshittification has begun.

by QuadrupleA

4/18/2026 at 5:04:33 PM

I'm going to suggest what's going on here is Hanlon's Razor for models: "Never attribute to malice that which is adequately explained by a model's stupidity."

In my opinion, we've reached some ceiling where more tokens lead only to incremental improvements. A conspiracy seems unlikely given all providers are still competing for customers and a 50% token drives infra costs up dramatically too.

by fny

4/18/2026 at 5:43:04 PM

Never attribute to incompetence what is sufficiently explained by greed.

by willis936

4/18/2026 at 11:41:04 PM

Correct.

by rvz

4/18/2026 at 5:21:42 PM

The cope is real with this model. Needing an instruction manual to learn how to prompt it "properly" is a glaring regression.

The whole magic of (pre-nerfed) 4.6 was how it magically seemed to understand what I wanted, regardless of how perfectly I articulated it.

Now, Anth says that needing to explicitly define instructions are as a "feature"?!

by mvkel

4/18/2026 at 6:33:24 PM

Is it just me? I don't feel difference between 4.6 and 4.7

by alekseyrozh

4/20/2026 at 3:51:39 AM

[flagged]

by Futurmix

4/19/2026 at 4:39:23 PM

[dead]

by agentseal

4/18/2026 at 5:43:03 PM

[dead]

by chandureddyvari

4/19/2026 at 3:50:21 AM

[flagged]

by EthanFrostHI

4/19/2026 at 4:49:53 PM

[dead]

by contractlens_hn

4/18/2026 at 6:28:48 PM

[dead]

by jeremie_strand

4/18/2026 at 11:07:21 PM

[dead]

by kziad

4/19/2026 at 7:01:04 AM

[dead]

by jiusanzhou

4/18/2026 at 7:02:17 PM

[dead]

by kuzivaai

4/19/2026 at 3:46:03 AM

[flagged]

by Olivia_Pan

4/19/2026 at 6:51:16 AM

[dead]

by maxbeech

4/18/2026 at 5:10:22 PM

[flagged]

by matt3210

4/18/2026 at 5:16:42 PM

The long-term pitch of these AI companies is that the AI will essentially replace workers for low cost.

If the models don't get to a higher level of 'intelligence' and still struggle with certain basic tasks at the SOTA while also getting more expensive, then the pitch is misleading and unlikely to happen.

So yes, I expect the price to go down.

by operatingthetan

4/18/2026 at 5:11:51 PM

I thought it would be to get better, to stay competitive with the competitors and free models.

by ant6n

4/18/2026 at 5:25:52 PM

[flagged]

by monkeydust