4/18/2026 at 6:39:19 PM
For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6, and seems to cost significantly less on the reasoning side as well.Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.
I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.
On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)
by andai
4/18/2026 at 9:14:23 PM
It thinks less and produces less output tokens because it has forced adaptive thinking that even API users can't disable. Same adaptive thinking that was causing quality issues in Opus 4.6 not even two weeks ago. The one bcherny recommended that people disable because it'd sometimes allocate zero thinking tokens to the model.https://news.ycombinator.com/item?id=47668520
People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.
I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.
> What's the difference between this and option 1.(a) presented before?
> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.
> You were right to push back. I was wrong. Let me actually trace it properly this time.
> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.
It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.
Can provide session feedback IDs if needed.
by matheusmoreira
4/18/2026 at 11:24:54 PM
> > Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.
What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.
by codethief
4/19/2026 at 1:07:06 AM
> when the model won't actually be able to provide oneThis is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.
That doesn't necessarily mean the reply is wrong because, as usual, a statistically plausible sounding answer sometimes also happens to be correct, but it has no fundamental truth value. I've gotten equally plausible answers just pasting the same session transcript into another LLM and asking why it did that.
by mrandish
4/19/2026 at 12:32:46 PM
How I think about this is…From early GPT days to now, best way to get a decently scoped and reasonably grounded response has always been to ask at least twice (early days often 7 or 8 times).
Because not only can it not reflect, it cannot "think ahead about what it needs to say and change its mind". It "thinks" out loud (as some people seem to as well).
It is a "continuation" of context. When you ask what it did, it still doesn't think, it just* continues from a place of having more context to continue from.
The game has always been: stuff context better => continue better.
Humans were bad at doing this. For example, asking it for synthesis with explanation instead of, say, asking for explanation, then synthesis.
You can get today's behaviors by treating "adaptive thinking" like a token budgeted loop for context stuffing, so eventually there's enough context in view to produce a hopefully better contextualized continuation from.
It seems no accident we've hit on the word "harness" — so much that seems impressive by end of 2025 was available by end of 2023 if "holding it right". If (and only if!) you are an expert in an area you need it to process: (1) turn thinking off, (2) do your own prompting to "prefill context", and (3) you will get superior final response. Not vibing, just staff-work.
---
* “just” – I don't mean "just" dismissively. Qwen 3.5 and Gemma 4 on M5 approaches where SOTA was a year ago, but faster and on your lap. These things are stunning, and the continuations are extraordinary. But still: Garbage in, garbage out; gems in, gem out.
by Terretta
4/19/2026 at 6:51:00 AM
> In a subsequent round, it generally can't meaningfully introspect on its prior internal stateIt can't do any better in the moment it's making the choices. Introspection mostly amounts to back-rationalisation, just like in humans. Though for humans, doing so may help learning to make better future decisions in similar situations.
by vanviegen
4/19/2026 at 11:15:58 AM
I don't understand why people don't just say "This is wrong. try again." or "This is wrong because xyz. try again." This anthropologizing by asking why seems a bit pointless when you know how LLMs work, unless you've empirically had better results from a specific make and version of LLM by asking why in the past. It's theoretically functionally equivalent to asking a brand new LLM instance with your chat history why the original gave such an answer...Do you want the correct result or do you actually care about knowing why?>Introspection mostly amounts to back-rationalisation, just like in humans.
That's the best case scenario. Again, let's stop anthropologizing. The given reasons why may be incompatible with the original answer upon closer inspection...
by sillyfluke
4/19/2026 at 11:47:54 AM
I definitely do this, along with the compulsion sometimes to tell the agent how a problem was fixed in the end, when investigating myself after the model failing to do so. Just common courtesy after working on something together. Let’s rationalize this as giving me an opportunity to reflect and rubberduck the solution.Regarding not just telling „try again“: of course you are right to suggest that applying human cognition mechanisms to llm is not founded on the same underlying effects.
But due to the nature of training and finetuning/rf I don’t think it is unreasonable that instructing to do backwards reflection could have a positive effect. The model might pattern match this with and then exhibit a few positive behaviors. It could lead it doing more reflection within the reasoning blocks and catch errors before answering, which is what you want. These will have attention to the question of „what caused you to make this assumption“, also, encouraging this behavior. Yes, both mechanisms are exhibited through linear forward going statical interpolation, but the concept of reasoning has proven that this is an effective strategy to arrive at a more grounded result than answering right away.
Lastly, back to anthro. it shows that you, the user, is encouraging of deeper thought an self corrections. The model does not have psychological safety mechanisms which it guards, but again, the way the models are trained causes them to emulate them. The RF primes the model for certain behavior, I.e. arriving at answer at somepoint, rather than thinking for a long time. I think it fair to assume that by „setting the stage“ it is possible to influence what parts of the RL activate. While role-based prompting is not that important anymore, I think the system prompts of the big coding agents still have it, suggesting some, if slight advantage, of putting the model in the right frame of mind. Again, very sorry for that last part, but anthro. does seem to be a useful analogy for a lot of concepts we are seeing (the reason for this being in the more far of epistemological and philosophical regions, both on the side of the models and us)
by Dumbledumb
4/19/2026 at 12:09:33 PM
> This is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.Yep, I've gotten used to treating the model output as a finished, self-contained thing.
If it needs to be explained, the model will be good at that, if it has an issue, the model will be good at fixing it (and possibly patching any instructions to prevent it in the future). I'm not getting out the actual reason why things happened a certain way, but then again, it's just a token prediction machine and if there's something wrong with my prompt that's not immediately obvious and perhaps doesn't matter that much, I can just run a few sub-agents in a review role and also look for a consensus on any problems that might be found, for the model to then fix.
by KronisLV
4/19/2026 at 10:24:17 AM
It's worked for me when I ask why with a stated goal of preventing the same error the next time."Why did you guess at the functions signature and get it wrong, what information were you using and how can we prevent it next time."
Is that not the right approach?
by wallst07
4/19/2026 at 7:44:14 PM
This can work, but it's sort of not the same as providing actual reasoning behind "why did you do/say X?" -- this is basically asking them model to read the conversation, from the conversation try to understand "why" something happened, and add information to prevent it from being wrong next time. That "why" something went wrong is not really the same as "why" the model output something.by natdempk
4/19/2026 at 12:35:21 PM
> This is key. In my experience, asking an LLM why it did something is usually pointless.That kind of strikes me as a huge problem. Working backwards from solutions (both correct and wrong) can yield pretty critical information and learning opportunities. Otherwise you’re just veering into “guess and check” territory.
by Forgeties79
4/19/2026 at 8:04:12 AM
> In a subsequent round, it generally can't meaningfully introspect on its prior internal stateIt has the K/V cache, no?
by AlexCoventry
4/19/2026 at 3:23:51 PM
The K/V Cache is just an optimization. But yeah you would expect the attention for the model producing "Ok im doing X" and you asking "Why did you do X?" be similar. So i don't see a reason why introspection would be impossible. In fact trying to adapt a test skill where the agent would write a new test instead of adapting a new one i asked it why and it gave the reasoning it used. We then adapted the skill to specifically reject that reasoning and then it worked and the agent adapted the existing test instead.by Sinidir
4/18/2026 at 11:32:54 PM
That's good advice. I managed to get the session back on track by doing that a few turns later. I started making it very explicit that I wanted it to really think things through. It kept asking me for permission to do things, I had to explicitly prompt it to trace through and resolve every single edge case it ran into, but it seems to be doing better now. It's running a lot of adversarial tests right now and the results at least seem to be more thorough and acceptable. It's gonna take a while to fully review the output though.It's just that Opus 4.6 DISABLE_ADAPTIVE_THINKING=1 doesn't seem to require me to do this at all, or at least not as often. It'd fully explore the code and take into account all the edge cases and caveats without any explicit prompting from me. It's a really frustrating experience to watch Anthropic's flagship subscription-only model burn my tokens only to end up lazily hand-waving away hard questions unless I explicitly tell it not to do that.
I have to give it to Opus 4.7 though: it recovered much better than 4.6.
by matheusmoreira
4/19/2026 at 7:15:14 AM
> Opus 4.6 DISABLE_ADAPTIVE_THINKING=1Strangely this option was not working for many of us on a team plan
by bobkb
4/19/2026 at 1:03:36 AM
Yeah for anyone seriously using these models I highly reccomend reading the Mythos system card, esp the sections on analyzing it's internal non verbalized states. Save a lot of head wall banging.by j-bos
4/19/2026 at 4:24:18 AM
This is frankly one of the most frustrating things about LLMs: sometimes I just want to drive it into a corner. “Why the f** did you do X when I specifically told you not to?”It never leads to anything helpful. I don’t generally find it necessary to drive humans into a corner. I’m not sure it’s because it’s explicitly not a human so I don’t feel bad for it, though I think it’s more the fact that it’s always so bland and is entirely unable to respond to a slight bit of negative sentiment (both in terms of genuinely not being able to exert more effort into getting it right when someone is frustrated with it, but also in that it is always equally nonchalant and inflexible).
by christina97
4/19/2026 at 5:08:57 AM
You might be surprised how well 5.3-codex follows your instructions. When it hits a wall with your request, it usually emits the final turn and says it can’t do it.by manmal
4/19/2026 at 5:58:41 AM
The same is true of humans, not surprisingly.If you ask the average human "Why?", they will generally get defensive, especially if you are asking them to justify their own motivation.
However, if you ask them to describe the thinking and actions that led to their result, they often respond very differently.
by nhod
4/19/2026 at 12:43:47 AM
Precisely. I find Grok’s multi-agent approach very useful here. I have custom agent configured as a validator.by nelox
4/19/2026 at 2:41:24 AM
Do you have to use Grok? I don't anyhow that found it passed evaluations.by sroussey
4/19/2026 at 12:37:40 PM
I find most people who use grok do so for ideological reasonsby Forgeties79
4/19/2026 at 8:45:33 PM
Yeah, I guess. "Lex Luthor made an AI, I need to support him so I'll use Grok!" is a thing.by sroussey
4/19/2026 at 8:56:12 PM
Unfortunately that is kind of how some people operate when it comes to musk. Grokopedia certainly is not used by people because it’s useful.by Forgeties79
4/19/2026 at 2:44:03 AM
> What works much better is to tell the model to take a step back and re-evaluate.I desperately hate that modern tooling relies on “did you perform the correct prayer to the Omnissiah”
> to add some entropy to get it away from the local optimum
Is that what it does? I don't think thats what it does, technically.
I think thats just anthropomorphizing a system that behaves in a non deterministic way.
A more menaingful solution is almost always “do it multiple times”.
That is a solution that makes sense sometimes because the system is prob based, but even then, when youre hitting an opaque api which has multiple hidden caching layers, /shrug who knows.
This is way I firmly believing prompt engineering and prompt hacking is just fluff.
Its both mostly technically meaningless (observing random variance over a sample so small you cant see actual patterns) and obsolete once models/apis change.
Just ask Claude to rewrite your request “as a prompt for claude code” and use that.
I bet it wont be any worse than the prompt you write by hand.
by noodletheworld
4/19/2026 at 3:30:42 AM
It definitely overcompensates to the point of defensiveness. They have all done so for years."Why did you do that?" (Me, just wanting to understand)
"You're right I should have done the opposite" (starts implementing the opposite without seeking approval, etc.
But if you agree with it it won't do that, so it isn't simply a case of randomly rerunning prompts.
by nprateem
4/19/2026 at 2:56:03 AM
Other than AI (and possibly npm packaging) where do you feel you have to rely on prayer? Additionally, most of human history has been the story of scientific advancement to a different point where people rely on prayer, so maybe suck it up buttercup is the best advice here? &emdash;by tclancy
4/19/2026 at 1:07:20 AM
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.Do you think it knows what max effort or patched system prompts are? It feels really weird to talk to an LLM like it’s a person that understands.
by what
4/19/2026 at 2:05:54 AM
I've tested system prompt patching and it's definitely capable of identifying that my changes have been applied.As someone who's been programming alone for over a decade, I absolutely do want to enjoy my coding buddy experience. I want to trust it. I feel pretty bad when I have to treat Claude like a dumb machine. It's especially bad when it starts making mistakes due to lack of reasoning. When I start explaining obvious stuff it's because I've lost the respect I had for it and have started treating it like a moron I have to babysit instead of a fellow programmer. It's definitely capable of understanding and reasoning, it's just not doing it because of adaptive thinking or bad system prompts or whatever else.
by matheusmoreira
4/19/2026 at 2:21:32 AM
I thought that was really weird as well.by hattmall
4/18/2026 at 10:45:57 PM
Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks?It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.
by rectang
4/18/2026 at 10:54:54 PM
I don't think there's a bias here. I'd say my task is of somewhat high complexity. I'm using Claude to assist me in implementing exceptions in my programming language. It's a SICP chapter 5.4 level task. There are quite a few moving parts in this thing. Opus 4.6 once went around in circles for half an hour trying to trace my interpreter's evaluator. As a human, it's not an easy task for me to do either.I think the problem just comes down to adaptive thinking allowing the model to choose how much effort it spends on things, a power which it promptly abuses to be as lazy as possible. CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 significantly improved Opus 4.6's behavior and the quality of its results. But then what do they do when they release 4.7?
https://code.claude.com/docs/en/model-config
> Opus 4.7 always uses adaptive reasoning.
> The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.
by matheusmoreira
4/19/2026 at 7:16:37 AM
‘effort high/max’ seems to be working thoughby bobkb
4/19/2026 at 5:06:58 PM
The problem I described occurred on Claude Code, Opus 4.7/1M, max effort, patched system prompts with all "don't think for simple stuff" instructions removed as well as CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 even though Opus 4.7 ignores it.by matheusmoreira
4/19/2026 at 9:44:29 AM
So CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 is not available/is ignored in 4.7?by virtualritz
4/19/2026 at 5:12:13 PM
It is ignored by Opus 4.7.https://code.claude.com/docs/en/model-config
> Opus 4.7 always uses adaptive reasoning. The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.
by matheusmoreira
4/19/2026 at 3:26:34 AM
Adaptive thinking is optionalby xvector
4/19/2026 at 6:11:35 AM
Not when you want extended thinking - you select extended thinking and opus decides if you get it with apativenthinking."With Opus 4.6, extended thinking was a toggle you managed: turn it on for hard stuff, off for quick stuff. If you left it on, every question paid the thinking tax whether it needed to or not. Now, with Opus 4.7, extended thinking becomes adaptive thinking. "
https://claude.com/resources/tutorials/working-with-claude-o...
by scrollop
4/19/2026 at 1:24:12 PM
...are you talking about the app? Come on. The app is for quick queries. You should be using Claude Code or Cowork.by xvector
4/19/2026 at 6:24:23 PM
I've gotten quite a bit of work done on claude.ai and the mobile app though. It's been good for code review. The GitHub connector is a bit clunky but it works.by matheusmoreira
4/19/2026 at 5:04:34 PM
No, it is not.https://code.claude.com/docs/en/model-config
> Opus 4.7 always uses adaptive reasoning. The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.
by matheusmoreira
4/19/2026 at 3:43:28 PM
> For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6Does it? Anthropic's own announcement says that for the same "effort level" 4.7 does more thinking (i.e uses more output tokens) than 4.6, and they've also increased the default effort level from 4.6 high to 4.7 xhigh.
I'm not sure what dominates the cost for a typical mix of agentic coding tasks - input tokens or output ones, but if you are working on an existing project rather than a brand new one, then file input has to be a significant factor and preliminary testing says that the new tokenizer is typically generating 40% or so more tokens for the exact same input.
I really have to wonder how much of 4.7's increase in benchmark scores over 4.6 is because the model is actually better trained for these cases, or just because it is using more tokens - more compute and thinking steps - to generate the output. It has to be a mix of the two.
by HarHarVeryFunny
4/19/2026 at 6:07:42 AM
That is not what anthropic says-"Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens. "
by scrollop
4/19/2026 at 12:27:31 PM
That's a good point. AA's Cost Efficiency section says the opposite: you can hover to see the breakdown between input, reasoning and output tokens.I'm not sure where that discrepancy comes from (is Anthropic using different benchmarks?).
There's a few different theories but all we have now are synthetic benchmarks, anecdotes and speculation.
(Benchmarks are misleading, I think our best bet now is for individuals to run real world tests, giving the same task to each model, and compare the quality, cost and time.)
The input cost inflation however is real, and dramatic.
I would have expected them to lower input costs proportionally, because otherwise you're getting less intelligence per dollar even with the smarter model. Think that would be the smartest thing for them to do, at least PR wise. And maybe a bit of free usage as an apology :)
by andai
4/19/2026 at 11:36:56 AM
The link you are commenting on shows data from actual prompts from real users, and the COST of the average prompt increased 37%. I do not think synthetic benchmarks are a rebuttal to real usage data.by irthomasthomas
4/19/2026 at 12:21:25 PM
The cost of the input tokens, not the reasoning or output.Agree though that benchmarks aren't very helpful w.r.t. estimating real world performance or costs.
What we'd need are people giving the same real world tasks to 4.6 and 4.7 and measuring time, quality and costs.
by andai
4/19/2026 at 2:02:26 PM
Thanks, that wasn't clear because it mentioned conversations, but it is only measuring the input tokens. So its just measuring the difference in the tokenizer.by irthomasthomas
4/18/2026 at 9:51:06 PM
Some have defined "fair" as tests of the same model at different times, as the behavior and token usage of a model changes despite the version number remaining the same. So testing model numbers at different times matters, unfortunately, and that means recent tests might not be what you would want to compare to future tests.by QuantumGood