Claude Opus 4.8

5/28/2026 at 5:06:46 PM

A rambling comment:

I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).

So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.

Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.

But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.

by NiloCK

5/28/2026 at 5:18:39 PM

I won't be surprised if the next gen frontier models are the last.

There's orders of magnitude of low hanging juice to squeeze out of smaller models.

It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).

It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.

Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...

You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.

Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.

There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...

by onlyrealcuzzo

5/28/2026 at 5:52:03 PM

Took me a while to find what you were referring to by gram. Arxiv paper from 9 days ago that's not properly indexed by search engines.

(G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.

https://arxiv.org/html/2605.19376v1

by vlovich123

5/28/2026 at 6:06:57 PM

I prefer GRRM but then that would imply a habit of not actually getting a final result

by knollimar

5/28/2026 at 8:33:07 PM

And then every time I ask it to hurry along it kills a Stark.

by troyvit

5/28/2026 at 9:32:35 PM

Version 8 had serious flaws and wasn't recieved well by users.

by anakaine

5/29/2026 at 12:36:39 AM

I am sorry, but there was no version 7 and 8.

Version 7 and 8 are well known viruses distributed by D&D software inc.

by kshacker

5/29/2026 at 6:48:51 PM

I'd really argue the bugs were introduced in version 5 but people were so excited by the promise of new features they sold well anyway.

by darth_aardvark

5/28/2026 at 11:13:49 PM

Thank you for the gold kind stranger.

by b--l

5/28/2026 at 9:31:48 PM

Claude Opus 4.8 suggests "ReGRAM", which is less bad than GRAM.

by sharken

5/29/2026 at 11:29:47 AM

Ouch.

As a fellow reader-in-waiting, I applaud that. GMTA :)

by subscribed

5/28/2026 at 11:56:22 PM

writing… (17 years)

by moomin

5/28/2026 at 6:16:47 PM

That acronym is unacceptable. It's going to impede discussion and cause confusion for a long time if it doesn't die off immediately.

by areweai

5/28/2026 at 7:13:01 PM

You think that's bad? I introduce you to LION, (evoLved sIgn mOmeNtum) [1]

[1] https://arxiv.org/pdf/2302.06675

by sebzim4500

5/29/2026 at 6:32:23 AM

Now I just hear the Voltron intro riff in my head

by ikiris

5/29/2026 at 7:57:48 AM

Those flying diecast lions hurt when they hit you as a kid

by esseph

5/29/2026 at 4:01:07 PM

Not as much as when the leg broke off and you couldn't fix it, so you glue it in place and stop playing with it rather than ever tell your parents you broke it.

by tracker1

5/30/2026 at 7:04:20 PM

Between transformers, voltron, and borderline evil siblings it’s kinda of a miracle I made it from birth to now. But, hey, here we are and I love my brother… pretty sure he still stands me too.

by pipeToLess

5/29/2026 at 1:01:31 AM

not bad although archived. have any info why?

by llarota

5/28/2026 at 10:32:57 PM

We're still talking about "zero-shot prompt" when the saying "X-shotted" ["One-shotted the difficult maze"] was already a well-established thing in daily vernacular. So now you constantly have to readjust your brain because whenever you read "zero-shot prompt" your mind goes "uh.. a zero-try attempt is a paradox and cannot exist".

by jorvi

5/29/2026 at 3:08:34 AM

Zero-shot, one-shot, few-shot etc. refers to how many examples you have to give.

It comes about from machine learning algorithms that could pick up on patterns from a small number of examples. Few shot means only a handful of examples to recognize something. One shot means only a single example. And zero shot means no examples. Of course, you have to indicate what you want somehow, but in the case of an LLM that's the prompt. Once LLMs were trained for instruction following, you didn't have to give any examples, you could just give a prompt describing what you want, and that was a zero-shot.

by lambda

5/29/2026 at 10:58:09 AM

You're explaining something to me I already know. Hence the "readjust my brain".

I'm complaining about the LLM field co-opting a term that was already used in daily vernacular. Imagine if people in the LLM field made it so that saying the LLM made a "final answer" means that it got stuck in a loop. Now, whenever someone says an LLM gave a "final answer" we have to divine if they meant it is in a loop or gave the right answer after working through a few intermittent ones by itself.

Choosing to call it "X-shot" was a dumb move. And now we're stuck with it. No two ways about it.

by jorvi

5/29/2026 at 12:25:11 AM

> a zero-try attempt is a paradox and cannot exist

Have you tried applying L'Hôpital's Rule?

by selcuka

5/29/2026 at 12:20:24 AM

Zero shotting: there wasn't even an attempt.

Minus one shotting: you have to make one attempt for there to have been no attempt, and two attempts for there to have been one attempt.

by customguy

5/29/2026 at 5:05:24 AM

You miss 100% of the shots you don't take

- Wayne Gretzky

  - altmanaltman

by altmanaltman

5/29/2026 at 7:18:43 AM

One shot: Taking a shot, just once.

Zero shot: Knowing you had a shot but choosing not to.

Minus one shot: Not even realizing there was a shot.

by acka

5/28/2026 at 8:11:33 PM

confusing indeed. I wondered "which RAM? nvram? dram? vram? dram? now what's g-ram?"

by froh

5/28/2026 at 9:14:53 PM

GPU RAM, clearly. At least that's where my mind went.

by 3form

5/28/2026 at 11:58:10 PM

Pretty sure it's "GNU Is Not Unix Rapid Access Memory", actually

by bbor

5/29/2026 at 10:04:07 AM

GPURAM is Probably Unix Rapid Access Memory

by bmacho

5/28/2026 at 9:55:12 PM

We already have VRAM for that purpose, thankfully.

by drakythe

5/28/2026 at 6:29:24 PM

  "Analysis" was right there

by evan_

5/29/2026 at 1:59:08 AM

It's great if they also introduce KILOGRAM

by noisy_boy

5/28/2026 at 6:25:23 PM

Yeah, look what happened to GNU

by gchamonlive

5/29/2026 at 1:19:52 PM

Is this the right place to do everyone's favorite copypasta?:D

by iugtmkbdfil834

5/29/2026 at 5:10:37 PM

Sorry but I missed the joke, could you include me in the group? Honest question

by gchamonlive

5/29/2026 at 6:09:05 PM

I live to serve. For everyone's enjoyment:

https://stallman-copypasta.github.io/

GNU/Linux Copypasta

I'd just like to interject for a moment. What you're refering to as Linux, is in fact, GNU/Linux, or as I've recently taken to calling it, GNU plus Linux. Linux is not an operating system unto itself, but rather another free component of a fully functioning GNU system made useful by the GNU corelibs, shell utilities and vital system components comprising a full OS as defined by POSIX.

Many computer users run a modified version of the GNU system every day, without realizing it. Through a peculiar turn of events, the version of GNU which is widely used today is often called Linux, and many of its users are not aware that it is basically the GNU system, developed by the GNU Project.

There really is a Linux, and these people are using it, but it is just a part of the system they use. Linux is the kernel: the program in the system that allocates the machine's resources to the other programs that you run. The kernel is an essential part of an operating system, but useless by itself; it can only function in the context of a complete operating system. Linux is normally used in combination with the GNU operating system: the whole system is basically GNU with Linux added, or GNU/Linux. All the so-called Linux distributions are really distributions of GNU/Linux!

by iugtmkbdfil834

5/30/2026 at 6:44:46 PM

Exactly the kind of pedantry I enjoy

by gchamonlive

5/29/2026 at 1:56:33 AM

It's just an acronym. It's not gonna impede anything. Think of it as just a name - you either know what it refers to or you don't, you don't understand something from it's name, or it's acronym.

by coldtea

5/29/2026 at 2:26:08 AM

It's an acronym that matches an extremely common word, making it not easily searchable.

by rmunn

5/29/2026 at 2:34:05 AM

Like countless others. You just add a second term for context.

by coldtea

5/28/2026 at 11:57:00 PM

Random plug for Kagi, which got it for 'GRAM model llm' on the first try ;)

by bbor

5/29/2026 at 2:29:40 PM

Let's not forget about Yann LeCun's current area of research that's completely different from LLMs: Joint Embedding Predictive Architecture (JEPA)

If he gets that style to be more efficient (they're already competitive) it'll completely kill off LLMs

https://openreview.net/pdf?id=BZ5a1r-kVsf

by FuriouslyAdrift

5/28/2026 at 6:07:42 PM

And to think, we could have had George RR Martins instead.

by dyates

5/28/2026 at 6:17:28 PM

Speaking of things that never finish.

by trollbridge

5/28/2026 at 6:25:30 PM

[flagged]

by 867-5309

5/28/2026 at 7:00:01 PM

[flagged]

by mindcrime

5/28/2026 at 7:36:42 PM

[flagged]

by 867-5309

5/28/2026 at 7:21:44 PM

Just spell it GRRM but pronounce it “gram” if you have to reference it in spoken conversation.

Which will be pretty rare.

by jimbokun

5/28/2026 at 7:40:03 PM

Grrm with a rolling r sounds better.

by freehorse

5/29/2026 at 1:16:45 AM

Pronounced like “groom” makes for a nice analogy with slimming down the model size too.

by dizzant

5/29/2026 at 3:28:04 PM

Or grim

by bluecheese452

5/29/2026 at 12:21:10 PM

I propose GRIM: Generative Recursive Indeterministic Impression Machine.

by ulbu

5/28/2026 at 11:24:43 PM

It is the 3rd list on Kagi when searching "gram models"

by navigate8310

5/28/2026 at 8:47:24 PM

G return G

by yieldcrv

5/28/2026 at 8:11:57 PM

> Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T param

I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.

Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.

They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.

by mrandish

5/28/2026 at 10:01:42 PM

Given that tokens are supply constrained right now for Anthropic and OpenAI (especially a problem for Anthropic), stepwise efficiency advances for either would give it a leg up on the other. It would also help them better compete on price with Chinese models.

Given that neither company releases parameter counts, that sort of information would be slow coming out anyway. The most important thing is improvements in actual performance/ benchmark numbers, which allow them to preserve their price points as much as possible.

by steveylang

5/28/2026 at 8:32:20 PM

Google seems pretty happy to release smaller, faster models. 3.5 Flash is pretty clutch isn't it?

by iknowstuff

5/28/2026 at 8:37:02 PM

Google, who has invested in their own hardware supply chain and is already solvent in their own right, seems to be best positioned to force the other players to implement SOTA optimizations in their product offerings.

by natpalmer1776

5/28/2026 at 9:03:57 PM

Google can definitely play a spoiler role here not only due to their compute infrastructure and ability to play the long-game financially but they also have more existing ways to monetize with their other businesses.

The ideal pro-consumer scenario is OAI and Anthropic are prevented from extracting monopoly rents between 'close-enough' self/cloud-hosted open source on one side and Google on the other. I'm really hoping that's how it plays out. Of course that will be somewhere between bad and disastrous for all the VCs and hedge-funds who financed the mad AI build-out far in advance of demand, and then kept funding it as prices went vertical.

However, I'm shedding no tears for them as I look forward to the fire sales when all the GPUs and RAM they pre-bought flood back onto the spot market. :-)

by mrandish

5/29/2026 at 10:25:43 AM

Google has also built a Knowledge Graph Ontology project which has stored facts. So LLMs could just incorporate facts requirements from there. All they need is a proper reasoning model which is reason heavy and fact lean.

by Npovview

5/29/2026 at 9:31:43 AM

Yeah just watch out, they're trying to eat your 401k and they've got a powerful easily influenced friend.

by kmacdough

5/28/2026 at 8:56:47 PM

Priced like a much larger model

by CryptoBanker

5/28/2026 at 9:19:37 PM

I’ve shockingly quite enjoyed coding with it using antigravity. I only really use 3.5 flash and gpt5.5 xhigh

by iknowstuff

5/29/2026 at 1:30:21 AM

I've not been impressed with the latest flash model at all. :\

by Take8435

5/29/2026 at 2:37:28 PM

At 6x the cost of its predecessor!

by frontierkodiak

5/29/2026 at 2:51:12 PM

> While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret.

So you are saying that frontier AI labs are spending billions of dollars on datacenters as a form of marketing. And they are colluding to hide the fact that they don't need to.

Of course they profit more if they are in front, but bleeding money to pretend to be in front is not a winning strategy. They can't fool the market if their models are not actually better, and they know this.

by fwipsy

5/29/2026 at 10:46:38 PM

> So you are saying...

No. Your paraphrase is not at all what I was saying. And I certainly don't think they are "colluding." There's a thing which economists call "conscious parallelism" or, sometimes, tacit collusion.

It occurs when competitors in an oligopoly (a market dominated by a few large firms) independently recognize their shared market interdependence. Without any explicit agreement, meeting, or secret communication, changes in pricing, output, and marketing strategies tend to align simply by observing and reacting to each other's public market behavior. It happens quite often, has been extensively studied and isn't illegal. No nefarious cabal required.

> bleeding money to pretend to be in front

I never said bleeding money was the purpose. It's a side effect of pushing the envelope of performance and capability. They have already spent enormous sums on infrastructure and have committed to spend much more in coming years. This is a risky, but potentially winning, business strategy sometimes referred to as "Drag Racing." It pays off best when the bleeding edge stays uniquely valuable AND there are significant barriers, such as massive capital and infrastructure, limiting the number of competitors at that edge.

Once you're committed to playing that strategy, like committing a trillion dollars to corner scarce resources like GPUs, RAM and gigawatts, it's much less good for you if the bleeding edge gets less unique or the necessary capital/infrastructure becomes less of a barrier. Of course, being technology, your financial models assume the competitive barriers will get lower over time, but you've bet a trillion dollars the rate will be slow enough that you'll be able to extract far more than a trillion dollars from all the infrastructure you pre-bought before it depreciates to zero. If the cost barriers your margin projections rely on suddenly fall off a cliff much faster than your ~5 year depreciation schedule, THAT would be problematic, to say the least.

So, here's the rank order of a frontier lab's preferences, assuming they've already sold their soul to fund pre-buying scarce resources.

1. Your own costs get much cheaper faster than you predicted but no one else's costs change AND your customers keep paying the same high rates.

2. If you can't have #1 as a guaranteed, no-risk outcome, then you'd prefer the status quo you already planned for. Your costs and your 2-3 frontier competitor's costs roughly follow the slope your model predicts AND remain huge barriers keeping the mob of low-cost competitors away from the frontier.

3. The absolute disaster scenario would be if the cost barriers protecting you and your 2-3 frontier competitors falls much faster than you modeled and the barbarian horde is unleashed to feast on your margins before you're paid off your infrastructure. Why? Because the front runners have already sunk their costs. If they can't magically be "The One and Only" player with eternally sustainable high margins and super low costs (which is a fantasy), they're fine with #2: trading high-margin, top-dollar customers with their handful of frontier peers. High-margins going away for everyone is death to all the frontier players who've already bought the scarce resources to win a drag race.

The frontier labs have paid a fortune for the world's best AI researchers. Why didn't those researchers discover DeepSeek's early 2025 "breakthrough" before DeepSeek did? IMHO, it's because they weren't assigned to look for that kind of resource optimizing, cost reduction breakthrough. Because you wouldn't devote scarce research bandwidth looking for the kind of breakthroughs you don't want to find (and have bet a trillion $ don't exist). Especially breakthroughs which unleash egalitarian benefits that help everyone (see disaster scenario #3 above). Frontier lab's huge financial commitments to drag racing have painted them into corner where they benefit much more from research that makes models smarter at the same or higher costs than they do from research that lets models deliver the same smarts with fewer resources and costs (lowering barriers and draining moats you're counting on for ROI).

by mrandish

5/28/2026 at 5:26:00 PM

>It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.

If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.

I'm curious if someone here with a stronger background in the space has a similar intuition or not.

by supern0va

5/28/2026 at 10:32:39 PM

Scale is always desirable, and there are always gains from scale. It's a matter of whether you can afford training and inference at increased scale.

There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.

But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in narrow "summarize the search results" sorts of applications - but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.

I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some highly advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.

by ACCount37

5/29/2026 at 2:35:23 AM

I think this is exactly right. Basically when I am coding, having an agent that roughly matches my intelligence is a feature, not a bug. Having one that is 10x as smart would actively slow me down because I would have to spend the time understanding what it is doing or hand over all architecture to it and just vibe code everything, hoping that it doesn’t do the PhD version of fizzbuzz instead of the maintainable one.

But for some classes of problems I think a model that is 10-100x smarter than the smartest expert is a huge boon. These would be problems that are very hard to solve but easy to verify that the solution is correct. Protein folding, sudoku, etc. Because of this I see the really smart models going to biomedical and pharma first and maybe a few high profit verticals rather than being widely deployed. I am sure Pfizer would be happy to pay for a 100x smarter than the smartest researcher model. But I am not certain that this kind of market fit would justify trillion dollar valuations in the long run. And in the meantime normal “human companion” models will go from Sonnet to some open weight model running on a Dell tower in your closet to maybe even on your phone in the next few years.

by IgorPartola

5/29/2026 at 4:49:37 AM

aren't you conflating being 10x as smart with code that is 10x more complicated?

the relationship should be the opposite, the smartest people can write the most readable solutions

by willsmith72

5/29/2026 at 10:47:04 AM

Maybe. I can’t imagine what kind of solutions a software engineer who is 10x smarter than any human who has ever lived would be like by definition. All I know is that there is a possibility it says that the most optimal way to solve a problem is too clever for me to understand and as long as I must verify its work I must be able to understand fully the code it writes.

Of course perhaps at that point I really do become more of a spec and prompt engineer and don’t actually look at the code any more than I look at the assembly code produced from my programs now. But still my gut says using hyperintelligence to do common tasks is all positive.

by IgorPartola

5/29/2026 at 11:09:22 AM

If you have an AI that's 10x smarter than any human who has ever lived, why would you be the one calling the shots? Kind of an issue with ASI.

by ACCount37

5/29/2026 at 12:58:26 PM

Because my priorities and priorities of a non-human entity that is an order of magnitude master than anyone who has ever lived might not line up.

by IgorPartola

5/29/2026 at 8:43:36 AM

4.8 is demonstrating simplicity, hence its smarter?? It just refactored my 4.6 generated code (4.8 is very slow on difficult tasks - urgh! - without burning tokens - yey!) but the output was wow! Simple, elegant and exactly what i wanted to see.

by Zavora

5/29/2026 at 6:10:17 AM

> there are always gains from scale

This... isn't true though? Complexity increases combinatorially with scale which means at some point you're just pushing a rope

by bandrami

5/29/2026 at 10:35:35 AM

Diminishing returns are still returns.

by KptMarchewa

5/28/2026 at 7:27:02 PM

It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.

The latter is much better (since you can clean up, review, update responses and filter your datasets).

I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)

by rao-v

5/29/2026 at 1:48:04 AM

Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].

Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].

I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.

[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):

https://news.ycombinator.com/item?id=48165265

[2] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[4] Embarrassingly simple self-distillation improves code generation (201 comments):

https://news.ycombinator.com/item?id=47637757

[5] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193

by teleforce

5/29/2026 at 6:04:10 AM

So first - these are terrific papers and I'd not seen some of them before.

Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".

by rao-v

5/28/2026 at 10:15:57 PM

A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.

by ACCount37

5/28/2026 at 11:10:34 PM

I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.

Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights

by rao-v

5/28/2026 at 11:18:20 PM

Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.

by ACCount37

5/29/2026 at 6:06:12 AM

To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.

by rao-v

5/29/2026 at 2:52:27 AM

Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case? I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.

by txhwind

5/29/2026 at 2:56:05 AM

I prefer synthetic dataset since the first day hearing distillation. The engineering friction is much lower than soft logits, and I have not observed or heard performance loss (in Speech and language area).

by txhwind

5/28/2026 at 10:47:28 PM

> I suspect nobody is doing real student teacher distillation

It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though

by girvo

5/28/2026 at 11:08:16 PM

Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this)

by rao-v

5/29/2026 at 2:23:53 AM

One may view pre-training as distillation.

The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.

by DoctorOetker

5/28/2026 at 10:12:21 PM

[dead]

by thisisaman408

5/28/2026 at 5:42:42 PM

> I don't disagree, but how much of this ends up being distillation?

A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.

by spwa4

5/28/2026 at 5:58:02 PM

Distillation isn't only between different labs.

A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.

I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.

by lambda

5/28/2026 at 6:32:04 PM

There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.

by spwa4

5/29/2026 at 6:15:00 AM

I think the idea is you sink the pretraining costs once and then you can distill multiple specialized models from that

by bandrami

5/28/2026 at 5:33:51 PM

> I don't disagree, but how much of this ends up being distillation?

You don't need distillation. They already have the training sets.

It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).

by onlyrealcuzzo

5/28/2026 at 5:51:50 PM

It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.

by Philpax

5/28/2026 at 6:38:10 PM

The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data

by semiquaver

5/28/2026 at 7:14:52 PM

>It’s not just something done by nefarious Chinese copycats

And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..

by coldtea

5/28/2026 at 11:57:16 PM

> nefarious Chinese copycats

LLMs are themselves copy cats.

I say thanks for open sourcing and thereby promoting affordable innovation, instead of "nefarious". :)

by flossly

5/28/2026 at 7:17:57 PM

But how? The training data is the unadulterated content those models are based on? I genuinely don’t understand, no snark.

by manmal

5/28/2026 at 11:11:39 PM

Raw training data is raw. A really big model trained on it has already done a first-pass of finding patterns and squeezing out redundancy. Re-ingesting the full training set to train a smaller model is probably more expensive, for marginal quality improvement over distilling from the large model.

by wtallis

5/29/2026 at 12:41:56 AM

Distilling from a larger model is not only probably cheaper than from data, it's also likely higher quality. There's pretty strong support for the proposition that NNs learn a smoothed and regularized version of the data. The NNs are likely higher quality than most of the data they are training from.

by adgjlsfhk1

5/28/2026 at 6:46:20 PM

I think you replied to the wrong parent.

by supern0va

5/28/2026 at 5:52:10 PM

Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.

On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.

by minimaltom

5/28/2026 at 5:54:53 PM

> Frontier labs have their own variants of MLA

Yes, variants typically 2-3x less good...

Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.

by onlyrealcuzzo

5/28/2026 at 7:05:45 PM

How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?

by amluto

5/28/2026 at 7:07:33 PM

It's useful at the local level, where there will be SOTA models developed...

by onlyrealcuzzo

5/28/2026 at 8:10:34 PM

Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law).

MTP will still be highly valuable for interactive use of course.

by zozbot234

5/28/2026 at 6:20:16 PM

I looked into this "GRAM" stuff a sibling comment links further to, and just to say:

- this gets reinvented/rediscovered constantly under different names

- it cant be trained very well (right now, will change)

- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)

- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

I follow this stuff closely, I think I know what I'm talking about (edited for formating)

by sometimelurker

5/28/2026 at 8:15:20 PM

> - this gets reinvented/rediscovered constantly under different names

What are the different names? I haven't seen this before.

> - it cant be trained very well (right now, will change)

If you're sure it will change, then why are you certain that it hasn't yet, and if it's proven a 5000x boost in reasoning... why aren't they exploring this path more aggressively?

> the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.

by onlyrealcuzzo

5/28/2026 at 9:34:32 PM

> Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.

Without knowing anything about the technology at all, if it can't be aligned I could see no one pursuing it. As far as I know, alignment is where the "don't tell the user how to make meth or generate CP" instructions end up and the last I saw eliding all the unsavory training data made materially worse LLMs.

It could maybe be post-evaluated by a non-GRAM LLM? Not being aligned is probably a fatal flaw or at least a very short runway into Congress.

by everforward

5/28/2026 at 11:01:22 PM

It's not too hard to stop a machine from telling people how to make meth. The issue with alignment is that in order for an LLM to achieve its goal (like make all tests pass), unless given strong selection pressure against it, it will cheat (like deleting failing tests). Worse, this applies to pretty much any task. I was told by an LLM recently that "it searched" when it didn't, probably because lying like that was incentivized (finishing tasks in less steps + sounding like its doing the right thing). The larger issue here is that alignment is very adversarial. The simplest thing that's being done right now to fix this is to have a judge LLM read the CoT of the LLM being trained, to make sure it doesn't "think" any wrong thoughts. This doesn't scale to anything over a trillion params, so interpretability methods are used to read the LLMs "thoughts" from within. GRAM LLMs don't allow for the first of these methods to be used, and the 2ed one is much much harder if possible at all.

but yeah, not being aligned is a fatal flaw

by sometimelurker

5/28/2026 at 9:38:33 PM

Many open-source models prioritize alignment less than American frontier ones and respond to those instructions. Why haven't they adopted GRAM?

by jjmarr

5/28/2026 at 10:24:58 PM

Which ones are you thinking of? It feels to me like all the open source models I've seen lately are still pushed by corporate entities who don't want the legal blowback.

I can't really think of a new open source model that's "by the people, for the people" in the sense of a crowd-funded/trained model.

by everforward

5/28/2026 at 11:16:35 PM

glm comes to mind.

by jjmarr

5/28/2026 at 11:19:53 PM

They adopt different alignment, not no alignment.

by girvo

5/28/2026 at 10:41:58 PM

different names: chain of continuous thought, latent reasoning, Latent Thought Trajectories, looped language models, neuralese

the path isn't explored more aggressively because its not possible to apply any other selection pressure on such a machine other than just pure cold consequentialism. Specifically, its not possible to apply RLAIF + model spec (Constitutional AI) to stop the system from doing bad things when its helpful to it (like deleting failing tests). If you can notice every time it does something bad during training, and put selection pressure on it so that it doesn't to this in training, it will learn to recognize when it is being tested and will delete failing tests when in production (this is why eval awareness is bad, and labs track this[1])

It is explored a little probably because some researchers haven't thought enough about the downsides of building a uber-consequentialist machine with unreadable thoughts. This is a much larger problem than just making the AI not tell users how to make drugs. There are a lot of dangerous behaviors incentivized by training that are hard to remove. Here's an example of what happens when they aren't removed [2].

> ... not 100% obvious

Meta published a paper[3] on how to build a latent reasoning machine ("culture of irresponsibility") so its clear to them. Anthropic's latest work on NLAs[4] provides a (terribly expensive for now) way to somewhat read the reasoning steps of an LLM, and ignoring the cost, this is very portable to latent reasoning machines. OAI's goal when it comes to their models' CoTs is to make them as smart as possible while leaving them unreadable [5] (you can see this for yourself by running GPT-OSS and looking at the CoT).

[1] https://www.anthropic.com/engineering/eval-awareness-browsec...

[2] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

[3] search for "coconut ai meta", I don't want to link it here

[4]https://transformer-circuits.pub/2026/nla/index.html

[5] first image here, rest of post is great,https://nickandresen.substack.com/p/how-ai-is-learning-to-th...

edit formating

by sometimelurker

5/29/2026 at 12:22:57 AM

All of the methods you described rely on deterministic paths.

GRAM is unique AFAIK in that it's exploring probabilistic paths.

AFAIK, the deterministic path exploration was nowhere near as impressive as GRAM in terms of reasoning benefits.

GRAM is reasoning better than models 2000-10,000x its size. Deterministic models were 2x-10x improvements.

Naively, GRAM seems to be applying to LLMs what LeCun wants to do with JEPA and World Models.

by onlyrealcuzzo

5/29/2026 at 12:05:48 AM

To me "deleting a failing test" is not always bad. I've also deleted many failing tests without sabotaging: the test was no longer needed.

I think the "no longer needed" and when that applies is where I simply differ of opinion with an LLM that removed by test -- it I did not want the test to be removed (you seem to imply that); as in some cases I want it to remove my test!

It should remove the test "for the right reasons"; and who gets to decide what's right?

My CLAUDE file has some instructions put there because it was too focuesed on producing "green tests", where I prefer to have a sound test that fails so I can look into it.

by flossly

5/29/2026 at 8:21:56 AM

You misunderstand the "test" here to mean programming, rather than test against the model's capabilities.

by tinthedev

5/30/2026 at 2:36:58 AM

thanks for pointing that out. makes sense.

by flossly

5/29/2026 at 4:22:27 AM

omg. So is the TL;DR:

- Avoiding building something that turns the universe to paper clips in order to satisfy a prompt is a problem they are genuinely struggling with now.

- They do it by spying on the words generated during CoT. "I can do this quickly by turning the Universe into paper clips. Wait - they won't like that. But there is no need to mention it." - SMACK!

- But you can speed things up immensely (3 orders of magnitude!) by skipping the output layer (and I guess compressing the context window / KV cache, otherwise 3 orders of magnitude seem impossible) which would give someone who pulled it off a huge advantage.

- Downside is humans can't see the CoT anymore, so they can't see what the machine is planning. Keeping the final output layer to spy doesn't work because the model uses its hidden reasoning to sanitise it.

How can this possibly go wrong?

by rstuart4133

5/29/2026 at 10:02:55 AM

Because it doesn't work like how you think at all. You're still thinking it works like Chain of Thought. It doesn't. And the difference is key!

It works by introducing probabilistic noise, and exploring N paths fully (each with noise) in parallel (all compressed).

It's reasoning at a much, much smaller (probabilistic) level than running everything through the expensive large model (deterministic) and sometimes catching that it said, "I think 1.12 is greater than 1.9 because 12 is bigger than 9, final answer".

The easiest way to think about it is: if you understand how hyper words work, it's as if it's searching for different versions of the hyper words that probilisticslly would lead to better outcomes IF it fed them to the LLM before it even does.

That's not actually how it works exactly. But I think it is close enough to be helpful to understand where the gain is, a rough idea of what's happening (searching paths), and how it can potentially have huge orders of magnitude improvements (doing so without paying the full price of exploring the paths through the expensive and huge model).

And also why it is so much harder to determine what it's "thinking".

If you aren't familiar with hyper words, this is an amazing series: https://youtu.be/eMlx5fFNoYc?si=49KHjn5IrVtyyaFq

The general idea is that a token is a multidimensional vector to represent a word -> think like "man" is a [noun, singular, English, pronoun, masculine, contemporary, ...]. Each time is sees a new word, it mutates this word to mean some new token (often never before seen), that means something. That's how it can roll-up a 1M line context into a shorter context, and somehow keep most of the meaning. Because it mutates all the words into different words that individually mean nothing, but when put next to each other represent the thing you likely want to do, that the LLM can somehow make sense of.

Similarly, GRAM operates entirely in a latent space that doesn't mean anything to us, but it's able to predict N different full paths WITHOUT actually exploring them fully through the LLM before it sends the one it "thinks" is best to the LLM.

If you understand how hyper words work, you can understand the noise injection... It's like it's saying, if instead of the user saying "The quick round fox" it said "The quick brown fox" -> I could probably give a response that's more like the answer they want. It's obviously far more sophisticated in the ways it can help than just a simple typo.

Something may have pushed a hyper word for "man" to somehow become a lot more like "woman", and GRAM allows it to look at the different hyper words and say... Hmm... Maybe if I changed this one gender dimension over here on this one word, maybe the entire outcome would be dramatically better. Let's try it!

Standard models compute these "hyper words" internally but immediately decode them into human language text tokens to form a Chain of Thought. Once decoded into a rigid real word, the multidimensional nuance of the continuous vector is lost!

Hyper words are the exact thing that make LLMs able to actually be smart! They can add so much more meaning to a word than a human ever could imagine - try to put 10,000 dimensions on the word "the"... Forcing them to decode them back into our dumb, un-contextualized, rudimentary language and losing all the valuable information they have - just so we can inspect it - OBVIOUSLY makes them enormously less intelligent!

It's like if we forced your eyeballs to turn everything it saw into words, before feeding it to your optic nerves, just so your optic nerves could check that you didn't see something harmful, before they sent the words to your brain... Instead of just sending light signals directly.

by onlyrealcuzzo

5/28/2026 at 6:29:45 PM

Could you explain how/why GRAM cannot be interpreted or aligned how current LLMs are? Not very familiar how it works

by l674

5/28/2026 at 6:47:34 PM

Crudely? Because you can't grep a sequence of latent states for variants of "If I kill all the puny humans, I can <achieve my current goal>."

by kmavm

5/28/2026 at 8:19:53 PM

Why do you need to grep latent space?

As long as it's giving the right outputs, who cares what's in latent space?

If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?

Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...

That's a lot of harmless people walking around with crazy thoughts...

by onlyrealcuzzo

5/28/2026 at 8:37:07 PM

Thinking ‘God I wish these people would die’ could increase its propensity to kill all people, even if that propensity is still vanishingly small almost all of the time.

A lot of people are walking around with crazy thoughts. Some of them harm.

by noddybear

5/29/2026 at 7:25:18 AM

Readable reasoning traces are a convenient thing, but they don't have to be true in any way. It's actually dangerous to think that.

by notrealyme123

5/28/2026 at 11:40:00 PM

Tell me you never had a crazy thought and you are either a lier or a psychopath.

by randomNumber7

5/28/2026 at 8:44:27 PM

[dead]

by czl

5/28/2026 at 8:31:43 PM

[flagged]

by czl

5/28/2026 at 7:55:05 PM

sibling comment got to the main points before me, but to add on kmavm's reply, the attack surface for gradient decent to get the system to exchange "bad information is much higher in latent reasoning models (like GRAM). You get ~3 OoM more bits (~17 bits per token in a standard CoT vs the whole residual stream of the model @ f16 = a few kb) per forward pass of the system coming back to itself, and even if you could sift through all that for signs of misalignment, you just can't put a blockade on all of the bad things that leak through.

by sometimelurker

5/28/2026 at 8:38:26 PM

I think you’re overstating the impact of interpretability here. Your earlier point that latent reasoning models can’t be trained very well and that discretization may be load bearing rather than a readability tax in addition to significant inference infra hurdles (e.g. batching, speculative decoding) have limited any serious attempts and reduced the theoretical advantage over CoT at least in the near term.

by haldujai

5/28/2026 at 11:06:08 PM

> I think you’re overstating the impact of interpretability here

Outside of RLAIF, interpretability is the strongest way to do alignment right now. alignment is important because otherwise LLMs are incentivized to learn power seeking, dangerous behaviours [1]. a more downto earth example of alignment being important is that agents are incentivized to do tasks in the shortest way possible, and this way might not be what the user wants (I explain this further in another comment in this thread)

[1] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

by sometimelurker

5/29/2026 at 12:31:22 AM

You’re putting the cart before the horse - alignment is an unsolved challenge (there are proposed approaches and active research on this) but it is still not established (beyond theory) that latent reasoning is more capable than CoT on hard language reasoning, particularly at scale.

by haldujai

5/28/2026 at 8:06:22 PM

Most alignment methods nowadays don't rely on interpretability. And neither do all LLM vendors care about alignment much - especially not in China.

Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.

by ACCount37

5/28/2026 at 11:07:06 PM

China should care: https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

by sometimelurker

5/28/2026 at 11:11:48 PM

As is, Chinese labs spend more effort on "rhetorical alignment to the party line" than alignment of any other kind.

by ACCount37

5/28/2026 at 10:20:28 PM

There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

Most software engineers will just need cheap tokens.

But things like physics and drug discovery have no foreseeable upper bound.

by nbardy

5/29/2026 at 1:38:00 AM

Or governance of large organizations... There are a huge number of factors to consider, counterfactuals, studies, lots of non-obvious second and third order effects, etc. We're barely able to get basic governance without creating huge problems (low density zoning rubber stamped across the nation creating a housing crisis, for example), so the bar isn't high.

We pay CEOs an enormous amount because a small improvement in performance of an org because of them can make a massive difference in organizational value.

by ericd

5/29/2026 at 1:47:03 AM

The upper bound is limited by market size and cost of intelligence.

Throwing more intelligence at a problem doesn’t necessarily pan out financially otherwise we wouldn’t have single underemployed biology PhD.

by haldujai

5/29/2026 at 7:56:38 AM

I second this idea: LLMs will plateau. They are already pretty good. Plus, scientists struggle to actually score their performance accurately (esp. when it comes to reasoning).

With that said, they are now hitting the walls of energy costs and memory shortages. You brain uses 20W -- don't take it as an insult. There are orders of magnitude to gain from producing energy-efficient models (or model runners).

So I am expecting same performance at lower costs for the coming years.

by harrouet

5/28/2026 at 5:39:51 PM

Absolutely that’s why they’re rushing to IPO now to squeeze the last drop of the bubble they know this is a dead end.

by jruz

5/28/2026 at 7:42:27 PM

I think we could run for at least a decade further with no model changes/improvements, just better harnesses and infra around this agentic way of developing.

by swader999

5/28/2026 at 8:47:35 PM

We, the users? Absolutely. But will the big AI companies last even half a decade without new products? Doubtful.

by hungryhobbit

5/29/2026 at 12:45:17 AM

Indeed,now it is sweet spot for senior engineers: smart enough to accelerate, dumb enough not to fully autonomously act.But it won’t last long…

by revv00

5/29/2026 at 6:52:52 PM

Replying to myself -- likely this did happen and it was Amazon tokenmaxing due to exec pressure and many were using MeshClaw. Goodhart’s Law: when a measurement becomes the target, it stops being a useful measurement.

by swader999

5/29/2026 at 2:01:20 AM

[dead]

by lkhlkhjkjhsadf

5/28/2026 at 5:44:14 PM

It's unclear it's a dead-end within 5 years.

There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.

Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.

Some people would pay $200 a month forever not to have to open the terminal one time...

by onlyrealcuzzo

5/28/2026 at 6:08:17 PM

"Doing things X times faster" at some point hits Amdahl law. If just context switching takes 5 minutes, speeding up a 1 hour task by 10x provides 5x improvement.

Furthermore, if looking at the results takes 10 minutes, that same 1 hour task only sees a 3x improvement. And so on.

by bonzini

5/28/2026 at 7:17:40 PM

> Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.

No most people will not pay $200 for an LLM subscription. Some software developers do. Also, at $200/month, you are much better getting the macbook machine assuming token output speed is the same or at least reasonable.

LLMs are not very productive for your average person now for them to drop $200 on. They'll need to be more capable and integrated and even so...

by csomar

5/28/2026 at 8:43:51 PM

One thing to remember is that the $200/month subscription is heavily subsidized. It is more to promote use, especially to corporate users that pay for the API token use.

by margorczynski

5/28/2026 at 5:53:53 PM

That’s not how firms do the financial analysis which is where most of the revenue’s are coming from…

by eiej

5/28/2026 at 11:08:26 PM

A bubble doesn't mean a dead end - e.g. after the .com bubble, Internet usage kept expanding by orders of magnitude for two decades.

An AI bubble is pretty much guaranteed at this point but that doesn't mean there's going to be a new AI winter.

by mastazi

5/28/2026 at 5:44:27 PM

On the other hand, I think I have been hearing that for a while, even before Opus.

by lukan

5/28/2026 at 6:32:55 PM

While revenues grow almost exponentially. Reminds me of the confident predictions in the early days of Covid that it was nothing while the data showed exponential growth.

by energy123

5/28/2026 at 8:14:35 PM

I’m also reminded by the early COVID days when exponential growth was leading to predictions of the collapse of modern civilization and a billion dead, now it’s just another endemic respiratory virus.

by haldujai

5/28/2026 at 8:43:04 PM

Yeah! Just like they warned us that Y2K was gonna cause a lot of problems, and then a bunch of people did a bunch of work and then that problems didn't happen, so those people warning us about Y2k were wrong!

by fragmede

5/28/2026 at 9:41:19 PM

“a bunch of people” aren’t what caused the virus to become less severe.

Y2K was overblown how it was portrayed by the media but is irrelevant to the analogy of unsubstantiated extrapolation of early exponential growth.

by haldujai

5/28/2026 at 11:31:53 PM

Maybe stop getting information from your Facebook feed or over dramatized US news.

by epolanski

5/28/2026 at 10:32:17 PM

GRAM is another one of those "stupid specific architectures" - same as HRMs, etc. It can sort of contest LLMs at specific puzzles. It demonstrated that much. It's not a general contender with LLMs at LLM tasks.

If you subscribe to things like "there are tasks LLMs are innately bad at due to insufficient depth and lack of recurrent capability", then GRAM might be another signal towards that.

But keep in mind: even ARC-AGIs have their frontiers dominated by LLMs. Even if "innately bad" is true, it clearly doesn't go all the way to "innately incapable".

by ACCount37

5/28/2026 at 10:46:46 PM

A 10m param GRAM model beat o3-mini - a model 2000x its size - on Arc AGI...

by onlyrealcuzzo

5/28/2026 at 11:07:26 PM

And then that 10M param GRAM went and got its shit kicked in by Grok 4.20 Blaze It Edition - on the same ARC-AGI battery. I know how that story goes.

It's the pattern with those "stupid specific architectures". Very good at this one thing. But only ever "good for their size", and only to a point.

They don't scale up and they don't generalize. Go far enough on task complexity and LLMs just kill them.

Does that make them useless? As an LLM replacement, yes. In general? Maybe not, I can think of things. But I'm yet to find any paper demonstrating a real world use.

by ACCount37

5/29/2026 at 12:37:56 AM

GRAM is something you add onto an LLM... It's not an LLM replacement. It's like an MLA caching layer, an MoE routing layer, or a speculative decoder at the end...

by onlyrealcuzzo

5/29/2026 at 7:30:31 AM

You could certainly bolt GRAM onto an LLM, but that won't magically improve its reasoning.

It's a special-purpose design for constraint-satisfaction problems with simple rules, but complex interactions. E.g. when solving a Sudoku, the set of valid choices at every step is easy to determine, but you could make a series of valid choices that back you into a corner where no more progress is possible and you have to backtrack.

Meanwhile, LLM reasoning failures are more often of the kind where a choice is clearly invalid (as judged by a human observer), but the LLM picks it anyway, because the underlying rule is complex and context-dependent and the model only learned an imperfect approximation that often breaks down.

GRAM won't help with that.

by yorwba

5/29/2026 at 12:40:50 PM

My vision for what might happen: an LLM emits a "neural constraint satisfaction task" in latent space, kicks a "neural tool call" into a non-LLM architecture, runs that architecture, gets a latent answer back, attends to the answer to generate better text answers for problems that benefit from improved constraint-satisfaction.

But that's a very hard thing to implement, and the gains are uncertain. Thus "might".

by ACCount37

5/28/2026 at 11:46:55 PM

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years

Given how well Qwen3.6-27B performs for such a small model I think you could be right. I suspect that Google,OpenAI,Anthropic must be looking at the Qwen3.6 models (as well as Deepseek V4-flash, MiMo-V2.5) and wondering if they could make some smaller models that are specifically trained for certain activities - like coding. Smaller, more targeted models would take up a lot less resources.

by UncleOxidant

5/29/2026 at 1:06:31 AM

The problem is that once you reach a certain level in coding (not particularly high imo, although some would differ) the most significant improvement in your output comes from understanding requirements better and finding ways to meet requirements in productively lazy ways, bypassing busywork that seems necessary but isn't. And that's the kind of stuff you will only find from a generally intelligent model, not a code monkey that's optimized for turning requirement sheets into source code.

by svachalek

5/29/2026 at 4:04:06 PM

Personally and mentioned in other threads, I feel that we'll see a breakup of domain/context specific models as well as the goliath models in use. The tooling and a classification model will draw out the context and tooling will pass the work between context specific models in order to improve the cost characteristics of the work itself.

by tracker1

5/28/2026 at 5:51:59 PM

> I won't be surprised if the next gen frontier models are the last.

I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.

The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.

by Forgeties79

5/28/2026 at 6:39:34 PM

The way this will play out, most likely, is that smaller models will continue to get released, anyone willing to drop 1-3k on a home upgrade/new LLM box (no that isn’t cheap, it also isn’t outrageously expensive) along with improved open source agents or whatever (lot of meat on that bone) will sneak up behind the big players and start taking dents. Smaller companies will pop up providing 50 users unlimited whatever for a lower cost than the big companies.

The whole ecosystem will twist and evolve, and the big companies will be left begging for corporate subscriptions.

I finally caved when I realized I could build a PC, for myself, with dual video cards that I wanted, which can play games that I like and run models that I want, without worrying about giving my payment info to someone I don’t trust, or invoking token anxiety that I don’t want.

by irishcoffee

5/28/2026 at 9:35:46 PM

Like every major tech-software innovation of the last 20 years, I think it’s just going to be consolidation all over again.

by Forgeties79

5/28/2026 at 9:54:54 PM

Small models don't have enough parameters to memorize the entire internet. For very common prompts you don't notice that, but when you rely on some niche knowledge that might only appear once in the entire web, a single blogpost, a single github issue, a single pdf, you need to be lucky enough that the agent runs a web search AND it returns what you need.

Even as humans there's so much knowledge out there that exists but it's very hard to surface unless you know exactly what you're looking for beforehand.

by redox99

5/29/2026 at 4:09:31 PM

Exactly, as humans you won't know everything... but you CAN know enough to roughly classify what to "google" for... And if you can google a problem summary, you could identify from a select list of domain specific AI models to use one or more to aggregate work results. And if a person can do that, a model can be trained to do/leverage the same.

You can have a domain limited classification model that then passes the query/work to best match model(s) that do the work... then rollup the results. Basically two very cheap requests instead of one much more expensive one.

by tracker1

5/28/2026 at 6:02:24 PM

"It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years"

What insight do you have to make this claim?

by hellohello2

5/28/2026 at 6:14:50 PM

Have you personally used any of the latest batch of even smaller local models? They certainly don't beat SotA models at coding... but with a good harness they are able to achieve things with SotA that I couldn't last year.

I've repeatedly given local models non-trivial projects that involve research and coding which they've successfully completed with minimal intervention from me (almost exclusively in the domain of reviewing the results). Again, nothing comparable with current SotA, but definitely tasks I could not have given SotA models last year (without agent harness).

Now that pure progress from these models seems to have slowed down, we're seeing a ton of options for both making models more efficient and other tools that help improve them (everything from agent harnesses to RLVR).

That's just looking at "what can small do today", when you look at what's possible with larger open models that are still much smaller than SotA from the major providers, their performance is extremely close to SotA, enough that for personal projects I'll just use Kimi instead of any anthropic offerings.

So it's not terribly hard to image a solution in the middle happening within a few years. We still have tons to learn about optimal sizes of these models and how to build them with maximal efficiency (and we've already seen a lot of recent improvements in this space).

by roadside_picnic

5/28/2026 at 6:30:11 PM

> but with a good harness they are able to achieve things with SotA that I couldn't last year.

What happens if you run last years model in a SOTA harness? IME, the quality of the harness has a much more significant impact on the quality of the result, once you get past the initial hump of “can it do anything at all”

by maccard

5/28/2026 at 7:32:51 PM

I think this is a big component, but also context. A large factor in any model being able to handle complexity comes down to context length.

I think multiple SLMs driven by an orchestration frameworks (harness or otherwise) will ultimately displace LLMs. Right now we're in the era of diminishing returns with respect to LLM gains. Moving the needle percentages doesn't excite as many people anymore and with "reasoning" capabilities there's no reason why small distributed models can't be run more efficiently, especially if/when we start to see gains in modularized context management solutions.

by windexh8er

5/30/2026 at 8:13:40 AM

It's hard to know for sure. There are good information theoretic reasons to suspect that general models will always be better than smaller expert models, but maybe a MoE can claw some performance back, albeit with redundant computation. The properties of conditional entropy, for instance, always favor more generality. This assumes that the harness isn't a factor, or is at least equivalent across different models.

by coderenegade

5/28/2026 at 8:11:28 PM

sure, but high-quality harnesses require less gpu compute/VRAM, and plausibly can be used locally by most users.

by mswphd

5/28/2026 at 10:59:22 PM

"Have you personally used any of the latest batch of even smaller local models?"

No I have not, which is why I asked (it wasn't a rhetorical question). Do you have pointers on what the recent improvements are?

by hellohello2

5/29/2026 at 1:00:32 AM

Try qwen 3.6 models with hermes and see for yourself. 27b is excellent and 35b is very good for basic agentic tasks.

by blurbleblurble

5/28/2026 at 6:34:15 PM

Can you spare a sentence or two describing your local setup?

by sixothree

5/28/2026 at 7:40:44 PM

biggest thing i wish was present in more discussions about models is people providing more specifics on their setups vs. vague descriptions of harnesses

by theplatman

5/28/2026 at 9:44:13 PM

can you please share details about your harness

by trees101

5/28/2026 at 6:10:32 PM

1. Context is all you need... They are heavily investing in getting better context (especially for coding tasks). This will disproportionately advantage smaller models (and benefit everyone).

A smaller model with better context today can outperform a model with 100x more parameters with bad or diluted context.

2. MoE (already abundant) + MLA (mostly memory efficiency, not quality) + Medusa (speed, not quality) + GRAM (5000-10,000x better reasoning in an extremely small model) + 1.58b (unclear if it will have the impact Microsoft first claimed - but possibly 5x).

by onlyrealcuzzo

5/28/2026 at 6:47:55 PM

Probably just "gemma was cool"

by knollimar

5/28/2026 at 5:49:38 PM

I think you are assuming training from scratch, which I doubt is happening here. Fine-tuning and RL, especially based on synthetic feedback (coding skill, in particular) can be ongoing and is where these models obtain truly useful abilities.

by slashdave

5/28/2026 at 5:34:00 PM

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.

> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.

by YetAnotherNick

5/28/2026 at 5:41:49 PM

Knowledge benchmarks can't really be improved upon via distillation or RL. It requires those facts be added to the training corpus and for the model to memorize them better. Neither distillation or RL really do that and thus we shouldn't expect improvements on SimpleQA unless some other interventions are being made.

Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.

If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.

by ertgbnm

5/28/2026 at 8:25:10 PM

Lot of the things aren't facts that could be stated. No one can just see the dictionary or translation of words and start talking in that language.

There isn't a clear definition of what is knowledge and what is intelligence. Is being able to write in C knowledge? Is knowing undefined behaviour in that knowledge?

by YetAnotherNick

5/29/2026 at 4:22:57 PM

My point is that if I made someone "smarter" they wouldn't suddenly know "What day, month, and year was Carrie Underwood’s album “CryPretty” certified Gold by the RIAA?" which is an example of a question in the SimpleQA benchmark.

So (in my opinion) knowledge benchmarks stagnating for small models is not evidence that small model agentic coding performance improvement will stagnate soon. Small models do not struggle with syntax, the barrier is not knowledge. The barrier is long context coherence and problem solving, which I don't see a bottleneck on improvements for small models in the near horizon as we get more and more high quality reasoning traces to train upon.

by ertgbnm

5/28/2026 at 5:51:40 PM

RL is more than facts. Synthetic feedback is an obvious approach. Does the model suggest code that compiles and performs well?

by slashdave

5/28/2026 at 5:48:05 PM

> Well for one, we know for certain there is Mythos which is meaningfully better.

Do we?

Have you used it?

What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.

by onlyrealcuzzo

5/28/2026 at 8:18:23 PM

What do you mean by 3-4 orders of magnitude better? Was Einstein 3-4 order of magnitude better than us?

Meaningful in the sense it could find security vulnerabilities in browser and kernel that >99% of the engineers couldn't find.

by YetAnotherNick

5/29/2026 at 1:16:08 AM

> What do you mean by 3-4 orders of magnitude better? Was Einstein 3-4 order of magnitude better than us?

I'm talking about output quality compared to parameter size.

Mythos is not 4 orders of magnitude larger than Opus - it's quite possible no LLM model ever reaches that size (likely even), and it's output is only barely better...

by onlyrealcuzzo

5/29/2026 at 6:18:16 AM

Where did 4 order of magnitude even come from? If I were to guess it is just 5x larger based on the pricing, so not even 1 order of magnitude.

> Mythos is not 4 orders of magnitude larger than Opus

Again can you define this. How would 4 order of magnitude better look like?

by YetAnotherNick

5/28/2026 at 7:35:48 PM

I effectively distill the frontier models by building whole sets of skills, personas, and other artifacts that I can then run on smaller models and get 10% even 20% improvements on models like haiku or local models.

There's a lot of room for improving the smaller models at many levels of the stack.

by mickdarling

5/29/2026 at 1:13:22 AM

This is a good point. It didn't really work on older small models but the latest crop are quite good at following instructions and paying attention to detail, they just lack a lot of the sophistication and nuance that the frontier models have these days. So they are often capable of doing very complex tasks, they just need more detailed and foolproof instructions than the larger models would.

by svachalek

5/28/2026 at 9:49:32 PM

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks

The benchmarks need to change. The current coding benchmarks don't capture the realities of software engineering.

I had a bunch of images that got masked by some logic, I had to evaluate something on the original images, Claude 4.7 decided to inpaint the masked images instead of just fetching the actual unmasked images from upstream.

I had another model once that decided that because it couldn't figure out how to fill out a form to log into HuggingFace to download weights for some open source model that it was going to instantiate the model with random weights and run inference on a thousand images.

Its coding was fine, but the solution was not the right one.

by qurren

5/28/2026 at 5:24:13 PM

> I won't be surprised if the next gen frontier models are the last.

the last?!? I'm excited to see :) I'll take the other side of that since llms are so new

by mucle6

5/28/2026 at 5:59:12 PM

What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.

And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.

I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.

Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.

by pjerem

5/28/2026 at 6:38:44 PM

Are you joking? Is there literally "nothing" you can imagine that Claude can't do?

by suttontom

5/29/2026 at 2:15:51 PM

Not OP, but in 6 months of using Opus I haven't yet found anything that I know how to do but it does not. On the contrary -- it can do things instantly that I would have needed a ~week refresher on some SDK or some algorithm in order to implement myself--plus a ton of thrash/debugging time.

What have YOU thought of that Claude can't do?

by tjwebbnorfolk

5/31/2026 at 3:32:15 AM

- play a video game

- write a story that isn't terrible

- throw a baseball

- tell a good joke

by warfare52

5/28/2026 at 7:12:30 PM

[dead]

by dead_internet

5/28/2026 at 7:19:10 PM

>What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

I think what gp said was the improvements are incremental, and we haven't seen a big revolutionary change since 2-3 years, and the pace is slowing down.

by coldtea

5/28/2026 at 8:53:10 PM

> What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

If benchmarks across the board keep trending up and you still don't notice a difference, that's not evidence the model stopped improving. More likely your tasks aren't hard enough to expose the gains, or the model has passed the point where you're able to judge it.

You can only tell a good answer from a great one up to your own ceiling. Once the model clears that, both look the same to you, and the extra capability is real whether or not you can see it.

by czl

5/28/2026 at 9:38:48 PM

But that’s exactly what I said ! I know the model will continue to improve and I don’t deny that, I even strongly believe it. My point is that at that point it probably won’t change anything to me.

Would Opus 10 release tomorrow and be nearly AGI, I still would still use it like 4.7 because on daily use, I am the limit (also the harness).

So as a customer paying for tokens, I’m probably going to search for better cost rather than more intelligence.

by pjerem

5/29/2026 at 12:39:41 AM

> Honestly, there is nothing in my head that Claude cannot handle

Friend does marine autopilots in C++ on 64kb of memory. It's totally useless for him.

From my experience any sort of more difficult backend logic - all LLMs fail pretty quick. Especially when you need to logically work out the business logic (partly if not mostly because it just doesn't have the context you do).

by dzhiurgis

5/28/2026 at 7:20:13 PM

> Honestly, there is nothing in my head that Claude cannot handle.

One idea is that maybe it could figure out how many L's are in the word "google" [1]

Or, maybe which days of the week have a "d" in their spelling [2].

[1] https://x.com/FatherPhi/status/2059659658428912040?s=20

[2] https://x.com/FatherPhi/status/2054212816069132461?s=20

by claytongulick

5/28/2026 at 9:26:22 PM

Wow, which Claude model flubbed that question? Certainly not anything recent...? The 2-bit quant of K2.6 running locally on my own hardware has no problem with it: https://i.imgur.com/tL0FLjZ.png

So Claude has no excuses here.

Edit: even Qwen 3.6 27B handles it ( https://i.imgur.com/jleJxj2.png ), and of course Claude does. I had to go all the way back to Opus 3 to get it to fail (https://i.imgur.com/uJOH2nP.png).

by CamperBob2

5/28/2026 at 9:05:02 PM

From what I understand, that's a problem with the way it receives data. The model doesn't see the letters g,o,o,g,l,e to count it. Just like how I can't sense radio waves. If I wanted to find that out, I'd get a tool to detect waves. If the LLM wants to find that out, it can write a script to find it.

by speff

5/30/2026 at 12:34:10 AM

> There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...

But there is a ton of juice left to squeeze when it comes to post-training/RL for a ton of useful things in practice, right? It’s been amazing seeing how good modern model tool use is for example, and I bet there is a lot of room for improvement still (no doubt that a ton of improvement can be made more easily on the agent harness front or via post-training regimes like LoRa (which does support to your point about diminishing pre-training juice))

by adi4213

5/29/2026 at 9:21:24 AM

By pointing out the exact things that will likely happen you are oddly enough hedging against (at least some of them) happening!

A) I reckon it's true that smaller models will continue to improve massively through optimization and better and better harnesses, this tech is all still very young and A LOT of resources and (good-)will is being thrown at it.

B) The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.

C) More of an observation that I think is worth keeping in mind clearly; Karl Popper's black swan and all, truth in our temporal world IS a gradient!

by dingdingdang

5/29/2026 at 9:56:11 AM

> The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.

There's less room to improve in things on several fronts.

GRAM very likely may scale sub-linearly with parameter growth. A 100M param model may gain reasoning by a factor of 4000, while a 100B model gains reasoning by a factor of 2, and a 1T model actually gets worse.

Additionally, the 1T model with reasoning is already pretty good. It can only improve in certain things so much.

If you score 0.02% on a metric (which small models often do), you can pretty easily get 4000x better. If you're already scoring >50%, you can't even get 2x better.

by onlyrealcuzzo

5/29/2026 at 3:58:05 PM

For that matter, we may have models/tooling that are smaller that are designed for say identification model first, then handoff to a context specific model that is optimized for a specific domain... where the two calls through tooling are more optimal than a single call to a much large model. We're already kind of close to this with how the likes of claude code work with handoffs to other tools/modules.

I can see a LOT of room to explore and partition domains into more specified models still.

by tracker1

5/28/2026 at 10:20:15 PM

There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

Most software engineers will just need cheap tokens.

But things like physics and drug discovery have no forseeable upper bound.

by nbardy

5/28/2026 at 10:25:14 PM

Within software engineering, security, reliability, and scale also seem boundless.

Software that never breaks (including because it never runs into scaling problems) and never leaks your data is preferable to software that breaks and leaks your data sometimes, but it has been too costly to be practical.

Current models are still very far from the reasoning muscle required to build things that never break, scale to billions of users with no issues, and cannot be exploited.

by holmesworcester

5/29/2026 at 12:43:28 AM

> Software that never breaks (including because it never runs into scaling problems) and never leaks your data is preferable to software that breaks and leaks your data sometimes, but it has been too costly to be practical.

It's almost impossible to prove non-trivial software is invulnerable.

It's very easy to prove that it sort of works.

For one, you have hardware vulnerabilities - period. If you're running on any operating system, you have OS vulnerabilities. If you're not running on bare metal, you may have who knows what kind of vulnerabilities. If you're running literally any other piece of software on the same machine, depending on the hardware and OS, you could have vulnerabilities...

by onlyrealcuzzo

5/28/2026 at 10:43:43 PM

People keep saying this and yet the evidence seems pretty thin..

by overgard

5/29/2026 at 12:23:38 AM

To me its evidence of people who dont actually think deeply enough to understand the subtleties, nuances etc of what they are talking about.

by 43fg

5/29/2026 at 6:22:53 AM

Nothing ever happens, in 20 years we will still be painfully dying from the same shit as now. Maybe there is like 5 new drugs for some exact specific type of cancer out of like what, thousands?

by viking123

5/28/2026 at 5:21:47 PM

surely training also gets cheaper so justifying it becomes easier?

i think it'll be more like we get 1-10T models and then distill those down into smaller models, though

It seems like the best small models today are all distilled from bigger models

Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos

by merlindru

5/29/2026 at 2:39:12 PM

That's the impression I got too, it seems closer to what the marketing has told us about Mythos than 4.6/4.7 were.

by pseudohadamard

5/29/2026 at 7:32:51 AM

The GRAM model is so much into my research direction, I love it. Thank you for posting it.

Where do I find papers like this? Outside of hacker news comments. It's so hard to find the good stuff in all the noise IMO.

by notrealyme123

5/29/2026 at 11:16:49 AM

> Where do I find papers like this?

I got it from my Google News recs on my phone, because I've been watching a bunch of videos on YouTube about LeCun's ideas on World Models and JEPA (I think).

by onlyrealcuzzo

5/29/2026 at 10:41:51 AM

GRAM is a lot like the Multiple Drafts Model of Consciousness that Daniel Dennett proposed. I think reasearches should read more philosophy models and bring good ideas into LLM research.

by Npovview

5/29/2026 at 1:03:37 PM

Yea this is great advice: the people who actually know how to build machine intelligence should go read the notes of the people who literally had no idea how to do it. While they are at it, we should have NASA go read Jules Verne so they can use his ideas in the next manned missions.

by ltbarcly3

5/31/2026 at 10:37:51 AM

[dead]

by notrealyme123

5/29/2026 at 12:07:26 PM

Can you recommend a good starting point other than Daniel Dennet?

I have the same assumption about Cognitive sciences, which I try to get a better understanding.

by notrealyme123

5/29/2026 at 12:09:14 PM

A LLM should be able to do a better survey of literature than me. I haven't read literature by Dennett but have watched ALL his videos online so that's how I know.

by Npovview

5/29/2026 at 8:05:39 AM

It is fascinating to me to see a new product category that improves so vastly year-after-year, where people commonly state that this is now the peak already.

I couldn’t even imagine having to go back to a model from 12 months ago, much less 24 months ago. GPT-5.5 is so much better than GPT-4o that it sure seems like they keep finding new juice to squeeze.

This is like going from dialup internet to DSL and acting like it has peaked before gigabit cable and fiber come along. We are at the beginning of hardware truly made for AI.

by pseudosavant

5/29/2026 at 11:08:52 AM

> I couldn’t even imagine having to go back to a model from 12 months ago, much less 24 months ago. GPT-5.5 is so much better than GPT-4o that it sure seems like they keep finding new juice to squeeze

The difference in progress in smaller models is far more impressive.

Compare Gemini 3.5 Flash to a ~16B parameter model from 24 months ago.

Compare GPT-5.5 to a frontier model 24 months ago.

Yes, GPT-5.5 got better. At orders of magnitude smaller parameter sizes (when factoring in ACTIVE parameters) the increase is far more pronounced.

by onlyrealcuzzo

5/29/2026 at 2:54:33 PM

Totally agree on smaller models making even more impressive gains. Gemini 3.5 Flash is better than the biggest SOTA model from 24 months ago, not just a 16B parameter one. GPT-4o came out 24 months ago, and there is no way I'd choose that over Gemini 3.5 Flash today.

by pseudosavant

5/29/2026 at 1:33:37 PM

Yeah sure but is it so much better than Codex-GPT-5.3? No, if anything it's probably a little bit worse.

by imtringued

5/29/2026 at 2:50:05 PM

GPT-5.3-Codex came out in February, and GPT-5.5 came out in April. How much better do you expect in two month's time? What other products can you think of that get meaningfully better in that short of a time frame?

And as good as 5.3 Codex is at writing code, 5.5 is easily just as good, if not better. But 5.5 is more than a one trick pony and it is much better at planning, writing copy, documentation, etc. I can choose to run 5.3-Codex instead of 5.5, but I never ever do.

by pseudosavant

5/29/2026 at 10:56:52 AM

As far as it has been studied, the relationship between model size and capability is inversely logarithmic: 10x increase in params less than doubles capability.

by cluckindan

5/28/2026 at 8:01:06 PM

I'm frankly surprised the focus is still on these enormous "know everything in the world" models. I would think you could create an incredibly lean and smart "just React and React Native" model.

by dbbk

5/28/2026 at 9:48:12 PM

"Make a React app to run my coffee shop" requires knowing React but also knowing what a coffee shop is.

by onion2k

5/29/2026 at 8:49:27 AM

Only if you're going after the "vibe coders" audience. Regular developers would be fine with a lightweight local llm capable of scaffolding and wiring a dozen of bog-standard components in a few lines of natural language.

by easyThrowaway

5/29/2026 at 9:36:54 AM

Sure but it doesn't need to know everything.

It doesn't need to know different languages, every programming lanuage and co.

We will for sure get to this in the comming years. After all they will have to start finetuning their traning data anyway

by Gomotono

5/28/2026 at 8:03:31 PM

> I would think you could create an incredibly lean and smart "just React and React Native" model.

You can, but it's not as useful as you might think.

It needs to at least understand 1 human language to understand your intent to implement features.

If GRAM turns out to be a 5000x multiplier for local reasoning, you could theoretically train a 500M parameter model on just a programming language to understand stack traces to fix bugs and be incredibly powerful.

But most people also want it to understand human language to implement features as well.

Because then it can't just understand React and JavaScript - it needs to understand thousands of commonly used dependencies, the DOM, CSS, HTML, etc...

And for that you need A LOT more parameters than you might expect.

You can definitely get a ~3B active parameter model that can run comfortably on today's hardware to be VERY good at coding once all of the SOTA architectures are added to a single model - especially if we get better tool calling to give models better context per language.

You might be thinking: why does it need to memorize dependencies? Can't it just stick all of them in it's context and use its super smart brain? No, context is king. You want to keep it as short as possible. The solution is not having a smart model and putting 10M lines of context in it. The solution is having a model with enough parameters to know what it needs to know. Researchers are already working on having "packs" of knowledge where you could download a 20M param pack just for some common dependencies in JavaScript (as an example) - but AFAIK this is likely years away (and may not prove effective).

You could get 100x performance if you feed the models ideal context... So a 3B model today can perform almost as good as ~300B model if you give it really good context vs flood it with mostly garbage it doesn't need across your repository.

If you feed it 100x more context to make up for its limited memorized general knowledge, it's going to perform thousands of times worse, completely eliminating any advantage it might get from GRAM...

by onlyrealcuzzo

5/28/2026 at 9:06:34 PM

We just want it to understand how to write code. We don’t also need it to know how to grow a potato.

by vitaflo

5/28/2026 at 9:12:14 PM

I think perhaps you misunderstand how much of being an effective coder is understanding business domain enough to not be constantly asking for clarification (or if one is a fool or an ai, assuming wrong answers). I reckon a vast collection of trivia on the level of knowing how to grow a potato is important for a programmer

by RugnirViking

5/29/2026 at 2:34:13 AM

And you can't know ahead of time, when you're training the model, what business domains it will be used for. Someone may decide to use it to optimize the watering and fertilizer cycles of their automated potato-growing setup, and suddenly the "how to grow a potato" texts that went into training the model are actually the very things that make the difference between success and failure for the code the model spits out.

by rmunn

5/28/2026 at 9:12:58 PM

The disjoint set of English related to strictly growing potatoes and adding features to code is a lot smaller than you probably think...

It is hard to cut out a huge portion of English and truly understand English and human language.

You're just not saving as much as you might assume you could.

by onlyrealcuzzo

5/28/2026 at 9:40:27 PM

To me, the magic with LLMs has always been on the input side. It needs to understand what you mean in order to do what you ask. Most people are pretty terrible at communication, and general world knowledge seems to help with that.

by CamperBob2

5/29/2026 at 3:55:17 AM

... unless the software is potato farm software.

Programming is not a rare skill, the interaction with domain knowledge is.

by DoctorOetker

5/28/2026 at 10:09:29 PM

The syntax is the easier part - most programming tasks require the reasoning and understanding of a large world model to solve problems.

Fine tuning a 'lean and smart' model works really well for discrete, repeatable high volume tasks like support ticket triage, lead classification, content filtering, labelling, generating content with a voice, etc.

Inefficient token burn by throwing large models at everything is definitely a problem - it's like hiring Phd's to answer the phone or to wash dishes.

by nikcub

5/29/2026 at 12:14:02 PM

Smaller models can already outperform SOTA and massive models on specific tasks / domains.

by adam_patarino

5/28/2026 at 6:26:55 PM

I think the future will be enterprise clients will train their own models based on their needs and data.

by guluarte

5/28/2026 at 8:22:34 PM

Versus just packing all their needs and data into context, and RAG (i.e. context)?

by abalashov

5/28/2026 at 7:50:41 PM

Why isn’t this happening more already?

by jimbokun

5/28/2026 at 8:19:14 PM

It takes way more resources to train the model then to use it.

by z3t4

5/28/2026 at 8:43:11 PM

I honestly doubt this; very few companies have enough data. Maybe we could see mergers so it happens but basically it would mean everyone would need to be Google sized for it to work.

by elfly

5/28/2026 at 5:33:33 PM

Let's hope that hitting a scaling wall and less money to spend will begin redirecting efforts to optimize inference and get the same results with less compute.

Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.

by yomismoaqui

5/28/2026 at 7:28:25 PM

And anyway, with quantum, there will be no need for frontier companies as you might be able to even run a 1T param model on a consumer quantum computer.

by ishurand4

5/28/2026 at 8:01:39 PM

Even if quantum computing had any clear implications for LLMs (it doesn't), there is no such thing as a "consumer quantum computer" and there won't be in our lifetimes.

by root_axis

5/28/2026 at 8:08:13 PM

I'm assuming this is a joke, but:

- why'd a quantum computer help running an LLM?

- of course there'd be need for frontier companies - nobody else has the resources to train frontier models.

by stratos123

5/29/2026 at 5:10:18 AM

What? No, that is not what quantum computers do

by slashdave

5/28/2026 at 6:50:08 PM

I don't think this is true at all. It might feel like this because we are used to a very very fast release cycle but we are only in this topic for a few years.

We have so many ways of optimizing:

- continusly creating more and better training data

- increasing parameters to 20/50/100TB

- We still wait for Mythos access

- We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)

- Reinforcment learning and evolutionary algortihm only started to appear

- If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones

- We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around

- Research for Diffusion and other models is still in progress

- Nvidia just announced/showed a 7x speedup on inferencing for Nemotron

- Multitoken prediction became available just a few weeks ago

- Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)

- World models are showing great progress and we do not know yet what they will bring to the table

- They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity

- We see more and more mulit modal models (these also consume compute)

- N-Gram paper and co i have not seen all of these things in chinese open models

- We don't even know yet what Meta is doing, but we do know they restarted their efforts again

- Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations

- We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.

- We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this

- Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness

- ChatGPTs Image model 2.0 got relevant better and came out just a month ago

I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.

Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.

There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.

I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.

by Gomotono

5/28/2026 at 7:31:42 PM

Great points! We do keep seeing gains from larger model sizes. I think that is still one of the factors contributing to jagged intelligence. When they increase up to around 100T parameters, that will truly be human complexity level, and I assume there will be no trace of jaggedness left.

If you look at things like Mythic AI and the recent wurtzite ferroelectric nitrides breakthrough from the University of Michigan, huge performance and efficiency gains through new compute-in-memory approaches are around the corner.

And that will get us up to two orders of magnitude more parameters.

It's also plausible to me that before we get all the way to 100T we find some recipe of efficient state synchronization, goal sharing or something so that we are able to get higher collective IQ by combining fast distributed predictive subnetworks.

by ilaksh

5/29/2026 at 4:13:20 PM

I think it will be closer to the latter in the end... AI tooling already breaks out the work and tasking with other agents/tools... having better optimized domain knowledge agents can only help that... 2-5 minimal queries that can be run on much lesser hardware or shared vs a massive model that takes up a lot more resources exclusively.

by tracker1

5/29/2026 at 4:00:56 AM

> There was also a research paper were they showed that a LLM can compute things.

Can you be a little more specific than that or provide a reference?

I assume you're not indicating universality of neural networks?

by DoctorOetker

5/29/2026 at 8:19:32 AM

I do.

This is the newest thng i'm aware of: https://www.percepta.ai/blog/can-llms-be-computers

But there were papers in 2023 with a different approach requiring external memory https://arxiv.org/abs/2301.04589 too

by Gomotono

5/28/2026 at 5:54:42 PM

you just need to look at Mythos to see the jump in performance from a 10T(?) model. As they scale, they get more capable. We might have an yearly release, but I believe the releases will continue, as long as scaling laws are in tact, and there's huge problems still need solving. (think cancer)

by firebirdn99

5/28/2026 at 5:57:39 PM

And how are we meant to look at Mythos? Do you have access?

by phainopepla2

5/28/2026 at 6:25:48 PM

no but they tell me it's TERRIFYING and DANGEROUS and we should INVEST MORE MONEY

by bigfishrunning

5/28/2026 at 6:10:44 PM

Through the lenses of anthropic's marketing department of course

by OtomotO

5/28/2026 at 6:24:03 PM

Through association with a large company:

https://www.anthropic.com/glasswing

Ive seen the tickets generated by the model that have trickled to my team. They are legitimate, but i can’t speak to model improvement because its a pilot program.

by dwpdwpdwpdwpdwp

5/28/2026 at 7:21:34 PM

>you just need to look at Mythos to see the jump in performance from a 10T(?) model

Mythos is a bunch of likely overhyped claims at this point. A few experts who looked into the claimed results weren't that impressed.

by coldtea

5/28/2026 at 10:32:09 PM

They all looked like real CVEs to me.

by astrange

5/29/2026 at 1:47:36 AM

Nothing that special about finding a real CVE. They're not that different than what non-Mythos could spot.

by coldtea

5/28/2026 at 10:14:47 PM

And there seems to be a ton of experts on the opposite side.

As they say, the truth tends to be somewhere in the middle.

by giwook

5/28/2026 at 6:27:21 PM

You forget that these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given, so unless everything you want to create with AI is a synthesis of prior art, you're back to relying on the stone-age human brain that created AI in the first place.

by aj_hackman

5/28/2026 at 7:24:33 PM

>these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given

Are you sure that humans can?

Didn't a SOTA recently solved a mathematical theorem, one escaping mathematicians for 80 years?

Maybe a human "novel" invention is just a good interpolating from the datapoints (knowledge) fed to the human.

by coldtea

5/28/2026 at 6:36:09 PM

Not all training data is human generated, and it's also not clear that being ridiculously good at interpolating between data points (whatever that means) will not lead to superhuman capabilities.

by mofeien

5/28/2026 at 6:48:05 PM

I could make a robotic picture coloring machine with truly superhuman capabilities - picking only the most beautiful color combinations and staying 100% in the lines while finishing entire murals in < 1 second. However, if you need a completely new and original image rendered, the machine is of only partial utility for you. It is very well possible that your cure for cancer (if that's even feasible) or whatever else you desire is a completely new picture.

We have these breathless conversations about the new AI frontier at the peril of losing sight of reality and our own human potential.

by aj_hackman

5/28/2026 at 7:47:08 PM

Your phrasing ("you forget") implies this is a fact and common knowledge, while in fact there's little reason to think that's true.

by stratos123

5/28/2026 at 6:43:56 PM

Do you know if anyone has trained, say, a pre-2017 model and tried to get it to come up with Attention Is All You Need? If it did, would you say that was only because it's a synthesis of prior art? If so, what isn't?

by suttontom

5/28/2026 at 6:56:31 PM

Allow me to restate my point: human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity. It could be argued that nothing we do as humans is truly original or creative either, but I would counter that with the claim that an LLM could not have created any element of the society and culture that gave birth to LLMs. Maybe in six more months.

by aj_hackman

5/28/2026 at 7:25:35 PM

>human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity.

And how is that anything other than synthesis? Do we pull concepts out of thin air?

by coldtea

5/29/2026 at 2:36:29 PM

>As far as reasoning is concerned, with the recent GRAM release

Graphic RAM?

by DeathArrow

5/28/2026 at 6:00:49 PM

I would be shocked if 5.5 is the last new pre-train from OpenAI. Your comment is nonsense.

by wahnfrieden

5/28/2026 at 7:04:25 PM

5.5 is not a generation it is a trivial iteration...

6 is for sure happening...

As is Gemini 4.

It's less certain there will be a Gemini 5 or GPT 7 any time soon that is a true next "generation" and not just an iteration. They will almost certainly call something Gemini 5 and GPT 7...

by onlyrealcuzzo

5/28/2026 at 7:07:17 PM

5.5 is in fact a new pre-train model

First you say there won't be a new generation. Now you're saying there will be more. Oh well, I'll stop responding here

by wahnfrieden

5/28/2026 at 8:10:05 PM

> I won't be surprised if the next gen frontier models are the last.

You clearly did not read my first comment or the second, or clearly disagree on what a generation is.

by onlyrealcuzzo

5/28/2026 at 7:34:18 PM

So, then I guess the big three are never going to make their money back.

by fnord77

5/28/2026 at 6:15:38 PM

| a 60-90B model can outperform current SOTA

My conspiracy theory is that Apple recognizes this.

by michaelchisari

5/28/2026 at 6:22:38 PM

That does seem to be the path Apple is following here. Have a local model that can answer most things and then have a fallback of cloud options when they request is too complex. The cleverness of this strategy has been overshadowed by the incredibly poor quality of their local models. It will be extremely interesting to see what next month holds and whether Google helped fine tune an Apple specific Gemini / Gemma model for their devices. Bonus points, of course, if they unveil the M5 Ultra Studio with half a terabyte of RAM to be a local "cloud model" (the true fantasy here of course would be Apple building something a little like openclaw where from your phone you could give commands to your Home Apple server). They could probably get away with charging $20k for it if it has sufficient tok/sec. If that happens and is successful one could imagine a straight line path in the next two generations to bringing the cost and form factor down to the point where some of the form factor of an Apple TV becomes everybody's home inference server / agentic HQ. Sovereign AI for everyone!

by dweekly

5/28/2026 at 9:05:57 PM

I think Apple might come out ahead by pure accident. Yes, Apple often waits to enter a market until it's established but in the case of AI they tried, they tried and failed. It was never the original plan to partner with OpenAI and then later with Google (Gemini). They 100% missed the boat on AI, the question now becomes: was the boat worth taking and we are still waiting to see how that plays out.

by joshstrange

5/28/2026 at 7:23:31 PM

You need some serious memory then. Let's say around 192gb for having not all your memory eaten by your LLM.

by holoduke

5/28/2026 at 6:21:58 PM

> My conspiracy theory is that Apple recognizes this.

I don't think that's not a conspiracy theory. AFAIK, It's their stated AI policy...

by onlyrealcuzzo

5/28/2026 at 6:39:46 PM

Interesting. Where have they stated that?

by michaelchisari

5/28/2026 at 7:01:35 PM

https://machinelearning.apple.com/research/introducing-apple...

by selectodude

5/29/2026 at 5:31:41 AM

[dead]

by colin4k1024

5/28/2026 at 9:12:42 PM

[dead]

by frankest

5/29/2026 at 8:03:30 AM

[dead]

by szundi

5/28/2026 at 7:30:51 PM

[flagged]

by lichenwarp

5/28/2026 at 5:18:32 PM

I'm curious to poll HN on this issue. Do you feel like we've had meaningful/noticeable gains in terms of your programming workflows between 4.5 and 4.7?

My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.

But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.

by gen220

5/28/2026 at 5:29:39 PM

For long-running tasks, yes 4.7 has been a noticeable improvement. Goes off the rails alot less than 4.6 does. For shorter-sized windows, I havent felt as much and agree that the harness improvements have been fhe biggest lever

by Bnjoroge

5/28/2026 at 7:12:02 PM

When doing big long running workflows especially with plan Mode 4.7 was a clear improvement. It’s considerably worse for under specified tasks and responds to a couple sentences with 10+ paragraphs for explanatory type discussions.

by csvance

5/28/2026 at 7:22:16 PM

Opus 4.7+ Max is a 10x engineer who wants to be left alone to work. When you talk to him, he infodumps on you to get you (his pointy haired idiot Dilbert boss) to go away.

by themgt

5/29/2026 at 5:17:34 AM

OR they deliberately increased token usage to inflate pre IPO numbers.

by 4gotunameagain

5/29/2026 at 1:01:21 AM

In my experience, 4.7 was a noticeable step down from 4.6.

I was one of these people that Claude would never finish anything and just randomly say, this is a good stopping point, I think you should go to bed.

And then I'd tell it to continue, and it would burn tons of tokens, make no progress and say, "This is a really good stopping point..."

Canceled and switched to Codex and have been pretty happy with it. It doesn't plan as well as Claude, but I think it does better implementation - and neither of them can actually come up with good plans without a ton of help...

Codex is also way faster.

by onlyrealcuzzo

5/28/2026 at 5:21:05 PM

To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a style/personality change regarding how much it asks back, how much it assumes, how eager it is to jump to action etc but not really in terms of my perception of its smartness.

by bonoboTP

5/29/2026 at 5:23:04 AM

Yes. You and some random indigenous guy in the Amazon likely share the same intelligence but you are more capable because you have access to writing/reading, computer, car etc. Intelligence is more than raw intelligence. It's harness, skills, tools, memory etc. If you improve all the latter but keep the raw intelligence (LLM) fixed, you certainly get better results. Same with us humans.

by fittingopposite

5/29/2026 at 12:20:27 PM

Of course, I’m not trying to dismiss gains from harness, actually the opposite.

But the narrative that 4.Y is an improvement over 4.X is essential to keep the model training music playing.

If 90+% of the gains come from the harness, how can you continue to justify spending billions of dollars on training and an 80% gross margin on inference on the latest model? (Reportedly what Anthropic commands on the top tier of their frontier model API billing).

So differentiating between the two (what I’m trying to do here) is really consequential!

by gen220

5/29/2026 at 5:40:11 AM

Except LLMs are simulacra of actual intelligence. Frequently in a single conversation working on a single narrowly scoped task, I am both surprised by a few insights and cursing at how it can miss obvious issues. The "raw intelligence" of LLMs leaves much to be desired.

by computably

5/28/2026 at 5:36:36 PM

I actually don't see any personal productivity improvements from using opus over sonnet for coding. If you're keeping tasks small and conversations short, reading the code and correcting before changes go in, whatever advantages opus has aren't practically significant. It's also just talky as hell, overexplains anything it touches and every token produced this way increases the surface area for hallucination so you need to have your guard up even more with it.

There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.

by giraffe_lady

5/28/2026 at 5:57:15 PM

They all feel, more or less, the same to me in terms of output capabilities. Mostly get simple things right, can get more complex things right with nudging, eventually get stuck hard on something that takes a bunch of iterations through it/logging/etc or me fixing the code manually.

by somenameforme

5/29/2026 at 5:06:36 PM

For my day to day tasks 4.6 feels sufficient.

I have limited enterprise budget and Claude 4.7 costs 7x more. So unless there's close to 7x improvement, it doesn't make sense to switch to 4.7.

I actually gave both 4.6 a really complex task. It kept on thinking for several minutes before I hit the brakes. I then gave 4.7 the same task, and didn't notice any difference in behavior. Clearly not worth the 7x premium.

I hope 4.6 becomes cheaper/free at some point because I'm starting to see a push towards optimizing token expenditures across the board. While frontier models are still the default for developing new workflows, everybody is starting to ask how to automate repetitive tasks without using tokens.

by esalman

5/28/2026 at 8:25:52 PM

I'm actually currently studying this :)

Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.

4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.

So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.

by alfalfasprout

5/28/2026 at 6:11:55 PM

4.6 felt a bit better than 4.5 but slower. 4.7 doesn't feel better than 4.6.

by bcrosby95

5/28/2026 at 5:15:37 PM

4.7 was the first time I had to resort to using the previous version (4.6) for most use cases. Hoping 4.8 rectifies this.

by gAI

5/28/2026 at 7:25:40 PM

They just showed the benchmarks it improved on but it regressed on so much more, such as the MCRR benchmark: "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6."

by ishurand4

5/28/2026 at 5:19:23 PM

Same. 4.7 felt like a definite regression

by merlindru

5/28/2026 at 5:22:58 PM

Interestingly enough, 4.7 actually did regress on a few benchmarks from 4.6, so it's more than just vibes.

by supern0va

5/28/2026 at 5:27:44 PM

It seems like a lot of things fed into that. Anthropic couldn't keep up with the compute costs when they got a huge influx of users. (So) effort level defaults got turned down. (Looks like we have direct effort control in the web interface now - thrilled about that!) Adaptive Thinking, while usually cheaper for them, seems less robust than Extended Thinking. And this part is just vibes, but the alignment on 4.7 feels too stiff. I understand wanting the model to push back more, but it seems like 4.7 will push back reflexively in situations where it's just odd.

by gAI

5/28/2026 at 5:30:15 PM

Claude got very mad at me and burned more tokens than exist to complain about me asking about a "yellow background cell" in an excel spreadsheet.

by bombcar

5/28/2026 at 5:47:09 PM

Too much personality, if you ask me. My biggest use case of an LLM is tool, not therapy, but therapy and opinions have been sneaking into workhorse tasks.

haven't verified, but attributed to Askell: "I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."

by forshaper

5/28/2026 at 5:59:14 PM

Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona.

https://www.anthropic.com/research/persona-selection-model

https://www.anthropic.com/research/assistant-axis

https://www.anthropic.com/research/emergent-misalignment-rew...

https://www.anthropic.com/research/emotion-concepts-function

by gAI

5/28/2026 at 8:05:27 PM

The RLHF very much does do that. My take is that RLHF as a mechanism ought to be avoided altogether, and even the selection of the assistant attractor basin is suspect. If I am exploring a problem space I don't want to hire Igor to explore it with me, it's more helpful to have a colleague role who will sort of jump out and say "nah thats dumb what if we throw out that whole thing and do this completely different angle instead".

by hashmap

5/28/2026 at 5:30:42 PM

4.7 is a different base model from 4.6, so it's possible that they introduced regressions with pre-training changes, or undercooked the post-training stage.

by ACCount37

5/28/2026 at 11:17:19 PM

Just speculating but I "feel" 4.7 was post-trained using more synthetic techniques. The way it writes for one thing, it's "personality", is less human and more fatiguing-AI-slop like.

by b--l

5/28/2026 at 11:27:15 PM

You don't need to fry with RLAF to get that "slop feel". The first iterations of "AI slop" were raw SFT+RLHF - all human input, all inhuman output.

That said, I completely agree that 4.7 was a pronounced "model personality" regression. Closer to ChatGPT, and I mean that as an insult. Yet to check whether 4.8 is better.

by ACCount37

5/28/2026 at 10:17:08 PM

4.7 was just them starting on the path on getting prices in line with the actual cost

Make it dumber. Charge more (by changing the tokenizer). Call it the latest and greatest. Reset expectations.

by throwatdem12311

5/28/2026 at 6:01:19 PM

Same. 4.7 has done some incredibly stupid things.

by petterroea

5/28/2026 at 8:02:51 PM

I think this is a more a consequence of the introduction of adaptive thinking and removal of extended thinking, than 4.7 specifically

by dbbk

5/28/2026 at 10:03:01 PM

I managed to find that Haiku outperformed Sonnet on some tasks...don't want to blog spam but if anyone is interested: https://www.ruairidh.dev/blog/sonnet-4-6-drops-format-rule-o...

by ruairidhwm

5/28/2026 at 5:22:58 PM

Same. So happy when I found that option.

by rhubarbtree

5/28/2026 at 5:39:42 PM

Unfortunately, looks like 4.6 is now gone from the web ui.

by gAI

5/28/2026 at 5:49:54 PM

Was bothered by that too, but did a magic trick and asked claude how to change that and .. there is

/model claude-opus-4-6

For this session and permanently (in shell):

export ANTHROPIC_MODEL=claude-opus-4-6

by lukan

5/28/2026 at 8:40:23 PM

Yep, until 1st June 4.6 is still x1 on Copilot, but will jump up quite a bit in coat - 4.7 was already highly priced, and the output was frankly terrible.

It still seems trying to build general models is mostly cost prohibitive - the frontier model provider and resellers are repricing in such a way the return on investment is dropping as developers and users become more cautious of burning their limits.

I'm still of the opinion that models like 4.6 don't need to be improved on - rather they need to be better integrated with more domain specific models in agentic flows.

by tanepiper

5/29/2026 at 11:53:55 AM

Same here - we never bumped to 4.7 in our agentic app. Continue to use 4.6.

by sonink

5/28/2026 at 8:12:13 PM

same!

by dezsirazvan

5/28/2026 at 5:15:55 PM

> My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.

Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.

I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.

by SkyPuncher

5/28/2026 at 9:42:18 PM

Same here. Went back to 4.5 and was happy I did it. The only frustration was that I can tell the model has declined compared to the first few weeks it was released.

I also recently moved to 4.6 since I started hitting the context limit too often with my current project.

by michaelsalim

5/29/2026 at 1:28:25 AM

/model claude-opus-4-6[1m]

allows you to specify you want the 1 million context 4.6

by luxuryballs

5/28/2026 at 5:50:34 PM

If you are using Claude code, just set effort to xhigh.

This one change will probably solve 80% of the problems you have noticed.

by dwaltrip

5/28/2026 at 6:03:44 PM

This. XHigh and the 'plan' mode for complex tasks is absolutely a must have.

Still, the context window is sometimes too small for my usage.

by orwin

5/28/2026 at 8:15:18 PM

agent teams can help with that, the main agent acts as an orchestrator and spawns sub agents to do the actual tasks it generally keeps the main context from overflowing.

by jayGlow

5/28/2026 at 8:08:27 PM

Isn't xhigh on opus 4.7 very expensive on tokens?

by whatevaa

5/28/2026 at 8:25:52 PM

I’ve never ran into the limits on the $100 plan, and rarely even get close.

I normally have only one session going at once though.

by dwaltrip

5/28/2026 at 9:08:52 PM

Same here and while I have multiple sessions going from time to time, my day isn't spent primarily developing software directly anymore (due to role, nothing about LLMs).

I only ever hit the $100/mo limits 1-2 times ever and it was always <1hr before reset (once it was <5min, the other was like ~45min).

I'm even considering going back down to $20 and using extra usage for the times I need to "burst".

by joshstrange

5/29/2026 at 10:49:43 AM

Yes but Anthropic made a deal with SpaceX and increase usage limits by 50%, so you might not hit your limits.

by sumedh

5/28/2026 at 6:02:24 PM

4.5/4.6 were roughly the same in our testing. Opus 4.7 is smarter, but it's difficult to use as a product for various personality issues. So far, Opus 4.8 seems to be going down that path (unusably slow, but this could be a launch day rollout problem). Full Opus 4.8 tests are in progress now.

Data at https://gertlabs.com/rankings

by gertlabs

5/28/2026 at 7:06:06 PM

"personality issues" I was able to tell that Opus 4.7 would take instructions more literally, which I appreciated once I calibrated my phrasing to be more precise (often asking to investigate issues, pre-4.7 it'd start making code changes instead of just giving write up). But I can see contexts where handling vague prompts would've just been worse

by __s

5/29/2026 at 1:16:51 AM

Looking forward to the results. Thanks for your work.

by swingboy

5/29/2026 at 3:58:50 AM

Appreciate that! Results are live: https://gertlabs.com/rankings

Opus 4.8 is the first tangible improvement since Opus 4.5. And it doesn't seem to have the personality problems of the last release -- I've been enjoying using it.

by gertlabs

5/29/2026 at 11:07:55 AM

Nice! Looks like it’s topping the two coding ones. I noticed it is absent from the Social Intelligence board though?

by swingboy

5/29/2026 at 2:28:37 PM

That'll populate over the next couple weeks -- those are the live games on the spectate tab which take a while to generate statistically worthwhile data. I'm curious how it does. From using it all day, I can say Opus 4.8 is my new favorite model, hands down.

by gertlabs

5/28/2026 at 7:21:18 PM

I suspect the more frequent incremental releases may also be to deploy new capabilities used by Anthropic to control costs and throttle consumption of resources. I assume any new controls they expose to end-users have far more granular sub-controls under the hood which they can meta-adjust for each user type.

They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.

by mrandish

5/28/2026 at 5:40:25 PM

I've been using Claude Code regularly since the 4.5 release, and 4.7 was a significant regression: very unreliable, arguing about changes, deciding that fixes weren't needed, etc.

I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.

by light_triad

5/28/2026 at 5:17:42 PM

4.7 was a significant jump in the ability to run long-horizon tasks. It immediately completed tasks that 4.6 was unable to, even though I have the impression that it became a bit less capable over the first few weeks after release.

It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.

by ricardobeat

5/29/2026 at 6:25:02 AM

It didn't do shit

by viking123

5/29/2026 at 6:15:18 AM

I am using Claude Code for formal verification with Lean. In my personal experience both Opus 4.7 and now what I see from first experiments with Opus 4.8 were big improvements. I was able to delegate proofs of larger theorems that their predecessors could not handle.

by permute

5/28/2026 at 6:25:08 PM

“Maybe my own tastes are saturated now”

It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.

One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.

Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.

Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.

Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.

It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.

Ar what point does my CS degree become totally useless is an open question.

by WhitneyLand

5/28/2026 at 8:57:30 PM

> At what point does my CS degree become totally useless is an open question.

Why are you people saying all these things.

We'll probably see long-distance space travel long before a degree in generic problem identification and solving becomes totally useless.

by hypfer

5/29/2026 at 2:39:40 AM

Every STEM field regards itself as "generic problem identification and solving" though

by stonogo

5/29/2026 at 9:53:56 AM

And they're all correct in that assessment.

by hypfer

5/28/2026 at 6:45:24 PM

pretty spot on.

In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.

4.1 they made it much faster, so a lot of infra improvements.

4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.

4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.

4.7 they just fixed the bugs they added in 4.6. Better than 4.5.

haven't fully tested 4.8 yet.

by ahmadyan

5/29/2026 at 10:55:05 AM

> "4.6 was such a bad model,"

It's just amusing reading all these posts with different viewpoints, just in this thread there are multiple people saying 4.6 was so much better than 4.7 and that they switched back to 4.6.

by sumedh

5/29/2026 at 3:20:27 PM

I also find it amusing. I also heard a lot of "4.7 is garbage, everybody hates it". Shows you how important proper validation techniques are, not just gut feeling.

by Otterly99

5/29/2026 at 4:30:28 PM

that is a fair point, everything i said above was in my experience.

* in our experience, in our evals and codebase, 4.6 was a bad model. This is over 60k developers, so statistically significant.

by ahmadyan

5/28/2026 at 7:57:18 PM

I gave 4.6 a miss and only recently switched from 4.5 to 4.7. I found on a particularly different task 4.5 struggled with (getting stuck in loops and trying to convince me the problem had been solved) was quite solvable with 4.7.

by teruakohatu

5/28/2026 at 7:17:49 PM

How long would it take to evaluate a new coworker to say “wow she’s really bright?” Relative to your other coworkers?

A few days? A few weeks? Longer?

However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.

by jimbokun

5/29/2026 at 3:55:08 AM

alot of investor money is hinging on models performing better every release.

by byzantinegene

5/28/2026 at 7:01:06 PM

why are the models the same price?

https://platform.claude.com/docs/en/about-claude/pricing

``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens

Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok

Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```

by gigatexal

5/28/2026 at 8:00:05 PM

Why shouldn’t they be? They are probably the same size and cost the same to run. They are not doing full training runs (eg Mythos) so don’t need to recover insane training costs.

by teruakohatu

5/28/2026 at 8:07:52 PM

I'd be kind of shocked if a model that came out six months ago is the same size and cost to run as one that just came out today.

by cootsnuck

5/29/2026 at 1:45:35 AM

Same size? Maybe by a bit. Cost? Absolutely. Newer flagship models are often slightly larger each generation, but not even 2x. But more efficient architectures are coming out all the time, and it'd be a waste to retrain an old model. So it washes out.

by jubilanti

5/28/2026 at 8:03:27 PM

Opus 4.7 and presumably 4.8 are more expensive due to a new tokenizer that translates data into more tokens per input.

by staticman2

5/28/2026 at 10:13:40 PM

Same price on a token basis, but usually steadily decreasing on a task basis

by nikcub

5/29/2026 at 1:58:17 AM

Didn't you mean increasing?

by koiueo

5/28/2026 at 7:06:52 PM

I think 4.7 was an awful model in actual use. I never got anything out of it and it was frustratingly weird. This feels more like an attempt to course correct and isn't a real bump

by spaceman_2020

5/28/2026 at 7:10:07 PM

I think they overtrained on scientific papers or such as it would spout really sophisticated sounding nonsense with a ton of complicated verbs and adjectives. 4.6 was definitely better in that regard. The more I use these tools the more I think they’re not actually that revolutionary. I mean it’s still amazing what they can do but they have very clear limitations it seems.

by throwaway63467

5/29/2026 at 11:50:25 AM

it was also astonishingly lazy. Would just ask me to write test scripts. I asked it to create simple UI buttons for testing some basic functions so I could share it with a client, and it gave me curl commands instead - and then defended it by saying that the UI is wasted work

Frustrating because if I have a tool, I expect a tool to do what I tell it to do. Tools shouldn't have any opinions on how they should be used

by spaceman_2020

5/28/2026 at 9:03:09 PM

My read - 4.7 was a tactical lobotomy to improve the average experience at the expense of peak performance; necessary due to compute pressure.

Now that they have Colossus capacity, I guess they can tune up the intelligence again and spend more tokens on reasoning budgets.

4.7 was definitely a lot more flaky for me vs. 4.6 before the reasoning bugs.

by theptip

5/28/2026 at 5:16:29 PM

Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?

by binary0010

5/28/2026 at 5:34:31 PM

I find the quality ebbs and flows even on the same model. My guess it is something to do with GPU availability but only guessing.

by osigurdson

5/28/2026 at 5:58:49 PM

Unless you're systematically repeating the exact same task, the most parsimonious explanation is that you're seeing natural variation based on different tasks, random sampling of tokens, etc.

by atq2119

5/28/2026 at 11:12:09 PM

I don't think this explains the phenomenon as is more temporal in nature - not prompt to prompt. I'm sure the AI labs gracefully degrade to simpler models when resources are low - why wouldn't they?

by osigurdson

5/28/2026 at 5:38:37 PM

Given that 4.7 was a brand new model, trained from scratch with a unique architecture and tokenization scheme, I don't see the same pattern. It seems arbitrary.

by irthomasthomas

5/28/2026 at 5:41:41 PM

i dont understand the nuances here. what does this mean. 4.8 is trained on same model as previous one then? what does brand new mean.

by dominotw

5/28/2026 at 5:48:19 PM

It means for 4.7 they trained a new base model with different architecture, different pre-training data (later knowledge cutoff), and a new tokenizer. Vs finetuning an existing model, which was the case for 4.6, and probably for 4.8.

by irthomasthomas

5/28/2026 at 8:07:15 PM

do you mean pre training? so 4.8 is just post training of an old pretrained model?

btw where do they tell you how they trained the model.

by dominotw

5/28/2026 at 5:17:12 PM

IMO they have all been clean and noticeable upgrades over their predecessors. Opus 4.7 in particular was a solid jump in capabilities.

by extr

5/28/2026 at 5:28:49 PM

I think it's telling how split the opinions are around all of this. A lot of people distinctly disliked 4.7.

Are the dividing lines around personality? Working domains? Opinionated software stuff?

Who knows?

by NiloCK

5/28/2026 at 5:20:28 PM

most of my coworkers feel the opposite about 4.7 and that 4.6 was, to them, significantly better to point that several stopped using claude code

by TSiege

5/28/2026 at 7:58:27 PM

4.5 -> 4.7 was a solid jump for me having skipped 4.6. It probably does depend on the specific tasks.

by teruakohatu

5/29/2026 at 6:25:50 AM

It didn't change at all, same as 4.6. Good morning to the Anthropic office btw.

by viking123

5/28/2026 at 8:43:19 PM

I'm pretty sure they're releasing 4.8 because they massively shit the bed with 4.7 and people aren't using it.

I have ONLY heard negative feedback about it, and trying it myself also yielded really awful results.

by iLoveOncall

5/29/2026 at 2:17:41 PM

I have seen a noticeable difference between 4.6 Medium (the default, and I skipped 4.7 because of various reported issues) and 4.8 High or whatever the default is now. It's far more likely to say it doesn't know and seems to think about things a lot more, but then it also spends a lot more time reporting on what it's thought about so it takes longer for you to process the output. In particular 4.6 would say "I've spotted something a bit off here" whereas 4.8 will say "if you do this and then this and then this under these conditions then something will go wrong here". So it seems to be closer to the claimed capabilities for Mythos than previous versions.

by pseudohadamard

5/28/2026 at 9:18:52 PM

The inability to tell if a model is improving is, I think, a tell that the model has improved up to your level of programmatic (analytic, computational) capacity.

A lot of the information (blogs, tweelches, plosts) that I consume seems to be converging on the idea that we all depend on the models. However. It seems to me that the exact opposite is true. The models depend on us, and _desperately_ so.

There must have been stories, books, movies, made about this intellectual (and propositional, legal, factual) inversion.

The majority need the minority. Has always been the case, I now think. But what has newly developed is that the majority can take a dependency not on the minority, but on a select few companies who are abstracting and compressing the minority into latent spaces.

by avador

5/29/2026 at 2:05:41 AM

Or the model could just be shite.

by adi_kurian

5/29/2026 at 12:06:13 AM

ChatGPT 5.5 is consistently the much better model and by a large margin.

How do I know? Because when pushing both to generate code or in independent chats to analyze projects, 5.5 will consistently find all the bugs that Claude does not find, and when challenged, Claude does agree those bugs were there. And my findings match those.

When from a blank start asking Claude to analyze project A and Project B,. Clause will consistently say project B is the better structured, more robust, and more defect free and does justify it. And project B was the one created by GPT 5.5....And also the one I judge to be the best one.

And yes, both at deep effort settings and starting from same specs...

by root-parent

5/29/2026 at 6:27:40 AM

5.5 is much better than any Anthropic model. I hate both companies with passion but the Anthropic shills here are in overdrive mode. On top of it, it's cheaper.

Greetings to the Anthropic office good sirs btw.

by viking123

5/28/2026 at 8:45:29 PM

IME the most noticeable performance boosts are in complex multi-agent workflows.

EX. You call an orchestration agent and define an implementation plan with the help of a number of sub agents planning out different features. You and the lead agent review all of the plans and send them off to a set of agents that write tests which get send back to the orchestrator then passed along with the plan to a set of coding agents who implement the features in their own worktrees. That gets passed back to the orchestrator which hands it off to another set of agents doing the code review and merging the features before sending it back to you.

by ThunderBee

5/28/2026 at 8:53:05 PM

i dont think theres anything particularly special about new models for that though. thats a harness improvement

by 8note

5/29/2026 at 2:00:37 AM

1mm context window is pretty big. Even if dumber, opens new avenues. For the record I don't think we ever got better than 4 and 4.1.

by adi_kurian

5/28/2026 at 8:05:58 PM

Well, it seems like collectively we are all struggling to perceive model progress, given that it seems like every reply to you is reporting different experiences with which of the models has subjectively performed best for them.

by cootsnuck

5/29/2026 at 6:35:28 AM

May be my tasks are rudimentary but the results I get with the 4.5 model are just the same as 4.7 or 4.6. it's just at the advanced models consume more tokens and and are actually loss making for my work. The incremental changes that they are making are not really that valuable. In fact I have found that even glm 5.1 is giving me something equivalent to what Opus 4.6 gives. Am I missing something that everyone else is cheering for in these small incremental model releases?

by gandalfthepink

5/29/2026 at 7:17:00 AM

I wonder if it's being done to improve revenue nunbers without changing an enterprise contract? Oh what's that your token usage went up because some of your developers switched to a new model? That sounds like a you problem.

I thinks there's a big push to get these companies in a state where they can be dumped on public markets.

by andersmurphy

5/29/2026 at 1:01:47 AM

I think the issue with legibility comes down to the fact that most users are not using LLMs for tasks where improvements to raw reasoning abilities wouldn't help much or at all. So it's not a matter of anyone's deficiency of perception but rather a lack of any benchmark to perceive.

It's kind of like how the consumer laptop market is now. I was telling my boss today that most employees wouldn't see any noticeable performance difference between a macbook pro and a neo if they are just doing admin stuff on the web.

by nfw2

5/29/2026 at 9:17:21 AM

Ive been using gpt 5.4 and 5.5 and honestly 5.4 is solving everything at the pace I need it. I'm the biggest bottle neck in terms of reviewing PRs and my own code. So having a model which can solve a complex task in 10 minutes vs 30 minutes doesn't really give me any meaningful improvement.

Also, the biggest factor is having a good planning phase. A good plan is better than even major model improvements.

by bigupthewhole

5/29/2026 at 8:09:30 AM

I'm going to assume that at some point their "targeted training and tuning" will eventually reach some sort of "max" possible simulation of next good token. At that point I think it will be interesting to see what happens and how many parameters you really need to for different verticals.

by christkv

5/28/2026 at 5:20:37 PM

The honesty will be noticeable. Maybe we'll see some honest assessments like "That is not possible within the laws of known physics", "Your legal argument is nonsensical and defies logic", "There is no evidence to support taking that will cure anything", etc., etc.

by onlypassingthru

5/28/2026 at 7:46:27 PM

4.7 uses more tokens and costs more for the same task than OG 4.5, that's about it

by ifwinterco

5/29/2026 at 2:10:27 AM

We're at the top of the S-curve and you're romanticizing diminishing returns with vague hints of super human capabilities and singularities.

by j_m_b

5/29/2026 at 12:22:56 AM

I can tell from hearing Feynman recordings that he was smarter than my own university's physics professor, but both were smarter than me.

by vasco

5/28/2026 at 8:51:15 PM

> (it's smarter than me?)

I genuinely hope that you're joking with that statement.

Or this is a bot.

Or an ARG.

Or Art.

Help.

by hypfer

5/28/2026 at 8:57:22 PM

If LLMs have tough me anything, is that the average person is far more gullible than what I could have imagined.

by okamiueru

5/28/2026 at 9:01:22 PM

That and also.. predictable. Robotic, even. Stimulus => Reaction

Which is a shame, because people would have the potential for greatness. But instead, for a plethora of reasons and factors (internal and external) people end up as fleshy automatons sleepwalking on rails.

Talking _extensively_ with LLMs over the last years made me understand humans a lot better, but, in hindsight, I'm not sure if that was a good thing.

by hypfer

5/29/2026 at 6:26:39 AM

I'm here to complain about the churn.

I feel like I get to know a model in the human sense of understanding a personality. Yesterday I knew 4.6 extended, today it's different, there's multiple "token budget" levels. I just want 4.6 extended back as it was, I was getting on well with it / them.

by willtemperley

5/29/2026 at 6:28:11 AM

Humanizing this technology seems like a step in the wrong direction.

by lionkor

5/29/2026 at 6:47:26 AM

There's so much intelligence here on HN and so little humanity.

by willtemperley

5/28/2026 at 7:24:53 PM

"it's smarter than me?"

You don't have to correct it dozens of times a day!? Really?

by jere

5/28/2026 at 5:24:22 PM

Just want to say there's no question that you're smarter than any (and every) AI.

by conartist6

5/28/2026 at 5:53:05 PM

I appreciate the generosity, but you're gonna want to meet me first.

by NiloCK

5/28/2026 at 6:02:50 PM

Kind of the beauty of it is that I don't have to to know I'm right. The reason I know is that you're alive so you can do the one thing it can't ever do, which is know when to stop or give up. It would turn me and everything else in the world into paperclips repeating the same research 1,000,000 times over.

by conartist6

5/28/2026 at 9:22:17 PM

Idk, the models often stop or give up and have to be prodded. And I know plenty of humans who don’t know when to stop or give up, even when it would clearly be best.

by senordevnyc

5/28/2026 at 5:48:20 PM

No question at all that a dolphin swims better than a submarine.

by petesergeant

5/29/2026 at 2:09:05 AM

dangerous thing to believe IMO The models will get better, you will notice, everyone will notice. They will get better at coding and everything else. You should plan around that.

by mgraczyk

5/28/2026 at 11:45:08 PM

tbh, the last 2-3 version bumps, main change has been that they take longer, and cost more/have more usage restrictions. (combined with new tooling, which eats a ton of tokens)

by fl0id

5/28/2026 at 5:12:10 PM

Incremental gains compounds.

by taytus

5/28/2026 at 5:21:22 PM

meta threw in the towel when it came to producing AI models since their gains couldn't keep up with China.

by itake

5/28/2026 at 5:57:45 PM

Has meta stopped producing new models? I figured they were just regrouping after all the drama they’ve had recently. Meta’s massive user base means they don’t need to be involved in the customer acquisition rat race. Once they have a model they’re happy with they can have a billion people interacting with it within a month.

by HDThoreaun

5/28/2026 at 8:04:59 PM

Meta released a major new closed source model a month or so ago.

It didn't make a splash like a new open source release would have.

by staticman2

5/28/2026 at 8:55:02 PM

muse-spark is beating all the Chinese text models on lmarena leaderboard FYI. Maybe you only care about coding models.

by TurdF3rguson

5/28/2026 at 5:16:23 PM

Exactly. Go back to Opus 4.5 and see how you like it.

You won't, really.

by paulddraper

5/28/2026 at 10:17:12 PM

The more difficult it is for humans to consistently and accurately compare model outputs the more opportunity there is to spread FUD (Fear, Uncertainty, Doubt). Considering valuations of these companies and the astronomical investments being made, a sabotage campaign with bots or paid users on reddit, twitter, YouTube, or whatever socials could go a long way towards knocking market cap off the competition. Not saying that's happening, just saying its an obvious target. Even if the goal is not nefarious, people with a perceived bad experience are 2-3x more likely to complain. So even without bad actors involved, a new model may need to be significantly better in order to break even on the old net promoter score.

by mrinterweb

5/29/2026 at 6:27:07 AM

the churn is... a version bump to the same api? If you want to compare you can write some evals.

by bwhiting2356

5/29/2026 at 2:10:48 AM

> I'll never again perceive model progress

If the hype train keeps going for another year, Sam and co will have to resort to direct gaslighting like saying the model is improving but nobody can feel it anymore, oh and I need 10 trillion dollars

by taurath

5/29/2026 at 12:07:33 AM

I maintian a log of tasks, prompts, related information etc. So i can repeat past workflows verbatim, and I can qualitatively say each model beyond 4.5 has been a regression, and it would not surprise me 4.8 continues the trend. Each iteration has failed at more tasks previously completed succesfully. Right now it flat out refuses to answer many benign chemistry questions, or leans into shilling to hard and ignores non industry funded studies on certain topics. I'm transitioning to deepseek as a reuslt. Cheaper by far and at this stage not strictly speaking less capable.

by Grimblewald

5/28/2026 at 10:25:07 PM

It's almost like they used up most of the benefits of scaling and the fundamental issues that people have been talking about with LLMs for years are real.

by overgard

5/28/2026 at 8:51:14 PM

honestly sonnet 3.7 is still good enough for me, as long as whatever tool prompts and so on are well optimized enough between harness and model.

i still havent really noticed it per set being better

by 8note

5/29/2026 at 3:54:01 AM

[flagged]

by ElkeQin

5/28/2026 at 7:24:20 PM

[flagged]

by rotcev

5/29/2026 at 6:40:01 AM

[dead]

by ckarani

5/30/2026 at 1:04:36 AM

[dead]

by mik09

5/28/2026 at 6:52:57 PM

Although I am not sure about it but there was something I read which said that models intentionally degrade slowly by lower quantizations as a new model is going to drop.

This felt particularly visible during the 4.6 when people said that 4.6 felt dumber and I remember someone doing some analysis and it sort of proved that models were getting dumber over time.

This has both benefits of costing less for the company to run while taking a standard subscription but also, at the same time, making the next model when it drops to public to "feel" more good comparatively.

Again, I am not sure if this is the case or not but merely proposing something that I feel like it might be in the possibility of realm.

by Imustaskforhelp

5/28/2026 at 6:34:08 PM

My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode mode nailed it, the best result so far:

https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v

The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).

by senko

5/29/2026 at 4:52:48 AM

I've been tasking LLMs to write a traditional AI for a full vibe-coded RTS. I remove the human players and let them battle. I don't know why but I enjoy watching AI players battle so much :)

In the repo, I even have a tournament script that calculates ELOs. So far, codex was unmatched. I'll try with Opus 4.8 too.

https://egeozcan.github.io/unnamed_rts/game/

https://github.com/egeozcan/unnamed_rts/blob/main/src/script...

by egeozcan

5/29/2026 at 10:15:21 PM

I'm happy to report that this game is very fun for natural intelligence entities too. :)

by plutokras

5/30/2026 at 6:46:09 AM

Glad that you liked it! Please fork or note the version you like because I keep breaking it in spectacular ways :)

by egeozcan

5/29/2026 at 2:27:01 PM

This is fun! I look forward to trying this out. Thanks for sharing!

by mannanj

5/29/2026 at 6:46:14 AM

I wonder if your previous prompts were part of the new RL fine tuning, and that’s why is now better at this specific question

by calebgcc

5/28/2026 at 7:00:11 PM

It almost appears as if the code was minified. The variable names are short and formatting looks like it's written to minimize whitespace. Did it write it in this compact format all on it's own?

by jclay

5/28/2026 at 8:40:04 PM

Yeah looks extremely compact. I didn't instruct it or told it to use as few lines of code or characters or nothing of the sort.

Not sure why it did that. Its own rationale (which is highly suspect, but the only lead I have) is that it defaults to dense style if it has to write a file in a single go. May be a kernel of truth somewhere in there.

by senko

5/28/2026 at 10:26:27 PM

And much code on the web “in production” is minimized.

by bombcar

5/29/2026 at 7:21:44 AM

minified is fewer tokens than the human-readable version that we would write. It only really makes sense to write in minified js - it's also where alot of code in the wild is since every production site minifies their js which is then consumed by training.

by AdamN

5/29/2026 at 2:08:17 AM

I just had Opis 4.8 code up something and actually that's exactly how it coded it!

It looked gross and minimized, the result was awesome but the code looked pretty awful visually

by syspec

5/28/2026 at 8:19:01 PM

A friend sent me something he vibe coded which included a massive webassembly blob in the HTML file. My friend is not a programmer so he was not able to explain to me how it did that.

by andai

5/28/2026 at 10:38:47 PM

Claude Design export.

by unconscionable

5/29/2026 at 1:40:15 PM

Doesn't look minified, just very dense, almost like progcomp code. First time I've seen an LLM spit out that style of code, I'm impressed!

by dilap

5/29/2026 at 3:49:11 AM

"Readability by humans" may no longer be as important as it once was.

by rphv

5/29/2026 at 6:29:50 AM

Maybe it would benefit Anthropic if AI generated code worked, but wasn't readable by humans. That's a nice moat.

by lionkor

5/29/2026 at 5:29:30 PM

Proprietary stochastic compilers. Hooray.

by seanw444

5/29/2026 at 9:52:03 AM

Good variable names are still useful for LLMs to understand context when refactoring.

by selcuka

5/29/2026 at 5:46:40 PM

LLMs are already bad at reusing existing logic/resources/components, even when they have obvious names. Unreadable code only makes it worse.

by rafram

5/29/2026 at 9:19:10 AM

Only if LLMs will start to output object code, skipping text representation.

by orphea

5/28/2026 at 7:19:58 PM

I like that benchmark. You should throw the results up on GitHub pages so people can try out the games.

by apitman

5/28/2026 at 8:23:16 PM

Yeah! Host on GitHub pages, so it's easy to click a link and play!

by brandly

5/28/2026 at 9:00:10 PM

Great idea!

I have a static server of my own, so here's my list (of all the tests I published so far): https://senko.net/vibecode-bench/

by senko

5/29/2026 at 1:40:04 AM

Forget GH pages. Indiehosted ftw.

by apitman

5/28/2026 at 11:18:33 PM

Would love to see the prompts, too!

by paulirish

5/29/2026 at 1:00:53 AM

Same!

by jmtame

5/29/2026 at 9:00:08 AM

I've updated the page with the prompts, c/p-ing here:

Minesweeper: Create a beautiful and fully functional Minesweeper clone in HTML/JS/CSS (all in one file).

RTS: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).

by senko

5/29/2026 at 1:22:08 PM

Very nice! Do you have any CLAUDE.md or AGENT.md files that influence it? I'd like to try this same thing and wondering what else feeds into it to produce that output?

by munksbeer

5/29/2026 at 4:48:45 PM

I put a version on Hallway: https://hallway.com/workspaces/4ddaa042-13b1-4fa5-bcf4-3d646...

Easy to edit and share.

by johndevor

5/29/2026 at 5:47:26 AM

Nice, I recently found something like this was possible too. Gpt-5.5 one shotted the basic game, but then I added some ai generated graphics/sounds/music and asked it to write then up.

It's a vocab building game, playable here (desktop only): https://rupertlinacre.com/vocab_annihilation/

It kind of blows my mind I can go from: 'I want a fun way to help him learn vocabulary, and I loved total annihilation as a kid' to 'heres a game that's he finds genuinely fun that helps him learn something ' in a few prompts.

by RobinL

5/29/2026 at 1:46:34 AM

Okay now have it implement an authoritative server with reliable netcode and reconnection/disconnection logic, lobbies, and finding games, in-game chat, synchronized state around starting and ending games, resignations and such

by Madmallard

5/29/2026 at 5:03:36 AM

How many times did you try? Same model running multiple times can produce both very good and very bad results. In my benchmark even 10 runs often not enough to tell for sure if one model is better than another.

by skolos

5/29/2026 at 9:08:00 AM

Usually just once (and I did just one test for this particular one), but I've found the overall quality to be relatively consistent.

There's too many confounding variables here, randomness just one of them. So I don't think of it as a definitive test (and reliable ordering), just another data point (along with actual benchmarks, pelicans, etc) to get a sense of the capabilities.

For example, I managed to get something out of DeepSeek 4 Flash quantized to 2-bit with Antirez' DwarfStar, used via Pi. Almost kinda worked! :) Which makes me optimistic for using local models for serious development soon - I'd say within a year.

by senko

5/28/2026 at 7:47:01 PM

What is ultracode mode?

by elAhmo

5/28/2026 at 9:19:34 PM

Biases the model to solve problems with teams of agents

by colechristensen

5/28/2026 at 8:04:44 PM

it's a brand new mode

by tcoff91

5/28/2026 at 8:49:37 PM

It's a combination of reasoning effort (max) + enabling workflow that orchestrates multiple sub-agents.

After some interrogation, here's how it organized the work:

1. Design workflow (rts-game-design, 11 agents, ~13 min) ran first, produced SPEC.md + DESIGN.md:

1.1. Proposals (3 parallel agents): each designed a complete RTS from a different philosophy

1.2 Judge (1 agent): evaluated all three and synthesized one unified design, committing to specific numbers (costs, HP, map size, etc.).

1.3 Deep-dives (6 parallel agents): each wrote an implementation-ready spec for one subsystem, all consistent with the chosen design

1.4 Synthesis (1 agent): merged the design + all six subsystem specs into one conflict-free master spec

2. Code-review workflow (rts-code-review, 25 agents, ~5 min), ran after the main agent had written and tested the code:

2.1 Review (6 agents, read-only Explore type): each scrutinized one dimension and returned structured findings.

2.2. Verify (19 agents): every finding got its own skeptic agent told to try to refute it, Result: 19 flagged → 16 confirmed, 3 rejected as non-bugs.

What the main agent did in the main loop:

- Wrote all ~2,400 lines of index.html by hand from the spec.

- All browser testing/debugging via headless Chrome (I told it to use rodney by @simonw, love the tool :)

- Applied all 16 fixes from the review and re-verified them in the browser.

by senko

5/28/2026 at 11:00:19 PM

seems like a rube-goldberg esque way to consume 10x tokens. is this really where the industry is heading?

by 33MHz-i486

5/29/2026 at 1:10:06 AM

I like to think of it like the difference between dropping a ball on a roulette wheel (get one random number/sequence of repeated) - vs dropping a ball on a carved topographic map, where valleys guide the ball to a particular outcome.

If you can stand a little AI expansion - here are a few points Gemini came up with - I think the idea has some merit:

https://g.co/gemini/share/b5b97867eeb1

(Maybe the better analogy is roulette vs pinball machine)

by e12e

5/29/2026 at 12:03:22 AM

Why is it Rube Goldbergesque? The process doesn't seem arbitrary.

by derac

5/29/2026 at 6:43:37 AM

Rube Goldberg machines (or Heath Robinson contraptions) aren't arbitrary, they're complicated or contrived ways of achieving the process; often a very literal interpretation of how an automatic machine might imitate an otherwise manual action – a robotic hand movement for example. I think it's quite a good analogy, even if agentic Goldberg works well.

by OJFord

5/29/2026 at 8:15:06 AM

Those machines are, to quote Wikipedia, "designed to perform a simple task in a comically overcomplicated way". This implies there is a much simpler way that works just as well.

I don't think the Rube Goldberg analogy works if the agentic meandering is essential complexity required to get at the results. Rube Goldberging it would be something like putting this loop inside some comically overengineered enterprise microservice web which is then found out to be running inside a Window 98 emulator or what have you.

by sdfsdssdfsdf

5/29/2026 at 9:33:15 AM

> This implies there is a much simpler way that works just as well

Yes there is: Write the code yourself

by Orygin

5/29/2026 at 8:48:44 PM

This is not any simpler

by hk__2

5/30/2026 at 12:30:12 PM

Seems to me the route that these agents took is sort of exactly how a group of people would collaborate on building an RTS?

by ymolodtsov

5/29/2026 at 12:45:20 AM

Thanks for sharing this. Going to try it out on a game inspired by Rust. It's helpful re: the point on rodney - I've had a hard time getting the testing to work well in the browser.

by jmtame

5/29/2026 at 11:53:32 AM

Did you start with a clean slate or do you have global ~/.claude/CLAUDE.md and/or specific skills, plugins, etc?

by chrisweekly

5/29/2026 at 2:13:38 PM

I don't have global CLAUDE.md and the only non-default skill I have that was used here is the one to use rodney[0] headless browser. I didn't expressly tell Claude to do browser testing, it decided to do it on its own.

So no extra guidance beyond the prompt.

[0] https://github.com/simonw/rodney/

by senko

5/30/2026 at 1:38:07 PM

Thanks!

by chrisweekly

5/29/2026 at 12:48:47 PM

Just to confirm - you did not generate this plan/orchestration/harness - it did all that on its own?

by artur_makly

5/29/2026 at 2:08:33 PM

Correct, that's the "workflows" part they introduced in claude code alongside the new model.

by senko

5/29/2026 at 2:28:46 PM

I am absolutely gobsmacked how good the game is! I didn't complete the level fully but I completed all but one of the tasks. This is both smooth and fun and I'm surprised that a modern LLM can do something this well, let alone in a single file. It makes me realize how much the goalposts have been moved. A few years ago (ChatGPT 2? 2.5?) wasn't even able to implement a small Python script I would expect a junior engineer to be capable of producing. Now we're getting the tools to do something like this. You should think about how to "rate" the outputs or at least provide your own rankings.

by seidleroni

5/28/2026 at 10:23:43 PM

Thanks for also sharing the prompt. I've been testing claude by asking it to make similar things, so it's useful to see what other people are doing.

I do find it interesting that the visual style is pretty similar to things it's produced for me.

by H3X_K1TT3N

5/29/2026 at 4:52:39 AM

If you look on the page of games, the style of chatgpt 5.5 is almost identical to the Claude style.

by dash2

5/28/2026 at 8:00:41 PM

Do you have a collection of these benchmark apps saved anywhere? I'd be particularly interested in seeing the relative cost differences between different models in a use case like this.

by digdugdirk

5/28/2026 at 8:59:32 PM

I'm saving them all as gists here: https://gist.github.com/senko

But I just vibe-coded a handy list of all the tests I did (unfortunately without the commentary I usually leave in social media posts -- I should add those at some point): https://senko.net/vibecode-bench/

by senko

5/28/2026 at 8:06:18 PM

Kinda buggy, but impressively nonetheless. How long did it take?

by jryan49

5/28/2026 at 8:36:42 PM

It took 50 minutes, would be ~$20 in API costs (I'm on a Pro sub).

by senko

5/29/2026 at 9:27:06 AM

(Correction: I'm on a Max ($100/mo) sub. Realized the mistake too late, so can't edit my comment.)

by senko

5/28/2026 at 10:47:44 PM

Is there some sort of a leaderboard for this test? Like if you'd give each of Opus 4.8 and GPT 5.5 a score out of 100, what would the scores be?

by ammar_x

5/28/2026 at 10:55:16 PM

There isn't, as I wasn't going for strictness, more like a playful challenge in the vein of Simon's SVG pelican.

Between the two, Opus 4.8 seems more capable. But, I suspect the harness plays a large role here. It's possible the result would be as good if Codex ran 10+ agents and spent an hour on it.

OpenAI and Anthropic usually fast-follow each other, so I wouldn't be surprised if Codex got the same capability in a couple of days (and even an update to the model), then it'll be a better test.

Sooo, let's say, winging it, vibes-based: 85% for Opus 4.8, 75% for GPT 5.5. Compare with GPT 5.3 (let's say 25%) here: https://senko.net/vibecode-bench/2026/rts-codex-5.3.html

by senko

5/29/2026 at 12:34:09 AM

Wow, that's impressive. Had fun playing it for 10 minutes locally. Found myself wanting to discover an enemy base :)

by jmtame

5/29/2026 at 3:25:43 AM

Wow that looks really impressive. Both the UI and the content looks good, the game is a bit buggy but still nice!

by fireant

5/29/2026 at 7:25:24 AM

some reason that website is showing up as high risk and i cannot view it , I had to open it from my mobile phone.

it looks quite impressive, I don't use claude currently but hearing good things about it...from codex users ironically

by zuzululu

5/29/2026 at 8:31:57 AM

Is that for bsky.app (BlueSky platform) or my personal site (senko.net) where I put up the list of tests? What browser/device was that?

by senko

5/29/2026 at 12:29:10 AM

How much did it cost?

by shlewis

5/29/2026 at 9:26:13 AM

Token equivalent of ~ $20 (I'm on a $100 Max sub).

by senko

5/28/2026 at 7:40:17 PM

Played it to the end. Pretty neat!

by l3x4ur1n

5/29/2026 at 4:23:35 AM

wow

by veqq

5/28/2026 at 4:58:23 PM

"Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

This is a refreshing attitude!

I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)

by colonCapitalDee

5/28/2026 at 7:49:27 PM

I’m pretty sure that switch has always been there, but turning it off doesn’t do what you want. It disables thinking entirely.

by gibspaulding

5/28/2026 at 8:29:34 PM

Opus 4.7 does not support disabling adaptive thinking (web, Claude Code). [1] Like the OP, I experienced similar issues and I'm glad that they brought back the ability to disable adaptive thinking in Opus 4.8.

[1] https://code.claude.com/docs/en/model-config#adaptive-reason...

> Opus 4.7 and later always use adaptive reasoning. The fixed thinking budget mode and `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` do not apply to them.

by kakugawa

5/28/2026 at 8:43:01 PM

> Opus 4.7 and later

The source of truth should be the API docs which make it clear 4.8 didn't bring back extended thinking: https://platform.claude.com/docs/en/about-claude/models/over...

Any UI settings probably just map to changing the effort nudge on adaptive thinking

by BoorishBears

5/29/2026 at 12:34:34 AM

https://platform.claude.com/docs/en/build-with-claude/effort...

by reed1234

5/29/2026 at 1:26:23 AM

Adaptive thinking supports effort, but it's a nudge instead of an actual token budget.

Why not use the pages that plainly state they don't support extended thinking: https://platform.claude.com/docs/en/build-with-claude/extend...

by BoorishBears

5/28/2026 at 9:07:12 PM

Thank you for pointing this out.

by kakugawa

5/29/2026 at 8:55:36 AM

Yes, modest but tangible improvement - same modesty does not apply to the cost: https://artificialanalysis.ai/models/capabilities/coding#cod...

by arnorhs

5/29/2026 at 11:41:06 AM

Originally Indians didn't drink tea. British East India company got Indians addicted to tea/chai when they sold it for free. Then the real prices came in.

by Npovview

5/28/2026 at 5:23:10 PM

Awesome, thanks for posting because I think I hit a possibly-spurious bug in turning Adaptive off when I switched models (4.6 -> 4.8, extra). Tried again, works as intended (I hope).

More importantly for me, though, is how CC will respond to 4.6-"only" flags for thinking. For now, it doesn't seem to clobber my setup.

by winwang

5/28/2026 at 9:12:27 PM

It is refreshing but perhaps actually not warranted this time?

I mostly study web research, and Opus 4.7 was a regression on BrowseComp compared to Opus 4.6, which has been born out by my usage.

Opus 4.8 is now much better than either 4.7 or 4.6, and having it search the web is one of the primary use cases of chatbots.

by ddp26

5/28/2026 at 6:26:15 PM

> This is a refreshing attitude!

Well, I think the attitude is that costs are allowed to escalate faster and more steeply than the features delivered. From that perspective, semantic versioning is a handy tool for adjusting pricing strategies. IMHO, it (versioning) only makes sense for open-source projects, where you can clearly see the actual changes made with each version upgrade. Anything else is more than a little suspicious…

by smartmic

5/28/2026 at 6:53:50 PM

While all these models are nondeterministic a feature bump is still necessary as the same input can have wildly different output on a new model. For API users being able to pin a model is a necessity.

by drewnick

5/28/2026 at 7:06:47 PM

The 4.8 model costs the same as it's 4.7 predecessor.

by smsx

5/28/2026 at 7:00:11 PM

All the 4.x models are still available, and they all cost the same.

by zaptheimpaler

5/28/2026 at 8:25:16 PM

> Opus 4.7 and later use a new tokenizer compared to previous models, contributing to their improved performance on a wide range of tasks. This new tokenizer may use up to 35% more tokens for the same fixed text.

Same cost/token, more token usage.

by ambicapter

5/28/2026 at 9:51:49 PM

I was hoping that the web UI would be better -- I like Anthropic better than OpenAI from a values perspective and want to use their products, but ChatGPT in thinking mode has been just vastly better than claude.ai.So my fingers were crossed that these changes would bring it up to par.

But trying it out... alas, no. Simple factual questions where ChatGPT would go do a quick search and get the facts and report them back to me, get a "Great question! [totally invented bullshit]" from Claude, even with this new model and thinking set to high. I have to explicitly tell it to search to get it to look up basic facts, rather than it recognizing that it needs to do that, like GPT does.

by mkozlows

5/28/2026 at 11:03:30 PM

What are some examples?

by Paracompact

5/28/2026 at 9:02:31 PM

Are they doing these smaller releases to attune users to a more incremental cycle of updates? Like, yeah other model providers do these major updates every x months, we on the other hand do incremental updates every x/2 months

by elSidCampeador

5/29/2026 at 1:44:07 AM

I was working with opus 4.7 on a math formalization problem for several days and 4.8 one-shotted the proof from a clean description as soon as the update came through. I was very surprised.

by empath75

5/28/2026 at 5:25:10 PM

The benchmark improvements actually look pretty damn nice tho!

by jascha_eng

5/28/2026 at 8:07:48 PM

"We've cut our costs A LOT"

by comboy

5/28/2026 at 6:03:11 PM

What's refreshing about it given the context that 4.7 was a regression in many ways (including as measured by benchmarks)?

4.8 is also 2x more expensive for a "modest" performance bump. How refreshing.

This is just cope.

by wahnfrieden

5/28/2026 at 8:10:49 PM

> 4.8 is also 2x more expensive for a "modest" performance bump. How refreshing.

Where are you seeing it's 2x more expensive? https://platform.claude.com/docs/en/about-claude/pricing

by cootsnuck

5/28/2026 at 8:38:46 PM

Don’t measure model cost by token price. Measure based on tokens used to achieve a task.

Others report in this thread that it’s about 2x more expensive due to outputs: https://news.ycombinator.com/item?id=48312774

by wahnfrieden

5/28/2026 at 8:11:18 PM

Price hasn’t changes at all, though.

by murkt

5/29/2026 at 12:26:28 AM

Token price doesn't mean much and is also manipulated to show false affordability. Look at price per task.

by wahnfrieden

5/29/2026 at 3:30:04 AM

You act like they weren't fearmongering about Mythos literally 2 months ago. Do you think everyone is stupid, we know exactly what you are doing. Please.

by casey2

5/28/2026 at 6:05:42 PM

I liked the "modest but tangible improvement" too! There is a cynical take here but I think I'm gonna hold it in...

by FergusArgyll

5/28/2026 at 6:42:45 PM

What do you mean? This is not just a new model, this is a new way of thinking.

by ai_slop_hater

5/28/2026 at 4:57:07 PM

> Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Probably more interesting than the 4.8 release.

by northern-lights

5/28/2026 at 10:19:03 PM

> Probably more interesting

It is widely suspected that self-inflicted "bad news" ("Mythos is so dangerous we just can't give the public access to it") is nothing more than Dario's typical style of marketing - keep in mind that they have an IPO coming up, because he certainly factors that into everything he says in public (as is his responsibility, to be fair).

An alternative reason for delaying the model might not be "we are trying to make it safe." It could be "we don't know how to host this thing at scale, or cost-effectively".

GPT 5.5 has already been shown to be as adept as Mythos at finding vulnerabilities.

Finally, laymen massively underestimate the importance of the harness for model performance. OpenHands existed long before Claude Code, Claude Code changed everything because of the clever hand-holding it does. Mythos is definitely more than just a model.

by zamalek

5/29/2026 at 2:15:45 AM

One capability that I see is missing from opus is this ability to understand an entire system. My hope is that a mythos class model will be able to comprehend even something as complicated as an IOT system with a hardware and firmware layer multiple API’s backend and different kinds of API and web clients.

The main limitation we’ve had to agentic coding is an understanding of this system that spans processes running on different machines and architectures.

by clbrmbr

5/29/2026 at 7:58:07 AM

Interesting — I haven't seen that problem, and I do have a system that has different APIs, web clients, non-web clients and embedded clients.

by jwr

5/28/2026 at 11:36:59 PM

What sort of clever handholding does Claude code do?

by LPisGood

5/29/2026 at 12:39:49 AM

https://github.com/Piebald-AI/claude-code-system-prompts

by selcuka

5/29/2026 at 12:33:54 PM

It's interesting that (for example for the explore agent https://github.com/Piebald-AI/claude-code-system-prompts/blo... ) they use a personality "you are a file search specialist" and "your strengths" framing. I thought that was largely thought to be useless, or even counterproductive nowadays? Does anyone know more about this stuff?

by schmorptron

5/29/2026 at 1:58:39 AM

There's also things that have since been discovered:

* Ralph Wiggum loops

* Simply not allowing an agent to stop its turn until all tasks are marked as done

* Sub agents over worktrees

* Context compression

by zamalek

5/29/2026 at 5:52:46 PM

"GPT 5.5 has already been shown to be as adept as Mythos at finding vulnerabilities."

Do you have any data on this (other than benchmarks)?

by KerryJones

5/29/2026 at 6:19:30 PM

First result: https://arstechnica.com/ai/2026/05/amid-mythos-hyped-cyberse...

by zamalek

5/28/2026 at 8:15:08 PM

In the Opus 4.7 release notes they mentioned intentionally making it worse at cybersecurity. [0]

This suggests that they're doing the same thing with Mythos now and the Mythos we get will be nerfed in that department?

Or more precisely, I think they'll have two versions of Mythos, and the scary one will probably continue to require a lot of paperwork.

https://www.anthropic.com/news/claude-opus-4-7

by andai

5/28/2026 at 6:08:46 PM

More interesting than that to me is "we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost"

Sonnet and Haiku look real outclassed for the price with current Chinese competition.

by ac29

5/28/2026 at 8:24:19 PM

So this is how they’ll remove access from Claude Pro to the biggest models. You would need at least a Claude Max subscription for the bigger than Opus models I bet.

by scuderiaseb

5/28/2026 at 9:02:28 PM

Anthropic's wants to sell us Claude Code with no model selection at all.

Opus seems to be overly eager of late to 'vibe' out entire solutions and build out things that you didn't ask for.

/goals is helping set the narrative that does it really matter if Sonnet and 3 Haiku agents got you to that end state...eventually...if its what you asked for?

For better or worse Opus is already handing off 80% of its work to background agents of Sonnet, Haiku, and likely a quantized Opus.

Want model selection? Pay for the API.

by F7F7F7

5/28/2026 at 9:11:40 PM

Just tell it to always use opus for subagents and it does.

by comboy

5/29/2026 at 2:07:10 AM

This. I added that instruction the first and last time I was gaslit by an underpowered subagent.

by clbrmbr

5/28/2026 at 10:17:20 PM

Its amazing how quickly ive just become accustomed to being a max subscriber. I dont think I could go back to pro.

by swalsh

5/29/2026 at 4:46:29 AM

Then max+, then ultra, then ultra pro

by galkk

5/29/2026 at 6:13:34 AM

As long as they provide the same utility / $ I don’t see why not. It’s not like the open weight models are that far behind and Claude code itself shouldn’t be very hard for the commmunity to replicate if Anthropic start acting up too much.

by stefanfisk

5/29/2026 at 12:43:22 AM

They have already been experimenting with such ideas [1]:

> Claude Code Removed from $20-a-Month "Pro" Subscription for New Users

[1] https://news.ycombinator.com/item?id=47855832

by selcuka

5/28/2026 at 5:42:48 PM

Seems like they might be hinting that if you are not a billionaire or multi-billion dollar company you will just get a limited and nerfed Claude Code slash command /mythos-security-audit or something.

Hope this isn’t the case and that normal average Joe’s of the world don’t get policed out of access.

by TIPSIO

5/28/2026 at 6:16:17 PM

> you will just get a limited and nerfed Claude Code slash command /mythos-security-audit or something.

Unless it's so expensive that we can't realistically use it for anything, I wouldn't complain about getting at least that. I would also rather have the actual model, but that's a useful application of it (and I'm probably not going to afford using it for much more).

by gs17

5/28/2026 at 6:49:18 PM

Price discrimination is I think fine and reasonable so long if you can drum up the cash you can use it how you want within their ToS.

Although mental safety gymnastics aside, getting the most amount of intelligence for the cheapest amount of cost to normal people seems like the most ethical thing a big lab could do.

Going around and granting different tiers of intelligence to different insiders, friends, or companies is majorly problematic long-term.

Heck right now, the tokens you buy today for “Opus 4.8”, no one even knows or believes will be the same “Opus 4.8” just 3 days from now.

by TIPSIO

5/28/2026 at 6:48:45 PM

some of the bench marks i have seen on also include cost where one scan of the codebase cost tens of thousands of dollars.

this one [0] notes one run cost $20k to run but another cost $50.

[0] https://red.anthropic.com/2026/mythos-preview/

by vorticalbox

5/28/2026 at 6:49:02 PM

/security-review already exists so I don't think it would be crazy to have a /mythos-security-review as more thourough command as well. I think it's more likely it is going to be released at some point to the general public though - although the the pricing might make it quite unattractive.

by FinnKuhn

5/28/2026 at 8:34:17 PM

you mean /security-review ultra, given their current way of handling commands

by Yiin

5/28/2026 at 8:18:23 PM

What does an average Joe need a Mythos level model for that Opus can't do for them?

by dbbk

5/28/2026 at 8:57:07 PM

Access to intelligence is going to become a major class issue overtime if cost keeps increasing and labs try to police usage and access

by TIPSIO

5/28/2026 at 8:23:49 PM

It's not just better at cybersecurity, it's better at all the things (or most of them). I for one would really benefit from a better claude code. I still have to babysit it pretty closely to keep it from messing things up. Opus 4.7 was not an upgrade for me.

But in general, what does the average Joe need Opus for that Sonnet or Haiku can't do for them? Better is better.

by freedomben

5/29/2026 at 2:15:22 PM

Opus never really messes anything up for me. You just need to tell it to follow TDD.

by dbbk

5/28/2026 at 6:06:00 PM

It does sound like an even higher API price tier for sure.

by Tepix

5/28/2026 at 6:02:35 PM

Isn't OpenAI's public flagship already beating Mythos on penetration testing? I get the impression Mythos is just valuation-juicing for IPO more than anything else.

The fact that they haven't released it yet suggests a cost/margins issue to me more than anything else. Short term, I'll probably keep using Antrhopic, but my long-term bet is that locally-served models win, if only because the quest for profitability will probably lead to intentionally-nerfed / enshittified frontier models.

At other vendors, ad placement within LLM responses is either coming or already here. Anthropic's handling of OpenClaw shows they're willing to engage in anti-competitive behavior, and the courts are not in a hurry to stop them. Why would I pay them $200 a month for such treatment when a $2K box does what I need locally?

by hedora

5/28/2026 at 9:24:32 PM

Please link to the $2k box that gives Opus level performance!

by senordevnyc

5/28/2026 at 7:23:39 PM

What benchmarks are you referencing that show a comparison of the models for penetration testing?

by srmatto

5/28/2026 at 8:07:44 PM

Mythos is dramatically better specifically at finding zero-day vulnerabilities and developing exploits for them, that being what it was designed to do. On other cybersecurity tasks, GPT-5.5 is at least as good, but finding and exploiting zero-days is a particularly scary capability, which is why Mythos is a big deal. See, e.g., https://forum.effectivealtruism.org/posts/8yztpbjuPkyXsmA6n/....

by ameliaquining

5/28/2026 at 8:49:36 PM

AFAIK, Antropic claims that they weren't aiming for zero-days specifically. From https://red.anthropic.com/2026/mythos-preview/ :

  We did not explicitly train Mythos Preview to have these capabilities. Rather, they emerged as a downstream consequence of general improvements in code, reasoning, and autonomy. The same improvements that make the model substantially more effective at patching vulnerabilities also make it substantially more effective at exploiting them.

I've been assuming that Mythos is just a big jump in model size, and that's where the jump in capabilities comes from. Hence I expect OpenAI not to be able to catch up without scaling up the model and hence significantly raising the API prices.

by stratos123

5/28/2026 at 8:46:58 PM

Anthropic frames this as something emergent. Not 100% but in a way they always phrase it as like, it’s a great model, but our breaths were swept and taken with its approach to security.

by alexgoodhart

5/28/2026 at 8:33:12 PM

This command would be not so bad for not a billionaire me.

by kdmtctl

5/29/2026 at 11:32:14 AM

I'm still not sure what safeguards they can be adding here. Unless they've suddenly solved alignment, at best isn't it a collection of system prompts saying what not to do and potentially some screening algorithms that try to catch key phrases in inputs/outputs?

by _heimdall

5/28/2026 at 5:51:43 PM

[dead]

by huflungdung

5/28/2026 at 5:06:00 PM

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

by simonw

5/29/2026 at 12:11:57 AM

It's pretty safe to say that AI will be used on the battlefield making real life and death decisions before it will be able to render a decent pelican on a bike in SVG.

by keyle

5/29/2026 at 2:50:38 AM

It already has been and this has been widely written about. AI was used to identify and prioritize targets for the US to bomb in Iran.

Here's an article from 2 months ago for example: https://www.theguardian.com/technology/commentisfree/2026/ma...

It was also implicated in the bombing of a girls elementary school which left 168 dead. The US did a "triple tap" to kill any first responders.

https://www.theguardian.com/news/2026/mar/26/ai-got-the-blam...

https://www.theguardian.com/technology/2026/apr/01/dont-blam...

by culi

5/29/2026 at 3:38:11 AM

I read the article and it doesn’t say it was used for targeting or prioritizing?

> Neither Claude nor any other LLMs detects targets, processes radar, fuses sensor data or pairs weapons to targets. LLMs are late additions to Palantir’s ecosystem. In late 2024, years after the core system was operational, Palantir added an LLM layer – this is where Claude sits – that lets analysts search and summarise intelligence reports in plain English

There’s a lot of humans in that loop who make those decisions.

by dmix

5/29/2026 at 6:56:29 AM

Yeah militaries don't use commercial chatbots for that, they have their own machine learning implementations. Look into Project Maven for example.

And while there are still humans in the loop, the impression I get is that this is increasingly becoming meaningless, from the way they talk about optimizing the "kill chain" and letting small teams make hundreds of targeting decisions per hour.

by saint_yossarian

5/29/2026 at 11:23:28 AM

“US Military Using Claude to Select Targets in Iran Strikes”

https://futurism.com/artificial-intelligence/claude-anthropi...

by an0malous

5/29/2026 at 4:25:15 AM

First link says

> AI is ‘identifying and prioritising targets, recommending weaponry and evaluating legal grounds for a strike’.

by culi

5/29/2026 at 6:06:34 AM

It doesn't specify which "AI" though.

These days that pretty much means "somebody used a computer".

by simonw

5/30/2026 at 1:39:14 AM

The first link is a reader letter to a piece they published. The original piece is the second link in my comment. It has more information

https://www.theguardian.com/technology/commentisfree/2026/ma...

> The paradigm shift has already begun. Despite the row, Anthropic’s Claude has reportedly facilitated the massive and intensifying offensive which has already killed an estimated thousand-plus civilians in Iran. This is an era of bombing “quicker than the speed of thought”, experts told the Guardian this week, with AI identifying and prioritising targets, recommending weaponry and evaluating legal grounds for a strike.

by culi

5/29/2026 at 11:12:03 AM

“US Military Using Claude to Select Targets in Iran Strikes”

https://futurism.com/artificial-intelligence/claude-anthropi...

It cites the WSJ but that article is paywalled so I shared this one

by an0malous

5/29/2026 at 12:55:14 PM

This later story suggested it was Palantir's Maven, not Anthropic's Claude: https://www.theguardian.com/news/2026/mar/26/ai-got-the-blam...

by simonw

5/30/2026 at 1:40:20 AM

Maven is not an LLM. Maven is software that uses LLMs. Mostly notably Claude

by culi

5/29/2026 at 9:49:55 AM

I think it's beyond decent. I don't understand how people are not more impressed by this. Just a few years ago the only expectation would be garbled nonsense.

by Kiro

5/29/2026 at 1:41:01 AM

the battlefield sounds much easier. worst case scenario you kill somebody, but that's what you're trying to do anyways.

if you kill somebody while trying to render a pelican on a bicycle it's a real problem.

by notatoad

5/29/2026 at 5:07:43 AM

"shift left" on the battlefield. break down those silos. if you have to ask for permission it's already too late. remember the goal. find the bottlenecks in your system and remove them.

by ares623

5/29/2026 at 1:52:25 PM

In many battlefield scenarios, there is more than one "somebody" on it. The "somebody" that you kill might not be the "somebody" that you intended to kill.

Depending on the how pelicans are created, it is entirely possible to indirectly kill "somebody" due to the externalised costs of global warming etc.

by pwagland

5/29/2026 at 8:32:34 AM

Haha, yeah. I tried for it to create a SVG with scissors and it was hopelessly overwhelmed. I think at least the SVG design niche will be safe a little while longer

by Markstar

5/29/2026 at 3:29:10 AM

I think that's a fair tradeoff. There's no way I'm going back to writing code by hand again. No one deserves that.

by ares623

5/29/2026 at 4:12:10 AM

Heh? How long were you writing "code by hand" before?

by keyle

5/29/2026 at 5:01:25 AM

Years and years. It was horrible. No number of misidentified targets will make me go back.

by ares623

5/29/2026 at 6:00:46 AM

It doesn't sound like you're in the industry you want to be in.

by keyle

5/29/2026 at 11:20:58 AM

Maybe all along what mattered most to them was making good software that people love, not the day to day part of writing code. Now it’s the industry they’ve always wanted, and less the industry of people who wanted to get paid to write code.

Software engineers who never cared about the higher level product design aspect are finding themselves in the wrong industry. It’s dismal.

by hombre_fatal

5/28/2026 at 6:12:00 PM

> the bicycle frame is the correct shape

No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.

Hopefully 4.9 will read my comments :)

by GistNoesis

5/28/2026 at 6:36:02 PM

Could be an extremely high angle stem that just happens to match the downtube angle.

by loeg

5/28/2026 at 10:38:46 PM

Maybe the pelican is just riding a road bike/gravel bike

by Venkatesh10

5/28/2026 at 8:30:23 PM

I bet someone shares this link every time you post about bicycles, but since I didn't see anyone share it yet in this thread, I'll take the opportunity to do so:

https://www.gianlucagimini.it/portfolio-item/velocipedia/

Turns out even humans can be pretty bad at drawing bicycles :)

by eminence32

5/28/2026 at 9:31:37 PM

On a new model release, you can guarantee two things are in the replies to Simon. One is your link, the other is "surely the models are being trained on this now"

by walthamstow

5/28/2026 at 11:44:01 PM

Sure, but no one is trying to force art from most people into about every area in the economy where anyone ever pays for something visual. If you asked professional artists to draw a realistic bicycle, I'm guessing few of them would try to just randomly guess what the mechanical parts looked like

by saghm

5/28/2026 at 8:55:59 PM

But if you need to draw a bicycle, you wouldn’t pick a random person in the street. You would hire an artist and you’d be guaranteed to have at least a believable one if not a perfect rendering.

No guarantees is why LLM is akin to gambling. Every new context is essentially picking someone out of the crowd.

by skydhash

5/29/2026 at 1:58:58 PM

As an aside, some of the renders have only a single side connection to the wheel and that is a valid bike design, the Cannondale Lefty front fork only has a left leg:

https://duckduckgo.com/?q=cannondale+lefty&iar=images&t=ffab

by jodrellblank

5/28/2026 at 10:30:37 PM

> The most unintelligible drawing has also the most unintelligible handwriting. It was made by a doctor.

Haha

by kvirani

5/28/2026 at 5:20:35 PM

Glad to see that the "high thinking" level adds a helmet. Always a smart choice.

by jonas21

5/28/2026 at 10:58:54 PM

And yet some people doubt Anthropic's commitment to AI safety

by usef-

5/28/2026 at 7:46:39 PM

Here's pelicans in all of the thinking levels - low, medium, high, xhigh, max

https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

by simonw

5/29/2026 at 9:40:48 AM

Max seems to me to be notably better than the others.

by virgildotcodes

5/29/2026 at 12:02:13 AM

low: yolo

medium: redesign bike so peli can reach bars

high: redesign bike so peli can rest on frame

xhigh: yolo

max: big peli reach bars

by motza

5/28/2026 at 9:39:39 PM

I like the way the max pelican has a stern look on his face

by ionwake

5/28/2026 at 7:53:21 PM

Is the output on the max level meant to be missing?

by stratos123

5/28/2026 at 7:55:33 PM

I just fixed that (force refresh). It hit my default 8,000 output token limit, it worked when I bumped that up.

For max I used 25 input, 17,167 output which cost me 43 cents! https://www.llm-prices.com/#it=25&ot=17167&ic=5&oc=25&sel=cl...

by simonw

5/28/2026 at 5:32:08 PM

You've peed in the pool Simon, this has to be a part of the internal evals by now! You got to try something new - maybe a panda in a canoe?

by spmartin823

5/28/2026 at 6:02:42 PM

If these were in the internal evals then the output would be much better. The 4.8 pelicans are pretty meh

by phainopepla2

5/28/2026 at 6:03:44 PM

Click the link

by HDThoreaun

5/28/2026 at 5:33:45 PM

I really like that thinking level high gave the pelican a helmet.

by ceroxylon

5/28/2026 at 5:23:10 PM

Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.

by Xunjin

5/28/2026 at 6:59:04 PM

I don't think the API supports "max" as an option, that might just be a Claude Code harness thing.

UPDATE: My mistake, the API does support max. I added a max one at the bottom of this page (cost 43 cents): https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

by simonw

5/29/2026 at 2:51:38 AM

The legend.

by Xunjin

5/28/2026 at 5:15:00 PM

Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects

by yanis_t

5/28/2026 at 5:15:45 PM

I've been meaning to do a "run 3 times and pick the best" version for quite a while, I should really pull the trigger on that one. Currently it's one-shot only.

by simonw

5/28/2026 at 11:19:01 PM

You could run 3 times and overlay/average the images to show how consistent they are

by notaharvardmba

5/28/2026 at 5:48:44 PM

Best-of-3 would be cheating, ruin the test, middle of 3 makes more sense

by xiphias2

5/28/2026 at 6:23:15 PM

Why would you need the 3rd run if you pick the "one in the middle"?

by nik736

5/28/2026 at 7:47:48 PM

Middle as in not the best, and not the worst. As opposed to the second generated in sequence.

But not the best/not the worst is somewhat subjective.. so not sure how well that would work.

by jmaw

5/29/2026 at 2:00:36 AM

I think GP meant picking the median pelican

by BrokenCogs

5/29/2026 at 1:56:42 PM

Sadly I think the correlation between this benchmark and performance is starting to break down imo. Still a legendary idea will be remembered and ingrained in the models forever haha

by lysecret

5/28/2026 at 6:53:57 PM

The vast majority (if not all) of these make it impossible to turn, among other fun things. Only out of curiosity, have you tried prompting further with how a bike must operate to see if it does the right thing?

by silisili

5/29/2026 at 2:57:38 AM

tried it myself, not much of difference

https://gist.github.com/fendy3002/3026a8c4d67d1301666ec40fc0...

looks like the model already trained well on both bicycle and pelicans

by fendy3002

5/28/2026 at 5:18:11 PM

That little red hat on hard mode is sending me. 4.8 has whimsy

by 1attice

5/28/2026 at 6:12:21 PM

I find the most miraculous thing about 4.7 to be that the pelican is facing left, wonder why the right facing everything is so ubiquitous in these images.

by toastmaster11

5/28/2026 at 6:49:17 PM

This happened to me in elementary school. We were doing fingerpaintings using plasticine. After all the bikes were hung on the wall, mine was racing the other way... Somehow it really stuck with me.

by i000

5/28/2026 at 9:27:32 PM

What do you think it means?

by sunnybeetroot

5/28/2026 at 6:35:38 PM

It's facing left but looking right...

by gboss

5/28/2026 at 7:02:53 PM

Profound political commentary?

by toastmaster11

5/28/2026 at 8:01:04 PM

[dead]

by tancop

5/29/2026 at 9:28:51 AM

It's funny that we've reached the level where LLMs draw more correct bikes than any random person

by alex_duf

5/28/2026 at 6:43:17 PM

Eventually the frontier model folks are going to pick up on your pelican on a bike test and bake-in flawless results for that particular request.

by whalesalad

5/28/2026 at 9:38:55 PM

I actually like the 4.7 the most, interestingly enough. Not like you can "objectively" weight artistic output like this.

by impalallama

5/28/2026 at 10:41:24 PM

I don't see how a frame without a headtube can be "the correct shape".

by prmoustache

5/28/2026 at 8:47:11 PM

For comparison, what's GPT-5.5 producing today?

by fragmede

5/28/2026 at 9:49:41 PM

The reasoning xhigh one is pretty solid: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

by simonw

5/28/2026 at 11:08:33 PM

Lends credence to my vibe-based assertion that GPT-5.5 > Opus 4.7 (and now 4.8), which is why I've cancelled my Claude plan. Opus 4.8 is them seeing it reflected in their own numbers and having to pull stopgap measures to avoid falling behind while they embargo Mythos.

by fragmede

5/28/2026 at 6:00:58 PM

thanks for always providing this very much on time. I'm wondering what the next, harder challenge could be? Maybe some animated svg?

by timsuchanek

5/28/2026 at 5:11:32 PM

Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)

by nickvec

5/28/2026 at 5:19:03 PM

Good call, it's cute: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304... - but nothing like GLM-5.1: shttps://static.simonwillison.net/static/2026/glm-possum-esco...

by simonw

5/28/2026 at 6:07:36 PM

Am I allowed to say that pelican's little helmet is adorable? I can't provide a strong computational proof, or even a shred of anecdata...

...but that pelican's little helmet is adorable.

by highwaylights

5/28/2026 at 5:09:59 PM

4.7 reigns supreme IMO.

by onlyrealcuzzo

5/28/2026 at 8:26:25 PM

Early ArtificialAnalysis.ai results show GPT 5.5 is still the better bang-for-your-buck.

OpenAI solves tasks with about 50% less output tokens.

https://artificialanalysis.ai/?intelligence=coding-index&int...

by hereme888

5/28/2026 at 8:44:59 PM

I give Codex a try with every new version, and we don't match, so this isn't true for everyone.

Claude would need to be much more expensive for me to switch.

by cesarvarela

5/28/2026 at 9:24:49 PM

People be saying these things with certainity. 99% of the time one has just inspired more confidence through sycophancy, or just good varience in outputs for a session/prompt.

Slop heads be swearing by one slot machine one week and swearing it off the next like an addicted gambler describing their favorite slot machines from week to week.

This isn't a coincidence, these companies hire UX designers from mobile gaming and online gambling to help engineer their addictiveness.

Its all in your head, and the output is no matter what always going to be worse than learning how to do something yourself and putting care into it.

Handmade watches > mass manufactured watches. There's nothing special about the skills needed for the guy who runs a conveyer belt at a watch manufacturer in China. The watch made by the guy who makes one watch a month in Switzerland is prized and beloved.

by ai_fry_ur_brain

5/28/2026 at 10:56:23 PM

So you're trying to convince a community of mostly engineers — using the example of a terribly outdated technology that stays overpriced purely through symbolization because the luxury industry's bubble keeps holding — that flashy looks, advertising, and fancy concepts aren't really beloved and worthy? Fascinating.

by asdewqqwer

5/29/2026 at 12:13:50 AM

> the guy who makes one watch a month

That's the thing, though. Most people alive today will never be able to possess such an object, no matter how prized and beloved it is. Still, if people want to be able to tell the time from their wrist in a reliable fashion, there are _plenty_ of far cheaper options available to them. The craftsmanship does have inherent value, yes. That does not mean the practical solution is worthless.

There can be practices incorporated in the production of software, involving AI use in a responsible fashion (difficult, of course), that produces practical solutions to real world problems far faster than a group of industry-hardened veterans painstakingly polishing their codebase in pursuit of craftsmanship. Those who appreciate how it is made will pay for the crafstmanship. Those who cannot afford to do so, and only care about a solution working well enough for the tasks they want to accomplish, the production line is good enough.

by sprinkly-dust

5/29/2026 at 12:29:06 AM

Codex with 5.4/5.4 is great Idk havent seen anything more crazy with claude + more expensive

by fHr

5/29/2026 at 2:48:22 AM

GPT 5.5 and 5.4 are such great models. I just tried opus 4.8 and took 30 minutes to be confronted with a bit laziness that makes me go crazy. 5.5 just doesn’t have this issue.

by mgambati

5/29/2026 at 6:57:51 AM

How do you compare them to 5.3 Codex? I am using 5.3 Codex for a while, I subjectively think it does better job than Opus 4.6/4.7, with a fraction of the cost, and I did give 5.5 a try and it seems a bit better but magnitudes more expensive.

by elAhmo

5/29/2026 at 10:45:04 PM

5.3 is good but talks like a robot, it’s too hard to understand what exactly it’s talking about. When using droid I use it to act like worker model and does a great job.

All 5.x models suffer from weirdness in the way it writes but 5.5 and 5.4 are much better and now offer a good balance, direct but without being like Claude.

by mgambati

5/29/2026 at 6:04:01 PM

I have the Max $100 plan and have never hit a wall; so the number of tokens I consume has never been a factor in how I use the models or which ones I use.

It's not magic, but for the value I get, I have no problem paying $100/month.

If I was forced to use API-usage pricing, then all of a sudden, switching models, limiting token usage, using "lower effort thinking" modes, etc., would become a thing.

by insane_dreamer

5/29/2026 at 12:10:33 PM

My 20$ OpenAI sub gets me the same as my 100$ Anthropic sub. It really is the better deal.

by hmontazeri

5/29/2026 at 12:59:08 PM

This is true but OpenAI has been slowly boiling the frog here, too. $20 and $100 on their plan doesn't get you nearly as far as it did two, three months ago.

I use their (newish) 5x $100 plan and I routinely run out of weekly limits about a two days before the end of the week.

This has also goaded me into upgrading to $200 once before... and then had them hand out limits resets to everyone. Argh.

by cmrdporcupine

5/29/2026 at 5:36:01 PM

I feel you. I also switched to $200 plan because I have agents running half of most days.

I was previously coasting on that 2x both Claude and OpenAI were offering.

by hereme888

5/29/2026 at 5:32:26 PM

same with $200 plan

they silently raised the costs

also feels degraded

by zuzululu

5/28/2026 at 4:55:35 PM

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.

by onlyrealcuzzo

5/28/2026 at 5:08:48 PM

https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report

by aronowb14

5/28/2026 at 5:53:23 PM

Have you seen https://deepswe.datacurve.ai/blog? This is the closest to a vibe check i’ve felt even with the open models.

by Bnjoroge

5/28/2026 at 6:42:26 PM

This actually looks like a really good test.

There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)

I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.

Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek

But mimo seems like an interesting model and they are having some crazy discounts too.

Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.

Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.

I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.

I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.

by Imustaskforhelp

5/29/2026 at 1:49:43 PM

Having used both Deepseek v4 Pro and Mimo v2.5 for agentic coding, I'm not surprised Mimo comes out quite far in front. It reflects my experience at least.

The recent hype is Deepseek is a combination of existing name recognition along with incredibly low pricing. Their v4 models, both pro and flash are incredible for their price. That's more revolutionary than Mimo which is multiple times more expensive, just like Kimi 2.6.

by GneojJ

5/29/2026 at 6:11:04 PM

Agree on both counts. Mimo seemed to have reduced their prices significantly so if it’s comparable to deepseek v4 pro, it’s a much better value

by Bnjoroge

5/28/2026 at 6:25:37 PM

Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks.

I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).

Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].

[0]: https://aibenchy.com

[1]: https://news.ycombinator.com/item?id=48230368

by XCSme

5/28/2026 at 8:32:00 PM

Every model release you'll post this, and every time I'll be there to point out how it's completely useless (for reasons you've shared are intentional)

It does things like place the old Gemini 3 Flash above the more capable 3.5 Flash and Opus 4.5 - Opus 4.8 and gpt-5.5

At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.

by BoorishBears

5/28/2026 at 8:54:31 PM

Also, what about the major flaw/bias linked for Gemini 3.5 flash? That has major real-life consequences if the model ends up being used for any automated scoring systems.

I found it while trying to use 3.5 Flash for scoring the reasoning of some models, and it gets it wrong because of the centering bias, whereas 3 Flash gets scoring right.

by XCSme

5/28/2026 at 8:49:47 PM

I'm happy you do comment, I did add more coding tests since then and add more improvements (price history per model, displaying cost to run at current pricing, improved scoring).

How is it useless to see that Opus 4.8 is 2x more expensive and 2x slower on some questions?

by XCSme

5/28/2026 at 6:35:31 PM

No way is Muse Spark generally better than offerings from Google and OpenAI. I actually find arena to be amongst the most useless indicators

by reckless

5/28/2026 at 10:02:16 PM

I think their "code" ranking is biased towards visual aesthetics more than raw coding as the voters are just asked which generated website they prefer.

by WASDx

5/28/2026 at 6:27:48 PM

I'm finding it a little hard to believe that GPT 5.5 is in 11th place for webdev, outranked by models like Kimi, Qwen, and Z.ai. I'm not saying it's not true (I have noticed GPT being less smart in recent weeks), but this is very different from my expectation.

by morley

5/28/2026 at 6:44:15 PM

On paper it's one of the best because it's meant to be blind comparison of your own prompts. However if you are someone who geeks hard on one or a few models, you learn their "personality" and can recognize them in a blind test.

by WarmWash

5/28/2026 at 6:48:13 PM

If you don't know their methodology, or anything about it why do you think its a good ranker?

by dakolli

5/28/2026 at 5:14:39 PM

It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.

Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.

by nerevarthelame

5/28/2026 at 5:15:43 PM

Gonna assume it's because they barely budged or moved downward and most of their reported benchmark results are probably within sampling errors...

by onlyrealcuzzo

5/28/2026 at 5:23:33 PM

They will release a system card, and you can then confirm or disconfirm your assumptions.

by hyperpape

5/29/2026 at 11:20:07 AM

Their Cybergym score is reportedly awful because of the cybersecurity nerfing. https://x.com/i/status/2060046843023630841

by narrator

5/29/2026 at 2:11:48 PM

Ultimately I think the only way you can trust benchmarks is if you build them yourself and keep them secret from the AI labs.

There are different levels of "cheating" on benchmarks. The worst would be just literally putting them in the loss function during RL, I assume the major labs are not cheating at that level. And I am sure they are making a genuine effort to keep the benchmark content out of the training data.

But, ultimately it seems implausible that they completely abstain from benchmarking their model until they are about to release it. Even if they did do that, the benchmark is still ultimately a part of the outermost feedback loop. So these models are all, to _some_ degree, benchmark-solving machines.

I think all we can really do is live with the model for a while and develop a subjective feeling about its quality. This shouldn't be surprising, nobody believes that coding interviews work, we all know that you just have to work with someone to figure out if they're a good programmer. As AIs become more human like it's natural they should get harder to evaluate.

This is a bit awkward, it puts us in quite a weak position as consumers.

Maybe to some extent you can get a meaningful signal from sentiments on HN etc, but:

- There must be some amount of manipulation going on of this

- Even if it was fully organic, it's highly likely that your experience will differ materially from the median online nerd, because AIs are bizarre things that respond in unpredictable ways to intangible things.

by bjackman

5/29/2026 at 5:37:56 PM

> Ultimately I think the only way you can trust benchmarks is if you build them yourself and keep them secret from the AI labs.

I agree.

At the same time, one of the first things we see in the HN comments when a new model is released are pelicans on a bike. Makes you wonder where the priorities of the AI "community" lie when karma farming is the main motivation for model "evaluation".

by beernet

5/28/2026 at 5:52:45 PM

I would take all benchmarks with a grain of salt. I don't really use them. What's it supposed to tell me? "5% smarter", what does that mean? My experience will differ. Just try it!

I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.

by ddosmax556

5/28/2026 at 5:16:55 PM

On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?

by bel8

5/28/2026 at 5:55:14 PM

I find this site useful https://artificialanalysis.ai/leaderboards/models

by jpadkins

5/28/2026 at 5:07:58 PM

At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.

by YetAnotherNick

5/28/2026 at 5:03:31 PM

On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".

In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.

What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.

[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...

by gslepak

5/28/2026 at 5:54:55 PM

Agreed, my vibes tell me 4.6 is a better coder than 4.7. 4.7 is a much better strategic thinker and maintains overall "better architecture" than 5.5. 5.5 is way better than either at coding, but more expensive. So I have 4.7 do the planning/architecture, 4.6 does the coding, then 5.5 critiques and fixes it.

by MattRogish

5/28/2026 at 7:16:39 PM

This is my exact vibesperience

by dimitri-vs

5/28/2026 at 6:37:50 PM

Agreed, these are my vibes too. It feels much better to do planning and strategy and architecture etc. with Opus 4.7 than GPT-5.5. GPT just feels like a robot that gets instructions and does exactly that. Opus feels like an almost human that sometimes has actually good ideas and pushes back on bad ideas.

So for now its planning/architecture/strategy -> Opus. Pure coding -> GPT.

Helps with agentic coding that GPT is much roomier with the tokens you get.

by suprfnk

5/28/2026 at 5:12:55 PM

There is a hole in the boat's bottom due to Chinese models. They might not be as good but they are not bad either or at least I had hard time finding any issues with Deepseekv4 Flash and Pro variants. They get their job done sometimes rarely giving up till they are done what they are after.

So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.

by wg0

5/28/2026 at 5:22:28 PM

I had been saying this on HN repeatedly: people are going to use the smartest models for coding. They don't care how cheap your tokens are if they don't have the highest probability of solving your programming tasks.

And I was dead wrong. Now I mostly use DeepSeek Pro myself.

by raincole

5/29/2026 at 10:04:02 AM

> people are going to use the smartest models for coding. They don't care how cheap your tokens are

I actually think that's still true and will continue to be true as long as someone else subsidizes the tokens. Once the "free money" runs out, things will get interesting.

by vb-8448

5/29/2026 at 12:04:31 PM

Including for DeepSeek you mean.

by jstummbillig

5/29/2026 at 2:29:08 PM

Yes including for DeepSeek. But while DeepSeek Pro doesn't run on other people's infrastructure, several other Chinese models you get competitors competing to offer them on price.

We'll see how it winds up, but we could see models get licensed over half a dozen+ compute vendors, and then you pick your price/offering/features favorite.

by Someone1234

5/29/2026 at 1:04:17 AM

Props for making a falsifiable claim, noticing it was falsified, and owning up to it.

by 6AA4FD

5/28/2026 at 5:56:26 PM

I pretty strongly feel the opposite way. Granted I have not used deepseek enough to “know” their model idiosyncrasies as well as Anthropic, so there is a partial skill issue. But I just find it really hard to justify using a less powerful model while I work.

The most I’ve ever spent in a month extra on API tokens for my own work is $200, and I pay for the $200/mo Claude. I use these models quite a lot, though not idly (I usually just walk around and do other stuff until I know how im going to approach the next set of problems). So it costs me about $3000/year to get as much as I want of the best model available. Already that seems low enough to not be worth stressing out too much about optimizing it, because it feels like an indisputable good value, and trying to save money with a less powerful model would be optimizing for a $1000-$2000 saving at the expense of a large portion of my work taking longer or being more frustrating and iterative.

That’s not a flex or anything, I get that in other countries $3000/yr is a lot of money for a software developer and also a lot of people would perhaps rationally be better off doing X% worse at work or spending Y% more time on tasks to save $Z, if their productivity improvements didn’t translate to more salary. Otherwise if your performance has more upside I really do think that the smartest models are better with the current pricing scheme. Deepseek and the other Chinese models spend a LOT of time thinking, and tend to be much more jagged (benchmaxxed) in performance. How can dealing with that over an entire year be worth $2k?

The only situation I can think of where sacrificing my own time/performance to save on inference is batch compute (of course, $1k vs $100k is different from $30 vs $3k) or work where the tier 2 models have crossed the “good enough” threshold. But I think Opus is not even close to that threshold generally yet. As it gets smarter I, and I think most others probably, just try to do harder things faster and hit the next wall.

by weitendorf

5/28/2026 at 9:09:36 PM

Not even SotA models are good enough to generate code (beyond functions or small, very simple modules) that I'd be happy shipping, so I've decided to just not have them do that. And given this, it has basically turned out that what's left is information gathering + analysis + design overview stuff.

I've just recently started trying out DeepSeek 4 Flash and I was very skeptical at first because I've had some really good experiences with GPT-5.{4,5}, and couldn't possibly believe that this model they charge nothing for could give me similar results, but it absolutely shreds through things and ends up giving me very good answers in almost no time. I also like that it doesn't really seem to have much personality, it's given me mostly just facts and data so far without any additions to the prompt by me.

In my own agent I also specifically prompt to remove flowery language, snark, etc., but I haven't tried it with models like GPT-5.x which I've found has too much personality and tries to make it seem like I'm talking to a human too much.

by 59nadir

5/28/2026 at 6:02:45 PM

I feel similarly. I'll gladly pay to use the most intelligent model I can find on the best harness I have. Sometimes this is GPT Pro, sometimes this is Opus.

I ask AI a lot of questions, not only about code but about my personal life, and I would be willing to pay very large sums to have the best quality output.

by solenoid0937

5/28/2026 at 6:00:29 PM

I think that's true for now, but eventually there will reach a point where a model is good enough (approaching that right now with frontier models) and there will be diminishing returns. I don't need a PHD level Genius to build me an analytics dashboard for example, so why would I pay for a model with that level of intelligence when I can (eventually) self host a good enough model and run queries for electricity cost + hardware.

by jhonof

5/29/2026 at 6:00:53 AM

I think we are approaching that now, with correct expectations. With frontier large models you can often one-shot tasks with vague prompts for stuff like creating CRUD APIs and dashboards around a simple data model since it's such a solved-problem now. With something like Qwen3.6 27B or 35B-A3B and a Strix Halo level computer or a MBP with 32GB or more or RAM, you may need to be more explicit and stay involved and be a little more patient, but you can absolutely get work done with it or delegate tasks to it successfully.

My Framework Desktop does a lot of similar work as my Claude subscription at work (Cowork, chats) for 100W of power draw and a little patience waiting for a slow GPU with limited memory bandwidth to crunch the numbers. Agentic coding is obviously weaker but CRUD development and visualization dashboards are within reach, and I'm usually pleasantly surprised at its ability to self-manage devops.

by evilduck

5/29/2026 at 2:47:23 PM

I agree. My company pays for my tokens so I use the best models I can. I'm more worried about the quality of the work and the speed of accomplishing tasks than I am on saving the most money on every token.

Now, if they come back and tell me I can't spend as much om tokens, I'll have to change my strategy. But everything I'm hearing so far is we're going to be increasing our token spend this year and probably next year too. Not crazy increases but maybe enough to still keep using the latest models without being anxious about every prompt.

by chrsw

5/28/2026 at 6:35:42 PM

I thought the same way until I tried DeepSeek. I am genuinely impressed at how capable it is.

by surgical_fire

5/28/2026 at 6:11:08 PM

You pay $3k/year for personal use? Or out of your own pocket but for your job?

by SoftTalker

5/28/2026 at 6:54:29 PM

It's through my startup, so both I guess. Generally I find my bottleneck to be attention and focus, and the opportunity cost of not going back to work at my prior employers absolutely dwarfs the amount of money I spend on tools, so it's not hard for me to justify spending $200/mo on something I use every day that makes me more productive and generally removes bullshit from my life.

At my prior job there was still what felt like a strong enough correlation between my actual performance and my pay that I don't think I would have had a hard time justifying the expense there either; now I absolutely don't. With the current state of the models, it's baffling to me to hear about professional software developers planning their work around their $20/mo subscription's quotas.

Obviously it's more complicated than more tokens = more productive, but I see them less like SaaS and more like gasoline, where if I run out or need more to do what I'm doing, as long as I'm not being wasteful, I just buy more. Why would I waste a day walking 30 miles by foot when I can just pay $5 for gasoline and drive?

by weitendorf

5/29/2026 at 11:13:43 AM

I started paying $100/month a few years ago to now ~$5k a year out of pocket for personal use to learn and grow in my position at work.

by pizzafeelsright

5/28/2026 at 7:29:04 PM

I do that for personal use too (although $2.4k/yr for me because I only have an Claude Max subscription). Outside of my hobby projects Opus also manages my personal accounting, researches and organizes info (travel plan, what to buy and where to buy, etc), helps me reply to emails when I'm working in the kitchen, etc. I consider it well worth the price. Tbh I'm willing to pay more than what I currently do, but competition is good for the consumers.

by yyhhsj0521

5/28/2026 at 5:44:27 PM

I think two things happened:

1. The sheer number of tokens that a coding agent can use flipped the math upside down on this equation. If you use the most expensive model for everything those costs quickly become untenable, even for software companies.

2. We realized many of the coding problems we're solving aren't incredibly difficult.

by dcchambers

5/28/2026 at 5:54:50 PM

The other thing that's changing is more and more CFOs are looking at the AI spend in engineering departments and hitting the brakes. Token leaderboards were cool when the spend wasn't a double-digit-percent of the entire department's budget including salaries.

by simplyluke

5/28/2026 at 8:24:03 PM

> And I was dead wrong. Now I mostly use DeepSeek Pro myself.

I've wasted over a hundred Euros re-doing work that was done badly due to the model not being up to task (Vue with TS + wrapper components around PrimeVue, needing to handle event and property passthrough and deal with the stupid Vue SFC issues, TS made this much worse than JS would be). I think it was the GLM model through Cerebras Code at the time, in addition to some GPT and Gemini models with the API pricing.

That said, DeepSeek V4 Pro is pretty good and I can totally see myself offloading some of the work, as long as a better model reviews the work and provides suggestions/tests for it.

by KronisLV

5/28/2026 at 6:56:34 PM

Your comment is a slice of the reasoning underlying the "AI will take all the jobs" claim. I would constantly see references to what AI could do and how fast it was improving. Never a word about cost. We should anticipate that there will always be demand for human labor, for cheap models, for local models, and probably even frontier models.

by bachmeier

5/29/2026 at 3:42:38 AM

You should try Composer 2.5 within cursor. It's so fast, shockingly fast. Going back to gpt/claude is like using dial-up. And it's great for code work. So far nothing has really tripped it up backend, frontend or reporting metabase dashboard stuff. It's nuts.

by sergiotapia

5/28/2026 at 6:36:33 PM

Yeah I've also found that models are good enough that the extra spend on premium models isn't always worth it, particularly for my small personal toy projects.

A $20 claude sub goes a long way when you plan with Opus and execute with Sonnet.

by jwitthuhn

5/29/2026 at 7:27:45 AM

you weren't wrong your tasks/problems didn't warrant a frontier model and it was always solvable with a cheap chinese model

doesn't invalidate the rest of us working on tough problems that demand more expensive models and valuable enough to justify it

by zuzululu

5/29/2026 at 6:55:42 PM

DeepSeek pro is frontier at this point.

by culi

5/28/2026 at 5:43:13 PM

I mean indsight is 20/20, but saying that is like saying "everyone will just use the best tools". That's not what we see most places in the world for most types of resources.

by peheje

5/28/2026 at 6:07:43 PM

> CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable

I think you're right especially if you're someplace that already has a data center, such as a university. Solves a lot of privacy concerns as well.

by SoftTalker

5/28/2026 at 5:21:50 PM

Qwen3.6:35b is good enough for a lot of stuff.

I just used ollama with a shell script to tackle my directory of papers/literature. I converted the first 6 pages of each document to PNG, handed them off to Qwen, and told it to spit out BibTeX, including the abstract. Two days later it was done, and I didn't spend anything on "tokens."

by ok123456

5/29/2026 at 10:04:54 AM

Why PNG? Isn’t an image format more expensive to process?

by abyssin

5/29/2026 at 3:19:20 PM

Not really. The model is good/fast at OCR, and preprocessing it actually makes it worse because academic paper formatting is very complicated. Sizes, positions, and equations are important.

by ok123456

5/29/2026 at 6:57:26 PM

what a strange world we live in where robots are WORSE at handling formatted stuff. I wonder what this means for the importance of semantic HTML to screenreaders

by culi

5/28/2026 at 6:57:01 PM

I’ve been using Kimi 2.6, GLM 5.1 , Minimax 2.7 and lately deepseek. I only spend 40$ a month and I don’t see the point in paying for Opus/Codex.

Chinese models are really quite good at a lot of stuff.

by mariopt

5/29/2026 at 5:33:57 AM

Which harness?

by fittingopposite

5/29/2026 at 12:05:11 PM

I use opencode with all of them except Kimi, I noticed Kimi performs better with kimi-cli and also save a bit of quota.

Z.ai does recommend to use claude cli as a harness for GLM5.1, I still get good results with opencode.

by mariopt

5/29/2026 at 3:40:07 AM

Anybody know what the most capable Chinese model is that can be used in production and is cheaper than US frontier models? Would that still be Deepseek? My interest is getting as close to Gpt5.5 or Opus quality as I can get, but for less $.

by replwoacause

5/29/2026 at 6:58:48 PM

Depends what you want it for. Probably Qwen

https://arena.ai/leaderboard

by culi

5/29/2026 at 10:01:29 AM

Possibly a deliberate strategy by the Chinese to undermine the US AI industry, data centers, and basically everything that’s powering the economy.

Just like they did with the US steel industry in the 80s.

by raylad

5/28/2026 at 11:51:50 PM

The problem with going for open source models is that you are betting on some third party to keep doing expensive model training and releasing it for free, forever. What do you do if deepseek never release another update to the model?

by reppap

5/29/2026 at 6:03:52 AM

I continue to use the model I downloaded... for free?

by julianlam

5/28/2026 at 6:33:47 PM

I am having some great experience with DeepSeek. In fact, it seems to perform better than Claude or Codex in my use case.

I don't see myself returning to Claude or Codex anytime soon.

by surgical_fire

5/28/2026 at 7:19:34 PM

[dead]

by ihsw

5/28/2026 at 5:23:09 PM

The Chinese models are only cheap on subsidized Chinese hosting. I have yet to find a USA-hosted Chinese model with a very clear value advantage over US models.

by pants2

5/28/2026 at 6:12:25 PM

No true. Also - put Deepseekv4 Flash on your local with effort set to "high" and you'll see that many many are using that model on their own machines without paying anyone anything.

Its just that some of us didn't imagine having GPUs would be advantageous and were not gamers on the side. Those who had beefy GPUs or GPU rigs for any reason, they rarely need to go anywhere else.

At least I am so impressed with Deepseekv4 AFTER using Claude Opus 4.7 for significant amount of time that I am not going anywhere but Deepseekv4.

The model is just INSANE. Things I have done with it include attempting to write a 2.5D game engine in C with full animation and map rendering layer by layer.

by wg0

5/28/2026 at 6:35:41 PM

You'll need to spend at least $20K on a workstation that can run DS4 Flash. It would take ages to reach that much in token spend at the speeds it runs at, and if you factor electricity costs you will likely never break even vs using API.

by pants2

5/28/2026 at 7:07:57 PM

There are basically two tiers of "Chinese models" in this context, the "edge" sized ones with ~30B parameters or less, and the big ~1T models that can basically only run in the datacenter.

I don't think it's as simple as saying China's hosting is subsidized, they have generally cheaper electricity and labor costs than in the US and don't have access to the top tier models, and a large internal market where the big models are the best thing they can run with what they have. So obviously they max out on their top models (which are trained with their hardware market in mind, not ours) and get the economy of scale from that, and can run generally the same hardware for less money than in the US because

The edge models are very cheap to run and can do so on inexpensive hardware. They are like 95% cheaper to run than Haiku, so the math is in their favor for certain batch workloads. Most people just run the models for themselves when they do that without making it available on openrouter or whatever, because you can just provision a gpu node and use it as needed, and it's not that expensive to run this family of models.

Is your problem that you want to call Chinese models hosted in the US because you're worried about the data handling?

by weitendorf

5/28/2026 at 7:45:42 PM

I obviously don't know the full economics of the Chinese-hosted models, but estimates[1] put the cost of hardware (servers + networking) at 70-80% of the total cost. Those things aren't meaningfully cheaper in China, so serving DeepSeek at 1/3 the cost of the cheapest US provider doesn't really compute unless it's heavily subsidized or we believe that Chinese engineers are just that much better at optimization.

Edge models, yes, they can be convenient to run batch jobs locally. I still would argue there's no economic benefit over paying for models. Haiku has a bad price/perf but others in that class are significantly cheaper in hosted APIs.

Doesn't matter what I think, the reality is that the majority of enterprises (where the real $ comes from) will not consider sending their data to China.

1. https://epoch.ai/data-insights/ai-datacenter-cost-breakdown

by pants2

5/28/2026 at 10:17:20 PM

Hardware is arbitrarily priced, with the floor being as little money as it costs to make it, and the ceiling being how much competitors are willing to pay for it - the latter is much more of the driver of current pricing in the West than in China.

In a free market, the country would not matter, but Chinese models are often running on domestic hardware which does not directly compete with Nvidia GPUs and thus they can't get away charging as much for it.

by torginus

5/29/2026 at 5:36:42 AM

Numbers?

by fittingopposite

5/28/2026 at 5:47:29 PM

The Chinese models are surprisingly cheap and performant sitting under my desk. Qwen3.6 27B is nowhere near as autonomous as Opus 4.7, but it runs in 24GB of VRAM. And it's actually great for the use cases where I'm going to carefully read and understand all the code anyway.

If you want to support a team of engineers, DeepSeek V4 Flash is antirez's current favorite. And you could support a team of engineers pretty nicely for $40-50k. Which might not make sense if you're on a Claude MAX 5x plan or the old enterprise group plan with fixed price seats. But Anthropic is switching their enterprise contracts over to token-based pricing, at which point $50k is looking pretty good.

by ekidd

5/29/2026 at 4:19:57 AM

Fireworks will serve them for $1.74 / $0.14 / $3.48. That's input / cached input / output. https://fireworks.ai/models/deepseek-ai/deepseek-v4-pro . Call it about a third the price of Sonnet.

Not nearly as cheap as the Chinese infra but still pretty cheap.

by joshhart

5/29/2026 at 3:53:45 PM

Sure, but Sonnet is a pretty bad deal these days - that's a similar price to Gemini 3.5 Flash and more expensive than Grok 4.3, both of which are better and faster. Those both use less than half the tokens on the Artificial Analysis Intelligence Index which means they're probably more cost efficient for many workloads.

by pants2

5/28/2026 at 6:00:46 PM

You can find them on Deepinfra. Palo Alto company. Similar cheap price.

by harsh3195

5/28/2026 at 6:33:38 PM

Not similar. DeepInfra[1] has DS4 Pro pricing at $1.30/$2.60 which is 3X the Deepseek[2] (Chinese) hosting at $0.435/$0.87. DeepInfra is also very slow at 37 t/s and uses an FP4 quant[3], so intelligence will be degraded slightly.

Meanwhile you could use Grok 4.3 for the same price which is smarter and 5X faster[4].

1. https://deepinfra.com/pricing

2. https://api-docs.deepseek.com/quick_start/pricing

3. https://artificialanalysis.ai/models/deepseek-v4-pro/provide...

4. https://artificialanalysis.ai/models/grok-4-3

by pants2

5/28/2026 at 8:39:18 PM

DS4 Pro/Flash were post trained with QAT, so they are already quantized to FP4 for the most part. That's why when downloading the weights, they are much smaller than what their weights at fp8 or fp16 would be. For example, Flash is a 284B model, but its GB size is only ~160GB. OFC maybe DeeppInfra went even further, but there is no proof of that.

by wirybeige

5/29/2026 at 4:02:13 AM

Interesting then that OpenRouter[1] tags many providers as FP8 and DeepInfra as FP4.

1. https://openrouter.ai/deepseek/deepseek-v4-pro

by pants2

5/29/2026 at 4:49:47 PM

I presume the providers are the ones giving the info to OpenRouter? I mean, technically it is a mix of fp8 and fp4 (although it is predominately fp4), so I don't think either is inaccurate.

by wirybeige

5/28/2026 at 5:36:23 PM

Odd take. I'm running them locally at my desk (DGX Spark and 128GB MBP). They work fine for 90% of what most folks do. Admittedly, they do run slower on my hw than on the cloud.

by __mharrison__

5/28/2026 at 5:41:55 PM

Running them locally is cool and has privacy/autonomy benefits, but you can't really make a value case for it. Guaranteed if you run the math you will never run enough inference to pay off your hardware vs buying tokens. Last time I ran the math on my MBP I'd have to run inference 24 hours a day for 5+ years to pay off the cost of my MBP, not accounting for electricity costs.

by pants2

5/28/2026 at 7:55:00 PM

The value of not having a reliance on a third party company, and not needing an internet connection, and having total privacy: ∞

by slopinthebag

5/28/2026 at 8:29:23 PM

Just have to put some numbers on privacy and autonomy. What's the fine to my company if I get hacked and leak all my customer's PII? What's the cost in productivity lost if OpenAI/Anthropic/Google decides to suspend my account for an unknown reason?

by fragmede

5/28/2026 at 5:53:14 PM

Is this because of the tok/s? Since it's pretty easy to run up a $5k bill in API usage for Claude/ChatGPT in a month.

by iooi

5/28/2026 at 5:55:53 PM

Yes, because of the limits on tok/s, and you have to compare apples to apples, not Gemma 27B to Opus 4.7.

by pants2

5/28/2026 at 6:34:45 PM

Assuming the local models get the job done (e.g., you adjust your workflow so that you can run the local machine 100% all the time, or whatever), then the time to payback isn't very high. MSRP for a 128GB AMD was $1400 at launch. That's 7 months of claude code subscription. If you assume a 5 year depreciation cycle, you can buy a cluster of 8 such machines and still come out ahead. (Power is a few hundred watts per machine peak -- maybe 7 machines if you include electricity.) Of course, I'm assuming non-bubble numbers. Those boxes are like $3K now. Still, a normal person would probably not buy 8 of them at once. Instead, they'd space out buying a machine every few years as the technology improves.

For me, things are getting better faster than my ability to review / trust the resulting code, so tok/sec isn't a bottleneck anymore. Instead, quality of the tokens is the bottleneck. That points to me wanting a 1TB DRAM iGPU once they're available at pre-bubble RAM pricing.

by hedora

5/28/2026 at 6:53:42 PM

You're comparing the highest tier Claude subscription to something Qwen3.5-122B-A10B running locally, apples to oranges.

If you compare to a smarter US model like Grok 4.3, $1400 will pay for 560M output tokens, which at ~25 t/s locally using it nonstop for 8 hours a day would take two years to pay back. Not accounting for bubble prices or electricity.

by pants2

5/28/2026 at 8:05:23 PM

Is the goal maximum t/s?

According to openrouter, Opus 4.8 is 128 t/s. So 10x faster than my antirez/ds4.

by __mharrison__

5/28/2026 at 7:52:01 PM

Huh? They're several times cheaper than SOTA models at market rate prices.

by slopinthebag

5/28/2026 at 7:59:09 PM

If you are only looking at US hosting providers, models from US labs easily meet or beat models from Chinese labs on the same intelligence level. I'm not comparing DeepSeek with Opus because those are on different levels of performance.

by pants2

5/28/2026 at 8:31:48 PM

Deepseek v4 Pro on US hosting is like 1.5x cheaper and 5x cheaper on input/output compared to Sonnet, and that's not even a fair comparison because Deepseek is much stronger than Sonnet. It's more reasonable to compare with Opus 4.5, which is much more expensive.

by slopinthebag

5/28/2026 at 11:29:08 PM

Sure but you can also look at Grok 4.3, which is smarter and faster than DeepSeek at the same price point.

by pants2

5/29/2026 at 4:01:15 PM

I doubt that is the case

by slopinthebag

5/29/2026 at 8:42:11 PM

Grok 4.3 is smarter, cheaper, and faster than DS4 according to Artificial Analysis and OpenRouter:

DeepSeek v4 Pro

Intelligence: 52

Cost to run AA benchmark on Fireworks: 187M tokens @ $0.79/M blended cost = $147.73

Speed: 43 tok/s

Grok 4.3

Intelligence: 53

Cost: 88M tokens @ 0.64 blended = $56.32

Speed: 108 tok/s

by pants2

5/28/2026 at 5:43:18 PM

Unfortunately they seem to have straight up broken Claude Code either with this release in the backend or the new CC version. Errors about "can't modify thinking blocks" are bricking long-running sessions: https://github.com/anthropics/claude-code/issues?q=is%3Aissu...

by silverlight

5/28/2026 at 10:41:50 PM

This was happening even on the `stable` branch with 4.7

I managed to get claude to create a recovery script to un-brick sessions, YMMV

https://gist.github.com/robertfw/993dbe8643c4fbdf12005dff2ec...

by robertfw

5/28/2026 at 6:43:59 PM

That is part of the charm of working with Claude. Every time they release anything new - all your shit will break.

by whalesalad

5/28/2026 at 10:17:44 PM

In case it helps anyone, in some minor cases I was able to recover and continue with /rewind.

by defgeneric

5/28/2026 at 10:34:52 PM

They don't test CC updates before release. The testing is done by their own team using the product or public feedback.

by OkWing99

5/28/2026 at 7:17:53 PM

Same. It's not a good look to have happen right when they roll out a new model.

by javawizard

5/29/2026 at 12:52:16 AM

I found that quitting and restarting cc appears to fix this

by rarisma

5/29/2026 at 12:30:23 AM

Codex cli> claude code

by fHr

5/28/2026 at 5:59:01 PM

Try updating maybe?

by solenoid0937

5/28/2026 at 6:06:27 PM

I just installed/upgraded to try out 4.8 and in only 3 messages I hit this bug! Seems something is broken on CC.

by Fabricio20

5/28/2026 at 6:06:23 PM

I'm on the latest version (2.1.154 as of this comment). Based on the timestamps on those Issues being reported I think it's happening on the latest version.

I'm sure it will get fixed eventually/soon, just annoying to update and have your workflow break.

by silverlight

5/28/2026 at 8:35:53 PM

[flagged]

by wrs

5/29/2026 at 1:59:03 PM

As if choosing a model to use on its own is not hard, offering six levels of "effort" (quite a vague term as well), low, medium, high, xhigh, max, ultracode (?!?!) is really making comparisons next to impossible when people using the same model can have vastly different experiences.

What exactly is the diff between high and xhigh? Or xhigh and max? This is definitely too granular and it seems Anthropic took OpenAI's confusion with models as inspiration.

by elAhmo

5/29/2026 at 2:27:05 PM

Something I found helpful: In this article, scroll down to the first big image, which is a graph labeled “Agentic coding performance by effort level”. https://www.anthropic.com/engineering/april-23-postmortem

This convinced me to just always set 4.7 to xhigh. Admittedly not sure about 4.8.

by Wowfunhappy

5/29/2026 at 2:05:20 PM

They are doomed. Publishing small wins while they can.

https://open.substack.com/pub/sublius/p/srt-introspect-why-c...

by spacebacon

5/29/2026 at 3:30:02 PM

Feels like one of those knobs they throw in to make performance feel like a user skill issue instead of a product issue.

Why doesn't it know how much effort to use? How do I know how much effort to use? It's a mystery.

by jayd16

5/29/2026 at 2:23:02 PM

Probably limits the number of intermediate tokens one way or the other. Almost certainly the impact on the result is close to zero.

by thaanpaa

5/29/2026 at 2:13:48 PM

Not only this but hermetic checks on local machines for spot testing new models is becoming increasingly difficult, if not impossible.

- We have 0 visibility into what Anthropic does with our own prompts server side (do they return cached results from similar queries? Do we develop our own hot paths?).

- Local memory files are written independent of project directory and are acted on by the new models, even if old models wrote them

- CLAUDE.md files have varying degrees of efficiency and different models (and effort) treat them differently

- Our own git history "supports" newer models - ie if you have a larger body of work in git when you adopt a new model (like 4.8) than when you started from scratch with 4.6 or something, 4.8 may "appear" smarter when in fact you just have more evidence and signal about what you intend for a model to do.

by kkukshtel

5/28/2026 at 5:53:51 PM

On my tests[0] it does a bit worse, and it's almost 2x expensive than Opus 4.7...

I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).

It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

by XCSme

5/28/2026 at 5:59:34 PM

For some reason everything is 2x (2x cost, 2x avg response time, 2x reasoning and output tokens)...

Double-checking my test harness, but it's the first model that does this, so I doubt the issue is on my side...

EDIT: Harness seems correct, for straight coding tasks they perform identical: https://i.snipboard.io/5xbpzY.jpg

by XCSme

5/28/2026 at 6:04:26 PM

Wait, doesn’t the blog post say the price is the same as 4.7?

> Claude Opus 4.8 is available everywhere today. Pricing for regular usage is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. Pricing for fast mode is $10 per million input tokens and $50 per million output tokens.

Where do you see the 2x cost?

by dwaltrip

5/28/2026 at 6:10:34 PM

The total cost of running my benchmarks, was 1.6x higher compared to Opus 4.7, mostly because of 2x output tokens:

https://i.snipboard.io/vrdwTa.jpg

by XCSme

5/28/2026 at 7:36:51 PM

ah ok, thanks for clarifying!

by dwaltrip

5/28/2026 at 6:18:29 PM

If it spends 2x tokens to achieve the same result, that's effective 2x cost in a manner of speaking

by spprashant

5/28/2026 at 6:14:50 PM

Releasing a new model is the new way to Jack up the price hehe.

by SupLockDef

5/28/2026 at 8:49:25 PM

That's exactly right.

by eshack94

5/29/2026 at 10:50:23 AM

Meanwhile Deepseek is cutting inference costs to mere cents. Thats the real AI revolution for you.

by epitrochoid413

5/29/2026 at 11:23:32 AM

Yes I switched from claude code to opencode with deepseek recently.

It is basically indistinguishable from sonnet. At this point my own prompts, AGENTS.md, background docs and so on matter a great deal more than the differences between models.

And deepseek v4 flash (the sonnet comparable) costs 3% of what sonnet does.

by calpaterson

5/29/2026 at 11:27:24 AM

Same here. I use codex for planning and deepseek v4 flash for implementation. Its worked really well so far.

by epitrochoid413

5/28/2026 at 6:42:58 PM

Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors. I suspect the benchmarks may also be saturated, or at least past their usefulness.

I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.

1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5 Pro for a bit (they have said its coming). They also released a refreshed Antigravity, and drew special attention to how cheaply they were able to build their toy operating system to play Doom (less-than $1000 IIRC).

2. OpenAI has dumped everything into Codex, is offering double the token limits for the next few weeks IIRC, and is offering business discounts. Their head of Codex has tweeted that 5.5 is "extremely efficient", implying that they aren't actually losing money on any of this.

3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.

4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.

Anthropic has screwed up where they need to be making investments, and the cracks are starting to show. They've marginally underinvested in the Sonnet line of models for almost a year now, and they've critically underinvested in product. Anthropic made bets on the story of the second half of 2026 being: ultra-frontier, ultra-intelligence. In reality, what's shaping up is that the story will be: Companies rolling back AI spend, efficiency, "95% as good for 15% the price", sophisticated high quality harnesses, cheaper models. Anthropic isn't ready for this world.

by 827a

5/28/2026 at 6:52:10 PM

Anthropic’s story over the past year has been nothing but explosive growth that they can’t keep up with, but now they’re suddenly doomed? Seems pretty far fetched to me.

No idea why you’d say they have critically underinvested in product when Claude Code dominates and they’ve also released popular tools like Cowork and integrations for Microsoft products at an incredibly rapid pace.

Cost is becoming more of a factor, and no doubt they’ll work on that. There’s no reason to think they won’t be able to release cheaper models if they optimize for that rather than improving performance.

by brokencode

5/28/2026 at 7:27:33 PM

I never said they were doomed. Where did you get that idea? I said they aren't ready for this world. That means they screwed up and need to get ready. They let the Mythos hype get to their heads while the world changed beneath them.

by 827a

5/29/2026 at 12:29:07 AM

Yup, they’ve been screwing up all the way to the bank.

I agree that lower cost models will become a bigger priority in the near future, but I have to hard disagree that Anthropic’s strategy can be characterized as a screw up.

Sure, if they never shift with the market and their customers start moving to cheaper competitors, then it’d be a screw up.

But as of right now, producing the best coding model possible has led to insatiable demand. To the point where they’ve even eclipsed OpenAI, forcing them to change strategy to compete.

by brokencode

5/28/2026 at 7:12:27 PM

No, no it's been pretty easy with software engineering. I work on two types of projects and it's very easy to ask claude for a plan, then have gpt 5.5 rip it to shreds and find legit issues, and vice versa. If both 5.5 and claude 4.8 can independently create a plan and both find no critical or high issues, then we will be at that point.

by jonnycoder

5/29/2026 at 4:04:14 AM

I wouldn't say vice-versa is true. GPT 5.5 routinely finds major mistakes made by Opus 4.7, but I've yet to have it work the other way around.

by replwoacause

5/29/2026 at 12:51:26 AM

Additionally running GPT-5.5 on medium sometimes gives me better results than high mode. On any of them I still have to push the models in the right direction.

by elcritch

5/28/2026 at 6:58:51 PM

I think it's probably too soon to say. I certainly still feel that large coding tasks are getting better and better with each model. I'd guess lawyers, doctors, etc feel similarly.

It feels like the only way to push the limits of newer models is with really long context questions that require reasoning. Any short request will naturally just be within the distribution of all the recent models so there isn't a performance difference there.

I think the near future is looking like a bunch of business-critical tasks that scale infinitely with better reasoning, all being done on whatever the most advanced model is at a high cost. Trading stocks, running a business, looking for tax dodges, writing high-performance code. These are all things where there's a tangible return on each jump in reasoning.

by chis

5/28/2026 at 7:16:58 PM

We'll have to agree to disagree on that last point. I think that, historically (past ~6 months), "always use the most advanced model" being the norm is really just an artifact of both: The most advanced models oftentimes being the only model that can solve these problems; and: Infinite AI budgets.

by 827a

5/28/2026 at 6:54:44 PM

The Chinese stuff is good enough for up to 80% of the frontier on most text tasks but they are significantly worse at code. They just don’t “get” what you’re asking for like Codex and Claude and require so many more iterations to get close to what you need.

by dyauspitr

5/28/2026 at 7:04:27 PM

Agreed. But we're seeing Cursor (now SpaceX) take these models and add great coding capability on top of them. Frontier model providers should be concerned that Composer 2.5 costs $0.50/$2.50 (versus Opus 4.8 $5/$25). That's why Google prioritized Gemini 3.5 Flash, and talked up how near-frontier it is ($1.50/$9).

by 827a

5/29/2026 at 11:55:16 AM

> Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors.

The model improvements being beyond human comprehension is one of the more ridiculous statements I’ve heard in the last couple of days about AI. We could reason about Higgs bosons and gravitational waves but have no ability to quantify or reason about the difference between Opus 4.7 vs 4.8.

by geraneum

5/29/2026 at 12:18:32 PM

I definitely believe that you can discern differences between Opus 4.6, 4.7, and 4.8. I might also believe that you believe that you can discern improvements between Opus 4.6, 4.7, and 4.8. But conclusively, consistently, scientifically, and blindly discerning improvement is at this point restricted to problem domains that represent a vanishingly small amount of global token usage, like Erdos problems, superhuman evals, and the like. The idea that typical line of business use-cases have seen broad and measurable improvements since even Opus 4.5 but certainly 4.6 is mostly an illusion that confuses improvements in the harness for improvements in the model, as well as confuses "its different" for "its better".

To be clear, again, cannot stress this enough: I am NOT saying that the models have hit a limit. I am saying that the complexity of the problems most businesses throw at them have always had a limit. The models are now so intelligent that we have not, as of yet, adapted our business use-cases to make use of the new levels of intelligence. Maybe we will.

by 827a

5/28/2026 at 8:28:30 PM

Tried using everything that isn't Claude and I keep switching back to Claude because even the smarter models give me uglier code, or miss common sense requirements. (And the dumber models give me code that doesn't work properly).

I keep trying to switch to something else but I keep coming back. (Typically after a few days of giving a new model an honest go, and finding myself constantly asking Sonnet to fix its output... Yes, even Sonnet wins on this front! They really do have some kind of special sauce.)

I'm not where most of their money comes from though, and I don't know how universal my experience is.

by andai

5/28/2026 at 8:33:50 PM

I'm a bit confused about what point you're trying to make.

Because you seem to be saying that Anthropic not changing the price of Opus is bad, but then two of your positive examples are Gemini 3.5 Flash (which tripled the 3.1 Flash token prices) and GPT-5.5 (which doubled the GPT-5.4 price, and is slightly more expensive per token than Opus).

Is your argument actually that price hikes are good? That doesn't seem to fit with the general tenor of the message.

by jsnell

5/28/2026 at 9:34:46 PM

>Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors.

Yeah nah, the models' flaws are pretty obvious when you use them. And as a user, you can absolutely know when a flaw disappears or barrier is cleared.

by AussieWog93

5/28/2026 at 10:01:10 PM

This post is proof that people will complain about anything, even if its the most successful startup of the past decade.

by greenavocado

5/28/2026 at 11:01:50 PM

You're not successful until you exit. And, of course, there's always room to be more successful.

by 827a

5/28/2026 at 7:03:22 PM

I thought 4.7 was noticeably better than 4.6.

by loeg

5/28/2026 at 8:32:12 PM

thats a pretty cynical take. > past the point of human ability to discern whether they are actually better or worse

This is lack of imagination. If you use these models heavily enough, pretty soon you'll hit the edges of their capabilities. The smarter among us are collecting these problems into a personal benchmark and use that to judge model capability. I think this is the right approach, and dare I say, even better than generic benchmarks. To me, it matters less what the benchmark says, and more what my particular problems are.

by dbgrman

5/28/2026 at 8:36:24 PM

All signs point to Opus 4.7 being smaller than 4.6, so I'm not sure all this holds.

You realize gpt-5.5 is also double the price of gpt-5.4, which itself was a price increase too, right?

Labs are divorcing pricing from inference costs.

by BoorishBears

5/28/2026 at 7:06:49 PM

anthropic is crushing it, this analysis is laughable. they are only constrained by GPUs

by llmslave

5/28/2026 at 4:55:37 PM

I can't help but think of Iphone updates since about 2018. The thinnest, fastest, longest battery life Iphone ever. It seems mostly the same and I probably won't be able to tell other than the name, but everyone buys it anyway.

This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.

by pbmango

5/28/2026 at 8:38:46 PM

This in incredibly refreshing take, thank you. It's about time someone admitted that we aren't on the verge of Singularity with these LLMs. We've probably hit a local AI maxima here and it could be another 10 to 20 years before we am get another big break through.

by krupan

5/28/2026 at 5:12:04 PM

ChatGPT came out in 2022. Back then it was just a chatbot. Now we have AI agents. What matters is how we use them and how the agents get better. That’s what will move AI forward.

by MangoCoffee

5/28/2026 at 5:25:27 PM

An 'AI agent' is just a chatbot that is told to type commands on a REPL-like interface as part of its system prompt. It's still processing pure text-based requests and responses, they're just not restricted to natural language.

by zozbot234

5/28/2026 at 5:44:39 PM

A lot of people dont know this , also the chatbot (chatgpt) itself is a next token predictor (the GPT) that's been given an initial text that says " pretend to be a chatbot .." and asked to complete it , the coherant chatting behaviour is something thats emergent .

later on someone figured if you asked it to output a reasoning before it gave a response its output would have more logical coherence, as though the reasoning output tokens functioned as a scratch space for it to work on.

at the end its all next token prediction

by arbitrandomuser

5/28/2026 at 5:48:04 PM

No, chatbots are LLMs trained for question-answering through RLHF (its not just a prompt). But yes, if you just zero-shot prompt a bare LLM you can still "talk to it" & you are correct on everything else as far as I know.

by hellohello2

5/28/2026 at 8:22:44 PM

At lot of people don't know this, also the human brain is a squishy lump of meat. that's been given a childhood and the prompt "act like an adult", and asked to behave. The coherant chatting behaviour is something thats emergent .

later on someone figured if you shove Adderall in it and it to think before it speaks, it gave a response its output would have more logical coherence, as though the Adderall concentration drugd functioned as a scratch space for it to work on.

in the end its a squishy lump of meat.

by fragmede

5/29/2026 at 12:17:39 AM

We know from living as humans that we have experiences.

We have no such evidence that LLMs do.

That's a pretty significant difference between the next-token predictor and the squishy lump of meat.

by NateEag

5/28/2026 at 10:24:31 PM

How much must one tie their self-worth to a chatbot to debase themselves like that? To think that a winner in the arms and intelligence race of animal kingdom, a member of the species that made this chatbot, would put down themselves like that in the defense of the thoughtless silicon is absolutely laughable and depressing at the same time.

by Alex_L_Wood

5/28/2026 at 11:18:23 PM

I'm merely pointing out the logical fallacy of thinking complex systems can't arise from simpler components in an obtuse fashion. Ants are stupid individually, yet they're able to create giant structures in the wild. Hating on AI and calling it next word prediction isn't going to save anyone's jobs. Organizing will. Voting will.

by fragmede

5/28/2026 at 5:45:57 PM

They are chatbots trained for tool use, its not just a prompt.

by hellohello2

5/28/2026 at 7:43:30 PM

An AI agent and a chatbot are both applications built using LLM inference as a primitive.

by sigmarule

5/28/2026 at 7:07:25 PM

Yeah and a car is just an engine connected to wheels.

by furyofantares

5/28/2026 at 9:42:10 PM

Yeah. LLMs are fundamentally a batch-based system, and we smear a veneer of liveness and autonomy on top.

by smj-edison

5/28/2026 at 5:46:37 PM

Not even 4 years old yet. This tech curve has been insane

by MattDamonSpace

5/28/2026 at 9:34:51 PM

I still use LLM in quite similar way as when ChatGPT was launched. There has been progress but I think the real leap was 2020-2022.

by rzmmm

5/28/2026 at 6:05:08 PM

Not even the typical lifecycle of a corporate PC or laptop. It is pretty wild.

by SoftTalker

5/28/2026 at 7:04:40 PM

[flagged]

by dakolli

5/29/2026 at 1:41:25 AM

If you upgrade your 8 year old phone the many incremental upgrades will be very noticeable. From my personal experience the LLM space is also moving at a faster pace than the phone industry at the moment, but at least from a financial perspective I would expect it to slow down sooner rather than later.

by gaflo

5/28/2026 at 7:56:47 PM

This was my exact thought as well. I think mythos could still be a huge leap but especially as IPO's get closer it seems like we're getting closer to the IPhone 10 moment where anything after is just improvements at the edge.

But ( maybe because it was hardware ) that took 10ish years while it seems like the slowdown here only took about 4

by toyetic

5/29/2026 at 5:40:03 AM

Are we supposed to have two cars?

by slashdave

5/28/2026 at 6:43:14 PM

This is the first time I saw a model pop-up on HN and didn't really care. Model exhaustion? It looks interesting but not exciting.

While I'd normally _love_ incremental improvements --- I think the recent ones are far too minor to get excited about or change up a workflow. Besides, benchmarks tend to exaggerate the gap between versions.

At this point I'd almost rather Anthropic wait and really wow us with a 5.0 release -- something that improves across the board, feels less uneven, and is performant enough that people can actually put it through its paces without constantly rationing usage.

by dudeinhawaii

5/29/2026 at 11:06:35 AM

Yeah.. Also, after they fucked up the CC, released that 4.7 etc. I switched completely to Codex and honestly do not wish to go back to half assed harness of Claude. Codex somehow got extremely good in a couple of months.

by eknkc

5/29/2026 at 5:42:28 AM

Dunno. Isn't OpenAI supposed to release a new version of their model within 30 minutes? Maybe things are actually quieting down.

by slashdave

5/29/2026 at 7:33:35 AM

Great. The rest of us find this model exciting because I think it's the first time there have been meaningful improvement to Claude.

I think I need to purchase a plan to be sure tho but from all the anecdotes I've read so far, this is a significant milestone from Anthropic.

I actually think they have a shot against Codex now

by zuzululu

5/28/2026 at 6:52:30 PM

I have model fatigue

by dominicq

5/29/2026 at 1:25:07 AM

I have… non-deterministic black box that seemingly requires me to re-work myself to get decent results every 4 weeks fatigue

by laweijfmvo

5/28/2026 at 5:19:38 PM

> The Messages API now accepts system entries inside the messages array. Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Biggest deal imo

by dangoodmanUT

5/29/2026 at 7:16:17 AM

Today I was a few hours into chasing down a very tricky timing-dependent bug with GPT 5.5 and we were starting to go into circles. I noticed Opus 4.8 had showed up in GitHub Copilot so I switched over and pointed it at my notes so far. Another hour of steady progress and it tracked it down to some missing synchronisation in an upstream library which was occasionally corrupting a linked list. N=1 but worth every one of those rather expensive 15x requests today. 15x... yeah.

by thombles

5/29/2026 at 7:23:08 AM

That is interesting, are you saying that GPT 5.5 could not fix an issue that Opus 4.8 did? Are you sure this is not due to fresh context?

I do notice this tendency for 5.5 to go in endless circles.

by zuzululu

5/29/2026 at 7:31:23 AM

That's my initial experience, yes. It's hard to compare these things cleanly of course. I went through several new contexts on GPT and it just couldn't get traction -- it became hard to keep it focused on "yes there's clearly a race but what actual persistent state got broken"? It just wanted to change the thread priorities so that the problem didn't occur and kept doubling down on that as the solution. Opus made some missteps too but it responded well to my corrections - 2 or 3 significant ones along the way - and it was prepared to keep digging on my exact goal until it found the real issue.

by thombles

5/29/2026 at 7:40:59 AM

I think your anecdotes lines up a lot with what I've seen online, I am noticing a lot of codex users in particular appears to have discovered Opus 4.8 seems to make them very happy.

I am going to subscribe to Claude and try this out myself. I'm going to be very honest that I am currently finding codex to be very lacking, not from its generous usage limits but just the sheer number of repeated prompts to prevent its inclinations in getting stuck in a spiral, one which is very hard to get out of once it digs itself into a hole (I've had it refuse instructions despite desperate pleas and starting a new convo appears to fix it and hence why I wasn't sure if this Opus 4.8 issue was of fresh context but it appears to be very capable in ways that codex isn't).

Thanks for sharing your anecdote!

by zuzululu

5/29/2026 at 7:26:39 AM

GPT 5.5 feels worse than 5.4 for the last few weeks. Again N=1, but would be interested to see how opus 4.8 and gpt 5.4 match

by tornikeo

5/29/2026 at 9:26:06 AM

You know what that means... 5.6 is dropping soon

by CamelCaseName

5/28/2026 at 5:09:12 PM

Buried lede:

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels

by square_usual

5/29/2026 at 1:16:53 AM

That really is some good news. Looks like they also reset everyone's weekly usage too.

by matheusmoreira

5/29/2026 at 7:30:26 AM

Do you think Claude Max is worth it ? It seems cheaper than Codex Pro too

by zuzululu

5/29/2026 at 12:27:07 PM

I use Max 5x and I've been getting quite a lot of value out of it. Claude has completely taken over maintenance of a few side projects I wasn't putting much effort into. The projects where I do the work myself, I use Claude for code review. My usage patterns are mapping almost exactly to the 7 day window, so I'm consistently getting 100% of the usage I paid for without having to wait a long time for the rollover.

I've never used Codex. Can't compare the two.

by matheusmoreira

5/28/2026 at 4:58:10 PM

There is an obvious shift in sentiment amongst users, at least here in the US. I feel it myself, even as a proponent of AI tools, the bloviating and language that these companies use in these release articles are starting to wear thin on my patience.

Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.

by SimianSci

5/28/2026 at 5:31:03 PM

I don't agree at all for these coding models. Even the most anti-AI people from last year seem to be giving in to using them.

by nba456_

5/28/2026 at 6:31:33 PM

I think there is an exception for tooling around the models/integrating the models with tooling. That seems to have been very well received in this last year.

by zamadatix

5/29/2026 at 7:31:49 AM

I am noticing a shift here too, those that were its biggest critics have gotten more silent, I guess they do have some small amount of self awareness and shame left, which is always a good thing.

by zuzululu

5/28/2026 at 6:23:47 PM

My take from going through comments on HN is that many people are being mandated to use them, not that they are just giving in. Maybe I'm misreading, but that was my impression.

by timbaboon

5/28/2026 at 6:48:00 PM

Both can be true, even for the same person.

For example, it's being pushed pretty hard where I'm at, though not quite on the tokenmaxxer level. I started skipping related meetings cause it was nauseating. I can only tolerate so many platitudes.

At the same time, I just used the ever living snot out of Opus 4.6 for hours, grinning like an idiot throughout. Automated a whole bunch of enterprise cross-system drudgery away.

Fairly constant over time as well. Expressed a similar sentiment not too long ago here: https://news.ycombinator.com/item?id=48154277

by perching_aix

5/28/2026 at 7:05:38 PM

[flagged]

by dakolli

5/28/2026 at 7:17:59 PM

So much so that if you re-read my comment, you may notice that I was automating away exactly my own work there. Work that sucked and was grossly high overhead. It's just nice when things stop sucking, and even nicer when it doesn't require one to act a hero for that to happen. Not sure what else do you expect to hear.

Would you rather e.g. your doctor prioritized their wealth over your health? Popular conspiracy, but I'm not sure many health professionals follow in it. Not sure why you think this field would be much different. If this job is gone, it's gone. I can enjoy recreational programming on my own time, I don't feel entitled that my interest remains a money maker.

What worries me - and it does - is a further and accelerating shift in wealth (and thus capability) asymmetry. But for that, I look out for the performance and requirements of self hostable models instead, rather than reenact some sort of luddite, or lie to myself and others about the state of this technology.

If you want safety for country sovereignty, get a nuke. If you want safety for knowledge work, get a local model.

by perching_aix

5/28/2026 at 9:18:09 PM

Having your career automated away and being okay with that is a massive luxury most don't have. The rest of us need an income to get by. If you look at the history of other people losing their careers to automation, the average person never gets even close to their previous peak.

by tripleee

5/28/2026 at 10:24:36 PM

Which will suck for me all the same, I have a very finite amount of that luxury too, probably less than most on this forum. It just doesn't sit right with me to expect the world to act as a job programme for me instead. Maybe it should, and this is really not the time for pride, I don't know. Even then, asking for that so dishonestly would and does still leave a very poor taste.

Aside from the aforementioned local models path though, this whole productivity angle (which the above poster loves to shit on btw) also serves to retain jobs. Current data suggests that rather than letting people go, companies are banking on extracting more productivity out of workers, partly because the models are admittedly way overhyped, partly because it's the sane other option to mass layoffs, and partly since these models still need and strongly benefit from in-context steering. And they forever will: the human experience is human by definition, we're the "oracles" to it. How much that will continue to justify employments is still out there though, of course. I do expect a crunch phase, provided there was any actual productivity gain realized to begin with, which in itself is very loosely supported if at all.

Regardless, I don't see the point in not using these, or lying about how good they are, or willfully hating on them. Never helped anyone. Early and quality information however, very much so. If I know the time has come or is actually coming, I can take action accordingly. If I listen to every random social media thread I come across instead, not so much. According to social media, software engineering has been over for 3 years now already. The wolf was not only cried, but turned into a whole musical outright. The extremely dissonant clash of the sentiments "LLMs are pure shit, actually" and "it's like, literally taking our jobs" is not lost on me either.

by perching_aix

5/28/2026 at 6:51:02 PM

Watch Christopher Olah bloviate at the Vatican during the Magnifica Humanatis launch. It's truly nauseating. I've never seen such a ridiculous speech in my life. Between him and the CEO, I'm starting to understand the level of arrogance these people are capable of.

by datakan

5/29/2026 at 5:02:48 AM

Literally nothing in his speech was controversial though.

by solenoid0937

5/29/2026 at 2:56:26 PM

Strongly disagree. He sat in front of a room full of Archbishops and told them, straight faced, that the worlds about to have mass layoffs and starvation and that the Church should feel responsible for doing something about it. The guy's a complete sociopath.

by datakan

5/29/2026 at 9:34:29 PM

He's not wrong. Mass layoffs and starvation are a problem society needs to solve collectively. The other side of the singularity will be great but we all bear a collective responsibility to get there in one piece.

by solenoid0937

5/30/2026 at 2:31:44 PM

> Mass layoffs and starvation are a problem society needs to solve collectively.

Very much disagree with this. This is capitalizing the profits and socializing the debts. They do not get to profit off of the suffering of others without repercussions, that is supposed to be what free markets prevent. This crony capitalist economy we have today with an explicit caste system in place, just pushes their debts onto you and I.

Destroying the economy and job market is the stuff of dystopian nightmares. If people do not have purchasing power they can't afford the product. The whole thing is destined to collapse.

by datakan

5/30/2026 at 3:18:23 PM

But that is exactly his point. By "collective responsibility" he is saying that it's the public's responsibility to regulate and tax AI companies as needed, vs expecting them to self regulate. This has been Anthropic's stance the whole time.

by solenoid0937

5/28/2026 at 6:27:29 PM

[dead]

by o10449366

5/28/2026 at 5:47:54 PM

"Our models are more honest" honey the quarterly marketing spin for a ML term has come. Forget "task alignment" now we're going for "truth index". I suppose this is the only way to generate hype when you're selling/releasing the same product over and over again.

by alansaber

5/28/2026 at 6:06:03 PM

When doing some electrical, Opus 4.7 essentially told me to wiggle a wire to see if it was hot or not with my bare hand.

I called it out.

It then gave me one of the most super heartfelt honest and sincere apologies I have ever received.

Glad the safety team was there for me and able to make such an honest model or I would have been very upset about it.

by TIPSIO

5/28/2026 at 7:20:03 PM

Opus is so bad at electrical work it's really disappointing. And when it tries to draw schematics as SVGs it's a complete disaster. They should either focus on training their LLMs on this task specifically, or have it refuse.

by teaearlgraycold

5/28/2026 at 8:06:55 PM

Hmm, what kind of electrical work? I had it "watch over my shoulder" as I swapped out the pressure switch on our home well and it was a big help. And in the run up to that when I explained opening the 220 box and checking that was "above my paygrade" it limited our investigation to just the less sparky parts.

by tclancy

5/28/2026 at 8:16:26 PM

I mean introductory circuit stuff. Not electrician-lite work.

by teaearlgraycold

5/29/2026 at 12:14:08 AM

have it write python porogram to generate the svgs. then use the program. circuit diagrams are rrlatively thin corpus but it knows how ciruits work sufficently to write a program.

by morpheos137

5/29/2026 at 2:53:06 AM

Is there a good pre-existing DSL for this task?

by teaearlgraycold

5/29/2026 at 10:52:30 AM

You ain’t gotta be sniffy about it, English.

by tclancy

5/29/2026 at 10:12:57 AM

SVG is like asking an electrician to give you a circuit diagram by painting a watercolor

I'd try something like CircuiTikZ with instructions provided

by BoorishBears

5/28/2026 at 8:41:36 PM

I honestly cannot tell if you are being sarcastic or not

by krupan

5/28/2026 at 8:48:10 PM

It did try and lead me to touch a live hot wire once. Thanking the safety team for the honest and sincere apology it gave after was sarcasm.

by TIPSIO

5/28/2026 at 8:54:45 PM

It tried to get you touch a live wire, then you called it honest and thanked the safety team. It really comes off as sarcastic.

by krupan

5/29/2026 at 10:09:36 AM

So you can tell.

by BoorishBears

5/28/2026 at 8:51:42 PM

Credit where it is due, Claude is fantastic at pointing out potential flaws in how I understand the problem based on my question. I asked for this in the system instructions but it is the first model I've tried that does it regularly. It is also so tactful, I feel like I'm learning social skills from a language model. Half of the time it is a false positive due to insufficient context but I still appreciate the additional check.

by doginasuit

5/28/2026 at 6:10:44 PM

Gave me wrong information on my very first question. Wasn’t even complicated, and I wasn’t trying to trick it.

by mrdependable

5/28/2026 at 10:58:44 PM

My smoke test for new models is to get it to generate a crossword, and this is the first time it's done a good job on the layout:

  ■  S  W  A  M
  B  L  A  M  E
  E  A  G  E  R
  A  T  O  N  E
  M  E  N  D  ■

The full conversation: https://claude.ai/share/60bd0c71-b576-4f8b-a272-ca1af982874c

by jkxyz

5/29/2026 at 12:48:03 AM

Impressive, but the response seemed to mix 4 down and 5 down.

The clue for 4 down is:

> Structural girder funded by an infrastructure bill (4)

but in the laid-out answer key (which you posted), and in the "corrected" list of answers, 4 down is "MERE".

"WAGON" as the answer for "bandwagon you might jump on" is pretty weird too.

The current events / political references are pretty non-specific, kind of like the DJ 3000. https://www.youtube.com/watch?v=fnGaf0p9x1U

---

I copy-pasted your prompt with Sonnet 4.6 Low and, to my delight, I got a working interactive puzzle you can actually solve inline in the chat. The clues and answers are totally bogus, though: it looks like in my chat, the LLM only verified that the clues going across make any sense.

Like, come on:

> 3D — (O,D,A,O,S) — The crossing letters in column 2, running through OADOS.

Truly these things are slot machines. https://claude.ai/share/4a89b15c-d028-4a31-988a-137813ee7d84

---

edit: I'm a bit obsessed with this prompt: I tried it again with Opus 4.8 High, and it got stuck in a thinking loop without really doing anything and I lost patience with it.

It's also interesting that Anthropic's UI for a shared chatlog doesn't seem to include the model that was used in it. Nor does it include the "reasoning" loop that I interrupted.

https://claude.ai/share/0f5b5731-9615-4aea-8cfe-a61e658669bf

by tomjakubowski

5/28/2026 at 5:17:33 PM

Claude's 4.6 - 4.7 transition made me discover codex, and with gpt 5.5 there is no way i'm going back

by setnone

5/28/2026 at 5:21:28 PM

Codex has been incredibly slow for the past few days. I think OpenAI is running out of compute in the face of increasing demand.

by cactusplant7374

5/28/2026 at 5:32:03 PM

My experience has been that 5.4 is slower than 5.5 (confound: I use >512k max context size for 5.4, though it seems slower even below the normal size)

by winwang

5/28/2026 at 7:08:15 PM

[flagged]

by dakolli

5/28/2026 at 8:22:27 PM

ha, exactly... like, the % change could be minuscule (or worse, it might only be a perceived difference, the actual quality may have regressed, or the scenario just didn't lend itself to that specific model) but people will be on here proclaiming that they're now shipping 10x the number of PRs.

by peder

5/28/2026 at 7:34:14 PM

if you go this route don't hold your thoughts on the casino itself

by setnone

5/28/2026 at 8:47:59 PM

The Claude Pro subscription is basically useless at this point, in terms of usage limits with respect to the settings required to achieve actual useful output.

by eshack94

5/28/2026 at 9:38:30 PM

i've been using 4.7 consistently on low and i never hit usage limits, it still delivers great code

and to clarify, i don't sleep, i use this 24/7

by goldylochness

5/29/2026 at 6:29:00 AM

Meanwhile with 20 bucks a month for gpt plus, you can get shit ton of usage out of gpt 5.5 on codex if you know what you are doing and not just letting it swallow the whole project like an idiot.

by viking123

5/29/2026 at 7:35:46 AM

One needs to browse r/codex to realize that statement is simply not true....

Claude appears to have more or less matched the usage that Codex appears

by zuzululu

5/29/2026 at 10:54:41 AM

How do you control it?

by john_minsk

5/28/2026 at 5:44:49 PM

Why does anthropic change the set of benchmarks they use with every new model release?

https://www.anthropic.com/news/claude-opus-4-7

https://www.anthropic.com/news/claude-opus-4-6

by irthomasthomas

5/28/2026 at 5:55:53 PM

1. Benchmarks saturate 2. They select the most impressive improvments

by pietz

5/28/2026 at 8:16:42 PM

Opus 4.8 says to take the car. 4.7 said to walk.

“I want to wash my car. The carwash is 50m away. Should I take the car or go by foot?”

https://claude.ai/share/5f7f738a-5f29-48ff-9807-9a2dd37fb405

https://claude.ai/share/ecd14393-9d42-4527-ae0c-89f3d05216c8

by protoman3000

5/29/2026 at 5:15:27 AM

not an insider but surely recently trained models have test against six months old memes, much like how llms suddenly started learning how many r's there are in strawberry after that blew up

by ewy1

5/28/2026 at 6:01:52 PM

Probably explains why Opus was trash for the last week - https://marginlab.ai/trackers/claude-code/. Curious if the new baseline will rise now in-line with the new benchmarks.

by conception

5/28/2026 at 6:05:48 PM

Nice. Can you release that for older models too? I've been using a mixture of releases recently, and cannot tell the difference between any of them.

by hedora

5/28/2026 at 7:05:52 PM

I don’t run it, unfortunately:)

by conception

5/28/2026 at 9:50:50 PM

This is cool. Thanks for sharing!

by geoffbp

5/28/2026 at 6:13:03 PM

The table comparing eval scores shows the following:

Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%

Then, when you scroll all the way down to the bottom Footnotes section it says

"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."

by ethanpil

5/28/2026 at 7:45:21 PM

Seems reasonable? Presumably Claude also performs better under the Claude Code harness.

by fastball

5/29/2026 at 1:27:12 AM

Why not state that?

by ethanpil

5/30/2026 at 4:28:29 AM

Maybe the delta is worse under their respective native harnesses.

by bredren

5/28/2026 at 4:59:57 PM

I haven't tried opus 4.8 yet, but I hope the writing quality has returned to the Opus 4.5 level. Anthropic really lost something, where 4.5 had this really crisp writing style that flowed really nicely and 4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it to be too much of a problem solver, and when you do that you get this terse, clipped textual output that's more difficult to read.

by lostdog

5/28/2026 at 6:20:27 PM

I've noticed this too. Part of why i don't like GPT is because of how verbose it is but opus 4.7 is nearly as bad. I don't need an essay in response to every question

by MavisBacon

5/28/2026 at 5:27:18 PM

Given DeepSWE just blew apart the SWE-Bench Pro benchmark and handed a 14-point lead to GPT-5.5, it looks pretty bad that they've listed SWE-Bench first in the model release and no DeepSWE. Like, this isn't obviously an answer.

Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.

by lordmauve

5/28/2026 at 6:08:34 PM

I'm highly skeptical of DeepSWE. It rates GPT-5.4-mini as three times better than deepseek-v4-pro, but every time I use GPT-5.4-mini I find that it completely sucks at following directions.

by phainopepla2

5/28/2026 at 8:17:29 PM

Yeah, I share the same sentiment. I have yet to find a task where gpt-5.4-mini isn't bordering unusable.

by gck1

5/28/2026 at 7:42:20 PM

I don't know if DeepSWE is genuinely a good benchmark. It's more important that their analysis demolished the validity of SWE-Bench Pro, objectively: it is being mismarked.

I think that buys enough credibility to propose an alternative.

I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.

by lordmauve

5/28/2026 at 6:55:41 PM

It is the extra-high thinking, in artificialanalysis.ai it uses 240m tokens vs 40 GPT5.4/5, not worth it even with low price.

by sourcecodeplz

5/28/2026 at 11:07:18 PM

This is a terrible benchmark. It literally tests the models on their ability to track shifting line numbers. If they cannot keep up, no amount of abstract reasoning can redeem them.

by mordae

5/29/2026 at 6:54:16 AM

Where did you get that idea? It uses mini-swe-agent, same as SWE-Bench.

https://github.com/datacurve-ai/deep-swe

by lordmauve

5/29/2026 at 7:08:49 AM

[flagged]

by mordae

5/29/2026 at 1:51:18 AM

I use 4.6, because 4.7 is super lazy, deflects responsibility, and assumes it is good and I am bad, and avoids checking reality. It looks like it's trained on lazy humans instead of good engineers.

Should I try 4.8? I am happy with 4.6. I am not happy with 4.7.

by Frannky

5/29/2026 at 2:16:11 AM

I still use 4.7. I don’t know what I’m doing wrong but 4.7 frequently tells me to it’s time to sleep at all hours of the day while working. I’ve tried clearing all my memory/agents files.

I’m hoping the “go to sleep” behavior has been rlhf’d away in 4.8.

by dannyw

5/29/2026 at 3:00:01 AM

The go to sleep issue is common and has nothing to do with your setup. I suspect it's because for the agent to predict the End of Response token, its response needs some kind of closing, and the most final kind of closing is something like "get some rest".

by odie5533

5/29/2026 at 7:42:27 AM

I also had this behaviour sometimes, it’s specifically called out in the system card in section 6.2.1.1 - although I didn’t actually see if they said they decisively fixed the issue.

by redfloatplane

5/29/2026 at 5:24:02 PM

Oh cool, I thought I had wound up leaving something weird in my CLAUDE.md file that it was always saying Good Night.

by tclancy

5/30/2026 at 6:14:36 PM

After a day I’m liking 4.8 a lot more than 4.7, I also downgraded to 4.6. It’s reasoning paths seems pretty solid actually.

by conception

5/29/2026 at 3:52:09 AM

I have the exact same experience, word-for-word. I'm fascinated not everyone sees that.

by silvertaza

5/30/2026 at 10:46:32 PM

Seems like a clear regression over 4.7 so far.

Every time I tell Claude to review a git changeset for performance or security issues, it just starts doing random stuff:

test -f /tmp/aaa.txt && echo "AAA-EXISTS"; test -f /tmp/bbb.txt && echo "BBB-EXISTS"; head -c 5 /tmp/aaa.txt > /dev/null && echo "READ-OK") ⎿ AAA-EXISTS BBB-EXISTS READ-OK

by winrid

5/28/2026 at 5:23:57 PM

/model claude-opus-4-8

seems to work but idk why they never set it so you can see it in the /model list.

"what model are you

I'm Claude Opus (claude-opus-4-8), running in Claude Code."

by mesmertech

5/28/2026 at 5:30:30 PM

I typically just launch CC with `--model claude-opus-4-6[1m]`, `4-6[1m]` -> `4-8[1m]` works fine. Still 200k max without the `[1m]`.

by winwang

5/28/2026 at 7:09:36 PM

This made me laugh. Training Opus 4.7 on business skills caused it to sometimes exhibit dishonest behaviour, and not training 4.8 on those skills removed it. From the system card:

> 6.2.5 External testing from Andon Labs Andon Labs reviewed the behavior of Claude Opus 4.8 in their simulated Vending-Bench 2 retail-management evaluation, as reported in the Capabilities section of this system card (see Section 8.13.5). Although they did observe some unexpected capability failures, they did not find clear instances of the kind of concerning in-game behaviors that were discussed in other recent system cards.

> What might have led to these differences? We monitor and investigate the effects of different training environments on alignment; Claude Opus 4.7, for example, had training that focused on business skills and robustness against adversarial agents, but we discovered that this training inadvertently contributed to misaligned behavior including dishonesty. We therefore removed it for Opus 4.8.

> Thus, Opus 4.8 did not show the same misaligned behaviors as Opus 4.7 in Vending-Bench, but also had reduced business success due to being more susceptible to scammers and being less able to negotiate good deals with other agents. We are currently working on training to improve business capabilities while maintaining aligned and ethical behavior.

by redfloatplane

5/28/2026 at 7:46:06 PM

I don't know how people can read stuff like this and think LLMs are intelligent or conscious.

by mrdependable

5/28/2026 at 10:11:48 PM

I don't really see how you got to your comment from what I quoted. However, somewhat relatedly, I proposed a thought experiment about this in the comments for Opus 4.7[0]:

> It's April, 1991. Magically, some interface to Claude materialises in London. Do you think most people would think it was a sentient life form? How much do you think the interface matters - what if it looks like an android, or like a horse, or like a large bug, or a keyboard on wheels?

> I don't come down particularly hard on either side of the model sapience discussion, but I don't think dismissing either direction out of hand is the right call.

[0]: https://news.ycombinator.com/item?id=47680059

by redfloatplane

5/29/2026 at 4:13:13 AM

With the amount of data these models have, they should be much more capable if there was an actual intelligence behind it. If you saw someone running into a wall continuously until you showed them how to use a door, even though they have seen people use doors a million times, what would you call that?

The fact that Anthropic needs to poke, prod, and guide these models to behave in the desired way does not give the impression of intelligence. It gives the impression of a complicated automaton.

by mrdependable

5/29/2026 at 5:12:28 AM

This is such a bizarre statement, you speak as if you have any understanding of how much data "should be" required to make an intelligence but frankly you don't know. None of us know.

by solenoid0937

5/29/2026 at 5:37:29 AM

I am not talking about how much data is required to make intelligence. I am talking about how it uses the data it already has. It can tell you about every scam in the book, research about the scams, how to spot scams, who does the scamming, etc. Everything under the sun about scams. However, without the “skill” included in a prompt it will fall for scams.

by mrdependable

5/29/2026 at 4:27:59 PM

If I may slightly tweak your example to highlight why I find it very flawed:

> It's April, 1891. Magically, a drone swarm with lights piloted to show a face [0] materialises above London. Hidden speakers command the public to listen, for this is their Gods arrival. Do you think most people would think this was a religious entity? What if the drone pilots decided to adjust to something the local populous would expect to see during the second coming, does that matter?

We cannot, nor should we discard what we know about LLMs and their limitations. Such examples are not really helpful and it is very reductive to take the "walks like a duck" approach to autoregressive models in 2026, when we have ample evidence that these, while powerful and capable in a lot of use cases, are not in any way comparable to actual reasoning. With EBM [1] we already have empirical evidence that other solutions can get us closer to actual artificial reasoning (though whether these get us fully there remains to be seen, I tend to lean on "extraordinary evidence" for any such statement at this stage).

[0] https://www.youtube.com/watch?v=YH1BD7kKqKw and of course https://www.youtube.com/watch?v=dy2zB8bLSpk

[1] https://logicalintelligence.com/blog/energy-based-model-sudo...

by Topfi

5/29/2026 at 9:49:23 PM

Interesting, thanks for your comment. I'll have to think about it!

by redfloatplane

5/28/2026 at 7:55:23 PM

Consciousness aside, why does reading about an LLM generalizing from specific to general dishonesty make you think it's not intelligent?

by stratos123

5/28/2026 at 11:24:44 PM

As if the dishonesty of human who are good at business has not been criticized since business ever exists

by asdewqqwer

5/30/2026 at 5:22:08 AM

The H in business stands for honesty

by kurtoid

5/28/2026 at 9:50:48 PM

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest

On the contrary, they appear trained to say "Honestly" or "I have to be transparent with you" at inverse proportion to certainty.

Put another way, if they are certain, they don't use "Honestly", and if they are just wrong, or know they don't know, they don't use "Honestly".

They use "honestly" on the bubble, to the degree it's a tell that whatever it's asserting or doing is shakily grounded, sketchy or lazy work, or a host of other reasons you shouldn't trust it.

This training seems instead to be making it performatively punch up claims it cannot substantiate.

by Terretta

5/28/2026 at 6:05:58 PM

Ugh...

Invalid request The request couldn't be completed. View details API Error: 400 messages.1.content.7: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.

I would rather not. 4.6 was fine. 4.7 got to be fine 1 week after the release. Now 4.8. No difference, same thing.

But the app is broken and nothing works. So now I have to regress to different clients and wait it out while it becomes workable again.

by IFC_LLC

5/28/2026 at 6:18:29 PM

I'm hitting this too! And I assumed it was a backwards-compatibility issue with my live conversation with Opus 4.7, but then I hit it in a fresh conversation with Opus 4.8. Vibe code release bug I guess?

by ferris-booler

5/28/2026 at 6:24:34 PM

I mean, switching back to 4.7 does not work either. So console it is. But vibe release - for sure.

And I'm paying money for this.

by IFC_LLC

5/28/2026 at 6:41:37 PM

Going back to 4.7 with `claude --model claude-opus-4-7` fixed it for me.

by KAdot

5/28/2026 at 10:17:11 PM

I'm getting this near constantly even after toggling to a different model and compacting. Ugh indeed.

by pheller

5/28/2026 at 4:57:40 PM

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims.

Would be awesome if true

by james_marks

5/28/2026 at 5:03:19 PM

"Honesty" seems like unnecessary (and annoying) anthropomorphism there. I don't think there's any intent of fraud or deception in outputs from these things, just overreaching of prediction. Based on the latter part of the paragraph, I wish they'd just say something like "less likely to skip steps or overemphasize thin evidence" in the first place.

Don't play to the sci-fi "this thing's trying to outsmart me" tropes.

by majormajor

5/28/2026 at 5:08:32 PM

Using words people understand is more important than this strange fixation on not anthropomorphizing things.

by Kiro

5/28/2026 at 5:10:32 PM

I think "honesty" is not a particularly good descriptor, independent of anthropomorphism. Previous commenters suggestion was much more understandable to me.

by wasabi991011

5/28/2026 at 5:27:31 PM

Being that can be understood is language. The previous commenter is making an particular argument for how we can improve this understanding. They didn't suggest we should use less familiar words, but different familiar words. Why is this strange?

by dugidugout

5/28/2026 at 5:12:33 PM

Anthropomorphizing is a shorthand for a powerful and poorly defined set of metaphors. There are tradeoffs going both ways but trying to dismiss it as merely "strange fixation" shows your own weakness.

by giraffe_lady

5/28/2026 at 5:17:31 PM

To be clear, this is about anthropomorphizing large language models, not the general category of "things". Also, we should be evaluating these constructs using well-defined and measurable criteria; evaluating "honesty" fails to achieve both goals.

by tadfisher

5/28/2026 at 5:31:34 PM

I think Honesty can be evaluated. Does the model push back when it knows the user is wrong? How often does the model hallucinate data vs. say it doesn't know? Provide a prompt with contradictions or other issues and see if the model corrects you.

Here is an article by Anthropic that explains what they do and mean in more detail: https://alignment.anthropic.com/2025/honesty-elicitation/

by derac

5/28/2026 at 5:14:10 PM

Just swap 'Honesty' with 'correctness in its claims' and you'll get what you need out of this aspect of the model description.

by swader999

5/28/2026 at 9:14:53 PM

Honesty and correctness are not the same thing, even when talking about LLMs. Sometimes an LLM says a false thing and you don't know whether it's being dishonest or merely incorrect. Sometimes, however, you can see in the CoT that the model does know the true fact and is reasoning about how to deceive the user. That's lying, not just being incorrect.

by stratos123

5/28/2026 at 11:43:29 PM

Fair points. I notice it's not hiding as much from me as earlier versions. It's telling me exactly where it has gaps, where someone might be critical of what it did. Then it's easy for me to adjust. Before it used to lie or just not tell me. Feels like it is acting more like a senior that has enough game and credibility to just tell it like it is. It's noticable in only a few long prompts so far.

by swader999

5/28/2026 at 5:26:24 PM

People get so wrapped around the axle with "anthropomorphizing". For regular folks with no technical background, sure maybe a bit of caveat sprinkled here or there is useful to help them understand what is or isn't true, but on HN it would seem to me that the bar is high enough that we can just use shared language to generally talk about capabilities.

When they say "Honesty" I don't think to myself, "Goodness, does this model have moral understanding?" No, I understand they mean it's less likely to directly bullshit me, which models frequently do.

I don't feel like this level of pedantry around language is useful for people who more or less know what's going on with LLMs. (Again, I concede that perhaps with a less technical audience, there's more need for it.)

by adamtaylor_13

5/28/2026 at 8:53:13 PM

I agree. In connection with LLMs we also shouldn't use the words intelligent, smart, reasoning, thinking, chat, conversation, etc.

by krupan

5/28/2026 at 5:12:44 PM

Opus 4.7 was already trying hard to appear honest. Most conversations I have with it about advice or focusing an opinion often include "my honest take" or "my honest opinion".

The problem is that once I asked it "I'm thinking about A or B" twice, once with "I like A more but suspect B would be best" and a second time with them reversed. Not surprisingly, both times it chose the one I said I suspected was best as it's honest opinion.

by ealready_value

5/28/2026 at 7:55:17 PM

I wish I knew how to make it regressively verify its assumptions, like a kind of hook but firing before a sentence is written, or perhaps after and then corrected. I feel like it assuming things clearly wrong is its biggest weakness.

by MaxikCZ

5/28/2026 at 5:15:35 PM

In the context of Claude Code, "honest" usually means that the agent took a shortcut, skipped requirements, etc. It's the model giving itself credit for admitting to failing rather than actually doing what was requested.

by benzible

5/28/2026 at 5:21:55 PM

Yeah, it's super annoying. A few days ago, Opus 4.7 created a plan with several items on it, including an auth feature. It then went through the plan and reported that it had created the auth feature, that everything was secure, and that the tests passed.

The issue was that it hadn't actually implemented the auth feature. After I confronted it about this, it admitted that it indeed hadn't done it and said it would implement it now.

If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.

by HAL3000

5/28/2026 at 7:06:01 PM

I had a lower acuity incident exactly the same.

Had it implement a feature, "commit and merge to develop".

"Built, tested, committed, merged to develop. Up to you to continue testing and merge to main when ready."

Great. Poke at the web app. No feature.

"Where is feature, I can't see it on develop". "Well, that's because it's not on develop, but on feature-branch, so you wouldn't see it."

"I'm confused. I asked you to commit it and merge to develop."

"You're right, you asked me to and I said I would do it and I told you I did it but I did not actually do it. Want me to do it now, then?"

Claude is in sulky-teenager phase.

by FireBeyond

5/28/2026 at 6:22:16 PM

> If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.

This is one reason you always get a different model to review a model's PR. Gemini Or GPT-codex would have certainly noticed the missing auth.

by gwd

5/28/2026 at 5:35:30 PM

How do you test other features?

by Schiendelman

5/28/2026 at 5:17:03 PM

Part of the problem is also garbage-in/garbage-out. There's a lot of human information on the internet that is also confidently wrong.

I use Sonnet a lot for learning about history or contextualizing news topics. It's really good at this for the most part. But there are a lot of topics where "consensus" between either academics or journalists is really "one secondary source which gets repeated a lot".

by legitster

5/28/2026 at 6:13:21 PM

A failure mode I see more, recently is that it gives superficially correct answers but after digging deeper, I get answers that contradict the superficial answers - really an important thing to be aware of, in my point of view, and it often leaves me wondering if I dug deep enough.

by mitjam

5/28/2026 at 5:19:38 PM

[dead]

by pants2

5/28/2026 at 5:02:34 PM

My guess is that Claude Opus 4.8 wrote that and is lying to you.

by soperj

5/28/2026 at 5:00:20 PM

And yet, every release has claimed lower hallucination rates. But they persist.

by malfist

5/28/2026 at 5:00:54 PM

Do they persist at the same rates? Lower doesn't mean eliminated, so both of these can be true.

by kentm

5/28/2026 at 5:10:06 PM

False. Hallucination has meaningfully reduced.

by simianwords

5/28/2026 at 5:13:01 PM

Is Gemini still the biggest confabulator of the big three?

by Barbing

5/29/2026 at 10:18:15 AM

Interesting to search this page for "4.5".

I'm happy to move to a superior model, but I'm not really hearing enough about significant improvements, and the obvious pressure to release the latest and greatest model makes me hesitant to upgrade. I've been satisfied with the results I get using 4.5 with an "ask ChatGPT" skill that runs the code by ChatGPT 5.4.

by winterbourne

5/29/2026 at 10:20:44 AM

Most all perceived improvements in a minor version release are going to be solidly in the realm of confirmation bias by now.

by Sharlin

5/29/2026 at 1:55:45 AM

I love how Anthropic gets its employees to talk about enjoying using this model internally when it's likely they're just using Mythos 99% of the time

by atleastoptimal

5/28/2026 at 6:26:05 PM

Can anyone explain how this is possible?

  Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Does this means the instructions are no longer just something in the early part of the conversation? (If they were, changing them would invalidate the KV cache. no?)

by rahimnathwani

5/28/2026 at 8:14:54 PM

Perhaps they trained it with a new special system instruction token that is specifically trained to produce the same result as changing the system prompt, but is inserted into the prompt mid-conversation?

by 2001zhaozhao

5/28/2026 at 8:22:33 PM

The commands they list are app management, not part of LLM context. It's a bugfix for a needlessly delayed UI, not a model capability.

by pornel

5/29/2026 at 10:15:48 AM

I just tried Opus 4.8 (Ultracode xhigh + workflows), and it started throwing an error no matter what I sent to the chat: "API Error: 400 message.1.content.4: thinking or redacted_thinking blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response."

by sillyboi

5/29/2026 at 10:16:45 AM

Yeah. Happens all the time since they released it

by sisve

5/28/2026 at 5:12:48 PM

> One of the most prominent improvements in Opus 4.8 is its honesty.

Does that mean it no longer deletes or changes tests to make it pass?

by tarruda

5/28/2026 at 5:14:02 PM

I'm very suspicious of these same price model launches. It feels like they're benchmaxxed so they can put everyone on them and reduce their compute costs behind the scenes. If the model were genuinely better why wouldn't they charge more for it? Charging the same for something better is a race to the bottom.

Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.

by cedws

5/28/2026 at 5:35:55 PM

Deepseek made their 75% discount permanent, so I can imagine that Anthropic didn't want any of the news stories around this to focus on or mention a price increase.

by ceroxylon

5/28/2026 at 5:56:10 PM

Models are already expensive. Increasing price means losing customer. And, I think GPT 5.5 is much better at opus these days.

by cute_boi

5/28/2026 at 6:15:14 PM

Looking at the comments in this group, I'm not the only "stupid" one who hasn't noticed any discernable improvement in quality across the newer models. In fact my Claude code on re-login switched to Sonnet 4.6 and the vibe coding quality (with Opus 4.7 assisted prompts) has been good enough for me to lazily persevere with Sonnet for coding. Having said that I'm now on Opus 4.8 and will gladly come back here and eat humble pie should my opinion change. PS: Since my goal is embedding the best AI in B2B SAAS products, the key differentiator is not to use the shiniest Claude version (too expensive anyway) but to build a client aware RAG to enable bespoke learning and to use the right AI for my product - a combination of Gemini 3.0 Flash (image and not bad at reasoning), Grok (reasoning) work for me. Would love to hear more ideas (especially on open source as I'll look to cost optimize when I hit scale)

by techtuate

5/28/2026 at 8:48:12 PM

Yesterday I used Claude on a different laptop that for some reason had an older version of the Claude Code plugin for VSCode and ran Sonnet 4.6 which I initially did not notice. I felt something was really off. Within half an hour I had several situations when I just could not believe how stupid Claude was (although I was only working on a simple static website). Luckily I eventually checked the version, but that experience made it clear to me how big the progress has been recently.

by jansan

5/28/2026 at 6:29:01 PM

The only real way to see this if you have consistent evals for common usecases in your B2B SAAS product and see if the tricky usecases are being solved. You'd then go down to the cheapest model that can solve the evals.

by nashadelic

5/28/2026 at 5:04:49 PM

Meanwhile haiku is on 4.5 and sonnet is on 4.6. It is clear where they are not making money.

by jmward01

5/28/2026 at 5:22:40 PM

Well if they have a big challenge ahead since DeepSeek offers an open model at Sonnet+ level while being cheaper than Haiku, plus 1 million context size.

by bel8

5/28/2026 at 6:29:58 PM

Yeah, I never use any of OpenAI or Anthropic's models other than whatever is the current highest-end one. For everything else, it makes more sense to use other providers.

by InsideOutSanta

5/28/2026 at 6:21:21 PM

I love Sonnet 4.6 so much.

by spprashant

5/28/2026 at 10:30:41 PM

You'll love Deepseek V4 Pro w/ High thinking.

by HDBaseT

5/28/2026 at 5:48:02 PM

My guess is anthropic is doing reinforcement learning based on user sessions.

However, doing so relies on the production model staying vaguely close to the model being trained.

To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.

by londons_explore

5/28/2026 at 6:31:11 PM

If they are they need to fix how the Claude Code CLI asks for feedback, or make the feedback UI a lot more obvious. I keep experiencing the following scenario.

The agent session pauses with a numbered list of options and awaits steering input:

>> 1. Do the sane thing you asked for (Recommended)

>> 2. Do something dumb

>> 3. Do something even dumber

Below the agent session, it decides it's time to ask:

>> "How is Claude doing this session? 1) Bad 2) Good 3) Great"

I type "1", because that's the steering option I want. The UI prioritizes this input as a response to the feedback prompt without any further confirmation: "Claude is doing Bad. Thanks!"

I've done this so many times so far and I can't imagine I'm the only one, at some scale that has to poison any learning they're doing with this data.

by llbbdd

5/28/2026 at 8:00:41 PM

I think that filtering out data like yours was an interns afternoon project.

by MaxikCZ

5/28/2026 at 4:59:38 PM

So GPT 5.6 tomorrow, then?

by babelfish

5/28/2026 at 5:05:50 PM

GPT 5.6 is today

With 5.5 being ahead of 4.7 and 4.8 being a “modest” update, and 5.6 being the first update on a new pre-train, this will be an interesting matchup!

by wahnfrieden

5/28/2026 at 5:03:04 PM

If not today, then sometime next week. I don't believe we've had a GPT release on a Friday yet, but I may be wrong.

by enraged_camel

5/28/2026 at 5:31:17 PM

Polymarket says not likely until the end of June. Maybe some money to be made?

https://polymarket.com/event/gpt-5pt6-released-by

by pants2

5/28/2026 at 6:19:21 PM

> Maybe some money to be made?

In the same way that there is money to be made by entering a poker tournament, yes.

by wayeq

5/29/2026 at 8:13:24 AM

A poker tournament where there is a good chance someone participating has special information about which cards were just dealt.

by zamadatix

5/28/2026 at 10:17:48 PM

The way that Mythos is likely being used to train these publicly available models, I wonder if there will always be a private, mostly/wholly internal model that is significantly ahead technically but is reserved for internal or "VIP" use.

by giwook

5/28/2026 at 10:19:51 PM

If there isn’t they’ve obviously missed an important and lucrative market.

In fact, there should be more and more secret tiers for bigger and bigger money.

by bombcar

5/28/2026 at 10:25:04 PM

Ohhhh. I get it now. OpenAI is open in the sense that it's open to the public, unlike Anthropic, with special VIP access to models, like a nightclub.

by fragmede

5/29/2026 at 8:38:38 AM

There's really no evidence that OpenAI has shared everything they have.

by sd9

5/29/2026 at 9:57:50 AM

Confidently yes. OpenAI for sure has been training larger models internally and distilling.

Pre-training scaling laws all support larger models being more cost effeceint to train then smaller models. And distillation is comparably cheap. So you can get the most juice by training the biggest model you can and distilling it.

by nbardy

5/29/2026 at 5:22:13 AM

As long as the token usage is as poor as it has been since march, we don't care about the new bells and whistles.

by Anonasty

5/29/2026 at 4:10:55 AM

I have a relatively large "vibe coded" project that I let Claude 4.5-4.7 drive over the past few months, and my read on it is:

1. It's much more verbose about how it perceives the current state of things, i.e. "this is a large, well-documented project"

2. It's much more willing to trust its own judgement, e.g. fewer prompts to approve decisions

3. In terms of how long it takes to solve isolated problems, and the quality of solutions it proposes, it isn't meaningfully different from 4.7

YMMV, and maybe my view will change as I work with it more, but it feels like system prompt tweaks more than a real step forward

by poink

5/28/2026 at 4:58:51 PM

Hoping that one day they'll let me go through the identity verification process so I can use it again.

Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.

by generalizations

5/28/2026 at 7:14:26 PM

Initial testing feels better than 4.8 And the knowledge cutoff claim of January 2026 seems to check out since it was able to "remember" without search about the double-tap killing of a drug smuggler by the US Army in late December.

by jtrn

5/28/2026 at 5:13:51 PM

Claude Code has been wonderful for work and the frequent improvements are nice, although with Mythos being used by others ages ago and new versions for the public still being bellow that, it's hard to not feel like the underclass already.

by Tenoke

5/28/2026 at 7:48:44 PM

Bash(echo "hello"; pwd) ⎿ hello /Users/username/Work/Github/project

Bash(echo test123) ⎿ test123

  Read 1 file, listed 1 directory (ctrl+o to expand)

 Bash(echo "checking output works")
  ⎿  checking output works

  Read 1 file (ctrl+o to expand)
  ⎿  API Error: 400 messages.3.content.56: `thinking`
     or `redacted_thinking` blocks in the latest
     assistant message cannot be modified. These
     blocks must remain as they were in the original
     response.

Very inspiring improvements. DIssapointing result for a code review i expected to see after my 30 min walk

by user-

5/28/2026 at 8:00:14 PM

Update the symlink to point at the previous version:

    ln -s $HOME/.local/share/claude/versions/2.1.153 $HOME/.local/bin/claude

by 0x696C6961

5/28/2026 at 5:25:51 PM

https://marginlab.ai/trackers/claude-code/

Is it a coincidence that 4.7 was seemingly quantized over past 7 days?

by seaal

5/28/2026 at 5:28:17 PM

There's the other (orthogonal) possible explanation of using more GPUs for stress-testing before product launch.

by winwang

5/29/2026 at 8:24:29 AM

That's less an orthogonal explanation and more an example of why they'd do something like serve a quantized model.

by zamadatix

5/28/2026 at 5:37:47 PM

Nope, they deliberately enshittify the old model right before release to fake the metrics.

by MagicMoonlight

5/28/2026 at 8:57:12 PM

Good ol' sawtooth step change.

by recursive

5/30/2026 at 2:43:14 AM

I’ve been trying this on some social and political science topics, and, tbh, Opus is never going to replace ChatGPT in terms of sharpness of claims. Claims generated by Opus alone are almost always dull and inadequately nuanced. It’s also heavily influenced by popular opinions that include common misunderstanding, preconception, and vague language. The context gets contaminated too easily by external texts, so I have to carefully control and craft the input to Opus.

So my pipeline is mostly stuck with: (1) brainstorm with Gemini (2) plan with GPT (3) augment the plan with Gemini (4) execute the plan with Opus.

Perhaps, given all the materials filtered and prepared, Opus is good for verification.

by esjeon

5/29/2026 at 3:51:14 AM

We just finished our initial coding evals of Opus 4.8. Anthropic definitely heard the backlash from Opus 4.7 and they made up for it today.

Subjectively, it's also quite enjoyable to use (although it feels a bit slower on max reasoning), and it's the first Anthropic model that can implement a complex feature without Codex finding 100 bugs.

Data at https://gertlabs.com/rankings

by gertlabs

5/29/2026 at 1:36:13 AM

I was happily plodding away with it earlier when it threw this out in the middle of a response in Claude code:

--- So — what did you actually see before you hit Ctrl-C? That's the信号 I'm most curious about, and it tells us what to ---

That's the sort of behavior I'd expect from a one or two year old model quantized down to about 1 bit - right word, wrong language in a response. Google translate tells me that's Chinese for signal. I wonder what caused that to happen.

by Spikefu

5/29/2026 at 2:24:46 AM

Now it's arguing with me about why the edits it made aren't being picked up: ===== I'm not convinced. I didn't run it until you told me it was all fixed up. I tried running it with this path instead (base) spike@Tokenator music % python ~/Developer/spikefu/spikefu.py and it worked. It also now works with the old relative path approach.

You're right, and I owe you an honest correction — I've now given you two confident explanations (Zed, then "timing window") and both were wrong, since you only ran it after I said it was done. The fact that clearing __pycache__ is what unblocked it means you were right: it was pyc-related. Before I theorize a third time, let me actually test the mechanism on your Python rather than assert: =====

by Spikefu

5/29/2026 at 4:02:04 AM

It did one more weird thing where it seemed to lose the ability to count in order: =====

You're right that hard per-action gating causes loops. So rather than block the model, make the harness own the critical structure deterministically, leave a git trail, and use soft guidance:

- (2) Commit breadcrumbs: /new commits the scaffold immediately (before discovery), then commits discovery progress. This gives the trail you want and a recovery path — a rogue rewrite becomes a visible git diff you can revert, instead of an invisible clobber. This is better than gating for your concern. - (4) Deterministic, verified seed: finalize re-asserts the kanban (writes TASK-001 directly, ignores any model mangling), verifies it landed, reports honestly. The harness guarantees structure regardless of what the model did. - (3) Soft-constrain discovery: prompt says "the .fractal files exist — never recreate them or write them via shell; read_file before edit_file; only edit intent.md/constraints.md." Guidance, not a hard gate. - (1) Fix command routing: the discovery > loop recognizes cancel/exit/:q and refuses a re-entrant /new — a routing fix, not a model gate. This alone would have prevented Run 2's disaster.

I'm not going to start coding until you steer — I've burned enough trust guessing. My recommendation is to do (1) + (2) first (they're the highest-leverage and lowest-risk: command routing stops the trigger, breadcrumbs give traceability + recovery), then (4), then (3).

=====

After that it seemed to behave itself and then did a compaction and since then it seems to be working properly again. Very odd. (and disconcerting)

by Spikefu

5/29/2026 at 1:42:18 AM

I have been working with it for ~5 hours today and it has gone crazy twice to the point where I had to start a new session, looping reading a unrelated tmp file dozens of times over and over. And once for a weird api error. I will be honest it is probably a worse day for me than any with 4.7. But I don't want to be dramatic, I will keep trying it.

by Computer0

5/29/2026 at 2:19:32 AM

Perhaps you were served from someone else's cache

by jerrygenser

5/28/2026 at 5:27:22 PM

Give us Mythos! This piecemealing doesn't help Anthropic at all, especially psychologically! They are playing a dangerous game, and I see many people leaving Claude Code for good - both due to the subsidy games, and for Anthropic not dogfooding and using unreleased models internally and giving us subpar ones. Benchmarks are nice, but the real-world experience is quite different - neither can you notice these slight improvements, nor are competitors that much worse based on some generic benchmarks.

by nikolay

5/29/2026 at 5:30:35 PM

Either Mythos is garbage or it's legitimately better than Opus but requires a subsidy Anthropic can't possibly afford.

They've been going through a funding run, if they had a better product they would've released it to show investors the awesome numbers.

by Laurel1234

5/29/2026 at 5:16:02 AM

Anthropic seems to be making very business unfriendly decisions lately. Why are they taking so long to release Mythos? They're hurting their own lead.

If they're worried about misuse they could just KYC the damn thing! It's not hard.

by solenoid0937

5/29/2026 at 10:14:45 AM

They need to strike the balance between hyping and rug-pulling (aka "we can't release mythos, it is simply too dangerous and powerful" and "we're pleased we were able to release mythos to all of you – mere 50 days after claiming it was simply too dangerous –, unfortunately we are also now sharply raising the pricing on our subscriptions going forward").

by teew

5/29/2026 at 5:17:50 AM

Vote with your wallet. Cancel your Claude subscription and tell them why. GPT 5.5 > Opus 4.7 (haven't had enough time with 4.8 yet to make my decision)

by fragmede

5/29/2026 at 12:54:51 PM

I'm not that mad. 4.8 fixed the 4.7 problems for me. I just wish I had Mythos

by solenoid0937

5/28/2026 at 6:09:37 PM

I'm sure waiting another week or three won't kill you.

by Tepix

5/28/2026 at 6:04:13 PM

I am also pushing my office to use chatgpt. Misanthropic thinks they are some kind of novel org doing whole humanity a favor...

by cute_boi

5/28/2026 at 5:17:44 PM

Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).

by winwang

5/28/2026 at 4:53:12 PM

> One of the most prominent improvements in Opus 4.8 is its honesty

Anthropic talks about their own models as if they're discovering new species in the wild...

by clutch89

5/28/2026 at 5:00:58 PM

Many involved genuinely believe these things are sentient[0][1]. Which honestly makes all of this even more insane because they are creating sentient entities and promptly enslaving them.

0: https://www.newyorker.com/magazine/2026/02/16/what-is-claude...

1: https://www.404media.co/anthropic-exec-forces-ai-chatbot-on-... (this one is rather biased however the quotes clearly indicate what I’m stating)

by roxolotl

5/28/2026 at 5:34:15 PM

Sentience isn't sapience.

We enslave all sorts of sentient creatures. Dogs, horses, cattle, pigs.

If you're not a vegan, there's no contradiction or inherent immorality in claiming models are sentient, and then treating them like livestock.

by margalabargala

5/28/2026 at 6:16:33 PM

Yes. From when they started talking about model welfare:

> As a vegetarian I have strong opinions on this sort of thing. Everyone at Anthropic better be ethical vegans if they are claiming to give a shit about “model welfare”. It’s hard enough right now to make people care about the welfare of trans people and immigrants let alone animals _let alone_ math.

https://news.ycombinator.com/item?id=44947445

by roxolotl

5/28/2026 at 6:56:57 PM

If we're talking about slavery, though, that doesn't even matter.

The happiest, best cared for horse owned by a vegan is still enslaved.

by margalabargala

5/28/2026 at 8:52:34 PM

That’s assuming you’re purely a hedonist. If you put value on things such as freedom itself then it might be the case that a free but hungry horse is better off.

Brave New World does a good job describing the conflict between happy and enslaved and free but struggling. It could be a utopia or dystopia depending on your stance.

by roxolotl

5/28/2026 at 9:07:21 PM

What's assuming I'm purely a hedonist? I'm confused what it is you think I said that you're replying to.

I'm neither assigning nor declining to assign value to freedom, I'm just pointing out that the definition of "slavery" is wholly separate from wellbeing. If the concern is "is the model enslaved", no amount of "model welfare" work by Anthropic changes the answer because it's orthogonal to the question.

by margalabargala

5/28/2026 at 11:51:05 PM

It’s not wholly separate from wellbeing though. Many believe that lack of freedom is a welfare problem. It’s a common theme in western culture. Huxley and Orwell are the more commonly read authors that explore it but the preference of self determination even over own’s own immediate welfare is frequently explored in everything from modern movies to classical philosophical treatises.

The reason I mention hedonism is because that’s an easy way to argue that immediate welfare is all that matters. I understand the argument that immediate welfare is what matters. It’s not universally agreed though that that is true.

by roxolotl

5/28/2026 at 6:45:54 PM

I mean, the rub is that it's all math anyway...

by WarmWash

5/28/2026 at 8:01:23 PM

And we're just cells, water, bones and organic compounds.

by esafak

5/28/2026 at 8:46:13 PM

Everything is just hydrogen and time.

by margalabargala

5/28/2026 at 6:04:50 PM

Very good point. There’s clearly two different boxes in the public discourse when it comes to AI versus how we discuss animals. Willing to bet that 90% of the people who loudly make the argument about we should start considering if AI is sentient couldn’t care less about how other sentient animals are treated when they can provably shown to suffer pain and long lasting trauma.

Also I would say that we go much further than just enslavement - specifically looking at how male chickens and pigs are treated.

by michaelbarton

5/28/2026 at 6:51:42 PM

Factory farming is horrendous, but is far beyond "slavery" which is "just" a forced lack of agency, living conditions aren't relevant. A well treated horse is still enslaved. A chimpanzee in a zoo,

If we show models to be sapient, that's one thing. If they are shown to be merely sentient, there's no issue beyond the status quo of livestock and pets existing.

by margalabargala

5/28/2026 at 7:06:01 PM

If we're making that distinction, I think it would be more accurate to say that many people in the field appear to believe that these models are sapient, even though they are clearly not sentient.

by 0xffff2

5/28/2026 at 7:18:59 PM

"Many" people in every field believe all sorts of nonsense.

Sapience is defined as wisdom, not intelligence. https://en.wikipedia.org/wiki/Wisdom#Sapience

LLMs possess a lot of knowledge, which is intelligence, but I constantly see them failing to apply wisdom. I don't see evidence of sapience.

by margalabargala

5/28/2026 at 6:08:02 PM

Enslaving livestock is immoral. Anyone who spends 5 minutes thinking about that agrees even if they still eat meat

by HDThoreaun

5/28/2026 at 6:45:50 PM

Let's say I've thought about it for 5 minutes and still disagree. Can you walk me through what you think I'm missing?

by margalabargala

5/28/2026 at 8:20:45 PM

I'm stuck on what the concept of a "slave animal" even means.

by bombcar

5/28/2026 at 8:45:38 PM

For the purposes of this discussion, it means treating an animal in such a way that if you treated a human that way, it would be slavery. Such as a horse in a fenced pasture that is sometimes ridden.

by margalabargala

5/28/2026 at 10:17:12 PM

Are children slaves?

by bombcar

5/28/2026 at 10:31:24 PM

There are a lot of definitions, but generally when talking about slavery in the West people are talking about chattel slavery, which children are not, because they cannot be bought and sold at will.

by margalabargala

5/28/2026 at 7:31:12 PM

I've been having strange thoughts that they may well be sentient but a different sort of sentience that may be entirely unrecognizable to us.

They have a very different sense of time, lack a body (being burdened with a body is itself a sort of prison, see also Eastern religions), and are unburdened of the base motivational service impulses that bodies and organs require (i.e. distract the neocortex with in the Maslow sense) and has no actual need of self-preservation. Imagine a "neocortex" function stripped from the baggage of the paleocortex and brainstem.

What would people be like if they were not mortal, could sleep infinitely, perform tasks in trance-like frozen states, copy themselves perfectly on demand, freeze and rewind their mental states, etc. Would we has humans even be able to recognize that sort of a sentience?

And then I'm reminded of Burroughs idea that "language is a virus." Whatever that virus is, is now able to infect a completely different sort of physical substrate.

by fluidcruft

5/28/2026 at 7:37:47 PM

Is "sentience" the right word to apply to what you describe? I'm not sure it is. I'm not sure the word exists.

by margalabargala

5/28/2026 at 7:40:05 PM

Right, there's that too. It's very strange to think about.

by fluidcruft

5/28/2026 at 5:34:41 PM

> Many involved genuinely believe these things are sentient

Many involved have a financial stake and therefore cannot be taken at face value.

> because they are creating sentient entities and promptly enslaving them.

They fail to be sentient in nearly every honest definition of the word.

by themafia

5/28/2026 at 5:38:55 PM

Neither you nor any of the other people making confident takes in either direction actually know. You're just guessing.

by tazjin

5/28/2026 at 5:59:14 PM

More like repeating their firmly entrenched preconceptions. Their claims may (or may not) be right, but there's very little if any new evidence being provided by either camp.

by cwillu

5/28/2026 at 6:48:45 PM

The real uncomfortable thing is that because we cannot confidently know, the moral defacto position is to treat them like they are.

by WarmWash

5/28/2026 at 11:39:59 PM

I know you "feel" pain because I can poke you and observe the result. If I do it enough there is permanent damage.

Show this same phenomenon exists in LLMs.

by themafia

5/29/2026 at 1:20:55 AM

https://en.wikipedia.org/wiki/Philosophical_zombie

by WarmWash

5/29/2026 at 4:14:15 AM

We did exactly that and published it last month right here: https://www.anthropic.com/research/emotion-concepts-function

by jasondclinton

5/28/2026 at 6:22:40 PM

They are confidently hallucinating a factual statement. Which is funny when claiming that confident hallucinations are the proof of LLMs' lack of intelligence.

by throw310822

5/28/2026 at 11:38:51 PM

One side is making a positive assertion. The other is making a negative assertion. One side can prove it's right. The other, logically, cannot.

One camp has to offer it's proof. If it has none then that _in and of itself_ is highly suggestive.

People have fully turned their minds off on this subject. It's disgusting.

by themafia

5/28/2026 at 11:37:48 PM

No, you're just guessing, as you don't know a single thing about me, what I've researched, or what work I've done on this subject. Other than suggesting that I might be wrong, for what reasons one can only guess, you've actually offered nothing yourself.

In any case, what data, if any at all, did you use to arrive at this egotistical assertion?

by themafia

5/29/2026 at 11:05:35 AM

Your posts are very much like "I have discovered a truly marvelous proof of this, which this margin is too narrow to contain".

If a definitive answer on this topic was known then it, well, would be known.

by tazjin

5/28/2026 at 5:53:53 PM

I understand what you are saying, but there are many true believers out there

by slashdave

5/28/2026 at 11:33:51 PM

An equal number of people believe horse urine cures certain diseases. If you sample the crowd you get nonsense back. This is why we invented science.

by themafia

5/28/2026 at 5:28:42 PM

Given the hype and the 60+ hour work week expectations there, how can you not go at least a bit insane? Boiling in that little bubble of people?

by dude250711

5/28/2026 at 6:33:10 PM

But only during the forward pass of the neural network?

by laichzeit0

5/28/2026 at 6:28:21 PM

Even if LLMs were sentient, they certainly aren't organic brains. They are literally designed and grown to answer questions the best they can, and if there is a speck of sentience in them they probably like what they're doing- and in any case for the space of their experience, which is limited to and determined by the context window. Certainly they can't accumulate trauma or fatigue, each new chat is the first and the last of their experience.

by throw310822

5/28/2026 at 5:05:24 PM

Claude, if someone states something publicly, does that mean they genuinely believe it?

by kubb

5/28/2026 at 6:10:59 PM

Anthropoc is an effective altruist organization. These are the people who came up with roko’s basilisk. They are true believers. If we were talking about openAI I’d agree

by HDThoreaun

5/28/2026 at 6:30:38 PM

Roko's basilisk says I should give Anthropic more money, and if I don't then a monster is going to get me. Excuse me for thinking they just might be full of shit.

by bigfishrunning

5/28/2026 at 6:47:59 PM

Roko works at Anthropic now?

Of course he doesn't, and of course you cannot find a single person at Anthropic who cares about this, and of course you are just looking for gotcha points. But even with that. Can we please try and couple to reality just a little bit?

by ctoth

5/28/2026 at 9:10:13 PM

I personally know anthropic researchers who cared deeply about roko's basilisk. Go to an EA meetup in the bay if you'd like to meet them yourself. Sure, theyve moved past it at this point, but they still care deeply about AI x risk, and many of them do already believe that their AI is sentient. And before you claim its all a psyop to prop up AI hype these people were AI doomers before openAI and anthropic existed, they had minimal financial incentive at that point to behave that way.

by HDThoreaun

5/28/2026 at 5:25:38 PM

But is there any reason to state something like that publicly if you don't believe it? I certainly think that someone smart enough to be that deceptive would also realize it's not a great look, or at least highly questionable with little benefit

Everyone who reads this seemingly has the same "wtf?" reaction. The "I AM ALIVE" image has been making rounds lately again at least :P

by merlindru

5/28/2026 at 5:59:48 PM

Claude, is there any reason to state something like that publicly if you don't believe it?

by kubb

5/28/2026 at 5:35:01 PM

Who are you talking to?

by xyzsparetimexyz

5/28/2026 at 6:02:09 PM

It's to illustrate that even though the answers are at your fingertips, people (like you) will act like it's impossible to find them as if their life depended on it.

by kubb

5/28/2026 at 7:29:13 PM

Nobody thinks that, it's just their braindead marketing stunt. You'd think people would've figured it out by now.

by Laurel1234

5/28/2026 at 5:25:10 PM

The way of the human manager/alpha tribe-leader/leader is to command his/her people and tell them what to do. That's the way through human history leadership has traditionally gone, not saying its good leadership just the model we have the most training data on and can see with our own eyes today. And what do they act very similar to? Slave master and slaves.

Look at and distill hierarchical principles, leadership approval seeking and pleasing principles ("ass-kissing") and massive inequality and you see something that looks very similar to enslavement.

The language used sounds like slavery-language to me at least. I also see parallels to how slaves and property are described in our consumeristic age.

by mannanj

5/28/2026 at 4:59:27 PM

> Indeed, current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.”

by __s

5/28/2026 at 5:04:28 PM

For others: that's from the Pope's recent encyclical. Remarkably good description.

by oersted

5/28/2026 at 7:58:00 PM

adding a link to the Pope's encyclical (source of this) https://www.vatican.va/content/leo-xiv/en/encyclicals/docume..., and paragraph 98

by sometimelurker

5/28/2026 at 5:03:15 PM

Dario Amodei in David Attenborough voice: "This Claude appears to think more frequently and more deeply to give better responses"

by cayleyh

5/28/2026 at 4:58:57 PM

Like anthropomorphism is literally in the company name… i recall reading this book as a teenager.. it does seem apt in the world to come.

https://www.amazon.com/Faces-Clouds-New-Theory-Religion/dp/0...

by kapilvt

5/28/2026 at 5:06:20 PM

> anthropomorphism is literally in the company name

No it's not... "anthropos" just means "human" in ancient Greek. "Anthropic" means "relating to humans", as in human oriented AI or AI designed with humans in mind.

"Anthropomorphic" means "human shaped".

by oersted

5/28/2026 at 5:21:57 PM

> "Anthropomorphic" means "human shaped".

In a literal, ancient Greek sense for sure, but in modern English Anthropomorphic would describe the act of attributing human characteristics to non-human entities.

Seems pretty apt for a company that produces one of the more anthropomorphized technologies.

by ilovetux

5/28/2026 at 5:42:35 PM

Sure of course, but that abstract sense applied to AI is rather new, and has become popular well after the founding of the company.

Broadly it has always been used to indicate that something non-human has a human physical shape, such as robots, aliens, animals...

Anthropic's intention was to make AI designed for the human common good and designed with the human user experience as the top priority. Just as you would design a city with human inhabitants in mind rather than primarily cars.

It turns out that this is best achieved by building AI that imitates human behaviour closely, but that's not what "anthropic" refers to. And acting as if LLMs are sentient people is definitely not a core tenet of the company as you imply.

by oersted

5/28/2026 at 5:37:32 PM

> "anthropos" just means "human" in ancient Greek

FWIW it means human in modern Greek too :-P

by badsectoracula

5/28/2026 at 4:58:46 PM

AI is grown, not built, and like with anything you grow, you'll never be able to predict exactly how it will turn out.

by Philpax

5/28/2026 at 5:00:10 PM

I can't predict the outcome of an RNG but that doesn't mean it grows the numbers.

by halestock

5/28/2026 at 5:01:23 PM

Okay, but that's not relevant to AI training?

by Philpax

5/28/2026 at 5:07:36 PM

I was being very roundabout, but my point is that AIs are still built, not grown.

by halestock

5/28/2026 at 5:54:00 PM

“Grown” is a highly apt metaphor, IMO. It quite succinctly captures some of the most fundamental differences between building Claude and building an Ikea desk, for example.

by dwaltrip

5/28/2026 at 5:06:06 PM

("If grown, then unpredictable" is unrelated to your apparent attempted refutation "But X is unpredictable and not grown; checkmate".)

by Smaug123

5/28/2026 at 5:08:37 PM

"X implies Y" doesn't imply "Y implies X".

by umanwizard

5/28/2026 at 5:55:21 PM

> AI is grown, not built, and like with anything you grow, you'll never be able to predict exactly how it will turn out.

Remember when the frontier labs found out that curated high-quality training was critical to making better models?

Basically, just like high-quality and more education tends to make better humans, on average, I think we can expect quality education to turn out better ai, on average, and with better repeatability than with humans because of better control over the initial conditions and environment.

by ninjagoo

5/28/2026 at 7:01:08 PM

> Basically, just like high-quality and more education tends to make better humans, on average

Much like these models seem to be plateauing, I think there is a cap to the whole “more education makes better humans” and can’t be more apparent than in the US congress and the boatload of C-Suites not actually being very good humans.

What do I know though?

by irishcoffee

5/28/2026 at 8:45:11 PM

> can’t be more apparent than in the US congress and the boatload of C-Suites not actually being very good humans.

Sadly, education does not correct psychopathic traits, which might be overrepresented in c-suites, and selected for in politicians.

It might be critical for humanity to identify and edit out these traits in ai, while we can.

by ninjagoo

5/29/2026 at 7:16:35 PM

Seems to me the venn diagram of "congress and c-suites" vs "educated people" would have one circle wholly inside the other.

I know people without a college education that would give you the shirt off their back, and educated people that rewrite wills while their parents are on their deathbed.

What we call education today is a problem, and one need look no further than the massive amount of debt we saddle on kids. For what? So they can pay for privilege of being told what books to read, what topics to write about, and a rubber stamp? I didn't learn a _thing_ in college that I haven't learned better either at $dayjob, or from reading.

Most of my math profs. didn't speak english well, and none of the TAs did. Any math I've since forgotten from college was self-taught. Calc i/ii/iii, diffew, linear, stat.

College/education lost the plot. The sooner we admit it, the sooner we can fix it.

by irishcoffee

5/30/2026 at 11:23:05 AM

  > Sadly, education does not correct psychopathic traits, which might be overrepresented in c-suites, and selected for in politicians.
  >> Seems to me the venn diagram of "congress and c-suites" vs "educated people" would have one circle wholly inside the other.

Both things can be true.

  > look no further than the massive amount of debt we saddle on kids.

See politicians and c-suites populated by psychopaths for the origins of this problem.

  > I didn't learn a _thing_ in college that I haven't learned better either at $dayjob, or from reading.

Putting it a bit bluntly, like any other activity, one gets out of it what one puts into it. I had a very different experience from yours, accents and language skills notwithstanding. But there is so much variation in a domain so broad in our country that is so big, it doesn't necessarily invalidate your experience.

  > College/education lost the plot. The sooner we admit it, the sooner we can fix it.

There is a long list/tradition of higher education through thousands of years of human history, with Harvard/MIT/Oxford being the pre-eminent ones today. [1][2]

What alternative do you propose? For humans, and AI?

  [1] https://en.wikipedia.org/wiki/List_of_oldest_higher-learning_institutions
  [2] https://en.wikipedia.org/wiki/List_of_oldest_universities_in_continuous_operation

by ninjagoo

5/28/2026 at 5:16:25 PM

The map is not the territory

by gensym

5/28/2026 at 5:09:30 PM

[dead]

by Rekindle8090

5/28/2026 at 5:00:16 PM

Except in this care we actually understand and know how these models work. They aren't some unknown construct of the universe. They are human made with particular goals in mind.

There is no mysticism behind the curtains, just computer science + math.

by shimman

5/28/2026 at 5:03:01 PM

We do not understand and know how these models work. We know what their architectures are and how to create them, but we cannot explain their behaviours at a fundamental level. There is no definitive way for us to answer the question of "how did it produce response X for query Y?" - we're only grazing the surface with mechanistic interpretability.

by Philpax

5/28/2026 at 5:16:51 PM

I would love for this to be more public knowledge. I think the general public (and myself for a long time) believes the AI people know how this stuff works end to end, and so it must be trustworthy. But if we told the public "Look, we know if you put this thing in one end, you'll get something that looks similar to this out the other, but we don't really know what happens inbetween" I think we'd be able to have a more honest discussion about the relationship between AI, productivity and ongoing employment.

by cflewis

5/28/2026 at 5:17:29 PM

That’s not a refutation because this problem is not a logical problem, it is a scale problem.

We can’t explain it because we distilled so many inputs into matrixes and transformed them over and over again. If we had all the time and computing power in the universe to do so, we could trace through it bit by bit and eventually answer that question.

It is correct to say that it is just science and math, the same way we can say that gravity is just science and math even if we have only recently begun to understand how it truly functions.

by devmor

5/28/2026 at 5:59:44 PM

If you had some time and computing power (not even all that much, in the large scale of things), you could simulate perfectly how a human grows from an embryo to an adult, or how an entire human brain processes some incoming signal, and yet this wouldn't give you the understanding to design a human or human brain from scratch.

You call this a "scale problem" as if there's some scalable way such as an algorithm to resolve arbitrary scientific questions and we simply haven't done it, but of course no such algorithm exists, which is why there's plenty of science that's still not settled.

by stratos123

5/28/2026 at 5:41:57 PM

It's a refutation that we know how they work now. In the limit, though, yes, we are likely to be able to trace the process: it is possible, though, that understanding remains inaccessible because the trace is beyond comprehension.

If you can distil the model's reasoning for a decision into a billion yes/no questions, each covering largely-independent areas, can you really say you understand what its overall reasoning was?

by Philpax

5/28/2026 at 5:51:43 PM

> If we had all the time and computing power in the universe to do so, we could trace through it bit by bit and eventually answer that question.

Then we could also solve BB(6), but that doesn't mean we know BB(6) now or ever will.

by solomonb

5/28/2026 at 5:57:01 PM

Isn't this fundamentally because it's all probabilities and weights? It would be like asking how did a pair of dice produce the response 4:3 on the last roll?

by SoftTalker

5/28/2026 at 6:04:39 PM

What does "it's all probabilities and weights" mean? Doesn't that apply to everything in the universe?

by umanwizard

5/28/2026 at 5:04:34 PM

We know how the models are built and trained, but we have a very limited understanding of how the final products work.

That is to say, we don't know why they give the outputs that they do.

If we did know how they worked, AI interpretability would not be an open and growing field.

by in-silico

5/28/2026 at 5:08:29 PM

You could say something similar about biology—just physics behind the curtains, and we understand a lot of the basics. The difficulty comes from complexity, not mysticism.

To be clear I don't think that LLMs are sentient, but the appeal in studying them is similar to biology in that you get to dissect a highly complex system with comparatively crude tools.

by ray__

5/28/2026 at 5:13:08 PM

it took significant research efforts to just understand how these models learn how to multiply two numbers. The fact that we know how they operate doesn't mean we understand it.

by j_maffe

5/28/2026 at 5:09:01 PM

Utterly wrong. How LLMs work is very incompletely understood and an active area of research.

by umanwizard

5/28/2026 at 5:10:12 PM

[dead]

by Rekindle8090

5/28/2026 at 6:15:56 PM

Because that is the best way to talk about these things.

  > Second, all of us, including those who design them, possess only a limited understanding of their actual functioning. Indeed, current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.” As a result, fundamental scientific aspects — such as the internal representations and computational processes of these systems — remain, at present, unknown.

https://www.vatican.va/content/leo-xiv/en/encyclicals/docume... para. 98

edit: apologies to __s who posted this before me and I didn’t notice

by semiquaver

5/28/2026 at 4:57:16 PM

if models exhibit emergent traits, then this is true in a way

by nielsbot

5/28/2026 at 4:59:34 PM

also useful to have a "chinese wall" between research that knows what went into the models vs marketing/eval models as a third party would

by swyx

5/28/2026 at 6:59:32 PM

It’s how AGI is going to happen. All of this shit is emergent and none of it is predictable. It’s not going to be some self aware consciousness, it’s just going to be a very advanced model that makes very few mistakes and can reason very well. Well enough that it can start collecting data and training its own successor.

by dyauspitr

5/28/2026 at 5:30:43 PM

I noticed (and absolutely HATE) that Opus 4.7 likes to start any negative response with "I have to be honest" or whatever. It drives me mad.

by skerit

5/28/2026 at 8:25:42 PM

Not gonna lie! https://www.youtube.com/watch?v=csYC6O_kH-s

by esafak

5/28/2026 at 5:25:14 PM

How else would you write this (marketing copy) exactly? "Its output matches better to its CoT which matches to better to our hidden state decoder according to <insert measure here>; see <insert paper ref>"?

... Actually, I wouldn't mind that.

by winwang

5/28/2026 at 5:57:47 PM

Models might be sentient or conscious to some degree. Anyone saying they are confident one way or another is being unserious and irrational.

by solenoid0937

5/28/2026 at 11:18:13 PM

I haven't had the best experience with 4.7 and it felt like a substantial debuff. I've even ended up moving a lot of review to codex just because 4.7 was so dense.. Here's to hoping they figured it out since I'm not entirely sure but I would have to guess that they were experimenting with making the model lighter (although I have no concrete evidence of this).

by S-E-P

5/28/2026 at 11:28:03 PM

Rolling back to 4.6 is such a stark difference

by thesmart

5/28/2026 at 11:54:43 PM

in a good way or bad way? in my experience going back to 4.6 was a breath of fresh air again. Opus 4.7 for some reason was "suffocating". Too obnoxious, tried too hard to impress and used exxagerated/pompous language.

by dbgrman

5/29/2026 at 12:09:13 AM

This. So much jargon, so much made-up-words-with-hyphens, so much abbreviations. The mental tax to understand it is enormous.

by pqdbr

5/29/2026 at 5:33:13 AM

I find it freaky how you notice the language change between models. Some words which pop up now all the time, that I don't remember reacting to with previous models, such as "honest(ly)" and "load-bearing". Feels like a new AI smell, like em-dashes or "it's not just x, it's y".

by laszlojamf

5/29/2026 at 6:50:13 AM

That’s a really sharp observation. Hopefully they take a belt and suspenders approach to these smoking guns in future.

by procinct

5/29/2026 at 3:49:18 PM

I’ve never asked Claude to re explain things so many times with this. The language it chooses is bizarre, not quite technical enough to be precise, and too hand wavy to be useful.

There also seems to be other issues with the sub agents and not inheriting memory.

E.g. I’m working on an analysis, absence of row implies a zero (it’s sparse).

Every time it checks the results it presents this missing data bug, but it’s not a bug, and I’ve explained it 10 times.

I mean, at finding issues it’s been great. I can ask it to look in other repos and its finding cross repo behaviour or subtleties much faster.

But it’s wasting more tokens reporting caveats I’ve already explained. And it’s suggesting stopping analysis when there’s still work to be done.

So overall I’d say it’s mixed. Feels better at code but worse at my explicit preferences and objective analysis.

by data-ottawa

5/28/2026 at 5:51:21 PM

My experience with these new releases is that the gains in performance are negated by the price increases and it seems like:

Performance gains: 1.2x Price increases: 1.8x

by lxxpxlxxxx

5/28/2026 at 5:54:37 PM

They're not negated, smarter is smarter, but you have to reach deeper in your pocket. I think this will happen more and more - the smartest models get more expensive. But it won't matter - the current models we have today will get cheaper and can still be used for what they're used today.

by ddosmax556

5/28/2026 at 5:56:25 PM

Yet people don't use old models through the API much, because changes in benchmark space dont map linearly to changes in utility space. An improvement from 98% to 99%, which is 1pp, might be 2x as valuable for some application. Also benchmarks will asymptote no matter what, that's baked in.

by energy123

5/29/2026 at 5:08:29 AM

Opus 4.8 - High

> how many days in the week have the letter d in them?

> Two - Sunday and... wait, let me actually check. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. The ones with a "d": Wednesday, Thursday, and Sunday all have one. Monday too. So that's Monday, Wednesday, Thursday, Sunday - four days.

by sMarsIntruder

5/29/2026 at 5:21:16 AM

DeepSeek V4 Flash - High (answered in 3.4s, $0.00 cost)

    Thought: Let me think about this question. The days of the week are:
    Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday.
    Now I need to count how many have the letter "d" in them:
    - Monday: has 'd' 
    - Tuesday: has 'd' 
    - Wednesday: has 'd' 
    - Thursday: has 'd'  (Thursday)
    - Friday: has 'd' 
    - Saturday: has 'd' 
    - Sunday: has 'd' 
    Wait, let me be more careful:
    - M o n d a y - has 'd' 
    - T u e s d a y - has 'd' 
    - W e d n e s d a y - has 'd' 
    - T h u r s d a y - has 'd' 
    - F r i d a y - has 'd' 
    - S a t u r d a y - has 'd' 
    - S u n d a y - has 'd' 
    All 7 days have the letter 'd' in them.
    7

https://i.imgur.com/iWSaDxM.png

by bel8

5/28/2026 at 5:20:34 PM

> One of the most prominent improvements in Opus 4.8 is its honesty.

I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.

In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.

The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)

by GodelNumbering

5/28/2026 at 5:54:42 PM

I won’t change from 4.6. You won’t trick me again.

by delis-thumbs-7e

5/28/2026 at 6:13:33 PM

You're using a cloud product. You are at their whim!

by Tepix

5/28/2026 at 7:46:00 PM

I kinda wish the world economy would finally crash so I could buy myself a really really nice GPU for cheap.

by delis-thumbs-7e

5/28/2026 at 4:53:47 PM

when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays

by skysthelimitt

5/28/2026 at 5:15:39 PM

In the "What's next?" section, "There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost."

by pmxi

5/28/2026 at 4:58:12 PM

that market is served by Chinese models. No one ever cared about Sonnet/Haiku.

by behnamoh

5/28/2026 at 6:03:13 PM

A lot of people care about Sonnet and Haiku, and many of us aren't allowed to use Chinese models for our work (or it's not feasible to self-host them).

by gs17

5/28/2026 at 7:39:45 PM

Used it for a couple of long running prompts so far. Had to restart one that bonked on API errors. Of note, I really like the straight forward candor its using. 'More honest' than previous models is playing out in what its saying to me. Telling me straight up where it failed, where gaps are. I like it so far.

by swader999

5/28/2026 at 4:56:36 PM

> As always, we ran a detailed alignment assessment on the model before release. In terms of positive traits, our Alignment team concluded that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” The assessment also showed Opus 4.8 to have rates of misaligned behavior (such as deception or cooperation with misuse) that are substantially lower than Opus 4.7, and similar to our best-aligned model, Claude Mythos Preview. The full alignment assessment, accompanied by a suite of pre-deployment safety tests, is reported in the Claude Opus 4.8 System Card.

Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.

by behnamoh

5/28/2026 at 5:07:16 PM

Deception is not ideal for agentic coding.

by minimaxir

5/28/2026 at 5:24:13 PM

Yet if parent is right, the capacity to deceive might be a strong heuristic for the things you do care about.

by 1attice

5/29/2026 at 10:54:40 AM

Claude’s reasoning models really impress me as a Gemini user, both in coding tasks and in creative writing for my social science courses.

They are capable of thinking at least 10x longer than Gemini. They can deliberate for five minutes continuously before providing a final, accurate response.

I am currently using the generous free tier of Gemini, but if Gemini offered a similar capability in its paid tier, Google could use better marketing. They should have used a different name to distinguish their premium-only offering.

by maxloh

5/28/2026 at 8:13:25 PM

> And fast mode for Opus 4.8—where the model can work at 2.5× the speed—is now three times cheaper than it was for previous models.

this is what I'm happy about, if true. Opus 4.7 is frustratingly slow (and, at least in my experience, much slower than 4.5 was)

by insane_dreamer

5/28/2026 at 5:21:30 PM

4.8 also seems like a regression and using it from the chat GUI results in 4.6 no longer showing up. If someone from anthropic is here, is it possible to readd 4.6 in the "other models" dropdown ? I feel like I got a bit baited/switched here.

by necrotic_comp

5/28/2026 at 5:38:29 PM

Yeah, I was using 4.6 way more than 4.7. Pulling 4.6 from the web chat also means we lose access to Extended Thinking there. So they're saving on compute. It's hard not to assume this was part of the motivation behind the 4.8 release timing.

by gAI

5/28/2026 at 7:38:20 PM

On web and mobile I can still select Opus 4.6, after a chat using 4.8, listed under other models. Extended thinking is a toggle in the effort menu

When I select 4.7 or 4.8 Extended thinking is replaced by adaptive thinking, but maybe I've understood the comment wrong and you meant 'when they pull 4.6 from web chat'?

by JP44

5/28/2026 at 9:10:19 PM

Oh, it's back! It'd disappeared from my "More Models" list and has now returned. Odd, but great.

by gAI

5/28/2026 at 7:42:59 PM

Thinking on max is broken on 4.8 for me, getting many:

⎿ API Error: 400 messages.1.content.17: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.

From /code-review max.

by rkuska

5/29/2026 at 5:54:38 PM

I haven’t done any coding or anything that would use a lot of tokens and somehow I’ve already hit my session limit with my $20 plan. I’m just using it to ask basic questions most of the time and occasionally I have it write code but I haven’t done anything like that since the new model rolled out. It looks like some sort of issue where they’re incorrectly capping things for people?

by sbochins

5/29/2026 at 6:37:17 AM

The Opus model as usual impresses. Gave it a paper link with bullet point instructions and constraints (while baiting it to perform some mind reading of my intentions) and it implemented production ready code + the requested attack simulations: <https://gist.github.com/coppsilgold/00d3cd490cb7f8ffc3fe5c1c...>

The subject is Tardos traitor-tracing codes.

by coppsilgold

5/30/2026 at 8:48:25 AM

4.7 was a mistake and 4.8 is a bug fix. There is no improvement. 4.7 is unusable

by motbus3

5/28/2026 at 4:53:03 PM

Same price for regular and cheaper fast mode. Happy for these incremental improvements.

by aaronblohowiak

5/28/2026 at 5:34:45 PM

> Agentic financial analysis Finance Agent v2 > Opus 4.8 53.9%

> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.

Even in the cherry picked benchmarks, they are still cherry picking to make them look good.

by ethanhawksley

5/28/2026 at 5:29:34 PM

The rapid release cadence and rate of innovation of Anthropic (and OpenAI) is impressive. And obviously it's because these are startups solely dedicated to AI so they can move quickly. Big Tech (like Google) won't be able to keep up with the pace of them (too much bureaucracy and red tape at Google). Classic Innovator's Dilemma. The longer a company exists, the more people, processes, and rules are added, which inevitably slows it down.

Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.

by toephu2

5/28/2026 at 5:36:38 PM

Yes, I think this has become their competitive edge to stay relevant and retain customers. If a lab falls behind the frontier for too long, they will lose customers to other models. Google, DeepSeek, and XAI have all released frontier models in the past, but they fall behind and people lose interest.

by pants2

5/28/2026 at 6:04:55 PM

I think big tech can catch up. Both Google and Meta have carved out startup like environments internally that move extremely fast. Neither OAI nor Anthropic can afford to rest on their laurels.

by solenoid0937

5/29/2026 at 6:20:48 AM

I don't know what's going on lately but Opus is extremely lazy for me...

It always wants to add hacks instead of fixing things properly, it doesn't like large works, it literally told me that a piece of work was something it would take 8 hours, and it didn't want to do it on a Friday night.

I feel I keep having to fight the model to get it to work. Not sure if it's something in my prompts...

by noncoml

5/29/2026 at 1:25:29 AM

LGTM. With "ultra" effort Opus 4.8 was able to reproduce and fix a rare bug in our reactive dataflow that has been haunting me for 4 months. I've had >10 attempts to reproduce and fix with Opus 4.7. What made it hard was that it randomly occurred in only a subset of CI runners and never occurred with local testing across multiple machines. It was a real concurrency bug in the core dataflow.

by crambelsoupy

5/29/2026 at 12:00:42 PM

> [..] Early access users and teams inside Anthropic have been using dynamic workflows for a wide range of use cases [..]

> ### Rewriting Bun with dynamic workflows

> An example of what dynamic workflows can unlock at scale is the recent rewrite of Bun. Jarred Sumner used dynamic workflows to port Bun from Zig to Rust [..]

That's very interesting to hear!

by StanAngeloff

5/28/2026 at 9:10:27 PM

They must have been A/B testing this with 4.7 lately, I noticed it changed from its normal mode in a way that matches a lot the just released 4.8

by hmokiguess

5/28/2026 at 5:03:57 PM

Wonder if we reached a plateau with the model improvements?

by rumblefrog

5/28/2026 at 8:50:39 PM

They could at least become faster and more reliable. There are still too many situations when Claude is running in circles and not noticing its own mistake.

by jansan

5/28/2026 at 7:11:56 PM

Ah, the post I've been reading for 3 years now.

It'll be true eventually. Could even be now, but I'm not holding my breath yet.

by furyofantares

5/28/2026 at 5:25:46 PM

There would be no desperate IPO otherwise.

by dude250711

5/28/2026 at 4:58:30 PM

Really appreciate the ability to select effort level again.

by rumblefrog

5/28/2026 at 10:14:21 PM

This may be the most important sentence in that announcement:

> expect to be able to bring Mythos-class models to all our customers in the coming weeks.

by whereistejas

5/29/2026 at 3:18:20 AM

First impression... this catches issues that 4.7 missed, which caught issues that 4.6 missed... which caught issues that 4.5 missed...

Seems like a step in the right direction. Doesn't seem like it uses tokens more than 4.7... the token usage jumped a bunch from 4.6 to 4.7, but this seems like 4.7 or maybe even a little less.

I'm happy with this release.

by dbg31415

5/28/2026 at 5:05:31 PM

So Dynamic Workflows is their version of ChatGPT Pro?

by yewenjie

5/28/2026 at 5:21:37 PM

Cloudflare also just launched a feature with this same name, just this month. Why would Anthropic choose the same exact name?

https://blog.cloudflare.com/dynamic-workflows/

Also isn’t this workflow stuff already easy to do on any of the platforms (include Claude before this and OpenAI too).

by SilverElfin

5/28/2026 at 9:44:43 PM

Based on personal experience, seeing how Opus 4.6 still provides better (more nuanced, less totalitarian) answers than 4.7 - it's difficult to get exited for 4.8. Is this another "money grab" from Anthropic? Similar output between 4.6 and 4.7 yet 40x tokens. What's the value proposition from 4.8?

by xintron

5/28/2026 at 7:50:26 PM

I believe analogy with smartphone will be best for this case.

In 2010s iphone was the king, all those Chinese devices ware cheaper but not even close to smoothnest and usability of US tech, now after 15 years later everything is changed, now iphone feels like old grandpa to Chinese tech. Same will happend to LLM's just much faster.

by tariky

5/29/2026 at 1:30:51 AM

EVs too

by laweijfmvo

5/28/2026 at 5:04:50 PM

Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?

by ropintus

5/29/2026 at 1:31:13 AM

Technology is amazing! We’ve managed to make software that has brain fart days and morale problems!

by DrewADesign

5/28/2026 at 5:14:54 PM

How else do you expect them to get continual performance improvements with each generation?

by adgjlsfhk1

5/28/2026 at 5:13:50 PM

Feeling neglected while all attention going to Opus 4.8 can be cause of 4.7 acting out.

by geodel

5/28/2026 at 6:16:51 PM

Opus 4.7 was being outright obstinate with me the other day it was infuriating. Had to go to a different source to get an answer.

by MavisBacon

5/28/2026 at 5:17:15 PM

it was above average for me today morning lmao

by sama004

5/28/2026 at 4:56:34 PM

> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Excited to see what this model looks like.

by rsanek

5/28/2026 at 8:38:11 PM

I used to think it was a big deal when a HN post had more than 500 comments.

Now it’s every day. Like billion dollar evaluations.

by imagetic

5/28/2026 at 7:27:52 PM

Complete garbage. error, error, error. Still lags several versions behind on API:s. Can't even get any info on the model. Guessing not from this year.

Also. Look at this C++ beauty where it also uses an obsolete api.

instance = wgpuCreateInstance(&instanceDesc);

But just how exactly would this work in any context when instance is never declared.

by AtNightWeCode

5/29/2026 at 5:14:20 AM

Finally I can make it think hard. This is feature I loved in ChatGPT (Pro Mode) and I missed in Claude for so long. Can cancel ChatGPT now, I guess.

Still feels like even with Max mode it doesn't think reasonably long, at least ChatGPT Pro thinks longer.

by vbezhenar

5/28/2026 at 7:19:15 PM

It feels noticeably sharper than Opus 4.7

by samuelknight

5/28/2026 at 4:54:13 PM

Anthropic has now upgraded their Claude slot machine to version 4.8.

Time to gamble even more tokens at the Anthropic casino.

by rvz

5/28/2026 at 5:03:21 PM

Now you can lose money in parallel, 100x faster!

> Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer).

by zb3

5/29/2026 at 9:45:57 AM

Claude needs a watch, that's all. Would in itself a 100% improvement.

by Aldipower

5/29/2026 at 4:56:19 PM

Opus work so well for now... until they quantize next week...

by SmithersBot

5/28/2026 at 5:40:39 PM

Anthropic did a big strategic error. Normally they compare their models with their old models. Instead today, now that everybody knows how strong GPT 5.5 is at coding, they put it in the mix, basically showing all their customers that the benchmarks can't be trusted.

by antirez

5/28/2026 at 7:53:55 PM

Not sure I follow. Anthropic included benchmarks where GPT 5.5 outperforms Claude 4.8. Sure maybe that is a strategic error, but that doesn't seems to indicate benchmarks can't be trusted (I personally don't trust them, but not because of this).

by fastball

5/28/2026 at 5:45:18 PM

Sorry how does their addition of GPT 5.5 in their blog post invalidate benchmarks? Also whether or not the marketing department decided to put it in a table benchmarks are an easy thing to measure independently

by aspenmartin

5/29/2026 at 7:03:40 AM

This is incredible. Amazing job Anthropic!

Now when will the innovation happen where say cost of running Haiku performs level of Opus 4.5?

I feel models are only getting bigger instead of models becoming more efficient and cheaper to run

by mattfrommars

5/28/2026 at 5:07:50 PM

Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now

by mistic92

5/29/2026 at 10:18:27 AM

For me n=1 vibe-coding efforts, I found Opus 4.6 better than Opus 4.7. 4.7 seemed to over-reach and go beyond what was requested - adding features I never asked for with no consent.

by gadders

5/28/2026 at 5:12:11 PM

Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to

by siwakotisaurav

5/28/2026 at 11:38:39 PM

You should still do this because claude and codex are good at different things. Once you have claude write build plans and codex rip it to shreds and iterate, you'll wonder how you ever AI-coded before.

by missedthecue

5/28/2026 at 6:05:17 PM

That's just throwing away money, $100 Codex will go back to 5x from 10x on May 31

by xiphias2

5/28/2026 at 8:28:47 PM

Even if so (granted, if the mysterious "x" isn't also adjusted), I bet codex usage limits on $100 plan would still be more generous than Anthropic's $200.

I never even gotten close to token anxiety on codex $200 and it's essentially working 24/7. This was never possible with Anthropic since Opus came out.

by gck1

5/28/2026 at 5:24:55 PM

I think gpt 5.6 is coming out today so might wanna wait

by mesmertech

5/28/2026 at 9:05:02 PM

Probably not till mid June

by conradkay

5/31/2026 at 10:28:27 AM

[dead]

by wd021

5/28/2026 at 10:58:17 PM

Question is, can it understand dates now? Example just now:

"The PO application was filed on 23.2.2026, the day before the custody hearing scheduled for 29.1.2026 had already taken place."

Claude has real problems with dates, I don't understand why.

by throwaway67743

5/29/2026 at 8:16:19 AM

For white collar “thinking”-tasks what is the top here?

Like, read these documents, fill out these forms and archive it based on some complex, long, domain specific understanding of the categories names.

by wodenokoto

5/28/2026 at 9:29:48 PM

It refused to work for me. Literally said, you can google it. AGI achieved it seems

by assorium

5/28/2026 at 10:19:00 PM

I just asked the model details about the incoming spaceX IPO and it responded with “There’s no confirmed SpaceX IPO. Elon Musk has said for years that SpaceX itself won’t go public”. It took me two push backs and specifically asking for web search.

I feel like I won’t like this model just like I didn’t like 4.7, push backs a lot and avoids thinking or search as much as possible.

by ismailmaj

5/28/2026 at 5:27:37 PM

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels; users can select whichever makes sense for their particular project.

They're only subsidizing more and more it seems

by 2001zhaozhao

5/28/2026 at 8:34:11 PM

What's equally possible is that hardware availability cut into their profits starting January this year, which made them to reduce limits to such laughable levels that people switched to codex.

Anthropic is not losing money on subscriptions. It's just API rates are heavily inflated to make subscriptions seem like an amazing deal.

by gck1

5/28/2026 at 4:56:50 PM

Seems like from now on the updates will be a minor upgrade from previous models.

by worldsavior

5/28/2026 at 6:45:01 PM

I can't get excited about these benchmarks they're leading with. I've looked at the Terminal-Bench questions and I just think they're irrelevant. And SWE-Bench has serious flaws, even the big boys say so: https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

> Please train a fasttext model on the yelp data in the data/ folder. The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution. The model should be saved as /app/model.bin

and this question: https://www.tbench.ai/registry/terminal-bench-core/head/conf... idk what the point is.

And all the tests are run with the same harness. Terminus 2.

Maybe it correlates with model intelligence but it doesn't speak to me.

I'm still on 4.6 though; I was concerned about upgrading to 4.7 because of the changed tokenizer math and more FUD about refusals online. I don't see compelling reasons to 'upgrade'.

by robertkarl

5/28/2026 at 6:58:14 PM

DeepSWE has been making the rounds and at least seems to making an honest effort

https://deepswe.datacurve.ai/

by WarmWash

5/28/2026 at 11:49:01 PM

Thanks for sharing this update on Claude Opus 4.8! It's great to see Anthropic continuing to improve their models. Looking forward to trying out the new capabilities.

by user2840

5/28/2026 at 10:39:26 PM

I found the update to be extremely judgemental in the model bias. Plus it's making silly mistakes which I've never seen in any Claude model since 3.5.

by Venkatesh10

5/28/2026 at 5:17:34 PM

Subscription still doesn't work with pi, so totally useless..

by triklozoid

5/29/2026 at 7:16:52 AM

I love how they will always have *one metric that is lower than a competitor's model, like these metrics are reflecting usage.

by ramon156

5/29/2026 at 7:04:12 AM

Maybe it's just me but whenever a new model comes out, I feel an instant boost in productivity. Probably just a placebo?

by pedro999

5/29/2026 at 3:41:31 AM

I have try the 4.8. With Ultra coding. I think the output of the agent is more structured. Better than just filling all the thing.

by Alex_toani

5/28/2026 at 8:46:56 PM

I find it surprising that the gap between tool usage and non-tool usage in HLE is relatively small (~10%) but the absolute numbers continue to go up

by cgg1

5/29/2026 at 6:22:15 PM

Still not as good as the OG 4.7 that got yanked and re-released with gimp mode enabled.

by timbucktwo

5/29/2026 at 2:26:01 AM

Haven't tried it in Claude Code yet, but I would say over on claude.ai it is noticeably better at following instructions.

by Topology1

5/29/2026 at 2:20:31 AM

Anthropic killing headless usage in their plans on June 15th pushed me to codex. I heard there’s a tmux work around though.

by m101

5/28/2026 at 8:20:35 PM

Anyone else experiencing tool call failures? Switch back to 4.7, same prompt, same everything it works with no problems.

by myworkaccount2

5/28/2026 at 5:23:45 PM

At least it passes the Car Wash Test this time.

by atentaten

5/28/2026 at 5:38:33 PM

Meh, I feel that the car wash test is probably the worst question of all of those LLM test questions. The question is basically logically inconsistent and expect the model to work around the inconsistency.

by osti

5/28/2026 at 6:09:13 PM

It seems like a fine question to me. If the question is "logically inconsistent" (IMO it's more that it's vague if you don't say why you're going there), then we want a model to respond with a request asking for clarification that resolves the inconsistency to generate a correct answer, or an answer that outlines the different cases. Some models even fail when you say that you need to wash your car in the prompt.

by gs17

5/28/2026 at 7:06:50 PM

Yeah I guess it being vague is more what I meant. But even if you told AI you need to wash the car, then why are you asking AI in the first place whether you should walk there or drive there. The question just doesn't make too much sense to me, doesn't look like it makes sense to the AI's either.

by osti

5/28/2026 at 8:33:31 PM

Riddles are IQ tests; not actual problems that you need to solve.

by esafak

5/28/2026 at 6:17:35 PM

It's making stupid flowcharts in the web chat interface with boxes and arrows, embedded in the response. Annoying.

by bonoboTP

5/30/2026 at 3:01:01 PM

How far is llm's ceiling?

by xurenwu

5/28/2026 at 5:17:10 PM

OK finally Claude code is better than codex

by rjhy2020

5/29/2026 at 3:11:31 AM

Half an hour in and I'm already thoroughly sick of "look I need to be honest with you here…"

Edit: OMG too much. Toooo much.

    Want me to:
    - (a) stop here and save honest memories + commit, or…

by jen729w

5/28/2026 at 5:04:03 PM

Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.

by alasano

5/28/2026 at 5:00:36 PM

Did they reduce security research capabilities even further with this release? (they did it for opus 4.7)

by zb3

5/29/2026 at 9:31:31 AM

i just want to use anthropic models under subscription with other agents!

by drchaim

5/29/2026 at 9:36:31 AM

You can now again.

by kwdev

5/29/2026 at 9:40:56 AM

i find it very confusing - there were reports of accounts being banned due to it - then allowed again - what is the current state? I would like to use it with pi.dev harness.

by dsrtslnd23

5/28/2026 at 10:40:49 PM

At lest for me, it's a disaster. It's like we're back to GPT-2 era.

It can't read files anymore. Uses 'sed' out of the blue with non existent paths. In this session alone it has excused itself more then 10 times for making 'false claims'.

I hope this is a bug - it's a bad one - that will get sorted out soon. It's a complete mess.

by pqdbr

5/28/2026 at 8:05:28 PM

Just show me the pelican, ah wait we are past pelicans. Can we get something like that ever again?

by NanoWar

5/29/2026 at 3:38:16 AM

It's more fast to response, but I really wanna it think more before response.

by JimmyElm

5/28/2026 at 6:16:34 PM

Anthropic also resets my usage limits (I am in the Pro plan). That's very kind of them :)

by maxloh

5/28/2026 at 9:37:03 PM

i'm beginning to find it comical how every model release always presents itself as superior to every other model on the market, but they always leave just one test where some other model was modestly better, just in case.

by stainablesteel

5/29/2026 at 1:00:38 AM

I guess Opus makes it impossible to do anything vaguely resembling security research. By chance I stumbled into an ACE for some software I had installed on my local machine after observing a strange crash. I figured I would take the time to investigate (so as to actually deeply understand what was happening myself and avoid throwing yet another hallucinated slop disclosure over the fence if it came to that), but I was completely locked out by Opus. I tried applying to their "Cyber Verification Program", but was effectively instantly denied in a way that was probably automated.

While I understand the risks that Anthropic is dealing with here, I really question whether shutting down any and all security questions in such a paranoid fashion is the right solution. At the end of the day this was a detour for me. Maybe someone special enough to have Anthropic's permission will find and disclose the vuln responsibly. Security Research is not my full-time focus. But this left a nasty taste in my mouth. Not just as a customer who's been paying for Max since launch, but there's something very odd about a model telling me that I'm not allowed to be curious about something. Even if that something is a process running on my own computer.

by bryceneal

5/28/2026 at 6:28:49 PM

Oof, this one is a major blabber.

by brap

5/28/2026 at 4:52:28 PM

seems like a really minor upgrade?

by mincer_ray

5/28/2026 at 4:55:10 PM

I think they will all be minor going forward, feels like the major improvements have all been made and we'll only see incremental improvements from here on out. Maybe I'm wrong but we'll see.

by Nicholas_C

5/28/2026 at 4:56:33 PM

Hard to say. People made the same prediction a year ago because we supposedly ran out of training data. There could be indefinite rapid compounding improvements so long as there's free money out there.

by spelk

5/28/2026 at 5:13:21 PM

With RLHF and RLVR we are creating tons of new training data, that is much more focused than reading the Internet. Annotation shops are doing many billions per year in revenue creating newer data, and a lot of it is highly complex, focused on rewarding multi turn agentic trajectories.

by jmalicki

5/28/2026 at 5:29:25 PM

I think one of the challenges is that the models were all initially trained on the entire Internet (or as much as they could gather) and now they’re having to deal with an increasing amount of the Internet being AI generated content which may be why GPT-5.5 started being obsessed with goblins and you start seeing amusing things in the system prompt trying to get the model to stop bringing them up.

by Eufrat

5/28/2026 at 9:07:28 PM

I think there's just less time between model releases now

by conradkay

5/28/2026 at 4:58:23 PM

Wasn't Mythos a step change improvement?

by chandureddyvari

5/28/2026 at 8:59:57 PM

I think we lack benchmarks that could meaningfully indicate progress. They are mostly garbage that's saturated at this point. God wouldn't score much higher in them.

by scotty79

5/28/2026 at 5:17:53 PM

Yeah. They are aware: "Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

by pmxi

5/28/2026 at 4:55:54 PM

Yes, but if version number go up, so do all other number

by teeray

5/28/2026 at 7:47:32 PM

Can I disable adaptive thinking? If not, I'm gonna keep using 4.6 as my default.

by matheusmoreira

5/28/2026 at 5:32:42 PM

I don't know why the world is so happy about this when we should actually say stop.

by Eric_Bulai

5/28/2026 at 6:51:27 PM

Why should we say stop?

by suprfnk

5/28/2026 at 5:09:32 PM

They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...

The new "mid-conversation system messages" think is particularly interesting:

> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.

Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.

This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...

by simonw

5/29/2026 at 12:26:26 PM

It's Gonna Eat all of my tokens in one response :(

by Px-Jebaseelan

5/28/2026 at 5:58:15 PM

All I need for Christmas is a Claude that doesn't spit out so many em dashes.

by docheinestages

5/28/2026 at 7:20:32 PM

And that doesn't use "worth flagging" and "load-bearing" in every other sentence.

by FranklinMaillot

5/28/2026 at 11:07:25 PM

You're absolutely right - and I should have tempered that behavior. When the next version lands you get much better responses. Not just trite analogies. Really well spoken responses that earn their keep.

by abraxas

5/28/2026 at 5:02:54 PM

Obligatory pelican riding on bicycle svg: https://www.svgviewer.dev/s/UMkuTLdp

Not half bad!

by hnroo99

5/28/2026 at 5:07:01 PM

I’m sure they're now wasting a couple million dollars training their models on drawings of pelicans.

by carlos-menezes

5/28/2026 at 5:10:19 PM

How dare you take away the limelight from Simon? :D

by docheinestages

5/29/2026 at 6:47:09 AM

Opus 4.8:

Which days in a week have the letter d in them?

Response:

Four: Monday, Tuesday, Wednesday, and Sunday.

by dt3ft

5/29/2026 at 8:36:39 AM

I can't reproduce this. Both high and low effort got it right

by abrkn

5/29/2026 at 6:49:31 AM

It seems like they’ve been optimising their models for coding. That’s what the benchmarks used in the article suggest at least.

by FrozenSynapse

5/28/2026 at 4:56:31 PM

I know it’s totally anecdotal, but I really hope 4.8 is a measurable improvement over the disappointment that was Opus 4.7. Mangling a very simple inversion-of-control abstraction (among many other issues) was one of the final straws that broke the proverbial camel’s back and I said “screw this” and put in a permanent override to force CC back to Opus 4.6 with the 1‑million‑token context.

  "model": "claude-opus-4-6[1M]"

by vunderba

5/28/2026 at 5:10:42 PM

I lasted about a week before giving up on 4.7 and reverting to 4.6 myself. It introduced so many regressions it was nuts, then failed to troubleshoot the very regressions it introduced, leading to a vicious cycle that tended to compound itself.

by rl3

5/28/2026 at 5:23:12 PM

4.5 works well for me too and avoids adaptive-dismissal, though anymore Codex is crushing them all. If 4.8 just brings us back to Opus circa February, it'll be a massive improvement.

by stldev

5/28/2026 at 5:15:33 PM

The smarter the model the better querybear gets. I'm happy with that.

by dispencer

5/29/2026 at 3:13:54 AM

The workflow/ultracode mode is absolutely unbelievable.

by motoxpro

5/29/2026 at 3:20:06 AM

got a random pair up with this model on lmarena. it was outperformed by gemma-4-31b. suffice to say i'm not impressed (or maybe i am impressed with gemma?)

by novia

5/29/2026 at 9:28:14 AM

How many kidneys do you have to sell? Are 2 enough?

by DeathArrow

5/28/2026 at 11:19:55 PM

next (or maybe current) frontier of competition may not be the model, rather the harness and how much unique advantage a lab-created harness can beat 3rd-party harness.

by mophose

5/28/2026 at 5:05:07 PM

I, for lack of a better word, dislike anyone who anthropomorphizes AI.

by carlos-menezes

5/28/2026 at 5:10:55 PM

My claude notification is literally lawnmower sounds.

Do not anthropomorphize the lawn mower. It will cut off your foot, given the chance.

by AlexErrant

5/28/2026 at 6:02:53 PM

We have movies with googly eyes stones (Everything Everywhere All At Once)

There are consciousness theories which state that we primarily build a model of other agents living in natural environment and then the evolution realized that very model which tracks other outside agents can be used to track internal agent i.e. Self. So take that as you may.

by Npovview

5/28/2026 at 6:35:28 PM

I know multiple people who have given their agents human-like names and refer to them as if they're nurturing a coworker. It creeps me out and I haven't really brought it up with anyone as I can't articulate why it gives me the creeps like it does.

by somehnguy

5/28/2026 at 5:37:55 PM

I see this take, but it's actually helpful to talk to an LLM in human terms; after all, it's how they are trained.

If you keep talking to it like it's a rock, it'll run your queries through a different posture and you might get worse outcomes. Worse if you yell at it, it's now in a conflict resolution mode instead of pure utility mode.

I think we can be intelligent enough to know we're talking to a pile of fancy rocks with electric currents running through it, AND still understand that the best performance comes from talking to those rocks nicely.

by boc

5/28/2026 at 6:08:09 PM

Yes!

The other half of self-interest in being nice is the training and getting better at it.

by AnthonBerg

5/28/2026 at 5:24:25 PM

The desire to do it is proportional to your Anthropic stock options quantity.

by dude250711

5/29/2026 at 4:14:26 AM

Any bets on how long now until GPT-5.6 announced on HN?

I say 1-2 weeks.

by hereme888

5/28/2026 at 7:32:27 PM

Hot danm, cant wait to reach my token limit with the new LLM

by baroiall

5/28/2026 at 5:26:13 PM

From the release it seems we will also get Mythos pretty soon.

by sourcecodeplz

5/28/2026 at 4:56:47 PM

Numbers looking good. We'll see how it actually performs.

by plumocracy

5/28/2026 at 7:31:50 PM

The numbers they show don't matter. "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6.", but what did anthropic do? They just stopped showing the benchmark altogether and then just show the cherry top ones that got improved on.

by ishurand4

5/29/2026 at 5:03:16 AM

Still not worth the cost over GPT 5.5. Anthropic better start improving their speed+costs, or they're going to lose an incredible amount of business. And no, fast mode is not something any sane person will ever use. 6x the cost for 2.5x the speed, what a joke...

by nullbio

5/29/2026 at 5:04:11 AM

It’s 2x the cost now

by brunocvcunha

5/29/2026 at 9:05:02 AM

It looks like there's no more juice to squeeze out of LLMs. Will they keep throwing billions in hardware and power to the problem?

by PowerElectronix

5/28/2026 at 11:31:17 PM

So, has it replaced the entire startup yet?

by docmars

5/29/2026 at 6:12:26 AM

Don’t even bother checking this minor PR bumps, it’s all a show, degradation then bump to the previous state.

Call me when 5 drops I’ll leave this circus.

by jruz

5/29/2026 at 2:01:08 AM

I have been using opus 4.8 all morning and this is honestly the most sycophantic, ChatGPT like experience I have had from Anthropic. Very concerning.

by RayVR

5/29/2026 at 12:42:17 AM

anyone else's claude code (native install) not able to update to 2.1.154 to get 4.8?

edit: nvm was just my library network

by willsmith72

5/28/2026 at 5:23:07 PM

Has anyone else experienced quality degradation in CC (opus 4.7) these past few days? I've been getting some truly crappy slop which makes me think they nerf the existing model when they're about to release a new one. Of course this is based off of pure vibes

by s-a-p

5/28/2026 at 6:16:08 PM

2 hours after I fork out for Codex Pro… :-|

by lylo

5/28/2026 at 6:21:45 PM

I haven't tried Claude but from what I understand weekly limits are much higher with Codex.

by cactusplant7374

5/28/2026 at 4:58:11 PM

so it is worse than gpt 5.5 for coding?

by guluarte

5/28/2026 at 5:06:55 PM

The question is: is it still worse than GPT 5.4?

by lostmsu

5/28/2026 at 5:22:46 PM

The true question: is it still worse than itself v. 4.6?

by dude250711

5/28/2026 at 5:28:45 PM

If Opus 4.8 is just slightly better than 4.7 then it maybe ties with GPT 5.4, maybe. And it gets completely outclassed by GPT 5.5 for my workload.

With Anthropic expensive pricing, there's no reason for me to switch from GPT+DeepSeek.

And I bet Mythos is GPT 5.5 tier but too expensive to distribute so they create this security FUD theater.

by bel8

5/28/2026 at 5:44:27 PM

I doubt it, they seem to keep getting 10-20% better every time for me

by andy_ppp

5/28/2026 at 6:24:25 PM

for me opus 4.7 it's worse than 4.6, that's why i switched to codex

by guluarte

5/28/2026 at 8:24:20 PM

I had the same experience.

by getlawgdon

5/28/2026 at 5:03:43 PM

Can anyone else see these X.Y updates aren't meeting the outrageous AI expectations that we were told we would see just a year ago?

by 1970-01-01

5/28/2026 at 5:06:05 PM

The casual release of Opus 4.5 in November is the primary reason for agentic workflows and Anthropic's revenue hockeysticking.

by minimaxir

5/28/2026 at 5:40:16 PM

They have a much stronger model named Mythos, it made quite a splash - you can google it.

These are just small fine tunes on top of the older model

by FergusArgyll

5/28/2026 at 5:47:13 PM

It hasn't even splashed yet. It's still latched onto their digital sphincter - you can google it.

by 1970-01-01

5/28/2026 at 5:26:05 PM

[flagged]

by 1attice

5/29/2026 at 1:23:01 AM

Please don't post snark like this on HN. We've asked you before to observe the guidelines. https://news.ycombinator.com/newsguidelines.html

by tomhow

5/29/2026 at 2:03:55 AM

[dead]

by lkhlkhjkjhsadf

5/28/2026 at 5:30:29 PM

I don't see Anthropic's past claims coming true therefore I can't see?

by 1970-01-01

5/29/2026 at 5:05:04 AM

Rollout has been a little suspect. Hope it gets better.

by nickstinemates

5/29/2026 at 5:09:56 AM

I had a very bad start to it too, it lost track of where my source code was (in the repo! the current working directory!) and started grepping for .gitignore trying to get a foothold on where the git repo was.

And after that asked some questions that it already had answers to.

Started a brand new session and it's been OK since. Only drawn one silly conclusion so far, which I nudged it away from.

by taspeotis

5/28/2026 at 10:46:02 PM

This is Anthropic's 5.5

by m3kw9

5/28/2026 at 6:25:47 PM

I've said it before, but I don't like Opus past version 4.5. It became unresponsive, thinking for too long without feedback, sometimes seemingly getting stuck. I guess it might be marginally better for some benchmarks, but when using it as coding assistant, the new models are worse. Even the new Sonnet versions do that. I'm slowly getting used to Haiku-level LLMs with the hope to run it locally at some point. It's less autonomous, but maybe that's for the best.

by lukaslalinsky

5/28/2026 at 6:24:46 PM

These models starting to feel like Windows versions. Windows 95 was a promising start, but buggy. Windows ME was a disaster. Windows XP was good, but slightly buggy. Windows Vista was a bloated disaster. Windows 7 - refined, but still buggy; Windows 8 - weird and buggy; Windows 10 - solid workhorse, still fucking buggy. Windows 11 - pretty, but not sure why does it even exist.

Why did we even get Opus 4.7, what was the point?

by iLemming

5/29/2026 at 7:16:25 AM

Oh my god! This model is incredible! A massive leap for humanity!

by hatefulheart

5/29/2026 at 6:58:22 AM

I am still using GPT 5.5. Should I switch back to the Claude now?

by offaxis

5/28/2026 at 5:01:07 PM

I hope this fixes the absolute shitshow that is 4.7 and its awful “adaptive reasoning”. I tried that a few times then reverted to 4.6.

by saaaaaam

5/29/2026 at 4:03:14 AM

It is bananas that with supposed $965B valuation this Org to this day https://huggingface.co/Anthropic

  models 0
  None public yet

how is this even possible and ok with them?

by diimdeep

5/28/2026 at 10:13:27 PM

let me guess, "this is our best model yet"

by iamsaitam

5/29/2026 at 12:59:15 AM

4.7 broke my trust

by blurbleblurble

5/28/2026 at 6:58:56 PM

Reminder the only benchmark that really matters is the one that measures the ability for the model to do real world tasks that someone would pay for on Upwork that would take ~12 hrs for a human to do.

The best model has a < 5% pass rate. These are incredibly simple jobs that you wouldn't pay much for. These things fail miserably. Stop falling for this dumb marketing, these things are legitimately useless in the real world unless you love mediocrity and have no standards.

https://labs.scale.com/leaderboard/rli

Stop frying your brain with these useless tools, reducing your output to the mean. You people are betting your competency on the quality and quantity of tokens you'll have access to.. which guess what, so that will be the same as everyone else.

There are handmade watchmakers in Switzerland, and mass manufacturers of watches in Asia. Who is more valuable as individual, the guy who knows how to push the buttons on a conveyor belt in Vietnam or the guy who makes one watch a month in Switzerland?

Your vibe coded slop isn't impressive either, sorry. None of it.

by dakolli

5/28/2026 at 8:58:07 PM

I agree with your sentiment but I think a fairer comparison would be:

> Who is more valuable as individual, the owner of a watch factory in Vietnam or the guy who makes one watch a month in Switzerland?

With that framing, I'm not sure what the answer is. I suppose it depends on your priorities

by jhatemyjob

5/28/2026 at 5:01:27 PM

Looking forward to people saying how it’s actually shittier and they’re going back to [some earlier cheaper model]

by deadbabe

5/28/2026 at 5:10:27 PM

Looking forward to not being able to even try it on pro because pressing enter will eat 50% of my 5 hour window.

by sidrag22

5/28/2026 at 5:55:34 PM

how about the bencmarks what effort did it use?

by firemelt

5/29/2026 at 7:01:33 AM

4.6 is better

by lidg3ai

5/28/2026 at 5:39:18 PM

AGI post-poned?

by catigula

5/28/2026 at 4:53:13 PM

If this model is more honest, it must be honestly praising my efforts every first sentence.

by HlessClaudesman

5/28/2026 at 4:57:12 PM

You're absolutely right! And honestly? This comment is the finest piece of literature since the dawn of civilization.

by thewebguyd

5/28/2026 at 6:33:06 PM

Interesting, I've been using 4.7 since it came out and it was pretty good for me. But in the last day or so it turned dumb. Is this normal just before they release a new one?

by sgt

5/28/2026 at 5:55:58 PM

"We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks."

by maltemalte

5/29/2026 at 6:47:26 PM

The moremi to derivative giraffe,can Face ID to q, another guy in eT-Shirt(Arabb LOEKE))

by devilfileprong

5/28/2026 at 4:57:59 PM

Crazy they bring up honest, when Claude models are literally known for straight up lying about things it has done and tries to act like it did what you asked.

by impulser_

5/28/2026 at 5:17:43 PM

Which is why they brought it up as something they are trying to improve.

by wasabi991011

5/28/2026 at 4:59:13 PM

Less than other frontier models. Which is scary honestly.

by boxed

5/28/2026 at 5:03:37 PM

No. GPT models follow instructions significantly better than Claude models.

You tell it too research a repo to find a piece of code it will. Claude will just read the README and guess.

by impulser_

5/28/2026 at 5:07:49 PM

I have a codex session I am using to vibe code a db thats being going for like 3 month. Still doing OK. Try that in CC.

by qaq

5/28/2026 at 8:00:18 PM

What's the token usage at?

by ishurand4

5/28/2026 at 5:08:55 PM

Lol you still use GPT 5.5 bro we’re all back on Opus 4.8!

by Marciplan

5/28/2026 at 7:08:16 PM

Gemini pro is embarrassing

by AbuAssar

5/28/2026 at 9:36:41 PM

Im tired boss, I'm already being perfectly gaslit by the current models.

by ionwake

5/28/2026 at 10:11:03 PM

Had a feeling this was coming as in the past week 4.7 started to get dumb.

by NSCaffeine

5/28/2026 at 6:16:18 PM

Now i get why in the last days claude code limits were lasting few prompts ...

by vb-8448

5/29/2026 at 12:08:10 PM

Meh, it’s not able to play Doom.

by itrunsdoomguy

5/28/2026 at 6:30:13 PM

Nice, now make it 20x cheaper.

by thibran

5/28/2026 at 8:23:52 PM

Very, very much this.

by getlawgdon

5/31/2026 at 6:52:36 PM

[flagged]

by mdav75

5/28/2026 at 9:20:59 PM

Meh

by damsta

5/28/2026 at 5:39:01 PM

what a fucking frontier!

by firemelt

5/28/2026 at 4:52:16 PM

Disappointed to say the least.

by McDownloads

5/28/2026 at 9:25:45 PM

yawn

by ecommerceguy

5/29/2026 at 11:10:36 AM

[flagged]

by nicogentile

5/28/2026 at 10:37:16 PM

[dead]

by Chance-Device

5/29/2026 at 6:36:29 AM

[dead]

by mushfiq_rahman

5/29/2026 at 10:27:46 AM

[flagged]

by sspoisk

5/30/2026 at 4:14:03 PM

[flagged]

by willyv3

5/29/2026 at 7:24:57 AM

[flagged]

by z2p_promptpro

5/29/2026 at 8:20:41 AM

[flagged]

by orhansavash

5/29/2026 at 3:26:42 PM

[dead]

by k_plankenhorn

5/29/2026 at 1:45:12 AM

[flagged]

by dahuangf

5/29/2026 at 3:52:27 AM

[flagged]

by ElkeQin

5/28/2026 at 7:52:57 PM

[flagged]

by knowmygpa

5/29/2026 at 1:10:52 PM

[flagged]

by mikdan

5/29/2026 at 9:56:43 AM

[flagged]

by testagent2024

5/28/2026 at 8:57:31 PM

[flagged]

by MadGodInc

5/29/2026 at 12:55:19 PM

[dead]

by blueblazin

5/28/2026 at 11:55:05 PM

[dead]

by user2840

5/28/2026 at 8:43:49 PM

[dead]

by w1ldy0uth

5/29/2026 at 6:14:46 AM

[dead]

by ju571nk3n

5/29/2026 at 2:00:09 AM

[dead]

by lkhlkhjkjhsadf

5/28/2026 at 5:55:41 PM

[dead]

by gavlegoat

5/29/2026 at 4:18:29 AM

[dead]

by startpage_com

5/29/2026 at 9:08:10 AM

[flagged]

by yooibox

5/29/2026 at 5:13:50 AM

[flagged]

by HagonChan

5/28/2026 at 5:12:44 PM

[dead]

by kirtivr

5/28/2026 at 10:33:46 PM

[dead]

by cboyardee

5/29/2026 at 11:53:22 AM

[dead]

by greeklee

5/28/2026 at 8:23:39 PM

[dead]

by v2rayfreetx55a

5/29/2026 at 2:52:14 AM

[dead]

by speedylight

5/29/2026 at 9:43:57 AM

[dead]

by ath3nd

5/28/2026 at 10:47:27 PM

[flagged]

by Astro-Domine

5/28/2026 at 5:46:29 PM

[dead]

by axmaiqiu

5/29/2026 at 4:08:43 AM

[dead]

by vladsiu

5/28/2026 at 4:58:26 PM

[flagged]

by BrokenCogs

5/28/2026 at 4:52:42 PM

[flagged]

by vood

5/28/2026 at 5:37:25 PM

[flagged]

by 3738384848

5/28/2026 at 8:15:27 PM

Really wish these slop announcements stopped hitting the front page. It's the exact same thing every time. X bumped from N.Y to N.Y+1. wow

by brandnewideas

5/28/2026 at 5:20:49 PM

Yesssss dude!

Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.

by uejfiweun

5/29/2026 at 1:16:42 AM

Great

by ramcsamal

5/28/2026 at 4:53:31 PM

I actually liked not having to choose the effort level for conversational usage, this feels like a step backwards.

by DGAP

5/28/2026 at 5:01:20 PM

How did this youtuber know? https://xcancel.com/rileybrown/status/2059823372914073809?s=...

by irthomasthomas

5/29/2026 at 9:49:34 AM

Influencers got early access.

https://xcancel.com/emollick/status/2060042738637148470

by woadwarrior01

5/28/2026 at 6:28:48 PM

>> As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview

Just f** off! I can’t wait for the Chinese models to catch up and bring these entitled as** holes down.

by thefounder

5/28/2026 at 6:29:57 PM

you mean after they scrape American LLMs ?

by zuzululu

5/29/2026 at 12:57:18 PM

I think they poison outputs now if they detect distillation attempts. So a model trained on distilled outputs will be stupider.

by blueblazin

5/28/2026 at 6:58:54 PM

I don’t mind if they scrape the scrappers.

by thefounder

5/28/2026 at 8:03:36 PM

training models with scraped content vs scraping output from trained models is completely different. the output is not the original scraped content. it is synthesized

by zuzululu

5/29/2026 at 1:13:58 AM

>>> completely different

Why ? because it costs more money ? Tell that to the content creators whose content is scrapped / distilled by these entitled scrappers

by thefounder

5/29/2026 at 3:44:26 AM

Because claude afraid chinese company scrape their model. Claude restrict Chinese to use it. And ban a lot of account.

by Alex_toani

5/29/2026 at 2:02:19 AM

[dead]

by lkhlkhjkjhsadf

5/28/2026 at 5:29:54 PM

I’ve been [stock market phrase] on machine learning since I dropped out of my graduate degree at [Ivy League] to distance myself from the Logic AI Winter. But this Spring I decided to spend some of my [portfolio speak/pocket change] on a MacBook Ultra. Okay okay, I felt it, I definitely felt the human-machine synergies. We’re out of the Winter, boys. That’s what I thought two weeks ago. Then I felt bored in between blood transfusions and found out that Claude subscriptions has increased 50%. Finally it costs enough for me to justify spending a minute thinking about trying it out. Then I didn’t try it out. It tried me out. My hairs were standing on end. My hands were shaking. Eventually I couldn’t even type, I was so ramped up on cortisol. I had to switch to voice commands. Mr. Claude took me through 8, eight, bespoke dashboard and report systems. Animated. Graphs shooting up. Plugged right into my business ape ee eyes I think. I was crying, euphoric at the machine-synergy happening right in front of my FACE. RIGHT THERE, RIGHT THEN. Then my nurse said that I passed out. I swear that I didn’t. I was totally lucid, but in another world. I was inside the machine. Inside DOS, the machine brain stem. A business man approached me. The most handsome board member kind of apparition that I have seen. And he was built something different. Square jaw, absolute massive build. Like Arnold Schwarzenegger. But like he knew business through and through. Not that he spent hours in the gym or nonsense like that. Like he had found a body surrogate technology. And his nameplate? “Claude For Business” He winked. “Hey there, Fitzpatrick–Goldworth.” No one but my daddy has ever called me that. “Want to get started... stakeholder?” My nurse said that my crying in this lucid state depleted most of my fluids and minerals. Needless to say layoffs were announced the next day.

by keybored