GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

7/4/2026 at 11:20:23 PM

Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.

Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes.

Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small.

by nsingh2

7/5/2026 at 8:21:48 AM

I have a philosophical problem with adaptive thinking. It’s a dumb guess for how much thinking budget to allocate ahead of thinking. At least in the context of LLMs there is probably no way of knowing how much thinking (token generation) is needed. The problem space is infinity vast, similarly of two prompts is not going to help any LLM decide how much thinning is needed. Models already stop thinking before hitting the thinking budget.

Why there is so much effort in making adaptive thinking happen and don’t we train models to produce the end of thinning token better?

Feels like a bandaid. We need models to be trained to do a reasonable amount of reasoning (no pub intended):

    reason

    estimate remaining uncertainty

    continue?

    reason more

    repeat

by mohsen1

7/5/2026 at 4:20:59 PM

I agree, adaptive thinking is a pest and without a minimum thinking budget especially Claude for me currently defaults to not think at all even on max effort.

Sequential-thinking was really a step in the right direction, and works almost exactly how you've described, though when it was popular before the reasoning models and even now when I tried it recently I have never once see it use its branching feature and it tends also to have the RLHF urge to answer something "helpful" quickly instead.

by CjHuber

7/5/2026 at 12:46:27 PM

At least there should be a tool call that's the equivalent of saying "wow, this is more complicated than I thought". Humans are also often prone to under-allocating reasoning time and coming to wrong conclusions because their reasoning ends up too shallow. But the best humans are great at mentally mapping the problem space and readjusting on the fly

by wongarsu

7/5/2026 at 1:10:36 PM

This is what I do in llm-consortium. An arbiter evaluates the response(s) and decides if more iterations are needed. You can also loop until a minimum confidence threshold, but self-reported confidence isn't a great metric.

by irthomasthomas

7/5/2026 at 12:05:06 PM

Right now we have a LOT of band aids. You want to optimize compute and thinking to a particular problem, sort of like we do. Yes you cannot perfectly predict this but you can do decently well and save a ton of tokens at the cost of this band aid being sort of leaky and gross.

But the larger problem is sound, and the answer is something jointly optimized (idk how they do the routing) but it’s hard to shoehorn it into the current paradigm.

by aspenmartin

7/5/2026 at 2:43:53 PM

Modern LLMs are nothing but band-aids, starting with the absurd bandwidth of HBM3 RAM that makes them possible.

by UltraSane

7/5/2026 at 3:11:46 PM

This is preliminary, but it seems like it might somehow be related to the `## Intermediary updates` system prompt that's provided to the model. Seems like it forces the model to stop thinking and return early to provide updates. Removing that entirely makes all runs succeed [1].

I wonder if it's somehow getting confused between what's supposed to be an intermediate update vs the final result.

[1] https://github.com/openai/codex/issues/30364#issuecomment-48...

by nsingh2

7/5/2026 at 3:12:33 AM

You still have to worry about misconfigured local models. Even the professionals get it wrong, which is why local model performance is uneven across providers.

by postalcoder

7/5/2026 at 7:15:36 AM

And to add insult to injury, some providers will ride on the good reputation of some local model, selling you a terrible quant instead.

With OpenAI, at least my gpt-5.5 is the same as your gpt-5.5. You can't say that about glm for example.

by miki123211

7/5/2026 at 7:30:50 AM

> And to add insult to injury, some providers will ride on the good reputation of some local model, selling you a terrible quant instead.

I just started using OpenRouter for some control testing of local models and what surprises me the most isn't that there are different providers providing different quantization levels, that makes sense, but I can't seemingly find a way of seeing what provider+model+quantization is actually used?! https://openrouter.ai/models shows the models, then say https://openrouter.ai/moonshotai/kimi-k2.7-code shows the providers but when I go to https://openrouter.ai/moonshotai/kimi-k2.7-code?endpoint=e7a... for example, why on earth is it not showing the actual details about the actual weights they're serving?! Give me details! It does have a "Precision" value that is sometimes filled out, but that seems to be a guess at best, even providers with the same values there have wildly different quality responses.

I like the idea about OpenRouter but holy hell does the implementation seem very far off from what it needs to be, in order to be useful.

by embedding-shape

7/5/2026 at 7:43:22 AM

There are properties on the API call you can pass for specific providers, so you test which providers you like the output, then add them to the list in ranked order if you want one by default, then to fall back to the other.

There might be something in the response, or in a followup API call for the session, that you get better details. I think I've seen the details in the dashboard, so they do exist.

by jermaustin1

7/5/2026 at 10:54:42 AM

Sam Altman, good reputation?

> With OpenAI, at least my gpt-5.5 is the same as your gpt-5.5.

How do we know that? The "orchestration" layer probably forwards to different levels of quantization. And it seems tempting to make some sort of load balancer with adaptive computation effort.

by rightbyte

7/5/2026 at 10:43:05 AM

> some providers will ride on the good reputation of some local model, selling you a terrible quant instead.

Quants in popular local inference apps (Ollama, LM Studio, etc) are the worst possible quants (RTN).

by woadwarrior01

7/5/2026 at 9:28:10 AM

That's not a real equivalency. They are not necessarily the same (testing in production, hello!) And most importantly you do not have a local model because openai is not open!

by beacon294

7/5/2026 at 5:19:06 AM

But in that case you have nobody but yourself to blame, and you can stabilize things yourself at any time by refraining from making any changes. You won't be surprised by a provider. Honestly? That's not just valuable—it's essential.

by jdiff

7/5/2026 at 5:48:10 AM

> Honestly? That's not just valuable—it's essential.

I'm curious if you wrote this or had a LLM write it.

I'm genuinely curious to be clear as I don't see why anyone would bother to go through a LLM to write such a short reply. Have we reached the point where Claudeisms that are this obnoxious have become part of regular speech?

by LiamPowell

7/5/2026 at 10:28:15 AM

The obnoxious cliche is mine, although I wouldn't call it "regular speech" since I tacked on that dense blob of LLMisms intentionally.

by jdiff

7/5/2026 at 6:02:35 AM

Or they were making a joke[0].

[0]: https://en.wikipedia.org/wiki/Joke

(…just like that)

by UqWBcuFx6NV4r

7/5/2026 at 6:13:15 AM

Sure, but it doesn't really fit there as a joke, it looks like it's just meant to be part of what they were trying to say.

by LiamPowell

7/5/2026 at 7:13:49 AM

I also think it's a joke, it starts with the response / argument, and then flows into tongue-in-cheek joke about the core issue of the post (LLM)

by subscribed

7/5/2026 at 6:38:26 AM

I've noticed them trying to creep into my writing. It doesn't help that I was a heavy em-dash user ten years before GPT-3.

by topynate

7/5/2026 at 7:51:26 AM

I’ve always loved em dashes…very sad they’re a hallmark of LLM slop now. That and trios in arguments. Maybe I’m part LLM?

by mdgld

7/5/2026 at 8:30:42 AM

This isn’t just em-dashes—it's the empty phrase that includes both whatever the contrastive construction is called and an “Honestly”. It might have been human written but the density of LLM flags is undeniable.

by andy99

7/5/2026 at 10:40:28 AM

Honesty has become a big tell, especially if its brutal. I'm not sure if it's getting worse or if I'm becoming more sensitive, but if I'm being brutally honest I can barely stand using LLMs anymore with their tone and their cliches. It's a horrible nightmare fusion of socmed influencer and LinkedIn hustler that sounds freakish even if it were human, and the density of tropes that are accumulating feels ridiculous.

Didn't the foundries take action against those in the past? I don't see "delve" nearly as often anymore. Why are the models spiraling like this now?

What really grinds my gears is the constant need to guess what I'm doing and offer a million random follow-ups. I asked what the weather was like, I don't appreciate the 5 paragraphs of tokens burned on weather-appropriate activity suggestions.

by jdiff

7/5/2026 at 8:11:00 AM

Some of us been writing texts on the public internet for decades, and humans invented machines trained on our texts, so suddenly the text we write now sounds like robots? Only way to win is to stop caring, let the people believe you're a LLM if so be it.

by embedding-shape

7/5/2026 at 8:32:24 AM

LLM writing style is trained in by data labellers, it’s not just emergent behavior from being trained on internet texts.

by andy99

7/5/2026 at 8:40:09 AM

Ultimately it's a mix-match of everything, including whatever data the pre-training uses and how exactly they do the post-training. I don't think you can say there is a single factor that decides the writing style, unless you have some particular insight into some specific pipeline. Generally though, they output text that looks like the human text they ingested for training.

by embedding-shape

7/5/2026 at 12:44:39 PM

I've made a passive workaround: a pair of Codex CLI hooks that detect the truncation from the local session transcript and warn — in the TUI at turn end, and via a message injected into the model's context on your next message.

See https://github.com/bentoner/codex-516-hook

by bentoner

7/5/2026 at 2:22:44 AM

I wonder if testing during different time/days show patterns? For example, whether the short circuiting happens more often during workday peak hours.

by dannyw

7/5/2026 at 9:27:33 AM

And people pay for those wasted tokens? If that's the case, it is probably good idea to ask for refunds.

by varispeed

7/5/2026 at 10:47:21 AM

You can use this small Python script to display an histogram of `reasoning_output_tokens` in your past Codex sessions. I do see a spike at 516 indeed.

  import os, glob, re
  import matplotlib.pyplot as plt
  vals = []
  for f in glob.glob(os.path.expanduser(r"~\.codex") + r"\**\*", recursive=True):
      if os.path.isfile(f):
          try:
              s = open(f, "r", encoding="utf-8", errors="ignore").read()
              vals += [int(x) for x in re.findall(r'"reasoning_output_tokens"\s*:\s*(\d+)', s)]
          except Exception:
              pass
  plt.hist(vals, bins=200, range=(0, 5000), weights=[100 / len(vals)] * len(vals))
  plt.xlabel("reasoning_output_tokens")
  plt.ylabel("%")
  plt.show()

by josephernest

7/4/2026 at 10:59:17 PM

I’ve definitely experienced step jumps down in quality on an almost daily basis. I usually used xhigh. The experience of relying on codex’s outstandingly thorough coding earlier in the year has evaporated for me. I’m seeing incredibly stupid implementations intermittently, and have simply switched to Claude until openai takes the issue seriously. As far as i could tell they haven’t taken it seriously for the several months I’ve been personally seeing it.

by zenapollo

7/4/2026 at 11:03:36 PM

I've switched 3 months ago to Codex because Claude got incredibly stupid. 6 months ago vice versa. It doesn't matter if you use Codex or Claude. Both will fuck with you at some point. Though Codex probably less.

by siva7

7/4/2026 at 11:47:34 PM

At least OpenAI lets me use my own harness. Having to rely on insane PMs letting Claude Mythos go wild on the codebase has not been going well lately.

by selectodude

7/5/2026 at 12:22:07 AM

How do you mean OpenAI lets you use your own harness? I'm under the impression that a custom harness requires the OpenAI SDK, which requires api tokens rather than plus/pro accounts.

by tunesmith

7/5/2026 at 12:42:13 AM

https://x.com/thsottiaux/status/2058071172361998482

"A little secret. About 5% of our production traffic is on the Pi harness, about another 5% is on OpenCode. Reminder you can use your ChatGPT account in a flourishing set of other tools.

We’ll continue to make Codex awesome, but you have options."

by lhl

7/5/2026 at 4:19:14 PM

They will only do this while they have no users. Once they are compute constrained they won't let you do this anymore

by solenoid0937

7/5/2026 at 1:48:46 AM

Not just harnesses, you can even use the subscription in CI/CD. That, plus the fact that web chat does not count toward the same limits, is why I think the Codex personal plan is easily 10x the value of Claude Code.

https://developers.openai.com/codex/auth/ci-cd-auth

by smoe

7/5/2026 at 5:04:55 PM

How are you using it in ci/cd?

by 7thpower

7/5/2026 at 12:38:52 AM

You must've missed the OpenAI's response to Anthropic forcing everyone to their own harness if they want subscription pricing: official endorsement of custom harnesses like opencode and pi even when used with Codex subscription.

I think they even partnered with opencode or something like that (don't remember).

by hakunin

7/5/2026 at 12:34:27 AM

Anthropic is the one that prohibits harnesses other than Claude Code on subscription plans and bans users for disobeying.

OpenAI officially allows that with subscriptions.

by HumanOstrich

7/5/2026 at 7:19:55 AM

OpenAI doesn't require that; only Anthropic does.

OpenAI's harness is fully open source[1], and (AFAIK) doesn't come with any kind of signed-build request integrity verification like Claude does. And by that logic, if you're allowed to use their API with a fork of Codex that you yourself compiled, there's nothing stopping you from making some other harness act like such a fork.

[1] https://github.com/openai/codex

by miki123211

7/5/2026 at 8:46:27 AM

And although Codex Desktop is not open source (AFAIK), it does expose CDP and appserver which allowed me to build a Greasemonkey-like plugin system on top of it. It's surprisingly powerful, you can add transcript annotations, programmatically control the side panels, integrate a native-looking account switcher, and so on.

by lmwnshn

7/5/2026 at 1:17:56 PM

This sounds really cool. Do you have the source published anywhere?

by walthamstow

7/5/2026 at 5:59:49 PM

Yup! MIT licensed, do whatever you want, you can also just use it directly with nodejs installed through `npx clankerbend` [0]. The screenshot on the github repo links you to a YouTube demo.

Warning: I use this daily myself for work, but beyond me arguing with it to get the CDP architecture, this was completely vibe-coded. I mostly use vim mode (jump to line, jump to prev/next user message) and the account switcher.

[0] https://github.com/onewillai/clankerbend

by lmwnshn

7/5/2026 at 9:52:01 AM

Will this problem not arise on other open source harness?

by navigate8310

7/5/2026 at 12:28:25 AM

If you use GitHub Copilot you can switch between them in the same session if you want.

by jerezzprime

7/5/2026 at 2:23:01 AM

Yeah but now that you pay for tokens that's going to be bad for token caching.

by gwerbin

7/5/2026 at 12:47:17 AM

Same with a third-party open harness.

by ElFitz

7/5/2026 at 3:40:01 AM

I have noticed this degradation of 5.5 reliability to what, in my experience, I consider Claude-level of reliability since early June.

My journey dealing with this has been transitioning from 5.5 high to 5.5 xhigh to 5.4 high.

5.4 high has been perfectly reliable for me for the last 3 weeks, and I am happy there.

Occasionally, I run some tasks on 5.5 xhigh to check if it has gone back to being 100% perfectly reliable, but, at this point, I am assuming they are just counting on releasing 5.6 rather than dealing with this reliability issue.

by matco11

7/5/2026 at 9:31:37 AM

I'm on the same journey but I bought a 3090 and put qwen 3.6 27b on it. It covers some things with better reliability. Obviously it doesn't have the breadth of a large model. If that's even a selling point for large models for coding?

by beacon294

7/4/2026 at 11:18:04 PM

i don't ever believe these issues are technical. They're business decisions to downgrade performance because to fix it means $$$$ and you arn't paying them enough.

by cyanydeez

7/5/2026 at 6:18:32 AM

[dead]

by Losenok

7/4/2026 at 11:32:20 PM

Deja Vu... This looks just like the Claude Code performance regression back in April. I just quit my Claude subscription when that happened and went to Codex.

Now I'm kinda thinking of trying per token for both, using GLM 5.2 on Fireworks for most tasks, shelling out to the big boys only when needed. Not totally confident I'll break even though.

by resonious

7/5/2026 at 12:51:50 AM

Re per token, I had the same reaction, but given both labs are economically advantaged moving customers to per-token consumption... almost want to avoid this on principle. Even if not intentional, benefitting from a degraded product is not something I want to accept or enable.

More now than ever (since original ChatGPT release), the OSS models and open harnesses (eg Pi) are looking mighty attractive.

by andrewcamel

7/5/2026 at 10:04:21 AM

If pricing is per-token then in theory the vendor can offer you modes that optimize token usage or quality whereas all-you-can eat encourages vendors to satisfy you just enough to keep paying but the trend is towards lower quality responses.

by AdamN

7/4/2026 at 11:55:22 PM

Right? I also quit Claude Code and switch to Codex over that. Now I’m trying to figure out how I could make an extra $65,000 to never have to be concerned about this nonsense again. I know the economics of using open router etc…

But I’m reminded of ~2008 and the rise of “the cloud” as a marketing term that seemed to me to be a cover for dropping an expectation of rich clients, increasing a companies margins around subscriptions that would chip away at local ownership.

Then I got offput by the zealotry and absolutism around “true FoSS”, told myself I was young and moved on.

And really, a lot of subscription models I kind of can appreciate/ tolerate. Might be irksome but whatever, I get that software is expensive to make and it’s not fair in 2026 to value a yearly upgrade of Photoshop at $200. The capricious UI changes to things that’ve worked for 20 years and they take away say the classic color swatches altogether - silly and dumb.

I can use another professionally necessary tool I pay $200/ mo for, Codex, to whip up a classic swatch plugin.

Is that $200 a fair price for my token usage? I think an extremely heavy month I might’ve used a billion tokens?

But that right there is the problem. They have no idea what, specifically, profitability looks like and are going to be pulling endless levers for … I genuinely have no idea how long - at least through 2030/2032 if we tea leaves their debt obligations?

I don’t want to think about any of that. At all. I don’t want to spend time evaluating model preference and degradation and updating the nuances of how I “speak” to an AI because there’s some mystery backend experiment running on the output I use to produce functional outputs — ie the actual products I get paid to build/ maintain.

AI’s something between a tool and coworking companion, and the capricious “personality” changes due to playing with poorly understood and knobs and levers at the inference level - is maddening. To that end, I want a box in the corner I can point to and know exactly the quality of outputs that no one but myself modifies.

by cududa

7/5/2026 at 12:28:22 AM

Fireworks?

by thatxliner

7/5/2026 at 12:57:05 AM

Provides access to AI models for a per-token fee. See OpenRouter, they are one of many.

by arcanemachiner

7/5/2026 at 4:49:41 AM

i believe they are referring to https://fireworks.ai/models/fireworks/glm-5p2

by frankdenbow

7/4/2026 at 11:41:56 PM

The vibe-assumed claude code performance regression, yep. People should stop expecting consistent performance from non-deterministic systems. There is zero empirical corroboration of performance degredation.

There has been a step change... in the amount of whining and complaining coders exhibit lately.

by jatora

7/4/2026 at 11:52:22 PM

If you bother to look at the issue instead of whining and complaining, you will see the evidence.

by HumanOstrich

7/5/2026 at 12:07:32 AM

When I disagree with the data: I will nitpick every last detail of methodology, any cross-corroboration is an anecdote, suddenly I demand a-priori levels of justification. All science is flawed anyways, it's not like mathematics, you can't get absolute certainty, so why bother? You're always going to be making base assumptions that can be challenged, you're necessarily going to abstract out the territory, the map is flawed.

When I agree with the data: I will boast about the victories of science and empiricism, we found the perfect set of natural abstractions that are necessary and sufficient to map out the territory that carve at the joints of the problem, any concern about assumptions is rebutted with generic "Well, we're just pragmatists; we're not perfect, but clearly we're converging on the right direction! You're clearly someone who just wants to nitpick and not get any work done."

My experience with certain hackernews commenters in a nutshell.

by sigbottle

7/5/2026 at 1:26:49 AM

This is evidence of a bug, not the purposeful enshittification people are referencing

by jatora

7/5/2026 at 3:52:52 AM

This reminds me of a time in COVID-health response when certain scientists said their evidence was real evidence and yours was not.

"There's no evidence to prove xyz" then they would say, as your evidence was never as rigorous as theirs. And since they were proclaimed to be the only authorized scientists in the room, by authority of big governing bodies, they were right.

So people will see whatever evidence they want, and whine and complain to dig into their side as tribalistic creatures.

by mannanj

7/5/2026 at 4:59:38 AM

Look, I felt it. I didn't wait for the official apology from Anthropic. I quite before they published that, then felt very vindicated when they did.

by resonious

7/4/2026 at 11:56:38 PM

[dead]

by darig

7/5/2026 at 4:31:23 AM

For me, the encrypted reasoning contents, when looking at the base64 string lengtht, show this effect. However, the server-reported reasoning tokens don't. So I assumed it was part of the encryption and/or obfuscation purely. So I don't think there is a real issue.

This is the biggest downside of GPT; thinking is encrypted, so it's more of a black box than kimi/glm/deepseek. You still get thinking summaries though. It's awkward, but workable.

by edg5000

7/5/2026 at 2:53:51 AM

I love that Codex is open source and issues like these can surface/be addressed publicly.

by laurels-marts

7/5/2026 at 4:08:12 AM

But this is model behavior and just a public issue tracker which claude code has just without code? I don’t see how it’s any different than https://github.com/anthropics/claude-code for these issues.

I do appreciate that codex is open source generally, but I don’t think it matters for this class of issue as the model is closed still

by rockwotj

7/5/2026 at 3:51:15 AM

I feel openai in general is much more open and real business like compared to anthropic. They’re just a black box.

by adithyassekhar

7/5/2026 at 4:05:15 AM

Not only that, OpenAI generally doesn't gaslight compared to the misanthropic team especially Boris, who was constantly claiming there is nothing wrong with Claude Code. And OpenAI is generous with resets.

by cute_boi

7/5/2026 at 5:00:06 AM

Thanks, I had good laugh. Nice fanfic.

by tomalbrc

7/5/2026 at 4:32:20 PM

Indeed, it looks like my work has suffered from the clustering issue as well:

  reasoning_output_tokens    count    percent
  ━━━━━━━━━━━━━━━━━━━━━━━━━  ━━━━━━━  ━━━━━━━━━
                         0      873    28.5948
  ─────────────────────────  ───────  ─────────
                         8       64     2.0963
  ─────────────────────────  ───────  ─────────
                         9       60     1.9653
  ─────────────────────────  ───────  ─────────
                        11       54     1.7688
  ─────────────────────────  ───────  ─────────
                       516       48     1.5722
  ─────────────────────────  ───────  ─────────
                        12       45     1.4740
  ─────────────────────────  ───────  ─────────
                        10       43     1.4085
  ─────────────────────────  ───────  ─────────
                        17       40     1.3102
  ─────────────────────────  ───────  ─────────
                        13       38     1.2447
  ─────────────────────────  ───────  ─────────
                        14       36     1.1792

Created a script for this: https://github.com/thehappybug/codex-reasoning-token-check

by m3h

7/5/2026 at 4:45:32 PM

When I reviewed the conversations affected by this issue, they did not always align with my feeling of "degraded output".

Some were definitely below par, and I recall having to iterate on the generated code more than I wanted to. However, it is only true for a very small number of conversations.

So we're looking at a small set of affected conversations, and even within that small set, only a few will have degraded output, likely because the model can compensate for the reasoning defect over the long conversation.

by m3h

7/5/2026 at 5:16:45 PM

I think it might affect real work if part of it requires a lot of thinking, i.e. something similar in nature to a puzzle.

There seems to be something wrong with the "commentary" channel related intermediate updates, maybe the model gets confused about what's an intermediate update vs what's the final answer? [1]

[1] https://github.com/openai/codex/issues/30364#issuecomment-48...

by nsingh2

7/4/2026 at 11:11:43 PM

I swear some days ago someone here claimed Openai succeeded cutting down their compute cost by half with a breakthrough optimization. So this is it?

by siva7

7/4/2026 at 11:15:20 PM

That was an article in The Information but it didn't read very well to me, I didn't get the impression the author was enough of a technical expert on how LLMs work to credibly evaluate the claim, which came from an insider rumor: https://www.theinformation.com/newsletters/ai-agenda/openai-...

> OpenAI engineers earlier this month told some colleagues they had figured out a way to more than halve the cost of inference, or running existing models, thanks to some newly-discovered optimizations, according to a person with knowledge of those discussions.

by simonw

7/5/2026 at 3:42:00 AM

I bet that since this bug has made headlines, there are some panicked engineers at OpenAI desperately trying to figure out how to fix it without undoing their “magic optimisation”.

by jiggawatts

7/5/2026 at 3:22:13 PM

Is there any indication that optimization shipped? Depending on whether it’s R&D or pragmatic, I’d expect it to take months at least.

by brookst

7/5/2026 at 12:00:46 AM

My understanding of the rumor is that it wasn't OpenAI itself, but one of the post-blip OpenAI breakaway groups (rumoured to be Thinking Machines) who have made a breakthrough and seem to be shopping it to OpenAI. I don't think it has actually been implemented by OpenAI yet.

by SyneRyder

7/5/2026 at 2:38:28 PM

Already reported (not as thoroughly but still quite detailed) two weeks ago and silently “closed as not planned” (keep in mind that the specific reason might be an artifact of GitHub workflow/UX and not actually the intended reason) without a acknowledgement or a response.

https://github.com/openai/codex/issues/29353

What even is the point of a public-facing bug tracker “for devs, by devs” when this is how reports get treated? Might as well use Apple’s Feedback Reporter that routes to /dev/null instead.

Anyway, I find it near impossible to see how this wasn’t already caught and flagged internally – it’s not a subtle pattern. Certainly they are at the very least collecting and graphing reasoning tokens vs model vs effort” and such an obvious spike at (multiple) single stops (not even distributed over a narrow range) should have been an immediate statistical red flag… which leads me to believe (combined with the fact the previously reported issue was closed without comment) that they’re at least internally aware of this behavior even if it’s not necessarily an intentional side effect of some internal forcing metric.

by ComputerGuru

7/5/2026 at 11:00:58 AM

> reasoning-token clustering at 516/1034/1552

Interesting. So 516 probably means initial 512 byte buffer and a 4 byte header. Then 516 + 518 = 1034...so another 512 + 4 byte header + 2 bytes for a linked list ref or similar, 1034 + 518 = 1552, etc.

by tyingq

7/4/2026 at 11:22:45 PM

A rare case "they made the model dumber" where they actually made the model dumber, instead of the usual user psychosis?

by ACCount37

7/5/2026 at 12:52:31 PM

This is the second one in a row now (previous was Anthropic flic in Feb/March)

by lostmsu

7/5/2026 at 3:22:59 PM

Nah there have been thousands of “they made the model dumber” false alarms between these two.

by brookst

7/4/2026 at 11:39:49 PM

It seems to be an inference engine or agent harness defect/misconfig rather. Not only do the issue details not evidence a willful stealth nerf, they actively suggest otherwise: the root cause is crude, and evidently not particularly stealthy (as it's being reported on by a regular user with independently verifiable, exact details).

I don't find "usual user psychosis" particularly fair or tasteful anyhow. You're not left with much more than subjective judgement and speculation/suspicion when all you have is a magic sink of an API endpoint that ingests your context window then spits back a continuation of it. Even if you have a standardized model test suite, claiming a stealth nerf remains an exercise in mind reading (of the people working there). Model quality can degrade without an explicit intention that way, or a downgrade of the underlying infrastructure, after all.

Being tongue-in-cheek conspiratorial, or even actually entertaining the possibility of a nerf, is no psychosis anyways. Not a fan of this trend of people abusing psychology diagnosis terminology like this. I'm sure there are people who go a step beyond and are overconfident in these judgements, maybe in their case it holds. But then that's a minority, and so what you have then is a hyperboly. Doesn't serve anyone.

by perching_aix

7/5/2026 at 12:03:30 AM

"They made the model dumber" on literally the same checkpoint with the same prompt on the same quantization running on the same hardware is a staple of AI complaints.

Users are completely incapable of objectively evaluating model quality over time.

Which makes it all the harder to notice actual "stealth nerfs", misconfigurations or other technical issues. Because "they made the model DUMBER, for REAL this time" is background noise.

by ACCount37

7/5/2026 at 2:30:39 AM

How are you so sure that frontier API models are always running the same quant/weights/etc? You think OpenAI and Anthropic are running essentially just vLLM endpoints? Of course not.

Firstly, we know Anthropic has been doing prompt injection into their 1P APIs (not bedrock/vertex AFAIK) for at least a year now. https://old.reddit.com/r/ClaudeAI/comments/1f6hcwo/injection...

This can be verified pretty quickly like OP — count the token metrics, if your context contains classifier-firing terms, you’ll see input_tokens being higher than your input.

So if they’re already doing that, what makes you think it’s just a dumb API, instead of a complicated pipeline filled with trade secrets and optimisations?

by dannyw

7/4/2026 at 11:45:31 PM

Maybe its just bad memory but I feel like 5.3 was the best version in terms of token usage and code quality. 5.5 works better but it just eviscerates tokens.

by ghosty141

7/5/2026 at 5:56:50 AM

It’s not just you this is also my opinion, 5.3-codex was a fantastic model in terms of balancing output quality and cost.

Cheap and efficient enough I could afford to use it on basically everything unlike 5.5 or Opus, but still pretty good, I preferred it to sonnet

by ifwinterco

7/5/2026 at 11:03:33 AM

5.3 was incredibly better than 5.4/5.5. I stuck with it for months after 5.4 was released, and kept testing 5.4/5.5 every now and then but they both were too inconsistent, too rash. I switched to 5.5 a few weeks ago and now regret it, but I am no longer seeing 5.3 as an option to use, only 5.3-Spark, which is trash compared to 5.5.

by notfried

7/5/2026 at 12:12:16 AM

They rendered 5.3 unusable for me a few weeks back. It simply was locking up or answering poorly.

by keyle

7/5/2026 at 3:18:52 PM

There is nothing called "GPT5.5 Codex" unless I've completely misunderstood OpenAI's product line?

Codex is a harness, while GPT-5.5 is a model. The last codex-branded model was 5.3. Codex as a harness ships as a CLI, a desktop app, and a web product (and I'm not at all sure how similar the underlying harness is between them.)

Is the bug here supposed to be with the CLI harness, or the model? Does it also happen in pi, opencode, etc while running GPT-5.5?

by yetanotherjosh

7/5/2026 at 4:28:28 PM

If the issue comments are to be believed it could be related to the codex system prompt itself so likely just the harness. I agree the wording is weird but they appear to be referencing GPT 5.5 in Codex because yeah there isn’t a GPT5.5 Codex model.

by geophph

7/5/2026 at 4:41:30 PM

if these ai companies want to be taken seriously as being productivity tools then they're going to have to stop with these ab tests and forcing unproven features onto everyone. it's bad enough that ais are inherently unpredictable in quality of output, but these kinds of changes just make things worse.

anthropic at least does have a latest and stable channel, as the other day they pushed something irritating that would skip question asking phase if you didn't reply in 60 seconds, and it broke my multi terminal workflow. like I don't know what their product people are thinking when they push this kind of stuff, but it made me switch to stable

by redml

7/4/2026 at 10:49:57 PM

Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization

by kleton

7/5/2026 at 4:15:18 AM

My first thought would be an adjustment to a reasoning budget parameter (using llama.cpp as my reference) which would lead to these results. But no way to know precisely without an OpenAI statement.

It could be a very dishonest way of scaling to demand during peak hours. I know that some people already scoff in this topic about the subjective nature of perceived performance of models. But the model seemed less smart when US comes online (at least from my testing over the month of May).

On my company blog post from a few weeks ago I felt the need to point this out because it had a perceptively more consistent pattern during those overlap times. Should have saved the session logs for further analysis https://webesque.agency/blog/2026-06-19-llms.html

by mhitza

7/4/2026 at 11:16:44 PM

Isn't the standard to use continuous batching? If they are using continuous batching -- I'm curious why generated token length matters, and why they might be clustering them. If not -- I'm curious why they aren't and what is the tradeoff here.

by kbdiaz

7/5/2026 at 12:13:56 AM

This "~512 batching" makes me think of things like diffusion or prefill.

If they managed to put together some dirty hack that lets them generate about 512 tokens worth of reasoning in parallel instead of in sequence? That would explain it.

by ACCount37

7/5/2026 at 2:45:34 PM

I was wondering WTF was happening.

This was past month:

   516 + 518*n
   516  n=0  count=4454
  1034  n=1  count=318
  1552  n=2  count=129
  2070  n=3  count=56
  2588  n=4  count=35
  3106  n=5  count=14
  3624  n=6  count=6
  4142  n=7  count=4
  4660  n=8  count=6

by rq1

7/5/2026 at 8:55:07 AM

It's funny, they sell you a subscription for frontier models, then over time begin to nerf them rapidly and no one talks about it. Should give me a discount when they reduce reasoning effort silently on the server side!

But on the other hand, I've been using 5.5-high on a daily basis in multithreading workflows, i.e. in parallel. I'm barely exhausting my weekly limits. I can't even Human-as-a-Service fast enough to catch up and read all the plans and implementations it does. So there is that.

by AmazingTurtle

7/5/2026 at 3:26:11 PM

I’ve been paying $200/mo for Claude code for, IDK, 9 months?

I am 100% sure that I get far more value from today than I did in December. The models are smarter, the limits are higher. It’s possible there’s some “five steps forward, one step back” going on, but it’s hard to imagine complaining about that step back.

by brookst

7/5/2026 at 12:16:47 PM

> they sell you a subscription for frontier models, then over time begin to nerf them rapidly and no one talks about it.

People talk about it all the time. Just check some of the dozens of forums where its non-stop complaining about nerfs, limit nerfs, performance issues etc...

Is hard to prove that any downgrade is a effect of being deliberately served a lower class model / lower quant, or whatever. Or the "optimizations" hurting the models performance.

The TOS allows for those service "optimizations", so legally, nobody has a foot to stand upon. Like when OpenAI or was it Anthropic played with the cache, this to free up more server resources, only to later discover that its gutted the long term context behavior, and heavily degraded the models as context grew.

If you want 100% guaranteed the same performance/behavior, you need to run a model yourself (be it rented GPUs online or your own local setup). But its going to cost you a lot more ...

by benjiro29

7/5/2026 at 4:36:33 AM

Even without stats i know it went bad. In the pass two month barely can do any good scientific writing lately, which of course rely on reasoning. It just writing for gods sake. And it show how far we are from AGI.

by chazeon

7/5/2026 at 4:59:49 AM

> In the pass two month barely can do any good scientific writing lately, which of course rely on reasoning. It just writing for gods sake

Honestly, I think this is a really cool sentence. Imagine going back to 2021 and telling someone this was a legitimate complaint about a pretty cheap and very prevalent technology in 2026.

by LPisGood

7/5/2026 at 6:16:42 AM

This is an intermittent issue, you should still be able to get your work done. 5.5 was released two months ago so perhaps you're using 5.5 wrong and some things that worked in 5.4 require tweaking your prompts?

by cbg0

7/5/2026 at 2:38:46 PM

Well, when 5.5 first came out, it was kind of OK, but now it's almost noticeably worse. It can be done, but requires a lot of effort, just more and more round, and actually, the Gemini Pro on the web (which should be 3.1 Pro) is actually doing a more stable job.

The thing that I ask it to do is like take X and Y paper into Z paragraph --- a not-so-silly model should think of how information in X and Y are related and how they support the whole article to synthesize this sentence in a way that is coherent to the article, but 5.5 now will just copy the stuff without any reasoning about the relation. Of course, this will cost a lot of tokens and will be obvious if not done. One clear indicator is that in a few rounds you can see the length of the article get bloated to 2-3x undesirably long, which is clearly because it is not analyzing/synthesizing the info.

by chazeon

7/5/2026 at 5:40:53 PM

Oh cool, another source of LLM nondeterminism. Just what we needed!

by LogicFailsMe

7/5/2026 at 4:58:20 AM

I swear all these ai companies are trying to rob us for more price

by preetham_rangu

7/5/2026 at 10:43:00 AM

I have some bad news: every company is trying to rob you for more price

by inigyou

7/4/2026 at 9:51:09 PM

tldr:

GPT-5.5 Codex model exhibits a clustering phenomenon in which reasoning_output_tokens cluster at fixed values spaced 518 apart.

These stuck responses at fixed thresholds are strongly correlated with errors in complex tasks.

Observed phenomenon is specific to GPT-5.5; it is much less prevalent in GPT-5.4 and almost absent in GPT-5.2 and 5.3

by maille

7/4/2026 at 10:24:03 PM

[dead]

by joe_mamba

7/5/2026 at 3:25:30 AM

this explains so much why gpt 5.5 has been so bad lately it was really puzzling why it struggled so much where when it first came out it was one shotting stuff totally amazing, i tried the prompt that will tell you if your plan is degraded:

    codex exec --json --skip-git-repo-check --ephemeral -s read-only --disable memories -m gpt-5.5 -c model_reasoning_effort=high "Do not use external tools. A black bag contains candies with counts: round apple 7, round peach 9, round watermelon 8; star apple 7, star peach 6, star watermelon 4. Shape is distinguishable by touch before drawing; flavor is not. What is the minimum number of candies to draw to guarantee having apple and peach candies of different shapes, i.e. round apple + star peach or round peach + star apple? Give reasoning and final number. The local project dir is irrelevant for this task, do not consult it. "

1. 516, 24

2. 516, 27

3. 516, 12

4. 516, 21

5. 516, 21

This means that the whole time we've been paying for a product that was silently routing to something completely different and inferior from gpt 5.5

Also I read through the github issues and it seems like they closed a previous issue without addressing it ???!!

whooo boy somebody from OpenAI is getting fired over this if not a class action lawsuit is almost guaranteed at this point.

by zuzululu

7/5/2026 at 10:46:15 AM

The correct answer is 29, right? You could draw all the watermelons and all the round pieces before drawing a star piece. So the model never gets it right, but it does when listing the cases exhaustively?

by inigyou

7/5/2026 at 1:30:25 PM

No, the answer is 21. You ignored the "Shape is distinguishable by touch before drawing".

by fhars

7/5/2026 at 6:52:19 AM

Verified this locally myself. Thanks for the concrete test. I guess it's time to give Claude another try.

by cageface

7/5/2026 at 4:26:09 PM

I wonder if it's somehow getting confused between what's supposed to be an intermediate update vs the final result.

[1] https://github.com/openai/codex/issues/30364#issuecomment-48...

by nsingh2

7/5/2026 at 7:13:26 PM

yea i just tried it and it works!!! i dont know why it works but now gpt 5.5 feels exactly like how I remembered it a month and half ago.

by zuzululu

7/5/2026 at 7:29:19 AM

I would switch to Claude if they kept Fable 5 in the sub

I'm also afraid to lose my "spot" if I leave codex and 5.6 is coming out so...

by zuzululu

7/5/2026 at 2:47:20 AM

Reset!

by wahnfrieden

7/5/2026 at 7:25:06 AM

I'm seeing this issue with 5.4 also.

by joohwan

7/5/2026 at 2:18:46 AM

Sounds like a problem with promoting the drafter.

by kordlessagain

7/4/2026 at 11:40:52 PM

It's been a month I've been using it as they gave me for free, and I found GPT-5 on Codex quite weird/awful. Even x-high. Then I figured out I should try OMP (Pi), and the experience was much better.

I remember GPT 5.2 Codex being fine...

by vitorgrs

7/5/2026 at 2:28:55 AM

The good experience I had with GPT-5.5 before made me upgrade to Pro this month. Now I want a refund.

by linzhangrun

7/5/2026 at 5:05:13 AM

You want a refund because of a problem you weren't even aware of until now? And you don't even really know if your work has been impacted by this problem.

by cbg0

7/5/2026 at 5:34:18 AM

No, the decline in GPT-5.5's performance over the past few weeks is clearly noticeable.

by linzhangrun

7/5/2026 at 5:36:48 AM

Doesn't look like it: https://marginlab.ai/trackers/codex/

by cbg0

7/5/2026 at 2:08:20 PM

So what are we to make of the two items:

- This tracker not showing any visible degradation. - Clearly incorrect answers being reported due to truncated thinking.

Is the tracker not measuring 'simpler' tasks that might get auto-sent to "low reasoning hell" even on high/xhigh? Is the clustering not actually causing reasoning misses in real-life coding, or not enough of a negative effect compared to the improvements made elsewhere? Something else?

by rahidz

7/5/2026 at 8:34:53 AM

Thanks for sharing this project. Maybe I'm being subjective.

by linzhangrun

7/5/2026 at 9:39:05 AM

its called hedonic adaptation - you get excited by a new model, but then the excitement disappears, and you confuse that with the model being nerfed

by memoriyato3

7/5/2026 at 8:05:01 AM

Cool resource and perfect way to track this, thanks for sharing

by mdgld

7/5/2026 at 7:39:57 AM

This seems really bad…

by maxignol

7/5/2026 at 12:37:16 AM

Does this affect the Codex app too, or just the Codex CLI tool?

by jiggawatts

7/5/2026 at 1:05:42 AM

From some of the numbers I'm seeing in the GitHub issue, the codex desktop app has the same 516 spikes. So most likely it is affected.

by nsingh2

7/5/2026 at 9:32:41 AM

If this really is widespread and degrading performance in 40% of the cases, then if OpenAI simultaneously fixes this bug and releases GPT 5.6 within a day or two, then the sudden boost in capability is going to blow people's hair back.

by jiggawatts

7/5/2026 at 5:01:22 AM

[dead]

by openclawclub

7/5/2026 at 4:46:19 AM

[flagged]

by dualdust

7/4/2026 at 11:33:28 PM

[flagged]

by trycaedral

7/4/2026 at 10:23:26 PM

Personally, I would say very likely, to be honest. I gotta go through this a little more, but I actually use 5.5 codex an obscene amount, and I almost never use it for reasoning anymore. It's not even in the same galaxy as far as actually taking out the thinking and using GPT-5.5 or even Claude and then coming back and giving it the reasoning. Blah blah blah, it's the same model. Well, let me tell you, no, it's not, for several reasons, and the delta on intelligence is pretty staggering.

by ProofHouse

7/4/2026 at 10:27:08 PM

Care to explain what you mean by that?

by benjiro29

7/4/2026 at 11:19:35 PM

I'm struggling as well to understand, and I think perhaps they mean they use ChatGPT website with GPT-5.5+reasoning for problem solving, and paste the output into Codex CLI/App. I think they're saying that letting Codex CLI/App problem solve with GPT-5.5 isn't as effective. Essentially that the web harness is superior to the agentic engineering harness for problem solving?

Not sure if I agree, but I do happen to use a fair bit of web harness as well, just because I find it to be much more effective at web search and a different type of reasoning. So I must agree a little or else I wouldn't do that.

by criley2

7/4/2026 at 11:45:06 PM

I assume they are lying and still think you can use gpt 5.5 non-codex within codex cli. And they outed themselves. A lot of nonsense. And the very poor communication skills just seem like the typical chinese astroturfing you see pretty often now when discussing OAI/Claude.

by jatora

7/5/2026 at 12:31:28 AM

See, this is part of the confusion. There is no such thing as "GPT-5.5-codex". The last codex-branded model was "GPT-5.3-codex". Starting with "GPT-5.4" the main model handles agentic engineering and they did not release a coding model.

Both the web harness and codex app/cli use "GPT-5.5".

by criley2

7/5/2026 at 1:27:18 AM

haha woops. guess im the chinaman now

by jatora

7/5/2026 at 11:17:45 AM

在英语中，我們会说“Chinese”。

by criley2

7/5/2026 at 7:58:51 PM

no thx

by jatora

7/5/2026 at 1:29:22 AM

What do you mean by that? Seems kinda racist.

by fragmede

7/5/2026 at 4:57:45 AM

Did you read my other comment? Seems kinda oblivious

by jatora

7/5/2026 at 1:46:34 AM

[dead]

by redsocksfan45

7/4/2026 at 10:51:25 PM

I know that these types of comments are not really popular here, but this struck a chord with me because I feel the same. They aren't remotely close.

I have codex right now purely because they gave me a month free of ChatGPT Pro, so I have been using it in between my usage resets with claude. Since it's "free money" for me I have been using it exclusively on xHigh.

One of my most frequent prompts is "hey codex worked on ____, but it didn't quite hit the mark, can we review the work..."

Yes, part of this is normal even within the same model -- you have the highest power model review the work for correctness, refactoring opportunities, and so on, but man I tell you, I don't know what it is about codex, this is obviously one guy's anecdote -- same prompting style, same repository documentation ala MD files, same skills, way different results.

All that to say, maybe the bug report is on to something here, and it can be fixed.

by dimitrios1

7/4/2026 at 10:26:10 PM

What?

by m101