7/4/2026 at 11:20:23 PM
Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes.
Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small.
by nsingh2
7/5/2026 at 8:21:48 AM
I have a philosophical problem with adaptive thinking. It’s a dumb guess for how much thinking budget to allocate ahead of thinking. At least in the context of LLMs there is probably no way of knowing how much thinking (token generation) is needed. The problem space is infinity vast, similarly of two prompts is not going to help any LLM decide how much thinning is needed. Models already stop thinking before hitting the thinking budget.Why there is so much effort in making adaptive thinking happen and don’t we train models to produce the end of thinning token better?
Feels like a bandaid. We need models to be trained to do a reasonable amount of reasoning (no pub intended):
reason
estimate remaining uncertainty
continue?
reason more
repeat
by mohsen1
7/5/2026 at 4:20:59 PM
I agree, adaptive thinking is a pest and without a minimum thinking budget especially Claude for me currently defaults to not think at all even on max effort.Sequential-thinking was really a step in the right direction, and works almost exactly how you've described, though when it was popular before the reasoning models and even now when I tried it recently I have never once see it use its branching feature and it tends also to have the RLHF urge to answer something "helpful" quickly instead.
by CjHuber
7/5/2026 at 12:46:27 PM
At least there should be a tool call that's the equivalent of saying "wow, this is more complicated than I thought". Humans are also often prone to under-allocating reasoning time and coming to wrong conclusions because their reasoning ends up too shallow. But the best humans are great at mentally mapping the problem space and readjusting on the flyby wongarsu
7/5/2026 at 1:10:36 PM
This is what I do in llm-consortium. An arbiter evaluates the response(s) and decides if more iterations are needed. You can also loop until a minimum confidence threshold, but self-reported confidence isn't a great metric.by irthomasthomas
7/5/2026 at 12:05:06 PM
Right now we have a LOT of band aids. You want to optimize compute and thinking to a particular problem, sort of like we do. Yes you cannot perfectly predict this but you can do decently well and save a ton of tokens at the cost of this band aid being sort of leaky and gross.But the larger problem is sound, and the answer is something jointly optimized (idk how they do the routing) but it’s hard to shoehorn it into the current paradigm.
by aspenmartin
7/5/2026 at 2:43:53 PM
Modern LLMs are nothing but band-aids, starting with the absurd bandwidth of HBM3 RAM that makes them possible.by UltraSane
7/5/2026 at 3:11:46 PM
This is preliminary, but it seems like it might somehow be related to the `## Intermediary updates` system prompt that's provided to the model. Seems like it forces the model to stop thinking and return early to provide updates. Removing that entirely makes all runs succeed [1].I wonder if it's somehow getting confused between what's supposed to be an intermediate update vs the final result.
[1] https://github.com/openai/codex/issues/30364#issuecomment-48...
by nsingh2
7/5/2026 at 3:12:33 AM
You still have to worry about misconfigured local models. Even the professionals get it wrong, which is why local model performance is uneven across providers.by postalcoder
7/5/2026 at 7:15:36 AM
And to add insult to injury, some providers will ride on the good reputation of some local model, selling you a terrible quant instead.With OpenAI, at least my gpt-5.5 is the same as your gpt-5.5. You can't say that about glm for example.
by miki123211
7/5/2026 at 7:30:50 AM
> And to add insult to injury, some providers will ride on the good reputation of some local model, selling you a terrible quant instead.I just started using OpenRouter for some control testing of local models and what surprises me the most isn't that there are different providers providing different quantization levels, that makes sense, but I can't seemingly find a way of seeing what provider+model+quantization is actually used?! https://openrouter.ai/models shows the models, then say https://openrouter.ai/moonshotai/kimi-k2.7-code shows the providers but when I go to https://openrouter.ai/moonshotai/kimi-k2.7-code?endpoint=e7a... for example, why on earth is it not showing the actual details about the actual weights they're serving?! Give me details! It does have a "Precision" value that is sometimes filled out, but that seems to be a guess at best, even providers with the same values there have wildly different quality responses.
I like the idea about OpenRouter but holy hell does the implementation seem very far off from what it needs to be, in order to be useful.
by embedding-shape
7/5/2026 at 7:43:22 AM
There are properties on the API call you can pass for specific providers, so you test which providers you like the output, then add them to the list in ranked order if you want one by default, then to fall back to the other.There might be something in the response, or in a followup API call for the session, that you get better details. I think I've seen the details in the dashboard, so they do exist.
by jermaustin1
7/5/2026 at 10:54:42 AM
Sam Altman, good reputation?> With OpenAI, at least my gpt-5.5 is the same as your gpt-5.5.
How do we know that? The "orchestration" layer probably forwards to different levels of quantization. And it seems tempting to make some sort of load balancer with adaptive computation effort.
by rightbyte
7/5/2026 at 10:43:05 AM
> some providers will ride on the good reputation of some local model, selling you a terrible quant instead.Quants in popular local inference apps (Ollama, LM Studio, etc) are the worst possible quants (RTN).
by woadwarrior01
7/5/2026 at 9:28:10 AM
That's not a real equivalency. They are not necessarily the same (testing in production, hello!) And most importantly you do not have a local model because openai is not open!by beacon294
7/5/2026 at 5:19:06 AM
But in that case you have nobody but yourself to blame, and you can stabilize things yourself at any time by refraining from making any changes. You won't be surprised by a provider. Honestly? That's not just valuable—it's essential.by jdiff
7/5/2026 at 5:48:10 AM
> Honestly? That's not just valuable—it's essential.I'm curious if you wrote this or had a LLM write it.
I'm genuinely curious to be clear as I don't see why anyone would bother to go through a LLM to write such a short reply. Have we reached the point where Claudeisms that are this obnoxious have become part of regular speech?
by LiamPowell
7/5/2026 at 10:28:15 AM
The obnoxious cliche is mine, although I wouldn't call it "regular speech" since I tacked on that dense blob of LLMisms intentionally.by jdiff
7/5/2026 at 6:02:35 AM
Or they were making a joke[0].[0]: https://en.wikipedia.org/wiki/Joke
(…just like that)
by UqWBcuFx6NV4r
7/5/2026 at 6:13:15 AM
Sure, but it doesn't really fit there as a joke, it looks like it's just meant to be part of what they were trying to say.by LiamPowell
7/5/2026 at 7:13:49 AM
I also think it's a joke, it starts with the response / argument, and then flows into tongue-in-cheek joke about the core issue of the post (LLM)by subscribed
7/5/2026 at 6:38:26 AM
I've noticed them trying to creep into my writing. It doesn't help that I was a heavy em-dash user ten years before GPT-3.by topynate
7/5/2026 at 7:51:26 AM
I’ve always loved em dashes…very sad they’re a hallmark of LLM slop now. That and trios in arguments. Maybe I’m part LLM?by mdgld
7/5/2026 at 8:30:42 AM
This isn’t just em-dashes—it's the empty phrase that includes both whatever the contrastive construction is called and an “Honestly”. It might have been human written but the density of LLM flags is undeniable.by andy99
7/5/2026 at 10:40:28 AM
Honesty has become a big tell, especially if its brutal. I'm not sure if it's getting worse or if I'm becoming more sensitive, but if I'm being brutally honest I can barely stand using LLMs anymore with their tone and their cliches. It's a horrible nightmare fusion of socmed influencer and LinkedIn hustler that sounds freakish even if it were human, and the density of tropes that are accumulating feels ridiculous.Didn't the foundries take action against those in the past? I don't see "delve" nearly as often anymore. Why are the models spiraling like this now?
What really grinds my gears is the constant need to guess what I'm doing and offer a million random follow-ups. I asked what the weather was like, I don't appreciate the 5 paragraphs of tokens burned on weather-appropriate activity suggestions.
by jdiff
7/5/2026 at 8:11:00 AM
Some of us been writing texts on the public internet for decades, and humans invented machines trained on our texts, so suddenly the text we write now sounds like robots? Only way to win is to stop caring, let the people believe you're a LLM if so be it.by embedding-shape
7/5/2026 at 8:32:24 AM
LLM writing style is trained in by data labellers, it’s not just emergent behavior from being trained on internet texts.by andy99
7/5/2026 at 8:40:09 AM
Ultimately it's a mix-match of everything, including whatever data the pre-training uses and how exactly they do the post-training. I don't think you can say there is a single factor that decides the writing style, unless you have some particular insight into some specific pipeline. Generally though, they output text that looks like the human text they ingested for training.by embedding-shape
7/5/2026 at 12:44:39 PM
I've made a passive workaround: a pair of Codex CLI hooks that detect the truncation from the local session transcript and warn — in the TUI at turn end, and via a message injected into the model's context on your next message.by bentoner
7/5/2026 at 2:22:44 AM
I wonder if testing during different time/days show patterns? For example, whether the short circuiting happens more often during workday peak hours.by dannyw
7/5/2026 at 9:27:33 AM
And people pay for those wasted tokens? If that's the case, it is probably good idea to ask for refunds.by varispeed