GPT‑5.4 Mini and Nano

3/17/2026 at 6:17:11 PM

I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:

- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).

- GPT-5.4 Mini averages about 180-190 t/s on API. Priority does nothing for it currently.

- GPT-5.4 Nano is at about 200 t/s.

To put this into perspective, Gemini 3 Flash is about 130 t/s on Gemini API and about 120 t/s on Vertex.

This is raw tokens/s for all models, it doesn't exclude reasoning tokens, but I ran models with none/minimal effort where supported.

And quick price comparisons:

- Claude: Opus 4.6 is $5/$25, Sonnet 4.6 is $3/$15, Haiku 4.5 is $1/$5

- GPT: 5.4 is $2.5/$15 ($5/$22.5 for >200K context), 5.4 Mini is $0.75/$4.5, 5.4 Nano is $0.2/$1.25

- Gemini: 3.1 Pro is $2/$12 ($3/$18 for >200K context), 3 Flash is $0.5/$3, 3.1 Flash Lite is $0.25/$1.5

by Tiberium

3/17/2026 at 10:21:01 PM

IME tok/s is only useful with the additional context of ttft and total latency. At this point a given closed-model does not exist in a vaccuum but rather in a wider architecture that affects the actual performance profile for an API consumer.

This isn't usually an issue comparing models within the same provider, but it does mean cross-provider comparison using only tok/s is not apples-to-apples in terms of real-world performance.

by rglynn

3/17/2026 at 10:49:05 PM

Exactly. Really frustrating they don't advertise TTFT and etc, and that it's really hard to find any info in that regard on newer models.

For voice agents gpt-4.1 and gpt-4.1-mini seem to be the best low latency models when you need to handle bigger data more complex asks.

But they are a year old and trying to figure out if these new models(instant, chat, realtime, mini, nona, wtf) are a good upgrade is very frustrating. AFAICT they aren't; the TTFT latencies are too high.

by Rapzid

3/18/2026 at 3:43:49 PM

Yeah, this speed is excellent! I'm using GPT-5 mini for my "AI tour guide" (simply summarizes Wikipedia articles for me on the fly, which are presented on my app based on geolocation), and it's always been a ~15 second wait for me before streaming of a large article summarization will begin. With GPT-5.4 it's around 2-3 seconds, and the quality seems at least as good. This is a huge UX improvement, it really starts to feel more 'real time'.

by widdershins

3/17/2026 at 9:44:20 PM

Curious to hear why people pick GPT and Claude over Google (when sometimes you’d think they have a natural advantage on costs, resources and business model etc)?

by daniel_iversen

3/18/2026 at 3:07:29 PM

Because Claude is so much more expensive, and I rarely need the best.

gpt-5.4 is really good now also for tricky problems. Just for the unsolvable problems we take opus-4.6. Or if someone pays for it.

by rurban

3/17/2026 at 10:40:07 PM

In my workplace, its availability. We have to use US-only models for government-compliance reasons, so we have access to Opus 4.6 and GPT 5.4, but only Gemini 2.5 which isn't in the same class as the first two.

by coderjames

3/18/2026 at 5:34:25 AM

Have you used gemini models for code work? Claude and Codex are miles ahead in terms of quality and how thorough they are

by fullstackchris

3/17/2026 at 6:52:41 PM

I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.

by coder543

3/17/2026 at 7:29:22 PM

In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.

I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.

by JLO64

3/17/2026 at 8:53:19 PM

OpenRouter has this information

by asselinpaul

3/17/2026 at 10:40:27 PM

I do not see prompt processing, only some kind of nebulous “throughput” that could be output or input+output, but definitely not input only.

by coder543

3/17/2026 at 10:32:17 PM

Man the lowest end pricing has been thoroughly hiked. It was convenient while it lasted.

by msp26

3/17/2026 at 11:35:29 PM

token/sec is meaningless without thinking level. If it's fast but keeps rambling about instead of jumping on it then it can take a very long time vs low token/sec but low/none thinking.

by attentive

3/17/2026 at 8:17:33 PM

Wow. How fast is haiku?

by rattray

3/17/2026 at 5:46:44 PM

To me, mini releases matter much more and better reflect the real progress than SOTA models.

The frontier models have become so good that it's getting almost impossible to notice meaningful differences between them.

Meanwhile, when a smaller / less powerful model releases a new version, the jump in quality is often massive, to the point where we can now use them 100% of the time in many cases.

And since they're also getting dramatically cheaper, it's becoming increasingly compelling to actually run these models in real-life applications.

by BoumTAC

3/17/2026 at 6:07:53 PM

If you're doing something common then maybe there are no differences with SOTA. But I've noticed a few. GPT 5.4 isn't as good at UI work in svelte. Gemini tends to go off and implement stuff even if I prompt it to discuss but it's pretty good at UI code. Claude tends to find out less about my code base than GPT and it abuses the any type in typescript.

by brikym

3/17/2026 at 7:02:57 PM

Big part of these differences may be the system prompts and/or the harness.

by patates

3/17/2026 at 5:50:28 PM

they do are cheaper than SOTA but not getting dramatically cheaper but actually the opposite - GPT 5.4 mini is around ~3x more expensive than GPT 5.0 mini.

Similarly gemini 3.1 flash lite got more expensive than gemini 2.5 flash lite.

by pzo

3/17/2026 at 6:01:05 PM

But they are getting dramatically better.

What's the point of a crazy cheap model if it's shit ?

I code most of the time with haiku 4.5 because it's so good. It's cheaper for me than buying a 23€ subscription from Anthropic.

by BoumTAC

3/17/2026 at 6:18:39 PM

The crazy cheap models may be adequate for a task, and low cost matters with volume. I need to label millions of images to determine if they're sexually suggestive (this includes but is not limited to nudity). The Gemini 2.0 Flash Lite model is inexpensive and performs well. Gemini 2.5 Flash Lite is also good, but not noticeably better, and it costs more. When 2.0 gets retired this June my costs are going up.

by philipkglass

3/18/2026 at 7:07:05 AM

Time to gather a dataset and train your own model!

by dev_hugepages

3/17/2026 at 11:42:28 PM

I use Gemini via its web app, which aggressively autoswitches to the Flash over Pro, but I usually notice quickly because the answers are weird or the logic doesn't quite follow. I feel like, at least for 'daily driver' usage, small models are still a little disappointing. That said, they're getting very good for more automation-y tasks with simple, well-constrained tasks.

by ainch

3/18/2026 at 2:08:13 AM

Most annoying part of their web app and a really terrible idea.

I often just think Gemini is terrible but then it turns out they silently changed the model on me

by mountainriver

3/18/2026 at 3:07:08 PM

I got Codex to whip me up a Chrome extension that autoswaps back to Pro whenever I reload the page. It's made Gemini significantly less irritating to use.

by ainch

3/17/2026 at 10:32:06 PM

Well, in that case, the difference is quite minimal between 5 mini and 5.4 mini

5.4 mini seems to be a lot more wild/unstable, but with this instability it gets the right answer more often.

https://aibenchy.com/compare/openai-gpt-5-4-mini-medium/open...

by XCSme

3/17/2026 at 8:29:13 PM

> And since they're also getting dramatically cheaper, it's becoming increasingly compelling to actually run these models in real-life applications.

They're not really cheaper than the SOTA open models on third-party inference platforms, and they're generally dumber. I suppose they're still worth it if you must minimize latency for any given level of smarts, but not really otherwise.

by zozbot234

3/17/2026 at 7:51:32 PM

> 100% of the time in many cases

So, every single time, the new model works most of the time?

by sebastiennight

3/17/2026 at 11:00:23 PM

You’ve parsed the sentence wrong.

Read it as: “You can use them full time in many cases”

by kennywinker

3/17/2026 at 6:45:26 PM

I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.

They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I give them. I'm sure I'm _holding them wrong_, in the sense that I'm not tailoring my prompt for them, but most other models don't have problem with the exact same prompt.

Does anybody else have a similar experience?

by pscanf

3/17/2026 at 7:59:49 PM

These little 5.4 ones are relatively low latency and fast which is what I need for voice applications. But can't quite follow instructions well enough for my task.

That's really the story of my life. Trying to find a smart model with low latency.

Qwen 3.5 9b is almost smart enough and I assume I can run it on a 5090 with very low latency. Almost. So I am thinking I will fine tune it for my application a little.

by ilaksh

3/17/2026 at 7:19:57 PM

I've had such the opposite experience, but mainly doing agentic coding & little chat.

Codex is an ice man. Every other model will have a thinking output that is meaningful and significant, that is walking through its assumptions. Codex outputs only a very basic idea of what it's thinking about, doesn't verbalize the problem or it's constraints at all.

Codex also is by far the most sycophantic model. I am a capable coder, have my charms, but every single direction change I suggest, codex is all: "that's a great idea, and we should totally go that [very different] direction", try as I might to get it to act like more of a peer.

Opus I think does a better job of working with me to figure out what to build, and understanding the problem more. But I find it still has a propensity for making somewhat weird suggestions. I can watch it talk itself into some weird ideas. Which at least I can stop and alter! But I find its less reliable at kicking out good technical work.

Codex is plenty fast in ChatGPT+. Speed is not the issue. I'm also used to GLM speeds. Having parallel work open, keeping an eye on multiple terminals is just a fact of life now; work needs to optimize itself (organizationally) for parallel workflows if it wants agentic productivity from us.

I have enormous respect for Codex, and think it (by signficiant measure) has the best ability to code. In some ways I think maybe some of the reason it's so good is because it's not trying to convey complex dimensional exploration into a understandable human thought sequence. But I resent how you just have to let it work, before you have a chance to talk with it and intervene. Even when discussing it is extremely extremely terse, and I find I have to ask it again and again and again to expand.

The one caveat i'll add, I've been dabbling elsewhere but mainly i use OpenCode and it's prompt is pretty extensive and may me part of why codex feels like an ice man to me. https://github.com/anomalyco/opencode/blob/dev/packages/open...

by jauntywundrkind

3/18/2026 at 6:31:41 AM

“every single direction change I suggest, codex is all: "that's a great idea, and we should totally go that [very different] direction", try as I might to get it to act like more of a peer.“

That’s not the model, that’s a personality setting you can change in the codex config file.

Set it to Pragmatic, and ask it (not command it) about your new direction in planning mode.

It will tell you if your idea is not good for the given project. It’s an excellent peer.

by psandor

3/17/2026 at 7:26:23 PM

> I've had such the opposite experience

Yeah, I've actually heard many other people swear by the GPTs / Codex. I wonder what factors make one "click" with a model and not with another.

> Codex is an ice man.

That might be because OpenAI hides the actual reasoning traces, showing just a summary (if I understood correctly).

by pscanf

3/17/2026 at 8:51:59 PM

OpenClaw guy (he's Austrian, it's relevant) much prefers Codex over Claude and articulated it as being due to Claude's output feeling very "American" and Codex's output feeling very "German", and I personally really agree with the sentiment.

As an American, Claude feels much more natural to me, with the same overly-optimistic "move fast, break things" ethos that permeates our culture. It takes bigger swings (and misses) at harder-to-quantify concepts than Codex, cuts corners (not intentionally, but it feels like a human who's just moving too fast to see the forest for the trees in the moment), etc. Codex on the other hand feels more grounded, more prone to trying to aggregate blind spots, edge cases, and cover the request more thoroughly than Claude. It's far more pedantic and efficient, almost humorless. The dude also claimed that most of the Codex team is European while Claude team is American, and suggested that as an influence on why this might be.

Anyways, I've found that if I force Claude and Codex to talk to each other, I can get way better results and consistency by using Claude to generate fairly good plans from my detailed requests that it passes to Codex for review and amendment, Claude incorporates the feedback and implements the code, then Codex reviews the commit and patches anything Claude misses. Best of both worlds. YMMV

by kevinsync

3/17/2026 at 9:05:26 PM

Oh, interesting perspective. I'm Italian, but from an Alpine valley not far from Austria, so I don't know what I should prefer. :D

But joking aside, putting it like that I'd think I'd prefer the German/Codex way of doing things, yet I'm in camp Claude. But I've always worked better with teammates that balance my fastidiousness, so maybe that's my answer.

by pscanf

3/17/2026 at 9:35:36 PM

Claude Code now hides thinking as well unless you turn on an undocumented setting:

https://github.com/anthropics/claude-code/issues/31326#issue...

https://x.com/nummanali/status/2032451025500528687

by aragonite

3/17/2026 at 7:53:49 PM

Opinions are my own.

For agentic work, both Gemini 3.1 and Opus 4.6 passed the bar for me. I do prefer Opus because my SIs are tuned for that, and I don't want to rewrite them.

But ChatGPT models don't pass the bar. It seems to be trained to be conversational and role-playing. It "acts" like an agent, but it fails to keep the context to really complete the task. It's a bit tiring to always have to double check its work / results.

by thanhhaimai

3/17/2026 at 8:25:18 PM

I find both Opus 4.6 and GPT-5.4 have weaknesses but tend to support each other. Someone described it to me jokingly as "Claude has ADHD and Codex is autistic." Claude is great at doing something until it gets done and will run for hours on a task without feedback, Codex is often the opposite: it will ask for feedback often and sometimes just stop in the middle of a task saying it's done with step 1 of 5. On the other hand, Codex is a diligent reviewer and will find even subtle bugs that Claude created in its big long-running "until its done" work mode.

by kraemahz

3/17/2026 at 9:30:14 PM

Seems like the diagnoses are backwards, in this case. Claude usually stays on task no matter what, but lately Opus 4.6 is showing signs of overuse. I never used to get overload/internal server error messages, but I've seen about a half-dozen of them today alone. And it has been prone to blowing off subtasks that I'd have expected it to resolve.

by CamperBob2

3/18/2026 at 10:39:56 PM

[dead]

by Saline9515

3/17/2026 at 6:53:32 PM

Yea absolutely. I am using GPT 5.2 / 5.2 Codex with OpenCode and it just doesn't get what I am doing or looses context. Claude on the other side (via GitHub Copilot) has no problem and also discovers the repository on it's own in new sessions while I need to basically spoonfeed GPT. I also agree on the speed. Earlier today I tasked GPT 5.2 Codex with a small refactor of a task in our codebase with reasoning to high and it took 20 minutes to move around 20 files.

by tom1337

3/17/2026 at 7:03:19 PM

I don't know any reason to use 5.2, when 5.3 is quite a bit faster.

by furyofantares

3/17/2026 at 7:05:38 PM

If using OpenAI models, use the Codex desktop app, it runs circles around OpenCode.

by spiderfarmer

3/17/2026 at 10:57:56 PM

Can you educate me as to what makes Codex app superior using the same GPT model in both? Thx in advance!

by qaz_plm

3/18/2026 at 3:41:26 AM

Usually it's the prompts, or the model is tuned to the specific first-party tools. Sometimes that gives an edge over the generic tools, unfortunately.

by stavros

3/18/2026 at 7:49:22 AM

It's the harness.

by spiderfarmer

3/17/2026 at 6:50:10 PM

Same, and I can't put my finger on the "why" either. Plus I keep hitting guard rails for the strangest reasons, like telling codex "Add code signing to this build pipeline, use the pipeline at ~/myotherproject as reference" and codex tells me "You should not copy other people's code signing keys, I can't help you with this"

by nikanj

3/17/2026 at 6:55:17 PM

Are you requesting reasoning via param? That was a mistake I was making. However with highest reasoning level I would frequently encounter cyber security violation when using agent that self-modifies.

I prefer Claude models as well or open models for this reason except that Codex subscription gets pretty hefty token space.

by renewiltord

3/17/2026 at 7:19:26 PM

Yes, I think? But I was talking more specifically about using the models via API in agents I develop, not for agentic coding. Though, thinking about it, I also don't click with the GPT models when I use them for coding (using Codex). They just seem "off" compared to Claude.

by pscanf

3/17/2026 at 7:58:21 PM

I am also talking about agents I'm developing. They just happen to be self-modifying but they're not for agentic coding. You have to explicitly send the reasoning effort parameter. If you set effort to None (default for gpt-5.4) you get very low intelligence.

by renewiltord

3/17/2026 at 8:37:24 PM

Ah OK sorry, I misinterpreted. But yes, I double checked one case and I am indeed setting the parameter explicitly (defaulting to medium effort). But no luck. It feels like the model ignores what I'm telling it.

For example, I pass it a list of database collections and tools to search through them, ask a question that can very obviously be answered with them, and it responds with "I can’t tell yet from your current records" (just tested with GPT 5.4-mini).

But I've prodded it a bit more now, and maybe the model doesn't want to answer unless it can be very very confident of the answer it produces. So it's sort of a "soft refusal".

by pscanf

3/17/2026 at 8:28:38 PM

I like GPT models in Codex, for a fully vibecoded experience (I don't look at code) for my side-projects. In there, they really get the job done: you plan, they say what they'll do, and it shows up done. It's rare I need to push back and point out bugs. I really can't fault them for this very specific use-case.

For anything else, I can't stand them, and it genuinely feels like I am interacting with different models outside of codex:

- They act like terribly arrogant agents. It's just in the way they talk: self-assured, assertive. They don't say they think something, they say it is so. They don't really propose something, they say they're going to do it because it's right.

- If you counter them, their thinking traces are filled with what is virtually identical to: "I must control myself and speak plainly, this human is out of his fucking mind"

- They are slow. Measurably slow. Sonnet is so much faster. With Sonnet models, I can read every token as it comes, but it takes some focusing. With GPT, I can read the whole trace in real-time without any effort. It genuinely gives off this "dumb machine that can't follow me" vibe.

- Paradoxically, even though they are so full of themselves, they insist upon checking things which are obvious. They will say "The fix is to move this bit of code over there [it isn't]" and then immediately start looking at sort of random files to check...what exactly?

- I feel they make perhaps as many mistakes as Sonnet, but they are much less predictable mistakes. The kind that leaves me baffled. This doesn't have to be bad for code quality: Sonnet makes mistakes which _might_ at points even be _harder_ to catch, so might be easier to let slip by. Yet, it just imprints this feeling of distrust in the model which is counter-productive to make me want to come back to it

I didn't compare either with Gemini because Gemini is a joke that "does", and never says what it is "doing", except when it does so by leaving thinking traces in the middle of python code comments. Love my codebase to have "But wait, ..." in the middle of it. A useless model.

I've recently started saying this:

- Anthropic models feel like someone of that level of intelligence thinking through problems and solving them. Sonnet is not Opus -- it is sonnet-level intelligence, and shows it. It approaches problems from a sensible, reasonably predictable way.

- Gemini models feel like a cover for a bunch of inferior developers all cluelessly throwing shit at the wall and seeing what sticks -- yet, ultimately, they only show the final decision. Almost like you're paying a fraudulent agency that doesn't reveal its methods. The thinking is nonsensical and all over the place, and it does eventually achieve some of its goals, but you can't understand what little it shows other than "Running command X" and "Doing Y".

On a final note: when building agentic applications, I used to prefer GPT (a year ago), but I can't stand it now. Robotic, mechanic, constantly mis-using tools. I reach for Sonnet/Opus if I want competence and adherence to prompt, coupled with an impeccable use of tools. I reach for Gemini (mostly flash models) if I want an acceptable experience at a fraction of the price and latency.

by jorl17

3/17/2026 at 8:52:12 PM

> They act like terribly arrogant agents

Oh I feel that. I sometimes ask ChatGPT for "a review, pull no punches" of something I'm writing, and my god, the answers really get on my nerves! (They do make some useful points sometimes, though.)

> On a final note: when building agentic applications, I used to prefer GPT (a year ago), but I can't stand it now. Robotic, mechanic, constantly mis-using tools. I reach for Sonnet/Opus if I want competence and adherence to prompt, coupled with an impeccable use of tools. I reach for Gemini (mostly flash models) if I want an acceptable experience at a fraction of the price and latency.

Yeah, this has been almost exactly my experience as well.

by pscanf

3/17/2026 at 8:49:49 PM

A bit off topic, but reading your post I suddenly realized that if I read it three years ago I’d assume you’re either insane or joking. The world moved fast looking back.

by baq

3/17/2026 at 7:07:55 PM

> cyber security violation

Would you mind expanding on this? Do you mean in the resulting code? Or a security problem on your local machine?

I naively use models via our Copilot subscription for small coding tasks, but haven't gone too deep. So this kind of threat model is new to me.

by birdsongs

3/17/2026 at 7:13:57 PM

No, I mean literal API response. They think I'm using it to hack. See related Github issue: https://github.com/anomalyco/opencode/issues/15776

I don't use OpenCode but looks like it also triggered similar use. My message was similar but different.

by renewiltord

3/17/2026 at 8:51:54 PM

Ahhh okay, I see. Thanks!

by birdsongs

3/17/2026 at 9:46:32 PM

I ran 5.4 Pro on some data analytics (admittedly it was 300+ pages). It took forever. Ran the same on Sonnet 4.6, night and day difference. I understand it's like using a V8 engine for a V4 task, but I was curious. These new models look promising though. I'd rather use something like a Haiku most of the time over the best rated. I'm not a rocket scientist or solving the mysteries of the universe. They seem to do a great job 80% of the time.

by hermit_dev

3/17/2026 at 9:13:38 PM

Here's a grid of pelicans for the different models and reasoning levels: https://static.simonwillison.net/static/2026/gpt-5.4-pelican...

by simonw

3/17/2026 at 9:32:11 PM

Surely this task must now be in the training data

by nharada

3/17/2026 at 9:57:33 PM

If it does and works well then it seems like mission accomplished and time for a new benchmark.

by Kye

3/17/2026 at 9:39:58 PM

Nano medium must have been run when the servers were on fire

by elif

3/17/2026 at 11:09:21 PM

Thanks for the grid. The nano xhigh is my favorite pelican

by 6thbit

3/17/2026 at 11:37:47 PM

Some of these are nightmare fuel. I love them.

by castral

3/17/2026 at 5:34:21 PM

According to their benchmarks, GPT 5.4 Nano > GPT-5-mini in most areas, but I'm noticing models are getting more expensive and not actually getting cheaper?

GPT 5 mini: Input $0.25 / Output $2.00

GPT 5 nano: Input: $0.05 / Output $0.40

GPT 5.4 mini: Input $0.75 / Output $4.50

GPT 5.4 nano: Input $0.20 / Output $1.25

by HugoDias

3/17/2026 at 5:36:42 PM

models are getting costlier but by performance getting cheaper. perhaps they don't see a point supporting really low performance models?

by simianwords

3/17/2026 at 5:39:34 PM

I would be curious to know if from the enterprise / API consumption perspective, these low-performance models aren't the most used ones. At least it matches our current scenario when it comes to tokens in / tokens out. I'd totally buy the price increase if these are becoming more efficient though, consuming less tokens.

by HugoDias

3/17/2026 at 6:46:13 PM

Those are bigger models. The serving isn’t going to be cheaper.

Why expect cheaper then? The performance is also better

by karmasimida

3/17/2026 at 8:32:29 PM

You seem to have insight into the size of OpenAI’s models.

Care to share the parameter counts for them?

by trvz

3/17/2026 at 7:23:12 PM

Why are we treating LLM evaluation like a vibe check rather than an engineering problem?

Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning?

by mikkelam

3/17/2026 at 11:50:28 PM

I don't think it's just an engineering problem - decades of research have failed to produce a convincing, general definition of intelligence, capability or agency. You can try to form proxy metrics by combining benchmarks, but existing benchmarks are flawed, and should be taken with a pinch of salt.

It's evident in the fact that every time AI has historically met certain thresholds (chess-playing, the Turing Test, fluent language), we play with them a little more and find out there's something still lacking.

by ainch

3/17/2026 at 7:25:39 PM

Whenever somebody makes a benchmark, people complain that the benchmark results are meaningless because they’re gamed. I don’t know why those same people don’t understand that grading on vibes is strictly worse.

by tanaros

3/17/2026 at 7:29:29 PM

Depends on benchmark.

If questions are fixed they are trivial to game.

by tintor

3/17/2026 at 7:38:24 PM

There’s a Dark Forest problem for evals. As soon as they’re made public they start running out of time to be useful. It’s also not clear how to predict how the model will perform on a task based on an eval. Or even whether, given two skills that the model can individually do well on in the evals, it still does well on their composition. It might at this point be better to be scientific in unscientific approaches, than to attribute more power to relatively weakly predictive evals than they actually have

by pizza

3/18/2026 at 3:17:29 AM

I agree with your analysis but not the conclusion.

Evals are broken - OpenAI showed that SWE Bench Verified was in the training data - models were able to reconstruct the changes from memory (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)

However, this doesn't mean we should completely give up on benchmarking. In fact, as models get more intelligent, and we give them more autonomy, I believe that tracking agent alignment to your coding standards becomes even more important.

What I've been exploring is making a benchmark that is unique per-repo - answering the question of how does the coding agent perform in my repo doing my tasks with my context. No longer do we have to trust general benchmarks.

Of course there will still be difficulties and limitations, but it's a step towards giving devs more information about agent performance, and allowing them to use that information to tweak and optimize the agent further

by bisonbear

3/17/2026 at 9:21:45 PM

Someone else already wrote it, but it's just too funny to not abuse:

Evals are bad because people learn and fit to them. So we do extremely small evals instead.

by H8crilA

3/17/2026 at 7:49:04 PM

Is "Dark Forest problem" an actual name? I just heard of the hypothesis and it has nothing to do with how you used it in this context.

by xandrius

3/17/2026 at 9:22:02 PM

I meant in the sense of - you have benchmarkers and trainers. If you publicize your evaluation, trainers may likely have their models 'consume' it, even if only indirectly: another person creating their own benchmark from scratch may be influenced by yours, even if the new question sets are clean-room. That, and the rule of thumb that benchmark value dissipates like sqrt(age) [0]

So there is a definite advantage to never publicizing your internal benchmark. But then, no one else can replicate your findings. You should assume that the space of benchmarks that are actually decent at evaluating model performance is much larger and most of the good ones, the ones that were costliest to produce, are hidden, and might not even correspond very well with the public ones. And that the public expensive benchmarks are selective and have a bias towards marketing purposes.

[0] https://www.offconvex.org/2021/04/07/ripvanwinkle/

by pizza

3/17/2026 at 7:53:21 PM

I believe the correct term is "Goodhart's Law": https://en.wikipedia.org/wiki/Goodhart%27s_law

by sebastiennight

3/17/2026 at 9:13:25 PM

I mean, you vibe check, then you vibe code. Makes perfect sense. (this is a joke)

by Culonavirus

3/17/2026 at 8:00:39 PM

The OSWorld numbers are kinda getting lost in the pricing discussion but imo that's the most interesting part. Mini at 72.1% vs 72.4% human baseline is basically noise, so why not just use mini by default unless you're hitting specific failure modes.

Also context bleed into nano subagents in multi-model pipelines — I've seen orchestrators that just forward the entire message history by default (or something like messages[-N:] without any real budgeting), so your "cheap" extraction step suddenly runs with 30-50K tokens of irrelevant context. And then what's even the point, you've eaten the latency/cost win and added truncation risk on top.

Has anyone actually measured where that cutoff is in practice? At what context size nano stops being meaningfully cheaper/faster in real pipelines, not benchmarks.

by ibrahim_h

3/17/2026 at 8:12:51 PM

This is a bot

by mudkipdev

3/17/2026 at 8:44:06 PM

ironic accusation on a thread about LLMs

by ibrahim_h

3/17/2026 at 5:30:11 PM

I've been waiting for this update.

For many "simple" LLM tasks, GPT-5-mini was sufficient 99% of the time. Hopefully these models will do even more and closer to 100% accuracy.

The prices are up 2-4x compared to GPT-5-mini and nano. Were those models just loss leaders, or are these substantially larger/better?

by powera

3/17/2026 at 7:28:10 PM

So far on my (simple) benchmarks, GPT-5.4-mini is looking very good. GPT-5.4-mini is about 30% faster than GPT-5-mini. GPT-5.4-mini gets 80% on the "how many Rs in Strawberry" test, and nearly perfect scores on everything else I threw at it.

GPT-5.4-nano is less impressive. I would stick to gpt-5.4-mini where precise data is a requirement. But it is fast, and probably cheaper and better quality than an 8-20B parameter local model would be.

( https://encyclopedia.foundation/benchmarks/dashboard/ for details - the data is moderately blurry - some outlier (15s) calls are included, a few benchmark questions are ambiguous, and some prices shown are very rough estimates ).

by powera

3/17/2026 at 5:37:08 PM

For us, it was also pretty good, but the performance decreased recently, that forced us to migrate to haiku-4.5. More expensive but much more reliable (when anthropic up, of course).

by HugoDias

3/17/2026 at 5:43:45 PM

they dont change the model weights (no frontier lab does). if you have evals and all prompts, tool calls the same, I'm curious how you are saying performance decreased..

by throwaway911282

3/20/2026 at 1:10:17 PM

The mini/nano tier is where the data equation gets interesting.

Frontier models are expensive and used by developers who read terms of service. Mini models are cheap, embedded in apps, used by everyone—and users rarely understand what's being collected.

OpenAI's revenue model increasingly depends on high-volume, low-cost inference. That volume requires scale. Scale creates incentives to monetize the data flowing through the system.

The model quality race is mostly won. The next frontier is who accumulates the most behavioral data from everyday users. Mini releases are how you get there.

[Disclosure: I work with pugchat.ai, a privacy-focused AI aggregator—relevant bias]

by pugchat

3/17/2026 at 5:58:32 PM

Based on the SWE-Bench it seems like 5.4 mini high is ~= GPT 5.4 low in terms of accuracy and price but the latency for mini is considerably higher at 254 seconds vs 171 seconds for GPT5.4. Probably a good option to run at lower effort levels to keep costs down for simpler tasks. Long context performance is also not great.

by cbg0

3/17/2026 at 7:40:12 PM

5.4 Mini's OSWorld score is a pleasant surprise. When SOTA scores were still ~30-40 models were too slow and inaccurate for realtime computer use agents (rip Operator/Agent). Curious if anyone's been using these in production.

by technocrat8080

3/17/2026 at 8:09:27 PM

People seem to dismiss OSWorld as "OpenClaw," but I think they're missing how powerful and flexible that type of full-interaction for safe workflows.

We have a legacy Win32 application, and we want to side-by-side compare interactions + responses between it and the web-converted version of the same. Once you've taught the model that "X = Y" between the desktop Vs. web, you've got yourself an automated test suite.

It is possible to do this another way? Sure, but it isn't cost-effective as you scale the workload out to 30+ Win32 applications.

by Someone1234

3/17/2026 at 7:32:40 PM

Several customer testimonials for GPT-5.4 Mini have em dashes in them.

Did GPT write them?

by tintor

3/17/2026 at 11:04:30 PM

Users of AI used AI? Shocking

by kennywinker

3/17/2026 at 7:16:21 PM

One thing I really want to find out, is which model and how to process TONS of pdfs very very fast, and very accurate. For prediction of invoice date, accrual accounting and other accounting related purposes. So a decent smart model that is really good at pdf and image reading. While still being very very fast.

by fastpdfai

3/17/2026 at 7:37:54 PM

I have a use case somewhat similar to this where I need to convert the content of PDFs in a non standard format to a specific YAML format. I currently use Haiku for this and am pleased with the accuracy/speed (I haven't tried scanned PDFs yet tho) however I've been thinking about fine tuning a small Qwen model for just this task. I can't yet justify the effort to investigate it but I imagine it could work out.

by JLO64

3/17/2026 at 9:33:08 PM

i switched to claude when i found chatgpt would argue with just about anything I said even when it was wrong. they have over optimised antisychophancy. i want a model that simulates critical thinking not one that repeats half baked often incomplete dogmas. the chatgpt 5x range is extraordinarily powerful but also extra ordinarily frustrating to try to use for anything creative or productive that is original in my opinion. claude basically is able to think critically while being neither sycophantic or argumentative most of the time in my option with appropriate user prompting. recent chat gpts seem to fight me every step of the way when not doing boiler plate. i don't want to waste my time fighting a tool.

by morpheos137

3/17/2026 at 10:29:29 PM

It's odd, that on many benchmarks, including mine[0], Nano does better than Mini.

5.4 mini seems to struggle with consistency, and even with temperature 0 sometimes gives the correct response, sometimes a wrong one...

[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...

by XCSme

3/17/2026 at 8:43:53 PM

I've been struggling on finding a reasonably priced model to use with my toy openclaw instance. Opus 4.6 felt kinda magical but that's just too expensive and I'm not risking my max subscription for it.

GPT 5.4 mini is the first alternative that is both affordable and decent. Pretty impressed. On a $20 codex plan I think I'm pretty set and the value is there for me.

by nicpottier

3/17/2026 at 8:56:00 PM

Open source models like MiniMax M2.5, GLM 5, Kimi K2.5 were not decent enough? (via openrouter)

by GaggiX

3/17/2026 at 10:56:55 PM

I will confess that I have not had time to play with those. Will give them a try, thanks for the recommendation.

by nicpottier

3/18/2026 at 10:33:48 AM

K2.5 and GLM-4.7/-5 were good in my experience, another vote for those.

by selfhoster11

3/17/2026 at 5:34:43 PM

I will be impressed when they release the weights for these and older models as open source. Until then, this is not that interesting.

by ryao

3/17/2026 at 6:00:58 PM

As a big Codex user, with many smaller requests, this one is the highlight: "In Codex, GPT‑5.4 mini is available across the Codex app, CLI, IDE extension and web. It uses only 30% of the GPT‑5.4 quota, letting developers quickly handle simpler coding tasks in Codex for about one-third the cost." + Subagents support will be huge.

by beklein

3/17/2026 at 6:13:04 PM

Having to invoke `/model` according to my perceived complexity of the request is a bit of a deal breaker though.

by hyperbovine

3/17/2026 at 6:17:39 PM

you use profiles for that [0], or in the case of a more capable tool (like opencode) they're more confusing referred to as 'agents'[1] , which may or may not coordinate subagents..

So, in opencode you'd make a "PR Meister" and "King of Git Commits" agent that was forced to use 5.4mini or whatever, and whenever it fell down to using that agent it'd do so through the preferred model.

For example, I use the spark models to orchestrate abunch of sub-agents that may or may not use larger models, thus I get sub-agents and concurrency spun up very fast in places where domain depth matter less.

[0]: https://developers.openai.com/codex/config-advanced#profiles [1]: https://opencode.ai/docs/agents/

by serf

3/17/2026 at 10:47:03 PM

The Nano tier is the one I'm watching. For agent workflows where you're making dozens of LLM calls per task, the cost per call matters more than peak capability. Would be interesting to see benchmarks on function calling latency specifically — that's what matters for agents.

by michaelgdwn

3/18/2026 at 1:35:44 AM

Last time I used GPT-5 mini it seems much slower than the primary GPT model API when we used it for an AI chat agent. Particularly around streaming the responses. But everything I've read implies it's supposed to be faster.

by dmix

3/17/2026 at 6:56:52 PM

i want 5.4 nano to decide whether my prompt needs 5.4 xhigh and route to it automatically

by dack

3/17/2026 at 7:21:36 PM

As per OpenAI themselves, xhigh is only necessary if the agent gets stuck on a long running task. Otherwise it’s thinking trades use so many tokens of context that it’s less effective than high for a great majority of tasks. This has also been my experience.

by mrtesthah

3/20/2026 at 4:22:13 AM

yes but didn't greg brockman say he just runs on xhigh at all times?

by dack

3/17/2026 at 7:10:24 PM

Like any work estimation, it will likely disappoint.

by exitb

3/17/2026 at 6:19:42 PM

Looking at the long context benchmark results for these, sounds like they are best fit for also mini-sized context windows.

Is there any harness with an easy way to pick a model for a subagent based on the required context size the subagent may need?

by 6thbit

3/18/2026 at 12:29:45 PM

Is anyone else getting numb to new model announcements?

by AbstractH24

3/17/2026 at 8:36:22 PM

Oh.. I thought maybe these would be upgrades to gpt-4.1 and gpt-4.1-mini and etc.. But the latency is way too high compared to the 400-600. Yeah, different models and etc but the naming is confusing.

by Rapzid

3/17/2026 at 6:33:19 PM

They could call them something like “sonnet” and “haiki” maybe.

by bananamogul

3/17/2026 at 8:07:38 PM

Benchmarking these now.

Preregistering my predictions:

Mini: better than Haiku but not as good as Flash 3, especially at reasoning=none.

Nano: worse than Flash 3 Lite. Probably better than Qwen 3.5 27b.

by jbellis

3/18/2026 at 12:29:47 AM

Please post it here. I'd also like to know if 5.4 mini is better than Flash 3. Include reasoning and timing, if possible.

by attentive

3/17/2026 at 7:29:48 PM

Crazy how OAI is way behind now and the only one to blame is Sam, his ego and lust for influence. Their downwards trajectory of paying accounts since "the move" (DoW deal) is an open secret. If you had placed a new CEO at OAI six months ago and told him to destroy the company, it would have been hard for that CEO to do a better job at that than Sam did. Should have left when he was let go but decided to go full Greg and MAGA instead. Here we are. Go Dario

by beernet

3/17/2026 at 7:53:50 PM

Just to elaborate, as I am getting downvoted by tech bros:

OpenAI restructures after Anthropic captures 70% of new enterprise deals. Claude Code hits $2.5B while Codex lags at $1B ahead of dual IPOs.

Src: https://www.implicator.ai/openai-cuts-its-side-quests-the-en...

by beernet

3/18/2026 at 2:55:02 AM

Is GPT-5.4Mini drastically or marginally better for writing tasks as compared to GPT-5Mini?

by jerrygoyal

3/17/2026 at 5:35:04 PM

why isn't nano available in codex? could be used for ingesting huge amount of logs and other such things

by simianwords

3/17/2026 at 7:07:42 PM

IMHO the best way is to let a SOTA model have a look at bunch of random samples and write you tools to analyze those.

I think, no model, SOTA or not, has neither the context nor the attention to be able to do anything meaningful with huge amount of logs.

by patates

3/18/2026 at 10:24:35 AM

OpenAI has "open" in the name without being anything similar to "open source". Additionally, they have not rejected using their technology for automatically killing people and for mass surveillance. I deleted my OpenAI account, and it felt good. Recommended.

by xyproto

3/17/2026 at 5:14:12 PM

What's the practical advantage of using a mini or nano model versus the standard GPT model?

by machinecontrol

3/17/2026 at 5:20:25 PM

Cheaper. Every month or so I visit the models used and check whether they can be replaced by the cheapest and smallest model possible for the same task. Some people do fine tuning to achieve this too.

by aavci

3/17/2026 at 7:09:03 PM

wow, not bad result on the computer use benchmark for the mini model. for example, Claude Sonnet 4.6 shows 72.5%, almost on par with GPT-5.4 mini (72.1%). but sonnet costs 4x more on input and 3x more on output

by kseniamorph

3/18/2026 at 11:09:21 AM

what's the point of this benchmark if sonnet is working great at my tasks and mini can't solve my tasks?

by PunchTornado

3/17/2026 at 6:09:47 PM

Not comparing with equivalent models from Anthropic or Google, interesting...

by yomismoaqui

3/17/2026 at 6:23:20 PM

They did actually compare them in the tweet, see https://x.com/OpenAI/status/2033953592424731072

Direct image: https://pbs.twimg.com/media/HDoN4PhasAAinj_?format=png&name=...

by Tiberium

3/17/2026 at 6:10:20 PM

I googled all the testimonial names and they are all linked-in mouthpieces.

by casey2

3/17/2026 at 6:40:40 PM

I stopped paying attention to GPT-5.x releases, they seem to have been severely dumbed down.

by varispeed

3/17/2026 at 7:36:42 PM

OpenAI don't talk about the "size" or "weights" of these models any more. Anyone have any insight into how resource-intensive these Mini/Nano-variant models actually are at this point?

I assume that OpenAI continue to use words like "mini" and "nano" in the names of these model variants, to imply that they reserve the smallest possible resource-units of their inference clusters... but, given OpenAI's scale, that may well be "one B200" at this point, rather than anything consumers (or even most companies) could afford.

I ask because I'm curious whether the economics of these models' use-cases and call frequency work out (both from the customer perspective, and from OpenAI's perspective) in favor of OpenAI actually hosting inference on these models themselves, vs. it being better if customers (esp. enterprise customers) could instead license these models to run on-prem as black-box software appliances.

But of course, that question is only interesting / only has a non-trivial answer, if these models are small enough that it's actually possible to run them on hardware that costs less to acquire than a year's querying quota for the hosted version.

by derefr

3/17/2026 at 7:43:05 PM

Have they ever talked about their size or weights?

by technocrat8080

3/17/2026 at 7:47:54 PM

They never put the parameter counts in their model names like other AI companies did, but back in the GPT3 era (i.e. before they had PR people sitting intermediating all their comms channels), OpenAI engineers would disclose this kind of data in their whitepapers / system cards.

IIRC, GPT-3 itself was admitted to be a 175B model, and its reduced variants were disclosed to have parameter-counts like 1.3B, 6.7B, 13B, etc.

by derefr

3/17/2026 at 8:54:05 PM

Wow, would love to see a source for this.

by technocrat8080

3/19/2026 at 5:50:57 PM

Table 2.1 in https://arxiv.org/pdf/2005.14165.

by derefr

3/17/2026 at 6:33:28 PM

All three ChatGPT models (Instant, Thinking, and Pro) have a new knowledge cutoff of August 2025.

Seriously?

by reconnecting

3/17/2026 at 6:49:50 PM

Do you find the results vary based on whether it uses RAG to hit the internet vs the data being in the weights itself? I'm not sure I've really noticed a difference, but I don't often prompt about current events or anything.

by dpoloncsak

3/17/2026 at 7:01:29 PM

I noticed that many recent technologies are not familiar to LLMs because of the knowledge cutoff, and thus might not appear in recommendations even if they better match the request.

by reconnecting

3/17/2026 at 8:16:16 PM

Oh thats a good point, yeah.

If I told it I'm shopping for a budget-level Mac, it may not recommend the Neo. I'm sure software only moves faster, too. Especially as more code is 'written' blindly, new stacks may never see adoption

by dpoloncsak

3/17/2026 at 6:43:43 PM

whats surprising about that? most of the minor version updates from all the labs are post training updates / not changing knowledge cutoff

by zild3d

3/17/2026 at 6:59:14 PM

Thanks for letting me know, I will be waiting for the major update.

by reconnecting

3/17/2026 at 7:00:54 PM

It's been like this since GPT 3.5. This is not a limitation and is generally considered a natural outcome of the process.

So there's no major update in the sense that you might be thinking. Most of the time there's not even an announcement when/if training cut offs are updated. It's just another byline.

A 6 month lag seems to be the standard across the frontier models.

by F7F7F7

3/17/2026 at 7:06:25 PM

I've actually started worrying that the amount of false data produced with LLMs on the public internet might provoke a situation where the knowledge cutoff becomes permanently (and silently) frozen. Like we can't trust data after 2025 because it will poison training data at scale, and models will only cover major events without capturing the finer details.

by reconnecting

3/17/2026 at 7:39:40 PM

I agree. That's why you should write as much as you can now, if you want to get it into the LLMs (https://gwern.net/blog/2024/writing-online). You never know when the window will slam shut and LLM training goes 'hermetic' as they focus on 'civilization in a datacenter' where only extremely vetted whitelisted data gets included in the 'seed' and everything is reconstructed from scratch for the training value & safety.

by gwern

3/17/2026 at 6:05:16 PM

[flagged]

by miltonlost

3/17/2026 at 6:06:36 PM

I am feeling the version fatigue. I cannot deal with their incremental bs versions.

by system2