3/17/2026 at 6:17:11 PM
I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).
- GPT-5.4 Mini averages about 180-190 t/s on API. Priority does nothing for it currently.
- GPT-5.4 Nano is at about 200 t/s.
To put this into perspective, Gemini 3 Flash is about 130 t/s on Gemini API and about 120 t/s on Vertex.
This is raw tokens/s for all models, it doesn't exclude reasoning tokens, but I ran models with none/minimal effort where supported.
And quick price comparisons:
- Claude: Opus 4.6 is $5/$25, Sonnet 4.6 is $3/$15, Haiku 4.5 is $1/$5
- GPT: 5.4 is $2.5/$15 ($5/$22.5 for >200K context), 5.4 Mini is $0.75/$4.5, 5.4 Nano is $0.2/$1.25
- Gemini: 3.1 Pro is $2/$12 ($3/$18 for >200K context), 3 Flash is $0.5/$3, 3.1 Flash Lite is $0.25/$1.5
by Tiberium
3/17/2026 at 10:21:01 PM
IME tok/s is only useful with the additional context of ttft and total latency. At this point a given closed-model does not exist in a vaccuum but rather in a wider architecture that affects the actual performance profile for an API consumer.This isn't usually an issue comparing models within the same provider, but it does mean cross-provider comparison using only tok/s is not apples-to-apples in terms of real-world performance.
by rglynn
3/17/2026 at 10:49:05 PM
Exactly. Really frustrating they don't advertise TTFT and etc, and that it's really hard to find any info in that regard on newer models.For voice agents gpt-4.1 and gpt-4.1-mini seem to be the best low latency models when you need to handle bigger data more complex asks.
But they are a year old and trying to figure out if these new models(instant, chat, realtime, mini, nona, wtf) are a good upgrade is very frustrating. AFAICT they aren't; the TTFT latencies are too high.
by Rapzid
3/18/2026 at 3:43:49 PM
Yeah, this speed is excellent! I'm using GPT-5 mini for my "AI tour guide" (simply summarizes Wikipedia articles for me on the fly, which are presented on my app based on geolocation), and it's always been a ~15 second wait for me before streaming of a large article summarization will begin. With GPT-5.4 it's around 2-3 seconds, and the quality seems at least as good. This is a huge UX improvement, it really starts to feel more 'real time'.by widdershins
3/17/2026 at 9:44:20 PM
Curious to hear why people pick GPT and Claude over Google (when sometimes you’d think they have a natural advantage on costs, resources and business model etc)?by daniel_iversen
3/18/2026 at 3:07:29 PM
Because Claude is so much more expensive, and I rarely need the best.gpt-5.4 is really good now also for tricky problems. Just for the unsolvable problems we take opus-4.6. Or if someone pays for it.
by rurban
3/17/2026 at 10:40:07 PM
In my workplace, its availability. We have to use US-only models for government-compliance reasons, so we have access to Opus 4.6 and GPT 5.4, but only Gemini 2.5 which isn't in the same class as the first two.by coderjames
3/18/2026 at 5:34:25 AM
Have you used gemini models for code work? Claude and Codex are miles ahead in terms of quality and how thorough they areby fullstackchris
3/17/2026 at 6:52:41 PM
I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.by coder543
3/17/2026 at 7:29:22 PM
In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.
by JLO64
3/17/2026 at 8:53:19 PM
OpenRouter has this informationby asselinpaul
3/17/2026 at 10:40:27 PM
I do not see prompt processing, only some kind of nebulous “throughput” that could be output or input+output, but definitely not input only.by coder543
3/17/2026 at 10:32:17 PM
Man the lowest end pricing has been thoroughly hiked. It was convenient while it lasted.by msp26
3/17/2026 at 11:35:29 PM
token/sec is meaningless without thinking level. If it's fast but keeps rambling about instead of jumping on it then it can take a very long time vs low token/sec but low/none thinking.by attentive
3/17/2026 at 8:17:33 PM
Wow. How fast is haiku?by rattray