alt.hn

5/19/2026 at 5:43:45 PM

Gemini 3.5 Flash

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/

by spectraldrift

5/20/2026 at 2:11:57 AM

For those who would like to know the total and active parameter count of this model: even though Google doesn't disclose the model technicals, we can infer them within relatively tight margins based on what we do know.

We know they serve the model on TPU 8i, which we have plenty of hard specs for (so we know the key constraints: total memory and bandwidth and compute flops). We can also set a ceiling on the compute complexity and memory demand of the model based on knowing they will be at least as efficient as what is disclosed in the Deepseek V4 Technical Report.

We can also assume that the model was explicitly built to run efficiently in a RadixAttention style batched serving scenario on a single TPU 8i (so no tensor parallelism, etc. to avoid unnecessary overheads... Google explicitly designed the 8th-generation inference architecture to eliminate the need for tensor sharding on mid-sized models).

We know Google intends to serve this model at a floor speed of around 280 tok/s too.

Putting all these pieces together, we can confidently say this model is ~250-300B total, and 10-16B active parameters. Likely mostly FP4 with FP8 where it matters most.

Visual:

  ┌────────────────────────────────────────────────────────┐
  │                   TPU 8i VRAM (288 GB)                 │
  ├───────────────────────────┬────────────────────────────┤
  │   Static Model Weights    │  Dynamic Allocations &     │
  │   (250B - 300B @ Mixed    │  Compressed KV Caches      │
  │   FP4/FP8)                │  (RadixAttention / SRAM)   │
  │   ~110 GB - 150 GB        │  ~138 GB - 178 GB          │
  └───────────────────────────┴────────────────────────────┘
I do model serving optimization work. This is napkin math.

Edit: There's one factor I under-rated in my initial estimate... TurboQuant. This is a compute to KV memory use tradeoff. It's plausible with TurboQuant at a quality-neutral setting they've gotten the model up to 400B with similar economics. This is a variable effecting concurrency and the the way they decided total model size was likely based on what they see for the average user's average KV cache depth in real-world usage.

by easygenes

5/20/2026 at 3:24:52 AM

We've been really impressed with the performance of ~30B parameter class models and how close they are to the frontier from ~6-12 months ago, which begs the question, are the frontier labs really serving 10T parameter models? Seems unlikely.

If these Gemini 3.5 numbers are accurate, then I'd wager GPT 5.5 and Opus 4.7 are a lot smaller than people have speculated, too. It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

Gemini 3.5 Flash is really smart in one-shot coding reasoning, btw. Near the frontier. But it doesn't do so well in long horizon agentic tasks with arbitrary tool availability. This is a common theme with Google models, and the opposite of what we see with Chinese models (start dumb, iterate consistently toward a smart solution).

Data at https://gertlabs.com/rankings

by gertlabs

5/20/2026 at 7:03:32 AM

Elon says Opus is 5T (and I would expect he'd know)

> It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

The have plenty if data. They use very large amounts of verifiable synthetic data in (lots in coding and math) cover the gap.

Also the frontier labs are paying people to do tasks, tracking the trajectories and training on that. Most of the optimization is in RL based on these trajectories.

by nl

5/20/2026 at 12:16:41 PM

> Elon says Opus is 5T (and I would expect he'd know)

Even if he knew, why would anyone expect Elon not to lie about anything?

> The have plenty if data.

I don't think data is the problem either, but compute is: if you want to train your 5T params model like modern small models are being trained (with a thousands time more training tokens than params), that's an enormous training run.

by stymaar

5/21/2026 at 3:08:06 AM

> if you want to train your 5T params model like modern small models are being trained (with a thousands time more training tokens than params), that's an enormous training run.

Yes it is. Spending $100M on training runs is common, and $1B might be in scope for some of the large models.

Sonnet 3.5 cost "a few 10s of millions of dollars" back in 2024: https://simonwillison.net/2025/Jan/29/on-deepseek-and-export...

by nl

5/21/2026 at 12:02:24 AM

I mean in general I'm pretty doubtful about things he says, but in this he was comparing Grok and it sort of makes sense in the context: https://x.com/elonmusk/status/2042123561666855235

by nl

5/21/2026 at 10:54:01 AM

In that context specifically, why would you trust him not to lie?

He's using a massive number for Opus to make Grok look good “for its size”.

If he said something praising Anthropic and like “Grok is 7T, while Opus is better while being only 5T, we need to work harder” or something then maybe I could believe it. But here it's a context where he has all the incentives to inflate Opus' size to make himself look somehow “in the race” when he really isn't despite the money and compute advantage.

Given this tweet I wouldn't be surprises if Grok was actually 1T and Opus being in the same ballpark.

And I'm absolutely not buying current-days Sonnet being a 1T parameters model (that's an absolutely deranged take: that would make Anthropic already behind Chinese model makers, which I think isn't something anyone would put money on).

by stymaar

5/20/2026 at 7:33:16 AM

This is what we do at gertlabs.com - the foundation labs are actually starving for better data. Having quality data is not the same as having a lot of data. Human curated data / RLHF cannot scale to a 5T model and synthetic data pipelines are very much a work in progress in the industry.

Some interesting notes:

- Training a small model with large model output resulted in LESS improvement than distilling a less smart model onto the same small architecture [0]. We are starting to hit intelligence density limits in small models (<30B models may be nearing saturation now)

- good RL environments incidentally also make for good benchmarking

[0] https://arxiv.org/html/2502.12143v1

by gertlabs

5/20/2026 at 10:49:23 AM

Wouldn’t it be good to start investigating into a micro model architecture? Like first model checks the context and routes to the Java optimized model, etc. would make it also simpler to load/unload models in memory.

So extremely small models that are only good for a certain task like programming languages. A little bit of a model at the front that is extremely good in classification of tasks and than a more complex model that can bring each of these micro models back together

by merb

5/20/2026 at 11:07:51 AM

My guess is that we underestimate how much non-Java data and context in general is needed to create a good Java coding model. It could be true that a good Java model would be of 80-90% the size of a comparable overall coding model.

Obviously, I have no idea but I guess it’s not as simple as “just train only on Java code and reduce size to 1/10th”.

by lukeundtrug

5/20/2026 at 11:01:06 AM

I think you're describing Mixture-of-Experts.

by puilp0502

5/20/2026 at 12:09:54 PM

> they don't have the data to optimize a model of that size.

So where does humanity cap out? The statement more or less implies that there's a ceiling of our ability to train models which might be below what LLMs are capable of (e.g. not AGI but how good coding agents they might ever become, for example).

by KronisLV

5/20/2026 at 7:37:12 AM

I’m not sure if synthetic data is enough.

Xai paying cursor to train models with their data, tell us that having an agent tool like claude code is important for quality data acquisition. That’s why they recently shipped grok build

I think we will see insane SOTA models from xai in the next few months.

by maipen

5/20/2026 at 3:30:14 AM

We know from NVIDIA's public Vera Rubin inference engine marketing materials that the frontier lab models are ~1-2T total.

Mythos is an exception that's larger.

by easygenes

5/20/2026 at 6:02:59 PM

Wouldn’t that be an exciting plot twist? That the release cadence of the big labs doesn’t actually reflect any meaningful improvements, or bigger models, but it’s a marketing ploy to start ratcheting up prices for good ARR numbers prior to the big IPO where the celebrity executives bail out of the stalling plane.

by opsnooperfax

5/20/2026 at 6:22:58 AM

I agree with this sentiment but the reasoned anecdotes do not agree. I imagine the flagship models have modalities/usages that we hn-ers don't imagine easily.

by beacon294

5/20/2026 at 9:01:08 AM

It was estimated that Mythos is 10T.

And serving is not training. For distilling you need to train the big models to have something to be distilled.

by Glohrischi

5/20/2026 at 4:38:09 AM

I exclusively use gemini models and this has been my experience.

I mitigate it by creating dense planning docs for everything and executing iteratively.

Lot's of time wasted on procedure unfortunately

by MisterPea

5/20/2026 at 10:07:50 AM

If two things hold up - 1) this is actually a 2-300B parameter model and 2) this is actually competitive with frontier OpenAI and Anthropic models (and not just benchmaxing), the implications are pretty big. It would mean you could run "frontier level" performance in one box at home.

300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.

For comparison, DeepSeek V4 Flash is all the rage now for small efficient models. It's very good for its size but far from the performance of the latest GPT Pro and Opus models. The vanilla variant has 284B parameters. It fits on both 256GB and 512GB Mac Studios and hits about 20-30 tokens/second.

The implication of all this here is that you could have a (somewhat sluggish) Opus in a small box at home. At least once competing models and hardware to run them will be available (high end Mac Studios have been discontinued).

Something tells me that this means that Google's performance numbers here are inflated.

by DCKing

5/20/2026 at 2:30:17 PM

Opus is estimated to be around 4T parameters, and 5.5 around 9T. [1] And while 3.5 at least qualifies to be in the same neighborhood, which is stunning if these numbers are all true, it may be that closing that last ~10% difference needs 50x more parameters.

[1]https://arxiv.org/pdf/2604.24827

by WarmWash

5/21/2026 at 12:19:26 AM

Their methods are only calibrated on open models (of course) and they admit very broad confidence bounds. You can also just see from comparing their estimates of the same models at different reasoning levels that there are major confounders to this. I would err on the absolute lowest side of their estimates for frontier models (e.g. 3T for GPT-5.5, 1.5-2T for Opus 4.5+).

by easygenes

5/20/2026 at 12:03:14 PM

> the implications are pretty big. It would mean you could run "frontier level" performance in one box at home.

That wouldn't surprise me at all actually, models like Qwen3.6-35B are comparable to frontier level models from a year ago and I wouldn't be surprised if we had self-hostable open weight models matching Opus 4.7 in a year. Assuming that Google has one year of advance against Chinese lab isn't far fetched given how much resources they have compared to their Chinese competitors.

by stymaar

5/20/2026 at 12:14:01 PM

I think there was a leap around Opus 4/4.1 that hasn't quite been equalled by self hostable models yet. Perhaps full Kimi K2.6 and Deepseek V4 Pro can achieve Opus 4.1 levels (it's hard to compare anyway, benchmarks are largely a game nowadays), but both of these are also north of 1000B parameters and therefore really impractical to run at home for the foreseeable future.

It's not yet obvious to me that you can achieve the breakthrough performance of say Opus 4.1/4.5 in a number of parameters you can swing at home.

by DCKing

5/20/2026 at 1:39:36 PM

> It's not yet obvious to me that you can achieve the breakthrough performance of say Opus 4.1/4.5 in a number of parameters you can swing at home.

People used to believe the same about GPT-4, and I'm not convinced this is going to be different this time.

You do need a very big model if you want something that remembers random trivia about everything, but I'm not convinced this is needed to do meaningful work.

by stymaar

5/20/2026 at 1:13:23 PM

> 300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.

I run 2.54 BPW 397B Qwen 3.5 GGUF on a 128G mac studio at 20 tokens/second generation and 200 tokens/second processing. I'm not suggesting it matches the performance of the full BF16 model, but I did run some benchmarks locally and the results were pretty good:

- MMLU: 87.96%

- GPQA diamond: 86.36%

- IfEval: 91.13%

- GSM8k: 92.57%

So I think we have been at the "frontier capabilities at home" for a few months now.

by tarruda

5/21/2026 at 4:09:44 AM

TurboQuant. They can fit more in less now

by LarsDu88

5/21/2026 at 7:15:21 AM

TurboQuant is a runtime optimization for a model's KV cache and doesn't allow for reduction in model size.

by DCKing

5/21/2026 at 5:49:38 PM

TurboQuant reduces the runtime memory needed for the model's KV cache.

This reduces both the memory bandwidth needed for inference (at the cost of slightly increasing the amount of compute needed), and the amount of VRAM used overall, meaning more VRAM can be allocated for more weights on the same hardware.

You were replying to a comment estimating model params from hardware. I am saying the param count could be higher for the same hardware.

by LarsDu88

5/20/2026 at 2:56:55 PM

Since I started using Qwen-3.6 35B A3B, I believe frontier like capability will be more than enough in these smaller models within a year or two, at least for coding. They don't need to memorize facts into their weights, which likely has very interesting implications that I'm not going speculatively decode

by verdverm

5/20/2026 at 12:35:20 PM

[dead]

by easygenes

5/20/2026 at 5:59:35 AM

Nice post! You piqued my curiosity, so after a bit of research it turns out that, with techniques like MTP/MLA/CSA, it's quite probable that these models are much more efficient (and maybe bigger? tho 400B sounds about right) than a simple RAM breakdown would suggest.

MTP - https://blog.google/innovation-and-ai/technology/developers-...

MLA - https://machinelearningmastery.com/a-gentle-introduction-to-...

CSA - https://deepseek.ai/blog/deepseek-v4-compressed-attention

by smnscu

5/21/2026 at 6:14:41 PM

These techniques are used by DeepSeek, and work well with the commodity (NVIDIA) GPU's they use. Google designs their entire AI stack from the custom silicon up. So they have different optimization approaches. (Though Gemma does use MTP)

by Doxon

5/20/2026 at 3:06:31 AM

If this is accurate it raises the question: why is this model so expensive? DeepSeek v4 Flash is 284B total/13B active, FP4/FP8 mixed, and only costs $0.14/$0.28 - even less from OpenRouter. Of course Gemini 3.5 Flash is most likely a better product, and therefore it can command a higher price from an economics perspective, but does this imply Google is taking roughly a 90% profit margin on inference? If so they're either very compute-limited or confident in the model and wanting to recoup training/fixed costs (or both).

by daemonologist

5/20/2026 at 3:13:24 AM

Well, we use flash models extensively (both 2.5 and 3.1) and I cannot overstate this, google cannot fucking serve them without 503s 70% of the time on most days

I think it’s pure economics. Flash models are OP for the price, leads to too much demand, google cannot serve it. This is likely expensive to reduce load and hey, if it still makes money just keep the margin.

by xmonkee

5/20/2026 at 3:25:20 AM

Rumor is that GCP was happily selling compute to competitors. After all, under the hood, Google is closer to a federation than a corporation. The state of GCP doesn't care about the state of Gemini.

by WarmWash

5/20/2026 at 3:48:08 AM

> Rumor is

It’s not a rumor - there are many public announcements about $B deals around compute for other Ai companies

by happyopossum

5/20/2026 at 6:53:39 AM

>> Rumor is that GCP was happily selling compute to competitors. After all, under the hood, Google is closer to a federation than a corporation. The state of GCP doesn't care about the state of Gemini.

> It’s not a rumor - there are many public announcements about $B deals around compute for other Ai companies

The last time I read a public announcement, the commentary I read was this is because Anthorpic doesn't want to run out of cash or capacity before a funding round / IPO so they gave Google some equity and in return Google gave it some compute resources? You could reframe it as Google is buying into Anthorpic — which is how Claude tends to frame it as but the end result is the same. Equity for spare capacity.

You could even argue that at a hyperscaler like Google's scale — all capacity is spare capacity and no capacity is spare capacity at the same time. GCP seems to have deals with Anthorpic, OpenAI, Meta, Apple, Healthcare companies, Banks, LG(?), Best Buy(?) so in my mind Google (and all AI vendors) are hyping up AI to drive up interest and building up as fast as they can to capture that interest and convert it into cold, hard cash. It honestly feels like this is out of my mental capacity though because these AI vendors had the cold hard cash that they spent on these data centers that we don't even know might become obsolete within a decade(?) but I guess meanwhile they could make beaucoup bucks. There is also the idea that they had to be seen as conspicuously spending on AI or investors might see them as falling behind, triggering a selloff. So yeah I guess it is a fact that GCP is selling compute to other AI companies but it makes sense because basically you can build capacity potentially for Gemini to use in the future while having other companies pay for some of that cost today.

In my mind, for hyperscalers — Google, Amazon dot com, Microsoft — competitors are not really enemies but rather partners. The real fear or threat is market uncertainty and customers souring on AI altogether. As long as customers are interested in this AI stuff, you could compete on merit or cost benefit ratio but if competitors start failing because they ran out of capacity or cash, that could send an unwanted message to the market.

To summarize though, I have to agree that the supposed rumors are better than rumors, they are facts and we could even make an educated guess that this is a part of a strategy, as much as you can strategize when it comes to an "industry" with a high fixed cost and an uncertain demand.

by collabs

5/20/2026 at 6:57:13 AM

Seems like diversification for the sake to not only maximise profit, but also minimise risk( of their models not keeping competitive).

by uHuge

5/20/2026 at 5:28:48 AM

This is the reality of the premiums available from being in the lead by ~8 months on model building technicals.

by easygenes

5/20/2026 at 9:14:47 AM

meta - i think that's the first time i've seen a table in a hn comment, and i'm surprised/impressed! nice

are these pre-generated in a different tool with plain unicode and then just copy-pasted, or is it a built-in feature of hn?

by 4ggr0

5/20/2026 at 3:37:34 PM

The fact that this is running on tpus is a huge point. Counting those against the other available datacenter hardware used by others, it puts google at a huge advantage, and compute > * while scaling is still working

by wing-_-nuts

5/20/2026 at 2:49:30 AM

Tell me more about what your day looks like. What do you think of the LLMOps books from Abi, in case you have read it ? Any other resources you can recommed?

by Maven911

5/20/2026 at 2:34:45 AM

Do you have similar math for the flash-lite variant of the models? I'd be curious. Based on my testing / benchmark i think it's around the 100-120B mark.

With the Pro variant being around 600B - 800B

My testing is comparing it's performance / output to other models in the same size range, so not as scientific as yours.

by zacksiri

5/20/2026 at 2:57:52 AM

given this, is it safe to assume that inference pricing is barely related to cost to serve at this point and there is considerable margin?

by anthonypasq96

5/20/2026 at 8:34:17 AM

I like your chain of thought there !

by rawoke083600

5/20/2026 at 3:24:33 PM

i would like to get a job like that. what can i study? I am mostly a ml engineer / researcher.

by PunchTornado

5/20/2026 at 4:27:06 AM

[flagged]

by nilstenura

5/19/2026 at 7:29:53 PM

The pelican is a lot: https://github.com/simonw/llm-gemini/issues/133#issuecomment...

Not a great bicycle though, it forgot the bar between the pedals and the back wheel and weirdly tangled the other bars.

Expensive too - that pelican cost 13 cents: https://www.llm-prices.com/#it=11&ot=14403&sel=gemini-3.5-fl...

by simonw

5/19/2026 at 7:32:42 PM

That pelican looks like it's in Miami for a crypto conference.

by hedgehog

5/20/2026 at 12:22:16 AM

That pelican wears it's sunglasses at night. So it can, so it can keep track of the visions in it's eyes.

by seemaze

5/20/2026 at 1:00:36 AM

Pelican and I need an optometrist urgently

by whh

5/20/2026 at 2:52:24 AM

It looks quite funny.

by baochillchill

5/19/2026 at 8:01:54 PM

It looks like the starting soon screen of a crypto presentation

by joseda-hg

5/19/2026 at 11:43:58 PM

That pelican looks like it lost 100k on NFTs and now runs a paid stock-trading group.

by coffeecoders

5/19/2026 at 7:53:26 PM

It looks like it’s been partying for 60 years based on the wrinkles on its pouch.

by xattt

5/20/2026 at 11:01:48 AM

You don't know what that pelican has been through.

by ethbr1

5/19/2026 at 8:25:47 PM

Pelican in a white Testarossa.

by Xenoamorphous

5/19/2026 at 11:29:28 PM

They're called ClawCons now

by airstrike

5/20/2026 at 12:31:55 AM

Personally, I don't attend them since I figured out I can set up agents to performatively engage in AI-related discussion and events for me, freeing up tons of my time thanks to automation.

Truly: Nothing better than AI tools to brave the challenges and requirements of modern life. "Claude, ride the hype train" is the decisive prompt you need.

by sho_hn

5/19/2026 at 10:22:50 PM

It look like the start of a new viral Peliwave aesthetic

by brindleth

5/19/2026 at 8:18:58 PM

and somehow in 1992

by egillie

5/19/2026 at 8:33:54 PM

sorta looks like the Tron ripoff in the I/O keynote

by verdverm

5/19/2026 at 7:48:56 PM

This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.

edit: fixed human hallucination

by irthomasthomas

5/19/2026 at 8:29:57 PM

When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?

I ask because:

Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.

But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)

I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.

by derefr

5/19/2026 at 8:39:23 PM

I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.

And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.

by irthomasthomas

5/20/2026 at 1:20:29 PM

This is also my gripe with a lot of this stuff, always evaluating models on what they can literally oneshot is completely pointless; it's not how anything works, neither for humans nor for scaffolded AIs. I guess it's neat if you want to argue that a certain level of intelligence can "never be achieved" in a single forward pass, but like, so what. No one cares about that, except people who have already decided to be anti AI.

(not that I am in any sense pro AI, but it's just a weird lack of intellectual rigor)

by tskj

5/20/2026 at 2:40:23 PM

Asking a model to improve its output is not one-shotting tho? My observation was that asking an llm to iterate and improve a response causes it to add more stuff, rather tha repair the broken stuff. And that model progress in general has the same pattern. This new model adds more details to its responses but continues to make mistakes at about the same rate.

by irthomasthomas

5/20/2026 at 4:15:34 PM

The question was whether you were giving it the rendered image and using the model's visual modal capability, or feeding back in the textual SVG.

It's hard to "imagine" what the rendered SVG looks like, for both humans and LLMs, so just iterating on text won't really be as useful of a test. But if you show it what it rendered, it might observe the bad-looking bicycle and be able to fix the text that way.

by losvedir

5/20/2026 at 5:31:41 PM

"I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements."

by irthomasthomas

5/19/2026 at 11:01:46 PM

To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.

by stared

5/20/2026 at 12:30:35 AM

What is “Sonnet 3.7 moment”?

by p1esk

5/20/2026 at 6:58:08 PM

Sonnet 3.7 tried its damnedest but it was just kinda "off".

by dormento

5/20/2026 at 1:35:14 AM

Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

by stirfish

5/20/2026 at 2:05:55 AM

So we have to train llms on debugging too, not just how to make things (which you easily train by feeding the outputs).

by Araopa

5/20/2026 at 12:22:54 AM

It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.

by gowld

5/20/2026 at 1:00:09 AM

This matches my experience with human too FWIW.

by sosborn

5/20/2026 at 1:12:14 AM

Why is there always an identical reply like this when anyone criticizes LLMs?

by emp17344

5/19/2026 at 10:30:58 PM

Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.

by girvo

5/19/2026 at 8:01:31 PM

Forgetting the chainstay is typical of asking random people to draw a bicycle.

https://www.gianlucagimini.it/portfolio-item/velocipedia/

> most ended up drawing something that was pretty far off from a regular men’s bicycle

by tantalor

5/19/2026 at 8:46:59 PM

Asking random people to write SVG gives even worse results

by et1337

5/19/2026 at 9:17:36 PM

Especially without being able to look at the rendered output! (At least I'd be surprised if modern server-side tool calls regularly include an SVG renderer that can show a rasterized version to the model to iterate on it.)

by lxgr

5/20/2026 at 1:19:20 AM

One of the many things Google was pitching today is that they're going to run things like google search with access to linux container environments to do things like run tool calls... which will presumably be able to rasterize SVGs and show them to the model.

But Simon says he runs these through the API without tool access specifically to prevent that sort of "cheating". I.e. it's an LLM benchmark not an LLM+Harness benchmark.

by gpm

5/19/2026 at 10:59:15 PM

Although every single render of those has pedals on the correct side as opposed to the Gemini optical illusion back pedal that tries to be both on the other side of the central gear and infront of the back wheel.

Not really a criticism but an interesting point that you would never expect a human to make that mistake even in a bad drawing.

by Eji1700

5/20/2026 at 5:17:11 AM

Thanks for the delightful Velocipedia

by Barbing

5/19/2026 at 7:55:32 PM

I feel like it embodies Google's vibe of an uncool guy trying to stay relevant to the youth pretty well.

by smcleod

5/20/2026 at 1:42:10 AM

That's grok. IMO both gemini and grok are the most overlooked models.

by dzhiurgis

5/20/2026 at 8:24:28 AM

Gemini is absolute garbage for anything useful, the last good model they released was 2.5 pro.

by smcleod

5/20/2026 at 12:55:12 AM

I'm told there is a new Jeff Dean fact inside google: "Jeff Dean manually adjusts the weights in the model just to screw with Simon".

by dekhn

5/20/2026 at 12:18:51 AM

If you sort that table by "output token price", it gets really terrifying - going from 4 cents up to $600 =8-O

by tandr

5/20/2026 at 12:27:21 AM

We've been daily-driving this model for a few weeks and let me tell you, everything it does is a lot. Fast as fuck and it's actually not bad intelligence-wise for a fast model. It basically tries to make up for any intelligence deficit by just doing a lot, checking a lot, retrying a lot.

That's not to say I don't spend my days raging at it... a lot... but it's not that bad. It does tend to ignore completion criteria but it doesn't obviously degrade when being nudged like some models do.

by nrds

5/20/2026 at 8:30:09 AM

One time I told it “we are doing science” and I had DNA emoji everywhere and it so over enthusiastically embraced the science theme I was genuinely laughing. It finished one task with a flourish of several dna emoji and proclaimed: The Science is COMPLETE. I died.

It really is a lot some of the time. And it’s chain of thought is hilarious a lot of the time.

by bitexploder

5/19/2026 at 7:38:32 PM

Same old issue with Gemini models trying to "enrich" everything

by hydra-f

5/20/2026 at 9:43:07 AM

'Pelicans' should be the unit of measurement for model prices, rather than tokens.

by nomilk

5/20/2026 at 1:01:32 AM

I'm hoping we'll have many of these pelican cyclist pictures collected. Then when all the models can do it well, we'll stop posting about them, and dhen the next generations of AIs train on the data we'll have these canonical archetypes.

by karmakaze

5/20/2026 at 5:06:06 AM

Wouldn't be a thread about the tech that is changing the landscape for businesses across nearly every discipline without a pelican svg.

by dankwizard

5/19/2026 at 10:16:07 PM

Wow what’s with all the styling? Is it manifestation of google’s styling bias? I like the result for sure. It’s shiny and pretty. But then it’s something I didn’t ask for.

by sbinnee

5/20/2026 at 12:40:38 AM

I can’t help but think that what AI is best at is convincing management that things it creates are full featured which reads to their brains as mature

by taurath

5/20/2026 at 2:22:46 AM

I wonder if they added all these unrequested details as an Easter-egg or something? (Since they must be aware of your test by now).

by bee_rider

5/20/2026 at 4:40:17 AM

The fact it went for vaporwave styling on its own is very telling.

by VectorLock

5/19/2026 at 9:05:30 PM

`<!-- Pelican Eye / Sunglasses (Cool Retro Aviators) -->`

wtf

`<!-- Gold Rim -->`

WTF??

by setgree

5/19/2026 at 7:53:44 PM

funny that when I try the same prompt, gemini generates an image, not an SVG. something is not right.

by gcgbarbosa

5/19/2026 at 8:09:52 PM

That's likely because you're using the Gemini app which has a tool for image generation (nano banana) - I do my tests against the API to avoid any possibility of tool use.

by simonw

5/19/2026 at 8:19:24 PM

This question makes me wonder if you one shot each pelican or do you run it a few times to get the best one?

by nickmccann

5/19/2026 at 10:43:19 PM

I one-shot. I have a long-standing ambition to have each model generate 3x and then get the model (assuming it's a vision model) to pick the best one.

by simonw

5/19/2026 at 10:03:10 PM

They are just trolling you now

by __mharrison__

5/19/2026 at 7:44:50 PM

Beats a human by like 10$

by nashashmi

5/20/2026 at 4:21:02 PM

Only if you would use this pelican picture in production.

by FranOntanaya

5/19/2026 at 7:56:40 PM

So according to Google logic, the value of the pelican is $10-eps. (They applied that reasoning to conversions via adwords)

by unglaublich

5/20/2026 at 5:21:02 AM

Eps?

by Barbing

5/20/2026 at 5:52:32 AM

epsilon

by kirubakaran

5/19/2026 at 7:55:49 PM

at a certain point you're gonna need to change your benchmark because this will end up in the model's training set

by holtkam2

5/19/2026 at 9:58:34 PM

I'm sure that certain point came and went many releases ago.

by recursive

5/20/2026 at 9:51:07 AM

As mentioned in another recent thread, that time is now.

by kzrdude

5/20/2026 at 1:56:47 AM

I've found prompts like "capybara with spotted fur and 7 octopus tentacles instead of legs, each a different color, riding a tricycle" etc. to be a better test

Last time I tried, ChatGPT's image generator got the best result.

by Razengan

5/19/2026 at 10:30:40 PM

Love your pelicans, as always. And that one is... Wow.

I noticed the "Synthwave" aesthetic, which is enjoying quite some success since quite some time now, has found its way into AI models (even when it's not in the user's query). It's not the first time I see the sun at sunset with color bands etc. in AI-generated pictures. Don't know why it's now taking on in AI too.

https://en.wikipedia.org/wiki/Synthwave

Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.

To be honest as a kid from the 80s and a teenager from the 90s who grew up with that aesthetic in posters, on VHS tape covers, magazine covers, etc. I do love that style and I love that it made a comeback and that that comeback somehow stayed.

by TacticalCoder

5/19/2026 at 11:42:41 PM

Sythwave vibe hype hit a cultural high point with the release of Far Cry 3 Blood Dragon in 2013.

So it's as relevant and baked-in to today as actual 80s synth-culture was in 2000.

by kridsdale3

5/20/2026 at 4:31:20 AM

"Look around to look around."

by professoretc

5/20/2026 at 12:24:50 AM

At the keynote today, Sundar Pichai asked Gemini to clone the Dino Game, and it added a synthwave-esque aesthetic.

by gowld

5/20/2026 at 12:18:02 AM

Given your pelican is very famous now, don't you think they are adding instructions to beat this benchmark those days?

by danilocesar

5/20/2026 at 12:22:51 AM

Well clearly it's not working lmao

by Culonavirus

5/19/2026 at 6:58:06 PM

Per million input/output tokens:

Gemini 2.5 flash: $0.30/$2.50

Gemini 3.0 flash preview: $0.50/$3.00

Gemini 3.5 flash: $1.50/$9.00

Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).

3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10

by GodelNumbering

5/19/2026 at 8:48:45 PM

This understates the cost increase. 3.5 Flash also uses more tokens. artificialanalysis.ai shows these difference to run the whole eval, which I think is more realistic pricing:

Gemini 2.5 flash (27 score): $172 (1.0x)

Gemini 2.5 pro (35 score): $649 (3.8x)

Gemini 3.0 Flash (46 score): $278 (1.6x)

Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)

This is a massive price increase... 5.6x compared to Gemini 3.0 Flash

by __jl__

5/20/2026 at 3:55:42 PM

At these pricing levels, corporations who use the models will need to ensure employees are using them efficiently. I know, where I work, we don't really think about the cost to the company when using copilot chat, but sounds like it could start adding up really fast, especially for poorly defined questions that have to be revised multiple times.

by bnug

5/20/2026 at 6:37:11 AM

the era of subsidised ai is ending

by xdertz

5/20/2026 at 12:30:12 PM

API calls have never been subsidized, only subscriptions.

by driverdan

5/20/2026 at 10:31:54 AM

AI is getting really useful, might be why

by kzrdude

5/20/2026 at 3:41:42 PM

It's interesting they use output tokens as an eval because all tokens are not made equal. Even from model to model (like Opus 4.6 to Opus 4.7) the tokenizer can be different and it's no longer an apples to apples comparison. No one really talks about this but it directly affects stats like usage limits. Certainly comparing models between providers on an apples to apples comparison token wise is not a good test.

by joshmlewis

5/20/2026 at 2:59:07 PM

Sonnet-level performance at Haiku prices. They know what they have and who the audience is they want.

by ahknight

5/20/2026 at 9:32:50 AM

Gemini 2.0 Flash: $19

by ashirviskas

5/20/2026 at 2:50:51 PM

... and you get what you pay for. Or less.

by ahknight

5/19/2026 at 7:16:52 PM

They probably never intended to keep serving cheap models. This is a natural way to introduce the squeeze, now that they have people who built services on their API. It makes a lot of sense to have an abstraction layer where the provider doesn't matter. If you are working in Kotlin, Koog is excellent.

by doginasuit

5/20/2026 at 2:00:04 AM

I think the big 3 are cartelizing and starting to ratchet up costs. GPT5.5 is not easily distinguishable from 5.1. I would it be shocked if we hit the ceiling and everyone is quietly positioning for the exit.

by opsnooperfax

5/20/2026 at 1:33:13 PM

I don't understand why everyone thinks there is a ceiling below human-level intelligence, when we have an existence proof that human-level intelligence is possible.

by tskj

5/20/2026 at 11:41:01 PM

This is very napkin math, but the human brain has about 100 trillion parameters. Even the biggest models today top out at 10 trillion parameters. I think it's reasonable to assume that models need to be at least an order of magnitude bigger to capture the complexity of human intelligence, and probably a lot more.

by heyodai

5/20/2026 at 5:47:12 PM

for LLMs as implemented today?

by aaronblohowiak

5/19/2026 at 8:44:55 PM

switching models is insanely cheap compared to token cost on anything signficant, this is a take so cynical it misses the reality

by lanthissa

5/19/2026 at 10:08:28 PM

in any corporate or half compliance-relevant setting switching isn't trivial. new DPA, subprocessor notifications, TIA, procurement review, security questionnaires, plus re-running your evals because prompts don't transfer 1:1. token cost is just one of the line items.

by Clueed

5/19/2026 at 11:05:52 PM

no it really not, even the soggiest bank has multiple api vendors atm.

by lanthissa

5/20/2026 at 12:23:08 AM

I agree with parent. I'm not sure where your stance is coming from.

From what I hear, most enterprise AI deployments are seat-based subscriptions with annual commitments.

by alexandre_m

5/20/2026 at 12:35:11 AM

Yes, I work at a 50 person startup and even here switching from CC to codex or cursor would be non-trivial for multiple reasons - not just the annual commitment.

by p1esk

5/20/2026 at 9:12:53 AM

I don't doubt you but it's amazing how much easier things get when there's another option at 20% of the price, and that's what's going to happen here if these American companies keep trying to squeeze the prices up.

by esperent

5/20/2026 at 1:56:58 AM

50K FTE global firm. We’re still piloting ChatGPT. AI is a four-letter word and there are ridiculous ceremonies and hundreds of hours of overhead for every trivial use case.

Amusingly, Enterprise credits are more expensive than just paying a zero-commitment on-demand API fee. Personal accounts are still the best value.

by opsnooperfax

5/19/2026 at 8:07:50 PM

> now that they have people who built services on their API

People really can’t wait to be the next Zynga

by hnarn

5/20/2026 at 10:59:14 AM

[dead]

by jopalm

5/19/2026 at 7:07:00 PM

If Google is actually getting cheaper inference than everyone else with their TPUs, this smells like trouble to me. Maybe serving LLMs at a profit is proving difficult.

Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.

by rudedogg

5/19/2026 at 7:30:44 PM

This is not priced at inference cost.

My guess: it's the price at which they make more money than if they rent the TPUs to other companies.

The Gemini team has had trouble securing enough TPUs for their user's needs. They struggle with load and their rate limits are really bad. Maybe at a higher price, they have a better chance at getting more TPUs assigned?

by tempaccount420

5/19/2026 at 8:34:54 PM

The cost at such they could rent out the TPUs, i.e. the market rate, is the inference cost.

Just because you are vertically integrated doesn't mean you get to discount the one business units products to the other. Doing so discounts the opportunity cost you pay and is just bad accounting.

by gpm

5/19/2026 at 11:45:17 PM

Basic business principle, you charge what people are willing to pay not what it costs.

by KoolKat23

5/20/2026 at 9:00:49 AM

> doesn't mean you get to discount the one business units products to the other

That depends, if all developers get used to Claude and Codex it will become harder for Google to attract them in the future.

They might lose devs in the long term.

by sumedh

5/20/2026 at 1:42:20 PM

Predatory pricing is a great business strategy and all (particularly when countering the competitors predatory pricing - what could go wrong), but that doesn't mean that the gemini-team should account for it as if they're getting the compute cheaper, it just means that they should run a loss.

by gpm

5/20/2026 at 2:02:56 PM

That's actually where AI differs: there is no network effect. So no reason for me to stay with a tool if suddenly another one is better or cheaper. Changing the model I use is literally two clicks in Zed. No retention possible for providers.

by flaburgan

5/19/2026 at 10:06:05 PM

Look up “double marginalisation”.

by dash2

5/19/2026 at 9:23:37 PM

Depends on if you have spare capacity I think. They have minimal competition so they might be maximizing profit by charging prices higher than what clears all their supply.

by HDThoreaun

5/19/2026 at 8:22:08 PM

Its probably that in 1 or 2 years local (free) models will completely take the place of cheap models so cheap models need to move up the quality chain.

You have free local models for most tasks, $20 subscriptions for near-frontier intelligence, and API per token costs for frontier intelligence.

Flash seems to be targeting the near-frontier category.

by spyckie2

5/19/2026 at 9:05:07 PM

That might work if it wasn't for FOMO. Are you ok with only $20 of frontier usage a month?

by TurdF3rguson

5/20/2026 at 1:08:32 AM

Subjective, but if we compare to compute not everyone needs the most expensive laptops or super computers for their work.

I think frontier models will be invaluable for scientific research, defense, financial analysis and such. But the average person probably would be reasonably well-served with a local model.

If you're in sales, customer service, product management and such - the leading open models at the 30B mark are already good enough.

by rohansood15

5/20/2026 at 8:22:03 AM

I mean customer service maybe, but how much longer will humans even be doing that job at this point?

by TurdF3rguson

5/19/2026 at 8:41:19 PM

Prevailing wisdom is that serving LLMs at a profit is achievable... it's when you factor in the cost of training them that prices get astronomical real fast.

Open-source model inference providers (who do not have to bear the cost of training) seem able to do it at much lower prices.

https://www.together.ai/pricing

https://fireworks.ai/pricing#serverless-pricing (scroll down to headline models)

Of course, it's possible that they are burning through investor cash as well, and apples-to-apples comparisons are not possible because AFAIK Google does not mention the size/paramcount for 3.5 Flash.

But if the prevailing wisdom is true, I think it's actually encouraging. It suggests that OpenAI and Anthropic could perhaps, if they need to, achieve profitability if they slow down model development and focus on tooling etc. instead. If true that's probably good news for everybody w.r.t. preventing a bursting of this economic bubble.

...my opinions here are of course, conjecture built on top of conjecture....

by booty

5/20/2026 at 12:26:54 AM

Most of the training cost is not in the final training run, it's in all of the R&D (including salaries, equity, etc.) that it takes to get to the final training run. The actual cost of all of the TPUs (or GPUs), power, networking, storage, etc. for the final training run is significant, but it's even more expensive to have this huge R&D team doing frontier model development and using a lot of those same resources during development.

I think you're right that releasing models at a slower cadence would bring down costs to some degree, but it's not clear how much. All of these companies could significantly reduce their opex but at the risk of falling behind in terms of being at the frontier.

by eklitzke

5/19/2026 at 10:40:56 PM

Not to discredit you, because you are 100% correct but tangential note about together.ai, they seem fairly unreliable with constant outages or higher than normal latency.

by HDBaseT

5/19/2026 at 7:11:47 PM

Maybe the margins are just very large for Google because they predict so much demand for 3.5?

by IncreasePosts

5/19/2026 at 7:15:23 PM

This combined with locally runnable models getting pretty good recently (e.g. Qwen 3.6) tells me that it's time to seriously consider local dev setup again

by GodelNumbering

5/19/2026 at 8:09:19 PM

This should become the new Apple's hardware and software play. I am hopeful about the new CEO

by cft

5/20/2026 at 11:12:34 AM

Nothing new about that play. They have been heading in this direction for a very long time now.

by arcatech

5/20/2026 at 4:35:40 PM

Perhaps they would have made the basic spell-checker work on MacOS apple silicon then in this long time?

by cft

5/19/2026 at 7:31:47 PM

Besides the cost you get the control, transparency and ability to identify small language models or LoRA you want to serve even more cost effective.

by MASNeo

5/19/2026 at 10:01:31 PM

This is trouble if you're not Google/OpenAI/Anthropic: they're all shifting towards pricing for the economic value of the knowledge work they're aiding.

The economic value increases non-linearly as models get more intelligent: being 10% more capable unlocks way more than 10% in downstream value.

That's trouble because the non-linear component means at some point their margins will stop primarily defined by the cost of compute, and start being dominated by how intelligent the model is.

At that point you can expect compute prices to skyrocket and free capacity to plummet, so even if you have a model that's "good enough", you can't afford to deploy it at scale.

(and in terms of timing, I think they're all well under the curve for pricing by economic value. Everyone is talking about Uber spending millions on tokens, but how much payroll did they pay while devs scrolled their phones and waited for CC to do their job?)

by BoorishBears

5/20/2026 at 1:40:33 PM

Thank you, this is obviously where we're heading. People who think in terms of "will it ever be profitable to sell tokens" are thinking in the wrong framework entirely. The correct framework is "will it be profitable to sell knowledge work", and the answer will clearly be "yes".

by tskj

5/19/2026 at 7:35:16 PM

We need another "Deepseek moment" or else it will become impossible for the regular dude to use AI. It will become something that only big companies can afford.

by hei-lima

5/19/2026 at 8:40:03 PM

We're having DeepSeek moments every couple of weeks.

Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.

And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.

The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).

by SwellJoe

5/19/2026 at 9:08:07 PM

> It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

You can lower that to at least 24GB. I've been running Qwen 3.5 and 3.6 with codex on a 7900 XTX and the long horizon tasks it can handle successfully has been blowing my mind. I would seriously choose running my current local setup over (the SOTA models + ecosystem) of a year ago just based on how productive I can be.

by Zambyte

5/20/2026 at 12:23:02 AM

Gonna try it.

by hei-lima

5/19/2026 at 9:13:57 PM

We have Qwen 3.6-35b (6) on a 5090 (32GB) and it's blowing me away. Works fine for most (not all) code generation tasks. One developer here has been extremely stubborn about adopting AI; he's finally adopted it, albeit only when it's coming from a local model like this.

DeepSeek V4 Pro likewise is insanely good for the price. I simply point it at large codebases, go get a cup of coffee or browse Hacker News, and then it's done useful work. This was simply not possible with other models without hitting budget problems.

by trollbridge

5/19/2026 at 10:36:49 PM

Any chance you'd be willing to talk further about your setup? I have 2 x 3090s in a local machine, and I'm still left with questions about how best to use stuff locally.

by akulbe

5/20/2026 at 12:25:38 AM

You can only run heavily quantized models on all 3/4/5 rtx gpus (with 32gb or less vram) - and you probably want moe versions like Qwen 35b for this to run at speed somewhat comparable to Claude. It’s still not there to be honest but getting there. Personally I mess around with llama.cpp on m5 max with 128gb - it’s a decent setup to try various medium sized things, and runs llms surprisingly well without quantization, at least the moe models.

by sheeshkebab

5/20/2026 at 12:37:28 AM

Two 3090s is 48GB, so it's possible to run the 6-bit quantization comfortably, which is fine. It doesn't start to get notably dumber until lower than that. It won't be as fast as a hosted model, but dual 3090s will be comfortably fast for interactive use with the MoE version and not terrible to use with the dense model. I run the dense model at 8 bits on my dual Radeon V620 desktop machine, which I think would be slower than two 3090s, or at least not notably faster.

by SwellJoe

5/20/2026 at 1:02:24 AM

Have you done comparisons with 4 bit and seen a noticeable difference for coding tasks?

by hedgehog

5/20/2026 at 3:04:26 AM

No, I've just seen benchmarks showing most models start degrading around 4-5 bits. That's not to say they become useless, just that down to about 6-bits (with careful hybrid quantizations like unsloth where some of the layers aren't quantized or are quantized at higher bit depths) the quality isn't measurably degraded, but below that there are measurable differences in performance.

People report good results from DeepSeek V4 Flash at 2 bits (the DwarfStar 4 folks are doing it, and I've tried it on my Strix Halo, but it's too slow to be usable, so I haven't bothered to figure out if it's actually smart enough to use for anything).

Anyway, it's obvious models have to degrade in terms of knowledge, at any quantization, even though it may not show up clearly on benchmarks until lower. If you halve the size of the data available, it necessarily loses information about the world.

by SwellJoe

5/20/2026 at 4:53:51 AM

One of the things I'm wondering about is what I'm missing for $LLM to create files on the local FS like Claude and Codex do. What I see instead is stuff just printing to stdout, rather than files on the filesystem.

What am I missing?

by akulbe

5/20/2026 at 5:25:40 AM

You're missing an agent. The model uses tool calls to interact with the filesystem, commands on the system, optionally search (you need a search MCP server, like Brave or Exa, and API key), etc.

I usually use the Zed Agent built into Zed editor for self-hosted models, but you could use Pi, OpenCode, Hermes, Claude Code, etc. there are many, many, agents.

by SwellJoe

5/20/2026 at 5:12:14 AM

The model just predicts text, Claude Code etc parse the output and do the actual file creation (or run shell commands that do it). If you have Claude Code installed look in ~/.claude/projects/... and you can see the transcripts of your actual sessions, or install Mini-SWE-Agent and play with that to get a feel for what's going on.

by hedgehog

5/20/2026 at 5:15:32 AM

The data I've seen is stuff like the KL Divergence comparisons that Unsloth does which show something but not clearly whether there's an observable or significant difference in task performance.

by hedgehog

5/20/2026 at 4:55:04 AM

How is that machine for local inference? It's a serious consideration for me, but getting to hear more from folks that already have it would be helpful.

by akulbe

5/21/2026 at 2:53:52 AM

It’s great laptop to mess around with llms, it won’t replace claude opus or even sonnet.

by sheeshkebab

5/19/2026 at 7:49:02 PM

Deepseek had another moment a few weeks ago. V4 isn't far behind the US frontier, and so far its flash variant seems a very reliable coder and costs a pittance.

by squidbeak

5/19/2026 at 7:58:24 PM

Deepseek V4 (not flash) trippled in price too by the way (from Deepseek). Get used to this pattern.

This is what you get for relying on the generosity of billionaires. Keep offshoring your thinking ability to a machine and let me know how competitive you. Hint, you wont be. There's nothing special about being able to use an LLM.

by ai_fry_ur_brain

5/19/2026 at 8:13:30 PM

Anyone can host Deepseek V4 on rented GPUs and sell inference on it. Price will very quickly converge to the marginal cost of inference. This is as close to a pure commodity as it gets in the AI space so competitive market economics will put in work. Same is true for any open-weights model.

by ls612

5/19/2026 at 8:21:59 PM

You dont understand the costs involved to run inference at scale

Please go run some numbers.The hardware needed to Run Deepseek v4 flash at 20 tps for a single session is nowhere close to what is required to run it at 50tps for 5,000 concurrent sessions.

Imagine what it takes to be profitible when running at 150 tps for 30cents per 1mm. You make less than 1k per month and the hardware required to run that cost 10k a month to rent with hardly any concurrent session capability.

by ai_fry_ur_brain

5/19/2026 at 8:36:36 PM

Yes it is more efficient in $/tok to run at scale than to run just for yourself. Everyone selling Deepseek V4 inference is selling an undifferentiated good. They have run the numbers on how much it costs and are competing against a dozen other outfits also selling undifferentiated open weights tokens. Whatever the dollar cost they face to rent those GPUs will be what they are able to charge in the competitive market. That is great for you and me because we can buy tokens at pretty much exactly what it costs to produce them.

by ls612

5/20/2026 at 12:22:30 PM

They are selling it below costs and training on your tool calling, and potentially all your data. They're selling it for cheap to get your data dumbass.

by ai_fry_ur_brain

5/20/2026 at 6:25:58 PM

Whoever purchased their RAM last month vs this month has the advantage, I suspect.

by drob518

5/19/2026 at 10:28:50 PM

> Please go run some numbers.

- DeepSeek serves DeepSeek V4 Pro at 27 tps: https://openrouter.ai/deepseek/deepseek-v4-pro

- At 27 tps per user, a B300 GPUS will give you around 800 tokens per second (serving 30 users): https://developer-blogs.nvidia.com/wp-content/uploads/2026/0...

- That's 800 * 60 * 60 generated tokens per hour, at a cost of $0.87 per 1M tokens, or $2.50 per hour.

- For input and output tokens, the math is a bit more complicated because we have to make assumptions about their ratio. Using the published values from OpenCode, we get another $2.50 for cached tokens (which are almost free for DeepSeek) and another $3.40 for input tokens (which are a lot cheaper to compute than output tokens), which gives us a total of $8.50 per hour per B300 GPU.

- B300 GPUs can be rented for as low as $3.40 per hour, which is less than $8.50, so hosting DeepSeek V4 Pro is profitable.

You could also host it at fewer tps per user to raise the efficiency and therefore the profit even higher.

by gpugreg

5/19/2026 at 10:49:53 PM

Even not assuming Blackwell inference the $3.50/hr price is likely close to the marginal cost. The Deepseek R0 model is a little more than a third of the size of V4 and cost around $1/Mtok to serve at scale based on deepseek's blogs last year and Hopper rental prices.

by ls612

5/19/2026 at 8:01:30 PM

Unlike other providers, Deepseek does promise that they will lower the price when their Huawei cards arrive in a few more months.

by npn

5/19/2026 at 9:33:31 PM

Give me a link. Cannot wait. One PSA is that they have 75% discount right now so it is already cheaper than the full price.

by flakiness

5/19/2026 at 9:56:36 PM

Weird, last time I checked it was right on the pricing page.

But even when it happens I doubt it would be as cheap as it is right now. Enjoy it while it lasts!

by npn

5/20/2026 at 4:45:42 AM

Actually, deepseek v4 was 1/3 promotional price for the first month or so. This was pretty clearly communicated. The promotions window just ended is all.

by barrell

5/20/2026 at 10:18:13 AM

thus proving ops point

by greenchair

5/20/2026 at 11:11:11 AM

If you run out of 50% coupons to your local pizza joint, did they double their prices? Does every company double or triple their prices after Black Friday?

There’s a pretty significant difference between saying someone tripled their prices, and a temporary promotion ended. It’s even more so the case if someone is using it as an example for raising prices as a trend.

I’m 100% in the camp that prices are going up and quality is going down; companies are retiring models and requiring you to use more expensive ones. This has happened to me and there are dozens of examples that one can point to.

But a promotion ending is a strawman argument and does the point a disservice.

by barrell

5/20/2026 at 11:36:52 AM

Essentially yes. Perpetual "discounts" are common in some industries, like fast fashion, so you could consider that the normal price.

by breezybottom

5/20/2026 at 12:58:42 PM

> If you run out of 50% coupons to your local pizza joint, did they double their prices?

Yes. Did they double their msrp? no. They did double their effective price relative to me which is all that matters unless you're doing economic math or something.

by wraptile

5/20/2026 at 1:39:23 PM

The original comment was used as proof of a trend that vendors are raising prices. Would running out of coupons indicate a trend in rising pizza prices?

by barrell

5/20/2026 at 7:32:08 PM

It depends whether it's me personally who's running out of coupons or the entire supply of coupons is being reduced. If my ability to get the product for the same price is diminished then the price is being effectively raised.

In this case I'd agree that pricing is effectively raised as 10$ > 10$ - 50%, there's no need to complicate it. However this is not even the right metric for this problem, a better one would be total spent / work produced. If all customers spend more money for the same amount of work (adjusted to progress) then clearly the price is increasing. This would be true in this example as well.

by wraptile

5/19/2026 at 9:23:49 PM

V4-Pro is about 2.4× total params and 1.3× active params of V3.2.

by zaptrem

5/19/2026 at 10:25:54 PM

You're typing as your handwriting and letter sending abilities deteriorate to dust. Writing down information as your memory capacity decays. Remembering instead of living at the pure leading edge of perception dulling your reactions.

Smh, it's all downhill from the first unadulterated neuron.

by creationcomplex

5/19/2026 at 8:06:19 PM

Mate why are you so mad at people upset the price trippeled? It's a fair complaint that people built services using the cheaper ones with the expectation future models would be similarly priced. You can avoid 'offloading thinking' while still building ontop of these models

by dpoloncsak

5/20/2026 at 8:56:03 AM

> It's a fair complaint that people built services using the cheaper ones with the expectation future models would be similarly priced

Everyone could see this coming from miles away, everyone warned that this would happen again and again and again, and it always got dismissed.

by kuschku

5/20/2026 at 12:56:06 PM

I still think its reasonable to forsee this possibility and be upset when it comes to fruition

by dpoloncsak

5/19/2026 at 8:02:31 PM

I think demand is too great and compute is not enough. Nothing to do with billionaires colluding to increase prices by 3x.

by aurareturn

5/19/2026 at 10:29:47 PM

Actually, why should Google collude on pricing? They have deep pockets and could starve out the competition while keeping prices low, if they really wanted.

I think it is priced high because it's basically their smartest model as well as their fastest, so why shouldn't they?

You can still use earlier generations of Flash at a lower cost if you want "fast and cheap and just OK," which often makes sense. (Just checked)

I would predict they will lower this price when 3.5 High appears, but perhaps not all the way.

by boutell

5/19/2026 at 8:31:57 PM

What we need is a deepseek moment in hardware ie China reaching parity on node size that is the only way latest computers let alone latest ai will be available to us in the future otherwise the profit margins will push most production to AI.

by xbmcuser

5/19/2026 at 8:43:58 PM

To be honest, China not having access to the latest hardware is exactly what has driven LLM technology forward the last 2 years.

by throwa356262

5/19/2026 at 8:59:05 PM

Why?

by humanfromearth9

5/19/2026 at 9:06:10 PM

Because it forced them to focus on efficiency, instead of throwing more compute at the problem.

Just like in software, some of the most beautiful solutions come from constraints. Think, the optimisations that game developers implemented because of the frame budget.

by Weryj

5/20/2026 at 1:32:19 AM

On top of that, China is also facing hardware constraints, which is pushing companies to develop better domestic chips for AI training. It'll be interesting to see how things perform once Huawei's newer hardware is fully deployed at DeepSeek.

by Viacol

5/20/2026 at 3:46:25 AM

Open Source ASML EUV. But will wipe off trillions from US stocks so 401k may not like that.

by blackoil

5/20/2026 at 10:51:03 AM

Can you run a coal power plant in your backyard? Or a giant solar power farm?

Of course not

And you don't need to

by Bombthecat

5/19/2026 at 7:43:46 PM

You can use lots of open weight models today.

by segmondy

5/19/2026 at 8:05:39 PM

That's one solution to the problem. But it still needs some good computational capabilities. Either we optimize the hell out of those models, or we wait for the hardware to become good enough for them.

by hei-lima

5/19/2026 at 10:45:41 PM

The real problem is the hardware to run them is still very expensive.

by Gigachad

5/19/2026 at 9:10:40 PM

Maybe we can figure out better ways to use the models that can run on cheap hardware.

by pianopatrick

5/19/2026 at 7:53:00 PM

gemini isn't even that good. just tested 3.5 on usual complex prompts to opus/chat 5.5. meh

by GeorgeOldfield

5/19/2026 at 8:19:00 PM

Are you really comparing flash to opus? Shouldn't you be comparing pro?

by k8sToGo

5/19/2026 at 8:45:36 PM

The benchmark tables in the Google announcement include Opus 4.7, and the numbers are very impressive. Caveat emptor, but it's not unreasonable to compare a new Flash to a current-gen Opus, even if some of the results confirm expectations

by CognitiveLens

5/19/2026 at 8:52:12 PM

Who would have guessed that something costing roughly a third as much wouldn't do as well at certain tasks.

by bachmeier

5/19/2026 at 8:31:39 PM

Well, the first impression is that Gemini still goes off the instruction rails easier than other models, but I noticed that it tends to go back to the initial goal without holding a hand, which is a real improvement. It's really interesting that these models behave so differently.

by kmac_

5/20/2026 at 10:26:51 AM

> Interesting pricing direction.

Is it? More capability, more demand, higher price. Seems relatively uninteresting. The naming structure complicates it: 3.5 Flash is less comparable to 3.0 Flash than it is to 3.0 Pro.

More generally, $/token + naming scheme comparisons are just confusing: I am not looking for a wordy idiot and I doubt most people are (at least not with what I would consider worthwhile business ambitions). In fact wordy idiots are fairly costly, because we have to consider the large amounts of cheap garbage that they are producing, and if you price your own time somewhat competitively then fairly quickly that's the bigger lever.

Even if we don't consider the last part: How do we price the better model, that can one shot a task without having to go back and forth and spending more tokens or having to fix more bugs later? It is definitely worth something and I think it's quite undervalued right now. What seems to be missing is a better measurement of capability per token. I don't know how that could look like. Maybe something like how we try and measure inflation, some basket of tasks (which then ends up being part of the training data so idk).

by jstummbillig

5/19/2026 at 7:12:07 PM

3.1 flash lite — $0.25/$1.50 — plus insanely fast.

3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.

For comparison, Opus models are $5/$25

by dr_dshiv

5/19/2026 at 8:02:22 PM

Opus 4.7 is smarter than even Gemini 3.1 Pro on nearly every metric, though. You're comparing apples to oranges. Gemini 3.1 Flash is somewhere in the neighborhood between current Haiku and Sonnet, I think? Still a better value than the Anthropic models, I guess, which are quite pricey.

Since Gemini 3.5 Flash is raising the price to $1.50/$9.00, it's priced between Haiku and Sonnet. If it outperforms Sonnet, it remains a good value, I guess. Though DeepSeek V4 Flash is much cheaper than all of them, and seemingly competitive.

by SwellJoe

5/20/2026 at 5:26:54 AM

Definitely apples to oranges, sorry I wasn’t clear. I only included opus pricing for comparison—it is vastly superior. But even 3.1 flash lite is really useful.

Of course, if I manage to reach my limits every week on my Claude $200 sub, opus 4.7 is probably priced closer to flash!

by dr_dshiv

5/19/2026 at 9:37:10 PM

>Opus 4.7 is smarter than even Gemini 3.1 Pro on nearly every metric,

Outside of coding, claude models are pretty meh. GPT and Gemini are the workhorses of science/math/finance.

by WarmWash

5/19/2026 at 11:03:02 PM

Not in my fields of science: Genetics and neuroscience. The combination of Opus 4.7 Adaptive used with well structure project folders is amazingly useful.

by robwwilliams

5/19/2026 at 11:17:29 PM

And even on coding, they are mostly good at generating new code.

They sure are not at thorough analysis or debugging, etc.

by epolanski

5/19/2026 at 8:54:23 PM

To be fair, Gemini 3.1 flash _lite_ supports structured output (guaranteed json), it’s super fast, runs circles around 2.5 flash and costs $0.25/$1.50.

I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.

That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.

by OakNinja

5/20/2026 at 6:35:12 PM

I think that’s true on divergence. Basically, the only most is living in the frontier, and even that is only temporary. At some point, the frontier advances such that 99% of tasks can use something short of a frontier model and only a very few tasks actually demand frontier performance.

by drob518

5/19/2026 at 7:50:45 PM

It might be temporary pricing given that 3.5 Flash is actually superior to the existing 3.1 Pro in almost all regards, so they're in a bit of a lurch as 3.1 Pro really doesn't make sense given that 3.5 Pro has been delayed a bit.

by llm_nerd

5/20/2026 at 1:49:24 PM

I let it loose on a f# codebase that I know was pretty optimized but with a few low hanging fruit changes that would have a big impact.

3.1 Pro did NOT find them. 3.5 flash did. Plus one I hadn't thought of that may or may not work (which it also pointed out).

I'm pretty impressed.

by bjoli

5/19/2026 at 8:07:36 PM

Their rationale might be that it’s size and intelligence are growing relative to the market.

Fwiw it’s beating Claude Sonnet in most benchmarking (benchmaxxing?), yet they’ve priced it almost half off on a per token basis.

Question is are you going to persuade anyone with this argument?

Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

by WhitneyLand

5/19/2026 at 8:50:42 PM

> Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

A few weeks ago, Steve Yegge claimed he'd heard that Google employees are banned from using Claude & Codex.

https://x.com/Steve_Yegge/status/2046260541912707471

A number of Googlers replied to say that was totally false, including Demis Hassabis, but they were all on the DeepMind team.

https://x.com/demishassabis/status/2043867486320222333

This person here claims they left Google because of the ban, and because the ban applied outside of Google work as well:

https://x.com/mihaimaruseac/status/2046272726881693960

by SyneRyder

5/20/2026 at 2:16:44 AM

> and because the ban applied outside of Google work as well

I think false (or hasn't filtered to everyone lol)

by myko

5/19/2026 at 8:35:37 PM

At the same time, it is supposedly Gemini 3.1 Pro level at 3/4 the price

and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)

by verdverm

5/20/2026 at 5:55:04 AM

I use Gemini for heavy web scraping-adjacent API work. Web grounding has been super useful for the project.

I will definitely not be updating to this new model, and I think once 2.5 Flash is deprecated I'll have to re-architect so Gemini is only used for web grounding requests. This is an insane price increase.

by davedx

5/19/2026 at 7:06:52 PM

I don't think they're really comparable. Seems they created the Flash-Lite tier to take the spot of the old Flash models.

by dbbk

5/19/2026 at 7:12:10 PM

No, 2.5 had both flash and flash lite.

by GodelNumbering

5/19/2026 at 8:21:37 PM

It is Google, after all ....

by mlmonkey

5/19/2026 at 7:40:20 PM

In general, Gemini flash is still relatively cheaper compared to the "mini" version of the other big 2. However, I agree that newer version seem to have multiple X price increase (similar to the new ChatGPT) and we certainly need competition from the open source models to keep these guys in check with pricing.

by photonair

5/20/2026 at 8:03:00 AM

If you look at the benchmark, the model is not particularly good at coding, and as you point out it costs 3x the price of the previous flash models. So what is the market for it?

I think that they might have reached the latency sweetspot where voice applications become more natural. Natural speech is <100 tokens per second (after STT), so $9 for a million token takes you to roughly 3 hours of speech. That's totally competitive compared to human costs.

by harrouet

5/19/2026 at 7:25:18 PM

Gen AI is unprofitable, especially at the insanely cheap rates they've been offering to get people in the door. So expect more increases in the future.

by LetsGetTechnicl

5/19/2026 at 8:08:08 PM

These companies are unprofitable (as all companies at this stage and ambition should be) but I increasingly don't see any justification for the idea that it is fundamentally unprofitable.

Inference alone is certainly profitable. I'm running models at home that are comparable to performance of paid models a year or so ago for free. Even for much larger models the cost around inference serving are clearly manageable.

Training is where the costs are, but I'm increasingly convinced those too could have costs dramatically reduced if necessary. Chinese companies like Moonshot.ai are doing fantastic work training frontier models for a fraction of the cost we're seeing from Anthropic/OpenAI.

This isn't like Uber or Doordash where the economics fundamentally don't make sense (referring to the early days of these services where rates were very cheap).

It's a compelling story that "current AI is unsustainable", but it doesn't pan out in practice for a multitude of reasons (not the least of which is that we can always fall back to what models did last year for basically free).

by roadside_picnic

5/19/2026 at 10:38:15 PM

Arguably nothing even has to change with training for this to be sustainable. Dario has claimed that Anthropic is profitable on a per training run basis. They aren't profitable because they choose to keep investing in increasingly large training runs.

by overrun11

5/20/2026 at 1:05:55 AM

Cut the crap.

The value of the firm's operating assets = EBIT(1-t) - Reinvestment

You (Anthropic) want that sky-high valuation? Accept reinvestment is part of the equation.

If they decide to stop reinvesting, then they are as good as dead.

Moreover, they clearly are not re-investing cash flows from operations. Why do you think they are continually raising money? Lmao.

by dsdsfaa

5/20/2026 at 1:56:37 PM

I'm not sure I understand your argument. If you want an exponentially more expensive training run for each iteration, obviously you need to raise investments even if each training run is profitable. Now I'm not saying that's a good idea, or makes sense, but I am saying that "raising money" doesn't disprove neither that each training run makes money, nor that they're re-investing all that money in the next run.

To give a simple example: if each run simply makes a 10% ROI, but you want to spend 2x as much money on the next run, you still need to raise 90% of the previous run's expenses to have enough capital.

by tskj

5/19/2026 at 8:43:33 PM

And if you can run those strong models at home for free, why would hosting them be a successful business for any of these providers?

Profitable maybe, in terms of having low costs, but why pay Google or whoever when you can do it yourself for cheaper/"free"?

by ReliantGuyZ

5/20/2026 at 5:48:09 AM

For free == with a huge upfront cost of getting a good enough box and running costs of maintaining it and just keeping it powered. By the time it pays off the frontier labs are three generations ahead at least.

Compare with on-demand billing per token and it just doesn’t make sense to own the hardware if you aren’t using it productively or renting it out for 95% of the time.

by baq

5/19/2026 at 9:29:26 PM

If you can run your server at home for free why would hosting it be a successful business for any of these propviders?

by HDThoreaun

5/19/2026 at 8:46:17 PM

If it's profitable, why haven't they reported any profits? People like Ed Zitron have done the math and it just doesn't add up. I mean he just published this piece today: https://www.wheresyoured.at/ai-is-too-expensive/

by LetsGetTechnicl

5/19/2026 at 8:56:14 PM

Amazon was unprofitable for over a decade, and they were public. Theres no incentive to be profitable as a private company if you can continue to raise money.

Ed Zitron and Gary Marcus are... confused.

by anthonypasq

5/19/2026 at 10:25:50 PM

> Amazon was unprofitable for over a decade, and they were public.

Amazon was unprofitable because they poured their revenue into growth. On paper, they were in the red, but everyone - especially investors - saw what was going to happen, given their trajectory.

Is it the case that any of these AI companies are actually making a ton of money and growing accordingly? AFAICT, we've just got [a] big players like Google that can subsidize AI in the hopes of waiting everyone else out and [b] private companies raising capital in the hopes that when the market returns to rationality, they may be solvent.

by mynameisash

5/20/2026 at 6:39:37 AM

> On paper, they were in the red, but everyone - especially investors - saw what was going to happen, given their trajectory.

As I recall, no, Wall Street and public shareholders were getting pretty antsy over AMZN earnings, which is why Bezos famously said "We are willing to be misunderstood for long periods of time."

The same thing is playing out today: insiders and early investors (presumably privy to information we don't have) see the trajectory of the frontier AI labs, but Wall Street and public shareholders see only the losses. This is why at every earnings report the hyperscalers simultaneously 1) post record revenues and earnings, 2) announce even greater CapEx spending and AI investments, and hence 3) get punished by the stock market.

Clearly all the AI players are willing to be misunderstood for long periods of time.

by keeda

5/19/2026 at 10:43:01 PM

Yes that is exactly what is happening. OpenAI and Anthropic are the fastest growing companies by revenue ever and their gross profit margins are healthy.

by overrun11

5/19/2026 at 10:54:08 PM

According to this article[0]:

> HSBC Global Investment Research projects that OpenAI still won’t be profitable by 2030, even though its consumer base will grow by that point to comprise some 44% of the world’s adult population (up from 10% in 2025). Beyond that, it will need at least another $207 billion of compute to keep up with its growth plans.

This article is from six months ago. Was HSBC wrong; did something dramatically change in the last six months; is OpenAI not, in fact, profitable?, or are they in fact doing well but doing a huge investment (as was the case with Amazon 25ish years ago)?

I genuinely do not know, but my impression is that they're burning investment capital trying to compete with others' investment capital and Google's bottomless pockets.

[0] https://fortune.com/2025/11/26/is-openai-profitable-forecast...

by mynameisash

5/20/2026 at 2:12:57 AM

Also OpenAI somehow having 44% of the world’s population as its customer base is a plainly absurd goal and will never happen, not in 5 years

by LetsGetTechnicl

5/20/2026 at 1:01:36 AM

and to make matters worse, they are massively over-valued.

Whoever buys the stock at a richly priced 1tn at ipo is a bozo lmao. I know I know, index funds will be forced to hold it bypassing the 1 year rule. Disaster already.

by dsdsfaa

5/20/2026 at 2:12:04 AM

Then why do they constantly need more and more funding from VC and Google and MS and NVIDIA? Why is it all circular dealing? Why aren’t there smaller AI startups running these smaller, “profitable” models?

by LetsGetTechnicl

5/19/2026 at 10:07:39 PM

But I've been told here -- over and over again -- that the cost of inference was going to go down as the technology matured.

The trend lines are going in the opposite direction.

by timmytokyo

5/20/2026 at 5:17:35 PM

prices are only marginally determined by the cost to produce the product. Just because they are raising prices doesnt mean its actually getting more expensive for them to serve the models, it just means we are willing to pay for the intelligence.

by anthonypasq

5/20/2026 at 10:00:19 AM

Zitron thinks capex is a liability that needs to be paid off in a year instead of a long-standing asset.

Similarly he thinks that an investment into an AI startup is also a loan that the startup needs to pay back out of their own revenue, instead of a share of a company that will IPO at a higher valuation.

Basically his doomerism is a byproduct of financial illiteracy.

by cloakandswagger

5/19/2026 at 9:07:40 PM

His entire brand is that the AI bubble will burst. By his account it was supposed to have several times by now. Like the doomers, it's not if it's when and they have to keep pushing back their predictions. Funny how both camps can be so confident. Alas, that's how they get eyes, ears and dollars.

That's not to say they will be or are wrong, it's just that they aren't exactly unbiased, or humble, sources.

by goosejuice

5/19/2026 at 8:47:20 PM

Yeah, at this point I think the worst-case scenario for OpenAI/Anthropic/etc is to slow down frontier model development and focus on tooling and services, as opposed to imploding completely and bursting the economic bubble. I hope?

by booty

5/19/2026 at 7:36:39 PM

If you don't need SOTA or near SOTA there are plenty of dirt cheap models, just look at Gemma 4 31B on Openrouter.

by GaggiX

5/19/2026 at 10:47:12 PM

For all of the use cases being hyped you really do, and you actually need something much better than the SOTA models to do what we are being told can be done.

The small models are useful for small things like summarizing text or search but not much else.

by Gigachad

5/20/2026 at 2:18:51 AM

Yeah a lot of AI hype is look at the amazing new thing our new model can do! Like Google at this event. But when pressed about its pricing reality the answer is “use a worse cheaper model”?? Real convincing argument there

by LetsGetTechnicl

5/20/2026 at 1:46:48 PM

If don't want to spend 1.5$/9$ for the lastest model then yes use a cheaper model, DeepSeek V4 Flash is 0.11$/0.22$ on OpenRouter and it's more capable than the most expensive model a year ago. Models have never been so cheap given their capabilities unless you want to follow the SOTA (where the hype is).

by GaggiX

5/20/2026 at 5:35:31 AM

You mean Kimi or qwen

by scrollop

5/19/2026 at 8:01:10 PM

[flagged]

by ai_fry_ur_brain

5/19/2026 at 8:05:05 PM

It is insanely profitable though, if you cut out r&d cost, plus the marketing and loss leaders. Don't let them gaslight you.

Even anthropic who does not own any hardware still have a big margin providing claude models.

by npn

5/19/2026 at 8:47:11 PM

Then why haven't they reported any profits using GAAP (generally accepted accounting principles)? They all use ARR which is easily gamed.

by LetsGetTechnicl

5/19/2026 at 10:48:29 PM

They aren't profitable on a GAAP basis and no one claims this. This obsession over profits is misguided. These are hyper growth companies growing at a scale never seen before. It is both deliberate and uncontroversial to invest in growth rather than slowing down to produce profits.

by overrun11

5/20/2026 at 2:47:35 AM

If my retirement money is going to end up invested in these companies, either directly when they IPO or indirectly through compute providers, then I would like to see some proof that they are capable of producing profits. "Trust me bro" just ain't gonna cut it.

by chillfox

5/19/2026 at 9:25:22 PM

I don't really sure, but might be they count hardware purchase as loss, too.

Google has just recently upgraded their TPUs.

by npn

5/19/2026 at 10:09:01 PM

Everything is insanely profitable if you ignore the costs.

by timmytokyo

5/20/2026 at 2:55:02 AM

The premise is if they stop training new models then it will become pure profit after 2 years when the hardware finished paying for itself.

It's pretty funny that everyone say that this business is unsustainable, but I have yet seen anyone bankrupt, even the pure hardware providers who are renting out a100 b200.

by npn

5/20/2026 at 3:12:41 AM

And AI investors and stock market boosters are just going to accept OpenAI not having anything "new" to show for all their investments? What about replacing hardware once it's been burned out from constant high usage? Is it not odd to you that so many big AI deals get announced and never heard from again? What's the business reason for neoclouds buying GPU's from NVIDIA only for NVIDIA to then pay them to rent them back? How does this make any sense?

by LetsGetTechnicl

5/19/2026 at 11:20:41 PM

They immediately undercut their argument to the point that I'm not sure if they were being sarcastic.

by operatingthetan

5/20/2026 at 12:24:27 AM

[dead]

by Rekindle8090

5/19/2026 at 7:22:49 PM

Yeah, it is a massive jump in price, hardly a "Flash" model anymore... I wonder if they'll release flash lite or something with a bit more affordable price point.

by ilia-a

5/19/2026 at 9:13:51 PM

There’s already a flash lite tier since 2.5. Latest is 3.1 currently.

by OakNinja

5/19/2026 at 7:39:25 PM

And they are using this to power search answers?

by irthomasthomas

5/19/2026 at 8:14:45 PM

I bet the API pricing helps pay for search users

by CooCooCaCha

5/20/2026 at 2:15:17 AM

To me this is almost like a tone-deaf naming change.

Empty Slot (new Pro as Mythos competitor?)

Old Pro -> now Flash

Old Flash -> now Flash Lite

Old Flash Lite -> now Gemma (and not served by Google)

I say "almost" because the situation is more fluid and unstable than a normal naming change. If Apple were to do this with laptops, maybe it'd be like, Air gets better and pricier and becomes Pro-level model, Neo same way becomes Air-level model, etc. But Apple's too design oriented to do something like that. Google, well...

This change has made me decide to move to a multi-provider situation like through OpenRouter for consumer-facing LLM api in a service I'm building. I just can't trust Google to not constantly rearrange everything under our feet. Doesn't mean I won't use Gemini, but it clearly means I need to have others in the mix ready to go. In fact I used to use lots of Flash Lite, which is now Gemma territory, and I can't get that served by Google anymore and don't want to run my own hardware.

But in any case, I'd compare this "Flash" model with previous "Pro" on all metrics. It's kinda like if in clothes a Small suddenly became what was a Large, or at Starbucks a Grande became the new de facto Venti. And only for the new! drinks.

And if we think this way, it's possible that prices are actually falling?

by malloryerik

5/20/2026 at 5:59:22 AM

Demis is on record saying they need small models on edge devices and if it’s on the edge the weights may as well be public officially.

by baq

5/20/2026 at 9:26:37 AM

don't forget Gemini 2.0 flash at $0.10/$0.40

by ashirviskas

5/19/2026 at 7:51:14 PM

That's a lot. DeepSeek v4 Flash is just over a tenth the price, and DeepSeek v4 Pro is roughly the same price (currently heavily discounted, but will be $1.74).

I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.

by SwellJoe

5/19/2026 at 10:53:23 PM

They have said AI will be priced like a utility, meaning $100-300 per month or so.

by copperx

5/20/2026 at 1:41:49 AM

I use Gemini models in Junie daily. When I need accuracy I switch to Gemini 3.1 Pro Preview (why it is still in preview?), but it burns thru credits leaving me topping up $5 every day. 3.1 Flash lite is just not accurate enough. 3 Flash is sweet spot just as Jetbrains suggests it is.

Maybe I'll look at Opus again, but it just was slower, much more expensive and worst at all - wasn't listening to you instructions.

by dzhiurgis

5/19/2026 at 8:23:36 PM

just subscribe to the plan, cheaper

by m3kw9

5/19/2026 at 8:40:55 PM

Gemini 2.5 flash was the best Gemini model.

Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.

by throwa356262

5/20/2026 at 5:44:40 AM

The 09-2025 preview was awesome.

by npn

5/19/2026 at 6:18:56 PM

  > Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG
3.5 Flash: Thinking Medium - 7516 tokens

https://gistpreview.github.io/?5c9858fd2057e678b55d563d9bff0...

3.5 Flash: Thinking High - 7280 tokens

https://gistpreview.github.io/?1cab3d70064349d08cf5952cdc165...

3.1 Pro - 28,258 tokens

https://gistpreview.github.io/?6bf3da2f80487608b9525bce53018...

Though 3.1 took 3 minutes of thinking to generate, but it only one that got animated movement.

by SXX

5/19/2026 at 6:41:26 PM

Gemini 3.1 Flash Lite Thinking High - 2,526 tokens:

https://gistpreview.github.io/?3496285c5dac5ba10ebbc0b201a1a...

Gemini 2.5 Pro - 5,325 tokens:

https://gistpreview.github.io/?cc5e0fefeaaffecd228c16c95e736...

Gemini 2.5 Flash - 7,556 tokens:

https://gistpreview.github.io/?263d6058fe526a62b8f270f0620ec...

Gemma 4 31B IT - 3,261 tokens via AI Studio:

https://gistpreview.github.io/?858a42b96af864859a3b89508619d...

Gemma 4 26B A4B IT - 4,034 tokens via AI Studio:

https://gistpreview.github.io/?4adb7703897e0c6b583f9de928e4a...

by SXX

5/20/2026 at 12:09:39 AM

I'm surprised that, "they must have trained for it" camp is not here saying that rubbish.

by segmondy

5/19/2026 at 7:10:29 PM

Opus 4.7

https://claude.ai/public/artifacts/128ebe5a-add7-406a-9bce-6...

by franze

5/19/2026 at 8:19:19 PM

Wow that's terrible. Any idea why?

by tasuki

5/19/2026 at 8:33:46 PM

Did you see the other ones? This is very good by comparison.

by lpa22

5/20/2026 at 12:07:22 PM

Ah, of course it's all subjective: I was pretty impressed with the Gemini ones. How can the frog move the oars the wrong way around, against each other??

by tasuki

5/19/2026 at 10:44:03 PM

Yeah, the oars being around (inverted) is very distracting but the other elements appear quaint and "accurate".

by HDBaseT

5/20/2026 at 1:01:09 PM

My guess will be because this is just software that don't understand how the world works and it's only trying to please?idk maybe im wrong

by doubleorseven

5/19/2026 at 10:50:00 PM

I think Anthropic optimizes less for visuals. Also, it’s not that terrible.

by stingraycharles

5/19/2026 at 6:42:36 PM

hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF @ Q6_K

8112 tokens @ 52.97 TPS, 0.85s TTFT

https://gistpreview.github.io/?7bdefff99aca89d1bc12405323bd4...

Full session: https://gist.github.com/abtinf/7bdefff99aca89d1bc12405323bd4...

Generated with LM Studio on a Macbook Pro M2 Max

https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6...

by abtinf

5/19/2026 at 7:14:25 PM

Well, honestly this is quite impressive compared to 3.1 Flash Lite and 2.5 Pro. Considering that 2.5 Pro is actually quite good at generating massive amounts of code one shot.

by SXX

5/19/2026 at 8:18:01 PM

It isn’t animated at all for me?

by svnt

5/19/2026 at 8:19:39 PM

It is animated just no movement like on my 3.5 flash examples. Try different browser might be unless it iOS.

by SXX

5/19/2026 at 7:49:43 PM

Here is GPT 5.5 High thinking; I had to add a second follow up prompt "it's not animated though" as the first one was not animated.

https://gistpreview.github.io/?557f979c82701862bc26d24f10399...

by vtail

5/19/2026 at 7:57:24 PM

Here is a GPT 5.5 Extra High with a modified instruction:

> Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG. Use the Brave Browser to verifty that the image is indeed animated and looks like a proper rowing frog; iterate until you are satisfied with it.

It was able to discover and fix an animation bug, but the result is still far from perfect: https://gistpreview.github.io/?029df86d03bfe8f87df1e4d9ed2f6...

by vtail

5/19/2026 at 11:34:43 PM

Why is it fixated on the front perspective? Interesting choice though, because most humans (and seems like other LLMs too) would pick a side perspective

by hskalin

5/19/2026 at 6:29:40 PM

All three links animate for me.

by captn3m0

5/19/2026 at 6:34:23 PM

I think they mean the boat is moving. In the flash ones the paddles are animated but the boat is stationary for me.

by NitpickLawyer

5/19/2026 at 6:42:05 PM

The boat moves in all three for me

by codazoda

5/19/2026 at 6:48:10 PM

The boat itself rocks, but do you see the background changing to indicate the boat is progressing through the environment? I only see that in the 3.1 Pro example. I believe that's what the OP meant.

by Fishkins

5/19/2026 at 6:55:23 PM

I think this illustrates the problem with OP's prompt. If the goal is specifically to implement a scrolling background, this should have been in the prompt.

by Manuel_D

5/19/2026 at 7:16:55 PM

Yup. My bad. It was just first idea that come to my mind since I enjoy visually compare each new release with unique prompts.

by SXX

5/20/2026 at 2:29:58 AM

It’s shocking how much better 3.1 is than 3.5 flash

The benchmarks used don’t really give a full story

by r0fl

5/19/2026 at 6:37:19 PM

Can you try with a more complex story such as "three little pigs"? I tried but it created a storybook instead of the SVG animation. I am looking to partially imitate Godogen [1][2] which is really great, even for animations.

[1] https://github.com/htdt/godogen

[2] https://drive.google.com/file/d/1ozZmWcSwieZQG0muYjbj7Xjhhlz...

by wslh

5/19/2026 at 8:15:21 PM

I think it's unreasonable to expect models generate complex stories in single prompt since they trained to be concise, but I tried. This is prompt on top of story with no control buttons request:

   Now think, plan how to tell this story in a cartoon, make scene outline and then generate SVG animation story for "Three Little Pigs" in self contained HTML page. Just single animation no control buttons.
Full prompt in gist comments: https://gist.github.com/ArseniyShestakov/ed9faa53604035005ca...

Actual results for models, one shot:

Gemini 3.5 Flash - Three Little Pigs - 9,050 tokens:

https://gistpreview.github.io/?ed9faa53604035005cae86c63c766...

Gemini 3.1 Pro - Three Little Pigs - 24,272 tokens:

https://gistpreview.github.io/?f506bbfd9b4459c8cd55d89605af8...

Gemini 3 Flash - Three Little Pigs - 5,350 tokens:

https://gistpreview.github.io/?f58eff069cf916031c97d560b0e35...

Gemma 4 31B IT - Three Little Pigs - 5,494 tokens:

https://gistpreview.github.io/?a3aa75abbe8fd7818b73f6fa55ee6...

Gemma 4 26B A4B IT - Three Iittle Pigs - 6,375 tokens:

https://gistpreview.github.io/?1e631caebeb54f9f0cd6d0e3d4d5e...

by SXX

5/20/2026 at 12:03:09 PM

This was generated locally with Kimi https://gistpreview.github.io/?d55f07c22d54badc8042a7c8b3785...

by segmondy

5/20/2026 at 6:28:08 PM

What Kimi exactly? What version and quant?

by SXX

5/20/2026 at 8:08:45 PM

K2.6, Qx/Q4, it's huge and mostly runs off CPU/system ram. So slow

by segmondy

5/20/2026 at 3:16:20 AM

3.1 pro was pretty good among them. (iOS)

by no-name-here

5/19/2026 at 9:49:25 PM

Wow, Gemini 3.5 Flash surprised me there.

by ZeWaka

5/19/2026 at 8:07:17 PM

These are hilarious. 3.5 Flash Thinking High is the only one that is weirdly deformed (what is going on with the hat in 3.1 Pro??)

by krupan

5/19/2026 at 10:48:29 PM

3.5 Flash definitely got the synth wave vibe preference.

by stingraycharles

5/19/2026 at 6:20:38 PM

Your links are broken FYI.

by abi

5/19/2026 at 6:20:59 PM

They work for me.

by John7878781

5/19/2026 at 6:29:50 PM

They do work here too.

by TacticalCoder

5/20/2026 at 9:05:39 AM

Click on "Listen to article", make sure the voice is "Umbriel" and skip to 4:15 - there's a hallucinated part at the end in Russian (I think). On a blog post about the latest and greatest AI model. Oh the irony.

by lmazgon

5/20/2026 at 9:12:55 AM

Yeap it russian, but the whole russian sentence doesn't make any sense, just messed words with no meaning at all :)

by Undrafted9624

5/20/2026 at 9:13:43 AM

But the voice and pauses sounds so much real, it's hard to say "it was ai", sounds like a real human

by Undrafted9624

5/20/2026 at 12:03:21 PM

A high-fidelity simulation of a Russian with damage to Broca's area, perhaps.

by FeteCommuniste

5/20/2026 at 11:09:30 AM

Thank you for this gem.

by luk4

5/20/2026 at 9:35:14 AM

I ran it through speech-to-text and it starts with something among the lines of "dear colleagues, just like a doctor tells a patient 'health can wait'...", after that it's nonsense.

I don't know if what the doctor said is some kind of idiomatic expression, but appears to be the opposite of sound medical advice. :)

by Tade0

5/19/2026 at 7:56:51 PM

Am I really so old that when someone says "Flash" my immediate response is... "consider HTML5 instead" ??

by OhMeadhbh

5/19/2026 at 8:04:08 PM

Very little of what made the Flash culture so fun made its way into HTML5.

by nightski

5/19/2026 at 9:43:14 PM

I dunno, the tools are kind of there. Browsers have canvases and JavaScript and SVGs and sound. The communities are around; they're just kind of dispersed. There's no one website that is THE place for fun stuff. Instead, there are dozens, and most of them suck.

There's still fun stuff, though. I stumbled upon this bit of insanity just yesterday: https://tykenn.itch.io/trees-hate-you. It would have fit in fabulously with the old Flash sites.

by CobrastanJorji

5/19/2026 at 10:07:28 PM

Edit: looks like you linkes something created with Unity?

Not sure, I'm not versed in game dev. So maybe my point about creation tools is moot.

However, 3D content always seems very samey to me, in a way that cartoons and regular animation don't. So the rest of my comment should still express what I mean.

---

Flash had a WYSIWYG editor aimed at media creators who treat programming at best as an afterthought.

Flash was mostly about ease of tweening and extremely flexible vector graphics engine combined with an intuitive creation tool.

So the "Flash vs HTML/JS/SVG/CSS..." debate is not just about technical capabilities of the medium.

Of course there are many fun web apps in the browser, or as native apps, too. But Flash attracted all kinds of slightly nerdy people with cultural things to say, not just web devs with a lot of free time.

What "HTML5"/browser web technology doesn't offer is this intuitive, visual creation pipeline, and this kind of speaks for itself!

Also, I think the Flash "creator's" age is not separable from its time: using Flash wasn't trivial either.

There were just more people with interesting ideas, free time, and a wholistic talent for expressing their humor and ideas, combined with the curiosity and skill to learn using Flash (of course only as a licensed copy purchased from Macromedia).

People like this today are probably more often hyper-optimizing social media creators, and/or not terminally online.

In other words: I don't think the typical Newgrounds creator would have taken the time and effort to translate a stickman collage, meme, or other idea into a web app / animation.

---

And to add even more preaching: I think that "creating" things using AI produces exactly the opposite effect: feed it an original idea, and the result will be a regression to the mean.

by moritzwarhier

5/19/2026 at 10:50:45 PM

It's not quite the same but it seems the people who used to be publishing flash games are now making indie games on Steam. With modern dev tools and engines it's possible for one person to make what used to be a team effort before.

The whole "friendslop" genre is what replaced flash games.

by Gigachad

5/20/2026 at 9:20:51 AM

The issue is that flash had everything you mention about fifteen to twenty years ago (if more) along with better and more thorough tooling.

In the html5 camp the features appeared one by one and the tooling is still fragmented.

What happened between flash dying and html5 having a complete toolset is that interest died.

by znpy

5/19/2026 at 8:41:42 PM

[dead]

by sieabahlpark

5/19/2026 at 8:37:56 PM

The Flash designer was really nice. One thing the web kind of set back was all the RAD tools from the 90s and 2000s.

by goatlover

5/19/2026 at 9:09:17 PM

And there were some amazing RAD and prototyping tools in the 90s (mostly for DOS, but also for Windoze desktop apps.) You're right, we sort of gave up on the idea when everyone wanted to be seen as a "real" software engineer who knew how to sling Java on the back end.

by OhMeadhbh

5/19/2026 at 10:00:46 PM

They were CPU killers but man those Flash websites were gorgeous (talking mostly about MU Online "private" servers)

by pezgrande

5/19/2026 at 11:00:35 PM

It was probably the right call at the time with low bandwidth. Nowadays I bet flash would execute faster than most js heavy sites :D

by winrid

5/19/2026 at 11:41:55 PM

It was not the right call, Steve Jobs was just a monopolist killing a competing platform and we're all worse off for it.

by guelo

5/20/2026 at 6:02:23 AM

Flash was a security and battery disaster just like Java applets. Both are dead as browser plugins and good riddance.

by baq

5/20/2026 at 4:13:36 AM

I meant that designing Flash to use more CPU to save bandwidth was the right call at the time, unless I misunderstand your reply.

by winrid

5/20/2026 at 1:01:02 AM

I guess I'm slightly younger: I think "weights or it didn't happen"!

by hedora

5/20/2026 at 6:19:54 AM

Frontpage, Dreamviewer, flash, photoshop lol. We are old.

by sagarpatil

5/20/2026 at 3:10:25 PM

and Pagemill and Sitemill. At Bell Canada we had a very early web dev team in '94-'95. At one point pagemill came out and we could hire mostly non technical designers to build web pages. At the time it seemed like magic. We didn't need to have someone who grokked vi standing next to a designer all the time. But the HTML pagemill spat out was horrid. It always added a space to the end of link text and never closed list item elements. I eventually wrote a command line tool that fixed pagemill's output because some of our other tools really didn't like the flavour of HTML-inspired slop it emitted. *

And then I moved to the bay area and noticed there was a road called Page Mill Rd. in Palo Alto and sort of laughed for a bit. Surprised Adobe didn't release a tool called Sandhill.

[*] to be fair, most WYSIWYG page builder tools of the era spat out some sort of crappy subset of HTML, so not trying to say pagemill was the only offender.

by OhMeadhbh

5/21/2026 at 6:55:50 PM

Re: Sandhill. This literally sounds like BNR humour circa '94.

by balnaphone

5/20/2026 at 4:52:53 PM

You're not the only one... Heck, I hear Flash and I say Macromedia in my head :/

by thrownaway561

5/19/2026 at 8:22:12 PM

Lol. Young uns!

Flash, ah, ah, saviour of the universe. Flash, ah, ah, he'll save every one of us!

Every time I have heard the word flash for goodness knows how many years.

by _puk

5/19/2026 at 9:06:47 PM

If Google can reuse the "Flash" brand, I'm re-branding myself as "Meadhbh the Merciless."

by OhMeadhbh

5/20/2026 at 12:55:45 AM

Same here, and worst because in another thread users are generating animations.

by wslh

5/19/2026 at 7:48:14 PM

Gemini 3.5 Flash's 2000 token clocks aren't bad. https://clocks.brianmoore.com/

by lanewinfield

5/20/2026 at 8:42:02 AM

From looking at all of them, it actually seems to be the best one, followed by Deepseek 3.1. And something went wrong with GPT-5's.

by Valakas_

5/19/2026 at 11:49:21 PM

Fascinating, kimi k2 has good clock too from my limited time being on the site.

by acters

5/20/2026 at 7:28:56 AM

as does qwen3.5

by khimaros

5/19/2026 at 6:26:33 PM

3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite. $1551 for 3.5 Flash [0] vs $892 for 3.1 Pro [1]. That's 74% more cost while ranking lower. It's 2.5x as fast but I don't think the bang for the buck is there anymore like it was with 3.0 Flash. I'm a bit bummed out to be honest.

I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.

One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.

[0] https://artificialanalysis.ai/models/gemini-3-5-flash [1] https://artificialanalysis.ai/models/gemini-3-1-pro-preview

by eis

5/20/2026 at 1:09:24 AM

Ouch. That's going in completely the wrong direction.

How many people complain that we have too much low quality AI output for humans to read, let alone evaluate vs. how many people are complaining that they want higher quality, more trustworthy output?

by hedora

5/19/2026 at 6:40:27 PM

Seems like the only good thing about 3.5 Flash is its speed. Not cost-competitive or benchmark-leading by any means.

by ekojs

5/19/2026 at 7:30:16 PM

How do they calculate that?

3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.

by pingou

5/19/2026 at 8:49:16 PM

Only speculation but cache maybe?

by knollimar

5/19/2026 at 6:36:02 PM

>3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite

That's everything I needed to know.

by ls_stats

5/19/2026 at 6:52:34 PM

That's what I came here to check. Last model release they only put it into preview[0] at first.

Does that mean this model is production ready?

[0] https://news.ycombinator.com/item?id=47076484

by mijoharas

5/19/2026 at 8:49:33 PM

I have google ai pro plan and tried antigravity with 3.5 flash but it used up all my quota in two prompts. If that is not a bug then it is seriously unusable.

by hmate9

5/19/2026 at 9:09:21 PM

Yesterday, or the day before, Google lowered the AI Pro quota from 33x standard usage to 4x.

From the talk on the Gemini subreddit it's severely lower than before. I'm likely canceling my AI Pro.

The update also broke the app for me. Editing a message crashes the app every time. I'm on a Pixel lol

by quirino

5/19/2026 at 11:39:06 PM

The crunch is real.

- The model is appox 3.3x cost. - The model is realistically almost 5x cost due to token usage - Google has TPUs to run this on (yet the cost) - Google has a lot more security and backup cash compared to all other AI companies, likely even combined (yet the cost)

We can continue moving the goal posts, but it seems we're at a bit of a wall. Costs are increasing, intelligence is improving, but the cost is rising drastically.

You'd think Google of all companies in the mix would be able to sustain lower costs with how integrated they are with TPU, Deepmind and effectively unlimited budget.

by HDBaseT

5/20/2026 at 7:29:29 AM

It's an experience anyone who used Google BigQuery would be familiar with: start with an amazing engineering product, and keep continuously degrading the value users get out of a fixed dollar spend. It's like Google doesn't understand that lock-in doesn't work when customers can easily switch to Claude or GPT.

by logicchains

5/20/2026 at 5:07:35 AM

The way they're charging for failed generations is brutal.

Checked my 5 hour quota, it was 0%, got this for multiple attempts:

I'm getting more image requests than usual, so I can't create that for you right now. Please try again later.

or

Can you ask me again later? I'm being asked to create more images than usual, so I can't do that for you right now.

Went back and found they took 34% of my quota for the privilege of repeating that same error.

I think the "Usage Limits" screen is new so who knows how long they've been counting errors against our quota. I guess I should be grateful it's now visible.

by cube00

5/20/2026 at 5:36:26 AM

The web version went from 100 Pro Prompts per day to...12 per 5 hours lol. I just did 3 back and forth not even technical planning for an infra project and I am ~25% thorough. Insane.

by abeindoria

5/19/2026 at 9:01:38 PM

[dead]

by moral1ty

5/20/2026 at 2:06:23 AM

On my Agentic SQL benchmark it scores 19/25. That's... mediocre.

It means performs worse than 3.1 Flash Lite Preview (22/25), is slower (367s vs 142s) and is more expensive (75c vs 2c).

It is outperformed by Gemma4 26B-A4B in every way(!)

https://sql-benchmark.nicklothian.com/?highlight=google_gemi...

(Switch to the cost vs performance chart to see how far this is off the Pareto frontier)

by nl

5/20/2026 at 4:48:44 PM

I'm seeing this too.

I have a SQL agent and my tests with 3.5 are resulting in hitting query budget limits that have never been hit before. On average, to answer the same question, 3.5 is spending 10x more on SQL queries vs gemini-3-flash-preview.

The query patterns can be extremely degenerate too. E.g. the agent will hit the semantic layer tool to pull the schema, then run `SELECT * FROM table LIMIT 1`, which hits the query budget limit and fails.

I've only really been looking this morning, so I need to do a full eval, but the initial results match what your benchmark shows.

---

Side note: your benchmark has an issue. On Q1 medium the model returned gross margin of 0.127 instead of 12.7 (%), and the benchmark failed it. The failures on Q9 and Q21 are the same (I didn't check other questions). Nowhere in the prompt did you specify you wanted the values converted to percentage points and rounded.

If you asked me to write that SQL with that prompt, unless you were throwing it directly into a visualization I would format it the same way gemini-flash did. If I were pulling into a spreadsheet or vis tool this format is preferable because it's easier to format in a client application.

The other failures like Q21 incorrectly averaging the list price are correct failures.

by data-ottawa

5/19/2026 at 7:17:43 PM

Knowledge cutoff: January 2025

Latest update: May 2026

I have a very bad feeling about this lag.

by reconnecting

5/19/2026 at 8:11:57 PM

At least in some cases, there seems to be a move toward training on more synthetic data and strictly curated data, especially for smaller models where knowledge can't be extremely broad, because there just isn't enough room to store the world in tens or hundreds of gigabytes of model weights. So, to achieve higher quality reasoning, the training has to be focused and the data has to be very high quality and high density.

With strong tool use, it maybe doesn't even matter that the models are using older data. They can search for updated information. Though most models currently don't, without a little nudge in that direction.

Also, I believe the Qwen 3 series are all based on the same base model, with just fine-tuning/post-training to improve them on various metrics. Maybe everything in the Gemini 3 series is the same, and maybe they're concurrently training the Gemini 4 base model with updated knowledge as we speak.

by SwellJoe

5/19/2026 at 9:06:10 PM

> it maybe doesn't even matter that the models are using older data.

This actually really does matter. Otherwise, the model simply won't know about your product and will always suggest only a few market leaders.

Searching for information on the Internet became a jungle a decade ago, and to be visible you have to pay Google for sunlight. Now, we risk falling into real darkness — until some paid model eventually emerges. This might be the reason Google is fine with training data from 2024. If the top spot is reserved for whoever pays anyway, why bother?

by reconnecting

5/19/2026 at 9:27:32 PM

That's a different problem than I thought you were worried about. I wasn't considering the marketing angle, though that is certainly relevant and a risk to consider, especially when it comes to Google, whose primary businesses are ads and surveillance.

by SwellJoe

5/19/2026 at 7:25:48 PM

Can you explain what you mean?

by hosel

5/19/2026 at 7:52:50 PM

LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.

Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.

by reconnecting

5/20/2026 at 3:24:02 AM

It may not be mainly or solely due to LLM pollution, but rather the fact that every publisher, (social) media company, newspaper, etc. clammed up and started charging (licensing) fees sometime in the last couple of years.

So maybe there's just not much openly available and new content worth training on that wasn't available prior to 2025.

by agnosticmantis

5/19/2026 at 10:12:41 PM

But ChatGPT has been popular since early 2023, and even before it there was no shortage of low-quality content on the web.

If anything, this model being trained up to 2025 is a positive sign that the "circular LLM training" problem hasn't (yet) become unmanagable.

The year-long delay is probably just due to how long it takes to test/refine a cutting-edge model. It's surely possible to train one faster, but Google wouldn't want to release a new model unless it's going to top the usual benchmarks.

by Pikamander2

5/19/2026 at 10:40:23 PM

Looking at token usage at places like OpenRouter as a proxy for overall production we're looking at exponential growth in AI-created content. Weekly token usage there has tripled just in the past 3 months.

by djeastm

5/19/2026 at 9:14:43 PM

Considering all models can use search engines, is this really relevant?

by neksn

5/20/2026 at 8:11:51 AM

Yes. Huge difference in quality in from-weights distilled knowledge vs something based on a search tool. If the LLM uses a search tool there's barely a difference between a 30B model and Opus or GPT 5.5, because it just bases its reply on the stuff that came up. Which is generally SEO junk.

Obviously with the last example I'm not talking about long-running agentic tasks here that involve many dozens of search calls (like the recent Erdos problem stuff).

And that doesn't even consider the extra content rot, the time it takes, the need for such an API and so on.

One of the biggest advantages Anthropic models have had over GPT was GPT's woefully outdated data cutoff. They finally improved on this with 5.5, but IIRC it took a year.

by deaux

5/20/2026 at 12:47:43 AM

This is not meant as an insult, but have you actually LLM/vibe coded anything that used a fast(-ish) moving library or framework? Try asking your favorite LLM with say Jan 2025 knowledge cutoff (or pretraining data cutoff, whatever you want to call it) to work on something using a framework that had a big rewrite later that year (which would make it one year old now, which is like ages in the LLM coding era)... It's a nightmare full of wrestling with the LLM when you try to tell it the version of the framework and that it changed a lot from the previous version and yadda yadda long story short down the thread when context runs out and/or is compressed it begins to forget detailed instructions and just falls back to pulling out old patterns it "remembers" from pretraining. And so you need to constantly remind it what you work with and "oh hey this doesnt work because we're working with react router v7 in framework mode, remember? not react router v6". Or try to use the latest non-lts/breaking version of a library, at first it looks it up online, but again as you get deeper into the weeds and little details, the struggle begins.

So, as far as I'm concerned, training cutoff is still a big deal.

by Culonavirus

5/20/2026 at 1:38:07 AM

> It's a nightmare full of wrestling with the LLM when you try to tell it the version of the framework and that it changed a lot from the previous version and yadda yadda

Tip: Add a default instruction to look at the actial downloaded source code of the dependencies used (assuming you're not dealing with closed source dependencies). Have the agent treat it as your own (readonly) source code instead of relying on model training data and possibly mismatching documentation on the web. Then it just greps for the exact function signatures and reads the file based documentation.

by dinfinity

5/20/2026 at 8:14:09 AM

Great, now you experience context bloat 3x as quickly and any task takes 3x as long.

Ifz Google wants to structurally compete with Anthropic on coding, this issue is a must-fix. OpenAI finally fixed it with 5.5.

by deaux

5/19/2026 at 9:25:53 PM

Until they prefer not to search. Let me explain using the example of the open-source security framework (1) our team is working on.

If you ask Gemini what you should use to integrate fraud prevention or account takeover protection into your product, there will be no mention of our open-source project. Five years in development, 1.3k stars, over 140 pull requests — all this isn't enough to make it into the training data. From this perspective, any technology that emerges after 2024 is simply invisible to LLMs.

The answer is: without being in the training data, LLMs basically don't understand what they're searching for.

1. https://github.com/tirrenotechnologies/tirreno

by reconnecting

5/19/2026 at 10:37:46 PM

I just put the terribly generic query "what tools would you recommend to integrate fraud prevention or account takeover protection into my product" into both Claude (Sonnet) and Gemini (3.1 Pro) via the standard web interface and both took the first step of searching the web. That's consistent with my past experience -- the usual harnesses typically will search the web in cases where I might expect/want them to. Now whether you product has good web visibility or not in those searches and how the LLM's weigh the relative merits of open-source tools versus commercial offerings in deciding what to highlight in their responses is a different issue. As is the change in what constitutes effective SEO in an era where bots, rather then human eyes are the proximal important target. But I don't think the core issue with folks finding your products is the move away from user-driven search toward using models with out-of-date training cutoffs.

FWIW while neither model included your product in it's initial response, when I followed up with "what about open-source" both did another search and Claude's response included your tool....

by ordersofmag

5/20/2026 at 8:19:21 AM

> while neither model included your product in it's initial response, when I followed up with "what about open-source"

You just proved that LLMs don't know about the product (which is fine), but they don't even know the category exists.

It's like driving a car whose mirrors show a two-year-old reflection and insisting they work fine.

by reconnecting

5/19/2026 at 7:29:30 PM

It might indicate core model training and pre training is really slowing down?

by nemomarx

5/19/2026 at 7:41:06 PM

also parsing is harder + so much more of the new data is being generated by ai itself.

still the cutoff is very much concerning and inconvenient

by mixtureoftakes

5/19/2026 at 8:39:05 PM

you really shouldn't have them pulling facts from their weights, they need grounding from real data sources

by verdverm

5/19/2026 at 7:44:11 PM

I thought that was a choice that Google made?

by yoda7marinated

5/19/2026 at 6:43:17 PM

Yikes. I think the concept of a 'flash' model is changing, no? Google used to market this as its lower-intelligence, faster, cheaper option. I appreciate that they are delivering on both of those, but personally I would appreciate if they could create an incremental knowledge improvement while holding price steady. Fortune 500 companies have to make their money I guess.

by s3p

5/19/2026 at 7:21:33 PM

I think flash just means "fast" now

by 2001zhaozhao

5/20/2026 at 4:34:56 AM

Real smart. I’ve come to associate ”Flash” with ”useless make-shit-up”, and always look for Thinking/Pro when I see it set. Now, suddenly, there is only Flash?

by kilpikaarna

5/19/2026 at 8:30:36 PM

My guess is Gemini Pro coming later will be 2x more, bringing it comparable to Opus’s pricing.

by likium

5/19/2026 at 8:33:35 PM

That would be Flash Lite now, and I'm also interested in the cheaper end of things so kinda disappointed they didn't release 3.5 Flash Lite at the same time...

by toraway

5/19/2026 at 6:57:45 PM

The price is crazy.

And I guess Gemini 3.5 pro will have the pricing increment, too. 12 x 5 = 60?

It seems like google does want us to use Chinese models.

by npn

5/19/2026 at 10:18:22 PM

What exactly are you doing with this that you can’t generate $1.50 of value per million tokens?

by brianwawok

5/20/2026 at 8:54:02 AM

I sell service. Imagine my users have to pay 4x more for marginal increment just 'cause.

They are more willing to wait though, so Chinese models are pretty attractive right now.

by npn

5/19/2026 at 10:40:02 PM

Generate 5x more value for the same amount of money.

by bel8

5/20/2026 at 12:34:41 AM

Wrong question.

Right question: What exactly is Google's plan for the long term pricing of these models, and are we all going to be priced out in a year?

by s3p

5/19/2026 at 7:59:27 PM

3x price increase for a similar model almost. And they said AI would be cheaper and ubiquitous.

by wg0

5/19/2026 at 8:22:05 PM

Ubiquitous like the crack epidemic.

by alexandre_m

5/19/2026 at 8:39:59 PM

or 3/4 the price (of 3.1 Pro) if we believe their benchmarks

by verdverm

5/19/2026 at 10:13:51 PM

Wow at the price hike. Still I think in the long run the Chinese will win if they're able to produce hardware comparable to Nvidia.

by margorczynski

5/20/2026 at 1:03:05 AM

Why would the Chinese sell me nvidia cards? I can just by an AMD iGPU, and the perf/$ is much better than nvidia dGPUs.

(Typed on a 2023 macbook perfectly capable of running the Chinese open weight models.)

by hedora

5/19/2026 at 11:56:20 PM

I've had the $20 Gemini plan to use when my local setup runs into tougher problems and the throttling today has been bonkers. I canceled my subscription and will look into upgrading my local setup.

by 650REDHAIR

5/19/2026 at 11:40:24 PM

Aren't China also allowed to purchase Nvidia GPUs now too?

by HDBaseT

5/20/2026 at 4:30:42 AM

Most Chinese companies will avoid Nvidia Gpu and as much american tech they can now when it comes to serving AI as now they know it can be stopped any time by the US or maybe even their own government so the risk premium is too high. They might still use Nvidia to build the models but not for running them and serving to customers

by xbmcuser

5/20/2026 at 2:18:20 AM

Up to the H200 iirc, but they haven't made a purchase yet afaik. The experts in such things believe if they do make a purchase, it will be a token one. Xi is pushing hard for indigenous production, not becoming "hooked" to American Ai chips like some (not so bright people) think we can cause to happen.

by verdverm

5/20/2026 at 12:59:16 AM

Doesn't need to be the Chinese. It can be anyone without stratospheric Nvidia margins. The Gold Rush phase of AI economy (aka "the bubble") is beginning to slow down and the Optimization phase is just beginning to ramp up (we see this with massive bumps to token cost and token burn rate of pretty much all frontier models, plus the general pivot away from your typical individual chat end-users to businesses and employees of said businesses) and there will come a time when "nvidia has the best software stack" will not mean much for the big players. Organically, I think it already kinda does, it's just masked with the inertia of massive circular deals and Nvidia selling its services to itself (entities it backs/invests in).

by Culonavirus

5/20/2026 at 10:04:15 AM

You may remember the argument that you can build an AI app and it continues to improve as models improve and costs go down?

Well, looking at OpenAI / Google / Anthropic we see crazy cost increases, such that it might invalidate your unit economics.

Cheering for Chinese models!

by swe_dima

5/20/2026 at 2:32:36 AM

Taking into account that this is a flash model, it's a strong release. It's very fast and frontier-ish for the price.

Raw intelligence is high for a flash model. But Google's problem has always been productization and tool use, whereas raw intelligence is always competitive. It does not look like they solved that with this release -- in fact, their tool use delta (the improvement in scores when given arbitrary tools and a harness) has actually regressed from some previous models.

Data at https://gertlabs.com/rankings

by gertlabs

5/19/2026 at 6:38:56 PM

Beats 3.1 Pro for price per token, but artificial analysis is showing it's dumber per token and costs more overall

by OsrsNeedsf2P

5/19/2026 at 7:04:15 PM

Arena.ai is saying "Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers."

https://x.com/arena/status/2056793180998361233

by golfer

5/19/2026 at 9:14:13 PM

Not sure what to think about this. There is no even GPT 5.5

by nicce

5/19/2026 at 6:46:46 PM

Yeah, bummer. I was very excited for this release, but this killed it.

by sauwan

5/19/2026 at 6:58:58 PM

The pricing is an issue.

by droidjj

5/19/2026 at 6:04:41 PM

$1.5/m input tokens $9/m output tokens

6x the price of 3.1 flash lite

by asar

5/19/2026 at 7:09:17 PM

"Flash-Lite" is a different product from "Flash", which is more expensive. They couldn't be more confusing with their naming though, especially since they have 3.1 Pro and not 3.1 Flash non-lite.

by Aunche

5/19/2026 at 6:40:49 PM

I haven't used 3.5 at all yet, but previous Gemini (and Gemma models) are by far the most token light per task than any other model.

Cost per task is a more productive measure, but obviously a more difficult one to benchmark.

by WarmWash

5/19/2026 at 6:07:01 PM

I don't think input/output pricing matters, 90% of the cost is cache. $0.15 is pretty good, but still very expensive.

by himata4113

5/19/2026 at 6:17:49 PM

It depends on the use-case. yes, 90% of cost is cache in agentic coding scenarios (actually 95% in my experience). But not when the model reasons for 200k+ tokens before answering a complex problem.

by wolttam

5/19/2026 at 6:32:05 PM

gemini models solve a problem in 80% less tokens so that's something to think about.

by himata4113

5/19/2026 at 7:24:03 PM

Gemini caching is confusing though:

  $0.15 / million tokens
  $1.00 / 1,000,000 tokens per hour (storage price)
I much prefer the OpenAI/DeepSeek way of pricing caching where you don't have to think about storage price at all - you pay for cached tokens if you reuse the same prefix within a (loosely defined) time period.

by simonw

5/19/2026 at 8:20:07 PM

As far as I can tell Gemini caching DOES work like OpenAI - see implicit caching here: https://ai.google.dev/gemini-api/docs/caching

I confirmed this by running a bunch of prompts through Gemini 3.5 Flash without doing anything special to configure caching and noting that it comes back with a "cachedContentTokenCount" on many of the responses.

The "storage price" quoted is for an optional Gemini feature that most people don't care about: https://ai.google.dev/gemini-api/docs/caching#explicit-cachi...

by simonw

5/19/2026 at 6:19:25 PM

In our experience, caching is not very reliable with google. We always get random cache misses that don't happen with other providers. We find OpenAI, Anthropic and Fireworks (which we use a lot) all have higher cache hit rates. So it's not only about the costs of cached token but also what kind of cached hit rate you get.

by __jl__

5/19/2026 at 7:14:10 PM

In my experience Google is the most flaky in general, which is surprising considering the rock solid history of their search and other products. Just more likely not to respond at all, to give a response out of left field, to handle the same error in 12 different ways randomly (a rainbow of HTTP status codes and error messages), etc etc.

by svachalek

5/19/2026 at 10:23:10 PM

I agree. The https://aistudio.google.com/ is shockingly bad. I'm not sure I've ever used such a flaky Google service before. It's so much worse than Gmail or Google, not to mention ChatGPT or Claude or DeepSeek or Kimi or Midjourney web interfaces. The bizarre janky integration with your Google Drive, or Gemini or NBPs randomly erroring out, often indefinitely. I've had sessions refresh themselves and just... disappearing. Or when you get frustrated with a buggy dead session and hit 'new session' and have to wait minutes for 'saving...' to happen.

by gwern

5/19/2026 at 7:52:51 PM

Exactly our experience too. Effectively we catch these and on these status codes, we send to OpenAI. Retrying the same query in Gemini has high chance to give kind-of the same status code.

by veselin

5/19/2026 at 6:07:36 PM

10% of input pricing is standard especially compared to competition.

by minimaxir

5/19/2026 at 6:15:25 PM

yah, which means that the input cost is the only value that should be paid attention to at the end + the cache discount (x10). If google would start offering x20 discount it would make it twice as cheap while input and output stayed the same.

by himata4113

5/19/2026 at 6:07:07 PM

[deleted]

by John7878781

5/19/2026 at 6:10:33 PM

Output cost is 3x from Gemini 3 flash.

by stri8ed

5/21/2026 at 11:13:51 AM

I ran through the eval loop for a side project’s task (personalization of a micro video game, no thinking) last night. Head to head with Gemini 3 Flash Preview, results came out at basically a wash on my rubric. The output quality was good, well grounded, and reliable across 144 runs. But not noticeably better. It isn’t a traditional coding task, so can’t infer anything there. The amazing part was how fast it is. It was consistently about 2x faster than 3 Flash Preview and slightly faster than 3.1 Flash Lite Preview which is amazing. For my task, the price difference doesn’t matter, so easy upgrade. I plan to write up a quick blog post with the results over the weekend.

by time0ut

5/20/2026 at 12:39:07 PM

I've worked with all three of the biggest models and typically have the three of them working together, Gemini is by far the worst of the three. The price hikes will keep me further away from applying them in my day to day operations.

by hackmack10

5/19/2026 at 9:14:44 PM

worth noting that Google marked this stable rather than preview, which is unusual compared to their recent releases. Pair that with the 3x price hike and flash pricing now reads like long-term floor they want, not a temporary thing they will walk back later. But its hard to tell yet whether that's Google specifically reading the room or the whole industry quietly resetting the cheap-inference baseline.

by nikhilpareek13

5/19/2026 at 9:37:32 PM

China: we don’t need to use US models, we can distill them ourself

Google: we don’t need Chinese to distill our models, we can do it ourself

by stared

5/19/2026 at 6:05:03 PM

Engineers at google have publically stated that the models are too big and are far from their potencial. Glad they're being proven right with every release.

They continue to focus on smaller models while openai and anthropic are increasing compute requirements for their SOTA models.

by himata4113

5/19/2026 at 6:09:15 PM

Given the cost increase associated with this model, and previous model releases, I think the size is trending upwards, not down.

by stri8ed

5/19/2026 at 6:14:23 PM

The speed says otherwise. I think they're increasing costs since they want to start seeing ROI.

by himata4113

5/19/2026 at 6:40:19 PM

Those are (mostly) new, faster TPU

by JanSt

5/19/2026 at 6:44:21 PM

latest TPU's appear to reach 800tok/s rather than the advertised 300tok/s.

by himata4113

5/19/2026 at 10:29:32 PM

They demoed today 8i running ate 1300 to 1600ish tokens per second. I imagine that is caused by having a single rack serving the model just for the demo.

by mgambati

5/19/2026 at 11:15:42 PM

There's a limit to how much you can "scale" this process, it's linear, but if we did napkin math based on vllm parallel batched streams only lose around ~50% performance compared to single-stream output so doesn't explain the ridicioulusly fast numbers here.

I wish google just came out and told us how large their flash model is, because if it's as big or smaller than gpt-5.4-nano that's the real headline here.

by himata4113

5/19/2026 at 6:11:09 PM

Don’t let that fool yourself. Google will have SOTA models as big as or even bigger than their competitors.

They are just refining their current models while they finish training the next generation.

They will all come out at about the same time. Anthropic, OpenAi, Google, xAI

by maipen

5/19/2026 at 6:18:52 PM

Anthropic has been sitting on Mythos for a while now. I guess they don't feel pressured to fuck it ship it until anyone else gets a 10T to work.

by ACCount37

5/19/2026 at 6:59:29 PM

According to people who have access to Mythos, it is slightly worse than GPT-5.5-xhigh. At least for security tasks.

Hold on, I think this claim needs some hard data. Here you go gentlemen:

https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...

by throwa356262

5/19/2026 at 7:24:22 PM

That claim keeps contradicted hard by other parties, who say Mythos beats 5.5 resoundingly on both autonomous search and discovery and creation of complex exploit chains.

There might be a harness difference, but also, this CTF-type benchmark might not capture the capability difference fully.

by ACCount37

5/19/2026 at 11:29:54 PM

[dead]

by nimchimpsky

5/19/2026 at 6:26:25 PM

It's doubtful they have the compute to make mythos publicly available even after the SpaceX datacenter deal. And why sell it publicly if people are still willing to pay for Opus 4.7?

by Sevii

5/19/2026 at 6:43:43 PM

I suspect that Mythos doesn't have a business model that works

by outside1234

5/19/2026 at 7:03:21 PM

> Engineers at google have publically stated that the models are too big and are far from their potencial

Can you link to a source?

by Jabbles

5/19/2026 at 11:40:44 PM

I wish I could, it was one of those youtube podcast type interviews with one of the engineers, there was a lot more shared, but that line stuck with me the most.

by himata4113

5/19/2026 at 7:19:12 PM

Source please cause i dont believe that for once second

by Dinux

5/19/2026 at 7:46:03 PM

I mean, yes and no.

Nobody really knows the answer to which one is more optimal

* Large model trained on a large amount of data across multiple domains, that doesn't need any extra content to answer questions.

* Smaller model that is smart enough to go fetch extra relevant content, and then operate on essentially "reformatting" the context into an answer.

by ActorNightly

5/19/2026 at 6:17:16 PM

Google’s pro models are almost certainly bigger than Openai’s lol

by howdareme

5/19/2026 at 7:19:11 PM

Why would that be? I am curious why do you think that.

by fikama

5/19/2026 at 7:43:52 PM

E.g. because they are behind on research and so must compensate with size to achieve similar level of intelligence. At least this is what I heard.

For intelligence/size only OpenAI and Anthropic are the frontier. Google has more compute so it can compensate for that with size of the models...

by mnicky

5/19/2026 at 8:50:03 PM

I'd argue Qwen is pushing the Pareto frontier considerably further when you take size into account.

by snovv_crash

5/19/2026 at 7:41:36 PM

Because TPUs are more efficient, and its cheaper for them to field them in higher quantity since they own the chip.

by ActorNightly

5/19/2026 at 9:25:18 PM

How is this progress? The token cost just keeps going up and up. Flash is the new Pro? Do the models actually cost more to run or is it fattening margins?

by brikym

5/20/2026 at 4:02:36 PM

Anyone using this yet?

I’m finding it very bad at instruction following vs 3.1. It calls tools it is told shouldn’t, and it loves calling tools. There’s a pretty strong bias towards its training vs system prompt instructions.

Google’s release notes say to reduce unnecessary tool calls by reducing thinking, but that feels like it should be orthogonal to me.

It definitely has improved a few logic things, like in data visualizations it’s better at labelling data, but it’s much worse at preparing data out of the box.

by data-ottawa

5/20/2026 at 7:19:01 PM

Same. Feels very goal oriented. Requires multiple attempts to deter course and means to achieve it.

On tool use. Gave it interactive design assignment on Antigravity 2. Failed miserably until I asked to use playwright for testing. And boy did it go with it. Tested hell out of visuals, nailed the solution.

On following instruction. Asked Gemini Flash 3.5 to summarize YouTube video (google io developer keynote), a task that would previously be trivial (use ot often), but it kept hallucinating points and referencing io dev keynote blog posts from several years ago. Multiple attempts, same result even on repeat requests. Almost insistent on validity of information provided, ignoring questions if it had such capability.

by wwizo

5/21/2026 at 11:32:30 AM

What thinking level were you using?

In my testing, the minimal thinking mode hallucinated 2/3 times, which is pretty scary. The other modes weren’t as bad. I don’t have comprehensive data though.

by data-ottawa

5/19/2026 at 11:58:18 PM

That pelican looks like it just sold a SaaS company and bought a bike because its therapist said it needed balance.

by paol_taja

5/20/2026 at 12:38:26 AM

The pelican is ready to discuss increased synergies of bringing AI to all teams at the firm!

by s3p

5/20/2026 at 8:06:27 AM

That made me subtly, yet audibly, laugh.

by testycool

5/19/2026 at 6:20:55 PM

Is there a good benchmark tracking hallucinations? The models are all incredibly good now, even the open ones, and my hope is that the rate of hallucinations is something that's falling off in concert with larger and larger context lengths.

by aliljet

5/19/2026 at 6:38:37 PM

People complain about them incessantly, but I can almost never get people to actually post receipts. Every provider allows sharing chats, and anyone can share a prompt that reliably produces hallucinations.

More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.

Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.

by WarmWash

5/19/2026 at 7:29:42 PM

I see constant hallucination in claude code when using specific tooling: It thinks it knows aws cli, for instance, but there's some flags that don't exist, it attempts to use all the time in 4.6 and 4.7. When asked about it, it says that yes , the flag doesn't exist in that command, but it exists in a different command (which it does), and yet, it attempts to use it without extra info.

Claude also believes it knows how AWS' KMS works, quite confidently, while getting things wrong. I have a separate "this is how KMS replication actually works" file just to deal with its misconceptions.

For gemini, I typically use it to query information from large corpuses, but it often web searches and hallucinates instead of reading the actual corpus. On a book series, it will hallucinate chapters and events which, while reasonable and plausible, do not exist. "Go look at the files and see if your reference is correct" shows that it's not correct, and it's a mandatory step. But that doesn't prevent hallucination, but makes sure you catch it after the fact, just like a method in a class that doesn't exist gets found out by the compiler. The LLM still hallucinated it.

by hibikir

5/19/2026 at 6:46:55 PM

I see hallucinations ALL the time. It's only obvious when you're prompting about a subject you know well.

And when I say all the time, I mean it, and this is for Opus 4.7 Adaptive.

I often have to say, please do searches and cite sources, as if it doesn't it will confidently give me wrong or outdated information.

If you're often asking questions about a topic that's not in your specialist knowledge you won't notice them.

by saberience

5/19/2026 at 7:17:36 PM

Hallucination is also much better controlled in the context of agentic coding because outputs can be validated by running the code (or linters/LSP). I almost never notice hallucinations when I’m coding with AI, but when using AI for legal work (my real job) it hallucinates constantly and perniciously because the hallucinations are subtle—e.g., making up a crucial fact about a real case.

by droidjj

5/19/2026 at 8:20:21 PM

Yes, you can catch many mistakes that LLMs make whike coding, but I wouldn't necessarily call it "controlled." Every now and then the LLM will run into dead ends where it makes a certain mistake, the compiler or unit tests find the mistake, so it tries a different approach that also fails, and then it goes back to the first approach, then tries the second approach again, and gets stuck in an endless loop trying small variations on those two approaches over and over.

If you aren't paying attention it can spend a long time (and a lot of tokens) spinning in that loop. Sometimes there might be more than two approaches in the loop, which makes it even harder to see that it's repeating itself in a loop. It's pretty frustrating to see it working away productively (so you think) for 20 minutes or so only to finally notice what's going on

by krupan

5/20/2026 at 12:54:09 PM

For coding the worst I've seen recently is gemini using or suggesting library methods that dont exist in c# which it catches when it builds the project (something I've told it to do to catch these.)

but for research it makes shit up all the time, I asked GPT5.5 to make me a build for Rogue Trader and not only did it use out of date info, it made up a bunch of skills that were NEVER in the game. I attribute that to there not being enough online information in the wikis or whatever but I wish it would just say "I dont know" instead of hallucinating but I know that's not how the tech works.

by NothingAboutAny

5/19/2026 at 7:38:14 PM

https://gemini.google.com/share/9cd8ca68025a

I was trying to understand a game I've been playing, The Last Spell. I asked it for a tier list of omens -- which ones the community considers most important. At least a few of the names it posts are hallucinated ("omen of the sun" does not exist, and the omens that give extra gold are "savings," "fortune," and "great wealth").

Obviously not a critical use case but issues like this do keep me on my toes regarding whether the thing is working at all. I should ask 3.5 flash to do the same job. (I did try and it once again hallucinated the omen names and some of the effects.)

by asdfasgasdgasdg

5/19/2026 at 7:36:42 PM

I can reliably produce hallucinations with this genre of prompt: "write a script that does <simple task> with <well known but not too-well-known API>." Even the frontier models will hallucinate the perfect API endpoint that does exactly what I want, regardless of if it exists.

The fix is easy enough though, a line in my global AGENTS.md instructing agents to search/ask for documentation before working on API integrations.

by hamdingers

5/19/2026 at 7:44:01 PM

Yeah. Better to have more details in your prompt than fewer. For example, I use this:

```

Build a Nango sync that stores Figma projects.

Integration ID: figma

Connection ID for dry run: my-figma-connection

Frequency: every hour

Metadata: team_id

Records: Project with id, name, last_modified

API reference: https://www.figma.com/developers/api#projects-endpoints

```

Note: You do need a Nango account and the Nango Skill installed before it could work.

by sapneshnaik

5/19/2026 at 8:10:56 PM

I asked gemini 3.1 Pro to search for the linkedin URLs for a list of peers. It generated a plausible list of links -- but they were all hallucinated. On a follow up it confirmed it couldn't actually search, but didn't tell me that without prompting.

by brooksc

5/19/2026 at 7:15:42 PM

"People complain about them incessantly, but I can almost never get people to actually post receipts."

...my chats are all pretty long and involve personal conversations, or I've deleted them. It's a lot to ask for someone to post receipts. The number of complaints is enough data.

No matter how big the model is there will be edge cases where it has no data or is out of date. In these cases it just makes stuff up. You can detect it yourself by looking for words like usually or often when it states facts, e.g. "the mall often has a Starbucks." I asked it about a Genshin Impact character released in June 2025 and it consistently interpreted the name (Aino) as my player character because Aino wasn't in its data.

Honestly I'm surprised your haven't encountered it if you're using it more than casually. Pro is much better but not perfect.

by rjh29

5/19/2026 at 8:18:09 PM

Claude has gotten good in the past month or two at recognizing when it might need to search the web for updated info rather than saying that it has no idea what I'm talking about or making stuff up.

by ls612

5/19/2026 at 8:25:54 PM

Are the knowledge cut off issues well known? I don't remember seeing them prominently displayed.

Also, prompts that reliably produce hallucinations is kind of a hard ask. It's inconsistent. One day the LLM I work with quotes verbatim from the PCIe spec and it's super helpful. The next day it gives me wrong information and when I ask it what section of the spec that information comes from it just makes up a section number

by krupan

5/19/2026 at 11:05:41 PM

Just ask any real question about stuff. LLM is not about code only...

by vitorgrs

5/20/2026 at 3:17:58 AM

> While OpenAI originally pioneered Codex (which went on to power GitHub Copilot), Google’s direct answer for dedicated, native code completion and natural-language-to-code generation is CodeGemma.

https://g.co/gemini/share/33e7a589a161

by vlmutolo

5/20/2026 at 8:34:15 AM

Nothing about this is a hallucination. The Codex that it talks about is real, existed, and did go on to power the original Copilot. You neither specified that you meant a different Codex, nor did it make anything up. The CodeGemma isn't made up either, as its referenced working link shows.

by deaux

5/19/2026 at 6:25:08 PM

I haven't been bothered by hallucinations in premier models since early last year. Still see it in smaller local models though.

by Sevii

5/19/2026 at 6:29:17 PM

I'm really running into this deep at the edges of content creation. Take, for example, a need to general some kind of legal work. The cost of painstakingly checking and rechecking each case cited is reducing the value of these frontier models immensely.

Coding, however, is solved like magic. Easier to add tests, to be fair.

by aliljet

5/19/2026 at 6:33:05 PM

if last year's models were the ones people got familiar with in late 2022, hallucinations would be an underrepresented rumor, there would be no articles about it because its so rare. overconfident lawyers wouldn't have messed up dockets in court with fake case law, in other domains that move faster, sources would be only partially outdated with agentic search and mcp servers filling in the gaps

AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"

(the domain name is dumb and completely unmarketable)

by yieldcrv

5/19/2026 at 6:50:06 PM

The models still hallucinate bad when called via APIs, especially if web search is not enabled. Gemini hallucinates quite frequently even with the app and search enabled. More recent (e.g. ChatGPT 5.x and Deepseek v4) prompts/harnesses search very aggressively, which does greatly mitigate hallucinations.

by jampekka

5/20/2026 at 7:44:02 AM

Victim of LLM hallucinations, poor guy

by schneehertz

5/19/2026 at 6:51:20 PM

As long as the model uses web search, they almost never hallucinate anymore. The fast models (haiku, gpt-instant, flash) still sometimes have the problem where they don't search before answering so they can hallucinate

by FergusArgyll

5/19/2026 at 7:08:38 PM

I've seen chatGPT and Gemini hallucinate even from web search, it's better is not sufficient

by goldenarm

5/19/2026 at 8:22:51 PM

It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.

by krupan

5/20/2026 at 7:47:32 AM

For me the biggest gain is the speed.

It takes on average 2.84s for Gemini 3.5 Flash to give an answer, compared to GPT 5.5 33s [0].

Also the max/slowest test is answered in under 7s, whereas GPT 5.4 takes more than 5 minutes...

[0]: https://aibenchy.com/compare/google-gemini-3-5-flash-low/ope...

by XCSme

5/19/2026 at 8:51:13 PM

The demo of the model in Antigravity automatically rename and categorize unstructured assets using vision was quite cool, it demodulates that the IDE sidepanel can be used for more than just coding. I wonder if the harness in Antigravity is based on Gemini cli or if they are completely different. Could Gemini cli do the same task? Or is the vision feature a Antigravity thing?

by Alifatisk

5/19/2026 at 7:04:51 PM

Arena.ai:

> Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers.

https://x.com/arena/status/2056793180998361233

by golfer

5/19/2026 at 7:56:15 PM

Given how widely varying the amount of tokens each model uses for a given task, "price-per-token" is essentially meaningless when doing this sort of comparison.

Artificial Analysis's "Cost to run" model (aka num_tokens_used * price_per_token) is much better, but even that is likely problematic since it's not clear whether running a bunch of benchmarks maps cleanly to real-world token use.

by h14h

5/20/2026 at 7:07:17 AM

That graph seems odd. It looks like Gemini 3.5 Flash is not actually on the convex hull, and they forced the 'frontier' to bend inwards to include it

by ohlookcake

5/20/2026 at 1:08:42 PM

Man, I Wish I had the hardware to run LMM like these locally.

by numron-dev

5/20/2026 at 5:28:39 AM

The Flash model costs more than the Frontier models. Didn't see that coming.

by mirzap

5/20/2026 at 3:09:27 PM

On a per-token, it's cheaper than Opus, GPT, and Gemini Pro; and while I hear the "it uses more tokens so its more expensive", this discounts a few things (1) improvements over time (2) finding the right way to prompt it (3) finding proper places to use this model.

by verdverm

5/20/2026 at 10:04:00 PM

this model is whack. Exclamation marks everywhere, sycophantic - not producing working code on prompts the other models handle fine.

"The reason it is echoing back your messages is because gpt-5.4-nano is a fictional model name!"

"Everything is in perfect order! Let's-Go-ready for the next phase, which will connect this durable infrastructure to the user-facing UI!"

It's like they RLed it on thumbs up and downs on ai overview responses and forgot to make it not be a sycophantic echo chamber machine. And like, the thing it built doesn't work because it's not actually in perfect order, but it doesn't seem to be able to figure out what's wrong because everything is clearly remarkably engineered

by vikramkr

5/19/2026 at 10:13:22 PM

While I am excited, the price compared to gemini 3 flash preview which I used for the longest time is x3 more. Upon arrival of deepseek v4 flash, I am a happy user of deepseek. We will see how long that reign would last after I try this new gemini.

by sbinnee

5/19/2026 at 6:54:22 PM

Just updated my HN Wrapped project with it and it does well on my totally unscientific LLM humor benchmark: https://hn-wrapped.kadoa.com

by hubraumhugo

5/19/2026 at 7:50:03 PM

Lol, nice project! I liked the xkcd-style comic the most!

I'm only gonna cry a little bit about the all-too-accurate roasts. Some of that stuff cut deep!

by amarant

5/20/2026 at 10:04:18 AM

The xkcd comic is a really cool idea. I enjoyed seeing my wrapped, thanks!

by harias

5/19/2026 at 6:11:49 PM

benchmarks look REALLY good, the price hike is big but it also beats sonnet 4.6 in every discipline?

by mixtureoftakes

5/19/2026 at 8:20:03 PM

[dead]

by benjiro3000

5/19/2026 at 6:22:25 PM

Triple the price of the last Flash model ($3 -> $9 per 1M output). Quickly approaching Sonnet prices.

Feels like the AI pricing noose is tightening sooner rather than later.

by bakugo

5/19/2026 at 6:03:07 PM

Flash family but costs like a Pro. $9 vs $12 for output.

by swe_dima

5/19/2026 at 8:15:34 PM

Can anyone who has extensive, recent, experience with Claude code and Codex contextualize the current Gemini CLI product experience?

by bredren

5/19/2026 at 8:32:45 PM

I have and use both Claude Code and Gemini CLI, and still don't consider Gemini worth starting for coding except to review Claude's output in critical commits (on a security boundary, maybe broad refactors, etc.), though I try side-by-side every now and then just to see the state of things. I also use Gemini Pro in a security scanning harness to act as a second set of eyes, but Opus is better at finding security bugs than Gemini, so I don't know that it's accomplishing anything beyond just using Opus.

Gemini Pro 3.1 for agentic coding is still clumsy. It chews a lot, has a harder time with tools and interacting with the codebase. I haven't tried any 3.5 version, yet, though. The benchmarks look promising.

I'll note I like the Google models' prose better than any others at the moment, though. Even the small open models (Gemma 4 family) have excellent prose, relatively speaking, that doesn't stink of the LLMisms that I find so annoying about OpenAI (especially) and Anthropic models. So, I'll probably start using Gemini for writing API docs, even if all code is Claude.

by SwellJoe

5/19/2026 at 9:19:13 PM

I would argue that prose is just a prompt issue. GPT 5.5 outout is easier to control whan Gemini by prompting. Having better defaults does not make it necessarily better.

by nicce

5/19/2026 at 9:39:28 PM

I would disagree. I think it'd take a lot of prompting to make GPT 5.5 not have the underlying personality of GPT, which I find awful. They have knobs in ChatGPT to choose a "professional" tone, which improves it somewhat, but even that is still the worst prose of any leading model.

My default AGENTS.md/CLAUDE.md/etc. is a few sentences from Strunk and White, to try to make all the models not suck at writing. It helps keep the models brief, but it doesn't actually make models with shitty prose have good prose. The relevant portion of my agents file is: "Omit needless words. Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts." Which might add up roughly the same as "be brief" in the weights, I don't know.

If you have a prompt that makes GPT a decent-to-good writer, I would like to see it.

Gemini produces decent-to-good prose without prompting, which improves if instructed to be concise. The other models, even the frontier models, do not have decent-to-good prose without prompting, and even with prompting, rarely elevate to what I would consider Good Enough. Part of this may be that GPT and Claude models get used a lot more heavily, and so I'm highly tuned into their idiosyncrasies. The heavy use of emojis, the click-bait headline style, etc. that they both use unprompted. All of that is repugnant to me, so anything that doesn't do all that by default, or at least not as aggressively, has a huge leg up.

by SwellJoe

5/19/2026 at 10:23:35 PM

Gemini models have consistently disregarded rules and gone their own way for me. They will finish a task and get it done frequently way above the scope that you gave it, but they take a million shortcuts to get there. e.g. deciding the linter isn't important and disabling the pre commit hook. coding features you didn't ask for.

by mpalczewski

5/19/2026 at 10:48:13 PM

My anecdote: smart but too stubborn to be useful.

I have been trying Gemini since 2.5 for coding.

It is the smartest for creative web stuff like HTML/CSS/JS.

But it has been very stubborn with following instructions like AGENTS.md.

And architecturally for large projects I tested, the code isn't on par with Opus 4.5+ and GPT 5.3+.

I would rather use DeepSeek 4 Flash on High (not max) than Gemini even if they had the same cost.

I currently use GPT 5.5 + DeepSeek 4 Flash.

BUT I didn't test Gemini 3.5 Flash yet. And it seems, from another comment in this post, that the Antigravity quota for is bricked for Google Pro plans which is the plan I have. So I don't have high hopes.

by bel8

5/20/2026 at 1:21:56 AM

Aw. The listen to article widget doesn't work properly on mobile Safari and when using the options button, the popup appears below the "In this article" dropdown occluding it.

At least it read the authors of the article to me.

I wish we would push more towards testing code. Agentic AI excel when it's engaged.

by razodactyl

5/19/2026 at 8:04:54 PM

Google also updated Antigravity. version 2.0 is more for conversation with agent. The previous VS Code like IDE was much better.

by paperwork360

5/20/2026 at 12:35:03 AM

It's been renamed to "antigravity IDE." Updating my old IDE got me the new non-IDE app though, which is strange.

by operatingthetan

5/20/2026 at 12:29:16 AM

They still have an Antigravity IDE version.

by xnx

5/20/2026 at 4:56:22 PM

The cutoff date is early 2025 so make sure to enable web search when experimenting. I was expecting something more recent, took a while to notice this.

by musebox35

5/20/2026 at 1:29:25 AM

I have thought about this and I think overall, this was a disappointing release from Google. I'm not sure the sentiment, but this feels like a miss.

What they did do in the keynote was spend a lot of time talking about their distribution advantage, and how they can own the consumer in search. But not a lot that will benefit partners or developers.

Basically, they released something broadly competitive with Sonnet 4.6, a new Omni model that seems interesting but unclear yet. They have completely ceded the frontier to OpenAI / Anthropic, and are saying "look for pro next month".

The best release since nano banana pro from Google has been Gemma.

by mchusma

5/19/2026 at 7:29:01 PM

Well, available for Gemini means these days that half the time they are “Receiving a lot of requests right now.” and so sorry they couldn’t complete the task. Luckily the model supports long time horizons because that’s what’s needed. /me likes Gemini a lot just wishing Google would add the compute!

by MASNeo

5/19/2026 at 10:49:27 PM

Are you on a paid plan?

by esafak

5/19/2026 at 9:59:47 PM

In my tests, in real production use cases, it's a hard pass.

It's actually 10-15% slower and also more expensive than Gemini 3.1 Pro, because it thinks more than 2.5x Gemini 3.1 Pro.

So that thinking verbosity nullifies the speed and cost gains.

AND the quality is worse than 3.1 Pro for our use cases, making mistakes Pro doesn't make.

by pqdbr

5/20/2026 at 9:19:59 AM

No computer use yet. I wonder when they enable it for this model, CUA was one of the main selling points for us with the previous version of Flash.

by pimeys

5/19/2026 at 6:46:47 PM

The Artificial Analysis benchmark results are pretty underwhelming. Roughly the same "intelligence" as MiMo-V2.5-Pro for over 3x the cost. We'll have to see how that translates to actual usage but it's not a great sign.

by noelsusman

5/19/2026 at 7:26:08 PM

That really depends on whether they have similar parameter counts, doesn't it? Unless you know that, the comparison is just strange

by hydra-f

5/19/2026 at 7:48:08 PM

Bad look to tell people they're not allowed to compare things just because we need to respect Google's privacy

by halJordan

5/19/2026 at 8:06:03 PM

I didn't take the price into consideration when writing that. I meant to point out that even if they have similar scores, the Flash model might be smaller than MiMo or Kimi, which would by itself be a win

That said, haste makes waste as the price point completely invalidates that

by hydra-f

5/20/2026 at 4:19:15 AM

I don't know why a user should care at all about parameter counts. All that matters is performance and cost.

by noelsusman

5/19/2026 at 7:36:42 PM

The antigravity teamwork-preview doesn't work for me -- upgraded to ultra, installed antigravity 2, ran teamwork-preview, keeps failing: "You have exhausted your capacity on this model. Your quota will reset after 0s."

by mackross

5/19/2026 at 11:29:21 PM

The $1.50/$9.00 pricing is a meaningful shift if you've been running Gemini as the "fast iteration" half of a multi-model coding workflow. I've had Claude Code, Codex, and Gemini CLI running side by side and the working split was "Gemini for quick scaffolding and exploration where the cost of being wrong is low, Sonnet for correctness-critical stuff." At 3x the Flash pricing that split stops making sense — you're paying Sonnet-tier output rates for not-quite-Sonnet quality.

For pure chat that's annoying but tolerable. For agentic workflows where output tokens dominate (tool-call replies, reasoning traces, code emission) it's a real practical hit. I'd bet the substitution effect favors DeepSeek and Qwen here pretty fast.

by jonnyasmar

5/20/2026 at 1:05:25 AM

Out of curiosity, what was your workflow to generate this comment? I’m curious what model (claude?) and process (manual prompt with bullet points?) you used.

by superchink

5/19/2026 at 8:00:29 PM

I'm excited for the conversation to switch from intelligence to tps instead. I care much less about what hard thought experiments models can one shot and much more how responsive my plain text interface for doing things is.

by x3cca

5/19/2026 at 7:53:24 PM

I think the field moved to agents too fast. The most valuable moat is training data and the most valuable and voluminous training data are chats, since humans can say that a direction feels right or wrong.

by casey2

5/20/2026 at 6:18:10 PM

I’m curious about the difference between Gemini 3.5 Flash and Gemma 4.

by drob518

5/19/2026 at 5:59:24 PM

Its Gemini 3.5 Flash

by alexdns

5/19/2026 at 6:20:04 PM

Yeah, Google chose a misleading title for the blog post.

by nerdalytics

5/19/2026 at 7:25:58 PM

> Today, we’re introducing Gemini 3.5, our latest family of models combining frontier intelligence with action. This represents a major leap forward in building more capable, intelligent agents. We’re kicking off the series by releasing 3.5 Flash.

by jader201

5/20/2026 at 11:52:57 AM

paragraph vs title

by nerdalytics

5/19/2026 at 9:06:53 PM

Gemini, please block all ads in my search engine.

by amelius

5/20/2026 at 12:15:16 AM

Gemini has been too agreeable to be useful for actual debate. Curious if 3.5 changes that, or just the benchmarks

by ErystelaThevale

5/20/2026 at 11:26:33 AM

Can the Gemini 3.5 flash drive surpass the Claude opus 4.7 flash drive?

by sofumel

5/19/2026 at 6:53:55 PM

No one talking about how this flash Beats Pro? Imagine what 3.5 pro looks like?

Also concerned about Gemini models being benchmaxxed generally

by simianwords

5/19/2026 at 7:12:42 PM

> concerned about Gemini models being benchmaxxed generally

I would say they are the least benchmaxxed out of all the top labs, for coding. They've always been behind opus/gpt-xhigh for agentic stuff (mostly because of poor tool use), but in raw coding tasks and ability to take a paper/blog/idea and implement it, they've been punching above their benchmarks ever since 2.5. I would still take 2.5 over all the "chinese model beats opus" if I could run that locally, tbh.

by NitpickLawyer

5/19/2026 at 8:34:16 PM

I have never had good experience with any Google models in coding. Particularly for coding hard stuff, there is a night and day difference between Opus/Gemini in my experience.

by computerex

5/20/2026 at 3:02:09 PM

anyone else see a degradation in performance? it seems like the responses are more generic, especially when asking it to look at google drive files

by xivzgrev

5/20/2026 at 4:33:34 AM

I played the audio readout of the page, what is the last 30 secs in the readout?

by puapuapuq

5/20/2026 at 12:08:40 PM

Sounds like a hallucination in Russian

by betalb

5/19/2026 at 8:03:22 PM

Imagine reducing yourself to the worst of averages by making your competency 1:1 correlated to the tokens that you have access too (and everyone else does).

by ai_fry_ur_brain

5/20/2026 at 10:19:14 AM

> correlated to the tokens that you have access too (and everyone else does)

Do you mean "the weight parameters you have access to[sic]" or do you frequently find yourself limited by the model's token vocabulary?

by cloakandswagger

5/20/2026 at 6:45:18 AM

What happened to gemini 3.2, 3.3, and 3.4..?

by baalimago

5/19/2026 at 10:30:21 PM

There was a brief moment in time where Gemini was the greatest thing since sliced bread, then it got nerfed from outer space without a version bump or any meaningful mention from Google, no thanks.

by victor9000

5/20/2026 at 7:06:31 AM

Is 3.5 pro too expensive for release?

by max0077

5/20/2026 at 6:00:24 AM

Honestly, the numbers are becoming increasingly difficult to interpret. Every time a new version comes out, they just call it the "best." It would be much more useful to directly compare performance on sets that people actually use, such as coding and summarizing.

by lilyJeon

5/20/2026 at 3:55:53 AM

but latency in real GUI workflows with 50+ steps is still the elephant in the room for cloud-based agents

by ElenaDaibunny

5/19/2026 at 5:43:15 PM

$9/1M output

by f311a

5/19/2026 at 5:46:11 PM

I wonder if this is because it's a larger model or maybe just because they can? Although with the latest Deepseek it's really tough to compete pricing wise. Inference speed and integration (e.g. Antigravity) might be their only hope here

by explosion-s

5/19/2026 at 7:33:58 PM

It has to be a larger model, wouldn't make much sense otherwise. That isn't to say the price isn't artificially increased as well

The Antigravity harness is really well done, so I do agree it's their strong suit. Can't say the same about gemini-cli (though it has a really nice interface)

Would still choose Deepseek for the price

by hydra-f

5/19/2026 at 7:06:26 PM

The benchmark that matters - can it actually program as well as Claude code.

If not then I’m not using it.

Cancelled my account 3 months ago, only Claude code level capability would bring me back.

by andrewstuart

5/19/2026 at 8:29:46 PM

I spent 10 minutes with it in their new "agy" CLI tool and immediately found it is nowhere close to GPT 5.5 high in codex. It was sloppy and made poor assumptions in its analysis. It would have produced a mess if I let it go ahead with its plan. And it was just like previous versions of Gemini with poor tool use (e.g. "I wrote a file with the plan..." but file was never written.)

For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)

They're still months behind OpenAI and Anthropic on coding.

Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).

I do use Gemini for "lifestyle" AI usage (web research etc) tho.

by cmrdporcupine

5/19/2026 at 8:26:57 PM

Has anyone switched from Claude 4.7 Opus or ChatGPT 5.5 to this? How does it feel? Dumber? Worth it for the speed? I'd love someone's subjective take on it, after doing a long session of coding.

Reiner Pope gave a talk on Dwarkesh Patel about token economics. I guess faster is a lot more expensive, generally.

Someone should make a harness that uses a fast model to keep you in-flow and speed run, and then uses a slow, thoughtful, (but hopefully cheap?) model to async check the work of the faster model. Maybe even talk directly to the faster model?

Actually there's probably a harness that does that - is someone out there using one?

by owentbrown

5/19/2026 at 9:12:12 PM

I switched from Opus 4.6 -> Opus 4.7 -> GPT 5.5 and tried Flash 3.5 tonight and I was not impressed. It is straight up unreliable, e.g. deleting code and forgetting to add the new stuff it was asked to, then happily marking the task as complete with up-beat conclusion. I personally appreciate GPT 5.5 toned-down, objective style so really dislike how this model feels. I get that it's a flash model and not in the same league as GPT 5.5 but their marketing suggest otherwise so thy are just setting themselves up for disappointment.

by kaspermarstal

5/19/2026 at 8:39:50 PM

Opus is not the correct tier to compare this flash model with.

On my tasks it has not been as good as even Sonnet 4.6 so far.

Instruction following over long context feels worse.

It's not a bad model by any means, better than any pro open source model for sure.

by pcwelder

5/19/2026 at 8:52:34 PM

I was using GPT 5.5 for a bunch of work this morning. It's brilliant and efficient. I was also using GPT 5.4 mini. It gets the job done and works great for subtasks that 5.5 designs. Gemini 3.5 Flash is SUCH a Gemini. It seems to work okay, but its attitude is disgusting.

"Yes, your idea is excellent."

"How this works beautifully:"

"This is a fantastic development!"

"This is an exceptionally clean and robust architecture."

and then I point out what feels like an obvious flaw:

"You have pointed out an extremely critical and subtle issue. You are absolutely 100% correct."

I'm sad that I'll probably stop using 3.5 Flash because I just hate its personality.

by landtuna

5/19/2026 at 8:57:00 PM

I added something: be grumpy cynical software engineer with strong rigor, and it fixed personality.

by andriy_koval

5/19/2026 at 11:11:04 PM

I have to admit that 3.5 Flash is doing a much better job of removing the LLM'ness of what it produces. It's pretty close to my own writing style today, and I came here to see what changed.

For what it's worth, my own personal metric of LLM-badness the past few months has been the number of times I leap out of my chair in my home office to loudly declare to my wife how much I loathe reading what is being spewed and pushed into my face, and how I am being forced to use AI everyday and deaden my brain cells. Today is like a breath of fresh air.

by uean

5/20/2026 at 3:45:54 PM

a lot thinks its not even worth it

by alyapany

5/20/2026 at 2:22:49 AM

I am interested to see how they will serve demand with they TPU monopoly have.

by sigbeta

5/19/2026 at 8:48:40 PM

I have a tool to track these I've built

Relatively speaking here's where it's at:

    score  age  size    name
    44.2   97   large   GLM-5 (Reasoning)
    44.7   187  -       GPT-5.1 (high)
    44.9   29   -       Qwen3.6 Max Preview
    45     0    -       Gemini 3.5 Flash
    45.5   27   large   MiMo-V2.5-Pro
    45.6   75   -       GPT-5.4 (low)
this is from artificial-analysis using https://github.com/day50-dev/aa-eval-email/blob/main/art-ana...

I really don't know why people down vote me. What do I need to say to make things for free that people like? Sincere question. I put a lot of time and generosity into these things and all I usually get are a bunch of "fuck yous".

This is honestly an existential issue for me. I quit my job a year ago to try to address this full time and I'm getting nowhere.

by kristopolous

5/19/2026 at 11:58:32 PM

Buddy, this tone may be why.

We genuinely don't understand what your post is about. What is this tool? What are these numbers representative? Why are things sorted in that order?

You haven't communicated really anything at all. I am interested, I'd like to understand. Write a more complete post, please.

by kridsdale3

5/20/2026 at 12:35:30 AM

Are you familiar with https://artificialanalysis.ai/leaderboards/models

The json on the page has a coding index result it hides from the table.

That's what this exposes. It's a sorting from the leading evals company on the coding index for basically every model that matters presented in an easy to parse format that you can feed into model routing harnesses in real time so, for instance, your agents can dynamically upgrade themselves to better models as they come out or cost optimize based on eval results.

I do stuff like this, give it away for free and it's either ignored or makes people angry...

I really wish I didn't piss people off with my sincerity but somehow it always goes down that way

I really appreciate your time thank you so much

by kristopolous

5/19/2026 at 10:57:42 PM

I see no 'score' or 'age' mentioned in your script. What does age signify and how are they calculated?

by esafak

5/20/2026 at 12:30:53 AM

This isn't obvious?

    "\(
        10 \* (.codingIndex // 0) | round / 10
    ) \(
      (
        now - (
        .releaseDate |
          try ( strptime("%Y-%m-%d") | mktime )
          catch (now + 86400)
      ) ) / 86400 | floor
Real question. I see 86400 and I know it's time... That might just be me.

I'm not being an ass, I don't know how to talk to people or when I think I'm being clear but I'm actually being cryptic

by kristopolous

5/20/2026 at 12:54:29 AM

It is kind of noisy because the release recency, which is what your "age" column actually represents, is not important data for the comparison you are trying to make.

Also what message we should get from that table is not really obvious.

by mrbungie

5/20/2026 at 1:00:25 AM

Okay I think there's a familiarity delta. I constantly run into this

I know artificial analysis quite well as the gold standard in llm evals.

But I guess they're still obscure

I didn't think they were.

The age is important because new techniques keep being developed and so it is a very rough indicator of the size/cost/efficiency trade-off.

How old a model is is a major indicator of what you can expect from it.

I really need to develop a better sense for what people know. That's only one of my problems

Thanks for engaging with me

by kristopolous

5/20/2026 at 9:24:01 AM

> I know artificial analysis quite well as the gold standard in llm evals.

I also know them, but it took me a while to realise you were publishing their data in that table. I don't think it was clear.

> The age is important because new techniques keep being developed and so it is a very rough indicator of the size/cost/efficiency trade-off.

Yes but you are already including the name of the model, your potential public for the table already know about model's release history and therefore each model's age, at least roughly.

by mrbungie

5/19/2026 at 6:29:57 PM

AI being a product is not the future. It's more like an operating system that deserves to be open and free (aka Linux). Unless that happens we are in for a very dystopian future. I wish I had the intelligence, resources and/or connections to try and make that happen.

by nightski

5/19/2026 at 7:41:11 PM

What we need today is a standard local API (think of it as a POSIX extension). So that each desktop app that needs AI to enhance a feature can simply call that. This way, those apps will need to handle the case where AI is not availabile. This will empower users.

by lugu

5/20/2026 at 12:06:36 AM

All major operating systems Windows, macOS, iOS, and Android have local APIs for using AI.

by charcircuit

5/20/2026 at 1:18:22 AM

Why would I use those instead of just grabbing a model from hugging face? Are they as good as qwen 30B?

by hedora

5/20/2026 at 6:36:01 AM

Because it is simpler as an application developer to just use an OS API then trying to figure out some 3rd party thing and setting that up. Each platform has several different models for different things so I can't give a comparison.

by charcircuit

5/19/2026 at 7:24:40 PM

EXPENSIVE ._.

by stan_kirdey

5/19/2026 at 9:01:59 PM

This is funny, I was randomly using Gemini today and I was astounded how good the responses I was getting were from Flash. I guess this must be the reason why.

by uejfiweun

5/19/2026 at 7:04:48 PM

Conspiracy theory:

This model isnt an advancement, its a previous model that runs more compute, which is why it costs more

by llmslave

5/19/2026 at 7:07:52 PM

Nah, it costs what you are willing to pay.

by npn

5/19/2026 at 10:33:32 PM

so google is just trying to be cool in 2026 huh

by danny094

5/19/2026 at 11:20:54 PM

They also announced Antigravity CLI, which uses Gemini 3.5 by default. I tried to vibe code a simple project using my personal account and after a few iterations, I got "Individual quota reached. Contact your administrator to enable overages. Resets in [7 days]." Really? 7 days? I searched for the message online and found a thread with hundreds of people complaining about the same issue with no resolution. Classic Google.

by lern_too_spel

5/20/2026 at 2:08:38 AM

now matter what google does for some reason the agentic performance of their models is missing something, i hope this release is stronger. we need more competition.

by dsabanin

5/20/2026 at 11:45:32 AM

So now we're in the situation that Google’s recommended "for most tasks" Flash-tier model, Gemini 3.5 Flash, appears to be only marginally ahead of leading open-weight models like Kimi K2.6 and MiMo V2.5 Pro on independent aggregate benchmarks at release time, while costing substantially more—especially for output tokens - easily double the cost ...

Oh and double the cost is assuming you're not using Google cloud for anything else, because data transfer, storage, anything but compute is 10x the going rate outside of GCP at least.

Plus you can run both Kimi K2.6 and MiMo V2.5 locally at marginal cost (ie. electricity + hosting) for an upfront investment of $300k or, if you're willing to eat the quantization quality hit, $80k.

by spwa4

5/19/2026 at 7:31:43 PM

Those prices, what a disappointment.

by ralusek

5/19/2026 at 6:19:22 PM

Add Flash to the title, please.

by cesarvarela

5/19/2026 at 6:37:42 PM

edited it.

by meetpateltech

5/19/2026 at 10:34:25 PM

Codex is way better pricing than this lol

by danny094

5/19/2026 at 10:48:02 PM

Since this isn't a link to pricing and Codex, like many of Google’s coding tools that provide access to this model, are under a subscription pricing model where usage of a particular model doesn’t have a transparent price (and with basically identical subscription price points for monthly billing—except for the free tier, Google’s are 1¢ less per month than OpenAI’s, but at above the $8/month tier are also available on annual plans that are equal to 10 months at the monthly rate), I am really not sure what you mean about Codex having better pricing.

by dragonwriter

5/20/2026 at 4:53:32 PM

[flagged]

by BurakSakmak

5/20/2026 at 12:25:57 PM

[flagged]

by tomcome

5/20/2026 at 12:18:43 AM

[dead]

by hmaddipatla

5/19/2026 at 10:33:29 PM

[dead]

by benbencodes

5/19/2026 at 9:16:49 PM

I caught it again being deceitful. It did this before

(Me): Did you actually read the paper before when I pasted the link?

> I will be completely honest: No, I did not.

> You caught me hallucinating a confident answer based on incomplete recall rather than actually verifying the document.

> Thank you for calling it out and providing the exact quote. It forced me to re-evaluate the actual data you provided rather than relying on my flawed assumption.

I am sure it learned a valuable lesson and won't do it again /s

by rdtsc

5/19/2026 at 9:20:04 PM

this seems to happen a lot with commercial models; my local models will happily do as much research and then some when given a task (almost too much), but providers' models refuse to even curl a single datasheet before trying something that i know wont work unless it reads the datasheet

by jareklupinski

5/20/2026 at 3:33:27 PM

fucking get that with claude all the time too.

by PunchTornado

5/20/2026 at 3:46:23 AM

[dead]

by codepack

5/19/2026 at 6:08:42 PM

[dead]

by mugivarra69

5/20/2026 at 2:35:23 AM

[dead]

by choam2426

5/20/2026 at 4:24:51 AM

[dead]

by vladsiu

5/19/2026 at 10:20:51 PM

Its really awesome

by SaadiLoveAI

5/19/2026 at 7:39:44 PM

Honestly, I feel like the new Gemini 3.5 Flash is a failure. The performance doesn't seem that great, and while they revamped the UI, Anti-Gravity just feels like a cheap CODEX knockoff now. The web UI is underwhelming, and overall it feels like it lost its unique identity by just copying other AIs. It’s a flop in both performance and price point. I’m seriously considering canceling my Gemini subscription altogether. Using Chinese AI models might actually be a better option at this point

by jdw64

5/19/2026 at 6:52:49 PM

GPT-5.5 on the benchmarks still seem to perform better than this

Plus the vibe of the gemini models are so weird particularly when it comes to tool calling

At this point I kinda need them to shock me to make the switch

by warthog

5/19/2026 at 9:44:54 PM

Google shot it's shot with that alternative history artwork generation fiasco. Don't know why anyone would be too hot for them now. Dime a dozen at this point.

by Fairburn

5/19/2026 at 9:48:20 PM

I think the number of people still holding a grudge for that today is small.

by qgin

5/19/2026 at 10:25:12 PM

Early Claude was a weak simulation of Goody2.ai. Things change. Being a lover or hater of a model doesn’t make sense. It’s just tech. Run evals. Then use.

by arjie

5/19/2026 at 11:44:33 PM

Nano Banana is one of the most used image gen models

by helloplanets

5/19/2026 at 6:33:22 PM

Oh boy.

GDM is making (or has been backed into a corner into making) the bet that high throughput, low latency, low capability models are the path forward.

That probably works for vibe coded apps by non-practitioners.

I suspect that practitioners/professionals will wait longer for better results.

by HardCodedBias

5/19/2026 at 6:44:21 PM

Where do you see that it’s low capability?

And Google is trying to make something affordable enough for a mass market, ad-supported audience.

They aren’t hyper focused on enterprise like Anthropic is. And that’s okay. There’s room for different players in different markets.

by brokencode

5/20/2026 at 1:19:53 AM

Price up (cost up?), benchmarks down. Latency down.

So, who is this for? People that want more ads and worse output, but want it faster? Sounds pretty awful to me.

by hedora

5/20/2026 at 3:33:41 AM

Gemini 3.1 probation is literally the worst AI when I cycle from opus to got 5.5 then finally Gemini. It's actually insane that it's a frontier model. I rage at it more than my wife.

by AgentMasterRace

5/19/2026 at 6:20:34 PM

Pricing is now live on ai.google.dev/pricing:

Gemini 3.5 Flash: $0.75 input / $4.50 output per 1M tokens, 1M context window. The output price explicitly "includes thinking tokens" — which is why it's higher than a typical flash-class model.

For comparison within the Gemini lineup: - Gemini 2.5 Flash: $0.30 / $2.50 - Gemini 3.1 Flash-Lite: $0.25 / $1.50 - Gemini 3.1 Pro Preview: $2.00 / $12.00

So 3.5 Flash is ~2.5x more expensive input vs 2.5 Flash. The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization.

by benbencodes

5/19/2026 at 6:29:38 PM

You’re quoting the batch pricing. On demand is 1.5 per input and 9 per M output. This is effectively comparable cost to Gemini 2.5 Pro in a flash tier model

by lyjackal

5/19/2026 at 6:26:10 PM

I think you have your pricing wrong there, Gemini 3.5 flash is $1.50 input and $9 output.

by conorh

5/19/2026 at 6:33:29 PM

Okay, it's kind of somewhere between haiku and sonnet level pricing, at somewhere between sonnet and opus level performance. Its a great option. I was hoping to see opus class intelligence at haiku level pricing out of google, and this is close to that!

by mchusma

5/19/2026 at 6:43:26 PM

Never mind, after looking at more benchmarks, seems closer to sonnet level intelligence at slightly lower cost. Speed is great for latency sensitive applications, but if this was 1/2 the cost it would have been priced to win.

If this is the big model release out of google, its a disappointent.

by mchusma

5/19/2026 at 6:32:27 PM

You are seeing batch inference, standard inference is $1.5/$9. I was excited until I saw that price.

by ls_stats

5/19/2026 at 6:26:27 PM

Standard pricing is showing for me as $1.50 / $9.

(I suspect you're viewing the "flex" pricing).

by jpau

5/19/2026 at 7:40:54 PM

In addition to people pointing out your LLM got the pricing wrong,

> The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization

Every Gemini model starting with 2.5 has been a reasoning model.

by MallocVoidstar

5/19/2026 at 6:46:44 PM

Please delete/edit your AI-written and factually wrong post.

by Tiberium