Popping the GPU Bubble

6/30/2026 at 5:49:40 AM

I really appreciate this type of articles. I feel like a lot of knowledge in LLM training and inference is locked inside the heads of practitioners. Similar to compiler engineers before.

To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.

by blueblazin

6/30/2026 at 6:15:39 AM

Thank you for the kind words. We will write and share more of these.

by radq

6/30/2026 at 9:31:33 AM

> Similar to compiler engineers before.

I guess the difference here being that we have ample compiler literature and practically know 99% of all there is to know about compilers that exist in the wild vs this new field.

Until we’ve gathered and agreed on a few “dragon books” for LLMs and have explored all there is to LLMs, you’re probably right - know-how will be with the practitioners and in source code until it’s distilled (pun intended).

by alfiedotwtf

6/30/2026 at 9:38:02 AM

Better comparison would be low level code running on smaller chips. Intersection of hardware and software engineering

by Melatonic

6/30/2026 at 9:10:00 AM

Most industries are like that.

by someonebaggy

6/30/2026 at 6:08:59 AM

Gentle reminder that while most money is spent on LLM inference, the vast majority of useful AI use is in fact not LLMs. Also, more and more work is poured into making small models. One thing I like about the whole export controls saga is that people are finding creative ways to squeeze performance out of these devices as witnessed in this post. But, if you then look at solutions like vLLM, vLLM will just fill whatever VRAM is available, no matter the context size, or the model size. So then you have two things to worry about:

First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io

While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.

Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.

by rjzzleep

6/30/2026 at 9:21:02 AM

> the vast majority of useful AI use is in fact not LLMs

Can you explain what you mean here? Are you talking about small neural networks doing specific tasks?

by esperent

7/1/2026 at 3:21:57 PM

All sorts of optimizations. Of course vision is huge. Lots of production use in all sorts of manufacturing. Lam research had a few talks a semiconductor manufacturing optimization. There is also CUDA assisted RAN.

Maybe AI is a bit of a misnomer, since everything ML at some point just started getting called AI.

by rjzzleep

6/30/2026 at 5:51:51 AM

Different bubble than the one I was hoping for.

This appears to be different than the recent "Speculative Pipeline Decoding" paper: https://arxiv.org/abs/2605.30852

by gardnr

6/30/2026 at 8:26:58 AM

As someone who works in the field, the blog is nice but it has a lot of CODEX fingerprints on it, and it's also very specific to the size of the model in question in a way that is not explicit from the blog until the very last section.

In general, for some reason CODEX loves CUDA-streams, it's the first optimization it goes for every time when writing GPU kernels. However in many cases this is not a bottleneck, it happens to be so here because the model in the blog is small (2.4ms FW-pass is tiny, and 9B params sit on a single GPU). Large models are closer to 30-40ms. The CPU-GPU sync is 1-2ms, when working on larger MoE models the scheduling of tokens in this way is much less important than for example scheduling of computation/communication or kernel optimization.

I wish the blog would state this at the start with the premise of what has been done, or show that this is indeed the bottleneck with some benchmarking. Otherwise is kind of overselling things imo.

by augment_me

6/30/2026 at 8:45:56 AM

Appreciate you saying the blog was nice. Not sure what you mean by "CODEX fingerprints", but I'll engage with the other points. We work on small models, and our customers want real-time inference on modern GPUs. The sub-title says "near-realtime VLM inference". 20-30ms forward passes are a non-starter for these workloads.

If you scroll down to the section titled "A cost model for the bubble", you will find both benchmark results and us saying, "you get back anywhere from a few percent to a third; more the faster your accelerator/model is".

by radq

6/30/2026 at 12:13:48 PM

My comment is aimed to highlight that the "GPU Bubble" is frames as a general solution when it's not, its a specific bottleneck based on your model size. Your dont mention your model size anywhere, the reader has to infer it from the runtimes, and if they dont know the average forward pass of a model, well too bad, they will leave without understanding the actual trade-off.

The benchmarks you point to in the section titled "A cost model for the bubble" dont include any CPU overheads or the T_block-T_pipe you mention, they just give the improvement %.

In general, you answers here in the thread read as defensive and unhumble. They leave a sour taste of your company, you should consider how you engage with your audience.

by augment_me

6/30/2026 at 5:50:21 AM

> you find that the GPU often sits idle, not for lack of work, but because the CPU hasn't told it what to do next yet. This phenomenon is called a GPU bubble.

This is true, but I've never heard anyone refer to this as a GPU bubble before.

I think most people hear "GPU bubble" and think of a financial bubble of some kind.

by nl

6/30/2026 at 6:04:03 AM

It appears to be a real term? https://docs.vulkan.org/tutorial/latest/Synchronization/Asyn...

Very odd, but perhaps more familiar to graphics programmers? I will say I'd probably call it a stall, which is exactly what the Vulkan docs call it moments later, so :shrug:

by SCdF

6/30/2026 at 6:10:55 AM

"bubble" used to be used a lot more when talking about very deep pipelines, eg Pentium 4 depth.

by kibibu

6/30/2026 at 8:09:46 AM

Or in the case of my poor Verilog, even very short pipelines :(

by tux3

6/30/2026 at 9:34:14 AM

And before that, graphics programmers called it vertical retrace :upsidedown:

by alfiedotwtf

6/30/2026 at 3:42:06 PM

I'm a rendering engineer and have used this term frequently.

It's actually a very common technique in rendering to not always be able to easily fill in the gaps, that we frequently deliberately introduce an extra frame of latency, so that the GPU is rendering jobs for the latter half of rendering passes for frame N+1 and the early half of rendering passes of frame N+2 while frame N is visible. This still means that a frame takes the same total GPU time to render, but means that the gaps between jobs on a single frame can be usefully filled with something else from the other.

by ralferoo

6/30/2026 at 6:02:02 AM

It's very common to call it a GPU bubble in gamedev, though not strictly for CPU induced bubbles.

by cma

6/30/2026 at 7:41:06 AM

Pretty sure that would be "[GPU performance] bottlenecked [by the CPU]" in most common terms.

by spaqin

6/30/2026 at 10:38:31 AM

I thought it was normal for the AI field to confuse people by repurposing other terms of art? To: "transformer", "lora", "diffusion", "hallucination", etc, we can now add "bubble".

by Eisenstein

6/30/2026 at 3:20:54 PM

it's very much an in-domain term for folks in machine learning. heavily used when pipeline parallelism caught on in training https://alband.github.io/doc_view/pipeline.html

by brrrrrm

6/30/2026 at 7:03:23 AM

while the title is misreading, when reading GPU profiling data, we do call these bubbles - where the GPU _could_ do something, but it's idle.

any time your GPU is idle = you are losing $$$ = your TCO is going up. you don't want that.

by _zoltan_

6/30/2026 at 6:16:25 AM

I saw it in literature on cpu pipelines in quotes, never without.

by vkazanov

6/30/2026 at 6:45:28 AM

I've never seen it in quotes, but yeah it is a very common term in pipelined CPUs.

by IshKebab

6/30/2026 at 5:54:58 AM

The term I would use would be “underutilised”

by rusk

6/30/2026 at 5:59:36 AM

"stall" is the best term I can think of as in "pipeline stall".

Better term, anyone?

by barries11

6/30/2026 at 7:05:12 AM

it's not stalled, as that would imply that it waits for something, which is not necessarily the case with bubbles. most often it shows lack of proper pipelining or wrong pipeline dependencies (pipeline A waits for pipeline B, pipeline C waits for pipeline B, while pipeline B waits for an event X, now you've just made all three pipelines stalled on event X - not good).

by _zoltan_

6/30/2026 at 7:16:11 AM

When an engine stalls, the implication is that the chain reaction that drives it is failing - I don’t think that is the case with a GPU as it will quite happily sit there drawing watts til you give it things. In systems nomenclature the inverse term for bubble is utilisation. This or that link is or node is using x% of its capacity. Indeed, if you monitor your GPU with nvidia-smi you will see that very term in the instrumentation.

by rusk

6/30/2026 at 5:52:56 AM

Yes, the title seems off - I also thought I am going to be reading about the AI/pricing bubble.

by nnevatie

6/30/2026 at 10:32:31 AM

The real GPU bubble will be when AI companies figure out they can better make their own ASICs and ditch all their GPUs onto the market.

by amelius

6/30/2026 at 10:39:37 AM

In data center operations, GPUs have some specific lifetime. Because datacenter GPUs are currently so expensive and hard to get, they don't get dropped on the market at some point (even if a better replacement has arrived), but used as long as possible.

Even if the AI companies decide to use their own ASICs, they will rather slowly, but continuously introduce them, while removing GPUs that have reached their end of life.

by aleph_minus_one

6/30/2026 at 11:27:13 AM

Yes, short term this is right. But at some point PyTorch will have a model.toVHDL() method, and we'll have a PCBWAY-style website for tapeout of the circuit. Nvidia's future looks less bright than they think and their GPU market will certainly pop.

by amelius

6/30/2026 at 2:29:32 PM

Doesn't that assume that VHDL is trivial? I feel like there are tons of performance tradeoffs or hardware designers wouldn't have jobs

by knollimar

6/30/2026 at 9:51:33 PM

No it does not assume that. Some very smart people will write that model.toVHDL() function. And keep in mind that a DL model is only a very small subset of what you can use VHDL for, and most models will have a very similar implementation in hardware from a conceptual point of view.

And don't take it too literally, VHDL could be replaced by other hardware design languages, maybe even at lower abstraction levels.

by amelius

7/1/2026 at 10:38:04 AM

Not trying to take it literally, but aren't there costs vs performance tradeoffs? Like the py.toHDL would have like (maxSize,maxCost,minThroughput) as free and that would determine energy usage?

And a GPU is already pretty optimized for inference, no? Like isn't it a bunch of FP mults? I don't think HDLs do well with that, either.

by knollimar

6/30/2026 at 2:53:15 PM

I can't imagine that model lifetimes will ever justify using model-specific ASICS for public serving (maybe something like serving fixed certified AI models in a vehicle or robot) over more generic GPUs/NPUs until after the AI bubble pops.

by dragonwriter

6/30/2026 at 4:16:45 PM

Be aware that currently the hardware costs and electric bill are two huge problems of modern LLMs.

If such AI models will deliver on their qualitative promises, and just the huge cost is the burden to overcome, custom ASIC might be a part of the solution.

If, on the other hand, AI models will still be unsuitable for many applications because of their qualitative issues, it is a much harder and different problem to solve - in this case, the AI bubble will plausibly burst.

by aleph_minus_one

6/30/2026 at 8:43:29 AM

Regarding the critique on the title: perhaps an analogy can be made to propeller cavitation on ships. Water influx rate, propeller design and operational parameters all influence the detrimental effect of water bubbles forming — deteriorating the system's efficiency.

The GPU would be the propeller, the influx is the work, and the operational parameters is what this article's about.

by tjoekbezoer

6/30/2026 at 8:54:30 AM

I'm disappointed with the commentary here. "GPU bubble" is an industry standard term, and literally how I would describe this to my colleagues in the industry. Look for example at the second slide here https://media.steampowered.com/apps/valve/2015/Alex_Vlachos_...

by radq

6/30/2026 at 12:37:47 PM

Just trying to be helpful by making an adequately coined term more palatable to a critical audience, thereby expediting the end of a fruitless discussion on an otherwise excellent article. Compliments.

by tjoekbezoer

6/30/2026 at 11:04:02 AM

I thought this was going to be an announcement of another GPU manufacturer :(

by NooneAtAll3

6/30/2026 at 7:10:15 AM

I love the brand name, Moondream

by Schlagbohrer

6/30/2026 at 8:20:47 AM

That's a terrible name for that and I can't say that Hanlon's razor applies. Bubble that everyone's knowingly referring to is the stock market collapsing like in 2001. To choose a headline that can be mistaken for that just to get clicks is shit. You could've called it GPU-CPU pipeline stall, but no, you intentionally chose a name that would be confused for something else just to get clicks?

by fragmede

6/30/2026 at 8:48:39 AM

This is what people in the field call it. I'm sorry you're offended.

by radq

6/30/2026 at 9:49:50 AM

You. You are people in the field. You can choose to name it anything else in the article that you just wrote. "We call it the GPU-CPU pipeline stall, but others might call it the GPU bubble."

by fragmede

6/30/2026 at 2:58:00 PM

The term is much older than the current GPU craze though. Ypu're trying to regulate how experts in a field communicate, which is... Weird.

by ksbd-pls-finish

6/30/2026 at 9:30:19 AM

Yeah the title is obviously clickbait.

by cubefox

6/30/2026 at 10:02:50 AM

Yeah, it works though, and the content is genuine enough which I guess trumps the issue with the title for me ;)

by dingdingdang

6/30/2026 at 3:58:12 PM

[flagged]

by investmuse