The Missing Nvidia GPU Glossary

1/14/2025 at 7:09:57 PM

The weird part of the programming model is that threadblocks don't map 1:1 to warps or SMs. A single threadblock executes on a single SM, but each SM has multiple warps, and the threadblock could be the size of a single warp, or larger than the combined thread count of all warps in the SM.

So, how large do you make your threadblocks to get optimal SM/warp scheduling? Well it "depends" based on resource usage, divergence, etc. Basically run it, profile, switch the threadblock size, profile again, etc. Repeat on every GPU/platform (if you're programming for multiple GPU platforms and not just CUDA, like games do). It's a huge pain, and very sensitive to code changes.

People new to GPU programming ask me "how big do I make the threadblock size?" and I tell them go with 64 or 128 to start, and then profile and adjust as needed.

Two articles on the AMD side of things:

https://gpuopen.com/learn/occupancy-explained

https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-...

by jms55

1/14/2025 at 7:42:50 PM

I was taught that you want, usually, more threads per block than each SM can execute, because SMs context switch between threads (fancy hardware multi threading!) on memory read stalls to achieve super high throughput.

There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).

by bassp

1/14/2025 at 10:30:40 PM

> I was taught that you want, usually, more threads per block > than each SM can execute, because SMs context switch between > threads (fancy hardware multi threading!) on memory read > stalls to achieve super high throughput.

You were taught wrong...

First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".

Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:

Amount of shared memory used per warp

* Makeup of the instructions to be executed

* Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).

by einpoklum

1/15/2025 at 12:27:08 AM

Sorry if I was sloppy with my wording, instruction issuance is what I meant :)

I thought that warps weren't issued instructions unless they were ready to execute (ie had all the data they needed to execute the next instruction), and that therefore it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once so that the warp scheduler can issue instructions to one warp while another waits on a memory read. Is that not true?

by bassp

1/15/2025 at 9:59:35 AM

> warps weren't issued instructions unless they were ready to execute

This is true, but after they've been issued, it still takes a while for the execution to conclude.

> it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once

Just replace "most" with "some". It really depends on what kind of kernel you're writing.

by einpoklum

1/15/2025 at 10:10:56 AM

The GPU Glossary mentions that a warp scheduler can context switch https://modal.com/gpu-glossary/device-hardware/warp-schedule... but you said there is no such thing as an SM "context switch between threads". Is there some ambiguity in context switch

by delifue

1/14/2025 at 8:39:48 PM

Pretty sure CUDA will limit your thread count to hardware constraints? You can’t just request a million threads.

by saagarjha

1/14/2025 at 8:44:20 PM

You can request up to 1024-2048 threads per block depending on the gpu; each SM can execute between 32 and 128 threads at a time! So you can have a lot more threads assigned to an SM than the SM can run at once

by bassp

1/15/2025 at 4:29:14 AM

Right, ok. So you mean a handful of warps and not like a plethora of them for no reason.

by saagarjha

1/14/2025 at 8:47:24 PM

Thread counts per block are limited to 1024 (unless I’ve missed and change and wikipedia is wrong), but total threads per kernel is 1024(2^32-1)65535*65535 ~= 2^74 threads

https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...

by buildbot

1/15/2025 at 4:32:35 AM

Yeah I’m talking about the limit per-block.

by saagarjha

1/14/2025 at 9:08:08 PM

100% -- there's basically no substitue for benchmarking! I find the empiricism kind of comforting, coming from a research science background.

IIUC, even CuBLAS basically just uses a bunch of heuristics that are mostly derived from benchmarking to decide with kernels to use.

by charles_irl

1/14/2025 at 10:25:04 PM

> It's a huge pain, and very sensitive to code changes.

Optimization is very often like that. Making things generic, uniform and simple typically has a performance penalty - and you use your GPU because you care about that stuff.

by einpoklum

1/14/2025 at 7:11:41 PM

Sounds like the sort of thing that would lend itself to runtime optimization.

by EarlKing

1/14/2025 at 7:17:07 PM

I'm not too informed on the details, but iirc drivers _do_ try and optimize shaders in the background, and then when ready swaps in a better version. But I doubt it does stuff like change threadgroup size, the programmer might assume a certain size and their shader would be broken if changed. Also drivers doing background work means unpredictable performance and stuttering, which developers really don't like.

Someone correct me if I'm wrong, maybe drivers don't do this anymore.

by jms55

1/14/2025 at 8:18:47 PM

Well, if the user isn't going to be sharing the GPU with another task then you could push things back to install-time. In other words: At install time you conduct a benchmark on the relevant shaders, rewrite as necessary, recompile, and save the results accordingly. Now the user has a version of your shaders optimized to their particular configuration. Since installation times are already somewhat lengthy anyway you can be reasonably certain that no one is going to miss an extra minute or two needed to conduct benchmarks, especially if it results in installing optimized code.

by EarlKing

1/14/2025 at 9:16:08 PM

Coming from the neural network world, rather than the shader world, but: I'd say you're absolutely right!

Right now NNs and their workloads are changing quickly enough that people tend to prefer runtime optimization (like the dynamic/JIT compilation provided by Torch's compiler), but when you're confident you understand the workload and have the know-how, you can do static compilation (e.g. with ONNX, TensorRT).

I work on a serverless infrastructure product that gets used for NN inference on GPUs, so we're very interested in ways to amortize as much of that compilation and configuration work as possible. Maybe someday we'll even have something like what Redshift has in their query engine -- pre-compiled binaries cached across users.

by charles_irl

1/15/2025 at 5:33:21 PM

This reminds me of the dreaded Vulkan Shaders Compilation dialog when you try to play some games after driver update.

by lostmsu

1/16/2025 at 6:03:50 AM

People complain a lot about shader compilation, but shader compilation on start-up is much nicer than when a game doesn't do that ahead of time and does it when you need those shaders.

by terribleperson

1/14/2025 at 8:40:25 PM

This is how autotuning often works yes

by saagarjha

1/14/2025 at 10:11:31 PM

But which programming languages are most amenable to automatic runtime optimization?

Should we go back to FORTRAN?

by amelius

1/15/2025 at 1:56:38 AM

The sad answer is... probably none of them. Runtime optimization has always been one of those things that sends most programmers running away screaming, and those who make languages never seem to come from the ranks of those who understand the clear utility of it.

by EarlKing

1/15/2025 at 5:04:51 AM

Squeak Smalltalk has several automatic runtime optimizations and compilers like JIT, parallel load balancing compiler [1], adaptive compiler [2] and a metacircular simulator and byte code virtual machine written in itself that allows you to do runtime optimisations on GPUs. The byte codes are of course replaced with the native GPU instructions at runtime.

There are dozens of scientific papers and active research is still being done [1].

I've worked on automatic parallel runtime optimizations and adaptive compilers since 1981. We make reconfigurable hardware (chips and wafers) that also optimises at runtime.

Truffle/GraalVM is very rigid and overly complicated [6].

With a meta compiler like Ometa or Ohm we can give any programming language the runtime adaptive compilation for GPUs [3][4].

I'm currently adapting my adaptive compiler to Apple Silicon M4 GPU and neural engine to unlock the trillions of operations per second these chips can do.

I can adapt them to more NVIDIA GPUs with the information of the website in the title. Thank you very much charles_irl! I would love to be able to save the whole website in a single PDF.

I can optimise your GPU software a lot with my adaptive compilers. It will cost less than 100K in labour to speed up your GPU code by a factor 4-8 at least, sometimes I see 30-50 times speedup.

[1] https://www.youtube.com/watch?v=wDhnjEQyuDk

[2] https://www.youtube.com/watch?v=CfYnzVxdwZE

[3] https://tinlizzie.org/~ohshima/shadama2/

[4] https://github.com/yoshikiohshima/Shadama

[5] http://www.tinlizzie.org/ometa/

[6] https://github.com/NVIDIA/grcuda

by morphle

1/14/2025 at 8:43:03 PM

It would be nice if this also included terms that are often used by Nvidia that apparently come from computer architecture (?) but are basically foreign to software engineers, like “scoreboard” or “math pipe”.

by saagarjha

1/14/2025 at 8:53:55 PM

Great idea! We'll add some of those in the next round.

by charles_irl

1/14/2025 at 7:10:59 PM

FINALLY. Nvidia's always been pretty craptacular when it comes to their documentation. It's really hard to read unless you already know their internal names for, well, just about everything.

by EarlKing

1/14/2025 at 8:09:12 PM

Nvidia isn't very big on opensource either. Most CUDA libraries are still closed source. I think this might eventually be their downfall, because people want to know what they are working with. For example with PyTorch, I can profile the library against my use case and then decide to modify the official library to get some bespoke optimization. With CUDA, if I need to do that, I need to start from scratch and guess as to whether the library from the api already has such optimizations.

by let_me_post_0

1/14/2025 at 10:36:45 PM

NVIDIA does have a bunch of FOSS libraries - like CUB and Thrust (now part of CCCL). But - they tend to suffer from "not invented here" syndrome [1] ; so they seem to avoid collaboration on FOSS they don't manage/control by themselves.

I have a bit of a chip on my shoulder here, since I've been trying to pitch my Modern C++ API wrappers to them for years, and even though I've gotten some appreciative comments from individuals, they have shown zero interest.

https://github.com/eyalroz/cuda-api-wrappers/

There is also their driver, which is supposedly "open source", but actually none of the logic is exposed to you. Their runtime library is closed too, their management utility (nvidia-smi), their LLVM-based compiler, their profilers, their OpenCL stack :-(

I must say they do have relatively extensive documentation, even if it doesn't cover everything.

[1] - https://en.wikipedia.org/wiki/Not_invented_here

by einpoklum

1/15/2025 at 2:23:24 AM

There's a deeper reason. Remember 3Dfx? They made the entire source code for their 3D hardware available to developers, all in C and a tiny bit of assembler. It could be easily ported to non-Wintel platforms. (I know, because I did port it to a MIPS based platform that had zero operating system. It was a poker machine.)

Then 3Dfx was smashed from the inside and bought out by nVidia. Source code to 3D accellerator hardware drivers never to be seen again.

Why? Because if just anybody could port 3D graphics hardware and drivers to any custom hardware and OS platform, then Microsoft, Apple, etc would rapidly be killed by something with a MUCH better GUI (3D) appearing on the market.

The powers that be do NOT want capable, unchained computing systems to upset their carefully constructed 'enslavement via enshitification' schemes.

by TerraHertz

1/15/2025 at 2:54:20 AM

Some of it is open source:

https://github.com/NVIDIA/open-gpu-kernel-modules

by ryao

1/15/2025 at 9:56:52 AM

Have you actually looked at those "sources"? You should. Or rather, you shouldn't since you'd be wasting your time trying to find anything useful.

by einpoklum

1/15/2025 at 8:14:43 PM

I have peeked at parts. I enjoyed seeing that they use coverity on the modules.

by ryao

1/16/2025 at 11:42:46 PM

>Source code to 3D accellerator hardware drivers never to be seen again.

What? Mesa supports plenty of hardware.

by garaetjjte

1/12/2025 at 9:13:04 PM

Oh hey, I wrote this!

Thanks for sharing it.

by charles_irl

1/14/2025 at 7:24:31 PM

Looks nice. I'm not sure if this is the place for it, but what I am always searching for is a very concise table of the different GPUs available with approximate compute power and costs. Lists such as wikipedia [1] are way to complicated.

[1] https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

by ks2048

1/14/2025 at 9:06:01 PM

Yeah, there's a tension between showing enough information to be useful for driving decisions and hiding enough information.

For example, "compute capability" sounds like it'd be what you need, but it's actually more of a software versioning index :(

Was thinking of splitting the difference by collecting up the quoted arithmetic (FLOP/s) and memory bandwidths from the manufacturer datasheets. But there's caveats there too, e.g. the dreaded "With sparsity" asterisk on the Tensor Core FLOP/s of recent generations.

by charles_irl

1/14/2025 at 9:12:33 PM

I was looking for a simple table recently- outlining say how the shared memory or total register size/SM varies between generations (Something like that Wiki table). It was surprisingly hard to find those info.

by shihab

1/15/2025 at 1:24:08 AM

Thank you for this.

Any chance you could just make it a single long webpage (as opposed to making me click through one page at a time)?

For some reason on my iPad the links don’t always work the first time I click them.

by alberth

1/13/2025 at 12:40:44 AM

Great work. Nice aesthetic.

"These groups of threads, known as warps , are switched out on a per clock cycle basis — roughly one nanosecond. CPU thread context switches, on the other hand, take few hundred to a few thousand clock cycles"

I would note that intels SMT does do something very similar (2 hw threads). Other like the xeon phi would round robin 4 threads on a single core.

by petermcneeley

1/14/2025 at 7:25:49 PM

SMT isn't that really is it?

SMT allows for concurrent execution of both threads (thus independent front-end for fetch, decode especially) and certain core resources are statically partitioned unlike a warp being scheduled on SM.

I'm not a graphics expert but warps seem closer to run-time/dynamic VLIW than SMT.

by zeusk

1/15/2025 at 2:31:42 PM

In actual implementation they are very much like very wide SIMD on a CPU core. Each HW thread is a different warp as each warp can execute different instructions.

This mapping is so close that translation from GPU to CPU relatively easy and performant.

by petermcneeley

1/14/2025 at 8:58:05 PM

Thanks!

> intels SMT does do something very similar (2 hw threads)

Yeah that's a good point. One thing I learned from looking at both hardware stacks more closely was that they aren't as different as they seem at first -- lots of the same ideas or techniques get are used, but in different ways.

by charles_irl

1/15/2025 at 2:04:07 AM

Thanks! As an old (retired) programmer I was hoping a good intro to GPUs would turn up. Now, I don't suppose you could add 'ink on paper' to the color options? Gray on light gray, with medium gray highlighting, is hard on old eyes. While I never want to see P7 phosphor green again. And I suppose a zipfile of the whole thing, for local reading and archive, would be out of the question?

by TerraHertz

1/14/2025 at 7:23:59 PM

I absolutely love the look. Is it a template or custom?

by byteknight

1/14/2025 at 8:59:10 PM

Custom! Took inspiration from lynx, lotus, and other classic terminal programs.

by charles_irl

1/15/2025 at 12:11:23 AM

What was your method for drawing/generating the SM diagram?

by MostlyAmiable

1/14/2025 at 11:25:41 PM

There's a wonderful correspondence between GPU and more conventional SIMD vector terms in the P&H comp arch book. Slide 13 of https://cse.buffalo.edu/~rsridhar/cse490-590/lec/Chapter04.p...

by krackers

1/14/2025 at 8:26:30 PM

Incredible work, thank you so much! This will hopefully break down more barriers to entry for newcomers wanting to work with GPUs!

by joshdavham

1/14/2025 at 9:11:29 PM

Thanks for the kind words! I still feel like one of those newcomers myself :)

Now that so many more people are running workloads, including critical ones, on GPUs, it feels much more important that a base level of knowledge and intuition is broadly disseminated -- kinda like how most engineers basically grok database index management, even if they couldn't write a high-performance B+ tree from scratch. Hope this document helps that along!

by charles_irl

1/15/2025 at 7:15:36 AM

Really great work, suggest for a next post: the VRAM requirements estimation calculation for running models locally. Especially with different architecture and different Quants, it gets always confusing and even online calculators give different answer. I never found a really good deep dive on this yet.

by 3abiton

1/12/2025 at 9:35:40 PM

Is there a plain text / markdown / html version?

by K0IN

1/12/2025 at 11:58:20 PM

I'm with you. The theme is cool for a brief blog post, but anything longer and I want out of the AS400 terminal.

by htk

1/14/2025 at 7:25:19 PM

I found it much better by clicking "light" at the top to change theme.

by ks2048

1/14/2025 at 8:05:48 PM

Found it more readable, yeah, but all of the captions on the diagrams- identifying block types by color- no longer made any sense.

by mandevil

1/14/2025 at 8:59:49 PM

Good callout! I'll work on the captions.

by charles_irl

1/12/2025 at 10:23:24 PM

I would also like to see a PDF that has all the text in one place, presented linearly. This looks like a very worthwhile read, but waiting a few seconds for two paragraphs to load is a very frustrating user experience.

by aithrowawaycomm

1/14/2025 at 9:02:30 PM

A few seconds is way longer than we intended! When I click around all pages after the first load in milliseconds.

Do you have any script blockers, browser cache settings, or extensions that might mess with navigation?

> would also like to see a PDF that has all the text in one place, presented linearly

Yeah, good idea! I think a PDF with links so that it's still easy to cross-reference terms would get the best of both worlds.

by charles_irl

1/14/2025 at 10:34:38 PM

I am using Safari on iOS - I disabled private relay and tested again, still seems oddly slow. No extensions; the settings to periodically delete cookies and block popups are enabled, don't see why those would affect this. Maybe it's just HN traffic, thousands of people flipping through the first dozen or so pages.

Edit: I just checked again and it didn't load at all... I also see this is on the front page again, at 5:30pm Eastern US time :) Probably HN hug of death.

by aithrowawaycomm

1/14/2025 at 10:24:09 PM

book time book time

by swyx

1/14/2025 at 10:38:15 PM

This has been submitted, like, five times already over the past 5 weeks:

https://news.ycombinator.com/from?site=modal.com

by einpoklum

1/15/2025 at 10:27:44 PM

8*!

And yet, this latest submission is the one that hits triple digit votes.

A small proof of the randomness of virality.

by d3rockk

1/15/2025 at 4:57:40 AM

This is incredible. I'm gonna spend some time here.

And I love the design/UI.

by JeremyMorgan

1/14/2025 at 10:05:23 PM

This is really incredible, thank you!

by germanjoey

1/14/2025 at 9:34:15 PM

content is cool; usability and design of the website is awful (although charming)

by richwater

1/14/2025 at 9:57:56 PM

Not at all -- the usability and design are fantastic! (On desktop, at least.)

What, specifially, do you find awful here?

by yshklarov

1/15/2025 at 6:57:04 AM

Try scrolling with the arrow keys.

by ttwwmm

1/14/2025 at 8:37:36 PM

that's pretty

by weltensturm

1/14/2025 at 9:17:13 PM

Thanks! We think that just because something is deeply technical doesn't mean it has to be ugly.

by charles_irl

1/15/2025 at 6:58:25 AM

Did you really have to hijack the up and down keys? I can't scroll.

by ttwwmm

1/15/2025 at 8:29:19 AM

Awesome <3

by pythops

1/14/2025 at 9:12:11 PM

[dead]

by hkgjjgjfjfjfjf