alt.hn

4/9/2026 at 7:20:10 PM

Rust Threads on the GPU

https://www.vectorware.com/blog/threads-on-gpu/

by PaulHoule

4/14/2026 at 3:15:43 AM

I don’t understand why this is a useful effort. It seems like a solution in source of a problem. It’s going to be incredibly easy to end up with hopelessly inefficient programs that need a full redesign in a normal gpu programming model to be useful.

by nynx

4/14/2026 at 3:44:49 AM

Founder here.

1. Programming GPUs is a problem. The ratio of CPUs to CPU programmers and GPUs to GPU programmers is massively out of whack. Not because GPU programming is less valuable or lucrative, because GPUs are weird and the tools are weird.

2. We are more interested in leveraging existing libraries than running existing binaries wholesale (mostly within a warp). But, running GPU-unaware code leaves a lot of space for the compiler to move stuff around and optimize things.

3. The compiler changes are not our product, the GPU apps we are building with them are. So it is in our interest to make the apps very fast.

Anyway, skepticism is understandable and we are well aware code wins arguments.

by LegNeato

4/14/2026 at 4:45:51 PM

> the GPU apps we are building with them are

I can't help but get the feeling you have use-case end-goal in mind that's opaque to many of us who are gpu-ignorant.

It could be helpful if there were an example of the type of application that would be nicer to express through your abstractions.

(I think what you've shown so far is super cool btw)

by electronsoup

4/14/2026 at 4:41:36 AM

Do you foresee this being faster than SIMD for things like cosine similarity? Apologies if I missed that context somewhere.

by jzombie

4/14/2026 at 5:04:37 AM

It depends. At VecorWare are a bit of an extreme case in that we are inverting the relationship and making the GPU the main loop that calls out to the CPU sparingly. So in that model, yes. If your code is run in a more traditional model (CPU driving and using GPU as a coprocessor), probably not. Going across the bus dominates most workloads. That being said, the traditional wisdom is becoming less relevant as integrated memory is popping up everywhere and tech like GPUDirect exists with the right datacenter hardware.

These are the details we intend to insulate people from so they can just write code and have it run fast. There is a reason why abstractions were invented on the CPU and we think we are at that point for the GPU.

(for the datacenter folks I know hardware topology has a HUGE impact that software cannot overcome on its own in many situations)

by LegNeato

4/14/2026 at 4:22:43 AM

> because GPUs are weird and the tools are weird.

Why is it also that terminology is so all over the place. Subgroups, wavefronts, warps etc. referring to the same concept. That doesn't help it.

by shmerl

4/14/2026 at 6:16:28 AM

This is the fault of NVIDIA, who, instead of using the terms that had been used for decades in computer science before them for things like vector lanes, processor threads, processor cores etc., have invented a new jargon by replacing each old word with a new word, in order to obfuscate how their GPUs really work.

Unfortunately, ATI/AMD has imitated slavishly many things initiated by NVIDIA, so soon after that they have created their own jargon, by replacing every word used by NVIDIA with a different word, also different from the traditional word, enhancing the confusion. The worst is that the NVIDIA jargon and the AMD jargon sometimes reuse traditional terms by giving them different meanings, e.g. an NVIDIA thread is not what a "thread" normally means.

Later standards, like OpenCL, have attempted to make a compromise between the GPU vendor jargons, instead of going back to a more traditional terminology, so they have only increased the number of possible confusions.

So to be able to understand GPUs, you must create a dictionary with word equivalences: traditional => NVIDIA => ATI/AMD (e.g. IBM 1964 task = Vyssotsky 1966 thread => NVIDIA warp => AMD wavefront).

by adrian_b

4/14/2026 at 4:41:53 AM

All the names for waves come from different hardware and software vendors adopting names for the same or similar concept.

- Wavefront: AMD, comes from their hardware naming

- Warp: Nvidia, comes from their hardware naming for largely the same concept

Both of these were implementation detail until Microsoft and Khronos enshrined them in the shader programming model independent of the hardware implementation so you get

- Subgroup: Khronos' name for the abstract model that maps to the hardware

- Wave: Microsoft's name for the same

They all describe mostly the same thing so they all get used and you get the naming mess. Doesn't help that you'll have the API spec use wave/subgroup, but the vendor profilers will use warp/wavefront in the names of their hardware counters.

by MindSpunk

4/14/2026 at 10:33:28 AM

You can add to this the Apple terminology, which is simdgroup. This reinforces your point – vendors have a tendency to invent their own terminology rather than use something standard.

by raphlinus

4/14/2026 at 10:46:57 AM

Rule #1 in not getting involved in any patent lawsuit: don't use the same terminology as your competitors.

by amelius

4/14/2026 at 1:04:18 PM

I have to give it to Apple though in this case. Waves or warps are ridiculously uninformative, while simdgroups at least convey some useful information.

by coffeeaddict1

4/14/2026 at 1:43:38 PM

[dead]

by seivan

4/14/2026 at 8:03:09 AM

> The ratio of CPUs to CPU programmers and GPUs to GPU programmers is massively out of whack.

These days I just ask an LLM to write my optimized GPU routines.

by amelius

4/14/2026 at 7:19:14 AM

It looks like they're trying to map the entire "normal GPU programming model" to Rust code, including potentially things like GPU "threads" (to SIMD lanes + masked/predicated execution to account for divergence) and the execution model where a single GPU shader is launched in multiple instances with varying x, y and z indexes. In this context, it makes sense to map the GPU "warp" to a Rust thread since GPU lanes, even with partially independent program counters, still execute in lockstep much like CPU SIMD/SPMD or vector code.

by zozbot234

4/14/2026 at 3:19:25 AM

I think they've taken the integration difficulty into account.

Besides, full redesign isn't so expensive these days (depending).

>It seems like a solution in source of a problem.

Agreed, but it'll be interesting to see how it plays out.

by rl3

4/14/2026 at 5:27:53 PM

I've been using Rust's Burn library recently and have avoided writing kernels in CubeCL because it's lack of documentation and my lack of experience. I would love to see some working together here.

by cbHXBY1D

4/14/2026 at 4:34:18 AM

Isn't this turning a GPU into a slower CPU? It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread. If code is written in a GPU unaware way it's not going to take advantage of the reasons for being on the GPU in the first place.

by kevmo314

4/14/2026 at 5:41:59 PM

> It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread.

This was overwhelmingly true ten years ago, not so much now.

Modern GPU threads are about 3Ghz, CPUs are still slightly faster in theory but the larger amounts of local fast memory makes GPU threads pretty competitive in practice.

by fooker

4/14/2026 at 5:34:42 AM

We have this issue in GFQL right now. We wrote the first OSS GPU cypher query language impl, where we make a query plan of gpu-friendly collective operations... But today their steps are coordinated via the python, which has high constant overheads.

We are looking to shed something of the python<->c++<->GPU overheads by pushing macro steps out of python and into C++. However, it'd probably be way better to skip all the CPU<>GPU back-and-forth by coordinating the task queue in the GPU to beginwith . It's 2026 so ideally we can use modern tools and type as safety for this.

Note: I looked at the company's GitHub and didn't see any relevant oss, which changes the calculus for a team like our's. Sustainable infra is hard!

by lmeyerov

4/14/2026 at 1:18:23 PM

Additionally there is still too much performance left on the table by not properly using CPU vector units.

by pjmlp

4/14/2026 at 5:45:05 PM

SIMD performance in modern Intel and AMD cpus is so bad that it is useless outside very specific circumstances.

This is mainly because vector instructions are implemented by sharing resources with other parts of the CPU and more or less stalls pipelines, significantly reduces ipc, makes out of order execution ineffective.

The shared resources are often involve floating point registers and compute, so it's a double whammy.

by fooker

4/14/2026 at 5:50:26 PM

Yet, it is still faster than not doing nothing, or calling into the GPU, on workloads where the bus traffic takes the majority of execution time.

by pjmlp

4/14/2026 at 8:50:54 AM

I've seen this objection pop up every single time and I still don't get it.

GPUs run 32, 64 or even 128 vector lanes at once. If you have a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence, etc how is it supposed to be slower?

Consider the following:

You have a hyperoptimized matrix multiplication kernel and you also have your inference engine code that previously ran on the CPU. You now port the critical inference engine code to directly run on the GPU, thereby implementing paged attention, prefix caching, avoiding data transfers, context switches, etc. You still call into your optimized GPU kernels.

Where is the magical slowdown supposed to come from? The mega kernel researchers are moving more and more code to the GPU and they got more performance out of it.

Is it really that hard to understand that the CUDA style programming model is inherently inflexible and limiting? I think the fundamental problem here is that Nvidia marketing gave an incredibly misleading perception of how the hardware actually works. GPUs don't have thousands of cores like CUDA Core marketing suggests. They have a hundred "barrel CPU"-like cores.

The RTX 5090 is advertised to have 21760 CUDA cores. This is a meaningless number in practice since the "CUDA cores" are purely a software concept that doesn't exist in hardware. The vector processing units are not cores. The RTX 5090 actually has 170 streaming multiprocessors each with their own instruction pointer that you can target independently just like a CPU. The key restriction here is that if you want maximum performance you need to take advantage of all 128 lanes and you also need enough thread copies that only differ in the subset of data they process so that the GPU can switch between them while it is working on multi cycle instructions (memory loads and the like). That's it.

Here is what you can do: You can take a bunch of streaming processors, lets say 8 and use them to run your management code on the GPU side without having to transfer data back to the CPU. When you want to do heavy lifting you are in luck, because you still have 162 streaming processors left to do whatever you want. You proceed to call into cuDNN and get great performance.

by imtringued

4/14/2026 at 9:17:08 AM

> a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence

But the library is using a warp as a single thread

by Bimos

4/14/2026 at 12:44:08 PM

> a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence

Sure, if you have that then of course it would be fast. But that’s not what this library is proposing.

by kevmo314

4/14/2026 at 9:14:26 AM

I really appreciate the way you've explained this. Are there any resources you recommend to reach your level of understanding?

by monideas

4/14/2026 at 12:40:55 PM

If you map Rust threads to warps, aren’t we basically turning the GPU into a very expensive CPU?

by Talderigi

4/14/2026 at 4:15:09 PM

It makes sense when the inner operations are vectorisable, as in the example.

by hgomersall

4/14/2026 at 12:50:40 PM

This blog post doesn't address how GPU "threads" can be mapped to Rust SIMD/SPMD "lanes" yet, though it hints at that. I assume that this is planned to be a topic for a future blog post.

I'd like to understand how the overall amount of "warps" to be launched on the GPU is determined. Is it fixed at shader launch, or can warps be created and destroyed on demand? If it's fixed, these are more like CPU-side "virtual processors" (in OS terminology) than true OS "threads".

by zozbot234

4/14/2026 at 5:07:26 AM

Is this proprietary, or something I can play around with? I can't find a repo.

by gpm

4/14/2026 at 5:09:26 AM

It is not, we just haven't yet upstreamed everything.

by LegNeato

4/14/2026 at 5:58:28 AM

This programming model seems like the wrong one, and I think its based on some faulty assumptions

>Another advantage of this approach is that it prevents divergence by construction. Divergence occurs when lanes within a warp take different branches. Because thread::spawn() maps one closure to one warp, every lane in that warp runs the same code. There is no way to express divergent branching within a single std::thread, so divergence cannot occur

This is extremely problematic - being able to write divergent code between lanes is good. Virtually all high performance GPGPU code I've ever written contains divergent code paths!

>The worst case is that a workload only uses one lane per warp and the remaining lanes sit idle. But idle lanes are strictly better than divergent lanes: idle lanes waste capacity while divergent lanes serialize execution

This is where I think it falls apart a bit, and we need to dig into GPU architecture to find out why. A lot of people think that GPUs are a bunch of executing threads, that are grouped into warps that execute in lockstep. This is a very overly restrictive model of how they work, that misses a lot of the reality

GPUs are a collection of threads, that are broken up into local work groups. These share l2 cache, which can be used for fast intra work group communication. Work groups are split up into subgroups - which map to warps - that can communicate extra fast

This is the first problem with this model: it neglects the local work group execution unit. To get adequate performance, you have to set this value much higher than the size of a warp, at least 64 for a 32-sized warp. In general though, 128-256 is a better size. Different warps in a local work group make true independent progress, so if you take this into account in rust, its a bad time and you'll run into races. To get good performance and cache management, these warps need to be executing the same code. Trying to have a task-per-warp is a really bad move for performance

>Each warp has its own program counter, its own register file, and can execute independently from other warps

The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each *thread* has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

Say we have two warps, both running the same code, where half of each warp splits at a divergence point. Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level. But notice that to get this hardware acceleration, we need to actually use the GPU programming model to its fullest

The key mistake is to assume that the current warp model is always going to stick rigidly to being strictly wide SIMD units with a funny programming model, but we already ditched that concept a while back on GPUs, around the Pascal era. As time goes on this model will only increasingly diverge from how GPUs actually work under the hood, which seems like an error. Right now even with just the local work group problems, I'd guess you're dropping ~50% of your performance on the table, which seems like a bit of a problem when the entire reason to use a GPU is performance!

by 20k

4/14/2026 at 7:49:15 AM

> Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level

Could you kindly share a source for this? Shader Execution Reordering (SER) is available for Ray tracing, but it is not a general-purpose feature that can be used in generic compute shaders.

> Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

I would strongly advise against this. GPUs are highly efficient when neighboring threads within a warp access neighboring data and follow largely the same code path. Even across warps, data locality is highly desirable.

by david-gpu

4/14/2026 at 3:17:17 PM

>I would strongly advise against this. GPUs are highly efficient when neighboring threads within a warp access neighboring data and follow largely the same code path. Even across warps, data locality is highly desirable.

Its a bit like saying writing code at all is bad though. Divergence isn't desirable, but neither is running any code at all - sometimes you need it to solve a problem

Not supporting divergence at all is a huge mistake IMO. It isn't good, but sometimes its necessary

>Could you kindly share a source for this? Shader Execution Reordering (SER) is available for Ray tracing, but it is not a general-purpose feature that can be used in generic compute shaders.

https://docs.nvidia.com/cuda/cuda-programming-guide/03-advan...

My understanding is that this is fully transparent to the programmer, its just more advanced scheduling for threads. SER is something different entirely

Nvidia are a bit vague here, so you have to go digging into patents if you want more information on how it works

by 20k

4/14/2026 at 9:12:16 AM

>The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each thread has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

I haven't found any evidence of the individual program counter thing being true beyond one niche application: Running mutexes for a single vector lane, which is not a performance optimization at all. In fact, you are serializing the performance in the worst way possible.

From a hardware design perspective it is completely impractical to implement independent instruction pointers other than maybe as a performance counter. Each instruction pointer requires its own read port on the instruction memory and adding 32, 64 or 128 read ports to SRAM is prohibitively expensive, but even if you had those ports, divergence would still lead to some lanes finishing earlier than others.

What you're probably referring to is a scheduler trick that Nvidia has implemented where they split a streaming processor thread with divergence into two masked streaming processor threads without divergence. This doesn't fundamentally change anything about divergence being bad, you will still get worse performance than if you had figured out a way to avoid divergence. The read port limitations still apply.

by imtringued

4/14/2026 at 3:21:16 PM

Threads have program counters individually according to nvidia, and have done for nearly 10 years

https://docs.nvidia.com/cuda/cuda-programming-guide/03-advan...

> the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity

Divergence isn't good, but sometimes its necessary - not supporting it in a programming model is a mistake. There are some problems you simply can't solve without it, and in some cases you absolutely will get better performance by using divergence

People often tend to avoid divergence by writing an algorithm that does effectively what pascal and earlier GPUs did, which is unconditionally doing all the work on every thread. That will give worse performance than just having a branch, because of the better hardware scheduling these days

by 20k