Async/Await on the GPU

2/17/2026 at 11:34:18 PM

We have been curious about this for rapids cudf. We built the graphistry stack 10 years ago (!) for GPU native end-to-end acceleration - data loading, wrangling, analytics, enrichment, viz, etc, all the way server GPUs to client GPUs - but this has been a huge sticking point. Major constant overheads for smaller workloads that seem avoidable.

Essentially, we solved the problem of writing our stack in a bulk-oriented way that Nvidia kernels can optimize. Think apache arrow, pure vectorized dataframe pipelines, etc. However, cudf is 'eager' with per-step CPU/GPU control plane coordination, even if the data plane lives on the GPU. Polars in theory moves to lazy scheduling that can allow deforesting optimizations for more bulk GPU-side control macro steps, but not really. Nvidia efforts to cut python asyncio costs for multitenant etc flows didn't pan out either. So enabling moving more to the GPU here is super interesting.

Will be watching!

by lmeyerov

2/19/2026 at 4:12:33 PM

The CPU to GPU dispatch overhead this aims to eliminate is a real bottleneck I've measured: my multi-pass Winograd kernel on MI300X (github.com/Jayluci4/nova-wino-amd) launches 3 HIP kernels + 1 rocBLAS GEMM per forward pass — 17-57% faster than MIOpen at batch=1, but at batch=8+ the dispatch latency between stages completely dominates and a fused single-dispatch kernel wins by 2-4x. On the AMD wavefront question: CDNA3's 64-lane wavefront vs NVIDIA's 32-lane warp changes the async scheduling model — you get 64-element register swaps via __shfl in one cycle (my transforms do full 8x8 matrix multiplies through wave shuffles with zero shared memory), but 64-wide means coarser divergence granularity for heterogeneous coroutine paths. An async execution model that pipelines multi-pass kernels without CPU round-trips would directly close the batch>1 gap for workloads like Winograd convolution, batched flash attention, and MoE expert dispatch. @magic_at_nodai — happy to help test AMD support when the time comes; I have working HIP kernels with wave shuffles and MFMA accumulation that would be a good real-workload stress test for the async dispatch model

by Jay_luci4

2/17/2026 at 5:49:55 PM

I'm not quite seeing the real benefit of this. Is the idea that warps will now be able to do work-stealing and continuation-stealing when running heterogenous parallel workloads? But that requires keeping the async function's state in GPU-wide shared memory, which is generally a scarce resource.

by zozbot234

2/17/2026 at 7:57:21 PM

God, as someone who took their elective on graphics program when GPGPU and computer shaders first became a thing, reading this makes me realize I definitely need an update on what modern GPU uarchs are like now.

Re: heterogenous workload: I'm told by a friend in HPC that the old advice about avoiding diverging branches within warps is no longer much of an issue – is that true?

by nxobject

2/17/2026 at 8:00:44 PM

That advice applies within warps, to single 'threads' (effectively SIMD lanes) whereas the article is consistently about running heterogenous tasks on different warps.

by zozbot234

2/17/2026 at 7:45:03 PM

This is already happening in C++, NVidia is the one pushing the senders/receivers proposal, which is one of the possible co-routine runtimes to be added into C++ standard library.

by pjmlp

2/17/2026 at 5:57:48 PM

Yes, that's the idea.

GPU-wide memory is not quite as scarce on datacenter cards or systems with unified memory. One could also have local executors with local futures that are `!Send` and place in a faster address space.

by LegNeato

2/17/2026 at 7:13:22 PM

A ton of GPU workloads require leaving large amounts of RAM resident on the GPU and running computation with some new data from the CPU.

by jmalicki

2/17/2026 at 7:19:16 PM

Really cool experiment (the whole company).

Training pipelines are full of data preparation that are first written on CPU then moving to GPU and always thinking of what to keep on CPU and what to put on GPU, when is it worth to create a tensor, or should it be tiling instead. I guess your company is betting on solving problems like this (and async-await is needed for serving inference requests directly on the GPU for example).

My question is a little bit different: how do you want to handle the SIMD question: should a rust function be running on the warp as a machine with 32 long arrays as data types, or always ,,hope'' for autovectorization to work (especially with Rust's iter library helpers).

by xiphias2

2/17/2026 at 7:52:14 PM

I'm not even sure a 32 wide array would be good either since on AMD warps are 64 wide. I wouldn't go fully towards auto vectorization with though.

by Cieric

2/17/2026 at 7:55:33 PM

Warp SIMD-width should be a build-time constant. You'd be using a variable-length vector-like interface that gets compiled down to a specified length as part of building the code.

by zozbot234

2/17/2026 at 8:04:10 PM

Now that I could agree with, the only place where hiccups have started to occur are with wave intrinsics where you can share data between thread in a wave without halting execution. I'm not sure disallowing it would be the best idea as it cuts out possible optimizations, but outright allowing it without the user knowing the number of lanes can cause it's own problems. My job is the fun time of fixing issues in other peoples code related to all of this. I have no stakes in rust though, I'd rather write a custom spirv compiler.

by Cieric

2/17/2026 at 8:11:00 PM

A compile time constant can still be surfaced to the user though. The code would simply be written to take the actual value into account and this would be reflected during the build.

by zozbot234

2/17/2026 at 8:24:32 PM

I don't have a lot of faith there, but that's mainly due to my experience being correcting peoples assumption that all gpus waves are 32 lanes. I might be biased there specifically since it's my job to fix those issues though.

by Cieric

2/17/2026 at 6:03:05 PM

What's the performance like? What would the benefits be of converting a streaming multiprocessor programming model to this?

by textlapse

2/17/2026 at 6:10:27 PM

We aren't focused on performance yet (it is often workload and executor dependent, and as the post says we currently do some inefficient polling) but Rust futures compile down to state machines so they are a zero-cost abstraction.

The anticipated benefits are similar to the benefits of async/await on CPU: better ergonomics for the developer writing concurrent code, better utilization of shared/limited resources, fewer concurrency bugs.

by LegNeato

2/17/2026 at 6:55:05 PM

warp is expensive - essentially it's running a 'don't run code' to maintain SIMT.

GPUs are still not practically-Turing-complete in the sense that there are strict restrictions on loops/goto/IO/waiting (there are a bunch of band-aids to make it pretend it's not a functional programming model).

So I am not sure retrofitting a Ferrari to cosplay an Amazon delivery van is useful other than for tech showcase?

Good tech showcase though :)

by textlapse

2/17/2026 at 7:51:51 PM

I think you're conflating GPU 'threads' and 'warps'. GPU 'threads' are SIMD lanes that are all running with the exact same instructions and control flow (only with different filtering/predication), whereas GPU warps are hardware-level threads that run on a single compute unit. There's no issue with adding extra "don't run code" when using warps, unlike GPU threads.

by zozbot234

2/17/2026 at 8:38:04 PM

My understanding of warp (https://docs.nvidia.com/cuda/cuda-programming-guide/01-intro...) is that you are essentially paying the cost of taking both the branches.

I understand with newer GPUs, you have clever partitioning / pipelining in such a way block A takes branch A vs block B that takes branch B with sync/barrier essentially relying on some smart 'oracle' to schedule these in a way that still fits in the SIMT model.

It still doesn't feel Turing complete to me. Is there an nvidia doc you can refer me to?

by textlapse

2/17/2026 at 8:49:02 PM

That applies inside a single warp, notice the wording:

> In SIMT, all threads in the warp are executing the same kernel code, but each thread may follow different branches through the code. That is, though all threads of the program execute the same code, threads do not need to follow the same execution path.

This doesn't say anything about dependencies of multiple warps.

by rowanG077

2/17/2026 at 9:05:49 PM

It's definitely possible, I am not arguing against that.

I am just saying it's not as flexible/cost-free as you would on a 'normal' von Neumann-style CPU.

I would love to see Rust-based code that obviates the need to write CUDA kernels (including compiling to different architectures). It feels icky to use/introduce things like async/await in the context of a GPU programming model which is very different from a traditional Rust programming model.

You still have to worry about different architectures and the streaming nature at the end of the day.

I am very interested in this topic, so I am curious to learn how the latest GPUs help manage this divergence problem.

by textlapse

2/17/2026 at 7:53:54 PM

One concern I have is that this async/await approach is not "AOT"-enough like the Triton approach, in the sense that you know how to most efficiently schedule the computations on which warps since you know exactly what operations you'll be performing at compile time.

Here with the async/await approach, it seems like there needs to be manual book-keeping at runtime to know what has finished, what has not, and _then_ consider which warp should we put this new computation in. Do you anticipate that there will be measurable performance difference?

by GZGavinZhao

2/17/2026 at 8:17:08 PM

Doing things at compile time / AOT is almost always better for perf. We believe async/await and futures enables more complex programs and doing things you couldn't easily do on the GPU before. Less about performance and more about capability (though we believe async/await perf will be better in some cases, time will tell).

by LegNeato

2/18/2026 at 9:47:02 AM

> Doing things at compile time / AOT is almost always better for perf

https://devblogs.microsoft.com/dotnet/bing-on-dotnet-8-the-i...

by bob1029

2/17/2026 at 8:36:09 PM

If you can tell deterministically whether an 'async' computation is going to be finished, you can most likely use a type-system-like static analysis to ensure that programs are scheduled correctly and avoid any reference to values that are yet to be computed. But this is not possible in many cases, where dynamic scheduling will be preferable.

by zozbot234

2/17/2026 at 10:39:09 PM

Love this.

You mention futures are cooperative and GPUs lack interrupts, but GPU warps already have a hardware scheduler that preempts at the instruction level. ARe you intentionally working above that layer, or do you see a path to a fture executor that hooks into warp scheduling more directly to get preemptive-like behavior?

by starkiller

2/17/2026 at 5:35:09 PM

Very cool to see this and something I have been curious about myself and exploring the space as well. I'd be curious what are some parallels and differentiations between this and NVIDIA's stdexec (outside of it being in Rust and using Future, which is also cooL)

by shayonj

2/18/2026 at 8:51:44 AM

Very interesting. What is the runtime model, based on the "async trilema" https://without.boats/blog/the-scoped-task-trilemma/ ?

I assume tokio-like, i.e. work-stealing?

by michalsustr

2/18/2026 at 5:32:11 AM

AS I know GPUs execute code pretty fast as long as all threads in a warp go the same execution path. Branching causes performance degradation. But executing exactly the same code for multiple coroutines seems for me to be practically impossible. So, can good performance be reached with such approach at all?

by Panzerschrek

2/18/2026 at 11:29:34 AM

GPU 'threads' are SIMD lanes and this post does not discuss those. It's about running multiple GPU 'warps', i.e. hardware threads with separate control flow and code execution.

(Beyond that, "executing the same code" on multiple instances of a single coroutine ought to be sometimes possible on an opportunistic basis.)

by zozbot234

2/18/2026 at 6:04:05 AM

The short answer is yes. The post literally has an example of co-routines (think C-style: possible, but ugly). The difference here is how easy it is to write. I'd wager the question is not if it can be achieved, but for which use cases it can be ergonomic.

by ablob

2/17/2026 at 6:06:59 PM

Is this Nvidia-only or does it work on other architectures?

by firefly2000

2/17/2026 at 6:10:55 PM

Currently NVIDIA-only, we're cooking up some Vulkan stuff in rust-gpu though.

by LegNeato

2/17/2026 at 6:32:37 PM

I don't have anything to offer but my encouragement, but there are _dozens_ of ROCm enjoyers out there.

In years prior I wouldn't have even bothered, but it's 2026 and AMD's drivers actually come with a recent version of torch that 'just works' on windows. Anything is possible :)

by monster_truck

2/17/2026 at 9:29:05 PM

Thank you! We're small so have to focus. If anyone from AMD wants to reach out, happy to chat.

by LegNeato

2/18/2026 at 1:29:24 AM

Anush is who you want to ping, he's motivated and will take care of you.

https://x.com/AnushElangovan

by latchkey

2/18/2026 at 2:39:56 AM

yes lmk how i can help. at the minimum i can get you hw and help with PRs etc. firstname at amd.com to reach me.

by magic_at_nodai

2/17/2026 at 6:59:39 PM

Does the lack of forward progress guarantees (ITS) on other architectures pose challenges for async/await?

by firefly2000

2/17/2026 at 8:44:27 PM

Warp specialization is an abomination that should be killed and I'm glad this could be an alternative.

I hope they can minimize the bookkeeping costs because I don't see it gain traction in AI if it hurts big kernels performance.

by ismailmaj

2/17/2026 at 6:39:04 PM

Very cool!

Is the goal with this project (generally, not specifically async) to have an equivalent to e.g. CUDA, but in Rust? Or is there another intended use-case that I'm missing?

by Arch485

2/17/2026 at 7:57:25 PM

The closest equivalent to that is rust-gpu, which this project is pretty closely involved with.

by zozbot234

2/17/2026 at 7:10:54 PM

Et tu, GPU?

I am, bluntly, sick of Async taking over rust ecosystems. Embedded and web/HTTP have already fallen. I'm optimistic this won't take hold in GPU; well see. Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.

I use rust on the GPU for the following: 3d graphics via WGPU, cuFFT via FFI, custom kernels via Cudarc, and ML via Burn and Candle. Thankfully these are all Async-free.

by the__alchemist

2/17/2026 at 9:21:04 PM

I don't see the utility of async on the GPU.

> Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.

Someone somewhere convinced you there is a async coloring problem. That person was wrong, async is an inherent property of some operations. Adding it as a type level construct gives visibility to those inherent behaviors, and with that more freedom in how you compose them.

by notnullorvoid

2/17/2026 at 10:59:07 PM

itd be interesting to see a setup where there's only async and you have to specify when you actually want to block on a result.

flip the colouring problem on its head

by 8note

2/18/2026 at 1:20:39 AM

For graphistry at least, I care less about the surface syntax (async/await) and more about getting gpu-side work stealing, dynamic task scheduling, etc. Our code is written at such a much higher level that these are primitives needed by our runtime, not most of our developers & users. Imagine SQL, cypher, etc on GPUs and our implementations of those being able to use the gpu-side libraries when coordinating 1M+ threads.

by lmeyerov

2/18/2026 at 3:08:41 AM

Is it the performance benefits? Or being able to write concurrent code much more expressively? Though I suppose the latter might imply the former.

by winwang

2/18/2026 at 4:05:56 AM

Performance.

Our code looks like pure pandas (fancier SQL) wrapped as HTTP service (arrow instead of json), so the expressivity is more of a step backwards. We already did the work of turning awkward irregular code into relational pipelines that GPUs love.

Our problems are:

- Multi-tenancy. Our users get to time share GPUs, so when getting many GPU tasks big & small, we want them co-scheduled across the many GPUs & their many cores. GPUs are already more cost effective per Watt than CPUs, but we think we can 2x+ here, which is significant.

- Constant overheads. One job can be deep, with many operations, so round-tripping each step of the control plane, think each SQL subexpression, CPU<>GPU is silly and adds up. Small jobs are dominated by embarrassing overheads that are precluding certain use cases. We are thinking of doing CPU hot paths to avoid this, but rather just fix the GPU path.

by lmeyerov

2/17/2026 at 8:21:35 PM

genius, great idea and follow through, please keep it up, this could improve the ML industry tremendously, maybe some einops inspired interface for this would be good?

by bionhoward

2/18/2026 at 1:04:18 PM

[dead]

by qaqqqqaq

2/17/2026 at 9:02:20 PM

[flagged]

by Arifcodes

2/17/2026 at 10:45:10 PM

I had a longer, snarkier response to this the (as I'm writing) top comment on this thread. I spent longer than I'd like to have trying to decode what insight you were sharing here (what exactly is inverted in the GPU/CPU summaries you give?) until I browsed your comment history and saw what looks like a bunch of AI-generated comments (sometimes less than a minute apart from each other) and realized I was trying to decode slop.

This one's especially clear because you reference "the cases shayonj mentioned", but shayonj's comment[1] doesn't mention any use cases, but it does make a comparison to "NVIDIA's stdexec", which seems like might have gotten mixed into what your model was trying to say in the preceding paragraph?

This is really annoying. Please stop.

[1] https://news.ycombinator.com/item?id=47050304

by magicalist

2/17/2026 at 11:06:19 PM

I see this accusation a lot, and admittedly, I defended someone who later on was shown to use AI to generate comments, but I am still missing a motivation for this. Is your argument that he is using AI to copyedit his posts, or that he is asking AI to write a response to a random thread that looks insightful? Because I cannot fathom why someone would ever do that.

by NewsaHackO

2/17/2026 at 11:14:56 PM

I have no idea what their motivation is and no idea if they're using an LLM to tune their prose or write comments whole cloth (considering the four recent comments, each two paragraphs, within 2.5 minutes, though, I'm guessing fully generated).

I was just annoyed enough by spending a couple of minutes trying to decode what had the semblance of something interesting that I felt compelled to write my response :)

There are a ton of interesting top-level comments and questions posted in this thread. It's such a waste this one is at the top.

by magicalist

2/18/2026 at 4:31:39 PM

I agree. I am just wondering aloud what they could possibly gain from this. Just really bizarre behavior.

by NewsaHackO

2/17/2026 at 11:02:43 PM

This is what I fucking hate about this AI craze. It's all [1], fundamentally, about deception. Trying to pass off word salad as a blogpost, fake video as real, a randomly generated page as a genuine recipe, an LLM summary as insight.

[1] Nearly all.

by andrepd

2/17/2026 at 11:43:09 PM

[flagged]

by Arifcodes

2/18/2026 at 12:16:23 AM

> I have been experimenting with AI-assisted drafting for HN comments

forgive the hyperbole but this seems completely insane to me. like is the purpose of a forum not to share our collective human experiences? or do you get off on some internetpointmaxxing side game instead

i just don't get it, what are you optimizing for here exactly. are you trying to remove every ounce of autonomy from your life or what

by thousand_nights

2/18/2026 at 2:57:20 AM

Seriously people, type out your own thoughts. If I want to chat with a clanker I can do that on my own time.

by bschwindHN

2/17/2026 at 9:32:06 PM

[dead]

by sieabahlpark