Modern GPU Programming for MLSys

6/26/2026 at 11:10:36 PM

So many frameworks are being built.

What are the state of the art frameworks in ML programming area? Similar to what React is for web and tailwind for CSS

Triton, ONNX, JAX, PyTorch, cublass, .....

I know they might be for different purposes, but having some idea what is for what and when to use would be helpful

by throwaw12

6/26/2026 at 11:17:28 PM

> ONNX, JAX, PyTorch

these are model-level frameworks

> Triton

this is a kernel DSL

> [cublas]

this is a BLAS library built atop CUDA

> I know they might be for different purposes, but having some idea what is for what and when to use would be helpful

when people ask this question i always ask: who are you and what is your job? if you're not an ML/DL/AI person then you knowing the specifics is about as useful as me knowing the specifics of react/express/angular/tailwind/django/whatever as an ML person. this is not meant to be condescending, this is meant to allay your anxiety, ie that if you ever find yourself in the position where you have to know these things for your job, it won't be that hard to figure out (just like it isn't that hard to figure out the difference between react and express and django if you're a webdev).

by mathisfun123

6/26/2026 at 11:32:43 PM

I am a product engineer in yet another enterprise SaaS CRUD shop, who wants to learn more about the landscape and find the way to enter it eventually.

by throwaw12

6/27/2026 at 12:03:38 AM

> wants to learn more about the landscape and find the way to enter it eventually

let's swap roles and let's pretend i'm an ML engineer asking you how to enter CRUD. what would you tell me? my strong suspicion (if i caught you in an honest, frank, moment) is you would say to me "why the fuck would you want to do that - it sucks". i have this suspicion because i did actually used to do CRUD and it does suck! but here's your moment of zen: so does ML/DL/AI. it really really does suck. it's basically just as bad as webdev in terms of tedium/boredom/incidental complexity/etc. it's not fun, interesting, exciting, whatever else you're projecting based on an outside-looking-in-perspective.

now i'll acknowledge that there's one big difference: the pay is way better at the far end of the distribution - meaning if you can get to a FAANG ML team then you'll get more money than you're probably getting now (and a ton more stress too) and it's even more than the CRUD devs in FAANG. fine. but ask yourself if it's really worth learning a whole heap of new bullshit just for a chance at more money (no guarantee).

okay now a useful/practical answer: i went back to school for a PhD but i should've just dropped out with the MS. do that. even better do Georgia Tech's online MS.

by mathisfun123

6/27/2026 at 3:28:05 AM

I'm from the ML platforms and systems domain.

I strongly recommend it if one's able. It's a bit more stable than a quickly evolving ML/DL/AI ecosystem or frontend ecosystem. The skills are more durable. It repays deep investment and knowledge.

It allows you to straddle both the distributed systems and services domain and the ML domain.

ML systems problems are extremely interesting since they require extremes of compute, storage, network, and latency, in very different parts of the model lifecycle. Its unique problem is the scarcity and cost of hardware accelerators.

I've worked eleven years in the space and rarely have had the desire to leave.

by golly_ned

6/27/2026 at 3:41:08 AM

> rarely have had the desire to leave.

I'm currently a GPU compiler engineer in FAANG specializing in compute (not graphics). So clearly ML systems. Prior I have worked at every level of stack above and during my PhD I worked below (RTL). I hate it and think about leaving every day (I stay because of the money and like wtf else am I gonna do lol).

by mathisfun123

6/27/2026 at 9:47:07 AM

Are you willing to take a pay cut?

by saagarjha

6/27/2026 at 2:00:38 AM

Would you (or someone else passionate about this topic) consider answering the question directly? I am curious about this too.

by dv35z

6/27/2026 at 3:24:29 AM

Pytorch is widely accepted as the de facto ML framework in both research and industry. TensorFlow comes second in industry. Jax is hardly used at all, but uses the same backend as TensorFlow.

Triton is a python-like language to define ML math operations that run efficiently on hardware accelerators like GPUs or TPUs. OpenAI open sourced it. If there's a particular math operation you have a unique need for in your model, and it hasn't already been implemented by some other library, and it's important for efficiency, you'd probably write it in triton these days. It'll be compiled to an intermediate representation, then to an efficient runtime.

The course linked deals with "MLSys", or "ml systems". That means using GPUs and other hardware accelerators efficiently to run ML math operations on one or more computers.

95% of working ML engineers will never need to write Triton, and will be more than satisfied with PyTorch. Many more ML engineers will, nevertheless, write Triton code, because it is interesting, fun, easy, and people are impressed when you tell them you did.

Hosting pytorch models efficiently is currently awkward, because there's no clear winner in the ecosystem. ONNX is a way of representing model graphs in a framework-agnostic way. Other systems can interpret ONNX graphs to do inference. So sometimes, when someone wants to host a pytorch model, they turn it into an ONNX model and run it with an efficient runtime on CPUs or GPUs.

by golly_ned

6/27/2026 at 3:44:55 AM

> Triton is a python-like language to define ML math operations that run efficiently on hardware accelerators like GPUs or TPUs. OpenAI open sourced it.

This is incorrect. Triton has literally no path to TPU and it has always been open source because it was Philippe Tillet's PhD project (OAI simply hired Philippe).

> 95% of working ML engineers will never need to write Triton, and will be more than satisfied with PyTorch.

Maybe 95% of hobbyist ML engineers but professional ML engineers are absolutely writing Triton day-to-day (eg FB has an army of such people). Even if you're not writing Triton you're still using Triton through inductor.

> because it is interesting, fun, easy, and people are impressed when you tell them you did

Professionals write Triton not for any of the reasons you mentioned but for the same reason they wrote CUDA kernels prior: it's a path to peak performance for their specific workloads (where stock PyTorch kernels have mediocre performance).

by mathisfun123

6/26/2026 at 6:26:28 PM

"Modern [NVIDIA GPU] Programming for ..."

Everything after "Pipelining GEMM with TMA" (inclusive) is specific to NVIDIA. Which is fine but the title (of the guide itself) is clearly misleading.

by mathisfun123

6/26/2026 at 8:35:10 PM

> Our main target is the Blackwell generation,

misleading?

by nh23423fefe

6/26/2026 at 9:12:51 PM

what is it with hn people where they willfully misinterpret the simplest observations;

> the title (of the guide itself) is clearly misleading.

...

> title: the distinguishing name of a written, printed, or filmed production

do you understand now? or do i need to also define for you the word misleading?

by mathisfun123

6/26/2026 at 9:39:13 PM

nah talking to you sucks.

by nh23423fefe

6/26/2026 at 8:14:45 PM

This looks great, but I'd really like to see associated exercises (and solutions) to make it useful for self-study

by hazard

6/27/2026 at 12:56:13 AM

I can't signal boost this enough.

I spent months, months of late nights watching commits to nvfuser and shit, I wrote a SASS decompiler instrumented everything trying to learn Blackwell.

This is the first time I've seen something so clean, just a real work of scholarship on it.

My hat is off to the authors and the contribution it represents.

If I would caution a reader anything it's that the 2CTA (sm_100 sm_110) patterns here are different on 1CTA in important ways and it's not a better / worse thing, they are good for different workloads.

Really outstanding work. I proves q lot of this in lean4 and published but I got lazy short of really doing the pedagogical work.

This is what you should be starting with if you want to max out 2CTA gear, it's immaculate.

by reinitctxoffset