CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

5/22/2026 at 6:26:25 AM

Strictly speaking, this is very domain-specific and doesn't enable any performance that Triton couldn't already achieve (eliminating global memory round-trips via epilogue fusion is nothing new). The real takeaway is the design shift for LLM-driven codegen rather than handcrafted kernels.

LLMs are still bad at low-level hardware optimizations, but really good at high-level composition. Designing compiler abstractions with a restricted, composable API so an LLM can easily glue expert-written blocks together is a smart move. I suspect this will eventually become the norm for codegens as we move to agentic development.

by rahen

5/22/2026 at 8:14:15 AM

>LLMs are still bad at low-level hardware optimizations, but really good at high-level composition.

I disagree. While yes they don't have all the architectural quirks of every GPU memorized, they are able to extract such optimizations from ISA docs and online guides. Now with 1M context available on frontier models, they can even fit the whole ISA definition in context (RDNA 3.5 here specifically) and spit out swathes of optimizations to try. The rest is just bruteforcing a single goal which they are extremely good at.

Or that's how simple it'll look until you have subtle bugs to solve somewhere deep in your stack.

Anyways, low-level hardware optimized GPU kernels has been an exceptionally good use case for agents in my opinion. They have far more trouble in other domains like doing GUI.

by tssge

5/22/2026 at 2:26:10 PM

If you look at Anthropic's recent kernel optimization challenge, and the human leaderboard, humans are soundly beating Claude's best attempt.

I think the reason, as parent suggested, is that LLMs are great at composition (mash-ups/regeneration - this is essentially what they are trained to do), and not so great at innovation. How well they can do relative to a human, on a low level optimization problem, is going to depend on degree of similarity of the problem to things they were trained on and/or have access to.

by HarHarVeryFunny

5/22/2026 at 9:11:35 AM

The lack of fast GPU kernels written by AI does not lend credence to your theory.

by saagarjha

5/22/2026 at 2:29:50 PM

Perhaps you missed work like https://crfm.stanford.edu/2025/05/28/fast-kernels.html ?

by boulos

5/22/2026 at 4:02:58 PM

Comparing against torch.compile is not particularly impressive

by saagarjha

5/22/2026 at 11:43:36 AM

> and spit out swathes of optimizations to try.

Without any guarantees of functional correctness.

by reliabilityguy

5/22/2026 at 6:55:59 AM

I imagine this is what’s already done for AI laying out hardware design.

by sroussey

5/22/2026 at 9:37:55 AM

TLDR:

Authors realize that global row-wise dependent functions like RMSNorm/LayerNorm have baked-in scales that are commutative in certain setups, so they can be moved out after a subsequent projection and be partially aggregated on tiles of rows.

So ((W1 @ gamma * globally_computed_scale) * W2 can be written as (W1 @ gamma * W2) * globally_computed_scale as long as we have row-only interactions for the scale.

This was usually not done before because left-to-right graph compilers like torch.compile can't assume that a global row-wise reduction between GEMMs can be commutative.

by augment_me

5/22/2026 at 9:14:43 AM

Guys who have only written CUTLASS GEMM epilogue fusions, seeing their second kernel: Getting a lot of "GEMM epilogue fusion" vibes from this

by saagarjha

5/22/2026 at 7:06:08 AM

« LLMs can successfully author CODA kernels » That might speed up progress in this area then

by maxignol

5/22/2026 at 2:32:18 PM

synthesis-only is the hard part. with execution feedback — run, profile, patch — the gap closes fast. it's basically an RL problem in disguise

by cold_harbor

5/22/2026 at 8:30:40 AM

[flagged]

by rohitsriram

5/22/2026 at 7:34:51 AM

[flagged]

by enricotal

5/22/2026 at 3:24:31 PM

[dead]

by rizkimurtadha