thanks so much for reading!!>I think around that microarchitecture you want at least a two-deep pipeline where global memory loads to shared memory two iterations head
I agree, if I were to spend more time on this, I would add another layer of pipelining like you say, and also tweak it so that each threadblock is persistent on a particular SM, and processes multiple tiles. This allows you to hide the write latency of the epilogue, and arrange the SM<->data layout in way that maximizes L2 cache locality. (good explanation of this here https://cudaforfun.substack.com/p/outperforming-cublas-on-h1...)
>I did want to say that I think the ldmatrix instruction was a mistake
I agree. I found this instruction wierd because it hides which thread is reading which data, but it causes shared memory bank conflicts, so you are left to guess which thread is reading what. I find using the TMA on hopper is much nicer.
>The model of "more threads, more occupancy, to hide the memory latency" is truly dead and buried
Well said, seems like all the intro to CUDA textbooks need to be rewritten. It seems like now for all GEMM-like kernels, occupancy matters very little, and its more about using dedicated, asychronous hardware units properly in conjunction one another. I like this because there is a bit less black magic involved when chasing the long tail of performance. This is well put here
https://research.colfax-intl.com/cutlass-tutorial-writing-ge...
"In historical context, these developments continue a trend of replacing general-purpose computational resources by specialized hardware resources, to both remove bottlenecks and free up those general-purpose resources for other operations. Starting with the Volta architecture, the Tensor Cores divorced GEMM arithmetic operations from the general computational pipeline. Ampere’s asynchronous copy instructions allowed for true pipelining of GEMM mainloops. On Hopper GPUs, the asynchronous, single-threaded TMA and the ability to reallocate registers between warpgroups dramatically reduced the register and thread cost of data movement, and the asynchronous WGMMA allowed for pipelining of MMA with other compute operations. Now, Tensor Memory and UMMA do for MMA just what TMA did for copy, making it a single-threaded, asynchronous operation that does not consume registers. As a result, registers can primarily be used for other tasks like scheduling and fused epilogue operations."