6/29/2026 at 10:56:49 AM
Wow that's really cool i'll definitely check it out! have played around with machine learning algorithms built from scratch in c / cuda too, but once i hit the cuda part of it i kinda just left it to the side. i'm curious how did you use CUDA to optimize the matrix multiplications? how optimized is training, does it take much longer then using pytorch?by AndReics
6/29/2026 at 12:00:21 PM
Hi, in nanoeuler I use cuBLAS (NVIDIA's super-optimized library) for all matrix multiplications, with the tensor cores in TF32 mode. It's the same thing PyTorch uses underneath, so it's very fast. What I've optimized (and will improve even more) and written by hand are the kernels for the other parts (like FlashAttention, which gave a nice 3x speedup), while I've delegated the large matrices to cuBLAS. Training the 116M model on a 4070 runs well and in reasonable times. Compared to PyTorch, it's a bit slower (probably 1.5-2.5x), but nothing dramatic, especially considering it's all done from scratch without a framework and there are no other optimizations that would make it faster. I'm working on it.by vforno
6/29/2026 at 1:00:43 PM
Do you have a guess why your code is so much slower than torch? I didn't look, but there must be no reason to have 2x slower code esp. for a simple grid of FMAs.by novaRom
6/29/2026 at 1:11:13 PM
Yes, because it has many separate kernels instead of aggressive merges like PyTorch (with Torch Compile). Each pass (norm, matmul, residual, RoPE, etc.) launches its own kernel, which increases launch overhead and memory traffic. CuBLAS helps, but it's not enough to compensate.by vforno
6/29/2026 at 1:30:59 PM
i see really cool, where i failed was trying to build my own matrix operations library, it was just too much, but using cuBLAS definitely helps, i'll look into the custom kernels you wrote they seem interesting!did you build the backprop yourself? it is a really cool project to build and i think you can agree that it teaches you a lot of how LLMS and machine learning in general works.
by AndReics
6/29/2026 at 1:35:57 PM
Absolutely yes! With nanoeuler I learned so much by testing every little detail of the project. Every little part you see has been tested and proven several times so that it could be understood and worked.by vforno