Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

6/29/2026 at 10:56:49 AM

Wow that's really cool i'll definitely check it out! have played around with machine learning algorithms built from scratch in c / cuda too, but once i hit the cuda part of it i kinda just left it to the side. i'm curious how did you use CUDA to optimize the matrix multiplications? how optimized is training, does it take much longer then using pytorch?

by AndReics

6/29/2026 at 12:00:21 PM

Hi, in nanoeuler I use cuBLAS (NVIDIA's super-optimized library) for all matrix multiplications, with the tensor cores in TF32 mode. It's the same thing PyTorch uses underneath, so it's very fast. What I've optimized (and will improve even more) and written by hand are the kernels for the other parts (like FlashAttention, which gave a nice 3x speedup), while I've delegated the large matrices to cuBLAS. Training the 116M model on a 4070 runs well and in reasonable times. Compared to PyTorch, it's a bit slower (probably 1.5-2.5x), but nothing dramatic, especially considering it's all done from scratch without a framework and there are no other optimizations that would make it faster. I'm working on it.

by vforno

6/29/2026 at 1:00:43 PM

Do you have a guess why your code is so much slower than torch? I didn't look, but there must be no reason to have 2x slower code esp. for a simple grid of FMAs.

by novaRom

6/29/2026 at 1:11:13 PM

Yes, because it has many separate kernels instead of aggressive merges like PyTorch (with Torch Compile). Each pass (norm, matmul, residual, RoPE, etc.) launches its own kernel, which increases launch overhead and memory traffic. CuBLAS helps, but it's not enough to compensate.

by vforno

6/29/2026 at 1:30:59 PM

i see really cool, where i failed was trying to build my own matrix operations library, it was just too much, but using cuBLAS definitely helps, i'll look into the custom kernels you wrote they seem interesting!

did you build the backprop yourself? it is a really cool project to build and i think you can agree that it teaches you a lot of how LLMS and machine learning in general works.

by AndReics

6/29/2026 at 1:35:57 PM

Absolutely yes! With nanoeuler I learned so much by testing every little detail of the project. Every little part you see has been tested and proven several times so that it could be understood and worked.

by vforno

6/29/2026 at 12:06:06 AM

Mentioning neural ODE doesn't make sense here, as this is unrelated. Basically any implementation of transformer uses residuals, but you're not really training a neural ODE here.

Also consider getting rid of the em-dashes. I don't know if you mostly vibe-coded this or not, but the README is pretty clearly AI generated.

by tdesilva

6/29/2026 at 5:23:12 AM

Hi, thanks for the comment. Nanoeuler is starting as a study and research project that will obviously improve over time. I'll do my best to make the readme and other things more readable. Thank you very much.

by vforno

6/29/2026 at 10:31:43 AM

this is super interesting. Looking forward to trying this out!

by ali_chherawalla

6/29/2026 at 10:52:55 AM

Really thanks If you need any help or have any questions I'm here.

by vforno

6/29/2026 at 2:19:27 AM

I'm genuinely curious how much of this is LLM generated?

by isatty

6/29/2026 at 5:20:53 AM

Most part of trasformer and sft!

by vforno

6/28/2026 at 10:02:52 PM

How long was it trained for? How many tokens?

by ericb

6/28/2026 at 10:06:53 PM

Hi, a couple of hours, not too much! Including sft!

by vforno

6/29/2026 at 8:56:05 AM

[dead]

by valentynkit

6/28/2026 at 7:44:21 PM

Very weird coding style, did you run astyle --style=python on C code?

Also, your LLM left a comment in the cuda source that it is untested, does the cuda stuff work?

by Chu4eeno

6/28/2026 at 10:04:16 PM

Not sure, but the code is quite dense and lacking in comments. `nanoeuler` & `nanoeuler_check` is itself the binary checked straight into git with the `.log` file? All of the commit messages are "Add files via upload" and happened in quick succession.

I suspect this is LLM generated, which is cool, but shouldn't then have the claim "forward and backward passes are written and verified by hand" unless it is true.

Regarding the data, old texts from Gutenberg probably lowers the performance - especially as many texts are on purpose whimsical. Shakespeare for example made up words to be theatrical. You have a mix of different old English styles in the corpus - it's a terrible way to learn modern English. I had some success using .ZIM data archives from Kiwix as a source, you should get a more stable output using that data.

by bArray

6/28/2026 at 10:09:23 PM

Hi, the uploads are one after the other because it was a long, step-by-step research project where I tested the code on another machine. I admit that I'm slowly making up for the commits on all the projects. For Gutenberg and Shakespeare, I admit that they were the best tests I could do, but I'll always improve!

by vforno

6/29/2026 at 4:19:55 AM

I haven't tested NanoEuler yet but Gutenberg is awesome. Maybe a matter of taste but I like it much better than modern English.

by andai

6/28/2026 at 9:26:31 PM

> Very weird coding style, did you run astyle --style=python on C code?

I'm sure you mean it in a more curious way but this type of comment on a Show HN often comes across as too harshy/snarky/dismissive for what we want here (see https://news.ycombinator.com/showhn.html).

by dang

6/29/2026 at 2:08:21 PM

Consider adding a rule that an author must disclose (in their own words) for what parts and to what extent LLMs have been used to assist their project.

by gaflo

6/29/2026 at 4:35:55 PM

I don't think it would help: most people wouldn't know about it, of those who did many wouldn't conform, and there would be no reliable way to enforce it.

by dang

6/28/2026 at 8:45:56 PM

yes yes tested on a 4070 ti 16gb everything worked without problems!

by vforno