Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

5/29/2026 at 8:39:08 PM

README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code

by yu3zhou4

5/30/2026 at 3:25:28 PM

I am not super familiar with C and CUDA, so I read solely for the README and enjoyed it supremely. The blend of cheerful walking through instructive examples and your philosophical takes on how to approach the exercise to get the most out of it put me in a great mood. You captured that special upbeat attitude that comes about when you're doing something as well as you can just because it's so legitimately interesting to you.

by lukemerrick

5/30/2026 at 2:03:52 AM

Really practical teaching approach. I clicked in to see how safetensors are loaded and just kept reading. Thanks for sharing.

by janalsncm

5/29/2026 at 10:11:15 PM

Very nice job on read me.

>>Physically, LLM is a file which contains a lot of float numbers.

aka atoms of the LLM.

by dwa3592

5/29/2026 at 10:16:38 PM

the universe is just atomic if statments

by cyanydeez

5/30/2026 at 7:51:01 AM

it from bit

by nullpoint420

5/30/2026 at 2:10:54 AM

The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.

by xuanlin314

5/30/2026 at 2:56:28 AM

Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.

by GoldenJade

5/30/2026 at 8:10:25 AM

I feel like I learned twice as much in 10 minutes reading this than I did reading LLM for Dummies. Thank you

by tom-wal

5/29/2026 at 8:41:34 PM

I love the documentation formatted in lessons. I can't wait to read through it.

by nazgulsenpai

5/29/2026 at 9:42:39 PM

Looks interesting, it reminds me of the first llama.cpp, but better documented.

by juancn

5/29/2026 at 10:26:55 PM

Wanted to add that the author has an amazing blog with lots of interesting papers: https://jedrzej.maczan.pl/

by cookiengineer

5/30/2026 at 9:49:42 AM

I am looking at a plain and simple C implemented LLM inference, and/or x86_64 assembly implemented, and/or AMD GPU RDNA assembly.

Anybody?

by sylware

5/30/2026 at 1:10:55 PM

I heard once that c++ can become assembly at some point if you type the right things in. :)

by irishcoffee

5/31/2026 at 11:23:29 AM

Well, the whole purpose is to be independent of invisible backdoor injectors...^W I mean compiler, to be more accurate those compilers which deals with computer languages with an absurd and grotesque syntax complexity.

by sylware

5/29/2026 at 10:13:27 PM

It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(

by einpoklum

5/30/2026 at 4:13:29 PM

interesting!

by smy_smy

5/30/2026 at 6:52:59 AM

[flagged]

by pslab

5/30/2026 at 3:18:02 AM

[dead]

by alexpandey

5/31/2026 at 6:47:30 AM

[dead]

by michaelmjh

5/31/2026 at 3:07:33 AM

[dead]

by aamir_ukmer

5/29/2026 at 10:11:09 PM

[dead]

by harshuljain13