6/30/2026 at 5:49:40 AM
I really appreciate this type of articles. I feel like a lot of knowledge in LLM training and inference is locked inside the heads of practitioners. Similar to compiler engineers before.To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.
by blueblazin
6/30/2026 at 6:15:39 AM
Thank you for the kind words. We will write and share more of these.by radq
6/30/2026 at 9:31:33 AM
> Similar to compiler engineers before.I guess the difference here being that we have ample compiler literature and practically know 99% of all there is to know about compilers that exist in the wild vs this new field.
Until we’ve gathered and agreed on a few “dragon books” for LLMs and have explored all there is to LLMs, you’re probably right - know-how will be with the practitioners and in source code until it’s distilled (pun intended).
by alfiedotwtf
6/30/2026 at 9:38:02 AM
Better comparison would be low level code running on smaller chips. Intersection of hardware and software engineeringby Melatonic
6/30/2026 at 9:10:00 AM
Most industries are like that.by someonebaggy
6/30/2026 at 6:08:59 AM
Gentle reminder that while most money is spent on LLM inference, the vast majority of useful AI use is in fact not LLMs. Also, more and more work is poured into making small models. One thing I like about the whole export controls saga is that people are finding creative ways to squeeze performance out of these devices as witnessed in this post. But, if you then look at solutions like vLLM, vLLM will just fill whatever VRAM is available, no matter the context size, or the model size. So then you have two things to worry about:First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io
While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.
Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.
by rjzzleep
6/30/2026 at 9:21:02 AM
> the vast majority of useful AI use is in fact not LLMsCan you explain what you mean here? Are you talking about small neural networks doing specific tasks?
by esperent
7/1/2026 at 3:21:57 PM
All sorts of optimizations. Of course vision is huge. Lots of production use in all sorts of manufacturing. Lam research had a few talks a semiconductor manufacturing optimization. There is also CUDA assisted RAN.Maybe AI is a bit of a misnomer, since everything ML at some point just started getting called AI.
by rjzzleep