5/7/2026 at 6:25:54 PM
Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.
The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.
by kgeist
5/7/2026 at 10:10:46 PM
> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?The inference engines in use already include different backend building blocks optimized for different hardware.
While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.
There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.
by Aurornis
5/8/2026 at 5:27:26 AM
Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs.> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,
https://www.tomshardware.com/tech-industry/artificial-intell...
Custom code targeting one specific hardware implementation can improve performance quite a bit.
by GeekyBear
5/8/2026 at 3:50:51 AM
When you support multiple backends, you end up having to abstract over them. Each backend may implement the abstraction to the best of its capability, but you still have to deal with the abstraction sitting between your workload and its compute. Wouldn't it be nice if you didn't need that abstraction? That's what GP is talking about, I'm sure: optimizing the workload directly for the hardware, rather than merely the workload and the backend for the abstraction.by LoganDark
5/8/2026 at 1:46:31 PM
Absttaction doesnt always imply performance overhead.by Muromec
5/8/2026 at 7:54:20 PM
Abstraction necessarily reduces fit to the hardware when multiple different kinds of hardware are supported. Whether that is towards the hardware you are using varies, but in many cases it is, which means you can reach performance gains by shedding the additional support to focus on just your hardware.by LoganDark
5/7/2026 at 7:37:56 PM
This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.[1] https://codegolf.stackexchange.com/questions/215216/high-thr...
by xtracto
5/7/2026 at 8:46:14 PM
I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.by Juvination
5/7/2026 at 9:04:32 PM
Check out cpp at 208.3 GiB/s, 3x faster than asm.by slaw
5/8/2026 at 7:26:41 AM
Yeah, because (and here's the trick) they are clever and do less work.Optimizing things usually means "think of a way to do the same thing with less effort".
by akie
5/8/2026 at 1:25:32 PM
Hire the laziest programmer :)by andai
5/7/2026 at 7:27:00 PM
I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.by mirsadm
5/8/2026 at 10:48:42 AM
I tried getting any sota llm (GPT 5, Opus 4.6, Deepseek V4 pro, glm-5) to write a Metal 4 shader for a bottle usdz and none of them got it right. They screwed up the normals and textures , total mess. I tried it to do it in Metal 3 and still crappy.by davidwritesbugs
5/7/2026 at 9:57:50 PM
Just curious if you've tried GPT 5.5 Pro?by wahnfrieden
5/7/2026 at 6:58:31 PM
Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.by joshmarlow
5/8/2026 at 9:55:24 AM
I'll add to this: What if chips were designed for the model? What would happen if we moved from digital to analog (vectors are not represented as bits, but instead as voltages)? Could the compute heavy matrix multiplications be done via op-amps? And could this analog approach be way more efficient than the limitations of bit representation?by egesko
5/8/2026 at 10:27:45 AM
There is https://taalas.com/ . Their chips are all digital though. The weights are written to silicon.by kristianp
5/7/2026 at 9:17:37 PM
What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?by didip
5/8/2026 at 1:27:19 PM
Ultra-optimized HW-specific engines is what Mojo lang seems to be targeting, but I rarely hear about it here.by nopurpose
5/8/2026 at 1:40:26 PM
> Mojo lang seems to be targeting, but I rarely hear about it hereMomentum over at Mojo lang seems very very slow.
According to their roadmap, they're still busy on Phase 1 ("High performance CPU + GPU coding"), and haven't touched Phase 2 ("Systems application programming") and Phase 3 ("Dynamic object-oriented programming").
So perhaps there isn't much to talk about?
by andsoitis
5/8/2026 at 3:28:39 PM
They've got a lot of work yet to do to be a general purpose language, but for GPU programming they have already demonstrated that they can outperform CUDA on Nvidia GPUs.That's pretty compelling.
by GeekyBear
5/8/2026 at 1:30:20 AM
this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.by p_stuart82