DeepSeek 4 Flash local inference engine for Metal

5/7/2026 at 6:25:54 PM

Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

by kgeist

5/7/2026 at 10:10:46 PM

> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.

There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.

by Aurornis

5/8/2026 at 5:27:26 AM

Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs.

> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,

https://www.tomshardware.com/tech-industry/artificial-intell...

Custom code targeting one specific hardware implementation can improve performance quite a bit.

by GeekyBear

5/8/2026 at 3:50:51 AM

When you support multiple backends, you end up having to abstract over them. Each backend may implement the abstraction to the best of its capability, but you still have to deal with the abstraction sitting between your workload and its compute. Wouldn't it be nice if you didn't need that abstraction? That's what GP is talking about, I'm sure: optimizing the workload directly for the hardware, rather than merely the workload and the backend for the abstraction.

by LoganDark

5/8/2026 at 1:46:31 PM

Absttaction doesnt always imply performance overhead.

by Muromec

5/8/2026 at 7:54:20 PM

Abstraction necessarily reduces fit to the hardware when multiple different kinds of hardware are supported. Whether that is towards the hardware you are using varies, but in many cases it is, which means you can reach performance gains by shedding the additional support to focus on just your hardware.

by LoganDark

5/7/2026 at 7:37:56 PM

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

by xtracto

5/7/2026 at 8:46:14 PM

I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.

by Juvination

5/7/2026 at 9:04:32 PM

Check out cpp at 208.3 GiB/s, 3x faster than asm.

by slaw

5/8/2026 at 7:26:41 AM

Yeah, because (and here's the trick) they are clever and do less work.

Optimizing things usually means "think of a way to do the same thing with less effort".

by akie

5/8/2026 at 1:25:32 PM

Hire the laziest programmer :)

by andai

5/7/2026 at 7:27:00 PM

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

by mirsadm

5/8/2026 at 10:48:42 AM

I tried getting any sota llm (GPT 5, Opus 4.6, Deepseek V4 pro, glm-5) to write a Metal 4 shader for a bottle usdz and none of them got it right. They screwed up the normals and textures , total mess. I tried it to do it in Metal 3 and still crappy.

by davidwritesbugs

5/7/2026 at 9:57:50 PM

Just curious if you've tried GPT 5.5 Pro?

by wahnfrieden

5/7/2026 at 6:58:31 PM

Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.

by joshmarlow

5/8/2026 at 9:55:24 AM

I'll add to this: What if chips were designed for the model? What would happen if we moved from digital to analog (vectors are not represented as bits, but instead as voltages)? Could the compute heavy matrix multiplications be done via op-amps? And could this analog approach be way more efficient than the limitations of bit representation?

by egesko

5/8/2026 at 10:27:45 AM

There is https://taalas.com/ . Their chips are all digital though. The weights are written to silicon.

by kristianp

5/7/2026 at 9:17:37 PM

What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?

by didip

5/8/2026 at 1:27:19 PM

Ultra-optimized HW-specific engines is what Mojo lang seems to be targeting, but I rarely hear about it here.

by nopurpose

5/8/2026 at 1:40:26 PM

> Mojo lang seems to be targeting, but I rarely hear about it here

Momentum over at Mojo lang seems very very slow.

According to their roadmap, they're still busy on Phase 1 ("High performance CPU + GPU coding"), and haven't touched Phase 2 ("Systems application programming") and Phase 3 ("Dynamic object-oriented programming").

So perhaps there isn't much to talk about?

by andsoitis

5/8/2026 at 3:28:39 PM

They've got a lot of work yet to do to be a general purpose language, but for GPU programming they have already demonstrated that they can outperform CUDA on Nvidia GPUs.

That's pretty compelling.

by GeekyBear

5/8/2026 at 1:30:20 AM

this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.

by p_stuart82

5/8/2026 at 12:10:15 AM

I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.

I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.

Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.

I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.

by lhl

5/8/2026 at 10:43:08 AM

In the same boat with 7900xtx. 24GB vram, on paper decent performance, in reality most things don't run. Only llama.cpp is consistent that it can run most models, even if maybe not at top performance (afaik - lacking MTP, problems cache invalidation with hybrid models). At least with llama.cpp I know what runs. With various python-based inferencers, between their uv/venv, my venv, system envs/pythons/libs yadayada - I need an agent to get to the bottom of what's actually running. :-) Yeah IK skill issue/user errors - but don't have seconds in the day left to spend them on that.

Even if not perfect, if you publish on GH or HF, some other agent can maybe start there and not from zero. I did this for Ling-2.6-flash (107B-A7B4 MoE) that's the biggest llm I can ran for practical use on the other h/w I got for local llms (M2 Max). Even if MTP is not working well, still improvement on the current llama.cpp that does not run Ling-2.6-flash at all. This - https://huggingface.co/inclusionAI/Ling-2.6-flash/discussion.... The 4-bit quants are at https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, the branch is at https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas....

by ljosifov

5/9/2026 at 12:31:58 AM

Doing the same for Apple M-series with fused wgsl shaders specifically targeting Qwen3/3.5.

My effort is called shady-thinker and is on github at github.com/tmzt/shady-thinker.

This was inspired in part by Antirez's earlier work with C kernels as well as other efforts to support in-browser LLMs. I've adapted them to Rust and the wgpu library.

Gemma 4 is also the next likely target (with the MTP work) as I'm experimenting with local AI agents.

I'd love to see what you've done to improve prefill and decode even if its not directly applicable.

One difference, I'm using MLX and GPTQ 4bit quants including AutoRound with safetensors as my shader pipeline is pretty much fixed for each model, ggml just adds unnecessary complexity.

by tmzt

5/8/2026 at 7:38:33 AM

Please share your knowledge and your findings

I think llama.cpp could have done a much better job supporting PC. Sure, some of it us due to bad vendor support but with so many users I am surprised we don't see more optimized inference on standard PCs

by throwa356262

5/8/2026 at 9:16:53 AM

When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this:

### Diagnosing parallelism pathologies (L1)

*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.

*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.

*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.

by lhl

5/7/2026 at 3:48:35 PM

This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.

by maherbeg

5/7/2026 at 5:12:47 PM

There will always be a huge gap between frontier models and open source models (unless you're very rich). This whole industry makes no sense, everyone is ignoring the unit economics. It cost 20k a month to running Kimi 2.6 at decent tok/ps, to sell those tokens at a profit you'd need your hardware costs to be less 1k a month.

Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.

by dakolli

5/7/2026 at 5:27:11 PM

If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

by bensyverson

5/7/2026 at 5:29:41 PM

It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.

by physicsguy

5/7/2026 at 6:17:22 PM

Exactly… as hn readers, we sometimes forget that a lot of people are using these tools to search for the best sunscreen, or rewrite an email.

by bensyverson

5/7/2026 at 5:36:37 PM

[flagged]

by dakolli

5/7/2026 at 5:41:12 PM

No offense, this is a crazy worthless contribution to the discussion.

Why?

by afro88

5/7/2026 at 7:15:27 PM

Because everyone in these replies is in complete denial about the physical limits of memory and scaling in general. Ya'll literally living in an alternate reality where model capability increases with a decrease in size, its simply not the case. There will be small focused models that preform well on very narrow tasks, yes, but you will not have "agents" capable of "building most things" running on consumer hardware until more capable (and affordable) consumer hardware exists.

by dakolli

5/7/2026 at 7:26:41 PM

Ah, you haven't realized that consumer hardware gets more capable over time

by bensyverson

5/7/2026 at 9:05:45 PM

Not this year, when many vendors either offer lower memory capacities or demand higher prices for their devices.

by adrian_b

5/7/2026 at 11:08:57 PM

Correct, the progress is not perfectly linear. But do you believe technological progress has stalled forever? If so, I'd get out of tech and start selling bomb shelters.

by bensyverson

5/7/2026 at 11:33:20 PM

Do you really think the trend of consumer hardware is heading towards more memory and better specs? Apple's most popular product this year is an 8gb of RAM laptop..

The trend is heading in the opposite direction, less options for strong consumer hardware and towards cloud based products. This is a memory issue more than anything. Nvidia is done selling their ddr7 to gamers and people with AI girlfriends.

by dakolli

5/8/2026 at 2:42:09 AM

This is more then just the hardware evolving over time but we also are seeing big improvements in quantization and efficiency improvements.

by iuffxguy

5/8/2026 at 3:18:28 AM

There are physical limits to how much you can compress data. I'm just saying, don't sit on your hands waiting for this to happen, becuase its probably not going to for another decade +. There's no use in waiting, just write the code your fkin self and stop being lazy.

by dakolli

5/8/2026 at 2:38:42 AM

Just so that I have your position straight: you actually believe that over the long term, like 10, 20 years, that the amount of RAM in a laptop is going to go down?

It's not out of the realm of possibility, but I just want to make you aware that this would be a very surprising development in computing history.

by bensyverson

5/8/2026 at 3:31:18 AM

This seems to be a different discussion than was going on up thread about:

> in the next few years a "good enough" model will run on entry-level hardware

by fulafel

5/8/2026 at 3:42:45 AM

Exactly. In the next few years, entry-level hardware will not be advancing beyond 16GB. And anything beyond 32GB will remain decidedly high-end.

And that's for laptops with unified memory. In the desktop space, 8GB discrete GPUs are going to be sticking around for a very long time.

by wtallis

5/8/2026 at 11:20:03 PM

I guess we'll find out! I bet all the vendors who supply RAM are looking at the current shortages and thinking "well, it's a shame we could never manufacture more RAM than we currently do."

by bensyverson

5/8/2026 at 3:15:15 AM

A future with less RAM is possible with more applications using computational storage with ssd/nvme.

But that's not my main argument is that its delusional for OP thinks its reasonable to expect that soon we'll be able to run models on consumer hardware that will be able to build basically most things,

But I do think there will be many compromises made for consumer electronics, I don't think the powers that be are eager to give consumers all the best memory (that should be clear by now) There's 3 DDR5 DRAM manufactures in the world that have to provide memory to all the world's militaries, governments, datacenters/corporations. Consumers are last priority.

by dakolli

5/8/2026 at 5:25:34 PM

Did they modify their post? I can't see who claimed that consumer hardware will be able to build most things?

by marci

5/8/2026 at 6:23:27 PM

> If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

I'm making some assumptions about what they're saying, but it seems clear they have no idea what they're about and that they're betting their competency on this technology.

by dakolli

5/8/2026 at 9:29:52 PM

If you're not paying attention to what's happening with small models, I suggest you take a closer look. Keeping parameter count constant, the quality of small models is rising fast. When you look at what you could do with Llama just 3 years ago vs Gemma 4 on the same 16GB hardware, the trend is clear.

Meanwhile, this year Apple bumped the base of their Mac lineup from 8GB to 16GB RAM, and the iPhone 17 Pro ships with 12GB. The Neo is at 8GB but is a brand new product tier which is not comparable to any past model.

by bensyverson

5/8/2026 at 9:34:43 PM

Small models are gaining useful reasoning ability and that's a genuinely helpful development, but they'll be heavily limited in world knowledge for the foreseeable future. BTW, the base of the Mac lineup is now once again a 8GB device with a small and low-performance SSD. Many people will tell you that it's broadly comparable (though of course not identical!) to the original base model M1.

by zozbot234

5/8/2026 at 11:18:13 PM

For many tasks, including lots of agentic applications, world knowledge is not a "must-have."

To me the Neo is an exception, and doesn't represent the core Mac lineup, which is all at 16GB+ of RAM. If you're developing pro software that would rely on an on-device LLM, you probably wouldn't be targeting the Neo anyway.

by bensyverson

5/8/2026 at 6:49:59 PM

Anything can technically "run" on almost any hardware, the meaningful question is what's the real-world performance. I for one have made a case in this thread that DeepSeek V4 is de facto optimal for wide batching, not single-request or single-agent inference - even on consumer hardware (which is unique among practical AI models). I might still be wrong of course, but if so I'd like to understand what's wrong with my assumptions.

by zozbot234

5/7/2026 at 5:30:18 PM

I am not sure where this comment is from (possibly without looking at this project?). This project is running quasi-frontier model at reasonable tps (~30) with reasonable prefill performance (~500tps) with a high-end laptop. People simply project what they see from this project to what you optimistically can expect.

You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.

by liuliu

5/7/2026 at 9:49:42 PM

There will always be a gap, but what's interesting is that because new models are constantly coming out, we as an industry never spend any time extracting the maximal value out of an existing model. What if there are techniques, and harness workflows that could be optimized for a singular model end to end? How far can that push the state of the art.

An example is https://blog.can.ac/2026/02/12/the-harness-problem/ for just improving edits.

Or if we could really steer these open source models using well structured plans, could we spend more time planning into a specific way and kick off the build over night (a la the night shift https://jamon.dev/night-shift)

by maherbeg

5/7/2026 at 5:48:00 PM

Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.

by amunozo

5/7/2026 at 7:03:53 PM

Frontier models can hardly do the tasks I want them too, I simply cannot buy into this notion.

by dakolli

5/7/2026 at 8:09:32 PM

For instance?

by drob518

5/7/2026 at 11:40:45 PM

> There will always be a huge gap between frontier models and open source models (unless you're very rich).

They said the same thing about open source chess engines.

by daveguy

5/7/2026 at 5:29:22 PM

> a delusional future where capable OS models fit on consumer grade hardware

48 gb is enough for a capable LLM.

Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.

by otabdeveloper4

5/7/2026 at 5:52:47 PM

A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...

by antirez

5/7/2026 at 5:58:58 PM

"Data centers for LLMs are technically more energy efficient per-user than self-hosting LLM models due to economies-of-scale" is a data point the internet isn't ready for.

by minimaxir

5/7/2026 at 8:56:27 PM

But if you're running it on your own hardware you might only generate tokens when you have something useful to do with them, instead of every time you load a Google search results page because Google decided the future is stuffing Gemini-generated answers down your eyeballs instead of letting you read it yourself from the primary source for 0.1 watts.

by wlesieutre

5/8/2026 at 6:00:58 PM

Whether I'm using Google or not is completely unrelated to whether I use OpenAI (for example) API or run LLM locally

by srdjanr

5/8/2026 at 10:27:15 AM

Don't worry, capitalism takes care of that.

by stavros

5/7/2026 at 8:02:14 PM

If LLM's were a mature product then this would be true at some point. However, you could argue (and I will) that the popularization of on-device LLM inference will lead to two things:

- Consumers of LLM inference (developers and hobbyists) will be more aware of compute cost, leading them to develop more token-efficient uses of LLM inference and be incentivized to pick the right model for the right job (instead of throwing Sonnet at the wall and follow up with Opus if that doesn't stick)

- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.

I think that despite the inefficiencies, shifting the market towards local inference would be a net positive in terms of energy use. Remember that 50W might seem like a lot, but is still much less than what, let's say, a PS5 draws.

Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?

by menno-sh

5/7/2026 at 6:05:21 PM

There's a bunch of companies doing garage GPU datacenters now. Probably can act as a heat source during winter too if you have a heat pump.

by Onavo

5/7/2026 at 11:34:33 PM

That's an interesting idea [1], the value being that its easier to build servers into a bunch of homes that are being built than building a datacenter. Every now and then something reminds me of "Dad's Nuke", a novel by Marc Laidlaw, about a family that has a nuclear reactor in their basement. A really bizarre, memorable satire [2].

[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...

[2] https://en.wikipedia.org/wiki/Dad%27s_Nuke

by kristianp

5/7/2026 at 11:41:09 PM

Separate to the self-host/datacentre argument, it would be interesting to see a speed/performance/watts-per-token leaderboard between leading models. Which model is the most watt-efficient?

by aeonfox

5/8/2026 at 9:03:50 AM

Akbaruddin

by ifeot

5/9/2026 at 11:21:27 AM

by aeonfox

5/7/2026 at 7:27:08 PM

I thought this is a pretty generally accepted fact?

by cortesoft

5/8/2026 at 12:28:46 AM

I've seen plenty of people on HN claim that LLM's running on their phones is the obvious future in terms of not just privacy but also efficiency, i.e. better along every possible metric.

They don't usually go into much detail, but the impression I get is that they think data centers are energy monsters full of overheated GPU's that need to be constantly replaced, while your phone is full of mostly unused compute capacity and will barely break a sweat if it's only serving queries for a single user at a time.

They don't seem to give much thought to the energy usage per user (or what this will potentially do to your phone battery), or how different phone-sized vs data center-sized models are in terms of capability.

by crazygringo

5/7/2026 at 8:01:36 PM

This is pretty much true for all applications.

by drob518

5/7/2026 at 9:13:27 PM

This is neither a controversial take nor a reason to prefer third-party hosting over self-hosting, so I don't think the internet really needs to be ready for it.

by airstrike

5/7/2026 at 6:57:52 PM

Using only this dimension in a vacuum, it sounds like an easy choice, but we're extremely early in this market, and the big providers are already a mess of pricing choices, pricing changes, and sudden quota adjustments for consumers.

Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.

A Mac is also the rest of the personal computer!

by Lalabadie

5/7/2026 at 7:28:25 PM

But it's simply an economic fact that EoS will be more efficient with a task that's so easy to offload somewhere else.

by j_maffe

5/7/2026 at 7:44:02 PM

It's so interesting to think about how much power it takes these machines to "think". I think I had a vague notion that it was "a lot" but it's good to put a number on it.

If DS4 Flash peaks at 50W and is 280B parameters, does that mean DS4 Pro at 1.6T parameters would likely be 300W or so? And the latest GPT 5 and Opus which feel maybe comparable-ish around 500W? Is it fair to say that when I'm using Claude Code and it's "autofellating" or whatever I'm burning 500W in a datacenter somewhere during that time?

by losvedir

5/7/2026 at 10:14:31 PM

There isn't a relationship between parameter size and energy use like that. You could run a 280B parameter model on a Raspberry Pi with a big SSD if you were so determined. The energy use would be small, but you would be waiting a very long time for your response.

Data center energy use isn't simple to calculate because servers are configured to process a lot of requests in parallel. You're not getting an entire GPU cluster to yourself while your request is being processed. Your tokens are being processed in parallel with a lot of other people's requests for efficiency.

This is why some providers can offer a fast mode: Your request gets routed to servers that are tuned to process fewer requests in parallel for a moderate speedup. They charge you more for it because they can't fit as many requests into that server.

by Aurornis

5/7/2026 at 11:19:26 PM

You're thinking about power use, not energy. There are systems that can more directly minimize energy per operation at the cost of high latency but they look more like TPUs than Raspberry Pi's.

by zozbot234

5/7/2026 at 8:22:06 PM

Batching lowers that, since the model is read once from memory. Activation accumulation doesn't scale as nicely

by eurekin

5/7/2026 at 8:52:22 PM

Energy use for any given request is going to be roughly proportional to active parameters, not total. That would be something like 13B for Flash and 49B for Pro. So you'd theoretically get something like 190W if you could keep the same prefill and decode speed as Flash, which is unlikely.

by zozbot234

5/7/2026 at 8:25:27 PM

Power isn't proportional to parameters. It may be vaguely proportional to tokens/s although batching screws that up.

Claude Sonnet is probably running on a 8 GPU box that consumes 10 kW while Opus might use more like 50 kW but that's shared by a bunch of users thanks to batching.

by wmf

5/7/2026 at 7:07:39 PM

Not everybody might realize this, but this is a truly excellent and very impressive result. Most models on my M4 Max run at 150W consumption.

by jwr

5/7/2026 at 10:16:53 PM

Power consumption numbers aren't useful for efficiency calculations without also considering the tokens per second for the same model and quantization.

I could write an engine that only uses 10W on your machine, but it wouldn't be meaningful if it was also 10X slower.

More power consumption is usually an indicator that the hardware is being fully utilized, all things equal (comparing GPU to GPU or CPU to CPU, not apples to oranges)

by Aurornis

5/8/2026 at 8:22:19 PM

I'm sorry I don't understand. From the way you frame it, and the sentiment of the replies, seems like this is some scary big number. MacBook M3 Max is a beefy machine and doing inference means it's going at full send. 50W is... what tiny appliances consume. Sure it's more than reading emails but... it's still not a number to be shocked at. An on-the-go laptop has a TDP (max rated power) of 45W. Regular work laptop is 70W. Gaming laptop 230W. The servers I have in the lab on which I run benchmarks counting syscalls per seconds for days on end (you know, performance engineering!) are now going north of 1kW.

Washing machine 900W. Hair dryer 1500W. Pizza oven 2000W. So yeah, you say 50W, yeah sure same as video rendering or gaming I guess, yet not really an OMG-level number.

And frankly I'm not quite sure there's anything like economy of scale where it gets more efficient if you serve more users (like some sibling comments seem to imply).

Last thing, and I know many know but also many others don't or have forgotten: Watts is a rate of consumption, not an absolute amount. That is Joule, energy. So you say 50W, but what you pay for (or the planet pays, whatever) generally is the amount of energy, hence you need to say for how long that consumption was sustained. 50W over 2 hours, that's 100 Joules, the actual resource you consumed and paid for.

Power (watts) is like speed (m/s). You say 50 miles an hour, need to say how long was the drive, so we know how far you got.

by gghh

5/8/2026 at 8:35:59 PM

50 watts over 2 hours is 100 watt hours (Wh) which is 360 kJ. A joule is a watt second. For reference, battery capacity is often measured in Wh and household electric power use in kWh.

Also, datacenter scale devices are almost certainly designed to minimize energy use per operation given comparable latency. You can still compete as an on prem consumer by (1) repurposing your existing hardware, which saves on high CapEx costs, (2) increasing latency, getting your answer computed in a longer time, which probably saves at least some power by design if you can leverage e.g. NPUs, or (3) running smaller or more bespoke models that aren't worthwhile for the bigger players to serve at scale.

There's also a likely gain in serving more requests in parallel, but it may have more to do with successfully amortizing memory access for model weights than any inherent increase in efficiency. Anyway, I've argued in sibling comments that you perhaps can also leverage this on consumer hardware for the special case of DeepSeek V4.

by zozbot234

5/8/2026 at 10:07:16 PM

> _50 watts over 2 hours is 100 watt hours (Wh) which is 360 kJ._

Yes of course that was a brain fart of mine. Watt is Joule per second not certainly Joule per hour. I made the point of "lecturing" readers on power v. energy since Antirez (OP) wrote _"50W of energy usage..."_ (instead of power consumption) and it's a mistake people often make. So my side point was: ok 50W but for how long.

The other thing I'm arguing is 50W is nothing to be shocked by. I would like to see an argument for the opposite. I'd like to know what's the power consumption of playing eg. Baldur's Gate for a couple hours on a gaming rig and I wager we surpass that by a margin.

Now, the data center economy of scales. You're saying they almost certainly exists. Okay whatever I don't know. Requests served in parallel. Amortizing memory access for model weights. Likely. I'm writing this with some thinly veiled dismissive attitude because I believe that it would be very useful to have hard data on whether or not serving many users v. just one user makes LLMs more efficient. It's an important point with wide ranging implications.

If there is scale, like you claim, and one day a wealthy patron gifts me a 40k USD rig where I can run a frontier LLM locally, then I'd still be making selfish use of the commons (energy, which belong to the planet, all of us, that kinda stuff) because the efficient/responsible choice is to pool and use a cloud vendor (or pool your rig with neighbors etc).

But saying a machine can be more efficient if it serves many users sounds to me a bit like nine women making a baby in a month.

by gghh

5/8/2026 at 10:48:07 PM

Keep in mind, I said serving many requests in parallel, not just many users. In fact it's even more efficient if you can batch the requests of a large subagent swarm in parallel since this allows for sharing a big chunk of context/KV cache not just the model weights. That's why I raised the possibility of leveraging this same efficiency with DeepSeek V4. If as a user I can get into the habit of just firing off a request to be cranked on in the background and be completed whenever, and I reach a compute-limited performance workload (just like the big inference labs that serve many users concurrently, only on a smaller scale since the overall compute bottleneck hits sooner) that's quite new wrt. local models. It used to be that we could only do that by spending huge amounts of money on very fast RAM and/or scaling out to multiple nodes.

A big cloud vendor does not face the same opportunity, they cannot leverage the repurposing of your own existing hardware. And they'll definitely want to minimize latency in order to get maximum throughput/utilization from the hardware they did buy, even at an emergy cost. That's why I was careful to note latency as a possible factor before.

by zozbot234

5/8/2026 at 10:58:46 PM

Ah ok, sharing context/KV cache, I can see that helping. I need to learn more about DS V4, you seem to hint it has some advantages over previous generations in this respect. I haven't followed that closely to quite catch this argument, I'll check it out.

by gghh

5/8/2026 at 11:06:49 PM

The basic argument is that its KV cache is roughly an order of magnitude more compact than previous Chinese models, which were already very compact compared to the likes of Gemma 4 (though that example is a bit of an extreme). If you pair this with the basic facts of how to maximize LLM inference performance at scale (this was recently talked about in a video lecture on the Dwarkesh Patel YouTube podcast) the case for doing slow batched inference on prem with DeepSeek V4, perhaps even with memory offload, becomes, as I see it, quite obvious. Of course, I'd like to be proven wrong!

by zozbot234

5/8/2026 at 11:29:40 PM

Right, Dwarkesh's episode with Reiner Pope. Didn't watch the full video but as soon I saw both going to an old school blackboard with an actual chalk in hand I could tell they meant business hehe :) Thanks for recommending the vid and for the info about DS V4.

by gghh

5/7/2026 at 6:01:43 PM

I think I’ve seen about 60 watt total system whenever I’ve used a local model on a MacBook Pro or a Mac Studio. Baseline for the Mac Studio is like 10 W and like 6 W for the MacBook Pro.

by Hamuko

5/7/2026 at 9:30:52 PM

That a serious number? By the way, how does a hardware normie like me even measure this?

by dkga

5/7/2026 at 9:47:47 PM

Most components have built in power measurement (although some are more accurate than others). Apps like Intel Power Gadget, Mx Power Gadget, Afterburner, Adrenalin, etc. can show power usage in real time.

by wmf

5/7/2026 at 5:58:42 PM

equals 2 or 3 human brains in power usage. Amazing work!

by bertili

5/7/2026 at 6:06:21 PM

True quantitatively, not qualitatively. DeepSeek V4 is not capable of doing what a human brain can do, of course, but for the tasks it can do, it can do it at a speed which is completely impossible for a human, so comparing the two requires some normalization for speed.

by antirez

5/7/2026 at 6:34:36 PM

I'm sure human brain, at least my present brain, is incapable of many things DeepSeek V4 can do. Qualitatively.

by scotty79

5/7/2026 at 4:42:44 PM

I've been trying deepseek-v4-flash in OpenCode (via OpenRouter) and I'm blown away. It's no Opus, obviously, but it had zero issues with any regular coding task I threw at it. v4-flash is remarkably "good enough" for what I needed. The whole evening of coding cost me $0.52 in API credits.

by speu

5/7/2026 at 6:51:21 PM

Using it in Kagi Assistant is stupidly slow. I get like 10 t/s.

While it’s pretty fast in the official app for example.

Kagi Assistant is also kind of broken when using Qwen 3.6 Plus.

So, beware of using them in Kagi at the moment.

by jiehong

5/10/2026 at 5:35:37 PM

Probably a provider thing. Looking at https://help.kagi.com/kagi/ai/llms-privacy.html, they're using deepinfra.

Looking at https://openrouter.ai/deepseek/deepseek-v4-flash/providers tells us that the deepseek provider achieves 49tps of throughput while deepinfra 19tps.

by dev_hugepages

5/10/2026 at 7:16:27 PM

Thanks for taking the time to provide this info. I appreciate it

by jiehong

5/7/2026 at 6:18:43 PM

I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.

by nazgulsenpai

5/7/2026 at 6:45:15 PM

[dead]

by throwaway613746

5/7/2026 at 6:26:48 PM

Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.

by visarga

5/7/2026 at 7:11:17 PM

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

by antirez

5/7/2026 at 7:52:29 PM

Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?

by brcmthrowaway

5/7/2026 at 6:55:05 PM

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

by bel8

5/7/2026 at 8:37:54 PM

It runs both q2 and original (4 bit routed experts). At the same speed more or less. The q2 quants are not what you could expect: it works extremely well for a few reasons. For the full model you need a Mac with 256GB.

by antirez

5/8/2026 at 2:31:58 AM

Out of curiosity, do you have any theories of why it works so well at such aggressive quantization levels?

by someone13

5/8/2026 at 5:39:21 AM

It's a mix of extreme sparsity but with the routed expert doing a non trivial amount of work (and it is q8), and projections and routing not being quantized as well. Also the fact it's a QAT model must have a role I guess, and I quantized routed experts out layers with Q2 instead of IQ2_XXS to retain quality.

by antirez

5/8/2026 at 4:09:13 PM

Not trying to give anyone homework thinking out loud :

One thing I would love to see is if this dogfoods itself

Like would dsv4 with q2 be able to do this task itself on this hardware ?

Sidenote: I wish I had a M4-m3 … thinking about getting a ASUS ROG Flow Z13 Gaming Laptop (Model GZ302EA-XS99) uses pcie 4.0 so disk might be a little slower, but I want to see how this does on like Vulcan :)

by happyPersonR

5/8/2026 at 3:57:39 AM

Can you ELI5 why this is so slow for local inference but so fast for using hosted models?

by habosa

5/7/2026 at 11:33:45 PM

Very impressive. One thing that seems odd to me is that is at like 4 minutes before it starts a response for large input? I don't use mac hardware for LLMs, but that is quite surprising and would seem to be a pretty large stumbling block for practical usage.

Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.

by layoric

5/8/2026 at 5:35:39 AM

Yep that happens with coding agents sending a very large system prompt. And also when later tool calling feed it large files or diffs. But with the M3 ultra the prefill speed is almost 500 t/s that is quite into the very usable zone. With M3 max you need a bit more patience but it works well and as it emits the think process if you use the pi agent you don't wait: you read non censored chain of though. I posted a video on X yesterday using it with my m3 max. It spills tokens at a decent speed.

by antirez

5/8/2026 at 9:41:46 AM

Given how small the KV cache for this model seems to be for small contexts, can you clarify how the engine behaves if you try to run increasingly larger batches on your prosumer hardware (RAM 128 GB)? Does it eventually become compute limited?

Also, can the engine support transparent mmap use for fetching weights from disk on-demand, at least when using pure CPU? (GPU inference might be harder, since it's not clear how page faults would interact with running a shader.)

If the latter test is successful, next would be testing Macs with more limited RAM, first running simple requests (would be quite slow) then larger batches (might be more worthwhile if one can partially amortize the cost of fetching weights from storage, and be bottlenecked by other factors).

by zozbot234

5/8/2026 at 3:09:52 PM

Curious why you went this route, don't you think you could have achieved near this performance 80%+ or more within llama.cpp?

by segmondy

5/8/2026 at 11:04:07 AM

Prefill is faster on the M5s, the older generations are a bit weak.

by MrBuddyCasino

5/8/2026 at 4:25:06 PM

I've tried it out with Claude Code on my existing codebase and it seemed to hold its weight (despite being the 2-bit quant). Takes minutes on prompt processing, the actual edits are reasonably quick at above 20 tks.

The good: It succeeded with discovering, applying edits and writing a test for a small task I gave it. The bad: It could not address a small nitpick I had. The ugly: It hallucinated a conversation about "The Duck" that I had with it simultaneously while trying to solve another problem. I can only imagine it's one of examples in the initial Claude Code prompt:

--cut-- However, the user's query is "Can you track these 3 videos here?" which seems unrelated. Perhaps the user is asking if I can track the progress of three videos they are working on?

Let me re-read the user's message. The user said "Source Code" and "The Agent" and "The Duck", it could be video titles. And they are asking if I can track these 3 videos.

?? That doesn't make sense in the context. Could there be two different conversations? --cut--

by sev_verso

5/8/2026 at 10:37:38 AM

Hmm, I'm unable to order more than 96GB RAM for a Mac studio, even with the M3 ultra or M4 Max. Is this au specific? However with the MacBook Pro I can specify 128GB with the M5 Mac.

https://www.apple.com/au/shop/buy-mac/mac-studio

by kristianp

5/8/2026 at 11:19:42 AM

It’s not just AU: https://9to5mac.com/2026/05/05/apples-most-powerful-mac-stud...

They’ve dropped all the mac studio configs higher than 96 gb, as well as the base mac mini. They’re also rumored to be considering taking the Neo base config off the market.

This seems to be how they’re dealing with supply constraints for fab capacity and RAM.

by Joeri

5/8/2026 at 3:45:50 PM

Difficult to believe this memory is made of unobtanium.

Maybe Apple would rather not price it at all than experience blowback for either gouging or lack of inventory.

by Terretta

5/8/2026 at 10:47:39 AM

The studio is really old now. The new one will drop at some point no doubt with more memory options. the 128GB M5 max MBP is great though

by smcleod

5/8/2026 at 3:46:59 PM

And yet, aside from offering 512GB, that really old Studio Ultra M3 LLMs faster (especially sustained) than the new M5 Max.

by Terretta

5/7/2026 at 9:19:58 PM

Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff

by Havoc

5/7/2026 at 11:26:35 PM

It could run viably with SSD offload on Macs with very little memory. You could even exploit batching to make the model almost compute limited even in that challenging setting, seeing as the KV cache is so extremely small (for non-humongous context). In fact, if that approach can be made to work I'd like to see a comparison between DS4 Flash and Pro on the same (Mac) hardware.

by zozbot234

5/7/2026 at 11:54:28 PM

>It could run viably with SSD offload on Macs with very little memory

Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range

Lovely as modern nvmes are they're not memory

by Havoc

5/7/2026 at 11:59:10 PM

You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)

by zozbot234

5/8/2026 at 1:27:02 PM

Yes I think what this demonstrates that folks are missing is that now optimization for specific scenarios is quite possible.

by happyPersonR

5/8/2026 at 1:57:35 PM

For offline work that's fine I guess, but batched or not <1tks is largely unusable for most usage cases

by Havoc

5/8/2026 at 2:28:14 PM

I just think this potential workflow needs to be tested so that we know if anything breaks or makes it infeasible. Ultimately it would be slow when running any single agent, but you might be working with a huge amount of them in parallel. I view this as potentially a great way of repurposing low-RAM hardware with this specific model.

by zozbot234

5/7/2026 at 5:45:48 PM

I am curious about it producing less tokens except for the max mode. I love DeepSeek V4 Flash and I use it extensively, it's so cheap I can use it all day and still not use all my 10$ OpenCode Go subscription. I use it always in max mode because of this, but now I wonder whether I should rather use high.

by amunozo

5/7/2026 at 6:05:37 PM

What do you use it for? I tend to just stick to SOTA (Claude 4.7 Max thinking), and put up with the slow req/response. I'm not sure what type of work i'd trust a less thinking model, as my intuition is built around what Claude vSOTA Max can handle.

Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.

edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.

by unshavedyak

5/7/2026 at 7:08:11 PM

I am experimenting with some game development and my thesis' beamer. I have a 20$ Codex account and I use GPT-5.5 for planning and DeepSeek for executing in OpenCode. This makes my Codex 5h tokens to last more than 10 minutes.

by amunozo

5/7/2026 at 7:48:33 PM

Apple just dropped the 128GB option as well.

by actsasbuffoon

5/8/2026 at 2:55:02 AM

It is still available for the M5 Max Macbook Pro, but yes, the Mac Studio is now only offered with up to 96 GB.

by fgfarben

5/7/2026 at 6:24:54 PM

On max it uses more than twice as many tokens as on high when running the ArtificialAnalysis benchmark suite, and then it's indeed the model with the highest token usage (among the current top tier models). See the "Intelligence vs. Token Use" chart here:

https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...

by PhilippGille

5/7/2026 at 7:11:41 PM

Wow, the difference is quite considerable and the gain in intelligence is not that much. I might try to use high and just iterate more often. I am working with hobby stuff so I don't have to worry whether it breaks things or not.

by amunozo

5/7/2026 at 6:12:44 PM

How has opencode go been for you? Worth changing over from Claude pro?

by syntaxing

5/7/2026 at 6:44:27 PM

I've found that opencode and codex are the two subscriptions that still seem to subsize usage. Deepseek V4 has been the most powerful model in opencode IMO, I trust it with problems where I can validate the solution such as debugging an issue - but I only trust the proprietary GPT-5.5 and Claude Opus 4.7 models for writing code that matters.

by DefineOutside

5/7/2026 at 7:09:50 PM

Given the price, extremely satisfied, especially thanks to DeepSeek V4 Flash that makes it last forever. I use it on top of my 20$ Codex which is great but tokens last nothing.

by amunozo

5/8/2026 at 4:55:27 PM

The intro was the best part of the README in my opinion. The rest of the README looks and feels AI generated. I am guilty of this same thing with README files.

by tmaly

5/8/2026 at 11:03:11 AM

Did I miss a simple motivating benchmark or goal?

I'm assuming this is faster, and/or lets you run a bigger, smarter model than just using the generic tool chain, but it doesn't spell out the level of existing improvements over that baseline or expected improvements as far as I can see?

Presumably you can work it out based on the numbers given if you have the relevant comparison values.

by ZeroGravitas

5/8/2026 at 6:05:16 AM

The beaty of it, that you can clone and make it, and it just works, no python shenanigans, what a blessing for this eco system.

by dejli

5/7/2026 at 6:24:55 PM

Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

by sourcecodeplz

5/8/2026 at 9:16:28 AM

Been working on local-first LLM observability for exactly this use case — tracing local model pipelines without sending data to cloud. Happy to share if anyone's interested.

by shivnathtathe

5/9/2026 at 1:24:10 AM

On both the llama.cpp based version and the custom Metal version, the model forgets how to use tools somewhere around the 50,000 token mark.

by fgfarben

5/7/2026 at 3:54:12 PM

How does it compare to popular local inference engines, e.g. ollama, lm studio, or handrolled llama.cpp? I saw a brief benchmark in the readme but wasn't sure if there was more.

by shay_ker

5/7/2026 at 7:46:34 PM

How does this compare with oMLX?

by brcmthrowaway

5/8/2026 at 8:38:53 AM

Finally someone who pays proper respect to GGML ecosystem.

by octocop

5/8/2026 at 2:21:51 PM

Any direct TPS comparison to Ollama?

by npgraph

5/8/2026 at 2:29:50 PM

Ollama has no local support for DeepSeek V4 at present; it's only listed as a cloud model. Even llama.cpp is still waiting for support.

by zozbot234

5/7/2026 at 10:35:13 PM

[flagged]

by andrefelipeafos

5/7/2026 at 3:45:07 PM

[dead]

by m00dy

5/8/2026 at 3:19:13 PM

[dead]

by JDevAlper

5/8/2026 at 3:43:43 AM

[flagged]

by micalo

5/8/2026 at 8:58:57 AM

[flagged]

by danborn26

5/8/2026 at 11:31:41 AM

AI bot

by mudkipdev

5/7/2026 at 6:03:35 PM

So just gonna ask a question, probably will get downvoted

I know this is flash, but….

But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?

Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK

Sidenote: shout out antirez love my redis :)

by happyPersonR

5/7/2026 at 6:11:32 PM

This is built atop a tower of stuff people built with profiling and performance-oriented design.

That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.

by AlotOfReading

5/7/2026 at 6:21:44 PM

Building this into agentic dev workflows (subject to token/time constraints) is something I spent a lot of time doing at work. I actually am kind of proud of that hahah

But you’re right I agree

In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen

Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.

A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space

I’m happy to see more stuff like this :)

TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?

by happyPersonR

5/7/2026 at 6:04:29 PM

DSv4 generates much faster on NVIDIA class hardware. It is just a very efficient model.

by liuliu

5/7/2026 at 6:22:55 PM

Every lab has a bunch of people doing nothing but optimizing.

by wmf

5/8/2026 at 9:04:32 AM

8011943553

by ifeot

5/7/2026 at 6:17:09 PM

The world is not China.

by fgfarben