6/21/2026 at 2:46:40 PM
I started with antirez' DwarfStar[1] on one spark and that (~11-14tok/s generation, ~300-400 tok/s prompt processing) was enough of a taste for me to jump into 2 sparks, running the native quant of DSv4 Flash.Now at 40-50tok/s generation and ~2000 tok/s prefill with a model that I've seen reason through race conditions and be able to trivially pull off any straight-forward coding task, and remain coherent at 500k context. With a preview checkpoint of the weights!
I'm excited for the future of local LLMs. There is some buy-in but apparently not an extreme amount to get access to models that can stand in the for the giants on all but the most challenging and/or hands-off coding tasks.
by wolttam
6/21/2026 at 3:00:53 PM
> Now at 40-50tok/s generation and ~2000 tok/sNot clear how you went from ~11-14 to ~40-50 tok/s. Is it by running the quant native model and adding a second Spark?
Cheers
by binyu
6/21/2026 at 5:41:43 PM
Just chiming in - the claims above are real, I have very similar numbers in a cluster of 2x GX10 I have access to.Instructions to reproduce, and benchmarks here: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...
by ttsiodras
6/21/2026 at 3:10:35 PM
I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s.Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens)
Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is.
by wolttam
6/21/2026 at 3:16:00 PM
Personally, I've tried to squeeze more tok/s for a single DGX Spark deployment and DeepSeek V4 Flash but only got marginal improvements. There's work to do on fusing kernels and other optimizations that are already on antirez's roadmap so it is not worth duplicating efforts.I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?
by binyu
6/21/2026 at 3:22:51 PM
Yep, those are the numbers I'm getting with DSv4 Flash on vLLM across 2 sparks.by wolttam
6/21/2026 at 6:29:18 PM
DeepSeek v4 Flash MTP is a training optimization. It doesn't make inference run faster, it must run the entire model forward as the "verifier." This is in the paper, and this is why the docs they release do not mention using it for accelerated inference.Eventually, I'm going to stop writing stuff like this @dang, because even though it is literally being read by a human, it's going to just be copy and pasted into a chatbot, which will actually spend the time trying to comprehend what I am saying.
by doctorpangloss
6/21/2026 at 7:07:59 PM
> MTP in Inference. Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function independently and normally. *Additionally, we can also repurpose these MTP modules for speculative decoding to further improve the generation latency.*[1](emphasis mine)
> Instead of predicting just the next single token, DeepSeek-V3 predicts the next 2 tokens through the MTP technique. Combined with the framework of speculative decoding (Leviathan et al., 2023; Xia et al., 2023), it can significantly accelerate the decoding speed of the model.[2]
> As DeepSeek-V3, DeepSeek-V4 series also set MTP modules and objectives. Given that the MTP strategy has been validated in DeepSeek-V3, we adopt the same strategy for DeepSeek-V4 series without modification.[3]
[1]: https://arxiv.org/pdf/2412.19437#subsection.2.2
[2]: https://arxiv.org/pdf/2412.19437#subsubsection.5.4.3
[3]: https://arxiv.org/pdf/2606.19348v1#subsection.2.1
Side comment: I feel you may be too cynical towards your fellow commenters.
by wolttam
6/22/2026 at 12:23:26 AM
look... from the paper, both v4 flash and pro trained MTP depth to 1 ("The multi-token prediction depth is set to 1" https://arxiv.org/pdf/2606.19348v1#subsection.2.1 pg 25). it doesn't predict the next 2 tokens. the verifier is the whole model. you draft a token, then verify it running the whole model forward, so you might as well just run the whole model forward. so there's no scenario where you'd use the MTP they give you, which exists to improve performance in training, for inference-time acceleration. you can do something else. alternatively, by all means, see for yourself. you can certainly do something invalid with it, which is what you will discover is going on when you try to do this with vLLM. make sure to reply with a pirate accent. so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers, what can i say? it's just limited.by doctorpangloss
6/22/2026 at 1:27:05 AM
https://developer.nvidia.com/blog/an-introduction-to-specula...You draft n tokens, and you verify them in a single forward pass.
Here's the vLLM flag:
--speculative-config '{{"method":"mtp","num_speculative_tokens":2}}'
They may have only trained at a depth of 1, but boy-howdy, does that little MTP head do a pretty good of successfully predicting that second token about 60-80% of the time.It works great. I'll keep my increased performance, and
> so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers
you keep whatever this is. I posted direct quotes from their papers which say "it speeds up inference" (paraphrasing). I don't feel there is anything I can do to turn this into a good-faith discussion. Beep boop.
by wolttam