4/28/2026 at 10:52:32 PM
There's a tradeoff between dense models and MoEs on memory usage vs. compute for the same quality.For example, Qwen3.5 27B and Qwen3.5 122B A10B have similar average performance across benchmarks. The 122B is much faster to run than the 27B (generates more tokens at the same compute). The 27B, on the other hand, uses ~4x less VRAM at low context lengths (less difference at high context lengths).
Right now, different hardware seems to be suited to different points in the dense vs. MoE balance. On one extreme is hardware like the DGX Spark and Strix Halo which have a lot of memory compared to compute performance and memory bandwidth, and are best-suited for MoE workflows. On the other extreme you have cards like RTX 5090 which have very high performance for the price but rather little memory, and is best suited for dense models.
The Arc Pro B70 seems to be the awkward middle. With 1-2 of these, you can run a ~30B dense model slowly, probably not fast enough to be useful interactively (you'd probably need a 5090 or 2x 3090 for that). Or, you can run a MoE model at high throughput, but probably not enough quality to support agentic workflows that actually use your throughput.
by 2001zhaozhao
4/28/2026 at 11:57:02 PM
DGX Spark is at the compute level of 5070. Its main issue is low memory bandwidth, i.e. it has quite fast token prefill but awful token generation. Strix Halo is just slow on every metric and used to be a cheap way to get 128GB unified RAM (now its prices are comparable to DGX Spark).by storus
4/29/2026 at 12:41:45 PM
I have one, this isn't true. The wattage of a 5070 is about 300. The spark entire unit runs at 200 watts max. In reality it runs like a rtx 5060 with lots of vram. Very good for training, okay for inferencing if you are running batch jobs and don't mind waiting.by tehologist
4/29/2026 at 7:23:36 PM
DGX Spark has actually the same compute as 5070Ti but its slower RAM and TDP brings it down to the 5070 territory.by storus
4/29/2026 at 6:26:10 PM
Strix Halo TDP is significantly lower. Comparing apples to oranges, really.by spookie
4/28/2026 at 11:08:41 PM
I am working mostly with image models so we do a lot of fun times and the card fits perfectly here. Performance isn't great but it can just tug along in the background.àpby BoredPositron
4/29/2026 at 12:25:18 AM
I still not see the point running these models. I say they produce plausible garbage, nowhere near quality of frontier models (when they work).Why can't Intel look beyond this nonsense state of affair and build something with 1TB of RAM or more?
What I am trying to say, I am yet to see anything competitive in the market. Cards very much stalled in sub 100GB region and best corporations can do is throw something to run toy models and forget about it after a week.
by varispeed
4/29/2026 at 4:33:49 AM
What's wrong with Grace Hopper if you want to throw buckets of local memory at a problem?by AlotOfReading
4/29/2026 at 9:14:57 AM
Most consumer platforms only allow up to 128/256GB of RAM. If you want more you likely need a data centre platform. This is again a mismatch between what companies think consumers are at and the reality.I think e.g. AMD missed the boat with 9950x3d2 by limiting memory controller. If it was possible to hook it with 1TB of consumer DDR5 RAM, that would be something to write home about.
by varispeed
4/30/2026 at 1:41:18 PM
What does Admiral Hopper have to do with this?by MisterTea
4/29/2026 at 8:07:01 AM
Some people, including myself, loathe Nvidia with the fiery burning passion of a thousand suns, and will put up with whatever nonsense is necessary to run without them.by MrDrMcCoy
4/29/2026 at 5:46:22 AM
LLMs are memory bandwidth bound not compute bound.by Readerium
4/29/2026 at 8:15:08 AM
LLMs are bound by both and depends on the hardware which factor is higher.by AntiUSAbah
4/29/2026 at 6:05:45 PM
Technically true, but if we're talking about local models, overwhelmingly you're gonna be bandwidth bound. You need about 2 flops per active parameter per token. An M5 chip has what, 150-200GB of bandwidth? But it can easily do something like 16tflops of fp16, so you're talking like 100 flops per byte of bandwidth. Which is just to say that in a batch=1 scenario, ie one user, you're only gonna use a few % of the GPU while you're totally saturated your memory bandwidth. For all practical purposes at the consumer level, take your memory bandwidth, divide by the size of the model, and that gives you the max tok/s throughput you're gonna get.Even a 5090 has something like 50-60 flops per byte of bandwidth, you just can't saturate the compute without running large batches. (At least at inference, prefill is obviously more compute bound).
by joshjob42
4/29/2026 at 6:38:55 AM
This is incorrect, prompt processing is compute bound.by ondra
4/29/2026 at 7:40:13 AM
This is only true for some parts of the time cost function.by icelancer