2/6/2026 at 5:09:57 PM
I am! I moved from a shoebox Linux workstation with 32MB of RAM and a 12GB RTX 3060 to a 256GB M3 Ultra, mainly for unified memory.I've only had it a couple of months, but so far it's proving its worth in the quality of LLM output, even quantized.
I generally run Qwen3-vl at 235b, at a Q4_K_M quantization level so that it fits, and it leaves me plenty of RAM for workstation tasks while delivering tokens at around 30tok/s
The smaller Qwen3 models (like qwen3-coder) I use in tandem, of course they run much faster and I tend to run them at higher quants up to Q8 for quality purposes.
The gigantic RAM's biggest boon, I've found, is letting me run the models with full context allocated, which lets me hand them larger and more complicated things than I could before. This alone makes the money I spent worth it, IMO.
I did manage to get glm-4.7 (a 358b model) running at a Q3 quantization level; it's delivery is adequate quality-wise, although it delivers at 15tok/s, though I did have to cut down to only 128k context to leave me enough room for the desktop.
If you get something this big, it's a powerhouse, but not nearly as much of a powerhouse as a dedicated nVidia GPU rig. The point is to be able to run them _adequately_, not at production speeds, to get your work done. I found price/performance/energy usage to be compelling at this level and I am very satisfied.
by josefcub