6/1/2026 at 6:42:04 AM
Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.
I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.
I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.
by cafkafk
6/1/2026 at 10:23:09 AM
"-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other."But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?
I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?
As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.
by Sweepi
6/1/2026 at 10:44:48 AM
> But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.
by zamadatix
6/1/2026 at 1:58:26 PM
This is ironically a pretty solid use case for (ex VLIW research) ILP-optimizing compilers.Given knowable runtime hardware usage patterns (huge bursts of memory bandwidth saturation) and a single limited core/thread-shared resource (memory bandwidth), one could optimize for the constraint ahead of runtime.
Because most of the performance optimization levers you have available to pull are (a) trade compute for memory bandwidth (e.g. compression), (b) preload when memory bandwidth is available, (c) optimize the choice of what's in cache when, (d) align to cache size / memory boundaries.
Or tl;dr, try to approximate GPU ISAs at the CPU compiler level. (Which why would anyone but hobbyists, because everyone else just buys pallets of Nvidia/AMD or designs their own ML chips?)
by ethbr1
6/1/2026 at 3:40:23 PM
Fantastic practical achievement!I wonder if I could get similar or even better performance from similar Dell T7610 workstation with dual Xeons and also 128GB DDR3?
The CPUs are better core wise, but that probably does not make much difference?
It has CPUs 2 × Xeon E5-2697 v2
Cores / threads 24 cores / 48 threads total
Per-CPU cores 12 cores / 24 threads
Base clock 2.70 GHz
Max turbo 3.50 GHz
It is sitting gather dust but reading spead Gemma sounds promising.
by sireat
6/1/2026 at 9:49:11 AM
You sure you got DDR3 .. I have 2 e5 v4 rigs at home and both have ddr4 ... Unless I am wrong and 2011-3 supports ddr3 and ddr4by gdjdhdheb
6/1/2026 at 3:50:11 PM
I won't speak for cafkafk, but I have two E5 (v3/v4) systems one on DDR4 and one on DDR3. This generation of CPU all support DDR4, but a few skus do support DDR3 also. ChatGPT told me they were niche products to meet specific customer needs.I just picked up the DDR3 board, an Aliexpress "XD3" so I could reuse some DDR3 ram on a better CPU. Quad channel 1866MT/s is not bad!
by duffyjp
6/1/2026 at 6:37:11 PM
I have a dual e5 v3 that had ddr 4 as well. Been going strong for ten years and still overpowered for what I use it for.by dawnerd
6/1/2026 at 10:15:21 AM
The first two generations supported DDR3 only. Haswell and Broadwell (v4) brought DDR4 support.by lightedman
6/1/2026 at 11:44:49 AM
right, and they talk about "v4" which is DDR4.by _zoltan_
6/1/2026 at 9:18:30 PM
There were several V4 Xeon models that supported DDR3 AND DDR4 simultaneously. If you had a motherboard with an X79 chipset it would (sometimes) work properly.by lightedman
6/1/2026 at 7:28:44 PM
You're right - the article says 'CPU: Intel Xeon E5-2620 v4 @ 2.10 GHz' but also says DDR3. And the specs page for that CPU (https://www.intel.com/content/www/us/en/products/sku/92986/i...) clearly says the 2620 v4 is DDR4.E5 CPUs have their supported RAM right on the Intel ARK pages, but short version:
E5-xxxxx v1 and v2 are all DDR3
E5-xxxxx v3 and v4 are all DDR4
Not sure why Intel didn't just cut new model numbers instead of keeping them all as "e5"
More concrete example for E5-2660 (great processor) showing v1 and v2 support DDR3, while v3 and v4, DDR4 (again, different motherboards)
DDR3 v1: https://www.intel.com/content/www/us/en/products/sku/64584/i...
DDR3 v2: https://www.intel.com/content/www/us/en/products/sku/75272/i...
DDR4 v3: https://www.intel.com/content/www/us/en/products/sku/81706/i...
DDR4 v4: https://www.intel.com/content/www/us/en/products/sku/91772/i...
This also means that you need to know the processor your motherboard supports (or, easier, probably RAM) before putting in an order to upgrade the processor. (These processors are incredibly cheap, less than $10 for something that might have cost literally thousands ten years ago, so worthwhile to spend a few minutes and pick out your favorite based on cores, watts, Ghz, etc.)
(Another commenter says that there are some motherboards that accept v3/v4 but also can run slower DDR3 RAM. That's new to me and quite cool - DDR3 is extremely cheap, even now. I did find these motherboards on aliexpress, too: https://www.aliexpress.us/w/wholesale-XD3-motherboard.html?s... and one clearly says v3/v4 cpu's with DDR3 RAM. That could be very useful although memory speeds are slower since CPU performance can be boosted with v3/v4.)
v1: https://www.intel.com/content/www/us/en/ark/products/series/...
v2: https://www.intel.com/content/www/us/en/ark/products/series/...
v3: https://www.intel.com/content/www/us/en/ark/products/series/...
v4: https://www.intel.com/content/www/us/en/ark/products/series/...
by _hyn3
6/1/2026 at 9:09:36 PM
I bought a renewed 2x E5-2690v4 server (28c/56t) 128gb on amazon for under $500 2 years ago (28c/56t) dell T7810search amazon for "chia farming" ...and scroll past chia seeds :)
now same machine is 2.5x the price
https://www.amazon.com/dp/B095TRGCSX
but way cheaper than current ddr5 machines
by m463
6/2/2026 at 12:24:28 AM
Bought the exact same machine (same config and ram as well) around the same time off ebay for ~$280. Part of me wonders if I should sell it, but I do occasionally like to play with homelab stuff.I have a 3060 12gb card I'd love to hook up to my PoE Reolink cameras for face detection and to get off of the Reolink app.
by justinram11
6/1/2026 at 11:17:38 PM
> now same machine is 2.5x the price2.5x?! I have a bunch of older Haswell servers I got for free that are rotting away in my garage. I had initially thought of stripping out the ECC DDR4, but now I'm wondering if I'll get takers on Marketplace...
by overfeed
6/1/2026 at 2:28:06 PM
This seems remarkably suited to my situation, CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Also with 128G. Does 8 dimm sockets imply more actual bandwidth in practice?This poor thing is currently a YouTube watching box.
by Lerc
6/1/2026 at 3:35:47 PM
One thing to note: These Xeons have quad memory channels, that usually means double the bandwidth of an equivalent desktop CPU, if you populate all the slots.I have a dual E5-2667 v2 server with 512GB DDR3 and it's quite nice, the memory bandwidth is higher than of a DDR4 desktop with a way newer CPU, even though it's ECC and registered.
by miahi
6/1/2026 at 7:07:17 AM
(purple on black is really hard to read)You say it runs "at reading speed". Have you benchmarked it?
by fragmede
6/1/2026 at 7:32:15 AM
> (purple on black is really hard to read)Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.
> You say it runs "at reading speed". Have you benchmarked it?
At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:
llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128
Gives:
llama_print_timings: load time = 83911.65 ms
llama_print_timings: sample time = 26.99 ms / 128 runs ( 0.21 ms per token, 4742.15 tokens per second)
llama_print_timings: prompt eval time = 343.41 ms / 7 tokens ( 49.06 ms per token, 20.38 tokens per second)
llama_print_timings: eval time = 10639.36 ms / 127 runs ( 83.77 ms per token, 11.94 tokens per second)
llama_print_timings: total time = 11114.98 ms / 134 tokens
So 11.94 tokens per second while it's also playing binary cache and CI builder.When I do it properly, I'll add it to the blog as well!
by cafkafk
6/1/2026 at 12:13:03 PM
And if you ever run out of things to do in your copious free time, it looks like that PR #1744 was merged without the has_target_ctx assert two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-).by fhars
6/1/2026 at 2:06:56 PM
> two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-)2010s Javascript, putting down the controller: Ha, no one will ever surpass my high score for wasting programmer time with dependency churn...
2026 Open Source ML: Hold my beer.
by ethbr1
6/1/2026 at 1:33:06 PM
What's time to first token? Raw throughput is usually not the problem in local setups in my experience.by bbatha
6/1/2026 at 9:17:16 AM
I am pretty sure llamacpp have their own benchmarking binary that you can use.by anon-3988
6/1/2026 at 10:46:21 AM
llama-bench is part of the llama-cpp package, but from recent experimentation, the settings it is able to (or is documented to?) accept lag behind somewhat. Not sure whether it would accept all of the esoteric settings in the article?by mft_
6/1/2026 at 9:49:45 AM
20 tokens per second for eval time is the killer here. It means you can't use this to process any meaningful amount of text.A GPU typically processes close to 1000 tokens/s during eval.
by ekianjo
6/1/2026 at 1:39:28 PM
The prompt is literally "why is the sky blue?" and consists of 7 tokens.It's probably too small for the timings to be taken seriously.
by hnfong
6/1/2026 at 10:47:44 AM
I'm pretty sure eval time is token generation time where it's actually outputting new tokens. If you're getting a thousand per second on that, I'd love to know on what.by boutell
6/1/2026 at 12:45:48 PM
From the prompt timings above, it seems like 'prompt eval time' is the equivalent to 'processing time for input tokens'.Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.
The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).
In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.
by Majromax
6/1/2026 at 1:43:35 PM
Seven tokens long input isn't very realistic, is it? For coding tasks it's normal for the input to be thousands or 10s of thousands. If it wasn't for prefix caching it'd be one miserable experience, but even then at the very best the input is often in hundreds each time. And don't even try to dump some logs into the prompt.by bboozzoo
6/1/2026 at 1:54:04 PM
> Seven tokens long input isn't very realistic, is it?The test prompt above was "Why is the sky blue?", so there's the seven tokens. I meant to highlight that because I'd expect processing of a thousand-token input to be faster per token than presented.
by Majromax
6/1/2026 at 7:17:05 PM
He meant prompt eval time, but have a look at these guys: https://www.youtube.com/watch?v=ndSA9T5yvmMOver 2500 tokens per second on a single request. With 8 MI300X.
by throwawayffffas
6/1/2026 at 1:38:24 PM
I meant prompt eval time.by ekianjo
6/1/2026 at 12:09:01 PM
Something doesn't add up here. As someone who has only recently built a home-server from an E5-26xx v2 on DDR3 RAM (because I have a sh*tload of 32g DDR3 DIMMs), I can confidently say that the newer cores (E5-26xx v3 and v4) only run on DDR4 memory...So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)
Everything else doesn't work
by dark-star
6/1/2026 at 1:27:36 PM
There are some OEM-only v3/v4 parts with dual memory controllers (because of a RAM supply crunch at the time, funnily enough), but the E5-2620 v4 is not one of them. The classic example is the very popular 12-core E5-2678 v3.by mwpmaybe
6/1/2026 at 2:28:04 PM
This is not true. A few well known brands made both DDR3 and DDR4 servers that support v3 & v4 chips. Ask me how I know :-)by robeastham
6/1/2026 at 3:15:45 PM
enlighten usby smartbit
6/1/2026 at 10:26:49 PM
https://www.aliexpress.com/s/wiki-ssr/article/2696-v4-ddr3by bobmcnamara
6/1/2026 at 12:44:32 PM
It looks like Supermicro had some DDR3 Xeon v3/v4 boards, and the first thing that came to mind was a Shenzen workstation/gaming board using recycled parts... haven't searched on that but it's bound to exist.by happycube
6/1/2026 at 12:32:57 PM
Yeah, the Intel reference page only lists DDR4, not DDR3:https://www.intel.com/content/www/us/en/products/sku/92986/i...
by justinclift
6/1/2026 at 12:15:30 PM
> So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)Yup that's odd... I've got a Xeon 2680 v4 (14 cores) (amazing bargain of a little beast btw) and it's indeed on DDR4 and I saw all Xeons v4 as supporting DDR4 only.
Full spec (brand/model/mobo type) would have been nice: mine's an HP Z440 workstation repurposed as a server (which I only turn on when I'm working and which I religiously turn off before going to bed).
by TacticalCoder
6/1/2026 at 9:20:44 AM
How many watts is that setup? Cool you got it to work, but maybe only useful for vintage / retro computing rather than practical if the energy consumption makes it economically wasteful.by arpinum
6/1/2026 at 2:57:37 PM
IDK about OPs setup, but I run a pile of E5-2683v4 Xeon recycled servers for Ceph and self hosted business SaaS usage.One node's ipmitool sensor report (and self-monitoring PSU, so grain of salt, but my UPS side monitoring tracks closely), reports 250-300w average power use. This though, mind you is for running 22 spinning disks, 2 SAS/SATA SSDs, and 4 NVME ssds, and 768GB of DDR4.
Mid-gen 2015ish Xeons were not great at power reduction, but if you are pegging the cores, they were never particularly slow, and they did have lots of PCIe lanes. This boils down to the CPU/mobo itself not being that big a cost floor, especially if you have high utilization rates.
As a comparison, my main desktop development machine, running a Threadripper 9970X, 128GB of DDR5, a RDNA4 GPU, and a small pile of NVME drives has a power floor of roughly 250W. Some CPU centric workloads you'll definitely lose out on on the older gens of machines, but they are by no means impractical.
Maybe for a desktop usecase they are absolutely suboptimal nowadays, but for a lot of realworld usecases I would say they're still relevant.
---
Like the author posts for the LLM usecase, I think optimizing the hardware choice to the application and not leaving levers unpulled is a big key, especially considering how wide a variety of bandwidth/power draw/peak frequency/corecount SKUs exist in the Xeon lines. Without knowing what you intend to run and fitting the correct processor to it, you will end up with a disappointingly poor environment fit.
by vetrom
6/1/2026 at 12:34:24 PM
How many kWh to fabricate a brand new machine better suited to the task?As long as performance is useable (apply your own metrics!), pulling it from existing hardware is likely the option with the lower eco footprint.
Also: chances are it'll only be used for this purpose occasionally, and/or for a short while. In that scenario [fabricating new hardware] always has the bigger eco footprint.
by RetroTechie
6/1/2026 at 1:14:48 PM
I don’t know why you’d assume that an older system is lower footprint.If you’ve got something consuming 100 watts average over your 24 hour period, and your electricity costs 20 cents per kWh, you’re already spending almost as much as a Claude subscription.
Just on electricity, this assumes your hardware never fails and you never incur any additional costs.
There’s a big reason why newer more efficient hardware is in demand. Something that’s 10+ years old has drastically worse performance per watt.
Obviously I am not saying to throw away your old hardware as a rule but there is a point where some of this old stuff just isn’t even worth running.
by dangus
6/1/2026 at 3:55:18 PM
The reason more performance/watt is in demand because a datacenter can't suddenly draw twice as much power.by ThatMedicIsASpy
6/1/2026 at 7:07:50 PM
Or because I don’t want my homelab to spike my electricity bill and give me a loud hot closet.by dangus
6/1/2026 at 2:40:13 PM
I have two LARGE Xeon systems of this era that I used to use when I was heavily involved with Kubernetes and needed to build out a home lab. One is 2x Xeon w/ 256 GB of ram, and one is 1x Xeon w/ 512GB of ram. Both are slow as dogs, and both of them take up at least 150+ watts with only one power supply. My 12th gen Intel Nuc is so, so much faster and efficient. I'm recycling the Xeon systems.by quietsegfault
6/1/2026 at 3:39:08 PM
Xeon is a group of products with really varying specs. There is no indication of which XEONs. Also new consumer CPUs often have really small internal caches.by gnerd00
6/1/2026 at 7:05:30 PM
The Xeon processor in use by the OP of this article claims to have 20MB of Intel “Smart Cache.”An Apple M4 chip in a Mac mini has 16MB on the P-cores and 4MB on the E-cores.
Depending on use case, AMD 3D V-cache at almost 100MB could also work out quite well.
So really, if you wait long enough, consumer chips end up with a pretty similar amount of cache.
by dangus
6/1/2026 at 5:46:33 PM
E5-2690s in my case.by quietsegfault
6/1/2026 at 2:11:08 PM
You mention lower footprint but then make a cost comparison against Claude subscription pricing.Claude subscription pricing is a broken way to consider footprint.
by souterrain
6/1/2026 at 5:17:42 PM
You can call it whatever you want, money is money, and money spent on energy is footprint.by dangus
6/1/2026 at 10:14:44 AM
Would you consider improving the website's layout? Right now I find it below average quality and very distracting. Whether you are an engineer or not is not really important; great engineers can write horrible text or use a layout that is not ideal, for instance.by shevy-java