A 10 year old Xeon is all you need

6/1/2026 at 6:42:04 AM

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

by cafkafk

6/1/2026 at 10:23:09 AM

"-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other."

But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?

As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.

by Sweepi

6/1/2026 at 10:44:48 AM

> But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.

by zamadatix

6/1/2026 at 1:58:26 PM

This is ironically a pretty solid use case for (ex VLIW research) ILP-optimizing compilers.

Given knowable runtime hardware usage patterns (huge bursts of memory bandwidth saturation) and a single limited core/thread-shared resource (memory bandwidth), one could optimize for the constraint ahead of runtime.

Because most of the performance optimization levers you have available to pull are (a) trade compute for memory bandwidth (e.g. compression), (b) preload when memory bandwidth is available, (c) optimize the choice of what's in cache when, (d) align to cache size / memory boundaries.

Or tl;dr, try to approximate GPU ISAs at the CPU compiler level. (Which why would anyone but hobbyists, because everyone else just buys pallets of Nvidia/AMD or designs their own ML chips?)

by ethbr1

6/1/2026 at 3:40:23 PM

Fantastic practical achievement!

I wonder if I could get similar or even better performance from similar Dell T7610 workstation with dual Xeons and also 128GB DDR3?

The CPUs are better core wise, but that probably does not make much difference?

It has CPUs 2 × Xeon E5-2697 v2

Cores / threads 24 cores / 48 threads total

Per-CPU cores 12 cores / 24 threads

Base clock 2.70 GHz

Max turbo 3.50 GHz

It is sitting gather dust but reading spead Gemma sounds promising.

by sireat

6/1/2026 at 9:49:11 AM

You sure you got DDR3 .. I have 2 e5 v4 rigs at home and both have ddr4 ... Unless I am wrong and 2011-3 supports ddr3 and ddr4

by gdjdhdheb

6/1/2026 at 3:50:11 PM

I won't speak for cafkafk, but I have two E5 (v3/v4) systems one on DDR4 and one on DDR3. This generation of CPU all support DDR4, but a few skus do support DDR3 also. ChatGPT told me they were niche products to meet specific customer needs.

I just picked up the DDR3 board, an Aliexpress "XD3" so I could reuse some DDR3 ram on a better CPU. Quad channel 1866MT/s is not bad!

by duffyjp

6/1/2026 at 6:37:11 PM

I have a dual e5 v3 that had ddr 4 as well. Been going strong for ten years and still overpowered for what I use it for.

by dawnerd

6/1/2026 at 10:15:21 AM

The first two generations supported DDR3 only. Haswell and Broadwell (v4) brought DDR4 support.

by lightedman

6/1/2026 at 11:44:49 AM

right, and they talk about "v4" which is DDR4.

by _zoltan_

6/1/2026 at 9:18:30 PM

There were several V4 Xeon models that supported DDR3 AND DDR4 simultaneously. If you had a motherboard with an X79 chipset it would (sometimes) work properly.

by lightedman

6/1/2026 at 7:28:44 PM

You're right - the article says 'CPU: Intel Xeon E5-2620 v4 @ 2.10 GHz' but also says DDR3. And the specs page for that CPU (https://www.intel.com/content/www/us/en/products/sku/92986/i...) clearly says the 2620 v4 is DDR4.

E5 CPUs have their supported RAM right on the Intel ARK pages, but short version:

E5-xxxxx v1 and v2 are all DDR3

E5-xxxxx v3 and v4 are all DDR4

Not sure why Intel didn't just cut new model numbers instead of keeping them all as "e5"

More concrete example for E5-2660 (great processor) showing v1 and v2 support DDR3, while v3 and v4, DDR4 (again, different motherboards)

DDR3 v1: https://www.intel.com/content/www/us/en/products/sku/64584/i...

DDR3 v2: https://www.intel.com/content/www/us/en/products/sku/75272/i...

DDR4 v3: https://www.intel.com/content/www/us/en/products/sku/81706/i...

DDR4 v4: https://www.intel.com/content/www/us/en/products/sku/91772/i...

This also means that you need to know the processor your motherboard supports (or, easier, probably RAM) before putting in an order to upgrade the processor. (These processors are incredibly cheap, less than $10 for something that might have cost literally thousands ten years ago, so worthwhile to spend a few minutes and pick out your favorite based on cores, watts, Ghz, etc.)

(Another commenter says that there are some motherboards that accept v3/v4 but also can run slower DDR3 RAM. That's new to me and quite cool - DDR3 is extremely cheap, even now. I did find these motherboards on aliexpress, too: https://www.aliexpress.us/w/wholesale-XD3-motherboard.html?s... and one clearly says v3/v4 cpu's with DDR3 RAM. That could be very useful although memory speeds are slower since CPU performance can be boosted with v3/v4.)

v1: https://www.intel.com/content/www/us/en/ark/products/series/...

v2: https://www.intel.com/content/www/us/en/ark/products/series/...

v3: https://www.intel.com/content/www/us/en/ark/products/series/...

v4: https://www.intel.com/content/www/us/en/ark/products/series/...

by _hyn3

6/1/2026 at 9:09:36 PM

I bought a renewed 2x E5-2690v4 server (28c/56t) 128gb on amazon for under $500 2 years ago (28c/56t) dell T7810

search amazon for "chia farming" ...and scroll past chia seeds :)

now same machine is 2.5x the price

https://www.amazon.com/dp/B095TRGCSX

but way cheaper than current ddr5 machines

by m463

6/2/2026 at 12:24:28 AM

Bought the exact same machine (same config and ram as well) around the same time off ebay for ~$280. Part of me wonders if I should sell it, but I do occasionally like to play with homelab stuff.

I have a 3060 12gb card I'd love to hook up to my PoE Reolink cameras for face detection and to get off of the Reolink app.

by justinram11

6/1/2026 at 11:17:38 PM

> now same machine is 2.5x the price

2.5x?! I have a bunch of older Haswell servers I got for free that are rotting away in my garage. I had initially thought of stripping out the ECC DDR4, but now I'm wondering if I'll get takers on Marketplace...

by overfeed

6/1/2026 at 2:28:06 PM

This seems remarkably suited to my situation,

    CPU(s): 32
      On-line CPU(s) list: 0-31
    Vendor ID: GenuineIntel  
    Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz

Also with 128G. Does 8 dimm sockets imply more actual bandwidth in practice?

This poor thing is currently a YouTube watching box.

by Lerc

6/1/2026 at 3:35:47 PM

One thing to note: These Xeons have quad memory channels, that usually means double the bandwidth of an equivalent desktop CPU, if you populate all the slots.

I have a dual E5-2667 v2 server with 512GB DDR3 and it's quite nice, the memory bandwidth is higher than of a DDR4 desktop with a way newer CPU, even though it's ECC and registered.

by miahi

6/1/2026 at 7:07:17 AM

(purple on black is really hard to read)

You say it runs "at reading speed". Have you benchmarked it?

by fragmede

6/1/2026 at 7:32:15 AM

> (purple on black is really hard to read)

Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.

> You say it runs "at reading speed". Have you benchmarked it?

At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:

llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128

Gives:

  llama_print_timings:        load time =   83911.65 ms
  llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
  llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
  llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
  llama_print_timings:       total time =   11114.98 ms /   134 tokens

So 11.94 tokens per second while it's also playing binary cache and CI builder.

When I do it properly, I'll add it to the blog as well!

by cafkafk

6/1/2026 at 12:13:03 PM

And if you ever run out of things to do in your copious free time, it looks like that PR #1744 was merged without the has_target_ctx assert two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-).

by fhars

6/1/2026 at 2:06:56 PM

> two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-)

2010s Javascript, putting down the controller: Ha, no one will ever surpass my high score for wasting programmer time with dependency churn...

2026 Open Source ML: Hold my beer.

by ethbr1

6/1/2026 at 1:33:06 PM

What's time to first token? Raw throughput is usually not the problem in local setups in my experience.

by bbatha

6/1/2026 at 9:17:16 AM

I am pretty sure llamacpp have their own benchmarking binary that you can use.

by anon-3988

6/1/2026 at 10:46:21 AM

llama-bench is part of the llama-cpp package, but from recent experimentation, the settings it is able to (or is documented to?) accept lag behind somewhat. Not sure whether it would accept all of the esoteric settings in the article?

by mft_

6/1/2026 at 9:49:45 AM

20 tokens per second for eval time is the killer here. It means you can't use this to process any meaningful amount of text.

A GPU typically processes close to 1000 tokens/s during eval.

by ekianjo

6/1/2026 at 1:39:28 PM

The prompt is literally "why is the sky blue?" and consists of 7 tokens.

It's probably too small for the timings to be taken seriously.

by hnfong

6/1/2026 at 10:47:44 AM

I'm pretty sure eval time is token generation time where it's actually outputting new tokens. If you're getting a thousand per second on that, I'd love to know on what.

by boutell

6/1/2026 at 12:45:48 PM

From the prompt timings above, it seems like 'prompt eval time' is the equivalent to 'processing time for input tokens'.

Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.

The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).

In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.

by Majromax

6/1/2026 at 1:43:35 PM

Seven tokens long input isn't very realistic, is it? For coding tasks it's normal for the input to be thousands or 10s of thousands. If it wasn't for prefix caching it'd be one miserable experience, but even then at the very best the input is often in hundreds each time. And don't even try to dump some logs into the prompt.

by bboozzoo

6/1/2026 at 1:54:04 PM

> Seven tokens long input isn't very realistic, is it?

The test prompt above was "Why is the sky blue?", so there's the seven tokens. I meant to highlight that because I'd expect processing of a thousand-token input to be faster per token than presented.

by Majromax

6/1/2026 at 7:17:05 PM

He meant prompt eval time, but have a look at these guys: https://www.youtube.com/watch?v=ndSA9T5yvmM

Over 2500 tokens per second on a single request. With 8 MI300X.

by throwawayffffas

6/1/2026 at 1:38:24 PM

I meant prompt eval time.

by ekianjo

6/1/2026 at 12:09:01 PM

Something doesn't add up here. As someone who has only recently built a home-server from an E5-26xx v2 on DDR3 RAM (because I have a sh*tload of 32g DDR3 DIMMs), I can confidently say that the newer cores (E5-26xx v3 and v4) only run on DDR4 memory...

So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Everything else doesn't work

by dark-star

6/1/2026 at 1:27:36 PM

There are some OEM-only v3/v4 parts with dual memory controllers (because of a RAM supply crunch at the time, funnily enough), but the E5-2620 v4 is not one of them. The classic example is the very popular 12-core E5-2678 v3.

by mwpmaybe

6/1/2026 at 2:28:04 PM

This is not true. A few well known brands made both DDR3 and DDR4 servers that support v3 & v4 chips. Ask me how I know :-)

by robeastham

6/1/2026 at 3:15:45 PM

enlighten us

by smartbit

6/1/2026 at 10:26:49 PM

https://www.aliexpress.com/s/wiki-ssr/article/2696-v4-ddr3

by bobmcnamara

6/1/2026 at 12:44:32 PM

It looks like Supermicro had some DDR3 Xeon v3/v4 boards, and the first thing that came to mind was a Shenzen workstation/gaming board using recycled parts... haven't searched on that but it's bound to exist.

by happycube

6/1/2026 at 12:32:57 PM

Yeah, the Intel reference page only lists DDR4, not DDR3:

https://www.intel.com/content/www/us/en/products/sku/92986/i...

by justinclift

6/1/2026 at 12:15:30 PM

> So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Yup that's odd... I've got a Xeon 2680 v4 (14 cores) (amazing bargain of a little beast btw) and it's indeed on DDR4 and I saw all Xeons v4 as supporting DDR4 only.

Full spec (brand/model/mobo type) would have been nice: mine's an HP Z440 workstation repurposed as a server (which I only turn on when I'm working and which I religiously turn off before going to bed).

by TacticalCoder

6/1/2026 at 9:20:44 AM

How many watts is that setup? Cool you got it to work, but maybe only useful for vintage / retro computing rather than practical if the energy consumption makes it economically wasteful.

by arpinum

6/1/2026 at 2:57:37 PM

IDK about OPs setup, but I run a pile of E5-2683v4 Xeon recycled servers for Ceph and self hosted business SaaS usage.

One node's ipmitool sensor report (and self-monitoring PSU, so grain of salt, but my UPS side monitoring tracks closely), reports 250-300w average power use. This though, mind you is for running 22 spinning disks, 2 SAS/SATA SSDs, and 4 NVME ssds, and 768GB of DDR4.

Mid-gen 2015ish Xeons were not great at power reduction, but if you are pegging the cores, they were never particularly slow, and they did have lots of PCIe lanes. This boils down to the CPU/mobo itself not being that big a cost floor, especially if you have high utilization rates.

As a comparison, my main desktop development machine, running a Threadripper 9970X, 128GB of DDR5, a RDNA4 GPU, and a small pile of NVME drives has a power floor of roughly 250W. Some CPU centric workloads you'll definitely lose out on on the older gens of machines, but they are by no means impractical.

Maybe for a desktop usecase they are absolutely suboptimal nowadays, but for a lot of realworld usecases I would say they're still relevant.

---

Like the author posts for the LLM usecase, I think optimizing the hardware choice to the application and not leaving levers unpulled is a big key, especially considering how wide a variety of bandwidth/power draw/peak frequency/corecount SKUs exist in the Xeon lines. Without knowing what you intend to run and fitting the correct processor to it, you will end up with a disappointingly poor environment fit.

by vetrom

6/1/2026 at 12:34:24 PM

How many kWh to fabricate a brand new machine better suited to the task?

As long as performance is useable (apply your own metrics!), pulling it from existing hardware is likely the option with the lower eco footprint.

Also: chances are it'll only be used for this purpose occasionally, and/or for a short while. In that scenario [fabricating new hardware] always has the bigger eco footprint.

by RetroTechie

6/1/2026 at 1:14:48 PM

I don’t know why you’d assume that an older system is lower footprint.

If you’ve got something consuming 100 watts average over your 24 hour period, and your electricity costs 20 cents per kWh, you’re already spending almost as much as a Claude subscription.

Just on electricity, this assumes your hardware never fails and you never incur any additional costs.

There’s a big reason why newer more efficient hardware is in demand. Something that’s 10+ years old has drastically worse performance per watt.

Obviously I am not saying to throw away your old hardware as a rule but there is a point where some of this old stuff just isn’t even worth running.

by dangus

6/1/2026 at 3:55:18 PM

The reason more performance/watt is in demand because a datacenter can't suddenly draw twice as much power.

by ThatMedicIsASpy

6/1/2026 at 7:07:50 PM

Or because I don’t want my homelab to spike my electricity bill and give me a loud hot closet.

by dangus

6/1/2026 at 2:40:13 PM

I have two LARGE Xeon systems of this era that I used to use when I was heavily involved with Kubernetes and needed to build out a home lab. One is 2x Xeon w/ 256 GB of ram, and one is 1x Xeon w/ 512GB of ram. Both are slow as dogs, and both of them take up at least 150+ watts with only one power supply. My 12th gen Intel Nuc is so, so much faster and efficient. I'm recycling the Xeon systems.

by quietsegfault

6/1/2026 at 3:39:08 PM

Xeon is a group of products with really varying specs. There is no indication of which XEONs. Also new consumer CPUs often have really small internal caches.

by gnerd00

6/1/2026 at 7:05:30 PM

The Xeon processor in use by the OP of this article claims to have 20MB of Intel “Smart Cache.”

An Apple M4 chip in a Mac mini has 16MB on the P-cores and 4MB on the E-cores.

Depending on use case, AMD 3D V-cache at almost 100MB could also work out quite well.

So really, if you wait long enough, consumer chips end up with a pretty similar amount of cache.

by dangus

6/1/2026 at 5:46:33 PM

E5-2690s in my case.

by quietsegfault

6/1/2026 at 2:11:08 PM

You mention lower footprint but then make a cost comparison against Claude subscription pricing.

Claude subscription pricing is a broken way to consider footprint.

by souterrain

6/1/2026 at 5:17:42 PM

You can call it whatever you want, money is money, and money spent on energy is footprint.

by dangus

6/1/2026 at 10:14:44 AM

Would you consider improving the website's layout? Right now I find it below average quality and very distracting. Whether you are an engineer or not is not really important; great engineers can write horrible text or use a layout that is not ideal, for instance.

by shevy-java

6/1/2026 at 12:07:29 PM

We’re not there yet, but the obvious endgame of the present bubble insanity is open models running on local hardware and devices are “good enough” for most use cases. That will completely implode what’s going on at the moment in tech.

by cmiles8

6/1/2026 at 12:38:58 PM

Happened to me. CoPilot changing prices prompted me to cancel my CoPilot subscription and install a local coding model running entirely in VRAM. Will call Claude APIs when I get really stuck, but I should be able to handle 80% of my needs with a dumber local model.

For a long time, too. Programming languages rarely change much, techniques rarely change, so I should be able to use said model for I hope at least five years; and if at any time they optimize local models to cram even more intelligence into the same amount of VRAM, I can upgrade to that.

I like this path.

by cbdevidal

6/1/2026 at 3:33:58 PM

> Will call Claude APIs when I get really stuck, but I should be able to handle 80% of my needs with a dumber local model.

I experiment with all of the local models I can fit into 32GB of VRAM and I have subscriptions to multiple SOTA providers.

The difference between them is very large, unfortunately. The local models can handle small tasks and refactoring mostly okay, but doing anything challenging with them becomes a waste of time. Unfortunately the waste isn’t immediately obvious because they will come back with something that looks like it works, but then on closer examination I need to throw it out and reset them in a usable direction.

by Aurornis

6/1/2026 at 6:44:56 PM

One thing I don’t quite understand:

Wouldn’t it be in Amazon’s interest to run open models and sell time slots at around the cost of running them?

My only guess for why they don’t is that AI labs are currently selling their models at a huge loss, so this isn’t worth Amazon spending low-margin compute on compared to other higher margin products.

What I’m getting at, is maybe we won’t even need to run the models locally for the current status quo to implode. After today’s AI labs run out of free-money runway and actually have to sell their models at a price above running them, there will be the incentive for anyone with compute to just undercut by selling open-models-as-a-service at commodity prices.

by materielle

6/1/2026 at 7:22:13 PM

AWS Bedrock offers a mixture of proprietary and open-weights models (DeepSeek, Nemotron, gpt-oss, etc.):

https://docs.aws.amazon.com/bedrock/latest/userguide/model-c...

by philipkglass

6/2/2026 at 2:13:53 AM

Given the current performance requirements for "good enough for most people", I just don't see that happening any time soon.

Most users (potential or actual) are not on a desktop and don't have a beefy discrete GPU. There are "NPU" ASIC chips like what is being put in the new raspberry pi's but their performance and compatibility is not what you might think it is. To get GPU-like performance the ASIC would have to be closer to the size of a real GPU, and at that point why bother. And many devices just don't have the room.

by ranger_danger

6/1/2026 at 12:29:15 PM

This. OpenAI and Anthropic are ultimately compute infrastructure plays and not really AI. Everyone will have models, they'll have the ability to run them. This is why the GPU shortage is in their favor.

by PLenz

6/1/2026 at 12:59:45 PM

And like Google and Meta, these companies are going to morph into advertising giants. Advertising is an economic black hole and it eats everything that comes close.

by ryandvm

6/1/2026 at 3:14:20 PM

Embedding ads in LLM responses is something researchers are having a lot of trouble figuring out right now.

I have seen the results of some early attempts. It fails in such hilarious ways that all these companies are scared of productizing it. But once someone does it, the taboo is broken and everyone else will follow suit immediately.

by fooker

6/1/2026 at 4:14:27 PM

It's already being done: https://openai.com/index/testing-ads-in-chatgpt/

by jaimie

6/1/2026 at 8:51:15 PM

I feel like LLM's will change advertising like internet search changed advertising.

Here "like" means similarly in magnitude, not direction. If I could predict the future etc.

by ducttapecrown

6/1/2026 at 9:24:47 PM

It is for now but they cannot keep demand on their side high enough to suck up supply forever. Manufacturing isnt going to stop, not unless there is a Taiwan incident.

For those with tin foil hats, scheme away at possible futures!

by HerbManic

6/1/2026 at 1:26:57 PM

How does that view align with Anthropic leasing data centers from others?

I don’t know OpenAI’s infra, but to the extent they are buying GPUs and building data centers with their own money, that sounds like a bad move.

Satya has mismanaged the AI transition in many ways, but one thing he got right is that models are commodities, and the value is in applications that apply them to create user benefit. I agree that any company trying to build a moat with a model is not long for this world.

by brookst

6/1/2026 at 2:38:45 PM

Then they go bankrupt.

by cmiles8

6/1/2026 at 12:42:45 PM

Do you think there will still be an incentive to release weights in that scenario? Everyone will have models only if there continue to be companies releasing weights.

by butokai

6/1/2026 at 12:52:08 PM

Companies won't but I suspect this is a role that something else open source-y will fill that niche. Maybe orgs like wikimedia or internet archive, maybe some hackers just making things, maybe nation states that want to disrupt other players. Also model training will get better and better both on the algo and the hardware side. You can easily see a world where you might be able to train a good enough model on a home lab in a few days.

by PLenz

6/1/2026 at 3:27:38 PM

But you will need training data. Like a whole Internet search engine or massive data scraping. That‘s a thing that will not change with better algorithms, hardware or cheaper energy.

by rmoriz

6/1/2026 at 3:49:52 PM

Data is the only moat but they'll be starting in the same place the current set of players statyed out just a few years ago. I suspect that the delta between what is publicly available (if not legally publicly available! see scihub) and what open ai and anthropic have is relatively small.

by PLenz

6/1/2026 at 3:21:55 PM

Maybe. But if we can all run our own model locally in 2 years on commodity hardware OpenAI and Anthropic will start to look like WeWork during the pandemic

by aorloff

6/1/2026 at 3:44:33 PM

I agree with you that they are headed in that direction! The GPU shortage is (I think) similar to the pandemic era hiring binge. It's less about the extra compute and more about denying the GPUs to potential competitors. They're racing against time to find something that gives them real moat (gen ai I guess?) and they are trading money for time.

This is also why the money being poured into datacenters isn't going to result in as much development as you think. It's about leveraging other people's money to lockdown more future hardware. This is going to end exactly like fiber build out in the 2000s. Eventually that fiber got used but the folks who originally paid for it got hosed.

by PLenz

6/1/2026 at 3:25:17 PM

And free model supply will stop…

by rmoriz

6/1/2026 at 3:45:09 PM

I wonder if Google will put out a free model with the ads already baked in.

by jayd16

6/1/2026 at 4:36:46 PM

If you mean releasing model weights: They won't, because they know the "shill something" vector will get abliterated immediately. And they can't use trade secrets or copyright to stop it, either, because they released the model themselves and you don't need to redistribute weights, just an adblocker LoRA.

by kyboren

6/1/2026 at 1:42:18 PM

You just described the absolute nightmare scenario for the newly minted trillion-dollar companies whose only hope is for enterprises and SMB to move all their business processes to the cloud, with employees competing at token maxxing.

by mv4

6/1/2026 at 12:25:25 PM

I wouldn't say "completely implode", too much money was poured int it, but it's clear we're heading in that direction. You get a model that is "good enough", plus privacy, plus savings in the long term.

Paradoxically, the better results we get from general harness of coding agents, the less moat Claude and co. get. It's unbelievably how fast some open models outpaced frontier models of just a few months ago.

by benterix

6/1/2026 at 12:53:37 PM

I keep intending to find time to try them. What are you seeing the best results with?

by brightball

6/1/2026 at 3:11:44 PM

If you are willing to spend about 2000 on GPUs, we are almost there.

In my opinion, the bottleneck is the package management layer and not the model capabilities and performance.

I have been an avid Linux user for decades, and if I find it confusing and painful, something is missing.

by fooker

6/1/2026 at 12:16:37 PM

this is sorta like saying that being able to run your blog on your laptop will completely implode the cloud business

by herval

6/1/2026 at 1:10:39 PM

This is actually what happens.

I run my word processing software on my apple 2 (a total joke of a computer) instead of running it on the WANG.

I run my book keeping software on visicalc instead of the IBM.

I run my simulation software on my IBM PC (I even paid for the 8087!) instead of the VAX.

Moore's law has, at least so far, allowed the pioneers with toy computers to grow their toys big enough to solve "big boy" problems after some time has allowed the toy computers to be faster and the pioneers have scaled their crappy home-grown solution to solve their 60% of the problem that was originally solved by some enormous complex system.

Eventually the toy infrastructure gets expensive and solves 90-120% of the "big iron" problem space, but it also grows to cost as much as the big iron solution, but then a new generation of toy software and toy systems emerges to disrupt the "big iron" systems.

by cduzz

6/1/2026 at 1:39:23 PM

Under appreciated requirement for this to work in post-cloud times: open source

If a vendor can SaaS a solution, then enterprise is generally happy (they don't want to have to hire folks for maintenance), and that completely locks out any ability to run locally.

Between enterprise's ambivalence and the obvious financial incentive to vendors, you get SaaS-only products.

by ethbr1

6/1/2026 at 4:09:26 PM

You're right Moore's law has been holding up, but will hit a hard limit on process node size, so all scaling will be based on multiple cores. OTH, computing per watt spent has been plateauing. If the future bottlenecks are energy and cooling, that will require infrastructure-scale solutions. My bet is this is going to be real AI company moat.

https://www.riq.net.br/pub/computing-scaling/

by manoDev

6/1/2026 at 1:14:03 PM

It's a huge difference. If you had AI sufficiently good running locally on a phone, you could devise workflows for things like basic digital hygiene, technical assistance, and tedious tasks like inbox management, image sorting, device updates, and so on. Privacy and security gets a big boost past some local competence threshold, and we're nearly there.

Make the local AI competent enough to do good image generation and editing, realtime voice and music generation, handle agentic tasks with a framework like Hermes, and you can take your AI places to do tasks in contexts that are inaccessible to or inappropriate for cloud.

Frontier big platform models will be the best, but there's a level of "good enough" for local uses that we're already seeing flourish, and "good enough" for the average joe is almost here.

by observationist

6/1/2026 at 1:21:04 PM

Phones and laptops are terrible devices for local AI, way too constrained by bad thermals and small batteries. MiniPC's (many of them using mobile hardware) don't have that particular issue, and can easily run on a 24/7 basis.

by zozbot234

6/1/2026 at 1:37:28 PM

Phones are also a terrible place to run a radio, but there's a huge amount of benefit in figuring out how to do so.

by trollbridge

6/1/2026 at 3:56:18 PM

That level of local AI is also more or less what you need for competent autonomous robots, too. If your household robots are orchestrated from your phone, the local security and cloud convenience converge on a single device. No extra servers, etc, reduced cost, all that - local AI is a massive market amplifier.

by observationist

6/1/2026 at 6:34:45 PM

Let me speculate - we are going in the weird direction of no private property unless you're an overlord that rents his property to peasants. I like to call it the revenge of communism. See how the market behaves in the llm space - it's more viable to share infrastructure than to own it. Imagine the private car revolution in the US was a bus revolution.

by yard2010

6/2/2026 at 2:10:32 AM

We’ve been dreaming about this since the days of talking about wifi mesh networking, but it seems to never happen.

by trollbridge

6/1/2026 at 12:23:54 PM

It's a little different because cloud and blogs didn't actively get in the way of your home compute. To wit, the various cost spikes for hardware.

People -- WANT -- this technology on their home devices and (apparently?) the providers of this tech don't seem to be running a profit so they probably don't want the maintenance tail on their side either.

I think it's a bit different. Inevitable that this becomes a household-run thing? Not likely.

by grumpymuppet

6/1/2026 at 12:40:40 PM

Running an LLM locally is theoretically viable. Running your blog on your laptop is never viable (unless you hook it up like a server). One just requires compute while the other a stable network.

by malmz

6/1/2026 at 1:24:43 PM

tbh, my home network is pretty close to the stability of my host these days…

But my downtimes are a bit self-inflicted: changing ISPs which I can personally workaround but harder for a blog where one expects uptime.

by Scoundreller

6/1/2026 at 1:24:05 PM

The primary feature of a blog or any website is that it is available around the clock, that is the primary feature of cloud: around on the clock computer and network that scales on demand.

The primary feature of "AI" is to process information and reason with a natural language interface at speed, the primary feature of AI bigboys is to provide the machinery that runs the "models".

See the difference?

by asdfsa32

6/1/2026 at 4:28:53 PM

You severely underestimate how little the fraction of the performance and human labor of a frontier AI is in "the model".

Hosting a blog 24x7 on a laptop is trivial, except for hyperscaling to the front page of HN and Reddit.

by gowld

6/1/2026 at 12:22:50 PM

More like implode proprietary blog hosting platforms and replace them with commodity VMs that can be used for blog hosting, among other things

by Kinrany

6/1/2026 at 12:37:12 PM

Wouldn't arcade cabinets vs home video game consoles be a more apt comparison?

by asimovDev

6/1/2026 at 3:22:45 PM

You have to consider that the enshittification factor is much higher now than in the cloud-for-free age.

by emsign

6/1/2026 at 5:25:41 PM

More likely we will have a compute device like NAS or something which will run one good model locally for all the house members just like we have one wifi router in every house. Nvidia can invest in building such a device as well as the models and make money on the hardware.

by Npovview

6/1/2026 at 12:40:36 PM

Curious when NVIDIA monopoly will ends. China will sure release something that can runs on commodity hardware. I wish they will soon.

by sreekanth850

6/1/2026 at 12:18:59 PM

I find that hard to believe. The AI companies will want to control what's possible and find new things to do that "need" their services. Otherwise it would be like Intel and Microsoft had decided in the year 2000 that computers are "good enough" now and we would have explored what's possible with that hardware ever since.

by IdiotSavage

6/1/2026 at 12:39:03 PM

> Otherwise it would be like Intel and Microsoft had decided in the year 2000 that computers are "good enough" now and we would have explored what's possible with that hardware ever since.

I think you've misunderstood what good enough means in the context - which is a model capable of completing the tasks assigned to it without having the breadth of full generalization. Your analogy breaks down because of this - we did get 'good enough' spec profiles for different hardware. That thing you're wearing on your wrist won't have the same specifications as the box you use to play games.

by squidbeak

6/1/2026 at 12:46:16 PM

I think you've misunderstood the analogy. Just ignore it, analogies mostly break down anyways.

> a model capable of completing the tasks assigned to it

The thing is, the "task assigned to it" is changing with improved capabilities. If everyone around you in 2036 is using general AI to do amazing stuff, you will probably have little interest in vibe coding slop like it's 2026.

by IdiotSavage

6/1/2026 at 1:32:15 PM

>The thing is, the "task assigned to it" is changing with improved capabilities.

Only if you give in to fads and FOMO.

The core tasks people need change at a much smaller pace.

by coldtea

6/1/2026 at 1:28:25 PM

Analogies are like metaphors, they’re illustrative rather than literal.

by brookst

6/1/2026 at 12:28:43 PM

> The AI companies will want to control what's possible and find new things to do that "need" their services.

That's correct. The problem is they have smart people, tons of money, and several years to figure that out, and the best thing they can come up is a coding agent.

by benterix

6/1/2026 at 4:46:33 PM

That isn’t the best thing they’ve come up with. It’s a marquee product that is fit for public consumption, however.

The ‘best’ things are; - fuzzy pattern matching algorithms for traffic analysis, human and other image target recognition.

- targeting algorithms that identify ‘suspicious’ individuals in large volumes of metadata.

- fraud analysis

- antagonistic image and video generation, both for fooling other fraud analysis, but also for propaganda, screwing with other actors, etc.

- directed high speed content generation (text, pictures, video) to spam the ‘algorithm’ and allow near realtime identification of additional buttons to push for given target audiences.

- massive marketing/ad manipulation.

Those budget line items (and the suppliers) really want to stay off the radar however, as it makes their life harder.

by lazide

6/1/2026 at 1:29:59 PM

>Otherwise it would be like Intel and Microsoft had decided in the year 2000 that computers are "good enough" now and we would have explored what's possible with that hardware ever since.

That would be the dream... no fucking Electron! No lockdown modules.

by coldtea

6/1/2026 at 12:56:34 PM

I disagree. We are currently in a weird period where these frontier AI companies are losing tons of money even on the subscription-based AI models. It's just too compute intensive and there's no way most people are going to be buying the kind of hardware required to run $20 worth of inference every day.

Sadly - it's going to be ads. Advertising is going to get in there and enshittify the whole thing because as always, advertising income is too easy and too plentiful for any company to resist.

Right now the models are fairly agnostic, but we are a hair-breadth away from ChatGPT responding with, "the right tool for this job is a circular saw - something like the Milwaulkee M18, which happens to be on sale at Home Depot this weekend."

by ryandvm

6/1/2026 at 2:27:04 PM

$20/day x 250 days per year x # devs/agents/etc = $$$. About $5k per dev at that daily use case.

Enough to validate repurposing an existing workstation with enough RAM, or finding a used high VRAM GPU, or in my case buying a Strix Halo system for home lab and local models.

The future is once again not cloud based, for AI tools.

by selicos

6/1/2026 at 1:30:16 PM

Most people are running a whole lot less than $20's worth of tokens per day on cloud platforms. (Is that assuming a frontier model? 1M output tokens per day?) Local hardware could easily take up that workload, at least the part of it that's non-time-critical.

by zozbot234

6/1/2026 at 1:33:48 PM

The advertising future looks like that to me, too. Service proxies like OpenRouter might talk about price optimization, maybe some ad filtering. But I expect proxies will have malicious entries, too, surreptitiously altering agentic prompts.

by enoint

6/1/2026 at 1:29:03 PM

Ads are usually the workaround where you don’t deliver enough value to get people to subscribe or payments are unavailable for some reason.

It makes sense to show some ads and get some money at low volume (like a faraway reader wanting to read a story in your local newspaper) but taking money from regular users directly will pay much more.

Newspapers are happy to cannibalize 99% of their ad revenue with a paywall if that 1% subscribes because that’s how much more money you make from someone paying $10-$20/month vs ads.

But yeah, if people use it as a buying recommendation engine, that’s where the money is on ads/referrals but a lot of AI use has little/no connection to buying intent touchpoints.

by Scoundreller

6/1/2026 at 1:37:03 PM

Newspapers had no choice after craigslist and later Google/Facebook took all their classified revenue.

LLMs may or may not be able to cover their costs with it. We'll see - I suspect product placement as recommendations will become a thing as it won't take as much GPU to give a "recommendation" on "the best widget for X". I firmly expect it to become enshittified the same way google and amazon search has.

And that's if LLMs don't become commodified.

by hylaride

6/1/2026 at 1:53:43 PM

For agentic services, how would you be able to tell that you’ve been product-placed?

by enoint

6/1/2026 at 2:04:15 PM

Hidden advertising is illegal in most jurisdictions, so it has to be indicated to the user for each specific occurrence and hence be trackable anyway.

by layer8

6/1/2026 at 4:33:12 PM

"AI can make mistakes. Responses include sponsored content or weights."

Now it's compliant with the law.

by gowld

6/1/2026 at 10:20:52 PM

That’s not how the current laws generally work.

by layer8

6/1/2026 at 7:58:07 PM

It might just shift who is buying memory, from large corporations to billions of individuals.

by billfor

6/1/2026 at 12:46:41 PM

Not saying this isn't the case, but my Anthropic subscription costs me less than the electricity would to power such a home inference system.

by dboreham

6/1/2026 at 10:49:27 PM

What happens when Anthropic decides that the free hay time is over, and it's milking time?

by dragandj

6/1/2026 at 1:41:11 PM

Gamers Nexus has a good video on this, but if NVIDIA exits the consumer market, and honestly why would they stay when they can charge up to a 100x for the same wafer space for enterprise, AMD would likely do the same. Only Apple really makes consumer hardware suitable for running things locally then, and maybe some weird Qualcomm ARM chip for Windows. It will be hard running things locally if nobody is supplying the hardware.

by techpression

6/1/2026 at 10:32:47 AM

Nice post and technically impressive work. I agree we need to understand the build pipeline and be able to do things locally. However, depending on your electricity cost, it might not make sense financially. These old servers are not energy efficient at all (I'm guessing that old Xeon server will easily pull 200W on load), and that model is currently at 0.1$/0.3$ per 1M tokens (with 76 tps and 262k context) in Openrouter (also, these servers are LOUD).

EDIT: I stand corrected, 200W is apparently way too high of an estimate. I used to run a bunch of old Xeon servers and they slurped watts like crazy, but I can't remember which ones exactly those were.

by deng

6/1/2026 at 10:53:25 AM

2620v4 is not a power slurping beast. Depending on the server board, it might not be either. Servers are often loud, but it depends.

There's a lot of budget hosting built around chips like these, and they're suprisingly power efficient.

by toast0

6/1/2026 at 10:36:34 AM

It should be closer to 85W on load. And it's incredibly silent on even a low end cooler. I rarely get above 50° Celcius.

by jansommer

6/1/2026 at 10:48:04 AM

OK, then you're in luck. I had a bunch of old 1U rack servers and even in the next room it was too annoying to run them (they had a bunch of 40mm fans which always ran at full speed, because in a server room, no one can hear you scream).

by deng

6/1/2026 at 11:01:35 AM

Could it just be really bad cooling? Looking at 9800X3D, it seems like it's running in a similar range wrt TDP unless you really push the 9800X3D. I'm comparing with desktop cpu's because that's what my workload is. cpu governor is set to performance (no schedutil). No audible change in fan speed during heavy compilation or gaming (very silent humming), and i don't have any fans beside cheap intake, cpu and exhaust fans (1 each) + an excessive amount of dust.

by jansommer

6/1/2026 at 11:15:01 AM

These servers had no fan control whatsoever, they always ran full blast. That's not untypical for rack servers, because as written: they are designed for server rooms, and you're supposed to wear ear protection there anyway... Yes, I could've modified them, but I ditched them because running them simply made no sense (especially the high idle power consumption was ridiculous).

by deng

6/1/2026 at 4:52:18 PM

Yeah, 1u is gonna do that. Get something that can accommodate a big tower air cooler such as the Hyper 212 and your airflow will be quieter than the disks.

I don't run it anymore but my old server was a dual xeon (with two of those coolers crammed in) and I rarely heard a peep out of it.

by jabroni_salad

6/1/2026 at 5:17:22 PM

Small fans need to spin faster so these can be very high pitch even if you stuff some Noctua 40mm fans into it.

by irusensei

6/1/2026 at 10:50:11 AM

Only when you remove it from the original server or enable low fan mode (if available). Most 1U/2U cases will happily blow at full speed well over 90db.

You likely need to replace the flow-through server chassis system with an active "normal" cooler to achieve a bit of silence.

85W might be about right. My old server CPU is in the same ballpark and compiling kernels it reached about 90w in power usage. If you want to keep it running: idle is not very low power unless you have one of the "low power" L versions, keep that in mind.

by consp

6/1/2026 at 11:09:44 AM

Get a 4U case, many options if you want to combine it with a NAS. Not hard to cool and keep somewhat quiet. If you can store it in a closet or something that helps too.

Well, you can use it for lots of other things as well.

Compared to the cloud you can probably save up to buy a new server every month. And don't underestimate the gains of having something to experiment on and play with.

by tjoff

6/1/2026 at 12:26:51 PM

85W for the whole system?! The specifications for the CPU mention a TDP of 85W [1].

[1] https://www.intel.com/content/www/us/en/products/sku/92986/i...

by ciupicri

6/1/2026 at 1:42:34 PM

But for LLM work the CPU is mostly idle, waiting for new data - so the CPU itself might not pull much power at all.

by actionfromafar

6/1/2026 at 12:22:58 PM

These servers are loud if you're trying to fit them into a 1U or 2U, which requires high speed fans to generate the necessary static pressure to push air through the case. I run a similar setup in a 4U case with slow 120mm fans and it's fine.

by naasking

6/1/2026 at 10:09:06 AM

Glad to see other people realizing this. I've been running Gemma 26B-A4B Q4 on a 2012 Xeon with 16GB to 24GB of RAM in a container. It's getting around 8 to 12 tokens per second. Obviously it's not comparable to huge contexts and running it on a GPU and the image decoder in llama.cpp is super slow compared to a GPU but for some small automation tasks and general trivia questions it's decent. The speed is just enough to not have to wait for it to finish so you can read along.

Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.

# Building

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON

# Running

export OPENBLAS_NUM_THREADS=4

export OMP_NUM_THREADS=4

OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \

llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1

by throwaway2027

6/1/2026 at 4:12:45 PM

I'm setting up a Frankenstein system at the moment. It's a Chinese DDR3 X99 motherboard with a 12 core Xeon v3, 32gb 1866MT/s ram, and a 1080 Ti.

I'm shoehorning it back in the Optiplex that donated the ram, so it's not ready to go at the moment, but when I had it running on top of the motherboard box as a test I ran the (9B?) gemma4:e4b-it-q4_K_M since it can fit entirely in the 11gb vram. It flew, more than 50tk/s. A model that small isn't useful for coding, but there could be uses. I'd love to figure out a Wake-on-Use and use it as my personal ChatGPT. I'm not sure how that would work... Maybe proxy the LLM thru a Pi with a script to Wake-on-LAN the PC? It'll be a fun weekend project someday.

My always-on LLM is the dense Gemma4:31b that's not quite half in GPU on a 12gb 2060. It's really slow, but the quality is great and my use case is an automated queue so I'm not sitting there watching the output. I have another 2060 but unfortunately the PC won't POST with both installed for some reason.

by duffyjp

6/1/2026 at 5:11:55 PM

Speaking of llama and local compute, there was a tweet from Georgi Gerganov (llama.cpp author) a couple of days ago saying that he is currently using Qwen3.6 27B, running locally on a Mac M2 Ultra or RTX 5090, to assist with llama.cpp development.

by HarHarVeryFunny

6/1/2026 at 7:03:05 PM

Went this route after hemming and hawing over a Mac Studio Pro for some time. Eventually bought and configured a headless HP Z620 with 192 GB of ECC RAM and dual Xeon E5-2680 v2 processors, an Optane AIC, two P102-100s with 10 GB VRAM each, and a minimal bootable SDD running Debian 12.6 with an older, locked version of CUDA that supports the Pascal cards. Run it remotely from the basement via AMT/meshcommander. Just fire up llama.cpp and its front end and connect over the local network. Currently playing with Talkie, Qwen 3.6 27b, and medgemma, but have had good luck with GGUF performance in general after selecting an appropriate quant. Total cost was under $500, but I bought the server via eBay last year; things may be different now.

Details aside, the hope is that ternary LLMs blossom in the coming months and this old hardware can eventually host some very dense models full of factual information, perhaps even larger than the GPU RAM and spilling over to the Optane for IO. Speed would be less important than general factual knowledge. The plan would be to configure then mothball the machine in a Faraday trashcan in the basement, retaining it as a possible "rebuild civilization" oracle should the world fall apart. Of course, power would be an issue in such a scenario, but for how cheap this hardware is and how often AI seems to be practically useful in its latest iterations, why not...

by hualapais

6/1/2026 at 9:42:19 AM

What intrigues me the most about AI progress, is not AGI or the model du jour by $AI_UNICORN, but rather what can be run locally. I remember having an amusing, but rather useless model in a beefy gaming PC that I had 6 years ago; and now, something that’s a hundred times better on my M5 laptop.

Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.

Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.

by phaser

6/1/2026 at 10:12:48 AM

Things you are not supposed to talk about:

- There is no "moat" (lasting, easy-to-defend technological edge) in AI model businesses. There are just short-term advantages.

- An AI business is a capital-intensive business, just like old factories. Data centers are expensive, models are energy-hungry, and the hardware inside must be replaced every 3–4 years.

- Smaller, specialized models eat margins from below. Transcription, voice, or image detection do not need large models.

There is no reason to expect high margins like you can in traditional software business. Benefits of AI go mostly to consumers.

edit: There is potential for economies of scale. Few megacorps can strive for cost advantage when they achieve scale (Microsoft, Google, Amazon and Meta)

by MAXPOOL

6/1/2026 at 11:17:31 AM

All true.

It does seem like the structural characteristics we’ve observed so far suggest there is a kind of flywheel from short-term to long-term advantage due to the capital requirements at various levels.

If you’re Nvidia, making the best GPUs today, the expanding wavefront of demand is consuming them with volume and margins to give you a huge edge in building out the best next generation of GPUs. Similar to how the mobile wave gave TSMC sustained advantage for about a decade now.

I’m guessing this is also what we’re seeing as Anthropic and OpenAI swap spots in the token-vendor market.

by twoodfin

6/1/2026 at 2:58:28 PM

I can see the fly wheel in action for Nvidia[1], but in terms of model building - I think the companies that have the advantage here are not Anthropic or OpenAI, but rather companies with substantial revenues from other sources - Google is the obvious player here - reported to be planning on spending 185 billion this year without having a raise a dime from the markets, but there are plenty of other companies - like Meta or Alibaba who can easily fund the longer game from existing revenues.

by DrScientist

6/1/2026 at 3:11:39 PM

Everybody talks about this stuff all the time

by treis

6/1/2026 at 10:06:01 AM

What you can run locally in consumer hardware is progressing pretty well.

If you get a not-quite-the-best gaming GPU like a 5080, you can run local models that are better than the state of the art from early 2025. Depending on what you want to do, you might have to switch models. The one size fits all huge models are still a data center thing.

by fooker

6/1/2026 at 10:03:03 AM

Its a convenience thing. You can run a whole lot of stuff locally from wikipedia to social media/email/video servers whatever. Most people with a full time job and 2 kids dont do it cause who has time and energy to patch and maintain the ever growing complexity of this stuff. These systems will keep growing complex. That also means more bugs. Age old tradeoff between freedom and convenience.

by skdb476

6/1/2026 at 1:09:25 PM

You can run mediawiki at home but you won't have wikipedia. You can run a video server but you won't have all the movies that Netfix has. A local model is actually the real thing.

by phaser

6/1/2026 at 3:43:33 PM

you can have the whole wiki loaded with full search available locally. check out kiwix.

by skdb476

6/1/2026 at 5:07:58 PM

Thanks I didn't know about kiwix, but, let's consider the fact that a wiki, or netflix movies are cheap or free, while AI is actually quite expensive at least for now, and i'm not sure if it's because of real costs or to justify the valuation.

So there is a bigger incentive to run locally something that's gonna get you $20 or $100 worth of bills to OpenAI than to mirror something that is actually free.

Example: In the past there was a whole market for sound cards, if you wanted your computer to have any "multimedia" capabilities you needed to get a sound blaster but now everybody assumes a computer will produce sound, and it's basically for free as all chips have it. Now sound interfaces are still a thing but only for audiophiles who are esoteric enough like me to believe that it's worth to have that extra hi-fi quality.

What I think it could happen, is that eventually AI will be part of all the chips, just like soundcards. And there will be people who will buy specialized AI from companies that perhaps are not OpenAI or Anthropic but second-generation sleepers who watched the carnage in the market and decided to enter when it was reasonable.

This could be Apple, or Nvidia or something new. They're just waiting for the others to do the research and introduce the taste for it to the masses, just like sound blaster made us fall in love with high fidelity sound in our computers.

by phaser

6/1/2026 at 11:29:07 AM

[dead]

by SadErn

6/1/2026 at 1:29:25 PM

--what this means for the valuation of the AI companies

Probably nothing. Most users have no idea what an LLM is or how it runs. Anecdotally speaking, I see many LLM users default to whatever their day job provides to them. And even slightly more sophisticated users seem ok with paying for their openai or anthropic subscriptions.

Maybe we will see a small but dedicated group of open weight model users who prefer local llm, but everybody else will just consume from the big providers? The scenario might look something like OS choices today - a small, committed group of Linux users vs the vast majority of other users running Windows, MacOS, or Chrome?

by clusterhacks

6/1/2026 at 10:57:23 PM

Prices from OpenAI and Anthropic have really jumped in the past month. I work for a big giant company and our Github co-pilot costs increased as of today, June 1st. Our internal estimates are that our bill will double or triple. How much are we willing to pay? I don't know, but nobody wants to be "left behind".

I think there's actually a big market opportunity here. Somebody, like Dell or HP, should start selling turnkey on-prem LLM servers.

by exhilaration

6/1/2026 at 12:01:30 PM

This has always been true of software, particularly games. You can get a 5-6 year old game for a fraction of the price, and run it on modest hardware. But the industry wont sit on its hands for 5 years, there will be newer software that requires better hardware.

by mr_toad

6/1/2026 at 1:21:22 PM

Technology doesn't always work like that.

A new game is a totally new world with everything created from scratch. A creation. A model, on the other hand, is a reinterpretation machine for hundreds of years of human creations, but not a creation in itself, more like a discovery.

You would think that by now we would have a much better Bitcoin that's taking over the payment networks of the world but what we actually got is a shitload of shitcoin.

by phaser

6/1/2026 at 10:20:35 AM

Training AI models to drive valuation reminds me of high frequency trading

by rienbdj

6/1/2026 at 12:24:38 PM

Result is ~12 tokens per second, as reported by OP down in these comments here.

An impressive effort, and better than I would have thought possible on this hardware -- but still pretty far short of what one needs for an satisfactory interactive session.

by montroser

6/1/2026 at 12:28:37 PM

Especially if you consider those smaller models are really cheap and fast on platforms like openrouter. Often by the factor 100-500 cheaper than SOTA models, and 2-5x in TPS.

by andix

6/1/2026 at 2:28:59 PM

Yeah took way too long to find that result. Being able to run on slow RAM isn't surprising considering you can run a model off an SSD.

by causal

6/1/2026 at 8:54:57 PM

It's not terrible for interactive... https://mikeveerman.github.io/tokenspeed/?rate=12&mode=text

And it should be just fine for plenty of background use cases.

by kingnothing

6/1/2026 at 4:41:47 PM

Right. You can also perform RSA encryption on pencil and paper with a scientific calculator. It works, but it's not useful throughput for serious work

by gowld

6/1/2026 at 2:55:59 PM

I was about to ask that

by greenavocado

6/1/2026 at 10:27:06 AM

The E5-2620 v4 is great. Have been using it for 10 years now. Wanted to upgrade until I saw current prices. I have 64 GB ddr4. Paired it with rx 9060 xt 16 GB and games run as fast as ever. Perhaps the cpu is a slight bottleneck in DOOM The Dark Ages, but i'm at 60 fps, so no problem. Light llm on the gpu is a nobrainer, and it's cool to see that things can be tuned to run ok on the cpu. I bought 2667 v4 a month ago for 30$. I'd expect it to give a decent performance boost but I just haven't had the need for it yet, but pushing into llm like in the article I'd probably upgrade because 2667 can handle slightly faster ram.

by jansommer

6/1/2026 at 5:55:01 PM

I'm on a dual-E5 2667-v4 / 256 GB DDR4 Z640 with a 1080ti that I picked up all the various pieces for (aside from SSDs) for less than $500 total in the first half of 2025 (case, PSU, riser board included). I'm still kind of blown away by what you can find aftermarket / secondhand!

I also had no idea RAM and GPU costs would explode they way they did, just happened to do it the right time. I might try to grab a ~$300 3080 on Ebay and sell the 1080ti, but otherwise it's been a great upgrade -- it sucks electricity like Coca Cola, but otherwise performs fantastic as a workstation, and I'm just gonna drive it til the wheels fall off.

by kevinsync

6/1/2026 at 11:00:15 AM

    > The E5-2620 v4 is great. Have been using it for 10 years now.

10 years? Damn, that is a long time. I always assumed that heat-induced damage will kill a CPU after a certain amount of time (5-7 years). Am I wrong here? I assume yes. Or are CPUs must stronger/tougher than the bad old days?

by throwaway2037

6/1/2026 at 12:03:18 PM

Intel sacrificing lifetime for short-term gigahertz is a relatively recent phenomenon.

by bobmcnamara

6/1/2026 at 8:50:31 PM

> 10 years? Damn, that is a long time. I always assumed that heat-induced damage will kill a CPU after a certain amount of time (5-7 years). Am I wrong here? I assume yes. Or are CPUs must stronger/tougher than the bad old days?

My i7 920 is still running fine. Or, it was when I decommissioned it in 2017. I don't imagine any reason it shouldn't, except perhaps bitrot of spinning rust (spinning rust rotting is no joke, especially after ~20 years) and maybe aging of thermal paste.

My i7 6950X is still running fine, in use since 2017 even today to write this message.

by inetknght

6/1/2026 at 8:56:29 PM

I've never had a cpu die in the decades I've been using them. I've bought 10-20 year old computers that still work just fine. I kept my last MacBook for 9 years before I upgraded out of want for more RAM.

Most computer equipment fails quickly, otherwise you'll get a long life out of whatever it is.

by kingnothing

6/1/2026 at 12:57:35 PM

This is among the "real" differences between workstation/server CPUs and commodity chips for laptops/desktops/handhelds.

Even then, if a commodity chip isn't pushed full tilt at all times, and assuming that the venting and dissipation are adequate, a commodity chip can last a long time.

by BirAdam

6/1/2026 at 11:21:50 AM

A quick search on Xeon production yields that it goes through a rather rigorous testing. I wouldn't be surprised that server cpu's in a desktop pc works longer. I can't overclock it either, and that probably helps with its lifespan as well. But yeah, the fact that it actually powers on when i click the button and isn't a limiting factor after 10 years is quite something.

by jansommer

6/1/2026 at 3:04:10 PM

You raise two very good points that I didn't think about: (1) better binning/testing, (2) no overclocking. Keep rockin' that elderly Xeon!

by throwaway2037

6/1/2026 at 4:32:51 PM

>I can't overclock it either

Except you can overclock v3 :)

by dur-randir

6/1/2026 at 11:30:07 AM

Back from my old overclocking days - its heat that kills life. And if you keep that under control (what ages is the heatpaste, replace it ever so often) i very much doubt you'll have any life issues from the cpu itself.

Bearings in fans, caps etc. are also stuff that you need to keep an eye on.

I just replaced a i5-660 thats been powered on since 2010 24/7, heatpaste was fucked so it crashed during heavy loads :)

by mrmlz

6/1/2026 at 1:34:32 PM

Not my experience.

by Grazester

6/1/2026 at 9:43:19 AM

Similar recent posting with optimizations for older Xeon:

High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440

https://news.ycombinator.com/item?id=47320244

by car

6/1/2026 at 12:55:35 PM

Apparently Itanium works quite well for LLMs https://medium.com/@tglozar/running-llama-inference-on-intel...

Which makes sense I suppose.

by RobotToaster

6/1/2026 at 3:14:30 PM

I want to share something strange. I found a typo or two in the post and this absolutely delighted me, because it implies a human wrote the words. (Or was at least heavily involved in the editing.)

Guess I am a species-ist after all ;)

by andai

6/1/2026 at 3:19:17 PM

I hope LLMs don’t get trained with this reply and start adding typos for making it look like it came from a human :)

by bicepjai

6/1/2026 at 3:40:19 PM

I felt like I had lost something valuable when I switched to mostly AI based programming, because I used to make so many mistakes that the computer would often do truly magical things I did not even realize were possible.

e.g. one time I tried making a collaborative drawing application but I messed up the logic, and the brush strokes would just get temporarily mirrored between the client and server, so you'd see it getting drawn over and over again in a loop.

The drawing wasn't stored anywhere, it existed only in the network packets between client and server. Accidental GNU.

http://www.gnuterrypratchett.com/

So I started working on a tool that adds random errors back into my programs. To reintroduce the possibility of such happy little accidents.

by andai

6/1/2026 at 4:44:15 PM

AIs already make typos, not directly intentionally. Since they are token-based, and tokens are lexemes, they can misconjugate works or make grammatical errors.

by gowld

6/1/2026 at 3:00:07 PM

I've got an old HP Z-620 workstation with dual E5-2697 v2 CPUs (24 cores total, 48 threads @ 2.7GHz) and 128GB of DDR3 RAM. The docs say it supports up to 192GB, but I wasn't able to get it to POST with all the RAM slots full.

It's still a "homelab" beast and does great with development and GIS/Mapping applications. I was not able to figure out how to run AI workloads on it with decent performance, however, so I finally broke down and got a dedicated GPU for it. It's pretty great what can still be done with older hardware.

by ryandrake

6/1/2026 at 10:25:37 PM

I'm in the same situation of having an older workstation nearly maxed out with RAM and neither wanting to pay for the equivalent RAM on a new system neither go down in GBs.

by bobmcnamara

6/1/2026 at 3:04:15 PM

I self host on old HP Z-840 with 2x3.6 GHz Xeons 24 total cores, and 512 GB RAM. Cost me peanuts used and works like a charm for many years already

by FpUser

6/1/2026 at 11:15:22 AM

I may have missed this in the article, but:

What was the net effect of the optimisations? How much faster did it get?

by FartyMcFarter

6/1/2026 at 8:48:47 AM

The E5 2620-v4 only supports DDR4.

by vhaudiquet

6/1/2026 at 12:04:32 PM

Probably in an x99 motherboard

by bobmcnamara

6/1/2026 at 1:29:18 PM

The memory controller is integrated into the CPU, so the motherboard chipset is irrelevant. There are some OEM-only v3/v4 parts with dual memory controllers, but the E5-2620 v4 is not one of them.

by mwpmaybe

6/1/2026 at 10:17:27 PM

Ooh weird!

by bobmcnamara

6/1/2026 at 8:53:06 AM

How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).

by NSUserDefaults

6/1/2026 at 9:19:43 AM

I've been running various models on a Mac Pro 2013 (8 cores, 32 GB RAM) at about 8 to 10 t/s for months. It's not fast, but it's more than enough for many actual tasks, in particular background tasks. An iMac pro will do just as well I suppose.

by wazoox

6/2/2026 at 1:04:36 AM

I have and use a Mac Pro 2013 too. Mine is 8 cores with 64 GB RAM. I haven't used mine for any LLM workloads, but it does just fine for most stuff. My biggest concern with it is the OS. I'm still running macOS (the latest supported version) but it's getting continually further out-of-date security wise all the time.

by neverartful

6/1/2026 at 10:07:46 AM

What are the tasks that do well with 8-10 t/s ?

by fooker

6/1/2026 at 12:19:55 PM

The sort of task you don't expect to end immediately. If extracting data from a bunch of PDFs takes 1 hour or the whole night, that doesn't make much difference to me. It's not fast enough for auto completion and slightly too slow for chat (but bearable IMO).

by wazoox

6/1/2026 at 3:08:25 PM

Running a local llm at 10 t/s overnight to extract data from a few PDFs will burn more in electricity than paying cents for the hosted kimi models.

You can (sometimes) break even if you have a workstation GPU.

by fooker

6/1/2026 at 8:24:08 PM

Sometimes data privacy is paramount.

by wazoox

6/1/2026 at 6:24:52 PM

I bought one AMD MI50 32GB back then when they were sold rather cheap (around $150-$170). it can easily generate over 70 tokens per second for gemma 4 26B moe model (q4).

I have no doubt that we will have another wave of cheap retired server gpus just like before. And that is the time when everyone will have their own models at their home.

Or we can just buy the newest medusa halo mini pc. they will be pretty decent, too, albeit pricey.

by npn

6/1/2026 at 12:35:58 PM

Old hardware is surprisingly effective. I've been considering a side hustle selling offline AI to local businesses who are privacy-sensitive. Medical, legal, places like that.

At the low end, I'd use old Xeons with gobs of DDR3, install some V100s, run a smaller agent for general chat inquiries, and a frontier model for the deeper stuff, with a router that passes between them depending on the complexity.

The frontier model would perform very slowly, but if it's a deep task the user can submit it in a batch in the evening e.g. "Correlate all of these cases and look for patterns" then receive the output with morning coffee.

Of course, AI helped me work out a plan for this. Haha

by cbdevidal

6/1/2026 at 4:58:03 PM

[flagged]

by nicogentile

6/1/2026 at 12:08:27 PM

Doesn't accepting 100% of the MTP draft tokens mean you should just be using the smaller model? Usually the acceptance rate in Qwen36 at least is around 60-70% and the "wrong" tokens are still filled in entirely by the base model, but when you just accept 100% of the draft tokens it seems kind of self defeating unless I'm wrong.

Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.

by lreeves

6/1/2026 at 12:23:05 PM

As far as I know, speculative decoding still verifies that the proposed tokens are what the "big" model would generate, it just uses the guesses to make that process faster. Setting the probability threshold too low then shouldn't affect correctness, just speed (time will be wasted verifying bad guesses).

by dvdkon

6/1/2026 at 12:26:54 PM

But won't setting it to accept 100% of the proposed tokens will skip the verification?

by lreeves

6/1/2026 at 1:19:34 PM

None of those settings set the speculative decoder to accept 100% of drafted token. I assume you are looking at --draft-p-min 0.0, if so, you are misunderstanding what it does.

by ac29

6/1/2026 at 12:26:35 PM

It depends on the type of MTP. If you're using two models, draft + full, then arguably yes, the larger model isn't providing much benefit if you really are seeing 100% acceptance rates. There are other forms of speculative decoding that work within the larger model by itself though, eg. Qwen has additional speculative decoding attention heads, so there is no secondary drafting model.

by naasking

6/1/2026 at 9:44:11 AM

Does this mean my 15 year old Phenom is too old? But it has 16 gb of DDR3 RAM!

Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.

by cykros

6/1/2026 at 7:20:27 AM

@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?

by potus_kushner

6/1/2026 at 7:35:31 AM

Honestly, at this point you're probably looking at a smaller model, for the Gemma series I'd go with Gemma 4 E4B with drafters, but that's just a hunch from using it on my laptop (where I do have a RTX 4060 M and 96gb ram).

So you'd change the invocation slightly here, but a lot of things you can potentially reuse.

That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.

by cafkafk

6/1/2026 at 9:51:19 AM

Have you tested Qwen3.6 35B? Putting aside the capability claims for that model (which I support, but are not my point here), that 35B has smaller active parameter count than the gemma 4 26B, potentially making both prefill and decode faster out of the box, and has MTP heads built in the model and well supported (you may need to make sure you download a quant that didn't strip them off, as some do to preserve space). I would be curious to see your numbers there too. And if you do test this, please go for a clean one and not a fine-tuned one.

by sleepyeldrazi

6/1/2026 at 8:26:08 AM

i tried the Q4_K_M model form unsloth with your Q4_K_M drafter, but the required memory to load everything is 72GB. odd. otoh i could load Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf and it requires just ~18 GB:

~/ik_llama.cpp[main]$ build/bin/llama-cli --model ~/models/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune -cnv --color --jinja --special -smgs -sas -mea 256 --temp 0.7 -t 6 --parallel 6 --cpu-moe --merge-up-gate-experts --flash-attn on --mla-use 3 --mlock --run-time-repack --no-kv-offload . works pretty fast, at about 15 t/s:

llama_print_timings: sample time = 45.28 ms / 404 runs ( 0.11 ms per token, 8921.67 tokens per second) llama_print_timings: prompt eval time = 949.42 ms / 51 tokens ( 18.62 ms per token, 53.72 tokens per second) llama_print_timings: eval time = 24067.08 ms / 400 runs ( 60.17 ms per token, 16.62 tokens per second) llama_print_timings: total time = 242192.55 ms / 451 tokens

so i wonder why the params used by the quantified qwen model use way less memory than the ones of gemma.

by potus_kushner

6/1/2026 at 3:52:51 PM

I wish this were somehow tagged with AI, so I would know that it's not about say, general computing or cost-efficiency (e.g. using an old xeon machine from ebay instead of new, in these cost-conscious times.)

As it is, the title is click-bait for me, as 1) it says I need at least a Xeon somehow and 2) as it doesn't say what I actually need it for.

by tomega2134

6/1/2026 at 9:41:45 PM

for solo operators that run saas (targeting business customers) & if you do a lot of data processing - old servers are the best bang for the buck.

remember if you serve real customers as a bootstrapped business - you can afford the whole serve down for maintenance. no need for 99.999%.

better than hetzner.

by dzonga

6/1/2026 at 10:03:15 AM

I tried to run gemma 4 on this CPU and it did not go well

https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281

It is way too slow

by anon-3988

6/1/2026 at 3:46:43 PM

llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...

ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...

When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.

by Aurornis

6/1/2026 at 2:02:35 PM

Noting for reference that Gemma4 MTP work is in progress[0] on llama.cpp; similar work for Qwen3.6 landed recently and has been great thus far.

[0]: https://github.com/ggml-org/llama.cpp/pull/23398

by kristjansson

6/1/2026 at 3:28:53 PM

Did some try to estimates what it would take to bake interference for a capable large language model into silicon so that one can pipeline inputs through it and produce outputs at one token per clock cycle?

by danbruc

6/1/2026 at 3:30:53 PM

I'd expect it to require too much RAM bandwidth to be feasible.

RAM is really slow at silicon speeds. Very little is reachable in one clock cycle, unless the clock cycle is abysmally slow.

by knorker

6/1/2026 at 3:38:39 PM

No RAM. Instead of having a general purpose multiplier that multiplies an input with a weight stored in RAM, just have a multiplier that hardcodes the weight. In some sense replace each weight with a specialized multiplier and wire them together with accumulators and activation functions in between. And some registers for pipelining. If one goes for four bit quantization, one could have sixteen optimized multipliers, one for each possible weight, and the one just selects and connects them according to the model weights and structure.

Example. If you have a neuron with 16 inputs each 8 bit wide and with a 4 bit weight per input, you will have 16 specialized multipliers each scaling its input by the corresponding weight and then the 16 scaled inputs feed into an adder tree and finally an activation function.

by danbruc

6/1/2026 at 9:13:50 PM

That sounds like wiring the RAM information into order of magnitude same number of transistors. A modern CPU has (quick googling) 184B transistors. If they were bits then that's 23GB. But presumably a model bit needs more than one transistor to represent how it acts as a neuron with its interactions.

Then there's the current speedup in inference from restricting which subset of the model is used, which is not a "swap in" that would work with hard wired neurons.

But I dunno. Maybe. I'm just guessing.

by thomashabets2

6/1/2026 at 8:31:51 AM

I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?

by asimovDev

6/1/2026 at 8:49:16 AM

CPU (2012)

  Model name: Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz

Mainboard

  Product Name: P8Z77 WS

GPU

  05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
  05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)

Memory: 32GB

This works.

by qwertox

6/1/2026 at 8:34:10 AM

Loading will take some minutes, but at 96 you can squeeze the model in and have some headroom around like ~10 GB, although depending on the Xeon, you may have to downgrade to E4B instead. Should still work thou.

by cafkafk

6/1/2026 at 8:34:10 AM

It may work - depending on your ram speeds it might not even be that much slower.

by tgtweak

6/1/2026 at 9:36:19 AM

I run Win 11 Enterprise on an el cheapo spare parts Xeon E3-1275 V2 + 32 GiB DDR3-2133 + Gigabyte GA-B75M-D3H rev. 1.2 (TPM support)

by burnt-resistor

6/1/2026 at 11:48:10 AM

What's the best way to apply this to slightly more modern hardware - i.e. 5800XT 32GB DDR4, 9060XT 16GB?

by alimbada

6/1/2026 at 10:01:08 AM

And this is one of those CPUs which had dual slot motherboards so you can have double the fun (and power bill)

https://pcpartpicker.com/products/motherboard/#s=20028,20029...

by haunter

6/1/2026 at 1:25:24 PM

Very intriguing. This might be the use for my e5-2430 V2 X2 server that's been lying around. DDR3 is (relatively) cheap now too. Could fit 192GB of RAM in it and play around for much cheaper than a new GPU.

by Liftyee

6/1/2026 at 7:18:06 AM

I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?

by Eonexus

6/1/2026 at 7:37:35 AM

That is a very fair point! I just ran a not very scientific benchmark with the system under load, and posted the raw logs in a sibling comment above, but the short answer is that it's hitting 11.94 tokens per second for generation - while it's also being a binary cache and CI build server.

Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.

by cafkafk

6/1/2026 at 7:40:44 AM

Huh, that's actually not bad at all! Sure, it's not at the speed of a GPU, but still, 20 tps is cromulent for a CPU.

by Eonexus

6/1/2026 at 12:53:22 PM

I have run llama.cpp on an i7-2600 with a 1050. It's too slow for everyday usage but it's not too slow to make it obvious AI is going to be everywhere and in everything. It's too easy to run.

by shovas

6/1/2026 at 11:17:05 AM

Granite or sapphire rapids are very under rated for MoE inference loads. But you need a GPU for the KV cache.

Plus many boards also support CXL for RAM expansion over PCI 5!

Source: building a hybrid inference business for regulated industry workloads.

by robotswantdata

6/1/2026 at 2:05:09 PM

I have an old 192GB DDR4 Dell Precision with dual Intel Xeon Gold 6130 that I've considered spinning up. What's giving me pause is 250W at idle.

by mv4

6/1/2026 at 2:11:26 PM

Surely that number can go lower with some tweaks

by mtoner23

6/1/2026 at 3:20:13 PM

I am sure it can. It will still generate a lot of heat when under load.

Are you telling me I should go for it? :)

I do have a dual DGX Spark cluster running MiniMax M2.7 already so I am all for on-prem. But will be interesting how this old machine will perform!

by mv4

6/1/2026 at 10:59:33 AM

I think one overlooked advantage of older Xeon systems is their availability. Many people can experiment with local AI deployments at a fraction of the cost of building a brand-new setup.

by Hasan121212

6/1/2026 at 1:58:55 PM

Is this John Siracusa? It sounds like it could be something he’d say…

(He has a fully maxed out “last Intel” Mac Pro and laments the lack of replacement).

by bombcar

6/1/2026 at 8:23:32 PM

Hah. My Xeon turns 20 this year. No issues.

by 1970-01-01

6/1/2026 at 10:00:49 AM

This and the previous one are insanely good articles. Thank you!

by egorfine

6/1/2026 at 1:51:08 PM

Either they have a E5-2620 V2 from 13 years ago, or they have DDR4, not DDR3. The V3 and V4 only support DDR4.

by SirMaster

6/1/2026 at 2:10:18 PM

Would there be any advantage of running this as dual Xeon? The CPUs are $5 and a dual mobo is $50...

by qingcharles

6/1/2026 at 2:18:14 PM

More memory bandwidth presumably. Not sure how well the ecosystem handles thread pinning though.

by bee_rider

6/1/2026 at 11:41:04 AM

This is great work.

I'd love if anyone knows how I might fare with an old Dell R710 with 2 x Xeon 5600 (12 cores total) and 96Gb of DDR3.

by coldcity_again

6/1/2026 at 12:43:15 PM

I don’t think it would work as well as there is no AVX or AVX2 on those older CPUs unfortunately.

by rythie

6/1/2026 at 7:11:25 PM

Thanks very much. I'd forgotten that these were Westmere generation! Experimenting anyway; at least the RAID controller is behaving, and Ubuntu 26.04 LTS has gone on cleanly.

by coldcity_again

6/1/2026 at 1:57:26 PM

ive been doing the same thing. i refactored a old newtek stream machine . its my new favorite thing to do! adding old PCs to my "starcraft" fleet xD

by sperandeo

6/1/2026 at 9:32:22 AM

What kind of tokens per second did the op get I saw nothing of this written.

by gigatexal

6/1/2026 at 9:41:24 AM

11.94 tokens/sec (from another answer above)

by urbandw311er

6/1/2026 at 5:16:29 PM

so how many tokens/s do you get, pp and tg? did I miss it in the article?

by b65e8bee43c2ed0

6/1/2026 at 8:04:31 AM

Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.

by christkv

6/1/2026 at 8:52:26 AM

In general you’re mem bandwidth constrained so cpu vs gpu often ends up similar on APUs

by Havoc

6/1/2026 at 9:35:12 AM

There are ways to trade off compute power for memory bandwidth (like MTP and other speculative decoding approaches). The CPU and GPU would need to be able to share the same cache for this to work. In the Strix Halo case the GPU has a private cache on the GPU die I think, which is the snag.

by fulafel

6/1/2026 at 8:20:53 AM

If you get the inference engine to route the heavy matrix math to the GPU and the speculative drafting to the CPU without choking on latency it's probably gonna be very fast.

Would love to see the benchmarks if someone actually pulls something like that off.

by cafkafk

6/1/2026 at 9:10:52 AM

I'm now staring at a 10 year old 4U with 256 GB of DDR4 and thinking hmmmmm

by hparadiz

6/1/2026 at 7:06:48 PM

My current desktop machine is a 24-core Xeon-3345 with 256GB of RAM and an Nvidia 5090. It still feels extremely fast, even though it's about 8 year old technology with a newer video card.

by fortran77

6/1/2026 at 10:39:24 AM

As someone doing this for fun on a windows 11 machine (96gb ram, 5090 24gb) I wonder if I need any flags to keep the model in memory and avoid swapping to ssd?

I use LM studio and qwen3.5 35B - but never figured out if it is swapping or not.

Om am unrelated note, does anyone know a model that can help with this use case:

https://news.ycombinator.com/item?id=48301635

by rvba

6/1/2026 at 1:52:11 PM

The article talks about using --mlock

by smw

6/1/2026 at 9:06:12 AM

I also run a Qwen 3.6 moe A4B on old hardware. I set it up with

numactl --membind=1

so it is constrained to one of the memory sticks which speeds up token generation a little.

by nurettin

6/1/2026 at 5:10:32 PM

Successfully ran Gemma4-26B-A4B on my 8yo first-gen Ryzen with a GeForce GTX 1070. It actually ran acceptably well; I was surprised. I even did some coding with it, but the wheels fell abruptly off when it tried several times to use a constant I told it doesn't exist. I only have 32 GiB of RAM in this old bucket, and these results are not worth the RAM consumption, so I put it aside. Maybe if I finish that build with more memory...

by bitwize

6/1/2026 at 5:03:03 PM

Have to point out one boring thing though: this will use a lot more electricity than newer things. So it'll work, but it'll run up your electric bill.

by api

6/1/2026 at 1:40:23 PM

Well, lets get started. I have 4 of those machines, and they are Two dual processor. They all had 32GB of ram, so now I have two with 64GB, and two with zero. They all hand stock K5000s, now how two have two cards. I stripped the uni processors ram and video cards, and put those into the dual procs. They have 256Gb SSDs, and two 1TB disk drives. One machine has 8Gb of VRam across two cards. Dual processors are 8Cx2 and 32 Threads. They can easily play 16 videos at once. For AI, I have not found a model that I can get above 3 tokens a second. Not a one.

by ForOldHack

6/1/2026 at 10:48:15 AM

When you use page up and page down key when reading that blog the first line on the screen is obscured by the floating bar or what ever it is. It is not even needed for reading.

by ezconnect

6/1/2026 at 10:13:46 AM

The webpage's layout is just horrible. Scrolling is also non-default - and thus rather annoying; I had to stop after two scroll events. Why do people think they need so much fancy effects or non-standard behaviour, if their alleged goal is to get information across to other people?

by shevy-java

6/1/2026 at 9:17:15 AM

Might consider going for even older CPUs which don't have the Intel ME ring -3 thing which is full of backdoors

by bflesch

6/1/2026 at 10:30:50 AM

I appreciate the downvotes without any reasoning. It's a fact that newer Intel CPUs have Intel ME which was not in older CPUs and significantly increases attack surface if you are not living in a five eyes state.

by bflesch

6/1/2026 at 1:31:09 PM

In a server, you have to worry about the ME only if you also have an Intel Ethernet interface, which is connected to a potentially hostile network.

If that is not true, the ME cannot be controlled remotely.

The existence of the ME is much more worrisome in laptops, where the ME can be accessed remotely through WiFi. There, to be certain that there is no way for the ME to be accessed remotely you would have to disconnect or cut the internal antennas and use a USB dongle for WiFi.

by adrian_b

6/1/2026 at 11:47:22 AM

I agree with the first part. I think this article by FSF about Intel's ME summarizes the issue https://static.fsf.org/nosvn/blogs/Intel_ME_Carikli_article_...

As for the second part, I am not sure about how living in a five eyes state would mitigate it. What do you mean by that?

by s20n

6/1/2026 at 2:03:52 PM

As five eyes citizen you have at least some rights on paper and you can appeal to your government, but if you are foreigner these guys can go gloves off without any fear of retribution.

Try analyzing Epstein files and posting about it, they'll give you a proper penetration test of all your devices to see what you found out about their ex employee.

Nowadays even EU citizens migrating away from US cloud providers are a "national security issue".

by bflesch

6/1/2026 at 4:00:19 PM

Isn't the whole five eyes argument moot because member states spy on citizens from the other countries and trade intel with each other?

by smilespray

6/1/2026 at 4:18:15 PM

No need for that charade if you are a foreigner, even from NATO ally.

by bflesch

6/1/2026 at 12:16:44 PM

How old are we talking?

by tryauuum

6/1/2026 at 1:55:03 PM

IIRC it is pre-2008.

by bflesch

6/1/2026 at 9:44:41 AM

Now we need someone try run Kimi K2.6 on old Xeon and DDR3. After all these platforms do support up to 768GB RAM.

by SXX

6/1/2026 at 3:24:41 PM

You can run these on a turing machine. At what point is it not worth it? At some point the energy to generate each token matters. We often seen token per second. I think a missing metric is tokens per kilowatt. That is what really matters.

by segmondy

6/1/2026 at 8:42:10 PM

This is just like running Crysis via software rendering on CPU / llvmpipe. It dont have to be practical in order to be fun to try.

by SXX

6/1/2026 at 11:16:31 AM

It’ll work but yield a token per minute. With ancient servers the throughput is the limiting aspect not mem size

by Havoc

6/1/2026 at 4:00:52 PM

[flagged]

by maxothex

6/1/2026 at 3:06:32 PM

[dead]

by 6_7

6/1/2026 at 9:31:34 AM

> The argument for speculative decoding is stronger on CPU than on GPU.

Uh. Uuuh.

No?

___

Also

> While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.

What purpose does the quoting of "caches" serve there? Is this AI writing written by that model running on that host?

by hypfer