7/3/2026 at 4:22:47 PM
I play with local LLMs a lot. I've spent more on hardware than I should. I'm friends with a local group of people who have spent a lot more than I have.The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.
Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware. You will read a lot of claims that 4-bit quantization is lossless, but those claims come from KL divergence measurements on a small corpus. Use one of these 4-bit models on long context coding tasks and the quality will be noticeably less. Even for non-coding tasks like dataset analysis, I can measure a substantial quality difference between 4-bit models, 8-bit quants, and even some times the full 16-bit source.
This article is also encouraging the use of a REAP model, which means someone has cut out some of the weights to make it smaller. The idea is to remove weights that are less useful for certain tasks, but again this is going to reduce the overall quality of the output.
The trap is that people say "I'm running GLM-5.2 locally!" and it sounds amazing when you look at the GLM-5.2 benchmarks. However they're not actually running GLM-5.2, they're running a model derived from GLM-5.2 that discards most of the bits and drops some of the experts. It does not perform the same as what you see in the benchmarks. In my experience, the divergence between a quantized/REAP model and the parent model is unnoticeable when you try it on very small tasks or chat, but becomes painful when you start trying to use it on long-horizon tasks where little errors start compounding.
Then you get into the slippery slope of thinking you're $50K deep into this project, but what you really need is just one or two more of those $12K GPUs to use the next level of quantization that might improve the quality a little more and make your investment worthwhile...
by Aurornis
7/4/2026 at 10:13:24 AM
I run Qwen3.6 on RTX4090, and it does amazing job for the most parts.For coding task, one needs to break the session among multiple calls I made https://github.com/aka-rider/orqestra but it's possible to do the same in almost any modern harness directly.
The main idea is: - separate session that burns context on reading code and calling tools (context7, etc) -> markdown report "here are relevant patrts of code, docs" "with evidence" to prevent hallucinations
- separate session for planning (architect) - (critic <-> architect) 1-3 times because small model skip over details - worker <-> validator, again, the same reason
Qwen3.6 can run for hours looking for a complex bugs in read-only mode, and usually it gets it. Proposed fix would probably be hacky, but so as Sonnet's
Qwen3.6 can mechanically write code by Opus-made plan. You would have to prompt afterwards:
"Review your own changes. Any bugs? Cross-validate against the original plan - any gaps? Any violations of CLAUDE.md"
But again, I need to do this for Sonnet. But also I use local llms for reindexing knowledge base.
Grooming tickets: I can leave a caveman note "single panel for errors rendering, move all error messages" and come back to 90% ready specs with the end goal and context.
by aka-rider
7/4/2026 at 11:50:05 AM
I'm afraid prompts and clever arrangements of data don't really negate the parent post warnings. It's great if it works for you and your projects. Unfortunately, I can almost guarantee your approach will break down once you get a project large enough or switch to a less popular language.My favorite example is Godot; most local models just can't get it through their thick AI skull that code alone won't be enough to generate working solutions. They must accept a more complex harness, or you must provide much more info that eats the precious available context on every run.
by NBJack
7/4/2026 at 2:39:22 PM
There is no replacement for large models, indeed. And this is not the point I'm trying to make. There are numerous applications for self-hosted models.As a simplest example, when you ask "explain what this code does" advantage of large models is negligible.
I tried Fable, "look at this repo, find all bugs" — yeah, neither Qwen nor Opus can do this.
> I can almost guarantee your approach will break down once you get a project large enough or switch to a less popular language.
I can guarantee you it is not, I used my Qwen on 10-15 years of PHP — I just know how and where it will break; what to ask for, what not. Orqestra was/is self-hosted, being developed by, well, orchestra of Qwen agents.
Moreover, Opus and GPT-5.5 break similarly, yeah they will withstand much more pressure, but they will hallucinate and loop nevertheless. My Qwen experience translates seamlessly. I learned so much about agentic engineering, harnesses, tooling, building custom MCPs...
by aka-rider
7/4/2026 at 3:39:47 PM
my experience has been similar, qwen is very good at ALMOST getting the job done for large tasks and does fine on smaller/medium tasks.by aayush0325
7/4/2026 at 1:48:35 PM
> Qwen3.6 can run for hours looking for a complex bugs in read-only mode, and usually it gets it. Proposed fix would probably be hacky, but so as Sonnet'sI'll go on a tangent but to me that's what we're all seeing. It's the "record number of CVEs found by AIs" thing: these tools are extremely good at searching inside code. And that is a godsend.
We' got people (claiming they're from Anthropic) posting comment saying: "Yes GLM 5.2 found that security bug in library xxx, but we just tried with Fable and it found it too".
More code-searching, more bugs finding. Dick-measuring contests on bug finding abilities.
But the headlines we don't see at all are: "1000 CVEs found by AI, 1000 CVEs fixed by code written by AI". These are nowhere to be found.
We don't see "GLM 5.2 suggested an elegant fix to CVE-2027-xxxxxx" to then have a paid Anthropic shill posting "Fable suggested an eleganter fix than GLM 5.2".
These headlines are, as of 2026, nowhere.
You wrote the result would be "hacky". Here's why I saw from a top, paid for, SOTA model from the top company of the moment: instead of doing two integers comparison (literally one line of code) to verify that a value is between a range, the thing somehow noticed a "pattern" in the hexadecimal representation of the two values and went insane. It started converting the value to its hexadecimal string representation and then started doing substring string matching on that.
"Hacky" is too nice of a word.
This is pure garbage.
Those who go hiking "while their agents ship features" don't realize the level of underperforming, buggy, insecure crap that their LLMs are generating.
I found it very interesting the schism between those who use LLMs to find issues but who verify/modify or even don't use at all the fix they suggest and those who vibe-code while on a yoga retreat.
It's 2026: LLMs do find bugs. But can they fix them?
And do we even care: isn't finding a bug 99% of the job?
by TacticalCoder
7/4/2026 at 3:11:27 PM
The best metaphor I heard about LLMs so far - it's a search engine. The bigger the model the bigger the search space. Small models tend to have a "tunnel vision" or fall into "rabbit holes" - they have less visible options to choose from.> underperforming, buggy, insecure crap that their LLMs are generating
The biggest challenges with AI-generated code are: models actively destroy security features, Opus explained to me once that authorization mechanism is "bad development experience" all while making a backdoor (he made a skeleton key if token=="test" then all permissions granted). Also models actively destroy QA gates. I don't even complain when they delete tests - at least it's visible, they can flip condition to make a test pass, and with vast code changes these are hard to spot.
I myself, and some people I know "vibe-code" professionally though, but then we often assess not the code but it's behaviour. For instance, whether hand-made tests are all pass, p95 is under 50ms, and so on, I may not care about the implementation details.
On the other hand, my friend told me about garage owner he visited, 60 yrs old auto-mechanic, CRM, parts inventory management, payments processing terminal, passwords in txt, people's personal data God knows where, could be unprotected MySQL looking into the Internet bare for all we know.
2026 onwards will be wild.
by aka-rider
7/3/2026 at 6:50:21 PM
This is similar to my experience with (8-bit quantized, non-MOE, 26b) Qwen locally on my computer. It’s really good for small tasks, but the first time I tried to do a major task with it it straight up forgot what agent harness it was in and started using the wrong format for tool calls lol(If you’re curious, it was running in Pi, but somehow convinced itself it was running in Claude instead and started trying to call Claude tools that didn’t exist)
by odo1242
7/4/2026 at 11:26:53 AM
Model+harness combination means a lot. That's why all major labs are making their own. All models have quircks harnesses know about "you are reading the same file 3rd time you are in the loop, step back"I tried all frontier Chinese models, and Qwen is the one running the best in ClaudeCode, my personal theory, it's because Qwen was distilled from Opus.
by aka-rider
7/3/2026 at 6:41:19 PM
I’ve found ds4 on my mbp to be very useful, bought before ram prices became insane. It’s not writing entire applications on it’s own, it has resolved annoying networking issues on my tailnet that I had neither the time nor inclination to figure out on my own and I often find myself reaching for it for simple but annoyingly research intensive tasks that I wouldn’t have otherwise gotten to. Is it opus? No, but is it useful? absolutely and I don’t have to worry about whether or not I’m getting value out of a subscription or the api cost of using it.by FuckButtons
7/3/2026 at 9:02:13 PM
Yeah, I really wish articles and comments about "<model> running locally" also reran the same common benchmarks published to compare the results.by nijave
7/4/2026 at 4:02:37 AM
Absolutely true. All this craze about running coding LLMs locally has been detrimental to local AI where purpose built SLMs could actually be beneficial.Little tools for NLP, TTS, image processing, audio engineering, signal processing, diffusion plugin for Krita etc. are all great for local setup. I wrote a small piece on it few days back[1].
[1] https://abishekmuthian.com/multiple-20-ai-plans-are-better-t...
by Abishek_Muthian
7/3/2026 at 11:35:14 PM
I would very much recommend first using a cloud vendor and setting up an LLM running on there to get a taste of what it’s like before buying the full hardware.by stingraycharles
7/3/2026 at 6:43:46 PM
> The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.> Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware.
This seems to ignore the very real possibility of running SOTA models at full precision on ordinary local hardware using SSD offload. Yes this will be slow and usually have very low throughput (even batched decode can only achieve so much before power and thermal limits become important, and that still leaves you with slow prefill as a major bottleneck) but that's OK if you aren't expecting a real-time response to begin with and your volumes as a single user are low enough.
by zozbot234
7/3/2026 at 6:55:23 PM
SSD streaming throughput is too slow to be usable.GLM-5.2 has 40B active parameters at a time. At Q4 that's 20GB. The best PCIe 5 SSDs can get 15GB/sec when everything goes well. Every expert load would take more than a second.
If you had enough RAM and enough SSDs in parallel you might get a couple tokens per second on a good day. If you left this machine running 24 hours straight, you might be able to get 200,000 tokens generated.
So it can be done, but only if you interact with your LLM like you're e-mailing someone back and forth and you're okay waiting until tomorrow for a response.
You would spend $50K to buy a machine that consumes 2000W and takes all day to produce as many tokens as I could buy on OpenRouter for $0.60. You would spend $5-15 on electricity depending on where you live.
If you have no other option but to process data locally and you must use a very large model and you aren't in a rush, this can do it. I would not recommend it unless you're desperate and operating inside of rigid constraints.
by Aurornis
7/3/2026 at 7:07:41 PM
You can improve that with speculative preload. I'm sure models could be designed and tuned around efficient SSD offloading to keep throughput pretty high.by CuriouslyC
7/3/2026 at 7:40:47 PM
surely the supply of unified memory will rise to meet demand before this is neededby rsalus
7/3/2026 at 7:56:41 PM
It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.by searealist
7/3/2026 at 7:26:27 PM
Wonder if AMD MI350P release will affect setups like this. From what I've heard, the price will be pretty similar to RTX PRO 6000 while having 50% more VRAM which is additionally an HBM3E instead of GDDR7.by vient
7/4/2026 at 12:49:10 PM
I’m also watching Intel Celestial with 160GB of LPDDR. Noticed lower memory throughput than AMD or NVIDIA, but potentially significantly lower cost per card. Two of them would likely run deepseek-v4-flash sized models pretty decently.by bradfa
7/3/2026 at 7:09:29 PM
They do say the cards were purchased when they were cheaper. They debuted at less than nine grand apparently.by bloat
7/4/2026 at 11:50:19 AM
The models will improve and the hardware will remain useful. It's likely a good investment regardless, if you have the money to spend. Plus your business won't be stolen by Anthropic.by nullbio
7/3/2026 at 5:13:01 PM
Well you could make a REAP with better input prompts on longer context then. It’ll improve the REAP qualityby ttoinou
7/3/2026 at 9:03:16 PM
> The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.Just two months ago you could get RTX PRO 6000's for about $8500 on ebay, which is the MSRP.
by nullc
7/3/2026 at 10:05:58 PM
> Just two months ago you could get RTX PRO 6000's for about $8500 on ebay, which is the MSRP.The MSRP was raised to $13,250.
Warranty is very important for expensive cards like this. I don't recommend buying on eBay unless they come with a very big discount.
by Aurornis
7/3/2026 at 4:48:38 PM
All very true. Right now, running GLM 5.2 at its full BF16 quantization level needs 1.5 TB of VRAM. You can't run this locally at a usable speed for less than $250K or so, and frankly I'd be surprised if it could be done for less than $500K.The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.
It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.
Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.
by CamperBob2
7/3/2026 at 4:53:30 PM
> It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers.The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter or renting a server as needed. Compare apples to apples.
You will almost certainly never break even compared to paying per token.
Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.
by Aurornis
7/4/2026 at 3:31:38 AM
Never say never. When the free money party stops, then those token costs are going to have to go up and up. The fact there’s such a glaring disparity between the cost of running AI locally and the pennies it costs to use an online model shows how heavily funded those platforms are right now. This is not and cannot be sustainable.by gizajob
7/4/2026 at 4:53:08 AM
> When the free money party stopsThe Openrouter providers the GP referenced were never at the "free money party". The actual cost of running something like GLM5.2 is well understood and tokens from those providers are not sold at a loss.
Obviously running things locally is more expensive but that all comes down to economies of scale. GLM5.2 is as expensive as it will ever be, barring an increase in demand that forces/allows providers to realise windfall gains disconnected from their underlying costs (always possible, but not the point).
by sho
7/3/2026 at 4:55:55 PM
Or if you want to hedge against the various tail risks of third-party providers raising prices or denying you service or somehow abusing your data...by jobeirne
7/3/2026 at 5:03:16 PM
> hedge against the various tail risks of third-party providers raising pricesThey could 10X the prices and you’d still be better off. It’s also unlikely that prices go up enough to warrant a $100K local investment to prevent paying a couple bucks per million tokens.
> or denying you service
I guess you’re not familiar with OpenRouter? There are many providers there. There are providers outside of OpenRouter. There will always be someone to take your business.
> or somehow abusing your data...
If data security is your concern then you’re better renting a server as needed still.
If you cannot tolerate any data leaving, then local models are the only way. You pay a high premium for it!
by Aurornis
7/4/2026 at 3:33:37 AM
People seem to miss that with local models you can have them burning their wee digital brains out 24/7, which is a different class of AI usage than that from online models even at a few dollars per million tokens.by gizajob
7/4/2026 at 4:00:31 PM
There's a definite psychological branch point. With a remote provider, no matter how readily you can afford it, your mindset is always going to be, "I should think twice about what I'm doing. I hate to waste tokens." With your own hardware, your mindset is more like, "I should try to get more done. I hate to see this thing just sitting there idle."by CamperBob2
7/3/2026 at 5:10:01 PM
Raising prices is not a tail risk, anything a local LLM setup can do for you can be done by any cloud provider, with the same capex as yours (or less), there is no moat here, so it is highy price competitive and will remain so. If you want to speculate on hardware shortages, that is a different business altogether and you need no janky garage setup to profit.by incrudible
7/3/2026 at 4:56:47 PM
Also agreed, it's definitely a sucker's game to run a high-end model locally, by any objective measure.Still... if it's not your weights, running on your box, you're always going to be behind somebody else's 8-ball. Everybody has to decide for themselves where their priorities lie.
by CamperBob2
7/4/2026 at 3:29:10 AM
Nah we’re in like desktop PCs in the 90s type days - bit clunky and maybe occasionally having to work out what an IRQ number is, but a long long way from hand-toggling switches just to get a “hello world” punched out onto paper tape. You can go to an Apple Store today and an hour later have your AI agent talking to LM Studio, and it configuring your MacBook to code and do useful work while also running a diffusion model in the background. Slowly, but not “hand toggling hex switches” slow.by gizajob
7/4/2026 at 12:35:37 AM
> $100K USDWith z.AI GLM Coding Subscription for 1344 USD per year, that buys you 74 years.
Maybe if you want to host the model for a group of people or really need no artificial token limits, or maybe cannot use cloud models, then it makes more sense.
by KronisLV
7/3/2026 at 9:55:48 PM
Another option is renting cloud GPUs only when you need them. A server with 8x B200 is around $32/hr.Obviously depends on the use case and threat model, but that hardware is publicly available at far less than $500k upfront.
by thinkmassive
7/3/2026 at 8:38:05 PM
Everything in this post is spot on and it is a rare example of a HN person not saying BS about LLMs!That said, modern LLM sampling algorithms like min_p, top_n sigma , etc heavily mitigate the performance penalty you get from doing long context tasks. Problems with long context come from accumulation of small sampling errors over time.
My qwen 3.6 27b (the dense one) runs perfectly well on coding tasks at the edge of its context window because I run it using modern LLM sampling stack, namely top N sigma of one, using DRY to stop repetitions and XTC as a superior alternative to temperature for diversification.
Yes there will be a paper soon on arxiv and hopefully NeurIPS proceedings talking about this phenomenon because it’s not well appreciated by the academic AI community yet.
by Der_Einzige
7/3/2026 at 10:10:04 PM
Can you please share you llama.cpp server parameters to turn on modern LLM sampling stack?Docs [1] say that the top_n_sigma is already in the default sampler list: "(default: penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)"
[1] https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...
by pulse7