RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

6/13/2026 at 3:56:27 PM

That's almost exactly my setup and I'm very happy with its performance.

I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.

Both fail at different tasks, and Qwen more so than Claude.

But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.

In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.

I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?

by sieste

6/14/2026 at 8:57:16 AM

Frontier models are still better (everyone would use them if it was cheap). Open source models are capable on even non "simple" problems but I trust them less, even though I usually write plans for all changes, and they are worse at debugging. I recently converted my homelab to nixos and let's just say Deepseek failed and Fable did great (the night before getting killed)

by iamanllm

6/14/2026 at 9:34:07 AM

While what you say is in general true, every model that followed Opus 4.6 on Anthropic side has been increasingly worse at what the previous user points out: they are extremely smart and can convince the user about major falsehood.

They are way too trained/reinforced on solving problems rather than assisting you, something on which they have becoming extremely bad at.

It's hard to explain because I too had the many moments where "Fable5 / Opus4.8 xhigh could solve bugs/stuff that previous models couldn't", I know that to be true, and they are very useful for that.

But 90% of my tasks are quite mundane and I need thorough investigation and a proper assistant. Not a smart bullshitter fixated on solving the issue itself. On that Opus 4.6 has been the last good model.

Anything after that is completely skewed towards passing benchmarks and E2E tasks, but definitely not assisting.

Fable in particular was a disaster on that, non stop being thorough on the fix it fixated on, writing nthousand experiments in /tmp, etc. Great model, not gonna lie, but only if your focus is vibe coding and you accept that you're nothing but an assistant and accept its shortcomings.

by epolanski

6/13/2026 at 4:51:01 PM

I keep finding more and more usecases for Q3.6 27b (same league) and the best performance is, when answers to my question is already in the context.

The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.

For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.

I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).

This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.

by eurekin

6/13/2026 at 11:21:32 PM

I keep playing around with this exact concept. While I don’t always trust entirely AI generated recipe, more traditional setups are super rigid when it comes to ingredients

by matthewfcarlson

6/14/2026 at 8:08:37 AM

I kept getting recipes with "that one ingredient", which was either a major PITA to source or produced too much waste, even from a real world dietician consultation. Example, use 1/4th of a pumpkin for something. Those were good recipes, in terms of macronutrient composition, but doesn't work long term due to logistics.

I'm years after that strict diet needs, but that itch of fixing or easing some parts of the process stayed.

by eurekin

6/14/2026 at 4:25:34 AM

>the local llm ordered products for me online

do you mean by commanding a browser? or using APIs?

by ed_mercer

6/14/2026 at 8:06:13 AM

Chrome driven by the OS accessibility API

by eurekin

6/14/2026 at 1:22:55 AM

I know the big labs like to pretend that their models are trillion parameter. But how likely is that really to be the case when Qwen 3.6 35B A3B gets so close to their performance? Seems that with the best research applied, best training data, they'd be able to top the charts with a 60B model quite easily.

by nullbio

6/14/2026 at 8:27:02 AM

Qwen 35B isn't even remotely close to the big models. It's just people over hyping small models. Ignore the benchmarks they are almost meaningless.

If you want something comparable you need the trillion parameter open models like deepseek.

by redox99

6/14/2026 at 1:28:11 AM

They want people to believe they have massive models, that is effectively their moat at this point.

Because if they don't imply that size is needed for every task, they'll end up tanking their valuations.

https://blog.nilesh.io/post/ai-profit-race

by MisterKent

6/13/2026 at 8:54:02 PM

Not having a lot of experience with this, I ask a naive question: is there a world where you can take your local LLM and hook it up to Claude and get more Claude-like results from your local model? Obviously, there are going to be material differences in how these perform, but are we getting close to a place where this is viable? I imagine that the answers are a combination of “not yet” and “yes but it’s a lot slower” and “yes but there is actually little point to doing this because ‘what Claude gets you’ is highly baked into anthropic’s models and that’s part of what you’re paying for.”

by hamburglar

6/13/2026 at 9:54:03 PM

I have a "task router" that is a small local LLM on my mac mini (Qwen 3.5 0.8B) that I use to decide (when activated) with Pi whether to route a given task to my local LLM (Step 3.7 Flash) or to <given cloud provider>, if that counts? It works surprisingly well really. Though some of the cloud providers are getting so good and so cheap (GLM 5.1/5.2, MiniMax M3, among others) that the need to use my local one becomes less and less relevant, depressingly!

by girvo

6/14/2026 at 4:47:35 AM

You can use ollama as the backend for claude code!

  ollama launch claude --model

I would characterize it as doable, but not really viable. It's "yes you can do it but it's a lot slower", with a hint of "and the best local LLMs are on par with Haiku or Maybe Sonnet so larger and longer tasks get notably worse".

by datadrivenangel

6/13/2026 at 9:16:44 PM

You're kinda talking about Claude being used for planning/architect role, while local LLM is just executing it (performing edits) -- at least in such form it's a thing, yes.

by petu

6/14/2026 at 4:32:38 AM

Already been done. Look at the Forge project for local LLMs. It can bring 8b models up to Opus-like performance at tool calling.

by znnajdla

6/14/2026 at 12:14:50 AM

opencode is like Claude code, but you can use any model.

by z3t4

6/13/2026 at 8:31:17 PM

I have said this before as well: these top-of-the-line models write clever, convoluted code. The code looks intelligent from above, but is a maintenance headache. Makes entire thing fragile for future developments on top of it.

The smaller models, especially the aforementioned ones, they fail much more, but, do not write that insanity of the code. They do simple, non-clever coding like humans do. Much easier to maintain and build upon.

Qwen-3.6-27b is a wonderful model. Exceptionally good for it's size, and excellent in general as well. And with mtp available now, it can run at 60+ tps on a single 3090... this is roughly 30% faster tgs than most of the hosted ones being served from giant data-centers.

by freakynit

6/13/2026 at 11:34:39 PM

i keep seeing people talk about pi harnesses. whats this about?

by trueno

6/14/2026 at 4:13:58 AM

It’s one of the hot new-ish harnesses. Believe it’s like openclaw or Claude code without all of the defaults

https://pi.dev/

by eyeris

6/13/2026 at 6:01:40 PM

It's also going to fail consistently. When calling Claude you don't know what version of the model you are talking to, it might be quantified sure to load or have been patched.

by christkv

6/13/2026 at 4:57:36 PM

This is true. The failure modes are simpler. And yes the ceiling is lower as well. Smaller models stability is lower over long sequences. And thus anything that needs a lot of CoT will be weaker. For example, I had a dumb lock + condvar with multiple defenses against lost wakeups in a N producer 1 consumer queue thing. Models generally need a lot of CoT before they realise they can switch it to a semaphore instead. Qwen typically isn't stable over such long CoTs and ends up adding more and more slop and band aids versus a larger model that outputs a large CoT and then realises it can swap 3 functions out with 2 lines if we use a semaphore.

by porridgeraisin

6/13/2026 at 8:19:51 PM

The recommended values for Qwen 3.6 in thinking mode is `--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00`, and `--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00` for coding/tool calling tasks, and for non-thinking, `--temp 0.7 -top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00`.

The options listed are none of these.

Also, the recommended Qwen MTP settings are `--spec-type draft-mtp --spec-draft-n-max 2`. 3 is not good on Nvidia hardware under different workloads. You can also add `ngram-mod`, but after `draft-mtp`; however, default `ngram-mod` settings aren't well tuned, and you want `--spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 16 --spec-ngram-mod-n-match 6` (defaults are 48, 64, 24; the ratio is good, the magnitude is suboptimal).

Of abliterated Qwen 3.6 27B models, huihui's ends up being the worst. Try heretic instead. https://huggingface.co/mradermacher/Qwen3.6-27B-uncensored-h...

by DiabloD3

6/14/2026 at 8:36:34 AM

With qwen3.6-35b-a3b-mtp using lm-studio on RTX 3090, I was getting 120tokens/s. The mtp (multi token prediction) is the key.

I tired coding with Pi and it was much faster than Claude, but for any not-straightforward tasks, it did so so. Either looping itself or not realising easy to spot constraints.

But for exploring codebases and asking questions about big stuff I find it better due to sheer speed.

by tomekowal

6/13/2026 at 4:26:26 PM

80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.

Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).

Being in California electricity alone puts this non-competitive with just paying a cloud though.

by ydj

6/13/2026 at 5:08:13 PM

That’s the cost of using a new hardware provider. A single RTX Pro 6000 Blackwell Max-Q will do better than that and be much more usable. I have 2 running DS4 Flash at 160 tok/s with max num seqs 4.

Very interesting though, these Tenstorrent chips. Might get one to experiment with.

by arjie

6/14/2026 at 7:09:29 AM

Yeah that’s definitely the smarter buy if you want to just have models running quickly. But the cost of 2 p150 and a 4090 was <$5000 for me.

The main issue is the immature software, and somewhat baroque way of writing kernels. Please, buy one and join us.

by ydj

6/14/2026 at 2:07:43 AM

Do you get anything useful out of your 4090 (I have one too)? Local cloud sounds like a fun idea but I just don’t see how it competes against OpenAI/Anthopic

by shepherdjerred

6/14/2026 at 7:41:39 AM

I think it’s not really worth it compared to just buying tokens or a coding plan.

My setup has 4090 handling attention while TT accelerators handles MLP. With just a 4090 you can have CPU handle the MLP layers and use a MoE model, assuming sufficiently powerful cpu. I tried that setup with minimax 2.5 before, and was able to eke out around 10 to 15 tps (albeit with a 7965wx cpu)

by ydj

6/13/2026 at 8:40:33 PM

I get 28tps for Qwen3.6 27B on a Ryzen AI Max 395+, with enough spare memory to run another two small models on the side. 60tps for 35B. Am surprised this is not more common.

by ricardobeat

6/13/2026 at 4:45:19 PM

How is the software compatibilty with the Tenstorrent cards? Are you stuck using vendor supplied runtimes/models?

It's surprising how little these things come up given the price they go for

by manbart

6/14/2026 at 7:15:38 AM

The software stack is pretty immature, definitely very DIY. Their officially supported models are pretty old at this point, though there’s community support for gemma4, and models with GDN like qwen3.6 is supposedly very close.

The entire stack (minus some binary blobs in firmware) is open source, so if you have the time and persistence you can get whatever you want done.

A few community members have been working on support with llamacpp, where we can have supported operations offloaded to the TT cards, while having unsupported ops running on GPU or CPU. Llamacpp is pretty good at that. The existing kernels could definitely be better, and I’ll try my hand at writing some kernels some time.

by ydj

6/13/2026 at 3:34:38 PM

I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each. It works. I saw that there is also a 4x Oculink card, but i don't know it that will work, too.

by avyeed_desa

6/13/2026 at 8:25:33 PM

Sits in silence, watching China as they innovated a new type of ultra-thin gpu board and calling it 5090 "Turbos." Still waiting for Shenzhen listings to post a 5090 official verified with VBIOS crack...

by WeylandDarkStar

6/13/2026 at 5:20:04 PM

I really like Qwen 3.6 27B Q8.

On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5. Not sure how it compares to llama.cpp performance.

In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.

by stared

6/13/2026 at 4:46:59 PM

Potential specs:

NVIDIA GeForce RTX 5080: https://flopper.io/gpu/nvidia-geforce-rtx-5080-16gb

NVIDIA GeForce RTX 3090: https://flopper.io/gpu/nvidia-geforce-rtx-3090-24gb

by triwats

6/13/2026 at 5:29:43 PM

I bought two 3080/20gb and one of those MACHINIST X99 mainboards as well (one with two full x16 pcie slots) those boards come with a xeon cpu included (for the pcie lane support) it set me back 800 euros total (had a spare psu, ssd and mem in a drawer) and now im also happily running 80tk/s Qwen 3.6 Q8 (MTP).

by cybertim

6/13/2026 at 6:34:54 PM

Good call, I really hesitated between the X570 and the X99, are you using P2P?

by iMil

6/13/2026 at 6:48:51 PM

$ nvidia-smi topo -p2p r

GPU0 GPU1

GPU0 X CNS

GPU1 CNS X

i guess not, i use llama.cpp with:

--spec-draft-n-max 3 --spec-type draft-mtp --split-mode tensor --tensor-split 1,1

and my (gen) tk/s are between 60-80 tk/s

will test this uncensored model and ngram added as well this weekend

btw, i also set my powerlimit to 220watt per card (with nvidia-smi) that will cost you around 1 tk/s but safe you a LOT of power and heat :)

by cybertim

6/13/2026 at 7:09:48 PM

CNS means Chipset not supported and I doubt it is the case, are you sure you are using the patched nvidia module? modinfo nvidia to check which one is loaded

by iMil

6/13/2026 at 7:42:29 PM

I'm using bazzite on my ai-rig just because it has the gpu-optimized things setup (also nvidia-open). Looking at P2P seems to be available only for 90-versions of the nvidia rtx gpu line, not 80, and some versions of 50xx? (apparently the 5080?). Anyways, i downloaded that uncensored model and tweaked those kv settings etc. still getting 60-80tk/s but im able to get my context on 180224 now, used to be 131072 which gave me some trouble, this is already a win :)

by cybertim

6/13/2026 at 3:07:37 PM

I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.

by ComputerGuru

6/13/2026 at 4:04:26 PM

Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory.

Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.

by atq2119

6/13/2026 at 3:18:22 PM

I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive

I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on

by verdverm

6/13/2026 at 5:44:51 PM

Would you mind giving these a try and let me know how they work for you? I’d imagine you would get better results and the latter will fit on a single GPU.

https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-EX...

https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-Mi...

Do be sure to use dflash and/or mtp for the draft:

https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3

https://huggingface.co/turboderp/Qwen3.6-27B-DFlash-exl3

by skhameneh

6/13/2026 at 11:51:54 PM

It is absolutely mind blowing to see some of the responses here. Open source, run-your-own, pay for nothing, we’re-all-nerds-that-buy-the-hardware-anyways ethos seems basically dead.

I guess I’m getting old. I own two 16gb cards and I use them for models, for gpu-pasthru for gaming, 3d model rendering, etc. 14 year old me is mortified at this community.

by irishcoffee

6/14/2026 at 1:36:41 AM

Times are changing. The open-weight models have needed time to catch up, but they're finally at a point now where we can get almost frontier level capabilities for coding.

I just wish we had a way to actually benchmark them properly though. Still seems no one has solved the problem of software architecture, brittleness and bloat as the codebase grows. Models love to add stuff, but they rarely clean up as they go. In a perfect world they'd do both near equally as they're developing.

It would be nice if there was an "architecture quality" benchmark that distilled the essence of what it means to have a good architecture, but I suppose that's an open research question with a lot of variables? Like how is good architecture actually quantified and measured? Is there a mechanism that can be re-used across all codebases to clearly denote one that is good and one that is bad, or is it highly subjective and depend on the lens you're looking at it from? Is there a lot more to it than just "how much refactoring effort is required to extend this in the future?".

Surely this is something that has been well researched - yet I never really hear anything about it. Makes me wonder why.

by nullbio

6/14/2026 at 3:00:35 AM

> Surely this is something that has been well researched - yet I never really hear anything about it. Makes me wonder why.

Occam’s razor rings true here: where’s the money in it?

by irishcoffee

6/14/2026 at 4:25:38 AM

14 year old me is mortified at this community.

Same here. There has to be someplace like this that's managed to cultivate a better crowd, but I'll be darned if I can find it.

by CamperBob2

6/13/2026 at 3:34:37 PM

I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...

by deng

6/13/2026 at 3:43:14 PM

> I pay ~3$ per 1M/tokens for that model on Openrouter

I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.

I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.

by redfloatplane

6/13/2026 at 4:17:20 PM

You're treating open weight inference providers the same as proprietary ones. They're fundamentally different business models. Proprietary companies have an incentive to subsidize actual inference and training costs in order to gain market share. The few dozen or so companies selling Qwen models by the token on openrouter are in a commodities market.

If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.

by jubilanti

6/13/2026 at 6:36:35 PM

I don’t know anything about the open weight host business model. Do we know for certain that the folks selling inference by the token are really selling them in an upfront and profitable way? No subsidies from harvesting the info, to sell to the model trainers or anything like that?

by bee_rider

6/13/2026 at 10:06:59 PM

Or subsidies from hopeful investors sweet-talked into not understanding the commodity nature of the business they are investing in. But that does not change much about the general assessment.

Chances are the typical story goes founders start fully believing that they would succeed with their own innovation but slip down a gradient towards commodity provider without really noticing themselves.

by usrusr

6/13/2026 at 4:23:28 PM

I was thinking of user-side regulations as well, not only provider-side ones. I could imagine a world where a government rules that you may not use LLMs for anything, which would be much easier to get around if you have local means.

by redfloatplane

6/13/2026 at 5:51:48 PM

I've spent the past week trying to scheme a way to get affordable local inference of something useful (Qwen3.6-36B-A3B) for ~$500 and have come to the conclusion that it simply isn't viable. A pair of power-restricted P100s in a workstation gets close but the workstations themselves are expensive and rare as hen's teeth (not to mention loud and large). I think early '27 will be when things open up as the hardware market unclenches and further strides are made in small capable models.

by alexjplant

6/13/2026 at 11:33:30 PM

I'm running Qwen3.6-35B-A3B on a very ordinary desktop PC (32GB DDR5, 8GB Radeon 6600XT) and getting a useful 15-20 tok/sec out of it. The MoE architecture and auto offloading from system to VRAM is just fantastic. Unsloth Q4_K_XL.

The Qwen3.6-27B is unbearably slow as it doesn't fit in VRAM, though, i think the MoE is very easy to run.

It is also extremely nice that you can just `apt install llama.cpp libggml0-backend-vulkan` now too.

by mappu

6/14/2026 at 5:48:57 AM

I wonder what parent poster means with „useful” and what he actually tried? Feels like he was just comparing some benchmarks.

Yesterday I downloaded Gemma4-26B with Ollama on quite rusty desktop with 1070 8gb and 32gb of ram and Core i5-9400.

I drop photo of my water meter and tell it to read the value and serial number. It was far from instant but it was also easily under 3 minutes and result was correct.

Earlier like in February I was trying the same photo with Gemma3 on the same hardware and results were bad.

by ozim

6/13/2026 at 4:11:40 PM

An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window (with room to spare) with a bit of fine tuning llamacpp-vulkan, but llamacpp's repository instability and lack of real versioning frustrates me.

In terms of electricity, if you aren't using it, even with all the vram loaded, at most your wasting about 30 watts or so.

Prompt processing a large uncached context is annoying, which is why I forced a lower context window, but I don't know if it's any worse in performance than the cloud models I've used.

There's a niceness, to me, knowing I don't have to rent it anymore. If you rent it, the terms can change regularly.

by ThunderSizzle

6/13/2026 at 7:26:47 PM

"An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window ..."

How would that change (improve) if you had two R9700 in a similar configuration ?

by rsync

6/13/2026 at 7:56:40 PM

better prompt processing like 1.5x+ and more kv but tg most likely lower like 0.8x or so but I am just going by memory for Qwen3.5 without mtp.

by vardalab

6/13/2026 at 4:46:07 PM

Qwen 27b is a compute heavy dense model.

by bertili

6/13/2026 at 4:48:28 PM

When they declare open models a 'security risk', his setup will be running, yours will not and even that 3090 will be way outside of your reach.

by PeterStuer

6/13/2026 at 4:13:06 PM

I use local models to explore, hosted models to refine. I somewhat envy those who can sustain local models (q8 120b+) running as a hobby.... for me, the practical path is a better SearXNG setup and knowing my routes forward.

by medfield

6/13/2026 at 6:34:01 PM

I think it's important to be able to do both so you can stay in control of the price to value created relationship.

In last year, some people were publishing aider /ollama/open router [1] and now thankfully people are publishing all around about pi/qwen/llama.cpp/openrouter. It's widespread.

[1] https://alexhans.github.io/posts/aider-with-open-router.html

by alexhans

6/14/2026 at 8:16:46 AM

You also aren't limited to LLMS. Vision, whisper, etc. You can even have claude farm out tasks to your local servers.

by sixothree

6/13/2026 at 3:38:55 PM

It’s a personal hobby project why should we care this is how someone chooses to spend their free time and money? Lots of hobbies are expensive and pointless if you think of commercially available offerings. That’s why it’s a hobby and not a small business

by TSiege

6/13/2026 at 4:04:18 PM

Yeah but they can also be used to play games and do other stuff.

by toyg

6/13/2026 at 5:39:26 PM

You are paying with your privacy ...

by amelius

6/13/2026 at 7:08:38 PM

> not to mention the electricity to run them...

And noise.

by pier25

6/13/2026 at 3:56:01 PM

Rtx 3090 24 gb set me back 390€ a year ago ( 2nd hand)

by NicoJuicy

6/13/2026 at 4:10:57 PM

Was it still in good condition? That price makes me wonder if it was used for crypto mining, which can wear down the hardware.

by rirze

6/13/2026 at 4:22:31 PM

Any sane crypto miner undervolted and underclocked their GPUs for efficiency's sake; if anything, they went through less wear than, say, regular gaming.

by gsora

6/13/2026 at 7:38:06 PM

[flagged]

by flowbarai

6/13/2026 at 3:50:58 PM

Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.

Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.

by Der_Einzige

6/13/2026 at 4:24:28 PM

> Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.

Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.

> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI

But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.

by jubilanti

6/14/2026 at 5:55:04 AM

on 2x 4090:

90 t/s for 27B Q8 256k context

260 t/s for 35B-A3B Q8 256k context

by mirekrusin

6/13/2026 at 11:15:52 PM

I tried implementing qwen through openrouter and deepinfra. Even without thinking, I had to wait 60s+ for the full result, where haiku or flash would be done in 5 or 6 seconds.

by neals

6/13/2026 at 5:44:50 PM

If I had an eGPU right now, I'd 100% be using Qwen

by tonyrice

6/13/2026 at 5:23:18 PM

It does come with one tiny little issue: it now draws 700W on full load. Just a single 5080 is enough to measurably heat up a room when loaded (320W draw at the wall on mine), and with that amount of power flowing through, you better have a good PSU as well as checking your power plugs themselves, these are going to get HOT when your entire setup is basically drawing 1kW.

by well_ackshually

6/13/2026 at 6:13:00 PM

I am actually surprised with the power draw, the box itself idles at 20W, which already amazes me for a Ryzen; when computing, I barely pass the 600W bar, and as I am not really using it to vibecode an entire system, I don't even notice the spikes on the power monitor (Shelly + homeassistant).

by iMil

6/14/2026 at 7:20:25 AM

[dead]

by nsbk

6/13/2026 at 10:50:21 PM

I've got a 4090 and 3090 in a node that peaks at 600W.

If you're not power limiting in nvidia-smi, start.

by washadjeffmad

6/13/2026 at 4:29:19 PM

Could 2x RTX5080 work just as well?

by varispeed

6/13/2026 at 5:06:54 PM

2xRTX5080 would be awesome. You'd only be able to run a q6, which it's already pretty good, but moreover you'd be able to use P2P and use Blackwell full speed, which I can't.

by iMil

6/14/2026 at 1:56:12 AM

With 2 Blackwells, would make sense to run NVFP4 quants

by kcb

6/13/2026 at 3:46:33 PM

Which "good quality PCIe 4 riser" did you buy?

by atlgator

6/13/2026 at 4:12:36 PM

This one: https://es.aliexpress.com/item/1005010123289822.html?spm=a2g...

by iMil

6/14/2026 at 1:04:51 AM

[flagged]

by hanzeweiasa

6/13/2026 at 11:48:39 PM

[flagged]

by verdyshd