3/8/2026 at 6:03:21 AM
Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s. This outperforms the majority of online llm services and the actual quality of output matches the benchmark. This model is really something, first time ever having usable model on consumer-grade hardware.by moqizhengz
3/8/2026 at 9:15:01 AM
> This outperforms the majority of online llm servicesI assume you mean outperforms in speed on the same model, not in usability compared to other more capable models.
(For those who are getting their hopes up on using local LLMs to be any replacement for Sonnet or Opus.)
by smokel
3/8/2026 at 10:04:14 AM
Obviously it's not going to be of a paid tier 2T sized SOTA model quality, but it can probably roughly match Haiku at the very least. And for tasks that aren't super complex that's already enough.Personally though, I find Qwen useless for anything but coding tasks because if its insufferable sycophancy. It's like 4o dialed up to 20, every reply starts with "You are absolutely right" with zero self awareness. And for coding, only the best model available is usually sensible to use otherwise it's just wasted time.
by moffkalast
3/8/2026 at 10:13:20 AM
That's why I start any prompt to Qwen 3.5 with:persona: brief rude senior
by Anduia
3/8/2026 at 1:16:08 PM
I'm using:persona: drunken sailor
Because then at least the tone matches the quality of the output and I'm reminded of what I can expect.
by amelius
3/8/2026 at 4:39:39 PM
Does it tend to break out into sea shanties?by dlcarrier
3/8/2026 at 6:30:31 PM
Yo, ho, ho, and a bottle of rum.by drob518
3/8/2026 at 7:45:47 PM
https://www.youtube.com/watch?v=C_k8wYuk8PQby yunnpp
3/8/2026 at 8:50:35 PM
But then what do you do with it early in the morning?by moffkalast
3/8/2026 at 9:10:25 PM
For starters, shave his belly with a rusty razor, obviously ;)by amelius
3/8/2026 at 1:31:06 PM
This also workspersona: emotionless vulcan
by em500
3/8/2026 at 4:42:55 PM
Does "persona: air traffic controller" work?If I could set up a voice assistant that actually verifies commands, instead of assuming it heard everything correctly 100% of the time, it might even be useful.
by dlcarrier
3/8/2026 at 3:47:33 PM
persona: fair witnessby 9wzYQbTYsAIc
3/8/2026 at 8:19:54 PM
You just paste in that YAML? Is this an official llm config format that is parsed out?by Chris2048
3/9/2026 at 4:36:35 AM
wow I had no idea you could do that. this changes everything for me.by ranger_danger
3/8/2026 at 2:59:10 PM
persona: party delegate in a rural province who doesn't want to be thereby varispeed
3/8/2026 at 11:33:30 AM
gamechangerby lemonginger
3/8/2026 at 2:11:52 PM
> I find Qwen useless for anything but coding tasks because if its insufferable sycophancyWe use Qwen at work since 2.0 for text/image/video analysis (summarization, categorization, NER, etc), I think it's impressive. We ask for JSON and always ask "do not explain your response".
by ggregoire
3/8/2026 at 4:56:37 PM
You can replace Sonnet and Opus with local models, you just need to run the larger ones.by segmondy
3/8/2026 at 10:46:52 AM
What context length and related performance are you getting out if this setup?At least 100k context without huge degradation is important for coding tasks. Most "I'm running this locally" reports only cover testing with very small context.
by the_duke
3/8/2026 at 4:58:25 PM
Long context degradation is a problem with the Qwen3.5 models for me. They have some clever tricks to accelerate attention that favor more recent context.The models can be frustrating to use if you expect long contexts to behave like they do on SOTA models. In my trials I could give them strict instructions to NOT do something and they would follow it for a short time before ignoring my prompt and doing the things I told it not to do.
by Aurornis
3/8/2026 at 1:53:07 PM
Q4 quants on 32G VRAM gives you 131K context for 35BA3B and 27B models who are pretty capable. On 5090 one gets 175 tg and ~7K pp with 35BA3B, 27B isaround 90 tg. So speed is awesome. Even Strix 395 gives 40 tk/s and 256K context. Pretty amazing, there is a reason people are excited about qwen 3.5by vardalab
3/8/2026 at 7:02:01 AM
There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp.by throwdbaaway
3/8/2026 at 7:54:55 AM
Can someone explain how a 27B model (quantized no less) ever be comparable to a model like Sonnet 4.0 which is likely in the mid to high hundreds of billions of parameters?Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.
by codemog
3/8/2026 at 3:15:32 PM
AFAIK post-training and distillation techniques advanced a lot in the past couple of years. SOTA big models get new frontier and within 6 months it trickles down to open models with 10x less parameters.And mind the source pre-training data was not made/written for training LLMs, it's just random stuff from Internet, books, etc. So there's a LOT of completely useless an contradictory information. Better training texts are way better and you can just generate & curate from those huge frontier LLMs. This was shown in the TinyStories paper where GPT-4 generated children's stories could make models 3 orders of magnitude smaller achieve quite a lot.
This is why the big US labs complain China is "stealing" their work by distilling their models. Chinese labs save many billions in training with just a bunch of accounts. (I'm just stating what they say, not giving my opinion).
by alecco
3/8/2026 at 8:10:36 AM
There's diminishing returns bigly when you increase parameter count.The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.
Anyways your perception of a model's "quality" is determined by careful post-training.
by otabdeveloper4
3/8/2026 at 8:51:08 AM
Interesting. I see papers where researchers will finetune models in the 7 to 12b range and even beat or be competitive with frontier models. I wish I knew how this was possible, or had more intuition on such things. If anyone has paper recommendations, I’d appreciate it.by codemog
3/8/2026 at 10:08:02 AM
They're using a revolutionary new method called "training on the test set".by stavros
3/8/2026 at 6:34:24 PM
So, curve fitting the training data? So, we should expect out of sample accuracy to be crap?by drob518
3/8/2026 at 6:47:06 PM
Yeah, that's usually what tends to happen with those tiny models that are amazing in benchmarks.by stavros
3/8/2026 at 8:23:47 AM
More parameters improves general knowledge a lot, but you have to quantize more in order to fit in a given amount of memory, which if taken to extremes leads to erratic behavior. For casual chat use even Q2 models can be compelling, agentic use requires more regularization thus less quantized parameters and lowering the total amount to compensate.by zozbot234
3/8/2026 at 9:24:36 AM
The short answer is that there are more things that matter than parameter count, and we are probably nowhere near the most efficient way to make these models. Also: the big AI labs have shown a few times that internally they have way more capable modelsby spwa4
3/8/2026 at 9:39:37 AM
Considering the full fat Qwen3.5-plus is good, but barely Sonnet 4 good in my testing (but incredibly cheap!) I doubt the quantised versions are somehow as good if not better in practice.by girvo
3/8/2026 at 10:31:41 AM
I think it depends on work pattern.Many do not give Sonnet or even Opus full reign where it really pushes ahead of over models.
If you're asking for tightly constrained single functions at a time it really doesn't make a huge difference.
I.e. the more vibe you do the better you need the model especially over long running and large contexts. Claude is heading and shoulders above everyone else in that setting.
by rustyhancock
3/8/2026 at 10:44:29 AM
>I.e. the more vibe you do the better you need the model especially over long running and large contextsFor sure, but the coolest thing about qwen3.5-plus is the 1mil context length on a $3 coding plan, super neat. But the model isn't really powerful enough to take real advantage of it I've found. Still super neat though!
by girvo
3/8/2026 at 10:07:00 AM
When you say Sonnet 4, do you mean literally 4, or 4.6?by stavros
3/8/2026 at 10:42:59 AM
It's not as capable as Sonnet 4.6 in my usage over the past couple days, through a few different coding harnesses (including my own for-play one[0], that's been quite fun).by girvo
3/8/2026 at 12:35:40 PM
What is the benefit of writing your own harness? I am asking because I need to get better at using AI for programming. I have used Cursor, Gemini CLI, Antigravity quite a bit and have had a lot of difficulties getting them do what I want. They just tend to "know better."by dr_kiszonka
3/8/2026 at 3:35:16 PM
I’m not an expert but I started with smaller tasks to get a feel for how to phrase things, what I need to include. It’s more manageable to manually fix things it screwed up than giving it full reign.You may want to look at the AGENTS.md file too so you can include your stock style things if it’s repeatedly screwing up in the same way.
by everforward
3/8/2026 at 8:47:14 PM
Purely as an exercise to see how they operate, and understand them better. Then additionally because I was curious how much better one could make something like qwen3.5-plus with its 1 mil context window despite its weaker base behaviour, if I was to give it something very focused on what I want from itThe Pi framework is probably right up your alley btw! Very extensible
by girvo
3/8/2026 at 1:28:09 PM
I think it's the same instinct as making your own Game Engine. You start off either because you want to learn how they work or because you think your game is special and needs its own engine. Usually, it's a combination of both.by newswasboring
3/8/2026 at 9:23:13 AM
It doesn’t. I’m not sure it outperforms chatgpt 3by revolvingthrow
3/8/2026 at 10:22:03 AM
You are not being serrious, are you? even 1.5 years old Mistral and Meta models outperform ChatGPT 3.by BoredomIsFun
3/8/2026 at 10:15:14 AM
3 not 3.5? I think I would even prefer the qwen3.5 0.8b over GPT 3.by gunalx
3/8/2026 at 8:04:13 AM
With MoE models, if the complete weights for inactive experts almost fit in RAM you can set up mmap use and they will be streamed from disk when needed. There's obviously a slowdown but it is quite gradual, and even less relevant if you use fast storage.by zozbot234
3/8/2026 at 8:13:16 PM
any good packages you recommend for this?by htrp
3/8/2026 at 11:38:12 AM
Say more please if you can. How/why is ik_llama.cpp faster then mainline, for the 27B dense? I'd like to be able to run 27B dense faster on a 24GB vram gpu, and also on an M2 max.by ljosifov
3/8/2026 at 3:37:58 PM
ik_llama.cpp was about 2x faster for CPU inference of Qwen3.5 versus mainline until yesterday. Mainline landed a PR that greatly increased speed for Qwen3.5, so now ik_llama.cpp is only 10% faster on token generation.by ac29
3/8/2026 at 7:41:32 AM
Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fast are you getting 27B to run?On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable.
by teaearlgraycold
3/8/2026 at 9:01:04 AM
Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context.35B A3B is faster but didn't do too well in my limited testing.
by throwdbaaway
3/9/2026 at 5:03:31 AM
with regular llama.cpp on a 3070ti I get 60tok/s TG with the 9B model, it's quite impressive.by ranger_danger
3/9/2026 at 4:58:13 AM
Don't sleep on the 9B version either, I get much faster speeds and can't tell any difference in quality. On my 3070ti I get ~60tok/s with it, and half that with the 35B-A3B.by ranger_danger
3/8/2026 at 8:03:45 AM
The 27B is rated slightly higher for SWE-bench.by ece
3/8/2026 at 6:59:28 AM
What exact model are you using?I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc?
by lukan
3/8/2026 at 8:17:20 AM
It depends on the task, but you generally want some context. These models can do things like OCR and summarize a pdf for you, which takes a bit of working memory. Even more so for coding CLIs like opencode-ai, qwen code and mistral ai.Inference engines like llama.cpp will offload model and context to system ram for you, at the cost of performance. A MoE like 35B-A3B might serve you better than the ones mentioned, even if it doesn't fit entirely on the GPU. I suggest testing all three. Perhaps even 122-A10B if you have plenty of system ram.
Q4 is a common baseline for simple tasks on local models. I like to step up to Q5/Q6 for anything involving tool use on the smallish models I can run (9B and 35B-A3B).
Larger models tolerate lower quants better than small ones, 27B might be usable at 3 bpw where 9B or 4B wouldn't. You can also quantize the context. On llama.cpp you'd set the flags -fa on, -ctk x and ctv y. -h to see valid parameters. K is more sensitive to quantization than V, don't bother lowering it past q8_0. KV quantization is allegedly broken for Qwen 3.5 right now, but I can't tell.
by vasquez
3/8/2026 at 1:09:44 PM
Did you figure out how to fix Thinking mode? I had to turn it off completely as it went on forever, and I tried to fix it with different parameters without success.by jadbox
3/8/2026 at 4:02:47 PM
Thinking has definitely become a bit more convuluted in this model - I gave the prompt of "hey" and it thought for about two minutes straight before giving a bog-standard "hello, how can i help" reply etcby sammyteee
3/9/2026 at 7:55:28 AM
did you try with the recommended settings? the ones for thinking mode, general tasks, really worked for me. Especially the repetition_penalty. At first it wasn't working very well, and it was because I was using OpenWebUI's "Repeat Penalty" field, and that didn't work. I needed to set a custom field with the exact nameby agile-gift0262
3/9/2026 at 12:26:22 AM
supposedly you can turn it off by passing `\no_think` or `/no_think` into the prompt but it never worked for mewhat did work was passing / adding this json to the request body:
{ "chat_template_kwargs": {"enable_thinking": false}}
[0] https://github.com/QwenLM/Qwen3/discussions/1300
by andrekandre
3/8/2026 at 6:22:48 AM
Do you point claude code to this? The orchestration seems to be very important.by yangikan
3/8/2026 at 10:42:45 AM
I ran the Qwen3 Coder 30B through LM Studio and with OpenCode(Instead of Claude code). Did decent on M4 Max 32GB. https://www.tommyjepsen.com/blog/run-llm-locally-for-codingby tommyjepsen
3/8/2026 at 4:36:01 PM
The 9B models are not useful for coding outside of very simple requests.Qwen3.5 is confusing a lot of newcomers because it is very confident in the answers it gives. It can also regurgitate solutions to common test requests like “make a flappy bird clone” which misleads users into thinking it’s genetically smart.
Using the Qwen3.5 models for longer tasks and inspecting the output is a little more disappointing. They’re cool for something I can run locally but I don’t agree with all of the claims about being Sonnet-level quality (including previous Sonnet versions) in my experience with the larger models. The 9B model is not going to be close to Sonnet in any way.
by Aurornis
3/8/2026 at 1:44:14 PM
I use Claude Code for agentic coding but it is better to use qwen3-coder in that case.It qwen3-coder is better for code generation and editing, strong at multi-file agentic tasks, and is purpose-built for coding workflows.
In contrast, qwen3.5 is more capable at general reasoning, better at planning and architecture decisions, good balance of coding and thinking.
by andsoitis
3/8/2026 at 9:13:22 AM
I’ve tried it on Claude code, Found it to be fairly crap. It got stuck in a loop doing the wrong thing and would not be talked out of it. I’ve found this bug that would stop it compiling right after compiling it, that sort of thing.Also seemed to ignore fairly simple instructions in CLAUDE.md about building and running tests.
by badgersnake
3/8/2026 at 7:50:55 AM
I loaded Qwen into LM Studio and then ran Oh My Pi. It automatically picked up the LM Studio API server. For some reason the 35B A3B model had issues with Oh My Pi's ability to pass a thinking parameter which caused it to crash. 27B did not have that issue for me but it's much slower.Here's how I got the 35B model to work: https://gist.github.com/danthedaniel/c1542c65469fb1caafabe13...
The 35B model is still pretty slow on my machine but it's cool to see it working.
by teaearlgraycold
3/8/2026 at 6:01:23 PM
These smaller models are fine for Q&A type stuff but are basically unuseable for anything agentic like large file modifications, coding, second brain type stuff - they need so much handholding. I'd be interested to see a demo of what the larger versions can do on better hardware though.by bluerooibos
3/8/2026 at 7:56:28 PM
Qwen3.5 27B works very well, to the point that if you use money on Claude 4.5 Haiku you could save hundreds of USD each day by running it yourself on a consumer GPU at home.by NorwegianDude
3/8/2026 at 6:44:33 PM
In some ways the handholding is the point. The way I used qwen2.5-coder in the past was as a rubber duck that happens to be able to type. You have to be in the loop with it, it's just a different style of agent use to what you might do with copilot or Claude.by regularfry
3/8/2026 at 7:41:04 PM
> consumer-grade hardwareNot disagreeing per se, but a quick look at the installation instructions confirms what I assumed:
Yeah, you can run a highly quantized version on your 2020 Nvidia GPU. But:
- When inferencing, it occupies your "whole machine.". At least you have a modern interactive heating feature in your flat.
- You need to follow eleven-thousand nerdy steps to get it running; my mum is really looking forward to that.
- Not to mention the pain you went through installing Nvidia drivers; nothing my mum will ever manage in the near future.
... and all this to get something that merely competes with Haiku.
Don't get me wrong - I am exaggerating, I know. It's important to have competition and the opportunity to run "AI" on your own metal. But this reminds me of the early days of smartphones and my old XDA Neo. Sure, it was damn smart, and I remember all those jealous faces because of my "device from the future." But oh boy, it was also a PITA maintaining it.
Here we are now. Running AI locally is a sneak peek into the future. But as long as you need a CS degree and hardware worth a small car to achieve reasonable results, it's far from mainstream. Therefore, "consumer-grade hardware" sounds like a euphemism here.
I like how we nerds are living in our buble celebrating this stuff while 99% of mankind still doomscroll through facebook and laughing at (now AI generated) brain rot.
(No offense (ʘ‿ʘ)╯)
by y42