6/13/2026 at 3:56:27 PM
That's almost exactly my setup and I'm very happy with its performance.I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.
Both fail at different tasks, and Qwen more so than Claude.
But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.
In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.
I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
by sieste
6/14/2026 at 8:57:16 AM
Frontier models are still better (everyone would use them if it was cheap). Open source models are capable on even non "simple" problems but I trust them less, even though I usually write plans for all changes, and they are worse at debugging. I recently converted my homelab to nixos and let's just say Deepseek failed and Fable did great (the night before getting killed)by iamanllm
6/14/2026 at 9:34:07 AM
While what you say is in general true, every model that followed Opus 4.6 on Anthropic side has been increasingly worse at what the previous user points out: they are extremely smart and can convince the user about major falsehood.They are way too trained/reinforced on solving problems rather than assisting you, something on which they have becoming extremely bad at.
It's hard to explain because I too had the many moments where "Fable5 / Opus4.8 xhigh could solve bugs/stuff that previous models couldn't", I know that to be true, and they are very useful for that.
But 90% of my tasks are quite mundane and I need thorough investigation and a proper assistant. Not a smart bullshitter fixated on solving the issue itself. On that Opus 4.6 has been the last good model.
Anything after that is completely skewed towards passing benchmarks and E2E tasks, but definitely not assisting.
Fable in particular was a disaster on that, non stop being thorough on the fix it fixated on, writing nthousand experiments in /tmp, etc. Great model, not gonna lie, but only if your focus is vibe coding and you accept that you're nothing but an assistant and accept its shortcomings.
by epolanski
6/13/2026 at 4:51:01 PM
I keep finding more and more usecases for Q3.6 27b (same league) and the best performance is, when answers to my question is already in the context.The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.
For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.
I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).
This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.
by eurekin
6/13/2026 at 11:21:32 PM
I keep playing around with this exact concept. While I don’t always trust entirely AI generated recipe, more traditional setups are super rigid when it comes to ingredientsby matthewfcarlson
6/14/2026 at 8:08:37 AM
I kept getting recipes with "that one ingredient", which was either a major PITA to source or produced too much waste, even from a real world dietician consultation. Example, use 1/4th of a pumpkin for something. Those were good recipes, in terms of macronutrient composition, but doesn't work long term due to logistics.I'm years after that strict diet needs, but that itch of fixing or easing some parts of the process stayed.
by eurekin
6/14/2026 at 4:25:34 AM
>the local llm ordered products for me onlinedo you mean by commanding a browser? or using APIs?
by ed_mercer
6/14/2026 at 8:06:13 AM
Chrome driven by the OS accessibility APIby eurekin
6/14/2026 at 1:22:55 AM
I know the big labs like to pretend that their models are trillion parameter. But how likely is that really to be the case when Qwen 3.6 35B A3B gets so close to their performance? Seems that with the best research applied, best training data, they'd be able to top the charts with a 60B model quite easily.by nullbio
6/14/2026 at 8:27:02 AM
Qwen 35B isn't even remotely close to the big models. It's just people over hyping small models. Ignore the benchmarks they are almost meaningless.If you want something comparable you need the trillion parameter open models like deepseek.
by redox99
6/14/2026 at 1:28:11 AM
They want people to believe they have massive models, that is effectively their moat at this point.Because if they don't imply that size is needed for every task, they'll end up tanking their valuations.
by MisterKent
6/13/2026 at 8:54:02 PM
Not having a lot of experience with this, I ask a naive question: is there a world where you can take your local LLM and hook it up to Claude and get more Claude-like results from your local model? Obviously, there are going to be material differences in how these perform, but are we getting close to a place where this is viable? I imagine that the answers are a combination of “not yet” and “yes but it’s a lot slower” and “yes but there is actually little point to doing this because ‘what Claude gets you’ is highly baked into anthropic’s models and that’s part of what you’re paying for.”by hamburglar
6/13/2026 at 9:54:03 PM
I have a "task router" that is a small local LLM on my mac mini (Qwen 3.5 0.8B) that I use to decide (when activated) with Pi whether to route a given task to my local LLM (Step 3.7 Flash) or to <given cloud provider>, if that counts? It works surprisingly well really. Though some of the cloud providers are getting so good and so cheap (GLM 5.1/5.2, MiniMax M3, among others) that the need to use my local one becomes less and less relevant, depressingly!by girvo
6/14/2026 at 4:47:35 AM
You can use ollama as the backend for claude code! ollama launch claude --model
I would characterize it as doable, but not really viable. It's "yes you can do it but it's a lot slower", with a hint of "and the best local LLMs are on par with Haiku or Maybe Sonnet so larger and longer tasks get notably worse".
by datadrivenangel
6/13/2026 at 9:16:44 PM
You're kinda talking about Claude being used for planning/architect role, while local LLM is just executing it (performing edits) -- at least in such form it's a thing, yes.by petu
6/14/2026 at 4:32:38 AM
Already been done. Look at the Forge project for local LLMs. It can bring 8b models up to Opus-like performance at tool calling.by znnajdla
6/14/2026 at 12:14:50 AM
opencode is like Claude code, but you can use any model.by z3t4
6/13/2026 at 8:31:17 PM
I have said this before as well: these top-of-the-line models write clever, convoluted code. The code looks intelligent from above, but is a maintenance headache. Makes entire thing fragile for future developments on top of it.The smaller models, especially the aforementioned ones, they fail much more, but, do not write that insanity of the code. They do simple, non-clever coding like humans do. Much easier to maintain and build upon.
Qwen-3.6-27b is a wonderful model. Exceptionally good for it's size, and excellent in general as well. And with mtp available now, it can run at 60+ tps on a single 3090... this is roughly 30% faster tgs than most of the hosted ones being served from giant data-centers.
by freakynit
6/13/2026 at 11:34:39 PM
i keep seeing people talk about pi harnesses. whats this about?by trueno
6/14/2026 at 4:13:58 AM
It’s one of the hot new-ish harnesses. Believe it’s like openclaw or Claude code without all of the defaultsby eyeris
6/13/2026 at 6:01:40 PM
It's also going to fail consistently. When calling Claude you don't know what version of the model you are talking to, it might be quantified sure to load or have been patched.by christkv
6/13/2026 at 4:57:36 PM
This is true. The failure modes are simpler. And yes the ceiling is lower as well. Smaller models stability is lower over long sequences. And thus anything that needs a lot of CoT will be weaker. For example, I had a dumb lock + condvar with multiple defenses against lost wakeups in a N producer 1 consumer queue thing. Models generally need a lot of CoT before they realise they can switch it to a semaphore instead. Qwen typically isn't stable over such long CoTs and ends up adding more and more slop and band aids versus a larger model that outputs a large CoT and then realises it can swap 3 functions out with 2 lines if we use a semaphore.by porridgeraisin