Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

4/22/2026 at 4:46:49 PM

The pelican is excellent for a 16.8GB quantized local model: https://simonwillison.net/2026/Apr/22/qwen36-27b/

I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.

Performance numbers:

  Reading: 20 tokens, 0.4s, 54.32 tokens/s
  Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s

I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

by simonw

4/22/2026 at 4:47:40 PM

I feel like this time it is indeed in the training set, because it is too good to be true.

Can you run your other tests and see the difference?

by throwaw12

4/22/2026 at 5:01:34 PM

It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":

https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...

by simonw

4/22/2026 at 5:09:44 PM

compared to your test with GLM 5.1, this indeed looks off

https://xcancel.com/simonw/status/2041646779553476801

by throwaw12

4/22/2026 at 5:21:57 PM

Yeah GLM 5.1 did an outstanding job on the possum - better than Opus 4.7 or GPT-5.4 and I think better than Gemini 3.1 Pro too.

But GLM 5.1 is a 1.51TB model, the Qwen 3.6 I used here was 17GB - that's 1/88 the size.

by simonw

4/22/2026 at 6:25:52 PM

The point is in the relative difference between the Pelican vs "other" test for each model suggesting the Pelican is being treated special these days (could be as simple as being common in recent data), not the relative difference between the models on the "other" case in isolation.

by zamadatix

4/22/2026 at 5:18:24 PM

Hoping this doesn't turn into a pelican-SVG back-and-forth: yesterday's GPT Image 2 thread ended up being three screenfuls of "I tried the prompt too" replies, and nothing on the model until you scroll past it. I appreciate the testing, and I know this sounds like fun police, but there's a pattern where well-known commenter + one-off vibe test + 1:1 sub-threads eats the whole discussion. It being fun makes it hard to push back on without looking picky.

by refulgentis

4/22/2026 at 5:20:39 PM

You can collapse the pelican thread with the little [-] toggle at the top.

by simonw

4/22/2026 at 5:24:26 PM

Why would you though?

And by the way: Thanks for relentlessly holding new models’ feet to the pelican SVG fire.

by taspeotis

4/22/2026 at 5:34:12 PM

Because I want to read about Qwen, not someone's one-off vibe test followed by 1:1 conversations. (case in miniature here: which is the last comment in this thread that says something about Qwen? The root post. Is that fun policing? Yes, apologies.)

by refulgentis

4/22/2026 at 6:02:30 PM

There's a bunch of useful information in my comment that's independent of the fact that it drew a pelican:

1. You can run this on a Mac using llama-server and a 17GB downloaded file

2. That version does indeed produce output (for one specific task) that's of a good enough quality to be worth spending more time checking out this model

3. It generated 4,444 tokens in 2min 53s, which is 25.57 tokens/s

by simonw

4/22/2026 at 6:16:49 PM

Right, that is exactly what I meant by "the root post [had info about Qwen]" - you shouldn't feel I'm being critical of you or asking you to do anything different, at all. I admire you deeply and feel humbled* by interacting with you, so I really want that to be 100% clear, because this is the 2nd time I'm reading that it might be personal.

* er, that probably sounds strange, but I did just spend 6 weeks working on integrating the Willison Trifecta for my app I've been building for 2.5 years, and I considered it a release blocker. It's a simple mental model that is a significant UX accomplishment IMHO.

by refulgentis

4/22/2026 at 7:32:55 PM

I like the pelican-bicycle test because it's pretty predictive of how the model does helping me with TikZ. And I hate writing TikZ.

by mlyle

4/22/2026 at 7:31:25 PM

Somewhat ironically - as of when I write this this tangent is dominating the size of this topic.

by interstice

4/23/2026 at 8:47:39 AM

I understand your reasoning and it's valid, but I think the best you can do is indeed collapse the thread (not sure if any mobile clients do better than that?)

It's perhaps not a serious test, it isn't to me, but on the edges of jokes about pelicans they're usually some useful things people smarter than me say, and additionally if providers are spending some time on making pelicans or svg look better, this benefits all of us.

So, no hard feelings, you're understood (and I'm not trying to be patronising, I'm just awkward with the language), but pelicans are here to stay because it seems that the consensus is they're beneficial and on topic.

All the best!

by subscribed

4/22/2026 at 5:56:54 PM

I think it's to help drive traffic to his blog now that he's accepted sponsors in the header of every page. I do see this pelican thing come up from him on every model post that gets released.

by rob

4/22/2026 at 6:12:32 PM

The traffic I get from a comment with a link to a pelican is pretty tiny.

by simonw

4/22/2026 at 6:56:17 PM

"Create me an SVG to drive MAXIMUM ENGAGEMENT for my sponsors".

Missing an opportunity here, lol.

by ai_critic

4/22/2026 at 9:37:07 PM

I think at this point we can safely put the pelican test in the category of Goodhart's law.

by sifar

4/22/2026 at 9:39:46 PM

If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.

by amelius

4/22/2026 at 5:32:29 PM

if they cook these in, i wonder what else was cooked in there to make it look good.

by m3kw9

4/22/2026 at 5:40:40 PM

Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.

by zargon

4/23/2026 at 3:18:26 PM

I have an out-there idea. Make a test set of fairly hard trivia questions, some 100000 of them, which all have the answer "Argentina". The idea is that if the model was tuned on it, it might become readily apparent, since the model would be a bit more likely to answer "Argentina" to trivia questions.

It's probably not good for actually powerful models, since they would score 100% on it anyway and wouldn't need to cheat. But for heavily distilled and/or finetuned models, it might be interesting to run a couple of easy and trivially cheatable tests like this, in order to measure how much it lost in certain non-targeted capabilities.

by vintermann

4/23/2026 at 12:59:47 PM

[dead]

by agdexai

4/23/2026 at 10:59:52 AM

I think it's important to see that the other similar example, a dragon driving a car while eating hotdog, doesn't nearly render as well.

https://news.ycombinator.com/item?id=47865232

by nsoonhui

4/22/2026 at 11:35:12 PM

You'd think by now the LLMs would have figured out that the body of a bicycle is basically just a bisected rhombus. → ◿◸

(I hope I don't ruin the test.)

by russellbeattie

4/23/2026 at 1:47:37 PM

It would be funny to do an optimization pass to find a compact description of how to coax an accurate pelican bicycle out of a few of the current models, then just blast that snippet everywhere.

by hedgehog

4/23/2026 at 3:42:13 AM

I am getter 13 t/s on my 36GB M3 Max with almost everything closed (to debug some issues I was having).

by jrumbut

4/23/2026 at 4:09:43 AM

If you ever consider a logo, make sure it’s either a very poorly considered,

or wildly realistic,

pelican.

by DANmode

4/22/2026 at 9:48:08 PM

I don’t think I ever heard you said excellent for the pelican test. It looks excellent indeed!

The trend went to MoE model for some times and this time around is dense model again. I wonder if closed models are also following this trend: MoE for faster ones and dense for pro model.

by sbinnee

4/23/2026 at 12:33:42 PM

IMHO looks more like a stork, not a pelican. Look up any image of an actual pelican and check the ratio of legs to body. IMHO that's a weird mistake to make when asked for a "pelican".

Have you considered asking a couple of artists on Fiverr or something to draw you a picture with the same prompt? I don't mean this as a gotcha, it's actual advice, you should probably get a sense of what a real human artist/designer (or three) would do with this prompt.

For example, I hope you will find that: One reasoning choice is wrong with this picture that's not much to do with its ability to draw. Do we enlarge the pelican to human size? Or do we shrink the bike to pelican size? There is only one answer that keeps pelican proportions. Draw a pelican on a very tiny bike, and its legs will just fit without making it a different species, and you can even sort of cover part of the steer under the wings, etc etc.

I'm curious if other artists would come up with the same or other solutions, but they should in general come up with solutions, which I haven't seen the LLM do, really.

You (or maybe others?) said that the "pelican on a bike" prompt is good because "there is no right answer" cause you can't really fit a pelican on a bike. But most artists will say "hold my beer" and figure it out anyway. Cartoonists won't even have to think. The "figuring out" of these problems is what I'm missing in the LLMs response. It just put a pelican on a bike and makes it look like a stork if necessary. I don't really feel like it's actually testing for the thing this prompt is designed for, unless the test still says "FAIL" for each and all of them, including the one you just called "excellent".

by tripzilch

4/23/2026 at 1:43:47 PM

Honestly it never crossed my mind to waste some artist's time with this, but now that the joke "benchmark" has somehow reached orbital velocity maybe I should be thinking about it!

I've run the prompt through dozens of dedicated image generation models so I've seen many versions of this that are better attempts than a text model spitting out SVG - here's gpt-image-2 as a recent example: https://chatgpt.com/share/69ea21ab-8738-83e8-a4d7-67374d84e0...

by simonw

4/24/2026 at 5:13:22 PM

I believe that if you pay them for their time, it's not really "wasted", at least not nearly as "wasted" as when the next person would pay them to design some vapid advertisement.

In addition to that 1) it's for science and 2) maybe you owe it to yourself to have a really nice framed picture of a pelican riding a bicycle on the wall :D

About the dedicated image generation results, I still would have made the bicycle smaller, but it starts to depend on how motivated the artist is to make both the bike and pelican accurate. Which is fine, but if you want to have a benchmark, it's important to have at least one "known good" example, I think.

by tripzilch

4/22/2026 at 7:42:58 PM

Metrics and toy examples can be gamed. Rather than these silly examples, how does it feel?

Can you replace Claude Code Opus or Codex with this?

Does it feel >80% as good on "real world" tasks you do on a day to day basis.

by echelon

4/22/2026 at 5:53:57 PM

at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)

by ahoog42

4/22/2026 at 5:58:16 PM

They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.

by hansonkd

4/22/2026 at 6:03:30 PM

See https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

by simonw

4/22/2026 at 9:53:50 PM

Why is the assumption that they trained for a pelican on a bicycle, rather than running RL for all kinds of 'generate an SVG' tasks?

by mudkipdev

4/23/2026 at 12:23:13 AM

Gemini did exactly that, and boasted about it at launch: https://x.com/JeffDean/status/2024525132266688757

by simonw

4/23/2026 at 1:40:07 PM

That post doesn't say anything about training for SVG generation

by acchow

4/23/2026 at 2:09:45 PM

https://blog.google/innovation-and-ai/models-and-research/ge...

> Code-based animation: 3.1 Pro can generate website-ready, animated SVGs directly from a text prompt. Because these are built in pure code rather than pixels, they remain crisp at any scale and maintain incredibly small file sizes compared to traditional video.

by simonw

4/23/2026 at 12:54:07 AM

https://imgur.com/a/UlGcBou

by bschwindHN

4/23/2026 at 6:24:35 AM

So this is it. We have finally achieved excellent illustrating of your svg art.

by Alifatisk

4/23/2026 at 5:24:19 PM

Time for a spin, mate.

by gverrilla

4/23/2026 at 12:01:02 AM

That bowtie on the Qwen Flamingo is also chef's kiss, imho

by verdverm

4/23/2026 at 7:18:40 AM

PelicanBench, the last benchmark for AGI.

by brtkwr

4/22/2026 at 8:08:42 PM

These are the stupidest things to cleave to.

by halJordan

4/23/2026 at 3:12:25 AM

[flagged]

by ItsClo688

4/23/2026 at 3:16:34 AM

I've been using it in a few harnesses (FP8 quant, max context length) and it does seem to get tripped up by tool use, often repeating the same tool when it failed previously - that's usually not a great sign for long-term context and multi-step reasoning. It is excellent at one-shotting though and might be most useful as a sub-agent for a stronger frontier coordinator.

by tgtweak

4/24/2026 at 12:56:39 AM

yeah that tracks, tool repetition on failure is a classic sign the model isn't really reading its own context. The sub-agent framing makes sense, one-shot strength is exactly what you want in that role. (Also somehow got flagged for my original comment, which, classic HN lol)

by ItsClo688

4/23/2026 at 11:28:10 AM

I just create the nopelican user to avoid seeing the same type of comments for scoring new models. Why doesn't someone create a pelican by month thread, like who is hiring, so that all who want to talk about their prefered mode and pelican can post with leisure at full extend. Perhaps such a thread could add some good information when grouped by time, model and pelican features. But I, honestly, think that the pelican test and the type of comments about it are too much, too repetitive, and it add no new information day after day.

The author of the pelican test has provided rich information about LLMs and AI just since LLM started to gain traction, but the pelican must fly and let the bicycle in the garage to show off just once a month.

Finally, a bitter take. Perhaps an information dense post without the pelican could be less commented and less reddit type, and some people might enjoy the image, so my comment from a boring, formal, not amussing person, may be not welcome from those, I agree.

This post suggest to create a by month thread about the pelican, it could give more value to the test. So I think is not far from meeting the HN etiquette of style.

Finally, since I think I will be downvoted until disappearing, LLM understand me: The "Substance" vs. "Meme" Conflict

I understand your frustration perfectly. When a model like Qwen 3.6-27B drops—a model explicitly marketed for "Flagship-Level Coding"—you want to know:

    How does it handle dependency injection in complex Python projects?

    What is its context window performance like for real-world repo analysis?

    How does it compare to Claude 3.5 Sonnet for agentic workflows?

Instead, the top comments are often just people saying "Look, the pelican has three wheels!" or "The pelican is floating!" To you, this feels like a waste of the front page.

by nopelican

4/23/2026 at 4:47:18 PM

The point of a benchmark is that it allows a relative comparison. The Pelican is one such benchmark.

Feel free to create a "how does it compare to Claude 3.5 Sonnet" benchmark. If people find it useful, it will be run against new LLMs to generate additional points of comparison.

I will also say; it's really easy to just skim past comments. I suspect your ROI time-wise in creating this account to complain will never be recouped compared with just skimming past pelican comment chains.

by hex4def6

4/23/2026 at 6:04:11 PM

Usually I read the top comments in posts, they usually have the best information. I don't think the pelican test deserve to be at top position. HN top posts should reflect the best of our community, not by karma but by the value and insight that they provide.

by nopelican

4/23/2026 at 2:07:40 AM

it seemed HN was moving the right direction when we added the "no AI comments", and yet, every single post about a new model is from you and your pelican. it's tired. please stop, it adds no value and has become cliche.

by syndacks

4/23/2026 at 2:22:14 AM

Wholly disagree. This a comment made by a person about an AI topic. Not an AI bot commenting on an article, which (as I understand it) is what “no AI comments” is saying.

Plus it’s a test that gives varied enough performance across multiple LLMs that it is a good barometer for how well it can think through the steps. Never mind the fact that most people can’t draw a bike from memory. The whole thing is hilarious!

by pixelatedindex

4/23/2026 at 4:27:56 AM

Are you saying I write comments here using an LLM? I don't do that.

by simonw

4/23/2026 at 4:46:05 AM

How does a quick benchmark of a model "add no value" to the post about the model?

by stavros

4/23/2026 at 2:09:33 AM

We like the pelican posts.

by 0xbadcafebee

4/23/2026 at 12:21:38 PM

I think it added plenty of value!

by rpdillon

4/22/2026 at 9:07:02 PM

Since Gemma 4 came this easter the gap from self hosting models to Claude has decreased sigificantly I think. The gap is still huge it just that local models were extremely non-competitive before easter. So now it seems Qwen 3.6 is another bump up from Gemma 4 which is exciting if it is so. I keep an Opus close ofcourse, because these local models still wander off in the wrong direction and fails. Something Opus almost never does for me anymore.

But every time a local model gets me by - I feel closer to where I should be; writing code should still be free. Both free as in free beer, and free as in freedom.

My setup is a seperate dedicated Ubuntu machine with RTX 5090. Qwen 3.6:27b uses 29/32gb of vram when its working right this minute. I use Ollama in a non root podman instance. And I use OpenCode as ACP Service for my editor, which I highly recommend. ACP (Agent Client Protocol) is how the world should be in case you were asking, which you didnt :)

Exciting times and thank you Qwen team for making the world a better place in a world of Sam Altmans.

by finnjohnsen2

4/22/2026 at 10:45:24 PM

>> I feel closer to where I should be; writing code should still be free. Both free as in free beer, and free as in freedom.

I’m just pleased by the competition, agree with the ideal of free and local but sustainable competition is key: driving $200 p/m down to a much much lower number.

by djfergus

4/22/2026 at 9:47:39 PM

Gemma4 feels the most "claude-like" of all the models I've run locally on my M5 mbp.

by datadrivenangel

4/22/2026 at 10:49:55 PM

I found on coding tasks that Qwen 3.5 can actually do the thing whereas Gemma 4 went off the rails frequently. Will try this new 3.6 release today.

by chr15m

4/23/2026 at 12:23:56 PM

I use Qwen 3.5 122B on an RTX PRO 6000 with open code, and very pleased. I don't feel a need for using a closed model any more. The result after answering questions in Plan mode is almost always what I want, with very few occasional bugs. It does a lot of effort to see how the code I am working on is written now while extending it in the same style.

If they release a Qwen 3.6 that also makes good use of the card, may move to it.

by da-x

4/23/2026 at 12:05:42 AM

There was a qwen-3.6 MoE six days ago that I thought was better than Gemma 4. Today's is a dense model. (gemma release both a 26B MoE and a 31B dense at the same time)

I have intention to evaluate all four on some evals I have, as long as I don't get squirrelled again.

by verdverm

4/22/2026 at 11:01:08 PM

What level of programming tasks can a 27B model handle? Even with Claude, I'm occasionally not satisfied, and I can't imagine how effective a 27B model would be.

by djyde

4/23/2026 at 9:27:06 AM

I ran 3 prompts (short versions, full version in the repo):

- Implement a numerically stable backward pass for layer normalization from scratch in NumPy.

- Design and implement a high-performance fused softmax + top-k kernel in CUDA (or CUDA-like pseudocode).

- Implement an efficient KV-cache system for autoregressive transformer inference from scratch.

and tested Qwen3.6-27B (IQ4_NL on a 3090) against MiniMax-M2.7 and GLM-5 with kimi k2.6 as the judge (imperfect, i know, it was 2AM). Qwen surpassed minimax and won 2/3 of the implementations again GLM-5 according to kimi k2.6, which still sounds insane to me. The env was a pi-mono with basic tools + a websearch tool pointing to my searxng (i dont think any of the models used it), with a slightly customized shorter system prompt. TurboQuant was at 4bit during all qwen tests. Full results https://github.com/sleepyeldrazi/llm_programming_tests.

I am also periodically testing small models in a https://www.whichai.dev style task to see their designs, and qwen3.6 27B also obliterated (imo) the other ones I tested https://github.com/sleepyeldrazi/llm-design-showcase .

Needless to say those tests are non-exhaustive and have flaws, but the trend from the official benchmarks looks like is being confirmed in my testing. If only it were a little faster on my 3090, we'll see how it performs once a DFlash for it drops.

by sleepyeldrazi

4/23/2026 at 12:43:50 AM

Basic triage is good. I've found I need to mostly handle programming, but local models have been good for pointing me at where to look with just "investigate https://github.com/HarbourMasters/Shipwright/issues/6232" as prompt

by __s

4/23/2026 at 1:07:21 AM

> Qwen 3.6:27b uses 29/32gb of vram

What context size are you using for that?

Btw, are you using flash attention in Ollama for this model? I think it's required for this model to operate ok.

by justinclift

4/23/2026 at 3:29:28 AM

I squeezed it into 24 GiB VRAM (since I have RX7900XTX):

-- Q5_K_M Unsloth quantization on Linux llama.cpp

-- context 81k, flash attention on, 8-bit K/V caches

-- pp 625 t/s, tg 30 t/s

by skirmish

4/24/2026 at 10:05:26 PM

I have the same GPU and get very good results, even better than Gemma 4 26B A4B, using the following setup (Fedora 43 Silverblue, podman compose):

  services:
    llama:
      image: ghcr.io/ggml-org/llama.cpp:server-vulkan
      container_name: llama-qwen3.6-27b-dense
      ports:
        - 4201:8080
      volumes:
        - ./Qwen3.6-27B-Q4_K_M.gguf:/models/model.gguf:ro,z
        - ./mmproj-BF16.gguf:/models/mmproj.gguf:ro,z
      devices:
        - /dev/dri
      group_add:
        - video
      command: >
        -m /models/model.gguf
        --mmproj /models/mmproj.gguf
        --alias "Qwen3.6 27b Dense"
        -ngl 99
        -c 98304
        -b 2048
        --host 0.0.0.0
        --port 8080
        --parallel 2
        --kv-unified
        --ubatch-size 2048
        --flash-attn on
        -cb
        --jinja
        --no-webui
        -ctk q8_0
        -ctv q8_0
        --image-min-tokens 1024
        --temp 0.6
        --top-k 20
        --top-p 0.95
        --repeat-penalty 1
        --presence-penalty 1.5
        --reasoning auto
      restart: unless-stopped

by rsolva

4/23/2026 at 3:31:24 AM

Depends entirely on quantization. Q6_K with max context length (262144) is ~40GB of VRAM.

Q8 with the same context wouldn't fit in 48GB of VRAM, it did with 128k of context.

by tgtweak

4/22/2026 at 10:35:18 PM

How many tokens/s do you get on RTX 5090?

by pawelduda

4/23/2026 at 1:25:11 AM

I set this up today on my 5090 at Q6_K quantization and Q4_0 KV, got 50 tokens/s consistently at 123k context, using ~28/32gb vram through LM Studio.

by gfosco

4/23/2026 at 11:15:33 AM

Wow, that sounds usable. I know it's anecdotal but how did you find the quality of the output, and can you compare it to any closed source model?

by pawelduda

4/23/2026 at 3:05:20 AM

Not that you asked but I’m getting ~20 tokens/s on my DGX Spark (Asus actually) using an Int4 AutoRound quant, MTP 1 and some other tricks

by girvo

4/22/2026 at 10:58:39 PM

Can't answer for an RTX 5090, but for an RTX 5080 16GB of RAM (desktop), I get about 6 tokens/sec after some tweaking (f16->q4_0). Kind of on the borderline of usable.. probably realistically need either a 5090 with more RAM or something like a Mac with a unified memory architecture.

by overgard

4/22/2026 at 11:38:19 PM

My M5 Pro is getting ~11 tokens per second via OMLX for an 8 bit quant.

by datadrivenangel

4/23/2026 at 12:40:05 AM

A Mac is not going to be all that much faster than a 5080 with any models, other than the ones you can’t currently run at all because you don’t have enough GPU+CPU memory combined.

You’re much better off adding a second GPU if you’ve already got a PC you’re using.

by angoragoats

4/22/2026 at 3:16:38 PM

I wish that all announcements of models would show what (consumer) hardware you can run this on today, costs and tok/s.

by anonzzzies

4/22/2026 at 3:30:12 PM

The 27B model they release directly would require significant hardware to run natively at 16-bit: A Mac or Strix Halo 128GB system, multiple high memory consumer GPUs, or an RTX 6000 workstation card.

This is why they don’t advertise which consumer hardware it can run on: Their direct release that delivers these results cannot fit on your average consumer system.

Most consumers don’t run the model they release directly. They run a quantized model that uses a lower number of bits per weight.

The quantizations come with tradeoffs. You will not get the exact results they advertise using a quantized version, but you can fit it on smaller hardware.

The previous 27B Qwen3.5 model had reasonable performance down to Q5 or Q4 depending on your threshold for quality loss. This was usable on a unified memory system (Mac, Strix Halo) with 32GB of extra RAM, so generally a 64GB Mac. They could also be run on an nVidia 5090 with 32GB RAM or a pair of 16GB or 24GB GPUs, which would not run as fast due to the split.

Watch out for some of the claims about running these models on iPhones or smaller systems. You can use a lot of tricks and heavy quantization to run it on very small systems but the quality of output will not be usable. There is a trend of posting “I ran this model and this small hardware” repos for social media bragging rights but the output isn’t actually good.

by Aurornis

4/22/2026 at 4:07:32 PM

Yea, this is currently the confusing part of running local models for newbies: Even after you have decided which model you want to run, and which org's quantizations to use (let's just assume Unsloth's for example), there are often dozens of quantizations offered, and choosing among them is confusing.

Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL. Will they differ significantly? What are each of them good at? The 4-bit quantizations will be a "tight squeeze" on your 20GB GPU. Again, Unsloth steps up to the plate with seven(!!) choices: IQ4_XS, Q4_K_S, IQ4_NL, Q4_0, Q4_1, Q4_K_M, UD-Q4_K_XL. Holy shit where do I even begin? You can try each of them to see what fits on your GPU, but that's a lot of downloading, and then...

Once you [guess and] commit to one of the quantizations and do a gigantic download, you're not done fiddling. You need to decide at the very least how big a context window you need, and this is going to be trial and error. Choose a value, try to load the model, if it fails, you chose too large. Rinse and repeat.

Then finally, you're still not done. Don't forget the parameters: temperature, top_p, top_k, and so on. It's bewildering!

1: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

by ryandrake

4/22/2026 at 4:12:47 PM

We made Unsloth Studio which should help :)

1. Auto best official parameters set for all models

2. Auto determines the largest quant that can fit on your PC / Mac etc

3. Auto determines max context length

4. Auto heals tool calls, provides python & bash + web search :)

by danielhanchen

4/22/2026 at 4:50:02 PM

Yea, I actually tried it out last time we had one of these threads. It's undeniably easy to use, but it is also very opinionated about things like the directory locations/layouts for various assets. I don't think I managed to get it to work with a simple flat directory full of pre-downloaded models on an NFS mount to my NAS. It also insists on re-downloading a 3GB model every time it is launches, even after I delete the model file. I probably have to just sit down and do some Googleing/searching in order to rein the software in and get it to work the way I want it to on my system.

by ryandrake

4/22/2026 at 5:49:51 PM

Sadly doesn't support fine tuning on AMD yet which gave me a sad since I wanted to cut one of these down to be specific domain experts. Also running the studio is a bit of a nightmare when it calls diskpart during its install (why?)

by hypercube33

4/22/2026 at 8:31:04 PM

Thanks for that. Did you notice that the unsloth/unsloth docker image is 12GB? Does it embed CUDA libraries or some default models that justifies the heavy footprint?

by Zopieux

4/22/2026 at 6:23:52 PM

I applaud that you recently started providing the KL divergence plots that really help understand how different quantizations compare. But how well does this correlate with closed loop performance? How difficult/expensive would it be to run the quantizations on e.g. some agentic coding benchmarks?

by WanderPanda

4/22/2026 at 6:10:33 PM

Is unsloth working on managing remote servers, like how vscode integrates with a remote server via ssh?

by cyanydeez

4/22/2026 at 6:18:34 PM

Lmstudio Link is GREAT for that right now

by kristjansson

4/22/2026 at 5:58:45 PM

what are you using for web search?

by jbellis

4/22/2026 at 4:43:25 PM

Great project! Thank you for that!

by wuschel

4/22/2026 at 4:19:22 PM

> Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL

There are actually two problems with this:

First, the 3-bit quants are where the quality loss really becomes obvious. You can get it to run, but you’re not getting the quality you expected. The errors compound over longer sessions.

Second, you need room for context. If you have become familiar with the long 200K contexts you get with SOTA models, you will not be happy with the minimal context you can fit into a card with 16-20GB of RAM.

The challenge for newbies is learning to identify the difference between being able to get a model to run, and being able to run it with useful quality and context.

by Aurornis

4/22/2026 at 5:58:52 PM

Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context (262k) fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though.

by zargon

4/22/2026 at 8:47:49 PM

I just tested it and have to make a correction. With llama.cpp, 262144 tokens context (Q8 cache) used 8.7 GB memory with Qwen3.6 27B. Still very impressive.

by zargon

4/23/2026 at 1:09:34 AM

The MoE variants are more cache efficient. Here from Qwen3.6 35B A3B MoE with 256k (262144) context at full F16 (so no cache quality loss):

  llama_kv_cache: size = 5120.00 MiB (262144 cells,  10 layers,  4/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB

The MXFP4-quantized variant from Unsloth just fits my 5090 with 32GB VRAM at 256k context.

Meanwhile here's for Qwen 3.6 27B:

  llama_kv_cache: size = 3072.00 MiB ( 49152 cells,  16 layers,  4/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB

So 16 tokens per MiB for the 27B model vs about 51 tokens per MiB for the 35B MoE model.

I went for the Q5 UD variant for 27B so could just fit 48k context, though it seems if I went for the Q4 UD variant I could get 64k context.

That said I haven't tried the Qwen3.6 35B MoE to figure out if it can effectively use the full 256k context, that varies from model to model depending on the model training.

by magicalhippo

4/22/2026 at 6:06:22 PM

I found the KLD benchmark image at the bottom of https://unsloth.ai/docs/models/qwen3.6 to be very helpful when choosing a quant.

by smallerize

4/22/2026 at 4:46:10 PM

Yea, I'm also kind of jealous of Apple folks with their unified RAM. On a traditional homelab setup with gobs of system RAM and a GPU with relatively little VRAM, all that system RAM sits there useless for running LLMs.

by ryandrake

4/22/2026 at 4:49:28 PM

That "traditional" setup is the recommended setup for running large MoE models, leaving shared routing layers on the GPU to the extent feasible. You can even go larger-than-system-RAM via mmap, though at a non-trivial cost in throughput.

by zozbot234

4/22/2026 at 5:40:31 PM

Strix Halo is another option

by khimaros

4/22/2026 at 9:26:13 PM

qwen3.5 27b w/ 4bit quant works reasonably on a 3090.

by jmspring

4/22/2026 at 9:56:50 PM

HuggingFace has a nice UI where you can save your specs to your account and it will display a checkmark/red X next to every unsloth quantization to estimate if it will fit.

by mudkipdev

4/23/2026 at 12:35:50 AM

Evaluating different quant levels for your use case is a problem you can pretty reliably throw at a coding agent and leave overnight though. At least, it should give you a much smaller shortlist.

by dannyw

4/22/2026 at 10:02:48 PM

To add more complexity to the picture, you can run MoE models at a higher quant than you might think, because CPU expert offload is less impactful than full layer offload for dense models.

by regularfry

4/22/2026 at 4:10:45 PM

Note that you could also run them on AMD (and presumably Intel) dGPUs. e.g. I have a 32GB R9700, which is much cheaper than a 5090, and runs 27B dense models at ~20 t/s (or MoE models with 3-4B active at ~80t/s). I expect an Arc B70 would also work soon if it doesn't already, and would likely be the price/perf sweet spot right now.

My R9700 does seem to have an annoying firmware or driver bug[0] that causes the fan to usually be spinning at 100% regardless of temperature, which is very noisy and wastes like 20+ W, but I just moved my main desktop to my basement and use an almost silent N150 minipc as my daily driver now.

[0] Or manufacturing defect? I haven't seen anyone discussing it online, but I don't know how many owners are out there. It's a Sapphire fwiw. It does sometimes spin down, the reported temperatures are fine, and IIRC it reports the fan speed as maxed out, so I assume software bug where it's just not obeying the fan curve

by ndriscoll

4/23/2026 at 4:37:40 AM

There was this ROCm bug I was watching for awhile: https://github.com/ROCm/ROCm/issues/5706 - This is about the GPU clock remaining at max frequency, but that can drive the fan speed to increase.

It doesn't happen with Vulkan backends, so that is what I have been using for my two dual R9700 hosts.

EDIT: The bug is closed but there were mentions of the issue still occurring after closure, so who knows if it is really fixed yet.

by theoli

4/22/2026 at 4:30:16 PM

Yup, I suppose that these smaller, dense models are in the lead wrt. fast inference with consumer dGPUs (or iGPUs depending on total RAM) with just enough VRAM to contain the full model and context. That won't give you anywhere near SOTA results compared to larger MoE models with a similar amount of active parameters, but it will be quite fast.

by zozbot234

4/22/2026 at 6:01:10 PM

I have 2x asrock R9700. One of the them was noticeably noisier than the other and eventually developed an annoying vibration while in the middle of its fan curve. Asrock replaced it under RMA.

by acrispino

4/22/2026 at 10:05:38 PM

How is your experience with dual cards? Is the a dense 27B model the best what you can run on this setup? What about other applications eg. diffusion or fine-tuning?

by kombine

4/22/2026 at 4:00:19 PM

i have a Strix Halo machine

typically those dense models are too slow on Strix Halo to be practical, expect 5-7 tps

you can get an idea by looking at other dense benchmarks here: https://strixhalo.zurkowski.net/experiments - i'd expect this model to be tested here soon, i don't think i will personally bother

by muyuu

4/22/2026 at 9:57:46 PM

Yep, clocking a run right now that's averaging about 8.7t/s. But when I don't mind waiting while I go eat a meal or something, it's not bad!

EDIT: I'm running the Unsloth Qwen3.6-27B-Q6_K GGUF on a Corsair Strix Halo 128GB I bought summer 2025.

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/blob/main/Qw...

by rpdillon

4/22/2026 at 4:36:16 PM

This one is around 250 t/s prefill and 12.4 generation in my testing.

by hedgehog

4/22/2026 at 7:39:17 PM

similar numbers here - slightly higher PP. slightly better peak speed and retention w/ q8_0 kv cache quants too. llama-bench results here, cba to format for hn: https://pastebin.com/raw/zgJeqRbv

GTR 9 Pro, "performance" profile in BIOS, GTT instead of GART, Fedora 44

by anonym29

4/22/2026 at 9:31:22 PM

If I did a proper benchmark I think the numbers would be what you got. Minimax M2.7 is also surprisingly not that slow, and in some ways faster as it seems to get things right with less thinking output. (around 140 t/s prefill and 23 t/s generation).

by hedgehog

4/22/2026 at 11:06:05 PM

The problem with M2.7 is that it's full GQA, meaning quadratic attention. It does start fast, but by 64k tokens deep, the version I'm running (Unsloth's UD IQ2_XXS) pp512 drops 95% from 261.3 t/s (0 context depth) to 13.1 t/s. q8_0 KV does help, still hitting 57.4 t/s at 64k depth vs 258.3 t/s at 0 depth. TG's retention rates are better, but still approaching single digit even with q8_0 KV cache by 64k depth.

That said, it was my favorite model when I valued output quality above all else, at least up until the new Qwen 3.6 27B, which I'm currently playing with.

I suspect I will like Qwen 3.6 122B A10B a LOT, maybe even better than M2.7.

by anonym29

4/22/2026 at 8:14:50 PM

interesting, might be worth having around although it is still pretty slow

by muyuu

4/22/2026 at 4:54:56 PM

You absolutely do not need to run at full BF16. The quality loss between BF16 (55.65 GB in GGUF) and Q8_0 (30.44 GB in GGUF) is essentially zero - think on the order of magnitude of +0.01-0.03 perplexity, or ~0.1-0.3% relative PPL increase. The quality loss between BF16 and Q4_K_M (18.66 GB in GGUF) is close to imperceptible, with perplexity changes in the +0.1-0.3 ballpark, or ~1-3% relative PPL increase. This would correlate to a 0-2% drop on downstream tasks like MMLU/GSM8K/HellaSwag: essentially indistinguishable.

You absolutely do NOT need a $3000 Strix Halo rig or a $4000 Mac or a $9000 RTX 6000 or "multiple high memory consumer GPUs" to run this model at extremely high accuracy. I say this as a huge Strix Halo fanboy (Beelink GTR 9 Pro), mind you. Where Strix Halo is more necessary (and actually offers much better performance) are larger but sparse MoE models - think Qwen 3.5 122B A10B - which offers the total knowledge (and memory requirements) of a 122B model, with processing and generation speed more akin to a 10B dense model, which is a big deal with the limited MBW we get in the land of Strix Halo (256 GB/s theoretical, ~220 GB/s real-world) and DGX Spark (273 GB/s theoretical - not familiar with real-world numbers specifically off the top of my head).

I would make the argument, as a Strix Halo owner, that 27B dense models are actually not particularly pleasant or snappy to run on Strix Halo, and you're much better off with those larger but sparse MoE models with far fewer active parameters on such systems. I'd much rather have an RTX 5090, an Arc B70 Pro, or an AMD AI PRO R9700 (dGPUs with 32GB of GDDR6/7) for 27B dense models specifically.

by anonym29

4/22/2026 at 5:04:23 PM

I'm all for running large MoE models on unified memory systems, but developers of inference engines should do a better job of figuring out how to run larger-than-total-RAM models on such systems, streaming in sparse weights from SSD but leveraging the large unified memory as cache. This is easily supported with pure-CPU inference via mmap, but there is no obvious equivalent when using the GPU for inference.

by zozbot234

4/22/2026 at 5:21:37 PM

I use llama.cpp, and there is a way to do this - some layers to (i)GPU, the rest to CPU. I was just trying this out with Kimi K2.5 (in preparation for trying it out with Kimi K2.6 the other night. Check out the --n-cpu-moe flag in llama.cpp.

That said, my Strix Halo rig only has PCIe 4.0 for my NVMe, and I'm using a 990 Evo that had poor sustained random read, being DRAM-less. My effective read speeds from disk were averaging around 1.6-2.0 GB/s, and with unsloth's K2.5, even in IQ2_XXS at "just" 326 GB, with ~64 GB worth of layers in iGPU and the rest free for KV cache + checkpoints. Even still, that was over 250 GB of weights streaming at ~2 GB/s, so I was getting 0.35 PP tok/s and 0.22 TG tok/s.

I could go a little faster with a better drive, or a little faster still if I dropping in two of em in raid0, but it would still be on the order of magnitude of sub-1 tok/s PP (compute limited) and TG (bandwidth limited).

by anonym29

4/22/2026 at 6:06:48 PM

In a computer with 2 PCIe 5.0 SSDs or one with a PCIe 5.0 SSDs and a PCIe 4.0 SSD, it should be possible to stream weights from the SSDs at 20 GB/s, or even more.

This is not a little faster, but 10 times faster than on your system. So a couple of tokens per second generation speed should be achievable.

Nowadays even many NUCs or NUC-like mini-PCs have such SSD slots.

I have actually started working at optimizing such an inference system, so your data is helpful for comparison.

by adrian_b

4/22/2026 at 6:13:21 PM

Strix Halo, to my knowledge, does not support PCIe 5.0 NVMe drives, unfortunately, despite it being Zen 5, and Zen 5 supporting the PCIe 5.0 standard.

While many other NUCs may support them, what most of them lack compared to Strix Halo is a 128 GB pool of unified LPDDR5x-8000 on a 256 bit bus and the Radeon 8060S iGPU with 40 CU of RDNA 3.5, which is roughly equivalent in processing power to a laptop 4060 or desktop 3060.

The Radeon 780M and Radeon 890M integrated graphics that come on most AMD NUCs don't hold a candle to Strix Halo's 8060S, and what little you'd gain in this narrow use case with PCIe gen 5, you'd lose a lot in the more common use cases of models that can fit into a 128 GB pool of unified memory, and there are some really nice ones.

Also, the speeds you're suggesting seem rather optimistic. Gen 5 drives, as I understand, hit peak speeds of about 28-30 GB/s (with two in RAID0, at 14-15 GB/s each), but that's peak sequential reads, which is neither reflective of sustained reads, nor the random read workloads that dominate reading model weights.

Maybe there are some Intel NUCs that compete in this space that I'm less up to speed with which do support PCIe 5. I know Panther Lake costs about as much to manufacture as Strix Halo, and while it's much more power efficient and achieves a lot more compute per Xe3 graphics core than Strix Halo achieves per RDNA 3.5 CU, they Panther Lake that's actually shipping ships with so many fewer Xe3 cores that it's still a weaker system overall.

Maybe DGX Spark supports PCIe 5.0, I don't own one and am admittedly not as familiar with that platform either, though it's worth mentioning that the price gap between Strix Halo and DGX Spark at launch ($2000 vs $4000) has closed a bit (many Strix Halo run $3000 now, vs $4700 for DGX Spark, and I think some non-Nvidia GB10 systems are a bit cheaper still)

by anonym29

4/22/2026 at 11:14:27 PM

While you are right about the advantages of Strix Halo, those advantages matter only as long as you can fit the entire model inside the 128 GB DRAM.

If you use a bigger model and your performance becomes limited by the SSD throughput, than a slower CPU and GPU will not affect the performance in an optimized implementation, where weights are streamed continuously from the SSDs and all computations are overlapped over that.

I have an ASUS NUC with Arrow Lake H and 2 SSDs, one PCIe 5.0 and one PCIe 4.0. I also have a Zen 5 desktop, which like most such desktops also has 2 SSDs, one PCIe 5.0 and one PCIe 4.0. Many Ryzen motherboards, including mine, allow multiple PCIe 4.0 SSDs, but those do not increase the throughput, because they share the same link between the I/O bridge and the CPU.

So with most cheap computers you can have 1 PCIe 5.0 SSD + 1 PCIe 4.0 SSD. With PCIe 4.0, it is easy to find SSDs that reach the maximum throughput of the interface, i.e. between 7 and 7.5 GB/s. For PCIe 5.0, the throughput depends on how expensive the SSD is and on how much power it consumes, from only around 10 GB/s up to the interface limit, i.e. around 15 GB/s.

With SSDs having different speeds, RAID0 is not appropriate, but the interleaving between weights stored on one SSD and on the other must be done in software, i.e. one third must be stored on the slower SSD and two thirds on the faster SSD.

A Zen 5 desktop with a discrete GPU is faster than Strix Halo when not limited by the main memory interface, but in the case when the performance is limited by the SSDs throughput I bet that even the Intel NUC can reach that limit and a faster GPU/CPU combo would not make a difference.

by adrian_b

4/22/2026 at 11:25:01 PM

That sounds like a huge hassle for what I imagine must be peak speeds of low double digit tok/s PP and TG, even with effective prompt caching and self-ngram and all the other tricks, no?

If I really feel like I needed larger models locally (I don't, the 120/122B A10/12B models are awesome on my hardware), I think I'd rather just either pony up for a used M3 Ultra 512GB, wait for an M5 Ultra (hoping they bring back 512GB config on new setup), or do some old dual socket Xeon or Epyc 8/12-channel DDR4 setup where I can still get bandwidth speeds in the hundreds of GB/s.

What kinds of models are you running over 128GB, and what kind of speeds are you seeing, if you don't mind me asking?

by anonym29

4/23/2026 at 12:03:30 AM

Until now I have not run models that do not fit in 128 GB.

I have an Epyc server with 128 GB of high-throughput DRAM, which also has 2 AMD GPUs with 16 GB of DRAM each.

Until now I have experimented only with models that can fit in this memory, e.g. various medium-size Qwen and Gemma models, or gpt-oss.

But I am curious about how bigger models behave, e.g. GLM-5.1, Qwen3.5-397B-A17B, Kimi-K2.6, DeepSeek-V3.2, MiniMax-M2.7. I am also curious about how the non-quantized versions of the models with around 120B parameters behave, e.g such versions of Nemotron and Qwen. It is said that quantization to 8 bits or even to 4 bits has negligible effects, but I want to confirm this with my own tests.

There is no way to test big models or non-quantized medium models at a reasonable cost, otherwise than with weights read from SSDs. For some tasks, it may be preferable to use a big model at a slow speed, if that means that you need less attempts to obtain something useful. For a coding assistant, it may be possible to batch many tasks, which will progress simultaneously during a single pass over the SSD data.

For now I am studying llama.cpp in order to determine how it can be modified to achieve the maximum performance that could be reached with SSDs.

by adrian_b

4/23/2026 at 6:14:22 AM

AIUI, the main obstacle to maximizing performance with SSD offload is that existing GGUF files for MoE models are not necessarily laid out so that fetching a single MoE layer-expert can be done by reading a single sequential extent off the file. It may be that the GGUF format is already flexible enough in its layout configuration that this is doable with a simple conversion; but if not, the GGUF specification would have to be extended to allow such a layout to be configured.

by zozbot234

4/23/2026 at 10:26:58 AM

You are right, which is why I do not intend to use a GGUF file but a set of files with a different layout, and this is why I need to make changes in llama.cpp.

by adrian_b

4/23/2026 at 1:58:25 PM

If you have to come up with a custom format anyway, why not just make it a draft extension to GGUF layout definitions (something like "coalesced expert fetch" or the like) and submit it for inclusion in the standard? Then future models could be autoconverted to such a format.

by zozbot234

4/24/2026 at 11:14:20 AM

This is a good suggestion.

I will consider to do this after I gather enough experience to determine which is the best layout and when I will have enough benchmark data to support that.

by adrian_b

4/22/2026 at 6:23:38 PM

Now I want to put two p5800x's to use. I wonder how much tinkering would be necessary to mmap a raid setup with them directly to the gpu. Im not fully busy with LLM's and more with graphics and systems, but this seems like a fun project to try out.

by greybcg

4/22/2026 at 9:26:36 PM

At least for the CPU/GPU split, llama.cpp recently added a `--fit` parameter (might default to on now?) that pairs with a `--fitc CONTEXTSIZE` parameter. That new feature will automatically look at your available VRAM and try to figure out a good CPU/GPU split for large models that leaves enough room for the context size that you request.

by a_e_k

4/22/2026 at 9:15:05 PM

How many t/s output are you getting at Q4_K_M with 200k context on your Strix Halo if you ask it to add a new feature to a codebase.

by hadlock

4/22/2026 at 11:17:32 PM

Qwen 3.6 27B, and other dense models, as opposed to MoE models do NOT scale well. Like I said in my original post, for 27B usage specifically, I'd take a dGPU with 32GB of VRAM over Strix Halo. I also don't usually benchmark out to 200k, my typical depths are 0, 16k, 32k, 64k, 128k. That said, with Qwen 3.5 122B A10B, I am still getting 70 tok/s PP speed and 20 tok/s TG speed at 128k depth, and with Nemotron 3 Super 120B A10B, 160 tok/s PP speed and 16 tok/s TG speed at 128k depth. With smaller MoE models, I did bench Qwen 3.6 35B A3B at 214 tok/s PP at 128k and 34.5 tok/s TG at 131k.

Because dense models degrade so severely, I rarely bench them past 32k-64k, however, I did find a Gemma4 31B bench I did - down to 22 tok/s PP speed and 6 tok/s TG speed at 128k.

Nemotron models specifically, because of their Mamba2 hybrid SSM architecture, scale exceptionally well, and I have benchmarks for 200k, 300k, 400k, 500k, and 600k for Nemotron 3 Super. I will use depth: PP512/TG128 for simplicity.

100k: 206/16 200k: 136/16 300k: 95/14 400k: 61/13 500k: 45/13 600k: 36/12

by anonym29

4/23/2026 at 5:06:31 PM

Thanks this is very helpful for planning out localLLM buy. Sounds like we are still at least 1 generation out (DDR6 500-700GB/s memory) from getting to that magic ~25-30TG/s. Nemotron 3 Super architecture sounds promising.

by hadlock

4/25/2026 at 11:51:09 AM

Medusa Halo is on my wishlist, but I'm hearing late 2027 :(

M5 Ultra may be a better near-term option, expected in June. Supposedly ~1.2 TB/s unified memory, unsure of whether Apple will revive the 512 GB SKU or limit to 256 GB, but the new Neural Engine in every GPU core should help dramatically. These were always compute limted rather than bandwidth limited, even in M3 Ultra era.

The big cost of course being that you're locked into Apple silicon and Apple's walled garden. You can still use MacOS without creating an Apple account... for now...

At least Apple Silicon holds resale value remarkably well.

by anonym29

4/22/2026 at 5:38:27 PM

tbh ~1-3% PPL hit from Q4_K_M stopped being the bottleneck a while ago. the bottleneck is the 48 hours of guessing llama.cpp flags and chat template bugs before the ecosystem catches up. you are doing unpaid QA.

by p_stuart82

4/22/2026 at 5:49:54 PM

Just wait a week for model bugs to be worked out. This is well-known advice and a common practice within r/localllama. The flags are not hard at all if you're using llama.cpp regularly. If you're new to the ecosystem, that's closer to a one-time effort with irregular updates than it is to something you have to re-learn for every model.

by anonym29

4/22/2026 at 8:36:05 PM

I can run Qwen3.5-27B-Q4_K_M on my weird PC with 32 GB of system memory and 6 GB of VRAM. It's just a bit slow, is all. I get around 1.7 tokens per second. IMO, everyone in this space is too impatient.

(Intel Core i7 4790K @ 4 Ghz, nVidia GTX Titan Black, 32 GB 2400 MHz DDR3 memory)

Edit: Just tested the new Qwen3.6-27B-Q5_K_M. Got 1.4 tokens per second on "Create an SVG of a pellican riding a bicycle." https://gist.github.com/Wowfunhappy/53a7fd64a855da492f65b4ca...

by Wowfunhappy

4/23/2026 at 1:27:25 PM

I have been using Qwen3.5-9B-UD-Q4_K_XL.gguf on an 8GB 3070Ti with llama.cpp server and I get 50-60 tok/s.

by ranger_danger

4/23/2026 at 8:19:55 AM

Don't forget that you're also spending much more electricity because it takes so long to run inference.

by iamsaitam

4/24/2026 at 12:48:13 AM

Given current hardware prices I wouldn't expect this to tip the scales.

Mind you, for me local models are a fun experiment, for anything serious I would use a frontier model. So I'm not using this for hours a day.

by Wowfunhappy

4/22/2026 at 4:45:11 PM

> but the quality of output will not be usable

Making the the right pick for model is one of the key problems as a local user. Do you have any references where one can see a mapping of problem query to model response quality?

by wuschel

4/22/2026 at 6:31:09 PM

Is it the same idea that when you go to luxury store you don't see prices on display?

Seems like nobody wants to admit they exclude working class from the ride.

by varispeed

4/22/2026 at 5:55:50 PM

Because when you pay for a subscription they don't silently quantize the model a few week after release, and you can no longer get the full model running.

Otherwise no need for full fp16, int8 works 99% as well for half the mem, and the lower you go the more you start to pay for the quants. But int8 is super safe imo.

by alex7o

4/22/2026 at 4:42:06 PM

If these models reach quality of Opus 4.5, then DGX could be a good alternative for serious dev teams to run local models. It is not that expensive and has short time to make ROI

by Oras

4/22/2026 at 7:30:59 PM

Memory bandwidth is the biggest L on the dgx spark, it’s half my MacBook from 2023 and that’s the biggest tok/sec bottleneck

by czk

4/22/2026 at 8:08:23 PM

Given the current best open-weight model (Kimi 2.6) is 1T A32B, I wonder how long we’ll have to wait for hardware like strix halo or gdx spark to be able to run it.

by bwv848

4/22/2026 at 11:29:12 PM

The bigger the [dense] models the more inference tends to take, it seems pretty linear.

In that sense, how long you'd need to wait to get say ~20tk/s .. maybe never.

(save a significant firmware update / translation layer)

by flockonus

4/23/2026 at 10:48:30 AM

For 1T Q4 - 1 token generated per every ~500GB memory read. So you'll need something like ~10TB/s memory for 20t/s. This is 8x5090 speed area and 16x5090 size area. HBM4 will bring us close to something really possible in home lab, but it will cost fortune for early adopters.

Speculative decoding/DFlash will help with it, but YMMV.

Edit: Missed a part that this is A32B MoE, which means it drastically reduces amount of reads needed. Seems 20 t/s should be doable with 1TB/s memory (like 3090)

by u8080

4/22/2026 at 3:38:05 PM

I get ~5 tokens/s on an M4 with 32G of RAM, using:

  llama-server \
   -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
   --no-mmproj \
   --fit on \
   -np 1 \
   -c 65536 \
   --cache-ram 4096 -ctxcp 2 \
   --jinja \
   --temp 0.6 \
   --top-p 0.95 \
   --top-k 20 \
   --min-p 0.0 \
   --presence-penalty 0.0 \
   --repeat-penalty 1.0 \
   --reasoning on \
   --chat-template-kwargs '{"preserve_thinking": true}'

35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.

I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models.

by benob

4/22/2026 at 3:49:29 PM

We also made some dynamic MLX ones if they help - it might be faster for Macs, but llama-server definitely is improving at a fast pace.

https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-4bit

by danielhanchen

4/22/2026 at 5:51:34 PM

What exactly does the .sh file install? How does it compare to running the same model in, say, omlx?

by DarmokJalad1701

4/22/2026 at 3:46:39 PM

Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results.

by dunb

4/23/2026 at 11:38:01 PM

Meaningless question, fit will put everything on the gpu if it fits. Fa is default on. No-mmap is not an inference tradeoff and if you do turn it off you need to turn on direct io via -dio

What he should actually do is enable speculative decoding

by halJordan

4/22/2026 at 10:52:43 PM

I confirm with the GGUF version at q4, 35B-A3B starts going in thinking loops at 60k basically

by fuomag9

4/22/2026 at 4:48:51 PM

When you say tok/s here are you describing the prefill (prompt eval) token/s or the output generation tok/s?

(Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore)

by kpw94

4/22/2026 at 7:19:29 PM

Here is llama-bench on the same M4:

  | model                    |       size |     params | backend    | threads |            test |                  t/s |
  | ------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
  | qwen35 27B Q4_K_M        |  15.65 GiB |    26.90 B | BLAS,MTL   |       4 |           pp512 |         61.31 ± 0.79 |
  | qwen35 27B Q4_K_M        |  15.65 GiB |    26.90 B | BLAS,MTL   |       4 |           tg128 |          5.52 ± 0.08 |
  | qwen35moe 35B.A3B Q3_K_M |  15.45 GiB |    34.66 B | BLAS,MTL   |       4 |           pp512 |        385.54 ± 2.70 |
  | qwen35moe 35B.A3B Q3_K_M |  15.45 GiB |    34.66 B | BLAS,MTL   |       4 |           tg128 |         26.75 ± 0.02 |

So ~60 for prefill and ~5 for output on 27B and about 5x on 35B-A3B.

by benob

4/22/2026 at 5:24:19 PM

If someone doesn't specifically say prefill then they always mean decode speed. I have never seen an exception. Most people just ignore prefill.

by zargon

4/22/2026 at 5:40:01 PM

But isn't the prefill speed the bottleneck in some systems* ?

Sure it's order of magnitude faster (10x on Apple Metal?) but there's also order of magnitude more tokens to process, especially for tasks involving summarization of some sort.

But point taken that the parent numbers are probably decode

* Specifically, Mac metal, which is what parent numbers are about

by kpw94

4/22/2026 at 6:10:51 PM

Yes, definitely it's the bottleneck for most use cases besides "chatting". It's the reason I have never bought a Mac for LLM purposes.

It's frustrating when trying to find benchmarks because almost everyone gives decode speed without mentioning prefill speed.

by zargon

4/22/2026 at 11:37:06 PM

oMLX makes prefill effectively instantaneous on a Mac.

Storing an LRU KV Cache of all your conversations both in memory, and on (plenty fast enough) SSD, especially including the fixed agent context every conversation starts with, means we go from "painfully slow" to "faster than using Claude" most of the time. It's kind of shocking this much perf was lying on the ground waiting to be picked up.

Open models are still dumber than leading closed models, especially for editing existing code. But I use it as essentially free "analyze this code, look for problem <x|y|z>" which Claude is happy to do for an enormous amount of consumed tokens.

But speed is no longer a problem. It's pretty awesome over here in unified memory Mac land :)

by mercutio2

4/22/2026 at 6:16:40 PM

Using opencode and Qwen-Coder-Next I get it reliably up to about 85k before it takes too long to respond.

I tried the other qwen models and the reasoning stuff seems to do more harm than good.

by cyanydeez

4/22/2026 at 5:30:20 PM

How is the quality of model answers to your queries? Are they stable over time?

I am wondering how to measure that anyway.

by wuschel

4/22/2026 at 4:16:10 PM

There are infinite combinations of CPU/GPU capable of running LLMs locally. What most people do is buy the system they can afford and roughly meets their goals and then ball-park VRAM usage by looking at the model size and quantization.

For more a detailed analysis, there are several online VRAM calculators. Here's one: https://smcleod.net/vram-estimator/

If you have a huggingface account, you can set your system configuration and then you get little icons next to each quant in the sidebar. (Green: will likely fit, Yellow: Tight fit, Red: will not fit)

Further, t/s depends greatly on a lot of different factors, the best you might get is a guess based on context size.

One thing about running local LLMs right now, is that there are tradeoffs literally everywhere and you have to choose what to optimize for down to the individual task.

by bityard

4/22/2026 at 6:56:44 PM

These calculators are almost entirely useless. They don't understand specific model architectures. Even the ones that try to support only specific models (like the apxml one) get it very wrong a lot of the time.

For example, the one you linked, when I provide a Qwen3.5 27B Q_4_M GGUF [0], says that it will require 338 GB of memory with 16-bit kv cache. That is wrong by over an order of magnitude.

[0] https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF/resol...

by zargon

4/22/2026 at 9:16:23 PM

Mine does https://github.com/gdevenyi/huggingface-estimate

by gdevenyi

4/22/2026 at 9:51:58 PM

Excellent job with this! I tried a few combinations that completely fail on other calculators and yours gets VRAM usage pretty much spot on, and even the performance estimate is in the ballpark to what I see with mixed VRAM / RAM workloads.

It's a shame that search is so polluted these days that it's impossible to find good tools like yours.

by zargon

4/22/2026 at 10:23:05 PM

Just ask Claude to install the most optimum model with a nice chat ui tailored to your wishes. 15'minutes max.

by holoduke

4/22/2026 at 3:40:51 PM

Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput.

by proxysna

4/22/2026 at 4:18:46 PM

What would be these additional vllm flags, if you don't mind sharing?

by tandr

4/22/2026 at 9:52:02 PM

This is from an example from my Nomad cluster with two a5000's, which is a bit different what i have at work, but it will mostly apply to most modern 24G vram nvidia gpu.

"--tensor-parallel-size", "2" - spread the LLM weights over 2 GPU's available

"--max-model-len", "90000" - I've capped context window from ~256k to 90k. It allows us to have more concurrency and for our use cases it is enough.

"--kv-cache-dtype", "fp8_e4m3", - On an L4 cuts KV cache size in half without a noticeable drop in quality, does not work on a5000, as it has no support for native FP8. Use "auto" to see what works for your gpu or try "tq3" once vllm people merge into the nightly.

"--enable-prefix-caching" - Improves time to first output.

"--speculative-config", "{\"method\":\"qwen3_next_mtp\",\"num_speculative_tokens\":2}", - Speculative mutli-token prediction. Qwen3.5 specific feature. In some cases provides a speedup of up to 40%.

"--language-model-only" - does not load vision encoder. Since we are using just the LLM part of the model. Frees up some VRAM.

by proxysna

4/27/2026 at 9:35:26 PM

Thank you!

by tandr

4/23/2026 at 5:29:28 AM

> "--speculative-config",

Regarding that last option: speculation helps max concurrency when it replaces many memory-expensive serial decode rounds with fewer verifier rounds, and the proposer is cheap enough. It hurts when you are already compute-saturated or the acceptance rate is too low. Good idea to benchmark a workload with and without speculative decoding.

by czl

4/23/2026 at 11:00:42 AM

Just curious, what's your setup like? How do the devs interact with the model?

by NwtnsMthd

4/23/2026 at 3:45:17 PM

OpenWebUI with postgres and vllm for inference, searxng for websearch a few other mcp's for tools.

by proxysna

4/22/2026 at 5:13:56 PM

question: why not use something like Claude? is it for security reasons?

by PcChip

4/22/2026 at 5:58:10 PM

Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc.

I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?

by lambda

4/22/2026 at 6:23:49 PM

I think, we're seeing very clearly, the problem with the Cloud (as usual) is it locks you into a service that only functions when the Cloud provides it.

But further, seeing with Claude, your workflow, or backend or both, arn't going anywhere if you're building on local models. They don't suddenly become dumb; stop responding, claim censorship, etc. Things are non-determinant enough that exposing yourself to the business decisions of cloud providers is just a risk-reward nightmare.

So yeah, privacy, but also, knowing you don't have to constantly upgrade to another model forced by a provider when whatever you're doing is perfectly suitable, that's untolds amount of value. Imagine the early npm ecosystem, but driven now by AI model FOMO.

by cyanydeez

4/22/2026 at 9:22:22 PM

We do make Claude and Mistral available to our developers too. But, like you said, security. I, personally, do not understand how people in tech, put any amount of trust in businesses that are working in such a cutthroat and corrupt environment. But developers want to try new things and it is better to set up reasonable guardrails for when they want to use these thing by setting up a internal gateway and a set of reasonable policies.

And the other thing is that i want people to be able to experiment and get familiar with LLM's without being concerned about security, price or any other factor.

by proxysna

4/23/2026 at 7:13:55 AM

Because it's a great tool and the second it's not we can just do what you're doing :)

by winrid

4/22/2026 at 3:48:47 PM

For Qwen3.5-27b I'm getting in the 20 to 25 tok/sec range on a 128GB Strix Halo box (Framework Desktop). That's with the 8-bit quant. It's definitely usable, but sometimes you're waiting a bit, though I'm not finding it problematic for the most part. I can run the Qwen3-coder-next (80b MoE) at 36tok/sec - hoping they release a Qwen3.6-coder soon.

by UncleOxidant

4/22/2026 at 4:27:19 PM

I have a Framework Desktop too and 20-25 t/s is a lot better than I was expecting for such a large dense model. I'll have to try it out tonight. Are you using llama.cpp?

by bityard

4/22/2026 at 4:53:53 PM

LMStudio, but it uses llama.cpp to run inference, so yeah. This is with the vulkan backend, not ROCm.

by UncleOxidant

4/22/2026 at 6:00:35 PM

That sounds high for a Strix Halo with a dense 27b model. Are you talking about decode (prompt eval, which can happend in parallel) or generation when you quote tokens per second? Usually if people quote only one number they're quoting generation speed, and I would be surprised if you got that for generation speed on a Strix Halo.

by lambda

4/22/2026 at 5:14:37 PM

> Qwen3.5-27b 8-bit quant 20 to 25 tok/sec

It that with some kind of speculative decoding? Or total throughput for parallel requests?

by petu

4/22/2026 at 7:25:04 PM

I'm getting 30 t/s on RTX 4090D (using 42 out of 48GB VRAM) with UD-Q6_K_XL

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/discussions/...

by SlavikCA

4/22/2026 at 7:30:09 PM

I thought Q4_K_M is the standard. Why did you choose the 6-bit variant? Does it generate better input?

by maxloh

4/22/2026 at 7:42:54 PM

There is no standard.

The higher quantization - the better results, but more memory is needed. Q8 is the best.

by SlavikCA

4/23/2026 at 3:24:09 AM

FP32 is best, although I wonder if there isn’t something better I don’t know about. Q8 is for the most part equal to FP16 in practical terms by being smart about what is quantized, but iirc always slower than FP16 and FP8.

by SV_BubbleTime

4/22/2026 at 3:50:27 PM

For Apple Mac, there is https://omlx.ai/benchmarks

by xngbuilds

4/22/2026 at 3:41:14 PM

As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless. With that, you can run this on a 3090/4090/5090. You can probably even go FP8 with 5090 (though there will be tradeoffs). Probably ~70 tok/s on a 5090 and roughly half that on a 4090/3090. With speculative decoding, you can get even faster (2-3x I'd say). Pretty amazing what you can get locally.

by ekojs

4/22/2026 at 3:45:38 PM

> As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless

The 4-bit quants are far from lossless. The effects show up more on longer context problems.

> You can probably even go FP8 with 5090 (though there will be tradeoffs)

You cannot run these models at 8-bit on a 32GB card because you need space for context. Typically it would be Q5 on a 32GB card to fit context lengths needed for anything other than short answers.

by Aurornis

4/22/2026 at 9:02:22 PM

I just loaded up Qwen3.6 27B at Q8_0 quantization in llama.cpp, with 131072 context and Q8 kv cache:

  build/bin/llama-server \
    -m ~/models/llm/qwen3.6-27b/qwen3.6-27B-q8_0.gguf \
    --no-mmap \
    --n-gpu-layers all \
    --ctx-size 131072 \
    --flash-attn on \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --jinja \
    --no-mmproj \
    --parallel 1 \
    --cache-ram 4096 -ctxcp 2 \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking": true}'

Should fit nicely in a single 5090:

  self    model   context   compute
  30968 = 25972 +    4501 +     495

Even bumping up to 16-bit K cache should fit comfortably by dropping down to 64K context, which is still a pretty decent amount. I would try both. I'm not sure how tolerant Qwen3.5 series is of dropping K cache to 8 bits.

by zargon

4/22/2026 at 4:04:10 PM

> You cannot run these models at 8-bit on a 32GB card because you need space for context

You probably can actually. Not saying that it would be ideal but it can fit entirely in VRAM (if you make sure to quantize the attention layers). KV cache quantization and not loading the vision tower would help quite a bit. Not ideal for long context, but it should be very much possible.

I addressed the lossless claim in another reply but I guess it really depends on what the model is used for. For my usecases, it's nearly lossless I'd say.

by ekojs

4/22/2026 at 6:01:47 PM

Turboquant on 4bit helps a lot as well for keeping context in vram, but int4 is definitely not lossless. But it all depends for some people this is sufficient

by alex7o

4/22/2026 at 10:13:16 PM

[dead]

by carlovalenti

4/22/2026 at 3:48:24 PM

4-bit quantization is almost never lossless especially for agentic work, it's the lowest end of what's reasonable. It's advocated as preferable to a model with fewer parameters that's been quantized with more precision.

by zozbot234

4/22/2026 at 3:55:57 PM

Yeah, figure the 'nearly lossless' claim is the most controversial thing. But in my defense, ~97% recovery in benchmarks is what I consider 'nearly lossless'. When quantized with calibration data for a specialized domain, the difference in my internal benchmark is pretty much indistinguishable. But for agentic work, 4-bit quants can indeed fall a bit short in long-context usecase, especially if you quantize the attention layers.

by ekojs

4/23/2026 at 4:34:04 AM

4-bit quantization is not applied to all layers, some are kept 8/16-bit.

by storus

4/22/2026 at 3:42:53 PM

That seems awfully speculative without at least some anecdata to back it up.

by binary132

4/22/2026 at 3:46:51 PM

Sure, go get some.

This isn't the first open-weight LLM to be released. People tend to get a feel for this stuff over time.

Let me give you some more baseless speculation: Based on the quality of the 3.5 27B and the 3.6 35B models, this model is going to absolutely crush it.

by arcanemachiner

4/22/2026 at 3:46:10 PM

Not at all, I actually run ~30B dense models for production and have tested out 5090/3090 for that. There are gotchas of course, but the speed/quality claims should be roughly there.

by ekojs

4/22/2026 at 3:40:51 PM

These might help if the provider doesn't offer the same details themselves. Of course, we have to wait for the newly released models to get added to these sites.

https://llmfit.io/

https://modelfit.io/

by chrsw

4/22/2026 at 3:46:09 PM

At 8-bit quantization (q8_0) I get 20 tokens per second on a Radeon R9700.

by rubiquity

4/22/2026 at 5:51:07 PM

I just ran Unsloth's Q4_K_M quant on Macbook Pro M1 Max (2021), 32 GB RAM. Getting about 9 tok/s. Reported RAM usage 15.9 GB. It's okay, but a bit slow for my taste.

An example of a query and a response is below. It was done with 9.09 tok/s, 893 tokens, 7.56 seconds to first token, thought for 1 minute and 9 seconds.

Question: I gave the model this truth table and asked to tell me what boolean function would produce the output given A, B as inputs.

  | A | B | Output |
  | :--- | --- | ---: |
  | (T) | (F)  | (T) |
  | (T) | (T) | (F) |
  | (F) | (F) | (F) |
  | (F) | (T) | (F) |

Answer: (note this is part of a larger conversation, where I was getting it to help me with some boolean logic):

  The boolean condition for your table is simply:

  ```rust
  A && !B
  ```

  ### Why?
  Your table shows the output is `true` **only** when `A` is true and `B` is  false. That's exactly what `A && !B` expresses.

  | A | B | `A && !B` | Output |
  |---|---|-----------|--------|
  |  (T) |  (F) | `T && T = T` |   |
  |  (T) |  (T) | `T && F = F` |   |
  |  (F) |  (F) | `F && T = F` |   |
  |  (F) |  (T) | `F && F = F` |   |

  ### In Rust:
  ```rust
  if A && !B {
      // output is true
  } else {
      // output is false
  }
  ```

  This is the most direct and idiomatic way to express that logic. Let me know if you need it adapted back to your `live_ticks` / `on_disconnect` context!

by neonstatic

4/22/2026 at 4:11:12 PM

Check out https://www.canirun.ai/ (and https://news.ycombinator.com/item?id=47363754)

by random3

4/22/2026 at 5:25:42 PM

CaniRun's not a great tool - look how long its been since it's been updated. It's not got any of the qwen3.6 models on the list nor the new kimi one. In fact it's missing many of the "popular" models.

by esskay

4/22/2026 at 6:52:57 PM

https://github.com/AlexsJones/llmfit gives a good approximation of that.

by jen20

4/22/2026 at 5:03:24 PM

27B dense is not a screamer, even on an RTX 6000, but it will run at full precision with (more than) enough room for context at the model's own capacity. You can expect about 30 tokens/second after prompt processing. Quants will likely run similarly well on the 16/24/32 GB consumer GPUs.

The 3.5 27B model was a strong and capable reasoner, so I have high hopes for this one. Thanks to the team at Qwen for keeping competition in this space alive.

by CamperBob2

4/22/2026 at 10:36:33 PM

You can often sorta estimate it but multiply it with like 2/3rds give or take a lot to work out how much vram you need.

27B will fit onto a 24gb card with decent context and a couple GB for operating system to spare at Q4.

tok/s doesn't really have a good way to eyeball it

by Havoc

4/22/2026 at 9:13:25 PM

You can point at the GGUF files and figure it out with your hardware here.

https://github.com/gdevenyi/huggingface-estimate

by gdevenyi

4/22/2026 at 3:40:39 PM

Fwiw, huggingface does this on the page where you download the weights. Slightly different format though - you put all the hardware you have, and it shows which quants you can run.

by jjcm

4/22/2026 at 3:43:31 PM

Divide the value before the B by 2, and there's your answer if you get a Q4_K_M quant. Plus a bit of room for KV cache.

TLDR: If you have 14GB of VRAM, you can try out this model with a 4-bit quant.

Tokens per second is an unreasonable ask since every card is different, are you using GGUF or not, CUDA or ROCm or Vulkan or MLX, what optimizations are in your version of your inference software, flags are you running, etc.

Note that it's a dense model (the Qwen models have another value at the end of the MoE model names, e.g. A3B) so it will not run very well in RAM, whereas with a MoE model, you can spill over into RAM if you don't have enough VRAM, and still have reasonable performance.

Using these models requires some technical know-how, and there's no getting around that.

by arcanemachiner

4/22/2026 at 3:20:08 PM

depends on format, compute type, quantization and kv cache size.

by underlines

4/22/2026 at 3:23:07 PM

Specs for whatever they used to achieve the benchmarks would be a good start.

by mottosso

4/22/2026 at 3:36:42 PM

The benchmarks are from the unquantized model they release.

This will only run on server hardware, some workstation GPUs, or some 128GB unified memory systems.

It’s a situation where if you have to ask, you can’t run the exact model they released. You have to wait for quantizations to smaller sizes, which come in a lot of varieties and have quality tradeoffs.

by Aurornis

4/22/2026 at 4:42:21 PM

This would likely run fine in just 96 GB of VRAM, by my estimation. Well within the ability of an enthusiastic hobbyist with a few thousand dollars of disposable income.

Quantizations are already out: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

by bityard

4/22/2026 at 4:32:32 PM

The benchmarks in the model card are purported to be measurements of model quality (ability to perform tasks with few errors), not speed.

They almost certainly run these benchmarks on their own cloud infrastructure (Alibaba afaik), which is typically not hardware that even the most enthusiastic homelab hobbyist can afford.

by bityard

4/22/2026 at 5:35:39 PM

this is what the unsloth post is for

by scosman

4/22/2026 at 3:49:54 PM

I would detest the time/words it takes to hand hold through such a review, of teaching folks the basics about LLM like this.

It's also a section that, with hope, becomes obsolete sometime semi soon-ish.

by jauntywundrkind

4/22/2026 at 3:29:42 PM

[dead]

by seffignoz

4/22/2026 at 8:47:58 PM

[dead]

by yorhodes

4/22/2026 at 4:13:51 PM

What competitive advantage does OpenAI/Anthropic has when companies like Qwen/Minimax/etc are open sourcing models that shows similar (yet below than OpenAI/Anthropic) benchmark results?

Also, the token prices of these open source models are at a fraction of Anthropic's Opus 4.6[1]

[1]: https://artificialanalysis.ai/models/#pricing

by jameson

4/22/2026 at 4:25:21 PM

For coding often quality at the margin is crucial even at a premium. It’s not the same as cranking out spam emails or HN posts at scale. This is why the marginal difference between your median engineer and your P99 engineer is comp is substantial, while the marginal comp difference between your median pick and packer vs your P99 pick and packer isn’t.

I’d also say it keeps the frontier shops competitive while costing R&D in the present is beneficial to them in forcing them to make a better and better product especially in value add space.

Finally, particularly for Anthropic, they are going for the more trustworthy shop. Even ali is hosting pay frontier models for service revenue, but if you’re not a Chinese shop, would you really host your production code development workload on a Chinese hosted provider? OpenAI is sketchy enough but even there I have a marginal confidence they aren’t just wholesale mining data for trade secrets - even if they are using it for model training. Anthropic I slightly trust more. Hence the premium. No one really believes at face value a Chinese hosted firm isn’t mass trolling every competitive advantage possible and handing back to the government and other cross competitive firms - even if they aren’t the historical precedent is so well established and known that everyone prices it in.

by fnordpiglet

4/22/2026 at 7:33:00 PM

I just assume any of those companies would steal my work and wouldn't care about it.

Everything they have done so far indicates this.

Running your own is the only option unless you really trust them or unless you have the option to sue them like some big companies can.

Or if you don't really care then you can use the chineese one since it is cheaper.

What makes you trust Anthropic more than Alibaba?

by ozgrakkurt

4/22/2026 at 9:08:44 PM

There’s a difference between stealing for model training and direct monitoring of actionable trade secrets and corporate espionage. Anthropic and OpenAI wouldn’t do this simply because they would be litigated out of existence and criminally investigated if they did. In China it’s an expected part of the corporate and legal structure with virtually no recourse for a foreign firm and when it’s in states interest domestic either. I’m surprised you don’t realize the US has fairly strong civil, criminal, and regulatory protections in place for theft of actionable material and reuse of corporate and trade secrets, let alone copyright materials. I assure you their ToS also do not allow them to do this and that in itself is a contractual obligation you can enforce and win in court.

by fnordpiglet

4/22/2026 at 9:40:29 PM

Anthropic already admitted to heavily monitoring user requests to protect against distillation. They have everything in place, turning on learning from user data would literally be just a couple lines of code at this point. Anyone trusting them not to do it is a fool.

by trvz

4/22/2026 at 10:51:48 PM

Absolutely. Plus as these companies become hungrier for revenue and to get out of the commodity market they are in, they are only going to get more aggressive in their (ab)use of customer data.

by anon373839

4/23/2026 at 2:18:33 AM

How exactly do you propose that a local weights model that I can run without an internet connection is going to exfiltrate my trade secrets to the Chinese government?

by Zetaphor

4/23/2026 at 3:48:19 AM

If you read I’m talking about their service only models.

by fnordpiglet

4/23/2026 at 9:16:47 AM

Why? No one else was. The discussion was about OpenAI / Anthropic's lack of moat when there are open weights models that are almost as good. You can host them anywhere you like. Pay a US company to do so if you want.

by dTal

4/22/2026 at 5:11:17 PM

> For coding often quality at the margin is crucial even at a premium

That's a cryptic way to say "Only for vibe-coding quality at the margin matters". Obviously, quality is determined first and foremost by the skills of the human operating the LLM.

> No one really believes at face value a Chinese hosted firm isn’t mass trolling every competitive advantage possible

That's much easier to believe than the same but applied to a huge global corp that operates in your own market and has both the power and the desire to eat your market share for breakfast, before the markets open, so "growth" can be reported the same day.

Besides, open models are hosted by many small providers in the US too, you don't have to use foreign providers per se.

by bigbadfeline

4/22/2026 at 5:27:44 PM

1) model provider choices don’t obviate the need to make other good choices

2) I think there is a special case for Chinese providers due to the philosophical differences in what constitutes fair markets and the regulatory and civil legal structure outside China generally makes such things existentially dangerous to do; hence while it might happen it is extraordinarily ill advised, while in China is implicitly the way things work. However my point is Ali has their own hosted version of Qwen models operating on the frontier that are at minimum hosted exclusively before released. Theres no reason to believe they won’t at some point exclusively host some frontier or fine tuned variants for purposes for commercial reasons. This is part of why they had recent turnover.

by fnordpiglet

4/22/2026 at 4:35:48 PM

Most code is not P99 though.

Also, have you considered that your trust in Anthropic and distrust in China may not be shared by many outside the US? There's a reason why Huawei is the largest supplier of 5G hardware globally.

by rohansood15

4/22/2026 at 5:30:17 PM

I find it hard to believe anyone who has ever done business inside China doesn’t know that the structure of Chinese business is built around massive IP theft and repurposing on a state wide systematic level. It’s not a nationalism point, it’s an objective and easily verified truth.

Most code is not P99, but companies pay a premium to produce code that is. That’s my point.

by fnordpiglet

4/23/2026 at 2:22:50 AM

I'll ask you the same thing I asked the other guy. How is a an open weights model that I can run on my own hardware without an internet connection going to exfiltrate my trade secrets to the Chinese government?

by Zetaphor

4/23/2026 at 9:13:53 AM

It's the same user and they already answered you: "If you read I’m talking about their service only models."

But yes this is a non-sequitor. The original question was "What competitive advantage does OpenAI/Anthropic has when companies like Qwen/Minimax/etc are open sourcing models that shows similar (yet below than OpenAI/Anthropic) benchmark results?"

Even if you don't trust Chinese companies, and you want a hosted model, you can always pay a third party to host a Chinese open weight model. And it'll be a lot cheaper than OpenAI.

by dTal

4/23/2026 at 5:32:03 AM

Chinese companies are built on IP theft, and Anthropic/Open AI are not?

And in world where code generation costs are trending to zero, goodluck commanding a premium to produce any kind of code.

There is a whole bunch of P99 code that is open-source. What makes code P99 is not the model that produces it, but the people who verify/validate/direct it.

by rohansood15

4/24/2026 at 2:40:39 PM

Didn’t the major American labs pirate a whole bunch of their training data?

by piperswe

4/22/2026 at 4:39:30 PM

You're right, but perspective is important, and that's because China and the US are engaged in economic warfare (even before the current US regime), vying for the dubious title of "superpower".

by runjake

4/23/2026 at 1:05:45 AM

Are you claiming that major Chinese cloud providers like Tencent and Alibaba are pilfering trade secrets from their customers' data? To my knowledge, there's no evidence for that whatsoever. If it were true and came out, it would instantly tank their cloud businesses (which is why they don't do it, and why AWS, Azure, etc. also don't do it).

If it were to happen, Chinese law does offer recourse, including to foreign firms. It's not as if China doesn't have IP law. It has actually made a major effort over the last 10+ years to set up specialized courts just to deal with IP disputes, and I think foreign firms have a fairly good track record of winning cases.

> No one really believes at face value

This says a lot more about the prejudices and stereotypes in the West about China than it does about China itself.

by DiogenesKynikos

4/23/2026 at 2:29:01 AM

In every one of these threads for a new Chinese open weights model, it's always the same tired discussion of how this is all actually a psyop by the Chinese government to undermine US interests and it can't answer questions about Tienanmen Square.

Meanwhile I'm over here solving real world business problems with a model that I can securely run on-prem and not pay out the nose for cloud GPU inference. And then after work I use that same model to power my personal experiments and hobby projects.

There are no Chinese labs with different financial and political motivations, there's only "China" the monolith. The last thread for Qwen's new hosted model was full of folks talking about how "China" is no longer releasing open weights models, when the next day Moonshot AI releases Kimi 2.6. A few days later and here's Qwen again with another open release.

For some reason this country gets what I assume are otherwise smart Americans to just completely shut off their brains and start repeating rhetoric.

by Zetaphor

4/23/2026 at 5:02:31 PM

> The last thread for Qwen's new hosted model was full of folks talking about how "China" is no longer releasing open weights models, when the next day Moonshot AI releases Kimi 2.6. A few days later and here's Qwen again with another open release.

looks like you declared win argument, because you now see that 2.6 was released, but at that time your opponents argument stand.

Also, you can't predict if Chinese labs will continue releasing open frontier models. Looks like Kimi is the only one left, Qwen is much smaller model.

by andriy_koval

4/24/2026 at 4:40:20 PM

> looks like you declared win argument, because you now see that 2.6 was released, but at that time your opponents argument stand.

Their argument was based entirely on speculation, but stated as a matter of fact, despite Alibaba making very clear statements that they were going to continue releasing open models.

And the core of my argument is that they were conflating a single company with the motivations of multiple companies in a country. Nobody talks about US companies by saying "The Americans are going to do X", they say "OpenAI/Anthropic/Google is going to do X".

by Zetaphor

4/23/2026 at 9:24:38 AM

>but if you’re not a Chinese shop, would you really host your production code development workload on a Chinese hosted provider?

The point of open source models is that you host them locally. I trust neither Chinese nor American providers with this.

by ginko

4/23/2026 at 4:57:40 PM

another point is that there could be multiple inference providers, so market will be healthier, and not dominated by one player who charges NN% margin.

by andriy_koval

4/22/2026 at 5:03:29 PM

> For coding often quality at the margin is crucial even at a premium.

For coding, quality is not measurable and is based entirely on feels (er, sorry, "vibes").

Employers paying for SOTA models is nothing but a lifestyle status perk for employees, like ping-pong tables or fancy lunch snacks.

by otabdeveloper4

4/22/2026 at 6:13:21 PM

I’m building my own company and I consider model choice crucial to my marginal ability to produce a higher quality product I don’t regret having built. Every higher end dev shop I’ve worked at over the last few years perceives things the same. There are measurable outcomes from software built well and software not, even if the code itself isn’t easily measurable. I would rather pay a few thousand more per year for a better overall outcome with less developer struggle against bad model decisions than end up with an inferior end product and have expensive developer spin wheels containing a dumb as a brick model. But everyone’s career experiences are different and I’d feel sad to work at a place where SOTA is a lifestyle choice rather than a rational engineering and business choice.

by fnordpiglet

4/23/2026 at 5:22:09 AM

"Rational engineering and business choice" and "AI" are two words that do not go together.

Wait five years and come back. Right now AI is 100% FOMO and lifestyle signaling and nothing more.

by otabdeveloper4

4/22/2026 at 7:42:54 PM

"based entirely on feels"

Now there's a word I haven't heard in a long, long time.

by j-bos

4/22/2026 at 8:14:49 PM

> but if you’re not a Chinese shop, would you really host your production code development workload on a Chinese hosted provider?

As opposed to an US-american shop? Yup, sure, why not? It's the same ballpark.

by OtomotO

4/22/2026 at 4:52:22 PM

Given the very limited experience I have where I've been trying out a few different models, the quality of the context I can build seems to be much more of an issue than the model itself.

If I build a super high quality context for something I'm really good at, I can get great results. If I'm trying to learn something new and have it help me, it's very hit and miss. I can see where the frontier models would be useful for the latter, but they don't seem to make as much difference for the former, at least in my experience.

The biggest issue I have is that if I don't know a topic, my inquiries seem to poison the context. For some reason, my questions are treated like fact. I've also seen the same behavior with Claude getting information from the web. Specifically, I had it take a question about a possible workaround from a bug report and present it as a de-facto solution to my problem. I'm talking disconnect a remote site from the internet levels of wrong.

From what I've seen, I think the future value is in context engineering. I think the value is going to come from systems and tools that let experts "train" a context, which is really just a search problem IMO, and a marketplace or standard for sharing that context building knowledge.

The cynic in me thinks that things like cornering the RAM market are more about depriving everyone else than needing the resources. Whoever usurps the most high quality context from those P99 engineers is going to have a better product because they have better inputs. They don't want to let anyone catch up because the whole thing has properties similar to network effects. The "best" model, even if it's really just the best tooling and context engineering, is going to attract the best users which will improve the model.

It makes me wonder of the self reinforced learning is really just context theft.

by donmcronald

4/23/2026 at 7:23:42 PM

Apologies for my ignorance but how can you know the quality of the context?

by jameson

4/22/2026 at 8:18:20 PM

> For coding often quality at the margin is crucial even at a premium

For some problems, sure, and when you are stuck, throwing tokens at Opus is worthwhile.

On the other hand, a $10/month minimax 2.7 coding subscription that literally never runs out of tokens will happily perform most day-to-day coding tasks

by swiftcoder

4/23/2026 at 5:16:30 AM

"Literally never runs out of tokens?" lol, no. Tokens are just energy. There is always a way to run out of tokens, and no one will subsidize free tokens forever.

by solenoid0937

4/23/2026 at 7:48:56 AM

"Never runs out of tokens" in the sense that running 8 hours a day 7 days a week is still under the subscription limit

by swiftcoder

4/23/2026 at 4:39:59 PM

You can also do that on an API without hitting a limit!

by solenoid0937

4/23/2026 at 5:38:39 PM

Not typically at predictable monthly spend, which turns out to be important to some folks

by swiftcoder

4/23/2026 at 7:21:56 AM

if you run it at home then the sun is a pretty good way to get "free energy."

by pistoriusp

4/22/2026 at 9:57:51 PM

Why pay for two subscriptions though?

Claude also has other models which use less tokens.

by sumedh

4/23/2026 at 5:39:39 PM

Redundancy, mostly. And having left over tokens when Opus eats all of those tokens

by swiftcoder

4/22/2026 at 4:33:27 PM

Not sure how your last point matters if 27b can run on consumer hardware, besides being hosted by any company which the user could certainly trust more than anthropic.

OpenAI & Anthropic are just lying to everyone right now because if they can't raise enough money they are dead. Intelligence is a commodity, the semiconductor supply chain is not.

by AJ007

4/22/2026 at 5:51:47 PM

The challenge is token speed. I did some local coding yesterday with qwen3.6 35b and getting 10-40 tokens per second means that the wall time is much longer. 20 tokens per second is a bit over a thousand tokens per minute, which is slower than the the experience you get with Claude Code or the opus models.

Slower and worse is still useful, but not as good in two important dimensions.

by datadrivenangel

4/22/2026 at 6:16:35 PM

Also benchmark measures are not empirical experience measures and are well gamed. As other commenters have said the actual observed behavior is inferior, so it’s not just speed.

It’s ludicrous to believe a small parameter count model will out perform a well made high parameter count model. That’s just magical thinking. We’ve not empirically observed any flattening of the scaling laws, and there’s no reason to believe the scrappy and smart qwen team has discovered P=NP, FTL, or the magical non linear parameter count scaling model.

by fnordpiglet

4/23/2026 at 9:35:19 AM

Ooh, car analogy time!

It's kinda like saying a car with a 6L engine will always outperform a car with a 2L engine. There are so many different engineering tradeoffs, so many different things to optimize for, so many different metrics for "performance", that while it's broadly true, it doesn't mean you'll always prefer the 6L car. Maybe you care about running costs! Maybe you'd rather own a smaller car than rent a bigger one. Maybe the 2L car is just better engineered. Maybe you work in food delivery in a dense city and what you actually need is a 50cc moped, because agility and latency are more important than performance at the margins.

And if you're the only game in town, and you only sell 6L behemoths, and some upstart comes along and starts selling nippy little 2L utility vehicles (or worse - giving them away!) you should absolutely be worried about your lunch. Note that this literally happened to the US car industry when Japanese imports started becoming popular in the 80s...

by dTal

4/22/2026 at 10:56:53 PM

This is just blind belief. The model discussed in this topic already outperforms “well made” frontier LLMs of 12-18 months ago. If what you wrote is true, that wouldn’t have been possible.

by anon373839

4/22/2026 at 11:26:23 PM

It's amazing that we can run models better than state of the art ~36 months ago on local consumer devices!

by datadrivenangel

4/22/2026 at 6:39:16 PM

> This is why the marginal difference between your median engineer and your P99 engineer is comp is substantial, while the marginal comp difference between your median pick and packer vs your P99 pick and packer isn’t.

That's an interesting analogy.

by rmacqueen

4/22/2026 at 4:16:04 PM

I use Opus and the Qwen models. The gap between them is much larger than the benchmark charts show.

If you want to compare to a hosted model, look toward the GLM hosted model. It’s closest to the big players right now. They were selling it at very low prices but have started raising the price recently.

by Aurornis

4/22/2026 at 4:45:05 PM

I like both GLM and Kimmi 2.6 but honestly for me they didn’t have quite the cost advantage that I would like partly because they use more tokens so they end up being maybe sonnet level intelligence at haiku level cost. Good but not quite as extreme as some people would make them out to be and for my use cases running the much cheaper, Gemma 4 four things where I don’t need Max intelligence and running sonnet or opus for things where I need the intelligence and I can’t really make the trade-off is been generally good and it just doesn’t seem worth it to cost cut a little bit. Plus when you combine prompt, cashing and sub agents using Gemma 4, the cost to run sonnet or even opus, are not that extreme.

For coding $200 month plan is such a good value from anthropic it’s not even worth considering anything else except for up time issues

But competition is great. I hope to see Anthropic put out a competitor in the 1/3 to 1/5 of haiku pricing range and bump haiku’s performance should be closer to sonnet level and close the gap here.

by mchusma

4/22/2026 at 4:42:01 PM

Yes and no. Are you using open router or local? Are the models are good as Opus? No. But 99% of the time, local models are terrible because of user errors. Especially true for MoE, even though the perplexity only drops minimal for Q4 and q4_0 for the KV cache, the models get noticeably worse.

by syntaxing

4/22/2026 at 4:59:54 PM

Sounds like you're accusing a professional of holding their tool incorrectly. Not impossible, but not likely either.

by acidtechno303

4/22/2026 at 5:16:02 PM

Inferencing is straight up hard. I’m not accusing them of anything. There’s a crap ton of variables that can go into running a local model. No one runs them at native FP8/FP16 because we cannot afford to. Sometimes llama cpp implementation has a bug (happens all the time). Sometimes the template is wrong. Sometimes the user forgot to expand the context length to above the 4096 default. Sometimes they use quantization that nerfs the model. You get the point. The biggest downside of local LLMs is that it’s hard to get right. It’s such a big problem, Kimi just rolled out a new tool so vendors can be qualified. Even on openrouter, one vendor can be half the “performance” of the other.

by syntaxing

4/22/2026 at 4:23:30 PM

If these results are because of vampire attacks, the results will stop being so good when closed ones figure out how to pollute them when they are sucking answers.

Also, they are not exactly as good when you use them in your daily flow; maybe for shallow reasoning but not for coding and more difficult stuff. Or at least I haven't found an open one as good as closed ones; I would love to, if you have some cool settings, please share

by Frannky

4/22/2026 at 4:56:25 PM

The token prices being high for Opus undermines your argument, because it shows people are willing to pay more for the model.

The thing is the new OpenAI/Anthropic models are noticeably better than open source. Open source is not unusable, but the frontier is definitely better and likely will remain so. With SWE time costing over $1/min, if a convo costs me $10 but saves me 10 minutes it's probably worth it. And with code, often the time saved by marginally better quality is significant.

by mmmore

4/22/2026 at 10:50:21 PM

There is no advantage at this moment. But there will be once one of the ecosystems consolidates.

by oliveiracwb

4/22/2026 at 5:45:32 PM

> yet below than OpenAI/Anthropic

This is the competitive advantage. Being better.

by jstummbillig

4/22/2026 at 4:45:26 PM

Been using Qwen 3.6 35B and Gemma 4 26B on my M4 MBP, and while it’s no Opus, it does 95% of what I need which is already crazy since everything runs fully local.

by syntaxing

4/22/2026 at 5:47:58 PM

It’s good enough that I’ve been having codex automate itself out of a job by delegating more and more to it.

Very excited for the 122b version as the throughput is significantly better for that vs the dense 27b on my m4.

by FuckButtons

4/22/2026 at 7:07:47 PM

You've got me curious. Two questions if I may:

- What kind of tasks/work?

- How is either Qwen/Gemma wired up (e.g. which harness/how are they accessed)?

Or to phase another way; what does your workflow/software stack look like?

by Someone1234

4/22/2026 at 7:23:24 PM

1. Qwen is mostly coding related through Opencode. I have been thinking about using pi agent and see if that works better for general use case. The usefulness of *claw has been limited for me. Gemma is through the chat interface with lmstudio. I use it for pretty much everything general purpose. Help me correct my grammar, read documents (lmstudio has a built in RAG tool), and vision capabilities (mentioned below, journal pictures to markdown).

2. Lmstudio on my MacBook mainly. You can turn on an OpenAI API compatible endpoint in the settings. Lmstudio also has a headless server called lms. Personally, I find it way better than Ollama since lmstudio uses llama cpp as the backend. With an OpenAI API compatible endpoint, you can use any tool/agent that supports openAI. Lmstudio/lms is Linux compatible too so you can run it on a strix halo desktop and the like.

by syntaxing

4/23/2026 at 2:53:06 AM

Curious how do you run opencode and qwen locally? Few times I tried it responds back with some nonsense. Chat, say, through ollama works well.

by ycombinatornews

4/23/2026 at 12:11:12 PM

Which quants are you using? I had similar issue until I used Unsloth’s. I would recommend at least UD_6. Also, make sure your context length is above 65K.

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

by syntaxing

4/22/2026 at 8:45:33 PM

Thanks I appreciate the info. I may try to spin up something like this and give it a whirl.

by Someone1234

4/22/2026 at 10:47:36 PM

I would recommend trying oMLX, which is much more performant and efficient than LM Studio. It has block-level KV context caching that makes long chats and agentic/tool calling scenarios MUCH faster.

by anon373839

4/22/2026 at 4:50:57 PM

can you expand more on what you mean by 95%?

There are 2 aspects I am interested in:

1. accuracy - is it 95% accuracy of Opus in terms of output quality (4.5 or 4.6)?

2. capability-wise - 95% accuracy when calling your tools and perform agentic work compared to Opus - e.g. trip planning?

by throwaw12

4/22/2026 at 5:11:19 PM

1. What do you mean by accuracy? Like the facts and information? If so, I use a Wikipedia/kiwx MCP server. Or do you mean tool call accuracy?

2. 3.6 is noticeably better than 3.5 for agentic uses (I have yet to use the dense model). The downside is that there’s so little personality, you’ll find more entertainment talking to a wall. Anything for creative use like writing or talking, I use Gemma 4. I also use Gemma 4 as a “chat” bot only, no agents. One amazing thing about the Gemma models is the vision capabilities. I was able to pipe in some handwritten notes and it converted into markdown flawlessly. But my handwriting is much better than the typical engineer’s chicken scratch.

by syntaxing

4/22/2026 at 5:15:03 PM

by accuracy I meant how close is the output to your expectations, for example if you ask 8B model to write C compiler in C, it outputs theory of how to write compiler and writes pseudocode in Python. Which is off by 2 measures: (1) I haven't asked for theory (2) I haven't asked to write it in Python.

Or if you want to put it differently, if your prompt is super clear about the actions you want it to do, is it following it exactly as you said or going off the rails occasionally

by throwaw12

4/22/2026 at 5:24:07 PM

Ironically, even though I write C/++ for a living, I don’t use it for personal projects so I can’t say how well it works for low level coding. Python works great but there’s a limit on context size (I just don’t have enough RAM, and I do not like quantizing my kv cache). Realistically, I can fit 128K max but I aim for 65K before compacting. With Unsloth’s Opencode templating, I haven’t had any major issues but I haven’t done anything intense with it as of late. But overall, I have not had to stop it from an endless loop which happened often on 3.5.

by syntaxing

4/22/2026 at 10:35:37 PM

I have a Supernote and was looking at different models for handwriting recognition, and I agree that gemma4-26B is the best I’ve tried so far (better than a qwen3-vl-8B and GLM-OCR). Besides turning off thinking, does your setup have any special sauce?

by physicles

4/23/2026 at 12:53:27 AM

Q8 or Q6_UD with no KV cache quantization. I swear it matters even more with small activated parameters MOE model despite the minimal KL divergence drop

by syntaxing

4/22/2026 at 8:14:20 PM

Do you use it with ollama? Or something else?

by richstokes

4/22/2026 at 10:04:00 PM

Llama cpp is vastly superior. There was this huge bug that prevented me from using a model in ollama and it took them four months for a “vendor sync” (what they call it) which was just updating ggml which is the underpinning library used by llama cpp (same org makes both). lmstudio/lms is essentially Ollama but with llama cpp as backend. I recommend trying lmstudio since it’s the lowest friction to start

by syntaxing

4/23/2026 at 9:51:45 AM

I know this is kind of old hat by now, but it kind of blows my mind that I can upload a hand drawn decision tree & get a transcribed dot file back on consumer hardware using a pile of linear algebra that wasn’t even particularly specialised for this purpose, it’s just a capability that it picked up along with everything else during training.

by pja

4/23/2026 at 11:05:25 AM

If you had shown this to someone in 2018 they wouldn't have hesitated to call it an AGI. We truly reached the state where we have one model that performs at usable levels across a huge range of tasks. You don't need to assemble a training set of hand drawn diagrams and corresponding dot files and train some kind of CNN on that, you just throw the task at a preexisting LLM and get a usable result.

We always talk about the negatives (in most tasks it's worse than a human domain expert, the results are soulless, the societal implications are scary), but this kind of generality really is a monumental achievement

by wongarsu

4/23/2026 at 9:56:02 AM

I totally agree. The feeling you get by running these things locally is different, as if you could feel the magic closer.

by amunozo

4/23/2026 at 10:03:32 AM

Well this is the magic of LLMs, they learn things incidentally well, and specialised models are pretty average.

by alansaber

4/22/2026 at 3:43:14 PM

Generate an SVG of a pelican riding a bicycle: https://codepen.io/chdskndyq11546/pen/yyaWGJx

Generate an SVG of a dragon eating a hotdog while driving a car: https://codepen.io/chdskndyq11546/pen/xbENmgK

Far from perfect, but it really shows how powerful these models can get

by sietsietnoac

4/22/2026 at 4:24:57 PM

The dragon image has issues like one eye, weird tail etc, but the pelican is imo perfect -- the best I've seen!

by tln

4/22/2026 at 5:31:45 PM

Yeah the dragon one is just a complete mess. The car is sideways but the WHEEL is oriented in a first-person perspective.

Seems like a case of overfitting with regard to the thousands of pelican bike SVG samples on the internet already.

by vunderba

4/22/2026 at 4:28:30 PM

I wonder if this became a so well known "benchmark" that models already got trained for it.

by yrds96

4/22/2026 at 4:39:09 PM

Given that the pelican looks way better than the dragon, it almost seems like a certainty.

by HotHotLava

4/22/2026 at 4:56:41 PM

Given the likeness of the sky between the 2 examples, the overall similarities and the fact that the pelican is so well done, there is 0-doubt that the benchmark is in the training data of these models by now

That doesn't make it any less of an achievement given the model size or the time it took to get the results

If anything, it shows there's still much to discover in this field and things to improve upon, which is really interesting to watch unfold

by sietsietnoac

4/22/2026 at 4:36:48 PM

every model release Simon comes with his Pelican and then this comment follows.

Can we stop both? its so boring

by Marciplan

4/22/2026 at 5:23:45 PM

I really appreciate you speaking up. Happened yesterday on GPT Image 2, bit my tongue b/c people would see it as fun policing, and same thing today. And it happens on every. single. LLM. release. thread.

It's disruptive to the commons, doesn't add anything to knowledge of a model at this point, and it's way out of hand when people are not only engaging with the original and creating screenfuls to wade through before on-topic content, but now people are creating the thread before it exists to pattern-match on the engagement they see for the real thing. So now we have 2x.

by refulgentis

4/23/2026 at 6:34:27 AM

> and creating screenfuls to wade through before on-topic content,

It's often just a single root comment that you can collapse.

I find how svg drawing skills improve over time interesting. Very simple and very small datapoint. But I still find value in it.

by Mashimo

4/22/2026 at 5:33:08 PM

No more disruptive than this comment. If you don't like it, downvote and move on. It's on topic and doesn't contradict the rules. The reason you see Simon's comment on the top is because people like it and upvote it.

by jszymborski

4/22/2026 at 5:37:27 PM

Our comments are no more disruptive, so we shouldn't write them. The other comments are at most as disruptive & fine.

Something seems off when I combine those premises.

You also make a key observation here: the root comment is fine and on-topic. The the replies spin off into nothing to do with the headline, but the example in the comment. Makes it really hard to critique with coming across as fun police.

Also, worth noting there's a distinction here, we're not in simonw's thread: we're in a brand new account's imitation of it.

by refulgentis

4/22/2026 at 7:48:09 PM

On llama server, the Q4_K_M is giving about 91k context on 24GB, which calculates to about 70MB per 1K context (KV-Cache). I could have gone for Q5 which probably leaves about 30K token space. I think this is pretty impressive.

by zkmon

4/23/2026 at 9:33:01 AM

I have been getting good results with IQ4_NL and TurboQuant at 4bits on 24gb (3090). It easily fits 256k with that setup, but it starts slowing down quite a bit after 80-100k. Quality in my testing is also still good:

- Coding task test: https://github.com/sleepyeldrazi/llm_programming_tests/ - Design task test: https://github.com/sleepyeldrazi/llm-design-showcase

Coding was against minimax-m2.7 and glm-5, and the design against other small models

by sleepyeldrazi

4/22/2026 at 9:46:47 PM

So far I'm unimpressed for local inference. got 11 tokens per second on omlx on an M5 Pro with 128gb of ram, so it took an hour to write a few hundred lines of code that didn't work. Opus and Sonnet in CC the same task successfully in a matter of minutes. The 3.6:35b model seemed okay on ollama yesterday.

Need to check out other harnesses for this besides claude code, but the local models are just painfully slow.

by datadrivenangel

4/23/2026 at 2:08:57 AM

I use the same computer as you do. m5 can run faster:

pip install mlx_lm

python -m mlx_vlm.convert --hf-path Qwen/Qwen3.6-27B --mlx-path ~/.mlx/models/Qwen3.6-27B-mxfp4 --quantize --q-mode mxfp4 --trust-remote-code

mlx_lm.generate --model ~/.mlx/models/Qwen3.6-27B-mxfp4 -p 'how cpu works' --max-tokens 300

Prompt: 13 tokens, 51.448 tokens-per-sec Generation: 300 tokens, 35.469 tokens-per-sec Peak memory: 14.531 GB

by fshen

4/23/2026 at 2:09:49 PM

One other thing you might want to check out for running locally. (I have not independently verified yet, it's on the TODO list though)

https://docs.vllm.ai/en/latest/api/vllm/model_executor/layer...

vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.

From what I understand, the steps are:

1. launch vLLM 2. execute a vLLM configure command like "use kv-turboquant for model xyz" 3. that's it

I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.

SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!

by AlexC04

4/23/2026 at 1:21:47 AM

You have better specs than I do and I'm running the same model almost twice as fast through GGUF on llama cpp. I'd try some different harnesses.

by hresvelgr

4/22/2026 at 11:42:52 PM

I got about 7 tokens/sec generation on an M2 max macbook running 8-bit quant on an MLX version.

by someguydave

4/23/2026 at 12:02:40 AM

this is a dense model, so that's expected. On a mac you'd want to try out the Mixture of Experts Qwen3.6 release, namely Qwen3.6-35B-A3B. On an M4 Pro I get ~70 tok/s with it. If your numbers are slower than this, it might be because you're accidentally using a "GGUF" formatted model, vs "MLX" (an apple-specific format that is often more performant for macs).

by mswphd

4/22/2026 at 10:07:55 PM

OpenCode seems to be a lot better than Claude at using local models.

by noman-land

4/23/2026 at 6:42:58 AM

For local models, you should check out https://swival.dev instead of Claude Code.

by jedisct1

4/22/2026 at 6:11:12 PM

I really like local models for code reviews / security audits.

Even if they don't run super fast, I can let them work overnight and get comprehensive reports in the morning.

I used Qwen3.6-27B on an M5 (oq8, using omlx) and Swival (https://swival.dev) /audit command on small code bases I use for benchmarking models for security audits.

It found 8 out of 10, which is excellent for a local model, produced valid patches, and didn't report any false positives. which is even better.

by jedisct1

4/22/2026 at 5:31:08 PM

I have been running the slightly larger 31B model for local coding:

ollama launch claude --model qwen3.6:35b-a3b-nvfp4

This has been optimized for Apple Silicon and runs well on a 32G ram system. Local models are getting better!

by mark_l_watson

4/22/2026 at 6:01:12 PM

Can I ask how much RAM of the 32GB does it use? For example can I run a browser and VS Code at the same time?

by yougotwill

4/22/2026 at 10:54:37 PM

1. the 35B model is a "Mixture of Experts" model. So the earlier commenter's point that it is "larger" does not mean it is more capable. Those types of models only have certain parts of themselves active (for 35b-A3b, it's only 3 billion parameters at a time, vs 27 billion for the model this post is about) at a time to speed up inference. So if you're interested in these things for the first time, Qwen3.6-35B-A3B is a good choice, but it is likely not as capable as the model this thread is about.

2. its hard to cite precise numbers because it depends heavily on configuration choices. For example

2a. on a macbook with 32GB unified memory you'll be fine. I can load a 4 bit quant of Qwen3.6-35B-A3B supporting max context length using ~20GB RAM.

2b. that 20GB ram would not fit on many consumer graphics cards. There are still things you can do ("expert offloading"). On my 3080, I can run that same model, at the same quant, and essentially the same context length. This is despite the 3080 only having ~10GB VRAM, by splitting some of the work with the CPU (roughly).

Layer offloading will cause things to slow down compared to keeping layers fully resident in memory. It can still be fast though. Iirc I've measured my 3080 as having ~55 tok/s, while my M4 pro 48GB has maybe ~70 tok/s? So a slowdown but still usable.

If you want to get your feet wet with this, I'd suggest trying out

* Lmstudio, and * the zed.dev editor

they're both pretty straightforward to setup/pretty respectable. zed.dev gives you very easy configuration to get something akin to claude code (e.g. an agent with tool calling support) in relatively little time. There are many more fancy things you can do, but that pair is along the lines of "setup in ~5 minutes", at least after downloading the applications + model weights (which are likely larger than the applications). This is assuming you're on mac. The same stack still works with nvidia, but requires more finnicky setup to tune the amount of expert offloading to the particular system.

It's plausible you could do something similar with LMstudio + vscode, I'm just less familiar with that.

by mswphd

4/22/2026 at 3:17:14 PM

This is getting very close to fit a single 3090 with 24gb VRAM :)

by vladgur

4/22/2026 at 3:24:58 PM

Yup! Smaller quants will fit within 24GB but they might sacrifice context length.

I’m excited to try out the MLX version to see if 32GB of memory from a Pro M-series Mac can get some acceptable tok/s with longer context. HuggingFace has uploaded some MLX versions already.

by originalvichy

4/22/2026 at 4:02:32 PM

I have an Mini M4 Pro with 64GB of 273GB/s memory bandwidth and it's borderline with 3.5-27B. I assume this one is the same. I don't know a ton, but I think it's the memory bandwidth that limits it. It's similar on a DGX Spark I have access to (almost the same memory bandwidth).

It's been a while since I tried it, but I think I was getting around 12-15 tokens per second an that feels slow when you're used to the big commercial models. Whenever I actually want to do stuff with the open source models, I always find myself falling back to OpenRouter.

I tried Intel/Qwen3.6-35B-A3B-int4-AutoRound on a DGX Spark a couple days ago and that felt usable speed wise. I don't know about quality, but that's like running a 3B parameter model. 27B is a lot slower.

I'm not sure if I "get" the local AI stuff everyone is selling. I love the idea of it, but what's the point of 128GB of shared memory on a DGX Spark if I can only run a 20-30GB model before the slow speed makes it unusable?

by donmcronald

4/23/2026 at 12:12:39 AM

There are a number of DGX benchmarks for these recent gemma-4 / qwen-3.6 models on the nvidia forum, ex: https://forums.developer.nvidia.com/t/qwen-qwen3-6-35b-a3b-a...

by verdverm

4/23/2026 at 1:40:20 PM

Tbf the Sparks usefulness isn’t for inference IMO. Its memory bandwidth is too low for that.

But on the other hand, running Qwen 3.5 122B A10B locally on it using ~110GB of memory and getting 50tk/s generation and quite excellent prefill… I couldn’t do that on many other machines at this price point

For me this has been awesome to learn CUDA on, fine tuning models (until I get it close to what I want then it’s off to H100 or something clusters) and a bit of inference on the side

by girvo

4/22/2026 at 3:39:48 PM

32GB RAM on mac also need to host OS, software, and other stuff. There may not even be 24GB VRAM left for the model.

by ycui1986

4/22/2026 at 3:25:50 PM

At 4-bit quantization it should already fit quite nicely.

by GaggiX

4/22/2026 at 3:42:23 PM

Unfortunately not with a reasonable context length.

by Aurornis

4/22/2026 at 10:14:49 PM

I've got 139k context with the UD-Q4_K_XL on a 4090, q8_0 ctk/v. Could probably squeeze a little more but that's enough for me for the moment.

by regularfry

4/23/2026 at 12:00:35 AM

Hey, buddy! Can I bum a command line arg list off ya?

by corysama

4/22/2026 at 4:57:14 PM

The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.

by GaggiX

4/22/2026 at 4:25:30 PM

It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090.

by kkzz99

4/22/2026 at 6:42:26 PM

I used to run qwen3.5 27b Q4_k_M on a single 3090 with these llama-server flags successfully: `-ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0`

by skiing_crawling

4/23/2026 at 4:40:56 AM

With CPU offloading of e.g. 25% on that hardware it is still fast enough for a lot of things.

by chr15m

4/22/2026 at 3:54:17 PM

Excited to try this, the Qwen 3.6 MoE they just released a week or so back had a noticeable performance bump from 3.5 in a rather short period of time.

For anyone invested in running LLMs at home or on a much more modest budget rig for corporate purposes, Gemma 4 and Qwen 3.6 are some of the most promising models available.

by Mr_Eri_Atlov

4/22/2026 at 3:19:33 PM

Good news!

Friendly reminder: wait a couple weeks to judge the ”final” quality of these free models. Many of them suffer from hidden bugs when connected to an inference backend or bad configs that slow them down. The dev community usually takes a week or two to find the most glaring issues. Some of them may require patches to tools like llama.cpp, and some require users to avoid specific default options.

Gemma 4 had some issues that were ironed out within a week or two. This model is likely no different. Take initial impressions with a grain of salt.

by originalvichy

4/22/2026 at 3:42:53 PM

This is probably less likely with this model, as it’s almost certainly a further RL training continuation of 3.5 27b. The bugs with this architecture were worked out when that dropped.

by jjcm

4/22/2026 at 3:51:54 PM

Valuable note!

by originalvichy

4/22/2026 at 3:41:52 PM

Good advice for all new LLM experimenters.

The bugs come from the downstream implementations and quantizations (which inherit bugs in the tools).

Expect to update your tools and redownload the quants multiple times over 2-4 weeks. There is a mad rush to be first to release quants and first to submit PRs to the popular tools, but the output is often not tested much before uploading.

If you experiment with these on launch week, you are the tester. :)

by Aurornis

4/22/2026 at 7:51:47 PM

TIL that our corporate network site blocker classifies qwen.ai as a sex site…

by navbaker

4/22/2026 at 7:56:47 PM

.. that is what they tell YOU

by mistrial9

4/22/2026 at 4:04:33 PM

Q4-Q5 quants of this model runs well on gaming laptops with 24GB VRAM and 64GB RAM. Can get one of those for around $3,500.

Interesting pros/cons vs the new Macbook Pros depending on your prefs.

And Linux runs better than ever on such machines.

by vibe42

4/22/2026 at 4:07:38 PM

What laptop has that much VRAM and RAM for $3500 with good/okay-ish Linux support? I was looking to upgrade my asus zephyrus g14 from 2021 and things were looking very expensive. Decided to just keep it chugging along for another year.

Then again, I was looking in the UK, maybe prices are extra inflated there.

by doix

4/22/2026 at 6:54:41 PM

I got a HP g1a for about 3k€ with 64gb of ram when it came out

by green7ea

4/22/2026 at 4:09:32 PM

A3B-35B is better suited for laptops with enough VRAM/RAM. This dense model however will be bandwidth limited on most cards.

The 5090RTX mobile sits at 896GB/s, as opposed to the 1.8TB/s of the 5090 desktop and most mobile chips have way smaller bandwith than that, so speeds won't be incredible across the board like with Desktop computers.

by kroaton

4/22/2026 at 4:19:43 PM

I find A3B-35B as an ideal model for small local projects- definitely the best for me so far

by jadbox

4/22/2026 at 3:19:50 PM

Unsloth quants available:

https://unsloth.ai/docs/models/qwen3.6

by spwa4

4/22/2026 at 3:52:14 PM

Getting ~36-33 tok/s (see the "S_TG t/s" column) on a 24GB Radeon RX 7900 XTX using llama.cpp's Vulkan backend:

    $ llama-server --version
    version: 8851 (e365e658f)

    $ llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.529 |   654.11 |    3.470 |    36.89 |    4.999 |   225.67 |
    |  2000 |    128 |    1 |   2128 |    3.064 |   652.75 |    3.498 |    36.59 |    6.562 |   324.30 |
    |  4000 |    128 |    1 |   4128 |    6.180 |   647.29 |    3.535 |    36.21 |    9.715 |   424.92 |
    |  8000 |    128 |    1 |   8128 |   12.477 |   641.16 |    3.582 |    35.73 |   16.059 |   506.12 |
    | 16000 |    128 |    1 |  16128 |   25.849 |   618.98 |    3.667 |    34.91 |   29.516 |   546.42 |
    | 32000 |    128 |    1 |  32128 |   57.201 |   559.43 |    3.825 |    33.47 |   61.026 |   526.47 |

by genpfault

4/22/2026 at 5:07:28 PM

Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    0.684 |  1462.61 |    2.869 |    44.61 |    3.553 |   317.47 |
    |  2000 |    128 |    1 |   2128 |    1.390 |  1438.84 |    2.868 |    44.64 |    4.258 |   499.80 |
    |  4000 |    128 |    1 |   4128 |    2.791 |  1433.18 |    2.886 |    44.35 |    5.677 |   727.11 |
    |  8000 |    128 |    1 |   8128 |    5.646 |  1416.98 |    2.922 |    43.80 |    8.568 |   948.65 |
    | 16000 |    128 |    1 |  16128 |   11.851 |  1350.10 |    3.007 |    42.57 |   14.857 |  1085.51 |
    | 32000 |    128 |    1 |  32128 |   25.855 |  1237.66 |    3.168 |    40.40 |   29.024 |  1106.96 |

Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.

by johndough

4/22/2026 at 6:09:41 PM

~25-26 tok/s with ROCm using the same card, llama.cpp b8884:

    $ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.034 |   966.90 |    4.851 |    26.39 |    5.885 |   191.67 |
    |  2000 |    128 |    1 |   2128 |    2.104 |   950.38 |    4.853 |    26.38 |    6.957 |   305.86 |
    |  4000 |    128 |    1 |   4128 |    4.269 |   937.00 |    4.876 |    26.25 |    9.145 |   451.40 |
    |  8000 |    128 |    1 |   8128 |    8.962 |   892.69 |    4.912 |    26.06 |   13.873 |   585.88 |
    | 16000 |    128 |    1 |  16128 |   19.673 |   813.31 |    4.996 |    25.62 |   24.669 |   653.78 |
    | 32000 |    128 |    1 |  32128 |   46.304 |   691.09 |    5.122 |    24.99 |   51.426 |   624.75 |

by cpburns2009

4/23/2026 at 7:11:07 AM

Did you try GPU/CPU mix with a bigger model?

by ozgrakkurt

4/28/2026 at 12:22:01 AM

Prompt processing is absolutely punishing:

    ./llama-batched-bench -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ4_NL -npp 1000 -ntg 128 -npl 1 --cache-type-k q8_0 --cache-type-v q8_0 -c 18000 --n-cpu-moe 32
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |   53.961 |    18.53 |    9.223 |    13.88 |   63.184 |    17.85 |

by genpfault

4/22/2026 at 8:37:03 PM

llama-batched-bench -hf ggml-org/Qwen3.6-27B-GGUF -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000

M2 Ultra, Q8_0

  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    1.307 |   391.69 |    6.209 |    20.61 |    7.516 |    85.15 |
  |  1024 |    128 |    1 |   1152 |    2.534 |   404.16 |    6.227 |    20.56 |    8.760 |   131.50 |
  |  2048 |    128 |    1 |   2176 |    5.029 |   407.26 |    6.229 |    20.55 |   11.258 |   193.29 |
  |  4096 |    128 |    1 |   4224 |   10.176 |   402.52 |    6.278 |    20.39 |   16.454 |   256.72 |
  |  8192 |    128 |    1 |   8320 |   20.784 |   394.14 |    6.376 |    20.08 |   27.160 |   306.33 |
  | 16384 |    128 |    1 |  16512 |   43.513 |   376.53 |    6.532 |    19.59 |   50.046 |   329.94 |
  | 32768 |    128 |    1 |  32896 |   99.137 |   330.53 |    7.081 |    18.08 |  106.218 |   309.70 |

DGX Spark, Q8_0

  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.881 |   580.98 |   16.122 |     7.94 |   17.003 |    37.64 |
  |  1024 |    128 |    1 |   1152 |    1.749 |   585.43 |   16.131 |     7.93 |   17.880 |    64.43 |
  |  2048 |    128 |    1 |   2176 |    3.486 |   587.54 |   16.169 |     7.92 |   19.655 |   110.71 |
  |  4096 |    128 |    1 |   4224 |    7.018 |   583.64 |   16.245 |     7.88 |   23.263 |   181.58 |
  |  8192 |    128 |    1 |   8320 |   14.189 |   577.33 |   16.427 |     7.79 |   30.617 |   271.75 |
  | 16384 |    128 |    1 |  16512 |   29.015 |   564.68 |   16.749 |     7.64 |   45.763 |   360.81 |
  | 32768 |    128 |    1 |  32896 |   60.413 |   542.40 |   17.359 |     7.37 |   77.772 |   422.98 |

by ggerganov

4/22/2026 at 3:33:44 PM

at this trajectory, unsloth are going to release the models BEFORE the model drop within the next weeks...

by endymi0n

4/22/2026 at 3:50:37 PM

Haha :)

by danielhanchen

4/22/2026 at 6:28:36 PM

Do you get early access so you can prep the quants for release?

by cpburns2009

4/22/2026 at 7:24:14 PM

IIRC they mentioned they do.

by ErneX

4/22/2026 at 4:17:06 PM

128GB (112 GB avail) Strix AI 395+ Radeon 8060x (gfx1151)

llama-* version 8889 w/ rocm support ; nightly rocm

llama.cpp/build/bin/llama-batched-bench --version unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    2.776 |   360.22 |   20.192 |     6.34 |   22.968 |    49.11 |
    |  2000 |    128 |    1 |   2128 |    5.778 |   346.12 |   20.211 |     6.33 |   25.990 |    81.88 |
    |  4000 |    128 |    1 |   4128 |   11.723 |   341.22 |   20.291 |     6.31 |   32.013 |   128.95 |
    |  8000 |    128 |    1 |   8128 |   24.223 |   330.26 |   20.399 |     6.27 |   44.622 |   182.15 |
    | 16000 |    128 |    1 |  16128 |   52.521 |   304.64 |   20.669 |     6.19 |   73.190 |   220.36 |
    | 32000 |    128 |    1 |  32128 |  120.333 |   265.93 |   21.244 |     6.03 |  141.577 |   226.93 |

More directly comparable to the results posted by genpfault (IQ4_XS):

llama.cpp/build/bin/llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    2.543 |   393.23 |    9.829 |    13.02 |   12.372 |    91.17 |
    |  2000 |    128 |    1 |   2128 |    5.400 |   370.36 |    9.891 |    12.94 |   15.291 |   139.17 |
    |  4000 |    128 |    1 |   4128 |   10.950 |   365.30 |    9.972 |    12.84 |   20.922 |   197.31 |
    |  8000 |    128 |    1 |   8128 |   22.762 |   351.46 |   10.118 |    12.65 |   32.880 |   247.20 |
    | 16000 |    128 |    1 |  16128 |   49.386 |   323.98 |   10.387 |    12.32 |   59.773 |   269.82 |
    | 32000 |    128 |    1 |  32128 |  114.218 |   280.16 |   10.950 |    11.69 |  125.169 |   256.68 |

by GrinningFool

4/22/2026 at 6:26:30 PM

Results are nearly identical running on a Strix Halo using Vulkan, llama.cpp b8884:

    $ llama-batched-bench -dev Vulkan2 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    3.288 |   304.15 |    9.873 |    12.96 |   13.161 |    85.71 |
    |  2000 |    128 |    1 |   2128 |    6.415 |   311.79 |    9.883 |    12.95 |   16.297 |   130.57 |
    |  4000 |    128 |    1 |   4128 |   13.113 |   305.04 |    9.979 |    12.83 |   23.092 |   178.76 |
    |  8000 |    128 |    1 |   8128 |   27.491 |   291.01 |   10.155 |    12.61 |   37.645 |   215.91 |
    | 16000 |    128 |    1 |  16128 |   59.079 |   270.83 |   10.476 |    12.22 |   69.555 |   231.87 |
    | 32000 |    128 |    1 |  32128 |  148.625 |   215.31 |   11.084 |    11.55 |  159.709 |   201.17 |

by cpburns2009

4/22/2026 at 4:43:09 PM

you should try vulkan instead of rocm. it goes like 20% faster.

by amstan

4/22/2026 at 5:47:50 PM

Is that based on recent experience? With "stable" ROCm, or the (IMHO better) releases from TheRock? With older or more recent hardware? The AMD landscape is rather uneven.

by MrDrMcCoy

4/22/2026 at 6:27:06 PM

For this model results are identical. In my experience it can go either way by up to 10%.

by cpburns2009

4/26/2026 at 4:05:24 PM

~/llama.cpp$ build-.../bin/llama-batched-bench -m models/....gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000

  On amd 7900xtx

Qwen3.6-27B-Q4_K_M |    PP | |-------|--------| |   512 | |  1024 | |  2048 | |  4096 | |  8192 | | 16384 | | 32768 |

Qwen3.6-27B-IQ4_NL |    PP | |-------|--------| |   512 | |  1024 | |  2048 | |  4096 | |  8192 | | 16384 | | 32768 |

On mbp M2 Max

Qwen3.6-27B-UD-Q8_K_XL |    PP | |-------|--------| |   512 | |  1024 | |  2048 | |  4096 | |  8192 | | 16384 | | 32768 |

Compared to similar prior models

On amd 7900xtx

Qwen3.6-35B-A3B-UD-Q4_K_S |    PP | |-------|--------| |   512 | |  1024 | |  2048 | |  4096 | |  8192 | | 16384 | | 32768 |

Qwen3.6-35B-A3B-UD-IQ4_XS |    PP | |-------|--------| |   512 | |  1024 | |  2048 | |  4096 | |  8192 | | 16384 | | 32768 |

Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_S |    PP | |-------|--------| |   512 | |  1024 | |  2048 | |  4096 | |  8192 | | 16384 | | 32768 |

Nemotron-Cascade-2-30B-A3B-IQ4_XS |    PP | |-------|--------| |   512 | |  1024 | |  2048 | |  4096 | |  8192 | | 16384 | | 32768 |

On mbp M2 Max

Qwen3.6-35B-A3B-UD-Q6_K_XL |    PP | |-------|--------| |   512 | |  1024 | |  2048 | |  4096 | |  8192 | | 16384 | | 32768 |

Nemotron-Cascade-2-30B-A3B-Q8_0 |    PP | |-------|--------| |   512 | |  1024 | |  2048 | |  4096 | |  8192 | | 16384 | | 32768 |

Dang I saw

by ljosifov

4/22/2026 at 2:15:57 PM

A bit skeptical about a 27B model comparable to opus...

by amunozo

4/22/2026 at 3:13:52 PM

For at least a year now, it has been clear that data quality and fine-tuning are the main sources of improvement for mediym-level models. Size != quality for specialized, narrow use cases such as coding.

It’s not a surprise that models are leapfrogging each other when the engineers are able to incorporate better code examples and reasoning traces, which in turn bring higher quality outputs.

by originalvichy

4/22/2026 at 4:11:53 PM

If all you're looking at is benchmarks that might be true, but those are way too easy to game. Try using this model alongside Opus for some work in Rust/C++ and it'll be night and day. You really can't compare a model that's got trillions of parameters to a 27B one.

by cbg0

4/22/2026 at 5:09:15 PM

> ...and it'll be night and day.

That's just, like, your opinion, man.

> You really can't compare a model that's got trillions of parameters to a 27B one.

Parameter count doesn't matter much when coding. You don't need in-depth general knowledge or multilingual support in a coding model.

by otabdeveloper4

4/22/2026 at 5:29:48 PM

I often do need in-depth general knowledge in my coding model so that I don't have to explain domain specific logic to it every time and so that it can have some sense of good UX.

by cbg0

4/22/2026 at 3:51:11 PM

You should try it out. I'm incredibly impressed with Qwen 3.5 27B for systems programming work. I use Opus and Sonnet at work and Qwen 3.x at home for fun and barely notice a difference given that systems programming work needs careful guidance for any model currently. I don't try to one shot landing pages or whatever.

by rubiquity

4/24/2026 at 7:45:01 AM

Is it available for API use? I don't have a laptop capable of running it.

by amunozo

4/22/2026 at 4:47:00 PM

Are you using the same agent/harness/whatever for both Claude and Qwen, or something different for each one?

by bityard

4/22/2026 at 4:54:22 PM

I use Pi at home and Claude Code at work (no choice). I use bone stock Pi. No extensions.

by rubiquity

4/22/2026 at 8:56:29 PM

From what I understand, ~30b is enough "intelligence" to make coding/reasoning etc. work, in general. Above ~30b, it's less about intelligence, and more about memorization. Larger models fail less and one-shot more often because they can memorize more APIs (documentation, examples, etc). Also from my experience, if a task is ambiguous, Sonnet has a better "intuition" of what my intent is. Probably also because of memorization, it has "access" to more repositories in its compressed knowledge to infer my intent more accurately.

by kgeist

4/22/2026 at 3:12:44 PM

Some of these benchmarks are supposedly easy to game. Which ones should we pay attention to?

by esafak

4/22/2026 at 4:58:58 PM

SWE-REbench should not be gameable. They collect new issues from live repos, and if you check 1-2 months after a model was released, you can get an idea. But even that would be "benchmaxxxable", which is an overloaded term that can mean many things, but the most vanilla interpretation is that with RL you can get a model to follow a certain task pretty well, but it'll get "stuck" on that task type, or "stubborn" when asked similar but sufficiently different tasks. So for swe-rebench that would be "it fixes bugs in these types of repos, under this harness, but ask it to do soemthing else in a repo and you might not get the same results". In a nutshell.

by NitpickLawyer

4/22/2026 at 3:22:06 PM

well, your own, unleaked ones, representing your real workloads.

if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.

by underlines

4/22/2026 at 3:33:42 PM

ARC-AGI 2

GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.

by WarmWash

4/22/2026 at 4:54:01 PM

None. Try them out with your own typical tasks to see the performance.

by cbg0

4/22/2026 at 3:43:51 PM

You should be skeptical. Benchmark racing is the current meta game in open weight LLMs.

Every release is accompanied by claims of being as good as Sonnet or Opus, but when I try them (even hosted full weights) they’re far from it.

Impressive for the size, though!

by Aurornis

4/22/2026 at 3:48:22 PM

A small model can be made to be "comparable to Opus" in some narrow domains, and that's what they've done here.

But when actually employed to write code they will fall over when they leave that specific domain.

Basically they might have skill but lack wisdom. Certainly at this size they will lack anywhere close to the same contextual knowledge.

Still these things could be useful in the context of more specialized tooling, or in a harness that heavily prompts in the right direction, or as a subagent for a "wiser" larger model that directs all the planning and reviews results.

by cmrdporcupine

4/22/2026 at 3:05:54 PM

you'd be surprised how good small models have gotten. Size of the model isnt all that matters.

by wesammikhail

4/22/2026 at 3:07:27 PM

My experience with qwen-3.6:35B-A3B reinforces this, gonna give this a spin when unsloth has quants available

Gemini flash was just as good as pro for most tasks with good prompts, tools, and context. Gemma 4 was nearly as good as flash and Qwen 3.6 appears to be even better.

by verdverm

4/22/2026 at 3:16:17 PM

> when unsloth has quants available

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

by cassianoleal

4/22/2026 at 3:26:19 PM

That was quick (compared to the 1T Kimi-2.6, not surprising)

by verdverm

4/22/2026 at 3:50:16 PM

Haha :) We had some issues with Kimi-2.6 since it was int4 and we were investigating how to handle it :)

by danielhanchen

4/22/2026 at 6:50:41 PM

Appreciate what y'all do! We were slacking about how many HGX-B300 it would take to run Kimi and it looks like we could actually fit 2-3 Kimis on a single HGX.

by verdverm

4/22/2026 at 3:41:15 PM

> Size of the model isnt all that matters.

What matters is the motion in the tokens

by dudefeliciano

4/22/2026 at 3:08:39 PM

Plus you can control thinking time a lot more, so when Anthropic lobotomizes Opus on you...

by freedomben

4/22/2026 at 3:44:47 PM

Opus 4.5 mind you, but I’m not too surprised given how good 3.5 was and how good the qwopus fine tune was. The model was shown to benefit heavily from further RL.

by jjcm

4/22/2026 at 6:22:22 PM

I'm kind of interested in a setup where one buys local hardware specifically to run a crap ton of small-to-medium LLM locally 24/7 at high throughput. These models might now be smart enough to make all kinds of autonomous agent workflows viable at a cheap price, with a good queue prioritization system for queries to fully utilize the hardware.

by 2001zhaozhao

4/22/2026 at 9:45:24 PM

I would love to have a shit load of small (27B dense. 35B MoE) agents running locally and looking at and ingesting every bit of data about me, my life and what I get up to see what sort of correlations it finds. Give a coding agent access to a data lake of events and let it build up its own analytics tooling to extract and draw out information from that data, and present it to me as daily/weekly/monthly summaries.

by OliverGuy

4/22/2026 at 11:38:50 PM

That's definitely doable. Planning similar except more webscraping / newsfeed / monitoring like.

I've got 3x SBCs that can run the Gemma 4 26B MoE on NPU. Around 4W extra power, 3 tokens a second...so that can hammer away at tasks 24/7 without moving the needle on electricity bill

by Havoc

4/23/2026 at 3:20:00 AM

I wonder if some investment firms are already doing this internally at a large scale. (Probably.)

by 2001zhaozhao

4/23/2026 at 1:48:15 PM

They are - I’ve seen it.

They just use APIs though. There is very little interest within them to do the model engineering and inference in house.

by Havoc

4/23/2026 at 3:19:00 AM

This was along my lines of thinking at one point as well. Though I'm now more interested in having it experiment autonomously on my software projects overnight.

by 2001zhaozhao

4/23/2026 at 3:11:57 AM

Adding to my own comment now that I've read the announcement in a little more detail: I find the assertion that the model's coding performance surpasses their own flagship 397B model from last generation fairly convincing.

This sounds like significant genuine gains unless one of the following is true, which would be really unlikely:

1. They somehow managed to benchmaxx every coding benchmark way harder than their own last generation.

2. They held back the coding performance of their last generation 397B model on purpose to make this 3.6 Qwen model look good. (basically a tinfoil hat theory as it would literally require 4D chess and self-harming to do)

So, it's pretty save to say that we actually have a competent agentic coding model we can leave on in a prosumer laptop overnight to create real software for almost zero token costs.

by 2001zhaozhao

4/22/2026 at 8:38:10 PM

Buy any Strix Halo box and have fun with your 128GB of VRAM.

by kroaton

4/23/2026 at 3:16:24 AM

I wonder whether it is much more cost-effective in terms of token throughput / hardware+power cost to get actual GPUs instead, given that the model size is only 27B.

by 2001zhaozhao

4/22/2026 at 9:36:28 PM

If this runs at Opus 4.5 level for agentic coding then I don't really need any cloud models anymore.

by storus

4/22/2026 at 9:48:02 PM

I tested it with my standard test - HTML/JS minesweeper. From the user POV, the result is great and subjectively equivalent to Opus 4.5.

I also asked Claude Code (Opus 4.7) and Codex (GPT-5.4) to review both qwen's output and that of opus 4.5, and both agents concluded qwen's was better.

Minesweeper is simple but nontrivial - 600-800 lines of code that need to be internally consistent. At that complexity level, this model is definitely a viable alternative.

(haven't tested with planning, debugging and more complex problems yet)

by senko

4/22/2026 at 11:38:36 PM

I just want sonnet 4.5 but I also want it to be fast which is trickier

by qudat

4/22/2026 at 7:29:57 PM

I'll be really interested to hear qualitative reports of how this model works out in practice. I just can't believe that a model this small is actually as good as Opus, which is rumored to be about two orders of magnitude larger.

by lgessler

4/22/2026 at 7:15:06 PM

Has anyone tried using this with a Claude Code or Qwen Code? They both require very large context windows (32k and 16k respectively), which on a Mac M4 48GB serving the model via LM Studio is painfully slow.

by docheinestages

4/22/2026 at 7:26:21 PM

I had the best success yet earlier today running https://pi.dev with a local gemma4 model on ollama on my m4 Mac with 48GB ram. I think pi is a lot lighter than Claude code.

by domh

4/23/2026 at 12:05:54 AM

I didn’t think pi supported local models?

by qudat

4/23/2026 at 1:58:32 PM

It does! Ollama provides a helper to launch it with the local model too: https://docs.ollama.com/integrations/pi

So you can do:

   ollama launch pi --model gemma4:26b

And it launches and points to the local model in one command. pi seems to do some setting caching too, because after doing the above once I can just do `pi` and it's already setup to the local model.

by domh

4/23/2026 at 12:48:01 AM

pi does, it can talk to any OpenAI API

by segmondy

4/25/2026 at 3:03:52 AM

how do you provide the base url?

by qudat

4/22/2026 at 10:57:57 PM

context window for Qwen3.6 models' size increase isn't that bad/large (e.g. you can likely fix max context well within the 48GB), but macbook prompt processing is notoriously slow (At least up through M4. M5 got some speedup but I haven't messed with it).

One thing to keep in mind is that you do not need to fully fit the model in memory to run it. For example, I'm able to get acceptable token generation speed (~55 tok/s) on a 3080 by offloading expert layers. I can't remember the prompt processing speed though, but generally speaking people say prompt processing is compute bound, so benefits more from an actual GPU.

by mswphd

4/23/2026 at 9:26:51 AM

[dead]

by docheinestages

4/22/2026 at 10:48:32 PM

Try running with Open Code. It works quite well.

by swalsh

4/23/2026 at 9:25:44 AM

I had an equally painful experience with Open Code. I don't think the harness is the issue. It's the need for a large context window and slow inference.

by docheinestages

4/22/2026 at 11:47:48 PM

I'm still fairly new to local LLMs, spent some time setting up and testing a few Qwen3.6-35B-A3B models yesterday (mlx 4b and 8b, gguf Q4_K_M and Q4_K_XL I think).

Was impressed at how they ran on my 64G M4.

It looks like this new model is slightly "smarter" (based on the tables in TFA) but requires more VRAM. Is that it? The "dense" part being the big deal?

As 27B < 35B, should we expect some quantized models soon that will bring the VRAM requirement down?

by n8henrie

4/23/2026 at 12:00:52 AM

that's not it. 35B-A3B is a "Mixture of Experts" model. Roughly, only ~3B parameters are active at a time. So, the actual computational requirements scale with this ~3B, rather than with the 35B (though you need high-bandwidth access to the full 35B layers though).

This model is a "dense" model. It will be much slower on macs. Concretely, on a M4 Pro, at Q6 gguf, it was ~9tok/s for me. 35-A3B (at Q4, with mlx, so not a fair comparison) was ~70 tok/s by comparison.

In general dedicated GPUs tend to do better with these kinds of "dense" models, though this becomes harder to judge when the GPU does not have enough VRAM to keep the model fully resident. For this model, I would expect if you have >=24GB VRAM you'd be fine, e.g. an NVIDIA {3,4,5}090-type thing.

by mswphd

4/22/2026 at 9:31:08 PM

Huh, running the Q4_K_M quant with LM Studio, and asked it "How can I set up Qwen 3.6 27b to use tools and access the local file system?".

Part of its reply was: Quick clarification: As of early 2025, "Qwen 3.6" hasn't been released yet. You are likely looking for Qwen2.5, specifically the Qwen2.5-32B-Instruct model, which is the 30B-class model closest to your 27B reference. The instructions below will use this model.

Weird.

by mft_

4/23/2026 at 6:12:56 AM

Models are math functions that predict next word, not conscious beings. If it was trained on dataset including data up to Q1 2025, then that's more or less expected answer -- even Qwen 3 didn't exist.

If you see model that can reliably answer questions about itself (version, family, capabilities, etc), then it's most likely part of system prompt.

In absence of system prompt even Claude could say it's a model created by DeepSeek: https://x.com/stevibe/status/2026227392076018101

by petu

4/22/2026 at 10:18:30 PM

If you are talking with Claude about AI, it will sometimes passively bring up "frontier models like GPT-4o"

by mudkipdev

4/23/2026 at 4:41:29 PM

Slightly tangential, how good/bad is 4o compared to the modern (5.3 I think?) one?

TBH I personally find non-thinking replies quite poor for the type of questions I ask so I haven't touched chatgpt for months (ever since Gemini 2.5 Pro I think.) (And even Gemini 3.1 Pro tends to still be too literal at times instead of understand the implied meaning lol. We've got more place to improve.)

by user_7832

4/22/2026 at 9:35:35 PM

This is pretty standard in every model. Ask Opus or Gemini about 2026 (without a big system prompt to steer them) and they'll swear blind it's 2024/25 too.

by Jowsey

4/22/2026 at 5:23:56 PM

Are there any "optimized" models, that have lesser hardware requirements and are specialised in single programming language, e.g. C# ?

by butz

4/22/2026 at 5:34:43 PM

LLMs need diverse and extensive training data to be good at a specific thing. We don't (yet?) know how to train a small model that is really good at one programming language. Just big models that are good at a variety of languages (plus lots of other things).

by zargon

4/22/2026 at 6:04:04 PM

Sort of - there's Qwen3-Coder and the Codestral family, but those are still multi-language, just code-focused. For truly single-language specialization, the practical path is fine-tuning an existing base model on a narrow distribution rather than training from scratch.

The issue with C# specifically is dataset availability. Open source C# code on GitHub is a fraction of Python/JS, and Microsoft hasn't released a public corpus the way Meta has for their code models. You'd probably get further fine-tuning Qwen3-Coder (or a similar base) on your specific codebase with LoRA than waiting for a dedicated C#-only model to appear.

by Abby_101

4/22/2026 at 9:20:25 PM

Issues with C# not withstanding. It is not inherently bad idea for small models to trained on only specific languages like a JS/PY only model with declarative languages like HTML, CSS YAML, JSON, graph etc thrown in, probably could be more efficient for local use.

Fine-tuning / LoRA on basis the org code base would be make it more useful.

by manquer

4/22/2026 at 3:45:08 PM

I've been waiting for this one. I've been using 3.5-27b with pretty good success for coding in C,C++ and Verilog. It's definitely helped in the light of less Claude availability on the Pro plan now. If their benchmarks are right then the improvement over 3.5 should mean I'm going to be using Claude even less.

by UncleOxidant

4/22/2026 at 7:34:27 PM

Any comparisons against Qwen3.6-35B-A3B?

by htrp

4/23/2026 at 10:47:40 AM

The benchmarks looks great for a 27B model, but Im curious how it will perform locally. I have tried a bunch of open-source models, and still I feel we are far from getting a similar output to claude code running in my M3 in my macbook pro.

by yamajun93

4/22/2026 at 7:21:18 PM

Does anyone know good provider for low latency llm api provider? We tried to look at Cerebras and Groq but they have 0 capacity right now. GPT models are too slow for us at the moment. Gemini are better but not really at same level as GPT.

by objektif

4/23/2026 at 2:15:55 AM

This depends a bit on your cost sensitivity and what model families you want support for, but Baseten and Fireworks have been my goto.

Currently Baseten has ~610ms TTFT and ~82 tk/s for Kimi K2.6, which is roughly 2x the throughput of GPT-5.4 (per their openrouter stats). GLM 5 is slightly slower on both metrics, but still strong.

by spmurrayzzz

4/22/2026 at 8:15:15 PM

Are there benchmarks of this / what’s the best way to compare it against paid models? With all the rate limiting in Claude/Copilot/etc, running locally is more and more appealing.

by richstokes

4/22/2026 at 3:07:56 PM

Has anyone tested it at home yet and wants to share early impressions?

by pama

4/22/2026 at 3:15:41 PM

I have been kicking the tires for about 40 minutes since it downloaded and it seems excellent at general tasks, image comprehension and coding/tool-calling (using VLLM to serve it). I think it squeaks past Gemma4 but it's hard to tell yet.

by lreeves

4/22/2026 at 3:17:36 PM

good to hear! Do you mind sharing your setup and tokens / seconds performance ?

by alfonsodev

4/22/2026 at 4:17:18 PM

I'm running the unquantized base model on 2xA6000s (Ampere gen, 48GB each). Runs at about 25 tokens/second.

by lreeves

4/22/2026 at 5:04:45 PM

FYI they also released FP8 quants, and those should be faster on your setup (we have the same). As long as you keep kv at 16bit, FP8 should be close-to-lossless compared to 16bit, but with more context available and faster inference speed.

by NitpickLawyer

4/22/2026 at 11:09:25 PM

An "obvious" point to make is that it is not particularly usable on a unified memory machine. Only getting 9 tok/s (for Q6 quants) using a Macbook M4 Pro 48GB memory (though with GGUFs, not mlx).

The quality seems fine, but the 9 tok/s mean I only tried it out briefly.

by mswphd

4/22/2026 at 6:44:41 PM

I'm experimenting with this on my RTX 3090 and opencode. It is pretty impressive so far.

by xrd

4/23/2026 at 2:30:28 PM

wish I had a rtx5090 instead of the rtx4090 so I can run this, need 32GB VRAM I assume, no I use Linux and I do not enjoy MacOS, I wish x86 family has something like M5 Pro though.

by synergy20

4/23/2026 at 2:33:29 PM

I was just wishing I had 24gb rather than 16gb so I could run q4 comfortably. Unsloth q4 variants are 15.4gb-17.6gb so it seems at least well worth trying on a 4090.

by fwipsy

4/23/2026 at 12:22:24 PM

130 token per seconds on dual RTX 5090 for FP8 version

by Avlin67

4/22/2026 at 8:58:07 PM

no FIM though :(, imo most slept on usecase for local models

by thot_experiment

4/22/2026 at 9:24:51 PM

I agree, what would you say are currently the best local options?

by harkh

4/23/2026 at 12:54:59 AM

For FIM, there's Qwen3 Coder Next.

Although Mistral's model card seems to indicate that Devstral 2 doesn't support FIM, it seems very odd that it wouldn't. I have been meaning to test it.

by zargon

4/23/2026 at 2:45:12 AM

Qwen Coder 30B A3B is far better than Qwen Coder Next imo. I may have inference issues or it's just a problem with running Coder Next at IQ4 XS, vs Q8 for the earlier/smaller model but I don't find the 80B to be much better at coding, even in instruct mode, and the insane speed and low latency of the smaller model is way more useful. Good one-line completions often happen in 300ms.

by thot_experiment

4/22/2026 at 8:52:19 PM

I wonder why they did not compare it to Qwen Coder Next?

by vocoda

4/23/2026 at 12:47:02 AM

It should crush Qwen3CoderNext especially since it beats 397B

by segmondy

4/23/2026 at 3:13:14 AM

Thank you Qwen team. Small DENSE LLMs shapes the future of local LLM users.

When Qwen 3.5 27b released, I didn't really understand why linear attention is used instead of full attention because of the performance degradation and problems introduced with extra (linear) operators. After doing some tests, I found that with llama.cpp and IQ4_XS quant, the model and BF16 cache of the whole 262k context just fit on 32GB vram, which is impossible with full attention. In contrast, with gemma 4 31b IQ4_XS quant I have to use Q8_0 cache to fit 262k context on the vram, which is a little annoying (no offenses, thank you gemma team, too).

From benchmarks, 3.5->3.6 upgrade is about agent things. I hope future upgrades fix some problems I found, e.g., output repetitiveness in long conversations and knowledge broadness.

by RandyOrion

4/23/2026 at 12:12:24 AM

What can I run on a M4 Pro with 48 GB or RAM?

by reddit_clone

4/23/2026 at 12:25:01 AM

A sparser model like Qwen3.6 35B A3B is probably your best choice: https://qwen.ai/blog?id=qwen3.6-35b-a3b

by bigyabai

4/23/2026 at 6:50:31 AM

The 35B MOE will run faster, but 48GB RAM is more than enough to run the 27B dense model as well. It's just that token/s will be on the lower side.

by hnfong

4/22/2026 at 7:15:05 PM

It's a rap on claude

by blurbleblurble

4/22/2026 at 5:50:26 PM

How much VRAM is needed?

by LowLevelKernel

4/22/2026 at 5:56:24 PM

27 multiplied by quant, add context

by awestroke

4/23/2026 at 2:36:50 PM

[dead]

by j_gonzalez

4/22/2026 at 4:04:39 PM

[dead]

by techpulselab

4/23/2026 at 11:59:47 AM

[dead]

by denniszelada

4/23/2026 at 3:07:27 AM

[dead]

by SleepyQuant

4/22/2026 at 3:52:09 PM

[dead]

by sowbug

4/27/2026 at 4:50:08 AM

[dead]

by zhouquanxi