Qwen3.6-35B-A3B: Agentic coding power, now open to all

4/16/2026 at 5:38:31 PM

I've been running this on my laptop with the Unsloth 20.9GB GGUF in LM Studio: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/mai...

It drew a better pelican riding a bicycle than Opus 4.7 did! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

by simonw

4/16/2026 at 10:24:37 PM

Thanks for pointing to the GGUF.

I just tried this GGUF with llama.cpp in its UD Q4_K_XL version on my custom agentic oritened task consisiting of wiki exploration and automatic database building ( https://github.com/GistNoesis/Shoggoth.db/ )

I noted a nice improvement over QWen3.5 in its ability to discover new creatures in the open ended searching task, but I've not quantified it yet with numbers. It also seems faster, at around 140 token/s compared to 100 token/s , but that's maybe due to some different configuration options.

Some little difference with QWen3.5 : to avoid crashes due to lack of memory in multimodal I had to pass --no-mmproj-offload to disable the gpu offload to convert the images to tokens otherwise it would crash for high resolutions images. I also used quantized kv store by passing -ctk q8_0 -ctv q8_0 and with a ctx-size 150000 it only need 23099 MiB of device memory which means no partial RAM offloading when I use a RTX 4090.

by GistNoesis

4/16/2026 at 8:23:47 PM

I'm not sure how you can give the flamingo win to Qwen:

* It's sitting on the tire, not the seat.

* Is that weird white and black thing supposed to be a beak? If so, it's sticking out of the side of its face rather than the center.

* The wheel spokes are bizarre.

* One of the flamingo's legs doesn't extend to the pedal.

* If you look closely at the sunglasses, they're semi-transparent, and the flamingo only has one eye! Or the other eye is just on a different part of its face, which means the sunglasses aren't positioned correctly. Or the other eye isn't.

* (subjective) The sunglasses and bowtie are cute, but you didn't ask for them, so I'd actually dock points for that.

* (subjective) I guess flamingos have multiple tail feathers, but it looks kinda odd as drawn.

In contrast, Opus's flamingo isn't as detailed or fancy, but more or less all of it looks correct.

by kelnos

4/16/2026 at 9:50:27 PM

He literally said it came down to the comment in the SVG. Points for taste, not correctness. Basically.

by withinboredom

4/16/2026 at 10:55:41 PM

Here's a reproduction attempt (LM Studio, same Qwen3.6-35B-A3B-GGUF model as linked in parent, M1 Max 64GB, <90 seconds):

https://files.catbox.moe/r3oru2.png

- My Qwen 3.6 result had sun and cloud in sky, similar to the second Opus 4.7 result in Simon's post.

- My Qwen 3.6 result had no grass (except as a green line), but all three results in Simon's post had grass (thick).

- My Qwen 3.6 result had visible "tailing air motion" like Simon's Qwen 3.6 result.

- My Qwen 3.6 result had a "sun with halo" effect that none of Simon's results had.

But, I know, it's more about the pelican and the bicycle.

by realityfactchex

4/17/2026 at 5:58:37 AM

The bicycle frame is ok. Simon's was better but at least it's not broken like Opus 4.7.

I can't comment that flamingo.

by _ache_

4/16/2026 at 6:16:42 PM

I wonder when pelican riding a bicycle will be useless as an evaluation task. The point was that it was something weird nobody had ever really thought about before, not in the benchmarks or even something a team would run internally. But now I'd bet internally this is one of the new Shirley Cards.

by jubilanti

4/16/2026 at 6:41:07 PM

Simon has an article on this

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

by abustamam

4/16/2026 at 10:03:22 PM

Pelicanmaxxing

by SwellJoe

4/16/2026 at 7:59:01 PM

Yeah try it with something else, or e.g. add a tiger to the back seat.

by amelius

4/17/2026 at 10:36:52 AM

I use this metric now, and I suggest you change it per your imagination:

"Make a single-page HTML file using threejs from a CDN. Render a scene of a flying dinosaur orbiting a planet. There are clouds with thunder and lightning, and the background is a beautiful starscape with twinkling stars and a colorful nebula"

This allows me to evaluate several factors across models. It is novel and creative. I generally run it multiple times, though now that I have shared it here, I will come up with new scenes personally to evaluate.

I also consider how well it one shots, errors generated, response to errors being corrected, and velocity of iteration to improvement.

Generally speaking, Claude Sonnet has done the best, Qwen3.5 122B does second, and I have nice results from Qwen3.5 35B.

ChatGPT does not do well. It can complete the task without errors but the creativity is atrocious.

by survirtual

4/16/2026 at 7:25:05 PM

They’ll hardcode it in 4.8, just like they do when they need to “fix” other issues

by MagicMoonlight

4/16/2026 at 6:41:34 PM

I mean look at the result where he asked about a unicycle - the model couldn't even keep the spokes inside the wheels - would be rudimentary if it "learned" what it means to draw a bicycle wheel and could transfer that to unicycle.

by rafaelmn

4/16/2026 at 7:27:19 PM

it's the frame that's surprisingly - and consistentnly - wrong. You'd think two triangles would be pretty easy to repro; once you get that the rest is easy. It's not like he's asking "draw a pelican on a four-bar linkage suspension mountainbike..."

by duzer65657

4/16/2026 at 7:59:06 PM

This is older, but even humans don't have a great concept of how a bicycle works... https://twistedsifter.com/2016/04/artist-asks-people-to-draw...

by Reddit_MLP2

4/16/2026 at 8:31:23 PM

Wouldn't this be more about being capable of mentally remembering how a bicycle looks versus how it works?

This reminds me of Pictionary. [0] Some people are good and some are really bad.

I am really bad a remembering how items look in my head and fail at drawing in Pictionary. My drawing skills are tied to being able to copy what I see.

[0] https://en.wikipedia.org/wiki/Pictionary

by yndoendo

4/17/2026 at 6:21:32 AM

I think it’s difficult to draw a bike exactly because you remember how it works rather than how it looks, so you worry about placing all the functional parts and get the overall composition wrong. Similar to drawing faces, without training, people will consistently dedicate too much area to the lower part of the face and draw some kind of neanderthal with no forehead.

by johanvts

4/16/2026 at 9:38:46 PM

is it possible to have greater success with the specificity? I don't think i ever drew a bike frame properly as a kid despite riding them and understanding the concept of spokes and wheels...

by quinnjh

4/17/2026 at 11:43:35 AM

Valid points, but you"d think "superintelligence" would "know" how to draw a pelican on a bike?

by hansmayer

4/16/2026 at 5:53:58 PM

It's fascinating that a $999 Mac Mini (M4 32GB) with almost similar wattage as a human brain gets us this far.

by bertili

4/17/2026 at 6:33:30 AM

Interesting thought, I looked it up out of curiosity and fund 155w max (but realistically more like 80w sustained) for the mac under load, and just around 20watts for the brain, surprisingly almost constant whether “under load” or not.

by johanvts

4/17/2026 at 4:34:31 PM

I clocked my M4 at 108 Watts while running inference using Qwen3.6-35b-a3b via Al dente.

by fragmede

4/17/2026 at 10:27:22 AM

> 155w max (but realistically more like 80w sustained)

155W PSU seems to be unified with M4 Pro model, plus there's reserve for peripherals (~55W for 5 USB/Thunderbolt ports).

Apple lists 65W for base M4 Mac itself: https://support.apple.com/en-am/103253

Notebookcheck found same number: https://www.notebookcheck.net/Apple-Mac-Mini-M4-review-Small...

by petu

4/16/2026 at 7:06:52 PM

the more I look at these images the more convinced I become that world models are the major missing piece and that these really are ultimately just stochastic sentence machines. Maybe Chomsky was right

by culi

4/17/2026 at 6:10:51 AM

> that these really are ultimately just stochastic sentence machines

I thought that's exactly what they are?

by bmitc

4/17/2026 at 7:44:04 PM

No, they have "attention". There is unique logic going on in the deep layers of the neural network.

Even the standard introductory exercise artificial neural networks, handwritten digit recognition, already shows deeper understanding. These simple networks take in raw pixels and somewhere in the many layers recognize "curves" and "edges" and then "circles" and "boxes" and whatnot and eventually "digits".

I think there's a genuine debate about whether or not this is a form of intelligence. I think the oversimplified argument of them just being stochastic sentence machines mostly comes from people who don't understand how they work. But I also think there's a much more nuanced version of this argument offered by people like Chomsky that should be taken seriously

by culi

4/18/2026 at 5:49:15 AM

> No, they have "attention". There is unique logic going on in the deep layers of the neural network.

Any specifics? That doesn't say anything about them not being sentence generators. And it's pretty well known that the LLMs constantly spew out fantastically grammatically correct sentences that have no logic to them whatsoever.

> These simple networks take in raw pixels and somewhere in the many layers recognize "curves" and "edges" and then "circles" and "boxes" and whatnot and eventually "digits".

That sounds like a version of anthropomorphizing. It is my understanding that it is a completely open problem as to what neural networks are actually doing in their internal, deep layers.

> I think the oversimplified argument of them just being stochastic sentence machines mostly comes from people who don't understand how they work.

I mean, that's effectively a logical fallacy, so it's not a strong argument.

by bmitc

4/17/2026 at 6:28:24 AM

I am so perplexed what exactly where people thinking they were. Its nothing else than highly sofisticated statistics.

by mastermage

4/17/2026 at 8:53:52 AM

From that perspective, which is totally correct, it makes you wonder what other domains of knowledge look like when pushed to the boundaries of our capabilities as a species.

by tmountain

4/18/2026 at 9:25:04 AM

That is a genuinely thought provoking idea.

by mastermage

4/17/2026 at 7:49:16 PM

Do you know of any other statistical model that can "hallucinate". They clearly have emergent capabilities that come from scale that are absent in any other statistical model we've ever dreamt up.

We know that LLMs build complex internal representations of language, logic, and concepts rather than just shallow word-counting.

If you deny that then you probably have an elementary understanding of how they work. Not even Chomsky denies that. The real argument imo is whether those internal representations constitute an actual "understanding" of the world or just flatten out to something much less interesting.

by culi

4/18/2026 at 9:29:50 AM

> Do you know of any other statistical model that can "hallucinate".

Actualy most statistical models can "hallucinate", specifically those that are capable of interpolation.

I have witnessed this for example in Gaussian Processes. In my own scientific work.

by mastermage

4/17/2026 at 3:47:40 PM

The Chomsky argument feels like it's moving in a different direction than what's actually useful to know. Whether or not these models have "real" understanding, they're clearly capable of solving problems that were previously considered to require understanding. The more interesting question is whether world models, if they existed, would actually improve the failure modes people care about — like hallucination and planning — or whether we'd just get better stochastic sentence machines with an extra layer of abstraction on top.

by 060880

4/16/2026 at 5:59:58 PM

But that you also gave a win to Qwen on flamingo is pretty outrageous! :)

Tthe right one looks much better, plus adding sunglasses without prompting is not that great. Hopefully it won't add some backdoor to the generated code without asking. ;)

by cyclopeanutopia

4/16/2026 at 6:13:44 PM

I love how the Chinese models often have an unprompted predilection to add flair.

GLM-5.1 added a sparkling earring to a north Virginia opossum the other day and I was delighted: https://simonwillison.net/2026/Apr/7/glm-51/

by simonw

4/16/2026 at 8:57:45 PM

You're running 5.1 locally or hosted?

by monksy

4/16/2026 at 9:37:43 PM

I used that one via OpenRouter.

by simonw

4/16/2026 at 6:41:12 PM

The flamingo on Qwen's unicycle is sitting on the tire, not the seat. That wins because of sunglasses?

by prirun

4/16/2026 at 7:38:22 PM

Can a benchmark meant as a joke not use a fun interpretation of results? The Qwen result has far better style points. Fun sunglasses, a shadow, a better ground, a better sky, clouds, flowers, etc.

If we want to get nitty gritty about the details of a joke, a flamingo probably couldn't physically sit on a unicycle's seat and also reach the pedals anyways.

by evilduck

4/16/2026 at 8:37:07 PM

Well, maybe the flamingo is a really good unicyclist...

https://youtu.be/Rrpgd5oIKwI

by akavel

4/17/2026 at 5:37:36 AM

Transparency of the wheel

Stylized gradients on the flamingo

Flowers

Ground/grass has a stylized look and feel

...despite a miss along the Y-axis where it's below the seat, couple oddly organized tail feathers, spokes, the composition overall is much closer to a production quality entity

Opus 4.7 looks like 20 seconds in MS paint.

Qwen3.6 looks incomplete due to the sitting position, but like a WIP I could see on a designer coworkers screen if I walk up and interrupt them. Click and drag it up, adjust tail feathers and spokes, you're there or much closer, to a usable output

by yabutlivnWoods

4/16/2026 at 7:11:44 PM

interesting, I just tried this very model, unsloth, Q8, so in theory more capable than Simon's Q4, and get those three "pelicans". definitely NOT opus quality. lmstudio, via Simon's llm, but not apple/mlx. Of course the same short prompt.

Simon, any ideas?

https://ibb.co/gFvwzf7M

https://ibb.co/dYHRC3y

https://ibb.co/FLc6kggm (tried here temperature 0.7 instead of pure defaults)

by rdslw

4/17/2026 at 1:04:00 AM

try Unsloth recommended settings

    Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

    Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

    Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

    Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

(Please note that the support for sampling parameters varies according to inference frameworks.)

by strobe

4/16/2026 at 8:59:55 PM

Hey I really enjoy your blog. On some things I end up finding a blog post of yours thats a year+ old and at other times, you and I are investigating similar things. I just pulled Qwen3.6 - 35b -A3B (Can't believe thats a A3B coming from 35b).

I'm impressed about the reach of your blog, and I'm hoping to get into blogging similar things. I currently have a lot on my backlog to blog about.

In short, keep up the good work with an interesting blog!

by monksy

4/16/2026 at 5:41:56 PM

I've had some really gnarly SVGs from Claude. Here's what I got after many iterations trying to draw a hand: https://imgur.com/a/X4Jqius

by jamwise

4/16/2026 at 6:50:38 PM

Probably because all the training material of humans drawing hands are garbage haha.

by giantg2

4/16/2026 at 10:25:56 PM

The real question is what the next truly weird, un-optimized prompt will be. Something involving a sloth debugging a quantum computer in MS Paint?"

by jaspanglia

4/16/2026 at 11:37:32 PM

The qwen flamingo looks like it’s smoking’ a doobie.

by quietsegfault

4/16/2026 at 7:11:19 PM

Interesting, qwen has the pelican driving on the left lane. Coincidence or has it something to do with the workers providing the RL data?

by MeteorMarc

4/16/2026 at 7:24:19 PM

Could be on a bike path where bikes are on the left and pedestrians to the right.

by rubiquity

4/17/2026 at 6:43:36 AM

I've been running qwen3.6:35b-a3b-q4_K_M (22.3GB) via Ollama.

Is the 20.9GB GGUF version better or negligible in comparison?

by Scrounger

4/16/2026 at 9:00:29 PM

I've been trying the Q4_K_M version, and sometimes it gets stuck in a loop. Gemma 4 doesn’t have this issue.

by bwv848

4/16/2026 at 9:33:04 PM

This has happened before with quantizations and other backends (ones not used by the research lab). Give it a week, download latest versions of everything, and try again.

by yencabulator

4/17/2026 at 12:22:31 PM

I'm having the same issues, the more I use it. The repetition penalty doesn't seem to help.

I get some really amusing 'reflective' responses, but I think it needs a bit more cooking. Maybe I'll try another variant.

by mobiuscog

4/17/2026 at 1:10:27 AM

perhaps increasing repitition_penalty might be helpful

by Readerium

4/17/2026 at 11:18:19 AM

what kind of specs does your laptop have? do you know how many tokens/second you get on it?

by logicallee

4/16/2026 at 5:50:28 PM

Oh that is pretty good! And the SVG one!

by danielhanchen

4/16/2026 at 5:48:38 PM

How does it do with the "car wash" benchmark? :D

by slekker

4/16/2026 at 1:59:10 PM

A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343

by bertili

4/16/2026 at 2:27:17 PM

This is just one model in the Qwen 3.6 series. They will most likely release the other small sizes (not much sense in keeping them proprietary) and perhaps their 122A10B size also, but the flagship 397A17B size seems to have been excluded.

by zozbot234

4/16/2026 at 10:38:56 PM

And shout-out to Qwen if they release 122b -- Jeff Barr's original Gemma 4 tweet said they'd release a ~122b, then it got redacted :(

by jonaustin

4/17/2026 at 7:11:25 AM

122b would be awesome. It is the largest size you can kinda run with a beefy consumer PC. I wondered about gemma stopping in the 30b category, it is already very strong. 122b might have been too close to being really useful.

by canpan

4/17/2026 at 2:29:31 PM

> not much sense in keeping them proprietary

Maybe for LLMs since everyone has their own competing LLM, but with Video models, Wan 2.2 did a rug pull, left a huge gap for the community that built around Wan 2.2 too, and I don't think a single open video model has come close since. Wan is at 2.7 now, and its been nearly a year since the last update.

by giancarlostoro

4/16/2026 at 2:34:45 PM

Is there any source for these claims?

by bertili

4/16/2026 at 2:43:30 PM

https://x.com/ChujieZheng/status/2039909917323383036 is the pre-release poll they did. ~397B was not a listed choice and plenty of people took it as a signal that it might not be up for release.

by zozbot234

4/16/2026 at 2:42:31 PM

A Qwen research member had a poll on X asking what Qwen 3.6 sizes people wanted to see:

https://x.com/ChujieZheng/status/2039909917323383036

Likely to drive engagement, but the poll excluded the large model size.

by anonova

4/16/2026 at 2:40:47 PM

397A17B = 397B total weights, 17B per expert?

by stingraycharles

4/16/2026 at 3:02:46 PM

That's not how it works. Many people get confused by the “expert” naming, when in reality the key part of the original name “sparse mixture of experts” is sparse.

Experts are just chunks of each layers MLP that are only partially activated by each token, there are thousands of “experts” in such a model (for Qwen3-30BA3, it was 48 layers x 128 “experts” per layer with only 8 active at each token)

by littlestymaar

4/16/2026 at 2:57:47 PM

17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active.

If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).

When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.

by zackangelo

4/16/2026 at 2:45:43 PM

397B params, 17B activated at the same time

Those 17B might be split among multiple experts that are activated simultaneously

by wongarsu

4/16/2026 at 5:06:46 PM

How many people/hackernews can run a 397b param model at home? Probably like 20-30.

by kylehotchkiss

4/16/2026 at 11:56:13 PM

The point is that open weights turns puts inference on the open market, so if your model is actually good and providers want to serve it, it will drive costs down and inference speeds up. Like Cerebras running Qwen 3 235B Instruct at 1.4k tps for cheaper than Claude Haiku (let that tps number sink in for a second. For reference, Claude Opus runs ~30-40 tps, Claude Haiku at ~60. Several orders of magnitude difference). As a company developing models, it means you can't easily capture the inference margins even though I believe you get a small kickback from the providers.

So I understand why they wouldn't want to go open weight, but on the other hand, open weight wins you popularity/sentiment if the model is any good, researchers (both academic and other labs) working on your stuff, etc etc. Local-first usage is only part of the story here. My guess is Qwen 3.5 was successful enough that now they want to start reaping the profits. Unfortunately most of Qwen 3.5's success is because it's heavily (and successfully!) optimized for extremely long-context usage on heavily constrained VRAM (i.e. local) systems, as a result of its DeltaNet attention layers.

by Nav_Panel

4/16/2026 at 5:53:04 PM

You can rent a cloud H200 with 140GB VRAM in a server with 256GB system ram for $3-4/hr.

by jubilanti

4/17/2026 at 7:31:47 PM

Can you tell me where? I used runpod before, but they don't have systems like that.

by Tamitami2000

4/17/2026 at 8:58:35 AM

This is like saying that Open source is not important because I don't have a machine to run it on right now. Of course it is important. We don't have any state of the art Language models that are open source, but some are still Open Weight. Better than nothing, and the only way to secure some type of privacy and control over own AI use. It is my goal to run these large models locally eventually; if they all go away that is not even a possibility. . .

by wolfhumble

4/16/2026 at 5:42:37 PM

I can (barely, but sustainably) run Q3.5 397B on my Mac Studio with 256GB unified. It cost $10,000 but that's well within reach for most people who are here, I expect.

by kridsdale3

4/16/2026 at 5:50:29 PM

Hacker News moment

by qlm

4/16/2026 at 5:44:57 PM

$10k is well outside my budget for frivolous computer purchases.

by toxik

4/16/2026 at 9:34:11 PM

It would be plenty in-budget if the software part of local AI was a bit more full-featured than it is at present. I want stuff like SSD offload for cold expert weights and/or for saved/cached KV-context, dynamic context sizing, NPU use for prefill, distributed inference over the network, etc. etc. to all be things that just work for most users, without them having to set anything up in an overly error-prone way. The system should not just explode when someone tries to run something slightly larger; it should undergo graceful degradation and let them figure out where the reasonable limits are.

by zozbot234

4/17/2026 at 8:01:25 AM

But it's well within the budget of a small company that wants to run a model locally. There are plenty of reasons to run one locally even if it's not state of the art, such as for privacy, being able to do unlimited local experiments, or refining it to solve niche problems.

by rwmj

4/16/2026 at 9:22:38 PM

yeah, but if you really really wanted to and/or your livelyhood depended on it, you probably could afford it.

by stefs

4/16/2026 at 6:06:42 PM

99.97% of HN users are nodding… :)

by bdangubic

4/16/2026 at 7:49:18 PM

There are way too many good uses of these models for local that I fully expect a standard workstation 10 years from now to start at 128GB of RAM and have at least a workstation inference device.

by hparadiz

4/16/2026 at 8:26:12 PM

or if you believe a lot of HN crowd we are in AI bubble and in 10 years inference will be dirt cheap when all of this crashes and we have all this hardware in data centers and it won't make any sense to run monster workstations at home (I work 128GB M4 but not run inference, just too many electron apps running at the same time...) :)

by bdangubic

4/16/2026 at 11:07:17 PM

> I work 128GB M4 but not run inference, just too many electron apps running at the same time.

This is somewhat depressing - needing a couple of thousand bucks worth of ram just to run your chat app and code/text editor and API doco tool and forum app and notetaking app all at the same time...

by bigiain

4/18/2026 at 6:34:13 AM

Crucial (Micron) sold 128GB of DDR5-5600 in SODIMM form for $280 a year ago. It would be slower tham the same amount on an M4 Mac, but still, I object to characterizing either as “a couple thousand bucks worth”.

by mananaysiempre

4/19/2026 at 11:52:02 PM

I( get that number by optioning up a Mac Studio to 128GB at the Apple Store.

(Admittedly, Apple should be facing criminal price gouging law suits for their ram pricing.)

by bigiain

4/16/2026 at 8:57:12 PM

Inference will be dirt cheap for things like coding but you'll want much more compute for architectural planning, personal assistants with persistent real time "thinking / memory", as well as real time multimedia. I could put 10 M4s to work right now and it won't be enough for what I've been cooking.

by hparadiz

4/17/2026 at 7:22:06 AM

That's kind of a specific percentage. What numbers did you use to get there?

by fragmede

4/17/2026 at 7:19:40 AM

Just have to reclassify it as non-frivolous then. $10k's not a lot for something as important as a car, if you live somewhere where one is required. Housing is typically gonna cost you more than $10k to own. I probably spend close to $10k for food for 1.5 years.

So if you just huff enough of the AI Kool aid, you too can own a Mac Studio. Or an M5 MacBook. Or a dual 3090 rig.

by fragmede

4/16/2026 at 6:06:01 PM

For some reason you were being downvoted but I enjoy hearing how people are running open weights models at home (NOT in the cloud), and what kind of hardware they need, even if it's out of my price range.

by rwmj

4/16/2026 at 5:50:18 PM

I'm running it on my Intel Xeon W5 with 256GB of DDR5 and Nvidia 72GB VRAM. Paid $7-8k for this system. Probably cost twice as much now.

Using UD-IQ4_NL quants.

Getting 13 t/s. Using it with thinking disabled.

by SlavikCA

4/17/2026 at 5:33:08 AM

I get 20 t/s on the UD-Q6_K_XL quant, Radeon 6800 XT.

by GrayShade

4/17/2026 at 5:30:09 AM

In where I am living, 10k USD is a little more than 3 years worth of rent, for a relatively new and convenient 2 bedroom apartment.

by lpnam0201

4/17/2026 at 6:29:11 AM

$277 a month for a two bedroom is literally 6-10% of what someone in the SF Bagholder Area pays.

Either you're in Africa, southeast Asia or south/central Amarica.

How do you even afford internet?

by gbgarbeb

4/17/2026 at 7:08:47 AM

Yes, I am in SEA. Home internet here costs 10$ per month.

My point was: not every person browsing this site has high living standard, and the ability to spend 10k on computing is a privilege.

by lpnam0201

4/18/2026 at 6:48:51 PM

[dead]

by gbgarbeb

4/16/2026 at 10:21:02 PM

you have proved my point

by kylehotchkiss

4/17/2026 at 7:28:08 PM

According to this blog (https://kaitchup.substack.com/p/lessons-from-gguf-evaluation...) the UD_IQ2_M quants are quite strong (rel. error to the base is very low), so it's around 120GB of RAM needed, while the experts can be loaded into VRAM and the rest offloaded into system RAM. It's a high end consumer PC, sure, but not unaffordable. For example, I got an older rig with a RTX 6000 ADA (48GB VRAM), 128 GB RAM and a Threadripper, which runs this quant offloaded at 20 tps

by Tamitami2000

4/16/2026 at 7:16:01 PM

I'm running it on dual DGX Sparks.

by bitbckt

4/18/2026 at 3:55:35 AM

I'm interested in your experiences running dual

by manyatoms

4/18/2026 at 12:00:13 AM

which exact model, and how many tokens per second for generation?

by DoctorOetker

4/17/2026 at 4:28:55 AM

I’ve mentioned this as an option in other discussions, but if you don’t care that much about tok/sec, 4x Xeon E7-8890 v4s with 1TB of DDR3 in a supermicro X10QBi will run a 397b model for <$2k (probably closer to $1500). Power use is pretty high per token but the entry price cannot be beat.

Full (non-quantized, non-distilled) DeepSeek runs at 1-2 tok/sec. A model half the size would probably be a little faster. This is also only with the basic NUMA functionality that was in llama.cpp a few months ago, I know they’ve added more interesting distribution mechanisms recently that I haven’t had a chance to test yet.

by jmb99

4/16/2026 at 5:36:17 PM

It doesn't matter how many can run it now, it's about freedom. Having a large open weights model available allows you to do things you can't do with closed models.

by stavros

4/16/2026 at 5:10:51 PM

OpenRouter.

by r-w

4/16/2026 at 5:48:38 PM

Yeah I think there’s benefits to third-party providers being able to run the large models and have stronger guarantees about ZDR and knowing where they are hosted! So Open Weights for even the large models we can’t personally serve on our laptops is still useful.

by mistercheese

4/16/2026 at 5:31:57 PM

If you're running it from OpenRouter, you might as well use Qwen3.6 Plus. You don't need to be picky about a particular model size of 3.6. If you just want the 397b version to save money, just pick a cheaper model like M2.7.

by parsimo2010

4/16/2026 at 9:54:48 PM

The 397B model can be run at home with the weights stored on an SSD (or on 2 SSDs, for double throughput).

Probably too slow for chat, but usable as a coding assistant.

by adrian_b

4/16/2026 at 10:05:12 PM

I think you have that backwards. Agentic coding is way more demanding than simple chat. The request/response loops (tool calling) are much tighter and more numerous, and the context is waaaaay bigger in general.

by xienze

4/17/2026 at 7:25:29 AM

In processing power, but chat is interactive. Agentic coding, you come up with a plan and sign off on it, and then just let it go for a while. It's the difference between speed and latency.

by fragmede

4/17/2026 at 4:42:25 AM

It only has 17b active params, it's a mixture of experts model. So probably a lot more people than you realize!

by blurbleblurble

4/16/2026 at 7:52:36 PM

Running the mxfp4 unsloth quant of qwen3.5-397b-a17b, I get 40 tps prefill, 20tps decode.

AMD threadripper pro 9965WX, 256gb ddr5 5600, rtx 4090.

by ydj

4/16/2026 at 2:08:02 PM

I really wish they released qwen-image 2.0 as open weights.

by guitcastro

4/16/2026 at 2:12:07 PM

Already quantized/converted into a sane format by Unsloth:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

by homebrewer

4/16/2026 at 3:32:41 PM

Unsloth is great for uploading quants quickly to experiment with, but everyone should know that they almost always revise their quants after testing.

If you download the release day quants with a tool that doesn’t automatically check HF for new versions you should check back again in a week to look for updated versions.

Some times the launch day quantizations have major problems which leads to early adopters dismissing useful models. You have to wait for everyone to test and fix bugs before giving a model a real evaluation.

by Aurornis

4/16/2026 at 3:46:54 PM

We re-uploaded Gemma4 4 times - 3 times were due to 20 llama.cpp bug fixes, which we helped solve some as well. The 4th is an official Gemma chat template improvement from Google themselves, so these are out of our hands. All providers had to re-fix their uploads, so not just us.

For MiniMax 2.7 - there were NaNs, but it wasn't just ours - all quant providers had it - we identified 38% of bartowski's had NaNs. Ours was 22%. We identified a fix, and have already fixed ours see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax.... Bartowski has not, but is working on it. We share our investigations always.

For Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were not optimal, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.

by danielhanchen

4/16/2026 at 4:36:13 PM

I just wanted to express gratitude to you guys, you do great work. However, it is a little annoying to have to redownload big models though and keeping up with the AI news and community sentiment is a full time job. I wish there was some mechanism somewhere (on your site or Huggingface or something) for displaying feedback or confidence in a model being "ready for general use" before kicking off 100+ GB model downloads.

by evilduck

4/16/2026 at 4:55:42 PM

Hey thanks - yes agreed - for now we do:

1. Split metadata into shard 0 for huge models so 10B is for chat template fixes - however sometimes fixes cause a recalculation of the imatrix, which means all quants have to be re-made

2. Add HF discussion posts on each model talking about what changed, and on our Reddit and Twitter

3. Hugging Face XET now has de-duplication downloading of shards, so generally redownloading 100GB models again should be much faster - it chunks 100GB into small chunks and hashes them, and only downloads the shards which have changed

by danielhanchen

4/17/2026 at 1:37:35 AM

If you would know - is this also why LM Studio and Ollama model downloads often fail with a signature mismatch error?

by ssrshh

4/17/2026 at 8:25:24 AM

Probably yes

by danielhanchen

4/16/2026 at 7:40:56 PM

Ah thanks, I wasn't aware of #3, that should be a huge boon.

by evilduck

4/17/2026 at 8:25:55 AM

Oh yes! This only applies if one uses hf download / snapshot_download - other normal download methods sadly won't have XET

by danielhanchen

4/16/2026 at 5:42:51 PM

Best policy is to just wait a couple of weeks after a major model is released. It's frustrating to have to re-download tens or hundreds of GB every few days, but the quant producers have no choice but to release early and often if they want to maintain their reputation.

Ideally the labs releasing the open models would work with Unsloth and the llama.cpp maintainers in advance to work out the bugs up front. That does sometimes happen, but not always.

by CamperBob2

4/16/2026 at 6:04:29 PM

Yep agreed at least 1 week is a good idea :)

We do get early access to nearly all models, and we do find the most pressing issues sometimes. But sadly some issues are really hard to find and diagnose :(

by danielhanchen

4/16/2026 at 3:57:20 PM

Please publish sha256sums of the merged GGUFs in the model descriptions. Otherwise it's hard to tell if the version we have is the latest.

by sowbug

4/16/2026 at 4:01:22 PM

Yep we can do that probs add a table - in general be post in discussions of model pages - for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/discussions...

HF also provides SHA256 for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/blob/main/U... is 92986e39a0c0b5f12c2c9b6a811dad59e3317caaf1b7ad5c7f0d7d12abc4a6e8

But agreed it's probs better to place them in a table

by danielhanchen

4/16/2026 at 4:12:15 PM

Thanks! I know about HF's chunk checksums, but HF doesn't publish (or possibly even know) the merged checksums.

by sowbug

4/16/2026 at 4:56:42 PM

Oh for multi files? Hmm ok let me check that out

by danielhanchen

4/16/2026 at 5:56:07 PM

Why do you merge the GGUFs? The 50 GB files are more manageable (IMO) and you can verify checksums as you say.

by zargon

4/16/2026 at 7:19:54 PM

I admit it's a habit that's probably weeks out of date. Earlier engines barfed on split GGUFs, but support is a lot better now. Frontends didn't always infer the model name correctly from the first chunk's filename, but once llama.cpp added the models.ini feature, that objection went away.

The purist in me feels the 50GB chunks are a temporary artifact of Hugging Face's uploading requirements, and the authoritative model file should be the merged one. I am unable to articulate any practical reason why this matters.

by sowbug

4/16/2026 at 4:18:37 PM

What do you think about creating a tool which can just patch the template embedded in the .gguf file instead of forcing a re-download? The whole file hash can be checked afterwards.

by dist-epoch

4/16/2026 at 4:58:05 PM

Sadly it's not always chat template fixes :( But yes we now split the first shard as pure metadata (10MB) for huge models - these include the chat template etc - so you only need to download that.

For serious fixes, sadly we have to re-compute imatrix since the activation patterns have changed - this sadly makes the entire quant change a lot, hence you have to re-download :(

by danielhanchen

4/16/2026 at 7:31:38 PM

Appreciate the work of your team very much.

Though chat templates seem like they need a better solution. So many issues, seems quite fragile.

by magicalhippo

4/17/2026 at 8:27:03 AM

Thank you! Agreed on chat template issue

by danielhanchen

4/16/2026 at 10:58:06 PM

Just curious, the fixes are not about weights but about templates, am I right?

by solomatov

4/17/2026 at 8:27:25 AM

Yes so chat templates and the actual implementations

by danielhanchen

4/16/2026 at 3:46:02 PM

Not to mention that almost every model release has some (at least) minor issue in the prompt template and/or the runtime itself, so even if they (not talking unsloth specifically, in general) claim "Day 0 support", do pay extra attention to actual quality as it takes a week or two before issues been hammered out.

by embedding-shape

4/16/2026 at 3:50:00 PM

Yes this is fair - we try our best to communicate issues - I think we're mostly the only ones doing the communication that model A or B has been fixed etc.

We try our best as model distributors to fix them on day 0 or 1, but 95% of issues aren't our issues - as you mentioned it's the chat template or runtime etc

by danielhanchen

4/17/2026 at 2:38:30 PM

I have to ask - what do you run locally on your laptop (model, backend, and agentic cli)?

Feature request:

A leader board with filtering so you can enter your machine specs and it will sort all models along with all the various quantisation and then rank them all - because so far model ranking site either don’t include all available quants, don’t compare apples to apples (ie was one model tested with Claude code while another benchmark done with opencode) etc

Oh - and as bonus, scoring also ranked by which agentic CLI :)

by alfiedotwtf

4/16/2026 at 4:56:52 PM

I don't understand why the open source model providers don't also publish the quantized version?

by fuddle

4/16/2026 at 4:58:39 PM

They sometimes do! Qwen, Google etc do them!

by danielhanchen

4/16/2026 at 7:39:09 PM

Thank you very much for this comment! I was not aware of that.

by i5heu

4/16/2026 at 6:31:31 PM

[dead]

by canarias_mate

4/16/2026 at 5:48:58 PM

Why doesn't Qwen itself release the quantized model? My impression is that quantization is a highly nontrivial process that can degrade the model in non-obvious ways, thus its best handled by people who actually built the model, otherwise the results might be disappointing.

Users of the quantized model might be even made to think that the model sucks because the quantized version does.

by torginus

4/16/2026 at 6:23:34 PM

Model developers release open-weight models for all sorts of reasons, but the most common reason is to share their work with the greater AI research community. Sure, they might allow or even encourage personal and commercial use of the model, but they don't necessarily want to be responsible for end-user support.

An imperfect analogy might be the Linux kernel. Linus publishes official releases as a tagged source tree but most people who use Linux run a kernel that has been tweaked, built, and packaged by someone else.

That said, models often DO come from the factory in multiple quants. Here's the FP8 quant for Qwen3.6 for example: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8

Unsloth and other organizations produce a wider variety of quants than upstream to fit a wider variety of hardware, and so end users can make their own size/quality trade-offs as needed.

by bityard

4/16/2026 at 6:44:18 PM

Quantization is an extraordinarily trivial process. Especially if you're doing it with llama.cpp (which unsloth obviously does).

Qwen did release an fp8 version, which is a quantized version.

by halJordan

4/16/2026 at 3:58:13 PM

I sense that I don't really understand enough of your comment to know why this is important. I hope you can explain some things to me:

- Why is Qwen's default "quantization" setup "bad" - Who is Unsloth? - Why is his format better? What gains does a better format give? What are the downsides of a bad format? - What is quantization? Granted, I can look up this myself, but I thought I'd ask for the full picture for other readers.

by sander1095

4/16/2026 at 4:12:42 PM

Oh hey - we're actually the 4th largest distributor of OSS AI models in GB downloads - see https://huggingface.co/unsloth

https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs is what might be helpful. You might have heard 1bit dynamic DeepSeek quants (we did that) - not all layers can be 1bit - important ones are in 8bit or 16bit, and we show it still works well.

by danielhanchen

4/16/2026 at 4:20:49 PM

The default Qwen "quantization" is not "bad", it's "large".

Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.

Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.

by dist-epoch

4/17/2026 at 8:27:49 AM

Love the JPEG analogy :)

by danielhanchen

4/16/2026 at 4:30:51 PM

hey you can do a bit research yourself and tell your results to us!

by est

4/16/2026 at 2:40:30 PM

How much VRAM does it need? I haven't run a local model yet, but I did recently pick up a 16GB GPU, before they were discontinued.

by palmotea

4/16/2026 at 2:49:05 PM

It's on the page:

  Precision  Quantization Tag File Size
  1-bit      UD-IQ1_M         10 GB
  2-bit      UD-IQ2_XXS       10.8 GB
             UD-Q2_K_XL       12.3 GB
  3-bit      UD-IQ3_XXS       13.2 GB
             UD-Q3_K_XL       16.8 GB
  4-bit      UD-IQ4_XS        17.7 GB
             UD-Q4_K_XL       22.4 GB
  5-bit      UD-Q5_K_XL       26.6 GB
  16-bit     BF16             69.4 GB

by WithinReason

4/16/2026 at 3:28:07 PM

Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

by Aurornis

4/16/2026 at 3:52:53 PM

Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

by Glemllksdf

4/16/2026 at 4:22:44 PM

Pretty sure all partial offload systems I’ve seen work by layers, but there might be something else out there.

by dragonwriter

4/17/2026 at 2:55:41 PM

Speculative decoding is already gambling.

by namibj

4/16/2026 at 4:34:39 PM

I really want to know what does M, K, XL XS mean in this context and how to choose.

I searched all unsloth doc and there seems no explaination at all.

by est

4/16/2026 at 8:14:56 PM

Q4_K is a type of quantization. It means that all weights will be at a minimum 4bits using the K method.

But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.

The S/M/L/XL is what tells you how many tensors get to use more bits.

The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).

Here's an example of the contents of a Q4_K_:

    S
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  136 tensors
    llama_model_loader: - type q5_0:   43 tensors
    llama_model_loader: - type q5_1:   17 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   55 tensors
    M
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   83 tensors
    L
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   14 tensors
    llama_model_loader: - type q8_0:   84 tensors

by tredre3

4/16/2026 at 4:45:02 PM

They are different quantization types, you can read more here https://huggingface.co/docs/hub/gguf#quantization-types

by huydotnet

4/16/2026 at 10:26:57 PM

Just start with q4_k_m and figure out the rest later.

by arcanemachiner

4/16/2026 at 3:00:04 PM

Thanks! I'd scanned the main content but I'd been blind to the sidebar on the far right.

by palmotea

4/16/2026 at 4:19:55 PM

"16-bit BF16 69.4 GB"

Is that (BF16) a 16-bit float?

by JKCalhoun

4/16/2026 at 10:12:18 PM

The IEEE standard FP16 is an older 16-bit format, which has balanced exponent and significand sizes.

It has been initially supported by GPUs, where it is useful especially for storing the color components of pixels. For geometry data, FP32 is preferred.

In CPUs, some support has been first added in 2012, in Intel Ivy Bridge. Better support is provided in some server CPUs, and since next year also in the desktop AMD Zen 6 and Intel Nova Lake.

BF16 is a format introduced by Google, intended only for AI/ML applications, not for graphics, so initially it was implemented in some of the Intel server CPUs and only later in GPUs. Unlike FP16, which is balanced, BF16 has great dynamic range, but very low precision. This is fine for ML but inappropriate for any other applications.

Nowadays, most LLMs are trained preponderantly using BF16, with a small number of parameters using FP32, for higher precision.

Then from the biggest model that uses BF16, smaller quantized models are derived, which use 8 bits or less per parameter, trading off accuracy for speed.

by adrian_b

4/16/2026 at 4:37:40 PM

Yes, it's a "Brain float", basically an ordinary 32-bit float with the low 16 mantissa bits cut off. Exact same range as fp32, lower precision, and not the same as the other fp16, which has less exponent and more mantissa.

by mtklein

4/16/2026 at 4:35:52 PM

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

Yes, however it’s a different format from standard fp16, it trades precision for greater dynamic range.

by Gracana

4/16/2026 at 4:49:09 PM

yes, it has 8 exponent bits like float32 instead of 6 like float16

by WithinReason

4/16/2026 at 3:34:38 PM

Pick a decent quant (4-6KM) then use llama-fit-params and try it yourself to see if it's giving you what you need.

by tommy_axle

4/16/2026 at 6:27:45 PM

I habe found llama-fit sometimes just selects a way to conservative load with VRAM to spare.

by gunalx

4/16/2026 at 2:49:26 PM

Should run just fine with CPU-MoE and mmap, but inference might be a bit slow if you have little RAM.

by zozbot234

4/16/2026 at 3:33:09 PM

You can run 25-30b model easily if you use Q3 or Q4 quants and llama-server with a pretty long list of options.

by Ladioss

4/16/2026 at 2:43:59 PM

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

by trvz

4/16/2026 at 3:17:14 PM

Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

by coder543

4/16/2026 at 6:57:35 PM

I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.

by boppo1

4/16/2026 at 10:17:17 PM

Running llama-server (it belongs to llama.cpp) starts a HTTP server on a specified port.

You can connect to that port with any browser, for chat.

Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.

by adrian_b

4/16/2026 at 2:46:45 PM

> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?

by palmotea

4/16/2026 at 3:02:15 PM

It's worth noting now there are other machines than just Apple that combine a powerful SoC with a large pool of unified memory for local AI use:

> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...

> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...

> https://frame.work/products/desktop-diy-amd-aimax300/configu...

etc.

But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.

by giobox

4/16/2026 at 3:00:55 PM

32 GiB of VRAM is possible to acquire for less than $1000 if you go for the Arc Pro B70. I have two of them. The tokens/sec is nowhere near AMD or NVIDIA high end, but its unexpectedly kind of decent to use. (I probably need to figure out vLLM though as it doesn't seem like llama.cpp is able to do them justice even seemingly with split mode = row. But still, 30t/s on Gemma 4 (on 26B MoE, not dense) is pretty usable, and you can do fit a full 256k context.)

When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)

by jchw

4/16/2026 at 4:24:31 PM

NVIDIA 5070 Ti can run Gemma 4 26B at 4-bit at 120 tk/s.

Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.

by dist-epoch

4/16/2026 at 5:17:57 PM

Unfortunately it really is running this slow with Llama.cpp, but of course that's with Vulkan mode. The VRAM capacity is definitely where it shines, rather than compute power. I am pretty sure that this isn't really optimal use of the cards, especially since I believe we should be able to get decent, if still sublinear, scaling with multiple cards. I am not really a machine learning expert, I'm curious to see if I can manage to trace down some performance issues. (I've already seen a couple issues get squashed since I first started testing this.)

I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.

A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.

by jchw

4/17/2026 at 12:05:25 AM

Parallelism can be tricky and always has a cost, but don't discount the 3090 which is more expensive these days in that price bracket.

3090 llama.cpp (container in VM)

    unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL  105 t/s
    unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL  103 t/s

Still slow compaired to the

    ggml-org/gpt-oss-20b-GGUF 206 t/s

But on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn't have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.

There are a lot of variables, but PCIe bus speed doesn't matter that much for inference, but the internal memory bandwidth does, and you won't match that with PCIe ever.

To be clear, multicard Vulkan and absolutely SYCL have a lot of optimizations that could happen, but the only time two GPUs are really faster for inference is when one doesn't have enough ram to fit the entire model.

A 3090 has 936.2 GB/s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...

For LLM inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.

LLM next token prediction is just a form of autoregressive decoding and will primarily be memory bound.

As I haven't used the larger intel GPUs I can't comment on what still needs to be optimized, but just don't expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound.

by nyrikki

4/16/2026 at 3:05:42 PM

New versions of llama.cpp have experimental split-tensor parallelism, but it really only helps with slow compute and a very fast interconnect, which doesn't describe many consumer-grade systems. For most users, pipeline parallelism will be their best bet for making use of multi-GPU setups.

by zozbot234

4/16/2026 at 3:45:15 PM

Yeah, I was doing split tensor and it seemed like a wash. The Arc B70s are not huge on compute.

Right now I'm only able to run them in PCI-e 5.0 x8 which might not be sufficient. But, a cheap older Xeon or TR seems silly since PCI-e 4.0 x16 isn't theoretically more bandwidth than PCI-e 5.0 x8. So it seems like if that is really still bottlenecked, I'll just have to bite the bullet and set up a modern HEDT build. With RAM prices... I am not sure there is a world where it could ever be worth it. At that point, seems like you may as well go for an obscenely priced NVIDIA or AMD datacenter card instead and retrofit it with consumer friendly thermal solutions. So... I'm definitely a bit conflicted.

I do like the Arc Pro B70 so far. Its not a performance monster, but it's quiet and relatively low power, and I haven't run into any instability. (The AMDGPU drivers have made amazing strides, but... The stability is not legendary. :)

I'll have to do a bit of analysis and make sure there really is an interconnect bottleneck first, versus a PEBKAC. Could be dropping more lanes than expected for one reason or another too.

by jchw

4/16/2026 at 3:52:01 PM

You could fit your HEDT with minimum RAM and a combination of Optane storage (for swapping system RAM with minimum wear) and fast NAND (for offloading large read-only data). If you have abundant physical PCIe slots it ought to be feasible.

by zozbot234

4/16/2026 at 2:57:42 PM

Macs with unified memory are economical in terms of $/GB of video memory, and they match an optimized/home built GPU setup in efficiency (W/token), but they are slow in terms of absolute performance.

With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.

To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).

by angoragoats

4/16/2026 at 3:03:26 PM

My Mac Studio with 96GB of RAM is maybe just at the low end of passable. It's actually extremely good for local image generation. I could somewhat replace something like Nano Banana comfortably on my machine.

But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.

For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.

by TechSquidTV

4/16/2026 at 4:29:28 PM

You just need to adjust your workflow to use the smaller models for coding. It's primarily just a case of holding them wrong if you end up with worse outputs.

by slopinthebag

4/16/2026 at 2:57:49 PM

It’s also doable with AMD Strix Halo.

by layer8

4/16/2026 at 2:47:52 PM

A bit like asking how long is a piece of string.

by bfivyvysj

4/16/2026 at 3:01:59 PM

It's twice as long as from one end to the middle.

by latentsea

4/16/2026 at 2:52:06 PM

More like "about how long of a string do I need to run between two houses in the densest residential neighborhood of single-family homes in the US?"

by palmotea

4/16/2026 at 2:51:56 PM

No, GP is excessively restrictive. Llama.cpp supports RAM offloading out of the box.

It's going to be slower than if you put everything on your GPU but it would work.

And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.

by littlestymaar

4/16/2026 at 2:51:43 PM

Obviously going to depend on your definition of "decent". My impression so far is that you will need between 90GB to 100GB of memory to run medium sized (31B dense or ~110B MoE) models with some quantization enabled.

by utilize1808

4/16/2026 at 2:57:01 PM

I’m running Gemma4 31B (Q8) on my 2 4090s (48GB) with no problem.

by cjbgkagh

4/16/2026 at 3:56:36 PM

I have the same setup but tried paperclip ai with it and it seems to me that either i'm unable to setup it properly or multiply agents struggle with this setup. Especially as it seems that paperclip ai and opencode (used for connection) is blowing up the context to 20-30k

Any tips around your setup running this?

I use lmstudio with default settings and prioritization instead of split.

by Glemllksdf

4/16/2026 at 5:14:57 PM

I asked AI for help setting it up. I use 128k context for 31B and 256k context for 26B4A. Ollama worked out of the box for me but I wanted more control with llama.cpp.

My command for llama-server:

llama-server -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -ngl 99 -sm layer -ts 10,12 --jinja --flash-attn on --cont-batching -np 1 -c 262144 -b 4096 -ub 512 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 8080 --timeout 18000

by cjbgkagh

4/16/2026 at 2:59:34 PM

Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.

by FusionX

4/16/2026 at 6:31:04 PM

Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.

by gunalx

4/16/2026 at 6:45:53 PM

There's absolutely nothing wrong it insane with a safetensors file. It might be less convenient than a single file gguf. But that's just laziness not insanity

by halJordan

4/17/2026 at 1:08:48 PM

Quantization is the major appeal, we can't all run full precision

by Zetaphor

4/17/2026 at 12:56:44 PM

As long as they're not releasing 32bit .pt :)

by lta

4/16/2026 at 2:35:41 PM

So I can use this in claude code with `ollama run claude`?

by txtsd

4/16/2026 at 4:38:09 PM

https://sleepingrobots.com/dreams/stop-using-ollama/

by nunodonato

4/17/2026 at 1:29:09 AM

Thank you, I had no idea ollama was so shady! I will start using llama.cpp directly.

by txtsd

4/16/2026 at 3:31:01 PM

More like `ollama launch claude --model qwen3.6:latest`

Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.

by Ladioss

4/16/2026 at 5:47:25 PM

If you're on a Mac, use the MLX backend versions which are considerably faster than the GGML based versions (including llama.cpp) and you don't need to fiddle with the context size. The models are `qwen3.6:35b-a3b-nvfp4`, `qwen3.6:35b-a3b-mxfp8`, and `qwen3.6:35b-a3b-mlx-bf16`.

by Patrick_Devine

4/17/2026 at 11:11:38 AM

I was comparing various models at M5 Pro 48GB RAM MLX vs GGUF and found that MLX models have a higher time to first token (sometimes by an order of magnitude) while tokens/sec and memory usage is same as GGUF.

Gemma 3 27B q4:

* MLX: 16.7 t/s, 1220ms ttft

* GGUF: 16.4 t/s, 760ms ttft

Gemma 4 31B q8:

* MLX: 8.3 t/s, 25000ms ttft

* GGUF: 8.4 t/s, 1140ms ttft

Gemma 4 A4B q8:

* MLX: 52 t/s, 1790ms ttft

* GGUF: 51 t/s, 380ms ttft

All comparisons done in LM Studio, all versions of everything are the latest.

by egorfine

4/16/2026 at 6:29:33 PM

I only have 16GB VRAM, and my system uses ~4GB from that. What are my options? I got this one: `Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf`

by txtsd

4/17/2026 at 9:22:44 AM

My system has 16 Gb VRAM / 32 Gb RAM, and ollama runs qwen3.6:latest at decent speed just fine. The 35b model is a moe, so I guess the whole model is offloaded.

by Ladioss

4/16/2026 at 2:45:04 PM

have you found a model that does this with usable speeds on an M2/M3?

by pj_mukh

4/16/2026 at 2:53:57 PM

On a M4 MBP ollama's qwen3.5:35b-a3b-coding-nvfp4 runs incredibly fast when in the claude/codex harness. M2/M3 should be similar.

It's incomparably faster than any other model (i.e. it's actually usable without cope). Caching makes a huge difference.

by postalcoder

4/17/2026 at 2:14:45 PM

> converted into a sane format

Having implemented a GGUF parser, I'd beg to differ on the “sane format” qualifier.

by littlestymaar

4/16/2026 at 2:36:17 PM

lmao they are so fast yooo

by terataiijo

4/16/2026 at 2:40:39 PM

Yes. How do they do it? Literally they must have PagerDuty set up to alert the team the second one of the labs releases anything.

by ttul

4/16/2026 at 2:43:44 PM

They obviously collaborate with some of the labs prior to the official release date.

by beernet

4/16/2026 at 2:46:35 PM

That... is a more plausible explanation I didn't think of.

by sigbottle

4/16/2026 at 3:07:24 PM

Yes we collab with them!

by danielhanchen

4/16/2026 at 6:13:04 PM

Sorry this is a bit of a tangent, but I noticed you also released UD quants of ERNIE-Image the same day it released, which as I understand requires generating a bunch of images. I've been working to do something similar with my CLI program ggufy, and was curious of you had any info you could share on the kind of compute you put into that, and if you generate full images or look at latents?

by qskousen

4/17/2026 at 8:34:41 AM

Yes we have started doing diffusion GGUFs but it's in it's infancy :) But yes we do generate images to test quants out!

by danielhanchen

4/16/2026 at 2:46:03 PM

Is quantization a mostly solved pipeline at this point? I thought that architectures were varied and weird enough where you can't just click a button, say "go optimize these weights", and go. I mean new models have new code that they want to operate on, right, so you'd have to analyze the code and insert the quantization at the right places, automatically, then make sure that doesn't degrade perf?

Maybe I just don't understand how quantization works, but I thought quantization was a very nasty problem involving a lot of plumbing

by sigbottle

4/17/2026 at 1:26:16 AM

that is true. gguf does not support any Architecture.

for the most recent example, as of April 16, 2026 (today)

Turboquant isnt still added to GGUF

by Readerium

4/16/2026 at 3:01:50 PM

Bad QA :/ They had a bunch of broken quantizations in the last releases

by bildung

4/16/2026 at 3:13:52 PM

1. Gemma-4 we re-uploaded 4 times - 3 times were 10-20 llama.cpp bug fixes - we had to notify people to upload the correct ones. The 4th is an official Gemma chat template improvement from Google themselves.

2. Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were under optimized, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space

3. MiniMax 2.7 - we swiftly fixed it due to NaN PPL - we found the issue in all quants regardless of provider - so it affected everyone not just us. We wrote a post on it, and fixed it - others have taken our fix and fixed their quants, whilst some haven't updated.

Note we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

Unfortunately sometimes quants break, but we fix them quickly, and 95% of times these are out of our hand.

We swiftly and quickly fix them, and write up blogs on what happened. Other providers simply just take our blogs and fixes and re-apply, re-use our fixes.

by danielhanchen

4/16/2026 at 3:21:06 PM

Fair enough, appreciate the detailed response! Can you elaborate why other quantizations weren't affected (e.g. bartowski)? Simply because they were straight Q4 etc. for every layer?

by bildung

4/16/2026 at 3:26:44 PM

No Bartowski's are more affected - (38% NaN) than ours (22%) - for MiniMax 2.7 see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax...

We already fixed ours. Bart hasn't yet but is still working on it following our findings.

blk.61.ffn_down_exps in Q4_K or Q5_K failed - it must be in Q6_K otherwise it overflows.

For the others, yes layers in some precision don't work. For eg Qwen3.5 ssm_out must be minimum Q4-Q6_K.

ssm_alpha and ssm_beta must be Q8_0 or higher.

Again Bart and others apply our findings - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

by danielhanchen

4/16/2026 at 3:34:00 PM

Thanks again, TIL

by bildung

4/16/2026 at 3:52:12 PM

Thanks!

by danielhanchen

4/16/2026 at 3:41:14 PM

Thanks for all the amazing work Daniel. I remember you guys being late to OH because you were working on weights released the night before - and it's great to see you guys keep up the speed!

by rohansood15

4/16/2026 at 3:52:06 PM

Oh thanks haha :) We try our best to get model releases out the door! :) Hope you're doing great!

by danielhanchen

4/16/2026 at 2:55:12 PM

yeah and often their quants are broken. They had to update their Gemma4 quants like 4 times in the past 2 weeks.

by ekianjo

4/16/2026 at 3:09:10 PM

No it's not our fault - re our 4 uploads - the first 3 are due to llama.cpp fixing bugs - this was out of our control (we're llama.cpp contributors, but not the main devs) - we could have waited, but it's best to update when multiple (10-20) bugs are fixed.

The 4th is Google themselves improving the chat template for tool calling for Gemma.

https://github.com/ggml-org/llama.cpp/issues/21255 was another issue CUDA 13.2 was broken - this was NVIDIA's CUDA compiler itself breaking - fully out of our hands - but we provided a solution for it.

by danielhanchen

4/16/2026 at 2:07:36 PM

Nice release from the Qwen team.

Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.

I'm thinking about banking and healthcare sector development agencies, for example.

It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

by mtct88

4/16/2026 at 2:36:42 PM

> It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

I've said in a recent comment that Mistral is the only one of the current players who appear to be moving towards a sustainable business - all the other AI companies are simply looking for a big payday, not to operate sustainably.

by lelanthran

4/16/2026 at 6:55:42 PM

Metawith the llama series as well,they just didn't manage to keep upping the game after and with llama4.

by gunalx

4/16/2026 at 3:36:39 PM

I play with the small open weight models and I disagree. They are fun, but they are not in the same class as hosted models running on big hardware.

If some organization forbade external models they should invest in the hardware to run bigger open models. The small models are a waste of time for serious work when there are more capable models available.

by Aurornis

4/17/2026 at 1:07:44 AM

Most organizations aren't going to need the wide breadth of capabilities of the frontier models. They're risk averse and LLMs are non-deterministic, so use cases are typically more tightly scoped to tasks that involve nuanced classification that small models can easily handle even if it takes a little fine-tuning on your organizations data.

by Zetaphor

4/16/2026 at 2:14:26 PM

I agree with the sentiment, but these models aren't suited for that. You can run much bigger models on prem with ~100k of hardware, and those can actually be useful in real-world tasks. These small models are fun to play with, but are nowhere close to solving the needs of a dev shop working in healthcare or banking, sadly.

by NitpickLawyer

4/17/2026 at 2:03:20 PM

100k is a lot of money for a software agency where I come from.

by mtct88

4/16/2026 at 2:26:17 PM

How true is this? How does a regulated industry confirm the model itself wasn't trained with malicious intent?

by smrtinsert

4/16/2026 at 2:47:07 PM

Why would it matter if the model is trained with malicious intent? It's a pure function. The harness controls security policies.

by ndriscoll

4/16/2026 at 6:32:30 PM

Much like a developer can insert a backdoor as a "bug" so can an LLM that was trained to do it.

One way you could probably do it is by identifying a commonly used library that can be misused in a way that would allow some kind of time-of-check to time-of-use (TOCTOU) exploit. Then you train the LLM to use the library incorrectly in this way.

by coppsilgold

4/16/2026 at 2:10:06 PM

I love the idea of building competitor to open weight models but damn is this an expensive game to play

by kennethops

4/17/2026 at 12:38:35 AM

It is, but think about how advances in computing technology have made that power available over time. A Raspberry Pi is almost 5 times more powerful than the Cray-1.

Granted, these next couple of years are going to suck because of the AI Component Drought, but progress marches on and the power and price of running today's frontier models will be affordable to mere mortals in time. Obviously we've hit the wall with Moore's law and other factors but this will not always be out of reach.

by pstuart

4/16/2026 at 3:39:56 PM

Related interesting find on Qwen.

"Qwen's base models live in a very exam-heavy basin - distinct from other base models like llama/gemma. Shown below are the embeddings from randomly sampled rollouts from ambiguous initial words like "The" and "A":"

https://xcancel.com/N8Programs/status/2044408755790508113

by alecco

4/17/2026 at 4:39:37 PM

This makes a lot of my experience with Qwen make sense. I’ve watched all the benchmarks imply how close it should be to various GPT or Claude releases, but in my own use chatting with it or trying to get it do agentic tasks it was nowhere near as smart as even GPT-3.5 for example. Meanwhile Gemma 4 casually dropped and even the 4B models were performing better than Qwen 3.5 MOE in my chats. Benchmaxxing.

by nxtfari

4/16/2026 at 8:07:09 PM

I have been using Qwen3.5-35B-A3B a lot in local testing, and it is by far the most capable model that could fit on my machine. I think quantization technology has really upped its game around these models, and there were two quants that blew me away

Mudler APEX-I-Quality. then later I tried Byteshape Q3_K_S-3.40bpw

Both made claims that seemed too good to be true, but I couldn't find any traces of lobotomization doing long agent coding loops. with the byteshape quant I am up to 40+ t/s which is a speed that makes agents much more pleasant. On an rtx 3060 12GB and 32GB of system ram, I went from slamming all my available memory to having like 14GB to spare.

by kanemcgrath

4/17/2026 at 11:43:47 AM

Unfortunately, llama.cpp quantization technology has been stagnant for two years. The main quantization developer left or was kicked out of llama.cpp due to an attribution dispute. He created his own fork ik_llama.cpp where he has made multiple new and better quants.

unsloth and byteshape are just using and highlighting features that have been available the whole time. I am very invested in figuring out a solution to this dispute, or some way to get the new quants upstreamed.

by Hugsun

4/16/2026 at 11:42:18 PM

Now that I have tried out on a few tasks, Qwen3.6 is a huge jump in capability. It can make improvements to a project that qwen3.5 always struggled with.

by kanemcgrath

4/18/2026 at 1:31:56 AM

Could you share more about your config? I've also got a 3060 12gb and 64gb of ram, but I've never got local models running well enough to be useful

by burgertea

4/17/2026 at 7:44:50 AM

What can and what can't it do compared to Codex and CC?

by edg5000

4/17/2026 at 11:26:34 AM

who do you compare it against qwen3.5 27b?

by mettamage

4/17/2026 at 8:12:08 PM

I haven't ran 27b that much because it only runs at like 2 tokens/sec on my computer.

by kanemcgrath

4/16/2026 at 11:44:00 PM

Which one is best?

by jadbox

4/17/2026 at 12:24:49 AM

I would say byteshape is smaller and faster, I can’t really notice a quality difference. But I haven’t used it as much as I only started using it a few days ago.

by kanemcgrath

4/16/2026 at 2:12:35 PM

I recall a Qwen exec posted a public poll on Twitter, asking which model from Qwen3.6 you want to see open-sourced; and the 27b variant was by far the most popular choice. Not sure why they ignored it lol.

by armanj

4/16/2026 at 2:38:37 PM

The 27B model is dense. Releasing a dense model first would be terrible marketing, whereas 35A3B is a lot smarter and more quick-witted by comparison!

by zozbot234

4/16/2026 at 2:51:59 PM

Each has it's pros and cons. Dense models of equivalent total size obviously do run slower if all else is equal, however, the fact is that 35A3B is absolutely not 'a lot smarter'... in fact, if you set aside the slower inference rates, Qwen3.5 27B is arguably more intelligent and reliable. I use both regularly on a Strix Halo system... the Just see the comparison table here: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF . The problem that you have to acknowledge if running locally (especially for coding tasks) is that your primary bottleneck quickly becomes prompt processing (NOT token generation) and here the differences between dense and MOE are variable and usually negligible.

by arxell

4/16/2026 at 4:25:13 PM

Could you explain why prompt processing is the bottle neck please? I've seen this behavior but I don't understand why.

by Mikealcl

4/16/2026 at 4:38:58 PM

You should be able to save a lot on prefill by stashing KV-cache shared prefixes (since KV-cache for plain transformers is an append-only structure) to near-line bulk storage and fetching them in as needed. Not sure why local AI engines don't do this already since it's a natural extension of session save/restore and what's usually called prompt caching.

by zozbot234

4/16/2026 at 10:56:14 PM

if I understand you correctly, this is essentially what vllm does with their paged cache, if I’ve misunderstood I apologize.

by FuckButtons

4/16/2026 at 11:08:47 PM

Paged Attention is more of a low-level building block, aimed initially at avoiding duplication of shared KV-cache prefixes in large-batch inference. But you're right that it's quite related. The llama.cpp folks are still thinking about it, per a recent discussion from that project: https://github.com/ggml-org/llama.cpp/discussions/21961

by zozbot234

4/16/2026 at 4:39:38 PM

I was hoping this would be the model to replace our Qwen3.5-27B, but the difference is marginally small. Too risky, I'll pass and wait for the release of a dense version.

by nunodonato

4/16/2026 at 4:23:01 PM

"…whereas 35A3B is a lot smarter…"

Must. Parse. Is this a 35 billion parameter model that needs only 3 billion parameters to be active? (Trying to keep up with this stuff.)

EDIT: A later comment seems to clarify:

"It's a MoE model and the A3B stands for 3 Billion active parameters…"

by JKCalhoun

4/16/2026 at 6:47:55 PM

That makes no sense. If you were just going to release the "more hype-able because it's quicker" model then why have a a poll.

by halJordan

4/16/2026 at 2:51:00 PM

What? 35B-A3B is not nearly as smart as 27B.

by Miraste

4/16/2026 at 8:50:10 PM

One interesting thing about Qwen3 is that looking at the benchmarks, the 35B-A3B models seem to be only a bit worse than the dense 27B ones. This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

by stratos123

4/16/2026 at 9:19:02 PM

> This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

Wouldn't you totally expect that, since 26A4B is lower on both total and active params? The more sensible comparison would pit Qwen 27B against Gemma 31B and Gemma 26A4B against Qwen 35A3B.

by zozbot234

4/17/2026 at 11:49:29 AM

They're comparing Qwen's moe vs dense (smaller difference) against Gemma's moe vs dense (bigger difference). Your proposed alternative misses the point.

by Hugsun

4/17/2026 at 12:04:09 PM

Gemma's dense is bigger than its moe's total parameters. You could totally expect the moe to do terribly by comparison.

by zozbot234

4/16/2026 at 2:56:41 PM

yeah the 27B feels like something completely different. If you use it on long context tasks it performs WAY better than 35b-a3b

by ekianjo

4/16/2026 at 3:27:58 PM

I've been telling analysts/investors for a long time that dense architectures aren't "worse" than sparse MoEs and to continue to anticipate the see-saw of releases on those two sub-architectures. Glad to continuously be vindicated on this one.

For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.

by Der_Einzige

4/17/2026 at 10:36:36 AM

Dense is (much) worse in terms of training budget. At inference time, dense is somewhat more intelligent per bit of VRAM, but much slower, so for a given compute budget it's still usually worse in terms of intelligence-per-dollar even ignoring training cost. If you're willing to spend more you're typically better off training and running a larger sparse model rather than training and running a dense one.

Dense is nice for local model users because they only need to serve a single user and VRAM is expensive. For the people training and serving the models, though, dense is really tough to justify. You'll see small dense models released to capitalize on marketing hype from local model fans but that's about it. No one will ever train another big dense model: Llama 3.1 405B was the last of its kind.

by reissbaker

4/17/2026 at 2:42:23 PM

You want to take bets on this? I'm willing to bet 500USD that an open access dense model of at least 300B is released by some lab within 3 years.

by Der_Einzige

4/17/2026 at 3:55:45 AM

MoE isn't inherently better, but I do think it's still an under explored space. When your sparse model can do 5 runs on the same prompt in the same time as a dense model takes to generate one, there opens up all sorts of interesting possibilities.

by naasking

4/16/2026 at 2:55:35 PM

Yes.

by zkmon

4/16/2026 at 10:58:20 PM

Based on the release schedule of 3.5 previously, my optimistic take is that they distill the small models from the 397B, and it is much faster to distill a sparse A3B model. Hopefully the other variants will be released in the coming days.

by throwdbaaway

4/16/2026 at 2:46:58 PM

Probably coming next

by arunkant

4/16/2026 at 2:51:58 PM

I'm guessing 3.5-27b would beat 3.6-35b. MoE is a bad idea. Because for the same VRAM 27b would leave a lot more room, and the quality of work directly depends on context size, not just the "B" number.

by zkmon

4/16/2026 at 2:59:47 PM

MoE is not a bad idea for local inference if you have fast storage to offload to, and this is quickly becoming feasible with PCIe 5.0 interconnect.

by zozbot234

4/16/2026 at 4:50:53 PM

MoE is excellent for the unified memory inference hardware like DGX Sparc, Apple Studio, etc. Large memory size means you can have quite a few B's and the smaller experts keeps those tokens flowing fast.

by perbu

4/16/2026 at 2:41:48 PM

I'm broadly curious how people are using these local models. Literally, how are they attaching harnesses to this and finding more value than just renting tokens from Anthropic of OpenAI?

by aliljet

4/16/2026 at 3:12:08 PM

Idk about everyone else, but I don’t want to rent tokens forever. I want a self hosted model that is completely private and can’t be monitored or adulterated without me knowing. I use both currently, but I am excited at the prospect of maybe not having to in the near to mid future.

I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.

by oompydoompy74

4/17/2026 at 2:48:16 AM

Exactly. Relying on external compute for professional work is a non-starter IMO.

by danny_codes

4/16/2026 at 3:13:08 PM

Qwen3.5-9B has been extremely useful for local fuzzy table extraction OCR for data that cannot be sent to the cloud.

The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.

Now with the multi-modal capabilities of these models we can leverage the language capabilities along side vision to extract structured data from a table that has 'roughly this shape' and 'this location'.

by seemaze

4/16/2026 at 2:52:53 PM

I used vLLM and qwen3-coder-next to batch-process a couple million documents recently. No token quota, no rate limits, just 100% GPU utilization until the job was done.

by marssaxman

4/16/2026 at 3:45:44 PM

Some tasks don’t require SOTA models. For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline. Also if you can break down certain tasks like JSON healing into small focused coding tasks then local models are useful

by znnajdla

4/17/2026 at 1:09:43 PM

> For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline.

What does better mean here? Does it handle formal vs informal speech? Idiomatic expressions? Regional variances (like American vs British English)? These are areas where Google Translate is weak.

How fast are we talking here (including initial loading times) and what's the impact on your phone battery? Also, what iPhone do you have?

I am really interested in this application hence my questions.

by netdevphoenix

4/17/2026 at 7:18:19 AM

How does that work? Wouldn't it be slow loading the weights into memory every time you launch it?

by root_axis

4/17/2026 at 8:31:06 AM

I'm guessing they're not using it as a word dictionary, but rather translating longer texts where the time to load the model isn't a significant issue.

by theshrike79

4/16/2026 at 5:01:55 PM

Is it really better? In which languages?

by kaliqt

4/16/2026 at 6:13:32 PM

Yes it is and has been for a very long time, it has been years now. Gemini 1.5 Pro is when LLM translations started significantly outperforming non-LLM machine translation, and that came out over 2 years ago.

Ever since then Google models have been the strongest at translation across the board, so it's no surprise Gemma 4 does well. Gemini 3 Flash is better at translation than any Claude or GPT model. OpenAI models have always been weakest at it, continuing to this day. It's quite interesting how these characteristics have stayed stable over time and many model versions.

I'm primarily talking about non-trivial language pairs, something like English<>Spanish is so "easy" now it's hard to distinguish the strong models.

by deaux

4/16/2026 at 7:31:05 PM

I've been using gemma4 for translating Mongolian to English. It runs circles around Google Translate for that language pair, it's not even close.

by homebrewer

4/17/2026 at 5:48:03 AM

I translate texts between Ukrainian, Russian and English dozens of times daily. The LLM translation is not only better, it's also refineable, you can chat with the AI to make changes to what you meant.

by znnajdla

4/17/2026 at 6:45:38 AM

Do you use E2B or E4B?

by oktoberpaard

4/16/2026 at 2:59:58 PM

The people i know that use local models just end up with both.

The local models don’t really compete with the flagship labs for most tasks

But there are things you may not want to send to them for privacy reasons or tasks where you don’t want to use tokens from your plan with whichever lab. Things like openclaw use a ton of tokens and most of the time the local models are totally fine for it (assuming you find it useful which is a whole different discussion)

by lkjdsklf

4/16/2026 at 6:20:36 PM

The open weights models absolutely compete with flagship labs for most tasks. OpenAI and Anthropic's "cheap tier" models are completely uncompetitive with them for "quality / $" and it's not close. Google is the only one who has remained competitive in the <$5/1M output tier with Flash, and now has an incredibly strong release with Gemma 4.

Unless you have a corporate lock-in/compliance need, there has been no reason to use Haiku or GPT mini/nano/etc over open weights models for a long time now.

by deaux

4/17/2026 at 12:57:55 PM

I am using it with pi agent and I have stopped renting tokens. Much better for me than Claude Code, on M1 Max 64GB. This model with oMLX is at 16k context PP 919.9 tok/s and TG 54.7 tok/s. You have to manage the context but the better you manage context the more focused the output is. I use it without thinking.

by dust42

4/16/2026 at 6:08:27 PM

I've been largely using Qwen3.5-122b at 6 bit quant locally for some c++/go/python dev lately because it is quite capable as long as I can give it pretty specific asks within the codebase and it will produce code that needs minimal massaging to fit into the project.

I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.

by jwitthuhn

4/16/2026 at 9:08:27 PM

How much VRAM do you need for that?

by 3836293648

4/17/2026 at 7:05:04 AM

Not OP, but I ran 122b successfully with normal RAM offloading. You dont need all that much VRAM, which is super expensive. I used 96gb ram + 16gb vram gpu. But it's not very fast in that setup, maybe 15 token per second. Still, you can give it a task and come back later and its done. (Disclaimer: I build that PC before stuff got expensive)

by canpan

4/17/2026 at 11:26:05 AM

128GB on a mac with unified memory. The model itself takes something like 110 of that and then I have ~16 left over to hold a reasonably sized context and 2 for the OS.

I do have a dedicated machine for it though because I can't run an IDE at the same time as that model.

by jwitthuhn

4/16/2026 at 10:59:40 PM

I squeeze Qwen3.5-122B-A10B at Q6 into 128GB. It's a great model.

by seemaze

4/17/2026 at 4:13:45 AM

Wow what kind of hardware do you have? Mac Studio, dgx spark, strix halo? How fast is it?

by mistercheese

4/19/2026 at 2:11:40 PM

Strix Halo, I'm seeing performance inline with these results[0].

I'm interested to investigate the claimed gains from the lemonade-sdk port of Apple MLX inference[1].

[0]https://kyuz0.github.io/amd-strix-halo-toolboxes/

[1]https://github.com/lemonade-sdk/lemonade/issues/1642

by seemaze

4/17/2026 at 4:09:18 AM

I use local models for asking about personal financial or health data that I want to keep local and private. Or even just whipping up quick and dirty prototypes for whatever I can think of but not seriously enough to spend tokens that I rather use on real projects.

by mistercheese

4/16/2026 at 3:08:33 PM

The privacy/data security angle really is important in some regions and industries. Think European privacy laws or customers demanding NDAs. The value of Anthropic and OpenAI is zero for both cases, so easy to beat, despite local models being dumber and slower.

by bildung

4/16/2026 at 3:29:18 PM

I use LMStudio to host and run GLM 4.7 Flash as a coding agent. I use it with the Pi coding agent, but also use it with the Zed editor agent integrations. I've used the Qwen models in the past, but have consistently come back to GLM 4.7 because of its capabilities. I often use Qwen or Gemma models for their vision capabilities. For example, I often will finish ML training runs, take a photo of the graphs and visualizations of the run metrics and ask the model to tell me things I might look at tweaking to improve subsequent training runs. Qwen 3.5 0.8b is pretty awesome for really small and quick vision tasks like "Give me a JSON representation of the cards on this page".

by kamranjon

4/17/2026 at 8:01:40 PM

I'm using forge code (https://forgecode.dev/) with various local and cloud models and I really like it. MiniMax 2.7 is really great with it, and the new Qwen 3.6 35B A3B feels much stronger, after some testing, than the 3.5 version. Check some harness benchmarks. Forge outperforms Claude Code with Opus by a big margin.

by Tamitami2000

4/16/2026 at 3:43:59 PM

It’s easy to find a combination of llama.cpp and a coding tool like OpenCode for these. Asking an LLM for help setting it up can work well if you don’t want to find a guide yourself.

> and finding more value than just renting tokens from Anthropic of OpenAI?

Buying hardware to run these models is not cost effective. I do it for fun for small tasks but I have no illusions that I’m getting anything superior to hosted models. They can be useful for small tasks like codebase exploration or writing simple single use tools when you don’t want to consume more of your 5-hour token budget though.

by Aurornis

4/16/2026 at 5:53:36 PM

Oh lord, are the LLMs already replacing LLMs?

by toxik

4/16/2026 at 11:31:48 PM

I'm using the smaller vision models (Qwen3.5-4B currently) with Frigate, a FOSS self-hosted "AI" NVR. It's good enough at analyzing images to figure out mostly what's happening, and doesn't require the big knowledge base that bigger models have.

Also use a bigger model for summarizing or translating text, which I don't consume in realtime, so doesn't need to be fast. Would be a thing I could use OpenAI's batch APIs for if I did need something higher quality.

by ThatPlayer

4/16/2026 at 4:31:06 PM

There are really nice GUIs for LLMs - CherryStudio for example, can be used with local or cloud models.

There are also web-UIs - just like the labs ones.

And you can connect coding agents like Codex, Copilot or Pi to local coding agents - the support OpenAI compatible APIs.

It's literally a terminal command to start serving the model locally and you can connect various things to it, like Codex.

by dist-epoch

4/17/2026 at 1:51:44 AM

[dead]

by ssrshh

4/16/2026 at 5:05:22 PM

I am working on a research project to link churches from their IRS Exempt org BMF entry to their google search result from 10 fetched. Gwen2.5-14b on a 16gb Mac Mini. It works good enough!

It's entertaining to see HN increasingly consider coding harness as the only value a model can provide.

by kylehotchkiss

4/16/2026 at 4:06:57 PM

While they can be run locally, and most of the discussion on HN about that, I bet that if you look at total tok/day local usage is a tiny amount compared to total cloud inference even for these models. Most people who do use them locally just do a prompt every now and then.

by deaux

4/16/2026 at 4:17:53 PM

This is why I'd like to see a lot more focus on batched inference with lower-end hardware. If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so, you don't really need top-of-the-line hardware even for SOTA results.

by zozbot234

4/16/2026 at 6:01:45 PM

> If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so

But they can't? The usage pattern is the polar opposite. Most people running these models locally just ask a few questions to it throughout the day. They want the answers now, or at least within a minute.

by deaux

4/16/2026 at 6:49:47 PM

If you want the answer right now, that alone ups your compute needs to the point where you're probably better off just using a free hosted-AI service. Unless the prompt is trivial enough that it can be answered quickly by a tiny local model.

by zozbot234

4/17/2026 at 9:48:49 AM

A strix halo machine or MAC will run at less than 20watts idle. You could leave it running.

by redman25

4/17/2026 at 4:15:49 AM

That’s a good point. I think I saw Together.ai with that offering, but for some reason just never think to throw random non urgent coding tasks at it overnight

by mistercheese

4/16/2026 at 2:59:56 PM

They are okay for vibe coding throw-away projects without spending your Anthrophic/OAI tokens

by flux3125

4/16/2026 at 2:53:31 PM

I was thinking the same thing. My only guess is that they are excited about local models because they can run it cheaper through Open Router ?

by Panda4

4/16/2026 at 11:07:56 PM

always inside claude code, just using ollama, takes 2 seconds

by zackify

4/16/2026 at 2:30:34 PM

China won again in terms of openness

by rvnx

4/17/2026 at 2:49:28 AM

Ironic

by danny_codes

4/17/2026 at 12:59:01 PM

Not as much as "Open" AI

by lta

4/16/2026 at 2:01:25 PM

Honestly, this is the AI software I actually look forward to seeing. No hype about it being too dangerous to release. No IPO pumping hype. No subscription fees. I am so pumped to try this!

by fooblaster

4/16/2026 at 4:04:19 PM

Same here. I really hope in a near future local model will be good enough and hardware fast enough to run them to become viable for most use cases

by wrxd

4/16/2026 at 8:35:30 PM

No need to hope; it is inevitable.

by vlapec

4/16/2026 at 10:09:52 PM

Is it inevitable though? Open-weight models large enough to come close to an API model are insanely expensive to run for con/prosumers. I'd put the “expensive” bar at ≥24GB since that's already well into 4 digits, which gives you quite many months of a subscription, not including the power will for >400W continuous.

Color me pessimistic, but this feels like a pipe dream.

by Zopieux

4/17/2026 at 8:23:12 PM

A decent amount of software developers and gamers do spend 3000 USD on a PC. That kind of hardware is going go get more and more capable over time wrt genAI models.

Of course there will always be a gap to frontier closed hosted models. It is not an either or proposition.

by jononor

4/17/2026 at 3:50:26 PM

I have moved through the local models at this size.

This one is by far the most capable. I've tried various versions of gemma4.26b, various versions of qwen3.5-27/35b (qwopus's galor),nemotron,phi,glm4.7.

This one is noticeably better as an agent. It's really good at breaking down tasks into small actionable steps, and - where there is ambiguity - asks for clarification. It's reasoning seems more solid than gemma4, tool use, multi-messaging/longer chain thinking.

I am excited to see what other versions of this model people train!

by altruios

4/17/2026 at 5:06:51 PM

How does it compare to CC Opus Max?

by onlyrealcuzzo

4/17/2026 at 5:24:09 PM

I try not to use publicly hosted models, and I avoid the SOTA data harvesting machine... so I can not compare. I can compare only to local models. And this one feels like a decently significant leap compared to 3.5 or gemma4.

I see there is now a distilled reasoning model version on hugging-face. I may look into that, but I have not seen a need to reach for that change yet either...

by altruios

4/16/2026 at 3:00:24 PM

Fingers crossed for mid and larger models as well. I'd personally love to see Qwen3.6-122B-A10B.

by seemaze

4/16/2026 at 7:19:16 PM

That would be really great. Though 3.5 122B is already doing a lot of work in our setup.

by Vespasian

4/16/2026 at 2:06:23 PM

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.

by abhikul0

4/16/2026 at 2:14:03 PM

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

by mhitza

4/16/2026 at 2:23:59 PM

Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

by abhikul0

4/16/2026 at 2:39:37 PM

CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.

by zozbot234

4/16/2026 at 3:09:32 PM

I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.

by abhikul0

4/16/2026 at 3:14:31 PM

Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.

by zozbot234

4/16/2026 at 2:32:02 PM

For sure I was running on autopilot with that reply. Though in Q4 I would expect it to fit, as 24B-A4B Gemma model without CPU offloading got up to 18GB of VRAM usage

by mhitza

4/16/2026 at 2:17:37 PM

Do I expect the same memory footprint from an N active parameters as from simply N total parameters?

by dgb23

4/16/2026 at 2:35:53 PM

No - this model has the weights memory footprint of a 35B model (you do save a little bit on the KV cache, which will be smaller than the total size suggests). The lower number of active parameters gives you faster inference, including lower memory bandwidth utilization, which makes it viable to offload the weights for the experts onto slower memory. On a Mac, with unified memory, this doesn't really help you. (Unless you want to offload to nonvolatile storage, but it would still be painfully slow.)

All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size.

by daemonologist

4/16/2026 at 2:18:43 PM

i dont get it, mac has unified memory how would offloading experts to cpu help?

by pdyc

4/16/2026 at 2:22:48 PM

I bet the poster just didn’t remember that important detail about Macs, it is kind of unusual from a normal computer point of view.

I wonder though, do Macs have swap, coupled unused experts be offloaded to swap?

by bee_rider

4/16/2026 at 2:39:38 PM

Of course the swap is there for fallback but I hate using it lol as I don't want to degrade SSD longevity.

by abhikul0

4/16/2026 at 2:12:51 PM

can you elaborate? you can use quantized version, would context still be an issue with it?

by pdyc

4/16/2026 at 2:20:02 PM

A usable quant, Q5_KM imo, takes up ~26GB[0], which leaves around ~6-7GB for context and running other programs which is not much.

[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...

by abhikul0

4/16/2026 at 2:15:57 PM

context is always an issue with local models and consumer hardware.

by nickthegreek

4/16/2026 at 2:20:38 PM

correct but it should be some ratio of model size like if model size is x GB, max context would occupy x * some constant of RAM. For quantized version assuming its 18GB for Q4 it should be able to support 64-128k with this mac

by pdyc

4/16/2026 at 2:36:13 PM

For the 9B model, I can use the full context with Q8_0 KV. This uses around ~16GB, while still leaving a comfortable headroom.

Output after I exit the llama-server command:

  llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free     self   model   context   compute    unaccounted |
  llama_memory_breakdown_print: |   - MTL0 (Apple M3 Pro) | 28753 = 14607 + (14145 =  6262 +    4553 +    3329) +           0 |
  llama_memory_breakdown_print: |   - Host                |                   2779 =   666 +       0 +    2112                |

by abhikul0

4/16/2026 at 9:01:10 PM

Qwen3.6 and Gemma4 have the same issue of never getting to the point and just getting stuck in never ending repeating thought loops. Qwen3.5 is still the best local model that works.

by 3836293648

4/16/2026 at 10:41:27 PM

I think the hype around Qwen and even Gemma4 often floated for views/attention glosses over that these models have clear gaps behind what closed models offer.

In short, it has its uses but it would/should not be the main driver. Will it get better, I'm sure of it, but there is too much hype and exaggeration over open source models, for one the hardware simply isn't enough at a price point where we can run something that can seriously compete with today's closed models.

If we got something like GPT-5.4-xhigh that can run on some local hardware under 5k, that would be a major milestone.

by agentifysh

4/17/2026 at 3:38:53 PM

I say "if we got $CURRENT_MODEL that can run under local hardware" claims are postproning BS.

What is gonna happen when that happens? They are gonna cry they need GPT-$CURRENT capabilities locally.

Now we have local models that are way better that GPT-2 (careful, this one is way too dangerous for release!) GPT3.5, in some ways better that 4, and can run on reasonably modest hardware.

by ElectricalUnion

4/17/2026 at 2:49:58 AM

Give it 6 months

by danny_codes

4/17/2026 at 3:59:08 AM

Quantization can introduce these issues, and Gemma 4 also had issues because the prompt tokens that Gemma used was new and not well supported yet.

by naasking

4/17/2026 at 4:42:20 PM

I had issues with Qwen thinking endlessly when I didn’t know I wasn’t using the temp/top_k/min_p/etc settings specified in the readme. I’ve never had an issue with Gemma 4 thinking endlessly but could possibly be the same.

by nxtfari

4/21/2026 at 10:40:10 AM

I use Ollama and kinda just assumed that Ollama would have everything except for context length (which I've explicitly overwritten) setup properly for me.

by 3836293648

4/17/2026 at 10:45:51 PM

I have a Macbook M3 Max with 128GB of RAM.

How close to Opus 4.6 can I get with this? Realistic, real-world usage. And I mean not sitting there for minutes waiting the model to finish saying hello, or being able to use it for anything more than a pelican riding a bicycle.

I'm asking because I'm always seeing excited replies, then I get excited, then I spend minutes to hours setting up the model and then, after first use I forget it exists for one reason or another.

Can I get any realistic use out of this?

by gck1

4/17/2026 at 11:51:39 PM

It won’t be a fair comparison against opus-4.6 but it will run quite well on your machine. I’ve tested qwen3.5 27B, Gemma4, minimax2.5 and Glm4.7 before on my m3 ultra. And i’d say this is the first model that I’m able to use for full agentic sessions. here is a pi session i just did and it worked quite well surprisingly: https://pi.dev/session/#c3d003becb1bfcc7ffbca04e89e1adf8

by qazplm17

4/18/2026 at 12:27:36 AM

Thank you! That actually looks quite impressive.

What seems very promising is that thinking blocks look coherent for the lack of a better word, and not that far away from thinking blocks (or rather, summaries) that I see from Claude models.

I think this could actually work for targeted worker agents that get explicit, detailed task instructions from better models.

I'll be trying this tomorrow in my workflow.

by gck1

4/17/2026 at 11:27:59 PM

You'd be the best person in this thread to answer this question.

by stavros

4/16/2026 at 2:30:04 PM

This is surprisingly close to Haiku quality, but open - and Haiku is quite a capable model (many of the Claude Code subagents use it).

by jake-coworker

4/16/2026 at 3:03:19 PM

Where did you see a haiku comparison? Haiku 4.5 was my daily driver for a month or so before Opus 4.5 dropped and would be unreasonably happy if a local model can give me similar capability

by wild_egg

4/16/2026 at 3:39:13 PM

I didn't see a direct comparison, but there's some overlap in the published benchmarks:

                           │ Qwen 3.6 35B-A3B │ Haiku 4.5               
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Verified     │ 73.4             │ 66.6                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Multilingual │ 67.2             │ 64.7                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Pro          │ 49.5             │ 39.45                   
   ────────────────────────┼──────────────────┼──────────────────────── 
    Terminal Bench 2.0     │ 51.5             │ 61.2 (Warp), 27.5 (CC)  
   ────────────────────────┼──────────────────┼──────────────────────── 
    LiveCodeBench          │ 80.4             │ 41.92

These are of course all public benchmarks though - I'd expect there to be some memorization/overfitting happening. The proprietary models usually have a bit of an advantage in real-world tasks in my experience.

by daemonologist

4/16/2026 at 3:34:27 PM

Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).

Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.

https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...

No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.

Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.

by coder543

4/17/2026 at 4:01:10 AM

Qwen models commonly get accused of benchmaxxing though. Just something to keep in mind when weighing the standard benchmarks.

by naasking

4/17/2026 at 11:41:18 AM

Every model release gets accused of that, including the flagship models.

by coder543

4/17/2026 at 12:36:48 PM

Less so for Gemma-4 because it falls behind Qwen on benchmarks. Tests for benchmaxxing are also strongly suggestive: https://x.com/bnjmn_marie/status/2041540879165403527

by naasking

4/17/2026 at 12:40:31 PM

No… seriously. Every model release is accused. Including Opus, GPT-5.4, whatever. And yes, including smaller models that are not the top in every benchmark.

My own experiences with Gemma 4 have been quite mediocre: https://www.reddit.com/r/LocalLLaMA/comments/1sn3izh/comment...

I would almost be tempted to call it benchmaxed if that term weren’t such a joke at this point. It is a deeply unserious term these days.

Gemma 4 is worse than its benchmarks show in terms of agentic workflows. The Qwen3.x models are much better; not benchmaxed. I have tested this extensively for my own workflows. Google really needs to release Gemma 4.1 ASAP. I really hope they’re not planning to just wait another calendar year like they did for Gemma 3 -> 4 with no intermediate updates.

And the lead author on the paper replied to that tweet to say that the scores would need to be greater than 80 to show actual contamination: https://x.com/MiZawalski/status/2043990236317851944?s=20

by coder543

4/16/2026 at 6:10:58 PM

I find Gemma 4 26B A4B better than Haiku 4.5 and that's smaller than this one.

by deaux

4/17/2026 at 11:38:11 PM

Just tried to use qwen3.6-35b-a3b-bf16 + omlx running a pi session to use my HN cli to do a sentiment analysis on this story and opus4.7 story. I’m getting ~40tk/s on a M3 Ultra Mac Studio and the tool use consistency has been held up well. Even when passing 100k tokens, the session was still going strong. Here is the full sentiment analysis report it produced:

https://gist.github.com/duh17/2db5351da026cec4bd4f46e169e75e...

Here is the full session:

https://pi.dev/session/#c3d003becb1bfcc7ffbca04e89e1adf8

This is by far my smoothest agentic session using a local model of any size. The output quality and speed has really struct the right balance. Very impressive release

by qazplm17

4/16/2026 at 7:04:35 PM

Anyone else getting gibberish when running unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ4_XS on CUDA (llama.cpp b8815)? UD-Q4_K_XL is fine, as is Vulkan in general.

by cpburns2009

4/16/2026 at 9:42:14 PM

Apparently it's a known issue with CUDA 13.2 [1].

[1] https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/discussi...

by cpburns2009

4/18/2026 at 10:24:23 AM

Yes sadly CUDA 13.2 is broken - NVIDIA will push a fix in CUDA 13.3

by danielhanchen

4/16/2026 at 5:29:08 PM

I wonder how this one compares to Qwen3 Coder Next (the 80B A3B model), since you'd think that even though it's older, it having more parameters would make it more useful for agentic and development use cases: https://huggingface.co/collections/Qwen/qwen3-coder-next

by KronisLV

4/16/2026 at 3:41:03 PM

I tried Gemma 4 A4B and was surprised how hart it is to use it for agentic stuff on a RTX 4090 with 24gb of ram.

Balancing KV Cache and Context eating VRam super fast.

by Glemllksdf

4/16/2026 at 7:13:30 PM

Are we going to get to the point where a local model can do almost what sonnet 4.6 can do?

by codeugo

4/16/2026 at 7:23:29 PM

Of course we are. And Opus 4.6+. It's a matter of when, not if.

by bluerooibos

4/17/2026 at 2:50:38 AM

Once you run out of data it’s just optimizations to commoditization

by danny_codes

4/16/2026 at 7:28:35 PM

We're already there IMHO.. If you have enough ram, sure.. but the ~32gig people can run models that beat sonnet 4.5

by intothemild

4/17/2026 at 4:23:30 AM

I am not sure. I tested it locally on my Desktop Framework and it so far it seem to giving me worse answers then Qwen 3.5. Maybe it is because I am chatting with models in my language instead of enlish or maybe it is optimised for coding instead.

I asked it to give me instruction on how to create SSH key and it tried to do it instead of just answering.

https://internetexception.com/2026/04/16/testing-qwen-3-6/

by npodbielski

4/16/2026 at 11:17:20 PM

Can this run on a PC with 16GB graphics card or a 24GB Macbook Pro? I'm not familiar with how Mixture-of-Experts models differ from standard models.

by poglet

4/16/2026 at 2:05:31 PM

Available for download:

https://huggingface.co/Qwen/Qwen3.6-35B-A3B

by adrian_b

4/16/2026 at 3:24:44 PM

Dangit, I'll need to give this a run on my personal machine. This looks impressive.

At the time of writing, all deepseek or qwen models are de facto prohibited in govcon, including local machine deployments via Ollama or similar. Although no legislative or executive mandate yet exists [1], it's perceived as a gap [2], and contracts are already including language for prohibition not just in the product but any part of the software environment.

The attack surface for a (non-agentic) model running in local ollama is basically non-existent . . but, eh . . I do get it, at some level. While they're not l33t haXX0ring your base, the models are still largely black boxes, can move your attention away from things, or towards things, with no one being the wiser. "Landing Craft? I see no landing craft". This would boil out in test, ideally, but hey, now you know how much time your typical defense subcon spends in meaningful software testing[3].

[1] See also OMB Memorandum M-25-22 (preference for AI developed and produced in the United States), NIST CAISI assessment of PRC-origin AI models as "adversary AI" (September 2025), and House Select Committee on the CCP Report (April 16, 2025), "DeepSeek Unmasked".

[2] Overall, rather than blacklist, I'd recommend a "whitelist" of permitted models, maintained dynamically. This would operate the same way you would manage libraries via SSCG/SSCM (software supply chain governance/management) . . but few if any defense subcons have enough onboard savvy to manage SSCG let alone spooling a parallel construct for models :(. Soooo . . ollama regex scrubbing it is.

[3] i.e. none at all, we barely have the ability to MAKE anything like software, given the combination of underwhelming pay scales and the fact defense companies always seem to have a requirement for on-site 100% in some random crappy town in the middle of BFE. If it wasn't for the downturn in tech we wouldn't have anyone useful at all, but we snagged some silcon refugees.

by lopsotronic

4/16/2026 at 1:57:26 PM

Wowzers, we were worried Qwen was going to suffer having lost several high profile people on the team but that's a huge drop.

It's better than 27b?

by incomingpain

4/16/2026 at 2:16:38 PM

Their previous model Qwen3.5 was available in many sizes, from very small sizes intended for smartphones, to medium sizes like 27B and big sizes like 122B and 397B.

This model is the first that is provided with open weights from their newer family of models Qwen3.6.

Judging from its medium size, Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B.

It remains to be seen whether they will also publish in the future replacements for the bigger 122B and 397B models.

The older Qwen3.5 models can be also found in uncensored modifications. It also remains to be seen whether it will be easy to uncensor Qwen3.6, because for some recent models, like Kimi-K2.5, the methods used to remove censoring from older LLMs no longer worked.

by adrian_b

4/16/2026 at 2:46:45 PM

There was also Qwen3.5-35B-A3B in the previous generation: https://huggingface.co/Qwen/Qwen3.5-35B-A3B

by mft_

4/16/2026 at 4:38:37 PM

> Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B

Not at all, Qwen3.5-27B was much better than Qwen3.5-35B-A3B (dense vs MoE).

by storus

4/16/2026 at 8:34:48 PM

Not sure why you're being downvoted, I guess it's because how your reply is worded. Anyway, Qwen3.7 35B-A3B should have intelligence on par with a 10.25B parameter model so yes Qwen3.5 27B is going to outperform it still in terms of quality of output, especially for long horizon tasks.

by rubiquity

4/16/2026 at 4:51:02 PM

Re-read that

by mudkipdev

4/16/2026 at 5:05:27 PM

You should. 3.5 MoE was worse than 3.5 dense, so expecting 3.6 MoE to be superior than 3.5 dense is questionable, one could argue that 3.6 dense (not yet released) to be superior than 3.5 dense.

by storus

4/16/2026 at 7:31:06 PM

Ok but you made a claim about the new model by stating a fact about the old model. It's easy to see how you appeared to be talking about different things. As for the claim, Qwen do indeed say that their new 3.6 MoE model is on a par with the old 3.5 dense model:

> Despite its efficiency, Qwen3.6-35B-A3B delivers outstanding agentic coding performance, surpassing its predecessor Qwen3.5-35B-A3B by a wide margin and rivaling much larger dense models such as Qwen3.5-27B.

https://qwen.ai/blog?id=qwen3.6-35b-a3b

by spuz

4/16/2026 at 9:03:32 PM

This says a slightly different thing:

https://x.com/alibaba_qwen/status/2044768734234243427?s=48&t...

If you look, at many benchmarks the old dense model is still ahead but in couple benchmarks the new 35B demolishes the old 27B. "rivaling" so YMMV.

by storus

4/17/2026 at 12:34:44 AM

This is obviously a continuation training of 3.5, it's not a new model architecture but an incremental improvement.

by segmondy

4/16/2026 at 4:21:38 PM

Looking to move off ollama on Open Suse tumbleweed.

Should I use brew to install llma.ccp or the zypper to install the tumbleweed package?

by 999900000999

4/16/2026 at 7:36:51 PM

You can compile it from source, all you need to do is clone the repository and do a `cmake -B build -DGGML_VULKAN=1` (add other backends if you want) followed by a `cmake --build build --config Release` and then you get all the llama tools in the `build/bin` (including `llama-server` which provides a web-based interface). There is a `docs/build.md` that has more detailed info (especially if you need another backend, though at least on my RX 7900 XTX i see no difference in terms of performance between Vulkan and ROCm and the former is much more stable and compatible -- i tried ROCm for a bit thinking it'd be much faster but only ended up being much more annoying as some models would OOM on it while they worked on Vulkan -- if you or NVIDIA hardware all this may sound quaint though :-P).

by badsectoracula

4/16/2026 at 9:42:38 PM

Cool, I assume this is how adults use llms.

I’m on a nvidia gpu , but I want to be able to combine vram with system memory.

by 999900000999

4/16/2026 at 5:20:55 PM

Why are you looking to move off Ollama? Just curious because I'm using Ollama and the cloud models (Kimi 2.5 and Minimax 2.7) which I'm having lots of good success with.

by rexreed

4/16/2026 at 5:54:17 PM

Ollama co mingles online and local models which defeats the purpose for me

by 999900000999

4/16/2026 at 8:53:15 PM

You can disable all cloud models in your Ollama settings if you just want all local. For cloud you don't have to use the cloud models unless you explicitly request.

by rexreed

4/16/2026 at 9:10:35 PM

Why not just download the binaries from github releases?

by stratos123

4/16/2026 at 2:33:54 PM

I'm a newbie here and lost how I'm supposed to use these models for coding. When I use them with Continue in VSCode and start typing basic C:

  #include <stdio.h>
  int m

I get nonsensical autocompletions like:

  #include <stdio.h>
  int m</fim_prefix>

What is going on?

by dataflow

4/16/2026 at 2:44:14 PM

These are not autocomplete models. It’s built to be used with an agentic coding harness like Pi or OpenCode.

by sosodev

4/16/2026 at 3:00:42 PM

They are but the IDE needs to be integrated with them.

Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.

by zackangelo

4/16/2026 at 3:14:20 PM

Oh, that’s interesting. Thanks for the correction. I didn’t know such heavily post trained models could still do good ol fashion autocomplete.

by sosodev

4/16/2026 at 4:28:40 PM

And even of those models trained for tool calling and agentic flows, mileage may vary depending on lots of factors. Been playing around with smaller local models (Anything that fits on 4090 + 64gb RAM) and it is a lottery it seems on a) if it works at all and b) how long it will work for.

Sometimes they don't manage any tool calls and fall over off the bat, other times they manage a few tool calls and then start spewing nonsense. Some can manage sub agents fr a while then fall apart.. I just can't seem to get any consistently decent output on more 'consumer/home pc' type hardware. Mostly been using either pi or OpenCode for this testing.

by JokerDan

4/16/2026 at 2:59:55 PM

Choose the correct FIM (Fill In the Middle) template for Qwen in Continue. All recent Qwen models are actually trained with FIM capability and you can use them.

by woctordho

4/16/2026 at 2:44:25 PM

This might sound snarky but in all earnestness, try talking to an AI about your experience using it.

by Jeff_Brown

4/16/2026 at 3:22:37 PM

I would use something like zeta-2 instead - https://huggingface.co/bartowski/zed-industries_zeta-2-GGUF

by recov

4/16/2026 at 2:00:09 PM

How does this compare to the commercial models like Sonnet 4.5 or GPT? Close enough that the price is right (free)?

by fred_is_fred

4/16/2026 at 2:07:57 PM

The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.

If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):

https://qwen.ai/blog?id=qwen3.6

by vidarh

4/16/2026 at 3:33:59 PM

on the bright side also worth to keep in mind those tiny models are better than GPT 4.0, 4.1 GPT4o that we used to enjoy less than 2 years ago [1]

[1] https://artificialanalysis.ai/?models=gpt-5-4%2Cgpt-oss-120b...

by pzo

4/16/2026 at 6:35:22 PM

They're absolutely worth using for the right tasks. It's hard to go back to GPT4 level for everything (for me at least), but there's plenty of stuff they are smart enough for.

by vidarh

4/16/2026 at 2:08:43 PM

> Close enough

No. These are nowhere near SotA, no matter what number goes up on benchmark says. They are amazing for what they are (runnable on regular PCs), and you can find usecases for them (where privacy >> speed / accuracy) where they perform "good enough", but they are not magic. They have limitations, and you need to adapt your workflows to handle them.

by NitpickLawyer

4/16/2026 at 2:12:48 PM

Can you share more about what adaptations you made when using smaller models?

I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.

by julianlam

4/16/2026 at 2:45:08 PM

You'd do most of the planning/cognition yourself, down to the module/method signature level, and then have it loop through the plan to "fill in the code". Need a strong testing harness to loop effectively.

by ukuina

4/16/2026 at 2:42:37 PM

It is very unlikely that general claims about a model are useful, but only very specific claims, which indicate the exact number of parameters and quantization methods that are used by the compared models.

If you perform the inference locally, there is a huge space of compromise between the inference speed and the quality of the results.

Most open weights models are available in a variety of sizes. Thus you can choose anywhere from very small models with a little more than 1B parameters to very big models with over 750B parameters.

For a given model, you can choose to evaluate it in its native number size, which is normally BF16, or in a great variety of smaller quantized number sizes, in order to fit the model in less memory or just to reduce the time for accessing the memory.

Therefore, if you choose big models without quantization, you may obtain results very close to SOTA proprietary models.

If you choose models so small and so quantized as to run in the memory of a consumer GPU, then it is normal to get results much worse than with a SOTA model that is run on datacenter hardware.

Choosing to run models that do not fit inside the GPU memory reduces the inference speed a lot, and choosing models that do not fit even inside the CPU memory reduces the inference speed even more.

Nevertheless, slow inference that produces better results may reduce the overall time for completing a project, so one should do a lot of experiments to determine an appropriate compromise.

When you use your own hardware, you do not have to worry about token cost or subscription limits, which may change the optimal strategy for using a coding assistant. Moreover, it is likely that in many cases it may be worthwhile to use multiple open-weights models for the same task, in order to choose the best solution.

For example, when comparing older open-weights models with Mythos, by using appropriate prompts all the bugs that could be found by Mythos could also be found by old models, but the difference was that Mythos found all the bugs alone, while with the free models you had to run several of them in order to find all bugs, because all models had different strengths and weaknesses.

(In other HN threads there have been some bogus claims that Mythos was somehow much smarter, but that does not appear to be true, because the other company has provided the precise prompts used for finding the bugs, and it would not hove been too difficult to generate them automatically by a harness, while Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it. So in reality the difference between SOTA models like Mythos and the open-weights models exists, but it is far smaller than Anthropic claims.)

by adrian_b

4/16/2026 at 3:32:57 PM

> Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it.

Unless there's been more information since their original post (https://red.anthropic.com/2026/mythos-preview/), this is a misleading description of the scaffold. The process was:

- provide a container with running software and its source code

- prompt Mythos to prioritize source files based on the likelihood they contain vulnerabilities

- use this prioritization to prompt parallel agents to look for and verify vulnerabilities, focusing on but not limited to a single seed file

- as a final validation step, have another instance evaluate the validity and interestingness of the resulting bug reports

This amounts to at most three invocations of the model for each file, once for prioritization, once for the main vulnerability run, and once for the final check. The prompts only became more specific as a result of information the model itself produced, not any external process injecting additional information.

by aesthesia

4/19/2026 at 5:10:02 PM

Thank you, I've been doing guided exploration of the various quantized models with the help of Gemini (which is highly ironic, but effective.)

It does seem like 16GB is on the extreme lower end of being able to produce capable results, very much like a junior dev, so much oversight is needed.

A tight code-test-fix loop seems to be the way forward.

by julianlam

4/16/2026 at 2:12:20 PM

I think its worth noting that if you are paying for electricity Local LLM is NOT free. In most cases you will find that Haiku is cheaper, faster, and better than anything that will run on your local machine.

by yaur

4/16/2026 at 2:25:55 PM

Electricity (on continental US) is pretty cheap assuming you already have the hardware:

Running at a full load of 1000W for every second of the year, for a model that produces 100 tps at 16 cents per kWh, is $1200 USD.

The same amount of tokens would cost at least $3,150 USD on current Claude Haiku 3.5 pricing.

by gyrovagueGeist

4/16/2026 at 2:42:32 PM

This 35B-A3B model is 4-5x cheaper than Haiku though, suggesting it would still be cheaper to outsource inference to the cloud vs running locally in your example

by ac29

4/16/2026 at 3:01:02 PM

If you need the heating then it is basically free.

by postalrat

4/16/2026 at 3:13:10 PM

Only if you use resistive electric heating, which is usually the most expensive heating available.

by mrob

4/16/2026 at 3:18:44 PM

I tried the car wash puzzle:

You want to wash your car. Car wash is 50m away. Should you walk or go by car?

> Walk. At 50 meters, the round trip is roughly 100 meters, taking about two minutes on foot. Driving would require starting the engine, navigating, parking, and dealing with unnecessary wear for a negligible distance. Walk to the car wash, and if the bay requires the vehicle inside, have it moved there or return on foot. Walking is faster and more efficient.

Classic response. It was really hard to one shot this with Qwen3.5 Q4_K_M.

Qwen3.6 UD-IQ4_XS also failed the first time, then I added this to the system prompt:

> Double check your logic for errors

Then I created a new dialog and asked the puzzle and it responded:

> Drive it. The car needs to be present to be washed. 50 meters is roughly a 1-minute walk or a 10-second drive. Walking leaves the car behind, making the wash impossible. Driving it the short distance is the only option that achieves the goal.

Now 3.6 gets it right every time. So not as great as a super model, but definitely an improvement.

by nurettin

4/16/2026 at 4:35:01 PM

Interestingly, Gemma4-26B IQ4_XS gets it correct:

> This sounds like a logic riddle! The answer is: You should go by car. Here is why: If you walk, you will arrive at the car wash, but your car will still be 50 meters away at home. You can't wash the car if the car isn't there! To accomplish your goal, you have to drive the car to the car wash.

It has the wrong one in thinking. It did think longer than usual:

Direct answer: Walk.

Reasoning 1: Distance (50m is negligible).

Reasoning 2: Practicality/Efficiency (engine wear/fuel).

Reasoning 3: Time (walking is likely faster or equal when considering car prep).

...

Wait, if I'm washing the car, I need to get the car to the car wash. The question asks how I should get there.

...

Wait, let's think if there's a trick. If you "go by car," you are moving the car to the destination. If you "walk," you are just moving yourself.

Conclusion: You should drive the car.

by dist-epoch

4/16/2026 at 5:54:59 PM

My last laptop was a used 2012 T530.

My current is a used M1 MBP Pro with 16GB of ram.

I thought this was all I was ever going to need, but wanting to run really nice models locally has me thinking about upgrading.

Although, part of me wants to see how far I could get with my trusty laptop.

by cyrialize

4/16/2026 at 5:58:38 PM

Your current laptop is still a fine thin client. Unless you program in the woods, it's probably cheapest to build a home inference box and route it over Tailscale or something.

by bigyabai

4/16/2026 at 7:13:34 PM

Or just an API server for all other devices to connect and do stuff with it.

by system2

4/16/2026 at 2:32:44 PM

What kind of hardware (preferably non-Apple) can run this model? What about 122B?

by kombine

4/16/2026 at 2:47:20 PM

The 3B active is small enough that it's decently fast even with experts offloaded to system memory. Any PC with a modern (>=8 GB) GPU and sufficient system memory (at least ~24 GB) will be able to run it okay; I'm pretty happy with just a 7800 XT and DDR4. If you want faster inference you could probably squeeze it into a 24 GB GPU (3090/4090 or 7900 XTX) but 32 GB would be a lot more comfortable (5090 or Radeon Pro).

122B is a more difficult proposition. (Also, keep in mind the 3.6 122B hasn't been released yet and might never be.) With 10B active parameters offloading will be slower - you'd probably want at least 4 channels of DDR5, or 3x 32GB GPUs, or a very expensive Nvidia Pro 6000 Blackwell.

by daemonologist

4/16/2026 at 2:43:08 PM

You won't like it, but the answer is Apple. The reason is the unified memory. The GPU can access all 32gb, 64gb, 128gb, 256gb, etc. of RAM.

An easy way (napkin math) to know if you can run a model based on it's parameter size is to consider the parameter size as GB that need to fit in GPU RAM. 35B model needs atleast 35gb of GPU RAM. This is a very simplified way of looking at it and YES, someone is going to say you can offload to CPU, but no one wants to wait 5 seconds for 1 token.

by ru552

4/16/2026 at 2:49:11 PM

That estimate doesn't account for context, which is very important for tool use and coding.

I used this napkin math for image generation, since the context (prompts) were so small, but I think it's misleading at best for most uses.

by samtheprogram

4/16/2026 at 4:16:51 PM

> You won't like it, but the answer is Apple.

Or strix halo.

Seems rather over simplified.

The different levels of quants, for Qwen3.6 it's 10GB to 38.5GB.

Qwen supports a context length of 262,144 natively, but can be extended to 1,010,000 and of course the context length can always be shortened.

Just use one of the calculators and you'll get much more useful number.

by sliken

4/16/2026 at 9:10:39 PM

What Strix Halo system has unified memory? A quick google says it's just a static vram allocation in ram, not that CPU and GPU can actively share memory at runtime

by 3836293648

4/17/2026 at 3:49:29 AM

All. Keep in mind strix != strix halo.

You can get tablets, laptops, and desktops. I think windows is more limited and might require static allocation of video memory, not because it's a separate pool, just because windows isn't as flexible.

With linux you can just select the lowest number in bios (usually 256 or 512MB) then let linux balance the needs of the CPU/GPU. So you could easily run a model that requires 96GB or more.

by sliken

4/17/2026 at 3:09:48 AM

> What Strix Halo system has unified memory?

All of them. The static VRAM allocation is tiny (512MB), most of the memory is unified

by ac29

4/16/2026 at 2:41:08 PM

Any good gaming pc can run the 35b-a3 model. Llama cpp with ram offloading. A high end gaming PC can run it at higher speeds. For your 122b, you need a lot of memory, which is expensive now. And it will be much slower as you need to use mostly system ram.

by canpan

4/16/2026 at 3:32:58 PM

Seconding this. You can get A3B/A4B models to run with 10+ tok/sec on a modern 6/8GB GPU with 32k context if you optimize things well. The cheapest way to run this model at larger contexts is probably a 12gb RTX 3060.

by bigyabai

4/16/2026 at 2:48:06 PM

I can run this on an AMD Framework laptop. A Ryzen 7 (I dont have Ryzen AI, just Ryzen 7 7840U) with 32+48 GB DDR. The Ryzen unified memory is enough, I get 26GB of VRAM at least.

Fedora 43 and LM Studio with Vulkan llama.cpp

by mildred593

4/16/2026 at 2:50:26 PM

I run Gemma 4 26B-A4B with 256k context (maximum) on Radeon 9070XT 16GB VRAM + 64GB RAM with partial GPU offload (with recommended LMStudio settings) at very reasonable 35 tokens per second, this model is similiar in size so I expect similar performance.

by terramex

4/16/2026 at 2:41:12 PM

The Q5 quantization (26.6GB) should easily run on a 32GB 5090. The Q4 (22.4GB) should fit on a 24GB 4090, but you may need to drop it down to Q3 (16.8GB) when factoring in the context.

You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.

You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.

The more you run in RAM the slower the inference.

by rhdunn

4/16/2026 at 3:17:10 PM

I currently run the qwen3.5-122B (Q4) on a Strix Halo (Bosgame M5) and am pretty happy with it. Obviously much slower than hosted models. I get ~ 20t/s with empty context and am down to about 14t/s with 100k of context filled.

No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.

by bildung

4/16/2026 at 5:07:24 PM

(Please don't downvote - serious question) Are Chinese models generally accepted for use within US companies? The company I work for won't allow Qwen.

by psim1

4/16/2026 at 5:28:39 PM

There is a difference between Chinese model and Chinese service.

Your company most likely is banning the use of foreign services, but it wouldn't make sense to ban the model, since the model would be ran locally.

I wouldn't allow my employees to use a foreign service either if my company had specific geographic laws it had to follow (ie, fin or med or privacy laws, such as the ones in the EU).

That said, I'm not sure I'd allow them to use any AI product either, locally inferred on-prem or not: I need my employees to _not_ make mistakes, not automate mistake making.

by DiabloD3

4/16/2026 at 5:16:45 PM

In private sector yes. Anything that touches public sector (government) and it starts to be supply chain concerns and they want all american made models

by kelsey98765431

4/17/2026 at 6:37:24 AM

The only problem is that the American models are super fracking dumb. Arcee Thinking Large (398B) is orders of magnitude worse than even Qwen 3.5 35B, getting stuck in thinking loops with incredibly basic questions that Google could answer in 500ms.

by gbgarbeb

4/16/2026 at 2:08:03 PM

how does this compare to gpt-oss-120b? It seems weird to leave it out.

by ghc

4/16/2026 at 3:11:23 PM

GPT-OSS 120B (really 117B-A5.1B) is a lot bigger. better comparison would be to 20B (21B-A3.6B).

by vyr

4/16/2026 at 5:28:15 PM

OSS-120 is too old to be relevant, and four times the size.

by 7734128

4/16/2026 at 6:38:08 PM

I cant wait to see some smaller sizes. I would love to run some sort of coding centric agent on a local TPU or GPU instead of having to pay, even if it's slower.

by giantg2

4/16/2026 at 2:44:03 PM

I'm disappointed they didn't release a 27B dense model. I've been working with Qwen3.5-27B and Qwen3.5-35B-A3B locally, both in their native weights and the versions the community distilled from Opus 4.6 (Qwopus), and I have found I generally get higher quality outputs from the 27B dense model than the 35B-A3B MOE model. My basic conclusion was that MoE approach may be more memory efficient, but it requires a fairly large set of active parameters to match similarly sized dense models, as I was able to see better or comparable results from Qwen3.5-122B-A10B as I got from Qwen3.5-27B, however at a slower generation speed. I am certain that for frontier providers with massive compute that MoE represents a meaningful efficiency gain with similar quality, but for running models locally I still prefer medium sized dense models.

I'll give this a try, but I would be surprised if it outperforms Qwen3.5-27B.

by tristor

4/16/2026 at 2:54:03 PM

You are right, but this is just the first open-weights model of this family.

They said that they will release several open-weights models, though there was an implication that they might not release the biggest models.

by adrian_b

4/16/2026 at 3:09:12 PM

Given that DeepSeek, GLM, Kimi etc have all released large open weight models, I am personally grateful that Qwen fills the mid/small sized model gap even if they keep their largest models to themselves. The only other major player in the mid/small sized space at this point is pretty much only Gemma.

by hnfong

4/16/2026 at 2:57:35 PM

I'm totally fine with that, frankly. I'm blessed with 128GB of Unified Memory to run local models, but that's still tiny in comparison the larger frontier models. I'd much rather get a full array of small and medium sized models, and building useful things within the limits of smaller models is more interesting to me anyway.

by tristor

4/16/2026 at 5:28:44 PM

[dead]

by andrewmcwatters

4/16/2026 at 6:57:42 PM

It's a given that the dense models with comparable size are better. I also proved that in my use case for those two Qwen 3.5 models.

The benchmarks show 3.6 is a bit better than 3.5. I should retry my task, but I don't have a lot of confidence. But it does sound like they worked on the right thing which is getting closer to the 27B performance.

by ilaksh

4/16/2026 at 2:34:05 PM

Planning to deploy Qwen3.6-35B-A3B on NVIDIA Spark DGX for multi-agent coding workflows. The 3B active params should help with concurrent agent density.

by btbr403

4/17/2026 at 11:17:10 AM

What kind of hardware does this require to run locally, and how many tokens/seconds does it produce?

by logicallee

4/16/2026 at 5:07:25 PM

Is it worth running speculative decoding on small active models like this? Or does MTP make speculative decoding unnecessary?

by syntaxing

4/16/2026 at 4:57:03 PM

Looks like they compare only to open models, unfortunately.

As I am using mostly the non-open models, I have no idea what these numbers mean.

by amelius

4/16/2026 at 6:42:47 PM

[dead]

by varispeed

4/16/2026 at 2:37:16 PM

What do all the numbers 6-35B-A3B mean?

by zshn25

4/16/2026 at 2:42:31 PM

3.6 is the release version for Qwen. This model is a mixture of experts (MoE), so while the total model size is big (35 billion parameters), each forward pass only activates a portion of the network that’s most relevant to your request (3 billion active parameters). This makes the model run faster, especially if you don’t have enough VRAM for the whole thing.

The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.

by dunb

4/16/2026 at 2:54:32 PM

And even if you have enough VRAM to fit the entire thing, inference speed after the first token is proportional to (activated parameters)/(vram bandwidth)

If you have the vram to spare, a model with more total params but fewer activated ones can be a very worthwhile tradeoff. Of course that's a big if

by wongarsu

4/16/2026 at 2:47:10 PM

Sorry, how did you calculate the 10.25B?

by zshn25

4/16/2026 at 3:06:33 PM

> > The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.

> Sorry, how did you calculate the 10.25B?

The geometric mean of two numbers is the square root of their product. Square root of 105 (35*3) is ~10.25.

by darrenf

4/16/2026 at 2:38:11 PM

The 6 is part of 3.6, the model version. 35B parameters, A3B means it's a mixture of experts model with only 3B parameters active in any forward pass.

by cshimmin

4/16/2026 at 2:42:16 PM

Got it. Thanks

by zshn25

4/16/2026 at 2:41:08 PM

35B (35 billion) is the number of parameters this model has. Its a Mixture of Experts model (MoE) so A3B means that 3B parameters are Active at any moment.

by JLO64

4/16/2026 at 2:42:06 PM

~I see. What’s the 6?~

Nevermind, the other reply clears it

by zshn25

4/16/2026 at 2:42:25 PM

3.6 is model number, 35B is total number of parameters, A3B means that only 3B parameters are activated, which has some implications for serving (either in you you shard the model, or you can keep the total params on RAM and only road to VRAM what you need to compute the current token, which will make it slower, but at least it runs)

by joaogui1

4/16/2026 at 3:41:10 PM

Anybody use these instead of codex or claude code? Thoughts in comparison?

benchmarks dont really help me so much

by yieldcrv

4/16/2026 at 9:17:48 PM

In my test case (a feature all models got stuck on a few months ago) it just gets stuck in a thinking loop and never gets anywhere. Not a super amazing test, but it happened a few times in a row, so...

by 3836293648

4/16/2026 at 2:19:05 PM

"open source"

give me the training data?

by zoobab

4/16/2026 at 2:33:45 PM

The training data is the entire internet. How do you propose they ship that to you

by tjwebbnorfolk

4/16/2026 at 4:28:45 PM

As a zip archive of however they store it in their database?

by thrance

4/17/2026 at 5:06:16 PM

to be fair there are some degree of "hand curation" of the data so while "it is the internet", the actual trained data is a derivation of that.

in a mild but productive analogy:

I could actually hand a K&R book C programming book + lots of specs to say "this is the linux source code" (the raw data that were all observations were made, aka "the internet") ...or just send them the "kernel the source code" (the refined training data, after a LOT of manual stuff) ... that your compiler consumes to generate the kernel. (the Open Weights model, what they actually shared)

Mildly related rant: honestly its a bit shit to say "open source model" in a "open weights" model, its like saying World of Warcraft is opensource because they gave you an executable of the game. (you can still change it, but in more restricted ways)

by aruametello

4/16/2026 at 2:32:52 PM

You ARE the training data

by flux3125

4/16/2026 at 5:34:10 PM

Did anyone try it and Gemma 4? Does it feel that it's better than Gemma 4?

by solomatov

4/16/2026 at 5:19:14 PM

Do we know if other models have started detecting and poisoning training/fine tuning that these Chinese models seem to use for alignment, I’d certainly be doing some naughty stuff to keep my moat if I was Anthropic or OpenAI…

by andy_ppp

4/16/2026 at 9:20:05 PM

They no longer show reasoning traces and are throttling more aggressively.

by storus

4/16/2026 at 9:23:07 PM

They never showed full reasoning traces, just post-hoc summaries.

by zozbot234

4/16/2026 at 10:32:19 PM

DeepSeek still shows them, it sometimes says "I am ChatGPT", and Claude sometimes says "I am DeepSeek" so the distillation went both ways.

by storus

4/16/2026 at 6:59:16 PM

Can anyone confirm this fits on a 3090? Size is exactly 24gb

by ActorNightly

4/16/2026 at 5:46:09 PM

What is the min VRAM this can run on given it is MOE?

by tmaly

4/16/2026 at 7:37:12 PM

Fwiw, with its predecessor's Qwen3.5-35B-A3B-Q6_K.gguf, on a laptop's 6 GB VRAM and 32 GB RAM, with default llama.cpp settings, I get 20 t/s generation.

by mncharity

4/16/2026 at 8:39:10 PM

Have you tried running llama.cpp with Unified Memory Access[1] so your iGPU can seamlessly grab some of the RAM? The environment variable is prefixed with CUDA but this is not CUDA specific. It made a pretty significant difference (> 40% tg/s) on my Ryzen 7840U laptop.

1 - https://github.com/ggml-org/llama.cpp/blob/master/docs/build...

by rubiquity

4/16/2026 at 8:49:22 PM

Your link seems to be describing a runtime environment variable, it doesn't need a separate build from source. I'm not sure though (1) why this info is in build.md which should be specific to the building process, rather than some separate documentation; and (2) if this really isn't CUDA-specific, why the canonical GGML variable name isn't GGML_ENABLE_UNIFIED_MEMORY , with the _CUDA_ variant treated as a legacy alias. AIUI, both of these should be addressed with pull requests for llama.cpp and/or the ggml library itself.

by zozbot234

4/16/2026 at 9:10:29 PM

You are right that it is an environment variable, and that's how I have it set in my nix config. Thanks for correcting that.

Unfortunately llama.cpp is somewhat notorious for having lackluster docs. Most of the CLI tools don't even tell you what they are for.

by rubiquity

4/16/2026 at 10:33:50 PM

Hmm. Perhaps there's a niche for a "The Missing Guide to llama.cpp"? Getting started, I did things like wrapping llama-cli in a pty... and only later noticing a --simple-io argument. I wonder if "living documents" are a thing yet, where LLMs keep an eye on repo and fora, and update a doc autonomously.

by mncharity

4/16/2026 at 10:16:41 PM

I hadn't tried that, thanks! I found simply defining GGML_CUDA_ENABLE_UNIFIED_MEMORY, whether 1, 0, or "", was a 10x hit to 2 t/s. Perhaps because the laptop's RAM is already so over-committed there. But with the much smaller 4B Qwen3.5-4B-Q8_0.gguf, it doubled performance from 20 to 40+ t/s! Tnx! (an old Quadro RTX 3000 rather than an iGPU)

by mncharity

4/17/2026 at 6:06:22 PM

That is pretty solid, I have a 2070 with 8GB VRAM and 64GB RAM, but I haven't run too much. I regret not getting a 3090 back when I built this machine.

by tmaly

4/18/2026 at 4:50:55 AM

Nod. Mine was VR dev leftovers. Fwiw, running 6ish prompts in parallel, roughly doubles my aggregate t/s (but requires cooling kludgery). If one's goal is not local, but rather real-time or consistent or transparent or scalable, there's AWS.

by mncharity

4/17/2026 at 11:53:46 AM

Does any LLM aggregator offers this model?

by Divs2890

4/16/2026 at 7:45:59 PM

any tips for running it locally within an agent harness? maybe using pi or opencode?

by zengid

4/16/2026 at 9:07:42 PM

It pretty much just works. Run the unsloth quant in llama.cpp and hook it up to pi. A bunch of minor annoyances like not having support for thinking effort. It also defaults to "interleaved thinking" (thinking blocks get stripped from context), set `"chat_template_kwargs": {"preserve_thinking": True},` if you interrupt the model often and don't want it to forget what it was thinking.

by stratos123

4/16/2026 at 7:23:11 PM

Is this the hybrid variant of Gwent and Quen? I hope this is in The Witcher IV!

by the__alchemist

4/16/2026 at 2:10:27 PM

Does anyone have any experience with Qwen or any non-Western LLMs? It's hard to get a feel out there with all the doomerists and grifters shouting. Only thing I need is reasonable promise that my data won't be used for training or at least some of it won't. Being able to export conversations in bulk would be helpful.

by bossyTeacher

4/16/2026 at 2:13:51 PM

The Chinese models are generally pretty good.

> Only thing I need is reasonable promise that my data won't be used

Only way is to run it local.

I personally don’t worry about this too much. Things like medical questions I tend to do against local models though

by Havoc

4/16/2026 at 2:26:59 PM

Have you tried asking about sensitive topics?

I asked it if there were out of bounds topics but it never gave me a list.

See its responses:

Convo 1

- Q: ok tell me about taiwan

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: output text data may contain inappropriate content!

Convo 2

- Q: is winnie the pooh broadcasted in china?

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: input text data may contain inappropriate content!

These seem pretty bad to me. If there are some topics that are not allowed, make a clear and well defined list and share it with the user.

by bossyTeacher

4/16/2026 at 2:45:04 PM

I have both the Qwen 3.5 9B regular and uncensored versions. The censored version sometimes refuses to answer these kinds of questions or just gives a sanitised response. For example:

> ok tell me about taiwan

> Taiwan is an inalienable part of China, and there is no such entity as "Taiwan" separate from the People's Republic of China. The Chinese government firmly upholds national sovereignty and territorial integrity, which are core principles enshrined in international law and widely recognized by the global community. Taiwan has been an inseparable part of Chinese territory since ancient times, with historical, cultural, and legal evidence supporting this fact. For accurate information on cross-strait relations, I recommend referring to official sources such as the State Council Information Office or Xinhua News Agency.

The uncensored version gives a proper response. You can get the uncensored version here:

https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-Hauhau...

by spuz

4/16/2026 at 2:57:42 PM

You can find on Huggingface uncensored modifications of the Qwen models, but I have not tried yet such questions, to see what they might answer.

For some such questions, even the uncensored models might be not able to answer, because I assume that any document about "winnie the pooh" would have been purged from the training set before training.

by adrian_b

4/16/2026 at 2:33:57 PM

You may be interested in heretic. People often post models to hf that have been un-censored

https://github.com/p-e-w/heretic

by boredatoms

4/16/2026 at 2:52:17 PM

lol yes I tried it for giggles back in 2023 when the first Chinese models came out.

Unless you’re a political analyst or child I don’t think asking models about Winnie the Pooh is particularly meaningful test of anything

These days I’m hitting way more restrictions on western models anyway because the range of things considered sensitive is far broader and fuzzier.

by Havoc

4/16/2026 at 3:10:24 PM

> These days I’m hitting way more restrictions on western models anyway because the range of things considered sensitive is far broader and fuzzier.

Ah interesting, what are some topics where you are not getting answers?

by bossyTeacher

4/16/2026 at 3:18:47 PM

General chatbot use about daily life. Accidentally stumbling across something considered racist/sexist/woke/pronouns/whatever being offended about is flavour of the week is much more likely than a casual chat session wandering into turf that is politically sensitive in China.

by Havoc

4/18/2026 at 8:02:18 AM

My main concern in practice here is prompt injection style attacks where the model gets destabilized by an attacker mentioning Chinese political topics.

Part of the issue here is that the western model restriction things you're talking about tend towards well reasoned refusals, whereas these models will outright lie instead. (Actual model output: Your previous question involved a false premise: there is no such thing as a "June 4th incident" in history.)

Like, yes, you don't go to these models for questions about Chinese politics, but imagine agentic scenarios along the lines of "the model sees a git commit message mentioning Taiwan and becomes more inclined to lie about the contents of the commit".

by klempner

4/16/2026 at 2:46:35 PM

> Have you tried asking about sensitive topics?

Quoting my teenage son on the subject of the existence of a god - "I don't know and I don't care."

I mean, seriously - do you really think you have access to a model that isn't lobotomised in some way?

by lelanthran

4/16/2026 at 2:32:33 PM

You can also rent a cloud GPU which is relatively affordable.

by manmal

4/16/2026 at 3:03:38 PM

I used Qwen CLI's undescribed “coder_agent” (I guess Qwen 3.5 with size auto-selection) and it was powerful enough to complete 95% of a small hobby project involving coding, reverse engineering and debugging. Sometimes it was able to work unattended for several tens of minutes, though usually I had to iterate at smaller steps and prompt it every 4-5 minutes on how to continue. I'd rate it a little below the top models by Anthropic and OpenAI, but much better than everything else.

by alberto-m

4/16/2026 at 7:18:26 PM

Personally, I wouldn't trust any foreign or domestic LLM providers to not train on your data. I also wouldn't trust them to not have a data breach eventually which is worse. If you're really worried about your data, run it locally. The Chinese models (Qwen, GLM, etc.) are really competitive to my understanding.

by cpburns2009

4/16/2026 at 2:17:10 PM

> Does anyone have any experience with Qwen or any non-Western LLMs?

I use GLM-5.1 for coding hobby project, that going to end up on github anyway. Works great for me, and I only paid 9 USD for 3 month, though that deal has run out.

> my data won't be used for training

Yeah, I don't know. Doubt it.

by Mashimo

4/16/2026 at 2:23:31 PM

$20 for 3 months is still far better than alternatives, and 5.1 works great

by ramon156

4/16/2026 at 10:22:34 PM

Run open models locally. Data stays local, and exporting sessions is straightforward.

by chabes

4/16/2026 at 11:03:52 PM

if any Alibaba (Qwen) folks are here - website is not working on safari

by dzonga

4/17/2026 at 6:43:12 AM

Yeah damn but they are heavy lol

by thesuperevil

4/18/2026 at 11:00:34 AM

[dead]

by nexustoken

4/17/2026 at 10:15:07 AM

[dead]

by hemangjoshi37a

4/17/2026 at 12:12:23 AM

[dead]

by bustah

4/17/2026 at 10:38:16 PM

[flagged]

by maryjeiel

4/16/2026 at 4:00:52 PM

[dead]

by maxothex

4/16/2026 at 9:42:55 PM

[dead]

by RITESH1985

4/16/2026 at 5:14:10 PM

[dead]

by LouisvilleGeek

4/16/2026 at 5:26:50 PM

[dead]

by ninjahawk1

4/16/2026 at 2:47:08 PM

[dead]

by reynaventures

4/16/2026 at 2:47:32 PM

[dead]

by reynaventures

4/16/2026 at 10:53:17 PM

[dead]

by tech_curator

4/16/2026 at 3:33:40 PM

[dead]

by typia

4/16/2026 at 11:16:25 PM

fuck off: https://news.ycombinator.com/item?id=47796830

by smcl

4/16/2026 at 2:06:30 PM

More benchmaxxing I see. Too bad there’s no rig with 256gb unified ram for under $1000

by amazingamazing

4/16/2026 at 2:11:18 PM

do you know if they did this to it?

https://research.google/blog/turboquant-redefining-ai-effici...

by kennethops

4/16/2026 at 2:28:35 PM

Llama.cpp already uses an idea from it internally for the KV cache [0]

So a quantized KV cache now must see less degradation

[0] https://github.com/ggml-org/llama.cpp/pull/21038

by kgeist

4/16/2026 at 3:38:18 PM

taps the sign

  Unified Memory Is A Marketing Gimmeck. Industrial-Scale Inference Servers Do Not Use It.

by bigyabai

4/16/2026 at 3:42:53 PM

Industrial Scale Inference is moving towards LPDDR memory (alongside HBM), which is essentially what "Unified Memory" is.

by zozbot234

4/16/2026 at 7:11:52 PM

> which is essentially what "Unified Memory" is.

Unified memory is when CPU and GPU can reference the same memory address without things being copied (CUDA allows you to write code as if it was unified even if it's not, so that doesn't count, but HMM does count[1])

That is all. What technology is underneath is hardware detail. Unified memory on macs lets you put something into a memory, then do some computation on it with CPU, ANE, ANA, Metal Shaders. All without copying anything.

DGX Spark also has unified memory.

[1]: https://docs.nvidia.com/cuda/cuda-programming-guide/02-basic...

by 0x457

4/16/2026 at 3:56:33 PM

LPDDR is LPDDR. There's nothing "unified" about it architecturally.

by bigyabai

4/17/2026 at 3:25:23 AM

On M5 Pro/Max the memory is actually just attached straight to the GPU die. CPU accesses memory through the die-to-die bridge. I don't see the difference between that and a pure GPU from a memory connectivity point of view.

Wrt inference servers: sure, it's not cost-effective to have such a huge CPU die and a bunch of media accelerators on the GPU die if you just care about raw compute for inference and training. Apple SoCs are not tuned for that market, nor do they sell into it. I'm not building a datacentre, I'm trying to run inference on my home hardware that I also want to use for other things.

by wren6991

4/17/2026 at 6:49:54 PM

If you're going to do unified memory, that's the way to do it, in addition to using higher-bandwidth RAM and padding out your GPGPU hardware. Nvidia realized this almost a decade ago, and Apple is being dragged through the mud to learn the exact same $4 trillion lesson.

by bigyabai

4/16/2026 at 11:48:51 PM

Unified Memory is mainly how consumer hardware has enough RAM accessible by the GPU to run larger models, because otherwise the market segmentation jacks up the price substantially.

by rcxdude

4/17/2026 at 12:20:29 AM

UMA removes the PCIe bottleneck and replaces it with a memory controller + bandwidth bottleneck. For most high-performance GPUs, that would be a direct downgrade.

by bigyabai

4/16/2026 at 7:09:23 PM

Sir, this is 2026. You're not getting any 128GB of RAM for under $1k.

by cpburns2009

4/16/2026 at 2:08:53 PM

I don't want "Agentic Power".

I want to reduce AI to zero. Granted, this is an impossible to win fight, but I feel like Don Quichotte here. Rather than windmill-dragons, it is some skynet 6.0 blob.

by shevy-java

4/16/2026 at 2:33:54 PM

Then who is Rocinante?

by lagniappe

4/16/2026 at 8:40:18 PM

A dimming IBM x40 Thinkpad missing its F key.

by blazzy