4/29/2026 at 4:30:03 PM
I'm not sure what people are on in the comments. It doesn't beat the other models, but it sure competes despite its size.GLM 5.1 is an excellent model, but even at Q4 you're looking at ~400GB. Kimi K2.5 is really good too, and at Q4 quantization you're looking at almost ~600GB.
This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).
For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable. This beats the latest Sonnet while running locally, without anyone charging you extra for having HERMES.md in your repo, or locking you out of your account on a whim.
Mistral has never been competitive at the frontier, but maybe that is not what we need from them. Having Pareto models that get you 80% of the frontier at 20% of the cost/size sounds really good to me.
by simjnd
4/29/2026 at 5:11:39 PM
> This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.
For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware.
For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token.
> This beats the latest Sonnet while running locally
Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.
by Aurornis
4/29/2026 at 9:42:11 PM
>Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here
My notes so far:
"us.anthropic.claude-sonnet-4-6" # working, good results
"us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions
"us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results
"us.anthropic.claude-opus-4-5-20251101-v1:0"
"us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive
"amazon.nova-pro-v1:0" # completely fails
"openai.gpt-oss-120b-1:0" # tool calling broken
"zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet
"minimax.minimax-m2.5" # didn't diagnose correctly
"zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet
"mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved
"moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly
"moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination
Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching
The Kimi ones were close to working but didn't quite make the mark
by nijave
4/30/2026 at 12:00:20 AM
" it supports prompt caching" May I ask if you checked that? I use "{"cachePoint": { "type": "default" }" and I found 2 things: * 1) even if stated in the Doco, Bedrock Converse API does not allow 1hr expiry time, only 5m - gives error when attempted; * 2) Bedrock Converse API does accept up to 4 cachePoint's but does NOT cache and returns zeroes. LOL. It was confirmed by some other people on Github. (Note: VertexAI does cache properly reducing the bill drastically, so I use Vertex instead of OpenRouter.)by pbgcp2026
4/30/2026 at 12:43:30 PM
I had Claude Code pull the OTEL trace and calculate cost based on token counts in the responses. I'll double check later today tho if I rememberEdit: I do see the first request shows 0 cache read, 7k cache write tokens. The next request shows 7k cache read, 900 cache write tokens. The agent run summary is:
usage {
cache_read_input_tokens 244586
cache_write_input_tokens 38399
completion_tokens 8131
input_tokens 1172
output_tokens 8131
prompt_tokens 1172
total_tokens 292288
}
I do see a recent issue in the Strands Agent issue tracker about 1hr TTL getting ignored and defaulting to 5m TTL. I haven't validated cache TTL but these agent runs take ~2-3m so a 5m TTL is sufficient.
I also checked the AWS bill and see separate Usage SKUs
USE1-MP:USE1_CacheWriteInputTokenCount-Units $0.34
USE1-MP:USE1_OutputTokenCount-Units $0.27
USE1-MP:USE1_CacheReadInputTokenCount-Units $0.16
USE1-MP:USE1_InputTokenCount-Units $0.01
by nijave
4/29/2026 at 5:51:20 PM
> The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.Very valid. This is an active area of research, and there are a lot of options to try out already today.
- People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.
- Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.
- DFlash (block diffusion for speculative decoding) needs a good drafting model compatible with the big model, but can provide an uplift up to 5x in decoding (although usually in the 2-2.5x range)
- Forcing a model's thinking to obey a simple grammar has been shown to improve results with drastically lower thinking output (faster effective result generation) although that has been more impactful on smaller models.
We should be skeptical, but it's definitely trending in the right direction and I wouldn't be surprised if we are indeed able to run it at acceptable speeds.
> Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.
This hasn't been my experience. After Anthropic's started their shenanigans I've switched to exclusively using open-weights models via OpenRouter and OpenCode and I can't really tell a difference (for better or for worse).
by simjnd
4/30/2026 at 3:16:18 AM
> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.All the Q quants from big quant providers are importance-weighted (imatrix) nowadays.
The main (possibly only?) difference between Q and IQ today is that IQ uses a lookup table to achieve better compression. That is also why IQ suffers more when it can't fully fit into VRAM.
It's important to teach people the distinction and not perpetuate wrong assumptions of the past. If one needs/wants static quants, ignoring IQ_ isn't enough.
by tredre3
4/30/2026 at 11:27:26 AM
Thanks for bringing this up I looked into it, and if I understood correctly:- Q4_0 (not K quant) is the traditional flat quantization - Q4_K (4-bit K quant) uses an imatrix and important weights get higher precision (5-6 bits instead of 4, but still largely 4 bits) - IQ4 uses an imatrix and important weights get an optimized scale to avoid clipping at 4-bit, but all the weights are still 4-bit
And yeah most quants nowadays are K quants which are importance weighted
by simjnd
4/29/2026 at 9:39:22 PM
Super interesting!> - People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.
Where can I find more info on this? I’d like to convert models to onnx this way.
> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.
Where can I find more info on this? I’d like to convert models to onnx this way.
The most difficult environment for small models is in the browser. Would be great to push the SOTA in that environment.
by sroussey
4/30/2026 at 11:13:06 AM
For TurboQuant on model weights AFAIK it's currently a single person effort [1]. It needs his fork of llama.cpp, hasn't been upstreamed. He publishes his quantizations on HuggingFace but I'm not sure if he open-sourced the quantization pipeline.by simjnd
4/29/2026 at 11:36:04 PM
Google only released their TurboQuant paper barely a month ago, it is bleeding edge even by LLM standardsby hadlock
4/30/2026 at 12:54:03 AM
Actually, they published a year ago. Recent was being on official Google blog.https://arxiv.org/abs/2504.19874
https://research.google/blog/turboquant-redefining-ai-effici...
by sroussey
4/29/2026 at 10:46:20 PM
> being able to run a model and being able to run a model fast are two very different thresholdsSpecifically speaking, on my Strix Halo machine with (theoretical) memory bandwidth of 256 GB/s, a 70 GB model can't generate faster than 256/70= 3.65 t/s. The logic here is that a dense model must do a full read of the weights for each token. So even if the GPU can keep up, the memory bandwidth is limiting.
A Mac M5 Pro is faster with a bandwidth of 307 GB/s, but that's only a little faster.
This thing is going to be slow on consumer hardware. Maybe that is useful for someone, but I probably prefer a faster model in most cases even if the model isn't quite as smart. Qwen3.6 35B-A3B generates about 50 t/s on my machine, so it can make mistakes, be corrected, and try again in the same time that this model would still be thinking about its first response.
by parsimo2010
4/29/2026 at 11:23:19 PM
Recent models support multi-token prediction, which can guess multiple future tokens in a single decode step (using some subset of the model itself, not a separate drafting model) and then verify them all at once. It's an emerging feature still (not widely supported) and it's only useful for speeding up highly predictable token runs, but it's one way to do better in practice than the common-sense theoretical limit might suggest.by zozbot234
4/30/2026 at 12:03:34 AM
It seems to me it's only Grok 4.20 that does this currently? Which other models did you have in mind, if I may ask?by pbgcp2026
4/30/2026 at 1:40:46 AM
Gemma4, qwen3.6, deepseek v4, mimo, glm 5/5.1 all do MTP.by phamilton
4/30/2026 at 2:01:09 AM
Thank you, I just realised we are talking about MTP. It seems that it's not that clear though. "Currently, the MTP capabilities are primarily accessible through Google's proprietary LiteRT framework, rather than the open-weights versions... Despite the missing MTP heads in the open release, Gemma 4 (specifically the 26B-A4B variant) still demonstrates high efficiency"by pbgcp2026
4/30/2026 at 1:35:57 PM
If Mistral Medium 3.5 supports it, that might get it to 10 t/s. It will still be fairly slow.by parsimo2010
5/1/2026 at 3:11:14 AM
[dead]by freakynit
4/29/2026 at 5:15:01 PM
Cloud hardware is not inherently more "proper" than what's being proposed here, there's nothing wrong per se about targeting slower inference speeds in an on prem single-user context.by zozbot234
4/29/2026 at 5:19:11 PM
> Cloud hardware is not inherently more "proper" than what's being proposed hereCloud hardware can run the original model. Quantization will reduce quality. The quality drop to Q4 is not trivial.
Cloud hardware is also massively faster in time to first token and token generation speed.
> there's nothing wrong per se about targeting slower inference speeds in a local single-user context.
If that's what the user wants and expects then it's fine
Most people working interactively with an LLM would suffer from slower turns.
by Aurornis
4/29/2026 at 6:20:06 PM
> Cloud hardware can run the original model. Quantization will reduce quality.New models are often being released in quantized format to begin with. This is true of both Kimi and the new DeepSeek V4 series. There is no "original model", the model is generated using Quantization Aware Training (QAT).
by zozbot234
4/29/2026 at 6:49:19 PM
> There is no "original model", the model is generated using Quantization Aware Training (QAT).The original model is the model used for the benchmarks
People will say "You can run it locally!" then show the benchmarks of the original model, but what they really mean is that you can run a heavily quantized adaptation of the model which has difference performance characteristics.
by Aurornis
4/29/2026 at 6:56:05 PM
That remark was specific to newer models like Kimi 2.x and DeepSeek V4 series, and this is clearly stated in my comment.As for other models, we quantize them because we are generally constrained by the model's total footprint in bytes, and running a larger model that's been quantized to fit in the same footprint as a smaller one improves performance compared to a smaller original, generally up to Q4 or so, with even tighter quantizations (up to Q2) being usable for some uses such as general Q&A chat.
by zozbot234
4/30/2026 at 8:39:26 AM
When you say DeepSeek v4... you do realise it is a 1.6T param model right?What kind of consumer hardware can run it reasonably in your mind?
by hu3
4/30/2026 at 2:03:16 AM
I wish “performance” didn’t cover speed and quality, here.by DANmode
4/29/2026 at 5:20:23 PM
The quantization for some models can be very detrimental and their quality can drop considerably from the posted benchmarks which are probably at bf16, this is why having considerable RAM can be important.by cbg0
4/30/2026 at 1:01:11 AM
being able to run a model fast is definitely more useful, but being able to run a model slowly for free is still super useful. agentic workflows are maturing all the time.yes, if i'm directly interacting with the LLM, i want it to be reasonably fast. but lately i've been queueing up a bunch of things when i go for lunch, or leaving things running when i go home at the end of the day. and claude doesn't keep working on that all night, it runs for an hour or so, gets to a point where it needs more input from me, and gives me some stuff to review in the morning. that could run 16x slower and still be just as useful for me.
by notatoad
4/29/2026 at 9:07:08 PM
Sure but for a casual conversational use case I have not found speed to be a huge barrier. I chatted with a 100b model using ddr5 only on a plane recently and it was fine. It's mainly that I cannot do data classification and coding tasks in a timely manner.by Computer0
4/29/2026 at 4:50:04 PM
I didn't know about HERMES.md ... (??) - found information here for others who are curious https://github.com/anthropics/claude-code/issues/53262by gregsadetsky
4/29/2026 at 6:19:34 PM
This github thread is incredible, thanks for sharing. This link should be its own HN topic.by gnulinux
4/29/2026 at 7:17:31 PM
https://news.ycombinator.com/item?id=47952722by nomel
4/29/2026 at 5:42:57 PM
That is insane, if you billed me an extra $200 for a bug in your system I'd flat out cancel my subscription. If you're not going to credit that back to me, you don't deserve anymore of my money. I'm a Claude first guy, but if you're going to bill me incorrectly, that's on you, own it, fix it.by giancarlostoro
4/29/2026 at 5:49:34 PM
They did credit it back to him. There's a comment in the linked issue.by xcrjm
4/29/2026 at 6:05:31 PM
Where? Just searched the entire thread for both the word "refund" and the word "credit" and I'm seeing nothing about credit being issued.Also what's with @sasha-id talking to himself? Looks weird as all get out.
by MarsIronPI
4/29/2026 at 6:16:15 PM
Looks like he copy pasted responses he got from their support agents.by argee
5/1/2026 at 4:30:46 AM
See also: https://news.ycombinator.com/item?id=47954655by bouke
4/29/2026 at 6:10:17 PM
Where? All I see is Boris saying "we are unable to issue compensation for degraded service or technical errors that result in incorrect billing routing".by simjnd
4/29/2026 at 6:39:32 PM
Keep this in mind next time you hear someone talking about "removing the human in the loop".Anthropic apparently won't take responsibility for issues their own systems handling billing cause. You think they'll take responsibility in your system when a bug in their models can be demonstrated as the cause?
by lenerdenator
4/29/2026 at 7:05:08 PM
> Anthropic apparently won't take responsibility for issues their own systems handling billing cause.I think with every org, especially the big ones, trying to dodge responsibility (setting the intent of "customer support" to be annoying them enough for them to buzz off), the only recourse people have is to give them enough bad press where they wake up and do the refund, it's less than a rounding error for them.
I think Anthropic is hardly unique in that position and being able to chat with a human with any sort of power to actually make things right is becoming more and more rare. If any human eyes saw that, the correct thing to do would probably be passing the message up the chain like "Hey, this will have really bad optics if we don't do the right thing. Can you take like 5 minutes and hit the refund button while I draft up a nice message about it?"
by KronisLV
4/29/2026 at 8:19:15 PM
Bad press is meaningless where it matters most these days. The kind of people who are most responsive to threats of bad press are the kind of people who don't need to be threatened with bad press to do the right thing.I really wish it carried any weight. It just doesn't. If someone at the organization just says "never admit fault, always attack", it's very likely they'll get away with it.
by lenerdenator
4/30/2026 at 2:05:21 AM
> You think they'll take responsibility in your system when a bug in their models can be demonstrated as the cause?Flag on the play: AI doesn’t replace responsibility for your commits.
It doesn’t matter what promises a service makes, what you say is valid code is still on you.
Act accordingly.
by DANmode
5/1/2026 at 2:27:04 PM
The issue is less what's in your commit and more if you're using these models as a foundation for some other service.I know this is a rather hackneyed example, but if a customer service agent model were to call a customer a racial slur, that's not the software surrounding the agent, it's the agent's model.
by lenerdenator
4/29/2026 at 4:59:20 PM
It has similar SWE bench score to qwen 3.6 27b[1]. No one is comparing it to frontier.[1]: There is no other common benchmark in the blog.
by YetAnotherNick
4/29/2026 at 6:16:56 PM
That's more a testament of how good Qwen3.6 27B is (it really is great) more than how bad this one is IMO. Gemma 4 31B was already good, but Qwen3.6 27B is incredible for its size.by simjnd
4/29/2026 at 8:11:11 PM
Good models vs bad models are relative: if this was released in 2020 it would be earth shattering. But releasing a model today that's only on par with open-source dense models a quarter of the size and soundly beaten by open-source MoEs with active param counts a quarter of the size is kind of a flop. The niche for this is basically no one. It'll run at near-zero TPS for the few local model aficionados with enough hardware to try it out, and is lower throughout and lower quality for people trying to use it at scale.I'm rooting for Mistral, I want them to release good models. This just isn't one. It's a little sad since they once were so prominent for open-source.
Who knows — if they have the compute to train this, they have the compute to train an MoE that's 3-4T total params with 128B active. Maybe they'll make a comeback (although using Llama 2 attention is... not promising). I hope they do.
by reissbaker
4/29/2026 at 5:16:31 PM
>This model? You can run it at Q4 with 70GB of VRAM. >This beats the latest Sonnet while running locallyNot sure it will beat Sonet at Q4.
>This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).
For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality.
by DeathArrow
4/29/2026 at 6:24:48 PM
> Not sure it will beat Sonet at Q4.Very valid. Importance-weighted quantization and TurboQuant on model weights can reduce loss a lot compared to "traditional" Q4 so one can be hopeful.
> For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality
But you will own no computer, and that's also assuming prices stay what they are. Anyway my point was not whether or not it makes financial sense for everyone. A lot of people are very happy not owning their movies, software, games, cars or house. I'm just happy there is a future where the people can own and locally run the tech that was trained on their stolen data.
by simjnd
4/30/2026 at 6:04:50 AM
@simjnd, I hate this idea but you remember how radio had been regulated to death? And how fast one will be triangulated if one decides to run a "self hosted" radio station today? My bet is in 5 years not only owning AI-inference-capable computer but using AI itself will be regulated. Essentially, we will have to scan biometrics to just ask any SOTA model to "summarise this".Why? Because capable and free models at the dawn of AI almost made people think again and - oh oh - ask questions!
by pbgcp2026
4/30/2026 at 2:42:53 PM
> For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality.I know HN's distaste for crypto, but I do my inference (for personal stuff - not my employer) through Venice. I was in the airdrop for VVV, and kept as much of it staked as I could. I have ~$40/day in inference as long as that service lasts.
These days the multiplier is about 1000x last I checked; if you want $10/day in inference and can lock up $10k in VVV, you get ~$10/day in inference plus (currently) ~16% APY in the form of more VVV.
I'm not sure I'd want to invest that much if I had to today, but it's a reasonable option. The risk of VVV going to $0 seems pretty small to me.
by Ancapistani
4/29/2026 at 5:39:06 PM
> For $3500 I can get 7-8 years of GLMmind sharing where's the go to place to pay for open models?
by kobalsky
4/29/2026 at 6:19:20 PM
I recommend using OpenRouter (openrouter.ai). Basically a broker between inference providers and you which allows you to pick, try, and switch models from a massive catalog, extremely transparent about usage and pricing.by simjnd
4/30/2026 at 6:10:53 AM
+5% to every API call.by pbgcp2026
4/29/2026 at 11:00:48 PM
I've had a decent experience with ollama cloud. It is slower than going thru openrouter but much, much cheaper -- the generosity of their $20 plan reminds me of what the Claude Code $20 plan was back in the dayby rsanek
4/29/2026 at 5:48:03 PM
You can get GLM coding plans from Z.ai and Ollama Cloud and OpenCode Go.by DeathArrow
4/29/2026 at 5:39:12 PM
> For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable.Before February I was able to use Opus on High exclusively on my Max plan no problem. Now I've shifted to just using Sonnet on high and yeah, its pretty capable. I love that, Claude Pilled. ;)
by giancarlostoro
4/29/2026 at 5:57:22 PM
Yeah I love Claude, amazing models. Anthropic has very quickly burned most of the goodwill I had for it so I still ended up cancelling my subscription.by simjnd
4/29/2026 at 8:11:09 PM
“This beats the latest Sonnet while running locally”Not really.
- The benchmarks are based on F8_E4M3 and you’re not running that on any Mac.
- Sonnet has a 1M token context window. This is 256k but again you’re probably not even getting that locally.
- Sonnet is fast over the wire. This is going to be much slower.
by WhitneyLand
4/29/2026 at 11:15:14 PM
> Sonnet is fast over the wire.Except when it’s unavailable. For sovereignity, the downsides are worth it to some.
by trvz
4/29/2026 at 8:36:18 PM
the benchmarks we're using to measure llm's do no justice when everyone's mental-benchmark is simply "is it going to feel like using claude" and the answer is still no. the entire llm space is stuffed with tons of crazy datapoints and vernacular that barely paint the picture of the mental benchmark everyone is after.i too am desperate to just sever ties with these big providers, my fingers are crossed we get there within the constraints of local hardware even if that means me spending 3-5k i just want off this wild ride.
by trueno
4/29/2026 at 11:34:37 PM
Not sure if 1M token window is meaningful with Sonnet/Opus. The models go dumb quickly as context increases making them unusable (that is if you get routed to actual Opus, otherwise they are just dumb regardless of context window).by varispeed
4/29/2026 at 6:19:37 PM
Let's not forget Qwen 35B A3B MoE. It gets better performance than this in all the metrics for a fraction of the memory / compute footprint.Sad to see all the non Chinese open source models being at least one generation behind.
by ksubedi
4/29/2026 at 6:32:16 PM
Qwen3.6 27B is even more impressive IMO. Dense so it doesn't run as fast but it's so good.by simjnd
4/30/2026 at 12:14:15 AM
im kinda torn on which to download. i have the headroom to run either, mostly just want the occasional "do a coding thing im too lazy to do"by trueno
4/30/2026 at 4:30:20 PM
Then go with Qwen3.6 35B A3B. It's way faster (up to 5x) and it is 80% as capable as the 27B. The 27B is for serious people looking for one shot coding. The 35B is for iterative and quicker coding. I am in the same situation as you (making something I don't want to do myself) and I use the 35B at Q5_K_M.by EntityDeletr
4/29/2026 at 5:45:00 PM
Yeah, you can run it locally if you have enough VRAM, but the reports trickling in are saying about 3 tok/sec. This was on a Strix Halo box which definitely has the needed VRAM, but isn't going to have as high mem bandwidth as a GPU card, it's going to be similar on a Mac - that's the dilemma... the unified memory machines have the VRAM, but the bandwidth isn't great for running dense models. This size of a dense model is only going to be runnable (usefully) by very few people who have multiple GPU cards with enough memory to add up to about 70GB.by UncleOxidant
4/29/2026 at 6:06:46 PM
I don't think this is quite correct, a Strix Halo box usually has 256 GB/s memory bandwidth. An M5 Max has 614 GB/s. An M3 Ultra (no M4 or M5 Ultra) has 820 GB/s. It's still not GDDR or HBM territory, but still significantly faster.That's the edge of Apple Silicon for AI. When they scale up the chip they add more memory controllers which adds more channels and more bandwidth.
But yeah in the end it's still going to be only a handful of people that can run it.
What I meant is that I think researching and developing smaller more powerful model is more interesting than chasing the next 3T parameter model while burning through VC money and squeezing your customer base more and more aggressively.
by simjnd
4/29/2026 at 5:12:17 PM
The point is it's open weight and is tiny compared to a lot of it's competitors. 4gpus for world class performance - sweet!by 2ndorderthought
4/29/2026 at 11:32:38 PM
> It doesn't beat the other models, but it sure competes despite its size.But what is the rationale for running a dumb model? Because it can ocasionally produce something passable?
I don't get where is the value apart from mild entertainment, as in "I am somewhat of Anthropic myself".
by varispeed
4/30/2026 at 7:42:02 AM
Are you dumb because you're not Einstein? Intelligence is a spectrum. Just because you're not #1 doesn't mean you're dumb. A lot of small models are not frontier but are still very competent and are very useful coding agent. It may take better prompting and more guiding, but that can be a reasonable tradeoff for some people.by simjnd
4/29/2026 at 5:30:19 PM
The competition is on DeepSeek v4 Flash for similar size / deployment target.by liuliu
4/29/2026 at 5:55:40 PM
DeepSeek v4 Flash is still over 100GB at Q4 IIRC, and Q4 has generally been the sweet spot. Although it's an MoE so it might run a lot faster that this dense Mistral model if you have the RAM.by simjnd
4/30/2026 at 5:58:01 AM
"Q4 has generally been the sweet spot" for self-hosting, yes. For any real meaningful work it's dumb AF. The only way to get reasonable intelligence from mid-size Gemma or Qwen is to run full precision BF16. Anything else is just an emulation of AI.by pbgcp2026
4/30/2026 at 4:33:47 PM
I would disagree. I have 8 GB of VRAM and 32 GB of RAM. I can either run a 4B BF16 dense model fully on GPU at around 30 t/s or Qwen3.6 35B A3B Q5_K_M at 20 t/s with GPU offload. Which one would I choose?by EntityDeletr
4/29/2026 at 4:41:36 PM
It’s 128b dense model. Good luck getting more than 3t/s out of a mac. It doesn’t matter if it fits or not.by redrove
4/29/2026 at 5:12:03 PM
You could run it on a single Mac Studio with M3 Ultra, or two Mac Studios with M4 Max at higher perf than that. And lightly quantizing this could give us modern dense models in the ~80GB size range, which is a very compelling target.by zozbot234
4/29/2026 at 5:15:56 PM
Wouldn't matter much still. M3 ultra has 819GB/s unified memory bandwidth. That means theoretical max tokem rate is 819/128 =~ 6.39 t/s. At 80 GB (5 bit quantization), its still near about 10 t/s ... far from a good coding experience. Also, these are theoretical max.. real world token generation rates would be at least 15-20% less.by freakynit
4/29/2026 at 5:56:11 PM
Isn't Kimi K2.6 natively INT4?by zackangelo
4/29/2026 at 6:27:39 PM
I don't think any models are natively INT4? I wouldn't see the point to nerf the model out-of-the-box.by simjnd
4/29/2026 at 6:39:00 PM
It's not nerfed, it's natively trained at that quantization a.k.a. Quantization Aware Training.by zozbot234
4/30/2026 at 6:13:13 AM
QAT typically uses BF16/FP32 during the training process to simulate lower precision.by pbgcp2026
4/30/2026 at 4:36:41 PM
The only model I have seen like that is GPT OSS, natively quantized to MXFP4.by EntityDeletr
4/29/2026 at 6:31:58 PM
Eh. Those results would be noteworthy if it was a a MoE. A 120B dense? Firmly in meh territory.by revolvingthrow
4/29/2026 at 6:58:12 PM
Why do you care?by gregorygoc
4/29/2026 at 6:13:29 PM
I would love to be able to run frontier locally, but I think the larger importance of open weight models is price accountability.In the US with our broken system of capitalism, it’s the only way we can tether these companies to reality. Left to their own devices, I’m not convinced they would actually compete with each other on price.
Buy nobody like to talk about how “moat” building is fundamentally anti-competitive, even in name.
Funny that self proclaimed capitalists hate the system in practice. Commodity pricing is what truly terrifies them.
by deepsquirrelnet
4/29/2026 at 6:30:48 PM
I'm not necessarily interested in having frontier locally. You don't need to be frontier to be a very good and useful coding agent. I agree with your point on price accountability though. Hopefully no tariff comes down on the Chinese and European open-weight models.by simjnd
4/29/2026 at 6:19:36 PM
[dead]by sayYayToLife
4/29/2026 at 5:05:39 PM
I was hoping a lot from it... but this one, is not up to that mark. For example, here is it's comparion with 4.7x smaller model, qwen3.7-27b.https://chatgpt.com/share/69f239e8-7414-83a8-8fdd-6308906e5f...
Tldr: qwen3.6-27b, a 4.7x smaller model, have similar performance.
by freakynit
4/29/2026 at 5:11:20 PM
To be fair MoE from Qwen itself had the same "problem". 3.5 122B MoE was same or worse than 3.5 27B. Yet to see 122B 3.6.UPD. NVM, Mistral Medium 3.5 is dense. So yes, it is worse in every way.
by lostmsu
4/29/2026 at 5:08:39 PM
That's a chatgpt summary. Actual usage would a better test.by r0b05
4/29/2026 at 5:12:11 PM
yep.. until then, this is good enough since the tests are standard, and the results are numeric and can be compared without any doubt.by freakynit