Mistral Medium 3.5

4/29/2026 at 4:30:03 PM

I'm not sure what people are on in the comments. It doesn't beat the other models, but it sure competes despite its size.

GLM 5.1 is an excellent model, but even at Q4 you're looking at ~400GB. Kimi K2.5 is really good too, and at Q4 quantization you're looking at almost ~600GB.

This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).

For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable. This beats the latest Sonnet while running locally, without anyone charging you extra for having HERMES.md in your repo, or locking you out of your account on a whim.

Mistral has never been competitive at the frontier, but maybe that is not what we need from them. Having Pareto models that get you 80% of the frontier at 20% of the cost/size sounds really good to me.

by simjnd

4/29/2026 at 5:11:39 PM

> This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).

The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.

For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware.

For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token.

> This beats the latest Sonnet while running locally

Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

by Aurornis

4/29/2026 at 9:42:11 PM

>Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here

My notes so far:

"us.anthropic.claude-sonnet-4-6" # working, good results

"us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions

"us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results

"us.anthropic.claude-opus-4-5-20251101-v1:0"

"us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive

"amazon.nova-pro-v1:0" # completely fails

"openai.gpt-oss-120b-1:0" # tool calling broken

"zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet

"minimax.minimax-m2.5" # didn't diagnose correctly

"zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet

"mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved

"moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly

"moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination

Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching

The Kimi ones were close to working but didn't quite make the mark

by nijave

4/30/2026 at 12:00:20 AM

" it supports prompt caching" May I ask if you checked that? I use "{"cachePoint": { "type": "default" }" and I found 2 things: * 1) even if stated in the Doco, Bedrock Converse API does not allow 1hr expiry time, only 5m - gives error when attempted; * 2) Bedrock Converse API does accept up to 4 cachePoint's but does NOT cache and returns zeroes. LOL. It was confirmed by some other people on Github. (Note: VertexAI does cache properly reducing the bill drastically, so I use Vertex instead of OpenRouter.)

by pbgcp2026

4/30/2026 at 12:43:30 PM

I had Claude Code pull the OTEL trace and calculate cost based on token counts in the responses. I'll double check later today tho if I remember

Edit: I do see the first request shows 0 cache read, 7k cache write tokens. The next request shows 7k cache read, 900 cache write tokens. The agent run summary is:

usage {

cache_read_input_tokens 244586

cache_write_input_tokens 38399

completion_tokens 8131

input_tokens 1172

output_tokens 8131

prompt_tokens 1172

total_tokens 292288

}

I do see a recent issue in the Strands Agent issue tracker about 1hr TTL getting ignored and defaulting to 5m TTL. I haven't validated cache TTL but these agent runs take ~2-3m so a 5m TTL is sufficient.

I also checked the AWS bill and see separate Usage SKUs

USE1-MP:USE1_CacheWriteInputTokenCount-Units $0.34

USE1-MP:USE1_OutputTokenCount-Units $0.27

USE1-MP:USE1_CacheReadInputTokenCount-Units $0.16

USE1-MP:USE1_InputTokenCount-Units $0.01

by nijave

4/29/2026 at 5:51:20 PM

> The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.

Very valid. This is an active area of research, and there are a lot of options to try out already today.

- People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.

- Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

- DFlash (block diffusion for speculative decoding) needs a good drafting model compatible with the big model, but can provide an uplift up to 5x in decoding (although usually in the 2-2.5x range)

- Forcing a model's thinking to obey a simple grammar has been shown to improve results with drastically lower thinking output (faster effective result generation) although that has been more impactful on smaller models.

We should be skeptical, but it's definitely trending in the right direction and I wouldn't be surprised if we are indeed able to run it at acceptable speeds.

> Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

This hasn't been my experience. After Anthropic's started their shenanigans I've switched to exclusively using open-weights models via OpenRouter and OpenCode and I can't really tell a difference (for better or for worse).

by simjnd

4/30/2026 at 3:16:18 AM

> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

All the Q quants from big quant providers are importance-weighted (imatrix) nowadays.

The main (possibly only?) difference between Q and IQ today is that IQ uses a lookup table to achieve better compression. That is also why IQ suffers more when it can't fully fit into VRAM.

It's important to teach people the distinction and not perpetuate wrong assumptions of the past. If one needs/wants static quants, ignoring IQ_ isn't enough.

by tredre3

4/30/2026 at 11:27:26 AM

Thanks for bringing this up I looked into it, and if I understood correctly:

- Q4_0 (not K quant) is the traditional flat quantization - Q4_K (4-bit K quant) uses an imatrix and important weights get higher precision (5-6 bits instead of 4, but still largely 4 bits) - IQ4 uses an imatrix and important weights get an optimized scale to avoid clipping at 4-bit, but all the weights are still 4-bit

And yeah most quants nowadays are K quants which are importance weighted

by simjnd

4/29/2026 at 9:39:22 PM

Super interesting!

> - People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.

Where can I find more info on this? I’d like to convert models to onnx this way.

> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

Where can I find more info on this? I’d like to convert models to onnx this way.

The most difficult environment for small models is in the browser. Would be great to push the SOTA in that environment.

by sroussey

4/30/2026 at 11:13:06 AM

For TurboQuant on model weights AFAIK it's currently a single person effort [1]. It needs his fork of llama.cpp, hasn't been upstreamed. He publishes his quantizations on HuggingFace but I'm not sure if he open-sourced the quantization pipeline.

[1]: https://x.com/coffeecup2020

by simjnd

4/29/2026 at 11:36:04 PM

Google only released their TurboQuant paper barely a month ago, it is bleeding edge even by LLM standards

by hadlock

4/30/2026 at 12:54:03 AM

Actually, they published a year ago. Recent was being on official Google blog.

https://arxiv.org/abs/2504.19874

https://research.google/blog/turboquant-redefining-ai-effici...

by sroussey

4/29/2026 at 10:46:20 PM

> being able to run a model and being able to run a model fast are two very different thresholds

Specifically speaking, on my Strix Halo machine with (theoretical) memory bandwidth of 256 GB/s, a 70 GB model can't generate faster than 256/70= 3.65 t/s. The logic here is that a dense model must do a full read of the weights for each token. So even if the GPU can keep up, the memory bandwidth is limiting.

A Mac M5 Pro is faster with a bandwidth of 307 GB/s, but that's only a little faster.

This thing is going to be slow on consumer hardware. Maybe that is useful for someone, but I probably prefer a faster model in most cases even if the model isn't quite as smart. Qwen3.6 35B-A3B generates about 50 t/s on my machine, so it can make mistakes, be corrected, and try again in the same time that this model would still be thinking about its first response.

by parsimo2010

4/29/2026 at 11:23:19 PM

Recent models support multi-token prediction, which can guess multiple future tokens in a single decode step (using some subset of the model itself, not a separate drafting model) and then verify them all at once. It's an emerging feature still (not widely supported) and it's only useful for speeding up highly predictable token runs, but it's one way to do better in practice than the common-sense theoretical limit might suggest.

by zozbot234

4/30/2026 at 12:03:34 AM

It seems to me it's only Grok 4.20 that does this currently? Which other models did you have in mind, if I may ask?

by pbgcp2026

4/30/2026 at 1:40:46 AM

Gemma4, qwen3.6, deepseek v4, mimo, glm 5/5.1 all do MTP.

by phamilton

4/30/2026 at 2:01:09 AM

Thank you, I just realised we are talking about MTP. It seems that it's not that clear though. "Currently, the MTP capabilities are primarily accessible through Google's proprietary LiteRT framework, rather than the open-weights versions... Despite the missing MTP heads in the open release, Gemma 4 (specifically the 26B-A4B variant) still demonstrates high efficiency"

by pbgcp2026

4/30/2026 at 1:35:57 PM

If Mistral Medium 3.5 supports it, that might get it to 10 t/s. It will still be fairly slow.

by parsimo2010

5/1/2026 at 3:11:14 AM

[dead]

by freakynit

4/29/2026 at 5:15:01 PM

Cloud hardware is not inherently more "proper" than what's being proposed here, there's nothing wrong per se about targeting slower inference speeds in an on prem single-user context.

by zozbot234

4/29/2026 at 5:19:11 PM

> Cloud hardware is not inherently more "proper" than what's being proposed here

Cloud hardware can run the original model. Quantization will reduce quality. The quality drop to Q4 is not trivial.

Cloud hardware is also massively faster in time to first token and token generation speed.

> there's nothing wrong per se about targeting slower inference speeds in a local single-user context.

If that's what the user wants and expects then it's fine

Most people working interactively with an LLM would suffer from slower turns.

by Aurornis

4/29/2026 at 6:20:06 PM

> Cloud hardware can run the original model. Quantization will reduce quality.

New models are often being released in quantized format to begin with. This is true of both Kimi and the new DeepSeek V4 series. There is no "original model", the model is generated using Quantization Aware Training (QAT).

by zozbot234

4/29/2026 at 6:49:19 PM

> There is no "original model", the model is generated using Quantization Aware Training (QAT).

The original model is the model used for the benchmarks

People will say "You can run it locally!" then show the benchmarks of the original model, but what they really mean is that you can run a heavily quantized adaptation of the model which has difference performance characteristics.

by Aurornis

4/29/2026 at 6:56:05 PM

That remark was specific to newer models like Kimi 2.x and DeepSeek V4 series, and this is clearly stated in my comment.

As for other models, we quantize them because we are generally constrained by the model's total footprint in bytes, and running a larger model that's been quantized to fit in the same footprint as a smaller one improves performance compared to a smaller original, generally up to Q4 or so, with even tighter quantizations (up to Q2) being usable for some uses such as general Q&A chat.

by zozbot234

4/30/2026 at 8:39:26 AM

When you say DeepSeek v4... you do realise it is a 1.6T param model right?

What kind of consumer hardware can run it reasonably in your mind?

by hu3

4/30/2026 at 2:03:16 AM

I wish “performance” didn’t cover speed and quality, here.

by DANmode

4/29/2026 at 5:20:23 PM

The quantization for some models can be very detrimental and their quality can drop considerably from the posted benchmarks which are probably at bf16, this is why having considerable RAM can be important.

by cbg0

4/30/2026 at 1:01:11 AM

being able to run a model fast is definitely more useful, but being able to run a model slowly for free is still super useful. agentic workflows are maturing all the time.

yes, if i'm directly interacting with the LLM, i want it to be reasonably fast. but lately i've been queueing up a bunch of things when i go for lunch, or leaving things running when i go home at the end of the day. and claude doesn't keep working on that all night, it runs for an hour or so, gets to a point where it needs more input from me, and gives me some stuff to review in the morning. that could run 16x slower and still be just as useful for me.

by notatoad

4/29/2026 at 9:07:08 PM

Sure but for a casual conversational use case I have not found speed to be a huge barrier. I chatted with a 100b model using ddr5 only on a plane recently and it was fine. It's mainly that I cannot do data classification and coding tasks in a timely manner.

by Computer0

4/29/2026 at 4:50:04 PM

I didn't know about HERMES.md ... (??) - found information here for others who are curious https://github.com/anthropics/claude-code/issues/53262

by gregsadetsky

4/29/2026 at 6:19:34 PM

This github thread is incredible, thanks for sharing. This link should be its own HN topic.

by gnulinux

4/29/2026 at 7:17:31 PM

https://news.ycombinator.com/item?id=47952722

by nomel

4/29/2026 at 5:42:57 PM

That is insane, if you billed me an extra $200 for a bug in your system I'd flat out cancel my subscription. If you're not going to credit that back to me, you don't deserve anymore of my money. I'm a Claude first guy, but if you're going to bill me incorrectly, that's on you, own it, fix it.

by giancarlostoro

4/29/2026 at 5:49:34 PM

They did credit it back to him. There's a comment in the linked issue.

by xcrjm

4/29/2026 at 6:05:31 PM

Where? Just searched the entire thread for both the word "refund" and the word "credit" and I'm seeing nothing about credit being issued.

Also what's with @sasha-id talking to himself? Looks weird as all get out.

by MarsIronPI

4/29/2026 at 6:16:15 PM

Looks like he copy pasted responses he got from their support agents.

by argee

5/1/2026 at 4:30:46 AM

by bouke

4/29/2026 at 6:10:17 PM

Where? All I see is Boris saying "we are unable to issue compensation for degraded service or technical errors that result in incorrect billing routing".

by simjnd

4/29/2026 at 6:39:32 PM

Keep this in mind next time you hear someone talking about "removing the human in the loop".

Anthropic apparently won't take responsibility for issues their own systems handling billing cause. You think they'll take responsibility in your system when a bug in their models can be demonstrated as the cause?

by lenerdenator

4/29/2026 at 7:05:08 PM

> Anthropic apparently won't take responsibility for issues their own systems handling billing cause.

I think with every org, especially the big ones, trying to dodge responsibility (setting the intent of "customer support" to be annoying them enough for them to buzz off), the only recourse people have is to give them enough bad press where they wake up and do the refund, it's less than a rounding error for them.

I think Anthropic is hardly unique in that position and being able to chat with a human with any sort of power to actually make things right is becoming more and more rare. If any human eyes saw that, the correct thing to do would probably be passing the message up the chain like "Hey, this will have really bad optics if we don't do the right thing. Can you take like 5 minutes and hit the refund button while I draft up a nice message about it?"

by KronisLV

4/29/2026 at 8:19:15 PM

Bad press is meaningless where it matters most these days. The kind of people who are most responsive to threats of bad press are the kind of people who don't need to be threatened with bad press to do the right thing.

I really wish it carried any weight. It just doesn't. If someone at the organization just says "never admit fault, always attack", it's very likely they'll get away with it.

by lenerdenator

4/30/2026 at 2:05:21 AM

> You think they'll take responsibility in your system when a bug in their models can be demonstrated as the cause?

Flag on the play: AI doesn’t replace responsibility for your commits.

It doesn’t matter what promises a service makes, what you say is valid code is still on you.

Act accordingly.

by DANmode

5/1/2026 at 2:27:04 PM

The issue is less what's in your commit and more if you're using these models as a foundation for some other service.

I know this is a rather hackneyed example, but if a customer service agent model were to call a customer a racial slur, that's not the software surrounding the agent, it's the agent's model.

by lenerdenator

4/29/2026 at 4:59:20 PM

It has similar SWE bench score to qwen 3.6 27b[1]. No one is comparing it to frontier.

[1]: There is no other common benchmark in the blog.

by YetAnotherNick

4/29/2026 at 6:16:56 PM

That's more a testament of how good Qwen3.6 27B is (it really is great) more than how bad this one is IMO. Gemma 4 31B was already good, but Qwen3.6 27B is incredible for its size.

by simjnd

4/29/2026 at 8:11:11 PM

Good models vs bad models are relative: if this was released in 2020 it would be earth shattering. But releasing a model today that's only on par with open-source dense models a quarter of the size and soundly beaten by open-source MoEs with active param counts a quarter of the size is kind of a flop. The niche for this is basically no one. It'll run at near-zero TPS for the few local model aficionados with enough hardware to try it out, and is lower throughout and lower quality for people trying to use it at scale.

I'm rooting for Mistral, I want them to release good models. This just isn't one. It's a little sad since they once were so prominent for open-source.

Who knows — if they have the compute to train this, they have the compute to train an MoE that's 3-4T total params with 128B active. Maybe they'll make a comeback (although using Llama 2 attention is... not promising). I hope they do.

by reissbaker

4/29/2026 at 5:16:31 PM

>This model? You can run it at Q4 with 70GB of VRAM. >This beats the latest Sonnet while running locally

Not sure it will beat Sonet at Q4.

>This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).

For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality.

by DeathArrow

4/29/2026 at 6:24:48 PM

> Not sure it will beat Sonet at Q4.

Very valid. Importance-weighted quantization and TurboQuant on model weights can reduce loss a lot compared to "traditional" Q4 so one can be hopeful.

> For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality

But you will own no computer, and that's also assuming prices stay what they are. Anyway my point was not whether or not it makes financial sense for everyone. A lot of people are very happy not owning their movies, software, games, cars or house. I'm just happy there is a future where the people can own and locally run the tech that was trained on their stolen data.

by simjnd

4/30/2026 at 6:04:50 AM

@simjnd, I hate this idea but you remember how radio had been regulated to death? And how fast one will be triangulated if one decides to run a "self hosted" radio station today? My bet is in 5 years not only owning AI-inference-capable computer but using AI itself will be regulated. Essentially, we will have to scan biometrics to just ask any SOTA model to "summarise this".

Why? Because capable and free models at the dawn of AI almost made people think again and - oh oh - ask questions!

by pbgcp2026

4/30/2026 at 2:42:53 PM

> For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality.

I know HN's distaste for crypto, but I do my inference (for personal stuff - not my employer) through Venice. I was in the airdrop for VVV, and kept as much of it staked as I could. I have ~$40/day in inference as long as that service lasts.

These days the multiplier is about 1000x last I checked; if you want $10/day in inference and can lock up $10k in VVV, you get ~$10/day in inference plus (currently) ~16% APY in the form of more VVV.

I'm not sure I'd want to invest that much if I had to today, but it's a reasonable option. The risk of VVV going to $0 seems pretty small to me.

by Ancapistani

4/29/2026 at 5:39:06 PM

> For $3500 I can get 7-8 years of GLM

mind sharing where's the go to place to pay for open models?

by kobalsky

4/29/2026 at 6:19:20 PM

I recommend using OpenRouter (openrouter.ai). Basically a broker between inference providers and you which allows you to pick, try, and switch models from a massive catalog, extremely transparent about usage and pricing.

by simjnd

4/30/2026 at 6:10:53 AM

+5% to every API call.

by pbgcp2026

4/29/2026 at 11:00:48 PM

I've had a decent experience with ollama cloud. It is slower than going thru openrouter but much, much cheaper -- the generosity of their $20 plan reminds me of what the Claude Code $20 plan was back in the day

by rsanek

4/29/2026 at 5:48:03 PM

You can get GLM coding plans from Z.ai and Ollama Cloud and OpenCode Go.

by DeathArrow

4/29/2026 at 5:39:12 PM

> For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable.

Before February I was able to use Opus on High exclusively on my Max plan no problem. Now I've shifted to just using Sonnet on high and yeah, its pretty capable. I love that, Claude Pilled. ;)

by giancarlostoro

4/29/2026 at 5:57:22 PM

Yeah I love Claude, amazing models. Anthropic has very quickly burned most of the goodwill I had for it so I still ended up cancelling my subscription.

by simjnd

4/29/2026 at 8:11:09 PM

“This beats the latest Sonnet while running locally”

Not really.

- The benchmarks are based on F8_E4M3 and you’re not running that on any Mac.

- Sonnet has a 1M token context window. This is 256k but again you’re probably not even getting that locally.

- Sonnet is fast over the wire. This is going to be much slower.

by WhitneyLand

4/29/2026 at 11:15:14 PM

> Sonnet is fast over the wire.

Except when it’s unavailable. For sovereignity, the downsides are worth it to some.

by trvz

4/29/2026 at 8:36:18 PM

the benchmarks we're using to measure llm's do no justice when everyone's mental-benchmark is simply "is it going to feel like using claude" and the answer is still no. the entire llm space is stuffed with tons of crazy datapoints and vernacular that barely paint the picture of the mental benchmark everyone is after.

i too am desperate to just sever ties with these big providers, my fingers are crossed we get there within the constraints of local hardware even if that means me spending 3-5k i just want off this wild ride.

by trueno

4/29/2026 at 11:34:37 PM

Not sure if 1M token window is meaningful with Sonnet/Opus. The models go dumb quickly as context increases making them unusable (that is if you get routed to actual Opus, otherwise they are just dumb regardless of context window).

by varispeed

4/29/2026 at 6:19:37 PM

Let's not forget Qwen 35B A3B MoE. It gets better performance than this in all the metrics for a fraction of the memory / compute footprint.

Sad to see all the non Chinese open source models being at least one generation behind.

by ksubedi

4/29/2026 at 6:32:16 PM

Qwen3.6 27B is even more impressive IMO. Dense so it doesn't run as fast but it's so good.

by simjnd

4/30/2026 at 12:14:15 AM

im kinda torn on which to download. i have the headroom to run either, mostly just want the occasional "do a coding thing im too lazy to do"

by trueno

4/30/2026 at 4:30:20 PM

Then go with Qwen3.6 35B A3B. It's way faster (up to 5x) and it is 80% as capable as the 27B. The 27B is for serious people looking for one shot coding. The 35B is for iterative and quicker coding. I am in the same situation as you (making something I don't want to do myself) and I use the 35B at Q5_K_M.

by EntityDeletr

4/29/2026 at 5:45:00 PM

Yeah, you can run it locally if you have enough VRAM, but the reports trickling in are saying about 3 tok/sec. This was on a Strix Halo box which definitely has the needed VRAM, but isn't going to have as high mem bandwidth as a GPU card, it's going to be similar on a Mac - that's the dilemma... the unified memory machines have the VRAM, but the bandwidth isn't great for running dense models. This size of a dense model is only going to be runnable (usefully) by very few people who have multiple GPU cards with enough memory to add up to about 70GB.

by UncleOxidant

4/29/2026 at 6:06:46 PM

I don't think this is quite correct, a Strix Halo box usually has 256 GB/s memory bandwidth. An M5 Max has 614 GB/s. An M3 Ultra (no M4 or M5 Ultra) has 820 GB/s. It's still not GDDR or HBM territory, but still significantly faster.

That's the edge of Apple Silicon for AI. When they scale up the chip they add more memory controllers which adds more channels and more bandwidth.

But yeah in the end it's still going to be only a handful of people that can run it.

What I meant is that I think researching and developing smaller more powerful model is more interesting than chasing the next 3T parameter model while burning through VC money and squeezing your customer base more and more aggressively.

by simjnd

4/29/2026 at 5:12:17 PM

The point is it's open weight and is tiny compared to a lot of it's competitors. 4gpus for world class performance - sweet!

by 2ndorderthought

4/29/2026 at 11:32:38 PM

> It doesn't beat the other models, but it sure competes despite its size.

But what is the rationale for running a dumb model? Because it can ocasionally produce something passable?

I don't get where is the value apart from mild entertainment, as in "I am somewhat of Anthropic myself".

by varispeed

4/30/2026 at 7:42:02 AM

Are you dumb because you're not Einstein? Intelligence is a spectrum. Just because you're not #1 doesn't mean you're dumb. A lot of small models are not frontier but are still very competent and are very useful coding agent. It may take better prompting and more guiding, but that can be a reasonable tradeoff for some people.

by simjnd

4/29/2026 at 5:30:19 PM

The competition is on DeepSeek v4 Flash for similar size / deployment target.

by liuliu

4/29/2026 at 5:55:40 PM

DeepSeek v4 Flash is still over 100GB at Q4 IIRC, and Q4 has generally been the sweet spot. Although it's an MoE so it might run a lot faster that this dense Mistral model if you have the RAM.

by simjnd

4/30/2026 at 5:58:01 AM

"Q4 has generally been the sweet spot" for self-hosting, yes. For any real meaningful work it's dumb AF. The only way to get reasonable intelligence from mid-size Gemma or Qwen is to run full precision BF16. Anything else is just an emulation of AI.

by pbgcp2026

4/30/2026 at 4:33:47 PM

I would disagree. I have 8 GB of VRAM and 32 GB of RAM. I can either run a 4B BF16 dense model fully on GPU at around 30 t/s or Qwen3.6 35B A3B Q5_K_M at 20 t/s with GPU offload. Which one would I choose?

by EntityDeletr

4/29/2026 at 4:41:36 PM

It’s 128b dense model. Good luck getting more than 3t/s out of a mac. It doesn’t matter if it fits or not.

by redrove

4/29/2026 at 5:12:03 PM

You could run it on a single Mac Studio with M3 Ultra, or two Mac Studios with M4 Max at higher perf than that. And lightly quantizing this could give us modern dense models in the ~80GB size range, which is a very compelling target.

by zozbot234

4/29/2026 at 5:15:56 PM

Wouldn't matter much still. M3 ultra has 819GB/s unified memory bandwidth. That means theoretical max tokem rate is 819/128 =~ 6.39 t/s. At 80 GB (5 bit quantization), its still near about 10 t/s ... far from a good coding experience. Also, these are theoretical max.. real world token generation rates would be at least 15-20% less.

by freakynit

4/29/2026 at 5:56:11 PM

Isn't Kimi K2.6 natively INT4?

by zackangelo

4/29/2026 at 6:27:39 PM

I don't think any models are natively INT4? I wouldn't see the point to nerf the model out-of-the-box.

by simjnd

4/29/2026 at 6:39:00 PM

It's not nerfed, it's natively trained at that quantization a.k.a. Quantization Aware Training.

by zozbot234

4/30/2026 at 6:13:13 AM

QAT typically uses BF16/FP32 during the training process to simulate lower precision.

by pbgcp2026

4/30/2026 at 4:36:41 PM

The only model I have seen like that is GPT OSS, natively quantized to MXFP4.

by EntityDeletr

4/29/2026 at 6:31:58 PM

Eh. Those results would be noteworthy if it was a a MoE. A 120B dense? Firmly in meh territory.

by revolvingthrow

4/29/2026 at 6:58:12 PM

Why do you care?

by gregorygoc

4/29/2026 at 6:13:29 PM

I would love to be able to run frontier locally, but I think the larger importance of open weight models is price accountability.

In the US with our broken system of capitalism, it’s the only way we can tether these companies to reality. Left to their own devices, I’m not convinced they would actually compete with each other on price.

Buy nobody like to talk about how “moat” building is fundamentally anti-competitive, even in name.

Funny that self proclaimed capitalists hate the system in practice. Commodity pricing is what truly terrifies them.

by deepsquirrelnet

4/29/2026 at 6:30:48 PM

I'm not necessarily interested in having frontier locally. You don't need to be frontier to be a very good and useful coding agent. I agree with your point on price accountability though. Hopefully no tariff comes down on the Chinese and European open-weight models.

by simjnd

4/29/2026 at 6:19:36 PM

[dead]

by sayYayToLife

4/29/2026 at 5:05:39 PM

I was hoping a lot from it... but this one, is not up to that mark. For example, here is it's comparion with 4.7x smaller model, qwen3.7-27b.

https://chatgpt.com/share/69f239e8-7414-83a8-8fdd-6308906e5f...

Tldr: qwen3.6-27b, a 4.7x smaller model, have similar performance.

by freakynit

4/29/2026 at 5:11:20 PM

To be fair MoE from Qwen itself had the same "problem". 3.5 122B MoE was same or worse than 3.5 27B. Yet to see 122B 3.6.

UPD. NVM, Mistral Medium 3.5 is dense. So yes, it is worse in every way.

by lostmsu

4/29/2026 at 5:08:39 PM

That's a chatgpt summary. Actual usage would a better test.

by r0b05

4/29/2026 at 5:12:11 PM

yep.. until then, this is good enough since the tests are standard, and the results are numeric and can be compared without any doubt.

by freakynit

4/29/2026 at 3:54:54 PM

As always, rooting for these guys — model and national diversity is great. This looks like a solid foundation to build on; hopefully the 3.6/3.7 will dial in more gains. It looks like maybe from the computer use benchmarks that their vision pipeline could use improvement, but that’s just speculation.

The different results on some benchmarks vibes as if this is truly an independently trained model, not just exfiltrated frontier logs, which I think is also really important - having different weight architectures inside a particular model seems like a benefit on its own when viewed from a global systems architecture perspective.

by vessenes

4/29/2026 at 6:20:33 PM

[dead]

by sayYayToLife

4/29/2026 at 3:50:24 PM

I'm rooting for Mistral. It seems they are making a big bet that smaller models will win over larger ones and I can see it happening. I was running some simple chat and tool-calling benchmarks for small models and Mistral Small 4 performed well for it's price ($.15/$.60). Seeing this today got me excited, benchmarks seems solid compared to models much larger, but it's priced higher than Haiku, 5.4 mini, and all the the Chinese models it's comparing itself too. It's not even winning those benches either, just being competitive with them, which is great, those models are 5x+ the size, but they are also 1/2 the price. Hard to be excited about that.

by wyre

4/29/2026 at 8:57:56 PM

The problem with this model is that DeepSeek v4 Flash runs quite well quantized to 2 bit (see https://github.com/antirez/llama.cpp-deepseek-v4-flash), at 30 t/s generation and 400 t/s prefill in a M3 Ultra (and not too much slower on a 128GB MacBook Pro M3 Max). It works as a good coding agent with opencode/pi, tool calling is very reliable, and so forth. All this at a speed that a 120B dense model can never achieve. So it has to compete not just with models that fit 4-bit quantized the same size, but with an 86GB GGUF file of DeepSeek v4 Flash, and it is not very easy to win in practical terms for local inference.

Note: I have more uncommitted speed improvements in my tree that I'll push soon, the current tree could be a little bit slower but not much, still super usable.

I don't understand one thing about Mistral, which I'm a fan being in Europe: they opened the open weights MoE show with Mixtral. Why are they now releasing dense models of significant sizes? In this way you don't compete in any credible space, nor local inference, nor remote inference since the model is far from SOTA and not cheap to serve. So why they are training such dense big models? Dense models have a place in the few tens of billion parameters, as Qwen 3.6 27B shows, but if you go 5 times that, it is no longer a fit, unless you are crushing with capabilities anything requiring the same VRAM, which is not the case.

by antirez

4/29/2026 at 10:36:45 PM

Your GitHub link only says "The model quantized in this way behaves very very well in the chat, frontier-model vibes, but it was not extensively tested." This is hardly relevant to how it behaves in agentic workflows, we're aware of how often they degrade severely with Q2 quantization. If this quantized Flash can keep up reasonable quality and performance at larger context lengths (which seems to be a key feature of the V4 series) it could be a very reasonable competitor to models in the same weight class like Qwen 3 Coder-Next 80B.

by zozbot234

4/29/2026 at 10:38:15 PM

Nope it works great with opencode as a agent, you can build a game or things like that. It works. The trick is a mix among the quantization I used, which is very asymmetric, and the fact that I guess DeepSeek v4 Flash tolerates extreme quantizations better than anything I saw in the past.

What I used was up/gate of routed models, IQ2_XXS, out -> Q2_K, then I quantized routing, projections, shared experts to Q8. The trick is that the very sensible parts are a small amount of the weights, and they are kept very high quality.

by antirez

4/29/2026 at 4:48:32 PM

Compared to all other hosted LLMs that I have tested, Mistral seems to be the only one with rather strict CSP headers. When you ask them to create a website with some javascript library it will not preview, even though le chat offers canvas mode.

Sometimes when a new release comes around from any provider I just want to test it a bit on the web. without paying and using an agent harness.

Why are they like this ;_;

Edit: Christ on a bike it's bad at drawing SVGs https://chat.mistral.ai/chat/23214adb-5530-4af9-bb47-90f5219...

by Mashimo

4/29/2026 at 5:54:19 PM

> Edit: Christ on a bike it's bad at drawing SVGs

On the bike would be an improvement. Geez.

I know SVGs may not be the best benchmark, but that matches my experience of trying to run a (previous) Mistral model in Mistral Vibe, asking it to help me configure an MCP server in Vibe. It confidently explained that MCP is the MineCraft Protocol and then began a search of my computer looking for Minecraft binaries.

by SyneRyder

4/29/2026 at 5:18:09 PM

I have never wanted, needed or hoped to draw svgs with an LLM. All of the models suck at it, some are just more fun or something.

by 2ndorderthought

4/29/2026 at 5:37:44 PM

I can't speak for what you consider sucking, but there is a significant difference between Mistral and Kimi or Gemini. I find the others to be usable for my needs.

by Mashimo

4/29/2026 at 5:44:58 PM

I agree there is a difference but does that translate to anything? It's not the same operations used to write code, and it's kind of useless. I wouldn't waste my power bill ensuring a model I was releasing was good at it.

by 2ndorderthought

4/29/2026 at 6:44:02 PM

> It's not the same operations used to write code

Is it not? It's html and javascript. And not even attempting to draw details that other models do.

When I try other html / js prompts it also lacks behind china models from over half a year ago. I mean worse then GLM 4.7.

by Mashimo

4/29/2026 at 7:01:30 PM

I've only tried it via the web so far and it's working great for my usual test prompts. Looks like it was trained on some pretty recent data.

by 2ndorderthought

4/29/2026 at 9:45:43 PM

Claude volunteered this the other day:

https://iili.io/BsfyNXR.jpg

(I think the hair was unintentional, but it is impossible to be sure.)

by andai

4/30/2026 at 2:57:38 PM

Are those cherries overlayed ontop of boobs with a bus to the side, driving towards a rock? Scnr

by ffsm8

4/29/2026 at 3:50:52 PM

This release Mistral really reminds you of the gap between the frontier labs and everyone else.

Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.

I've been a big fan of the smaller labs like Mistral and especially Cohere but it's been a while since I've been excited by a release by either company.

That said, I'm using mistral voxtral realtime daily – it's great.

by postalcoder

4/29/2026 at 4:07:56 PM

Can't agree at all. Productivity gap just 1 year ago was much larger for frontier model vs non-frontier. Let alone 2 years ago.

by deaux

4/29/2026 at 5:59:39 PM

Same. The gap is almost paper thin for anyone who hasn't gone full uninformed vibe code.

by 2ndorderthought

4/29/2026 at 4:28:37 PM

When I was thinking pre-agentic, I was actually thinking more pre-"coding seen as the main use case for these models".

by postalcoder

4/29/2026 at 5:07:43 PM

Coding has always been the main real-world business usecase since day one. There has been no point since the very first public availability of GPT 3.5 in November 2022, that it wasn't.

A lot of us have been agentic coding since almost 2 years ago, mid-2024. I have. The productivity gap of "best vs 2nd vs 3rd best model" was biggest back then and has slowly been shrinking ever since.

by deaux

4/29/2026 at 4:05:29 PM

> Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.

It's just apples to oranges.

There is not a clear, across the board, winner on non-agentic tasks between Gemini, ChatGPT, and Claude - the simple chatbot interface.

But Claude Code is substantially better than Codex which itself is notably better than Gemini-cli.

In this vein, it should not be surprising that Claude Code is way better than non-frontier models for agentic coding... It's substantially better than other frontier models at specialized agentic tasks.

by onlyrealcuzzo

4/29/2026 at 4:36:25 PM

I’ve been comparing Claude Code and Codex extensively side by side over the past couple of weeks with my favorite prompting framework superpowers…

From my perspective, Claude Code is decidedly not better than Codex. They’re slightly different and work better together. I would have no issues dropping CC entirely and using codex 100%.

If you’re working off of “defaults”, in other words no custom prompting, Claude Code does perform a lot better out of the box. I think this matters, but if you’re a professional software developer, I’d make the case that you should be owning your tools and moving beyond the baked in prompts.

by philipbjorge

4/29/2026 at 4:28:21 PM

CC is not better than Codex, nor is it better than OpenCode, Crush, Pi etc…

by nothinkjustai

4/29/2026 at 4:30:17 PM

I think there's a fair amount of evidence that the heavy harnesses actually drag down performance compared to bare harnesses.

by postalcoder

4/29/2026 at 4:06:36 PM

> Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models.

This is a very naive and misguided opinion. In most tasks, including complex coding tasks, you can hardly tell the difference between a frontier model and something like GPT4.1. You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences. To make matters worse, frontier models are taking a brute force approach to results which ends up making them far more expensive to run, both in terms of what shows up on your invoice and how much more you have to wait to get any resemblance of output.

And I won't even go into the topic or local models.

by locknitpicker

4/29/2026 at 4:17:35 PM

> You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences.

This is like saying "the current models and the old models are the same if you ignore every important advance they've made"

by postalcoder

5/1/2026 at 5:51:58 AM

> This is like saying "the current models and the old models are the same if you ignore every important advance they've made"

Please go ahead and list the single most important advance a frontier model has over, say, gpt4.1.

Reasoning is one of the main features, and in practice all it does is waste compute to rewrite your original prompt. See how GPT 5.4 burns through compute by running additional prompts where it acts like your own interpreter, with lengthy reasoning prompts on "the user is asking for (...)" as if you are completely unable to stitch together a usable prompt. That's your frontier model.

by locknitpicker

4/29/2026 at 6:25:09 PM

[dead]

by sayYayToLife

4/29/2026 at 6:51:48 PM

Mistral continuing to ship credible models is good for the market. Buyers need more than a two-company choice if they want pricing and deployment leverage.

by deferredgrant

4/29/2026 at 4:55:48 PM

I'm using mistral-medium-2508 for some text transformation operations. It's giving me better results than mistral-large for my use cases. Looking forward to testing this new model, although I'm not sure if it's really meant at replacing the previous medium model since it's a lot more expensive and presented more as a coding / agentic model (mistral-medium-2508 was priced $0.4/$2 per 1M tokens, mistral-medium-3.5 is $1,5/$7.5).

by seb_lz

4/29/2026 at 8:14:37 PM

I actually use Mistral Large to go through some large text chunks (in production). It gives about the same level of results as Sonnet, while being 90% cheaper. Definitely wouldn't use it for coding, but for this text-analyzing task it has been great. Much better than all the latest Chinese models, for example.

So I was waiting for this release and it's... 5x more expensive than the latest Mistral Large. So now I'm worried they'll pull the plug on the cheap Large when their releases roll over to that one.

by hulk-konen

4/30/2026 at 7:24:16 AM

Yeah I use Mistral Large for a lot of formatting work. For this one use case of mine, it outperforms frontier models by a significant margin. I've found tons of use cases for mistral small as well.

I'd love to use Mistral for more tasks, but Mistral Large doesn't quite cut it for all tasks. So on the one hand, I'm excited there is another model, and presumably more performant based on the price? But the fact it's a "Medium" and 5x the price of the Large definitely concerns me.

The entire release is also about Vibe Coding, and so I'm not even sure if this model is applicable outside of coding, or even worth testing.

by barrell

4/29/2026 at 8:20:01 PM

Why does this matter if the model is open? It can be offered by competitive third-party providers, there's no rug pull.

by zozbot234

4/29/2026 at 8:54:26 PM

Right now it's really not offered by third parties. I found it via a single provider (BitDeer). I'm not sure I'd trust them with my customers' data. Also, considering the model is getting a bit old, I wouldn't expect them to keep offering it forever.

Anyhow, competition is fierce. I'll have some model I can use in the future, even if it's not dirt cheap like current Mistral Large is.

by hulk-konen

4/29/2026 at 4:03:39 PM

I like the idea of Mistral, but the last time I evaluated Mistral Vibe it was really nice for $15/month but not as effective as Gemini Plus with AntiGravity and gemini-cli. I am currently running Gemini Ultra on a 3 month 'special deal' and AntiGravity with Opus 4.7 tokens is pretty much fantastic.

That said, when I stop spending money on Gemini Ultra, I will give Mistral Vibe another 1-month test.

I like the entire business model and vibe of Mistral so much more than OpenAI/Anthropic/Google but I also have stuff to get done. I am curious if Mistral Vibe for $15/month is a stable business model (i.e., can they make a profit).

by mark_l_watson

4/29/2026 at 5:41:00 PM

How do you feel about the responsiveness of gemini-cli? I tried it on a paid plan and the 10-minute hang-ups (per step, not the whole plan execution) really break the illusion of performance gains, unless you run it in the background and do something else in the meantime. It's more noticeable when Americans are awake.

by danelski

4/30/2026 at 10:27:26 PM

it is usually fast, but if gemini-cli or any other coding agent is sluggish I quit using it for a while.

by mark_l_watson

4/29/2026 at 4:35:19 PM

I'm testing it right now and it seems very buggy and unstable, just like before.

by amunozo

4/29/2026 at 3:39:23 PM

It's okay, nothing exceptional, but any news from non US and non Chinese models is still good news.

by mtct88

4/29/2026 at 3:46:30 PM

This is the bar for Europe, huh?

by pb7

4/29/2026 at 4:17:25 PM

Where are the competitive models from Singapore, Japan, Taiwan, Korea, Russia, Canada, India, the UK? From anywhere that isn't China or the US?

There are none. Mistral Small 4 is pareto-competitive in its pricing bracket at $0.15/$0.60, at worst it's second to Gemma 4 26B A4B. The above countries have never had a model that is even close to being so.

This particular Mistral Medium looks to be uncompetitive at that pricing. I'm surprised it's so expensive given its size. Wonder if we'll see other providers offer it for cheaper.

but that doesn't mean Mistral has never produced anything useful.

by deaux

4/29/2026 at 4:47:24 PM

> Korea

EXAONE from LG AI Research https://huggingface.co/LGAI-EXAONE

They had one of the best small models a few months ago and they released a new model just last week.

There's also HyperCLOVA X (haven't tested it, but maybe it is also good) https://huggingface.co/naver-hyperclovax

> India

India has the Sarvam model series, which admittedly are not SotA, but they have pretty good voice capabilities https://huggingface.co/sarvamai

The UAE (not part of the list above) also has a few noteworthy models: https://huggingface.co/tiiuae

by johndough

4/29/2026 at 5:05:04 PM

I'm familiar with those models. They're nowhere near competitive. Miles away from Mistral or (obviously) Chinese models.

> (haven't tested it, but maybe it is also good)

I have. It is not.

by deaux

4/29/2026 at 5:11:28 PM

You mentioned "pareto-competitive", and EXAONE certainly was that. The statement that the "above countries have never had a model that is even close to being so" is simply too broad.

by johndough

4/29/2026 at 6:02:41 PM

You're talking about EXAONE 4.5 33B? Gemma 4 31B was released 1 week earlier and blows it out of the water. Which point in time/model size are you possibly talking about? The original K-EXAONE in January?

More than anything the availability speaks for itself. If it was indeed pareto competitive, all dozens of model providers would be doing their best to offer it for serverless inference. They don't. There's maybe one that does. Do you think a lot of companies wouldn't prefer a Korean model over a Chinese one? In this case, the market speaks. Go talk to people who run business based on putting billions or trillions of tokens through open weights models. And how much time they put into optimization of model selection to save money and latency. And ask why none of them are using EXAONE models. It's not because we're not aware of their existence. There's also reason to believe they've been benchmaxxing more than Chinese models, btw. Have you done the vibecheck?

I wish they were strong, I hope that in the future, they are. More diversity is better. So far they have not yet been a serious option at any point.

by deaux

4/29/2026 at 5:07:42 PM

they should ask unsloth to follow them. For my usecases locally w/128GB, Qwen3.5-Coder-Next is SOTA.

by cyanydeez

4/29/2026 at 4:29:31 PM

DeepMind, which is headquartered in London, probably had a significant role in the development of the Gemini and Gemma models.

Yes, it might be a problem that the UK allows companies like this to be bought up by foreign countries.

by argsnd

4/30/2026 at 12:11:08 AM

Yet ASML is always cited as a great Europe great achievement, but it's hardly ever mentioned that without American's EUV research and patents, and without Cymer there would be no AMSL as we known of.

In all honesty I believe ASML's success is mostly their own. Still, lamenting "being bought up by foreign countries" is a lame excuse.

by signatoremo

4/29/2026 at 5:10:34 PM

Without Google’s funding its not obvious i DeepMind would have went anywhere.

Unless the moved to US for funding while keeping a back office in the UK.

It’s strange to expect anything significant to come out from Europe when VCs there are either very risk averse and/or don’t have enough cash to begin with. It’s not like government or EU funding can replace that since its almost always wasted or missdirected

by wasfgwp

4/29/2026 at 7:39:36 PM

It’s a company containing such remarkable talent that I’m sure they would not have run into significant issues raising capital on international markets.

It’s not like VCs are only allowed to invest in companies in their own country.

by argsnd

4/30/2026 at 5:46:57 AM

Usually to maximize its funding the company would move its HQ to the US and if they are lucky have an IPO there and eventually become effectively American after a few years (e.g. Unity)

by wasfgwp

4/30/2026 at 6:35:06 AM

I have no idea why @wasfgwp is downvoted - it's very true. +1 on that.

by pbgcp2026

4/29/2026 at 6:47:12 PM

What does Pareto competitive mean here? Look at the pricing of the V4-flash model: https://api-docs.deepseek.com/quick_start/pricing

by pama

4/30/2026 at 5:03:47 AM

> What does Pareto competitive mean here?

Being near the Pareto frontier of inference cost vs. output quality.

This was released 6 days ago. The dust hasn't settled yet, and Mistral Small 4 was released earlier. Even if Deepseek V4-flash turns out to crush it, there was a period where it was Pareto competitive. None of the countries I named (i.e. no country that isn't China/US/Mistral) have had a Pareto competitive model at any point in time.

by deaux

4/29/2026 at 7:07:09 PM

80 percent as good for 20 percent the cost.

by selectodude

4/29/2026 at 8:48:48 PM

But it is worse and more expensive…

by pama

4/29/2026 at 4:36:13 PM

Although the Manus decision might change things for AI, Singapore-washing is quite rampant among Chinese companies, so I wouldn't call this place of origin an alternative market.

by class4behavior

4/29/2026 at 6:22:20 PM

[dead]

by sayYayToLife

4/29/2026 at 3:58:45 PM

This is the bar for anybody that's not the frontier labs.

by amunozo

4/29/2026 at 4:09:20 PM

> This is the bar for Europe, huh?

A few months ago China was being criticized left and right on how somehow it was not able to compete, and once DeepSeek showed up then all the hatred shifted onto how China was actually competing but exploring unfair competitive advantages.

Funny how that works.

Also, aren't the likes of OpenAI burning through over $2 of investment for each $1 of revenue?

by locknitpicker

4/29/2026 at 4:10:41 PM

[flagged]

by pb7

4/29/2026 at 4:23:55 PM

2 businesses working to get money from the same customers in the same field is competition. Kellogs is competing with store brand cereal. People are choosing to use these Chinese AI apis because they are good enough for some workflows and cheaper. If they didn't exist, the money would go to the frontier labs. There is no world where this would not be defined as competition.

by nickthegreek

4/29/2026 at 5:36:51 PM

I find it funny how people don't realize the technical achievements and papers coming out of deepseek or Alibaba. They are making this whole AI thing sustainable and cheap and available to do at home. That's the future. I should be able to run my own harness and model and never bother with openai or anthropic at all.

by 2ndorderthought

4/29/2026 at 4:23:19 PM

> China is not competing, it is distilling US models

China are cheating by using data obtained without permission to train their models in an evil commie way!

They should have done what the US did instead and trained models on data obtained without permission in a fair and freedum way!

> Where are the Chinese models that are blowing US ones out of the water?

Kimi2 blows every US model out of the water in any comparison that includes both costs and performance.

by tirpen

4/29/2026 at 5:16:05 PM

Qwen3.6 runs on a single GPU and beats claudes sonnet. In benchmarks and real world tests from humans. Kimi is awesome but most people won't be able to host it themselves.

A lot of people are slowly realizing the moat of 1T closed source models is gone as of the last few weeks. It's going to change the industry. April was a huge month for open models, it'll be curious to see if that continues.

This Mistral submission is another nail in the coffin.

by 2ndorderthought

4/29/2026 at 5:23:31 PM

i run qwen 3.6. you need to drink some settle down juice.

by prodigycorp

4/29/2026 at 5:30:52 PM

No way it's awesome.

by 2ndorderthought

4/29/2026 at 5:18:04 PM

> beats claudes sonnet

Based on benchmarks which don’t mean that much these days.

> models is gone as of the last few weeks.

Yes, that’s exactly what people were saying after every major release for the past year or so. It’s always a couple of weeks away

by wasfgwp

4/29/2026 at 4:21:36 PM

> China is not competing, it is distilling US models.

I think you should check your notes. The likes of Kimi K2 thinking shows up as high as the second best general purpose model currently in existence. It seems they compete just fine.

If you believe "distilling" is all it takes to put together a model at the top of any synthetic benchmark then I wonder what you would have to say about all US models that greatly underperform in comparison and still manage to be used extensively in professional settings.

But your argument is an emotional one and not rarional, isn't it?

by locknitpicker

4/29/2026 at 5:16:25 PM

> high as the second best general purpose model

According to benchmarks which are gamed to the extreme these days. Trusting them blindly isn’t exactly rational either. They don’t necessarily translate that well to real world tasks

It’s obviously not “distilling” as such but there are reasons why Chinnese models are consistently several months behind OpenAI/Antropic

by wasfgwp

4/29/2026 at 5:38:19 PM

[dead]

by 2ndorderthought

4/29/2026 at 5:12:07 PM

Theft is quite a slippery slope argument not in your favor in the context of US based LLMs and how/what they were trained on..

by Jackpillar

4/29/2026 at 4:16:26 PM

Ah yes, like those EUV machines America and China have worked on.

by sagacity

4/29/2026 at 6:56:01 PM

I mean, at least we're not melting the planet trying to predict the next token that sounds about right.

by Matl

4/29/2026 at 7:54:07 PM

Europeans use AI as much as anyone else.

by pb7

4/29/2026 at 8:01:58 PM

Yes, but it would seem that Chinese models are much more efficiently trained than the US ones, (i.e. with fewer resources).

Europe doesn't invest nowhere near as much as the US does into tech, so we need to either figure out how to be at least as, and hopefully more, efficient as the Chinese models are (at least in terms of training) or there's little point in trying.

I suspect this is one of the reasons why Mistral's models are somewhat struggling; i.e. US style training costs, but nowhere near as much cash as OpenAI/Anthopic have.

There are multiple European Google alternatives as well for example, but being 80% as good just doesn't cut it. Chinese models win because they are 95-98% as good as the SotA US ones but at a fraction of the cost.

by Matl

4/29/2026 at 10:48:31 PM

They actually don't use AI as much as a lot of other regions: https://www.visualcapitalist.com/mapped-ai-adoption-across-e...

by troyvit

4/29/2026 at 4:26:02 PM

[flagged]

by wg0

4/29/2026 at 5:06:20 PM

[flagged]

by gadders

4/29/2026 at 5:42:17 PM

[flagged]

by 2ndorderthought

4/29/2026 at 5:12:13 PM

[flagged]

by saulapremium

4/29/2026 at 8:01:07 PM

The fact that this comment is still up hours later but my comment below participating in the discussion got flagged should tell one everything they need to know about the intellectual rigor here.

by pb7

4/30/2026 at 4:40:05 AM

Oh it's flagged as well, and I admit that it was low effort. But your comment served no purpose besides from provocation, and I guess it worked.

by saulapremium

4/29/2026 at 4:27:06 PM

It's funny that 128B is now considered Medium. I remember back in the day when 355M parameters was considered medium with GPT-2.

by minimaxir

4/29/2026 at 4:38:16 PM

And GPT-2 1.5B was considered too dangerous to release.

They were perhaps right.

by speedgoose

4/29/2026 at 6:58:49 PM

considered that by OpenAI for marketing purposes that is

But yes, perhaps it would have been better for all of us if they haven't.

by Matl

4/29/2026 at 7:24:43 PM

In lockstep over the past month, a subset of people, un-labelable, unprompted, share this train of thought:

- Mythos wasn't released widely.

- But Anthropic shared info on it and said it was dangerous.

- Anthropic is a company.

- Companies like money.

- Therefore Mythos is marketing hype.

- Remember GPT-2? That also wasn't released. They said it was dangerous.

- But, GPT-3, GPT-4, GPT-5, etc. were released.

- Therefore GPT-2 being dangerous was marketing hype.

I've seen the idea that GPT-2 not being released was marketing hype at least 6 times since Mythos was shared.

It's Not Even Wrong, in the Pauli sense: they weren't selling anything! They weren't raising funding! What were they marketing!?

And there's a lot more elided from history, ex. they didn't have an API yet.

GPT-3 was released, a year or two later, and did have an API. But, no one used it, it wasn't good enough yet. And they did treat it as dangerous, it was wildly over-the-top manually monitored for anything resembling not-intended-use. I got permanently suspended for using the word "twink"

by refulgentis

4/29/2026 at 7:53:28 PM

> I've seen the idea that GPT-2 not being released was marketing hype at least 6 times since Mythos was shared.

That's not what I am saying.

It's not that GPT-2 not being released was marketing hype, it's that OpenAI themselves claiming it's too dangerous to release specifically, implying it's close to AGI, (or something like that), was marketing hype.

by Matl

4/29/2026 at 8:00:44 PM

That may sound more defensible to you, but its even more detached from reality. I feel very old right now because I actually read the thing at the time, but setting that aside, do you really think anyone thought or said GPT-2 was AGI?

I don't think you do.

I only mention reading it because that would clear it up, and you seem interested, and your parenthetical indicates A) you're aware you're claiming something a bit silly and B) you don't know what was actually said.

by refulgentis

4/30/2026 at 5:40:18 PM

> do you really think anyone thought or said GPT-2 was AGI?

> I don't think you do.

I don't think they did. I think the marketing around it was such as to imply to the general public that it's not that far from AGI/let their imagination run wild, because it's useful marketing.

Same way as Apple's marketing around the iPad was that it's the 'Super. (full stop) (space) Computer. (full stop)' They never say it's any kind of a 'supercomputer'. But they know how that is going to be interpreted by many. It's intentional marketing.

Same with OpenAI.

I do think you're playing obtuse at this point, if you don't get that.

by Matl

4/30/2026 at 6:44:45 PM

I am not playing obtuse, and note “do you really think” is different from accusing someone of playing a character on purpose to gaslight you, especially when the “do you really think they said” is stapled to a long, kind, explication that your parenthetical says straight out you don’t know what was said, you’re guessing.

Your attempt to recount GPT-2 and what was said about it won’t make any sense to anyone who was present at the time, and you’ve circled back to root, “marketing”, again, for a company that was selling nothing and wouldn’t for years and wasn’t raising and wasn’t even mentioning AGI in connection with GPT2. It was just a text generator back then, though a curious one worth noting. It’s genuinely awe inspiring to see someone arguing gpt-2 was near AGI and that anyone at the time thought so or said so.

It’s not worth engaging further, especially with you starting to get nasty and project what’s in my head. Have a good day.

by refulgentis

4/29/2026 at 5:05:16 PM

Given what Vibe already did in the previous versions with codestral-v2, that's great news. Keep up the good work ! I don't want to depend on the world's two hungry superpowers.

by maelito

4/29/2026 at 6:07:24 PM

The Vibe CLI is really bad on Windows, sure they don’t officially support it, so can’t blame them, but a FYI for anyone wanting to try it. It can’t get find and replace right.

by andhuman

4/30/2026 at 6:01:24 AM

Mistral is playing the long game here ngl, lower sized models, lower costs, overall good enough performance!

by vjay15

4/29/2026 at 4:18:43 PM

I can't figure out if this is available in the official Mistral API or not.

Their model listing API returns this:

  {
    "id": "mistral-medium-2508",
    "object": "model",
    "created": 1777479384,
    "owned_by": "mistralai",
    "capabilities": {
      "completion_chat": true,
      "function_calling": true,
      "reasoning": false,
      "completion_fim": false,
      "fine_tuning": true,
      "vision": true,
      "ocr": false,
      "classification": false,
      "moderation": false,
      "audio": false,
      "audio_transcription": false,
      "audio_transcription_realtime": false,
      "audio_speech": false
    },
    "name": "mistral-medium-2508",
    "description": "Update on Mistral Medium 3 with improved capabilities.",
    "max_context_length": 131072,
    "aliases": [
      "mistral-medium-latest",
      "mistral-medium",
      "mistral-vibe-cli-with-tools"
    ],
    "deprecation": null,
    "deprecation_replacement_model": null,
    "default_model_temperature": 0.3,
    "type": "base"
  },

So that has the alias "mistral-medium-latest", but the official ID is "mistral-medium-2508" which suggests it's the model they released in August 2025.

But... that 1777479384 timestamp decodes to Wednesday, April 29, 2026 at 04:16:24 PM UTC

So is that the new Mistral Medium?

by simonw

4/29/2026 at 4:27:55 PM

Some poking around in the source code for https://github.com/mistralai/mistral-vibe got me to this:

  curl https://api.mistral.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(llm keys get mistral)" \
  -d '{
    "model": "mistral-medium-3.5",
    "messages": [
      {"role": "user", "content": "Generate an SVG of a pelican riding a bicycle"}
    ]
  }'

Which did work: https://gist.github.com/simonw/f3158919b18d2c47863b0a5dc257a... - it's pretty disappointing.

Weird that it doesn't show up in the model list:

  curl https://api.mistral.ai/v1/models \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(llm keys get mistral)" | jq

by simonw

4/29/2026 at 4:54:24 PM

I also did some SVG tests, it's really bad.

https://chat.mistral.ai/chat/897fbe7d-b1ae-4109-9b29-f3ccc4f...

by Mashimo

4/29/2026 at 5:03:03 PM

Wow. I get that "how well can it make SVGs" isn't the (or a) gold standard for how useful a model is or isn't, but the fact the Gemma 4 26B A4B I'm running locally can blow it out of the water doesn't give me high confidence for the model. Maybe an unfair comparison, but...

by spijdar

4/29/2026 at 5:23:58 PM

It sounds like they focussed performance on not drawing svgs. Which honestly, makes a lot of sense to me.

by 2ndorderthought

4/29/2026 at 5:36:41 PM

Drawing SVGs isn't something I really care about either, and I think it's still to "qualitatively compare" e.g. "Opus's pelican vs GPT's pelican vs GLM's pelican" or whatever the kids are doing.

But what stands out to me is that it's barely able to draw a "recognizable" pelican at all. The Devstral 2 model even looks slightly better, though maybe I'm splitting hairs: https://simonwillison.net/2025/Dec/9/

by spijdar

4/29/2026 at 5:19:14 PM

It's so bad I don't want to spend the 18 EUR just to test it for a month. It can't even create an SVG of the facebook logo. There should be plenty of examples of that around.

Gemini fast could do that in under 5 seconds.

by Mashimo

4/29/2026 at 5:11:31 PM

I'm curios: are you doing a real apples to apples comparison, or are you running a harness that already curates prompts? There's a far and wide margin how any of these models respond based on already loaded context. Most models are pretty much hot garbage until their context is curated appropiately.

by cyanydeez

4/29/2026 at 5:18:53 PM

I just copied and pasted each prompt as specified by Mashimo and simonw into a chat interface, using a 4-bit Unsloth quantization of Gemma 4 26B, with the default sampler settings recommended by Google, and a system prompt of "You are a helpful assistant". The results are miles ahead of what the Mistral model output.

I've gotten a lot of use out of Mistral models, and I imagine this model is pretty good at other things, but it really feels like a 128B parameter dense model should be at least a little better than this.

by spijdar

4/29/2026 at 6:55:54 PM

For it's size, that's really good! Though I bet it being a dense model probably helps a lot, if it was MoE at that size, I bet the benchmark performance would go quite a bit down (which consequently would also mean that I'd at least be able to run it with decent tokens/second, with the bunch of Nvidia L4 cards available to me, which presently are only okay with MoE models).

It's cool that they added comparisons to their own Mistral Small 4 119B A7B, which kind of shows that! They could have also included comparisons to something like Qwen Coder Next 80B A3B (or maybe the newer Qwen 3.6 35B A3B, or the 27B dense one), maybe DeepSeek V4 Flash 284B A13B, or the older GPT-OSS 120B A5B to illustrate that difference and where their model sits even better, it would probably give a more positive picture than just comparing themselves against a bunch of bigger models!

Come to think of it, alongside throwing some money at DeepSeek not just Anthropic, I probably should get a Mistral subscription as well sometime, to see how they perform on various tasks - cause they seem pretty cost effective and it's nice to support at least some EU orgs: https://mistral.ai/pricing

by KronisLV

4/29/2026 at 5:00:20 PM

This is a very interesting strategy that might pay off. This model is a very good option for enterprise self host. I would argue a lot of companies are VRAM constrained rather than compute constrained. You could fit 4-5 running instances on one H100 cluster where you can only fit 1-2 Kimi K2 or GLM5.

by syntaxing

4/29/2026 at 5:47:50 PM

This is 128B dense though. the K/V cache on long context is going to be massive

by 2001zhaozhao

4/29/2026 at 6:30:49 PM

Don’t think kv size correlates to dense/moe

by Havoc

4/29/2026 at 6:46:47 PM

KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.

by zozbot234

4/29/2026 at 7:22:03 PM

With turbo quant, you would reduce it by over 6X.

by syntaxing

4/29/2026 at 6:23:04 PM

[dead]

by sayYayToLife

4/29/2026 at 3:37:54 PM

TLDR: Mistral Medium 3.5, text-only, 128B dense model, 256k context window, modified MIT license. Model is ~140G ...

https://huggingface.co/mistralai/Mistral-Medium-3.5-128B

They more or less claim this exceeds Claude Sonnet 3.5 on most things, but is worse than Sonnet 3.6, and exceeds all other open models.

Oh and they have a cloud service that will code your apps "in the cloud". But, yeah, at this point, so does my cat.

And, yes, unsloth is on it: https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF (but 4bit quant is 75G)

by spwa4

4/29/2026 at 4:01:38 PM

Sonnet 4.5 and 4.6*

There is no way it exceeds “all other” open models - but it does exceed all of Mistral’s past models.

You can see it getting blown past by GLM 5.1 and Kimi in this.

Still excited to give it a try

by wolttam

4/29/2026 at 5:28:09 PM

It looks like qwen 3.6 is winning and smaller for the April small model roll out

by 2ndorderthought

4/29/2026 at 4:05:35 PM

Unfortunately they only compare to old “all other open models”. There are probably over 10 other open models better than it by now.

by pama

4/29/2026 at 4:01:13 PM

You mean Sonnet 4.5 and 4.6 riight

by Marciplan

4/29/2026 at 4:55:58 PM

right

by spwa4

4/29/2026 at 3:55:00 PM

I use Mistral Le Chat quite a bit.

One thing in particular I was disappointed in was its bad explanations when asking about French grammar. It made multiple mistakes and the other models got it right, even Qwen 3.6 27b!

Anyway, I'm hoping they catch up some more.

by Tepix

4/29/2026 at 3:59:41 PM

There's a good chance that they'll catch up. The "AI race" is a race to the bottom, with the leaders blowing huge wads of cash on capabilities that get replicated months later by the competition at a fraction of the cost.

The only benefit of leading is mindshare. OpenAI is doubling down on that, by investing in communication companies. That's their pathetic attempt at a "moat".

by kubb

4/29/2026 at 4:08:30 PM

They catch up by distilling frontier models. They will eventually figure out how to prevent that from happening. No one has any interest in investing tens of billions if the product can be copied and sold for less.

by pb7

4/29/2026 at 4:25:21 PM

>No one has any interest in investing tens of billions if the product can be copied and sold for less.

That is what has happened until now though

by amarcheschi

4/30/2026 at 3:46:31 PM

Le Chat fast is really good now. Did a few normal queries and noticed and improvement. Very well done.

by Scroll_Swe

4/29/2026 at 7:27:54 PM

Ouch. Maybe they have a captive buying market to insulate them from actual market forces or ???

by jollymonATX

4/29/2026 at 3:35:19 PM

Looks at first graph. It's SWE-Bench Verified. A benchmark Open-AI stopped using two months ago due to contamination.

Doesn't look to promising. Is there any reason to consider Mistral other than it's not US?

by InputName

4/29/2026 at 5:30:01 PM

They did not stop using it due to contamination. They said it's flawed and indirectly said anthropics results were impossible. It's very possible they are sore losers

by 2ndorderthought

4/29/2026 at 3:43:51 PM

If it's not US and it's within a few percent of SOTA that might be good enough for a lot of people (eg Europeans)

by tpurves

4/29/2026 at 3:46:34 PM

Gemma has been better for us at EU languages than mistral (for comparable sized models) :/ so ... dunno. What mistral does well and others are lagging behind is deploying on prem with their engineers and know-how, offering tuned models for your tasks and finetuning on your own data. (I expect google to start offering this next)

by NitpickLawyer

4/29/2026 at 5:03:26 PM

It's sad that despite their strength in this for onprem, they're so behind on this in the cloud. No publicly available cloud SFT at all. Meanwhile Google has been offering that for years - though remains to be seen if they will for Gemini 3 when GA.

And on top of it a range of providers like Fireworks and so on that offer it for Chinese models. This seems such an obvious thing for Mistral to offer.

by deaux

4/29/2026 at 3:49:12 PM

Price and speed.

by amunozo

4/29/2026 at 5:33:55 PM

With most OSS releases being MoEs, and modern GPUs optimized for MoEs, can somebody with knowledge of the topic explain or speculate why Mistral might have opted for a dense model?

by schipperai

4/29/2026 at 6:02:26 PM

Modern GPUs aren't optimized for MoEs though?

The advantage to a dense model like this Mistral one is that it is as smart as a much larger MoE model so it can fit on less GPUs. The tradeoff is that it is much slower since it has to read 100% of its weights for every token, MoE models typically only read about a tenth (though sparsity levels vary).

by ac29

4/29/2026 at 11:34:54 PM

Thanks, makes sense. I meant Blackwell is explicitly optimized for MoEs.

by schipperai

4/29/2026 at 5:32:29 PM

A 1000B model, can we call it 1KB model?

by Alifatisk

4/29/2026 at 3:23:17 PM

I want to believe it's gonna be good, but after trying GPT-5.5 even the most advanced Chinese models seem depressing.

by amunozo

4/29/2026 at 3:55:15 PM

I am not following this obsession with SOTA and benchmark rankings

I have been using DeepSeek and GLMnmodels with OpenCode and Codex and Claudr side by side.

I have not found the Chinese models lacking. I enjoy for coding and like to maintain full control of my codebade and deeply care about the GOF patterns. So I am very stringent in terms of what I want the LLM to code and how to code.

So from my perspective, they are all about the same.

by manishsharan

4/29/2026 at 4:00:15 PM

That I agree with, but for more complex autonomous changes the differences are considerable. However, it seems that most models will reach the saturation time in which they will be useful for almost everything and the difference will be in more and more niche and specialized tasks.

by amunozo

4/29/2026 at 3:31:56 PM

This is a French model sir

by r0b05

4/29/2026 at 3:40:03 PM

Évidemment

Funny detail: Google AI (the one they use in search) can't spell évidemment correctly.

by spwa4

4/29/2026 at 3:41:07 PM

What's French for 'goblin'...?

by baq

4/29/2026 at 3:31:54 PM

Then you’ll be happy to learn it’s not Chinese

by ako

4/29/2026 at 3:38:14 PM

GP is stating that the second best in the field, the Chinese, is so far behind the best in the field, GPT 5.5, that it is not even worth testing anything else.

by dotancohen

4/29/2026 at 3:48:52 PM

Thanks for the translation, I did not express it very clearly. Anything that I try is so much worse.

by amunozo

4/29/2026 at 4:00:31 PM

Is GPT 5.5 the best in the field? I think Opus is still better despite Anthropic's recent stumbling.

by Ritewut

4/30/2026 at 5:40:26 AM

I did not try much Opus recently as I had a Codex subscription and heard bad things, but Opus is super good too. Let's say compared to any of them.

by amunozo

4/29/2026 at 3:41:37 PM

Honestly I depends on the context which this performance matters. Mistral is quiet cheap

by lava_pidgeon

4/30/2026 at 2:47:21 PM

I hack you

by dznan

4/29/2026 at 9:53:57 PM

[dead]

by vicchenai

4/29/2026 at 4:28:45 PM

Oh they are still a thing?! Completely forgot about Mistral. I am assuming they are still burning trough investor money.

by Giorgi

4/29/2026 at 6:13:29 PM

> they are still burning trough investor money

Difficult to say, this information is not really public. That said, those investors include EU agencies and European multinational companies and governments. It’s not as flashy as the ridiculous sums OpenAI is getting but it should be enough to keep them going for a while.

They also have a different business model. They are selling their expertise to fine tune and adapt their models to on-premises computers (which they can help you build) to handle confidential data and information. I would not be surprised that the revenue they get from normal people is negligible in comparison.

by kergonath

4/30/2026 at 11:41:18 AM

Ooh, ok so people got all worked up because it is EU vs USA thing.

by Giorgi

4/29/2026 at 5:46:57 PM

I believe they'll get profitable sooner than their frontier competition. Their operating costs seem to peanuts compared to the providers they're compared to most often while having the local advantage of not being Chinese nor American.

by danelski

4/29/2026 at 4:35:40 PM

What's better than Voxtral for locally processed voice input? More competition is always better.

by sev_verso

4/29/2026 at 6:33:16 PM

Think they’re positioned pretty well. They’ve got an edge in the European corporate space and don’t have ungodly large numbers to hit

by Havoc