5/11/2026 at 12:36:22 AM
Getting so close to good!I consider Gemma 4 31B (dense / no MoE), the new baseline for local models. It's obviously worse than the frontier models, but it feels less like a science experiment than any previous local model I’ve run, including GPT OSS 120B and Nemotron Super 120B.
On my M5 Max with 128 GB of RAM and the full 256K context window, I see RAM use spike to about 70 GB, with something like 14 GB of system overhead. A 64 GB Panther Lake machine with the full Arc B390, or a 48 GB Snapdragon X2 Elite machine, could probably run it with a 128K to 256K context window. Maybe you can squeeze it into 32GB (27.5GB usable) with a 32K context window?
Even last year, seeing this kinda performance on a mainstream-ish/plus configuration would have seemed like a pipe dream.
by soganess
5/11/2026 at 3:22:37 AM
Gemma 4 IS good, I've literally had it get a thing right that Opus 4.7 missed, the edges are ragged and I'm reliably finding usecases where it's basically equivalent. Ultimately the metric is "what can I RELY on it to do". Opus definitely knows a lot more and can sometimes do much more complex tasks, but especially when you're good about feeding the context Gemma is amazing. The difference between the sets of things I trust the two models to do is surprisingly small. I've had some insanely good runs recently working on my personal tooling as well as random projects. The first local model that can reliably left to implement features in agentic mode on non-trivial projects.https://thot-experiment.github.io/gradient-gemma4-31b/
This is a relatively complex piece of tooling built entirely by Gemma 4 inside OpenCode where I manually intervened maybe only 4 times over the course of a few hours.
running Q6_K_XL, 128k context @ q8 ~ 800tok/s read 16tok/sec write
eagerly awaiting turboquant and MTP in llama.cpp, should take me to 256k and 25-30tok/s if the rumors are true
by thot_experiment
5/11/2026 at 5:25:00 AM
Re-posting this from a buried comment for visibility because it's just so fucking impressive to me.I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.
idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.
by thot_experiment
5/11/2026 at 12:02:25 PM
Had a very similar experience recently.Built a basic authentication handler for this test just so it wouldn't be in the training data of either model. It had deliberately planted bugs. One was a hardcoded secret, another was a wrap-on-0xFFFFFFFF bug as a result of a malloc(length+1).
Qwen 3.6 found both, alongside two other issues I hadn't even considered, and the location of the magic value. GPT-5.4, though, missed the malloc issue (flagging memory exhaustion as the only risk), it missed a separate timing bug (it explicitly said the function was safe), and it hallucinated the location of the magic value. Qwen correctly identified the integer overflow. GPT-5.4 did not.
I then compared basic research between them using SearXNG for web search. For example, the current status of MTP in llama.cpp. Qwen 3.6 27B found the current PR, but flagged a related issue that shows the current implementation can be slower than just using a draft model right now. GPT-5.5 Thinking found the same PR, but didn't flag the downsides.
In a similar comparison, I asked both models how I should get started with ESPHome as a total beginner. ChatGPT suggested an ESP32-S3 and a BME280, which is... just not a good idea. It also talked about the ESP32-P4 not having Wi-Fi, and installing with HA or Docker. Meanwhile, Qwen3.6 27B said regular ESP32, DHT22, and mentioned HA, Docker, and pip as installation methods. While GPT was good, it was just throwing out jargon for a prompt that explicitly requested it for a beginner.
It kind of blew my mind that in all three of these, Qwen landed it better.
by AdamConwayIE
5/11/2026 at 7:29:41 AM
It definitly is and just a few years ago unheared of.And we progress on so many different frontiers in parallel: Agent harness, Agent model, hardware etc.
by AntiUSAbah
5/11/2026 at 9:23:04 AM
A technology indistinguishable from magic.by hparadiz
5/11/2026 at 6:42:37 AM
The small Qwen 3.6 models handle context a little better than Gemma 4, but Gemma 4 26B in particular has such small and efficient solutions which are really smart for its weight class. I was so impressed with its performance in our benchmark upon release that I wrote a blog post about it [0], although its position on the leaderboard later fell a bit as we ran it in more long context agentic coding environments.by gertlabs
5/11/2026 at 3:40:26 PM
Here's a great explanation why:https://www.youtube.com/watch?v=_A367W_qvc8
Google's messing with the context. LOTS of speed for a little worse long-context performance.
by spwa4
5/11/2026 at 6:28:08 AM
i use smaller model gemma e2b for most of my editing and it works surprisingly well. Workflow is planning with sota models and execution via small models. If you plan properly dont leave ambiguity for smaller model it works well.by pdyc
5/11/2026 at 10:05:17 AM
Out of curiosity have you tried other small models? The e2b for me was unusable. Llama3.2 3b was better and that thing is a year old and I rarely use it now too.by 2ndorderthought
5/11/2026 at 12:33:12 PM
yes i keep on trying small models, i have also tried qwen 3.5 0.8B, 2B, 4b and gemma4 e4B models but they either did not worked reliably (thinking loop, issue in following instruction) or there were performance issues (prompt speed, tg speed, too much ram) e2b was the sweet spot where i could give it plan and it can edit files properly.by pdyc
5/12/2026 at 1:45:08 AM
How did e2b compare to e4b ?by Melatonic
5/12/2026 at 4:33:14 AM
i did not see much improvement for my use case i.e. file editing tasks but with e4b tg/s is lower so i stick with e2b.by pdyc
5/11/2026 at 12:51:20 PM
That makes sense it sounds like your computer isn't super powerful. Whatever works for youby 2ndorderthought
5/12/2026 at 12:54:25 PM
It's great, but I wish I could use these things without it feeling like my laptop is going to melt through the desk.by prettyblocks
5/11/2026 at 3:05:13 AM
Could you please share your time to first token and tok/s?by discordance
5/11/2026 at 5:25:58 AM
M4 Pro 64GB (14 CPU / 20 GPU), Gemma 4 31B Q4_K_M GGUF, LM Studio: time to first token 0.92s, 11.56 tokens/s.Edit: For comparison with the other poster, same setup as above, but with Gemma 4 31B Instruct 8bit MLX (not sure if exactly the same model): time to first token 4.62s, 7.20 tokens/s; with a different prompt, 1.17s and 7.24 tokens/s.
by isomorphic
5/11/2026 at 7:56:51 AM
Could you (or anyone with the same hardware) try antirez's ds4 and report how gracefully it degrades with only the 64GB RAM? Obviously it's going to be dog slow at best for any single inference flow, but can you meaningfully improve on that by running many sessions in parallel? (Ideally you'd need roughly on the order of model sparsity in order to get meaningful sharing of MoE weights, but whether that's genuinely achievable is anyone's guess!)by zozbot234
5/11/2026 at 3:29:48 AM
I’m on an M2 Max and get 10 tok/s with Gemma 4 8bit MLXby ls612
5/11/2026 at 5:21:04 AM
Does gemma work better than qwen3 in your experience?by plufz
5/11/2026 at 10:06:24 AM
Not in mine. I see a lot of people talking about Gemma on here but in my circles pretty much everyone else is running qwen.by 2ndorderthought
5/11/2026 at 5:06:34 PM
What's your opinion with Gemma 4 vs Qwen3.6?by alfiedotwtf
5/12/2026 at 11:29:07 AM
[dead]by henry_kang