alt.hn

4/12/2026 at 8:43:08 PM

I ran Gemma 4 as a local model in Codex CLI

https://blog.danielvaughan.com/i-ran-gemma-4-as-a-local-model-in-codex-cli-7fda754dc0d4

by dvaughan

4/13/2026 at 10:35:57 AM

I also tried Gemma 4 on a M1 Macbook Pro. It worked but it was too slow. Great to know that it works on more advanced laptops!

by alvsilvao

4/13/2026 at 8:13:11 AM

> The finding I did not expect: model quality matters more than token speed for agentic coding.

I'm really surprised how that was not obvious.

Also, instead of limiting context size to something like 32k, at the cost of ~halving token generation speed, you can offload MoE stuff to the CPU with --cpu-moe.

by mhitza

4/13/2026 at 6:53:15 AM

I'm currently experimenting with running google/gemma-4-26b-a4b with lm studio (https://lmstudio.ai/) and Opencode on a M3 Ultra with 48Gb RAM. And it seems to be working. I had to increase the context size to 65536 so the prompts from Opencode would work, but no other problems so far.

I tried running the same on an M3 Max with less memory, but couldn't increase the context size enough to be useful with Opencode.

It's also easy to integrate it with Zed via ACP. For now it's mostly simple code review tasks and generating small front-end related code snippets.

by tuzemec

4/13/2026 at 10:20:05 AM

I do the same thing on a MacBook Pro with an M4 Max and 64GB. I had problems until the most recent LM Studio update (0.4.11+1), tool calling didn't work correctly.

Now both codex and opencode seem to work.

by jwr

4/13/2026 at 9:33:55 AM

I would have liked to see quality results between the different quantization methods - Q4_K_M, Q_8_0, Q_6_K rather than tok/s

by meander_water

4/13/2026 at 9:10:34 AM

I don't really have the hardware to try it out, but I'm curious to see how Qwen3.5 stacks up against Gemma 4 in a comparison like this. Especially this model that was fine tuned to be good at tool calling that has more than 500k downloads as of this moment: https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-...

by dajonker

4/13/2026 at 9:38:45 AM

It's rather surprising that a solo dev can squeeze more performance out of a model with rather humble resources vs a frontier lab. I'm skeptical of claims that such a fine-tuned model is "better" -- maybe on certain benchmarks, but overall?

FYI the latest iteration is here: https://huggingface.co/Jackrong/Qwopus3.5-27B-v3

by anana_

4/13/2026 at 10:36:55 AM

Ollama is the worst engine you could use for this. Since you are already running on an Nvidia stack for the dense model, you should serve this with vLLM. With 128GB you could try for the original safetensors even though you might need to be careful with caches and context length.

by magic_hamster

4/13/2026 at 8:33:28 AM

Related: I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (Q4) runs about 8x more t/s on M5 Pro and loads 2x times faster from disk to memory.

Gonna run some more tests later today.

by egorfine

4/13/2026 at 9:11:00 AM

> The same Gemma 4 MoE model (Q4)

As you have so much RAM I would suggest running Q8_0 directly. It's not slower (perhaps except for the initial model load), and might even be faster, while being almost identical in quality to the original model.

And just to be sure: you're are running the MLX version, right? The mlx-community quantization seemed to be broken when I tried it last week (it spit out garbage), so I downloaded the unsloth version instead. That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch of https://github.com/ml-explore/mlx-lm.

I unfortunately only have 16 GiB of RAM on a Macbook M1, but I just tried to run the Q8_0 GGUF version on a 2023 AMD Framework 13 with 64 GiB RAM just using the CPU, and that works surprisingly well with tokens/s much faster than I can read the output. The prompt cache is also very useful to quickly insert a large system prompt or file to datamine although there are probably better ways to do that instead of manually through a script.

by Confiks

4/13/2026 at 9:17:25 AM

> As you have so much RAM I would suggest running Q8_0 directly

On the 48GB mac - absolutely. The 24GB one cannot run Q8, hence why the comparison.

> And just to be sure: you're are running the MLX version, right?

Nah, not yet. I have only tested in LM Studio and they don't have MLX versions recommended yet.

> but has since been fixed on the main branch

That's good to know, I will play around with it.

by egorfine

4/13/2026 at 8:09:57 AM

For coding it makes no sense to use any quantization worse than Q6_K, from my experience. More quantized models make more mistakes and if for text processing it still can be fine, for coding it's not.

by zihotki

4/13/2026 at 8:26:24 AM

Nice walkthrough and interesting findings! The difference between the MoE and the dense models seems to be bigger than what benchmarks report. It makes sense because a small gain in toll planning and handling can have a large influence on results.

by danilop

4/13/2026 at 8:50:30 AM

I think local models are not yet that good or fast for complex things, so I am just using local Gemma 4 for some dummy refactorings or something really simple.

by karpetrosyan

4/13/2026 at 8:20:24 AM

You can also try speculative decoding with the E2B model. Under some conditions it can result in a decent speed up

by Havoc

4/13/2026 at 1:27:33 AM

With a nvidia spark or 128gb+ memory machine, you can get a good speed up on the 31B model if you use the 26B MoE as a draft model. It uses more memory but I’ve seen acceptance rate at around 70%+ using Q8 on both models

by blackmanta

4/13/2026 at 2:11:41 AM

1 token ahead or 2?

It's interesting - imo we'll soon have draft models specifically post-trained for denser, more complicated models. Wouldn't be surprised if diffusion models made a comeback for this - they can draft many tokens at once, and learning curves seem to top out at 90+% match for auto-regressive ones so quite interesting..

by foobar10000

4/13/2026 at 10:10:25 AM

flow matching is making some strides right now, too

by electroglyph

4/13/2026 at 1:46:30 AM

This is genuinely very helpful. I'm planning a MacBook pro purchase with local inference in mind and now see I'll have to aim for a slightly higher memory option because the Gemma A4 26B MoE is not all that!

by ehtbanton

4/13/2026 at 8:32:08 AM

I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (4bit, don't remember which version) runs about 8x faster on M5 Pro and loads 2x times faster in memory.

So yes, do purchase that new MacBook Pro.

by egorfine

4/13/2026 at 8:12:00 AM

pretty sure Nvidia GPU is better bang for buck because of usable inference speed..

by tomr75

4/13/2026 at 1:02:14 AM

Amazing. Thanks for your detailed posts on the bake-off between the Mac and GB10, Daniel, and on your learnings. I had trying similar on both compute platforms on my to-do list. Your post should save me a lot of debugs, sweat, and tears.

by anactofgod

4/13/2026 at 9:27:19 AM

Gemma 4 is a strongly censored model, so much so that it refused to answer medical and health related questions, even basic ones. No one should be using it, and if this is the best that Google can do, it should stop now. Other models do not have such ridiculous self-imposed problems.

by OutOfHere

4/13/2026 at 9:57:28 AM

I suspect a possible future of local models is extreme specialisation - you load a Python-expert model for Python coding, do your shopping with a model focused just on this task, have a model specialised in speech-to-text plus automation to run your smart home, and so on. This makes sense: running a huge model for a task that only uses a small fraction of its ability is wasteful, and home hardware especially isn't suited to this wastefulness. I'd rather have multiple models with a deep narrow ability in particular areas, than a general wide shallow uncertain ability.

Anyway, is it possible that this may be what lies behind Gemma 4's "censoring"? As in, Google took a deliberate choice to focus its training on certain domains, and incorporated the censor to prevent it answering about topics it hasn't been trained on?

Or maybe they're just being sensibly cautious: asking even the top models for critical health advice is risky; asking a 32B model probably orders of magnitude moreso.

by mft_

4/13/2026 at 10:28:42 AM

> is it possible that this may be what lies behind Gemma 4's "censoring"

That isn't the explanation. The censorship here is explicitly imposed by Google. Your explanation would make sense if various other rare domains were also censored, but they aren't, so it doesn't.

> asking even the top models for critical health advice is risky

Not asking, and living in ignorance, is riskier. For high-stakes questions, of course I'd want references that only an online model like ChatGPT or Gemini, etc. would be able to find.

If I am asking a local model for health advice, odds are that it is because I am traveling and am temporarily offline, or am preparing off-grid infrastructure. In both cases I definitely require a best-effort answer. I also require the model to be able to tell when it doesn't know the answer.

Ignore health advice for a moment, and switch to electrical advice. Imagine I am putting together electrical infrastructure, and the model gives me bad advice, risking electrocution and/or a serious fire. Why is this not censored, and what makes it not be high-stakes! The logic is the same.

For the record, various open-source Asian models do not have any such problem, so I would rather use them.

by OutOfHere

4/13/2026 at 9:29:30 AM

I don't quite get why you feel so strongly about it that this should be a deal breaker for everyone. It's really much better than a wrong answer, for everyone.

by tgv

4/13/2026 at 10:25:03 AM

> It's really much better than a wrong answer

That is a bad premise and a false dichotomy, because most medical questions are simple, with well-known standard answers. ChatGPT and Gemini answer such questions correctly, also finding glaring omissions by doctors, even without having to look up information.

As for the medical questions that are not simple, the ones that require looking up information, the model should in principle be able to respond that it does not know the answer when this is truthfully the case, implying that the answer, or a simple extrapoloation thereof, was not in its training data.

by OutOfHere

4/13/2026 at 1:11:58 AM

I've been VERY impressed with Gemma4 (26B at the moment). It's the first time I've been able to use OpenCode via a llamacpp server reliably and actually get shit done.

In fact, I started using it as a coding partner while learning how to use the Godot game engine (and some custom 'skills' I pulled together from the official docs). I purposely avoided Claude and friends entirely, and just used Gemma4 locally this week... and it's really helped me figure out not just coding issues I was encountering, but also helped me sift through the documentation quite readily. I never felt like I needed to give in and use Claude.

Very, very pleased.

by fortyseven

4/13/2026 at 1:50:07 AM

Nothing about omlx?

by brcmthrowaway