1/19/2026 at 4:25:35 PM
Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't really show for the kind of work I throw at it, mostly "write tests for this and that method which are not covered yet". Will give this a try once someone has quantized it in ~4 bit GGUF.Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.
by dajonker
1/20/2026 at 10:26:50 AM
Update: I'm experiencing issues with OpenCode and this model. I have built the latest llama.cpp and followed the Unsloth guide, but it's not usable at the moment because of:- Tool calling doesn't work properly with OpenCode
- It repeats itself very quickly. This is addressed in the Unsloth guide and can be "fixed" by setting --dry-multiplier to 1.1 or higher
- It makes a lot of spelling errors such as replacing class/file name characters with "1". Or when I asked it to check AGENTS.md it tried to open AGANTS.md
I tried both the Q4_K_XL and Q5_K_XL quantizations and they both suffer from these issues.
by dajonker
1/22/2026 at 10:23:21 PM
There is a new update on HF:> Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.
by eblanshey
1/26/2026 at 1:21:30 PM
Yes! This update works great. Seems to be pretty good at first glance. I'll have to setup an interesting task and see how different models approach the problem.by dajonker
1/23/2026 at 1:13:59 AM
After re-downloading the model, do not use --dry-multiplier... and also, don't ask me how I know...by philippelh
1/19/2026 at 4:29:32 PM
https://huggingface.co/unsloth/GLM-4.7-GGUFThis user has also done a bunch of good quants:
by latchkey
1/19/2026 at 5:14:23 PM
I find it hard to trust post training quantizations. Why don't they run benchmarks to see the degradation in performance? It sketches me out because it should be the easiest thing to automatically run a suite of benchmarksby WanderPanda
1/19/2026 at 6:00:58 PM
Unsloth doesn't seem to do this for every new model, but they did publish a report on their quant methods and the performance loss it causes.https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
It isn't much until you get down to very small quants.
by Miraste
1/19/2026 at 4:43:51 PM
Yes I usually run Unsloth models, however you are linking to the big model now (355B-A32B), which I can't run on my consumer hardware.The flash model in this thread is more than 10x smaller (30B).
by dajonker
1/19/2026 at 5:25:47 PM
When the Unsloth quant of the flash model does appear, it should show up as unsloth/... on this page:https://huggingface.co/models?other=base_model:quantized:zai...
Probably as:
by a_e_k
1/19/2026 at 5:33:00 PM
it'a a new architecture. Not yet implemented in llama.cppissue to follow: https://github.com/ggml-org/llama.cpp/issues/18931
by homarp
1/19/2026 at 5:33:16 PM
One thing to consider is that this version is a new architecture, so it’ll take time for Llama CPP to get updated. Similar to how it was with Qwen Next.by dumbmrblah
1/19/2026 at 7:55:00 PM
Apparently it is the same as the DeepseekV3 architecture and already supported by llama.cpp once the new name is added. Here's the PR: https://github.com/ggml-org/llama.cpp/pull/18936by cristoperb
1/20/2026 at 3:17:41 AM
has been mergedby khimaros
1/19/2026 at 4:47:16 PM
There are a bunch of 4bit quants in the GGUF link and the 0xSero has some smaller stuff too. Might still be too big and you'll need to ungpu poor yourself.by latchkey
1/19/2026 at 4:51:15 PM
yeah there is no way to run 4.7 on a 32g vram this flash is something that im also waiting to try later tonightby disiplus
1/19/2026 at 6:12:08 PM
Why not? Run it with vLLM latest and enable 4bit quantization with bnb, and it will quantize the original safetensors on the fly and fit your vram.by omneity
1/19/2026 at 7:37:58 PM
because how huge glm 4.7 is https://huggingface.co/zai-org/GLM-4.7by disiplus
1/19/2026 at 7:54:27 PM
Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit.by omneity
1/19/2026 at 8:12:16 PM
yes, but the parrent link was to the big glm 4.7 that had a bunch of ggufs, the new one at the point of posting did not, nor does it now. im waiting for unsloth guys for the 4.7 flashby disiplus
1/19/2026 at 5:04:37 PM
> Codex is notably higher quality but also has me waiting forever.And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.
by behnamoh