4/3/2026 at 2:44:37 PM
If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.
So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.
If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.
by Aurornis
4/3/2026 at 2:49:24 PM
You seem like you know what you're talking about... what inference engine should I use? (linux, 4090)I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.
by colechristensen
4/3/2026 at 3:52:34 PM
For the specific issue parent is talking about, you really need to give various tools a try yourself, and if you're getting really shit results, assume it's the implementation that is wrong, and either find an existing bug tracker issue or create a new one.Same thing happened when GPT-OSS launched, bunch of projects had "day-1" support, but in reality it just meant you could load the model basically, a bunch of them had broken tool calling, some chat prompt templates were broken and so on. Even llama.cpp which usually has the most recent support (in my experience) had this issue, and it wasn't until a week or two after llama.cpp that GPT-OSS could be fairly evaluated with it. Then Ollama/LM Studio updates their llama.cpp some days after that.
So it's a process thing, not "this software is better than that", and it heavily depends on the model.
by embedding-shape
4/3/2026 at 4:55:23 PM
After spending the past few weeks playing with different backends and models, I just can’t believe how buggy most models are.It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.
Tool calling is like the Achilles Heel where most will fail unless you either modify the system prompts or run via proxies so you can inject/munge the request/reply.
Like seriously… how many billions and billions (actually we saw one >800 billion evaluation last week, so almost a whole trillion) goes into AI development and yet 99.999% of all models from the big names do not work straight out of the box with the most common backends. Blows my mind!
by alfiedotwtf
4/3/2026 at 9:23:25 PM
> It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.The models usually run fine on the server targeted backends they’re released for.
Those projects you cited are more niche. They each implement their own ways of doing things.
It’s not the responsibility of model providers to implement and debug every different backend out there before they release their model. They release the model and usually a reference way of running it.
The individual projects that do things differently are responsible for making their projects work properly.
Don’t blame the open weight model teams when unrelated projects have bugs!
by Aurornis
4/3/2026 at 5:25:43 PM
Just since I'm curious, what exact models and quantization are you using? In my own experience, anything smaller than ~32B is basically useless, and any quantization below Q8 absolutely trashes the models.Sure, for single use-cases, you could make use of a ~20B model if you fine-tune and have very narrow use-case, but at that point usually there are better solutions than LLMs in the first place. For something general, +32B + Q8 is probably bare-minimum for local models, even the "SOTA" ones available today.
by embedding-shape
4/4/2026 at 5:39:29 AM
I haven’t tried any Qwen yet, but so far I’m sticking with gpt-oss-20B.In terms of what I’m using, I’ve looked at anything that will fit on a MacBook Pro with 32Gb RAM (so with shared memory) - LFM2, Llama, Minstral, Ministral, Devstral, Phi, and Nemotron.
As for quantisation, I aim for the biggest that will fit while also not being too slow - so it all depends on the model. But I’ll skip a model if I can’t at least use a Q4_K_M.
Also, given that I also bump my context to at least 32K, because tooling sucks when the tooling definitions itself come close to 4096!
I can’t wait for RAM prices to come down!
by alfiedotwtf
4/3/2026 at 4:23:01 PM
I've had really good success with LMStudio and GLM 4.7 Flash and the Zed editor which has a baked in integration with LMStudio. I am able to one-shot whole projects this way, and it seems to be constantly improving. Some update recently even allowed the agent to ask me if it can do a "research" phase - so it'll actually reach out to website and read docs and code from github if you allow it. GLM 4.7 flash has been the most adept at tool calling I've found, but the Qwen 3 and 3.5 models are also fairly good, though run into more snags than I've seen with GLM 4.7 flash.by kamranjon
4/3/2026 at 2:55:37 PM
I don’t know if any of engines are fully tested yet.For new LLMs I get in the habit of building llama.cpp from upstream head and checking for updated quantizations right before I start using it. You can also download llama.cpp CI builds from their release page but on Linux it’s easy to set up a local build.
If you don’t want to be a guinea pig for untested work then the safe option would be to wait 2-3 weeks
by Aurornis
4/3/2026 at 10:04:45 PM
For me, LM Studio on Fedora + Gemma 4 didn't work yesterday afternoon with the release, but worked this morning after the runtimes updated. In fact - there are new runtime updates now as I check again.by accrual
4/3/2026 at 2:52:20 PM
just use openrouter or google ai playground for the first week till bugs are ironed out. You still learn the nuances of the model and then yuu can switch to local. In addition you might pickup enough nuance to see if quantization is having any effectby vardalab