4/13/2026 at 4:10:44 AM
Gemma4 in my view is good enough to do things similar to Gemini 2.5 flash, meaning if I point it code and ask for help and there is a problem with the code it’ll answer correctly in terms of suggestions but it’s not great at using all tools or one shooting things that require a lot of context or “expert knowledge”If a couple more iterations of this, say gemma6 is as good as current opus and runs completely locally on a Mac, I won’t really bother with the cloud models.
That’s a problem.
For the others anyway.
by amazingamazing
4/13/2026 at 10:37:35 AM
> if I point it code and ask for help and there is a problem with the code it’ll answer correctly in terms of suggestionscould I ask how you do that? I installed openclaw and set it to use Gemma 4 but it didn't act in an agent mode at all, it only responded in the chat window while doing nothing, and didn't read any files or do anything that you wrote (though I see you do mention that it's not great at using all tools). What are you using exactly?
by logicallee
4/13/2026 at 6:01:27 AM
> it’s not great at using all toolsGlad it wasnt just me - i was impressed with the quality of Gemma4 - it just couldnt write the changes to file 9/10 times when using it with opencode
by blitzar
4/13/2026 at 6:06:35 AM
https://huggingface.co/google/gemma-4-31B-it/commit/e51e7dcd...There was an update to tool calling 3 days ago. I haven't tested it myself but hope it helps.
by seaal
4/13/2026 at 6:57:53 AM
Hmm.. is there an updated onnx?by sroussey
4/13/2026 at 7:54:58 AM
> it just couldnt write the changes to file 9/10 times when using it with opencodeYou might want to give this a try, it dramatically improves Edit tool accuracy without changing the model: https://blog.can.ac/2026/02/12/the-harness-problem/
by erichocean
4/13/2026 at 5:35:30 AM
similar vibes as "640k ought to be enough for anybody"by swazzy
4/13/2026 at 9:36:07 AM
I think the difference is that with LLMs, in a lot of cases you do see some diminishing returns.I won't deny that the latest Claude models are fantastic at just one shotting loads of problems. But we have an internal proxy to a load of models running on Vertex AI and I accidentally started using Opus/Sonnet 4 instead of 4.6. I genuinely didn't know until I checked my configuration.
AI models will get to this point where for 99% of problems, something like Gemma is gonna work great for people. Pair it up with an agentic harness on the device that lets it open apps and click buttons and we're done.
I still can't fathom that we're in 2026 in the AI boom and I still can't ask Gemini to turn shuffle mode on in Spotify. I don't think model intelligence is as much of an issue as people think it is.
by Philip-J-Fry
4/13/2026 at 9:54:31 AM
I mean to me even difference between Opus and Sonnet is as clear as day and night, and even Opus and the best GPT model. Opus 4.6 just seems much more reliable in terms of me asking it to do something, and that to actually happen.by mewpmewp2
4/13/2026 at 9:59:40 AM
It depends what you're asking it though. Sure, in a software development environment the difference between those two models is noticeable.But think about the general user. They're using the free Gemini or ChatGPT. They're not using the latest and greatest. And they're happy using it.
And I am willing to bet that a lot of paying users would be served perfectly fine by the free models.
If a capable model is able to live on device and solve 99% of people's problems, then why would the average person ever need to pay for ChatGPT or Gemini?
by Philip-J-Fry
4/13/2026 at 6:02:17 AM
Well you can do a lot with 640k…if you try. We have 16G in base machines and very few people know how to try anymore.The world has moved on, that code-golf time is now spent on ad algorithms or whatever.
Escaping the constraint delivered a different future than anticipated.
by shermantanktop
4/13/2026 at 8:22:03 AM
> you can do a lot with 640k…if you try.it is economically not viable to try anymore.
"XYZ Corp" won't allow their developers to write their desktop app in Rust because they want to consume only 16MB RAM, then another implementation for mobile with Swift and/or Kotlin, when they can release good enough solution with React + Electron consuming 4GB RAM and reuse components with React Native.
by throwaw12
4/13/2026 at 8:45:16 AM
People get hung up on bad optimization. It you are the working at sufficiently large scale, yes, thinking about bytes might be a good use of your time.But most likely, it's not. At a system level we don't want people to do that. It's a waste of resources. Making a virtue out of it is bad, unless you care more about bytes than humans.
by jstummbillig
4/13/2026 at 9:27:15 AM
These bytes are human lives. The bytes and the CPU cycles translate to software that takes longer to run, that is more frustrating, that makes people accomplish less in longer time than they could, or should. Take too much, and you prevent them from using other software in parallel, compounding the problem. Or you're forcing them to upgrade hardware early, taking away money they could better spend in different areas of their lives. All this scales with the number of users, so for most software with any user base, not caring about bytes and cycles is wasting much more people-hours than is saving in dev time.by TeMPOraL
4/13/2026 at 9:27:18 AM
The simple fact is that a 16 GB RAM stick costs much less than the development time to make the app run on less.by stavros
4/13/2026 at 8:56:39 AM
Especially if the 640k are "in your hand" and the rest is "in the cloud"by raverbashing
4/13/2026 at 8:50:12 AM
Look at the whole history of computing. How many times has the pendulum swung from thin to fat clients and back?I don't think it's even mildly controversial to say that there will be an inflection point where local models get Good Enough and this iteration of the pendulum shall swing to fat clients again.
by pdpi
4/13/2026 at 7:29:05 AM
Assuming improvements in LLMs follow a sigmoid curve, even if the cloud models are always slightly ahead in terms of raw performance it won't make much of a difference to most people, most of the time.The local models have their own advantages (privacy, no -as-a-service model) that, for many people and orgs, will offset a small performance advantage. And, of course, you can always fall back on the cloud models should you hit something particularly chewy.
(All IMO - we're all just guessing. For example, good marketing or an as-yet-undiscovered network effect of cloud LLMs might distort this landscape).
by flir
4/13/2026 at 4:19:19 AM
Yep, and to be honest we don't really need local models for intensive tasks. At least yet. You can use openrouter (and others) to consume a wide variety of open models which are capable of using tools in an agentic workflow, close to the SOTA models, which are essentially commodities - many providers, each serving the same model and competing with each-other on uptime, throughput, and price. At some point we will be able to run them on commodity hardware, but for now the fact that we can have competition between providers is enough to ensure that rug pulls aren't possible.Plus having Gemma on my device for general chat ensures I will always have a privacy respecting offline oracle which fulfils all of the non-programming tasks I could ever want. We are already at the point where the moat for these hyper scalers has basically dissolved for the general public's use case.
If I was OpenAI or Anthropic I would be shitting my pants right now and trying every unethical dark pattern in the book to lock in my customers. And they are trying hard. It won't work. And I won't shed a single tear for them.
by slopinthebag
4/13/2026 at 4:32:15 AM
Local models seem somewhere between 9 and 24 months behind. I'm not saying I won't be impressed with what online models will be able to do in two years, but I'm pretty satisfied with the prediction that I won't really need them in a couple of years.by colechristensen
4/13/2026 at 5:01:15 AM
We still aren't going to be putting 200gb ram on a phone in a couple years to run those local models.by Gigachad
4/13/2026 at 10:43:02 AM
That amount of RAM won’t be necessary. Gemma 4 and comparably sized Qwen 3.5 models are already better than the very best, biggest frontier models were just 12-18 months ago. Now in an 18-36GB footprint, depending on quantization.by anon373839
4/13/2026 at 5:17:17 AM
A lot of people are making the mistake of noticing that local models have been 12-24 months behind SotA ones for a good portion of the last couple years, and then drawing a dotted line assuming that continues to hold.It simply.. doesn't. The SotA models are enormous now, and there's no free lunch on compression/quantization here.
Opus 4.6 capabilities are not coming to your (even 64-128gb) laptop or phone in the popular architecture that current LLMs use.
Now, that doesn't mean that a much narrower-scoped model with very impressive results can't be delivered. But that narrower model won't have the same breadth of knowledge, and TBD if it's possible to get the quality/outcomes seen with these models without that broad "world" knowledge.
It also doesn't preclude a new architecture or other breakthrough. I'm simply stating it doesn't happen with the current way of building these.
edit: forgot to mention the notion of ASIC-style models on a chip. I haven't been following this closely, but last I saw the power requirements are too steep for a mobile device.
by mh-
4/13/2026 at 10:33:01 AM
Would the model even need that breath of knowledge? Humans just look things up in books or on Wikipedia, which you can store on a plain old HDD, not VRAM. All books ever written fit into about 60TB if you OCR them, and the useful information in them probably in a lot less, that's well within the range of consumer technology.by grumbel
4/13/2026 at 5:46:48 AM
Don’t underestimate the march of technology. Just look at your phone, it has more FLOPS than there were in the entire world 40 years ago.by am17an
4/13/2026 at 5:58:01 AM
And I think it's very likely that with improved methods you could get opus 4.6 level performance on a wrist watch in few years.You needed supercomputer to win in chess until you didn't.
Currently local models performance in natural language is much better than any algorithm running on a super computer cluster just few years ago.
by kuboble
4/13/2026 at 8:09:27 AM
Yeah, but that's the current state of the art after decades of aggressive optimizations, there's no foreseeable future where we'll ever be able to cram several orders of magnitude more ram into a phone.by root_axis
4/13/2026 at 9:32:26 AM
We already cram several orders of magnitude more flash storage into phone than RAM (e.g. my phone has 16 GB RAM but 1 TB storage); even now, with some smart coding, if you don't need all that data at the same time for random access at sub millisecond speed, it's hard to tell the difference.by TeMPOraL
4/13/2026 at 9:38:59 AM
but it doesn't have that much more flops than it did a couple of years ago.by vrighter
4/13/2026 at 7:59:51 AM
Pretty sure there’s at least a couple orders of magnitude in purely algorithmic areas of LLM inference; maybe training, too, though I’m less confident here. Rationale: meat computers run on 20W, though pretraining took a billion years or so.by baq
4/13/2026 at 5:54:22 AM
There's been plenty of free lunch shrinking models thus far with regards to capability vs parameter count.Contradicting that trend takes more than "It simply.. doesn't."
There's plenty of room for RAM sizes to double along with bus speed. It idled for a long time as a result of limited need for more.
by colechristensen
4/13/2026 at 6:07:21 AM
We don’t need 200gb of RAM on a phone to run big models. Just 200 GB of storage thanks to Apple’s “LLM in a flash” research.by jurmous
4/13/2026 at 7:47:58 AM
Yes, I agree that this is the right solution, because for a locally-hosted model I value more the quality of the output than the speed with which it is produced, so I prefer the models as they were originally trained, not with further quantizations.While that paper praises the Apple advantage in SSD speed, which allows a decent performance for inference with huge models, nowadays SSD speeds equal or greater than that can be achieved in any desktop PC that has dual PCIe 5.0 SSDs, or even one PCIe 5.0 and one PCIe 4.0 SSDs.
Because I had also independently reached this conclusion, like I presume many others, I have just started to work a week ago on modifying llama.cpp to use in an optimal manner weights stored on SSDs, while also batching many tasks, so that they will share each pass through the SSDs. I assume that in the following months we will see more projects in this direction, so the local hosting of very large models will become easier and more widespread, allowing the avoidance of the high risks associated with external providers, like the recent enshittification of Claude Code.
by adrian_b
4/13/2026 at 9:30:39 AM
But that difference atm is the difference between it being OK on its own with a team of subagents given good enough feedback / review mechanisms or having to babysit it prompt by prompt.By the time gemma6 allows you to do the above the proprietary models supposedly will already be on the next step change. It just depends if you need to ride the bleeding edge but specially because it's "intelligence", there's an obvious advantage in using the best version and it's easy to hype it up and generate fomo.
by vasco
4/13/2026 at 9:36:45 AM
> But that difference atm is the difference between it being OK on its own with a team of subagents given good enough feedbackDo people actually build meaningful things like that?
It's basically impossible to leave any AI agent unsupervised, even with an amazing harness (which is incredibly hard to build). The code slowly rots and drifts over time if not fully reviewed and refactored constantly.
Even if teams of agents working almost fully autonomously were reliable from a functional perspective (they would build a functional product), the end product would have ever increasing chaos structurally over time.
I'd be happy to be proven wrong.
by oblio
4/13/2026 at 7:50:33 AM
When that happens, you'll have fomo from not using opus 5.x. The numbers that they showed for Mythos show that the frontier is still steadily moving (and maybe even at a faster pace than before)by gorgmah
4/13/2026 at 6:18:58 AM
There is a cognitive ceiling for what you can do with smaller models. Animals with simpler neural pathways often outperform whatever think they are capable of but there's no substitute for scale. I don't think you'll ever get a 4B or 8B model equivalent to Opus 4.6. Maybe just for coding tasks but certainly not Opus' breadth.by blcknight
4/13/2026 at 7:00:26 AM
The only thing that we are sure can't be highly compressed is knowledge, because you can only fit so much information in given entropy budget without losing fidelity.The minimal size limits of reasoning abilities are not clear at all. It could be that you don't need all that many parameters. In which case the door is open for small focused models to converge to parity with larger models in reasoning ability.
If that happens we may end up with people using small local models most of the time, and only calling out to large models when they actually need the extra knowledge.
by zarzavat
4/13/2026 at 7:36:51 AM
> and only calling out to large models when they actually need the extra knowledgeWhen would you want lossy encoding of lots of data bundled together with your reasoning? If it is true that reasoning can be done efficiently with fewer parameters it seems like you would always want it operating normal data searching and retrieval tools to access knowledge rather than risk hallucination.
And re: this discussion of large data centers versus local models, do recall that we already know it's possible to make a pretty darn clever reasoning model that's small and portable and made out of meat.
by idle_zealot
4/13/2026 at 9:51:26 AM
> we already know it's possible to make a pretty darn clever reasoning modelThere's is a problem though: we know that it is possible, but we don't know how to (at least not yet and as far as I am aware). So we know the answer to "what?" question, but we don't know the answer to "how?" question.
by dryarzeg
4/13/2026 at 8:05:18 AM
I would call brains with the needed support infrastructure small.by adrianN
4/13/2026 at 8:19:18 AM
I think you underestimate the amount of knowledge needed to deal with the complexities of language in general as opposed to specific applications. We had algorithms to do complex mathematical reasoning before we had LLMs, the drawback being that they require input in restricted formal languages. Removing that restriction is what LLMs brought to the table.Once the difficult problem of figuring out what the input is supposed to mean was somewhat solved, bolting on reasoning was easy in comparison. It basically fell out with just a bit of prompting, "let's think step by step."
If you want to remove that knowledge to shrink the model, we're back to contorting our input into a restricted language to get the output we want, i.e. programming.
by yorwba
4/13/2026 at 6:43:49 AM
I think you are underestimating the strength a small model can get from tool use. There may be no substitute for scale, but that scale can live outside of the model and be queried using tools.In the worst case a smaller model could use a tool that involves a bigger model to do something.
by charcircuit
4/13/2026 at 6:56:26 AM
Small models are bad at tool use. I have liquidai doing it in the browser but it’s super fragile.by sroussey
4/13/2026 at 8:20:33 AM
except you don't want knowledge in the model, and most of that "size" comes from "encoded knowledge", i.e. over fitting. The goal should be to only have language handling in the model, and the knowledge in a database you can actually update, analyze etc. It's just really hard to do so."world models" (for cars) maybe make sense for self driving, but they are also just a crude workaround to have a physics simulation to push understanding of physics. Through in difference to most topics, basic, physics tend to not change randomly and it's based on observation of reality, so it probably can work.
Law, health advice, programming stuff etc. on the other hand changes all the time and is all based on what humans wrote about it. Which in some areas (e.g. law or health) is very commonly outdated, wrong or at least incomplete in a dangerous way. And for programming changes all the time.
Having this separation of language processing and knowledge sources is ... hard, language is messy and often interleaves with information.
But this is most likely achievable with smaller models. Actually it might even be easier with a small model. (Through if the necessary knowledge bases are achievable to fit on run on a mac is another topic...)
And this should be the goal of AI companies, as it's the only long term sustainable approach as far as I can tell.
I say should because it may not be, because if they solve it that way and someone manages to clone their success then they lose all their moat for specialized areas as people can create knowledge bases for those areas with know-how OpenAI simple doesn't have access to. (Which would be a preferable outcome as it means actual competition and a potential fair working market.)
by dathinab
4/13/2026 at 8:38:39 AM
as a concrete outdated case:TLS cipher X25519MLKEM768 is recommended to be enabled on servers which do support it
last time I checked AI didn't even list it when you asked it for a list of TLS 1.3 ciphers (through it has been widely supported since even before it was fully standardized..)
this isn't surprising as most input sources AI can use for training are outdated and also don't list it
maybe someone of OpenAI will spot this and feet it explicitly into the next training cycle, or people will cover it more and through this it is feed implicitly there
but what about all that many niche but important information with just a handful of outdated stack overflow posts or similar? (which are unlikely to get updated now that everyone uses AI instead..)
The current "lets just train bigger models with more encoded data approach" just doesn't work, it can get you quite far, tho. But then hits a ceiling. And trying to fix it by giving it also additional knowledge "it can ask if it doesn't know" has so far not worked because it reliably doesn't realize it doesn't know if it has enough outdated/incomplete/wrong information encoded in the model. Only by assuring it doesn't have any specialized domain knowledge can you make sure that approach works IMHO.
by dathinab