4/22/2026 at 2:06:09 PM
I already felt that gemini 3 proved what is possible if you train a model for efficiency. If I had to guess the pro and flash variants are 5x to 10x smaller than opus and gpt-5 class models.They produce drastically lower amount of tokens to solve a problem, but they haven't seem to have put enough effort into refinining their reasoning and execution as they produce broken toolcalls and generally struggle with 'agentic' tasks, but for raw problem solving without tools or search they match opus and gpt while presumably being a fraction of the size.
I feel like google will surprise everyone with a model that will be an entire generation beyond SOTA at some point in time once they go from prototyping to making a model that's not a preview model anymore. All models up till now feel like they're just prototypes that were pushed to GA just so they have something to show to investors and to integrate into their suite as a proof of concept.
by himata4113
4/22/2026 at 6:14:05 PM
> If I had to guess the pro and flash variants are 5x to 10x smaller than opus and gpt-5 class models.I really doubt it, especially Pro. If anything I wouldn't be surprised if their hardware lets them run bigger models more cheaply and quickly than the others. Pro is probably smaller than GPT 5.4 and Opus 4.6 (looks like 4.7 decreased in size), but 5x seems way too much. IMO Gemini 3 Pro is the most "intelligent" in an all-round human way. Especially in the humanities. It's highly knowledgeable and undeniably the number one model at producing natural text in a large number of (human!) languages. The difference becomes especially large for more niche languages. That does not suggest a smaller model, more the opposite. The top 4 models at multilinguality are all Google : 1. 3 Pro 2. 3 Flash 3. 2.5 Pro 4. 2.5 Flash. Even the biggest OpenAI and Anthropic models can't compete in that dimension.
It's definitely weaker at math and much worse at agentic things. Gemini chat as an app is also lightyears behind, it's barely different from ChatGPT at release over 3 yeaes ago. These things make it feel much weaker than it is.
by deaux
4/22/2026 at 7:06:01 PM
Regarding Anthropic, they used to make best multilingual and generalist models, it's their policy thing, not a capability issue. Claude 3 was best at this, including dead and low-resource languages. Neither modern Claude nor Gemini are remotely close to what Claude 3 was capable of (e.g. zero-shot writing styles). Anthropic basically reversed their "character training" policy and started optimizing their models for code generation at the cost of everything else, starting with Sonnet 3.5. Claude 4 took a huge hit in multilingual abilityGPT, on the other hand, was always terrible at languages, except for the short-lived gpt-4.5-preview.
All modern models including Gemini have bugs in basic language coherency - random language switching, self-correction attempts resulting in hallucinations etc. I speculate it's a problem with heavy RL with rewards and policies not optimized for creative writing.
by orbital-decay
4/23/2026 at 10:56:30 PM
I've never ever had Gemini over the API switch languages in translation tasks and that's across more than 10 language pairs and 6 figures of calls, across both short and long outputs. Maybe your languages are even lower resource ones, though we do include Central Asian languages.The Chinese models are very prone to it, they love to mix them up.
I've seen it in chat, but IMO that's more of a system prompt/harness issue.
I'll admit I don't remember Claude 3, the oldest data I have seems to be 3.5. And at that time Gemini 1.5 Pro did a much better job across all of our language pairs, it wasn't close.
by deaux
4/23/2026 at 2:38:22 AM
This always bothers me because models will almost never see text that is mostly English with a little other language in training data (opposite happens of course) and certainly not in RL data. Why do they occasionally language switch?by rao-v
4/22/2026 at 8:46:03 PM
The benchmarks don’t seem to say that language ability has gotten worse?by awongh
4/23/2026 at 10:49:37 PM
There are no real benchmarks of how "natural/idiomatic" output is in a multitude of languages."Multilingual benchmarks" are usually something like "How good is it at a multiple choice exam like the SAT in language X". This is a completely unrelated metric.
by deaux
4/22/2026 at 9:25:30 PM
That's the thing with benchmarks, without evals and actual hands-on experience they can give you false confidence. Claude now sounds almost clinical, and is unable to speak in different styles as easily. Claude 4+ uses a lot more constructions borrowed from English than Claude 3, especially in Slavic languages where they sound unnatural. And most modern models eventually glitch out in longer texts, spitting a few garbage tokens in a random language (Telugu, Georgian, Ukrainian, totally unrelated), then continuing in the main language like nothing happened. It's rare but it happens. Samplers do not help with this, you need a second run to spellcheck it. This wasn't a problem in older models, it's a widespread issue that roughly correlates with the introduction of reasoning. Another new failure mode is self-correction in complicated texts that need reading comprehension: if the model hallucinates an incorrect fact and spots it, it tries to justify or explain it immediately. Which is much more awkward than leaving it incorrect, and also those hallucinations are more common now (maybe because the model learns to make those mistakes together with the correction? I don't know.)by orbital-decay
4/22/2026 at 9:45:24 PM
Not disputing this might be true, but this seems like something that should be capturable in a multi-lingual benchmark.Maybe it's just something that people aren't bothered with?
by awongh
4/22/2026 at 10:29:27 PM
Basically everyone who experiments with creative writing is keenly aware of that (e.g. roleplayers), it's just the devs that have the experience training the models for it (Anthropic, DeepMind) aren't bothered doing this anymore since there's no money in it.>this seems like something that should be capturable in a multi-lingual benchmark
Creative writing benchmarks just don't have good objectives to measure against. In particular, valid but inauthentic language constructions can't be captured well if your LLM judge lacks fidelity to capture it to begin with. Which is I think what typically happens.
An easy litmus test would be making a selected character in a story speak Ebonics or Haitian Creole or TikTok. Claude 3 Opus was light years ahead of any model in authenticity in using them, and it was immediately obvious in a side-by-side comparison with any model including Claude 3.5+. Nuances of Polish or Russian profanities/mat or British obscenities are always the hardest for any model (they tend to either swear like dockers or tone it down, lacking the eloquence), but Opus 3 was also ahead in any of those.
by orbital-decay
4/23/2026 at 3:24:27 AM
Btw samplers do in fact help with this. Random tokens deep in your output context are due to accumulated sampling errors from using shit samplers like top_p and top_k with temperature.Use a full distribution aware sampler like p-less decoding, top-H, or top-n sigma, and this goes away
Yes the paper for this will be up for review at NeurIPS this year.
by Der_Einzige
4/23/2026 at 3:36:20 AM
3/3.1 Pro appears to have knowledge about eccentric topics with no obvious sources that often turns out to be right.It does hallucinate a lot though, and is the most affected by context rot in multi-turn conversations
by blueblisters
4/23/2026 at 11:00:57 PM
Agreed on both, especially hallucination. That's what makes their chat app even worse, it's very opaque about web search and sources, so you can't tell whether it's a hallucination.by deaux
4/22/2026 at 6:56:53 PM
Aistudio should be their default appby algoth1
4/22/2026 at 9:14:17 PM
generally speakingultra ~ mythos ~ gpt-4.5 ~ 4x behemoth
pro ~ opus ~ 2x maverick
flash ~ sonnet ~ scout ~ other 20-30b active Chinese models
by ahmadyan
4/22/2026 at 2:17:09 PM
> They produce drastically lower amount of tokens to solve a problem, but they haven't seem to have put enough effort into refinining their reasoning and execution as they produce broken toolcalls and generally struggle with 'agentic' tasks, but for raw problem solving without tools or search they match opus and gpt while presumably being a fraction of the size.Agreed, Gemini-cli is terrible compared to CC and even Codex.
But Google is clearly prioritizing to have the best AI to augment and/or replace traditional search. That's their bread and butter. They'll be in a far better place to monetize that than anyone else. They've got a 1B+ user lead on anyone - and even adding in all LLMs together, they still probably have more query volume than everyone else put together.
I hope they start prioritizing Gemini-cli, as I think they'd force a lot more competition into the space.
by onlyrealcuzzo
4/22/2026 at 3:08:24 PM
> Agreed, Gemini-cli is terrible compared to CC and even Codex.Using it with opencode I don't find the actual model to cause worse results with tool calling versus Opus/GPT. This could be a harness problem more than a model problem?
I do prefer the overall results with GPT 5.4, which seems to catch more bugs in reviews that Gemini misses and produce cleaner code overall.
(And no, I can't quantify any of that, just "vibes" based)
by JeremyNT
4/22/2026 at 4:28:01 PM
I wonder what I am missing, because I can use gemini-cli with English descriptions of features or entire projects and it just cranks out the code. Built a bunch of stuff with it. Can't think of anything it's currently lacking.by rjh29
4/22/2026 at 4:38:09 PM
>> Can't think of anything it's currently lacking.Speed? The pro models are slow for me
The model 3.1 pro model is good and i don't recognise the GP's complaint of broken tool calls but i'm only using via gemini cli harness, sounds like they might be hosting their own agentic loop?
by CraigJPerry
4/22/2026 at 5:42:37 PM
Same. I've built dozens of small tools and scripts and never felt the need to try something else.by xnx
4/22/2026 at 3:01:35 PM
also, for incorporating into gsuite, youtube, maps, gcp and their other winning apps and behind-the-scenes infra...by asah
4/22/2026 at 4:44:22 PM
I thought the same for a long time, borderline unusable with loops/bizarre decisions compared to Claude Code and later Codex.But I picked it up again about a month ago and I have been quite impressed. Haven’t hit any of those frustrating QoL issues yet it was famous for and I’ve been using it a few hours a day.
Maybe it will let me down sooner or later but so far it has been working really well for me and is pretty snappy with the auto model selection.
After cancelling my Claude Pro plan months ago due to Anthropic enshittification I’ve been nervous relying solely on Codex in case they do the same, so I’ve been glad to have it available on my Google One plan.
by toraway
4/22/2026 at 3:28:31 PM
Not only that, google has an advange because they don't need to always generate a response.When a lot of people ask the same thing they can just index the questions, like a results on the search engine and recalculate it only so often,
by Iulioh
4/22/2026 at 5:59:08 PM
Google doesn't need to give a shit, because so much of the internet is infested with with google ad trackers and adwords, and everybody uses Chrome, that they will continue to make billions even without AI. Facebook did the same with their pixel so they could soak up data.Gemini will be dead in 2 years and there'll be something else, but the ad and search company will remain given that they basically own the world wide web.
Except now, so much of the WWW is filled with AI slop that it breaks the system.
by ljm
4/23/2026 at 1:20:05 AM
Which ever shitty model they’re using for search is so much better than the free offerings from the other companies. It’s not even close. It’s not going anywhere.by what
4/22/2026 at 3:38:35 PM
IIRC when Gemini 3 Pro came out it was considered to be just about on par with whatever version of Claude was out then (4?). Now Gemini 3 is looking long in the tooth. Considering how many Chinese models have been released since then, and at least 2 or 3 versions of Claude, it's starting to look like Google is kind of sitting still here. Maybe you're right and they'll surprise us soon with a large step improvement over what they currently have. Note: I do realize that there's been a Gemini 3.1 release, but it didn't seem like a noticeable change from 3.by UncleOxidant
4/22/2026 at 6:23:48 PM
As other people are saying here: the Gemini models are mostly terrible at tool use and long context management. And maybe not quite as good with finicky "detail" parts of coding generally.Where they excel is just total holistic _knowledge_ about the world. I don't like "talking" to it, because I kind of hate its tone, but I find Gemini generally extremely useful for research and analysis tasks and looking up information.
by cmrdporcupine
4/22/2026 at 8:37:57 PM
People who say Gemini is bad at long contexts are so wrong.You can put whole 50,000 - 70,000 LOC codebase into Gemini 3.1 Pro context making it 800,000+ tokens, give it detailed task and ask for whole changed files back and it will execute it sometimes in one shot, sometimes in two. E.g depend on whatever stack you work with let you see all the errors at once so it can fix everything on single reply.
Yes it will give you back 5-15 files up to 4000 LOC total with only relevant parts changed.
This is terrible inefficient way to burn $10 of tokens in 20 minutes, but attention and 1:1 context retention is truly amazing.
PS: At the same time it is bad at tool use, but this have nothing to do with context.
by SXX
4/23/2026 at 2:44:19 AM
This! And with AI studio you get a couple of free calls per day (it has gotten less and less). I have had days where I would be able to get 100 USD worth of tokens from AI studio for free. 1m tokens in and great code out.by oezi
4/23/2026 at 3:27:36 AM
You can even turn most of the censorship off in the AI studio (but not the hidden top_k of 64 they force in there).AI studio is where you go if you want an actually good mostly uncensored model. Gemini 3.1 is fully and somehow still quietly coomer approved.
by Der_Einzige
4/22/2026 at 6:31:45 PM
Gemini had the best long context support for the longest time, and even now at >400k tokens it's still got the best long context recall.Gemini is just not trained for autonomy/tool use/agentic behavior to the same degree as the other frontier models. Goog seems to emphasize video/images/scientific+world knowledge.
by CuriouslyC
4/22/2026 at 6:36:24 PM
My experience is it advertises large context and then just becomes incoherent and confused as it climbs to fill that context.e.g. it sucks at general tool use but sucks even more at it after a chunk of time in a session. One frustrating situation is to watch it go into a loop trying and failing to edit source files.
I often wonder how my old coworkers from Google get by, if this is the the agentic coding they have available to them for working on projects on Google3. But I suspect the models they work with have been fine tuned on Google's custom tooling and perform better?
by cmrdporcupine
4/22/2026 at 4:29:46 PM
Their "preview" naming is pretty arbitrary. It's just their way to avoid making any availability or persistence promises, let alone guarantees. It's also a PR tactic to mask any failures by pretending it's beta quality.by orbital-decay
4/23/2026 at 3:44:14 AM
> the pro and flash variants are 5x to 10x smaller than opus and gpt-5 class models.The rumor is that Gemini Pro is the largest model being served today (or at least was prior to Mythos)
Source: some podcast where they were discussing TPU vs Nvidia cluster topologies, and how Google is exploiting their topology to allow this. But I can't remember exactly which podcast, so hopefully someone else will know.
by nl
4/22/2026 at 8:16:23 PM
I really wonder what I’m missing with Gemini. It’s a second rate model for me at best. I find it okay (not great) at collecting information and completely useless at agentic tasks. It’s like it’s always drunk. When the Claude credits expire in Antigravity, I’m done for the day.> They produce drastically lower amount of tokens to solve a problem
I LOLed at this because I of the constant death loops that don’t even solve the problem at all.
by solarkraft
4/23/2026 at 2:16:21 AM
Yah it doesn't even make sense how they got through their benchmarks without death loops. Gemini-cli even has a hotfix to break the model of such death-loops. But if you were to ignore this bug/quirk that will be fixed in the next patch release my point still stands.by himata4113
4/23/2026 at 1:08:17 AM
i get much better results with it using a different toolset. give it serena and it mostly works, and is less likely to hit a death loop.i feel like the geminicli app is missing some tools for making sure the session history is actually valid
by 8note
4/22/2026 at 4:54:52 PM
Am I tripping or is this an AI reply? Like it barely has anything to do with the article other than both are related to AIby big-chungus4
4/22/2026 at 9:51:50 PM
An AI reply would be more relevant to the headline / article, humans often write something tangential since we have more going on in our head and not just the context at hand while AI can't ignore context.by Jensson
4/23/2026 at 2:14:07 AM
Google uses these chips to create gemini, I simply used this as an excuse to rant and predict the future.by himata4113
4/22/2026 at 7:31:53 PM
> a model that will be an entire generation beyond SOTAThat model would then be SOTA.
Tautologically you can't be better than SOTA
by robocat
4/23/2026 at 2:15:20 AM
SOTA at that time*by himata4113
4/22/2026 at 4:47:10 PM
Interesting mix of words: "I felt" -> "proved" -> "guess". One of those is not like the others!by mrcwinn
4/23/2026 at 2:17:35 AM
I guess I felt pretty uncertain that day which proved that a lack of sleep is bad for your mental cognition.by himata4113
4/22/2026 at 2:17:40 PM
[flagged]by ALLTaken
4/22/2026 at 3:13:01 PM
Is your friend on the JAX team?by _boffin_
4/22/2026 at 3:17:02 PM
I'm really struggling with terrible bloating today, but I deemed it too dangerous to release.by neonstatic
4/22/2026 at 4:43:36 PM
Thank you for your sacrifice. Could you speak to my dog please? You may wish to yell from a distance, actually.by tclancy