6/3/2026 at 4:26:36 PM
The big story here is the encoder-free part, which I still don't fully understand.> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.
That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...
> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.
I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.
by minimaxir
6/3/2026 at 5:06:52 PM
Embedded within that developer page is a good explainer of the encoder free architecture . https://newsletter.maartengrootendorst.com/p/a-visual-guide-...by georgehm
6/3/2026 at 5:28:34 PM
This is just early fusion basically.FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818
I've been waiting for something like this to be released since then.
The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).
by spott
6/3/2026 at 5:21:17 PM
The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.
by mchinen
6/3/2026 at 5:33:36 PM
I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioningby make3
6/3/2026 at 5:52:41 PM
Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.by mchinen
6/3/2026 at 5:39:55 PM
Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.by neosat
6/3/2026 at 4:32:43 PM
Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.by jszymborski
6/3/2026 at 4:34:54 PM
In hindsight I may have been pedantic.by minimaxir
6/3/2026 at 4:59:59 PM
I had a similar thought to you, and found your question and the resulting discussion helpful!by wilkystyle
6/3/2026 at 5:04:05 PM
Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.by alberto467
6/3/2026 at 5:54:12 PM
Tokens are such a strange base unit. Couldn't we do something that naturally conforms better to reality than such choppy units that cause all sorts of artifacts? making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language, but its far more like multiple waves of data coalescing into an 'idea', internal... subjectively (n=1) at least. I think wave/signal based transformers are the next jump.After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.
Tokens create and hide too many problems to be the 'optimal' solution.
by altruios
6/3/2026 at 4:31:23 PM
> quantization12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?
But TBD how well the base model performs before thinking too much about quantization
by kristjansson
6/3/2026 at 4:58:45 PM
One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.by matja
6/3/2026 at 5:29:48 PM
But do I have the option to run it 'text only'?by pferdone
6/3/2026 at 5:35:10 PM
There are many priors to encoder-free VLMs. I specifically remember the EVE series of models from ~2 years.by woadwarrior01
6/3/2026 at 5:21:07 PM
Encoder free is huge for running on SBCs etc. often the encoding time is a significant fraction of generation time if you are using a VLM as a all purpose vision modelby rao-v
6/3/2026 at 4:33:16 PM
It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.by reactordev
6/3/2026 at 4:35:38 PM
I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.by wolttam
6/3/2026 at 4:33:50 PM
Well its a real simple encoder I guessby LarsDu88
6/3/2026 at 4:31:09 PM
> That's technically encodingIsn't that just projecting the patches into the d_model size vectors that the models takes?
>I am assuming that involves of quantization
12B model in 16GB seems very reasonable to me, int8 is top quality for running models.
by GaggiX
6/3/2026 at 4:37:10 PM
The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that.
by minimaxir
6/3/2026 at 4:33:43 PM
[dead]by fushigokira