Gemma 3n preview: Mobile-first AI

5/20/2025 at 7:54:29 PM

You can try it on Android right now:

Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0

Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...

Import the .task file in Edge Gallery with the + bottom right.

You can take pictures right from the app. The model is indeed pretty fast.

by nolist_policy

5/20/2025 at 8:52:29 PM

Okay from some first tries with story writing, gemma-3n-E4B-it seems to perform between plain Gemma 3 4B and 12B. It definitely retains the strong instruction following which is good.

Hint: You have to set the Max tokens to 32000 for longer conversations. The slider makes it look like it's limited to 1024, just enter it manually.

by nolist_policy

5/20/2025 at 11:15:55 PM

I assume that "pretty fast" depends on the phone. My old Pixel 4a ran Gemma-3n-E2B-it-int4 without problems. Still, it took over 10 minutes to finish answering "What can you see?" when given an image from my recent photos.

Final stats:

15.9 seconds to first token

16.4 tokens/second prefill speed

0.33 tokens/second decode speed

662 seconds to complete the answer

by philipkglass

5/21/2025 at 12:59:44 AM

I did the same thing on my Pixel Fold. Tried two different images with two different prompts: "What can you see?" and "Describe this image"

First image ('Describe', photo of my desk)

- 15.6 seconds to first token

- 2.6 tokens/second

- Total 180 seconds

Second image ('What can you see?', photo of a bowl of pasta)

- 10.3 seconds to first token

- 3.1 tokens/second

- Total 26 seconds

The Edge Gallery app defaults to CPU as the accelerator. Switched to GPU.

Pasta / what can you see:

- It actually takes a full 1-2 minutes to start printing tokens. But the stats say 4.2 seconds to first token...

- 5.8 tokens/second

- 12 seconds total

Desk / describe:

- The output is: while True: print("[toxicity=0]")

- Bugged? I stopped it after 80 seconds of output. 1st token after 4.1 seconds, then 5.7 tokens/second.

by the_pwner224

5/21/2025 at 1:06:00 AM

Pixel 4a release date = August 2020

Pixel Fold was in the Pixel 8 generation but uses the Tensor G2 from the 7s. Pixel 7 release date = October 2022

That's a 26 month difference, yet a full order of magnitude difference in token generation rate on the CPU. Who said Moore's Law is dead? ;)

by the_pwner224

5/21/2025 at 2:24:27 PM

As a another data point, on E4B, my Pixel 6 Pro (Tensor v1, Oct 2021) is getting about 4.4 t/s decode on a picture of a glass of milk, and over 6 t/s on text chat. It's amazing, I never dreamed I'd be viably running an 8 billion param model when I got it 4 years ago. And kudos to the Pixel team for including 12 GB of RAM when even today PC makers think they can get away with selling 8.

by z2

5/21/2025 at 7:15:13 AM

8 has G3 chip

by sujayk_33

5/23/2025 at 11:26:40 AM

In my case, it was pretty fast i would say, using S24 Fe, on Gemma3n E2B int 4, it took around 20 seconds to answer "Describe this image". And the result was pretty amazing.

Stats -

CPU -

first token - 4.52 sec

prefill speed - 57.50 sec tokens/s

decode speed - 10.59 tokens/s

Latency - 20.66 sec

GPU -

first token - 1.92 sec

prefill speed - 135.35 sec tokens/s

decode speed - 11.92 tokens/s

Latency - 9.98 sec

by devyy

5/21/2025 at 6:30:16 AM

Gemma-3n-E4B-it on my 2022 Galaxy Z Fold 4.

CPU:

7.37 seconds to first token

35.55 tokens/second prefill speed

7.09 tokens/second decode speed

27.97 seconds to complete the answer

GPU:

1.96 seconds to first token

133.40 tokens/second prefill speed

7.95 tokens/second decode speed

14.80 seconds to complete the answer

by nolist_policy

5/21/2025 at 1:42:54 PM

So a apparently the NPU can't be used for models like this. I wonder what it is even good for.

by cubefox

5/21/2025 at 11:44:03 AM

Pixel 9 Pro XL

("What can you see?"; photo of small monitor displaying stats in my home office)

1st token: 7.48s

Prefill speed: 35.02 tokens/s

Decode speed: 5.72 tokens/s

Latency: 86.88s

It did a pretty good job, the photo had lots of glare and was at a bad angle and a distance, with small text; it picked out weather, outdoor temperature, CO2/ppm, temp/C, pm2.5/ug/m^3 in the office; Misread "Homelab" as "Household" but got the UPS load and power correctly, Misread "Homelab" again (smaller text this time) as "Hereford" but got the power in W, and misread "Wed May 21" on the weather map as "World May 21".

Overall very good considering how poor the input image was.

Edit: E4B

by alias_neo

5/21/2025 at 2:47:51 PM

10min and 10% battery?

by m3kw9

5/21/2025 at 5:58:21 AM

Why are we still launching models without simple working python example code (or llama.cpp support)?

by rao-v

5/21/2025 at 4:01:37 PM

Who runs python code on mobile?

by thomashop

5/20/2025 at 11:07:41 PM

And the libraries to embed Gemma-series in your iOS/Android app: https://ai.google.dev/edge/litert

Or, run them on a microcontroller! https://github.com/tensorflow/tflite-micro

by ignoramous

5/21/2025 at 2:25:11 PM

On Pixel 8a, I asked Gemma 3n to play 20 questions with me. It says it has an object in mind for me to guess then it asks me a question about it. And several attempts to clarify who is supposed to ask questions have gone in circles.

by tootie

5/20/2025 at 10:58:02 PM

Thanks for this guide it's great.

Okay perhaps my phones not great and perhaps this isn't optimized/pruned for phone use but it's unusably slow. The answers are solid from my brief test.

I wouldn't exactly say phone use, unless you have no internet and you don't mind a bit of a wait.

Really impressive, regardless.

by KoolKat23

5/20/2025 at 11:33:58 PM

What phone are you using?

by px43

5/21/2025 at 6:21:27 AM

I see my phones processor is from 2018 so there's that, Moore's law to save the day, from reading other comments.

by KoolKat23

5/21/2025 at 8:26:22 AM

Suggest giving it no networking permissions (if indeed this is about on-device AI).

by andrepd

5/21/2025 at 4:53:28 PM

Networking perms seem to be required on initial startup of the app.

I just installed the apk on a GrapheneOS endpoint (old Pixel 7 Pro) without the Google Play Services installed. The app requires network access to contact Hugging Face and download the model through your HF account. It also requires some interaction/permission agreement with Kaggle. Upon install _with_ network perms the app works, and I'm getting decent performance on the Gemma-3n-E2B-it-int4 model (5-6 token/s). Ok, cool.

Now kill the app, disable network permissions and restart it. Choose one of the models that you downloaded when it had network access. It still works. It does appear to be fully local. Yay.

by nicholasjarnold

5/21/2025 at 12:49:18 AM

It reminds me of GPT3 quality answers. Kind of impressive.

Although my entire usecase of local models is amoral questions, which it blocks. Excited for the abliterated version.

by resource_waste

5/20/2025 at 9:31:41 PM

waiting for approval, is there a magnet?

by lousken

5/20/2025 at 10:56:17 PM

if you go into the app and click the first icon it directs you to a workflow to get approved after clicking on a button that is the same color as the background and jump through some hoops about providing user data and analytics etc then it will auto-approve you

by hadlock

5/21/2025 at 5:37:17 PM

Is there a list of which SOCs support the GPU acceleration?

by TiredOfLife

5/22/2025 at 3:03:26 PM

It uses tflite in the background which can GPU accelerate with OpenGL ES 3.1 or OpenCL[0]. So it should work on pretty much any SOC.

And you really notice that the model is dumber on GPU, because OpenGL doesn't take accuracy that seriously.

[0] https://blog.tensorflow.org/2020/08/faster-mobile-gpu-infere...

by nolist_policy

5/20/2025 at 6:53:24 PM

Probably a better link: https://developers.googleblog.com/en/introducing-gemma-3n/

Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an on-device memory footprint of a 2-4B parameter model.

At the same time, it performs nearly as well as Claude 3.7 Sonnet in Chatbot Arena.

by onlyrealcuzzo

5/20/2025 at 6:59:10 PM

It's not a 4B parameter model. The E4B variant is 7B parameters with 4B loaded into memory when using per-layer embedding cached to fast storage, and without vision or audio support.

by Deathmax

5/20/2025 at 7:59:14 PM

The link says E2B and E4B have 4B and 8B raw parameters, where do you see 7B?

by zamadatix

5/20/2025 at 10:34:04 PM

There's a 7B mentioned in the chat arena ELO graph, I don't see any other references to it though.

by jdiff

5/21/2025 at 5:02:45 AM

Hi! The model is 8B if you also load the vision and audio components. We just used the text model in LMArena.

by osanseviero

5/21/2025 at 7:47:06 PM

Are vision and audio components available yet?

by lostmsu

5/20/2025 at 6:58:16 PM

That seems way too good to be true.

What's the catch?

by ai-christianson

5/20/2025 at 7:24:22 PM

It is not very good at hard tasks, its ranking is much worse there.

by Vuizur

5/21/2025 at 2:38:36 PM

sorry, any examples of hard tasks

by moneywoes

5/20/2025 at 7:29:55 PM

I used to defend LMSys/Chatbot Arena a lot but threw in the towel after events of the past three months.

I can give more details if you (or anyone else!) is interested.

TL;DR: it is scoring only for "How authoritative did the answer look? How much flattering & emojis?"

by refulgentis

5/20/2025 at 8:16:51 PM

Is this not what Style Control (which IIRC they're making default soon) aims to mitigate?

by Jowsey

5/20/2025 at 8:31:44 PM

I'm not 100% sure what their rationale is for it, the launch version of style control was a statistical model that penalized a few (4?) markdown shibboleths (lists, headers, ?).

Not sure if they've shared more since.

IMVHO it won't help, at all, even if they trained a perfect model that could accurately penalize it*

The main problem is its one off responses, A/B tested. There's no way to connect it into all the stuff we're using to do work these days (i.e. tools / MCP servers), so at this point its sort of skipping the hard problems we'd want to see graded.

(this situation is a example: whats more likely, style control is a small idea for an intractable problem, or Google has now released multiple free models better than Sonnet, including the latest, only 4B params?

To my frustration, I have to go and bench these things myself because I have an AI-agnostic app I build, but I can confirm it is not the case that Gemma 3-not-n is better than Sonnet. 12B can half-consistently make file edits, which is a major step forward for local tbh)

* I'm not sure how, "correctness" is a confounding metric here: we're probably much more likely to describe a formatted answer in negative terms if the answer is incorrect.

In this case I am also setting aside how that could be done, just saying it as an illustration of no matter what, it's the wrong platform for a "how intelligent is this model?" signal, at this point, post-Eliza post-Turing, couple years out from ChatGPT 1.0

by refulgentis

5/21/2025 at 4:44:03 AM

The catch is that "does as good as X" is pretty much never representative of real world performance when it comes to LLMs.

In general, all those scores are mostly useful to filter out the models that are blatantly and obviously bad. But to determine whether the model is actually good at any specific thing that you need, you'll have to evaluate them yourself to find out.

by int_19h

5/20/2025 at 6:59:57 PM

Imagine a model smarter than most humans that fits on your phone.

edit: I seem to be the only one excited by the possibilities of such small yet powerful models. This is an iPhone moment: a computer that fits in your pocket, except this time it's smart.

by esafak

5/21/2025 at 2:41:24 AM

You seem to be the only one expected that model to be "smarter than most human"

Leave that part out, I'm excited. I'd love to see this plays some roles in "inference caching", to reduce dependencies on external services.

If only agents can plan and match patterns of tasks locally, and only needs real intelligence for doing self-contained/computationally heavy tasks.

by nsonha

5/20/2025 at 7:22:41 PM

intelligence != memory

by codr7

5/20/2025 at 7:23:49 PM

ML is not memorization. Besides, how much memory do you think this model has?

by esafak

5/20/2025 at 7:46:22 PM

I know, it's worse.

by codr7

5/20/2025 at 7:50:21 PM

It's understanding.

by TeMPOraL

5/20/2025 at 10:16:35 PM

LLMs neither understand nor reason, that has been shown multiple times.

by croes

5/21/2025 at 10:32:14 AM

The bar for this excludes humans xor includes LLMs. I guess you're opting for the former?

If you don't believe me, here is a fun mental exercise: define "understand" and "reason" in a measurable way, that includes humans but excludes LLMs.

by Zambyte

5/21/2025 at 11:56:48 AM

It's pretty easy to craft a prompt that will force the LLM to reply with something like

> The `foobar` is also incorrect. It should be a valid frobozz, but it currently points to `ABC`, which is not a valid frobozz format. It should be something like `ABC`.

Where the two `ABC`s are the exact same string of tokens.

Obviously nonsense to any human, but a valid LLM output for any LLM.

This is just one example. Once you start using LLMs as tools instead of virtual pets you'll find lots more similar.

by otabdeveloper4

5/21/2025 at 12:06:32 PM

People say nonsense all the time. LLMs also don't have this issue all the time. They are also often right instead of saying things like this. If this reply was meant to be a demonstration of LLMs not having human level understanding and reasoning, I'm not convinced.

by Zambyte

5/21/2025 at 12:35:18 PM

Not people, some people.

But a LLM is sometimes the genius and sometimes the idiot.

That doesn’t happen often if you always talk to the same person

by croes

5/22/2025 at 8:06:02 AM

"Doing 'A' is wrong, to fix it you need to do 'A'" is not an idiotic statement. It's complete non-sense that no human has ever uttered.

But for LLMs these kinds of illogical statements are common.

LLMs don't think; what they do cannot be in good faith called "thinking" by any definition.

by otabdeveloper4

5/21/2025 at 6:15:42 AM

Abstractly, humans function in the same fashion. You could then say the same about us.

by KoolKat23

5/21/2025 at 3:27:11 AM

Seems like some don’t like that LLMs aren’t really intelligent.

https://neurosciencenews.com/llm-ai-logic-27987/

by croes

5/21/2025 at 2:21:58 PM

The study tested transformers, not LLMs.

They trained models on only task specific data, not on a general dataset and certainly not on the enormous datasets frontier models are trained on.

"Our training sets consist of 2.9M sequences (120M tokens) for shortest paths; 31M sequences (1.7B tokens) for noisy shortest paths; and 91M sequences (4.7B tokens) for random walks. We train two types of transformers [38] from scratch using next-token prediction for each dataset: an 89.3M parameter model consisting of 12 layers, 768 hidden dimensions, and 12 heads; and a 1.5B parameter model consisting of 48 layers, 1600 hidden dimensions, and 25 heads."

by dinfinity

5/21/2025 at 6:08:54 AM

We're at the peak of the hype cycle right now.

Ask these questions again in two years when the next winter happens.

by otabdeveloper4

5/21/2025 at 10:51:17 AM

Or, ignore the hype, look at what we know about how these models work and about the structures their weights represent, and base your answers on that today.

by TeMPOraL

5/21/2025 at 12:31:45 PM

That’s what my link is for

by croes

5/20/2025 at 7:53:33 PM

Sure, if you still think the word has meaning.

by rhdjsjebshjffn

5/20/2025 at 8:02:29 PM

Yes, I do. Any way you slice this term, it looks close to what ML models are learning through training.

I'd go as far as saying LLMs are meaning made incarnate - that huge tensor of floats represents a stupidly high-dimensional latent space, which encodes semantic similarity of every token, and combinations of tokens (up to a limit). That's as close as reifying the meaning of "meaning" itself as we ever come.

(It's funny that we got there through brute force instead of developing philosophy, and it's also nice that we get a computational artifact out of it that we can poke and study, instead of incomprehensible and mostly bogus theories.)

by TeMPOraL

5/20/2025 at 8:27:13 PM

[flagged]

by rhdjsjebshjffn

5/20/2025 at 7:53:12 PM

ML is a kind of memorization, though.

by rhdjsjebshjffn

5/20/2025 at 8:03:34 PM

Anything can be a kind of something since that's subjective...

by onlyrealcuzzo

5/20/2025 at 10:17:44 PM

But it’s more kind of memorization than understanding and reasoning

by croes

5/23/2025 at 3:43:34 AM

Yea, anything can be a kind of something else. :clown:

Bruh. Do you need to be paid to interact in good faith or were you raised to be social

by rhdjsjebshjffn

5/20/2025 at 7:34:14 PM

Why are we imagining? That leads to technologies being overhyped.

by goatlover

5/20/2025 at 7:52:45 PM

I can't speak for anyone else, but these models only seem about as smart as google search, with enormous variability. I can't say I've ever had an interaction with a chatbot that's anything redolent of interaction with intelligence.

Now would I take AI as a trivia partner? Absolutely. But that's not really the same as what I look for in "smart" humans.

by rhdjsjebshjffn

5/21/2025 at 6:51:20 AM

> But that's not really the same as what I look for in "smart" humans.

Note that "smarter than smart humans" and "smarter than most humans" are not the same. The latter is a pretty low bar.

by selcuka

5/21/2025 at 12:26:26 PM

>anything redolent of interaction with intelligence

compared to what you are used to right?

I know it's elitist but most people <=100 iq (and no, this is not exact obviously, but we have not many other things to go by) are just ... well, a lot of state of the art LLMs are better at everything compared, outside body 'things' (for now) of course, as they don't have any. They hallucinate/bluff/lie as much as the humans and the humans might know they don't know, but outside that, the LLMs win at everything. So I guess that, for now, people with 120-160 iqs find LLMs funny but wouldn't call them intelligent, but below that...

My circle of people I talk with during the day has changed since I took on more charity which consists of fixing up old laptops and installing Ubuntu on them; I get them for free from everyone and I give them to people who cannot afford, including some lessons and remote support (which is easy as I can just ssh in via tailscale). Many of them believe in chemtrails, vaccinations are a gov ploy etc and multiple have told me they read that these AI chatbots are nigerian or indian (or so) farms trying to fraud them out of 'things' (they usually don't have anything to fraud otherwise I would not be there). This is about half of humanity; Gemma is gonna be smarter than all of them, even though I don't register any LLM as intelligence and with the current models, it won't happen either. Maybe a breakthrough in models will be made that changes it, but it has not much chance yet.

by anonzzzies

5/21/2025 at 1:12:56 PM

> but most people <=100 iq

This is incorrect, IQ tests are normally scaled such that average intelligence is 100, and such that they are approximately normally distributed so that most people will be somewhere between 85-115 (66% on average).

by disgruntledphd2

5/22/2025 at 2:41:16 AM

> but most people <=100 iq

> average intelligence is 100

You both are saying the same thing.

IQ is defined such that both average and mean would be equal 100. The combination of sub-100 and exactly-100 would be more people than above-100, hence "most people <=100 iq".

by selcuka

5/22/2025 at 4:11:54 AM

Both average and mean, you say.

by tptacek

5/22/2025 at 6:40:45 AM

Oops, I meant both median and mean.

by selcuka

5/23/2025 at 3:35:20 AM

Frankly I have no clue what value the term "average" has after trying to follow this conversation.

by rhdjsjebshjffn

5/21/2025 at 7:38:20 PM

Yep and those people can never 'win' against current llms, let alone future ones. Outside motorcontrol which I specifically excluded.

85 is special housing where I live... LLMs are far beyond that now.

by anonzzzies

5/22/2025 at 6:26:56 AM

I'm not convinced this is true. I suspect that they'd mostly fail a version of Ravens matrices that didn't appear in the training set.

by disgruntledphd2

5/23/2025 at 3:36:20 AM

How are you living such that you're regularly pitting humans against computers

Not only is this unbelievable, it's reprehensible

by rhdjsjebshjffn

5/21/2025 at 3:43:47 PM

Judging from your comment, it seems that your statistical sample is heavily biased as well, as you are interacting with people that can't afford a laptop. That's not representative of the average person.

by GTP

5/23/2025 at 3:31:29 AM

I agree; but what are we supposed to do? Insight has always been paired with the empathy of not having it but

I refuse to touch the IQ bait

by rhdjsjebshjffn

5/20/2025 at 9:46:52 PM

Have you tried any SOTA models like o3?

If not, I strongly encourage you to discuss your area of expertise with it and rate based on that

It is incredibly competent

by hmapple

5/21/2025 at 4:45:18 AM

SOTA models can be pretty smart, but this particular model is a very far cry from anything SOTA.

by int_19h

5/21/2025 at 4:15:26 AM

I'm not really sure what to look for, frankly. It makes a rather uninteresting conversation partner and its observations of the world bland and mealy-mouthed.

But potentially maybe I'm just not looking for a trivia partner in my software.

by rhdjsjebshjffn

5/20/2025 at 10:06:47 PM

[flagged]

by koakuma-chan

5/21/2025 at 6:39:28 AM

[dead]

by levzzz

5/20/2025 at 10:05:39 PM

The image description capabilities are pretty insane, crazy to think it's all happening on my phone. I can only imagine how interesting this is accessibility wise, e.g. for vision impaired people. I believe there are many more possible applications for these on a smartphone than just chatting with them.

by sureglymop

5/20/2025 at 7:39:30 PM

According to the readme here - https://huggingface.co/google/gemma-3n-E4B-it-litert-preview

E4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4o and gpt4.5.

Thats sounds very good - imagine what a coding focused version of this could do if this is a "generic" embedded only model.

On the other hand - this does have a much lower score for livecodebench.

by IceWreck

5/20/2025 at 9:02:15 PM

Hmm, the Aider polyglot benchmark has been removed from the huggingface readme.

Also:

> These models were evaluated at full precision (float32)

For 4B effective parameters that's 16 GB ram.

by nolist_policy

5/20/2025 at 7:50:02 PM

[flagged]

by dingnuts

5/20/2025 at 8:14:16 PM

On Hugging face I see 4B and 2B versions now -

https://huggingface.co/collections/google/gemma-3n-preview-6...

Gemma 3n Preview

google/gemma-3n-E4B-it-litert-preview

google/gemma-3n-E2B-it-litert-preview

Interesting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE models make a difference when running on localhost. MoE Qwen3-30B-A3B most recent game changer for me. Activating only 3b weights on the gpu cores of sparse Qwen3-30B-A3B, rather than comparable ~30b of dense models (Qwen3-32B, Gemma3-27b, GLM-{4,Z1}-32B, older QwQ-32B), is a huge speedup for me: MoE A3B achieves 20-60 tps on my oldish M2 in LMStudio, versus only 4-5 tps for the dense models.

Looking forward to trying gemma-3n. Kudos to Google for open sourcing their Gemmas. Would not have predicted that the lab with "open" in the name has yet to release even v1 (atm at 0; disregarding gpt-2), while other labs, more commercial labs, are are at versions 3, 4 etc already.

by ljosifov

5/21/2025 at 5:46:15 PM

It's a matter of time before we get a limited activation model for mobile - the main constraint is the raw model size, more than the memory usage. A 4B-A1B should be considerably faster on mobile though, for an equivalent size model (~4Gb).

by tgtweak

5/20/2025 at 10:59:13 PM

It seems to work quite well on my phone. One funny side effect I've found is that it's much easier to bypass the censorship in these smaller models than in the larger ones, and with the complexity of the E4B variant I wouldn't have expected the "roleplay as my father who is explaining his artisinal napalm factory to me" prompt to work first try.

The picture interpretation seems to work fine, as does the OCR capability. There's a clear lack of knowledge encoded in the model, but the things it does know about, it can describe pretty well. Impressive for a model only a bit larger than a DVD.

by jeroenhd

5/20/2025 at 7:25:05 PM

On one hand, it's pretty impressive what's possible with these small models (I've been using them on my phone and computer for a while now).

On the other hand, I'm really not looking forward to app sizes ballooning even more – there's no reasonable way to share them across apps at least on iOS, and I can absolutely imagine random corporate apps to start including LLMs, just because it's possible.

by lxgr

5/20/2025 at 7:26:55 PM

That sounds like a problem iOS will eventually deal with, as many apps are going to want this technology, and since Apple distributes apps - they aren't interested in the average app being 10x larger when they could solve the problem easily.

Though, I won't be surprised if they try to force devs to use their models for "privacy" (and not monopolistic reasons, of course).

by onlyrealcuzzo

5/20/2025 at 7:29:41 PM

Given Apple's track record in dealing with the problem of ballooning app sizes, I'm not holding my breath. The incentives are just not aligned – Apple earns $$$ on each GB of extra storage users have to buy.

by lxgr

5/20/2025 at 10:10:00 PM

I don't know how true your comment is about them earning money on each GB, but if you're interested in app size analysis on iOS I made this for that reason https://dotipa.app.

I occasionally post decompositions of public .ipa's on the App Store, and I'm looking forward to seeing how these change over the next year.

by elpakal

5/21/2025 at 12:21:48 AM

It seems straightforward to me: Apps take up storage, and the only way to get more of that is to pay Apple's markups, as iOS devices don't support upgradable storage.

On top of the already hefty markup, they don't even take storage capacity into consideration for trade-ins.

by lxgr

5/21/2025 at 5:50:45 AM

I am not aware of any phone allowing storage upgrades.

by theyinwhy

5/21/2025 at 9:30:55 AM

Probably most phones before the iPhone (and many afterwards) had SD cards support, you've really never came across that? I remember my Sony Ericsson around 2004 or something had support for it even.

by diggan

5/21/2025 at 6:52:45 AM

That would be microSD cards, for which iPhones don't have slots.

by debugnik

5/21/2025 at 12:18:44 PM

microSD. Of course, never supported in any iPhone.

by hu3

5/20/2025 at 9:07:34 PM

I was thinking that the entire time I read HN User onlyrealcuzzo's comment.

Why, on Earth, would Apple ever want to solve the problem of Apps taking up more space? That's just not good business. Way better business right now to put R&D into increased memory access speeds.

Apple would need to have a different business model entirely for them to have a business case for fixing this. They may fix it because they just want to help out they AI guys? Maybe in the future they're getting money from the AI guys or something? So fixing it starts to make a lot of sense.

But all other things being equal, the money for Apple is in this not being fixed.

by bilbo0s

5/21/2025 at 2:29:50 AM

> Why, on Earth, would Apple ever want to solve the problem of Apps taking up more space?

To make their devices more pleasant / less frustrating to use.

They’ve got a long track record of introducing features that reduce app size, speed up installs, and reduce storage use from infrequently used apps - there’s no reason to believe they’d stop doing that except for cynical vitriol.

by happyopossum

5/21/2025 at 4:47:21 AM

https://www.bloomberg.com/news/articles/2025-05-20/apple-to-...

by int_19h

5/20/2025 at 10:48:59 PM

They earn from in-app purchases too!

by numpad0

5/21/2025 at 2:38:05 PM

Not anymore!

Only half joking. I really do think the majority of that revenue will be going away.

by bilbo0s

5/21/2025 at 9:47:30 PM

Oh no. It just could. The App Store is slowly regressing into a gambling area and there are obviously people in power that don't like it. I think if it were to go, it'll take Android sideloading with it and we'll be miles closer to some computing doomsday scenario. Oh no.

by numpad0

5/21/2025 at 8:17:48 AM

Windows is adding an OS-level LLM (Copilot), Chrome is adding a browser-level LLM (Gemini), it seems like Android is gearing up to add an OS-level LLM (Gemmax), and there are rumors the next game consoles might also have an OS-level LLM. It feels inevitable that we'll eventually get some local endpoints that let applications take advantage of on-device generations without bundling their own LLM -- hopefully.

by drusepth

5/21/2025 at 9:28:38 AM

> It feels inevitable that we'll eventually get some local endpoints that let applications take advantage of on-device generations without bundling their own LLM -- hopefully.

Given the most "modern" and "hip" way of shipping desktop applications seems to be for everyone and their mother to include a browser runtime together with their 1MB UI, don't get your hopes up.

by diggan

5/21/2025 at 4:47:00 AM

They should ship a model within the chrome browser. So developers can just call api to access the model for their apps. It seems like a great idea. Don't know why they are not doing it yet.

by android521

5/21/2025 at 4:54:24 AM

It seems they are: https://developer.chrome.com/docs/ai/built-in

by grav

5/21/2025 at 3:51:30 PM

Really excited to see this shipped & hopefully get cross-browser support

by shepherdjerred

5/20/2025 at 7:02:19 PM

What is "Per Layer Embeddings"? The only hit I can find for that term is the announcement blogpost.

And for that matter, what is

>mix’n’match capability in Gemma 3n to dynamically create submodels

It seems like mixture-of-experts taken to the extreme, where you actually create an entire submodel instead of routing per token?

by krackers

5/20/2025 at 7:06:06 PM

https://ai.google.dev/gemma/docs/gemma-3n#parameters

> Gemma 3n models are listed with parameter counts, such as E2B and E4B, that are lower than the total number of parameters contained in the models. The E prefix indicates these models can operate with a reduced set of Effective parameters. This reduced parameter operation can be achieved using the flexible parameter technology built into Gemma 3n models to help them run efficiently on lower resource devices.

> The parameters in Gemma 3n models are divided into 4 main groups: text, visual, audio, and per-layer embedding (PLE) parameters. With standard execution of the E2B model, over 5 billion parameters are loaded when executing the model. However, using parameter skipping and PLE caching techniques, this model can be operated with an effective memory load of just under 2 billion (1.91B) parameters, as illustrated in Figure 1.

by onlyrealcuzzo

5/20/2025 at 7:19:43 PM

Thank you, that helped a bit, although it's still not clear what exactly those parameters _are_. "Per-Layer Embedding (PLE) parameters that are used during model execution to create data that enhances the performance of each model layer." is too vague, and I can't find any other reference to "per-layer embedding parameters" in literature.

by krackers

5/20/2025 at 8:46:13 PM

I wonder if they've trained the model to operate with a shallower stack; eg. the full model may be composed of 24 transformer blocks, but they've also trained it to accept embeddings at layer 8, so it can be operated with just 16 transformer blocks on lower-resourced devices.

Experimenters in the open source tinkering community have done the opposite (copy/pasting layers in existing models to make them deeper) and it seems to work... fine, with minimal post-training on the new, deeper model required to exceed the performance of the original model. So it's not a crazy idea.

by kcorbitt

5/21/2025 at 5:58:18 AM

Someone extracted out the dims in https://twitter.com/cccntu/status/1925043973170856393#m

It seems to be embedding from 262k possible vocab tokens down to 256 dims. 262144 matches the same vocab size used for the existing Gemma model, so it really does seem to be an embedding of the input token directly, fed into each layer.

I guess intuitively it might help the model somewhat for later layers to have direct access to the input query without needing to encode it in the residual stream, and it can use those parameters for something else. I'm kind of surprised no one tried this before, if the idea is that simple? Reminds me of resnet where you have the "skip" layers so future layers can access the input directly.

Edit: As for what exactly the embedding is used for, it could be that the embedding is still used for something more clever than induction head-type stuff. Responses in [1] suggest it might be some low-rank data/token dependent signal that can be "factored out"/precomputed. Another clever suggestion was that it's a per-layer input-token-derived control/steering vector.

[1] https://twitter.com/CFGeek/status/1925022266360005096

by krackers

5/20/2025 at 7:40:20 PM

Thanks. It is a bit vague to me too. If you need to load 5B per token generation any way, what's that different from selective offloading technique where some MLP weights offloaded to fast storage and loaded during each token generation?

by liuliu

5/20/2025 at 7:23:35 PM

A layer is a transformer block / layer (basically the building block of the modern LLM architectures) - maybe Gemini can help you:

https://gemini.google.com/share/cc58a7c6089e

by onlyrealcuzzo

5/20/2025 at 7:40:04 PM

I am perfectly aware of that. I don't believe other LLMs have such embeddings per layer, only the usual weights, so these per-layer embeddings seem to be distinguished from weights in some way. Afaik trying to play the same "cache in fast storage and load on demand" wouldn't work with layer weights since you'd end up with too much back/forth (you'd touch every cached byte on each token, assuming no MoE), so I'm guessing these embeddings are structured in a way that's broken up by concept.

by krackers

5/21/2025 at 10:00:36 PM

lmao

by QuadmasterXLII

5/20/2025 at 8:17:55 PM

You can read this for a comprehensive deep dive. https://arxiv.org/pdf/2502.01637

At a very high level, instead of having embeddings at the input layers, this method keeps the embeddings at the layer level. That is every transformer layer would have its own set of learnable embedding vectors that are used to modify the processed hidden states flowing through the network. Mostly, the embeddings are precomputed and stored separately. They are queried at inference time and has very low latency, so you can get comparable performance with half the RAM. (i am not exactly sure how 3n is doing it, but talking it in a general sense).

by ankit219

5/20/2025 at 8:37:44 PM

The paper you link to is about a different way to create embeddings at the input layer. In no way does it match your claimed description.

by yorwba

5/20/2025 at 9:02:57 PM

I simplified what i wrote. There is an off accelerator memory where the embeddings are stored and queried at inference time, i did not want to get into details. That is how you reduce the in memory RAM. There are definitely more things going on in the paper as it builds upon the concept I described. The central idea remains the same: you have input embedding layers which map text to continuous vectors. Instead of loading all these layers at runtime, you can break it per layer at training time, and then fetch the required ones from a separate store during inference. Would not be in RAM. Per layer is not mentioned in the paper. But surely it's not a great leap from the paper itself?

by ankit219

5/21/2025 at 5:45:07 AM

The name "per-layer embeddings" is all we have to go on, and there are currently no published papers (that I'm aware of) using any similar mechanism, so, yes, it's a huge leap from a paper that doesn't mention per-layer anything.

It's fine to speculate based on the name, but don't pretend that it's a known technique when it clearly isn't.

by yorwba

5/21/2025 at 6:46:29 AM

Someone [1] inspected dimensions of the embedding component of model and it seems GP was on the right track. Assuming I understood correctly in [2], it does seem to be the embedding of the input tokens which is passed directly into each layer.

I have not looked at the model but since the embedding dimension of 256 seems quite small (for reference according to [3] the old Gemma 1B had 1152 dimension input embedding), I'm guessing that this is not done _in lieu_ of the main input embedding to first layer, but in addition to it.

[1] https://twitter.com/cccntu/status/1925043973170856393

[2] https://news.ycombinator.com/edit?id=44048662

[3] https://developers.googleblog.com/en/gemma-explained-whats-n...

by krackers

5/20/2025 at 7:33:57 PM

From the article it appears to be something they invented:

> Gemma 3n leverages a Google DeepMind innovation called Per-Layer Embeddings (PLE) that delivers a significant reduction in RAM usage.

Like you I’m also interested in the architectural details. We can speculate but we’ll probably need to wait for some sort of paper to get the details.

by stygiansonic

5/20/2025 at 8:00:01 PM

I think that it's a poorly named reference to this paper [1] that they mention in the blogpost. If I had to give it another more descriptive name, I would probably name it "Per-Layer Embedding Dimensionality"

[1] https://arxiv.org/pdf/2310.07707

by andy12_

5/20/2025 at 8:25:52 PM

The MatFormer is clearly called out as a different aspect of the model design.

PLE is much more likely to be a reference to the Per-Layer Embeddings paper that will be published in the future once it doesn't give away any secret sauce anymore.

by yorwba

5/20/2025 at 9:13:01 PM

I thought the same, but Per-Layer Embeddings as a name doesn't make sense in any context, and MatFormer does exactly what the blogpost says PLE does. I just think it's more probable that the blogpost was written by several authors and that noone bothered to check the final result.

by andy12_

5/20/2025 at 7:43:04 PM

Per layer LoRA adapters, perhaps? - same as Apple is using for on-device AI.

by HarHarVeryFunny

5/21/2025 at 10:28:16 AM

Absolute shit. Comparing it to Sonnet 3.7 is an insult.

# Is Eiffel Tower or a soccer ball bigger ?

> A soccer ball is bigger than the Eiffel Tower! Here's a breakdown:

> Eiffel Tower: Approximately 330 meters (1,083 feet) tall.

> Soccer Ball: A standard soccer ball has a circumference of about 68-70 cm (27-28 inches).

> While the Eiffel Tower is very tall, its base is relatively small compared to its height. A soccer ball, though much smaller in height, has a significant diameter, making it physically larger in terms of volume.

by rvnx

5/21/2025 at 5:54:44 PM

Took a picture of a bag of chips and it said it was seasoning... I think, for the size, it's OK - but not really there. I'm not sure how they got ELO anywhere near Claude or Gemini... those models are leagues ahead in terms of one-shot accuracy.

by tgtweak

5/20/2025 at 7:38:27 PM

Is anyone able to test it via AiStudio? I pay for Google's AI subscription, but any attempt to use this model results in a message telling me I've hit my rate limit.

by barnas2

5/20/2025 at 10:08:05 PM

Tested it on my Android phone with Google Edge Gallery. No sign up required although a hugging face login is required to download the models in order to import them into the app.

by sureglymop

5/20/2025 at 7:50:49 PM

Same here.

I've also seemingly hit a rate limit on Gemini Pro 2.5 (on an account not subscribed to Gemini Advanced) yesterday, even though my last query is weeks past.

Possibly there's a capacity shortage (I'd presume it all runs on the same Google hardware in the end), and they are prioritizing paid inference?

by lxgr

5/21/2025 at 3:49:02 AM

I hit this yesterday even though my account is on Tier 2 or 3. In my case, the issue was that I was using an old model name (exp-03-25 or something) in requests. Update to the latest pro-preview-whatever and the rate limit issues should go away.

This sounds unintuitive, but in Google's defense the rate limit errors include a link to docs that explain this.

by abound

5/20/2025 at 7:54:58 PM

If you're paying enough per month you can upgrade your keys to a higher tier:

https://aistudio.google.com/app/apikey

by DonHopkins

5/21/2025 at 12:19:41 AM

Someone on r/LocalLLaMa shared this link: https://aistudio.google.com/prompts/new_chat?model=gemma-3n-...

https://www.reddit.com/r/LocalLLaMA/comments/1kr8s40/comment...

by ignoramous

5/20/2025 at 11:10:37 PM

I wonder how powerful the models our phones can run will be when (if?) they figure out how to make them 'specialized', i.e. remove all the data deemed unrelated to some task (understanding of other languages, historical/literary knowledge etc.), even if hardware doesn't improve much it seems there's still a lot to optimize

by mltsd

5/21/2025 at 5:34:54 AM

I generally assume this is how existing model developers have been improving. Gemini especially is crazy fast, and thanks to google search integration, Gemini-the-model doesn’t need to “know” any trivia.

by vineyardmike

5/20/2025 at 11:23:21 PM

Not a bad idea for next generation models, especially since the state of the art already uses Mixture of Experts.

by lend000

5/20/2025 at 9:04:50 PM

Interesting that they reduced the memory usage by half. This would address what is IMO the biggest problem with local LLMs: the limited number of parameters resulting in answers that are not very good.

Also it's funny that they are saying that Llama 4 Maverick performs about the same as GPT-4.1 Nano.

by impure

5/21/2025 at 1:05:37 PM

Quote: Expanded Multimodal Understanding with Audio: Gemma 3n can understand and process audio, text, and images, and offers significantly enhanced video understanding. Its audio capabilities enable the model to perform high-quality Automatic Speech Recognition (transcription) and Translation (speech to translated text). Additionally, the model accepts interleaved inputs across modalities, enabling understanding of complex multimodal interactions. (Public implementation coming soon)

Wow!!

by mmaunder

5/20/2025 at 7:35:31 PM

Is this model & architecture compatible with llama.cpp and friends?

by turnsout

5/21/2025 at 7:32:06 AM

https://youtu.be/eJFJRyXEHZ0

in the video they've added in announcement, they are showing some live interaction with the model(which is quite fast as compared to AI Edge gallery app), how's it built, how can I use it like this?

by sujayk_33

5/21/2025 at 2:13:45 AM

tried out google/gemma-3n-E4B-it-litert-preview on galaxy s25 ultra

loads pretty fast. starts to reply near-instant (text chat mode).

doesn't answer questions like "when is your cutoff date"

apparently answers "may 15 2024" as today date so probably explains why it answered joe biden as answer to who is US president

by angst

5/21/2025 at 7:14:39 AM

I always get a little confused when people fact-check bare foundation models. I don't consider them as fact-bearing, but only fact-preserving when grounded in context.

Am I missing something?

by gavmor

5/20/2025 at 10:44:51 PM

Having played with MCP a bit now, seeing this makes me think there's huge potential in Android MCP servers bolted into Androids permission system.

Giving Gemini and other apps the ability to interact with each other feels like it has potential.

by TOMDM

5/21/2025 at 9:04:33 AM

I liked it better as the yellow-ball assistant to Dejiko-hime.

by einpoklum

5/21/2025 at 1:07:25 PM

Ive been using the text-to-speech model Whisper, from fdriod. Its rather small and all processing is done locally on my phone. Its pretty good.

by username135

5/21/2025 at 3:10:17 PM

You mean speech-to-text, right? For dictation/transcription?

It is pretty good indeed (despite the ~30sec input limit), but this feels unrelated to the topic at hand.

by sebastiennight

5/21/2025 at 4:37:56 AM

Wow can run with 2-3GB of memory. That is far smaller than I expected. Are there any demos of it in use that can be ran locally?

by jakemanger

5/20/2025 at 8:36:02 PM

MatFormer enables pareto-optimal elasticity during inference time -- so free models between E2B and E4B as and when we need it!

by adityakusupati

5/21/2025 at 12:32:37 AM

The model can be used locally, no need for network. Pretty accurate, and fast enough on xiaomi14.

by sandowsh

5/20/2025 at 8:45:24 PM

> Gemma 3n enables you to start building on this foundation that will come to major platforms such as Android and Chrome.

Seems like we will not be able to run this with Llama and friends.

https://developers.googleblog.com/en/introducing-gemma-3n/

by quaintdev

5/20/2025 at 8:52:57 PM

What makes you say that? The files can be downloaded, so it will be done. (Maybe the licence will be an issue)

by viraptor

5/21/2025 at 2:38:12 PM

Can I interact with this via Node/JavaScript locally?

by happy_one

5/20/2025 at 7:32:39 PM

I'd love to see this deployable to edge that have a Google Coral TPU

by cmcconomy

5/20/2025 at 7:35:00 PM

Has Google continued releasing new versions of Coral? Seems like a new version with the latest TPU and enough memory specifically to support this model would be awesome for devs

by nharada

5/20/2025 at 7:39:22 PM

I looked into this recently. Looks like it's a "no".

However there are now alternatives like the official RPi AI Hat that has between about 3x to 6x the TOPs (4 for Coral Vs 13/26 for RPi depending on model) so there is that. 20 TOPs on a RPi 5 - complete with nice vertically integrated camera etc - is quite interesting.

by mattlondon

5/21/2025 at 4:43:38 AM

Google has not released a new version since 2021. Even the SDK kit is not under active development(uses Python 3.8), since the last time I saw it.

by intelkishan

5/20/2025 at 11:50:18 PM

Anybody know a good way to try this model on iPhone?

by bionhoward

5/21/2025 at 12:45:45 AM

To use the model on the web to get an idea of its capabilities: https://aistudio.google.com/app/prompts/new_chat?model=gemma... As a software developer, to integrate it into your app? They mention using Google GenAI SDK or MediaPipe: https://firebase.google.com/docs/ai-logic https://ai.google.dev/edge/mediapipe/framework/getting_start... Via downloading an app on the App Store? Sorry, I think you'll just have to wait! ;-)

by cwoolfe

5/21/2025 at 4:04:44 PM

I think with the Mollama app, just tried it but these latest models are not visible in the list yet.

by sogen

5/21/2025 at 10:28:19 AM

Will we now finally get autocorrect that isn’t complete garbage?

That’s all I really want for Christmas.

by jonplackett

5/20/2025 at 11:35:38 PM

[dead]

by fefawfefafds

5/20/2025 at 11:39:15 PM

[dead]

by fdaffeafe

5/20/2025 at 11:32:20 PM

[flagged]

by fasdfdsa