5/20/2025 at 7:54:29 PM
You can try it on Android right now:Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0
Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...
Import the .task file in Edge Gallery with the + bottom right.
You can take pictures right from the app. The model is indeed pretty fast.
by nolist_policy
5/20/2025 at 8:52:29 PM
Okay from some first tries with story writing, gemma-3n-E4B-it seems to perform between plain Gemma 3 4B and 12B. It definitely retains the strong instruction following which is good.Hint: You have to set the Max tokens to 32000 for longer conversations. The slider makes it look like it's limited to 1024, just enter it manually.
by nolist_policy
5/20/2025 at 11:15:55 PM
I assume that "pretty fast" depends on the phone. My old Pixel 4a ran Gemma-3n-E2B-it-int4 without problems. Still, it took over 10 minutes to finish answering "What can you see?" when given an image from my recent photos.Final stats:
15.9 seconds to first token
16.4 tokens/second prefill speed
0.33 tokens/second decode speed
662 seconds to complete the answer
by philipkglass
5/21/2025 at 12:59:44 AM
I did the same thing on my Pixel Fold. Tried two different images with two different prompts: "What can you see?" and "Describe this image"First image ('Describe', photo of my desk)
- 15.6 seconds to first token
- 2.6 tokens/second
- Total 180 seconds
Second image ('What can you see?', photo of a bowl of pasta)
- 10.3 seconds to first token
- 3.1 tokens/second
- Total 26 seconds
The Edge Gallery app defaults to CPU as the accelerator. Switched to GPU.
Pasta / what can you see:
- It actually takes a full 1-2 minutes to start printing tokens. But the stats say 4.2 seconds to first token...
- 5.8 tokens/second
- 12 seconds total
Desk / describe:
- The output is: while True: print("[toxicity=0]")
- Bugged? I stopped it after 80 seconds of output. 1st token after 4.1 seconds, then 5.7 tokens/second.
by the_pwner224
5/21/2025 at 1:06:00 AM
Pixel 4a release date = August 2020Pixel Fold was in the Pixel 8 generation but uses the Tensor G2 from the 7s. Pixel 7 release date = October 2022
That's a 26 month difference, yet a full order of magnitude difference in token generation rate on the CPU. Who said Moore's Law is dead? ;)
by the_pwner224
5/21/2025 at 2:24:27 PM
As a another data point, on E4B, my Pixel 6 Pro (Tensor v1, Oct 2021) is getting about 4.4 t/s decode on a picture of a glass of milk, and over 6 t/s on text chat. It's amazing, I never dreamed I'd be viably running an 8 billion param model when I got it 4 years ago. And kudos to the Pixel team for including 12 GB of RAM when even today PC makers think they can get away with selling 8.by z2
5/21/2025 at 7:15:13 AM
8 has G3 chipby sujayk_33
5/23/2025 at 11:26:40 AM
In my case, it was pretty fast i would say, using S24 Fe, on Gemma3n E2B int 4, it took around 20 seconds to answer "Describe this image". And the result was pretty amazing.Stats -
CPU -
first token - 4.52 sec
prefill speed - 57.50 sec tokens/s
decode speed - 10.59 tokens/s
Latency - 20.66 sec
GPU -
first token - 1.92 sec
prefill speed - 135.35 sec tokens/s
decode speed - 11.92 tokens/s
Latency - 9.98 sec
by devyy
5/21/2025 at 6:30:16 AM
Gemma-3n-E4B-it on my 2022 Galaxy Z Fold 4.CPU:
7.37 seconds to first token
35.55 tokens/second prefill speed
7.09 tokens/second decode speed
27.97 seconds to complete the answer
GPU:
1.96 seconds to first token
133.40 tokens/second prefill speed
7.95 tokens/second decode speed
14.80 seconds to complete the answer
by nolist_policy
5/21/2025 at 1:42:54 PM
So a apparently the NPU can't be used for models like this. I wonder what it is even good for.by cubefox
5/21/2025 at 11:44:03 AM
Pixel 9 Pro XL("What can you see?"; photo of small monitor displaying stats in my home office)
1st token: 7.48s
Prefill speed: 35.02 tokens/s
Decode speed: 5.72 tokens/s
Latency: 86.88s
It did a pretty good job, the photo had lots of glare and was at a bad angle and a distance, with small text; it picked out weather, outdoor temperature, CO2/ppm, temp/C, pm2.5/ug/m^3 in the office; Misread "Homelab" as "Household" but got the UPS load and power correctly, Misread "Homelab" again (smaller text this time) as "Hereford" but got the power in W, and misread "Wed May 21" on the weather map as "World May 21".
Overall very good considering how poor the input image was.
Edit: E4B
by alias_neo
5/21/2025 at 2:47:51 PM
10min and 10% battery?by m3kw9
5/21/2025 at 5:58:21 AM
Why are we still launching models without simple working python example code (or llama.cpp support)?by rao-v
5/21/2025 at 4:01:37 PM
Who runs python code on mobile?by thomashop
5/20/2025 at 11:07:41 PM
And the libraries to embed Gemma-series in your iOS/Android app: https://ai.google.dev/edge/litertOr, run them on a microcontroller! https://github.com/tensorflow/tflite-micro
by ignoramous
5/21/2025 at 2:25:11 PM
On Pixel 8a, I asked Gemma 3n to play 20 questions with me. It says it has an object in mind for me to guess then it asks me a question about it. And several attempts to clarify who is supposed to ask questions have gone in circles.by tootie
5/20/2025 at 10:58:02 PM
Thanks for this guide it's great.Okay perhaps my phones not great and perhaps this isn't optimized/pruned for phone use but it's unusably slow. The answers are solid from my brief test.
I wouldn't exactly say phone use, unless you have no internet and you don't mind a bit of a wait.
Really impressive, regardless.
by KoolKat23
5/20/2025 at 11:33:58 PM
What phone are you using?by px43
5/21/2025 at 6:21:27 AM
I see my phones processor is from 2018 so there's that, Moore's law to save the day, from reading other comments.by KoolKat23
5/21/2025 at 8:26:22 AM
Suggest giving it no networking permissions (if indeed this is about on-device AI).by andrepd
5/21/2025 at 4:53:28 PM
Networking perms seem to be required on initial startup of the app.I just installed the apk on a GrapheneOS endpoint (old Pixel 7 Pro) without the Google Play Services installed. The app requires network access to contact Hugging Face and download the model through your HF account. It also requires some interaction/permission agreement with Kaggle. Upon install _with_ network perms the app works, and I'm getting decent performance on the Gemma-3n-E2B-it-int4 model (5-6 token/s). Ok, cool.
Now kill the app, disable network permissions and restart it. Choose one of the models that you downloaded when it had network access. It still works. It does appear to be fully local. Yay.
by nicholasjarnold
5/21/2025 at 12:49:18 AM
It reminds me of GPT3 quality answers. Kind of impressive.Although my entire usecase of local models is amoral questions, which it blocks. Excited for the abliterated version.
by resource_waste
5/20/2025 at 9:31:41 PM
waiting for approval, is there a magnet?by lousken
5/20/2025 at 10:56:17 PM
if you go into the app and click the first icon it directs you to a workflow to get approved after clicking on a button that is the same color as the background and jump through some hoops about providing user data and analytics etc then it will auto-approve youby hadlock
5/21/2025 at 5:37:17 PM
Is there a list of which SOCs support the GPU acceleration?by TiredOfLife
5/22/2025 at 3:03:26 PM
It uses tflite in the background which can GPU accelerate with OpenGL ES 3.1 or OpenCL[0]. So it should work on pretty much any SOC.And you really notice that the model is dumber on GPU, because OpenGL doesn't take accuracy that seriously.
[0] https://blog.tensorflow.org/2020/08/faster-mobile-gpu-infere...
by nolist_policy