Show HN: Apple's SHARP running in the browser via ONNX runtime web

5/3/2026 at 3:10:53 PM

A *2.4gb* ONNX? That is wild. This format continues to impress me. ONNX uses 32bit single precision floats I believe, so thats something like ~644m float params/constants. I recently dove deep 'traditional ML' side of the ONNX serialization format for the purposes of writing an JVM ML compiler for trees and regressions. ONNX actually quite clever the way it serializes trees into parallel arrays (which is then serialized using protobuf). My trees have capped out at < 32mb. I haven't dove into the neural net side of things yet, mainly because I don't have any models to run in prod.(https://github.com/exabrial/petrify if anyone is interested.)

by exabrial

5/3/2026 at 4:06:39 PM

Same, I really like the ONNX format. I only wish that they weren't so frustratingly difficult to use on Apple iOS. Their browser engine, WebKit, has become annoyingly restrictive over the years in terms of the working memory cap.

I ran into quite a few out-of-memory iOS safari issues when I was building continuous voice recognition for my blind chess game, so people could play while on the go.

by vunderba

5/3/2026 at 5:05:07 PM

Interesting, what use cases are you using onnx for btw?

by bring-shrubbery

5/3/2026 at 5:26:37 PM

So I use a VAD onnx (Silero [1]) to automatically detect when someone is talking, and then it sends the audio into one of the voice recognition libraries.

I originally tried to get away with just Whisper Tiny in the chess game [2], but it performs worse on the kinds of short phrases (knight E4, c takes d5, etc) used to dictate chess notation. Even with hotword-based phrasing and corrections, I found its accuracy on brief inputs noticeably poorer. So I switched over to Sherpa [3] trained on gigaspeech. It’s significantly more accurate, but it also comes with a correspondingly larger memory footprint.

Ideally, I would have used just one engine, but I needed a fallback for iOS devices (especially older ones) which can easily OOM.

[1] - https://github.com/snakers4/silero-vad

[2] - https://shahkur.specr.net

[3] - https://github.com/k2-fsa/sherpa-onnx

by vunderba

5/4/2026 at 2:22:25 AM

RNNoise has a VAD inbuilt that works much better than silero.

https://github.com/xiph/rnnoise

by Tsarp

5/3/2026 at 4:28:43 PM

Most ONNX files are fp32, but the ONNX format actually allows fp16, int8, etc. as well (see onnx.proto for the full list of dtypes [1] - they even have fp8/fp4 these days!). I ended up switching over to fp16 ONNX models for my own web-based inference project since the quality is ~identical and page loads get 2x faster.

[1] https://github.com/onnx/onnx/blob/main/onnx/onnx.proto#L605

by ollin

5/4/2026 at 1:40:18 AM

Thanks for the pointer actually. I need to take a look at this version of the spec.

by exabrial

5/3/2026 at 5:04:16 PM

Yeah it's pretty cool what a 2gb NN can do from a single image

by bring-shrubbery

5/3/2026 at 12:45:04 PM

I vibecoded a simple web app using Sharp that allowed be to quickly browse any local image folder and view them as "almost" volumetric 3d scenes in a VR headset.

I precomputed and cached each one so it was nearly instant. The effect - although only a crude wrapper around what Sharp already does - was quite transformative and mesmerising. Just the ease of pointing it at any folder of photos and viewing them fully spatially.

It was a bit of a mess code-wise and kinda specific to my local setup - but I should really clean it up deploy it somewhere for other people to try. Although I keep assuming someone else will do it before me and make a better job of it.

by andybak

5/3/2026 at 5:05:46 PM

Nice, would love to see it, feel free to link it here <3

by bring-shrubbery

5/3/2026 at 2:27:30 PM

I would love to try that out, if you ever make it let me know.

by SpyCoder77

5/3/2026 at 2:33:43 PM

My email is in my profile - ping me and I'll be much more likely to remember to do it.

by andybak

5/3/2026 at 12:17:55 PM

Nice, I've also been doing some similarly neat things via ONNX web at https://intabai.dev (caution, just PoC tools atm, only Chrome tested, only some mobile devices work, no filters).

I think all-client-side in-browser AI imagery is becoming very doable and has lots of privacy benefits. However ONNX web leaves a lot to be desired (I had to proto patch many pytorch conversions because things like Conv3D ops had webgpu issues IIRC). I have yet to try Apache TVM webgpu approaches or any others, but I feel if the webgpu space were more invested in, running these models would be even more feasible.

by kodablah

5/3/2026 at 5:08:40 PM

Interesting. Yeah in-browser is not the best, but getting much easier over time!

by bring-shrubbery

5/3/2026 at 12:56:53 PM

I don't like that it uses only a single photo. This means it is going to make up a lot of stuff. E.g. if I show it a photo of a poster, then it will make that poster 3D. With only two photos that problem would already be solved.

by amelius

5/3/2026 at 5:07:36 PM

Yeah I completely agree, but I think this model solves a different problem. AFAIK it's specifically there for the case where you only have one photo, but still need a 3D gaussian splat scene.

by bring-shrubbery

5/3/2026 at 1:15:08 PM

I haven't tried that specific case but - are you sure? It does get a lot of stuff right from context. I think it would probably depend how much of the frame, the poster took up.

by andybak

5/3/2026 at 1:53:13 PM

More reference images from different angles is always going to give more accurate information in 3D. From a single 2D image there is a lot of ambiguity in the context. Several different shapes in 3D can be represented in identical ways in 2D. Additional context like lighting shadows etc helps. But more real signal from more images will always be better

by deanva

5/3/2026 at 2:31:28 PM

I'm not saying it wouldn't be - because that's obvious.

by andybak

5/3/2026 at 9:35:53 PM

Agreed, wasn't arguing just trying to add additional information in case it isn't obvious to anyone

by deanva

5/3/2026 at 1:32:58 PM

Maybe, but what is wrong with wanting real depth instead of "made up depth"? One extra photo mostly solves that.

by amelius

5/3/2026 at 2:32:34 PM

1. There's many use cases where only a single photo is available

2. There are many models similar to Sharp that do accept multiple photos - but Sharp is trying to solve a specific problem. If you have multiple photos - don't use Sharp.

by andybak

5/3/2026 at 11:49:40 AM

Did not work in Firefox on Linux, but it runs on Chrome.

Have to admit, I dont get it. I tried it with 3 landscape photos I have and the results were nowhere close to the results in the demo, but that just speaks to the model.

Regardless, its very cool as a browser tech showcase.

by javier2

5/3/2026 at 5:29:03 PM

Thanks for trying it out! How much ram do you have? Pretty sure it's the only issue that can occur. The quality varies depending on the image too, so it might have been unlucky photos :(

by bring-shrubbery

5/3/2026 at 6:25:35 PM

I've been poking at running LLMs in the browser. It feels like we're definitely close (<1 year) to seeing real use cases there.

Ubiquity and coverage of devices is what will take longest. Largely dependent on how well we can shrink models with similar performance and how much we can accelerate mobile devices. This feels like it's but further (<3 years?)

by parentheses

5/4/2026 at 12:28:49 AM

I can't wait to get something like this small enough to fit into a browser extension. I already use ONNX for zoom + enhance in Ultra Zoom, zooming from 2d to 3d would be crazy.

by david_mchale

5/3/2026 at 3:12:42 PM

What are the requirements for running this? Chrome throws a whole bunch of "out of memory" errors into the console when I try to execute these. I'm guessing 4GiB of VRAM is not enough?

by jeroenhd

5/3/2026 at 5:11:27 PM

Ahh, yeah I forgot to mention it. The model is 2.5gb so I assume you'd need at least free 3gb with all the surroudning stuff, with the rest of your system using more ram I'd guess 4gb is way too low - maybe even 8gb would be in some occasions.

I personally tested it on 32gb Apple M2, and it's able to run much heavier stuff.

by bring-shrubbery

5/4/2026 at 8:55:14 AM

My laptop's GPU has 4GiB of VRAM but it still failed to allocate enough memory. Seems like I'll have to pass on demos like these until someone figures out a way to use less (V)RAM to accomplish these kinds of things.

by jeroenhd

5/3/2026 at 1:54:40 PM

This is cool. For practitioners, What’s the current state of the art for free form multi picture to splat? The last time I looked at it the pipeline was pretty janky and included a few separate steps.

by vessenes

5/3/2026 at 5:29:38 PM

For multi-photo, the go-to is still the original 3D Gaussian Splatting (Kerbl et al., 2023) — most consumer tools like Polycam, Luma, and Postshot wrap that under the hood.

by bring-shrubbery

5/3/2026 at 11:44:26 PM

Does it use both WASM and WebGPU?

by sroussey

5/3/2026 at 1:22:27 PM

Loading the model crashes my browser tab from memory usage :/

by herpdyderp

5/3/2026 at 5:26:08 PM

Yeah, I think you need at least 8gb ram unfortunately, but I tested it only on a 32gb M2, so 8gb might also not be enough.

I might create a compressed version of the model, that would work on low-ram machines.

by bring-shrubbery

5/3/2026 at 5:56:07 PM

I've worked around lower RAM machines with ONNX web models by first separating .onnx from .onnx_data, and second having scripts that split up the "layers" and shards the run (e.g. https://huggingface.co/cretz/Z-Image-Turbo-ONNX-sharded). Then you can have the runtime only run one at a time. I don't understand the details too deep, but Claude is good at writing scripts to shard onnx protos.

by kodablah

5/3/2026 at 6:24:53 PM

It froze up my computer, had to hard-boot lol

(16MB M1 Macbook, Chrome)

by anentropic

5/3/2026 at 2:42:41 PM

Are there any examples one could view before downloading?

by andruby

5/3/2026 at 5:24:35 PM

There's nothing to download - you can run everything from your browser, and the photo you upload is not uploaded anywhere, it stays in your browser.

by bring-shrubbery

5/4/2026 at 6:48:55 AM

The point is what does a typical result look like, especially before loading a 2GB model or making a browser crash.

Model results https://apple.github.io/ml-sharp/

by utopiah

5/3/2026 at 6:04:21 PM

> inference itself is a few seconds on a recent Mac

This is impressive as hell

Very cool demo. It works in about ~9 seconds on my machine.

A few asks if you're going to devote more time to the project: can you make a full orbital camera - it seems to not be able to orbit 360? Also, can you use double click drag to move the camera in non-orbiting mode for view refinement? (Super minor nitpicks - this demo is really cool.)

> Caveats: SHARP's released weights are research-use only (Apple's model license, not the code's).

Nobody should GAF about this. We have all the major players distilling each other in the open. This gives Apple the ability to slap you with lawyers, but in practice you'll often get more done if you just break the rules.

Do you know of any other image-to-splat models? WorldLabs has a few versions of their Marble model, and the Tencent Hunyuan team just released HyWorld as open weights:

https://github.com/Tencent-Hunyuan/HY-World-2.0

HyWorld looks to be SOTA and better than all the other players.

Apple's Sharp is awesome in that it is fast, but it only generates a small depth sample from the image. There are no back faces or splats, so if you move the camera even slightly from the original perspective, you'll see lots of holes.

by echelon

5/3/2026 at 3:25:35 PM

Why is it so large? Is it the same model used to create 3D effects on iOS lockscreen?

by zb3

5/3/2026 at 5:02:00 PM

I think it's essentially a transformer, so it just stores a bunch of weights. As a model that's supposed to be able to convert any image to 3d scene, it's pretty nice size actually.

Regarding the ios lockscreen - I believe they are different models. I think Apple use this one to generate those Vision Pro 3d photos though, but I'm not too sure.

by bring-shrubbery

5/3/2026 at 6:18:05 PM

No the 3D effects are much more simple for the lock screen…more akin to old-school animation via layers

by MattDamonSpace

5/3/2026 at 11:34:06 AM

[dead]

by tokenhub_dev

5/3/2026 at 12:58:45 PM

[flagged]

by takahitoyoneda

5/3/2026 at 1:52:06 PM

[dead]

by eddyaipt

5/3/2026 at 6:29:15 PM

[flagged]

by hottrends

5/3/2026 at 10:30:23 AM

[flagged]

by Grappelli

5/3/2026 at 12:48:11 PM

[dead]

by jt543ujtfrry

5/3/2026 at 1:13:49 PM

[dead]

by tim0414