alt.hn

2/26/2026 at 1:15:25 AM

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

by zyoralabs

2/26/2026 at 8:00:36 AM

32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deployments.

I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.

(1) https://philippdubach.com/posts/bandits-and-agents-netflix-a...

by 7777777phil

2/26/2026 at 6:25:37 PM

If you don't mind a stupid question, is this essentially dynamic quantization? I'm trying to understand how this is different from using a regular quantized model to squeeze more parameters into less RAM.

by HanClinto

2/26/2026 at 7:19:43 AM

Discussion on reddit: https://www.reddit.com/r/LocalLLaMA/comments/1rewis9/removed...

by reconnecting

2/26/2026 at 8:00:53 AM

Sorry, this post has been removed by the moderators of r/LocalLLaMA.

Classic reddit..

by 7777777phil

2/26/2026 at 8:57:08 AM

> Classic reddit..

That sub used to be the absolute best place to get the latest in LLM developments. The worst thing that happened to the sub was karpathy making it popular with a tweet. Since then it's been overrun by a whole bunch of drama, toxic behaviour and useless bots, and the quality content has cratered.

There was a mod crisis and new mods came in, with really weird stuff (integrations with discord and such), lots of bots became active with useless posts and "engagement" bait, the chinese labs are all fighting eachother on who's better every time there's a release, claude-induced-manias on "papers" this and "zenodo" that (everyone is a researcher now, everyone is inventing a subquadratic attention, led by claude hallucinated stuff), they have an obsession with "local only", leading to removing any discussion about SotA (which is entirely counter productive) and so on.

by NitpickLawyer

2/26/2026 at 9:55:23 AM

This seems excellent if not revolutionary, just what I've been looking for, but GPU support didn't work on my M1 and M1 Max. Is there a way to support Apple M series processors? That would be greatly appreciated. I don't have experience about this kind of programming and didn't get very far with ChatGPT.

On M1 Max, it says 14.8GB free / 32.0 GB total, but " No GPU detected" and "What Can You Run? (ZSE Ultra Mode)" only says "7B GPU + CPU Hybrid", nothing else.

by cipher-108

2/26/2026 at 2:36:07 AM

This is so freaking awesome, I am working on a project trying run 10 models on two GPUs, loading/off loading is the only solution I have in mind.

Will try getting this deployed.

Does cold start timings advertised for a condition where there is no other model loaded on GPUs?

by medi_naseri

2/26/2026 at 9:10:31 AM

Are you using the Model GPU memory snapshotting for this?

by mzl