3/9/2026 at 9:07:57 PM
Hi @fatihturker – exciting project if it works!I have a MacBook Pro M1 Max w/64 GB RAM, and a Mac Studio M3 Ultra w/96 GB RAM. What do you think is possible to run on these? Just curious before I really try it out.
by pcf
3/9/2026 at 6:47:40 AM
by fatihturker
3/9/2026 at 9:07:57 PM
Hi @fatihturker – exciting project if it works!I have a MacBook Pro M1 Max w/64 GB RAM, and a Mac Studio M3 Ultra w/96 GB RAM. What do you think is possible to run on these? Just curious before I really try it out.
by pcf
3/9/2026 at 7:20:49 PM
Fascinating. I don't understand the technical terms, but running a big coding agent locally is a dream of mine, so I thank you for your efforts!by deflator
3/9/2026 at 3:19:20 PM
Running a Mac Mini M4 as a home server for a bunch of automation scripts right now. The mmap-based layer streaming is the part I'm most curious about -- how does latency look when you're streaming layers from disk mid-inference? I'd expect throughput to degrade sharply once you exceed unified memory, but maybe the Top-K sparsity masks enough of the weight accesses that it's not as bad as sequential streaming would be. What's the actual tokens/sec at 140B scale on the base Mac Mini config?by ryanholtdev
3/9/2026 at 4:17:24 PM
Yeah...https://github.com/opengraviton/graviton?tab=readme-ov-file#...
the benchmarks don't show any results for using these larger-than-memory models, only the size difference
it all smells quite sloppy
by anentropic
3/9/2026 at 8:24:04 PM
What could find in the readme shows:~19 tok/s for Apple M1 Max (64 GB) with TinyLlama-1.1B-Chat-v1.0
by hu3