4/16/2026 at 8:30:25 PM
So it's basically just openrouter with cloudflare argo networking? I feel like they could do some much more interesting stuff with their replicate acquisition. Application specific RL is getting so good but there's no good way to deploy these models in a scalable way. Even the providers like fireworks which claim to let you deploy LORAs in a scalable way can't do it. For now I literally have to host base load on my application on a rack of 3090s in my garage which seems silly but it saves me $1k a month.by mips_avatar
4/17/2026 at 7:38:46 AM
Running a rack of 3090s in your garage to avoid provider lock-in/costs is the most Hacker News thing. Out of curiosity, what are you doing for uptime/failover? If you are running production traffic to that garage rack, does your app just degrade gracefully if your home internet drops, or do you have a cloud fallback?by bryden_cruz
4/17/2026 at 4:59:20 PM
Yeah the model i'm running locally is just one of several models the app supports and it falls back to others if not available.by mips_avatar
4/17/2026 at 8:27:29 AM
[flagged]by handfuloflight
4/17/2026 at 3:30:07 AM
Gilfoyle? Is that you?by jonfromsf
4/17/2026 at 4:24:19 AM
I think these gpus were actually used for bitcoin mining before I bought themby mips_avatar
4/17/2026 at 4:54:59 PM
It's Anton's grandson!by menno-dot-ai
4/16/2026 at 10:00:40 PM
Curious which models are you able to run and how many 3090s do they require at scale?by vladgur
4/16/2026 at 10:20:55 PM
4 3090s with nvlinks on each pair. Super fast inference on Moe models around 20-36bby mips_avatar
4/17/2026 at 4:00:46 PM
> Super fast inferenceHow fast is "super fast" exactly, and with what runtime+model+quant specifically? Curious to see how how 4x 3090s compare to 1x Pro 6000, could probably put together 4x 3090s for a fraction of the cost compared to the Pro 6000, but the times I've seen the tok/s in/out for multiple GPUs my heart always drops a little.
by embedding-shape
4/17/2026 at 4:43:03 PM
I haven't benchmarked against a pro 6000, it's more that i have 4 3090s and i don't have a pro 6000.by mips_avatar
4/17/2026 at 5:00:25 PM
Yes, that's why I'm asking you what exactly 4 3090s get in prompt-processing and generation, sorry if I was unclear.by embedding-shape
4/17/2026 at 7:29:25 PM
Maxes out around 4K tok/s output. Each pair of 3090s has its own instance of the model with parallelism across the nvlink bridge. Though nvlink is only 2x over pcie5by mips_avatar
4/17/2026 at 11:39:21 AM
The interesting part is that you can use the same API with Workers AI models (hosted at the edge) and proxied models (OpenRouter-style).Disclaimer: I work at Cloudflare, but not on this.
by ascorbic
4/17/2026 at 11:11:47 PM
It's the same problem as fireworks, the only models supporting LORA are like year old dense models that perform horribly on most tasks. If you want to do anything close to relevant you still need to rent/own dedicated GPUs, which seems insane to me when vLLM fully support dynamic LORA loading.by mips_avatar