3/1/2026 at 2:11:50 AM
Cool that it's possible but basically unusable performance characteristics. For an 8192 token prompt they report a ~1.5 minute time-to-first-token and then 8.30tk/s from there. For context ChatGPT is typically <<1s ttft and ~50tk/s.by ibeckermayer
3/1/2026 at 12:54:04 PM
I've never understood the obsession with token/s. I'm fine with asking a question and then going on to another task (which might be making coffee).Even with a cloud-based LLM where the response is pretty snappy, I still find that I wander off and return when I am ready to digest the entire response.
by JKCalhoun
3/1/2026 at 8:53:50 PM
Your workflow is unusual, oftentimes there is a vigorous back and forth, or a desired output like code generation, etc where a low tk/s drastically effects ux and user productivity.But the real kicker here is the 90s ttft, that means you ask a question and don't see anything for a full minute and a half.
by ibeckermayer
3/1/2026 at 1:59:21 PM
You are fine with it. But may be rest of the world is not. Anyway, to compare performance/benchmark, we need metrics and this is one of the basic metric to measure.by nitinreddy88
3/1/2026 at 8:44:33 AM
Given that APU only has 4 channels isn't this setup comically starved for bandwidth? By the same token, wouldn't you expect performance to scale approximately linearly as you add additional boxes? And wouldn't you be better off with smaller nodes (ie less RAM and CPU power per box)?If I'm right about that then if you're willing to go in for somewhere in the vicinity of $30k (24x the Max 385 model) you should be able to achieve ChatGPT performance.
by fc417fc802
3/1/2026 at 9:06:43 PM
Good thought... I think you're wrong because the dominant factor is bandwidth over the interconnect. In this case they're using 5Gbps over Ethernet; compare that to 80-120 Gbps for a Thunderbolt 5 connected Mac Studio cluster: https://www.youtube.com/watch?v=bFgTxr5yst0by ibeckermayer
3/1/2026 at 9:52:40 PM
> I think you're wrong because the dominant factor is bandwidth over the interconnect.Is it? Why do you say that? I understand inference to be almost entirely bottlenecked on memory bandwidth.
There are n^2 weights per layer but only n state values in the vector that exists between layers. Transmitting a few thousand (or even tens of thousands) of fp values does not require a notable amount of bandwidth by modern standards.
Training is an entirely different beast of course. And depending on the workload latency can also impact performance. But for running inference with a single query from a single user I don't see how inter-node bandwidth is going to matter.
by fc417fc802