alt.hn

6/1/2026 at 6:43:51 PM

When does fragmentation occur in the CUDA caching allocator?

https://docs.pytorch.org/devlogs/eager/2026-06-01-cuda-caching-allocator/

by matt_d

6/4/2026 at 4:57:18 AM

In LLM serving, I treat the failure mode at the end of this (long-lived blocks interleaved with short-lived ones, which expandable segments still can't merge across) as the steady state, not an edge case: weights and graph buffers sit forever while per-request KV churns. So I've stopped relying on the caching allocator for KV at all. vLLM reserves one big region at startup and pages fixed-size KV blocks itself, so the allocator never sees the churn. Same fragmentation, solved one layer up.

by keynha