2/17/2026 at 11:34:18 PM
We have been curious about this for rapids cudf. We built the graphistry stack 10 years ago (!) for GPU native end-to-end acceleration - data loading, wrangling, analytics, enrichment, viz, etc, all the way server GPUs to client GPUs - but this has been a huge sticking point. Major constant overheads for smaller workloads that seem avoidable.Essentially, we solved the problem of writing our stack in a bulk-oriented way that Nvidia kernels can optimize. Think apache arrow, pure vectorized dataframe pipelines, etc. However, cudf is 'eager' with per-step CPU/GPU control plane coordination, even if the data plane lives on the GPU. Polars in theory moves to lazy scheduling that can allow deforesting optimizations for more bulk GPU-side control macro steps, but not really. Nvidia efforts to cut python asyncio costs for multitenant etc flows didn't pan out either. So enabling moving more to the GPU here is super interesting.
Will be watching!
by lmeyerov