3/5/2026 at 9:30:04 PM
I had success on a similar problem by allocating native buffers for the matrices, then using a basic CUDA call. The actual work was 100x faster than my CPU baseline.The bottleneck of course was fetching & loading relevant data to memory to start with.
by lukev