Skip to content

Conversation

@David-McKenna
Copy link
Contributor

Hey Cees,

Not the patch I was intending to upstream next, but I've been having endless issues with OpenMP forcing itself to be serial, so here's a quick win I came across.

cuFFT allows you to manage your memory usage yourself, so here we can only allocate the larger memory block needed for the cuFFT operation rather than allocating two separate blocks for each FFT. This saves me around 2GB in VRAM in my normal configuration, and I ran a test to compare it to what you've mentioned in the past (20 subbands I believe?) and it should be ~1GB saved in your case, so you could increase nforward or sample even more DMs

Cheers,
David

This is achieved by sharing the memory space between the cuFFT operations (since they cannot be executed in parallel), and only allocated the memory needed by the largest operation (though in this case they -should- be the same

References https://docs.nvidia.com/cuda/cufft/index.html\#function-cufftgetsizemany https://docs.nvidia.com/cuda/cufft/index.html\#function-cufftsetautoallocation + further functions in 3.7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant