-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CuArray Performance #56
Comments
I'm not surprised to see that native single-GPU 3D FFT implemented in CUDA is way more efficient than the PencilFFTs version. Note that in PencilFFTs, a 3D FFT is implemented as a series of three 1D FFTs (even on a single GPU), with data transpositions happening between FFTs to make sure that FFTs are performed along the contiguous axis. I don't know whether CUFFT does something similar, but I'm pretty sure their implementation is way more optimised and it will be hard to beat them. This issue is somewhat similar to #35 for CPUs, but for GPUs the differences are much larger. There is still room for optimising support for GPU arrays in PencilArrays/PencilFFTs. It would be great if you could take a look at where the time is actually spent as I already suggested in a previous issue, so that we can identify a path to possible improvements. |
I am not sure if this is another "direction" of solving the issue. p.s. https://developer.nvidia.com/blog/multinode-multi-gpu-using-nvidia-cufftmp-ffts-at-scale/ |
Thanks for the link. Their results look quite impressive! I think wrapping the multi-node CUDA FFT is the way to go. What's a bit annoying is that cuFFTMp is for now in early access, which apparently means that you need to ask permission to nvidia to use it. Assuming we have access to cuFFTMp, I guess the first thing would be to wrap the cuFFTMp functions ourselves, and either add that to CUDA.jl or to a separate package (PencilFFTsCUDA.jl?). Secondly, we'd need to think about how to interface this library with PencilArrays/PencilFFTs... |
We also have a similar situation with
Our approach is trying to wrap the cufftXt first ( cufftXt.jl ? is still under exploration stage right now ) and wait for the mature of cuFFtMp, although it only supports the single node multi GPU(max =16) computation. Btw, for the interface, there had a discussion on whether CUDA.jl should wrap the cufftXt. Turns out those APIs aren’t sufficient to implement broadcasting and they need to deal with more general cases such as indexing. But those shouldn't be an issue for us as the package already do the job for indexing. |
Firstly, thanks @jipolanco for PencilFFTs! It's a really nice package, and so clearly documented for someone relatively new to Julia to get up to speed quickly. Just thought I'd share a link to a relatively new NVIDIA library As well as being relatively new to Julia, I'm also very new to using GPUs, but it might be interesting to compare performance between |
Hi @chowland, thank you for your kind words. It would be great to have a comparison with |
It seems that the
PencilFFTs
speed is slower than only using one GPU withCUFFT
.I'm using CUDA 11.7, openmpi 4.1.4, uxc 1.13(without gdrcopy) and julia 1.7.3.
Data dim is
(8192,32,32)
For
CUFFT
with single gpu:For
PencilFFTs
with single gpu:For
PencilFFTs
with 4 gpus in the same node:benchmark file:
CUFFT
:PencilFFTs
:The text was updated successfully, but these errors were encountered: