Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CuArray Performance #56

Open
Lightup1 opened this issue Jul 20, 2022 · 6 comments
Open

CuArray Performance #56

Lightup1 opened this issue Jul 20, 2022 · 6 comments

Comments

@Lightup1
Copy link

It seems that the PencilFFTs speed is slower than only using one GPU with CUFFT.
I'm using CUDA 11.7, openmpi 4.1.4, uxc 1.13(without gdrcopy) and julia 1.7.3.
Data dim is (8192,32,32)
For CUFFT with single gpu:

Dimension 8192,32,32
start fft benchmark
complete fft benchmark
FFT benchmark results
BenchmarkTools.Trial: 1521 samples with 1 evaluation.
 Range (min … max):  2.044 ms … 51.419 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.411 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.737 ms ±  2.174 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆▃                                                        
  █████▇▄▁▄▃▃▄▃▄▅▃▁▁▁▁▃▁▃▄▁▁▁▃▁▃▃▁▄▁▃▁▁▁▁▃▃▃▁▁▃▁▁▁▁▁▁▁▁▃▁▃▃▅ █
  2.04 ms      Histogram: log(frequency) by time     14.5 ms <

 Memory estimate: 2.94 KiB, allocs estimate: 49.

For PencilFFTs with single gpu:

rank:0GPU:CuDevice(0)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):   93.965 ms … 300.333 ms  ┊ GC (min … max): 0.00% … 4.89%
 Time  (median):     173.977 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   181.007 ms ±  39.066 ms  ┊ GC (mean ± σ):  0.08% ± 0.49%

                 █▆     ▂▂  ▃                                    
  ▄▁▁▁▁▁▁▁▁▁▁▅▇▄▁██▇▇▇▄███▄██▇▅█▅▅▁▅▅▁▁█▅▄▇▅▅▁▄▁▄▅▁▁▄▁▁▄▄▁▁▁▁▁▅ ▄
  94 ms            Histogram: frequency by time          288 ms <

 Memory estimate: 17.64 KiB, allocs estimate: 343.

For PencilFFTs with 4 gpus in the same node:

rank:0GPU:CuDevice(0)
rank:1GPU:CuDevice(1)
rank:2GPU:CuDevice(2)
rank:3GPU:CuDevice(3)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  16.761 ms … 28.305 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.601 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.547 ms ±  1.360 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                 ▁▄█ ▃ ▃ ▄ ▃▂  
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▃▁▄▇▇████▇█▆███▃██ ▃
  16.8 ms         Histogram: frequency by time        28.2 ms <

 Memory estimate: 13.38 KiB, allocs estimate: 294.

benchmark file:
CUFFT:

using CUDA
using FFTW
using BenchmarkTools

println("Dimension 8192,32,32")
println("start fft benchmark")
b=@benchmark (CUDA.@sync op*data) setup=(op=plan_fft!(CuArray{ComplexF64}(undef,8192,32,32));data=CUDA.rand(ComplexF64,8192,32,32))
println("complete fft benchmark")
io = IOBuffer()
show(io, "text/plain", b)
s = String(take!(io))
println("FFT benchmark results")
println(s)

PencilFFTs:

using MPI
using PencilFFTs
using PencilArrays
using BenchmarkTools
using Random
using CUDA

MPI.Init(threadlevel=:funneled)
comm = MPI.COMM_WORLD
dims = (8192, 32, 32)

rank=MPI.Comm_rank(comm)
device!(rank % length(devices()))
sleep(1*rank)
print("rank:",rank,"GPU:",device(),"\n")

pen = Pencil(CuArray,dims, comm)
transform=Transforms.FFT!()

plan = PencilFFTPlan(pen, transform)
u = allocate_input(plan)
if rank == 0
    println("has-cuda:",MPI.has_cuda())
    print("data size:",dims,"\n")
    print("Start data allocationg\n")
end
randn!(first(u))

b = @benchmark $plan*$u evals=1 samples=100 seconds=30 teardown=(MPI.Barrier(comm))

if rank == 0
    io = IOBuffer()
    show(io, "text/plain", b)
    s = String(take!(io))
    println(s)
end
@jipolanco
Copy link
Owner

I'm not surprised to see that native single-GPU 3D FFT implemented in CUDA is way more efficient than the PencilFFTs version.

Note that in PencilFFTs, a 3D FFT is implemented as a series of three 1D FFTs (even on a single GPU), with data transpositions happening between FFTs to make sure that FFTs are performed along the contiguous axis. I don't know whether CUFFT does something similar, but I'm pretty sure their implementation is way more optimised and it will be hard to beat them.

This issue is somewhat similar to #35 for CPUs, but for GPUs the differences are much larger.

There is still room for optimising support for GPU arrays in PencilArrays/PencilFFTs. It would be great if you could take a look at where the time is actually spent as I already suggested in a previous issue, so that we can identify a path to possible improvements.

@doraemonho
Copy link

I am not sure if this is another "direction" of solving the issue.
I see NVIDIA has released a new library for multi-node GPU FFT early this year. Maybe we can utilize those packages to speed up the performance. However, CUDA.jl hasn't wrapped up that library yet... Maybe we need to do it ourselves...

p.s. https://developer.nvidia.com/blog/multinode-multi-gpu-using-nvidia-cufftmp-ffts-at-scale/

@jipolanco
Copy link
Owner

Thanks for the link. Their results look quite impressive!

I think wrapping the multi-node CUDA FFT is the way to go. What's a bit annoying is that cuFFTMp is for now in early access, which apparently means that you need to ask permission to nvidia to use it.

Assuming we have access to cuFFTMp, I guess the first thing would be to wrap the cuFFTMp functions ourselves, and either add that to CUDA.jl or to a separate package (PencilFFTsCUDA.jl?). Secondly, we'd need to think about how to interface this library with PencilArrays/PencilFFTs...

@doraemonho
Copy link

Thanks for the link. Their results look quite impressive!

I think wrapping the multi-node CUDA FFT is the way to go. What's a bit annoying is that cuFFTMp is for now in early access, which apparently means that you need to ask permission to nvidia to use it.

We also have a similar situation with MHDFlows.jl as we are planning to support the multi-GPU feature.
We have access to cuFFTMp as our group is the NERSC Perlmutter user but the new system is still undergoing the test stage and the system can't run those c++ examples from NVIDIA.

Assuming we have access to cuFFTMp, I guess the first thing would be to wrap the cuFFTMp functions ourselves, and either add that to CUDA.jl or to a separate package (PencilFFTsCUDA.jl?). Secondly, we'd need to think about how to interface this library with PencilArrays/PencilFFTs...

Our approach is trying to wrap the cufftXt first ( cufftXt.jl ? is still under exploration stage right now ) and wait for the mature of cuFFtMp, although it only supports the single node multi GPU(max =16) computation.

Btw, for the interface, there had a discussion on whether CUDA.jl should wrap the cufftXt. Turns out those APIs aren’t sufficient to implement broadcasting and they need to deal with more general cases such as indexing. But those shouldn't be an issue for us as the package already do the job for indexing.

@chowland
Copy link

Firstly, thanks @jipolanco for PencilFFTs! It's a really nice package, and so clearly documented for someone relatively new to Julia to get up to speed quickly. Just thought I'd share a link to a relatively new NVIDIA library cuDecomp that seems to cover very similar ground to PencilFFTs for domain decomposition and FFTs, but from a lower-level CUDA implementation.

As well as being relatively new to Julia, I'm also very new to using GPUs, but it might be interesting to compare performance between cuDecomp and PencilFFTs given the similar flexibility of these libraries compared to the singular focus of cuFFTmp on 3D FFTs. I'd be happy to look into this but it might take me a bit of time to learn what I'm doing!

@jipolanco
Copy link
Owner

Hi @chowland, thank you for your kind words.

It would be great to have a comparison with cuDecomp. Right now I don't have a lot of time to look into this, so feel free to attempt a comparison. I will be happy to guide you through the usage of PencilFFTs on GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants