CuArray Performance #56

Lightup1 · 2022-07-20T07:27:01Z

It seems that the PencilFFTs speed is slower than only using one GPU with CUFFT.
I'm using CUDA 11.7, openmpi 4.1.4, uxc 1.13(without gdrcopy) and julia 1.7.3.
Data dim is (8192,32,32)
For CUFFT with single gpu:

Dimension 8192,32,32
start fft benchmark
complete fft benchmark
FFT benchmark results
BenchmarkTools.Trial: 1521 samples with 1 evaluation.
 Range (min … max):  2.044 ms … 51.419 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.411 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.737 ms ±  2.174 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆▃                                                        
  █████▇▄▁▄▃▃▄▃▄▅▃▁▁▁▁▃▁▃▄▁▁▁▃▁▃▃▁▄▁▃▁▁▁▁▃▃▃▁▁▃▁▁▁▁▁▁▁▁▃▁▃▃▅ █
  2.04 ms      Histogram: log(frequency) by time     14.5 ms <

 Memory estimate: 2.94 KiB, allocs estimate: 49.

For PencilFFTs with single gpu:

rank:0GPU:CuDevice(0)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):   93.965 ms … 300.333 ms  ┊ GC (min … max): 0.00% … 4.89%
 Time  (median):     173.977 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   181.007 ms ±  39.066 ms  ┊ GC (mean ± σ):  0.08% ± 0.49%

                 █▆     ▂▂  ▃                                    
  ▄▁▁▁▁▁▁▁▁▁▁▅▇▄▁██▇▇▇▄███▄██▇▅█▅▅▁▅▅▁▁█▅▄▇▅▅▁▄▁▄▅▁▁▄▁▁▄▄▁▁▁▁▁▅ ▄
  94 ms            Histogram: frequency by time          288 ms <

 Memory estimate: 17.64 KiB, allocs estimate: 343.

For PencilFFTs with 4 gpus in the same node:

rank:0GPU:CuDevice(0)
rank:1GPU:CuDevice(1)
rank:2GPU:CuDevice(2)
rank:3GPU:CuDevice(3)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  16.761 ms … 28.305 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.601 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.547 ms ±  1.360 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                 ▁▄█ ▃ ▃ ▄ ▃▂  
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▃▁▄▇▇████▇█▆███▃██ ▃
  16.8 ms         Histogram: frequency by time        28.2 ms <

 Memory estimate: 13.38 KiB, allocs estimate: 294.

benchmark file:
CUFFT:

using CUDA
using FFTW
using BenchmarkTools

println("Dimension 8192,32,32")
println("start fft benchmark")
b=@benchmark (CUDA.@sync op*data) setup=(op=plan_fft!(CuArray{ComplexF64}(undef,8192,32,32));data=CUDA.rand(ComplexF64,8192,32,32))
println("complete fft benchmark")
io = IOBuffer()
show(io, "text/plain", b)
s = String(take!(io))
println("FFT benchmark results")
println(s)

PencilFFTs:

using MPI
using PencilFFTs
using PencilArrays
using BenchmarkTools
using Random
using CUDA

MPI.Init(threadlevel=:funneled)
comm = MPI.COMM_WORLD
dims = (8192, 32, 32)

rank=MPI.Comm_rank(comm)
device!(rank % length(devices()))
sleep(1*rank)
print("rank:",rank,"GPU:",device(),"\n")

pen = Pencil(CuArray,dims, comm)
transform=Transforms.FFT!()

plan = PencilFFTPlan(pen, transform)
u = allocate_input(plan)
if rank == 0
    println("has-cuda:",MPI.has_cuda())
    print("data size:",dims,"\n")
    print("Start data allocationg\n")
end
randn!(first(u))

b = @benchmark $plan*$u evals=1 samples=100 seconds=30 teardown=(MPI.Barrier(comm))

if rank == 0
    io = IOBuffer()
    show(io, "text/plain", b)
    s = String(take!(io))
    println(s)
end

The text was updated successfully, but these errors were encountered:

jipolanco · 2022-07-21T06:43:35Z

I'm not surprised to see that native single-GPU 3D FFT implemented in CUDA is way more efficient than the PencilFFTs version.

Note that in PencilFFTs, a 3D FFT is implemented as a series of three 1D FFTs (even on a single GPU), with data transpositions happening between FFTs to make sure that FFTs are performed along the contiguous axis. I don't know whether CUFFT does something similar, but I'm pretty sure their implementation is way more optimised and it will be hard to beat them.

This issue is somewhat similar to #35 for CPUs, but for GPUs the differences are much larger.

There is still room for optimising support for GPU arrays in PencilArrays/PencilFFTs. It would be great if you could take a look at where the time is actually spent as I already suggested in a previous issue, so that we can identify a path to possible improvements.

doraemonho · 2022-09-03T02:39:57Z

I am not sure if this is another "direction" of solving the issue.
I see NVIDIA has released a new library for multi-node GPU FFT early this year. Maybe we can utilize those packages to speed up the performance. However, CUDA.jl hasn't wrapped up that library yet... Maybe we need to do it ourselves...

p.s. https://developer.nvidia.com/blog/multinode-multi-gpu-using-nvidia-cufftmp-ffts-at-scale/

jipolanco · 2022-09-05T08:01:18Z

Thanks for the link. Their results look quite impressive!

I think wrapping the multi-node CUDA FFT is the way to go. What's a bit annoying is that cuFFTMp is for now in early access, which apparently means that you need to ask permission to nvidia to use it.

Assuming we have access to cuFFTMp, I guess the first thing would be to wrap the cuFFTMp functions ourselves, and either add that to CUDA.jl or to a separate package (PencilFFTsCUDA.jl?). Secondly, we'd need to think about how to interface this library with PencilArrays/PencilFFTs...

doraemonho · 2022-09-05T09:22:25Z

Thanks for the link. Their results look quite impressive!

I think wrapping the multi-node CUDA FFT is the way to go. What's a bit annoying is that cuFFTMp is for now in early access, which apparently means that you need to ask permission to nvidia to use it.

We also have a similar situation with MHDFlows.jl as we are planning to support the multi-GPU feature.
We have access to cuFFTMp as our group is the NERSC Perlmutter user but the new system is still undergoing the test stage and the system can't run those c++ examples from NVIDIA.

Assuming we have access to cuFFTMp, I guess the first thing would be to wrap the cuFFTMp functions ourselves, and either add that to CUDA.jl or to a separate package (PencilFFTsCUDA.jl?). Secondly, we'd need to think about how to interface this library with PencilArrays/PencilFFTs...

Our approach is trying to wrap the cufftXt first ( cufftXt.jl ? is still under exploration stage right now ) and wait for the mature of cuFFtMp, although it only supports the single node multi GPU(max =16) computation.

Btw, for the interface, there had a discussion on whether CUDA.jl should wrap the cufftXt. Turns out those APIs aren’t sufficient to implement broadcasting and they need to deal with more general cases such as indexing. But those shouldn't be an issue for us as the package already do the job for indexing.

chowland · 2022-12-12T13:29:32Z

Firstly, thanks @jipolanco for PencilFFTs! It's a really nice package, and so clearly documented for someone relatively new to Julia to get up to speed quickly. Just thought I'd share a link to a relatively new NVIDIA library cuDecomp that seems to cover very similar ground to PencilFFTs for domain decomposition and FFTs, but from a lower-level CUDA implementation.

As well as being relatively new to Julia, I'm also very new to using GPUs, but it might be interesting to compare performance between cuDecomp and PencilFFTs given the similar flexibility of these libraries compared to the singular focus of cuFFTmp on 3D FFTs. I'd be happy to look into this but it might take me a bit of time to learn what I'm doing!

jipolanco · 2022-12-13T09:37:11Z

Hi @chowland, thank you for your kind words.

It would be great to have a comparison with cuDecomp. Right now I don't have a lot of time to look into this, so feel free to attempt a comparison. I will be happy to guide you through the usage of PencilFFTs on GPUs.

navidcy mentioned this issue Jun 29, 2023

Bump compat entries for PencilFFTs and PencilArrays CliMA/Oceananigans.jl#3121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CuArray Performance #56

CuArray Performance #56

Lightup1 commented Jul 20, 2022

jipolanco commented Jul 21, 2022

doraemonho commented Sep 3, 2022

jipolanco commented Sep 5, 2022

doraemonho commented Sep 5, 2022

chowland commented Dec 12, 2022

jipolanco commented Dec 13, 2022

CuArray Performance #56

CuArray Performance #56

Comments

Lightup1 commented Jul 20, 2022

jipolanco commented Jul 21, 2022

doraemonho commented Sep 3, 2022

jipolanco commented Sep 5, 2022

doraemonho commented Sep 5, 2022

chowland commented Dec 12, 2022

jipolanco commented Dec 13, 2022