Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: thrust::device_vector initialization with fancy iterators 4x slower than initialization + thrust::transform #2451

Closed
1 task done
rwarmstr opened this issue Sep 23, 2024 · 12 comments
Assignees
Labels
bug Something isn't working right.

Comments

@rwarmstr
Copy link

Is this a duplicate?

Type of Bug

Performance

Component

Thrust

Describe the bug

In a performance comparison between two methods of initializing device vectors, creating a zero-initialized vector and subsequently initializing it with thrust::transform seems to be approximately 4x faster than using fancy iterators to initialize the same vector through the constructor.

How to Reproduce

Example of fast initialization:

    // Create a device vector to hold out input waveform
    thrust::device_vector<float> d_combined(NUM_SAMPLES);

    auto wave1 = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), sine_wave_functor(1, 2 * M_PI * FREQ_A / SAMPLE_RATE, 0));
    auto wave2 = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), sine_wave_functor(0.5, 2 * M_PI * FREQ_E / SAMPLE_RATE, 0));

    // Initialize the two device vectors
    thrust::transform(wave1, wave1 + NUM_SAMPLES, wave2, d_combined.begin(), thrust::plus<float>());

Image

Example of slow initialization:

    auto wave1 = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), sine_wave_functor(1, 2 * M_PI * FREQ_A / SAMPLE_RATE, 0));
    auto wave2 = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), sine_wave_functor(0.5, 2 * M_PI * FREQ_E / SAMPLE_RATE, 0));
    const auto waves = thrust::make_zip_iterator(thrust::make_tuple(wave1, wave2));
    const auto initializer = thrust::make_transform_iterator(waves, [] __host__ __device__(thrust::tuple<float, float> const &t)
                                                             { return thrust::get<0>(t) + thrust::get<1>(t); });
    thrust::device_vector<float> d_combined(initializer, initializer + NUM_SAMPLES);

Image

Expected behavior

Both methods of initialization should use similar code paths and take similar amounts of time.

Reproduction link

https://godbolt.org/z/bq4G3bces

Operating System

Ubuntu Linux 24.04

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    Off |   00000000:01:00.0 Off |                    0 |
| 30%   42C    P8             22W /  300W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
@rwarmstr rwarmstr added the bug Something isn't working right. label Sep 23, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Sep 23, 2024
@fbusato
Copy link
Contributor

fbusato commented Sep 23, 2024

both methods use the same cub:: routine (cub::DeviceFor) with the same parameters.
My suspect is the “fast initialization” method doesn't consider the memory allocation itself. Isolating the two methods on SM89, “fast initialization” allocation + “fast initialization” transform matches “slow initialization” method

@rwarmstr
Copy link
Author

In the "fast" method, total end-to-end time for initialization and population, including lazy loading, is 7.965 ms. In the 'slow' case, the same goalposts add to 20.72 ms.

The kernel runtime itself is 11.286 ms in the slow case and 2.584 in the fast case. This information is visible in the traces attached.

How are you isolating them to say they're the same?

@fbusato
Copy link
Contributor

fbusato commented Sep 23, 2024

This is the methodology that I adopted:

  • I used the code provided in the link, added the fast initialization method, removed everything else after both initializations.
  • nsight-system to trace CUDA calls (locking CPU/GPU clocks, disabling autoboost, persistent mode, etc.)
  • repeated the same experiment by swapping the methods
 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     51.1          677,415          1  677,415.0  677,415.0   677,415   677,415          0.0  void cub::CUB_200400_890_NS::detail::for_each::static_kernel<cub::CUB_200400_890_NS::detail::for_ea…
     42.1          557,926          1  557,926.0  557,926.0   557,926   557,926          0.0  void cub::CUB_200400_890_NS::detail::for_each::static_kernel<cub::CUB_200400_890_NS::detail::for_ea…
      6.8           90,081          1   90,081.0   90,081.0    90,081    90,081          0.0  void cub::CUB_200400_890_NS::detail::for_each::static_kernel<cub::CUB_200400_890_NS::detail::for_ea…

I need to investigate the whole code to understand why the end-to-end time changes between the two methods

@rwarmstr
Copy link
Author

rwarmstr commented Sep 23, 2024

Interesting - here is my CMake config for reference:

cmake_minimum_required(VERSION 3.20)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

set(CMAKE_CUDA_ARCHITECTURES "native")
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -g -G")  # enable cuda-gdb
endif()

set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --extended-lambda")

project(signals LANGUAGES CXX CUDA)

find_package(CUDAToolkit REQUIRED)

add_executable(signals
    signals.cu
)

target_link_libraries(signals
    CUDA::cufft
)

set_target_properties(signals PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

@rwarmstr
Copy link
Author

Also interestingly my string for the module is different - I'm using the version that shipped in CTK 12.6.1, is there a delta between that and the one you're using (presumably the version from GitHub?)

void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_each::policy_hub_t::policy_350_t, long, thrust::THRUST_200500_890_NS::cuda_cub::__transform::unary_transform_f<thrust::THRUST_200500_890_NS::transform_iterator<add_waves, thrust::THRUST_200500_890_NS::zip_iterator<thrust::THRUST_200500_890_NS::tuple<thrust::THRUST_200500_890_NS::transform_iterator<sine_wave_functor, thrust::THRUST_200500_890_NS::counting_iterator<int, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::transform_iterator<sine_wave_functor, thrust::THRUST_200500_890_NS::counting_iterator<int, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>>>, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::device_ptr<float>, thrust::THRUST_200500_890_NS::cuda_cub::__transform::no_stencil_tag, thrust::THRUST_200500_890_NS::identity<float>, thrust::THRUST_200500_890_NS::cuda_cub::__transform::always_true_predicate>>(T2, T3)

@fbusato
Copy link
Contributor

fbusato commented Sep 24, 2024

yes, I tried CUDA 12.5 for setting up the experiment in a quick way, but I didn't note any meaningful difference. I'm trying with CUDA 12.6u1

@fbusato fbusato self-assigned this Sep 24, 2024
@fbusato
Copy link
Contributor

fbusato commented Sep 24, 2024

I tried with the same configuration:

  • same GPU, CUDA toolkit 12.6u1 (libs + compiler)
  • whole application profiling
  • I also evaluated the same code with more elements 32 * 10^6 to see if the difference is more visible.

I'm still observing very similar performance

Original code:

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name       
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     49.8        1,416,034          3  472,011.3  455,040.0   449,121   511,873     34,647.8  void regular_fft_factor<(unsigned int)256, EPT<(unsigned int)16>, (unsigned int)8, (unsigned int)0,…
     23.2          658,721          1  658,721.0  658,721.0   658,721   658,721          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
     13.0          369,505          1  369,505.0  369,505.0   369,505   369,505          0.0  void postprocess_kernel<float, unsigned int, (loadstore_modifier_t)2>(real_complex_args_t<T2>)
      8.8          251,552          1  251,552.0  251,552.0   251,552   251,552          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      5.2          148,256          1  148,256.0  148,256.0   148,256   148,256          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)     Med (ns)    Min (ns)    Max (ns)   StdDev (ns)           Operation
 --------  ---------------  -----  -----------  -----------  ---------  ----------  -----------  ----------------------------
     95.9       18,643,131      2  9,321,565.5  9,321,565.5  6,160,393  12,482,738  4,470,573.0  [CUDA memcpy Device-to-Host]
      4.1          792,577      7    113,225.3        384.0        384     790,273    298,550.0  [CUDA memcpy Host-to-Device]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
    201.327      2   100.663   100.663    67.109   134.218       47.453  [CUDA memcpy Device-to-Host]
      6.799      7     0.971     0.000     0.000     6.799        2.570  [CUDA memcpy Host-to-Device]

"Fast initialization":

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name       
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     49.6        1,415,906          3  471,968.7  456,897.0   446,848   512,161     35,168.4  void regular_fft_factor<(unsigned int)256, EPT<(unsigned int)16>, (unsigned int)8, (unsigned int)0,…
     18.8          536,705          1  536,705.0  536,705.0   536,705   536,705          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
     13.1          372,737          1  372,737.0  372,737.0   372,737   372,737          0.0  void postprocess_kernel<float, unsigned int, (loadstore_modifier_t)2>(real_complex_args_t<T2>)
      8.9          252,992          1  252,992.0  252,992.0   252,992   252,992          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      6.1          173,089          1  173,089.0  173,089.0   173,089   173,089          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      3.6          102,016          1  102,016.0  102,016.0   102,016   102,016          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)    Min (ns)    Max (ns)   StdDev (ns)           Operation
 --------  ---------------  -----  ------------  ------------  ---------  ----------  -----------  ----------------------------
     96.7       23,617,421      2  11,808,710.5  11,808,710.5  6,169,292  17,448,129  7,975,342.1  [CUDA memcpy Device-to-Host]
      3.3          801,186      7     114,455.1         384.0        352     798,850    301,789.8  [CUDA memcpy Host-to-Device]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
    201.327      2   100.663   100.663    67.109   134.218       47.453  [CUDA memcpy Device-to-Host]
      6.799      7     0.971     0.000     0.000     6.799        2.570  [CUDA memcpy Host-to-Device]

is there any other detail that can help to understand the performance difference?

The only relevant difference that I see in the memcpy to host

@rwarmstr
Copy link
Author

That's very weird, your results are what I'd expect to see, but not what I actually see. Here's my 'slow' result:

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  ------------  ------------  ----------  ----------  -----------  ----------------------------------------------------------------------------------------------------
     83.3       11,260,683          1  11,260,683.0  11,260,683.0  11,260,683  11,260,683          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
     15.2        2,060,697          1   2,060,697.0   2,060,697.0   2,060,697   2,060,697          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      1.3          180,898          1     180,898.0     180,898.0     180,898     180,898          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      0.1            9,760          1       9,760.0       9,760.0       9,760       9,760          0.0  void regular_fft_factor<(unsigned int)512, EPT<(unsigned int)8>, (unsigned int)8, (unsigned int)0, …
      0.0            6,241          1       6,241.0       6,241.0       6,241       6,241          0.0  void regular_fft_factor<(unsigned int)1024, EPT<(unsigned int)16>, (unsigned int)8, (unsigned int)0…
      0.0            3,968          1       3,968.0       3,968.0       3,968       3,968          0.0  void postprocess_kernel<float, unsigned int, (loadstore_modifier_t)2>(real_complex_args_t<T2>)

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Operation
 --------  ---------------  -----  ---------  ---------  --------  ---------  -----------  ----------------------------
     69.6        1,047,341      5  209,468.2      384.0       352  1,045,869    467,562.3  [CUDA memcpy Host-to-Device]
     30.4          456,900      2  228,450.0  228,450.0   142,337    314,563    121,782.2  [CUDA memcpy Device-to-Host]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
      6.799      5     1.360     0.000     0.000     6.799        3.040  [CUDA memcpy Host-to-Device]
      6.291      2     3.146     3.146     2.097     4.194        1.483  [CUDA memcpy Device-to-Host]

I sent you my CMakeLists.txt and that's basically the entire thing - you're building in Debug configuration or Release? Which GPU are you using for profiling? Mine is RTX A6000 Ada.

@fbusato
Copy link
Contributor

fbusato commented Sep 24, 2024

I didn't use CMakeLists.txt. Just a single compile command equivalent to release mode, but I don't think it makes a difference. I'm using the same GPU. I will try other experiments to understand where the problem is. Your profile results are useful in that direction.

@jrhemstad
Copy link
Collaborator

@fbusato one thing important from @rwarmstr's CMake is that he is enabling relocatable device code (set_target_properties(signals PROPERTIES CUDA_SEPARABLE_COMPILATION ON)). That can impact Thrust internals quite a bit, and so may impact your ability to reproduce the issue.

@rwarmstr
Copy link
Author

Based on some more experimentation, have separable compilation didn't make a difference but the use of -G did. I had expected there to be a significantly higher overhead with device-side debug enabled but would have intuitively expected it to be uniform; turns out this wasn't the case. Turning off the -G option makes both code paths equivalently efficient.

So, this doesn't seem to be an inherent issue with CCCL. Thanks @fbusato for your debug help and I think we can consider this resolved.

@github-project-automation github-project-automation bot moved this from Todo to Done in CCCL Sep 24, 2024
@jrhemstad
Copy link
Collaborator

would have intuitively expected it to be uniform; turns out this wasn't the case.

Huh, I would have totally thought the same.

I'm guessing the difference here happens because -G turns off function inlining, and the iterator constructor uses more nested fancy iterators which incurs more non-inlined function calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Archived in project
Development

No branches or pull requests

3 participants