[BUG]: thrust::device_vector initialization with fancy iterators 4x slower than initialization + thrust::transform #2451

rwarmstr · 2024-09-23T21:52:12Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Performance

Component

Thrust

Describe the bug

In a performance comparison between two methods of initializing device vectors, creating a zero-initialized vector and subsequently initializing it with thrust::transform seems to be approximately 4x faster than using fancy iterators to initialize the same vector through the constructor.

How to Reproduce

Example of fast initialization:

    // Create a device vector to hold out input waveform
    thrust::device_vector<float> d_combined(NUM_SAMPLES);

    auto wave1 = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), sine_wave_functor(1, 2 * M_PI * FREQ_A / SAMPLE_RATE, 0));
    auto wave2 = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), sine_wave_functor(0.5, 2 * M_PI * FREQ_E / SAMPLE_RATE, 0));

    // Initialize the two device vectors
    thrust::transform(wave1, wave1 + NUM_SAMPLES, wave2, d_combined.begin(), thrust::plus<float>());

Example of slow initialization:

    auto wave1 = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), sine_wave_functor(1, 2 * M_PI * FREQ_A / SAMPLE_RATE, 0));
    auto wave2 = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), sine_wave_functor(0.5, 2 * M_PI * FREQ_E / SAMPLE_RATE, 0));
    const auto waves = thrust::make_zip_iterator(thrust::make_tuple(wave1, wave2));
    const auto initializer = thrust::make_transform_iterator(waves, [] __host__ __device__(thrust::tuple<float, float> const &t)
                                                             { return thrust::get<0>(t) + thrust::get<1>(t); });
    thrust::device_vector<float> d_combined(initializer, initializer + NUM_SAMPLES);

Expected behavior

Both methods of initialization should use similar code paths and take similar amounts of time.

Reproduction link

https://godbolt.org/z/bq4G3bces

Operating System

Ubuntu Linux 24.04

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    Off |   00000000:01:00.0 Off |                    0 |
| 30%   42C    P8             22W /  300W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

The text was updated successfully, but these errors were encountered:

fbusato · 2024-09-23T22:48:35Z

both methods use the same cub:: routine (cub::DeviceFor) with the same parameters.
My suspect is the “fast initialization” method doesn't consider the memory allocation itself. Isolating the two methods on SM89, “fast initialization” allocation + “fast initialization” transform matches “slow initialization” method

rwarmstr · 2024-09-23T22:53:25Z

In the "fast" method, total end-to-end time for initialization and population, including lazy loading, is 7.965 ms. In the 'slow' case, the same goalposts add to 20.72 ms.

The kernel runtime itself is 11.286 ms in the slow case and 2.584 in the fast case. This information is visible in the traces attached.

How are you isolating them to say they're the same?

fbusato · 2024-09-23T23:13:45Z

This is the methodology that I adopted:

I used the code provided in the link, added the fast initialization method, removed everything else after both initializations.
nsight-system to trace CUDA calls (locking CPU/GPU clocks, disabling autoboost, persistent mode, etc.)
repeated the same experiment by swapping the methods

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     51.1          677,415          1  677,415.0  677,415.0   677,415   677,415          0.0  void cub::CUB_200400_890_NS::detail::for_each::static_kernel<cub::CUB_200400_890_NS::detail::for_ea…
     42.1          557,926          1  557,926.0  557,926.0   557,926   557,926          0.0  void cub::CUB_200400_890_NS::detail::for_each::static_kernel<cub::CUB_200400_890_NS::detail::for_ea…
      6.8           90,081          1   90,081.0   90,081.0    90,081    90,081          0.0  void cub::CUB_200400_890_NS::detail::for_each::static_kernel<cub::CUB_200400_890_NS::detail::for_ea…

I need to investigate the whole code to understand why the end-to-end time changes between the two methods

rwarmstr · 2024-09-23T23:40:28Z

Interesting - here is my CMake config for reference:

cmake_minimum_required(VERSION 3.20)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

set(CMAKE_CUDA_ARCHITECTURES "native")
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -g -G")  # enable cuda-gdb
endif()

set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --extended-lambda")

project(signals LANGUAGES CXX CUDA)

find_package(CUDAToolkit REQUIRED)

add_executable(signals
    signals.cu
)

target_link_libraries(signals
    CUDA::cufft
)

set_target_properties(signals PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

rwarmstr · 2024-09-23T23:44:00Z

Also interestingly my string for the module is different - I'm using the version that shipped in CTK 12.6.1, is there a delta between that and the one you're using (presumably the version from GitHub?)

void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_each::policy_hub_t::policy_350_t, long, thrust::THRUST_200500_890_NS::cuda_cub::__transform::unary_transform_f<thrust::THRUST_200500_890_NS::transform_iterator<add_waves, thrust::THRUST_200500_890_NS::zip_iterator<thrust::THRUST_200500_890_NS::tuple<thrust::THRUST_200500_890_NS::transform_iterator<sine_wave_functor, thrust::THRUST_200500_890_NS::counting_iterator<int, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::transform_iterator<sine_wave_functor, thrust::THRUST_200500_890_NS::counting_iterator<int, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>>>, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::device_ptr<float>, thrust::THRUST_200500_890_NS::cuda_cub::__transform::no_stencil_tag, thrust::THRUST_200500_890_NS::identity<float>, thrust::THRUST_200500_890_NS::cuda_cub::__transform::always_true_predicate>>(T2, T3)

fbusato · 2024-09-24T00:02:41Z

yes, I tried CUDA 12.5 for setting up the experiment in a quick way, but I didn't note any meaningful difference. I'm trying with CUDA 12.6u1

fbusato · 2024-09-24T02:14:39Z

I tried with the same configuration:

same GPU, CUDA toolkit 12.6u1 (libs + compiler)
whole application profiling
I also evaluated the same code with more elements 32 * 10^6 to see if the difference is more visible.

I'm still observing very similar performance

Original code:

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name       
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     49.8        1,416,034          3  472,011.3  455,040.0   449,121   511,873     34,647.8  void regular_fft_factor<(unsigned int)256, EPT<(unsigned int)16>, (unsigned int)8, (unsigned int)0,…
     23.2          658,721          1  658,721.0  658,721.0   658,721   658,721          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
     13.0          369,505          1  369,505.0  369,505.0   369,505   369,505          0.0  void postprocess_kernel<float, unsigned int, (loadstore_modifier_t)2>(real_complex_args_t<T2>)
      8.8          251,552          1  251,552.0  251,552.0   251,552   251,552          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      5.2          148,256          1  148,256.0  148,256.0   148,256   148,256          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)     Med (ns)    Min (ns)    Max (ns)   StdDev (ns)           Operation
 --------  ---------------  -----  -----------  -----------  ---------  ----------  -----------  ----------------------------
     95.9       18,643,131      2  9,321,565.5  9,321,565.5  6,160,393  12,482,738  4,470,573.0  [CUDA memcpy Device-to-Host]
      4.1          792,577      7    113,225.3        384.0        384     790,273    298,550.0  [CUDA memcpy Host-to-Device]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
    201.327      2   100.663   100.663    67.109   134.218       47.453  [CUDA memcpy Device-to-Host]
      6.799      7     0.971     0.000     0.000     6.799        2.570  [CUDA memcpy Host-to-Device]

"Fast initialization":

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name       
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     49.6        1,415,906          3  471,968.7  456,897.0   446,848   512,161     35,168.4  void regular_fft_factor<(unsigned int)256, EPT<(unsigned int)16>, (unsigned int)8, (unsigned int)0,…
     18.8          536,705          1  536,705.0  536,705.0   536,705   536,705          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
     13.1          372,737          1  372,737.0  372,737.0   372,737   372,737          0.0  void postprocess_kernel<float, unsigned int, (loadstore_modifier_t)2>(real_complex_args_t<T2>)
      8.9          252,992          1  252,992.0  252,992.0   252,992   252,992          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      6.1          173,089          1  173,089.0  173,089.0   173,089   173,089          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      3.6          102,016          1  102,016.0  102,016.0   102,016   102,016          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)    Min (ns)    Max (ns)   StdDev (ns)           Operation
 --------  ---------------  -----  ------------  ------------  ---------  ----------  -----------  ----------------------------
     96.7       23,617,421      2  11,808,710.5  11,808,710.5  6,169,292  17,448,129  7,975,342.1  [CUDA memcpy Device-to-Host]
      3.3          801,186      7     114,455.1         384.0        352     798,850    301,789.8  [CUDA memcpy Host-to-Device]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
    201.327      2   100.663   100.663    67.109   134.218       47.453  [CUDA memcpy Device-to-Host]
      6.799      7     0.971     0.000     0.000     6.799        2.570  [CUDA memcpy Host-to-Device]

is there any other detail that can help to understand the performance difference?

The only relevant difference that I see in the memcpy to host

rwarmstr · 2024-09-24T06:06:58Z

That's very weird, your results are what I'd expect to see, but not what I actually see. Here's my 'slow' result:

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  ------------  ------------  ----------  ----------  -----------  ----------------------------------------------------------------------------------------------------
     83.3       11,260,683          1  11,260,683.0  11,260,683.0  11,260,683  11,260,683          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
     15.2        2,060,697          1   2,060,697.0   2,060,697.0   2,060,697   2,060,697          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      1.3          180,898          1     180,898.0     180,898.0     180,898     180,898          0.0  void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_ea…
      0.1            9,760          1       9,760.0       9,760.0       9,760       9,760          0.0  void regular_fft_factor<(unsigned int)512, EPT<(unsigned int)8>, (unsigned int)8, (unsigned int)0, …
      0.0            6,241          1       6,241.0       6,241.0       6,241       6,241          0.0  void regular_fft_factor<(unsigned int)1024, EPT<(unsigned int)16>, (unsigned int)8, (unsigned int)0…
      0.0            3,968          1       3,968.0       3,968.0       3,968       3,968          0.0  void postprocess_kernel<float, unsigned int, (loadstore_modifier_t)2>(real_complex_args_t<T2>)

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Operation
 --------  ---------------  -----  ---------  ---------  --------  ---------  -----------  ----------------------------
     69.6        1,047,341      5  209,468.2      384.0       352  1,045,869    467,562.3  [CUDA memcpy Host-to-Device]
     30.4          456,900      2  228,450.0  228,450.0   142,337    314,563    121,782.2  [CUDA memcpy Device-to-Host]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
      6.799      5     1.360     0.000     0.000     6.799        3.040  [CUDA memcpy Host-to-Device]
      6.291      2     3.146     3.146     2.097     4.194        1.483  [CUDA memcpy Device-to-Host]

I sent you my CMakeLists.txt and that's basically the entire thing - you're building in Debug configuration or Release? Which GPU are you using for profiling? Mine is RTX A6000 Ada.

fbusato · 2024-09-24T16:38:45Z

I didn't use CMakeLists.txt. Just a single compile command equivalent to release mode, but I don't think it makes a difference. I'm using the same GPU. I will try other experiments to understand where the problem is. Your profile results are useful in that direction.

jrhemstad · 2024-09-24T19:46:32Z

@fbusato one thing important from @rwarmstr's CMake is that he is enabling relocatable device code (set_target_properties(signals PROPERTIES CUDA_SEPARABLE_COMPILATION ON)). That can impact Thrust internals quite a bit, and so may impact your ability to reproduce the issue.

rwarmstr · 2024-09-24T21:19:06Z

Based on some more experimentation, have separable compilation didn't make a difference but the use of -G did. I had expected there to be a significantly higher overhead with device-side debug enabled but would have intuitively expected it to be uniform; turns out this wasn't the case. Turning off the -G option makes both code paths equivalently efficient.

So, this doesn't seem to be an inherent issue with CCCL. Thanks @fbusato for your debug help and I think we can consider this resolved.

jrhemstad · 2024-09-24T21:42:44Z

would have intuitively expected it to be uniform; turns out this wasn't the case.

Huh, I would have totally thought the same.

I'm guessing the difference here happens because -G turns off function inlining, and the iterator constructor uses more nested fancy iterators which incurs more non-inlined function calls.

rwarmstr added the bug Something isn't working right. label Sep 23, 2024

github-project-automation bot moved this to Todo in CCCL Sep 23, 2024

github-project-automation bot added this to CCCL Sep 23, 2024

fbusato self-assigned this Sep 24, 2024

rwarmstr closed this as completed Sep 24, 2024

github-project-automation bot moved this from Todo to Done in CCCL Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: thrust::device_vector initialization with fancy iterators 4x slower than initialization + thrust::transform #2451

[BUG]: thrust::device_vector initialization with fancy iterators 4x slower than initialization + thrust::transform #2451

rwarmstr commented Sep 23, 2024

fbusato commented Sep 23, 2024

rwarmstr commented Sep 23, 2024

fbusato commented Sep 23, 2024 •

edited

Loading

rwarmstr commented Sep 23, 2024 •

edited

Loading

rwarmstr commented Sep 23, 2024

fbusato commented Sep 24, 2024

fbusato commented Sep 24, 2024

rwarmstr commented Sep 24, 2024

fbusato commented Sep 24, 2024

jrhemstad commented Sep 24, 2024

rwarmstr commented Sep 24, 2024

jrhemstad commented Sep 24, 2024

[BUG]: thrust::device_vector initialization with fancy iterators 4x slower than initialization + thrust::transform #2451

[BUG]: thrust::device_vector initialization with fancy iterators 4x slower than initialization + thrust::transform #2451

Comments

rwarmstr commented Sep 23, 2024

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

fbusato commented Sep 23, 2024

rwarmstr commented Sep 23, 2024

fbusato commented Sep 23, 2024 • edited Loading

rwarmstr commented Sep 23, 2024 • edited Loading

rwarmstr commented Sep 23, 2024

fbusato commented Sep 24, 2024

fbusato commented Sep 24, 2024

rwarmstr commented Sep 24, 2024

fbusato commented Sep 24, 2024

jrhemstad commented Sep 24, 2024

rwarmstr commented Sep 24, 2024

jrhemstad commented Sep 24, 2024

fbusato commented Sep 23, 2024 •

edited

Loading

rwarmstr commented Sep 23, 2024 •

edited

Loading