-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: thrust::device_vector initialization with fancy iterators 4x slower than initialization + thrust::transform #2451
Comments
both methods use the same cub:: routine ( |
In the "fast" method, total end-to-end time for initialization and population, including lazy loading, is 7.965 ms. In the 'slow' case, the same goalposts add to 20.72 ms. The kernel runtime itself is 11.286 ms in the slow case and 2.584 in the fast case. This information is visible in the traces attached. How are you isolating them to say they're the same? |
This is the methodology that I adopted:
I need to investigate the whole code to understand why the end-to-end time changes between the two methods |
Interesting - here is my CMake config for reference: cmake_minimum_required(VERSION 3.20)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
set(CMAKE_CUDA_ARCHITECTURES "native")
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -g -G") # enable cuda-gdb
endif()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --extended-lambda")
project(signals LANGUAGES CXX CUDA)
find_package(CUDAToolkit REQUIRED)
add_executable(signals
signals.cu
)
target_link_libraries(signals
CUDA::cufft
)
set_target_properties(signals PROPERTIES CUDA_SEPARABLE_COMPILATION ON) |
Also interestingly my string for the module is different - I'm using the version that shipped in CTK 12.6.1, is there a delta between that and the one you're using (presumably the version from GitHub?) void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_each::policy_hub_t::policy_350_t, long, thrust::THRUST_200500_890_NS::cuda_cub::__transform::unary_transform_f<thrust::THRUST_200500_890_NS::transform_iterator<add_waves, thrust::THRUST_200500_890_NS::zip_iterator<thrust::THRUST_200500_890_NS::tuple<thrust::THRUST_200500_890_NS::transform_iterator<sine_wave_functor, thrust::THRUST_200500_890_NS::counting_iterator<int, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::transform_iterator<sine_wave_functor, thrust::THRUST_200500_890_NS::counting_iterator<int, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>>>, thrust::THRUST_200500_890_NS::use_default, thrust::THRUST_200500_890_NS::use_default>, thrust::THRUST_200500_890_NS::device_ptr<float>, thrust::THRUST_200500_890_NS::cuda_cub::__transform::no_stencil_tag, thrust::THRUST_200500_890_NS::identity<float>, thrust::THRUST_200500_890_NS::cuda_cub::__transform::always_true_predicate>>(T2, T3) |
yes, I tried CUDA 12.5 for setting up the experiment in a quick way, but I didn't note any meaningful difference. I'm trying with CUDA 12.6u1 |
I tried with the same configuration:
I'm still observing very similar performance Original code:
"Fast initialization":
is there any other detail that can help to understand the performance difference? The only relevant difference that I see in the memcpy to host |
That's very weird, your results are what I'd expect to see, but not what I actually see. Here's my 'slow' result:
I sent you my CMakeLists.txt and that's basically the entire thing - you're building in Debug configuration or Release? Which GPU are you using for profiling? Mine is RTX A6000 Ada. |
I didn't use CMakeLists.txt. Just a single compile command equivalent to release mode, but I don't think it makes a difference. I'm using the same GPU. I will try other experiments to understand where the problem is. Your profile results are useful in that direction. |
Based on some more experimentation, have separable compilation didn't make a difference but the use of So, this doesn't seem to be an inherent issue with CCCL. Thanks @fbusato for your debug help and I think we can consider this resolved. |
Huh, I would have totally thought the same. I'm guessing the difference here happens because |
Is this a duplicate?
Type of Bug
Performance
Component
Thrust
Describe the bug
In a performance comparison between two methods of initializing device vectors, creating a zero-initialized vector and subsequently initializing it with thrust::transform seems to be approximately 4x faster than using fancy iterators to initialize the same vector through the constructor.
How to Reproduce
Example of fast initialization:
Example of slow initialization:
Expected behavior
Both methods of initialization should use similar code paths and take similar amounts of time.
Reproduction link
https://godbolt.org/z/bq4G3bces
Operating System
Ubuntu Linux 24.04
nvidia-smi output
NVCC version
The text was updated successfully, but these errors were encountered: