[FEA]: Design single-stage mr-based API for CUB #2523

gevtushenko · 2024-10-09T18:44:10Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

It's not uncommon for users to define a macro that queries temporary storage required by CUB algorithm, allocates that storage using cudaMemcpyAsync, invokes the algorithm, and then frees the storage. This essentially makes the two-stage CUB API a single-stage one while preserving asynchrony. This approach leads to less verbose API, and addresses issues associated with mismatching parameters at query and execution stages. We should have a standard solution.

Describe the solution you'd like

We should consider memory resource-based API for CUB. That'd allow users to customize temporary storage allocation when needed and take advantage of asynchronous memory managemenet by default. Something along the lines of:

cub::DeviceReduce::Max(in_it, out_it, in_it.size(), stream, mr = cudax::mr::async_resource{});

Describe alternatives you've considered

We could have a wrapper function / macro, but these solutions are more verbose and limits functionality.

Additional context

No response

The text was updated successfully, but these errors were encountered:

NailaRais · 2025-02-04T14:17:31Z

template <typename InputIt, typename OutputIt, typename MR = cudax::mr::async_resource>
__host__ __device__ 
cudaError_t DeviceReduceMax(InputIt d_in, OutputIt d_out, int num_items, cudaStream_t stream = 0, MR mr = MR{}) {
    void* d_temp_storage = nullptr;
    size_t temp_storage_bytes = 0;

    // Query required temporary storage size
    cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, stream);

    // Allocate temporary storage using the provided memory resource
    d_temp_storage = mr.allocate(temp_storage_bytes);

    // Perform reduction
    cudaError_t err = cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, stream);

    // Free temporary storage
    mr.deallocate(d_temp_storage, temp_storage_bytes);

    return err;
}

[FEA] Implement single-stage MR-based API for CUB (#2523)

- Introduced a memory resource-based API for CUB.
- Eliminated the need for two-step (query + execution) calls.
- Enabled asynchronous memory allocation with cudax::mr::async_resource{}.
- Improved usability while maintaining backward compatibility.

How about this?

gevtushenko · 2025-02-18T18:05:38Z

@NailaRais the interface that you've specified looks great! We've just discussed the interface with the team. A few conclusions below:

New interface should accept stream first, then memory resource. We expect the need of changing the stream to be more frequent than changing the default memory resource behavior. This is confirmed by precedent of libcudf ordering stream and memory resource this way.
New interface should accept memory resource type as a template parameter to avoid any performance overhead caused by indirect calls coming from the type erasure.
We should be careful not to create ambiguity with existing overloads (probably SFINAE-out on size_t& as the second parameter).
Memory resources are experimental, so we might want to hide the new interface behind the LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE macro for now.

As to the implementation, it'll be tracked by a separate issue. @NailaRais it looks like you have an implementation for reduction. We'll split the work into sub-issues soon. Let us know if you'd like to contribute new API for reduction.

NailaRais · 2025-02-20T17:08:42Z

@NailaRais the interface that you've specified looks great! We've just discussed the interface with the team. A few conclusions below:

New interface should accept stream first, then memory resource. We expect the need of changing the stream to be more frequent than changing the default memory resource behavior. This is confirmed by precedent of libcudf ordering stream and memory resource this way.

New interface should accept memory resource type as a template parameter to avoid any performance overhead caused by indirect calls coming from the type erasure.

We should be careful not to create ambiguity with existing overloads (probably SFINAE-out on size_t& as the second parameter).

Memory resources are experimental, so we might want to hide the new interface behind the LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE macro for now.

As to the implementation, it'll be tracked by a separate issue. @NailaRais it looks like you have an implementation for reduction. We'll split the work into sub-issues soon. Let us know if you'd like to contribute new API for reduction.

@gevtushenko

Hi @gevtushenko 

Thank you for reviewing my proposal and for the detailed feedback! I appreciate the insights from the team discussion.

Here’s how I plan to incorporate the suggested improvements:

Reordering parameters – I will update the interface to accept the stream first, followed by the memory resource, ensuring consistency with libcudf conventions.
Template-based memory resource – I'll modify the implementation to take the memory resource type as a template parameter, avoiding any potential performance overhead due to type erasure.
Overload ambiguity handling – I’ll ensure that existing overloads are not affected by using SFINAE to avoid conflicts (e.g., handling size_t& as the second parameter).
Hiding the interface behind the experimental macro – Since memory resources are still experimental, I will wrap the new API behind LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE to align with best practices.
I’d be happy to contribute the new API for reduction as part of the upcoming sub-issues.

Looking forward to your thoughts!

Best,
Naila Rais

gevtushenko added the feature request New feature or request. label Oct 9, 2024

github-project-automation bot added this to CCCL Oct 9, 2024

github-project-automation bot moved this to Todo in CCCL Oct 9, 2024

gevtushenko added the good first issue Good for newcomers. label Oct 9, 2024

gevtushenko closed this as completed Feb 18, 2025

github-project-automation bot moved this from Todo to Done in CCCL Feb 18, 2025

gevtushenko mentioned this issue Feb 19, 2025

[FEA]: Add memory resource-based overloads to CUB device-level algorithms #3855

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Design single-stage mr-based API for CUB #2523

[FEA]: Design single-stage mr-based API for CUB #2523

gevtushenko commented Oct 9, 2024

NailaRais commented Feb 4, 2025

gevtushenko commented Feb 18, 2025

NailaRais commented Feb 20, 2025

[FEA]: Design single-stage mr-based API for CUB #2523

[FEA]: Design single-stage mr-based API for CUB #2523

Comments

gevtushenko commented Oct 9, 2024

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

NailaRais commented Feb 4, 2025

gevtushenko commented Feb 18, 2025

NailaRais commented Feb 20, 2025