Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Design single-stage mr-based API for CUB #2523

Closed
1 task done
gevtushenko opened this issue Oct 9, 2024 · 3 comments
Closed
1 task done

[FEA]: Design single-stage mr-based API for CUB #2523

gevtushenko opened this issue Oct 9, 2024 · 3 comments
Labels
feature request New feature or request. good first issue Good for newcomers.

Comments

@gevtushenko
Copy link
Collaborator

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

It's not uncommon for users to define a macro that queries temporary storage required by CUB algorithm, allocates that storage using cudaMemcpyAsync, invokes the algorithm, and then frees the storage. This essentially makes the two-stage CUB API a single-stage one while preserving asynchrony. This approach leads to less verbose API, and addresses issues associated with mismatching parameters at query and execution stages. We should have a standard solution.

Describe the solution you'd like

We should consider memory resource-based API for CUB. That'd allow users to customize temporary storage allocation when needed and take advantage of asynchronous memory managemenet by default. Something along the lines of:

cub::DeviceReduce::Max(in_it, out_it, in_it.size(), stream, mr = cudax::mr::async_resource{});

Describe alternatives you've considered

We could have a wrapper function / macro, but these solutions are more verbose and limits functionality.

Additional context

No response

@gevtushenko gevtushenko added the feature request New feature or request. label Oct 9, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 9, 2024
@gevtushenko gevtushenko added the good first issue Good for newcomers. label Oct 9, 2024
@NailaRais
Copy link

template <typename InputIt, typename OutputIt, typename MR = cudax::mr::async_resource>
__host__ __device__ 
cudaError_t DeviceReduceMax(InputIt d_in, OutputIt d_out, int num_items, cudaStream_t stream = 0, MR mr = MR{}) {
    void* d_temp_storage = nullptr;
    size_t temp_storage_bytes = 0;

    // Query required temporary storage size
    cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, stream);

    // Allocate temporary storage using the provided memory resource
    d_temp_storage = mr.allocate(temp_storage_bytes);

    // Perform reduction
    cudaError_t err = cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, stream);

    // Free temporary storage
    mr.deallocate(d_temp_storage, temp_storage_bytes);

    return err;
}

[FEA] Implement single-stage MR-based API for CUB (#2523)

- Introduced a memory resource-based API for CUB.
- Eliminated the need for two-step (query + execution) calls.
- Enabled asynchronous memory allocation with cudax::mr::async_resource{}.
- Improved usability while maintaining backward compatibility.

How about this?

@gevtushenko
Copy link
Collaborator Author

@NailaRais the interface that you've specified looks great! We've just discussed the interface with the team. A few conclusions below:

  • New interface should accept stream first, then memory resource. We expect the need of changing the stream to be more frequent than changing the default memory resource behavior. This is confirmed by precedent of libcudf ordering stream and memory resource this way.
  • New interface should accept memory resource type as a template parameter to avoid any performance overhead caused by indirect calls coming from the type erasure.
  • We should be careful not to create ambiguity with existing overloads (probably SFINAE-out on size_t& as the second parameter). 
  • Memory resources are experimental, so we might want to hide the new interface behind the LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE macro for now.

As to the implementation, it'll be tracked by a separate issue. @NailaRais it looks like you have an implementation for reduction. We'll split the work into sub-issues soon. Let us know if you'd like to contribute new API for reduction.

@NailaRais
Copy link

@NailaRais the interface that you've specified looks great! We've just discussed the interface with the team. A few conclusions below:

  • New interface should accept stream first, then memory resource. We expect the need of changing the stream to be more frequent than changing the default memory resource behavior. This is confirmed by precedent of libcudf ordering stream and memory resource this way.
  • New interface should accept memory resource type as a template parameter to avoid any performance overhead caused by indirect calls coming from the type erasure.
  • We should be careful not to create ambiguity with existing overloads (probably SFINAE-out on size_t& as the second parameter).
  • Memory resources are experimental, so we might want to hide the new interface behind the LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE macro for now.

As to the implementation, it'll be tracked by a separate issue. @NailaRais it looks like you have an implementation for reduction. We'll split the work into sub-issues soon. Let us know if you'd like to contribute new API for reduction.

@gevtushenko

Hi @gevtushenko 

Thank you for reviewing my proposal and for the detailed feedback! I appreciate the insights from the team discussion.

Here’s how I plan to incorporate the suggested improvements:

Reordering parameters – I will update the interface to accept the stream first, followed by the memory resource, ensuring consistency with libcudf conventions.
Template-based memory resource – I'll modify the implementation to take the memory resource type as a template parameter, avoiding any potential performance overhead due to type erasure.
Overload ambiguity handling – I’ll ensure that existing overloads are not affected by using SFINAE to avoid conflicts (e.g., handling size_t& as the second parameter).
Hiding the interface behind the experimental macro – Since memory resources are still experimental, I will wrap the new API behind LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE to align with best practices.
I’d be happy to contribute the new API for reduction as part of the upcoming sub-issues.

Looking forward to your thoughts!

Best,
Naila Rais


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request. good first issue Good for newcomers.
Projects
Archived in project
Development

No branches or pull requests

2 participants