-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA]: Design single-stage mr-based API for CUB #2523
Comments
How about this? |
@NailaRais the interface that you've specified looks great! We've just discussed the interface with the team. A few conclusions below:
As to the implementation, it'll be tracked by a separate issue. @NailaRais it looks like you have an implementation for reduction. We'll split the work into sub-issues soon. Let us know if you'd like to contribute new API for reduction. |
|
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
It's not uncommon for users to define a macro that queries temporary storage required by CUB algorithm, allocates that storage using
cudaMemcpyAsync
, invokes the algorithm, and then frees the storage. This essentially makes the two-stage CUB API a single-stage one while preserving asynchrony. This approach leads to less verbose API, and addresses issues associated with mismatching parameters at query and execution stages. We should have a standard solution.Describe the solution you'd like
We should consider memory resource-based API for CUB. That'd allow users to customize temporary storage allocation when needed and take advantage of asynchronous memory managemenet by default. Something along the lines of:
cub::DeviceReduce::Max(in_it, out_it, in_it.size(), stream, mr = cudax::mr::async_resource{});
Describe alternatives you've considered
We could have a wrapper function / macro, but these solutions are more verbose and limits functionality.
Additional context
No response
The text was updated successfully, but these errors were encountered: