[FEA]: Introduce cuda.cooperative overloads not requiring temporary storage #2527
Closed
1 task done
Labels
feature request
New feature or request.
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
cuda.cooperative API currently has an issue. We do not specify alignment of the temporary storage. This leads to bugs like the following one:
Because both allocations of shared memory are made at
uint8
granularity, second one is not properly aligned, leading to:Describe the solution you'd like
Majority of kernels do not create temporary storage unions, so we could simplify the API by not requiring temporary storage:
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: