You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, this is not currently implemented for reductions.
Currently, each chunk to be reduced is spread across thread blocks. So the threads in a single block may be working on 1 or more chunks. How does this work with grid striding?
Additionally, how would shared memory be used?
The text was updated successfully, but these errors were encountered:
Often, it is recommend to loop over items in a cuda kernel like:
This was recently done for most kernels in #787
However, this is not currently implemented for reductions.
Currently, each chunk to be reduced is spread across thread blocks. So the threads in a single block may be working on 1 or more chunks. How does this work with grid striding?
Additionally, how would shared memory be used?
The text was updated successfully, but these errors were encountered: