Using gridDim.x in reduction (max/min/sum) cuda kernels #789

coreylowman · 2023-05-19T15:35:39Z

Often, it is recommend to loop over items in a cuda kernel like:

for (unsigned int i = tid; i < n; i += blockDim.x * gridDim.x) {
   ...
}

This was recently done for most kernels in #787

However, this is not currently implemented for reductions.

Currently, each chunk to be reduced is spread across thread blocks. So the threads in a single block may be working on 1 or more chunks. How does this work with grid striding?

Additionally, how would shared memory be used?

coreylowman added gpu Related to GPU support optimization expert Requires advanced knowledge of dfdx labels Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using gridDim.x in reduction (max/min/sum) cuda kernels #789

Using gridDim.x in reduction (max/min/sum) cuda kernels #789

coreylowman commented May 19, 2023 •

edited

Loading

Using gridDim.x in reduction (max/min/sum) cuda kernels #789

Using gridDim.x in reduction (max/min/sum) cuda kernels #789

Comments

coreylowman commented May 19, 2023 • edited Loading

coreylowman commented May 19, 2023 •

edited

Loading