-
Notifications
You must be signed in to change notification settings - Fork 448
Issue using cub reduce on more than elements than fit into a 4 byte integer #129
Comments
So, an explanation and a solution: CUB's algorithms actually aren't hard-coded for 32-bit counts -- they are specialized by template parameter. However, 64b offsets require twice the register file as 32b offsets, and many of the algorithms (prefix sum, radix sort, etc.) have pervasive bookkeeping offsets and counts, so specializing for 64b counts often reduces performance as RF pressure reduces occupancy. So... the outer interface specializes everything for 32b If you want to use a 64b count (e.g., https://github.com/NVlabs/cub/blob/1.8.0/cub/device/device_reduce.cuh#L148 for example (which is where the outer interface specializes Duane |
Ok yes i see how to do this now. This does unstick us thank you. Felipe |
So maybe I'm missing something here, but it appears that |
@jakirkham I guess what @dumerrill meant is to invoke |
Ah sorry. I got overly focused on the highlighted line. So IIUC we should be looking here. Is that right? |
Yeah I guess so. |
Also if using 32-bit is significantly more performant than 64-bit, what is the recommendation for doing reductions that exceed the size of 32-bit signed integers? Additionally I understand that 32-bit has special value here, but why is a signed value used instead of an unsigned one? Switching would double the size of allowed values without affecting the number of bits used. |
Closing as this is part of a larger issue being tracked in #212. |
Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream=0, bool debug_synchronous=false)
my issue seems to be that num_items is of type int so when I try to reduce more elements than fit into a 4 byte integer it overflows and the code obviously doesn't work properly. Given that GPUs are both growing in RAM size and that we can now oversubscribe by using cudaSharedMalloc are there any plans to change that number to be able to receive type size_t?
The text was updated successfully, but these errors were encountered: