Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Issue using cub reduce on more than elements than fit into a 4 byte integer #129

Closed
felipeblazing opened this issue Feb 20, 2018 · 8 comments

Comments

@felipeblazing
Copy link

felipeblazing commented Feb 20, 2018

Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream=0, bool debug_synchronous=false)

my issue seems to be that num_items is of type int so when I try to reduce more elements than fit into a 4 byte integer it overflows and the code obviously doesn't work properly. Given that GPUs are both growing in RAM size and that we can now oversubscribe by using cudaSharedMalloc are there any plans to change that number to be able to receive type size_t?

@felipeblazing felipeblazing changed the title Issue using cub reduce on more than elements than fit into a 4 bit integer Issue using cub reduce on more than elements than fit into a 4 byte integer Feb 20, 2018
@dumerrill
Copy link
Contributor

dumerrill commented May 30, 2018

So, an explanation and a solution:

CUB's algorithms actually aren't hard-coded for 32-bit counts -- they are specialized by template parameter. However, 64b offsets require twice the register file as 32b offsets, and many of the algorithms (prefix sum, radix sort, etc.) have pervasive bookkeeping offsets and counts, so specializing for 64b counts often reduces performance as RF pressure reduces occupancy. So... the outer interface specializes everything for 32b int counts because the majority of people aren't reducing/scanning/sorting more than 2 billion items.

If you want to use a 64b count (e.g., int64_t or size_t or whatever), you can simply invoke the more generic interface underneath. See:

https://github.com/NVlabs/cub/blob/1.8.0/cub/device/device_reduce.cuh#L148

for example (which is where the outer interface specializes int as the offset type). Let me know if that unsticks you,

Duane

@felipeblazing
Copy link
Author

Ok yes i see how to do this now. This does unstick us thank you.

Felipe

@jakirkham
Copy link

So maybe I'm missing something here, but it appears that num_items is still int. Is there a way to relax that constraint? It would be useful to have something like size_t here instead. Thoughts? 🙂

@leofang
Copy link
Member

leofang commented Feb 25, 2020

@jakirkham I guess what @dumerrill meant is to invoke DispatchReduce::Dispatch() ourselves with num_items being size_t?

@jakirkham
Copy link

Ah sorry. I got overly focused on the highlighted line. So IIUC we should be looking here. Is that right?

@leofang
Copy link
Member

leofang commented Feb 25, 2020

Yeah I guess so.

@jakirkham
Copy link

Also if using 32-bit is significantly more performant than 64-bit, what is the recommendation for doing reductions that exceed the size of 32-bit signed integers?

Additionally I understand that 32-bit has special value here, but why is a signed value used instead of an unsigned one? Switching would double the size of allowed values without affecting the number of bits used.

@alliepiper
Copy link
Collaborator

Closing as this is part of a larger issue being tracked in #212.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants