Optimize DeviceSegmentedRadixSort #879

gevtushenko · 2021-10-05T20:47:06Z

During the development of the new segmented sort, I extracted an AgentSegmentedRadixSort class. It's mostly based on the existing DeviceSegmentedRadixSort implementation. The only differences are:

while (current_bit < end_bit) loop is moved from the host to the device side.
if the segment data fit into shared memory, BlockRadixSort is used.

The combination of these changes gives about 6x speedup on RTX3090 and up to 7x on RTX2080 for segments with up to 5k elements. Unfortunately, the case of large segments is also affected. Since the new code requires a different number of registers, the speedup/slowdown is unpredictable. For some input data types/segment sizes, I got about 14% improvement. In few cases, I've noticed a 40% slowdown. Although the median speedup was around 0.996, more research is required.

When the slowdowns of the large segments sorting are addressed, we should use AgentSegmentedRadixSort as the DeviceSegmentedRadixSort implementation.

The text was updated successfully, but these errors were encountered:

jrhemstad added the cub For all items related to CUB label Feb 22, 2023

jarmak-nv assigned alliepiper Feb 23, 2023

alliepiper removed their assignment Feb 23, 2023

jarmak-nv transferred this issue from NVIDIA/cub Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize DeviceSegmentedRadixSort #879

Optimize DeviceSegmentedRadixSort #879

gevtushenko commented Oct 5, 2021

Optimize DeviceSegmentedRadixSort #879

Optimize DeviceSegmentedRadixSort #879

Comments

gevtushenko commented Oct 5, 2021