You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the development of the new segmented sort, I extracted an AgentSegmentedRadixSort class. It's mostly based on the existing DeviceSegmentedRadixSort implementation. The only differences are:
while (current_bit < end_bit) loop is moved from the host to the device side.
if the segment data fit into shared memory, BlockRadixSort is used.
The combination of these changes gives about 6x speedup on RTX3090 and up to 7x on RTX2080 for segments with up to 5k elements. Unfortunately, the case of large segments is also affected. Since the new code requires a different number of registers, the speedup/slowdown is unpredictable. For some input data types/segment sizes, I got about 14% improvement. In few cases, I've noticed a 40% slowdown. Although the median speedup was around 0.996, more research is required.
When the slowdowns of the large segments sorting are addressed, we should use AgentSegmentedRadixSort as the DeviceSegmentedRadixSort implementation.
The text was updated successfully, but these errors were encountered:
During the development of the new segmented sort, I extracted an
AgentSegmentedRadixSort
class. It's mostly based on the existingDeviceSegmentedRadixSort
implementation. The only differences are:while (current_bit < end_bit)
loop is moved from the host to the device side.BlockRadixSort
is used.The combination of these changes gives about 6x speedup on RTX3090 and up to 7x on RTX2080 for segments with up to 5k elements. Unfortunately, the case of large segments is also affected. Since the new code requires a different number of registers, the speedup/slowdown is unpredictable. For some input data types/segment sizes, I got about 14% improvement. In few cases, I've noticed a 40% slowdown. Although the median speedup was around 0.996, more research is required.
When the slowdowns of the large segments sorting are addressed, we should use
AgentSegmentedRadixSort
as theDeviceSegmentedRadixSort
implementation.The text was updated successfully, but these errors were encountered: