Faster segmented sorting (and segmented problems in general) #224

ogreen · 2020-10-26T16:51:25Z

Segmented problems can suffer from workload imbalance if the distribution of the segment sizes vary. The workload imbalance becomes even more challenging when the number of segments is very large.

The attached zip file has an efficient segmented sort that is based on NVIDIA's CUB library. Most of the sorting is done using CUB's sorting functionality. The key thing that this segmented sorting algorithm offers is simple and efficient load-balancing.
lrb_sort.cuh.zip

The solution used for the segmented sort is applicable to other segmented problems.

Details of the algorithm (and performance) can be found in the following paper.

gevtushenko · 2021-05-29T16:11:54Z

I've written some extreme-case tests for Logarithmic Radix Binning (LRB) application in segmented sorting. In segmented sort, a thread block is assigned to a single segment. Therefore load balancing could affect work distribution between SMs. To test how workload imbalance affects performance, I've used the following pattern: all segments that satisfy the property position % N == 0 contains millions of items to sort. All other segments contain hundreds. I used the multiprocessors count as N. The idea was to assign huge segments to a single SM. If we continue increasing the waves count, we'll reach the point where SMs with small tasks will have small segments sorted and the SM with huge segments still busy. At this point, huge chunks will be scheduled on other SMs, which might improve balance. Here is the experimental data that seem to illustrate the behaviour above.

I haven't considered a runtime of LRB itself here. As far as I can see, the speedup from load balancing (for segmented sorting) can occur on a moderate segments number. In the case of a large number of segments, thread-block scheduling should balance between free SMs. The experiment also supports this premise. Further increase of waves counts leads to speedup convergence around 2%.

I think that the load balancing property of LRB could be more useful in algorithms with different level of parallelism, for example, thread per segment. In this case, it could reduce thread divergence. Therefore I think it should be helpful to have LRB as a separate algorithm in CUB.

Regarding LRB application for segmented sorting, its different property is most helpful here. Clustering segments could facilitate segmented scan specialisation. For kernel specialisation benchmarking, I've generated different input data pattern. This time, all the large segments are at the head of the list. The tail contains small segments. This pattern eliminates the effects of load balancing. After performing the LRB, I've processed all the small segments with a different kernel. It assigns a warp to a segment and executes a bitonic warp sort.

Kernel specialisation for small segments demonstrates significant speedup. I've also tried to specialise kernel for large segments. After LRB, I process large segments by the whole device. It's done by a call to cub::DeviceRadixSort::SortKeys. Unlike small kernel specialisation, the speedup here depends on the number of large segments. It might be worth developing a different kernel for this purpose.

alliepiper · 2021-06-01T20:02:19Z

Just to make sure we're on the same page, can you define what you mean by "waves" in the above?

These are good ideas. Specializing based on work size makes sense, as does having the LRB machinery as a shared utility between the segmented algorithms.

For now, let's focus on getting LRB implemented as a utility, and start applying it to the segmented algorithms, and look into specializing for size later.

ogreen · 2021-06-01T21:29:07Z

Adding an update the latest version of the segmented sort code is available here:
Segmented Sort
This code is more up to date than the attached code at the top of this PR.

gevtushenko · 2021-06-02T13:58:03Z

The wave hare stands for the segments count equal to the SMs count. For example, if a GPU has two SMs, four segments form two waves. This term is convenient here because it's possible to launch max thread blocks per SM waves with N-th segment in each having a lot of work and expect that all large segments will be assigned to a single SM. This should cause the biggest imbalance between SMs.

mnicely · 2021-06-25T19:08:27Z

@senior-zero @allisonvacanti It has come to my attention that this can greatly improve a particular math function of high importance. What can need to be done to get this into the next release of CUB, which I think is 1.14? So it'll be in next release of CTK, and we can start using it.

gevtushenko · 2021-06-25T19:16:00Z

@senior-zero @allisonvacanti It has come to my attention that this can greatly improve a particular math function of high importance. What can need to be done to get this into the next release of CUB, which I think is 1.14? So it'll be in next release of CTK, and we can start using it.

Hello, @mnicely! LRB part can be ready quite soon. The prototype is available here. Specialization of segmented sorting could take more time. Do you need a generalized algorithm for load balancing or optimized segmented sorting?

mnicely · 2021-06-25T19:20:30Z

@senior-zero Thanks for the quick reply. We need optimized seqmented sorting

@federico-busato for viz

alliepiper · 2022-02-07T18:37:45Z

Closing as #357 added a (significantly!) improved segmented sort.

alliepiper added the type: enhancement New feature or request. label Oct 26, 2020

alliepiper assigned gevtushenko May 26, 2021

gevtushenko added this to the 1.14.0 milestone Jun 28, 2021

gevtushenko added the P1: should have Necessary, but not critical. label Jun 28, 2021

gevtushenko mentioned this issue Aug 17, 2021

New segmented sort algorithm #357

Merged

alliepiper modified the milestones: 1.14.0, 1.15.0 Aug 17, 2021

alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 15, 2021

alliepiper closed this as completed Feb 7, 2022

alliepiper modified the milestones: 1.16.0, 1.15.0 Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster segmented sorting (and segmented problems in general) #224

Faster segmented sorting (and segmented problems in general) #224

ogreen commented Oct 26, 2020

gevtushenko commented May 29, 2021

alliepiper commented Jun 1, 2021

ogreen commented Jun 1, 2021

gevtushenko commented Jun 2, 2021

mnicely commented Jun 25, 2021

gevtushenko commented Jun 25, 2021

mnicely commented Jun 25, 2021

alliepiper commented Feb 7, 2022

Faster segmented sorting (and segmented problems in general) #224

Faster segmented sorting (and segmented problems in general) #224

Comments

ogreen commented Oct 26, 2020

gevtushenko commented May 29, 2021

alliepiper commented Jun 1, 2021

ogreen commented Jun 1, 2021

gevtushenko commented Jun 2, 2021

mnicely commented Jun 25, 2021

gevtushenko commented Jun 25, 2021

mnicely commented Jun 25, 2021

alliepiper commented Feb 7, 2022