Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Faster segmented sorting (and segmented problems in general) #224

Closed
ogreen opened this issue Oct 26, 2020 · 8 comments
Closed

Faster segmented sorting (and segmented problems in general) #224

ogreen opened this issue Oct 26, 2020 · 8 comments
Assignees
Labels
P1: should have Necessary, but not critical. type: enhancement New feature or request.
Milestone

Comments

@ogreen
Copy link

ogreen commented Oct 26, 2020

Segmented problems can suffer from workload imbalance if the distribution of the segment sizes vary. The workload imbalance becomes even more challenging when the number of segments is very large.

The attached zip file has an efficient segmented sort that is based on NVIDIA's CUB library. Most of the sorting is done using CUB's sorting functionality. The key thing that this segmented sorting algorithm offers is simple and efficient load-balancing.
lrb_sort.cuh.zip

The solution used for the segmented sort is applicable to other segmented problems.

Details of the algorithm (and performance) can be found in the following paper.

@alliepiper alliepiper added the type: enhancement New feature or request. label Oct 26, 2020
@gevtushenko
Copy link
Collaborator

I've written some extreme-case tests for Logarithmic Radix Binning (LRB) application in segmented sorting. In segmented sort, a thread block is assigned to a single segment. Therefore load balancing could affect work distribution between SMs. To test how workload imbalance affects performance, I've used the following pattern: all segments that satisfy the property position % N == 0 contains millions of items to sort. All other segments contain hundreds. I used the multiprocessors count as N. The idea was to assign huge segments to a single SM. If we continue increasing the waves count, we'll reach the point where SMs with small tasks will have small segments sorted and the SM with huge segments still busy. At this point, huge chunks will be scheduled on other SMs, which might improve balance. Here is the experimental data that seem to illustrate the behaviour above.

image

I haven't considered a runtime of LRB itself here. As far as I can see, the speedup from load balancing (for segmented sorting) can occur on a moderate segments number. In the case of a large number of segments, thread-block scheduling should balance between free SMs. The experiment also supports this premise. Further increase of waves counts leads to speedup convergence around 2%.

I think that the load balancing property of LRB could be more useful in algorithms with different level of parallelism, for example, thread per segment. In this case, it could reduce thread divergence. Therefore I think it should be helpful to have LRB as a separate algorithm in CUB.

Regarding LRB application for segmented sorting, its different property is most helpful here. Clustering segments could facilitate segmented scan specialisation. For kernel specialisation benchmarking, I've generated different input data pattern. This time, all the large segments are at the head of the list. The tail contains small segments. This pattern eliminates the effects of load balancing. After performing the LRB, I've processed all the small segments with a different kernel. It assigns a warp to a segment and executes a bitonic warp sort.

image

Kernel specialisation for small segments demonstrates significant speedup. I've also tried to specialise kernel for large segments. After LRB, I process large segments by the whole device. It's done by a call to cub::DeviceRadixSort::SortKeys. Unlike small kernel specialisation, the speedup here depends on the number of large segments. It might be worth developing a different kernel for this purpose.

image

@alliepiper
Copy link
Collaborator

Just to make sure we're on the same page, can you define what you mean by "waves" in the above?

These are good ideas. Specializing based on work size makes sense, as does having the LRB machinery as a shared utility between the segmented algorithms.

For now, let's focus on getting LRB implemented as a utility, and start applying it to the segmented algorithms, and look into specializing for size later.

@ogreen
Copy link
Author

ogreen commented Jun 1, 2021

Adding an update the latest version of the segmented sort code is available here:
Segmented Sort
This code is more up to date than the attached code at the top of this PR.

@gevtushenko
Copy link
Collaborator

The wave hare stands for the segments count equal to the SMs count. For example, if a GPU has two SMs, four segments form two waves. This term is convenient here because it's possible to launch max thread blocks per SM waves with N-th segment in each having a lot of work and expect that all large segments will be assigned to a single SM. This should cause the biggest imbalance between SMs.

@mnicely
Copy link
Collaborator

mnicely commented Jun 25, 2021

@senior-zero @allisonvacanti It has come to my attention that this can greatly improve a particular math function of high importance. What can need to be done to get this into the next release of CUB, which I think is 1.14? So it'll be in next release of CTK, and we can start using it.

@gevtushenko
Copy link
Collaborator

@senior-zero @allisonvacanti It has come to my attention that this can greatly improve a particular math function of high importance. What can need to be done to get this into the next release of CUB, which I think is 1.14? So it'll be in next release of CTK, and we can start using it.

Hello, @mnicely! LRB part can be ready quite soon. The prototype is available here. Specialization of segmented sorting could take more time. Do you need a generalized algorithm for load balancing or optimized segmented sorting?

@mnicely
Copy link
Collaborator

mnicely commented Jun 25, 2021

@senior-zero Thanks for the quick reply. We need optimized seqmented sorting

@gevtushenko gevtushenko added this to the 1.14.0 milestone Jun 28, 2021
@gevtushenko gevtushenko added the P1: should have Necessary, but not critical. label Jun 28, 2021
@alliepiper alliepiper modified the milestones: 1.14.0, 1.15.0 Aug 17, 2021
@alliepiper alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 15, 2021
@alliepiper
Copy link
Collaborator

Closing as #357 added a (significantly!) improved segmented sort.

@alliepiper alliepiper modified the milestones: 1.16.0, 1.15.0 Feb 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1: should have Necessary, but not critical. type: enhancement New feature or request.
Projects
None yet
Development

No branches or pull requests

4 participants