Simplify the iterator adaptive splitting strategy #857

cuviper · 2021-05-13T18:28:44Z

Before, when an iterator job was stolen, we would reset the split count
all the way back to current_num_threads to adaptively split jobs more
aggressively when threads seem to need more work. This ends up splitting
a lot farther than a lot of people expect, especially in the tail end of
a computation when threads are fighting over what's left. Excess
splitting can also be harmful for things like fold or map_with that
want to share state as much as possible.

We can get a much lazier "adaptive" effect by just not updating the
split count when we split a stolen job, effectively giving it only one
extra boost of splitting.

cuviper · 2021-05-13T18:31:38Z

Here are my benchmark results with a 5% threshold. It's rather mixed at extremes, so I'm not sure how to evaluate this...

 name                                                                split-num-threads ns/iter  split-skip ns/iter     diff ns/iter   diff %  speedup
 factorial::factorial_par_iter                                       785,857                    744,186                     -41,671   -5.30%   x 1.06
 factorial::factorial_recursion                                      1,312,877                  1,393,096                    80,219    6.11%   x 0.94
 fibonacci::fibonacci_split_iterative                                79,043                     66,834                      -12,209  -15.45%   x 1.18
 fibonacci::fibonacci_split_recursive                                597,931                    536,831                     -61,100  -10.22%   x 1.11
 join_microbench::increment_all                                      62,094                     51,761                      -10,333  -16.64%   x 1.20
 join_microbench::increment_all_max                                  73,677                     181,079                     107,402  145.77%   x 0.41
 life::bench::par_iter_generations                                   11,345,270                 9,244,708                -2,100,562  -18.51%   x 1.23
 map_collect::i_mod_10_to_i::with_collect                            4,321,277                  3,916,068                  -405,209   -9.38%   x 1.10
 map_collect::i_mod_10_to_i::with_fold                               683,736                    506,218                    -177,518  -25.96%   x 1.35
 map_collect::i_mod_10_to_i::with_fold_vec                           819,674                    557,608                    -262,066  -31.97%   x 1.47
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec            4,286,532                  3,917,595                  -368,937   -8.61%   x 1.09
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec_sized      4,354,821                  3,908,441                  -446,380  -10.25%   x 1.11
 map_collect::i_mod_10_to_i::with_linked_list_map_reduce_vec_sized   4,404,012                  3,931,468                  -472,544  -10.73%   x 1.12
 map_collect::i_mod_10_to_i::with_mutex_vec                          11,913,779                 9,188,517                -2,725,262  -22.87%   x 1.30
 map_collect::i_mod_10_to_i::with_vec_vec_sized                      4,601,016                  4,026,035                  -574,981  -12.50%   x 1.14
 map_collect::i_to_i::with_collect                                   6,957,727                  6,354,939                  -602,788   -8.66%   x 1.09
 map_collect::i_to_i::with_fold_vec                                  38,067,101                 35,491,125               -2,575,976   -6.77%   x 1.07
 map_collect::i_to_i::with_linked_list_collect_vec_sized             6,840,195                  6,383,791                  -456,404   -6.67%   x 1.07
 map_collect::i_to_i::with_linked_list_map_reduce_vec_sized          6,901,355                  6,313,163                  -588,192   -8.52%   x 1.09
 map_collect::i_to_i::with_mutex_vec                                 36,533,047                 33,202,140               -3,330,907   -9.12%   x 1.10
 map_collect::i_to_i::with_vec_vec_sized                             7,247,123                  6,418,205                  -828,918  -11.44%   x 1.13
 nbody::bench::nbody_parreduce                                       9,854,831                  11,379,041                1,524,210   15.47%   x 0.87
 pythagoras::euclid_parallel_one                                     2,877,928                  3,144,155                   266,227    9.25%   x 0.92
 pythagoras::euclid_parallel_weightless                              2,878,436                  3,131,320                   252,884    8.79%   x 0.92
 quicksort::bench::quick_sort_splitter                               5,588,636                  13,225,091                7,636,455  136.64%   x 0.42
 sort::demo_merge_sort_ascending                                     102,956 (3885 MB/s)        114,411 (3496 MB/s)          11,455   11.13%   x 0.90
 sort::demo_merge_sort_big                                           6,817,448 (938 MB/s)       6,112,125 (1047 MB/s)      -705,323  -10.35%   x 1.12
 sort::demo_merge_sort_descending                                    110,011 (3636 MB/s)        201,164 (1988 MB/s)          91,153   82.86%   x 0.55
 sort::demo_merge_sort_mostly_ascending                              258,798 (1545 MB/s)        462,874 (864 MB/s)          204,076   78.86%   x 0.56
 sort::demo_merge_sort_mostly_descending                             255,883 (1563 MB/s)        481,346 (831 MB/s)          225,463   88.11%   x 0.53
 sort::demo_quick_sort_big                                           3,858,490 (1658 MB/s)      2,992,156 (2138 MB/s)      -866,334  -22.45%   x 1.29
 sort::demo_quick_sort_strings                                       3,346,042 (239 MB/s)       3,122,390 (256 MB/s)       -223,652   -6.68%   x 1.07
 sort::par_sort_unstable_big                                         1,830,812 (3495 MB/s)      2,530,672 (2528 MB/s)       699,860   38.23%   x 0.72
 sort::par_sort_unstable_mostly_descending                           184,598 (2166 MB/s)        171,825 (2327 MB/s)         -12,773   -6.92%   x 1.07
 str_split::parallel_space_char                                      264,467                    213,826                     -50,641  -19.15%   x 1.24
 str_split::parallel_space_fn                                        215,124                    182,302                     -32,822  -15.26%   x 1.18
 vec_collect::vec_i::with_collect_into_vec_reused                    419,200                    381,671                     -37,529   -8.95%   x 1.10
 vec_collect::vec_i::with_fold                                       8,026,621                  7,100,183                  -926,438  -11.54%   x 1.13
 vec_collect::vec_i::with_linked_list_collect_vec                    4,623,693                  3,617,714                -1,005,979  -21.76%   x 1.28
 vec_collect::vec_i::with_linked_list_collect_vec_sized              4,765,766                  3,618,636                -1,147,130  -24.07%   x 1.32
 vec_collect::vec_i::with_linked_list_map_reduce_vec_sized           4,659,306                  3,644,773                -1,014,533  -21.77%   x 1.28
 vec_collect::vec_i::with_vec_vec_sized                              4,454,536                  3,543,003                  -911,533  -20.46%   x 1.26
 vec_collect::vec_i_filtered::with_collect                           5,133,122                  4,143,791                  -989,331  -19.27%   x 1.24
 vec_collect::vec_i_filtered::with_fold                              10,158,530                 9,013,442                -1,145,088  -11.27%   x 1.13
 vec_collect::vec_i_filtered::with_linked_list_collect_vec           6,698,931                  5,980,331                  -718,600  -10.73%   x 1.12
 vec_collect::vec_i_filtered::with_linked_list_collect_vec_sized     6,810,821                  6,010,995                  -799,826  -11.74%   x 1.13
 vec_collect::vec_i_filtered::with_linked_list_map_reduce_vec_sized  5,073,883                  4,195,550                  -878,333  -17.31%   x 1.21
 vec_collect::vec_i_filtered::with_vec_vec_sized                     4,915,209                  4,116,080                  -799,129  -16.26%   x 1.19

nikomatsakis · 2021-05-15T08:46:18Z

cc @wagnerf42

wagnerf42 · 2021-05-16T10:36:00Z

hi, i'd like to take some time looking at the benches.

on top of my head here are raw comments:

it is impossible to always win, (i have one example where the current algorithm is not splitting enough already). the idea for me would be to have good defaults and to allow users to tune the policy when they really need it. it is actually possible to move the scheduling policy out of the bridge and into an adaptor so it would be ok to screw a few examples.
i don't like micro benches so much because they need to be on very small inputs. they bring something but a 20% speed increase give an impression of being 20% faster while if you double the size you are maybe only 10% faster. i'd rather count the number of tasks and draw some logs to figure out what is going on.
i'd like to target specifically examples which might suffer from this change (they need some un-even load distribution)
that being said, the current policy is cutting way too much for high number of threads.

if you can give me a bit more time (1 week ?) i will take a closer look.

cuviper · 2021-05-17T21:34:17Z

it is impossible to always win, [...] the idea for me would be to have good defaults and to allow users to tune the policy when they really need it.

Yes, I agree that needs will vary, but I hope this change makes a better default.

i don't like micro benches so much because they need to be on very small inputs.

Yeah, it's hard. Even with the larger benchmarks (cargo run in rayon-demo rather than cargo bench), I'm not really confident that we're representing realistic workloads, or enough diversity.

if you can give me a bit more time (1 week ?) i will take a closer look.

I'm not in a rush, so I'm happy to let you dig further -- thanks!

wagnerf42 · 2021-05-21T16:14:01Z

hi, so here is one example i had in mind:

            let r = (0..*size)
                .into_par_iter()
                .map(|e| (0..e).map(|x| x % 2).sum::<u64>())
                .sum::<u64>();

so, why that ?

if everything is nicely balanced then i don't see how the proposed modification could be bad.
this example makes sense for me because this is what people would do for parallel combinations.
(all 2 out of n) and it is not balanced.

in this example when the parallel range gets divided in two, the left part contains 1/4 of total work and the right part 3/4. for performances the scheduler will need to divide regularly throughout the execution. (log(n) times).
note that for two threads it is not handled nicely by the current rayon and there is a high randomness in the execution times.

i wonder of the following: if you need to keep generating tasks on one branch during the execution then i think at one point it should be possible that only one single stealable task remains with the new mechanism. what does it mean ? well it is still enough to distribute work to all threads but now all stealers must try around p times unsuccesfully before achieving a successful steal.
is that a big deal ? well maybe/maybe not because the number of steals is related to the depth which is logarithmic here.

i did a run of this code on a 32 cores machine using 64 threads and the new code was slower (40%) there (sizes 100k to 300k).

i'd still need to take a look inside the run to see what is really going on.

i'll also try some more benches next week.

cuviper · 2021-05-21T17:27:19Z

Thank you for your investigation! The slowdown is surprising and disappointing, but I hope you'll be able to get a complete picture of why that is. We should also encode that knowledge in source comments at least, because this is all currently very opaque.

adamreichold · 2021-10-05T14:46:42Z

To add another data point to the discussion, I have an example at GeoStat-Framework/GSTools-Core#6 where this makes the difference between a slow down (serial versus parallel) and a speed-up because without it around 500 large temporary arrays are needed for a fold and using it, this drops down to 150 (with theoretical maximum of 1000). This is especially important in that case as I cannot use with_min_len due to ndarray's Zip not providing an IndexedParallelIterator implementation.

Before, when an iterator job was stolen, we would reset the split count all the way back to `current_num_threads` to adaptively split jobs more aggressively when threads seem to need more work. This ends up splitting a lot farther than a lot of people expect, especially in the tail end of a computation when threads are fighting over what's left. Excess splitting can also be harmful for things like `fold` or `map_with` that want to share state as much as possible. We can get a much lazier "adaptive" effect by just not updating the split count when we split a stolen job, effectively giving it only _one_ extra boost of splitting.

adamreichold mentioned this pull request Oct 6, 2021

What are typical problem sizes? GeoStat-Framework/GSTools-Core#6

Merged

cuviper force-pushed the simple-split branch from 520e171 to 6e2ce77 Compare July 2, 2022 01:37

cuviper mentioned this pull request Aug 18, 2022

Performance regression with non-power-of-2 number of threads #968

Open

cuviper mentioned this pull request Feb 3, 2024

fold creates identity for each sublist #1124

Open

cuviper force-pushed the simple-split branch from 6e2ce77 to 34ed130 Compare February 9, 2024 19:42

cuviper mentioned this pull request Dec 3, 2024

init function in .map_init called unexpectedly many times. #1214

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the iterator adaptive splitting strategy #857

Simplify the iterator adaptive splitting strategy #857

cuviper commented May 13, 2021

cuviper commented May 13, 2021

nikomatsakis commented May 15, 2021

wagnerf42 commented May 16, 2021

cuviper commented May 17, 2021

wagnerf42 commented May 21, 2021

cuviper commented May 21, 2021

adamreichold commented Oct 5, 2021 •

edited

Loading

Simplify the iterator adaptive splitting strategy #857

Are you sure you want to change the base?

Simplify the iterator adaptive splitting strategy #857

Conversation

cuviper commented May 13, 2021

cuviper commented May 13, 2021

nikomatsakis commented May 15, 2021

wagnerf42 commented May 16, 2021

cuviper commented May 17, 2021

wagnerf42 commented May 21, 2021

cuviper commented May 21, 2021

adamreichold commented Oct 5, 2021 • edited Loading

adamreichold commented Oct 5, 2021 •

edited

Loading