Skip to content

Multi-Threading for Sample_z #463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open

Conversation

Marvin-Beckmann
Copy link
Member

@Marvin-Beckmann Marvin-Beckmann commented Feb 26, 2025

Description

This PR implements

  • a method to utilise multi-threading for sample_z, and use it for samples for matrices and polynomials.

Comment: Rayon also has the option min_length, which might be more intuitive to use for splitting the amount of work for each rayon job. However, I was not able to come up with a good way to share the dgis over one rayon job with this change. Hence, I kept my implementation. If a reviewer finds a nice way, please suggest it.

Improvements, measured on MAC

SampleZ wide 10,000     time:   [4.0497 ms 4.1460 ms 4.2510 ms]
                        change: [-77.351% -76.809% -76.309%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

Benchmarking SampleZ narrow 10,000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 50.
SampleZ narrow 10,000   time:   [1.7816 ms 1.7972 ms 1.8158 ms]
                        change: [-83.647% -83.539% -83.425%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

SampleZ wide single     time:   [8.5441 µs 8.5694 µs 8.5949 µs]
                        change: [-0.4292% +0.1489% +0.7009%] (p = 0.61 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  4 (4.00%) high severe

SampleZ narrow single   time:   [4.2505 µs 4.2601 µs 4.2714 µs]
                        change: [-2.2812% -1.8075% -1.3209%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
  • let benchmark run on other devices

On WSL:

SampleZ wide 10,000     time:   [13.522 ms 13.746 ms 13.977 ms]
                        change: [-50.594% -49.275% -47.947%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

SampleZ narrow 10,000   time:   [7.4188 ms 7.5750 ms 7.7502 ms]
                        change: [-56.419% -55.085% -53.799%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe

SampleZ wide single     time:   [15.781 µs 16.062 µs 16.365 µs]
                        change: [-9.4522% -6.9744% -4.4529%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

SampleZ narrow single   time:   [7.7638 µs 7.9789 µs 8.2728 µs]
                        change: [-3.9677% -1.0294% +2.6720%] (p = 0.54 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
  • consider speed for small matrices (I added a cutoff based on the performance of my laptop regarding the number of samples. Matrices with less than 10 matrices will be sampled without multithreading, as this showed to be the cutoff in terms of efficiency for my device. For most usecases, these matrices are probably not going to have such few entries, but I added the cutoff, because there would be a higher efficiency dropoff otherwise.
SampleZ 1x1 matrix      time:   [8.6651 µs 8.6872 µs 8.7113 µs]
                        change: [+0.7105% +1.2841% +1.8256%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild

SampleZ 1x9 matrix      time:   [65.601 µs 65.748 µs 65.897 µs]
                        change: [+1.3021% +1.9660% +2.7914%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

So, for very small matrices we have a small runtime loss, but for larger matrices we have a gain in efficiency.

SampleZ 1x10 matrix     time:   [63.215 µs 63.398 µs 63.580 µs]
                        change: [-12.884% -12.142% -11.525%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild

Testing

  • I added basic working examples (possibly in doc-comment)
  • I triggered all possible errors in my test in every possible way
  • I included tests for all reasonable edge cases

Checklist:

  • I have performed a self-review of my own code
    • The code provides good readability and maintainability s.t. it fulfills best practices like talking code, modularity, ...
      • The chosen implementation is not more complex than it has to be
    • My code should work as intended and no side effects occur (e.g. memory leaks)
    • The doc comments fit our style guide
    • I have credited related sources if needed

@Marvin-Beckmann Marvin-Beckmann added the performance⚡ This issue or PR should improve label Feb 26, 2025
@Marvin-Beckmann Marvin-Beckmann self-assigned this Feb 27, 2025
@Marvin-Beckmann Marvin-Beckmann marked this pull request as ready for review February 27, 2025 08:07
Comment on lines +219 to +221
(0..nr_threads)
.into_par_iter()
.map(|thread_i| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to provide another idea to make things quicker (although this might not be rayon's way of parallelising things).
Currently, you split the workload into similar-sized buckets - implicitely making the assumption that each bucket will roughly take the same duration on each thread. This assumption should be correct in this case for larger bucket sizes, but for smaller sizes, it might be quicker to just submit tasks to a pool of threads, where each thread collects a new task once it has finished the current one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked again, but I was not able to find a better solution for it, and given that the function will probably not be called with small numbers of samples - I think the current implementation is reasonable.
The problem with the dynamic approach is that I was not able to find a good way to also distribute the integer sampler, and additionally, this also provides an overhead with more threadmanagement, which might also increase the runtime due to the dynamic distribution of tasks.

Comment on lines +232 to +235
.reduce(Vec::new, |mut a, mut b| {
a.append(&mut b);
a
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the flamegraph, but considering my current experience with our library, setting the values of the matrix has a significant overhead. Furthermore, joining and iterating vectors shouldn't be the fastest thing in the world.
Could it be possible that it would be quicker to set the entry in the matrix directly after sampling it - sharing the matrix in an Arc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check later, but the join/reduce is probably efficient, as it moves the values.
I would assume that the additional management cost of Arc will exceed the runtime of the collect.

Co-authored-by: Jan Niklas Siemer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance⚡ This issue or PR should improve
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants