Multi-Threading for Sample_z #463

Marvin-Beckmann · 2025-02-26T09:41:36Z

Description

This PR implements

a method to utilise multi-threading for sample_z, and use it for samples for matrices and polynomials.

Comment: Rayon also has the option min_length, which might be more intuitive to use for splitting the amount of work for each rayon job. However, I was not able to come up with a good way to share the dgis over one rayon job with this change. Hence, I kept my implementation. If a reviewer finds a nice way, please suggest it.

Improvements, measured on MAC

SampleZ wide 10,000     time:   [4.0497 ms 4.1460 ms 4.2510 ms]
                        change: [-77.351% -76.809% -76.309%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

Benchmarking SampleZ narrow 10,000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 50.
SampleZ narrow 10,000   time:   [1.7816 ms 1.7972 ms 1.8158 ms]
                        change: [-83.647% -83.539% -83.425%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

SampleZ wide single     time:   [8.5441 µs 8.5694 µs 8.5949 µs]
                        change: [-0.4292% +0.1489% +0.7009%] (p = 0.61 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  4 (4.00%) high severe

SampleZ narrow single   time:   [4.2505 µs 4.2601 µs 4.2714 µs]
                        change: [-2.2812% -1.8075% -1.3209%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

let benchmark run on other devices

On WSL:

SampleZ wide 10,000     time:   [13.522 ms 13.746 ms 13.977 ms]
                        change: [-50.594% -49.275% -47.947%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

SampleZ narrow 10,000   time:   [7.4188 ms 7.5750 ms 7.7502 ms]
                        change: [-56.419% -55.085% -53.799%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe

SampleZ wide single     time:   [15.781 µs 16.062 µs 16.365 µs]
                        change: [-9.4522% -6.9744% -4.4529%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

SampleZ narrow single   time:   [7.7638 µs 7.9789 µs 8.2728 µs]
                        change: [-3.9677% -1.0294% +2.6720%] (p = 0.54 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

consider speed for small matrices (I added a cutoff based on the performance of my laptop regarding the number of samples. Matrices with less than 10 matrices will be sampled without multithreading, as this showed to be the cutoff in terms of efficiency for my device. For most usecases, these matrices are probably not going to have such few entries, but I added the cutoff, because there would be a higher efficiency dropoff otherwise.

SampleZ 1x1 matrix      time:   [8.6651 µs 8.6872 µs 8.7113 µs]
                        change: [+0.7105% +1.2841% +1.8256%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild

SampleZ 1x9 matrix      time:   [65.601 µs 65.748 µs 65.897 µs]
                        change: [+1.3021% +1.9660% +2.7914%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

So, for very small matrices we have a small runtime loss, but for larger matrices we have a gain in efficiency.

SampleZ 1x10 matrix     time:   [63.215 µs 63.398 µs 63.580 µs]
                        change: [-12.884% -12.142% -11.525%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild

Testing

I added basic working examples (possibly in doc-comment)
I triggered all possible errors in my test in every possible way
I included tests for all reasonable edge cases

Checklist:

I have performed a self-review of my own code
- The code provides good readability and maintainability s.t. it fulfills best practices like talking code, modularity, ...
  - The chosen implementation is not more complex than it has to be
- My code should work as intended and no side effects occur (e.g. memory leaks)
- The doc comments fit our style guide
- I have credited related sources if needed

…reading

src/utils/sample/discrete_gauss.rs

jnsiemer · 2025-03-02T10:19:33Z

src/utils/sample/discrete_gauss.rs

+            (0..nr_threads)
+                .into_par_iter()
+                .map(|thread_i| {


Just to provide another idea to make things quicker (although this might not be rayon's way of parallelising things).
Currently, you split the workload into similar-sized buckets - implicitely making the assumption that each bucket will roughly take the same duration on each thread. This assumption should be correct in this case for larger bucket sizes, but for smaller sizes, it might be quicker to just submit tasks to a pool of threads, where each thread collects a new task once it has finished the current one.

I looked again, but I was not able to find a better solution for it, and given that the function will probably not be called with small numbers of samples - I think the current implementation is reasonable.
The problem with the dynamic approach is that I was not able to find a good way to also distribute the integer sampler, and additionally, this also provides an overhead with more threadmanagement, which might also increase the runtime due to the dynamic distribution of tasks.

jnsiemer · 2025-03-02T10:22:46Z

src/utils/sample/discrete_gauss.rs

+                .reduce(Vec::new, |mut a, mut b| {
+                    a.append(&mut b);
+                    a
+                })


I haven't looked at the flamegraph, but considering my current experience with our library, setting the values of the matrix has a significant overhead. Furthermore, joining and iterating vectors shouldn't be the fastest thing in the world.
Could it be possible that it would be quicker to set the entry in the matrix directly after sampling it - sharing the matrix in an Arc?

I will check later, but the join/reduce is probably efficient, as it moves the values.
I would assume that the additional management cost of Arc will exceed the runtime of the collect.

Co-authored-by: Jan Niklas Siemer <[email protected]>

implement an algorithm to get several sample_z samples using multi-th…

86539c7

…reading

Marvin-Beckmann added the performance⚡ This issue or PR should improve label Feb 26, 2025

Marvin-Beckmann added 2 commits February 27, 2025 08:56

add benchmark for small matZ discrete Gaussian sampling

6585233

fix clippy warnings in multi-threaded implementation of sample_z

365e917

Marvin-Beckmann requested a review from jnsiemer February 27, 2025 08:00

Marvin-Beckmann self-assigned this Feb 27, 2025

Marvin-Beckmann marked this pull request as ready for review February 27, 2025 08:07

jnsiemer reviewed Mar 2, 2025

View reviewed changes

Apply suggestions from code review

e8d7908

Co-authored-by: Jan Niklas Siemer <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Threading for Sample_z #463

Multi-Threading for Sample_z #463

Marvin-Beckmann commented Feb 26, 2025 •

edited

Loading

jnsiemer Mar 2, 2025

Marvin-Beckmann Mar 3, 2025

jnsiemer Mar 2, 2025

Marvin-Beckmann Mar 3, 2025

Multi-Threading for Sample_z #463

Are you sure you want to change the base?

Multi-Threading for Sample_z #463

Conversation

Marvin-Beckmann commented Feb 26, 2025 • edited Loading

jnsiemer Mar 2, 2025

Choose a reason for hiding this comment

Marvin-Beckmann Mar 3, 2025

Choose a reason for hiding this comment

jnsiemer Mar 2, 2025

Choose a reason for hiding this comment

Marvin-Beckmann Mar 3, 2025

Choose a reason for hiding this comment

Marvin-Beckmann commented Feb 26, 2025 •

edited

Loading