-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression with non-power-of-2 number of threads #968
Comments
Well, I don't think the pool itself would care about that, but it might be a problem with job-splitting on parallel iterators. Basically, they work by even 50-50 splits until there are enough jobs to fill the pool, but that does effectively overshoot up to the next power of two. Plus, when jobs do get stolen, we split further in an attempt of "adaptive" splitting for imbalanced workloads. So basically, a pool of 12 threads will split just as much as a pool of 16 threads would. And with smaller jobs, they may finish sooner and work-steal more, which adaptively splits more, potentially cascading poorly. I'm guessing about all of this, but maybe that could stack up to your 20% effect, especially if you have things like I've known that rayon's job-splitting is not "ideal", but it's trying to strike a general balance. In the past I found that having more control knobs on this ends up slowing everyone down, but there's at least |
To expand a bit more on our problem, what we're using rayon for is to compile all functions in a wasm module in parallel. We basically build a big list of all the functions and then use rayon's let funcs = if std::env::var("SLOW").is_ok() {
let (a, _) = rayon::join(|| calculate(), || ());
a
} else {
calculate()
}; Specifically on an intel cpu (I can't seem to reproduce this on arm64) when Otherwise though I tested out #857 and it unfortunately didn't show any effect here. Using Apart from this one manual call to |
That's wacky, especially that it would happen on Intel and not arm64. The only thing I can see, which has nothing to do with being power-of-two, is that Does the same slowdown happen if you use |
Unfortunately yeah even with |
Sorry in advance for the drive-by comment: wonder if this thread #795 is related or not? |
@Walther that issue is specific to the way that |
Over in bytecodealliance/wasmtime#4327 we found a ~20% performance regression on a workload when a non-power-of-2 number of threads were specified, on my 12-core system (Ryzen 3900X) and on other systems with 3, 6, ... threads manually specified. On my own system, manually specifying a power-of-two number of threads (e.g., 8) causes the regression to disappear.
@alexcrichton produced a diff that isolates the issue in that PR; it appears that it may have something to do with scopes. I am not familiar enough with the design of rayon or terminology here to describe in more detail what these changes mean.
Does this issue seem familiar at all? Could there be some dependence on a power-of-two number of threads for, e.g., even work distribution?
The text was updated successfully, but these errors were encountered: