-
Notifications
You must be signed in to change notification settings - Fork 304
max thread count is not directly settable or queryable #823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
git-svn-id: https://source.openmpt.org/svn/openmpt/trunk/OpenMPT@23044 56274372-70c3-4bfc-bfc3-4c3a0b034d27
This max thread count is mostly temporary. I don't have the means to test the code with so many threads, as I have no machines with more than 4 threads. I did not feel comfortable setting max thread count above 64 for that reason. If I would have access to a machine with, lets say, 32 or more threads, I could run some tests, for example the test suite with thread sanitizer, and that restriction could be lifted entirely. Also, I don't think setting more than 64 threads is really useful, I don't think the code scales very well beyond 32 threads anyway. |
All fair, but what should client applications actually do? I would rather not have to invent a magic value myself that happens to be adequate for FLAC 1.5.0, but might not be in the future. I guess there would also be another option: I do have machines with up to 16 threads available, so if you want me to run some specific test, or a general scaling measurement, I could do that when I find the time for it. |
I agree that I didn't think of this. I assumed 64 would be so large it would cause any problems. And in some respect I still think it is silly to use so many threads: while I haven't tested, I think at 64 threads the overhead is already so large adding any more would not change anything. At some point, there are two things that become a bottleneck: the MD5 calculation and the main thread which "interacts" with the client app and does some data preparation. When just using presets, this already becomes a bottleneck at something like 16 threads. I agree, it would have been nice not to rely on magic numbers. On the other hand, it would be nice if implementers give this a little thought: at some point adding more threads is pointless, and firing up a second encoder instance (to process a second file in parallel) is much more efficient. If that is not possible, perhaps it is better to just leave some of those cores dormant, otherwise they'll only be consuming electricity for pure overhead. |
I guess we need some numbers. build script (current git master):
test corpus (44100Hz stereo 16bit):
benchmark script:
gnuplot script (1 and 2 threads result were discarded before plotting):
test systems:
raw results (
For system A (8C/16T), the optimum appears to be 10 threads. My best guess would be that FLAC saturates the execution units good enough such that SMT does not help much beyond the 8 cores. However, it could also be scheduling related due to WSL2 inside Windows 11. I do not have any 8C/16T native Linux system available (I do have an AMD Ryzen 2700 (Zen+), but it is also running Ubuntu in WSL2 on Windows 11 24H2). For system B (16C/16T), we can see obvious scaling of the algorithm even with 16 threads. As far as I know (https://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2, https://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope/), in Piledriver, the FPU is used for integer SIMD and is shared between 2 cores on that microarchitecture (see https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture) and https://en.wikipedia.org/wiki/Piledriver_(microarchitecture)). So, I would actually have expected worse scaling than what the results show. I do not have any modern non-SMT Intel CPUs (like Core Ultra 200 (Arrow Lake)) available for testing. I did use In general, the upper bound of scaling appears to not be reached yet with 16 threads for the current algorithm. |
Without the WSL2 VM overhead, it scales better on that system, and the optimum appears to be 14 threads. The scaling problem for system A is more likely the VM overhead and/or Windows, and less so SMT scaling. So, my first intuition of "throw all available threads at libFLAC" appears to be a sane choice, at least so for up to 16 threads. |
flac/src/libFLAC/stream_encoder.c
Lines 2160 to 2161 in 4c57829
FLAC__stream_encoder_set_num_threads()
ignores the set value whenvalue > FLAC__STREAM_ENCODER_MAX_THREADS
and returnsFLAC__STREAM_ENCODER_SET_NUM_THREADS_TOO_MANY_THREADS
.The value of
FLAC__STREAM_ENCODER_MAX_THREADS
is not exposed in the API, thus a client application that is setting a value that is too high has no way of knowing the maximum value that would be settable.I think the intention of client applications setting a high value is probably "use as many threads as possible", so the behavior of returning an error instead of applying the maximum value possible probably goes against that intention. I would also guess that, in absence of any concrete reason to use another value, most applications might just put the number of threads available in the system there - and CPUs with more than 64 hardware threads do exist.
I am not sure what the best solution would be. Some possibilities:
FLAC__STREAM_ENCODER_SET_NUM_THREADS_TOO_MANY_THREADS
.FLAC__STREAM_ENCODER_SET_NUM_THREADS_OK
.FLAC__STREAM_ENCODER_MAX_THREADS
.FLAC__stream_encoder_set_num_threads()
and addFLAC__stream_encoder_set_num_threads2()
which applies the maximum value and returnsFLAC__STREAM_ENCODER_SET_NUM_THREADS_OK
.FLAC__stream_encoder_set_num_threads()
and addFLAC__stream_encoder_set_num_threads2()
which applies the maximum value and returns a new return valueFLAC__STREAM_ENCODER_SET_NUM_THREADS_MAX_THREADS
.The text was updated successfully, but these errors were encountered: