Skip to content

max thread count is not directly settable or queryable #823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
manxorist opened this issue Mar 16, 2025 · 5 comments
Open

max thread count is not directly settable or queryable #823

manxorist opened this issue Mar 16, 2025 · 5 comments

Comments

@manxorist
Copy link
Contributor

if(value > FLAC__STREAM_ENCODER_MAX_THREADS)
return FLAC__STREAM_ENCODER_SET_NUM_THREADS_TOO_MANY_THREADS;
FLAC__stream_encoder_set_num_threads() ignores the set value when value > FLAC__STREAM_ENCODER_MAX_THREADS and returns FLAC__STREAM_ENCODER_SET_NUM_THREADS_TOO_MANY_THREADS.

The value of FLAC__STREAM_ENCODER_MAX_THREADS is not exposed in the API, thus a client application that is setting a value that is too high has no way of knowing the maximum value that would be settable.

I think the intention of client applications setting a high value is probably "use as many threads as possible", so the behavior of returning an error instead of applying the maximum value possible probably goes against that intention. I would also guess that, in absence of any concrete reason to use another value, most applications might just put the number of threads available in the system there - and CPUs with more than 64 hardware threads do exist.

I am not sure what the best solution would be. Some possibilities:

  1. Leave it as it is, and expect client applications to do a retry loop.
  2. Apply the maximum value and still return FLAC__STREAM_ENCODER_SET_NUM_THREADS_TOO_MANY_THREADS.
  3. Apply the maximum value and return FLAC__STREAM_ENCODER_SET_NUM_THREADS_OK.
  4. Provide an explicit API to query FLAC__STREAM_ENCODER_MAX_THREADS.
  5. Deprecate FLAC__stream_encoder_set_num_threads() and add FLAC__stream_encoder_set_num_threads2() which applies the maximum value and returns FLAC__STREAM_ENCODER_SET_NUM_THREADS_OK.
  6. Deprecate FLAC__stream_encoder_set_num_threads() and add FLAC__stream_encoder_set_num_threads2() which applies the maximum value and returns a new return value FLAC__STREAM_ENCODER_SET_NUM_THREADS_MAX_THREADS.
manxorist added a commit to OpenMPT/openmpt that referenced this issue Mar 16, 2025
@ktmf01
Copy link
Collaborator

ktmf01 commented Mar 17, 2025

This max thread count is mostly temporary. I don't have the means to test the code with so many threads, as I have no machines with more than 4 threads. I did not feel comfortable setting max thread count above 64 for that reason.

If I would have access to a machine with, lets say, 32 or more threads, I could run some tests, for example the test suite with thread sanitizer, and that restriction could be lifted entirely.

Also, I don't think setting more than 64 threads is really useful, I don't think the code scales very well beyond 32 threads anyway.

@manxorist
Copy link
Contributor Author

manxorist commented Mar 17, 2025

All fair, but what should client applications actually do?

I would rather not have to invent a magic value myself that happens to be adequate for FLAC 1.5.0, but might not be in the future.

I guess there would also be another option:
7. Provide a constant (like ((uint32_t)-1)) that can be passed to FLAC__stream_encoder_set_num_threads which either sets the supported maximum or a recommended default.

I do have machines with up to 16 threads available, so if you want me to run some specific test, or a general scaling measurement, I could do that when I find the time for it.

@ktmf01
Copy link
Collaborator

ktmf01 commented Mar 18, 2025

I agree that I didn't think of this. I assumed 64 would be so large it would cause any problems. And in some respect I still think it is silly to use so many threads: while I haven't tested, I think at 64 threads the overhead is already so large adding any more would not change anything.

At some point, there are two things that become a bottleneck: the MD5 calculation and the main thread which "interacts" with the client app and does some data preparation. When just using presets, this already becomes a bottleneck at something like 16 threads.

I agree, it would have been nice not to rely on magic numbers. On the other hand, it would be nice if implementers give this a little thought: at some point adding more threads is pointless, and firing up a second encoder instance (to process a second file in parallel) is much more efficient. If that is not possible, perhaps it is better to just leave some of those cores dormant, otherwise they'll only be consuming electricity for pure overhead.

@manxorist
Copy link
Contributor Author

manxorist commented Mar 18, 2025

I guess we need some numbers.

build script (current git master):

#!/usr/bin/env bash
set -e

./autogen.sh
./configure --enable-static --disable-shared
make
make check

test corpus (44100Hz stereo 16bit):

  • Placebo - Without You I'm Nothing (Full Album)
  • Suzanne Vega - Solitude Standing (Full Album)

benchmark script:

#!/usr/bin/env bash
set -e

function bench_threads () {
	echo "Threads: $1"
	for i in {1..5}; do
		rm -f ../wav/*.flac
		sync
		time ( /usr/bin/time -f "$1 %e" ./src/flac/flac --silent --force --best ../wav/*.wav -j $1 >> bench.log 2>&1 )
	done
	echo ""
	return 0;
}

rm -f bench.log
for threads in {1..16}; do
	bench_threads $threads
done

gnuplot script (1 and 2 threads result were discarded before plotting):

set term png
set output "bench.png"
plot 'bench.log'

test systems:

  • system A: AMD Ryzen 7 6800U (8C/16T) (Zen3+), 24GB RAM, GCC 11.4, Linux 5.15, Ubuntu 22.04 in WSL2 on Windows 11 24H2
  • system B: AMD Opteron 6378 (16C/16T) (Piledriver), 128GB RAM, GCC 12.2, Linux 6.1, LMDE 6 (Debian 12)

results:
system A:
Image
system B:
Image

raw results (threads seconds, 5 runs each):

For system A (8C/16T), the optimum appears to be 10 threads. My best guess would be that FLAC saturates the execution units good enough such that SMT does not help much beyond the 8 cores. However, it could also be scheduling related due to WSL2 inside Windows 11. I do not have any 8C/16T native Linux system available (I do have an AMD Ryzen 2700 (Zen+), but it is also running Ubuntu in WSL2 on Windows 11 24H2).

For system B (16C/16T), we can see obvious scaling of the algorithm even with 16 threads. As far as I know (https://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2, https://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope/), in Piledriver, the FPU is used for integer SIMD and is shared between 2 cores on that microarchitecture (see https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture) and https://en.wikipedia.org/wiki/Piledriver_(microarchitecture)). So, I would actually have expected worse scaling than what the results show.

I do not have any modern non-SMT Intel CPUs (like Core Ultra 200 (Arrow Lake)) available for testing.

I did use --best in order to increase the CPU load and reduce the relative overhead of thread setup and synchronization.

In general, the upper bound of scaling appears to not be reached yet with 16 threads for the current algorithm.

@manxorist
Copy link
Contributor Author

  • system C: AMD Ryzen 7 6800U (8C/16T) (Zen3+), 24GB RAM, GCC 14.1, MSYS2-MINGW64 on Windows 11 24H2

Image

C-results.txt

Without the WSL2 VM overhead, it scales better on that system, and the optimum appears to be 14 threads. The scaling problem for system A is more likely the VM overhead and/or Windows, and less so SMT scaling.

So, my first intuition of "throw all available threads at libFLAC" appears to be a sane choice, at least so for up to 16 threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants