max thread count is not directly settable or queryable #823

manxorist · 2025-03-16T19:09:06Z

Lines 2160 to 2161 in 4c57829

    
           if(value > FLAC__STREAM_ENCODER_MAX_THREADS) 
        
           	return FLAC__STREAM_ENCODER_SET_NUM_THREADS_TOO_MANY_THREADS;

FLAC__stream_encoder_set_num_threads() ignores the set value when value > FLAC__STREAM_ENCODER_MAX_THREADS and returns FLAC__STREAM_ENCODER_SET_NUM_THREADS_TOO_MANY_THREADS.

The value of FLAC__STREAM_ENCODER_MAX_THREADS is not exposed in the API, thus a client application that is setting a value that is too high has no way of knowing the maximum value that would be settable.

I think the intention of client applications setting a high value is probably "use as many threads as possible", so the behavior of returning an error instead of applying the maximum value possible probably goes against that intention. I would also guess that, in absence of any concrete reason to use another value, most applications might just put the number of threads available in the system there - and CPUs with more than 64 hardware threads do exist.

I am not sure what the best solution would be. Some possibilities:

Leave it as it is, and expect client applications to do a retry loop.
Apply the maximum value and still return FLAC__STREAM_ENCODER_SET_NUM_THREADS_TOO_MANY_THREADS.
Apply the maximum value and return FLAC__STREAM_ENCODER_SET_NUM_THREADS_OK.
Provide an explicit API to query FLAC__STREAM_ENCODER_MAX_THREADS.
Deprecate FLAC__stream_encoder_set_num_threads() and add FLAC__stream_encoder_set_num_threads2() which applies the maximum value and returns FLAC__STREAM_ENCODER_SET_NUM_THREADS_OK.
Deprecate FLAC__stream_encoder_set_num_threads() and add FLAC__stream_encoder_set_num_threads2() which applies the maximum value and returns a new return value FLAC__STREAM_ENCODER_SET_NUM_THREADS_MAX_THREADS.

The text was updated successfully, but these errors were encountered:

git-svn-id: https://source.openmpt.org/svn/openmpt/trunk/OpenMPT@23044 56274372-70c3-4bfc-bfc3-4c3a0b034d27

ktmf01 · 2025-03-17T08:03:28Z

This max thread count is mostly temporary. I don't have the means to test the code with so many threads, as I have no machines with more than 4 threads. I did not feel comfortable setting max thread count above 64 for that reason.

If I would have access to a machine with, lets say, 32 or more threads, I could run some tests, for example the test suite with thread sanitizer, and that restriction could be lifted entirely.

Also, I don't think setting more than 64 threads is really useful, I don't think the code scales very well beyond 32 threads anyway.

manxorist · 2025-03-17T08:58:32Z

All fair, but what should client applications actually do?

I would rather not have to invent a magic value myself that happens to be adequate for FLAC 1.5.0, but might not be in the future.

I guess there would also be another option:
7. Provide a constant (like ((uint32_t)-1)) that can be passed to FLAC__stream_encoder_set_num_threads which either sets the supported maximum or a recommended default.

I do have machines with up to 16 threads available, so if you want me to run some specific test, or a general scaling measurement, I could do that when I find the time for it.

ktmf01 · 2025-03-18T08:25:34Z

I agree that I didn't think of this. I assumed 64 would be so large it would cause any problems. And in some respect I still think it is silly to use so many threads: while I haven't tested, I think at 64 threads the overhead is already so large adding any more would not change anything.

At some point, there are two things that become a bottleneck: the MD5 calculation and the main thread which "interacts" with the client app and does some data preparation. When just using presets, this already becomes a bottleneck at something like 16 threads.

I agree, it would have been nice not to rely on magic numbers. On the other hand, it would be nice if implementers give this a little thought: at some point adding more threads is pointless, and firing up a second encoder instance (to process a second file in parallel) is much more efficient. If that is not possible, perhaps it is better to just leave some of those cores dormant, otherwise they'll only be consuming electricity for pure overhead.

manxorist · 2025-03-18T12:14:53Z

I guess we need some numbers.

build script (current git master):

#!/usr/bin/env bash
set -e

./autogen.sh
./configure --enable-static --disable-shared
make
make check

test corpus (44100Hz stereo 16bit):

Placebo - Without You I'm Nothing (Full Album)
Suzanne Vega - Solitude Standing (Full Album)

benchmark script:

#!/usr/bin/env bash
set -e

function bench_threads () {
	echo "Threads: $1"
	for i in {1..5}; do
		rm -f ../wav/*.flac
		sync
		time ( /usr/bin/time -f "$1 %e" ./src/flac/flac --silent --force --best ../wav/*.wav -j $1 >> bench.log 2>&1 )
	done
	echo ""
	return 0;
}

rm -f bench.log
for threads in {1..16}; do
	bench_threads $threads
done

gnuplot script (1 and 2 threads result were discarded before plotting):

set term png
set output "bench.png"
plot 'bench.log'

test systems:

system A: AMD Ryzen 7 6800U (8C/16T) (Zen3+), 24GB RAM, GCC 11.4, Linux 5.15, Ubuntu 22.04 in WSL2 on Windows 11 24H2
system B: AMD Opteron 6378 (16C/16T) (Piledriver), 128GB RAM, GCC 12.2, Linux 6.1, LMDE 6 (Debian 12)

results:
system A:

system B:

raw results (threads seconds, 5 runs each):

system A: A-results.txt
system B: B-results.txt

For system A (8C/16T), the optimum appears to be 10 threads. My best guess would be that FLAC saturates the execution units good enough such that SMT does not help much beyond the 8 cores. However, it could also be scheduling related due to WSL2 inside Windows 11. I do not have any 8C/16T native Linux system available (I do have an AMD Ryzen 2700 (Zen+), but it is also running Ubuntu in WSL2 on Windows 11 24H2).

For system B (16C/16T), we can see obvious scaling of the algorithm even with 16 threads. As far as I know (https://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2, https://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope/), in Piledriver, the FPU is used for integer SIMD and is shared between 2 cores on that microarchitecture (see https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture) and https://en.wikipedia.org/wiki/Piledriver_(microarchitecture)). So, I would actually have expected worse scaling than what the results show.

I do not have any modern non-SMT Intel CPUs (like Core Ultra 200 (Arrow Lake)) available for testing.

I did use --best in order to increase the CPU load and reduce the relative overhead of thread setup and synchronization.

In general, the upper bound of scaling appears to not be reached yet with 16 threads for the current algorithm.

manxorist · 2025-03-18T16:45:55Z

system C: AMD Ryzen 7 6800U (8C/16T) (Zen3+), 24GB RAM, GCC 14.1, MSYS2-MINGW64 on Windows 11 24H2

C-results.txt

Without the WSL2 VM overhead, it scales better on that system, and the optimum appears to be 14 threads. The scaling problem for system A is more likely the VM overhead and/or Windows, and less so SMT scaling.

So, my first intuition of "throw all available threads at libFLAC" appears to be a sane choice, at least so for up to 16 threads.

manxorist added a commit to OpenMPT/openmpt that referenced this issue Mar 16, 2025

[Fix] Work-around <xiph/flac#823>.

139bbda

git-svn-id: https://source.openmpt.org/svn/openmpt/trunk/OpenMPT@23044 56274372-70c3-4bfc-bfc3-4c3a0b034d27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max thread count is not directly settable or queryable #823

max thread count is not directly settable or queryable #823

manxorist commented Mar 16, 2025

ktmf01 commented Mar 17, 2025

manxorist commented Mar 17, 2025 •

edited

Loading

ktmf01 commented Mar 18, 2025

manxorist commented Mar 18, 2025 •

edited

Loading

manxorist commented Mar 18, 2025

max thread count is not directly settable or queryable #823

max thread count is not directly settable or queryable #823

Comments

manxorist commented Mar 16, 2025

ktmf01 commented Mar 17, 2025

manxorist commented Mar 17, 2025 • edited Loading

ktmf01 commented Mar 18, 2025

manxorist commented Mar 18, 2025 • edited Loading

manxorist commented Mar 18, 2025

manxorist commented Mar 17, 2025 •

edited

Loading

manxorist commented Mar 18, 2025 •

edited

Loading