vnni512 is faster than vnni256 on Xeon w5-2445 despite of MHz throttling (downlocking) on AVX-512-heavy code #5757

maximmasiutin · 2025-01-08T01:16:57Z

Describe the issue

Despite the report at #3038 the downclocking does not held true on all CPUs. That report refers to a question at https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency asked more than 5 years ago.

Today Xeon w5-2445 with GCC 14 is faster on vnni512 compared to vnni256 (and all other options, including vnni256, avx2, bmi2, etc.). I also testes GCC 13 vs GCC 14 on vnn512: the difference is negligible but statistically present.

Here are attached the runs of https://github.com/hazzl/pyshbench

The first run is with the with "bench" parameter of Stockfish (500 runs, speedup +0.0139) - attached the results file bench-pyshbench-log.txt
The second run is with "speedtest" parameter (20 runs, speedup +0.0606). This is a newly implemented parameter, see in https://official-stockfish.github.io/docs/stockfish-wiki/UCI-&-Commands.html#speedtest - attached the results file speedtest-pyshbench-log.txt
The third run is GCC 13 vs GCC 14 on vnn512 ("speedtest, 20 runs, speedup +0.0049) - speedtest-13-vs-14.txt

The "bench" parameter makes Stockfish run a single thread whereas "speedtest" uses all available threads on the CPU, that is 10 cores 20 threads in my case, that's why the speed increase is more noticeable with the "speedtest" parameter. Despite just 20 runs of "speedtest", they took more time combined than 500 runs on "bench".

@vondele correctly pointed out at #3038 (comment) that (quote): "The problem is that this frequency behavior will change over time, and presumably the widest vectors will eventually be most efficient."

It is probably GCC 14 and correct code that can be unrolled by the compiler, implemented by @mstembera here:
32e46fc47 (mstembera 2024-01-08 23:20:23 -0800 231) vec_add_dpbusd_32(acc[k], in0, col0[k]);

This new code does not have dependency on previous data as was the case before.

The StockFish at https://github.com/official-stockfish/Stockfish/blob/master/scripts/get_native_properties.sh seems to deliberately avoid vnni512 unless this target is explicitly specified as make -j profile-build ARCH=x86-64-vnni512, so does FishTest at https://github.com/official-stockfish/fishtest/blob/master/worker/games.py#L636 (and line 643).

Attached files:
bench-pyshbench-log.txt
speedtest-pyshbench-log.txt
speedtest-13-vs-14.txt

Expected behavior

vnni512 is used by default on capable processors

Steps to reproduce (vnni256 vs vnni512 on GCC 14)

Make sure you have GCC14 or later by default, otherwise install it and add COMPCXX=g++-14 (or 15, 16, whichever applicable) parameter to make, e.g. make -j profile-build COMP=x86-64-vnni256 COMPCXX=g++-14
Install https://github.com/hazzl/pyshbench
make two separate directories for target Stockfish binaries: ~/1/ and ~/2/
2.1 in first directory compile Stockfish with vnni256 target by running make -j profile-build COMP=x86-64-vnni256 COMPCXX=g++-14, and rename stockfish executable to stockfish-x86-64-vnni256 and copy it to ~/1/stockfish-x86-64-vnni256
2.2 in second directory compile Stockfish with vnni512 target by running make -j profile-build COMP=x86-64-vnni512 COMPCXX=g++-14, and rename stockfish executable to stockfish-x86-64-vnni512 and copy it to ~/2/stockfish-x86-64-vnni512
To use with "bench" parameter, run ./pyshbench ~/1/stockfish-x86-64-vnni256 ~/2/stockfish-x86-64-vnni512 500 > bench.txt
To use with "speedtest" parameter, edit ./pyshbench and replace "bench" to "speedtest", then run ./pyshbench ~/1/stockfish-x86-64-vnni256 ~/2/stockfish-x86-64-vnni512 20 > speedtest.txt

Steps to reproduce (vnni512 GCC13 vs vnni512 on GCC 14)

Build Stockfish by GCC 13 using make -j profile-build ARCH=x86-64-vnni512 COMPCXX=g++-13
Run pyshbench as specified in previous subsection, but this time compare GCC13 vs GCC14 on vnni512

stockfish-x86-64-vnni512-gcc13 compiler
Stockfish dev-20250106-c76c1793 by the Stockfish developers (see AUTHORS file)

Compiled by                : g++ (GNUC) 13.3.0 on Linux
Compilation architecture   : x86-64-vnni512
Compilation settings       : 64bit VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.3.0

stockfish-x86-64-vnni512-gcc14 compiler
Stockfish dev-20250106-c76c1793 by the Stockfish developers (see AUTHORS file)

Compiled by                : g++ (GNUC) 14.2.0 on Linux
Compilation architecture   : x86-64-vnni512
Compilation settings       : 64bit VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 14.2.0

Anything else?

I run the tests with Ubuntu 24.04.1 LTS and GCC on Windows Subsystem for Linux (WSL2) under Windows 11

Operating system

Linux

Stockfish version

dev-20250106-c76c1793

P.S. Thanks to @Disservin for guidance, and for help in finding the links to relevant code.

The text was updated successfully, but these errors were encountered:

Disservin · 2025-01-08T02:08:09Z

Was gcc 14 really required or did you simply not test it?
Did you modify pyshbench because the linked version only uses "bench" ?
Against which targets where the two speedup commands run? vnni256?

maximmasiutin · 2025-01-08T09:41:43Z

GCC 14 was not really required, the difference compared to GCC 13 is small but persistent, see "speedtest" pyshbench result of 20 runs,
speedtest-13-vs-14.txt
See also free-form results in stockfish-runs.txt - The additional optional parameters I specified in the environment are "-O3 -march=native -mtune=native"
I first used unmodified pyshbench to use "bench", but then replaced "bench" to "speedtest".
make -j profile-build COMP=x86-64-vnni256 vs make -j profile-build COMP=x86-64-vnni512` as I mentioned in "Steps to reproduce" at vnni512 is faster than vnni256 on Xeon w5-2445 despite of MHz throttling (downlocking) on AVX-512-heavy code #5757 (comment)

ppigazzini · 2025-01-13T14:13:41Z

@maximmasiutin pyshbench uses stockfish bench, shortcut to stockfish bench 16 1 13 default depth
Using depth 20 with stockfish bench 16 1 20 default depth gives more stable nps (in more time, of course), with that I use 1/10 of the iterations.

Stockfish/src/benchmark.cpp

Lines 391 to 396 in c085670

    
           // Assign default values to missing arguments 
        
           std::string ttSize    = (is >> token) ? token : "16"; 
        
           std::string threads   = (is >> token) ? token : "1"; 
        
           std::string limit     = (is >> token) ? token : "13"; 
        
           std::string fenFile   = (is >> token) ? token : "default"; 
        
           std::string limitType = (is >> token) ? token : "depth";

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vnni512 is faster than vnni256 on Xeon w5-2445 despite of MHz throttling (downlocking) on AVX-512-heavy code #5757

vnni512 is faster than vnni256 on Xeon w5-2445 despite of MHz throttling (downlocking) on AVX-512-heavy code #5757

maximmasiutin commented Jan 8, 2025 •

edited

Loading

Disservin commented Jan 8, 2025

maximmasiutin commented Jan 8, 2025 •

edited

Loading

ppigazzini commented Jan 13, 2025 •

edited

Loading

vnni512 is faster than vnni256 on Xeon w5-2445 despite of MHz throttling (downlocking) on AVX-512-heavy code #5757

vnni512 is faster than vnni256 on Xeon w5-2445 despite of MHz throttling (downlocking) on AVX-512-heavy code #5757

Comments

maximmasiutin commented Jan 8, 2025 • edited Loading

Describe the issue

Expected behavior

Steps to reproduce (vnni256 vs vnni512 on GCC 14)

Steps to reproduce (vnni512 GCC13 vs vnni512 on GCC 14)

Anything else?

Operating system

Stockfish version

Disservin commented Jan 8, 2025

maximmasiutin commented Jan 8, 2025 • edited Loading

ppigazzini commented Jan 13, 2025 • edited Loading

maximmasiutin commented Jan 8, 2025 •

edited

Loading

maximmasiutin commented Jan 8, 2025 •

edited

Loading

ppigazzini commented Jan 13, 2025 •

edited

Loading