Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vnni512 is faster than vnni256 on Xeon w5-2445 despite of MHz throttling (downlocking) on AVX-512-heavy code #5757

Open
maximmasiutin opened this issue Jan 8, 2025 · 3 comments

Comments

@maximmasiutin
Copy link
Contributor

maximmasiutin commented Jan 8, 2025

Describe the issue

Despite the report at #3038 the downclocking does not held true on all CPUs. That report refers to a question at https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency asked more than 5 years ago.

Today Xeon w5-2445 with GCC 14 is faster on vnni512 compared to vnni256 (and all other options, including vnni256, avx2, bmi2, etc.). I also testes GCC 13 vs GCC 14 on vnn512: the difference is negligible but statistically present.

Here are attached the runs of https://github.com/hazzl/pyshbench

  1. The first run is with the with "bench" parameter of Stockfish (500 runs, speedup +0.0139) - attached the results file bench-pyshbench-log.txt
  2. The second run is with "speedtest" parameter (20 runs, speedup +0.0606). This is a newly implemented parameter, see in https://official-stockfish.github.io/docs/stockfish-wiki/UCI-&-Commands.html#speedtest - attached the results file speedtest-pyshbench-log.txt
  3. The third run is GCC 13 vs GCC 14 on vnn512 ("speedtest, 20 runs, speedup +0.0049) - speedtest-13-vs-14.txt

The "bench" parameter makes Stockfish run a single thread whereas "speedtest" uses all available threads on the CPU, that is 10 cores 20 threads in my case, that's why the speed increase is more noticeable with the "speedtest" parameter. Despite just 20 runs of "speedtest", they took more time combined than 500 runs on "bench".

@vondele correctly pointed out at #3038 (comment) that (quote): "The problem is that this frequency behavior will change over time, and presumably the widest vectors will eventually be most efficient."

It is probably GCC 14 and correct code that can be unrolled by the compiler, implemented by @mstembera here:
32e46fc47 (mstembera 2024-01-08 23:20:23 -0800 231) vec_add_dpbusd_32(acc[k], in0, col0[k]);

This new code does not have dependency on previous data as was the case before.

The StockFish at https://github.com/official-stockfish/Stockfish/blob/master/scripts/get_native_properties.sh seems to deliberately avoid vnni512 unless this target is explicitly specified as make -j profile-build ARCH=x86-64-vnni512, so does FishTest at https://github.com/official-stockfish/fishtest/blob/master/worker/games.py#L636 (and line 643).

Attached files:
bench-pyshbench-log.txt
speedtest-pyshbench-log.txt
speedtest-13-vs-14.txt

Expected behavior

vnni512 is used by default on capable processors

Steps to reproduce (vnni256 vs vnni512 on GCC 14)

  1. Make sure you have GCC14 or later by default, otherwise install it and add COMPCXX=g++-14 (or 15, 16, whichever applicable) parameter to make, e.g. make -j profile-build COMP=x86-64-vnni256 COMPCXX=g++-14
  2. Install https://github.com/hazzl/pyshbench
  3. make two separate directories for target Stockfish binaries: ~/1/ and ~/2/
    2.1 in first directory compile Stockfish with vnni256 target by running make -j profile-build COMP=x86-64-vnni256 COMPCXX=g++-14, and rename stockfish executable to stockfish-x86-64-vnni256 and copy it to ~/1/stockfish-x86-64-vnni256
    2.2 in second directory compile Stockfish with vnni512 target by running make -j profile-build COMP=x86-64-vnni512 COMPCXX=g++-14, and rename stockfish executable to stockfish-x86-64-vnni512 and copy it to ~/2/stockfish-x86-64-vnni512
  4. To use with "bench" parameter, run ./pyshbench ~/1/stockfish-x86-64-vnni256 ~/2/stockfish-x86-64-vnni512 500 > bench.txt
  5. To use with "speedtest" parameter, edit ./pyshbench and replace "bench" to "speedtest", then run ./pyshbench ~/1/stockfish-x86-64-vnni256 ~/2/stockfish-x86-64-vnni512 20 > speedtest.txt

Steps to reproduce (vnni512 GCC13 vs vnni512 on GCC 14)

  1. Build Stockfish by GCC 13 using make -j profile-build ARCH=x86-64-vnni512 COMPCXX=g++-13
  2. Run pyshbench as specified in previous subsection, but this time compare GCC13 vs GCC14 on vnni512
stockfish-x86-64-vnni512-gcc13 compiler
Stockfish dev-20250106-c76c1793 by the Stockfish developers (see AUTHORS file)

Compiled by                : g++ (GNUC) 13.3.0 on Linux
Compilation architecture   : x86-64-vnni512
Compilation settings       : 64bit VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.3.0
stockfish-x86-64-vnni512-gcc14 compiler
Stockfish dev-20250106-c76c1793 by the Stockfish developers (see AUTHORS file)

Compiled by                : g++ (GNUC) 14.2.0 on Linux
Compilation architecture   : x86-64-vnni512
Compilation settings       : 64bit VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 14.2.0

Anything else?

I run the tests with Ubuntu 24.04.1 LTS and GCC on Windows Subsystem for Linux (WSL2) under Windows 11

Operating system

Linux

Stockfish version

dev-20250106-c76c1793

P.S. Thanks to @Disservin for guidance, and for help in finding the links to relevant code.

@Disservin
Copy link
Member

  • Was gcc 14 really required or did you simply not test it?
  • Did you modify pyshbench because the linked version only uses "bench" ?
  • Against which targets where the two speedup commands run? vnni256?

@maximmasiutin
Copy link
Contributor Author

maximmasiutin commented Jan 8, 2025

  1. GCC 14 was not really required, the difference compared to GCC 13 is small but persistent, see "speedtest" pyshbench result of 20 runs,
    speedtest-13-vs-14.txt
    See also free-form results in stockfish-runs.txt - The additional optional parameters I specified in the environment are "-O3 -march=native -mtune=native"
  2. I first used unmodified pyshbench to use "bench", but then replaced "bench" to "speedtest".
  3. make -j profile-build COMP=x86-64-vnni256 vs make -j profile-build COMP=x86-64-vnni512` as I mentioned in "Steps to reproduce" at vnni512 is faster than vnni256 on Xeon w5-2445 despite of MHz throttling (downlocking) on AVX-512-heavy code #5757 (comment)

@ppigazzini
Copy link
Contributor

ppigazzini commented Jan 13, 2025

@maximmasiutin pyshbench uses stockfish bench, shortcut to stockfish bench 16 1 13 default depth
Using depth 20 with stockfish bench 16 1 20 default depth gives more stable nps (in more time, of course), with that I use 1/10 of the iterations.

Stockfish/src/benchmark.cpp

Lines 391 to 396 in c085670

// Assign default values to missing arguments
std::string ttSize = (is >> token) ? token : "16";
std::string threads = (is >> token) ? token : "1";
std::string limit = (is >> token) ? token : "13";
std::string fenFile = (is >> token) ? token : "default";
std::string limitType = (is >> token) ? token : "depth";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants