diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 50a3f3b9e0d..ed9bdc9d21f 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -51,8 +51,8 @@ For more information about design and development practices for each CCCL compon - [CUB Developer Guide](docs/cub/developer_overview.rst) - General overview of the design of CUB internals - [CUB Test Overview](docs/cub/test_overview.rst) - Overview of how to write CUB unit tests -- [CUB Tuning Infrastructure](docs/cub/tuning.rst) - Overview of CUB's performance tuning infrastructure - [CUB Benchmarks](docs/cub/benchmarking.rst) - Overview of CUB's performance benchmarks +- [CUB Tuning Infrastructure](docs/cub/tuning.rst) - Overview of CUB's performance tuning infrastructure #### Thrust diff --git a/cub/benchmarks/nvbench_helper/nvbench_helper/nvbench_helper.cuh b/cub/benchmarks/nvbench_helper/nvbench_helper/nvbench_helper.cuh index 081bc5aa263..e8dacb4a1ff 100644 --- a/cub/benchmarks/nvbench_helper/nvbench_helper/nvbench_helper.cuh +++ b/cub/benchmarks/nvbench_helper/nvbench_helper/nvbench_helper.cuh @@ -61,6 +61,7 @@ using integral_types = nvbench::type_list; using fundamental_types = nvbench::type_list; using all_types = nvbench::type_list; #else +// keep those lists in sync with the documentation in tuning.rst using integral_types = nvbench::type_list; using fundamental_types = diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst index dbd22d84209..6d0603e49cd 100644 --- a/docs/cub/benchmarking.rst +++ b/docs/cub/benchmarking.rst @@ -1,34 +1,200 @@ CUB Benchmarks ************************************* -This file contains instructions on how to run all CUB benchmarks using CUB tuning infrastructure. +.. TODO(bgruber): this guide applies to Thrust as well. We should rename it to "CCCL Benchmarks" and move it out of CUB + +CUB comes with a set of `NVBench `_-based benchmarks for its algorithms, +which can be used to measure the performance of CUB on your system on a variety of workloads. +The integration with NVBench allows to archive and compare benchmark results, +which is useful for continuous performance testing, detecting regressions, tuning, and optimization. +This guide gives an introduction into CUB's benchmarking infrastructure. + +Building benchmarks +-------------------------------------------------------------------------------- + +CUB benchmarks are build as part of the CCCL CMake infrastructure. +Starting from scratch: .. code-block:: bash - pip3 install --user fpzip pandas scipy git clone https://github.com/NVIDIA/cccl.git - cmake -B build -DCCCL_ENABLE_THRUST=OFF\ - -DCCCL_ENABLE_LIBCUDACXX=OFF\ - -DCCCL_ENABLE_CUB=ON\ - -DCCCL_ENABLE_BENCHMARKS=YES\ - -DCUB_ENABLE_DIALECT_CPP11=OFF\ - -DCUB_ENABLE_DIALECT_CPP14=OFF\ - -DCUB_ENABLE_DIALECT_CPP17=ON\ - -DCUB_ENABLE_DIALECT_CPP20=OFF\ - -DCUB_ENABLE_RDC_TESTS=OFF\ - -DCUB_ENABLE_TUNING=YES\ - -DCMAKE_BUILD_TYPE=Release\ - -DCMAKE_CUDA_ARCHITECTURES="89;90" + cd cccl + mkdir build cd build - ../cub/benchmarks/scripts/run.py + cmake ..\ + -GNinja\ + -DCCCL_ENABLE_BENCHMARKS=YES\ + -DCCCL_ENABLE_CUB=YES\ + -DCCCL_ENABLE_THRUST=NO\ + -DCCCL_ENABLE_LIBCUDACXX=NO\ + -DCUB_ENABLE_RDC_TESTS=NO\ + -DCMAKE_BUILD_TYPE=Release\ + -DCMAKE_CUDA_ARCHITECTURES=90 # TODO: Set your GPU architecture + +You clone the repository, create a build directory and configure the build with CMake. +It's important that you enable benchmarks (`CCCL_ENABLE_BENCHMARKS=ON`), +build in Release mode (`CMAKE_BUILD_TYPE=Release`), +and set the GPU architecture to match your system (`CMAKE_CUDA_ARCHITECTURES=XX`). +This _ +contains a great table listing the architectures for different brands of GPUs. +.. TODO(bgruber): do we have a public NVIDIA maintained table I can link here instead? +We use Ninja as CMake generator in this guide, but you can use any other generator you prefer. + +You can then proceed to build the benchmarks. + +You can list the available cmake build targets with, if you intend to only build selected benchmarks: + +.. code-block:: bash + + ninja -t targets | grep '\.bench\.' + cub.bench.adjacent_difference.subtract_left.base: phony + cub.bench.copy.memcpy.base: phony + ... + cub.bench.transform.babelstream3.base: phony + cub.bench.transform_reduce.sum.base: phony + +We also provide a target to build all benchmarks: + +.. code-block:: bash + + ninja cub.all.benches + + +Running a benchmark +-------------------------------------------------------------------------------- + +After we built a benchmark, we can run it as follows: + +.. code-block:: bash + + ./bin/cub.bench.adjacent_difference.subtract_left.base\ + -d 0\ + --stopping-criterion entropy\ + --json base.json\ + --md base.md + +In this command, `-d 0` indicates that we want to run on GPU 0 on our system. +Setting `--stopping-criterion entropy` is advisable since it reduces runtime +and increase confidence in the resulting data. +It's not set as default yet, because NVBench is still evaluating it. +By default, NVBench will print the benchmark results to the terminal as Markdown. +`--json base.json` will save the detailed results in a JSON file as well for later use. +`--md base.md` will save the Markdown output to a file as well, +so you can easily view the results later without having to parse the JSON. +More information on what command line options are available can be found in the +`NVBench documentation `__. + +The expected terminal output is something along the following lines (also saved to `base.md`), +shortened for brevity: + +.. code-block:: bash + + # Log + Run: [1/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^16] + Pass: Cold: 0.004571ms GPU, 0.009322ms CPU, 0.00s total GPU, 0.01s total wall, 334x + Run: [2/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^20] + Pass: Cold: 0.015161ms GPU, 0.023367ms CPU, 0.01s total GPU, 0.02s total wall, 430x + ... + # Benchmark Results + | T{ct} | OffsetT{ct} | Elements{io} | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | + |-------|-------------|------------------|---------|------------|---------|------------|--------|---------|--------------|--------| + | I32 | I32 | 2^16 = 65536 | 334x | 9.322 us | 104.44% | 4.571 us | 10.87% | 14.337G | 114.696 GB/s | 14.93% | + | I32 | I32 | 2^20 = 1048576 | 430x | 23.367 us | 327.68% | 15.161 us | 3.47% | 69.161G | 553.285 GB/s | 72.03% | + ... + +If you are only interested in a subset of workloads, you can restrict benchmarking as follows: + +.. code-block:: bash + + ./bin/cub.bench.adjacent_difference.subtract_left.base ...\ + -a 'T{ct}=I32'\ + -a 'OffsetT{ct}=I32'\ + -a 'Elements{io}[pow2]=[24,28]'\ + +The `-a` option allows you to restrict the values for each axis available for the benchmark. +See the `NVBench documentation `__. +for more information on how to specify the axis values. +If the specified axis does not exist, the benchmark will terminate with an error. + + +Comparing benchmark results +-------------------------------------------------------------------------------- + +Let's say you have a modification that you'd like to benchmark. +To compare the performance you have to build and run the benchmark as described above for the unmodified code, +saving the results to a JSON file, e.g. `base.json`. +Then, you apply your code changes (e.g., switch to a different branch, git stash pop, apply a patch file, etc.), +rebuild and rerun the benchmark, saving the results to a different JSON file, e.g. `new.json`. + +You can now compare the two result JSON files using, assuming you are still in your build directory: +.. code-block:: bash + + PYTHONPATH=./_deps/nvbench-src/scripts ./_deps/nvbench-src/scripts/nvbench_compare.py base.json new.json + +The `PYTHONPATH` environment variable may not be necessary in all cases. +The script will print a Markdown report showing the runtime differences between each variant of the two benchmark run. +This could look like this, again shortened for brevity: + +.. code-block:: bash + + | T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | + |---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------| + | I32 | I32 | 2^16 | 4.571 us | 10.87% | 4.096 us | 0.00% | -0.475 us | -10.39% | FAIL | + | I32 | I32 | 2^20 | 15.161 us | 3.47% | 15.143 us | 3.55% | -0.018 us | -0.12% | PASS | + ... + +In addition to showing the absolute and relative runtime difference, +NVBench reports the noise of the measurements, +which corresponds to the relative standard deviation. +It then reports with statistical significance in the `Status` column +how the runtime changed from the base to the new version. + + +Running all benchmarks directly from the command line +-------------------------------------------------------------------------------- + +To get a full snapshot of CUB's performance, you can run all benchmarks and save the results. +For example: + +.. code-block:: bash + + ninja cub.all.benches + benchmarks=$(ls bin | grep cub.bench); n=$(echo $benchmarks | wc -w); i=1; \ + for b in $benchmarks; do \ + echo "=== Running $b ($i/$n) ==="; \ + ./bin/$b -d 0 --stopping-criterion entropy --json $b.json --md $b.md; \ + ((i++)); \ + done + +This will generate one JSON and one Markdown file for each benchmark. +You can archive those files for later comparison or analysis. + + +Running all benchmarks via tuning scripts (alternative) +-------------------------------------------------------------------------------- + +The benchmark suite can also be run using the :ref:`tuning ` infrastructure. +The tuning infrastructure handles building benchmarks itself, because it records the build times. +Therefore, it's critical that you run it in a clean build directory without any build artifacts. +Running cmake is enough. Alternatively, you can also clean your build directory with. +Furthermore, the tuning scripts require some additional python dependencies, which you have to install. -Expected output for the command above is: +.. code-block:: bash + + ninja clean + pip install --user fpzip pandas scipy +We can then run the full benchmark suite from the build directory with: + +.. code-block:: bash + + ../benchmarks/scripts/run.py + +You can expect the output to look like this: .. code-block:: bash - ../cub/benchmarks/scripts/run.py &&&& RUNNING bench ctk: 12.2.140 cub: 812ba98d1 @@ -38,15 +204,62 @@ Expected output for the command above is: &&&& PERF cub_bench_adjacent_difference_subtract_left_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002673664130270481 -sec ... +The tuning infrastructure will build and execute all benchmarks and their variants one after each other, +reporting the time it seconds it took to execute the benchmark executable. It's also possible to benchmark a subset of algorithms and workloads: .. code-block:: bash - ../cub/benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32' + ../benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32' &&&& RUNNING bench - ctk: 12.2.140 - cub: 812ba98d1 - &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__24 0.00016899200272746384 -sec - &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002696000039577484 -sec + ctk: 12.6.77 + cccl: v2.7.0-rc0-265-g32aa6aa5a + &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28 0.003194367978721857 -sec + &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U64___Elements_io__pow2__28 0.00319383991882205 -sec &&&& PASSED bench + + +The `-R` option allows you to specify a regular expression for selecting benchmarks. +The `-a` restricts the values for an axis across all benchmarks +See the `NVBench documentation `__. +for more information on how to specify the axis values. +Contrary to running a benchmark directly, +the tuning infrastructure will just ignore an axis value if a benchmark does not support, +run the benchmark regardless, and continue. + +The tuning infrastructure stores results in an SQLite database called `cccl_meta_bench.db` in the build directory. +This database persists across tuning runs. +If you interrupt the benchmark script and then launch it again, only missing benchmark variants will be run. +The resulting database contains all samples, which can be extracted into JSON files: + +.. code-block:: bash + + ../benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db + +This will create a JSON file for each benchmark variant next to the database. +For example: + +.. code-block:: bash + + cat cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28.json + [ + { + "variant": "base ()", + "elapsed": 2.6299014091, + "center": 0.003194368, + "bw": 0.8754671386, + "samples": [ + 0.003152896, + 0.0031549439, + ... + ], + "Elements{io}[pow2]": "28", + "base_samples": [ + 0.003152896, + 0.0031549439, + ... + ], + "speedup": 1 + } + ] diff --git a/docs/cub/index.rst b/docs/cub/index.rst index 21e42d81cc3..da59a9b8ec0 100644 --- a/docs/cub/index.rst +++ b/docs/cub/index.rst @@ -10,8 +10,8 @@ CUB modules developer_overview test_overview - tuning benchmarking + tuning ${repo_docs_api_path}/cub_api .. the line below can be used to use the README.md file as the index page diff --git a/docs/cub/tuning.rst b/docs/cub/tuning.rst index c1ebe1864f5..184dc57900a 100644 --- a/docs/cub/tuning.rst +++ b/docs/cub/tuning.rst @@ -1,3 +1,5 @@ +.. _cub-tuning: + CUB Tuning Infrastructure ================================================================================ @@ -168,9 +170,29 @@ construct: #endif -This logic is automatically applied to :code:`all_types`, :code:`offset_types`, and -:code:`fundamental_types` lists when you use matching names for the axes. You can define -your own axis names and use the logic above for them (see sort pairs example). +This logic is already implemented if you use any of the following predefined type lists: + +.. list-table:: Predefined type lists + :header-rows: 1 + + * - Axis name + - C++ identifier + - Included types + * - :code:`T{ct}` + - :code:`integral_types` + - :code:`int8_t, int16_t, int32_t, int64_t` + * - :code:`T{ct}` + - :code:`fundamental_types` + - :code:`integral_types` and :code:`int128_t, float, double` + * - :code:`T{ct}` + - :code:`all_types` + - :code:`fundamental_types` and :code:`complex` + * - :code:`OffsetT{ct}` + - :code:`offset_types` + - :code:`int32_t, int64_t` + + +But you are free to define your own axis names and use the logic above for them (see sort pairs example). Search Process