Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend CUB benchmarking documentation #2831

Merged
merged 11 commits into from
Nov 15, 2024
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ For more information about design and development practices for each CCCL compon

- [CUB Developer Guide](docs/cub/developer_overview.rst) - General overview of the design of CUB internals
- [CUB Test Overview](docs/cub/test_overview.rst) - Overview of how to write CUB unit tests
- [CUB Tuning Infrastructure](docs/cub/tuning.rst) - Overview of CUB's performance tuning infrastructure
- [CUB Benchmarks](docs/cub/benchmarking.rst) - Overview of CUB's performance benchmarks
- [CUB Tuning Infrastructure](docs/cub/tuning.rst) - Overview of CUB's performance tuning infrastructure

#### Thrust

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ using integral_types = nvbench::type_list<TUNE_T>;
using fundamental_types = nvbench::type_list<TUNE_T>;
using all_types = nvbench::type_list<TUNE_T>;
#else
// keep those lists in sync with the documentation in tuning.rst
using integral_types = nvbench::type_list<int8_t, int16_t, int32_t, int64_t>;

using fundamental_types =
Expand Down
257 changes: 235 additions & 22 deletions docs/cub/benchmarking.rst
Original file line number Diff line number Diff line change
@@ -1,34 +1,200 @@
CUB Benchmarks
*************************************

This file contains instructions on how to run all CUB benchmarks using CUB tuning infrastructure.
.. TODO(bgruber): this guide applies to Thrust as well. We should rename it to "CCCL Benchmarks" and move it out of CUB

CUB comes with a set of `NVBench <https://github.com/NVIDIA/nvbench>`_-based benchmarks for its algorithms,
which can be used to measure the performance of CUB on your system on a variety of workloads.
The integration with NVBench allows to archive and compare benchmark results,
which is useful for continuous performance testing, detecting regressions, tuning, and optimization.
This guide gives an introduction into CUB's benchmarking infrastructure.

Building benchmarks
--------------------------------------------------------------------------------

CUB benchmarks are build as part of the CCCL CMake infrastructure.
Starting from scratch:

.. code-block:: bash

pip3 install --user fpzip pandas scipy
git clone https://github.com/NVIDIA/cccl.git
cmake -B build -DCCCL_ENABLE_THRUST=OFF\
-DCCCL_ENABLE_LIBCUDACXX=OFF\
-DCCCL_ENABLE_CUB=ON\
-DCCCL_ENABLE_BENCHMARKS=YES\
-DCUB_ENABLE_DIALECT_CPP11=OFF\
-DCUB_ENABLE_DIALECT_CPP14=OFF\
-DCUB_ENABLE_DIALECT_CPP17=ON\
-DCUB_ENABLE_DIALECT_CPP20=OFF\
-DCUB_ENABLE_RDC_TESTS=OFF\
-DCUB_ENABLE_TUNING=YES\
-DCMAKE_BUILD_TYPE=Release\
-DCMAKE_CUDA_ARCHITECTURES="89;90"
cd cccl
mkdir build
cd build
../cub/benchmarks/scripts/run.py
cmake ..\
-GNinja\
-DCCCL_ENABLE_BENCHMARKS=YES\
-DCCCL_ENABLE_CUB=YES\
-DCCCL_ENABLE_THRUST=NO\
-DCCCL_ENABLE_LIBCUDACXX=NO\
-DCUB_ENABLE_RDC_TESTS=NO\
-DCMAKE_BUILD_TYPE=Release\
-DCMAKE_CUDA_ARCHITECTURES=90 # TODO: Set your GPU architecture

You clone the repository, create a build directory and configure the build with CMake.
It's important that you enable benchmarks (`CCCL_ENABLE_BENCHMARKS=ON`),
build in Release mode (`CMAKE_BUILD_TYPE=Release`),
and set the GPU architecture to match your system (`CMAKE_CUDA_ARCHITECTURES=XX`).
This <website `https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/`>_
contains a great table listing the architectures for different brands of GPUs.
.. TODO(bgruber): do we have a public NVIDIA maintained table I can link here instead?
We use Ninja as CMake generator in this guide, but you can use any other generator you prefer.

You can then proceed to build the benchmarks.

You can list the available cmake build targets with, if you intend to only build selected benchmarks:

.. code-block:: bash

ninja -t targets | grep '\.bench\.'
cub.bench.adjacent_difference.subtract_left.base: phony
cub.bench.copy.memcpy.base: phony
...
cub.bench.transform.babelstream3.base: phony
cub.bench.transform_reduce.sum.base: phony

We also provide a target to build all benchmarks:

.. code-block:: bash

ninja cub.all.benches


Running a benchmark
--------------------------------------------------------------------------------

After we built a benchmark, we can run it as follows:

.. code-block:: bash

./bin/cub.bench.adjacent_difference.subtract_left.base\
-d 0\
--stopping-criterion entropy\
--json base.json\
--md base.md

In this command, `-d 0` indicates that we want to run on GPU 0 on our system.
Setting `--stopping-criterion entropy` is advisable since it reduces runtime
and increase confidence in the resulting data.
It's not set as default yet, because NVBench is still evaluating it.
By default, NVBench will print the benchmark results to the terminal as Markdown.
`--json base.json` will save the detailed results in a JSON file as well for later use.
`--md base.md` will save the Markdown output to a file as well,
so you can easily view the results later without having to parse the JSON.
More information on what command line options are available can be found in the
`NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md>`__.

The expected terminal output is something along the following lines (also saved to `base.md`),
shortened for brevity:

.. code-block:: bash

# Log
Run: [1/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^16]
Pass: Cold: 0.004571ms GPU, 0.009322ms CPU, 0.00s total GPU, 0.01s total wall, 334x
Run: [2/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^20]
Pass: Cold: 0.015161ms GPU, 0.023367ms CPU, 0.01s total GPU, 0.02s total wall, 430x
...
# Benchmark Results
| T{ct} | OffsetT{ct} | Elements{io} | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil |
|-------|-------------|------------------|---------|------------|---------|------------|--------|---------|--------------|--------|
| I32 | I32 | 2^16 = 65536 | 334x | 9.322 us | 104.44% | 4.571 us | 10.87% | 14.337G | 114.696 GB/s | 14.93% |
| I32 | I32 | 2^20 = 1048576 | 430x | 23.367 us | 327.68% | 15.161 us | 3.47% | 69.161G | 553.285 GB/s | 72.03% |
...

If you are only interested in a subset of workloads, you can restrict benchmarking as follows:

.. code-block:: bash

./bin/cub.bench.adjacent_difference.subtract_left.base ...\
-a 'T{ct}=I32'\
-a 'OffsetT{ct}=I32'\
-a 'Elements{io}[pow2]=[24,28]'\

The `-a` option allows you to restrict the values for each axis available for the benchmark.
See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`__.
for more information on how to specify the axis values.
If the specified axis does not exist, the benchmark will terminate with an error.


Comparing benchmark results
--------------------------------------------------------------------------------

Let's say you have a modification that you'd like to benchmark.
To compare the performance you have to build and run the benchmark as described above for the unmodified code,
saving the results to a JSON file, e.g. `base.json`.
Then, you apply your code changes (e.g., switch to a different branch, git stash pop, apply a patch file, etc.),
rebuild and rerun the benchmark, saving the results to a different JSON file, e.g. `new.json`.

You can now compare the two result JSON files using, assuming you are still in your build directory:

.. code-block:: bash

PYTHONPATH=./_deps/nvbench-src/scripts ./_deps/nvbench-src/scripts/nvbench_compare.py base.json new.json

The `PYTHONPATH` environment variable may not be necessary in all cases.
The script will print a Markdown report showing the runtime differences between each variant of the two benchmark run.
This could look like this, again shortened for brevity:

.. code-block:: bash

| T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | I32 | 2^16 | 4.571 us | 10.87% | 4.096 us | 0.00% | -0.475 us | -10.39% | FAIL |
| I32 | I32 | 2^20 | 15.161 us | 3.47% | 15.143 us | 3.55% | -0.018 us | -0.12% | PASS |
...

In addition to showing the absolute and relative runtime difference,
NVBench reports the noise of the measurements,
which corresponds to the relative standard deviation.
It then reports with statistical significance in the `Status` column
how the runtime changed from the base to the new version.


Running all benchmarks directly from the command line
--------------------------------------------------------------------------------

To get a full snapshot of CUB's performance, you can run all benchmarks and save the results.
For example:

.. code-block:: bash

ninja cub.all.benches
benchmarks=$(ls bin | grep cub.bench); n=$(echo $benchmarks | wc -w); i=1; \
for b in $benchmarks; do \
echo "=== Running $b ($i/$n) ==="; \
./bin/$b -d 0 --stopping-criterion entropy --json $b.json --md $b.md; \
((i++)); \
done

This will generate one JSON and one Markdown file for each benchmark.
You can archive those files for later comparison or analysis.


Running all benchmarks via tuning scripts (alternative)
--------------------------------------------------------------------------------

The benchmark suite can also be run using the :ref:`tuning <cub-tuning>` infrastructure.
The tuning infrastructure handles building benchmarks itself, because it records the build times.
Therefore, it's critical that you run it in a clean build directory without any build artifacts.
Running cmake is enough. Alternatively, you can also clean your build directory with.
Furthermore, the tuning scripts require some additional python dependencies, which you have to install.

Expected output for the command above is:
.. code-block:: bash

ninja clean
pip install --user fpzip pandas scipy

We can then run the full benchmark suite from the build directory with:

.. code-block:: bash

../benchmarks/scripts/run.py

You can expect the output to look like this:

.. code-block:: bash

../cub/benchmarks/scripts/run.py
&&&& RUNNING bench
ctk: 12.2.140
cub: 812ba98d1
Expand All @@ -38,15 +204,62 @@ Expected output for the command above is:
&&&& PERF cub_bench_adjacent_difference_subtract_left_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002673664130270481 -sec
...

The tuning infrastructure will build and execute all benchmarks and their variants one after each other,
reporting the time it seconds it took to execute the benchmark executable.

It's also possible to benchmark a subset of algorithms and workloads:

.. code-block:: bash

../cub/benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
../benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
&&&& RUNNING bench
ctk: 12.2.140
cub: 812ba98d1
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__24 0.00016899200272746384 -sec
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002696000039577484 -sec
ctk: 12.6.77
cccl: v2.7.0-rc0-265-g32aa6aa5a
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28 0.003194367978721857 -sec
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U64___Elements_io__pow2__28 0.00319383991882205 -sec
&&&& PASSED bench


The `-R` option allows you to specify a regular expression for selecting benchmarks.
The `-a` restricts the values for an axis across all benchmarks
See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`__.
for more information on how to specify the axis values.
Contrary to running a benchmark directly,
the tuning infrastructure will just ignore an axis value if a benchmark does not support,
run the benchmark regardless, and continue.

The tuning infrastructure stores results in an SQLite database called `cccl_meta_bench.db` in the build directory.
This database persists across tuning runs.
If you interrupt the benchmark script and then launch it again, only missing benchmark variants will be run.
The resulting database contains all samples, which can be extracted into JSON files:

.. code-block:: bash

../benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db

This will create a JSON file for each benchmark variant next to the database.
For example:

.. code-block:: bash

cat cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28.json
[
{
"variant": "base ()",
"elapsed": 2.6299014091,
"center": 0.003194368,
"bw": 0.8754671386,
"samples": [
0.003152896,
0.0031549439,
...
],
"Elements{io}[pow2]": "28",
"base_samples": [
0.003152896,
0.0031549439,
...
],
"speedup": 1
}
]
2 changes: 1 addition & 1 deletion docs/cub/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ CUB
modules
developer_overview
test_overview
tuning
benchmarking
tuning
${repo_docs_api_path}/cub_api

.. the line below can be used to use the README.md file as the index page
Expand Down
28 changes: 25 additions & 3 deletions docs/cub/tuning.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _cub-tuning:

CUB Tuning Infrastructure
================================================================================

Expand Down Expand Up @@ -168,9 +170,29 @@ construct:
#endif


This logic is automatically applied to :code:`all_types`, :code:`offset_types`, and
:code:`fundamental_types` lists when you use matching names for the axes. You can define
your own axis names and use the logic above for them (see sort pairs example).
This logic is already implemented if you use any of the following predefined type lists:

.. list-table:: Predefined type lists
:header-rows: 1

* - Axis name
- C++ identifier
- Included types
* - :code:`T{ct}`
- :code:`integral_types`
- :code:`int8_t, int16_t, int32_t, int64_t`
* - :code:`T{ct}`
- :code:`fundamental_types`
- :code:`integral_types` and :code:`int128_t, float, double`
* - :code:`T{ct}`
- :code:`all_types`
- :code:`fundamental_types` and :code:`complex`
* - :code:`OffsetT{ct}`
- :code:`offset_types`
- :code:`int32_t, int64_t`


But you are free to define your own axis names and use the logic above for them (see sort pairs example).


Search Process
Expand Down
Loading