Skip to content

Commit 1dd5bc7

Browse files
Extend CUB benchmarking documentation (#2831)
* Document predefined benchmark typelists * Show benchmark guide before tuning guide * Extend CUB benchmark guide * Rework and extend tuning section
1 parent bb1c7e7 commit 1dd5bc7

File tree

5 files changed

+263
-27
lines changed

5 files changed

+263
-27
lines changed

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,8 @@ For more information about design and development practices for each CCCL compon
5151

5252
- [CUB Developer Guide](docs/cub/developer_overview.rst) - General overview of the design of CUB internals
5353
- [CUB Test Overview](docs/cub/test_overview.rst) - Overview of how to write CUB unit tests
54-
- [CUB Tuning Infrastructure](docs/cub/tuning.rst) - Overview of CUB's performance tuning infrastructure
5554
- [CUB Benchmarks](docs/cub/benchmarking.rst) - Overview of CUB's performance benchmarks
55+
- [CUB Tuning Infrastructure](docs/cub/tuning.rst) - Overview of CUB's performance tuning infrastructure
5656

5757
#### Thrust
5858

cub/benchmarks/nvbench_helper/nvbench_helper/nvbench_helper.cuh

+1
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ using integral_types = nvbench::type_list<TUNE_T>;
6161
using fundamental_types = nvbench::type_list<TUNE_T>;
6262
using all_types = nvbench::type_list<TUNE_T>;
6363
#else
64+
// keep those lists in sync with the documentation in tuning.rst
6465
using integral_types = nvbench::type_list<int8_t, int16_t, int32_t, int64_t>;
6566

6667
using fundamental_types =

docs/cub/benchmarking.rst

+235-22
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,200 @@
11
CUB Benchmarks
22
*************************************
33

4-
This file contains instructions on how to run all CUB benchmarks using CUB tuning infrastructure.
4+
.. TODO(bgruber): this guide applies to Thrust as well. We should rename it to "CCCL Benchmarks" and move it out of CUB
5+
6+
CUB comes with a set of `NVBench <https://github.com/NVIDIA/nvbench>`_-based benchmarks for its algorithms,
7+
which can be used to measure the performance of CUB on your system on a variety of workloads.
8+
The integration with NVBench allows to archive and compare benchmark results,
9+
which is useful for continuous performance testing, detecting regressions, tuning, and optimization.
10+
This guide gives an introduction into CUB's benchmarking infrastructure.
11+
12+
Building benchmarks
13+
--------------------------------------------------------------------------------
14+
15+
CUB benchmarks are build as part of the CCCL CMake infrastructure.
16+
Starting from scratch:
517

618
.. code-block:: bash
719
8-
pip3 install --user fpzip pandas scipy
920
git clone https://github.com/NVIDIA/cccl.git
10-
cmake -B build -DCCCL_ENABLE_THRUST=OFF\
11-
-DCCCL_ENABLE_LIBCUDACXX=OFF\
12-
-DCCCL_ENABLE_CUB=ON\
13-
-DCCCL_ENABLE_BENCHMARKS=YES\
14-
-DCUB_ENABLE_DIALECT_CPP11=OFF\
15-
-DCUB_ENABLE_DIALECT_CPP14=OFF\
16-
-DCUB_ENABLE_DIALECT_CPP17=ON\
17-
-DCUB_ENABLE_DIALECT_CPP20=OFF\
18-
-DCUB_ENABLE_RDC_TESTS=OFF\
19-
-DCUB_ENABLE_TUNING=YES\
20-
-DCMAKE_BUILD_TYPE=Release\
21-
-DCMAKE_CUDA_ARCHITECTURES="89;90"
21+
cd cccl
22+
mkdir build
2223
cd build
23-
../cub/benchmarks/scripts/run.py
24+
cmake ..\
25+
-GNinja\
26+
-DCCCL_ENABLE_BENCHMARKS=YES\
27+
-DCCCL_ENABLE_CUB=YES\
28+
-DCCCL_ENABLE_THRUST=NO\
29+
-DCCCL_ENABLE_LIBCUDACXX=NO\
30+
-DCUB_ENABLE_RDC_TESTS=NO\
31+
-DCMAKE_BUILD_TYPE=Release\
32+
-DCMAKE_CUDA_ARCHITECTURES=90 # TODO: Set your GPU architecture
33+
34+
You clone the repository, create a build directory and configure the build with CMake.
35+
It's important that you enable benchmarks (`CCCL_ENABLE_BENCHMARKS=ON`),
36+
build in Release mode (`CMAKE_BUILD_TYPE=Release`),
37+
and set the GPU architecture to match your system (`CMAKE_CUDA_ARCHITECTURES=XX`).
38+
This <website `https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/`>_
39+
contains a great table listing the architectures for different brands of GPUs.
40+
.. TODO(bgruber): do we have a public NVIDIA maintained table I can link here instead?
41+
We use Ninja as CMake generator in this guide, but you can use any other generator you prefer.
42+
43+
You can then proceed to build the benchmarks.
44+
45+
You can list the available cmake build targets with, if you intend to only build selected benchmarks:
46+
47+
.. code-block:: bash
48+
49+
ninja -t targets | grep '\.bench\.'
50+
cub.bench.adjacent_difference.subtract_left.base: phony
51+
cub.bench.copy.memcpy.base: phony
52+
...
53+
cub.bench.transform.babelstream3.base: phony
54+
cub.bench.transform_reduce.sum.base: phony
55+
56+
We also provide a target to build all benchmarks:
57+
58+
.. code-block:: bash
59+
60+
ninja cub.all.benches
61+
62+
63+
Running a benchmark
64+
--------------------------------------------------------------------------------
65+
66+
After we built a benchmark, we can run it as follows:
67+
68+
.. code-block:: bash
69+
70+
./bin/cub.bench.adjacent_difference.subtract_left.base\
71+
-d 0\
72+
--stopping-criterion entropy\
73+
--json base.json\
74+
--md base.md
75+
76+
In this command, `-d 0` indicates that we want to run on GPU 0 on our system.
77+
Setting `--stopping-criterion entropy` is advisable since it reduces runtime
78+
and increase confidence in the resulting data.
79+
It's not set as default yet, because NVBench is still evaluating it.
80+
By default, NVBench will print the benchmark results to the terminal as Markdown.
81+
`--json base.json` will save the detailed results in a JSON file as well for later use.
82+
`--md base.md` will save the Markdown output to a file as well,
83+
so you can easily view the results later without having to parse the JSON.
84+
More information on what command line options are available can be found in the
85+
`NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md>`__.
86+
87+
The expected terminal output is something along the following lines (also saved to `base.md`),
88+
shortened for brevity:
89+
90+
.. code-block:: bash
91+
92+
# Log
93+
Run: [1/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^16]
94+
Pass: Cold: 0.004571ms GPU, 0.009322ms CPU, 0.00s total GPU, 0.01s total wall, 334x
95+
Run: [2/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^20]
96+
Pass: Cold: 0.015161ms GPU, 0.023367ms CPU, 0.01s total GPU, 0.02s total wall, 430x
97+
...
98+
# Benchmark Results
99+
| T{ct} | OffsetT{ct} | Elements{io} | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil |
100+
|-------|-------------|------------------|---------|------------|---------|------------|--------|---------|--------------|--------|
101+
| I32 | I32 | 2^16 = 65536 | 334x | 9.322 us | 104.44% | 4.571 us | 10.87% | 14.337G | 114.696 GB/s | 14.93% |
102+
| I32 | I32 | 2^20 = 1048576 | 430x | 23.367 us | 327.68% | 15.161 us | 3.47% | 69.161G | 553.285 GB/s | 72.03% |
103+
...
104+
105+
If you are only interested in a subset of workloads, you can restrict benchmarking as follows:
106+
107+
.. code-block:: bash
108+
109+
./bin/cub.bench.adjacent_difference.subtract_left.base ...\
110+
-a 'T{ct}=I32'\
111+
-a 'OffsetT{ct}=I32'\
112+
-a 'Elements{io}[pow2]=[24,28]'\
113+
114+
The `-a` option allows you to restrict the values for each axis available for the benchmark.
115+
See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`__.
116+
for more information on how to specify the axis values.
117+
If the specified axis does not exist, the benchmark will terminate with an error.
118+
119+
120+
Comparing benchmark results
121+
--------------------------------------------------------------------------------
122+
123+
Let's say you have a modification that you'd like to benchmark.
124+
To compare the performance you have to build and run the benchmark as described above for the unmodified code,
125+
saving the results to a JSON file, e.g. `base.json`.
126+
Then, you apply your code changes (e.g., switch to a different branch, git stash pop, apply a patch file, etc.),
127+
rebuild and rerun the benchmark, saving the results to a different JSON file, e.g. `new.json`.
128+
129+
You can now compare the two result JSON files using, assuming you are still in your build directory:
24130

131+
.. code-block:: bash
132+
133+
PYTHONPATH=./_deps/nvbench-src/scripts ./_deps/nvbench-src/scripts/nvbench_compare.py base.json new.json
134+
135+
The `PYTHONPATH` environment variable may not be necessary in all cases.
136+
The script will print a Markdown report showing the runtime differences between each variant of the two benchmark run.
137+
This could look like this, again shortened for brevity:
138+
139+
.. code-block:: bash
140+
141+
| T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
142+
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
143+
| I32 | I32 | 2^16 | 4.571 us | 10.87% | 4.096 us | 0.00% | -0.475 us | -10.39% | FAIL |
144+
| I32 | I32 | 2^20 | 15.161 us | 3.47% | 15.143 us | 3.55% | -0.018 us | -0.12% | PASS |
145+
...
146+
147+
In addition to showing the absolute and relative runtime difference,
148+
NVBench reports the noise of the measurements,
149+
which corresponds to the relative standard deviation.
150+
It then reports with statistical significance in the `Status` column
151+
how the runtime changed from the base to the new version.
152+
153+
154+
Running all benchmarks directly from the command line
155+
--------------------------------------------------------------------------------
156+
157+
To get a full snapshot of CUB's performance, you can run all benchmarks and save the results.
158+
For example:
159+
160+
.. code-block:: bash
161+
162+
ninja cub.all.benches
163+
benchmarks=$(ls bin | grep cub.bench); n=$(echo $benchmarks | wc -w); i=1; \
164+
for b in $benchmarks; do \
165+
echo "=== Running $b ($i/$n) ==="; \
166+
./bin/$b -d 0 --stopping-criterion entropy --json $b.json --md $b.md; \
167+
((i++)); \
168+
done
169+
170+
This will generate one JSON and one Markdown file for each benchmark.
171+
You can archive those files for later comparison or analysis.
172+
173+
174+
Running all benchmarks via tuning scripts (alternative)
175+
--------------------------------------------------------------------------------
176+
177+
The benchmark suite can also be run using the :ref:`tuning <cub-tuning>` infrastructure.
178+
The tuning infrastructure handles building benchmarks itself, because it records the build times.
179+
Therefore, it's critical that you run it in a clean build directory without any build artifacts.
180+
Running cmake is enough. Alternatively, you can also clean your build directory with.
181+
Furthermore, the tuning scripts require some additional python dependencies, which you have to install.
25182

26-
Expected output for the command above is:
183+
.. code-block:: bash
184+
185+
ninja clean
186+
pip install --user fpzip pandas scipy
27187
188+
We can then run the full benchmark suite from the build directory with:
189+
190+
.. code-block:: bash
191+
192+
../benchmarks/scripts/run.py
193+
194+
You can expect the output to look like this:
28195

29196
.. code-block:: bash
30197
31-
../cub/benchmarks/scripts/run.py
32198
&&&& RUNNING bench
33199
ctk: 12.2.140
34200
cub: 812ba98d1
@@ -38,15 +204,62 @@ Expected output for the command above is:
38204
&&&& PERF cub_bench_adjacent_difference_subtract_left_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002673664130270481 -sec
39205
...
40206
207+
The tuning infrastructure will build and execute all benchmarks and their variants one after each other,
208+
reporting the time it seconds it took to execute the benchmark executable.
41209

42210
It's also possible to benchmark a subset of algorithms and workloads:
43211

44212
.. code-block:: bash
45213
46-
../cub/benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
214+
../benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
47215
&&&& RUNNING bench
48-
ctk: 12.2.140
49-
cub: 812ba98d1
50-
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__24 0.00016899200272746384 -sec
51-
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002696000039577484 -sec
216+
ctk: 12.6.77
217+
cccl: v2.7.0-rc0-265-g32aa6aa5a
218+
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28 0.003194367978721857 -sec
219+
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U64___Elements_io__pow2__28 0.00319383991882205 -sec
52220
&&&& PASSED bench
221+
222+
223+
The `-R` option allows you to specify a regular expression for selecting benchmarks.
224+
The `-a` restricts the values for an axis across all benchmarks
225+
See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`__.
226+
for more information on how to specify the axis values.
227+
Contrary to running a benchmark directly,
228+
the tuning infrastructure will just ignore an axis value if a benchmark does not support,
229+
run the benchmark regardless, and continue.
230+
231+
The tuning infrastructure stores results in an SQLite database called `cccl_meta_bench.db` in the build directory.
232+
This database persists across tuning runs.
233+
If you interrupt the benchmark script and then launch it again, only missing benchmark variants will be run.
234+
The resulting database contains all samples, which can be extracted into JSON files:
235+
236+
.. code-block:: bash
237+
238+
../benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db
239+
240+
This will create a JSON file for each benchmark variant next to the database.
241+
For example:
242+
243+
.. code-block:: bash
244+
245+
cat cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28.json
246+
[
247+
{
248+
"variant": "base ()",
249+
"elapsed": 2.6299014091,
250+
"center": 0.003194368,
251+
"bw": 0.8754671386,
252+
"samples": [
253+
0.003152896,
254+
0.0031549439,
255+
...
256+
],
257+
"Elements{io}[pow2]": "28",
258+
"base_samples": [
259+
0.003152896,
260+
0.0031549439,
261+
...
262+
],
263+
"speedup": 1
264+
}
265+
]

docs/cub/index.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ CUB
1010
modules
1111
developer_overview
1212
test_overview
13-
tuning
1413
benchmarking
14+
tuning
1515
${repo_docs_api_path}/cub_api
1616

1717
.. the line below can be used to use the README.md file as the index page

docs/cub/tuning.rst

+25-3
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _cub-tuning:
2+
13
CUB Tuning Infrastructure
24
================================================================================
35

@@ -168,9 +170,29 @@ construct:
168170
#endif
169171

170172

171-
This logic is automatically applied to :code:`all_types`, :code:`offset_types`, and
172-
:code:`fundamental_types` lists when you use matching names for the axes. You can define
173-
your own axis names and use the logic above for them (see sort pairs example).
173+
This logic is already implemented if you use any of the following predefined type lists:
174+
175+
.. list-table:: Predefined type lists
176+
:header-rows: 1
177+
178+
* - Axis name
179+
- C++ identifier
180+
- Included types
181+
* - :code:`T{ct}`
182+
- :code:`integral_types`
183+
- :code:`int8_t, int16_t, int32_t, int64_t`
184+
* - :code:`T{ct}`
185+
- :code:`fundamental_types`
186+
- :code:`integral_types` and :code:`int128_t, float, double`
187+
* - :code:`T{ct}`
188+
- :code:`all_types`
189+
- :code:`fundamental_types` and :code:`complex`
190+
* - :code:`OffsetT{ct}`
191+
- :code:`offset_types`
192+
- :code:`int32_t, int64_t`
193+
194+
195+
But you are free to define your own axis names and use the logic above for them (see sort pairs example).
174196

175197

176198
Search Process

0 commit comments

Comments
 (0)