1
1
CUB Benchmarks
2
2
*************************************
3
3
4
- This file contains instructions on how to run all CUB benchmarks using CUB tuning infrastructure.
4
+ .. TODO(bgruber): this guide applies to Thrust as well. We should rename it to "CCCL Benchmarks" and move it out of CUB
5
+
6
+ CUB comes with a set of `NVBench <https://github.com/NVIDIA/nvbench >`_-based benchmarks for its algorithms,
7
+ which can be used to measure the performance of CUB on your system on a variety of workloads.
8
+ The integration with NVBench allows to archive and compare benchmark results,
9
+ which is useful for continuous performance testing, detecting regressions, tuning, and optimization.
10
+ This guide gives an introduction into CUB's benchmarking infrastructure.
11
+
12
+ Building benchmarks
13
+ --------------------------------------------------------------------------------
14
+
15
+ CUB benchmarks are build as part of the CCCL CMake infrastructure.
16
+ Starting from scratch:
5
17
6
18
.. code-block :: bash
7
19
8
- pip3 install --user fpzip pandas scipy
9
20
git clone https://github.com/NVIDIA/cccl.git
10
- cmake -B build -DCCCL_ENABLE_THRUST=OFF\
11
- -DCCCL_ENABLE_LIBCUDACXX=OFF\
12
- -DCCCL_ENABLE_CUB=ON\
13
- -DCCCL_ENABLE_BENCHMARKS=YES\
14
- -DCUB_ENABLE_DIALECT_CPP11=OFF\
15
- -DCUB_ENABLE_DIALECT_CPP14=OFF\
16
- -DCUB_ENABLE_DIALECT_CPP17=ON\
17
- -DCUB_ENABLE_DIALECT_CPP20=OFF\
18
- -DCUB_ENABLE_RDC_TESTS=OFF\
19
- -DCUB_ENABLE_TUNING=YES\
20
- -DCMAKE_BUILD_TYPE=Release\
21
- -DCMAKE_CUDA_ARCHITECTURES=" 89;90"
21
+ cd cccl
22
+ mkdir build
22
23
cd build
23
- ../cub/benchmarks/scripts/run.py
24
+ cmake ..\
25
+ -GNinja\
26
+ -DCCCL_ENABLE_BENCHMARKS=YES\
27
+ -DCCCL_ENABLE_CUB=YES\
28
+ -DCCCL_ENABLE_THRUST=NO\
29
+ -DCCCL_ENABLE_LIBCUDACXX=NO\
30
+ -DCUB_ENABLE_RDC_TESTS=NO\
31
+ -DCMAKE_BUILD_TYPE=Release\
32
+ -DCMAKE_CUDA_ARCHITECTURES=90 # TODO: Set your GPU architecture
33
+
34
+ You clone the repository, create a build directory and configure the build with CMake.
35
+ It's important that you enable benchmarks (`CCCL_ENABLE_BENCHMARKS=ON `),
36
+ build in Release mode (`CMAKE_BUILD_TYPE=Release `),
37
+ and set the GPU architecture to match your system (`CMAKE_CUDA_ARCHITECTURES=XX `).
38
+ This <website `https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ `>_
39
+ contains a great table listing the architectures for different brands of GPUs.
40
+ .. TODO(bgruber): do we have a public NVIDIA maintained table I can link here instead?
41
+ We use Ninja as CMake generator in this guide, but you can use any other generator you prefer.
42
+
43
+ You can then proceed to build the benchmarks.
44
+
45
+ You can list the available cmake build targets with, if you intend to only build selected benchmarks:
46
+
47
+ .. code-block :: bash
48
+
49
+ ninja -t targets | grep ' \.bench\.'
50
+ cub.bench.adjacent_difference.subtract_left.base: phony
51
+ cub.bench.copy.memcpy.base: phony
52
+ ...
53
+ cub.bench.transform.babelstream3.base: phony
54
+ cub.bench.transform_reduce.sum.base: phony
55
+
56
+ We also provide a target to build all benchmarks:
57
+
58
+ .. code-block :: bash
59
+
60
+ ninja cub.all.benches
61
+
62
+
63
+ Running a benchmark
64
+ --------------------------------------------------------------------------------
65
+
66
+ After we built a benchmark, we can run it as follows:
67
+
68
+ .. code-block :: bash
69
+
70
+ ./bin/cub.bench.adjacent_difference.subtract_left.base\
71
+ -d 0\
72
+ --stopping-criterion entropy\
73
+ --json base.json\
74
+ --md base.md
75
+
76
+ In this command, `-d 0 ` indicates that we want to run on GPU 0 on our system.
77
+ Setting `--stopping-criterion entropy ` is advisable since it reduces runtime
78
+ and increase confidence in the resulting data.
79
+ It's not set as default yet, because NVBench is still evaluating it.
80
+ By default, NVBench will print the benchmark results to the terminal as Markdown.
81
+ `--json base.json ` will save the detailed results in a JSON file as well for later use.
82
+ `--md base.md ` will save the Markdown output to a file as well,
83
+ so you can easily view the results later without having to parse the JSON.
84
+ More information on what command line options are available can be found in the
85
+ `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md >`__.
86
+
87
+ The expected terminal output is something along the following lines (also saved to `base.md `),
88
+ shortened for brevity:
89
+
90
+ .. code-block :: bash
91
+
92
+ # Log
93
+ Run: [1/8] base [Device= 0 T{ct}= I32 OffsetT{ct}= I32 Elements{io}= 2^16]
94
+ Pass: Cold: 0.004571ms GPU, 0.009322ms CPU, 0.00s total GPU, 0.01s total wall, 334x
95
+ Run: [2/8] base [Device= 0 T{ct}= I32 OffsetT{ct}= I32 Elements{io}= 2^20]
96
+ Pass: Cold: 0.015161ms GPU, 0.023367ms CPU, 0.01s total GPU, 0.02s total wall, 430x
97
+ ...
98
+ # Benchmark Results
99
+ | T{ct} | OffsetT{ct} | Elements{io} | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil |
100
+ | -------| -------------| ------------------| ---------| ------------| ---------| ------------| --------| ---------| --------------| --------|
101
+ | I32 | I32 | 2^16 = 65536 | 334x | 9.322 us | 104.44% | 4.571 us | 10.87% | 14.337G | 114.696 GB/s | 14.93% |
102
+ | I32 | I32 | 2^20 = 1048576 | 430x | 23.367 us | 327.68% | 15.161 us | 3.47% | 69.161G | 553.285 GB/s | 72.03% |
103
+ ...
104
+
105
+ If you are only interested in a subset of workloads, you can restrict benchmarking as follows:
106
+
107
+ .. code-block :: bash
108
+
109
+ ./bin/cub.bench.adjacent_difference.subtract_left.base ...\
110
+ -a ' T{ct}=I32' \
111
+ -a ' OffsetT{ct}=I32' \
112
+ -a ' Elements{io}[pow2]=[24,28]' \
113
+
114
+ The `-a ` option allows you to restrict the values for each axis available for the benchmark.
115
+ See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md >`__.
116
+ for more information on how to specify the axis values.
117
+ If the specified axis does not exist, the benchmark will terminate with an error.
118
+
119
+
120
+ Comparing benchmark results
121
+ --------------------------------------------------------------------------------
122
+
123
+ Let's say you have a modification that you'd like to benchmark.
124
+ To compare the performance you have to build and run the benchmark as described above for the unmodified code,
125
+ saving the results to a JSON file, e.g. `base.json `.
126
+ Then, you apply your code changes (e.g., switch to a different branch, git stash pop, apply a patch file, etc.),
127
+ rebuild and rerun the benchmark, saving the results to a different JSON file, e.g. `new.json `.
128
+
129
+ You can now compare the two result JSON files using, assuming you are still in your build directory:
24
130
131
+ .. code-block :: bash
132
+
133
+ PYTHONPATH=./_deps/nvbench-src/scripts ./_deps/nvbench-src/scripts/nvbench_compare.py base.json new.json
134
+
135
+ The `PYTHONPATH ` environment variable may not be necessary in all cases.
136
+ The script will print a Markdown report showing the runtime differences between each variant of the two benchmark run.
137
+ This could look like this, again shortened for brevity:
138
+
139
+ .. code-block :: bash
140
+
141
+ | T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
142
+ | ---------| ---------------| ----------------| ------------| -------------| ------------| -------------| ------------| ---------| ----------|
143
+ | I32 | I32 | 2^16 | 4.571 us | 10.87% | 4.096 us | 0.00% | -0.475 us | -10.39% | FAIL |
144
+ | I32 | I32 | 2^20 | 15.161 us | 3.47% | 15.143 us | 3.55% | -0.018 us | -0.12% | PASS |
145
+ ...
146
+
147
+ In addition to showing the absolute and relative runtime difference,
148
+ NVBench reports the noise of the measurements,
149
+ which corresponds to the relative standard deviation.
150
+ It then reports with statistical significance in the `Status ` column
151
+ how the runtime changed from the base to the new version.
152
+
153
+
154
+ Running all benchmarks directly from the command line
155
+ --------------------------------------------------------------------------------
156
+
157
+ To get a full snapshot of CUB's performance, you can run all benchmarks and save the results.
158
+ For example:
159
+
160
+ .. code-block :: bash
161
+
162
+ ninja cub.all.benches
163
+ benchmarks=$( ls bin | grep cub.bench) ; n=$( echo $benchmarks | wc -w) ; i=1; \
164
+ for b in $benchmarks ; do \
165
+ echo " === Running $b ($i /$n ) ===" ; \
166
+ ./bin/$b -d 0 --stopping-criterion entropy --json $b .json --md $b .md; \
167
+ (( i++ )) ; \
168
+ done
169
+
170
+ This will generate one JSON and one Markdown file for each benchmark.
171
+ You can archive those files for later comparison or analysis.
172
+
173
+
174
+ Running all benchmarks via tuning scripts (alternative)
175
+ --------------------------------------------------------------------------------
176
+
177
+ The benchmark suite can also be run using the :ref: `tuning <cub-tuning >` infrastructure.
178
+ The tuning infrastructure handles building benchmarks itself, because it records the build times.
179
+ Therefore, it's critical that you run it in a clean build directory without any build artifacts.
180
+ Running cmake is enough. Alternatively, you can also clean your build directory with.
181
+ Furthermore, the tuning scripts require some additional python dependencies, which you have to install.
25
182
26
- Expected output for the command above is:
183
+ .. code-block :: bash
184
+
185
+ ninja clean
186
+ pip install --user fpzip pandas scipy
27
187
188
+ We can then run the full benchmark suite from the build directory with:
189
+
190
+ .. code-block :: bash
191
+
192
+ ../benchmarks/scripts/run.py
193
+
194
+ You can expect the output to look like this:
28
195
29
196
.. code-block :: bash
30
197
31
- ../cub/benchmarks/scripts/run.py
32
198
&&&& RUNNING bench
33
199
ctk: 12.2.140
34
200
cub: 812ba98d1
@@ -38,15 +204,62 @@ Expected output for the command above is:
38
204
&&&& PERF cub_bench_adjacent_difference_subtract_left_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002673664130270481 -sec
39
205
...
40
206
207
+ The tuning infrastructure will build and execute all benchmarks and their variants one after each other,
208
+ reporting the time it seconds it took to execute the benchmark executable.
41
209
42
210
It's also possible to benchmark a subset of algorithms and workloads:
43
211
44
212
.. code-block :: bash
45
213
46
- ../cub/ benchmarks/scripts/run.py -R ' .*scan.exclusive.sum.*' -a ' Elements{io}[pow2]=[24,28]' -a ' T{ct}=I32'
214
+ ../benchmarks/scripts/run.py -R ' .*scan.exclusive.sum.*' -a ' Elements{io}[pow2]=[24,28]' -a ' T{ct}=I32'
47
215
&&&& RUNNING bench
48
- ctk: 12.2.140
49
- cub : 812ba98d1
50
- &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__24 0.00016899200272746384 -sec
51
- &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002696000039577484 -sec
216
+ ctk: 12.6.77
217
+ cccl : v2.7.0-rc0-265-g32aa6aa5a
218
+ &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28 0.003194367978721857 -sec
219
+ &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U64___Elements_io__pow2__28 0.00319383991882205 -sec
52
220
&&&& PASSED bench
221
+
222
+
223
+ The `-R ` option allows you to specify a regular expression for selecting benchmarks.
224
+ The `-a ` restricts the values for an axis across all benchmarks
225
+ See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md >`__.
226
+ for more information on how to specify the axis values.
227
+ Contrary to running a benchmark directly,
228
+ the tuning infrastructure will just ignore an axis value if a benchmark does not support,
229
+ run the benchmark regardless, and continue.
230
+
231
+ The tuning infrastructure stores results in an SQLite database called `cccl_meta_bench.db ` in the build directory.
232
+ This database persists across tuning runs.
233
+ If you interrupt the benchmark script and then launch it again, only missing benchmark variants will be run.
234
+ The resulting database contains all samples, which can be extracted into JSON files:
235
+
236
+ .. code-block :: bash
237
+
238
+ ../benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db
239
+
240
+ This will create a JSON file for each benchmark variant next to the database.
241
+ For example:
242
+
243
+ .. code-block :: bash
244
+
245
+ cat cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28.json
246
+ [
247
+ {
248
+ " variant" : " base ()" ,
249
+ " elapsed" : 2.6299014091,
250
+ " center" : 0.003194368,
251
+ " bw" : 0.8754671386,
252
+ " samples" : [
253
+ 0.003152896,
254
+ 0.0031549439,
255
+ ...
256
+ ],
257
+ " Elements{io}[pow2]" : " 28" ,
258
+ " base_samples" : [
259
+ 0.003152896,
260
+ 0.0031549439,
261
+ ...
262
+ ],
263
+ " speedup" : 1
264
+ }
265
+ ]
0 commit comments