@@ -27,9 +27,11 @@ You clone the repository, create a build directory and configure the build with
27
27
It's important that you enable benchmarks (`CCCL_ENABLE_BENCHMARKS=ON `),
28
28
build in Release mode (`CMAKE_BUILD_TYPE=Release `),
29
29
and set the GPU architecture to match your system (`CMAKE_CUDA_ARCHITECTURES=XX `).
30
- This < website ` https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ `> _
30
+ This ` website < https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ >` _
31
31
contains a great table listing the architectures for different brands of GPUs.
32
+
32
33
.. TODO(bgruber): do we have a public NVIDIA maintained table I can link here instead?
34
+
33
35
We use Ninja as CMake generator in this guide, but you can use any other generator you prefer.
34
36
35
37
You can then proceed to build the benchmarks.
@@ -172,6 +174,10 @@ Therefore, it's critical that you run it in a clean build directory without any
172
174
Running cmake is enough. Alternatively, you can also clean your build directory with.
173
175
Furthermore, the tuning scripts require some additional python dependencies, which you have to install.
174
176
177
+ To select the appropriate CUDA GPU, first identify the GPU ID by running `nvidia-smi `, then set the
178
+ desired GPU using `export CUDA_VISIBLE_DEVICES=x `, where `x ` is the ID of the GPU you want to use (e.g., `1 `).
179
+ This ensures your application uses only the specified GPU.
180
+
175
181
.. code-block :: bash
176
182
177
183
ninja clean
@@ -181,7 +187,7 @@ We can then run the full benchmark suite from the build directory with:
181
187
182
188
.. code-block :: bash
183
189
184
- .. /benchmarks/scripts/run.py
190
+ < root_dir_to_cccl > /cccl /benchmarks/scripts/run.py
185
191
186
192
You can expect the output to look like this:
187
193
@@ -197,13 +203,13 @@ You can expect the output to look like this:
197
203
...
198
204
199
205
The tuning infrastructure will build and execute all benchmarks and their variants one after each other,
200
- reporting the time it seconds it took to execute the benchmark executable.
206
+ reporting the time in seconds it took to execute the benchmark executable.
201
207
202
208
It's also possible to benchmark a subset of algorithms and workloads:
203
209
204
210
.. code-block :: bash
205
211
206
- .. /benchmarks/scripts/run.py -R ' .*scan.exclusive.sum.*' -a ' Elements{io}[pow2]=[24,28]' -a ' T{ct}=I32'
212
+ < root_dir_to_cccl > /cccl /benchmarks/scripts/run.py -R ' .*scan.exclusive.sum.*' -a ' Elements{io}[pow2]=[24,28]' -a ' T{ct}=I32'
207
213
&&&& RUNNING bench
208
214
ctk: 12.6.77
209
215
cccl: v2.7.0-rc0-265-g32aa6aa5a
@@ -227,7 +233,7 @@ The resulting database contains all samples, which can be extracted into JSON fi
227
233
228
234
.. code-block :: bash
229
235
230
- .. /benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db
236
+ < root_dir_to_cccl > /cccl /benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db
231
237
232
238
This will create a JSON file for each benchmark variant next to the database.
233
239
For example:
0 commit comments