NVIDIA · gonidelis · Nov 19, 2024 · Nov 18, 2024 · Nov 18, 2024 · Nov 18, 2024
@@ -27,9 +27,11 @@ You clone the repository, create a build directory and configure the build with
 It's important that you enable benchmarks (`CCCL_ENABLE_BENCHMARKS=ON`),
 build in Release mode (`CMAKE_BUILD_TYPE=Release`),
 and set the GPU architecture to match your system (`CMAKE_CUDA_ARCHITECTURES=XX`).
-This <website `https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/`>_
+This `website <https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/>`_
 contains a great table listing the architectures for different brands of GPUs.
+
 .. TODO(bgruber): do we have a public NVIDIA maintained table I can link here instead?
+
 We use Ninja as CMake generator in this guide, but you can use any other generator you prefer.
 
 You can then proceed to build the benchmarks.
@@ -172,6 +174,10 @@ Therefore, it's critical that you run it in a clean build directory without any
 Running cmake is enough. Alternatively, you can also clean your build directory with.
 Furthermore, the tuning scripts require some additional python dependencies, which you have to install.
 
+To select the appropriate CUDA GPU, first identify the GPU ID by running `nvidia-smi`, then set the
+desired GPU using `export CUDA_VISIBLE_DEVICES=x`, where `x` is the ID of the GPU you want to use (e.g., `1`).
+This ensures your application uses only the specified GPU.
+
 .. code-block:: bash
 
     ninja clean
@@ -181,7 +187,7 @@ We can then run the full benchmark suite from the build directory with:
 
 .. code-block:: bash
 
-    ../benchmarks/scripts/run.py
+    <root_dir_to_cccl>/cccl/benchmarks/scripts/run.py
 
 You can expect the output to look like this:
 
@@ -197,13 +203,13 @@ You can expect the output to look like this:
     ...
 
 The tuning infrastructure will build and execute all benchmarks and their variants one after each other,
-reporting the time it seconds it took to execute the benchmark executable.
+reporting the time in seconds it took to execute the benchmark executable.
 
 It's also possible to benchmark a subset of algorithms and workloads:
 
 .. code-block:: bash
 
-    ../benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
+    <root_dir_to_cccl>/cccl/benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
     &&&& RUNNING bench
      ctk:  12.6.77
     cccl:  v2.7.0-rc0-265-g32aa6aa5a
@@ -227,7 +233,7 @@ The resulting database contains all samples, which can be extracted into JSON fi
 
 .. code-block:: bash
 
-    ../benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db
+    <root_dir_to_cccl>/cccl/benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db
 
 This will create a JSON file for each benchmark variant next to the database.
 For example: