From 9eeb7d5515fc3a9d3f1711f5c3a8ca0ae4f9d375 Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 13:05:00 +0100
Subject: [PATCH 01/11] Document predefined benchmark typelists

---
 .../nvbench_helper/nvbench_helper.cuh         |  1 +
 docs/cub/tuning.rst                           | 26 ++++++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/cub/benchmarks/nvbench_helper/nvbench_helper/nvbench_helper.cuh b/cub/benchmarks/nvbench_helper/nvbench_helper/nvbench_helper.cuh
index 081bc5aa263..e8dacb4a1ff 100644
--- a/cub/benchmarks/nvbench_helper/nvbench_helper/nvbench_helper.cuh
+++ b/cub/benchmarks/nvbench_helper/nvbench_helper/nvbench_helper.cuh
@@ -61,6 +61,7 @@ using integral_types    = nvbench::type_list<TUNE_T>;
 using fundamental_types = nvbench::type_list<TUNE_T>;
 using all_types         = nvbench::type_list<TUNE_T>;
 #else
+// keep those lists in sync with the documentation in tuning.rst
 using integral_types = nvbench::type_list<int8_t, int16_t, int32_t, int64_t>;
 
 using fundamental_types =
diff --git a/docs/cub/tuning.rst b/docs/cub/tuning.rst
index c1ebe1864f5..f706ec61bff 100644
--- a/docs/cub/tuning.rst
+++ b/docs/cub/tuning.rst
@@ -168,9 +168,29 @@ construct:
   #endif
 
 
-This logic is automatically applied to :code:`all_types`, :code:`offset_types`, and
-:code:`fundamental_types` lists when you use matching names for the axes. You can define
-your own axis names and use the logic above for them (see sort pairs example).
+This logic is already implemented if you use any of the following predefined type lists:
+
+.. list-table:: Predefined type lists
+   :header-rows: 1
+
+   * - Axis name
+     - C++ identifier
+     - Included types
+   * - :code:`T{ct}`
+     - :code:`integral_types`
+     - :code:`int8_t, int16_t, int32_t, int64_t`
+   * - :code:`T{ct}`
+     - :code:`fundamental_types`
+     - :code:`integral_types` and :code:`int128_t, float, double`
+   * - :code:`T{ct}`
+     - :code:`all_types`
+     - :code:`fundamental_types` and :code:`complex`
+   * - :code:`OffsetT{ct}`
+     - :code:`offset_types`
+     - :code:`int32_t, int64_t`
+
+
+But you are free to define your own axis names and use the logic above for them (see sort pairs example).
 
 
 Search Process

From 8a1777fd303517adc4c590f05229dda945de4da3 Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 13:36:41 +0100
Subject: [PATCH 02/11] Show benchmark guide before tuning guide

---
 CONTRIBUTING.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 50a3f3b9e0d..ed9bdc9d21f 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -51,8 +51,8 @@ For more information about design and development practices for each CCCL compon
 
 - [CUB Developer Guide](docs/cub/developer_overview.rst) - General overview of the design of CUB internals
 - [CUB Test Overview](docs/cub/test_overview.rst) - Overview of how to write CUB unit tests
-- [CUB Tuning Infrastructure](docs/cub/tuning.rst) - Overview of CUB's performance tuning infrastructure
 - [CUB Benchmarks](docs/cub/benchmarking.rst) - Overview of CUB's performance benchmarks
+- [CUB Tuning Infrastructure](docs/cub/tuning.rst) - Overview of CUB's performance tuning infrastructure
 
 #### Thrust
 

From edac61c8f6bdcc854f2c767544c93ace9f10981b Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 13:37:01 +0100
Subject: [PATCH 03/11] Extend CUB benchmark guide

---
 docs/cub/benchmarking.rst | 81 +++++++++++++++++++++++++++++++--------
 1 file changed, 64 insertions(+), 17 deletions(-)

diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
index dbd22d84209..b82a7149921 100644
--- a/docs/cub/benchmarking.rst
+++ b/docs/cub/benchmarking.rst
@@ -1,26 +1,73 @@
 CUB Benchmarks
 *************************************
+.. TODO(bgruber): this guide applies to Thrust as well. We should rename it to "CCCL Benchmarks" and move it out of CUB
 
-This file contains instructions on how to run all CUB benchmarks using CUB tuning infrastructure.
+CUB comes with a set of `NVBench <https://github.com/NVIDIA/nvbench>`_-based benchmarks for its algorithms,
+which can be used to measure the performance of CUB on your system on a variety of workloads.
+The integration with NVBench allows to archive and compare benchmark results,
+which is useful for continuous performance testing, detecting regressions, tuning, and optimization.
+This guide gives an introduction into CUB's benchmarking infrastructure.
+
+Building benchmarks
+--------------------------------------------------------------------------------
+
+CUB benchmarks are build as part of the CCCL CMake infrastructure.
+Starting from scratch:
 
 .. code-block:: bash
 
-    pip3 install --user fpzip pandas scipy
     git clone https://github.com/NVIDIA/cccl.git
-    cmake -B build -DCCCL_ENABLE_THRUST=OFF\
-             -DCCCL_ENABLE_LIBCUDACXX=OFF\
-             -DCCCL_ENABLE_CUB=ON\
-             -DCCCL_ENABLE_BENCHMARKS=YES\
-             -DCUB_ENABLE_DIALECT_CPP11=OFF\
-             -DCUB_ENABLE_DIALECT_CPP14=OFF\
-             -DCUB_ENABLE_DIALECT_CPP17=ON\
-             -DCUB_ENABLE_DIALECT_CPP20=OFF\
-             -DCUB_ENABLE_RDC_TESTS=OFF\
-             -DCUB_ENABLE_TUNING=YES\
-             -DCMAKE_BUILD_TYPE=Release\
-             -DCMAKE_CUDA_ARCHITECTURES="89;90"
+    cd cccl
+    mkdir build
     cd build
-    ../cub/benchmarks/scripts/run.py
+    cmake ..\
+        -GNinja\
+        -DCCCL_ENABLE_BENCHMARKS=YES\
+        -DCCCL_ENABLE_CUB=YES\
+        -DCCCL_ENABLE_THRUST=NO\
+        -DCCCL_ENABLE_LIBCUDACXX=NO\
+        -DCUB_ENABLE_RDC_TESTS=NO\
+        -DCMAKE_BUILD_TYPE=Release\
+        -DCMAKE_CUDA_ARCHITECTURES=90 # TODO: Set your GPU architecture
+
+You clone the repository, create a build directory and configure the build with CMake.
+It's important that you enable benchmarks (`CCCL_ENABLE_BENCHMARKS=ON`),
+build in Release mode (`CMAKE_BUILD_TYPE=Release`),
+and set the GPU architecture to match your system (`CMAKE_CUDA_ARCHITECTURES=XX`).
+This <website `https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/`>_
+contains a great table listing the architectures for different brands of GPUs.
+.. TODO(bgruber): do we have a public NVIDIA maintained table I can link here instead?
+We use Ninja as CMake generator in this guide, but you can use any other generator you prefer.
+
+You can then proceed to build the benchmarks.
+
+You can list the available cmake build targets with, if you intend to only build selected benchmarks:
+
+.. code-block:: bash
+
+    ninja -t targets | grep '\.bench\.'
+    cub.bench.adjacent_difference.subtract_left.base: phony
+    cub.bench.copy.memcpy.base: phony
+    ...
+    cub.bench.transform.babelstream3.base: phony
+    cub.bench.transform_reduce.sum.base: phony
+
+We also provide a target to build all benchmarks:
+
+.. code-block:: bash
+
+    ninja cub.all.benches
+
+
+Running all benchmarks
+--------------------------------------------------------------------------------
+
+This file contains instructions on how to run all CUB benchmarks using CUB tuning infrastructure.
+
+.. code-block:: bash
+
+    pip install --user fpzip pandas scipy
+    ../benchmarks/scripts/run.py
 
 
 Expected output for the command above is:
@@ -28,7 +75,7 @@ Expected output for the command above is:
 
 .. code-block:: bash
 
-    ../cub/benchmarks/scripts/run.py
+    ../benchmarks/scripts/run.py
     &&&& RUNNING bench
     ctk:  12.2.140
     cub:  812ba98d1
@@ -43,7 +90,7 @@ It's also possible to benchmark a subset of algorithms and workloads:
 
 .. code-block:: bash
 
-    ../cub/benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
+    ../benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
     &&&& RUNNING bench
     ctk:  12.2.140
     cub:  812ba98d1

From af210a693b3f2b11e3c00ecb0f8d2c4ab6b5dfc6 Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 13:52:57 +0100
Subject: [PATCH 04/11] Extend CUB benchmark guide on running a benchmark

---
 docs/cub/benchmarking.rst | 44 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
index b82a7149921..542b9769f94 100644
--- a/docs/cub/benchmarking.rst
+++ b/docs/cub/benchmarking.rst
@@ -59,6 +59,50 @@ We also provide a target to build all benchmarks:
     ninja cub.all.benches
 
 
+Running a benchmark
+--------------------------------------------------------------------------------
+
+After we built a benchmark, we can run it as follows:
+
+.. code-block:: bash
+
+    ./bin/cub.bench.adjacent_difference.subtract_left.base\
+        -d 0\
+        --stopping-criterion entropy\
+        --json base.json\
+        --md base.md
+
+In this command, `-d 0` indicates that we want to run on GPU 0 on our system.
+Setting `--stopping-criterion entropy` is advisable since it reduces runtime
+and increase confidence in the resulting data.
+It's not set as default yet, because NVBench is still evaluating it.
+By default, NVBench will print the benchmark results to the terminal as Markdown.
+`--json base.json` will save the detailed results in a JSON file as well for later use.
+`--md base.md` will save the Markdown output to a file as well,
+so you can easily view the results later without having to parse the JSON.
+More information on what command line options are available can be found in the
+`NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md>`_.
+
+The expected terminal output is something along the following lines (also saved to `base.md`):
+
+.. code-block:: bash
+
+    TODO
+
+If you are only interested in a subset of workloads, you can restrict benchmarking as follows:
+
+.. code-block:: bash
+
+    ./bin/cub.bench.adjacent_difference.subtract_left.base ...\
+        -a 'T{ct}=I32'\
+        -a 'OffsetT{ct}=I32'\
+        -a 'Elements{io}[pow2]=[24,28]'\
+
+The `-a` option allows you to restrict the values for each axis available for the benchmark.
+See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`_.
+for more information on how to specify the axis values.
+
+
 Running all benchmarks
 --------------------------------------------------------------------------------
 

From ffa61fcb5196d457df66c88fb2e086eddc280a97 Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 14:21:43 +0100
Subject: [PATCH 05/11] Extend CUB benchmark guide on comparing benchmark
 results

---
 docs/cub/benchmarking.rst | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
index 542b9769f94..670688282d3 100644
--- a/docs/cub/benchmarking.rst
+++ b/docs/cub/benchmarking.rst
@@ -103,6 +103,35 @@ See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs
 for more information on how to specify the axis values.
 
 
+Comparing benchmark results
+--------------------------------------------------------------------------------
+
+Let's say you have a modification that you'd like to benchmark.
+To compare the performance you have to build and run the benchmark as described above for the unmodified code,
+saving the results to a JSON file, e.g. `base.json`.
+Then, you apply your code changes (e.g., switch to a different branch, git stash pop, apply a patch file, etc.),
+rebuild and rerun the benchmark, saving the results to a different JSON file, e.g. `new.json`.
+
+You can now compare the two result JSON files using, assuming you are still in your build directory:
+
+.. code-block:: bash
+
+    PYTHONPATH=./_deps/nvbench-src/scripts ./_deps/nvbench-src/scripts/nvbench_compare.py base.json new.json
+
+The `PYTHONPATH` environment variable may not be necessary in all cases.
+The script will print a Markdown report, showing the runtime differences between each variant of the two benchmark run:
+
+.. code-block:: bash
+
+    TODO
+
+In addition to showing the absolute and relative runtime difference,
+NVBench reports the noise of the measurements,
+which corresponds to the relative standard deviation.
+It then reports with statistical significance in the `Status` column
+how the runtime changed from the base to the new version.
+
+
 Running all benchmarks
 --------------------------------------------------------------------------------
 

From d792e9057db8a1c38598e8dcbaa083498db82f5a Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 14:25:57 +0100
Subject: [PATCH 06/11] Extend CUB benchmark guide on running all benchmarks

---
 docs/cub/benchmarking.rst | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
index 670688282d3..06ab46967b4 100644
--- a/docs/cub/benchmarking.rst
+++ b/docs/cub/benchmarking.rst
@@ -132,7 +132,27 @@ It then reports with statistical significance in the `Status` column
 how the runtime changed from the base to the new version.
 
 
-Running all benchmarks
+Running all benchmarks directly from the command line
+--------------------------------------------------------------------------------
+
+To get a full snapshot of CUB's performance, you can run all benchmarks and save the results.
+For example:
+
+.. code-block:: bash
+
+    ninja cub.all.benches
+    benchmarks=$(ls bin | grep cub.bench); n=$(echo $benchmarks | wc -w); i=1; \
+    for b in $benchmarks; do \
+      echo "=== Running $b ($i/$n) ==="; \
+      ./bin/$b -d 0 --stopping-criterion entropy --json $b.json --md $b.md; \
+      ((i++)); \
+    done
+
+This will generate one JSON and one Markdown file for each benchmark.
+You can archive those files for later comparison or analysis.
+
+
+Running all benchmarks via tuning scripts
 --------------------------------------------------------------------------------
 
 This file contains instructions on how to run all CUB benchmarks using CUB tuning infrastructure.

From d7bb087a252a9ec42f166afc731c8429c077b240 Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 14:26:51 +0100
Subject: [PATCH 07/11] Swap

---
 docs/cub/index.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/cub/index.rst b/docs/cub/index.rst
index 21e42d81cc3..da59a9b8ec0 100644
--- a/docs/cub/index.rst
+++ b/docs/cub/index.rst
@@ -10,8 +10,8 @@ CUB
    modules
    developer_overview
    test_overview
-   tuning
    benchmarking
+   tuning
    ${repo_docs_api_path}/cub_api
 
 .. the line below can be used to use the README.md file as the index page

From 32aa6aa5a7e07c1c060ee84d6f72678197022254 Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 15:30:25 +0100
Subject: [PATCH 08/11] Fill TODOs

---
 docs/cub/benchmarking.rst | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
index 06ab46967b4..40014cc70bc 100644
--- a/docs/cub/benchmarking.rst
+++ b/docs/cub/benchmarking.rst
@@ -83,11 +83,23 @@ so you can easily view the results later without having to parse the JSON.
 More information on what command line options are available can be found in the
 `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md>`_.
 
-The expected terminal output is something along the following lines (also saved to `base.md`):
+The expected terminal output is something along the following lines (also saved to `base.md`),
+shortened for brevity:
 
 .. code-block:: bash
 
-    TODO
+    # Log
+    Run:  [1/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^16]
+    Pass: Cold: 0.004571ms GPU, 0.009322ms CPU, 0.00s total GPU, 0.01s total wall, 334x
+    Run:  [2/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^20]
+    Pass: Cold: 0.015161ms GPU, 0.023367ms CPU, 0.01s total GPU, 0.02s total wall, 430x
+    ...
+    # Benchmark Results
+    | T{ct} | OffsetT{ct} |   Elements{io}   | Samples |  CPU Time  |  Noise  |  GPU Time  | Noise  | Elem/s  | GlobalMem BW | BWUtil |
+    |-------|-------------|------------------|---------|------------|---------|------------|--------|---------|--------------|--------|
+    |   I32 |         I32 |     2^16 = 65536 |    334x |   9.322 us | 104.44% |   4.571 us | 10.87% | 14.337G | 114.696 GB/s | 14.93% |
+    |   I32 |         I32 |   2^20 = 1048576 |    430x |  23.367 us | 327.68% |  15.161 us |  3.47% | 69.161G | 553.285 GB/s | 72.03% |
+    ...
 
 If you are only interested in a subset of workloads, you can restrict benchmarking as follows:
 
@@ -119,11 +131,16 @@ You can now compare the two result JSON files using, assuming you are still in y
     PYTHONPATH=./_deps/nvbench-src/scripts ./_deps/nvbench-src/scripts/nvbench_compare.py base.json new.json
 
 The `PYTHONPATH` environment variable may not be necessary in all cases.
-The script will print a Markdown report, showing the runtime differences between each variant of the two benchmark run:
+The script will print a Markdown report showing the runtime differences between each variant of the two benchmark run.
+This could look like this, again shortened for brevity:
 
 .. code-block:: bash
 
-    TODO
+    |  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
+    |---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
+    |   I32   |      I32      |      2^16      |   4.571 us |      10.87% |   4.096 us |       0.00% |  -0.475 us | -10.39% |   FAIL   |
+    |   I32   |      I32      |      2^20      |  15.161 us |       3.47% |  15.143 us |       3.55% |  -0.018 us |  -0.12% |   PASS   |
+    ...
 
 In addition to showing the absolute and relative runtime difference,
 NVBench reports the noise of the measurements,

From fa7ab5a8f6474affaf9c10dafd355b78ce4cb6e9 Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 16:07:47 +0100
Subject: [PATCH 09/11] Rework and extend tuning section

---
 docs/cub/benchmarking.rst | 73 ++++++++++++++++++++++++++++++++++-----
 docs/cub/tuning.rst       |  2 ++
 2 files changed, 66 insertions(+), 9 deletions(-)

diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
index 40014cc70bc..cf6a8008dcc 100644
--- a/docs/cub/benchmarking.rst
+++ b/docs/cub/benchmarking.rst
@@ -113,6 +113,7 @@ If you are only interested in a subset of workloads, you can restrict benchmarki
 The `-a` option allows you to restrict the values for each axis available for the benchmark.
 See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`_.
 for more information on how to specify the axis values.
+If the specified axis does not exist, the benchmark will terminate with an error.
 
 
 Comparing benchmark results
@@ -169,23 +170,30 @@ This will generate one JSON and one Markdown file for each benchmark.
 You can archive those files for later comparison or analysis.
 
 
-Running all benchmarks via tuning scripts
+Running all benchmarks via tuning scripts (alternative)
 --------------------------------------------------------------------------------
 
-This file contains instructions on how to run all CUB benchmarks using CUB tuning infrastructure.
+The benchmark suite can also be run using the :ref:`tuning <cub-tuning>` infrastructure.
+The tuning infrastructure handles building benchmarks itself, because it records the build times.
+Therefore, it's critical that you run it in a clean build directory without any build artifacts.
+Running cmake is enough. Alternatively, you can also clean your build directory with.
+Furthermore, the tuning scripts require some additional python dependencies, which you have to install.
 
 .. code-block:: bash
 
+    ninja clean
     pip install --user fpzip pandas scipy
-    ../benchmarks/scripts/run.py
 
+We can then run the full benchmark suite from the build directory with:
 
-Expected output for the command above is:
+.. code-block:: bash
+
+    PYTHONPATH=../benchmarks/scripts ../benchmarks/scripts/run.py
 
+You can expect the output to look like this:
 
 .. code-block:: bash
 
-    ../benchmarks/scripts/run.py
     &&&& RUNNING bench
     ctk:  12.2.140
     cub:  812ba98d1
@@ -195,6 +203,8 @@ Expected output for the command above is:
     &&&& PERF cub_bench_adjacent_difference_subtract_left_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002673664130270481 -sec
     ...
 
+The tuning infrastructure will build and execute all benchmarks and their variants one after each other,
+reporting the time it seconds it took to execute the benchmark executable.
 
 It's also possible to benchmark a subset of algorithms and workloads:
 
@@ -202,8 +212,53 @@ It's also possible to benchmark a subset of algorithms and workloads:
 
     ../benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
     &&&& RUNNING bench
-    ctk:  12.2.140
-    cub:  812ba98d1
-    &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__24 0.00016899200272746384 -sec
-    &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002696000039577484 -sec
+     ctk:  12.6.77
+    cccl:  v2.7.0-rc0-265-g32aa6aa5a
+    &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28 0.003194367978721857 -sec
+    &&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U64___Elements_io__pow2__28 0.00319383991882205 -sec
     &&&& PASSED bench
+
+
+The `-R` option allows you to specify a regular expression for selecting benchmarks.
+The `-a` restricts the values for an axis across all benchmarks
+See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`_.
+for more information on how to specify the axis values.
+Contrary to running a benchmark directly,
+the tuning infrastructure will just ignore an axis value if a benchmark does not support,
+run the benchmark regardless, and continue.
+
+The tuning infrastructure stores results in an SQLite database called `cccl_meta_bench.db` in the build directory.
+This database persists across tuning runs.
+If you interrupt the benchmark script and then launch it again, only missing benchmark variants will be run.
+The resulting database contains all samples, which can be extracted into JSON files:
+
+.. code-block:: bash
+
+    ../benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db
+
+This will create a JSON file for each benchmark variant next to the database.
+For example:
+
+.. code-block:: bash
+
+    cat cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28.json
+    [
+      {
+        "variant": "base ()",
+        "elapsed": 2.6299014091,
+        "center": 0.003194368,
+        "bw": 0.8754671386,
+        "samples": [
+          0.003152896,
+          0.0031549439,
+          ...
+        ],
+        "Elements{io}[pow2]": "28",
+        "base_samples": [
+          0.003152896,
+          0.0031549439,
+          ...
+        ],
+        "speedup": 1
+      }
+    ]
diff --git a/docs/cub/tuning.rst b/docs/cub/tuning.rst
index f706ec61bff..184dc57900a 100644
--- a/docs/cub/tuning.rst
+++ b/docs/cub/tuning.rst
@@ -1,3 +1,5 @@
+.. _cub-tuning:
+
 CUB Tuning Infrastructure
 ================================================================================
 

From f3026670415772f6b377d0bff6ad7c0adf021a9c Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 16:18:56 +0100
Subject: [PATCH 10/11] Fix links with same tags

See: https://github.com/sphinx-doc/sphinx/issues/3921
---
 docs/cub/benchmarking.rst | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
index cf6a8008dcc..6960d0dabc2 100644
--- a/docs/cub/benchmarking.rst
+++ b/docs/cub/benchmarking.rst
@@ -1,5 +1,6 @@
 CUB Benchmarks
 *************************************
+
 .. TODO(bgruber): this guide applies to Thrust as well. We should rename it to "CCCL Benchmarks" and move it out of CUB
 
 CUB comes with a set of `NVBench <https://github.com/NVIDIA/nvbench>`_-based benchmarks for its algorithms,
@@ -81,7 +82,7 @@ By default, NVBench will print the benchmark results to the terminal as Markdown
 `--md base.md` will save the Markdown output to a file as well,
 so you can easily view the results later without having to parse the JSON.
 More information on what command line options are available can be found in the
-`NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md>`_.
+`NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md>`__.
 
 The expected terminal output is something along the following lines (also saved to `base.md`),
 shortened for brevity:
@@ -111,7 +112,7 @@ If you are only interested in a subset of workloads, you can restrict benchmarki
         -a 'Elements{io}[pow2]=[24,28]'\
 
 The `-a` option allows you to restrict the values for each axis available for the benchmark.
-See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`_.
+See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`__.
 for more information on how to specify the axis values.
 If the specified axis does not exist, the benchmark will terminate with an error.
 
@@ -221,7 +222,7 @@ It's also possible to benchmark a subset of algorithms and workloads:
 
 The `-R` option allows you to specify a regular expression for selecting benchmarks.
 The `-a` restricts the values for an axis across all benchmarks
-See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`_.
+See the `NVBench documentation <https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md>`__.
 for more information on how to specify the axis values.
 Contrary to running a benchmark directly,
 the tuning infrastructure will just ignore an axis value if a benchmark does not support,

From fb315b58c9195b3fbe85d400ffffb306a6bdea50 Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Fri, 15 Nov 2024 16:23:19 +0100
Subject: [PATCH 11/11] Drop PYTHONPATH

---
 docs/cub/benchmarking.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
index 6960d0dabc2..6d0603e49cd 100644
--- a/docs/cub/benchmarking.rst
+++ b/docs/cub/benchmarking.rst
@@ -189,7 +189,7 @@ We can then run the full benchmark suite from the build directory with:
 
 .. code-block:: bash
 
-    PYTHONPATH=../benchmarks/scripts ../benchmarks/scripts/run.py
+    ../benchmarks/scripts/run.py
 
 You can expect the output to look like this: