From 04b0a92306e68e510227b575530bde4ac458ff40 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Sun, 2 Feb 2025 17:07:48 +0100 Subject: [PATCH 01/54] Document CUDA execution model --- docs/libcudacxx/extended_api.rst | 1 + .../extended_api/execution_model.rst | 243 ++++++++++++++++++ 2 files changed, 244 insertions(+) create mode 100644 docs/libcudacxx/extended_api/execution_model.rst diff --git a/docs/libcudacxx/extended_api.rst b/docs/libcudacxx/extended_api.rst index 9b359b12166..26a3dc6a450 100644 --- a/docs/libcudacxx/extended_api.rst +++ b/docs/libcudacxx/extended_api.rst @@ -6,6 +6,7 @@ Extended API .. toctree:: :maxdepth: 2 + extended_api/execution_model extended_api/memory_model extended_api/thread_groups extended_api/shapes diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst new file mode 100644 index 00000000000..8e2f9e54a42 --- /dev/null +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -0,0 +1,243 @@ +.. _libcudacxx-extended-api-execution-model: + +Execution model +=============== + +CUDA C++ aims to provide `_parallel forward progress_ [intro.progress.9] `__ +for all device threads of execution, making the parallelization of pre-existing C++ applications with CUDA C++ straight-forward. + +.. dropdown:: [intro.progress] + + - `[intro.progress.7] `__: For a thread of execution + providing `concurrent forward progress guarantees `__, + the implementation ensures that the thread will eventually make progress for as long as it has not terminated. + + [Note 5: This applies regardless of whether or not other threads of execution (if any) have been or are making progress. + To eventually fulfill this requirement means that this will happen in an unspecified but finite amount of time. — end note] + + - `[intro.progress.9] `__: For a thread of execution providing + `parallel forward progress guarantees `__, the implementation is not required to ensure that + the thread will eventually make progress if it has not yet executed any execution step; once this thread has executed a step, + it provides concurrent forward progress guarantees. + + [Note 6: This does not specify a requirement for when to start this thread of execution, which will typically be specified by the entity + that creates this thread of execution. For example, a thread of execution that provides concurrent forward progress guarantees and executes + tasks from a set of tasks in an arbitrary order, one after the other, satisfies the requirements of parallel forward progress for these + tasks. — end note] + +.. _libcudacxx-extended-api-execution-model-host-threads: + +Host threads +------------ + +The forward-progress provided by threads of execution created by the host implementation to +execute `main `__, `std::thread `__, +and `std::jthread https://en.cppreference.com/w/cpp/thread/jthread>`__ is implementation-defined behavior of the host +implementation `[intro.progress] `__. +General-purpose host implementations should provide _concurrent forward progress_. + +If the host implementation provides `_concurrent forward progress_ [intro.progress.7] `__, +then CUDA C++ provides `_parallel forward progress_ [intro.progress.9] `__ for device threads. + + +.. _libcudacxx-extended-api-execution-model-device-threads: + +Device threads +-------------- + +Once a device-thread makes progress: + +- If the device-thread is part of a `Cooperative Grid `__, + then all device-threads in its grid shall eventually make progress. +_ Otherwise, all device-threads in its `thread-block cluster `__ + shall eventually make progress. + + [Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.] + [Note: This implies that all device-threads in its thread block shall eventually make progress. - end note.] + +The order in which device-threads eventually get a chance to make progress is _unspecified_. + +Modify `[intro.progress.1] `__ as follows + +The implementation may assume that any **host** thread will eventually do one of the following: + + 1. terminate, + 2. invoke the function `std::his_thread::yield `__ (`[thread.thread.this] `__), + 3. make a call to a library I/O function, + 4. perform an access through a volatile glvalue, + 5. perform a synchronization operation or an atomic operation, or + 6. continue execution of a trivial infinite loop (`[stmt.iter.general] `__). + +**The implementation may assume that any device thread will eventually do one of the following:** + + 1. **terminate**, + 2. **make a call to a library I/O function**, + 3. **perform an access through a volatile glvalue except if the designated object has automatic storage duration, or** + 4. **perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.** + +.. dropdown:: Examples of forward progress guarantee differences between host and device threads due to [intro.progress.1]. + + The following examples refer to the itemized sub-clauses of the implementation assumptions for host and device threads above + using "host.threads." and "device.threads.", respectively. + + .. code:: cuda + // Example: Execution.Model.Device.0 + // Allowed outcome: No thread makes progress because device threads don't support host.threads.2. + __global__ void ii() { + while(true) std::this_thread::yield(); + } + + .. code:: cuda + // Example: Execution.Model.Device.1 + // Allowed outcome: No thread makes progress because device threads don't support host.threads.4 + // for objects with automatic storage duration (see exception in device.threads.3). + __global__ void iv() { + volatile bool True = true; + while(True); + } + + .. code:: cuda + // Example: Execution.Model.Device.2 + // Allowed outcome: No thread makes progress because device threads don't support host.threads.5 + // for objects with automatic storage duration (see exception in device.threads.4). + __global__ void v_atomic_automatic() { + cuda::atomic True = true; + while(True.load()); + } + + .. code:: cuda + // Example: Execution.Model.Device.3 + // Allowed outcome: No thread makes progress because device threads don't support host.thread.6. + __global void vi() { + while(true) { /* empty */ } + } + +.. _libcudacxx-extended-api-execution-model-cuda-apis: + +CUDA APIs +--------- + +Any in-progress CUDA API shall eventually either return or ensure at least one device-thread makes progress. + +CUDA query functions (e.g. `cudaStreamQuery `__, +`cudaEventQuery `__, etc.) shall not consistently +return ``cudaErrorNotReady`` without a device-thread making progress. + +[Note: The device-thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device-thread on another stream or process. - end note.] + +[Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.] + +.. dropdown:: Examples of CUDA API forward progress guarantees. + + .. code:: cuda + // Example: Execution.Model.API.1 + // Outcome: if device empty, terminates and returns cudaSuccess. + // Rationale: CUDA guarantees that if the device is empty: + // - `cudaDeviceSynchronize` eventually ensures that at least one device-thread makes progress, which implies that eventually `hello_world` grid and one of its device-threads start. + // - All thread-block threads eventually start (due to "if a device thread makes progress, all other threads in its thread-block cluster eventually make progress"). + // - Once all threads in thread-block arrive at `__syncthreads` barrier, all waiting threads are unblocked. + // - Therefore all device threads eventually exit the `hello_world`` grid. + // - And `cudaDeviceSynchronize`` eventually unblocks. + __global__ void hello_world() { __syncthreads(); } + int main() { + hello_world<<<1,2>>>(); + return (int)cudaDeviceSynchronize(); + } + + .. code:: cuda + // Example: Execution.Model.API.2 + // Allowed outcome: eventually, no thread makes progress. + // Rationale: the `cudaDeviceSynchronize` API below is only called if a device thread eventually makes progress and sets the flag. + // However, CUDA only guarantees that `producer` device thread eventually starts if the synchronization API is called. + // Therefore, the host thread may never be unblocked from the flag spin-loop. + cuda::atomic flag = 0; + __global__ void producer() { flag.store(1); } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + producer<<<1,1>>>(); + while (flag.load() == 0); + return cudaDeviceSynchronize(); + } + + .. code:: cuda + // Example: Execution.Model.API.3 + // Allowed outcome: eventually, no thread makes progress. + // Rationale: same as Example.Model.API.2, with the addition that a single CUDA query API call does not guarantee + // the device thread eventually starts, only repeated CUDA query API calls do (see Execution.Model.API.4). + cuda::atomic flag = 0; + __global__ void producer() { flag.store(1); } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + producer<<<1,1>>>(); + (void)cudaStreamQuery(0); + while (flag.load() == 0); + return cudaDeviceSynchronize(); + } + + .. code:: cuda + // Example: Execution.Model.API.4 + // Outcome: terminates. + // Rationale: same as Execution.Model.API.3, but this example repeatedly calls + // a CUDA query API in within the flag spin-loop, which guarantees that the device thread + // eventually makes progress. + cuda::atomic flag = 0; + __global__ void producer() { flag.store(1); } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + producer<<<1,1>>>(); + while (flag.load() == 0) { + (void)cudaStreamQuery(0); + } + return cudaDeviceSynchronize(); + } + +.. _libcudacxx-extended-api-execution-model-stream-ordering: + +Stream and event ordering +------------------------- + +A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events. + +[Note: This excludes dependencies such as Programmatic Dependent Launch or Launch Completion which do not encompass termination of the dependency. - end note.] + +[Note: Tasks are also known as `Commands `__. - end note. ] + +.. dropdown:: Examples of CUDA API forward progress guarantees due to Stream and event ordering + + .. code:: cuda + // Example: Exeuction.Model.Stream.0 + // Allowed outcome: eventually, no thread makes progress. + // Rationale: while CUDA guarantees that one device thread makes progress, since there + // is no dependency between `first` and `second`, it does not guarantee which thread, + // and therefore it could always pick the device thread from `second`, which then never + // unblocks from the spin-loop. + // That is, `second` may starve `first`. + cuda::atomic flag = 0; + __global__ void first() { flag.store(1, rlx); } + __global__ void second() { while(flag.load(rlx) == 0) {} } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + cudaStream_t s0, s1; + cudaStreamCreate(&s0); + cudaStreamCreate(&s1); + first<<<1,1,0,s0>>>(); + second<<<1,1,0,s1>>>(); + return cudaDeviceSynchronize(); + } + + .. code:: cuda + // Example: Exeuction.Model.Stream.1 + // Outcome: terminates. + // Rationale: same as Execution.Model.Stream.0, but this example has a stream dependency + // between first and second, which requires CUDA to run the grids in order. + cuda::atomic flag = 0; + __global__ void first() { flag.store(1, rlx); } + __global__ void second() { while(flag.load(rlx) == 0) {} } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + cudaStream_t s0; + cudaStreamCreate(&s0); + first<<<1,1,0,s0>>>(); + second<<<1,1,0,s0>>>(); + return cudaDeviceSynchronize(); + } \ No newline at end of file From 1c3e1ff0b327cb55b90d65cd8a3bbd3a31989b81 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Mon, 3 Feb 2025 13:06:30 +0100 Subject: [PATCH 02/54] Add example --- .../libcudacxx/extended_api/execution_model.rst | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 8e2f9e54a42..db9cb9bfd13 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -82,13 +82,24 @@ The implementation may assume that any **host** thread will eventually do one of .. code:: cuda // Example: Execution.Model.Device.0 + // Outcome: grid eventually terminates per device.threads.4 because the atomic object does not have automatic storage duration. + __global__ void ii(cuda::atomic_ref atom) { + if (threadIdx.x == 0) { + while(atom.load(cuda::memory_order_relaxed) == 0); + } else if (threadIdx.x == 1) { + atom.store(1, cuda::memory_order_relaxed); + } + } + + .. code:: cuda + // Example: Execution.Model.Device.1 // Allowed outcome: No thread makes progress because device threads don't support host.threads.2. __global__ void ii() { while(true) std::this_thread::yield(); } .. code:: cuda - // Example: Execution.Model.Device.1 + // Example: Execution.Model.Device.2 // Allowed outcome: No thread makes progress because device threads don't support host.threads.4 // for objects with automatic storage duration (see exception in device.threads.3). __global__ void iv() { @@ -97,7 +108,7 @@ The implementation may assume that any **host** thread will eventually do one of } .. code:: cuda - // Example: Execution.Model.Device.2 + // Example: Execution.Model.Device.3 // Allowed outcome: No thread makes progress because device threads don't support host.threads.5 // for objects with automatic storage duration (see exception in device.threads.4). __global__ void v_atomic_automatic() { @@ -106,7 +117,7 @@ The implementation may assume that any **host** thread will eventually do one of } .. code:: cuda - // Example: Execution.Model.Device.3 + // Example: Execution.Model.Device.4 // Allowed outcome: No thread makes progress because device threads don't support host.thread.6. __global void vi() { while(true) { /* empty */ } From c4f9366143d254bef5a4b6c6c86c0bd1fd4c38ce Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 13 Feb 2025 10:37:20 +0100 Subject: [PATCH 03/54] Remove undefined in-progress --- docs/libcudacxx/extended_api/execution_model.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index db9cb9bfd13..de845f12fa2 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -128,7 +128,7 @@ The implementation may assume that any **host** thread will eventually do one of CUDA APIs --------- -Any in-progress CUDA API shall eventually either return or ensure at least one device-thread makes progress. +Any CUDA API shall eventually either return or ensure at least one device-thread makes progress. CUDA query functions (e.g. `cudaStreamQuery `__, `cudaEventQuery `__, etc.) shall not consistently @@ -251,4 +251,4 @@ A device-thread shall not make progress if it is dependent on termination of one first<<<1,1,0,s0>>>(); second<<<1,1,0,s0>>>(); return cudaDeviceSynchronize(); - } \ No newline at end of file + } From 3432fd1a5f3df2f5e8d75c8643dbdb6ea01a54c8 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:01:19 +0100 Subject: [PATCH 04/54] Missing colon --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index de845f12fa2..2615bd3ff1a 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -57,7 +57,7 @@ _ Otherwise, all device-threads in its `thread-block cluster `__ as follows +Modify `[intro.progress.1] `__ as follows: The implementation may assume that any **host** thread will eventually do one of the following: From ab8d0481825507ad303bae90e68ac2548103e7d4 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:01:48 +0100 Subject: [PATCH 05/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 2615bd3ff1a..72c25963595 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -45,7 +45,7 @@ then CUDA C++ provides `_parallel forward progress_ [intro.progress.9] `__, then all device-threads in its grid shall eventually make progress. From 6d3121e53b4f047994255baa00e3a7b1c5a6785b Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:02:11 +0100 Subject: [PATCH 06/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 72c25963595..eaa5150f600 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -47,7 +47,7 @@ Device threads Once a device thread makes progress: -- If the device-thread is part of a `Cooperative Grid `__, +- If the device thread is part of a `Cooperative Grid `__, then all device-threads in its grid shall eventually make progress. _ Otherwise, all device-threads in its `thread-block cluster `__ shall eventually make progress. From 61b8dec5564933a37f731ed32e50554874f4cd93 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:02:31 +0100 Subject: [PATCH 07/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index eaa5150f600..dae52a42b22 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -48,7 +48,7 @@ Device threads Once a device thread makes progress: - If the device thread is part of a `Cooperative Grid `__, - then all device-threads in its grid shall eventually make progress. + then all device threads in its grid shall eventually make progress. _ Otherwise, all device-threads in its `thread-block cluster `__ shall eventually make progress. From ba2bdb578bb44e43b5f9c6eb065f8102bef1a518 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:02:51 +0100 Subject: [PATCH 08/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index dae52a42b22..85a55a05001 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -49,7 +49,7 @@ Once a device thread makes progress: - If the device thread is part of a `Cooperative Grid `__, then all device threads in its grid shall eventually make progress. -_ Otherwise, all device-threads in its `thread-block cluster `__ +_ Otherwise, all device threads in its `thread-block cluster `__ shall eventually make progress. [Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.] From 1ffdf19022170422f87c9f9abfd6732eb5261b72 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:03:56 +0100 Subject: [PATCH 09/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 85a55a05001..d1e21209366 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -55,7 +55,7 @@ _ Otherwise, all device threads in its `thread-block cluster `__ as follows: From 2a9d1f6ef1e3411cec2e4078f49ee5f8f48c3f3f Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:04:43 +0100 Subject: [PATCH 10/54] Clarify --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index d1e21209366..d1025f880f5 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -57,7 +57,7 @@ _ Otherwise, all device threads in its `thread-block cluster `__ as follows: +Modify `[intro.progress.1] `__ as follows (modifications in **bold**): The implementation may assume that any **host** thread will eventually do one of the following: From 3ec07f143e0bac9a1b1726fd118667b7863e5707 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:05:11 +0100 Subject: [PATCH 11/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index d1025f880f5..69bd59576d3 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -62,7 +62,7 @@ Modify `[intro.progress.1] `__ as follow The implementation may assume that any **host** thread will eventually do one of the following: 1. terminate, - 2. invoke the function `std::his_thread::yield `__ (`[thread.thread.this] `__), + 2. invoke the function `std::this_thread::yield `__ (`[thread.thread.this] `__), 3. make a call to a library I/O function, 4. perform an access through a volatile glvalue, 5. perform a synchronization operation or an atomic operation, or From 3e5d26fa53a420983bc9e9e7e4c667947583f202 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:06:34 +0100 Subject: [PATCH 12/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 69bd59576d3..c4859268603 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -75,7 +75,7 @@ The implementation may assume that any **host** thread will eventually do one of 3. **perform an access through a volatile glvalue except if the designated object has automatic storage duration, or** 4. **perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.** -.. dropdown:: Examples of forward progress guarantee differences between host and device threads due to [intro.progress.1]. +.. dropdown:: Examples of forward progress guarantee differences between host and device threads due to modifications to [intro.progress.1]. The following examples refer to the itemized sub-clauses of the implementation assumptions for host and device threads above using "host.threads." and "device.threads.", respectively. From b24688ef6f3c5ec74d74dfa2d52d1456dad88087 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:11:56 +0100 Subject: [PATCH 13/54] Better example name --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index c4859268603..ac6a4718128 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -83,7 +83,7 @@ The implementation may assume that any **host** thread will eventually do one of .. code:: cuda // Example: Execution.Model.Device.0 // Outcome: grid eventually terminates per device.threads.4 because the atomic object does not have automatic storage duration. - __global__ void ii(cuda::atomic_ref atom) { + __global__ void ex0(cuda::atomic_ref atom) { if (threadIdx.x == 0) { while(atom.load(cuda::memory_order_relaxed) == 0); } else if (threadIdx.x == 1) { From a5ee4476dd8a954a4e2441ac17064a82cecaf95e Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:13:11 +0100 Subject: [PATCH 14/54] Better example name --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index ac6a4718128..ab5b2c69172 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -94,7 +94,7 @@ The implementation may assume that any **host** thread will eventually do one of .. code:: cuda // Example: Execution.Model.Device.1 // Allowed outcome: No thread makes progress because device threads don't support host.threads.2. - __global__ void ii() { + __global__ void ex1() { while(true) std::this_thread::yield(); } From c32acefd1dc8e5780aec026c8ed6f2a66d672a92 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:13:42 +0100 Subject: [PATCH 15/54] Better example name --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index ab5b2c69172..d331a6f15e5 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -102,7 +102,7 @@ The implementation may assume that any **host** thread will eventually do one of // Example: Execution.Model.Device.2 // Allowed outcome: No thread makes progress because device threads don't support host.threads.4 // for objects with automatic storage duration (see exception in device.threads.3). - __global__ void iv() { + __global__ void ex2() { volatile bool True = true; while(True); } From 3add820db9b9f50191e3a33133fa9f7897d87996 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:14:19 +0100 Subject: [PATCH 16/54] Better example name --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index d331a6f15e5..3c83a3175c8 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -111,7 +111,7 @@ The implementation may assume that any **host** thread will eventually do one of // Example: Execution.Model.Device.3 // Allowed outcome: No thread makes progress because device threads don't support host.threads.5 // for objects with automatic storage duration (see exception in device.threads.4). - __global__ void v_atomic_automatic() { + __global__ void ex3() { cuda::atomic True = true; while(True.load()); } From 6c966ad999536876621731916f80cdcf232f2c11 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:14:43 +0100 Subject: [PATCH 17/54] Better example name --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 3c83a3175c8..95e14b861b1 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -119,7 +119,7 @@ The implementation may assume that any **host** thread will eventually do one of .. code:: cuda // Example: Execution.Model.Device.4 // Allowed outcome: No thread makes progress because device threads don't support host.thread.6. - __global void vi() { + __global void ex4() { while(true) { /* empty */ } } From b33a350389d52be88746b3ce69363b9f455393b8 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:15:45 +0100 Subject: [PATCH 18/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 95e14b861b1..1493afc297e 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -128,7 +128,7 @@ The implementation may assume that any **host** thread will eventually do one of CUDA APIs --------- -Any CUDA API shall eventually either return or ensure at least one device-thread makes progress. +Any CUDA API shall eventually either return or ensure at least one device thread makes progress. CUDA query functions (e.g. `cudaStreamQuery `__, `cudaEventQuery `__, etc.) shall not consistently From 863deb52650c281db4e2424d2f6c54ff6900d111 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:16:09 +0100 Subject: [PATCH 19/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 1493afc297e..763e1086983 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -132,7 +132,7 @@ Any CUDA API shall eventually either return or ensure at least one device thread CUDA query functions (e.g. `cudaStreamQuery `__, `cudaEventQuery `__, etc.) shall not consistently -return ``cudaErrorNotReady`` without a device-thread making progress. +return ``cudaErrorNotReady`` without a device thread making progress. [Note: The device-thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device-thread on another stream or process. - end note.] From d935dd8658a97fbc3222e04fb9022e09389ac619 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:16:34 +0100 Subject: [PATCH 20/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 763e1086983..4ca2f2c65ed 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -134,7 +134,7 @@ CUDA query functions (e.g. `cudaStreamQuery `__, etc.) shall not consistently return ``cudaErrorNotReady`` without a device thread making progress. -[Note: The device-thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device-thread on another stream or process. - end note.] +[Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device-thread on another stream or process. - end note.] [Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.] From 494ec392114f4a7898f23fc5b9570dbce6783a73 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:17:27 +0100 Subject: [PATCH 21/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 4ca2f2c65ed..0e364e6e5ef 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -134,7 +134,7 @@ CUDA query functions (e.g. `cudaStreamQuery `__, etc.) shall not consistently return ``cudaErrorNotReady`` without a device thread making progress. -[Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device-thread on another stream or process. - end note.] +[Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.] [Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.] From ce1e4e133a853d573477d11bcd71d4b811c92f1d Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:27:45 +0100 Subject: [PATCH 22/54] Add note to clarify bugs vs cxx being too strict --- docs/libcudacxx/extended_api/execution_model.rst | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 0e364e6e5ef..4600f45eb36 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -75,6 +75,14 @@ The implementation may assume that any **host** thread will eventually do one of 3. **perform an access through a volatile glvalue except if the designated object has automatic storage duration, or** 4. **perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.** + [Note: We consider some of the current limitations of device threads with respect to host + threads implementation bugs, e.g., the undefined behavior introduced by programs that + eventually only perform volatile or atomic operations on objects with automatic storage + duration. However, we consider that some of the limitations are due to the C++ standard + currently being too strict, and this resulting in the performance of common workloads + being degraded to provide forward progress to “useless” programs, e.g., the undefined + behavior introduced by programs that eventually only perform atomic write operations + or fences. - end note.] .. dropdown:: Examples of forward progress guarantee differences between host and device threads due to modifications to [intro.progress.1]. The following examples refer to the itemized sub-clauses of the implementation assumptions for host and device threads above From adba14e15672cbd42e9add91e4c3cc5fa178ad21 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:28:22 +0100 Subject: [PATCH 23/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 4600f45eb36..0af0bc4d74a 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -83,6 +83,7 @@ The implementation may assume that any **host** thread will eventually do one of being degraded to provide forward progress to “useless” programs, e.g., the undefined behavior introduced by programs that eventually only perform atomic write operations or fences. - end note.] + .. dropdown:: Examples of forward progress guarantee differences between host and device threads due to modifications to [intro.progress.1]. The following examples refer to the itemized sub-clauses of the implementation assumptions for host and device threads above From bc876fa9a48fa035e8e0aa30b35efbada1745a47 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:29:09 +0100 Subject: [PATCH 24/54] Fix spacing --- docs/libcudacxx/extended_api/execution_model.rst | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 0af0bc4d74a..18399e7f80e 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -75,14 +75,14 @@ The implementation may assume that any **host** thread will eventually do one of 3. **perform an access through a volatile glvalue except if the designated object has automatic storage duration, or** 4. **perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.** - [Note: We consider some of the current limitations of device threads with respect to host - threads implementation bugs, e.g., the undefined behavior introduced by programs that - eventually only perform volatile or atomic operations on objects with automatic storage - duration. However, we consider that some of the limitations are due to the C++ standard - currently being too strict, and this resulting in the performance of common workloads - being degraded to provide forward progress to “useless” programs, e.g., the undefined - behavior introduced by programs that eventually only perform atomic write operations - or fences. - end note.] + [Note: We consider some of the current limitations of device threads with respect to host + threads implementation bugs, e.g., the undefined behavior introduced by programs that + eventually only perform volatile or atomic operations on objects with automatic storage + duration. However, we consider that some of the limitations are due to the C++ standard + currently being too strict, and this resulting in the performance of common workloads + being degraded to provide forward progress to “useless” programs, e.g., the undefined + behavior introduced by programs that eventually only perform atomic write operations + or fences. - end note.] .. dropdown:: Examples of forward progress guarantee differences between host and device threads due to modifications to [intro.progress.1]. From 9bbec18d93b985c5b8ba0766ce420c3cb725580e Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:29:50 +0100 Subject: [PATCH 25/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 18399e7f80e..43110d0a7f3 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -4,7 +4,7 @@ Execution model =============== CUDA C++ aims to provide `_parallel forward progress_ [intro.progress.9] `__ -for all device threads of execution, making the parallelization of pre-existing C++ applications with CUDA C++ straight-forward. +for all device threads of execution, making the parallelization of pre-existing C++ applications with CUDA C++ straightforward. .. dropdown:: [intro.progress] From f64a5b4d991c43d24a3096587575452e8e30364e Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu, 20 Feb 2025 13:34:54 +0000 Subject: [PATCH 26/54] [pre-commit.ci] auto code formatting --- .../extended_api/execution_model.rst | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 43110d0a7f3..ec074f9237a 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -3,21 +3,21 @@ Execution model =============== -CUDA C++ aims to provide `_parallel forward progress_ [intro.progress.9] `__ +CUDA C++ aims to provide `_parallel forward progress_ [intro.progress.9] `__ for all device threads of execution, making the parallelization of pre-existing C++ applications with CUDA C++ straightforward. .. dropdown:: [intro.progress] - `[intro.progress.7] `__: For a thread of execution - providing `concurrent forward progress guarantees `__, + providing `concurrent forward progress guarantees `__, the implementation ensures that the thread will eventually make progress for as long as it has not terminated. - [Note 5: This applies regardless of whether or not other threads of execution (if any) have been or are making progress. + [Note 5: This applies regardless of whether or not other threads of execution (if any) have been or are making progress. To eventually fulfill this requirement means that this will happen in an unspecified but finite amount of time. — end note] - `[intro.progress.9] `__: For a thread of execution providing `parallel forward progress guarantees `__, the implementation is not required to ensure that - the thread will eventually make progress if it has not yet executed any execution step; once this thread has executed a step, + the thread will eventually make progress if it has not yet executed any execution step; once this thread has executed a step, it provides concurrent forward progress guarantees. [Note 6: This does not specify a requirement for when to start this thread of execution, which will typically be specified by the entity @@ -33,7 +33,7 @@ Host threads The forward-progress provided by threads of execution created by the host implementation to execute `main `__, `std::thread `__, and `std::jthread https://en.cppreference.com/w/cpp/thread/jthread>`__ is implementation-defined behavior of the host -implementation `[intro.progress] `__. +implementation `[intro.progress] `__. General-purpose host implementations should provide _concurrent forward progress_. If the host implementation provides `_concurrent forward progress_ [intro.progress.7] `__, @@ -49,9 +49,9 @@ Once a device thread makes progress: - If the device thread is part of a `Cooperative Grid `__, then all device threads in its grid shall eventually make progress. -_ Otherwise, all device threads in its `thread-block cluster `__ +_ Otherwise, all device threads in its `thread-block cluster `__ shall eventually make progress. - + [Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.] [Note: This implies that all device-threads in its thread block shall eventually make progress. - end note.] @@ -75,9 +75,9 @@ The implementation may assume that any **host** thread will eventually do one of 3. **perform an access through a volatile glvalue except if the designated object has automatic storage duration, or** 4. **perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.** - [Note: We consider some of the current limitations of device threads with respect to host - threads implementation bugs, e.g., the undefined behavior introduced by programs that - eventually only perform volatile or atomic operations on objects with automatic storage + [Note: We consider some of the current limitations of device threads with respect to host + threads implementation bugs, e.g., the undefined behavior introduced by programs that + eventually only perform volatile or atomic operations on objects with automatic storage duration. However, we consider that some of the limitations are due to the C++ standard currently being too strict, and this resulting in the performance of common workloads being degraded to provide forward progress to “useless” programs, e.g., the undefined @@ -140,7 +140,7 @@ CUDA APIs Any CUDA API shall eventually either return or ensure at least one device thread makes progress. CUDA query functions (e.g. `cudaStreamQuery `__, -`cudaEventQuery `__, etc.) shall not consistently +`cudaEventQuery `__, etc.) shall not consistently return ``cudaErrorNotReady`` without a device thread making progress. [Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.] @@ -150,7 +150,7 @@ return ``cudaErrorNotReady`` without a device thread making progress. .. dropdown:: Examples of CUDA API forward progress guarantees. .. code:: cuda - // Example: Execution.Model.API.1 + // Example: Execution.Model.API.1 // Outcome: if device empty, terminates and returns cudaSuccess. // Rationale: CUDA guarantees that if the device is empty: // - `cudaDeviceSynchronize` eventually ensures that at least one device-thread makes progress, which implies that eventually `hello_world` grid and one of its device-threads start. @@ -205,7 +205,7 @@ return ``cudaErrorNotReady`` without a device thread making progress. int main() { cudaHostRegister(&flag, sizeof(flag)); producer<<<1,1>>>(); - while (flag.load() == 0) { + while (flag.load() == 0) { (void)cudaStreamQuery(0); } return cudaDeviceSynchronize(); @@ -229,7 +229,7 @@ A device-thread shall not make progress if it is dependent on termination of one // Allowed outcome: eventually, no thread makes progress. // Rationale: while CUDA guarantees that one device thread makes progress, since there // is no dependency between `first` and `second`, it does not guarantee which thread, - // and therefore it could always pick the device thread from `second`, which then never + // and therefore it could always pick the device thread from `second`, which then never // unblocks from the spin-loop. // That is, `second` may starve `first`. cuda::atomic flag = 0; @@ -238,7 +238,7 @@ A device-thread shall not make progress if it is dependent on termination of one int main() { cudaHostRegister(&flag, sizeof(flag)); cudaStream_t s0, s1; - cudaStreamCreate(&s0); + cudaStreamCreate(&s0); cudaStreamCreate(&s1); first<<<1,1,0,s0>>>(); second<<<1,1,0,s1>>>(); @@ -256,7 +256,7 @@ A device-thread shall not make progress if it is dependent on termination of one int main() { cudaHostRegister(&flag, sizeof(flag)); cudaStream_t s0; - cudaStreamCreate(&s0); + cudaStreamCreate(&s0); first<<<1,1,0,s0>>>(); second<<<1,1,0,s0>>>(); return cudaDeviceSynchronize(); From 1c05413b8373350c6f41df73944d9d9f729b8c71 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:35:30 +0100 Subject: [PATCH 27/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index ec074f9237a..edce4b6bc73 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -49,7 +49,7 @@ Once a device thread makes progress: - If the device thread is part of a `Cooperative Grid `__, then all device threads in its grid shall eventually make progress. -_ Otherwise, all device threads in its `thread-block cluster `__ +- Otherwise, all device threads in its `thread-block cluster `__ shall eventually make progress. [Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.] From d7e6a3f5ffb43411d3a333c0deeb20c4ba8c9bbe Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:37:34 +0100 Subject: [PATCH 28/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index edce4b6bc73..f6eaec6865f 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -225,7 +225,7 @@ A device-thread shall not make progress if it is dependent on termination of one .. dropdown:: Examples of CUDA API forward progress guarantees due to Stream and event ordering .. code:: cuda - // Example: Exeuction.Model.Stream.0 + // Example: Execution.Model.Stream.0 // Allowed outcome: eventually, no thread makes progress. // Rationale: while CUDA guarantees that one device thread makes progress, since there // is no dependency between `first` and `second`, it does not guarantee which thread, From fd466053f4dc51249fcfdd3ffe150fe1fcf186d4 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:37:58 +0100 Subject: [PATCH 29/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index f6eaec6865f..ef666b9d9ce 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -246,7 +246,7 @@ A device-thread shall not make progress if it is dependent on termination of one } .. code:: cuda - // Example: Exeuction.Model.Stream.1 + // Example: Execution.Model.Stream.1 // Outcome: terminates. // Rationale: same as Execution.Model.Stream.0, but this example has a stream dependency // between first and second, which requires CUDA to run the grids in order. From 273d691ce0ff28eba0808c3da6bcab34b71b1d6b Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:56:34 +0100 Subject: [PATCH 30/54] Rephrase --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index ef666b9d9ce..61902bd82ef 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -4,7 +4,7 @@ Execution model =============== CUDA C++ aims to provide `_parallel forward progress_ [intro.progress.9] `__ -for all device threads of execution, making the parallelization of pre-existing C++ applications with CUDA C++ straightforward. +for all device threads of execution, facilitating the parallelization of pre-existing C++ applications with CUDA C++. .. dropdown:: [intro.progress] From 48532d10a702384446bf5f5bb7ed2b50beded3d6 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 14:59:28 +0100 Subject: [PATCH 31/54] Rephrase --- docs/libcudacxx/extended_api/execution_model.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 61902bd82ef..3726d3cd1b8 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -47,8 +47,8 @@ Device threads Once a device thread makes progress: -- If the device thread is part of a `Cooperative Grid `__, - then all device threads in its grid shall eventually make progress. +- If it is part of a `Cooperative Grid `__, + all device threads in its grid shall eventually make progress. - Otherwise, all device threads in its `thread-block cluster `__ shall eventually make progress. From 3c40ed80be05c44ac1b311684b02fd0ad3df1aa8 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 15:00:51 +0100 Subject: [PATCH 32/54] Typos --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 3726d3cd1b8..bbbeda97827 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -53,7 +53,7 @@ Once a device thread makes progress: shall eventually make progress. [Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.] - [Note: This implies that all device-threads in its thread block shall eventually make progress. - end note.] + [Note: This implies that all device threads within its thread block shall eventually make progress. - end note.] The order in which device threads eventually get a chance to make progress is _unspecified_. From 1e5ff6da6c522a3f48849cb6bcbf5b9ebf0cf474 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 15:01:37 +0100 Subject: [PATCH 33/54] Rephrase --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index bbbeda97827..9b15e8150f8 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -55,7 +55,7 @@ Once a device thread makes progress: [Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.] [Note: This implies that all device threads within its thread block shall eventually make progress. - end note.] -The order in which device threads eventually get a chance to make progress is _unspecified_. +The order in which device threads make progress is _unspecified_. Modify `[intro.progress.1] `__ as follows (modifications in **bold**): From 6e02dbfbe37bd1eabe81adbee974c21ab56df153 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 15:14:15 +0100 Subject: [PATCH 34/54] Rephrase --- docs/libcudacxx/extended_api/execution_model.rst | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 9b15e8150f8..799fc1743b0 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -75,14 +75,12 @@ The implementation may assume that any **host** thread will eventually do one of 3. **perform an access through a volatile glvalue except if the designated object has automatic storage duration, or** 4. **perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.** - [Note: We consider some of the current limitations of device threads with respect to host - threads implementation bugs, e.g., the undefined behavior introduced by programs that - eventually only perform volatile or atomic operations on objects with automatic storage - duration. However, we consider that some of the limitations are due to the C++ standard - currently being too strict, and this resulting in the performance of common workloads - being degraded to provide forward progress to “useless” programs, e.g., the undefined - behavior introduced by programs that eventually only perform atomic write operations - or fences. - end note.] + [Note: Some current limitations of device threads relative to host threads are considered + implementation defects, such as the undefined behavior that arises from device threads + that eventually only performs volatile or atomic operations on automatic storage duration objects. + However, other limitations stem from the strictness of the C++ standard, e.g., providing + forward progress to programs that eventually only perform atomic writes or fences degrades overall + performance for little practical benefit. - end note.] .. dropdown:: Examples of forward progress guarantee differences between host and device threads due to modifications to [intro.progress.1]. From d4dbe7e825c19ec1e21196b430d45c5a947be717 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 15:14:51 +0100 Subject: [PATCH 35/54] Typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 799fc1743b0..24878a7f868 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -77,7 +77,7 @@ The implementation may assume that any **host** thread will eventually do one of [Note: Some current limitations of device threads relative to host threads are considered implementation defects, such as the undefined behavior that arises from device threads - that eventually only performs volatile or atomic operations on automatic storage duration objects. + that eventually only perform volatile or atomic operations on automatic storage duration objects. However, other limitations stem from the strictness of the C++ standard, e.g., providing forward progress to programs that eventually only perform atomic writes or fences degrades overall performance for little practical benefit. - end note.] From ee99ba2f9587d388dadf2c991112283165f978fd Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Thu, 20 Feb 2025 15:17:41 +0100 Subject: [PATCH 36/54] Rephrase --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 24878a7f868..e9eb7543d5d 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -218,7 +218,7 @@ A device-thread shall not make progress if it is dependent on termination of one [Note: This excludes dependencies such as Programmatic Dependent Launch or Launch Completion which do not encompass termination of the dependency. - end note.] -[Note: Tasks are also known as `Commands `__. - end note. ] +[Note: Tasks are also referred to as `Commands `__. - end note. ] .. dropdown:: Examples of CUDA API forward progress guarantees due to Stream and event ordering From 2c7a31d5b852c60966bf0db197ace9343e7bc85b Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 11:50:26 +0100 Subject: [PATCH 37/54] Fix incorrect namespace --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index e9eb7543d5d..adf3cdd5d1d 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -102,7 +102,7 @@ The implementation may assume that any **host** thread will eventually do one of // Example: Execution.Model.Device.1 // Allowed outcome: No thread makes progress because device threads don't support host.threads.2. __global__ void ex1() { - while(true) std::this_thread::yield(); + while(true) cuda::std::this_thread::yield(); } .. code:: cuda From 753c2a987921c9bd08e63f113eb7f820dbaf25dd Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 12:04:51 +0100 Subject: [PATCH 38/54] Fix typo Co-authored-by: Mark Hoemmen --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index adf3cdd5d1d..3a2fdea18f1 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -218,7 +218,7 @@ A device-thread shall not make progress if it is dependent on termination of one [Note: This excludes dependencies such as Programmatic Dependent Launch or Launch Completion which do not encompass termination of the dependency. - end note.] -[Note: Tasks are also referred to as `Commands `__. - end note. ] +[Note: Tasks are also referred to as `Commands `__. - end note.] .. dropdown:: Examples of CUDA API forward progress guarantees due to Stream and event ordering From 4851f6acd3470e4e58b12970f33f60cae5bf57b8 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 12:05:05 +0100 Subject: [PATCH 39/54] Fix typo Co-authored-by: Mark Hoemmen --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 3a2fdea18f1..8c5c1f99da2 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -220,7 +220,7 @@ A device-thread shall not make progress if it is dependent on termination of one [Note: Tasks are also referred to as `Commands `__. - end note.] -.. dropdown:: Examples of CUDA API forward progress guarantees due to Stream and event ordering +.. dropdown:: Examples of CUDA API forward progress guarantees due to stream and event ordering .. code:: cuda // Example: Execution.Model.Stream.0 From d7de9b06004fa47492372d05bec3f87752b815e4 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 12:05:40 +0100 Subject: [PATCH 40/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 8c5c1f99da2..16d0650303f 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -232,7 +232,7 @@ A device-thread shall not make progress if it is dependent on termination of one // That is, `second` may starve `first`. cuda::atomic flag = 0; __global__ void first() { flag.store(1, rlx); } - __global__ void second() { while(flag.load(rlx) == 0) {} } + __global__ void second() { while(flag.load(cuda::memory_order_relaxed) == 0) {} } int main() { cudaHostRegister(&flag, sizeof(flag)); cudaStream_t s0, s1; From 0af7452a064f13bb0700e5891c1024d9dc7ffc32 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 12:06:07 +0100 Subject: [PATCH 41/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 16d0650303f..3f5696947c5 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -231,7 +231,7 @@ A device-thread shall not make progress if it is dependent on termination of one // unblocks from the spin-loop. // That is, `second` may starve `first`. cuda::atomic flag = 0; - __global__ void first() { flag.store(1, rlx); } + __global__ void first() { flag.store(1, cuda::memory_order_relaxed); } __global__ void second() { while(flag.load(cuda::memory_order_relaxed) == 0) {} } int main() { cudaHostRegister(&flag, sizeof(flag)); From 41f57bb2202ec8701cb925e0e86b004898f6cba7 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 12:06:40 +0100 Subject: [PATCH 42/54] Fix typos --- docs/libcudacxx/extended_api/execution_model.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 3f5696947c5..e0bb14abd04 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -249,8 +249,8 @@ A device-thread shall not make progress if it is dependent on termination of one // Rationale: same as Execution.Model.Stream.0, but this example has a stream dependency // between first and second, which requires CUDA to run the grids in order. cuda::atomic flag = 0; - __global__ void first() { flag.store(1, rlx); } - __global__ void second() { while(flag.load(rlx) == 0) {} } + __global__ void first() { flag.store(1, cuda::memory_order_relaxed); } + __global__ void second() { while(flag.load(cuda::memory_order_relaxed) == 0) {} } int main() { cudaHostRegister(&flag, sizeof(flag)); cudaStream_t s0; From bd48ba2e395d6346fa00939e9f878fe06bb51c82 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 17:35:37 +0100 Subject: [PATCH 43/54] Clarify additions and modifications to C++ standard. --- docs/libcudacxx/extended_api/execution_model.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index e0bb14abd04..82f8ed45b7d 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -27,6 +27,10 @@ for all device threads of execution, facilitating the parallelization of pre-exi .. _libcudacxx-extended-api-execution-model-host-threads: +The CUDA C++ Programming Language is an extension of the C++ Programming Language. +This section documents the modifications and extensions to the `[intro.progress] `__ section of the current `ISO International Standard ISO/IEC 14882 – Programming Language C++ `__ draft. +Modified sections are called out explicitly and their diff is shown in **bold**. +All other sections are additions. Host threads ------------ From fb3b5737fe664a404c6f1ab2608b254ec5af913b Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 17:36:17 +0100 Subject: [PATCH 44/54] Fix typo Co-authored-by: Mark Hoemmen --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 82f8ed45b7d..c394ca816e1 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -34,7 +34,7 @@ All other sections are additions. Host threads ------------ -The forward-progress provided by threads of execution created by the host implementation to +The forward progress provided by threads of execution created by the host implementation to execute `main `__, `std::thread `__, and `std::jthread https://en.cppreference.com/w/cpp/thread/jthread>`__ is implementation-defined behavior of the host implementation `[intro.progress] `__. From 39fc26e84ba5f5d14dd088df661e873f92a5b6bb Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 19:30:06 +0100 Subject: [PATCH 45/54] Rephrase note Co-authored-by: Mark Hoemmen --- .../libcudacxx/extended_api/execution_model.rst | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index c394ca816e1..80519548dad 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -79,12 +79,17 @@ The implementation may assume that any **host** thread will eventually do one of 3. **perform an access through a volatile glvalue except if the designated object has automatic storage duration, or** 4. **perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.** - [Note: Some current limitations of device threads relative to host threads are considered - implementation defects, such as the undefined behavior that arises from device threads - that eventually only perform volatile or atomic operations on automatic storage duration objects. - However, other limitations stem from the strictness of the C++ standard, e.g., providing - forward progress to programs that eventually only perform atomic writes or fences degrades overall - performance for little practical benefit. - end note.] + [Note: Some current limitations of device threads relative to host threads + are implementation defects known to us, that we may fix over time. + Examples include the undefined behavior that arises from device threads + that eventually only perform volatile or atomic operations + on automatic storage duration objects. + However, other limitations of device threads relative to host threads + are intentional choices. They enable performance optimizations + that would not be possible if device threads followed the C++ Standard strictly. + For example, providing forward progress to programs + that eventually only perform atomic writes or fences + would degrade overall performance for little practical benefit. - end note.] .. dropdown:: Examples of forward progress guarantee differences between host and device threads due to modifications to [intro.progress.1]. From 3fbab9d1cb6cacbe136ab6207363408a4ab21b40 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Fri, 21 Feb 2025 19:31:08 +0100 Subject: [PATCH 46/54] Rephrase Co-authored-by: Mark Hoemmen --- docs/libcudacxx/extended_api/execution_model.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 80519548dad..88eb7e9f3bb 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -158,7 +158,8 @@ return ``cudaErrorNotReady`` without a device thread making progress. .. code:: cuda // Example: Execution.Model.API.1 - // Outcome: if device empty, terminates and returns cudaSuccess. + // Outcome: if no other device threads (e.g., from other processes) are making progress, + // this program terminates and returns cudaSuccess. // Rationale: CUDA guarantees that if the device is empty: // - `cudaDeviceSynchronize` eventually ensures that at least one device-thread makes progress, which implies that eventually `hello_world` grid and one of its device-threads start. // - All thread-block threads eventually start (due to "if a device thread makes progress, all other threads in its thread-block cluster eventually make progress"). From ecae57dde3dc851359e48752134ac6261cab57de Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Fri, 21 Feb 2025 18:35:39 +0000 Subject: [PATCH 47/54] [pre-commit.ci] auto code formatting --- docs/libcudacxx/extended_api/execution_model.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 88eb7e9f3bb..d3f1a010149 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -28,8 +28,8 @@ for all device threads of execution, facilitating the parallelization of pre-exi .. _libcudacxx-extended-api-execution-model-host-threads: The CUDA C++ Programming Language is an extension of the C++ Programming Language. -This section documents the modifications and extensions to the `[intro.progress] `__ section of the current `ISO International Standard ISO/IEC 14882 – Programming Language C++ `__ draft. -Modified sections are called out explicitly and their diff is shown in **bold**. +This section documents the modifications and extensions to the `[intro.progress] `__ section of the current `ISO International Standard ISO/IEC 14882 – Programming Language C++ `__ draft. +Modified sections are called out explicitly and their diff is shown in **bold**. All other sections are additions. Host threads ------------ From bde9281940e8599dfd82235aab1655c572ad765e Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Mon, 24 Feb 2025 13:19:53 +0100 Subject: [PATCH 48/54] Rephrase --- docs/libcudacxx/extended_api/execution_model.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index d3f1a010149..3ee9f137da8 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -152,7 +152,9 @@ return ``cudaErrorNotReady`` without a device thread making progress. [Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.] -[Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.] +[Note: A simple but not sufficient method to test a program for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1``, and then check that the program still terminates. +If it does not, the program has a bug. +This method is not sufficient because it does not catch all Forward Progress bugs, but it does catch many such bugs. - end note.] .. dropdown:: Examples of CUDA API forward progress guarantees. From cb7906da27c9418c738dc6115dc98c884b1b0c34 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Mon, 24 Feb 2025 13:20:37 +0100 Subject: [PATCH 49/54] Rephrase --- docs/libcudacxx/extended_api/execution_model.rst | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 3ee9f137da8..7fdc291848a 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -226,11 +226,10 @@ This method is not sufficient because it does not catch all Forward Progress bug Stream and event ordering ------------------------- -A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events. +A device thread shall not start making progress until all its dependencies have completed. -[Note: This excludes dependencies such as Programmatic Dependent Launch or Launch Completion which do not encompass termination of the dependency. - end note.] - -[Note: Tasks are also referred to as `Commands `__. - end note.] +[Note: Dependencies that prevent device threads from starting to make progress can be created, for example, via CUDA Stream `Command `__s. +These may include dependencies on the completion of, among others, `CUDA Events `__ and `CUDA Kernels `__. - end note.] .. dropdown:: Examples of CUDA API forward progress guarantees due to stream and event ordering From ac61f7a5f9731638715aab0567187ef0c0fd6ae1 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 24 Feb 2025 12:32:46 +0000 Subject: [PATCH 50/54] [pre-commit.ci] auto code formatting --- docs/libcudacxx/extended_api/execution_model.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 7fdc291848a..350a9fed64a 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -152,7 +152,7 @@ return ``cudaErrorNotReady`` without a device thread making progress. [Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.] -[Note: A simple but not sufficient method to test a program for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1``, and then check that the program still terminates. +[Note: A simple but not sufficient method to test a program for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1``, and then check that the program still terminates. If it does not, the program has a bug. This method is not sufficient because it does not catch all Forward Progress bugs, but it does catch many such bugs. - end note.] @@ -228,7 +228,7 @@ Stream and event ordering A device thread shall not start making progress until all its dependencies have completed. -[Note: Dependencies that prevent device threads from starting to make progress can be created, for example, via CUDA Stream `Command `__s. +[Note: Dependencies that prevent device threads from starting to make progress can be created, for example, via CUDA Stream `Command `__s. These may include dependencies on the completion of, among others, `CUDA Events `__ and `CUDA Kernels `__. - end note.] .. dropdown:: Examples of CUDA API forward progress guarantees due to stream and event ordering From 35fb6b7b8aa80c11e9c612a1acf61ebb135b68e8 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Mon, 24 Feb 2025 19:11:50 +0100 Subject: [PATCH 51/54] Fix typo --- docs/libcudacxx/extended_api/execution_model.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 350a9fed64a..3cb970efd68 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -31,6 +31,7 @@ The CUDA C++ Programming Language is an extension of the C++ Programming Languag This section documents the modifications and extensions to the `[intro.progress] `__ section of the current `ISO International Standard ISO/IEC 14882 – Programming Language C++ `__ draft. Modified sections are called out explicitly and their diff is shown in **bold**. All other sections are additions. + Host threads ------------ From 7449b51aaeea02fe026c0a1344f255fd74f3d05b Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Mon, 24 Feb 2025 19:27:16 +0100 Subject: [PATCH 52/54] Remove redundancy --- docs/libcudacxx/extended_api/execution_model.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index 3cb970efd68..caff7d66a1c 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -60,7 +60,6 @@ Once a device thread makes progress: [Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.] [Note: This implies that all device threads within its thread block shall eventually make progress. - end note.] -The order in which device threads make progress is _unspecified_. Modify `[intro.progress.1] `__ as follows (modifications in **bold**): From 2334ebcfe9e6d845f7ee5aa977914595bfa0ee82 Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Tue, 25 Feb 2025 16:07:52 +0100 Subject: [PATCH 53/54] Fix bug The old wording may allow non steps to start executing (e.g. non atomic data accesses). Just saying that it does not start should suffice to prevent evaluating any expression. --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index caff7d66a1c..ec7526e1a49 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -226,7 +226,7 @@ This method is not sufficient because it does not catch all Forward Progress bug Stream and event ordering ------------------------- -A device thread shall not start making progress until all its dependencies have completed. +A device thread shall not start until all its dependencies have completed. [Note: Dependencies that prevent device threads from starting to make progress can be created, for example, via CUDA Stream `Command `__s. These may include dependencies on the completion of, among others, `CUDA Events `__ and `CUDA Kernels `__. - end note.] From 5504adbf977c06b82001a19990dbc4c55661856e Mon Sep 17 00:00:00 2001 From: gonzalobg <65027571+gonzalobg@users.noreply.github.com> Date: Tue, 25 Feb 2025 17:20:12 +0100 Subject: [PATCH 54/54] Clarify semantics --- docs/libcudacxx/extended_api/execution_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst index ec7526e1a49..6def2b2c6e1 100644 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -144,7 +144,7 @@ The implementation may assume that any **host** thread will eventually do one of CUDA APIs --------- -Any CUDA API shall eventually either return or ensure at least one device thread makes progress. +A host or device thread CUDA API call shall eventually either return or ensure at least once device thread makes progress. CUDA query functions (e.g. `cudaStreamQuery `__, `cudaEventQuery `__, etc.) shall not consistently