diff --git a/docs/libcudacxx/extended_api.rst b/docs/libcudacxx/extended_api.rst index 1df995a432f..5584fd77dd1 100644 --- a/docs/libcudacxx/extended_api.rst +++ b/docs/libcudacxx/extended_api.rst @@ -6,7 +6,6 @@ Extended API .. toctree:: :maxdepth: 2 - extended_api/execution_model extended_api/memory_model extended_api/thread_groups extended_api/shapes diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst deleted file mode 100644 index 6def2b2c6e1..00000000000 --- a/docs/libcudacxx/extended_api/execution_model.rst +++ /dev/null @@ -1,272 +0,0 @@ -.. _libcudacxx-extended-api-execution-model: - -Execution model -=============== - -CUDA C++ aims to provide `_parallel forward progress_ [intro.progress.9] `__ -for all device threads of execution, facilitating the parallelization of pre-existing C++ applications with CUDA C++. - -.. dropdown:: [intro.progress] - - - `[intro.progress.7] `__: For a thread of execution - providing `concurrent forward progress guarantees `__, - the implementation ensures that the thread will eventually make progress for as long as it has not terminated. - - [Note 5: This applies regardless of whether or not other threads of execution (if any) have been or are making progress. - To eventually fulfill this requirement means that this will happen in an unspecified but finite amount of time. — end note] - - - `[intro.progress.9] `__: For a thread of execution providing - `parallel forward progress guarantees `__, the implementation is not required to ensure that - the thread will eventually make progress if it has not yet executed any execution step; once this thread has executed a step, - it provides concurrent forward progress guarantees. - - [Note 6: This does not specify a requirement for when to start this thread of execution, which will typically be specified by the entity - that creates this thread of execution. For example, a thread of execution that provides concurrent forward progress guarantees and executes - tasks from a set of tasks in an arbitrary order, one after the other, satisfies the requirements of parallel forward progress for these - tasks. — end note] - -.. _libcudacxx-extended-api-execution-model-host-threads: - -The CUDA C++ Programming Language is an extension of the C++ Programming Language. -This section documents the modifications and extensions to the `[intro.progress] `__ section of the current `ISO International Standard ISO/IEC 14882 – Programming Language C++ `__ draft. -Modified sections are called out explicitly and their diff is shown in **bold**. -All other sections are additions. - -Host threads ------------- - -The forward progress provided by threads of execution created by the host implementation to -execute `main `__, `std::thread `__, -and `std::jthread https://en.cppreference.com/w/cpp/thread/jthread>`__ is implementation-defined behavior of the host -implementation `[intro.progress] `__. -General-purpose host implementations should provide _concurrent forward progress_. - -If the host implementation provides `_concurrent forward progress_ [intro.progress.7] `__, -then CUDA C++ provides `_parallel forward progress_ [intro.progress.9] `__ for device threads. - - -.. _libcudacxx-extended-api-execution-model-device-threads: - -Device threads --------------- - -Once a device thread makes progress: - -- If it is part of a `Cooperative Grid `__, - all device threads in its grid shall eventually make progress. -- Otherwise, all device threads in its `thread-block cluster `__ - shall eventually make progress. - - [Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.] - [Note: This implies that all device threads within its thread block shall eventually make progress. - end note.] - - -Modify `[intro.progress.1] `__ as follows (modifications in **bold**): - -The implementation may assume that any **host** thread will eventually do one of the following: - - 1. terminate, - 2. invoke the function `std::this_thread::yield `__ (`[thread.thread.this] `__), - 3. make a call to a library I/O function, - 4. perform an access through a volatile glvalue, - 5. perform a synchronization operation or an atomic operation, or - 6. continue execution of a trivial infinite loop (`[stmt.iter.general] `__). - -**The implementation may assume that any device thread will eventually do one of the following:** - - 1. **terminate**, - 2. **make a call to a library I/O function**, - 3. **perform an access through a volatile glvalue except if the designated object has automatic storage duration, or** - 4. **perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.** - - [Note: Some current limitations of device threads relative to host threads - are implementation defects known to us, that we may fix over time. - Examples include the undefined behavior that arises from device threads - that eventually only perform volatile or atomic operations - on automatic storage duration objects. - However, other limitations of device threads relative to host threads - are intentional choices. They enable performance optimizations - that would not be possible if device threads followed the C++ Standard strictly. - For example, providing forward progress to programs - that eventually only perform atomic writes or fences - would degrade overall performance for little practical benefit. - end note.] - -.. dropdown:: Examples of forward progress guarantee differences between host and device threads due to modifications to [intro.progress.1]. - - The following examples refer to the itemized sub-clauses of the implementation assumptions for host and device threads above - using "host.threads." and "device.threads.", respectively. - - .. code:: cuda - // Example: Execution.Model.Device.0 - // Outcome: grid eventually terminates per device.threads.4 because the atomic object does not have automatic storage duration. - __global__ void ex0(cuda::atomic_ref atom) { - if (threadIdx.x == 0) { - while(atom.load(cuda::memory_order_relaxed) == 0); - } else if (threadIdx.x == 1) { - atom.store(1, cuda::memory_order_relaxed); - } - } - - .. code:: cuda - // Example: Execution.Model.Device.1 - // Allowed outcome: No thread makes progress because device threads don't support host.threads.2. - __global__ void ex1() { - while(true) cuda::std::this_thread::yield(); - } - - .. code:: cuda - // Example: Execution.Model.Device.2 - // Allowed outcome: No thread makes progress because device threads don't support host.threads.4 - // for objects with automatic storage duration (see exception in device.threads.3). - __global__ void ex2() { - volatile bool True = true; - while(True); - } - - .. code:: cuda - // Example: Execution.Model.Device.3 - // Allowed outcome: No thread makes progress because device threads don't support host.threads.5 - // for objects with automatic storage duration (see exception in device.threads.4). - __global__ void ex3() { - cuda::atomic True = true; - while(True.load()); - } - - .. code:: cuda - // Example: Execution.Model.Device.4 - // Allowed outcome: No thread makes progress because device threads don't support host.thread.6. - __global void ex4() { - while(true) { /* empty */ } - } - -.. _libcudacxx-extended-api-execution-model-cuda-apis: - -CUDA APIs ---------- - -A host or device thread CUDA API call shall eventually either return or ensure at least once device thread makes progress. - -CUDA query functions (e.g. `cudaStreamQuery `__, -`cudaEventQuery `__, etc.) shall not consistently -return ``cudaErrorNotReady`` without a device thread making progress. - -[Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.] - -[Note: A simple but not sufficient method to test a program for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1``, and then check that the program still terminates. -If it does not, the program has a bug. -This method is not sufficient because it does not catch all Forward Progress bugs, but it does catch many such bugs. - end note.] - -.. dropdown:: Examples of CUDA API forward progress guarantees. - - .. code:: cuda - // Example: Execution.Model.API.1 - // Outcome: if no other device threads (e.g., from other processes) are making progress, - // this program terminates and returns cudaSuccess. - // Rationale: CUDA guarantees that if the device is empty: - // - `cudaDeviceSynchronize` eventually ensures that at least one device-thread makes progress, which implies that eventually `hello_world` grid and one of its device-threads start. - // - All thread-block threads eventually start (due to "if a device thread makes progress, all other threads in its thread-block cluster eventually make progress"). - // - Once all threads in thread-block arrive at `__syncthreads` barrier, all waiting threads are unblocked. - // - Therefore all device threads eventually exit the `hello_world`` grid. - // - And `cudaDeviceSynchronize`` eventually unblocks. - __global__ void hello_world() { __syncthreads(); } - int main() { - hello_world<<<1,2>>>(); - return (int)cudaDeviceSynchronize(); - } - - .. code:: cuda - // Example: Execution.Model.API.2 - // Allowed outcome: eventually, no thread makes progress. - // Rationale: the `cudaDeviceSynchronize` API below is only called if a device thread eventually makes progress and sets the flag. - // However, CUDA only guarantees that `producer` device thread eventually starts if the synchronization API is called. - // Therefore, the host thread may never be unblocked from the flag spin-loop. - cuda::atomic flag = 0; - __global__ void producer() { flag.store(1); } - int main() { - cudaHostRegister(&flag, sizeof(flag)); - producer<<<1,1>>>(); - while (flag.load() == 0); - return cudaDeviceSynchronize(); - } - - .. code:: cuda - // Example: Execution.Model.API.3 - // Allowed outcome: eventually, no thread makes progress. - // Rationale: same as Example.Model.API.2, with the addition that a single CUDA query API call does not guarantee - // the device thread eventually starts, only repeated CUDA query API calls do (see Execution.Model.API.4). - cuda::atomic flag = 0; - __global__ void producer() { flag.store(1); } - int main() { - cudaHostRegister(&flag, sizeof(flag)); - producer<<<1,1>>>(); - (void)cudaStreamQuery(0); - while (flag.load() == 0); - return cudaDeviceSynchronize(); - } - - .. code:: cuda - // Example: Execution.Model.API.4 - // Outcome: terminates. - // Rationale: same as Execution.Model.API.3, but this example repeatedly calls - // a CUDA query API in within the flag spin-loop, which guarantees that the device thread - // eventually makes progress. - cuda::atomic flag = 0; - __global__ void producer() { flag.store(1); } - int main() { - cudaHostRegister(&flag, sizeof(flag)); - producer<<<1,1>>>(); - while (flag.load() == 0) { - (void)cudaStreamQuery(0); - } - return cudaDeviceSynchronize(); - } - -.. _libcudacxx-extended-api-execution-model-stream-ordering: - -Stream and event ordering -------------------------- - -A device thread shall not start until all its dependencies have completed. - -[Note: Dependencies that prevent device threads from starting to make progress can be created, for example, via CUDA Stream `Command `__s. -These may include dependencies on the completion of, among others, `CUDA Events `__ and `CUDA Kernels `__. - end note.] - -.. dropdown:: Examples of CUDA API forward progress guarantees due to stream and event ordering - - .. code:: cuda - // Example: Execution.Model.Stream.0 - // Allowed outcome: eventually, no thread makes progress. - // Rationale: while CUDA guarantees that one device thread makes progress, since there - // is no dependency between `first` and `second`, it does not guarantee which thread, - // and therefore it could always pick the device thread from `second`, which then never - // unblocks from the spin-loop. - // That is, `second` may starve `first`. - cuda::atomic flag = 0; - __global__ void first() { flag.store(1, cuda::memory_order_relaxed); } - __global__ void second() { while(flag.load(cuda::memory_order_relaxed) == 0) {} } - int main() { - cudaHostRegister(&flag, sizeof(flag)); - cudaStream_t s0, s1; - cudaStreamCreate(&s0); - cudaStreamCreate(&s1); - first<<<1,1,0,s0>>>(); - second<<<1,1,0,s1>>>(); - return cudaDeviceSynchronize(); - } - - .. code:: cuda - // Example: Execution.Model.Stream.1 - // Outcome: terminates. - // Rationale: same as Execution.Model.Stream.0, but this example has a stream dependency - // between first and second, which requires CUDA to run the grids in order. - cuda::atomic flag = 0; - __global__ void first() { flag.store(1, cuda::memory_order_relaxed); } - __global__ void second() { while(flag.load(cuda::memory_order_relaxed) == 0) {} } - int main() { - cudaHostRegister(&flag, sizeof(flag)); - cudaStream_t s0; - cudaStreamCreate(&s0); - first<<<1,1,0,s0>>>(); - second<<<1,1,0,s0>>>(); - return cudaDeviceSynchronize(); - }