diff --git a/docs/libcudacxx/extended_api.rst b/docs/libcudacxx/extended_api.rst index 5584fd77dd1..1df995a432f 100644 --- a/docs/libcudacxx/extended_api.rst +++ b/docs/libcudacxx/extended_api.rst @@ -6,6 +6,7 @@ Extended API .. toctree:: :maxdepth: 2 + extended_api/execution_model extended_api/memory_model extended_api/thread_groups extended_api/shapes diff --git a/docs/libcudacxx/extended_api/execution_model.rst b/docs/libcudacxx/extended_api/execution_model.rst new file mode 100644 index 00000000000..6def2b2c6e1 --- /dev/null +++ b/docs/libcudacxx/extended_api/execution_model.rst @@ -0,0 +1,272 @@ +.. _libcudacxx-extended-api-execution-model: + +Execution model +=============== + +CUDA C++ aims to provide `_parallel forward progress_ [intro.progress.9] `__ +for all device threads of execution, facilitating the parallelization of pre-existing C++ applications with CUDA C++. + +.. dropdown:: [intro.progress] + + - `[intro.progress.7] `__: For a thread of execution + providing `concurrent forward progress guarantees `__, + the implementation ensures that the thread will eventually make progress for as long as it has not terminated. + + [Note 5: This applies regardless of whether or not other threads of execution (if any) have been or are making progress. + To eventually fulfill this requirement means that this will happen in an unspecified but finite amount of time. — end note] + + - `[intro.progress.9] `__: For a thread of execution providing + `parallel forward progress guarantees `__, the implementation is not required to ensure that + the thread will eventually make progress if it has not yet executed any execution step; once this thread has executed a step, + it provides concurrent forward progress guarantees. + + [Note 6: This does not specify a requirement for when to start this thread of execution, which will typically be specified by the entity + that creates this thread of execution. For example, a thread of execution that provides concurrent forward progress guarantees and executes + tasks from a set of tasks in an arbitrary order, one after the other, satisfies the requirements of parallel forward progress for these + tasks. — end note] + +.. _libcudacxx-extended-api-execution-model-host-threads: + +The CUDA C++ Programming Language is an extension of the C++ Programming Language. +This section documents the modifications and extensions to the `[intro.progress] `__ section of the current `ISO International Standard ISO/IEC 14882 – Programming Language C++ `__ draft. +Modified sections are called out explicitly and their diff is shown in **bold**. +All other sections are additions. + +Host threads +------------ + +The forward progress provided by threads of execution created by the host implementation to +execute `main `__, `std::thread `__, +and `std::jthread https://en.cppreference.com/w/cpp/thread/jthread>`__ is implementation-defined behavior of the host +implementation `[intro.progress] `__. +General-purpose host implementations should provide _concurrent forward progress_. + +If the host implementation provides `_concurrent forward progress_ [intro.progress.7] `__, +then CUDA C++ provides `_parallel forward progress_ [intro.progress.9] `__ for device threads. + + +.. _libcudacxx-extended-api-execution-model-device-threads: + +Device threads +-------------- + +Once a device thread makes progress: + +- If it is part of a `Cooperative Grid `__, + all device threads in its grid shall eventually make progress. +- Otherwise, all device threads in its `thread-block cluster `__ + shall eventually make progress. + + [Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.] + [Note: This implies that all device threads within its thread block shall eventually make progress. - end note.] + + +Modify `[intro.progress.1] `__ as follows (modifications in **bold**): + +The implementation may assume that any **host** thread will eventually do one of the following: + + 1. terminate, + 2. invoke the function `std::this_thread::yield `__ (`[thread.thread.this] `__), + 3. make a call to a library I/O function, + 4. perform an access through a volatile glvalue, + 5. perform a synchronization operation or an atomic operation, or + 6. continue execution of a trivial infinite loop (`[stmt.iter.general] `__). + +**The implementation may assume that any device thread will eventually do one of the following:** + + 1. **terminate**, + 2. **make a call to a library I/O function**, + 3. **perform an access through a volatile glvalue except if the designated object has automatic storage duration, or** + 4. **perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.** + + [Note: Some current limitations of device threads relative to host threads + are implementation defects known to us, that we may fix over time. + Examples include the undefined behavior that arises from device threads + that eventually only perform volatile or atomic operations + on automatic storage duration objects. + However, other limitations of device threads relative to host threads + are intentional choices. They enable performance optimizations + that would not be possible if device threads followed the C++ Standard strictly. + For example, providing forward progress to programs + that eventually only perform atomic writes or fences + would degrade overall performance for little practical benefit. - end note.] + +.. dropdown:: Examples of forward progress guarantee differences between host and device threads due to modifications to [intro.progress.1]. + + The following examples refer to the itemized sub-clauses of the implementation assumptions for host and device threads above + using "host.threads." and "device.threads.", respectively. + + .. code:: cuda + // Example: Execution.Model.Device.0 + // Outcome: grid eventually terminates per device.threads.4 because the atomic object does not have automatic storage duration. + __global__ void ex0(cuda::atomic_ref atom) { + if (threadIdx.x == 0) { + while(atom.load(cuda::memory_order_relaxed) == 0); + } else if (threadIdx.x == 1) { + atom.store(1, cuda::memory_order_relaxed); + } + } + + .. code:: cuda + // Example: Execution.Model.Device.1 + // Allowed outcome: No thread makes progress because device threads don't support host.threads.2. + __global__ void ex1() { + while(true) cuda::std::this_thread::yield(); + } + + .. code:: cuda + // Example: Execution.Model.Device.2 + // Allowed outcome: No thread makes progress because device threads don't support host.threads.4 + // for objects with automatic storage duration (see exception in device.threads.3). + __global__ void ex2() { + volatile bool True = true; + while(True); + } + + .. code:: cuda + // Example: Execution.Model.Device.3 + // Allowed outcome: No thread makes progress because device threads don't support host.threads.5 + // for objects with automatic storage duration (see exception in device.threads.4). + __global__ void ex3() { + cuda::atomic True = true; + while(True.load()); + } + + .. code:: cuda + // Example: Execution.Model.Device.4 + // Allowed outcome: No thread makes progress because device threads don't support host.thread.6. + __global void ex4() { + while(true) { /* empty */ } + } + +.. _libcudacxx-extended-api-execution-model-cuda-apis: + +CUDA APIs +--------- + +A host or device thread CUDA API call shall eventually either return or ensure at least once device thread makes progress. + +CUDA query functions (e.g. `cudaStreamQuery `__, +`cudaEventQuery `__, etc.) shall not consistently +return ``cudaErrorNotReady`` without a device thread making progress. + +[Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.] + +[Note: A simple but not sufficient method to test a program for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1``, and then check that the program still terminates. +If it does not, the program has a bug. +This method is not sufficient because it does not catch all Forward Progress bugs, but it does catch many such bugs. - end note.] + +.. dropdown:: Examples of CUDA API forward progress guarantees. + + .. code:: cuda + // Example: Execution.Model.API.1 + // Outcome: if no other device threads (e.g., from other processes) are making progress, + // this program terminates and returns cudaSuccess. + // Rationale: CUDA guarantees that if the device is empty: + // - `cudaDeviceSynchronize` eventually ensures that at least one device-thread makes progress, which implies that eventually `hello_world` grid and one of its device-threads start. + // - All thread-block threads eventually start (due to "if a device thread makes progress, all other threads in its thread-block cluster eventually make progress"). + // - Once all threads in thread-block arrive at `__syncthreads` barrier, all waiting threads are unblocked. + // - Therefore all device threads eventually exit the `hello_world`` grid. + // - And `cudaDeviceSynchronize`` eventually unblocks. + __global__ void hello_world() { __syncthreads(); } + int main() { + hello_world<<<1,2>>>(); + return (int)cudaDeviceSynchronize(); + } + + .. code:: cuda + // Example: Execution.Model.API.2 + // Allowed outcome: eventually, no thread makes progress. + // Rationale: the `cudaDeviceSynchronize` API below is only called if a device thread eventually makes progress and sets the flag. + // However, CUDA only guarantees that `producer` device thread eventually starts if the synchronization API is called. + // Therefore, the host thread may never be unblocked from the flag spin-loop. + cuda::atomic flag = 0; + __global__ void producer() { flag.store(1); } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + producer<<<1,1>>>(); + while (flag.load() == 0); + return cudaDeviceSynchronize(); + } + + .. code:: cuda + // Example: Execution.Model.API.3 + // Allowed outcome: eventually, no thread makes progress. + // Rationale: same as Example.Model.API.2, with the addition that a single CUDA query API call does not guarantee + // the device thread eventually starts, only repeated CUDA query API calls do (see Execution.Model.API.4). + cuda::atomic flag = 0; + __global__ void producer() { flag.store(1); } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + producer<<<1,1>>>(); + (void)cudaStreamQuery(0); + while (flag.load() == 0); + return cudaDeviceSynchronize(); + } + + .. code:: cuda + // Example: Execution.Model.API.4 + // Outcome: terminates. + // Rationale: same as Execution.Model.API.3, but this example repeatedly calls + // a CUDA query API in within the flag spin-loop, which guarantees that the device thread + // eventually makes progress. + cuda::atomic flag = 0; + __global__ void producer() { flag.store(1); } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + producer<<<1,1>>>(); + while (flag.load() == 0) { + (void)cudaStreamQuery(0); + } + return cudaDeviceSynchronize(); + } + +.. _libcudacxx-extended-api-execution-model-stream-ordering: + +Stream and event ordering +------------------------- + +A device thread shall not start until all its dependencies have completed. + +[Note: Dependencies that prevent device threads from starting to make progress can be created, for example, via CUDA Stream `Command `__s. +These may include dependencies on the completion of, among others, `CUDA Events `__ and `CUDA Kernels `__. - end note.] + +.. dropdown:: Examples of CUDA API forward progress guarantees due to stream and event ordering + + .. code:: cuda + // Example: Execution.Model.Stream.0 + // Allowed outcome: eventually, no thread makes progress. + // Rationale: while CUDA guarantees that one device thread makes progress, since there + // is no dependency between `first` and `second`, it does not guarantee which thread, + // and therefore it could always pick the device thread from `second`, which then never + // unblocks from the spin-loop. + // That is, `second` may starve `first`. + cuda::atomic flag = 0; + __global__ void first() { flag.store(1, cuda::memory_order_relaxed); } + __global__ void second() { while(flag.load(cuda::memory_order_relaxed) == 0) {} } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + cudaStream_t s0, s1; + cudaStreamCreate(&s0); + cudaStreamCreate(&s1); + first<<<1,1,0,s0>>>(); + second<<<1,1,0,s1>>>(); + return cudaDeviceSynchronize(); + } + + .. code:: cuda + // Example: Execution.Model.Stream.1 + // Outcome: terminates. + // Rationale: same as Execution.Model.Stream.0, but this example has a stream dependency + // between first and second, which requires CUDA to run the grids in order. + cuda::atomic flag = 0; + __global__ void first() { flag.store(1, cuda::memory_order_relaxed); } + __global__ void second() { while(flag.load(cuda::memory_order_relaxed) == 0) {} } + int main() { + cudaHostRegister(&flag, sizeof(flag)); + cudaStream_t s0; + cudaStreamCreate(&s0); + first<<<1,1,0,s0>>>(); + second<<<1,1,0,s0>>>(); + return cudaDeviceSynchronize(); + }