Skip to content

Commit

Permalink
cp_async_bulk: Fix test (#3198)
Browse files Browse the repository at this point in the history
* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

* cp_async_bulk: Fix test

The global memory pointer could be misaligned.
  • Loading branch information
ahendriksen authored Dec 20, 2024
1 parent 86b9118 commit 7fd9ae5
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ using barrier = cuda::barrier<cuda::thread_scope_block>;
namespace cde = cuda::device::experimental;

static constexpr int buf_len = 1024;
__device__ int gmem_buffer[buf_len];
__device__ alignas(128) int gmem_buffer[buf_len];

__device__ void test()
{
Expand Down

0 comments on commit 7fd9ae5

Please sign in to comment.