[BUG]: `compute-sanitizer` shows invalid read access on reduction #1229

cliffburdick · 2023-12-18T19:50:56Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Something else

Component

CUB

Describe the bug

When running a simple DeviceReduce::Sum using the provided example in the docs, compute-sanitizer reports an invalid read of 16 bytes. It's not clear whether this is a CUB or compute-sanitizer issue since the result appears to be correct. The output is:

compute-sanitizer examples/fft_conv
========= COMPUTE-SANITIZER
========= Invalid __shared__ read of size 16 bytes
=========     at __half cub::CUB_200200_800_NS::BlockReduceWarpReductions<__half, (int)256, (int)1, (int)1, (int)0>::ApplyWarpAggregates<(bool)1, cub::CUB_200200_800_NS::Max, (int)1>(T2, __half, int, cub::CUB_200200_800_NS::Int2Type<T3>)+0x6320 in /repro/MatX/build/_deps/cccl-src/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:118
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x8 is misaligned
=========     Device Frame:__half cub::CUB_200200_800_NS::BlockReduceWarpReductions<__half, (int)256, (int)1, (int)1, (int)0>::ApplyWarpAggregates<(bool)1, cub::CUB_200200_800_NS::Max>(T2, __half, int)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:156
=========     Device Frame:__half cub::CUB_200200_800_NS::BlockReduceWarpReductions<__half, (int)256, (int)1, (int)1, (int)0>::Reduce<(bool)1, cub::CUB_200200_800_NS::Max>(__half, int, T2)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:207
=========     Device Frame:__half cub::CUB_200200_800_NS::BlockReduce<__half, (int)256, (cub::CUB_200200_800_NS::BlockReduceAlgorithm)2, (int)1, (int)1, (int)0>::Reduce<cub::CUB_200200_800_NS::Max>(__half, T1, int)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/block/block_reduce.cuh:450
=========     Device Frame:__half cub::CUB_200200_800_NS::AgentReduce<cub::CUB_200200_800_NS::AgentReducePolicy<(int)256, (int)16, __half, (int)4, (cub::CUB_200200_800_NS::BlockReduceAlgorithm)2, (cub::CUB_200200_800_NS::CacheLoadModifier)5, cub::CUB_200200_800_NS::MemBoundScaling<(int)256, (int)16, __half>>, __half *, __half *, unsigned int, cub::CUB_200200_800_NS::Max, __half>::ConsumeRange<(int)1>(cub::CUB_200200_800_NS::GridEvenShare<unsigned int> &, cub::CUB_200200_800_NS::Int2Type<T1>)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/agent/agent_reduce.cuh:367
=========     Device Frame:cub::CUB_200200_800_NS::AgentReduce<cub::CUB_200200_800_NS::AgentReducePolicy<(int)256, (int)16, __half, (int)4, (cub::CUB_200200_800_NS::BlockReduceAlgorithm)2, (cub::CUB_200200_800_NS::CacheLoadModifier)5, cub::CUB_200200_800_NS::MemBoundScaling<(int)256, (int)16, __half>>, __half *, __half *, unsigned int, cub::CUB_200200_800_NS::Max, __half>::ConsumeRange(unsigned int, unsigned int)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/agent/agent_reduce.cuh:392
=========     Device Frame:void cub::CUB_200200_800_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_800_NS::DeviceReducePolicy<__half, unsigned int, cub::CUB_200200_800_NS::Max>::Policy600, __half *, __half *, unsigned int, cub::CUB_200200_800_NS::Max, __half, __half>(T2, T3, T4, T5, T6)+0x50 in /repro/MatX/build/_deps/cccl-src/cub/cub/device/dispatch/dispatch_reduce.cuh:265

How to Reproduce

Use the following example:

  int  num_items=300;      // e.g., 7
  __half  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
  __half  *d_out;         // e.g., [-]
  cudaMalloc((void**)&d_in, num_items*sizeof(__half));
  cudaMalloc((void**)&d_out, sizeof(__half));
  void     *d_temp_storage = NULL;
  size_t   temp_storage_bytes = 0;
  cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
  cudaMalloc(&d_temp_storage, temp_storage_bytes);
  cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);

Run compute-sanitizer on it with no extra flags

Expected behavior

Clean output from sanitizer

Reproduction link

No response

Operating System

Ubuntu 22.04

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.27                 Driver Version: 550.27         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:41:00.0 Off |                    0 |
| N/A   40C    P0             45W /  300W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Sun_Dec__3_19:16:31_PST_2023
Cuda compilation tools, release 12.4, V12.4.46
Build cuda_12.4.r12.4/compiler.33607301_0

The text was updated successfully, but these errors were encountered:

gevtushenko · 2023-12-18T20:48:47Z

@cliffburdick I can reproduce the issue on sm89.
The policy in use is 600, which leads to 256 threads per block (8 warps). Looking at the offset of the unaligned read, it's caused by cub::BlockReduceWarpReductions::_TempStorage, that has typename WarpReduce::TempStorage warp_reduce[WARPS];, or eight single-byte elements. I believe that compiler vectorizes loads from shared memory and loads 16 bytes at a time, but the address is not a multiple of this alignment. Simple reordering of warp_aggregates to be the first field addresses the issue, but it's not a solution, since the temporary storage can still be aligned at 2 bytes and not 16. We should investigate this sooner.

A bit simpler reproducer:

#include <cub/block/block_reduce.cuh>

constexpr unsigned block_threads = 256;

__global__ void kernel(__half *ptr) {
  using block_reduce_t = cub::BlockReduce<__half, block_threads, cub::BLOCK_REDUCE_WARP_REDUCTIONS>;
  using storage_t      = typename block_reduce_t::TempStorage;

  __half data = ptr[threadIdx.x];

  __shared__ storage_t storage;
  data = block_reduce_t(storage).Reduce(data, cub::Max{});

  if (threadIdx.x == 0) {
    ptr[0] = data;
  }
}

int main() {
  __half *d_ptr{};
  cudaMalloc(&d_ptr, block_threads * sizeof(__half));

  kernel<<<1, block_threads>>>(d_ptr);
  cudaDeviceSynchronize();
}

gevtushenko · 2024-05-09T15:50:54Z

Addressed by nvbug 4428282

cliffburdick added the bug Something isn't working right. label Dec 18, 2023

github-project-automation bot added this to CCCL Dec 18, 2023

github-project-automation bot moved this to Todo in CCCL Dec 18, 2023

gevtushenko self-assigned this Dec 18, 2023

gevtushenko moved this from Todo to In Progress in CCCL Dec 18, 2023

gevtushenko mentioned this issue Dec 19, 2023

Fix block reduce alignment #1233

Closed

2 tasks

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 19, 2023

jrhemstad moved this from In Review to Blocked in CCCL Jan 17, 2024

gevtushenko closed this as completed May 9, 2024

github-project-automation bot moved this from Blocked to Done in CCCL May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: `compute-sanitizer` shows invalid read access on reduction #1229

[BUG]: `compute-sanitizer` shows invalid read access on reduction #1229

cliffburdick commented Dec 18, 2023

gevtushenko commented Dec 18, 2023

gevtushenko commented May 9, 2024

[BUG]: compute-sanitizer shows invalid read access on reduction #1229

[BUG]: compute-sanitizer shows invalid read access on reduction #1229

Comments

cliffburdick commented Dec 18, 2023

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

gevtushenko commented Dec 18, 2023

gevtushenko commented May 9, 2024

[BUG]: `compute-sanitizer` shows invalid read access on reduction #1229

[BUG]: `compute-sanitizer` shows invalid read access on reduction #1229