Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: compute-sanitizer shows invalid read access on reduction #1229

Closed
1 task done
cliffburdick opened this issue Dec 18, 2023 · 2 comments
Closed
1 task done

[BUG]: compute-sanitizer shows invalid read access on reduction #1229

cliffburdick opened this issue Dec 18, 2023 · 2 comments
Assignees
Labels
bug Something isn't working right.

Comments

@cliffburdick
Copy link
Contributor

Is this a duplicate?

Type of Bug

Something else

Component

CUB

Describe the bug

When running a simple DeviceReduce::Sum using the provided example in the docs, compute-sanitizer reports an invalid read of 16 bytes. It's not clear whether this is a CUB or compute-sanitizer issue since the result appears to be correct. The output is:

compute-sanitizer examples/fft_conv
========= COMPUTE-SANITIZER
========= Invalid __shared__ read of size 16 bytes
=========     at __half cub::CUB_200200_800_NS::BlockReduceWarpReductions<__half, (int)256, (int)1, (int)1, (int)0>::ApplyWarpAggregates<(bool)1, cub::CUB_200200_800_NS::Max, (int)1>(T2, __half, int, cub::CUB_200200_800_NS::Int2Type<T3>)+0x6320 in /repro/MatX/build/_deps/cccl-src/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:118
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x8 is misaligned
=========     Device Frame:__half cub::CUB_200200_800_NS::BlockReduceWarpReductions<__half, (int)256, (int)1, (int)1, (int)0>::ApplyWarpAggregates<(bool)1, cub::CUB_200200_800_NS::Max>(T2, __half, int)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:156
=========     Device Frame:__half cub::CUB_200200_800_NS::BlockReduceWarpReductions<__half, (int)256, (int)1, (int)1, (int)0>::Reduce<(bool)1, cub::CUB_200200_800_NS::Max>(__half, int, T2)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:207
=========     Device Frame:__half cub::CUB_200200_800_NS::BlockReduce<__half, (int)256, (cub::CUB_200200_800_NS::BlockReduceAlgorithm)2, (int)1, (int)1, (int)0>::Reduce<cub::CUB_200200_800_NS::Max>(__half, T1, int)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/block/block_reduce.cuh:450
=========     Device Frame:__half cub::CUB_200200_800_NS::AgentReduce<cub::CUB_200200_800_NS::AgentReducePolicy<(int)256, (int)16, __half, (int)4, (cub::CUB_200200_800_NS::BlockReduceAlgorithm)2, (cub::CUB_200200_800_NS::CacheLoadModifier)5, cub::CUB_200200_800_NS::MemBoundScaling<(int)256, (int)16, __half>>, __half *, __half *, unsigned int, cub::CUB_200200_800_NS::Max, __half>::ConsumeRange<(int)1>(cub::CUB_200200_800_NS::GridEvenShare<unsigned int> &, cub::CUB_200200_800_NS::Int2Type<T1>)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/agent/agent_reduce.cuh:367
=========     Device Frame:cub::CUB_200200_800_NS::AgentReduce<cub::CUB_200200_800_NS::AgentReducePolicy<(int)256, (int)16, __half, (int)4, (cub::CUB_200200_800_NS::BlockReduceAlgorithm)2, (cub::CUB_200200_800_NS::CacheLoadModifier)5, cub::CUB_200200_800_NS::MemBoundScaling<(int)256, (int)16, __half>>, __half *, __half *, unsigned int, cub::CUB_200200_800_NS::Max, __half>::ConsumeRange(unsigned int, unsigned int)+0xc0 in /repro/MatX/build/_deps/cccl-src/cub/cub/agent/agent_reduce.cuh:392
=========     Device Frame:void cub::CUB_200200_800_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_800_NS::DeviceReducePolicy<__half, unsigned int, cub::CUB_200200_800_NS::Max>::Policy600, __half *, __half *, unsigned int, cub::CUB_200200_800_NS::Max, __half, __half>(T2, T3, T4, T5, T6)+0x50 in /repro/MatX/build/_deps/cccl-src/cub/cub/device/dispatch/dispatch_reduce.cuh:265

How to Reproduce

Use the following example:

  int  num_items=300;      // e.g., 7
  __half  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
  __half  *d_out;         // e.g., [-]
  cudaMalloc((void**)&d_in, num_items*sizeof(__half));
  cudaMalloc((void**)&d_out, sizeof(__half));
  void     *d_temp_storage = NULL;
  size_t   temp_storage_bytes = 0;
  cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
  cudaMalloc(&d_temp_storage, temp_storage_bytes);
  cub::DeviceReduce::Max(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);  

Run compute-sanitizer on it with no extra flags

Expected behavior

Clean output from sanitizer

Reproduction link

No response

Operating System

Ubuntu 22.04

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.27                 Driver Version: 550.27         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:41:00.0 Off |                    0 |
| N/A   40C    P0             45W /  300W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Sun_Dec__3_19:16:31_PST_2023
Cuda compilation tools, release 12.4, V12.4.46
Build cuda_12.4.r12.4/compiler.33607301_0
@cliffburdick cliffburdick added the bug Something isn't working right. label Dec 18, 2023
@github-project-automation github-project-automation bot moved this to Todo in CCCL Dec 18, 2023
@gevtushenko
Copy link
Collaborator

@cliffburdick I can reproduce the issue on sm89.
The policy in use is 600, which leads to 256 threads per block (8 warps). Looking at the offset of the unaligned read, it's caused by cub::BlockReduceWarpReductions::_TempStorage, that has typename WarpReduce::TempStorage warp_reduce[WARPS];, or eight single-byte elements. I believe that compiler vectorizes loads from shared memory and loads 16 bytes at a time, but the address is not a multiple of this alignment. Simple reordering of warp_aggregates to be the first field addresses the issue, but it's not a solution, since the temporary storage can still be aligned at 2 bytes and not 16. We should investigate this sooner.

A bit simpler reproducer:

#include <cub/block/block_reduce.cuh>

constexpr unsigned block_threads = 256;

__global__ void kernel(__half *ptr) {
  using block_reduce_t = cub::BlockReduce<__half, block_threads, cub::BLOCK_REDUCE_WARP_REDUCTIONS>;
  using storage_t      = typename block_reduce_t::TempStorage;

  __half data = ptr[threadIdx.x];

  __shared__ storage_t storage;
  data = block_reduce_t(storage).Reduce(data, cub::Max{});

  if (threadIdx.x == 0) {
    ptr[0] = data;
  }
}

int main() {
  __half *d_ptr{};
  cudaMalloc(&d_ptr, block_threads * sizeof(__half));

  kernel<<<1, block_threads>>>(d_ptr);
  cudaDeviceSynchronize();
}

@gevtushenko gevtushenko self-assigned this Dec 18, 2023
@gevtushenko gevtushenko moved this from Todo to In Progress in CCCL Dec 18, 2023
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 19, 2023
@jrhemstad jrhemstad moved this from In Review to Blocked in CCCL Jan 17, 2024
@gevtushenko
Copy link
Collaborator

Addressed by nvbug 4428282

@github-project-automation github-project-automation bot moved this from Blocked to Done in CCCL May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants