-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not pass integral constants to ptx #2229
Conversation
7627b14
to
43a1fff
Compare
_CCCL_DEVICE static inline void cp_async_bulk_wait_group(n32_t<_N32> __n) | ||
{ | ||
NV_IF_ELSE_TARGET( | ||
NV_PROVIDES_SM_90, | ||
(asm volatile("cp.async.bulk.wait_group %0;" | ||
: | ||
: "n"(__N) | ||
: "n"(__n.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: How about using just the template parameter _N32
? I am not certain __n
is a constant expression here, and therefore some compilers complain that __n.value
isn't either. I am also not sure whether that's required here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integral_constant.value
is a static constexpr variable, so __n.value
is a constant expression even if it is accessed through a non-constant value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember running into issues with this at some point, but I just checked on compiler explorer and could not reproduce it. All good then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This popped up during one of the compilation tests I ran trying to import v2.5.0. Does the error get triggered on one of cccl's tests w/ clang in your CI? I assume it should. It would give the definitive answer whether clang is OK with __n.value
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the tests do not build for sm_90a, so they would not catch stuff hidden behind NV_HAS_FEATURE_SM_90a,
:
[41/1786] /usr/bin/sccache /usr/bin/clang++ -Dheadertest___ptx_instructions_tensormap_cp_fenceproxy_h_EXPORTS -I/home/coder/cccl/libcudacxx/include -O3 \
-DNDEBUG -std=gnu++20 --cuda-gpu-arch=sm_60 --cuda-gpu-arch=sm_70 --cuda-gpu-arch=sm_80
Considering that there's nontrivial amount of interesting functionality that's specific to sm_90 and sm_90a, it may be useful to build/test for them, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is really strange, considering we especially have a target for SM90a for that reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you know what, our SM90a tests only runs on nvcc with gcc https://github.com/NVIDIA/cccl/actions/runs/10368241031/job/28701520566
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two CI runs for SM90 and SM90a, lets see what happens with that
🟨 CI finished in 3h 52m: Pass: 93%/417 | Total: 1d 16h | Avg: 5m 49s | Max: 51m 28s | Hits: 98%/31314
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
b011e37
to
af65d16
Compare
🟨 CI finished in 1h 41m: Pass: 91%/433 | Total: 1d 20h | Avg: 6m 10s | Max: 38m 44s | Hits: 98%/31314
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 433)
# | Runner |
---|---|
317 | linux-amd64-cpu16 |
65 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
🟨 CI finished in 16h 31m: Pass: 98%/433 | Total: 1d 22h | Avg: 6m 22s | Max: 38m 44s | Hits: 98%/31314
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 433)
# | Runner |
---|---|
317 | linux-amd64-cpu16 |
65 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
@Artem-B So it seems SM_90a is not supported, which makes CI fail:
I am seeing a lot of spurious clang erros, about initializing a __shared__ barrier bar; // <- error: initialization is not supported for __shared__ variables.
if (threadIdx.x == 0)
{
init(&bar, blockDim.x);
}
__syncthreads(); Full report here: https://github.com/NVIDIA/cccl/actions/runs/10372551667/job/28747729022?pr=2229 |
Also looks like we need to revisit our PTX detection scheme for clang:
All of that is guarded by: #if __cccl_ptx_isa >= 830 Which in turn is defined as: #if (defined(__CUDACC_VER_MAJOR__) && (__CUDACC_VER_MAJOR__ > 12)) || (!defined(__CUDACC_VER_MAJOR__))
# define __cccl_ptx_isa 830ULL
#elif (defined(__CUDACC_VER_MAJOR__) && (__CUDACC_VER_MAJOR__ >= 12 && __CUDACC_VER_MINOR__ >= 3)) \
|| (!defined(__CUDACC_VER_MAJOR__))
# define __cccl_ptx_isa 830ULL |
🟨 CI finished in 7h 00m: Pass: 99%/427 | Total: 1d 23h | Avg: 6m 37s | Max: 44m 23s | Hits: 75%/31314
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 427)
# | Runner |
---|---|
311 | linux-amd64-cpu16 |
65 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
b20b250
to
c29ed0a
Compare
🟨 CI finished in 59m 55s: Pass: 97%/382 | Total: 1d 17h | Avg: 6m 31s | Max: 46m 01s | Hits: 74%/25663
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 382)
# | Runner |
---|---|
311 | linux-amd64-cpu16 |
28 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
@Artem-B I am seeing some compiler crashes, could I drop them on you? (I dont know whether they are legit) |
🟨 CI finished in 1h 12m: Pass: 97%/382 | Total: 2d 02h | Avg: 7m 53s | Max: 49m 11s | Hits: 86%/25663
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 382)
# | Runner |
---|---|
311 | linux-amd64-cpu16 |
28 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
04bbc18
to
90bdfd4
Compare
@Artem-B It seems something is wrong with the NV_TARGET implementation here: I dont really understand what is wrong there, so that it believes the variables are unused. Is that something that is worth investigating or can I just mark them as xfail for now? |
🟨 CI finished in 2h 09m: Pass: 97%/382 | Total: 3d 05h | Avg: 12m 06s | Max: 1h 22m | Hits: 10%/25663
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 382)
# | Runner |
---|---|
311 | linux-amd64-cpu16 |
28 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
90bdfd4
to
d4fc802
Compare
🟨 CI finished in 1h 19m: Pass: 98%/386 | Total: 2d 02h | Avg: 7m 52s | Max: 49m 04s | Hits: 68%/27963
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CCCL C Parallel Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CCCL C Parallel Library |
🏃 Runner counts (total jobs: 386)
# | Runner |
---|---|
312 | linux-amd64-cpu16 |
31 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
3f69de3
to
dc91833
Compare
🟨 CI finished in 2h 00m: Pass: 98%/386 | Total: 2d 13h | Avg: 9m 31s | Max: 1h 29m | Hits: 10%/27985
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CCCL C Parallel Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CCCL C Parallel Library |
🏃 Runner counts (total jobs: 386)
# | Runner |
---|---|
312 | linux-amd64-cpu16 |
31 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 1h 06m: Pass: 99%/386 | Total: 1d 22h | Avg: 7m 14s | Max: 51m 55s | Hits: 76%/27985
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CCCL C Parallel Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CCCL C Parallel Library |
🏃 Runner counts (total jobs: 386)
# | Runner |
---|---|
312 | linux-amd64-cpu16 |
31 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 56m 06s: Pass: 99%/386 | Total: 1d 20h | Avg: 6m 57s | Max: 49m 54s | Hits: 99%/27985
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CCCL C Parallel Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CCCL C Parallel Library |
🏃 Runner counts (total jobs: 386)
# | Runner |
---|---|
312 | linux-amd64-cpu16 |
31 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 2h 09m: Pass: 100%/386 | Total: 1d 21h | Avg: 7m 00s | Max: 49m 54s | Hits: 99%/27985
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CCCL C Parallel Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CCCL C Parallel Library |
🏃 Runner counts (total jobs: 386)
# | Runner |
---|---|
312 | linux-amd64-cpu16 |
31 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
@@ -10,6 +10,9 @@ | |||
|
|||
// UNSUPPORTED: pre-sm-70 | |||
|
|||
// clang-cuda errors out with "fatal error: error in backend: Cannot cast between two non-generic address spaces" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, these errors should be fixed in recent LLVM: llvm/llvm-project#114325
__shared__ barrier* bar; | ||
cde::cp_async_bulk_global_to_shared(&smem, gmem, 1024, *bar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks highly suspicious: https://godbolt.org/z/4Ma1Yq9M4
We end up using an unintialized value passed to cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes
via %r3
.visible .entry _Z9test_bulkPv(
.param .u64 _Z9test_bulkPv_param_0
)
{
.reg .b32 %r<6>;
.reg .b64 %rd<4>;
.loc 1 6 0
.shared .align 4 .u32 _ZZ9test_bulkPvE4smem;
ld.param.u64 %rd3, [_Z9test_bulkPv_param_0];
cvta.to.global.u64 %rd1, %rd3;
mov.u32 %r1, _ZZ9test_bulkPvE4smem;
mov.u32 %r5, 1024;
cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%r1], [%rd1], %r5, [%r3]; // 1a. unicast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miscco This is still an issue. We do need bar
pointing to valid shared memory location.
Using something like this appears to make the cp_async_bulk* tests work.
alignas(128) __shared__ int smem_buffer[buf_len];
__shared__ char barrier_data[sizeof(barrier)];
barrier &bar = *reinterpret_cast<barrier*>(&barrier_data);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this may be considered to be a new bug. I'll file the issue and send the patch tomorrow. LMK if this workaround is OK. I may still need to look at why clang complains about non-trivial initializer for barrier
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is an old version, on main it lreads:
__shared__ barrier* bar;
if (threadIdx.x == 0)
{
init(bar, blockDim.x);
}
cde::cp_async_bulk_global_to_shared(&smem, gmem, 1024, *bar);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bar is still an uninitialized pointer and it's expected to be pointing to the valid data to initialized, and then used.
E.g.
asm volatile("mbarrier.init.shared.b64 [%0], %1;" ::"r"( |
In the original comment I've quoted PTX the current code produces and it's clearly wrong. Re're passing [%r3]
as the last mbar
argument, but r3
is not set to anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh we are missing the sync I believe, because init(bar, blockDim.x)
does initialize it but we need to ensure that we wait for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #3061
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init(bar, blockDim.x) does initialize it
I'm confused (or missing something).
bar
is a pointer, pointing who knows where, right? If we pass an invalid pointer to init, what exactly does init
initialize then? AFAICT init() assumes that bar
points to valid storage in shared memory.
In the original code bar
was a variable and that looks like what the barrier APIs expect:
cccl/libcudacxx/test/libcudacxx/cuda/barrier/cp_async_bulk_tensor.pass.cpp
Lines 71 to 75 in baee3f5
__shared__ barrier bar; | |
if (threadIdx.x == 0) | |
{ | |
init(&bar, blockDim.x); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I somehow totally misread what this thread was about and mentally filed it as fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happens. I should've filed a separate issue rather than leave a comment on an already closed pull request... and then forget about it, too, because I've fixed it locally. I'll do better next time.
Rather than passing the instance of integral_constant we should pass the value
Fixes #2225