-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CCCL C] Structs use size t for size and alignment members #3978
base: main
Are you sure you want to change the base?
[CCCL C] Structs use size t for size and alignment members #3978
Conversation
…ll values. This resolves valgrind error messages about conditional jump based on uninitialized values shown during run of cccl.c.parallel.test.for.cpp
…get_name struct Since the struct member contains a reference, default compiler-generated methods may lead to undefined behavior. This caused a crash when executing Release binary of cccl.c.parallel.test.for.cpp test file. This is output with debugging print statements which since had been removed. ``` executing GetName executed nvrtcGetLoweredName ==977762== Invalid free() / delete / delete[] / realloc() ==977762== at 0x484BB6F: operator delete(void*, unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==977762== by 0x4877B51: cccl_device_for_build (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/lib/libcccl.c.parallel.so) ==977762== by 0x12A41E: for_each(cccl_iterator_t, unsigned long, cccl_op_t) (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x12DE09: void CATCH2_INTERNAL_TEMPLATE_TEST_0<int>() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x16859E: Catch::RunContext::runCurrentTest() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x168B44: Catch::RunContext::runTest(Catch::TestCaseHandle const&) (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x13AB31: Catch::Session::runInternal() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x13AE8D: Catch::Session::run() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x1293F2: main (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== Address 0x1ffeffd5f0 is on thread 1's stack ==977762== in frame NVIDIA#1, created by cccl_device_for_build (???:) ==977762== done with GetName ``` After this change the crash disappeared. Add deleted default constructor for nvrtc_get_name struct
Size and alignment struct members must have same types as return types of sizeof(T) and of alignof(T) respectively, which are both `size_t`.
Replace unsigned long long types used to describe problem size with uint64_t to avoid LP64/LLP64 ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great cleanup!
I'm not sure what the consequences are for the int
-> size_t
changes (4
-> 8
bytes), but I saw the "no objections" response from @gevtushenko, therefore I assume it's fine. If there is an easy way to provide a rationale in the PR description, that might be nice.
🟩 CI finished in 51m 31s: Pass: 100%/3 | Total: 1h 06m | Avg: 22m 10s | Max: 51m 12s | Hits: 92%/308
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-cpu16 |
# warning "C exposure is experimental and subject to change. Define CCCL_C_EXPERIMENTAL to acknowledge this warning." | ||
#else // ^^^ !CCCL_C_EXPERIMENTAL ^^^ / vvv CCCL_C_EXPERIMENTAL vvv | ||
# error "C exposure is experimental and subject to change. Define CCCL_C_EXPERIMENTAL to acknowledge this notice." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the team is OK with compile-time error, I am OK. Most likely this is not user-visible because we build/ship the binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other headers cause compile-time error if CCCL_C_EXPERIMENTAL
is not set, so I went along.
However, I feel these conditionals should be moved from header files to translation units that implement exported functionality. It is for a separate PR and requires a discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error or warning is both fine, as long as you can suppres it somehow.
int candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size); | ||
items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2); | ||
int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
int candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size); | |
items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2); | |
int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32); | |
auto candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size); | |
items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2); | |
auto max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32); |
Maybe we should also avoid hard-coding int
by replacing them with decltype(items_per_thread
and decltype(block_size)
, respectively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same for changes elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But int
is shorter than auto
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for all CUDA-supported platforms int = int32_t, but if we were to be pedantic then the latter is preferred, then its longer than auto 😆
@@ -14,8 +14,9 @@ | |||
|
|||
#include "test_util.h" | |||
#include <cccl/c/for.h> | |||
#include <stdint.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO test files need not include std library headers unless there's type declarations only needed for tests, but here uint64_t
should already come from our C headers. (When in doubt, follow the same practice as in other C++ tests, for which I am less familiar 😅)
{ | ||
std::string_view name; | ||
std::string& lowered_name; | ||
|
||
nvrtc_get_name() = delete; | ||
nvrtc_get_name(std::string_view name, std::string& lowered_name) | ||
: name(name) | ||
, lowered_name(lowered_name) | ||
{} | ||
~nvrtc_get_name() noexcept {}; | ||
|
||
nvrtc_get_name(const nvrtc_get_name&) = delete; | ||
nvrtc_get_name(nvrtc_get_name&& other) noexcept | ||
: name(std::move(other.name)) | ||
, lowered_name(other.lowered_name) | ||
{} | ||
|
||
nvrtc_get_name& operator=(const nvrtc_get_name&) = delete; | ||
nvrtc_get_name& operator=(nvrtc_get_name&&) = delete; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class smells. Mainly because we need to define a lot of special member functions and it has a reference member. Is there any particular reason, this class could not consist of two string views and require no special handling?
struct nvrtc_get_name
{
std::string_view name;
std::string_view lowered_name;
}
int candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size); | ||
items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2); | ||
int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32); | ||
block_size = _CUDA_VSTD::min(block_size, max_block_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of magic numbers in these lines. I think we can do better.
items_per_thread * 4 / accumulator_type.size
indicaftes that we are not having the items per thread, but the nominal 4B items per thread. Please consider renaming the structured binding.
1024 * 48
looks like the amount of statically available shared memory. We should have a constant for this. CUB has cub::detail::max_smem_per_block
.
int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);
looks like it's rounding up to a multiple of 32. Please use cuda::round_up
.
Description
closes gh-3960
closes gh-3865
size
andalignment
member fields in structs tosize_t
, consistent with return types ofsizeof(T)
andalignof(T)
compiler built-ins.unsigned long long num_iterms
touint64_t num_items
in CCCL.c.parallel algorithm function signatures.c/parallel/tests
as per changes abovepython/cuda_parallel/cuda/parallel/experimental/_cccl.py
as per changes aboveRationale for the change: Using
size_t
for size and alignment information aligns with best C++ practices. I believe it is uncontroversial due to return types ofsizeof(T)
andalignof(T)
.The rationale to changing
unsigned long long
touint64_t
is that it is manifestly portable (64-bit wide on all platforms).Checklist