[CCCL C] Structs use size t for size and alignment members #3978

oleksandr-pavlyk · 2025-03-02T20:05:58Z

Description

Change size and alignment member fields in structs to size_t, consistent with return types of sizeof(T) and alignof(T) compiler built-ins.
Change unsigned long long num_iterms to uint64_t num_items in CCCL.c.parallel algorithm function signatures.
Change c/parallel/tests as per changes above
Change python/cuda_parallel/cuda/parallel/experimental/_cccl.py as per changes above

Rationale for the change: Using size_t for size and alignment information aligns with best C++ practices. I believe it is uncontroversial due to return types of sizeof(T) and alignof(T).

The rationale to changing unsigned long long to uint64_t is that it is manifestly portable (64-bit wide on all platforms).

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

…ll values. This resolves valgrind error messages about conditional jump based on uninitialized values shown during run of cccl.c.parallel.test.for.cpp

…get_name struct Since the struct member contains a reference, default compiler-generated methods may lead to undefined behavior. This caused a crash when executing Release binary of cccl.c.parallel.test.for.cpp test file. This is output with debugging print statements which since had been removed. ``` executing GetName executed nvrtcGetLoweredName ==977762== Invalid free() / delete / delete[] / realloc() ==977762== at 0x484BB6F: operator delete(void*, unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==977762== by 0x4877B51: cccl_device_for_build (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/lib/libcccl.c.parallel.so) ==977762== by 0x12A41E: for_each(cccl_iterator_t, unsigned long, cccl_op_t) (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x12DE09: void CATCH2_INTERNAL_TEMPLATE_TEST_0<int>() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x16859E: Catch::RunContext::runCurrentTest() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x168B44: Catch::RunContext::runTest(Catch::TestCaseHandle const&) (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x13AB31: Catch::Session::runInternal() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x13AE8D: Catch::Session::run() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== by 0x1293F2: main (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp) ==977762== Address 0x1ffeffd5f0 is on thread 1's stack ==977762== in frame NVIDIA#1, created by cccl_device_for_build (???:) ==977762== done with GetName ``` After this change the crash disappeared. Add deleted default constructor for nvrtc_get_name struct

Size and alignment struct members must have same types as return types of sizeof(T) and of alignof(T) respectively, which are both `size_t`.

Replace unsigned long long types used to describe problem size with uint64_t to avoid LP64/LLP64 ambiguity.

rwgk

Great cleanup!

I'm not sure what the consequences are for the int -> size_t changes (4 -> 8 bytes), but I saw the "no objections" response from @gevtushenko, therefore I assume it's fine. If there is an easy way to provide a rationale in the PR description, that might be nice.

github-actions · 2025-03-02T20:59:11Z

🟩 CI finished in 51m 31s: Pass: 100%/3 | Total: 1h 06m | Avg: 22m 10s | Max: 51m 12s | Hits: 92%/308

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 15m 19s | Avg: 7m 39s | Max: 12m 49s | Hits: 92%/308

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
🟩 ctk
  🟩 12.8               Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
🟩 gpu
  🟩 rtx2080            Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 30s | Avg:  2m 30s | Max:  2m 30s | Hits:  85%/154   
  🟩 Test               Pass: 100%/1   | Total: 12m 49s | Avg: 12m 49s | Max: 12m 49s | Hits:  98%/154

🟩 python: Pass: 100%/1 | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
2	`linux-amd64-gpu-rtx2080-latest-1`
1	`linux-amd64-cpu16`

leofang · 2025-03-02T21:39:42Z

c/parallel/include/cccl/c/for.h

-#  warning "C exposure is experimental and subject to change. Define CCCL_C_EXPERIMENTAL to acknowledge this warning."
-#else // ^^^ !CCCL_C_EXPERIMENTAL ^^^ / vvv CCCL_C_EXPERIMENTAL vvv
+#  error "C exposure is experimental and subject to change. Define CCCL_C_EXPERIMENTAL to acknowledge this notice."


As long as the team is OK with compile-time error, I am OK. Most likely this is not user-visible because we build/ship the binary.

Other headers cause compile-time error if CCCL_C_EXPERIMENTAL is not set, so I went along.

However, I feel these conditionals should be moved from header files to translation units that implement exported functionality. It is for a separate PR and requires a discussion.

Error or warning is both fine, as long as you can suppres it somehow.

leofang · 2025-03-02T21:44:10Z

c/parallel/src/reduce.cu

+  int candidate_ipt  = static_cast<int>(items_per_thread * 4 / accumulator_type.size);
+  items_per_thread   = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2);
+  int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);


nit

Suggested change

int candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size);

items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2);

int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);

auto candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size);

items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2);

auto max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);

Maybe we should also avoid hard-coding int by replacing them with decltype(items_per_thread and decltype(block_size), respectively?

(same for changes elsewhere)

But int is shorter than auto :)

Yeah for all CUDA-supported platforms int = int32_t, but if we were to be pedantic then the latter is preferred, then its longer than auto 😆

leofang · 2025-03-02T21:47:36Z

c/parallel/test/test_for.cpp

@@ -14,8 +14,9 @@

 #include "test_util.h"
 #include <cccl/c/for.h>
+#include <stdint.h>


IMO test files need not include std library headers unless there's type declarations only needed for tests, but here uint64_t should already come from our C headers. (When in doubt, follow the same practice as in other C++ tests, for which I am less familiar 😅)

bernhardmgruber · 2025-03-03T09:02:27Z

c/parallel/src/nvrtc/command_list.h

 {
  std::string_view name;
  std::string& lowered_name;
+
+  nvrtc_get_name() = delete;
+  nvrtc_get_name(std::string_view name, std::string& lowered_name)
+      : name(name)
+      , lowered_name(lowered_name)
+  {}
+  ~nvrtc_get_name() noexcept {};
+
+  nvrtc_get_name(const nvrtc_get_name&) = delete;
+  nvrtc_get_name(nvrtc_get_name&& other) noexcept
+      : name(std::move(other.name))
+      , lowered_name(other.lowered_name)
+  {}
+
+  nvrtc_get_name& operator=(const nvrtc_get_name&) = delete;
+  nvrtc_get_name& operator=(nvrtc_get_name&&)      = delete;
 };


This class smells. Mainly because we need to define a lot of special member functions and it has a reference member. Is there any particular reason, this class could not consist of two string views and require no special handling?

struct nvrtc_get_name { std::string_view name; std::string_view lowered_name; }

bernhardmgruber · 2025-03-03T09:15:51Z

c/parallel/src/reduce.cu

+  int candidate_ipt  = static_cast<int>(items_per_thread * 4 / accumulator_type.size);
+  items_per_thread   = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2);
+  int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);
+  block_size         = _CUDA_VSTD::min(block_size, max_block_size);


There are a lot of magic numbers in these lines. I think we can do better.

items_per_thread * 4 / accumulator_type.size indicaftes that we are not having the items per thread, but the nominal 4B items per thread. Please consider renaming the structured binding.

1024 * 48 looks like the amount of statically available shared memory. We should have a constant for this. CUB has cub::detail::max_smem_per_block.

int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);
looks like it's rounding up to a multiple of 32. Please use cuda::round_up.

oleksandr-pavlyk added 6 commits March 1, 2025 14:48

Modify casting operators of pointer_t and operation_t to initialize a…

9dcfc18

…ll values. This resolves valgrind error messages about conditional jump based on uninitialized values shown during run of cccl.c.parallel.test.for.cpp

Change ltsz type from int to size_t, removed static cast

d3b982e

Implements NVIDIAgh-3960, change size and alignment to be size_t

fab5cc8

Size and alignment struct members must have same types as return types of sizeof(T) and of alignof(T) respectively, which are both `size_t`.

Implement NVIDIAgh-3865

e5e6192

Replace unsigned long long types used to describe problem size with uint64_t to avoid LP64/LLP64 ambiguity.

Changes _cccl.py to align with type changes in header files

8465958

oleksandr-pavlyk requested review from a team as code owners March 2, 2025 20:05

oleksandr-pavlyk requested review from gevtushenko, shwina, rwgk, leofang and fbusato March 2, 2025 20:05

rwgk approved these changes Mar 2, 2025

View reviewed changes

leofang approved these changes Mar 2, 2025

View reviewed changes

bernhardmgruber reviewed Mar 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CCCL C] Structs use size t for size and alignment members #3978

[CCCL C] Structs use size t for size and alignment members #3978

oleksandr-pavlyk commented Mar 2, 2025 •

edited

Loading

rwgk left a comment

github-actions bot commented Mar 2, 2025

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 15m 19s | Avg: 7m 39s | Max: 12m 49s | Hits: 92%/308

🟩 python: Pass: 100%/1 | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

leofang Mar 2, 2025

oleksandr-pavlyk Mar 2, 2025

bernhardmgruber Mar 3, 2025

leofang Mar 2, 2025

leofang Mar 2, 2025

oleksandr-pavlyk Mar 2, 2025

leofang Mar 2, 2025

leofang Mar 2, 2025

bernhardmgruber Mar 3, 2025

bernhardmgruber Mar 3, 2025

[CCCL C] Structs use size t for size and alignment members #3978

Are you sure you want to change the base?

[CCCL C] Structs use size t for size and alignment members #3978

Conversation

oleksandr-pavlyk commented Mar 2, 2025 • edited Loading

Description

Checklist

rwgk left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 2, 2025

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 15m 19s | Avg: 7m 39s | Max: 12m 49s | Hits: 92%/308

🟩 python: Pass: 100%/1 | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleksandr-pavlyk commented Mar 2, 2025 •

edited

Loading