Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CCCL C] Structs use size t for size and alignment members #3978

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

oleksandr-pavlyk
Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk commented Mar 2, 2025

Description

closes gh-3960
closes gh-3865

  • Change size and alignment member fields in structs to size_t, consistent with return types of sizeof(T) and alignof(T) compiler built-ins.
  • Change unsigned long long num_iterms to uint64_t num_items in CCCL.c.parallel algorithm function signatures.
  • Change c/parallel/tests as per changes above
  • Change python/cuda_parallel/cuda/parallel/experimental/_cccl.py as per changes above

Rationale for the change: Using size_t for size and alignment information aligns with best C++ practices. I believe it is uncontroversial due to return types of sizeof(T) and alignof(T).

The rationale to changing unsigned long long to uint64_t is that it is manifestly portable (64-bit wide on all platforms).

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

…ll values.

This resolves valgrind error messages about conditional jump based on uninitialized
values shown during run of cccl.c.parallel.test.for.cpp
…get_name struct

Since the struct member contains a reference, default compiler-generated methods
may lead to undefined behavior. This caused a crash when executing Release binary
of cccl.c.parallel.test.for.cpp test file.

This is output with debugging print statements which since had been removed.

```
executing GetName
executed nvrtcGetLoweredName
==977762== Invalid free() / delete / delete[] / realloc()
==977762==    at 0x484BB6F: operator delete(void*, unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==977762==    by 0x4877B51: cccl_device_for_build (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/lib/libcccl.c.parallel.so)
==977762==    by 0x12A41E: for_each(cccl_iterator_t, unsigned long, cccl_op_t) (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp)
==977762==    by 0x12DE09: void CATCH2_INTERNAL_TEMPLATE_TEST_0<int>() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp)
==977762==    by 0x16859E: Catch::RunContext::runCurrentTest() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp)
==977762==    by 0x168B44: Catch::RunContext::runTest(Catch::TestCaseHandle const&) (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp)
==977762==    by 0x13AB31: Catch::Session::runInternal() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp)
==977762==    by 0x13AE8D: Catch::Session::run() (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp)
==977762==    by 0x1293F2: main (in /home/coder/cccl/build/cuda12.8-gcc13/cccl-c-parallel/bin/cccl.c.parallel.test.for.cpp)
==977762==  Address 0x1ffeffd5f0 is on thread 1's stack
==977762==  in frame NVIDIA#1, created by cccl_device_for_build (???:)
==977762==
done with GetName
```

After this change the crash disappeared.

Add deleted default constructor for nvrtc_get_name struct
Size and alignment struct members must have same types as
return types of sizeof(T) and of alignof(T) respectively,
which are both `size_t`.
Replace unsigned long long types used to describe problem size with uint64_t
to avoid LP64/LLP64 ambiguity.
Copy link
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great cleanup!

I'm not sure what the consequences are for the int -> size_t changes (4 -> 8 bytes), but I saw the "no objections" response from @gevtushenko, therefore I assume it's fine. If there is an easy way to provide a rationale in the PR description, that might be nice.

Copy link
Contributor

github-actions bot commented Mar 2, 2025

🟩 CI finished in 51m 31s: Pass: 100%/3 | Total: 1h 06m | Avg: 22m 10s | Max: 51m 12s | Hits: 92%/308
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 15m 19s | Avg: 7m 39s | Max: 12m 49s | Hits: 92%/308

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 15m 19s | Avg:  7m 39s | Max: 12m 49s | Hits:  92%/308   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 30s | Avg:  2m 30s | Max:  2m 30s | Hits:  85%/154   
      🟩 Test               Pass: 100%/1   | Total: 12m 49s | Avg: 12m 49s | Max: 12m 49s | Hits:  98%/154   
    
  • 🟩 python: Pass: 100%/1 | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 51m 12s | Avg: 51m 12s | Max: 51m 12s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
2 linux-amd64-gpu-rtx2080-latest-1
1 linux-amd64-cpu16

# warning "C exposure is experimental and subject to change. Define CCCL_C_EXPERIMENTAL to acknowledge this warning."
#else // ^^^ !CCCL_C_EXPERIMENTAL ^^^ / vvv CCCL_C_EXPERIMENTAL vvv
# error "C exposure is experimental and subject to change. Define CCCL_C_EXPERIMENTAL to acknowledge this notice."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the team is OK with compile-time error, I am OK. Most likely this is not user-visible because we build/ship the binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other headers cause compile-time error if CCCL_C_EXPERIMENTAL is not set, so I went along.

However, I feel these conditionals should be moved from header files to translation units that implement exported functionality. It is for a separate PR and requires a discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error or warning is both fine, as long as you can suppres it somehow.

Comment on lines +101 to +103
int candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size);
items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2);
int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
int candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size);
items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2);
int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);
auto candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size);
items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2);
auto max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);

Maybe we should also avoid hard-coding int by replacing them with decltype(items_per_thread and decltype(block_size), respectively?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same for changes elsewhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But int is shorter than auto :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for all CUDA-supported platforms int = int32_t, but if we were to be pedantic then the latter is preferred, then its longer than auto 😆

@@ -14,8 +14,9 @@

#include "test_util.h"
#include <cccl/c/for.h>
#include <stdint.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO test files need not include std library headers unless there's type declarations only needed for tests, but here uint64_t should already come from our C headers. (When in doubt, follow the same practice as in other C++ tests, for which I am less familiar 😅)

Comment on lines 42 to 61
{
std::string_view name;
std::string& lowered_name;

nvrtc_get_name() = delete;
nvrtc_get_name(std::string_view name, std::string& lowered_name)
: name(name)
, lowered_name(lowered_name)
{}
~nvrtc_get_name() noexcept {};

nvrtc_get_name(const nvrtc_get_name&) = delete;
nvrtc_get_name(nvrtc_get_name&& other) noexcept
: name(std::move(other.name))
, lowered_name(other.lowered_name)
{}

nvrtc_get_name& operator=(const nvrtc_get_name&) = delete;
nvrtc_get_name& operator=(nvrtc_get_name&&) = delete;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class smells. Mainly because we need to define a lot of special member functions and it has a reference member. Is there any particular reason, this class could not consist of two string views and require no special handling?

struct nvrtc_get_name
{
  std::string_view name;
  std::string_view lowered_name;
}

Comment on lines +101 to +104
int candidate_ipt = static_cast<int>(items_per_thread * 4 / accumulator_type.size);
items_per_thread = cuda::std::clamp(candidate_ipt, 1, items_per_thread * 2);
int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);
block_size = _CUDA_VSTD::min(block_size, max_block_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of magic numbers in these lines. I think we can do better.

items_per_thread * 4 / accumulator_type.size indicaftes that we are not having the items per thread, but the nominal 4B items per thread. Please consider renaming the structured binding.

1024 * 48 looks like the amount of statically available shared memory. We should have a constant for this. CUB has cub::detail::max_smem_per_block.

int max_block_size = static_cast<int>((((1024 * 48) / (accumulator_type.size * items_per_thread)) + 31) / 32 * 32);
looks like it's rounding up to a multiple of 32. Please use cuda::round_up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
4 participants