Fix block reduce alignment #1233

gevtushenko · 2023-12-19T01:46:08Z

Description

Vectorized loads of temporary storage in block reduce, lead to unaligned access. This PR marks temporary storage for block reduce as struct alignas(detail::max_alignment_t<16, T, WarpReduceStorage>::value) _TempStorage. This and reordering of member variables in temporary storage, ensures that warp aggregate loads can be vectorized.

Note that other parts of CUB rely on __align__(16). This is potentially problematic for custom types with explicitly specified alignment. I'll create a separate issue for unifying the approach later.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

gevtushenko · 2023-12-19T01:50:52Z

Performance is equivalent after the change:

## [0] NVIDIA RTX 6000 Ada Generation

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Cmp Noise |   %Diff |  Status  |
|---------|---------------|----------------|-------------|---------|----------|
|   I8    |      I32      |      2^24      |       2.41% |   0.28% |   PASS   |
|   I8    |      I64      |      2^24      |       3.02% |   0.19% |   PASS   |
|   I16   |      I32      |      2^24      |       2.67% |   0.25% |   PASS   |
|   I16   |      I64      |      2^24      |       2.52% |  -0.12% |   PASS   |
|   I32   |      I32      |      2^24      |       2.42% |   0.03% |   PASS   |
|   I32   |      I64      |      2^24      |       2.69% |  -0.05% |   PASS   |
|   I64   |      I32      |      2^24      |       1.68% |   0.00% |   PASS   |
|   I64   |      I64      |      2^24      |       1.77% |   0.01% |   PASS   |
|  I128   |      I32      |      2^24      |       1.81% |   0.05% |   PASS   |
|  I128   |      I64      |      2^24      |       1.63% |  -0.02% |   PASS   |
|   F32   |      I32      |      2^24      |       2.48% |   0.06% |   PASS   |
|   F32   |      I64      |      2^24      |       2.78% |  -0.11% |   PASS   |
|   F64   |      I32      |      2^24      |       1.88% |   0.04% |   PASS   |
|   F64   |      I64      |      2^24      |       2.04% |  -0.06% |   PASS   |
|   C64   |      I32      |      2^24      |       1.88% |  -0.09% |   PASS   |
|   C64   |      I64      |      2^24      |       1.70% |   0.03% |   PASS   |
|   I8    |      I32      |      2^28      |       1.80% |  -0.04% |   PASS   |
|   I8    |      I64      |      2^28      |       2.64% |  -0.22% |   PASS   |
|   I16   |      I32      |      2^28      |       1.91% |  -0.14% |   PASS   |
|   I16   |      I64      |      2^28      |       1.81% |   0.01% |   PASS   |
|   I32   |      I32      |      2^28      |       1.39% |   0.06% |   PASS   |
|   I32   |      I64      |      2^28      |       1.48% |  -0.16% |   PASS   |
|   I64   |      I32      |      2^28      |       9.73% |  -0.94% |   PASS   |
|   I64   |      I64      |      2^28      |       8.91% |   1.64% |   PASS   |
|  I128   |      I32      |      2^28      |       7.61% |   1.08% |   PASS   |
|  I128   |      I64      |      2^28      |       7.08% |   0.02% |   PASS   |
|   F32   |      I32      |      2^28      |       1.07% |   0.12% |   PASS   |
|   F32   |      I64      |      2^28      |       1.38% |   0.06% |   PASS   |
|   F64   |      I32      |      2^28      |       9.92% |  -1.05% |   PASS   |
|   F64   |      I64      |      2^28      |       9.02% |   0.48% |   PASS   |
|   C64   |      I32      |      2^28      |      10.04% |  -0.40% |   PASS   |
|   C64   |      I64      |      2^28      |       9.21% |   0.51% |   PASS   |

miscco · 2023-12-19T08:22:05Z

cub/cub/util_type.cuh

+template <::cuda::std::size_t Alignment> 
+struct max_alignment_t<Alignment> 
+{
+  constexpr static ::cuda::std::size_t value = Alignment;


Suggestion:

we usually go with the ordering of static constexpr because static is the more relevant information here

Remark: clang-format can handle such orderings using the QualifierOrder style option, which we seem to not set in our .clang-format.

Opened: #1748

gevtushenko · 2024-05-09T15:50:41Z

Addressed by nvbug 4428282

Fix block reduce alignment

122c4a0

gevtushenko requested review from a team as code owners December 19, 2023 01:46

gevtushenko requested review from elstehle and griwes December 19, 2023 01:46

griwes approved these changes Dec 19, 2023

View reviewed changes

miscco approved these changes Dec 19, 2023

View reviewed changes

elstehle approved these changes Dec 19, 2023

View reviewed changes

gevtushenko closed this May 9, 2024

bernhardmgruber mentioned this pull request May 16, 2024

Specify qualifier order in .clang-format #1748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix block reduce alignment #1233

Fix block reduce alignment #1233

gevtushenko commented Dec 19, 2023

gevtushenko commented Dec 19, 2023

miscco Dec 19, 2023

bernhardmgruber May 16, 2024

bernhardmgruber May 16, 2024

gevtushenko commented May 9, 2024

Fix block reduce alignment #1233

Fix block reduce alignment #1233

Conversation

gevtushenko commented Dec 19, 2023

Description

Checklist

gevtushenko commented Dec 19, 2023

miscco Dec 19, 2023

Choose a reason for hiding this comment

bernhardmgruber May 16, 2024

Choose a reason for hiding this comment

bernhardmgruber May 16, 2024

Choose a reason for hiding this comment

gevtushenko commented May 9, 2024