Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix block reduce alignment #1233

Closed

Conversation

gevtushenko
Copy link
Collaborator

Description

closes #1229

Vectorized loads of temporary storage in block reduce, lead to unaligned access. This PR marks temporary storage for block reduce as struct alignas(detail::max_alignment_t<16, T, WarpReduceStorage>::value) _TempStorage. This and reordering of member variables in temporary storage, ensures that warp aggregate loads can be vectorized.

Note that other parts of CUB rely on __align__(16). This is potentially problematic for custom types with explicitly specified alignment. I'll create a separate issue for unifying the approach later.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@gevtushenko gevtushenko requested review from a team as code owners December 19, 2023 01:46
@gevtushenko
Copy link
Collaborator Author

Performance is equivalent after the change:

## [0] NVIDIA RTX 6000 Ada Generation

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Cmp Noise |   %Diff |  Status  |
|---------|---------------|----------------|-------------|---------|----------|
|   I8    |      I32      |      2^24      |       2.41% |   0.28% |   PASS   |
|   I8    |      I64      |      2^24      |       3.02% |   0.19% |   PASS   |
|   I16   |      I32      |      2^24      |       2.67% |   0.25% |   PASS   |
|   I16   |      I64      |      2^24      |       2.52% |  -0.12% |   PASS   |
|   I32   |      I32      |      2^24      |       2.42% |   0.03% |   PASS   |
|   I32   |      I64      |      2^24      |       2.69% |  -0.05% |   PASS   |
|   I64   |      I32      |      2^24      |       1.68% |   0.00% |   PASS   |
|   I64   |      I64      |      2^24      |       1.77% |   0.01% |   PASS   |
|  I128   |      I32      |      2^24      |       1.81% |   0.05% |   PASS   |
|  I128   |      I64      |      2^24      |       1.63% |  -0.02% |   PASS   |
|   F32   |      I32      |      2^24      |       2.48% |   0.06% |   PASS   |
|   F32   |      I64      |      2^24      |       2.78% |  -0.11% |   PASS   |
|   F64   |      I32      |      2^24      |       1.88% |   0.04% |   PASS   |
|   F64   |      I64      |      2^24      |       2.04% |  -0.06% |   PASS   |
|   C64   |      I32      |      2^24      |       1.88% |  -0.09% |   PASS   |
|   C64   |      I64      |      2^24      |       1.70% |   0.03% |   PASS   |
|   I8    |      I32      |      2^28      |       1.80% |  -0.04% |   PASS   |
|   I8    |      I64      |      2^28      |       2.64% |  -0.22% |   PASS   |
|   I16   |      I32      |      2^28      |       1.91% |  -0.14% |   PASS   |
|   I16   |      I64      |      2^28      |       1.81% |   0.01% |   PASS   |
|   I32   |      I32      |      2^28      |       1.39% |   0.06% |   PASS   |
|   I32   |      I64      |      2^28      |       1.48% |  -0.16% |   PASS   |
|   I64   |      I32      |      2^28      |       9.73% |  -0.94% |   PASS   |
|   I64   |      I64      |      2^28      |       8.91% |   1.64% |   PASS   |
|  I128   |      I32      |      2^28      |       7.61% |   1.08% |   PASS   |
|  I128   |      I64      |      2^28      |       7.08% |   0.02% |   PASS   |
|   F32   |      I32      |      2^28      |       1.07% |   0.12% |   PASS   |
|   F32   |      I64      |      2^28      |       1.38% |   0.06% |   PASS   |
|   F64   |      I32      |      2^28      |       9.92% |  -1.05% |   PASS   |
|   F64   |      I64      |      2^28      |       9.02% |   0.48% |   PASS   |
|   C64   |      I32      |      2^28      |      10.04% |  -0.40% |   PASS   |
|   C64   |      I64      |      2^28      |       9.21% |   0.51% |   PASS   |

template <::cuda::std::size_t Alignment>
struct max_alignment_t<Alignment>
{
constexpr static ::cuda::std::size_t value = Alignment;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

we usually go with the ordering of static constexpr because static is the more relevant information here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: clang-format can handle such orderings using the QualifierOrder style option, which we seem to not set in our .clang-format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened: #1748

@gevtushenko
Copy link
Collaborator Author

Addressed by nvbug 4428282

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG]: compute-sanitizer shows invalid read access on reduction
5 participants