Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for large num_items to DeviceReduce::{ArgMin,ArgMax} #2515

Closed
6 tasks done
elstehle opened this issue Oct 8, 2024 · 0 comments · Fixed by #2647
Closed
6 tasks done

Add support for large num_items to DeviceReduce::{ArgMin,ArgMax} #2515

elstehle opened this issue Oct 8, 2024 · 0 comments · Fixed by #2647
Assignees
Labels
2.8.0 target for 2.8.0 release

Comments

@elstehle
Copy link
Collaborator

elstehle commented Oct 8, 2024

DeviceReduce::Arg{Min,Max} does not yet support larger-than INT_MAX number of items. Simply instantiating the kernel with a wider offset type degrades performance by as much as 38%.

We want to mitigate the performance downside and also address #414 (comment)

Performance numbers by simply instantiating DispatchReduce with different offset types:

Rel. time u32 vs i32 any num items Rel. time u32 vs i32 2^28 num items Rel. time i64 vs i32 any num items Rel. time i64 vs i32 2^28 num items Rel. time u64 vs i32 any num items Rel. time u64 vs i32 2^28 num items
min 98.93% 100.00% 99.43% 100.00% 98.82% 100.00%
max 108.61% 106.36% 138.59% 138.59% 138.10% 138.10%
avg 101.00% 101.14% 106.47% 107.81% 106.80% 107.79%
  • Add benchmarks for DeviceReduce::Arg{Min,Max}
  • Add tests for large number of items for DeviceReduce::Arg{Min,Max}
  • Identify solution for large number of items support for DeviceReduce::Arg{Min,Max} ➡ "streaming" reduction
  • implement new dispatch class for streaming reduction that will be used for DeviceReduce::Arg{Min,Max}
  • Implement solution for large number of items support for DeviceReduce::Arg{Min,Max}
  • Find solution for 64-bit indexing for DeviceSegmentedReduce #414 (comment)

Performance comparison for old.main.i64 offset type vs. streaming approach:

Summary for all rows

  • Minimum %Diff: -27.44% (i.e., 1.38-fold higher throughput)
  • Maximum %Diff: 5.64% (5% slow-down, but really only for a tiny problem size)
Performance comparison for `old.main.i64` vs. streaming approach
T{ct} Operation{ct} Elements{io} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 ArgMin 2^16 9.105 us 2.33% 9.338 us 2.53% 0.233 us 2.56% SLOW
I8 ArgMin 2^20 12.638 us 1.44% 11.554 us 1.77% -1.084 us -8.58% FAST
I8 ArgMin 2^24 41.738 us 0.52% 33.996 us 0.52% -7.743 us -18.55% FAST
I8 ArgMin 2^28 339.095 us 0.43% 246.679 us 0.66% -92.416 us -27.25% FAST
I8 ArgMax 2^16 9.432 us 2.06% 9.499 us 2.07% 0.067 us 0.71% SAME
I8 ArgMax 2^20 12.650 us 1.74% 11.540 us 1.84% -1.110 us -8.77% FAST
I8 ArgMax 2^24 41.745 us 0.56% 33.918 us 0.64% -7.828 us -18.75% FAST
I8 ArgMax 2^28 338.962 us 0.35% 245.951 us 0.77% -93.010 us -27.44% FAST
I16 ArgMin 2^16 9.347 us 2.54% 9.583 us 2.24% 0.237 us 2.53% SLOW
I16 ArgMin 2^20 12.476 us 1.67% 11.917 us 1.57% -0.559 us -4.48% FAST
I16 ArgMin 2^24 46.075 us 0.62% 38.511 us 0.68% -7.565 us -16.42% FAST
I16 ArgMin 2^28 346.506 us 1.98% 310.016 us 2.11% -36.490 us -10.53% FAST
I16 ArgMax 2^16 9.296 us 2.43% 9.615 us 1.68% 0.319 us 3.43% SLOW
I16 ArgMax 2^20 12.463 us 1.67% 11.889 us 1.73% -0.574 us -4.61% FAST
I16 ArgMax 2^24 45.914 us 0.58% 38.449 us 0.66% -7.465 us -16.26% FAST
I16 ArgMax 2^28 345.574 us 1.85% 310.318 us 2.10% -35.256 us -10.20% FAST
I32 ArgMin 2^16 8.847 us 2.19% 8.512 us 2.31% -0.334 us -3.78% FAST
I32 ArgMin 2^20 12.941 us 1.59% 12.460 us 1.82% -0.482 us -3.72% FAST
I32 ArgMin 2^24 59.855 us 0.40% 56.674 us 0.47% -3.181 us -5.31% FAST
I32 ArgMin 2^28 579.582 us 1.33% 576.438 us 1.22% -3.144 us -0.54% SAME
I32 ArgMax 2^16 8.812 us 1.55% 8.505 us 2.21% -0.308 us -3.49% FAST
I32 ArgMax 2^20 13.020 us 1.67% 12.493 us 1.80% -0.528 us -4.05% FAST
I32 ArgMax 2^24 59.830 us 0.44% 56.633 us 0.49% -3.196 us -5.34% FAST
I32 ArgMax 2^28 579.888 us 1.32% 576.493 us 1.23% -3.395 us -0.59% SAME
I64 ArgMin 2^16 8.701 us 1.73% 9.121 us 2.49% 0.420 us 4.83% SLOW
I64 ArgMin 2^20 14.708 us 1.13% 15.204 us 1.31% 0.497 us 3.38% SLOW
I64 ArgMin 2^24 93.593 us 0.27% 94.248 us 0.32% 0.655 us 0.70% SLOW
I64 ArgMin 2^28 1.120 ms 0.87% 1.116 ms 0.81% -4.249 us -0.38% SAME
I64 ArgMax 2^16 8.598 us 1.59% 9.083 us 2.45% 0.485 us 5.64% SLOW
I64 ArgMax 2^20 14.839 us 1.29% 15.271 us 1.39% 0.431 us 2.91% SLOW
I64 ArgMax 2^24 93.671 us 0.31% 94.310 us 0.32% 0.639 us 0.68% SLOW
I64 ArgMax 2^28 1.120 ms 0.88% 1.116 ms 0.82% -4.115 us -0.37% SAME
I128 ArgMin 2^16 10.218 us 1.89% 10.497 us 1.56% 0.279 us 2.73% SLOW
I128 ArgMin 2^20 24.396 us 0.85% 24.242 us 0.92% -0.153 us -0.63% SAME
I128 ArgMin 2^24 165.965 us 0.91% 167.414 us 0.71% 1.449 us 0.87% SLOW
I128 ArgMin 2^28 2.214 ms 0.55% 2.206 ms 0.50% -7.729 us -0.35% SAME
I128 ArgMax 2^16 10.515 us 1.83% 10.305 us 2.07% -0.211 us -2.01% FAST
I128 ArgMax 2^20 24.632 us 0.94% 23.853 us 0.87% -0.780 us -3.16% FAST
I128 ArgMax 2^24 166.315 us 0.93% 167.585 us 0.59% 1.270 us 0.76% SLOW
I128 ArgMax 2^28 2.214 ms 0.54% 2.206 ms 0.50% -7.473 us -0.34% SAME
F32 ArgMin 2^16 9.295 us 2.57% 8.222 us 2.36% -1.073 us -11.55% FAST
F32 ArgMin 2^20 13.201 us 1.64% 12.146 us 1.35% -1.055 us -7.99% FAST
F32 ArgMin 2^24 60.171 us 0.45% 56.509 us 0.44% -3.662 us -6.09% FAST
F32 ArgMin 2^28 580.091 us 1.33% 576.227 us 1.23% -3.864 us -0.67% SAME
F32 ArgMax 2^16 9.236 us 2.18% 8.304 us 2.41% -0.932 us -10.09% FAST
F32 ArgMax 2^20 13.253 us 1.54% 12.098 us 1.74% -1.155 us -8.72% FAST
F32 ArgMax 2^24 60.159 us 0.47% 56.480 us 0.43% -3.679 us -6.12% FAST
F32 ArgMax 2^28 580.086 us 1.34% 576.152 us 1.23% -3.934 us -0.68% SAME
F64 ArgMin 2^16 9.080 us 2.17% 9.179 us 2.86% 0.098 us 1.08% SAME
F64 ArgMin 2^20 15.096 us 1.46% 15.044 us 1.42% -0.052 us -0.34% SAME
F64 ArgMin 2^24 93.488 us 0.31% 94.285 us 0.30% 0.798 us 0.85% SLOW
F64 ArgMin 2^28 1.116 ms 0.86% 1.116 ms 0.81% -0.344 us -0.03% SAME
F64 ArgMax 2^16 9.128 us 2.39% 8.986 us 2.20% -0.142 us -1.56% SAME
F64 ArgMax 2^20 15.059 us 1.62% 15.048 us 1.25% -0.011 us -0.07% SAME
F64 ArgMax 2^24 93.443 us 0.30% 94.263 us 0.29% 0.819 us 0.88% SLOW
F64 ArgMax 2^28 1.117 ms 0.85% 1.116 ms 0.80% -0.703 us -0.06% SAME

Performance comparison for old.main.i32 vs. streaming approach

Summary for rows where Elements{io} is 2^28

  • Minimum %Diff: -0.03%
  • Maximum %Diff: 2.51%
Performance comparison for `old.main.i32` vs. streaming approach
T{ct} Operation{ct} Elements{io} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 ArgMin 2^16 9.118 us 1.97% 9.338 us 2.53% 0.221 us 2.42% SLOW
I8 ArgMin 2^20 11.384 us 1.49% 11.554 us 1.77% 0.170 us 1.49% SAME
I8 ArgMin 2^24 33.450 us 0.63% 33.996 us 0.52% 0.546 us 1.63% SLOW
I8 ArgMin 2^28 240.632 us 0.79% 246.679 us 0.66% 6.046 us 2.51% SLOW
I8 ArgMax 2^16 9.424 us 2.57% 9.499 us 2.07% 0.075 us 0.80% SAME
I8 ArgMax 2^20 11.358 us 1.69% 11.540 us 1.84% 0.182 us 1.61% SAME
I8 ArgMax 2^24 33.498 us 0.67% 33.918 us 0.64% 0.419 us 1.25% SLOW
I8 ArgMax 2^28 243.687 us 0.79% 245.951 us 0.77% 2.265 us 0.93% SLOW
I16 ArgMin 2^16 9.495 us 2.63% 9.583 us 2.24% 0.088 us 0.93% SAME
I16 ArgMin 2^20 11.671 us 1.42% 11.917 us 1.57% 0.246 us 2.11% SLOW
I16 ArgMin 2^24 38.096 us 0.75% 38.511 us 0.68% 0.414 us 1.09% SLOW
I16 ArgMin 2^28 309.922 us 2.11% 310.016 us 2.11% 0.094 us 0.03% SAME
I16 ArgMax 2^16 9.346 us 2.64% 9.615 us 1.68% 0.269 us 2.87% SLOW
I16 ArgMax 2^20 11.692 us 1.88% 11.889 us 1.73% 0.197 us 1.69% SAME
I16 ArgMax 2^24 38.072 us 0.75% 38.449 us 0.66% 0.378 us 0.99% SLOW
I16 ArgMax 2^28 309.651 us 2.13% 310.318 us 2.10% 0.667 us 0.22% SAME
I32 ArgMin 2^16 8.017 us 1.77% 8.512 us 2.31% 0.495 us 6.18% SLOW
I32 ArgMin 2^20 12.078 us 1.82% 12.460 us 1.82% 0.382 us 3.16% SLOW
I32 ArgMin 2^24 56.300 us 0.44% 56.674 us 0.47% 0.374 us 0.67% SLOW
I32 ArgMin 2^28 576.020 us 1.23% 576.438 us 1.22% 0.418 us 0.07% SAME
I32 ArgMax 2^16 8.050 us 1.63% 8.505 us 2.21% 0.455 us 5.65% SLOW
I32 ArgMax 2^20 12.016 us 1.23% 12.493 us 1.80% 0.476 us 3.96% SLOW
I32 ArgMax 2^24 56.175 us 0.45% 56.633 us 0.49% 0.459 us 0.82% SLOW
I32 ArgMax 2^28 576.168 us 1.22% 576.493 us 1.23% 0.324 us 0.06% SAME
I64 ArgMin 2^16 8.708 us 2.19% 9.121 us 2.49% 0.414 us 4.75% SLOW
I64 ArgMin 2^20 14.878 us 1.36% 15.204 us 1.31% 0.326 us 2.19% SLOW
I64 ArgMin 2^24 93.945 us 0.30% 94.248 us 0.32% 0.303 us 0.32% SLOW
I64 ArgMin 2^28 1.116 ms 0.82% 1.116 ms 0.81% -0.039 us -0.00% SAME
I64 ArgMax 2^16 8.743 us 2.07% 9.083 us 2.45% 0.339 us 3.88% SLOW
I64 ArgMax 2^20 14.939 us 1.40% 15.271 us 1.39% 0.332 us 2.22% SLOW
I64 ArgMax 2^24 93.908 us 0.31% 94.310 us 0.32% 0.402 us 0.43% SLOW
I64 ArgMax 2^28 1.116 ms 0.81% 1.116 ms 0.82% 0.088 us 0.01% SAME
I128 ArgMin 2^16 10.168 us 1.77% 10.497 us 1.56% 0.329 us 3.23% SLOW
I128 ArgMin 2^20 23.805 us 0.82% 24.242 us 0.92% 0.438 us 1.84% SLOW
I128 ArgMin 2^24 167.297 us 0.51% 167.414 us 0.71% 0.117 us 0.07% SAME
I128 ArgMin 2^28 2.206 ms 0.50% 2.206 ms 0.50% 0.288 us 0.01% SAME
I128 ArgMax 2^16 10.461 us 1.78% 10.305 us 2.07% -0.156 us -1.50% SAME
I128 ArgMax 2^20 24.045 us 0.99% 23.853 us 0.87% -0.193 us -0.80% SAME
I128 ArgMax 2^24 167.588 us 0.54% 167.585 us 0.59% -0.003 us -0.00% SAME
I128 ArgMax 2^28 2.206 ms 0.50% 2.206 ms 0.50% 0.573 us 0.03% SAME
F32 ArgMin 2^16 8.489 us 2.06% 8.222 us 2.36% -0.267 us -3.15% FAST
F32 ArgMin 2^20 12.315 us 1.71% 12.146 us 1.35% -0.169 us -1.37% FAST
F32 ArgMin 2^24 56.654 us 0.45% 56.509 us 0.44% -0.145 us -0.26% SAME
F32 ArgMin 2^28 576.334 us 1.21% 576.227 us 1.23% -0.106 us -0.02% SAME
F32 ArgMax 2^16 8.492 us 2.31% 8.304 us 2.41% -0.188 us -2.21% SAME
F32 ArgMax 2^20 12.281 us 1.78% 12.098 us 1.74% -0.184 us -1.50% SAME
F32 ArgMax 2^24 56.601 us 0.47% 56.480 us 0.43% -0.122 us -0.21% SAME
F32 ArgMax 2^28 576.297 us 1.23% 576.152 us 1.23% -0.145 us -0.03% SAME
F64 ArgMin 2^16 9.214 us 1.91% 9.179 us 2.86% -0.036 us -0.39% SAME
F64 ArgMin 2^20 15.252 us 1.51% 15.044 us 1.42% -0.208 us -1.37% SAME
F64 ArgMin 2^24 94.372 us 0.32% 94.285 us 0.30% -0.086 us -0.09% SAME
F64 ArgMin 2^28 1.116 ms 0.81% 1.116 ms 0.81% 0.000 us 0.00% SAME
F64 ArgMax 2^16 9.181 us 2.28% 8.986 us 2.20% -0.195 us -2.13% SAME
F64 ArgMax 2^20 15.272 us 1.45% 15.048 us 1.25% -0.225 us -1.47% FAST
F64 ArgMax 2^24 94.383 us 0.30% 94.263 us 0.29% -0.120 us -0.13% SAME
F64 ArgMax 2^28 1.116 ms 0.80% 1.116 ms 0.80% -0.281 us -0.03% SAME
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 8, 2024
@elstehle elstehle self-assigned this Oct 9, 2024
@elstehle elstehle moved this from Todo to In Progress in CCCL Oct 9, 2024
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Oct 29, 2024
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Review to In Progress in CCCL Oct 30, 2024
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Oct 30, 2024
@jollylili jollylili added the 2.8.0 target for 2.8.0 release label Nov 15, 2024
@elstehle elstehle added the blocked This PR cannot be merged due to various reasons label Dec 12, 2024
@elstehle elstehle changed the title Add support for large num_items to DeviceReduce::{ArgMin,ArgMax} [DO NOT MERGE] Add support for large num_items to DeviceReduce::{ArgMin,ArgMax} Dec 12, 2024
@elstehle elstehle changed the title [DO NOT MERGE] Add support for large num_items to DeviceReduce::{ArgMin,ArgMax} Add support for large num_items to DeviceReduce::{ArgMin,ArgMax} Dec 12, 2024
@elstehle elstehle removed the blocked This PR cannot be merged due to various reasons label Dec 12, 2024
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.8.0 target for 2.8.0 release
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants