Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDAX] Introduce pinned memory pool and move pinned memory resource to use it on new CUDA versions #3975

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pciolkosz
Copy link
Contributor

Draft, todo description

Copy link

copy-pr-bot bot commented Mar 1, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@pciolkosz
Copy link
Contributor Author

/ok to test

Copy link
Contributor

github-actions bot commented Mar 1, 2025

🟥 CI finished in 25m 26s: Pass: 0%/22 | Total: 55m 39s | Avg: 2m 31s | Max: 8m 15s
  • 🟥 cudax: Pass: 0%/22 | Total: 55m 39s | Avg: 2m 31s | Max: 8m 15s

    🟥 cudacxx_family
      🟥 nvcc               Pass:   0%/22  | Total: 55m 39s | Avg:  2m 31s | Max:  8m 15s
    🟥 cpu
      🟥 amd64              Pass:   0%/18  | Total: 48m 43s | Avg:  2m 42s | Max:  8m 15s
      🟥 arm64              Pass:   0%/4   | Total:  6m 56s | Avg:  1m 44s | Max:  1m 52s
    🟥 ctk
      🟥 12.0               Pass:   0%/1   | Total:  8m 15s | Avg:  8m 15s | Max:  8m 15s
      🟥 12.5               Pass:   0%/2   | Total:  8m 07s | Avg:  4m 03s | Max:  4m 12s
      🟥 12.8               Pass:   0%/19  | Total: 39m 17s | Avg:  2m 04s | Max:  7m 54s
    🟥 cudacxx
      🟥 nvcc12.0           Pass:   0%/1   | Total:  8m 15s | Avg:  8m 15s | Max:  8m 15s
      🟥 nvcc12.5           Pass:   0%/2   | Total:  8m 07s | Avg:  4m 03s | Max:  4m 12s
      🟥 nvcc12.8           Pass:   0%/19  | Total: 39m 17s | Avg:  2m 04s | Max:  7m 54s
    🟥 cxx
      🟥 Clang14            Pass:   0%/1   | Total:  2m 17s | Avg:  2m 17s | Max:  2m 17s
      🟥 Clang15            Pass:   0%/1   | Total:  2m 16s | Avg:  2m 16s | Max:  2m 16s
      🟥 Clang16            Pass:   0%/1   | Total:  2m 18s | Avg:  2m 18s | Max:  2m 18s
      🟥 Clang17            Pass:   0%/1   | Total:  2m 18s | Avg:  2m 18s | Max:  2m 18s
      🟥 Clang18            Pass:   0%/4   | Total:  5m 56s | Avg:  1m 29s | Max:  2m 19s
      🟥 GCC10              Pass:   0%/1   | Total:  2m 10s | Avg:  2m 10s | Max:  2m 10s
      🟥 GCC11              Pass:   0%/1   | Total:  1m 53s | Avg:  1m 53s | Max:  1m 53s
      🟥 GCC12              Pass:   0%/2   | Total:  2m 13s | Avg:  1m 06s | Max:  2m 13s
      🟥 GCC13              Pass:   0%/6   | Total: 10m 02s | Avg:  1m 40s | Max:  2m 18s
      🟥 MSVC14.39          Pass:   0%/1   | Total:  8m 15s | Avg:  8m 15s | Max:  8m 15s
      🟥 MSVC14.42          Pass:   0%/1   | Total:  7m 54s | Avg:  7m 54s | Max:  7m 54s
      🟥 NVHPC24.7          Pass:   0%/2   | Total:  8m 07s | Avg:  4m 03s | Max:  4m 12s
    🟥 cxx_family
      🟥 Clang              Pass:   0%/8   | Total: 15m 05s | Avg:  1m 53s | Max:  2m 19s
      🟥 GCC                Pass:   0%/10  | Total: 16m 18s | Avg:  1m 37s | Max:  2m 18s
      🟥 MSVC               Pass:   0%/2   | Total: 16m 09s | Avg:  8m 04s | Max:  8m 15s
      🟥 NVHPC              Pass:   0%/2   | Total:  8m 07s | Avg:  4m 03s | Max:  4m 12s
    🟥 gpu
      🟥 h100               Pass:   0%/2   | Total:  2m 15s | Avg:  1m 07s | Max:  2m 15s
      🟥 rtx2080            Pass:   0%/20  | Total: 53m 24s | Avg:  2m 40s | Max:  8m 15s
    🟥 jobs
      🟥 Build              Pass:   0%/19  | Total: 55m 39s | Avg:  2m 55s | Max:  8m 15s
      🟥 Test               Pass:   0%/3  
    🟥 sm
      🟥 90                 Pass:   0%/3   | Total:  4m 33s | Avg:  1m 31s | Max:  2m 18s
      🟥 90a                Pass:   0%/1   | Total:  2m 10s | Avg:  2m 10s | Max:  2m 10s
    🟥 std
      🟥 17                 Pass:   0%/4   | Total:  9m 55s | Avg:  2m 28s | Max:  4m 12s
      🟥 20                 Pass:   0%/18  | Total: 45m 44s | Avg:  2m 32s | Max:  8m 15s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 22)

# Runner
13 linux-amd64-cpu16
4 linux-arm64-cpu16
2 windows-amd64-cpu16
2 linux-amd64-gpu-rtx2080-latest-1
1 linux-amd64-gpu-h100-latest-1

@pciolkosz
Copy link
Contributor Author

/ok to test

Copy link
Contributor

github-actions bot commented Mar 1, 2025

🟨 CI finished in 22m 53s: Pass: 77%/22 | Total: 2h 24m | Avg: 6m 35s | Max: 14m 26s | Hits: 86%/9499
  • 🟨 cudax: Pass: 77%/22 | Total: 2h 24m | Avg: 6m 35s | Max: 14m 26s | Hits: 86%/9499

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  72%/18  | Total:  2h 07m | Avg:  7m 05s | Max: 14m 26s | Hits:  86%/7167  
      🟩 arm64              Pass: 100%/4   | Total: 17m 12s | Avg:  4m 18s | Max:  4m 34s | Hits:  87%/2332  
    🔍 sm: 90 🔍
      🔍 90                 Pass:  66%/3   | Total: 23m 03s | Avg:  7m 41s | Max: 14m 26s | Hits:  86%/1166  
      🟩 90a                Pass: 100%/1   | Total:  4m 08s | Avg:  4m 08s | Max:  4m 08s | Hits:  87%/583   
    🔍 std: 20 🔍
      🟩 17                 Pass: 100%/4   | Total: 19m 43s | Avg:  4m 55s | Max:  7m 01s | Hits:  84%/2124  
      🔍 20                 Pass:  72%/18  | Total:  2h 05m | Avg:  6m 57s | Max: 14m 26s | Hits:  86%/7375  
    🟨 ctk
      🟥 12.0               Pass:   0%/1   | Total: 12m 12s | Avg: 12m 12s | Max: 12m 12s
      🟩 12.5               Pass: 100%/2   | Total: 13m 57s | Avg:  6m 58s | Max:  7m 01s | Hits:  76%/750   
      🟨 12.8               Pass:  78%/19  | Total:  1h 58m | Avg:  6m 15s | Max: 14m 26s | Hits:  87%/8749  
    🟨 cudacxx
      🟥 nvcc12.0           Pass:   0%/1   | Total: 12m 12s | Avg: 12m 12s | Max: 12m 12s
      🟩 nvcc12.5           Pass: 100%/2   | Total: 13m 57s | Avg:  6m 58s | Max:  7m 01s | Hits:  76%/750   
      🟨 nvcc12.8           Pass:  78%/19  | Total:  1h 58m | Avg:  6m 15s | Max: 14m 26s | Hits:  87%/8749  
    🟨 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  4m 21s | Avg:  4m 21s | Max:  4m 21s | Hits:  87%/585   
      🟩 Clang15            Pass: 100%/1   | Total:  5m 01s | Avg:  5m 01s | Max:  5m 01s | Hits:  87%/583   
      🟩 Clang16            Pass: 100%/1   | Total:  4m 48s | Avg:  4m 48s | Max:  4m 48s | Hits:  87%/583   
      🟩 Clang17            Pass: 100%/1   | Total:  4m 40s | Avg:  4m 40s | Max:  4m 40s | Hits:  87%/583   
      🟨 Clang18            Pass:  75%/4   | Total: 25m 22s | Avg:  6m 20s | Max: 12m 11s | Hits:  87%/1749  
      🟩 GCC10              Pass: 100%/1   | Total:  4m 39s | Avg:  4m 39s | Max:  4m 39s | Hits:  87%/585   
      🟩 GCC11              Pass: 100%/1   | Total:  4m 52s | Avg:  4m 52s | Max:  4m 52s | Hits:  87%/583   
      🟨 GCC12              Pass:  50%/2   | Total: 17m 16s | Avg:  8m 38s | Max: 12m 29s | Hits:  87%/583   
      🟨 GCC13              Pass:  83%/6   | Total: 35m 53s | Avg:  5m 58s | Max: 14m 26s | Hits:  86%/2915  
      🟥 MSVC14.39          Pass:   0%/1   | Total: 12m 12s | Avg: 12m 12s | Max: 12m 12s
      🟥 MSVC14.42          Pass:   0%/1   | Total: 11m 55s | Avg: 11m 55s | Max: 11m 55s
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 13m 57s | Avg:  6m 58s | Max:  7m 01s | Hits:  76%/750   
    🟨 cxx_family
      🟨 Clang              Pass:  87%/8   | Total: 44m 12s | Avg:  5m 31s | Max: 12m 11s | Hits:  87%/4083  
      🟨 GCC                Pass:  80%/10  | Total:  1h 02m | Avg:  6m 16s | Max: 14m 26s | Hits:  87%/4666  
      🟥 MSVC               Pass:   0%/2   | Total: 24m 07s | Avg: 12m 03s | Max: 12m 12s
      🟩 NVHPC              Pass: 100%/2   | Total: 13m 57s | Avg:  6m 58s | Max:  7m 01s | Hits:  76%/750   
    🟨 cudacxx_family
      🟨 nvcc               Pass:  77%/22  | Total:  2h 24m | Avg:  6m 35s | Max: 14m 26s | Hits:  86%/9499  
    🟨 gpu
      🟨 h100               Pass:  50%/2   | Total: 18m 40s | Avg:  9m 20s | Max: 14m 26s | Hits:  86%/583   
      🟨 rtx2080            Pass:  80%/20  | Total:  2h 06m | Avg:  6m 18s | Max: 12m 29s | Hits:  86%/8916  
    🟨 jobs
      🟨 Build              Pass:  89%/19  | Total:  1h 45m | Avg:  5m 34s | Max: 12m 12s | Hits:  86%/9499  
      🟥 Test               Pass:   0%/3   | Total: 39m 06s | Avg: 13m 02s | Max: 14m 26s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 22)

# Runner
13 linux-amd64-cpu16
4 linux-arm64-cpu16
2 windows-amd64-cpu16
2 linux-amd64-gpu-rtx2080-latest-1
1 linux-amd64-gpu-h100-latest-1

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursory glance

@@ -219,9 +219,9 @@ try

printf("Enabling peer access between GPU%d and GPU%d...\n", peers[0].get(), peers[1].get());
cudax::device_memory_resource dev0_resource(peers[0]);
dev0_resource.enable_peer_access_from(peers[1]);
dev0_resource.enable_access_from(peers[1]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move that rename into a separate PR?

Comment on lines +144 to +148
inline all_devices::operator ::std::vector<device_ref>() const
{
return ::std::vector<device_ref>(begin(), end());
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any benefit to not defining these functions inline?

//! @param __device_id The id of the device for which to query support.
//! @throws cuda_error if \c cudaDeviceGetAttribute failed.
//! @returns true if \c cudaDevAttrMemoryPoolsSupported is not zero.
inline void __device_supports_stream_ordered_allocations(const int __device_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I came up with that name, but its bad. Because this does not retunrs anything.

This should rather be prefixed with something like __check or __verify

// Construct on NUMA node 0 only for now
__pool_properties.location.type = ::cudaMemLocationTypeHostNuma;
__pool_properties.location.id = __id;
#else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to the conditional compilations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants