Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STF] Ensure algorithms with nested contexts use allocator adapters #3548

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

caugonnet
Copy link
Contributor

@caugonnet caugonnet commented Jan 27, 2025

Description

Creating memory nodes in CUDA graph is very expensive, and caching executable graphs with memory nodes will leak memory. We therefore make our best to let the parent context based on CUDA streams deal with the allocations done in the graph_ctx internal to an "algorithm".

This PR should ensure that we do not create CUDA graph memory nodes, but use the allocator of the parent context instead for the "uncached allocations" .

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Jan 27, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@caugonnet caugonnet self-assigned this Jan 27, 2025
@caugonnet caugonnet added the stf Sequential Task Flow programming model label Jan 27, 2025
caugonnet and others added 2 commits January 28, 2025 10:08
…le and to check clear() was called, factorize code to setup allocators in algorithms
@caugonnet
Copy link
Contributor Author

/ok to test

Copy link
Contributor

🟩 CI finished in 57m 24s: Pass: 100%/20 | Total: 3h 09m | Avg: 9m 27s | Max: 17m 16s | Hits: 388%/522
  • 🟩 cudax: Pass: 100%/20 | Total: 3h 09m | Avg: 9m 27s | Max: 17m 16s | Hits: 388%/522

    🟩 cpu
      🟩 amd64              Pass: 100%/16  | Total:  2h 37m | Avg:  9m 49s | Max: 17m 16s | Hits: 388%/522   
      🟩 arm64              Pass: 100%/4   | Total: 31m 45s | Avg:  7m 56s | Max:  8m 22s
    🟩 ctk
      🟩 12.0               Pass: 100%/1   | Total:  9m 41s | Avg:  9m 41s | Max:  9m 41s | Hits: 388%/261   
      🟩 12.5               Pass: 100%/2   | Total: 11m 30s | Avg:  5m 45s | Max:  5m 49s
      🟩 12.6               Pass: 100%/17  | Total:  2h 47m | Avg:  9m 52s | Max: 17m 16s | Hits: 388%/261   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/1   | Total:  9m 41s | Avg:  9m 41s | Max:  9m 41s | Hits: 388%/261   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 11m 30s | Avg:  5m 45s | Max:  5m 49s
      🟩 nvcc12.6           Pass: 100%/17  | Total:  2h 47m | Avg:  9m 52s | Max: 17m 16s | Hits: 388%/261   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/20  | Total:  3h 09m | Avg:  9m 27s | Max: 17m 16s | Hits: 388%/522   
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  8m 30s | Avg:  8m 30s | Max:  8m 30s
      🟩 Clang15            Pass: 100%/1   | Total:  9m 27s | Avg:  9m 27s | Max:  9m 27s
      🟩 Clang16            Pass: 100%/1   | Total:  9m 20s | Avg:  9m 20s | Max:  9m 20s
      🟩 Clang17            Pass: 100%/1   | Total:  9m 16s | Avg:  9m 16s | Max:  9m 16s
      🟩 Clang18            Pass: 100%/4   | Total: 41m 47s | Avg: 10m 26s | Max: 16m 37s
      🟩 GCC10              Pass: 100%/1   | Total:  8m 59s | Avg:  8m 59s | Max:  8m 59s
      🟩 GCC11              Pass: 100%/1   | Total:  9m 44s | Avg:  9m 44s | Max:  9m 44s
      🟩 GCC12              Pass: 100%/2   | Total: 28m 08s | Avg: 14m 04s | Max: 17m 16s
      🟩 GCC13              Pass: 100%/4   | Total: 30m 07s | Avg:  7m 31s | Max:  8m 22s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  9m 41s | Avg:  9m 41s | Max:  9m 41s | Hits: 388%/261   
      🟩 MSVC14.39          Pass: 100%/1   | Total: 12m 32s | Avg: 12m 32s | Max: 12m 32s | Hits: 388%/261   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 11m 30s | Avg:  5m 45s | Max:  5m 49s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/8   | Total:  1h 18m | Avg:  9m 47s | Max: 16m 37s
      🟩 GCC                Pass: 100%/8   | Total:  1h 16m | Avg:  9m 37s | Max: 17m 16s
      🟩 MSVC               Pass: 100%/2   | Total: 22m 13s | Avg: 11m 06s | Max: 12m 32s | Hits: 388%/522   
      🟩 NVHPC              Pass: 100%/2   | Total: 11m 30s | Avg:  5m 45s | Max:  5m 49s
    🟩 gpu
      🟩 v100               Pass: 100%/20  | Total:  3h 09m | Avg:  9m 27s | Max: 17m 16s | Hits: 388%/522   
    🟩 jobs
      🟩 Build              Pass: 100%/18  | Total:  2h 35m | Avg:  8m 37s | Max: 12m 32s | Hits: 388%/522   
      🟩 Test               Pass: 100%/2   | Total: 33m 53s | Avg: 16m 56s | Max: 17m 16s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  6m 31s | Avg:  6m 31s | Max:  6m 31s
      🟩 90a                Pass: 100%/1   | Total:  7m 19s | Avg:  7m 19s | Max:  7m 19s
    🟩 std
      🟩 17                 Pass: 100%/4   | Total: 27m 35s | Avg:  6m 53s | Max:  7m 55s
      🟩 20                 Pass: 100%/16  | Total:  2h 41m | Avg: 10m 05s | Max: 17m 16s | Hits: 388%/522   
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 20)

# Runner
12 linux-amd64-cpu16
4 linux-arm64-cpu16
2 windows-amd64-cpu16
2 linux-amd64-gpu-v100-latest-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stf Sequential Task Flow programming model
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

2 participants