Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests failed #58

Closed
findmyway opened this issue Jul 28, 2024 · 3 comments
Closed

Tests failed #58

findmyway opened this issue Jul 28, 2024 · 3 comments

Comments

@findmyway
Copy link

findmyway commented Jul 28, 2024

Do I need to configure anything to pass the test?

(This is a fresh new installation based on the pytorch:24.01-py3 image)

     Testing Running tests...
┌ Info: CUDA information:
│ CUDA runtime 12.5, artifact installation
│ CUDA driver 12.5
│ NVIDIA driver 535.161.8, originally for CUDA 12.2
│
│ CUDA libraries:
│ - CUBLAS: 12.5.3
│ - CURAND: 10.3.6
│ - CUFFT: 11.2.3
│ - CUSOLVER: 11.6.3
│ - CUSPARSE: 12.5.1
│ - CUPTI: 2024.2.1 (API 23.0.0)
│ - NVML: 12.0.0+535.161.8
│
│ Julia packages:
│ - CUDA: 5.4.3
│ - CUDA_Driver_jll: 0.9.1+1
│ - CUDA_Runtime_jll: 0.14.1+0
│
│ Toolchain:
│ - Julia: 1.10.4
│ - LLVM: 15.0.7
│
│ 8 devices:
│   0: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   1: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   2: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   3: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   4: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   5: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   6: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
└   7: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
[ Info: NCCL version: 2.19.4
Communicator: Error During Test at /....../NCCL.jl/test/runtests.jl:11
  Got exception outside of a @test
  NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
  Stacktrace:
    [1] check
      @ /....../NCCL.jl/src/libnccl.jl:17 [inlined]
    [2] ncclCommInitAll
      @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:34 [inlined]
    [3] Communicators(deviceids::Vector{Int32})
      @ NCCL /....../NCCL.jl/src/communicator.jl:70
    [4] Communicators(devices::CUDA.DeviceIterator)
      @ NCCL /....../NCCL.jl/src/communicator.jl:80
    [5] macro expansion
      @ /....../NCCL.jl/test/runtests.jl:13 [inlined]
    [6] macro expansion
      @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [7] macro expansion
      @ /....../NCCL.jl/test/runtests.jl:13 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [9] top-level scope
      @ /....../NCCL.jl/test/runtests.jl:11
   [10] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [11] top-level scope
      @ none:6
   [12] eval
      @ ./boot.jl:385 [inlined]
   [13] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [14] _start()
      @ Base ./client.jl:552
@findmyway
Copy link
Author

By setting NCCL_DEBUG=INFO I got the following error msg:

NCCL WARN Cuda failure 'initialization error'

@findmyway
Copy link
Author

It seems in the original pipeline, there are some extra configurations

echo -e "[extras]\nCUDA_Runtime_jll = \"76a88914-d11a-5bdc-97e0-2f5a05c973a2\"\nCUDA_Driver_jll = \"4ee394cb-3365-5eb0-8335-949819d2adfc\"" >>test/Project.toml
echo -e "[CUDA_Runtime_jll]\nversion = \"{{matrix.cuda}}\"" >test/LocalPreferences.toml
echo -e "[CUDA_Driver_jll]\ncompat = \"false\"" >>test/LocalPreferences.toml

By setting LocalPreferences.toml to

[CUDA_Runtime_jll]
version = "12.3"
[CUDA_Driver_jll]
compat = "false"

Now I can at least initialize NCCL.Communicators, all collective operations hit the following error

sum: Error During Test at NCCL.jl/test/runtests.jl:29
  Got exception outside of a @test
  ArgumentError: cannot take the GPU address of inaccessible device memory.
  
  You are trying to use memory from GPU 0 on GPU 7.
  P2P access between these devices is not possible; either switch to GPU 0
  by calling `CUDA.device!(0)`, or copy the data to an array allocated on device 7.

@findmyway
Copy link
Author

findmyway commented Aug 1, 2024

The error is triggered when converting CuArray into a CuPtr. And the root cause is that

https://github.com/JuliaGPU/CUDA.jl/blob/d7077da2b7df32f9d0a2bced56511cdd778ab4ed/src/memory.jl#L549

the p2p access is not enabled.

julia> CUDA.peer_access[]
8×8 Matrix{Int64}:
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
 -1  0  0  0  0  0  0  0

However, by executing nvidia-smi topo -p2p r, p2p access on my node should be ok:

nvidia-smi topo -p2p r
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
 GPU0   X       OK      OK      OK      OK      OK      OK      OK
 GPU1   OK      X       OK      OK      OK      OK      OK      OK
 GPU2   OK      OK      X       OK      OK      OK      OK      OK
 GPU3   OK      OK      OK      X       OK      OK      OK      OK
 GPU4   OK      OK      OK      OK      X       OK      OK      OK
 GPU5   OK      OK      OK      OK      OK      X       OK      OK
 GPU6   OK      OK      OK      OK      OK      OK      X       OK
 GPU7   OK      OK      OK      OK      OK      OK      OK      X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

OK, I missed the very important warning

NCCL version 2.19.4+cuda12.3
┌ Warning: Enabling peer-to-peer access between CuDevice(7) and CuDevice(0) failed; please file an issue.
│   exception =
│    CUDA error: peer access is already enabled (code 704, ERROR_PEER_ACCESS_ALREADY_ENABLED)
│    Stacktrace:

This is reported from
https://github.com/JuliaGPU/CUDA.jl/blob/d7077da2b7df32f9d0a2bced56511cdd778ab4ed/lib/cudadrv/context.jl#L404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant