Skip to content

Issue with P2P on PVC #810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
BenBrock opened this issue Apr 16, 2025 · 6 comments
Open

Issue with P2P on PVC #810

BenBrock opened this issue Apr 16, 2025 · 6 comments

Comments

@BenBrock
Copy link

BenBrock commented Apr 16, 2025

Describe the issue

I'm trying to set up IPEX on a system with 8 PVC tiles and am having difficulty getting things working. Right now, I'm just trying to run some sanity tests to ensure things work. A basic P2P test is failing.

Steps Taken So Far

  • I installed IPEX using the instructions linked in the repo. The install appears successful.
  • After sourcing oneAPI with source /opt/intel/oneapi/setvars.sh and setting my LD_LIBRARY_PATH to point to the pip's lib folder as well, the sanity test from the install instructions completes successfully with a warning (see below).

Simple P2P Check

I then tried to run a simple P2P check to measure bandwidth between devices:

#!/usr/bin/env python

import os
import sys
import time
import torch
import torch.distributed as dist
import intel_extension_for_pytorch as ipex
import oneccl_bindings_for_pytorch as torch_ccl

def get_device():
    return 'xpu:%s' % (dist.get_rank() % torch.xpu.device_count(),)

def get_rank_from_env():
    if 'PMI_RANK' in os.environ:
        return os.environ['PMI_RANK']
    elif 'PMIX_RANK' in os.environ:
        return os.environ['PMIX_RANK']
    elif 'RANK' in os.environ:
        return os.environ['RANK']
    else:
        raise Exception('Error: neither \'PMI_RANK\' nor \'RANK\' environment variable found. Are you invoking this script using mpirun or torchrun?')

def get_nprocs_from_env():
    if 'PMI_SIZE' in os.environ:
        return os.environ['PMI_SIZE']
    elif 'WORLD_SIZE' in os.environ:
        return os.environ['WORLD_SIZE']
    else:
        raise Exception('Error: neither \'PMI_SIZE\' nor \'WORLD_SIZE\' environment variable found. Are you invoking this script using mpirun or torchrun?')

os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "29500"
os.environ["RANK"] = get_rank_from_env()
os.environ["WORLD_SIZE"] = get_nprocs_from_env()
dist.init_process_group(backend="ccl", init_method="env://")

nbytes = 1024*1024*1024

n = nbytes // 4
nbytes = n * 4
gbytes = nbytes * 1e-9

print('Process %s/%s using device %s' % (dist.get_rank(), dist.get_world_size(), get_device()))

send_tensor = torch.zeros(n, dtype=torch.float32, device=get_device())
recv_tensor = torch.zeros(n, dtype=torch.float32, device=get_device())

# Perform an all_reduce to initialize communicators and such.
dist.all_reduce(send_tensor)

if dist.get_rank() == 0:
    print('Benchmarking P2P...')

for send_rank in range(dist.get_world_size()):
    for recv_rank in range(dist.get_world_size()):
        if send_rank != recv_rank:
            dist.barrier()

            if dist.get_rank() == send_rank:
                print('Send %s -> %s' % (send_rank, recv_rank))

            dist.barrier()
            begin = time.time()

            reqs = []

            if dist.get_rank() == send_rank:
                req = dist.isend(send_tensor, recv_rank)
                reqs.append(req)

            if dist.get_rank() == recv_rank:
                req = dist.irecv(recv_tensor, send_rank)
                reqs.append(req)

            for req in reqs:
                req.wait()

            end = time.time()
            duration = end - begin

            if dist.get_rank() == recv_rank:
                print('%s -> %s took %s s, achieved %s GB/s' % (send_rank, recv_rank, duration, gbytes / duration))

The output is as follows (removing the ATen warning previously mentioned):

(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> cat out.dat
My guessed rank = 4
My guessed rank = 0
My guessed rank = 1
My guessed rank = 2
My guessed rank = 3
My guessed rank = 5
My guessed rank = 6
My guessed rank = 7
Process 6/8 using device xpu:6
Process 4/8 using device xpu:4
Process 5/8 using device xpu:5
Process 3/8 using device xpu:3
Process 7/8 using device xpu:7
Process 2/8 using device xpu:2
Process 0/8 using device xpu:0
Process 1/8 using device xpu:1
Benchmarking P2P...
Send 0 -> 1
0 -> 1 took 0.30544233322143555 s, achieved 3.51536675573249 GB/s
Send 0 -> 2
  1. It blocks indefinitely on the send from 0 -> 2.

  2. The bandwidth is way lower than expected. It reports 3.5 GB/s, when it should be >150 GB/s (over MDFI between tiles 0 and 1, and Xe Link would be 20 GB/s).

My GPUs on the system appear to be configured correctly:

(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8480+ OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:5] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:6] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:7] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:8] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:9] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:4] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:5] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:6] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:7] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> xpu-smi topology -m
         GPU 0/0  GPU 0/1  GPU 1/0  GPU 1/1  GPU 2/0  GPU 2/1  GPU 3/0  GPU 3/1  CPU Affinity
GPU 0/0  S        MDF      XL*      XL8      XL8      XL*      XL8      XL*      0-55,112-167
GPU 0/1  MDF      S        XL8      XL*      XL*      XL8      XL*      XL8      0-55,112-167
GPU 1/0  XL*      XL8      S        MDF      XL*      XL8      XL*      XL8      0-55,112-167
GPU 1/1  XL8      XL*      MDF      S        XL8      XL*      XL8      XL*      0-55,112-167
GPU 2/0  XL8      XL*      XL*      XL8      S        MDF      XL8      XL*      56-111,168-223
GPU 2/1  XL*      XL8      XL8      XL*      MDF      S        XL*      XL8      56-111,168-223
GPU 3/0  XL8      XL*      XL*      XL8      XL8      XL*      S        MDF      56-111,168-223
GPU 3/1  XL*      XL8      XL8      XL*      XL*      XL8      MDF      S        56-111,168-223

Please advise on what to do. I get the same results whether using the mpirun bundled with pip or the system's Intel MPI.

Sanity Check Warning

The warning produced by the sanity check after install is about ATen op registration. I saw someone was told in another issue that this warning can be ignored, so I'm ignoring it and assuming this is successful.

(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> !pyth
python3 -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
[W416 15:51:01.385586326 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
2.6.0+xpu
2.6.10+xpu
[0]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[1]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[2]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[3]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[4]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[5]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[6]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[7]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[W416 15:51:04.037020987 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
@jingxu10
Copy link
Contributor

Could you check if this torch-ccl demo script works?
https://github.com/intel/torch-ccl/tree/master/demo

@BenBrock
Copy link
Author

The demo script seems to work:

(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> mpirun -n 8 -l python demo.py --device xpu
[0] [rank0]:[W416 22:45:40.723224165 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[2] [rank2]:[W416 22:45:40.723223113 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[3] [rank3]:[W416 22:45:40.723229753 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[4] [rank4]:[W416 22:45:40.723227604 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[5] [rank5]:[W416 22:45:40.723257697 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[7] [rank7]:[W416 22:45:40.726658140 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[1] [rank1]:[W416 22:45:40.730492019 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[6] [rank6]:[W416 22:45:40.731073464 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[2] My guessed rank = 2
[0] My guessed rank = 0
[1] My guessed rank = 1
[3] My guessed rank = 3
[4] My guessed rank = 4
[5] My guessed rank = 5
[6] My guessed rank = 6
[7] My guessed rank = 7
[6] Runing Iteration: 0 on device xpu:6
[6] Runing forward: 0 on device xpu:6
[6] Runing loss: 0 on device xpu:6
[6] Runing backward: 0 on device xpu:6
[6] Runing optim: 0 on device xpu:6
[6] Runing Iteration: 1 on device xpu:6
[6] Runing forward: 1 on device xpu:6
[6] Runing loss: 1 on device xpu:6
[6] Runing backward: 1 on device xpu:6
[6] Runing optim: 1 on device xpu:6
[6] Runing Iteration: 2 on device xpu:6
[6] Runing forward: 2 on device xpu:6
[6] Runing loss: 2 on device xpu:6
[6] Runing backward: 2 on device xpu:6
[6] Runing optim: 2 on device xpu:6
[6] Finish
[4] Runing Iteration: 0 on device xpu:4
[4] Runing forward: 0 on device xpu:4
[4] Runing loss: 0 on device xpu:4
[4] Runing backward: 0 on device xpu:4
[4] Runing optim: 0 on device xpu:4
[4] Runing Iteration: 1 on device xpu:4
[4] Runing forward: 1 on device xpu:4
[4] Runing loss: 1 on device xpu:4
[4] Runing backward: 1 on device xpu:4
[4] Runing optim: 1 on device xpu:4
[4] Runing Iteration: 2 on device xpu:4
[4] Runing forward: 2 on device xpu:4
[4] Runing loss: 2 on device xpu:4
[4] Runing backward: 2 on device xpu:4
[4] Runing optim: 2 on device xpu:4
[4] Finish
[1] Runing Iteration: 0 on device xpu:1
[1] Runing forward: 0 on device xpu:1
[1] Runing loss: 0 on device xpu:1
[1] Runing backward: 0 on device xpu:1
[1] Runing optim: 0 on device xpu:1
[1] Runing Iteration: 1 on device xpu:1
[1] Runing forward: 1 on device xpu:1
[1] Runing loss: 1 on device xpu:1
[1] Runing backward: 1 on device xpu:1
[1] Runing optim: 1 on device xpu:1
[1] Runing Iteration: 2 on device xpu:1
[1] Runing forward: 2 on device xpu:1
[1] Runing loss: 2 on device xpu:1
[1] Runing backward: 2 on device xpu:1
[1] Runing optim: 2 on device xpu:1
[1] Finish
[7] Runing Iteration: 0 on device xpu:7
[7] Runing forward: 0 on device xpu:7
[7] Runing loss: 0 on device xpu:7
[7] Runing backward: 0 on device xpu:7
[7] Runing optim: 0 on device xpu:7
[7] Runing Iteration: 1 on device xpu:7
[7] Runing forward: 1 on device xpu:7
[7] Runing loss: 1 on device xpu:7
[7] Runing backward: 1 on device xpu:7
[7] Runing optim: 1 on device xpu:7
[7] Runing Iteration: 2 on device xpu:7
[7] Runing forward: 2 on device xpu:7
[7] Runing loss: 2 on device xpu:7
[7] Runing backward: 2 on device xpu:7
[7] Runing optim: 2 on device xpu:7
[7] Finish
[3] Runing Iteration: 0 on device xpu:3
[3] Runing forward: 0 on device xpu:3
[3] Runing loss: 0 on device xpu:3
[3] Runing backward: 0 on device xpu:3
[3] Runing optim: 0 on device xpu:3
[3] Runing Iteration: 1 on device xpu:3
[3] Runing forward: 1 on device xpu:3
[3] Runing loss: 1 on device xpu:3
[3] Runing backward: 1 on device xpu:3
[3] Runing optim: 1 on device xpu:3
[3] Runing Iteration: 2 on device xpu:3
[3] Runing forward: 2 on device xpu:3
[3] Runing loss: 2 on device xpu:3
[3] Runing backward: 2 on device xpu:3
[3] Runing optim: 2 on device xpu:3
[3] Finish
[5] Runing Iteration: 0 on device xpu:5
[5] Runing forward: 0 on device xpu:5
[5] Runing loss: 0 on device xpu:5
[5] Runing backward: 0 on device xpu:5
[5] Runing optim: 0 on device xpu:5
[5] Runing Iteration: 1 on device xpu:5
[5] Runing forward: 1 on device xpu:5
[5] Runing loss: 1 on device xpu:5
[5] Runing backward: 1 on device xpu:5
[5] Runing optim: 1 on device xpu:5
[5] Runing Iteration: 2 on device xpu:5
[5] Runing forward: 2 on device xpu:5
[5] Runing loss: 2 on device xpu:5
[5] Runing backward: 2 on device xpu:5
[5] Runing optim: 2 on device xpu:5
[5] Finish
[2] Runing Iteration: 0 on device xpu:2
[2] Runing forward: 0 on device xpu:2
[2] Runing loss: 0 on device xpu:2
[2] Runing backward: 0 on device xpu:2
[2] Runing optim: 0 on device xpu:2
[2] Runing Iteration: 1 on device xpu:2
[2] Runing forward: 1 on device xpu:2
[2] Runing loss: 1 on device xpu:2
[2] Runing backward: 1 on device xpu:2
[2] Runing optim: 1 on device xpu:2
[2] Runing Iteration: 2 on device xpu:2
[2] Runing forward: 2 on device xpu:2
[2] Runing loss: 2 on device xpu:2
[2] Runing backward: 2 on device xpu:2
[2] Runing optim: 2 on device xpu:2
[2] Finish
[0] Runing Iteration: 0 on device xpu:0
[0] Runing forward: 0 on device xpu:0
[0] Runing loss: 0 on device xpu:0
[0] Runing backward: 0 on device xpu:0
[0] Runing optim: 0 on device xpu:0
[0] Runing Iteration: 1 on device xpu:0
[0] Runing forward: 1 on device xpu:0
[0] Runing loss: 1 on device xpu:0
[0] Runing backward: 1 on device xpu:0
[0] Runing optim: 1 on device xpu:0
[0] Runing Iteration: 2 on device xpu:0
[0] Runing forward: 2 on device xpu:0
[0] Runing loss: 2 on device xpu:0
[0] Runing backward: 2 on device xpu:0
[0] Runing optim: 2 on device xpu:0
[0] Finish
[4] My guessed rank = 4
[6] My guessed rank = 6
[7] My guessed rank = 7
[5] My guessed rank = 5
[1] My guessed rank = 1
[0] My guessed rank = 0
[3] My guessed rank = 3
[2] My guessed rank = 2

(I'm again leaving out the ATen warnings I mentioned previously.)

@BenBrock
Copy link
Author

@jingxu10 Wondering if there are any more tests I should run to make sure things are running correctly? Or if the CCL features I'm using are expected to work on an 8-tile PVC system?

@jingxu10
Copy link
Contributor

probably some environment variable misconfigured. I'm checking internally and will reach back to you later this week.

@jingxu10
Copy link
Contributor

torch-ccl will be deprecated by the newly XCCL backend in the latest PyTorch. I'm checking for BKMs and will share it to you early next week.

@BenBrock
Copy link
Author

Thanks—in that case, I'd like to get up and running with XCCL. Would appreciate details on how to get up and running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants