-
Notifications
You must be signed in to change notification settings - Fork 271
Issue with P2P on PVC #810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you check if this torch-ccl demo script works? |
The demo script seems to work:
(I'm again leaving out the ATen warnings I mentioned previously.) |
@jingxu10 Wondering if there are any more tests I should run to make sure things are running correctly? Or if the CCL features I'm using are expected to work on an 8-tile PVC system? |
probably some environment variable misconfigured. I'm checking internally and will reach back to you later this week. |
torch-ccl will be deprecated by the newly XCCL backend in the latest PyTorch. I'm checking for BKMs and will share it to you early next week. |
Thanks—in that case, I'd like to get up and running with XCCL. Would appreciate details on how to get up and running. |
Describe the issue
I'm trying to set up IPEX on a system with 8 PVC tiles and am having difficulty getting things working. Right now, I'm just trying to run some sanity tests to ensure things work. A basic P2P test is failing.
Steps Taken So Far
source /opt/intel/oneapi/setvars.sh
and setting myLD_LIBRARY_PATH
to point to the pip'slib
folder as well, the sanity test from the install instructions completes successfully with a warning (see below).Simple P2P Check
I then tried to run a simple P2P check to measure bandwidth between devices:
The output is as follows (removing the ATen warning previously mentioned):
It blocks indefinitely on the send from 0 -> 2.
The bandwidth is way lower than expected. It reports 3.5 GB/s, when it should be >150 GB/s (over MDFI between tiles 0 and 1, and Xe Link would be 20 GB/s).
My GPUs on the system appear to be configured correctly:
Please advise on what to do. I get the same results whether using the
mpirun
bundled withpip
or the system's Intel MPI.Sanity Check Warning
The warning produced by the sanity check after install is about ATen op registration. I saw someone was told in another issue that this warning can be ignored, so I'm ignoring it and assuming this is successful.
The text was updated successfully, but these errors were encountered: