Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified successfully, but not displayed on ALLTOALL? #25

Open
2 tasks done
yiguCM opened this issue Nov 15, 2024 · 6 comments
Open
2 tasks done

Modified successfully, but not displayed on ALLTOALL? #25

yiguCM opened this issue Nov 15, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@yiguCM
Copy link

yiguCM commented Nov 15, 2024

NVIDIA Open GPU Kernel Modules Version

550.90.07-p2p

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.5 LTS

Kernel Release

Linux 6.8.0-47-generic NVIDIA#47~22.04.1-Ubuntu

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 4090 ~ GPU 7: NVIDIA GeForce RTX 4090

Describe the bug

Hi, I successfully modified BAR1 and enabled the P2P function, which increased the performance of P2PTEST, but when I performed the NCCL test, I found that the performance of ALL TO ALL scenarios decreased. Why is this?
Is it because the BAR1 register is too large? Can we only open P2P and not modify BAR1?
image
image

To Reproduce

/nccl-tests/build/alltoall_perf -b 8 -e 8G -f 2 -g 8

Bug Incidence

Always

nvidia-bug-report.log.gz

image
image
image

More Info

No response

@yiguCM yiguCM added the bug Something isn't working label Nov 15, 2024
@xiaobuding-cx
Copy link

@yiguCM Hi, I encountered the same issue as you. Have you resolved it?

@yiguCM
Copy link
Author

yiguCM commented Dec 23, 2024 via email

@mylesgoose
Copy link

Seems to be solved with multiple gpu. On the epyc cpu. https://github.com/aikitoria/open-gpu-kernel-modules
Screenshot_20241226_080252_Brave

@xiaobuding-cx
Copy link

@yiguCM @mylesgoose Thank you all. We are currently experimenting with different CPU and hardware configurations, hoping to make some discoveries.

@mylesgoose
Copy link

I think if you have 2 cpu. You have to bridge the p lanes on epyc with mcio cables from ports on one cpu to other. Giving you 128 lanes. With 8 gpu. If you use 10 gpu or 9 then p2p goes via the cpu. 160 lanes. Pcie 16x.

@mylesgoose
Copy link

Also I have not tested but if your using a dual root plx board maybe same issue. The issue seems to come from the gpu being not single root. You could try buying some cpayne pcie 5 to two pcie 4 mcio splitters and put the gpu all on same cpu. Or root complex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants