Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low performance when running over NVLink #3

Open
2 tasks done
sheepymeh opened this issue Apr 14, 2024 · 7 comments
Open
2 tasks done

Low performance when running over NVLink #3

sheepymeh opened this issue Apr 14, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@sheepymeh
Copy link

sheepymeh commented Apr 14, 2024

NVIDIA Open GPU Kernel Modules Version

Comparing with NVIDIA commit 12933b2

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.4 LTS

Kernel Release

5.15.0-102-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 3090

Describe the bug

Thank you for this project! It seems to be working well on 3090s. However, NVLink seems to underperform with this fork.

In the results below, the variation in the performance of PCIe GPUs is caused by differing PCIe versions and lanes. GPUs 2 and 3 are connected via NVLink (4 lanes, 56.25GB/s theoretical unidirectional performance). They are also connected via PCIe Gen 4 x8 (25GB/s theoretical unidirectional performance).

Running p2pBandwidthLatencyTest with this fork:

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.80  11.33  11.67  10.49  15.66  11.40  11.11
     1  11.37 812.92   8.92   8.93  11.38   8.94  11.40
     2  11.23   8.94 838.70   8.97  11.14   8.98  11.27
     3  11.20   8.90   8.91 838.00  11.12   8.92  11.25
     4  15.48  11.35  11.57  11.55 838.93  11.39  16.07
     5  11.34   8.90   8.95   8.93  11.38 838.03  11.31
     6  15.86  11.39  10.57  11.67  16.05  10.95 838.48

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.13  25.42  26.00  25.98  50.82  25.46  51.20
     1  25.45 838.25  25.40  25.50  25.46  25.52  25.45
     2  25.95  25.45 837.58  17.27  25.99  25.46  25.99
     3  25.99  25.50  17.04 835.34  25.99  25.46  25.99
     4  50.18  25.46  26.00  25.98 838.25  25.42  51.21
     5  25.46  25.57  25.41  25.51  25.38 837.35  25.47
     6  50.20  25.46  25.99  25.98  51.22  25.47 839.83

With the original open-source driver:

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.13  11.36  11.66  10.49  15.65  11.37  15.92
     1  11.43 830.23   8.88   8.92  11.41   8.95  11.38
     2  11.18   8.93 837.80   8.97  11.13   8.99  11.26
     3  11.21   8.91   8.91 839.60  11.13   8.91  11.26
     4  15.51  11.38  11.56  11.57 838.70  11.41  16.01
     5  11.34   8.97   8.93   8.94  11.35 838.67  11.28
     6  15.86  11.35  11.66  11.68  11.66  11.27 838.03
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.56  11.35  11.66  11.65  15.60  11.32  15.92
     1  11.42 838.66   8.94   8.94  11.37   8.94  11.38
     2  11.21   8.94 838.70 101.69  11.14   8.94  11.26
     3  11.19   8.97 101.91 837.80  11.11   8.92  11.26
     4  15.50  11.37  11.57  11.57 838.48  11.37  15.84
     5  11.31   8.95   8.93   8.94  11.33 838.03  11.28
     6  15.80  11.35  11.70  10.43  16.07  11.28 838.93

We can see that the p2p driver improves performance as expected on PCIe with this fork (e.g. 15.80 GB/s -> 50.20 GB/s). However the NVLink performance (GPUs 2 and 3) decreases from ~100 GB/s to ~17 GB/s.

To Reproduce

Run p2pBandwidthLatencyTest and compare with original fork

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

@sheepymeh sheepymeh added the bug Something isn't working label Apr 14, 2024
@TheAhmadOsman
Copy link

Experiencing the same

@geohot
Copy link

geohot commented Apr 15, 2024

Ahh, yea this is real, and glad to see it working with 3090s. I have only tested on 4090s where there's no NVLink to worry about.

The driver is forcing P2P to be through PCI-E, I'm sure there's a way to not need that force. Would merge a PR that fixes this, I doubt it's too hard. Though we are only maintaining this driver for tinybox, so it would have to come from external.

@zvorinji
Copy link

@geohot if you connect two Tinyboxes together, will it allow for the GPU in one box to communicate P2P with a GPU from the second box if you connect them with Mellanox adapter cards in the OCP slots?

@ilovesouthpark
Copy link

ilovesouthpark commented May 27, 2024

@zvorinji i though it before but it seems not practical both from economic and techniques.

  1. if we need Mellanox to achieve pcie speed eg. pcie 4.016 you will need 64GB/s so convert to Mellanox adapter card it means at least you will need the adpater at 500G (only 400G or 800G for actual products?) which is very expensive. And you will waste at least one pcie 4.016 slot in each machine. I am not sure if the 2 IB and RDMA supported adapters can transfer data more than the limit of bandwidth.
  2. And another thing is that i am not sure when we activate p2p in the drive does it mean RDMA is also activated? Otherwise we need 2 more GPUs support RDMA to transfer P2P data inside both machines before the data transfer between the two machines.
    Correct me if i am wrong since i am new in this area but also want to build my low cost inference cluster :)

@samsja
Copy link

samsja commented Aug 3, 2024

wondering if anybody find a workaround, planning on using the driver with my 3090 nvlinked

@rimb05
Copy link

rimb05 commented Aug 20, 2024

I'm also curious about this. Would be nice to be able to use this with NVLINK 3090s.

@Ph0rk0z
Copy link

Ph0rk0z commented Jan 5, 2025

So imagine this. A server with 4-6x 3090. The pairs have nvlink but then there is no P2P between the pairs. If the driver could respect PCIE and NVLINK access you'd have a heck of a machine when training or using peer access supporting tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants