Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTX4090 and RTX4060Ti no P2P #28

Open
2 tasks done
Ivan04012025 opened this issue Jan 3, 2025 · 10 comments
Open
2 tasks done

RTX4090 and RTX4060Ti no P2P #28

Ivan04012025 opened this issue Jan 3, 2025 · 10 comments
Labels
bug Something isn't working

Comments

@Ivan04012025
Copy link

NVIDIA Open GPU Kernel Modules Version

550.90.07

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Linux Mint 21.3

Kernel Release

6.8.0-50-generic NVIDIA#51~22.04.1-Ubuntu

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

RTX4090 and RTX4060Ti

Describe the bug

Is it possible to enable P2P between 4090 and 4060Ti cards?
My motherboard is Asus Pro WS X299 SAGE II. I turned on large BAR and disabled IOMMU in bios.
Next I installed open-gpu-kernel-modules-550.90.07-p2p using install.sh script and driver: NVIDIA-Linux-x86_64-550.90.07.run --no-kernel-modules
nvidia-smi works fine, but p2pBandwidthLatencyTest gives following output:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 1a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4060 Ti, pciBusID: 68, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
D\D 0 1
0 1 0
1 0 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 911.08 6.27
1 6.26 244.87
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 913.21 6.27
1 6.25 245.25
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 915.33 8.49
1 8.63 244.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 918.31 8.43
1 8.63 244.56
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.44 20.43
1 20.54 1.20

CPU 0 1
0 2.25 6.10
1 5.98 2.24
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.45 20.61
1 11.36 1.20

CPU 0 1
0 2.22 5.93
1 6.19 2.23

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

This result indicate that P2P is not working. Manual says that both 4090 and 4060Ti should be supported. Is there anythig that can be done to enable P2P?

To Reproduce

I followed the installation instructions for the kernel version 550

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

@Ivan04012025 Ivan04012025 added the bug Something isn't working label Jan 3, 2025
@mylesgoose
Copy link

Seen a few people having issues with mixed cards. One guy has a6000 and 4 4090 and p2p didn't work unrill he disabled the A6000

@Ivan04012025
Copy link
Author

Seen a few people having issues with mixed cards. One guy has a6000 and 4 4090 and p2p didn't work unrill he disabled the A6000

Thank you for a reply. It's a pity that p2p doesn't work on mixed cards.

@ilovesouthpark
Copy link

"Manual says that both 4090 and 4060Ti should be supported" which manual says that? You may mess up with the p2p mod and the original open-gpu-kernel-modules.
_"Normally, P2P on NVIDIA cards uses MAILBOXP2P. This is some hardware interface designed to allow GPUs to transfer memory back in the days of small BAR. It is not present or disabled in hardware on the 4090s, and that's why P2P doesn't work. There was a bug in early versions of the driver that reported that it did work, and it was actually sending stuff on the PCIe bus. However, because the mailbox hardware wasn't present, these copies wouldn't go to the right place. You could even crash the system by doing something like torch.zeros(10000,10000).cuda().to("cuda:1")

In some 3090s and all 4090s, NVIDIA added large BAR support."_

@Ivan04012025
Copy link
Author

Ivan04012025 commented Jan 5, 2025

which manual says that?

For example, here. In the end there is a table of compatible GPUs
https://github.com/tinygrad/open-gpu-kernel-modules/tree/535.54.03
However, I installed 550.90.07 version of a driver and open gpu kernel modules, so I am wondering should i set the "NVreg_OpenRmEnableUnsupportedGpus" nvidia.ko kernel module parameter to 1 in this version of kernel module or it is already set by default? I didn't do that.

@mylesgoose
Copy link

@Ivan04012025 I think he I meaning that list of gpu that are compatible are the ones compatible with the original nvidia open driver. For p2p he mentioned you need a gpu with large BAR support. Perhaps that 4060 gpu does not support that.

@Ivan04012025
Copy link
Author

@Ivan04012025 I think he I meaning that list of gpu that are compatible are the ones compatible with the original nvidia open driver. For p2p he mentioned you need a gpu with large BAR support. Perhaps that 4060 gpu does not support that.

Then is there a way to find out whether 4060 has large BAR or not?

@mylesgoose
Copy link

@Ivan04012025 says on spec page for 4060 Resizable BAR
Resizable BAR is an advanced PCI Express feature that enables the CPU to access the entire GPU frame buffer at once, improving performance in many games. . You can see the bar information from dmesg or system info. Or with cpu x Linux version.

@Ivan04012025
Copy link
Author

Ivan04012025 commented Jan 5, 2025

@Ivan04012025 says on spec page for 4060 Resizable BAR Resizable BAR is an advanced PCI Express feature that enables the CPU to access the entire GPU frame buffer at once, improving performance in many games. . You can see the bar information from dmesg or system info. Or with cpu x Linux version.

In system info or cpu x I do not see any info about BAR support. dmesg gives a lot of output and i don't know which string is about BAR support on 4060. I copied this output, maybe you can help with that?
dmesg.log

NVIDIA settings software shows: "Resizable BAR: Yes" on both 4090 and 4060 GPUs
BAR

@mylesgoose
Copy link

@Ivan04012025 3090s and all 4090s, NVIDIA added large BAR support.

tiny@tiny14:~$ lspci -s 01:00.0 -v
01:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 510b
Physical Slot: 49
Flags: bus master, fast devsel, latency 0, IRQ 377
Memory at b2000000 (32-bit, non-prefetchable) [size=16M]
Memory at 28800000000 (64-bit, prefetchable) [size=32G]
Memory at 28400000000 (64-bit, prefetchable) [size=32M]
I/O ports at 3000 [size=128]
Expansion ROM at b3000000 [virtual] [disabled] [size=512K]
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
Notice how BAR1 is size 32G. In H100, they also added support for a PCIe mode that uses the BAR directly instead of the mailboxes, called BAR1P2P. So, what happens if we try to enable that on a 4090?

We do this by bypassing the HAL and calling a bunch of the GH100 methods directly. Methods like kbusEnableStaticBar1Mapping_GH100, which maps the entire VRAM into BAR1. This mostly just works, but we had to disable the use of that region in the MapAperture function for some reason. Shouldn't matter.

[ 3491.654009] NVRM: kbusEnableStaticBar1Mapping_GH100: Static bar1 mapped offset 0x0 size 0x5e9200000
[ 3491.793389] NVRM: kbusEnableStaticBar1Mapping_GH100: Static bar1 mapped offset 0x0 size 0x5e9200000
Perfect, we now have the VRAM mapped. However, it's not that easy to get P2P. When you run ./simpleP2P from cuda-samples, you get this error.

[ 3742.840689] NVRM: kbusCreateP2PMappingForBar1P2P_GH100: added PCIe BAR1 P2P mapping between GPU2 and GPU3
[ 3742.840762] NVRM: kbusCreateP2PMappingForBar1P2P_GH100: added PCIe BAR1 P2P mapping between GPU3 and GPU2
[ 3742.841089] NVRM: nvAssertFailed: Assertion failed: (shifted >> pField->shift) == value @ field_desc.h:272
[ 3742.841106] NVRM: nvAssertFailed: Assertion failed: (shifted & pField->maskPos) == shifted @ field_desc.h:273
[ 3742.841281] NVRM: nvAssertFailed: Assertion failed: (shifted >> pField->shift) == value @ field_desc.h:272
[ 3742.841292] NVRM: nvAssertFailed: Assertion failed: (shifted & pField->maskPos) == shifted @ field_desc.h:273
[ 3742.865948] NVRM: GPU at PCI:0000:01:00: GPU-49c7a6c9-e3a8-3b48-f0ba-171520d77dd1
[ 3742.865956] NVRM: Xid (PCI:0000:01:00): 31, pid=21804, name=simpleP2P, Ch 00000013, intr 00000000. MMU Fault: ENGINE CE3 HUBCLIENT_CE1 faulted @ 0x7f97_94000000. Fault is of type FAULT_INFO_TYPE_UNSUPPORTED_KIND ACCESS_TYPE_VIRT_WRITE
Failing with an MMU fault. So you dive into this and find that it's using GMMU_APERTURE_PEER as the mapping type. That doesn't seem supported in the 4090. So let's see what types are supported, GMMU_APERTURE_VIDEO,GMMU_APERTURE_SYS_NONCOH, and GMMU_APERTURE_SYS_COH. We don't care about being coherent with the CPU's L2 cache, but it does have to go out the PCIe bus, so we rewrite GMMU_APERTURE_PEER to GMMU_APERTURE_SYS_NONCOH. We also no longer set the peer id that was corrupting the page table.

cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 24.21GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 1: val = 0.000000, ref = 4.000000
Verification error @ element 2: val = 0.000000, ref = 8.00000

@mylesgoose
Copy link

Also why is one of you gpu only 8x pcie.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants