-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOftware seems installed ok, but no P2P #21
Comments
large BAR support "ON "and IOMMU off. there is no way to enable it, just auto. Auto is enabled. Did you blacklist the neuveo drivers? |
I did not blacklist anything, I installed the nvidia driver from the NVIDIA-Linux-x86_64-550.67.run file not from any deb package. To my knowledge there is no way the neuveo driver could be installed, unless I am missing something. How would I check it? if I do: sudo lshw -c video | grep 'configuration' I get: configuration: driver=nvidia latency=0 lspci | grep VGA I get: 21:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1) |
I also made an attempt to blacklist Nouveau drivers. I created a file: blacklist nouveau then run: sudo update-initramfs -u then rebooted the system. Same result: [/home/renato/cuda-samples-master/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting... Checking GPU(s) for support of peer to peer memory access...
is there a way to doublecheck whether the P2P modules are installed correctly? |
I think I see your problem. First how did you install the nvidia driver from a run file? You must have been looking at the screen. Hence the open source driver was installed first. When you install the run file did you say --no-modules-kernel or whatever it is. Perhaps you installed the modules from the run file. Which is no problem just replace them with the ones from the geohot Deb package. I also noticed when you install the modules from the Deb package it installs I to a different location then the modules installed by the run file. Or apt. So your best solution is to run that run file and uninstall. Purge the drivers using the run file itself. The go apt purge nvidia-* etc. Remove all nvidia drivers. But have the nvidia driver handy ready to run that installer again. Now run the installer and let it install tye modules for the kernel. Reboot check everything works. Then find out where those files are stored. Override them with your modules using the terminal copy and paste over the model's with your modified ones. Then you can also run that installer script to double check. Then you have to regenerate the kernel that loads maybe so it actually uses your modules. I think. And reboot. If that does not work simply purge everything again and also unsiall unblocklist neuveo and uninstall nvidia deiver with with runfile reboot and let the standard nvidia driver neuveo work. Because the neuveo deiver is working you wonr have any problems overriding the modules as they wont be in use. And then ensure there is no lingering drivers with apt and that apt is not auto installing updates. And run the installer this time with no kernel modules. Flag. And then install your kernel modules and then you will see nvidia smi is working. |
How to Build make modules -j$(nproc) make modules_install -j$(nproc) sh ./NVIDIA-Linux-[...].run --no-kernel-modules |
I can see in your nvdia-smi that that driver is loaded. So you almost there. Your problem is your using the kernel modules from the run file. |
First of all, thank you for your help! I started with ubuntu server 22.04, I presume with no drivers as I seleced not to install any third party driver . then, these are the commands I executed, in the exact order : sudo ./NVIDIA-Linux-x86_64-550.67.run --no-kernel-modules make modules -j$(nproc) as I had to build the simpleP2P and nvbandwidth tools, I downloaded cuda_12.4.0_550.54.14_linux.run. Again in the ".run" format to be in control of what was being installed. Strangely enough it asked me if I wanted to install a different and newer driver, to which I said no. I then compiled nvbandwith and simpleP2P correctly and then I went on blacklisting the nouveau driver as I mentioned before. Is there a way for me to check the modules installed are indeed the one of the open-gpu-kerne-modules-550? I believe I did the outmost to make sure they are the ONLY modules ever built. I did not install any .deb package which may have overwritten those modules. Bfore I re try the process, starting from the re-installation of ubuntu, I would like to know whether there is any verification / change to my process, I can do, not to end up again in the same place. |
err, I used the nvidia driver version 550.67 , not the 560.35.03. There isn't a open-gpu-kernel-modules-xxx branch for the driver version 560xxx. did you use the 560.35.03 driver with the 550 branch? I used the 550.67 as it is mentioned in the branch description that is the driver to use : "Note that the kernel modules built here must be used with GSP firmware and user-space NVIDIA GPU driver components from a corresponding 550.67 driver release. This can be achieved by installing the NVIDIA GPU driver from the .run file using the --no-kernel-modules option." everything seems to be compiling and installing fine with the 550.67, but for the fact it does not work :)) . is the use of the 550.67 nvidia driver that is causing the problem? |
@thecaptain2000 https://github.com/tinygrad/open-gpu-kernel-modules/releases/download/550.90.07-p2p/nvidia-kernel-source-550-open-0ubuntu1_amd64.deb this is already pre compiled right. so purge yuor system of all drivers. then install this deve package pre compiled. and then install the matching run file with no kernel modules and reboot.if it works then you can compile from source if you like. |
@mylesgoose, I will give it a go. I will let you know of the progress. Thank you again in the meantime |
Soo, I installed the 550.90.07 driver from the run file this way sudo ./NVIDIA-Linux-x86_64-550.90.07.run --no-kernel-modules as you mentioned. even BEFORE installing the driver, I executed: dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb if I execute nvidia-smi I get: I tried to reboot the pc and it did not help I tried to execute dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb also after the driver installation but the situation remained the same. I also tried executing sudo apt install dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb . same result my doubt at this point is I need to specify a different directory. when I execute the dpkg -i nvidia-kernel-source-550-open-0ubuntu1_amd64.deb is that the case? |
well i think you should run the install wit the removal of the no kernel thing. see where it copies the files to and then replace them with the ones from the deb package and then rebuild intramf . these are my ones. modules.zip sorry it would not fit if a just zipped it so unlizip then un tar /usr/lib/modules/6.8.0-44-generic/kernel/drivers/video/nvidia-uvm.ko |
modules.zip |
@mylesgoose I am getting somewhere. While I was waiting for your response, I performed a clean linux install and installed 550.67 and compiled and installed the open modules. I had a hunch that the modules were actually working when I originally installed them, but that somewhere / somehow they were getting overridden so after the clean install of linux + modules, I installed my python + pytorch anvironment and run torch.zeros(70000,70000).cuda().to("cuda:1"). It took 3.9 seconds. where before it was taking something shy of 8 seconds. problem is, at that point I could not run simpleP2P and nvbandwith as I did not have them anywhere else, so I installed the cuda toolkit (again from a .run file) asking not to install anything but the cuda toolkit. I re run the torch.zeros(70000,70000).cuda().to("cuda:1") and boom, it was taking 8 seconds again, which means the cuda toolkit overriden all / part of the nvidia modules. now I just compiled and run simpleP2P and it tells me there is no P2P. so what I will do now is I wil build also nvbandwith and save them, hopefully they do not need any library to run and I will be able to recreate the initial situation where, I suspect, the whole "toy" was running as expected with P2P enabled before the installation of the cuda toolkit |
alltoall_perf.zip why do you want to use tat old version 550.67 |
Whe you install that Deb package it does not put in the corect location. If you installed cuda and it replaced your driver why not just reinstall the driver again or replace the modules. That it replaced. |
Well
well, given that once I installed the 550.90.07 driver and compiled the module, it worked the first time, I would say "Because I am an idiot :)) " |
Thank you for helping me trough this |
It stuck me for 2 days. How do you compile cuda-samples ( simpleP2P?) It prompt LargeKernelParameter error. Thank you very much ! |
are you trying to compile all of the samples or just that simple p2p. i didn want to recompile all of them so i just copied that simple p2p folder to my desktop open a terminnl inside that folder and type make clean and then " sudo make INCLUDES="-I../../../Common -I/home/myles/cuda-samples/Common" " because your ether going to be in the directory bellow that common files folder with cuda helper etc headers or you jut link to it |
@thecaptain2000 hey can you try this newer version https://github.com/mylesgoose/open-gpu-kernel-modules/tree/560.35.03-p2p make sure you install the run file corresponding to that newer release |
O.. shit !! it works while i copy it outside ! I were keep "make" inside the 0_introductoin folder. |
@keithyau which NVIDIA driver did you install? |
560 and then patch the tinygrad p2p update into it. |
Well I can see your problem there you have not patched it correctly and missed a file as is showing in your screen print |
https://github.com/mylesgoose/open-gpu-kernel-modules/tree/560.35.03-p2p @keithyau try this one because that one your using does not have this file done obviously mylesgoose@1ca8b01 |
Thank you ! |
I followed the steps: but when I simpleP2P,I got this, please give me a hand.thanks a lot! |
Is your iommu disabled feom grub and enable large BAR support in bios? |
I did solve it, I went to check and disable the IOMMU on the motherboard.
Moreover, I guess the driver I was using was not working, I used the most
up to date of the list and it worked just fine.
Good luck.
Renato
Il giorno ven 20 dic 2024 alle ore 07:39 hetian127 ***@***.***>
ha scritto:
… the same error.. any idea?
image.png (view on web)
<https://github.com/user-attachments/assets/3b79ad86-deb0-4801-86df-214776e9d0a4>
—
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG5FOICR7UAWLRLDVDOR2L2GO3R3AVCNFSM6AAAAABQBRQIKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGM4DGNJUHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
NVIDIA Open GPU Kernel Modules Version
550
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 22.04.5 LTS
Kernel Release
Linux ai-server 5.15.0-124-generic NVIDIA#134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-2fbe0316-3cc8-4b18-797e-de9975b5f814) GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-21adc1c4-fcf0-de35-d8a5-8a864de22da8)
Describe the bug
openGPU installs fine, I built and the modules in OpenGPU (I did not build the modules when I installed the server) and all seems correct. The IOMMU is off, Large Bar is set to auto (there is no way to enable it, just auto/disable)
Nvidia-sme reports:
NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4
simpleP2P reports:
checking GPU(s) for support of peer to peer memory access...
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
I created the modules using the open P2P software only, I did not make the modules when installing the NVIDIA driver, so I can presume they are the correct modules
My motherboard is a TRX40 Designare with a threadripper 3970, large BAR support and IOMMU off. Is there anything else I need to enable / disable / install / uninstall, etc?
To Reproduce
well, I just followed the installation instructions for the kernel version 550
Bug Incidence
Always
nvidia-bug-report.log.gz
there is no bug, it just does not work
More Info
to have P2P working? :)
The text was updated successfully, but these errors were encountered: