Skip to content

update Nvidia device driver docs to link to list of supported cards and newer versions #25531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 28, 2025

Conversation

sofixa
Copy link
Contributor

@sofixa sofixa commented Mar 26, 2025

Description

NVIDIA Device Driver docs had a few older references (e.g. nvidia/cuda:11.0-base doesn't exist anymore), and didn't have a link to the list of compatible NVIDIA devices (e.g. Jetsons aren't compatible).

Also removed the x86_64 qualifier for Linux because the driver runs successfully (even if it can't fully fingerprint the unsupported card) on arm64 in my testing, and the underlying library and its Go bindings are arm64 compatible (NVML and nvml-go).

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

@sofixa sofixa requested review from a team as code owners March 26, 2025 17:35
@aimeeu aimeeu added the theme/docs Documentation issues and enhancements label Mar 26, 2025
@aimeeu aimeeu added backport/website This will backport PR changes to `stable-website` && the latest release-branch backport/1.9.x backport to 1.9.x release line labels Mar 26, 2025
Copy link
Contributor

@aimeeu aimeeu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I left a few style suggestions.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also removed the x86_64 qualifier for Linux because the driver runs successfully (even if it can't fully fingerprint the unsupported card) on arm64 in my testing, and the underlying library and its Go bindings are arm64 compatible (NVML and nvml-go).

I'm hesitant to remove this unless we've actually tested it on arm64 and shown it works end-to-end. I don't see any issues in the driver repo that show someone using it on arm64. Do you know of anyone who's done so?

@sofixa
Copy link
Contributor Author

sofixa commented Mar 26, 2025

@tgross I've ran it on arm64, kind of, and it works enough to identify the NVIDIA GPU, but fails at getting a device handle to check power/memory because the card in question is an iGPU which isn't supported by NVML.

We can give it a spin in an AWS G5g (Graviton ARM CPU + NVIDIA T4 GPU, which is supported by NVML).

@tgross
Copy link
Member

tgross commented Mar 26, 2025

We can give it a spin in an AWS G5g (Graviton ARM CPU + NVIDIA T4 GPU, which is supported by NVML).

If you could, that'd be great. It'd at least give us a reasonable smoke test before we attest to it.

@sofixa
Copy link
Contributor Author

sofixa commented Mar 27, 2025

@tgross gave it a spin on an AWS G5g.2xlarge, the driver compiles like a charm, and everything I could think of works:

root@ip-172-31-5-203:/home/ubuntu# nomad node status -self -verbose
ID              = 3187b260-20ae-6a1d-ffc4-e121a3a7298a
Name            = ip-172-31-5-203
Node Pool       = default
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 9m34s

Drivers
Driver    Detected  Healthy  Message   Time
docker    true      true     Healthy   2025-03-27T16:39:45Z
exec      true      true     Healthy   2025-03-27T16:39:45Z
java      false     false    <none>    2025-03-27T16:39:45Z
qemu      false     false    <none>    2025-03-27T16:39:45Z
raw_exec  false     false    disabled  2025-03-27T16:39:45Z

Node Events
Time                  Subsystem       Message          Details
2025-03-27T16:25:05Z  Driver: docker  Healthy          driver: docker
2025-03-27T16:17:35Z  Cluster         Node registered  <none>

Allocated Resources
CPU          Memory       Disk
0/10000 MHz  0 B/7.6 GiB  0 B/3.2 GiB

Allocation Resource Utilization
CPU          Memory
0/10000 MHz  0 B/7.6 GiB

Host Resource Utilization
CPU           Memory           Disk
75/10000 MHz  367 MiB/7.6 GiB  (/dev/root)

Device Resource Utilization
nvidia/gpu/NVIDIA T4G[GPU-0c3f1dde-eb43-9134-a648-afe61c408cfa]  446 / 15360 MiB

....

Attributes
cpu.arch                                 = arm64
cpu.frequency.efficiency                 = 2500

Output of a docker job with nvidia-smi running in a cuda container (same image as the one I've put for the doc).

Thu Mar 27 16:39:57 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     Off |   00000000:00:1F.0 Off |                    0 |
| N/A   43C    P8             15W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Anything else you can think of that I should test?

@tgross
Copy link
Member

tgross commented Mar 27, 2025

Awesome, that's great @sofixa. Let's ship it. I'll mark this for re-review.

@tgross tgross requested a review from aimeeu March 27, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent backport/website This will backport PR changes to `stable-website` && the latest release-branch backport/1.9.x backport to 1.9.x release line backport/1.10.x backport to 1.10.x release line theme/docs Documentation issues and enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants