-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
这个GPU共享插件支持使用dcgm-exporter做监控吗 #211
Comments
同问+1 |
已收到您的邮件,我将及时查看并回复,谢谢 王鑫
|
同问+1 |
@ZhangSetSail hi dear friend, how did you setup the monitoring with this gpu operator... any documentation would be very helpful. Thanks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
kubernetes version:v1.23.16
nvidia-docker info
Client: Docker Engine - Community
Version: 24.0.2
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.5
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.18.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 110
Running: 56
Paused: 0
Stopped: 54
Images: 40
Server Version: 20.10.24
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
Default Runtime: nvidia
Init Binary: docker-init
containerd version: 3dce8...
runc version: v1.1.7-0-g860f061
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-150-generic
Operating System: Ubuntu 20.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 48
Total Memory: 125.6GiB
Name: node01
ID:
Docker Root Dir: /app/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
我在集群上部署了gpushare做GPU共享,并且使用dcgm-exporter来做监控。https://github.com/NVIDIA/dcgm-exporter
但是在普罗米修斯上看不到GPU利用率的参数值,以及无法监控pod的gpu资源利用率
有同学用过这种方案吗,麻烦支持一下。
The text was updated successfully, but these errors were encountered: