You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deployed descheduler with the above configuration which only includes GPU resource thresholds
Created pods with and without GPU resource requests on nodes
Observed the behavior when node GPU utilization falls below the threshold (40%)
Steps to reproduce:
# 1. Deploy a pod with GPU requestapiVersion: v1kind: Podmetadata:
name: gpu-podspec:
containers:
- name: gpu-containerimage: nvidia/cudaresources:
requests:
"cloudml.gpu/v100-32g": "1"limits:
"cloudml.gpu/v100-32g": "1"# 2. Deploy a pod without GPU requestapiVersion: v1kind: Podmetadata:
name: non-gpu-podspec:
containers:
- name: nginximage: nginx
What did you expect to see?
When a node's GPU utilization falls below the threshold:
Only pods that have requested the specified GPU resources (cloudml.gpu/v100-32g or cloudml.gpu/h20-96g) should be considered for eviction
Pods that don't use these GPU resources should not be affected
What did you see instead?
When a node's GPU utilization falls below the threshold:
The HighNodeUtilization plugin evicts all evictable pods from the node, regardless of whether they use the specified GPU resources or not
This includes pods that don't request any GPU resources, which should not be affected by the GPU utilization threshold
This behavior is problematic because:
It unnecessarily disrupts pods that don't contribute to the GPU utilization
It may cause service disruption for non-GPU workloads
It doesn't align with the principle of only balancing the specifically configured resources
The root cause appears to be in the pod filtering logic of the HighNodeUtilization plugin, where it doesn't properly check whether pods are using the configured GPU resources before selecting them for eviction.
The text was updated successfully, but these errors were encountered:
What version of descheduler are you using?
descheduler version: main branch (latest development version)
Does this issue reproduce with the latest release?
Yes, this issue can be reproduced with the latest release.
Which descheduler CLI options are you using?
Using descheduler as a Kubernetes deployment with a ConfigMap-based policy configuration.
Please provide a copy of your descheduler policy config file
What k8s version are you using (
kubectl version
)?Client version: v1.24.0
Server version: v1.21
What did you do?
Steps to reproduce:
What did you expect to see?
When a node's GPU utilization falls below the threshold:
What did you see instead?
When a node's GPU utilization falls below the threshold:
This behavior is problematic because:
The root cause appears to be in the pod filtering logic of the HighNodeUtilization plugin, where it doesn't properly check whether pods are using the configured GPU resources before selecting them for eviction.
The text was updated successfully, but these errors were encountered: