fix: HighNodeUtilization plugin incorrectly evicts all pods when only GPU resources are configured #1635

ditingdapeng · 2025-02-24T06:58:11Z

What version of descheduler are you using?

descheduler version: main branch (latest development version)

Does this issue reproduce with the latest release?
Yes, this issue can be reproduced with the latest release.

Which descheduler CLI options are you using?
Using descheduler as a Kubernetes deployment with a ConfigMap-based policy configuration.

Please provide a copy of your descheduler policy config file

apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
kind: DeschedulerConfiguration
deschedulerPolicy:
  strategies:
    HighNodeUtilization:
      enabled: true
      params:
        thresholds:
          "cloudml.gpu/v100-32g": 40
          "cloudml.gpu/h20-96g": 40
        numberOfNodes: 0
  nodeSelector:
    node.miks.io/type: "gpu"

What k8s version are you using (kubectl version)?

Client version: v1.24.0

Server version: v1.21

What did you do?

Deployed descheduler with the above configuration which only includes GPU resource thresholds
Created pods with and without GPU resource requests on nodes
Observed the behavior when node GPU utilization falls below the threshold (40%)

Steps to reproduce:

# 1. Deploy a pod with GPU request
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: gpu-container
    image: nvidia/cuda
    resources:
      requests:
        "cloudml.gpu/v100-32g": "1"
      limits:
        "cloudml.gpu/v100-32g": "1"

# 2. Deploy a pod without GPU request
apiVersion: v1
kind: Pod
metadata:
  name: non-gpu-pod
spec:
  containers:
  - name: nginx
    image: nginx

What did you expect to see?
When a node's GPU utilization falls below the threshold:

Only pods that have requested the specified GPU resources (cloudml.gpu/v100-32g or cloudml.gpu/h20-96g) should be considered for eviction
Pods that don't use these GPU resources should not be affected

What did you see instead?
When a node's GPU utilization falls below the threshold:

The HighNodeUtilization plugin evicts all evictable pods from the node, regardless of whether they use the specified GPU resources or not
This includes pods that don't request any GPU resources, which should not be affected by the GPU utilization threshold

This behavior is problematic because:

It unnecessarily disrupts pods that don't contribute to the GPU utilization
It may cause service disruption for non-GPU workloads
It doesn't align with the principle of only balancing the specifically configured resources

The root cause appears to be in the pod filtering logic of the HighNodeUtilization plugin, where it doesn't properly check whether pods are using the configured GPU resources before selecting them for eviction.

The text was updated successfully, but these errors were encountered:

googs1025 · 2025-02-24T11:36:01Z

same as: #1634

ditingdapeng added the kind/bug Categorizes issue or PR as related to a bug. label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: HighNodeUtilization plugin incorrectly evicts all pods when only GPU resources are configured #1635

fix: HighNodeUtilization plugin incorrectly evicts all pods when only GPU resources are configured #1635

ditingdapeng commented Feb 24, 2025

googs1025 commented Feb 24, 2025

fix: HighNodeUtilization plugin incorrectly evicts all pods when only GPU resources are configured #1635

fix: HighNodeUtilization plugin incorrectly evicts all pods when only GPU resources are configured #1635

Comments

ditingdapeng commented Feb 24, 2025

googs1025 commented Feb 24, 2025