Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HighNodeUtilization will evict nodes that do not have resource types specified in the configuration profile thresholds. #1634

Open
JBinin opened this issue Feb 20, 2025 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@JBinin
Copy link

JBinin commented Feb 20, 2025

What version of descheduler are you using?

descheduler version: v0.28.0

Does this issue reproduce with the latest release?

Yes

Which descheduler CLI options are you using?

Please provide a copy of your descheduler policy config file

  - name: HighNodeUtilization
     args:
       thresholds:
         "cloudml.gpu/v100-32g": 40

What k8s version are you using (kubectl version)?

kubectl version Output
$ kubectl version

What did you do?

What did you expect to see?
If the node does not possess the resources specified in the configuration profile thresholds, then it will not be considered as a underutilized node and thus will not be subject to eviction.

In a cluster where multiple types of GPU nodes exist, if a certain GPU type is not included in the threshold settings of the configuration file, the node hosting that GPU will be deemed as underutilized node and will be evicted. This occurs even if the GPU allocation rate on that node is at 100%, which is evidently unreasonable.

What did you see instead?
In the cluster, a node that did not contain "cloudml.gpu/v100-32g" resources was mistakenly identified as a underutilized node and was consequently evicted.

2025-02-20T12:06:51.894357974Z I0220 12:06:51.894295 1 nodeutilization.go:198] "Node is underutilized" node="test", usage=map[cloudml.gpu/v100-32g:0 cpu:13777m memory:19106058Ki pods:33] usagePercentage=map[cpu:10.76328125 memory:1.8098670918577757 pods:12.992125984251969]

@JBinin JBinin added the kind/bug Categorizes issue or PR as related to a bug. label Feb 20, 2025
@LY-today
Copy link

I encountered the same problem. After I expanded a new GPU type machine in the cluster, I tried to schedule the pod to this machine, but it was evicted by the HighNodeUtilization policy. It feels very risky to use this policy in the production environment.

@googs1025
Copy link
Member

/cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants