-
Notifications
You must be signed in to change notification settings - Fork 407
[OCPBUGS-50992]: Filter out unreachable taints from tolerations #1990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/test images |
@ingvagabund , @atiratree i am trying to get some sense on the failed checks, but i cannot find the corelation between my proposed changed patch and the errors. What am I missing here from your point of view? Much appreciated, |
Looks like all three e2e tests are flaking (unrelated to this PR) and the verify needs to run "make update-gofmt". I think you are good here when it comes to passing the tests :) |
TolerationLoop: | ||
for _, tol := range tolerations { | ||
for _, excluded := range excludedTaints { | ||
if tol.ToleratesTaint(&excluded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are excluding taints that are added in the same method. What about not adding the excluded taints instead of adding them and then excluding them? Or, is this code to be extended later with a list of user provided excluded taints?
Also, none of those two tolerations tolerate the taint as the Key
fields are always different:
"node-role.kubernetes.io/master"
!="node.kubernetes.io/unreachable"
"node.kubernetes.io/not-ready"
!="node.kubernetes.io/unreachable"
Which makes filteredTolerations = tolerations
always. Making the double loop a no-op. Maybe I misread the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still unanswered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund with respect to this, i have added a comment that i am intending to support a user-defined exclusion tains in the hope to preserve this logic, should we define a unit test which is validating this logic further ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How far in the future do you expect the user-defined exclusion taints support added? If the current logic has no use it's better to introduce it in the PR that introduces the user-defined exclusion taints. So all the relevant changes are in the same PR to preserve the context. This way the code changes will be apart. Maybe never extended/finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund I expect to expand it in the next PR with the experimental flag from here: #1990 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. In that case it's better to introduce the loop as part of the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund unfortunetly this cannot be removed for next PR, main feature its to manage the tolerations for the must-gather pod and avoid a random alocation to a pod which cannot be scheduled. Initial implementation was allowing the must-gather pod to tolerate everything, hence this section was change to exclude specific taints.
What about to extend the |
What if all control plane nodes are "node.kubernetes.io/unreachable" tainted? What if all nodes are "node.kubernetes.io/unreachable" tainted? Some nodes might be still eligible for running must-gather. Presence of the taint does not necessarily means the node is "gone". From https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions:
Maybe it's just the node controller that's temporarily separated from the rest of the control plane? |
I can expand this topic as follows:
|
remove commented line 946
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: midu16 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This theoretical supposition that all the nodes are tained, i assume in a manual manner, i am not able to replicate. Here its my try: # oc get nodes
NAME STATUS ROLES AGE VERSION
hub-ctlplane-0.5g-deployment.lab NotReady control-plane,master,worker 30h v1.31.7
hub-ctlplane-1.5g-deployment.lab Ready control-plane,master,worker 30h v1.31.7
hub-ctlplane-2.5g-deployment.lab Ready control-plane,master,worker 30h v1.31.7 In here, one of the nodes has been powered off, to simulate the exact shutdown scenario. I will be trying to taint the # oc adm taint node hub-ctlplane-1.5g-deployment.lab node.kubernetes.io/unreachable:NoExecute
node/hub-ctlplane-1.5g-deployment.lab tainted
# oc adm taint node hub-ctlplane-2.5g-deployment.lab node.kubernetes.io/unreachable:NoExecute
node/hub-ctlplane-2.5g-deployment.lab tainted As you can see here, the nodes have neen tainted, but while describing the nodes above tainted: # oc describe nodes hub-ctlplane-1.5g-deployment.lab
Name: hub-ctlplane-1.5g-deployment.lab
Roles: control-plane,master,worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=hub-ctlplane-1.5g-deployment.lab
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node-role.kubernetes.io/master=
node-role.kubernetes.io/worker=
node.openshift.io/os_id=rhcos
Annotations: k8s.ovn.org/host-cidrs: ["172.16.30.10/24","172.16.30.11/24","172.16.30.21/24"]
k8s.ovn.org/l3-gateway-config:
{"default":{"mode":"shared","bridge-id":"br-ex","interface-id":"br-ex_hub-ctlplane-1.5g-deployment.lab","mac-address":"aa:aa:aa:aa:01:02",...
k8s.ovn.org/network-ids: {"default":"0"}
k8s.ovn.org/node-chassis-id: 76b0cc16-f376-4664-aaaf-85260e2b7a82
k8s.ovn.org/node-gateway-router-lrp-ifaddrs: {"default":{"ipv4":"100.64.0.3/16"}}
k8s.ovn.org/node-id: 3
k8s.ovn.org/node-masquerade-subnet: {"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}
k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.16.30.21/24"}
k8s.ovn.org/node-subnets: {"default":["10.133.0.0/23"]}
k8s.ovn.org/node-transit-switch-port-ifaddr: {"ipv4":"100.88.0.3/16"}
k8s.ovn.org/remote-zone-migrated: hub-ctlplane-1.5g-deployment.lab
k8s.ovn.org/zone-name: hub-ctlplane-1.5g-deployment.lab
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-master-22119f80b4843c8b8f72be63d136687b
machineconfiguration.openshift.io/desiredConfig: rendered-master-22119f80b4843c8b8f72be63d136687b
machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-master-22119f80b4843c8b8f72be63d136687b
machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-22119f80b4843c8b8f72be63d136687b
machineconfiguration.openshift.io/lastObservedServerCAAnnotation: false
machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 720542
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 26 May 2025 01:00:22 -0700
Taints: <none> ###<<<<<<<<---There is no taint
Unschedulable: false
Lease:
HolderIdentity: hub-ctlplane-1.5g-deployment.lab
AcquireTime: <unset>
RenewTime: Tue, 27 May 2025 07:39:20 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 27 May 2025 07:37:29 -0700 Mon, 26 May 2025 01:00:22 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 27 May 2025 07:37:29 -0700 Mon, 26 May 2025 01:00:22 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 27 May 2025 07:37:29 -0700 Mon, 26 May 2025 01:00:22 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 27 May 2025 07:37:29 -0700 Mon, 26 May 2025 01:02:15 -0700 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.16.30.21
Hostname: hub-ctlplane-1.5g-deployment.lab
Capacity:
cpu: 40
ephemeral-storage: 313894128Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 72030168Ki
pods: 250
Allocatable:
cpu: 39500m
ephemeral-storage: 288211086062
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 70879192Ki
pods: 250
System Info:
Machine ID: 48d5d5458e8446029eec591aba88f832
System UUID: 48d5d545-8e84-4602-9eec-591aba88f832
Boot ID: 95d9e7c9-0f03-4252-963a-63234abdfe5b
Kernel Version: 5.14.0-427.64.1.el9_4.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 418.94.202504080525-0
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.31.7-2.rhaos4.18.git83d6749.el9
Kubelet Version: v1.31.7
Kube-Proxy Version: v1.31.7
Non-terminated Pods: (74 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kcli-infra coredns-hub-ctlplane-1.5g-deployment.lab 150m (0%) 0 (0%) 512Mi (0%) 0 (0%) 30h
kcli-infra haproxy-hub-ctlplane-1.5g-deployment.lab 150m (0%) 0 (0%) 512Mi (0%) 0 (0%) 30h
kcli-infra keepalived-hub-ctlplane-1.5g-deployment.lab 150m (0%) 0 (0%) 2Gi (2%) 0 (0%) 30h
kcli-infra mdns-hub-ctlplane-1.5g-deployment.lab 150m (0%) 0 (0%) 1Gi (1%) 0 (0%) 30h
openshift-apiserver apiserver-6cb9f4dd8-z29xf 110m (0%) 0 (0%) 250Mi (0%) 0 (0%) 30h
openshift-authentication oauth-openshift-89b7f44b5-cvs68 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-catalogd catalogd-controller-manager-5fd54d45c8-7gnww 105m (0%) 0 (0%) 264Mi (0%) 0 (0%) 30h
openshift-cloud-controller-manager-operator cluster-cloud-controller-manager-operator-8464c5cbdd-4s2jj 30m (0%) 0 (0%) 95Mi (0%) 0 (0%) 146m
openshift-cluster-node-tuning-operator tuned-cc8hc 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-cluster-samples-operator cluster-samples-operator-58b96497bf-f5d6w 20m (0%) 0 (0%) 100Mi (0%) 0 (0%) 30h
openshift-cluster-storage-operator csi-snapshot-controller-7869c57f49-7j6mg 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-cluster-version cluster-version-operator-7d97d57688-tx6pl 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-console-operator console-operator-dd7847f44-2z4gn 10m (0%) 0 (0%) 100Mi (0%) 0 (0%) 146m
openshift-console console-b7bf68596-fwhd6 10m (0%) 0 (0%) 100Mi (0%) 0 (0%) 143m
openshift-console downloads-76fc98c7c9-hbpzg 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-controller-manager controller-manager-85c8cbd6f9-sbhcw 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 6h43m
openshift-dns dns-default-ncwml 60m (0%) 0 (0%) 110Mi (0%) 0 (0%) 30h
openshift-dns node-resolver-2zq5f 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 30h
openshift-etcd etcd-guard-hub-ctlplane-1.5g-deployment.lab 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 30h
openshift-etcd etcd-hub-ctlplane-1.5g-deployment.lab 370m (0%) 0 (0%) 960Mi (1%) 0 (0%) 30h
openshift-gitops openshift-gitops-application-controller-0 1 (2%) 16 (40%) 2Gi (2%) 32Gi (47%) 30h
openshift-gitops openshift-gitops-redis-5d74f4d9d9-fgpmx 250m (0%) 500m (1%) 128Mi (0%) 256Mi (0%) 143m
openshift-gitops openshift-gitops-server-f5b9f6644-56kvj 125m (0%) 500m (1%) 128Mi (0%) 256Mi (0%) 143m
openshift-image-registry image-registry-77c7d7d786-762xw 100m (0%) 0 (0%) 256Mi (0%) 0 (0%) 143m
openshift-image-registry node-ca-f57pj 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 30h
openshift-ingress-canary ingress-canary-5qw6n 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 30h
openshift-ingress router-default-6f79bff9ff-grlxb 100m (0%) 0 (0%) 256Mi (0%) 0 (0%) 30h
openshift-kube-apiserver kube-apiserver-guard-hub-ctlplane-1.5g-deployment.lab 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 30h
openshift-kube-apiserver kube-apiserver-hub-ctlplane-1.5g-deployment.lab 290m (0%) 0 (0%) 1224Mi (1%) 0 (0%) 6h38m
openshift-kube-controller-manager kube-controller-manager-guard-hub-ctlplane-1.5g-deployment.lab 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 30h
openshift-kube-controller-manager kube-controller-manager-hub-ctlplane-1.5g-deployment.lab 80m (0%) 0 (0%) 500Mi (0%) 0 (0%) 30h
openshift-kube-scheduler openshift-kube-scheduler-guard-hub-ctlplane-1.5g-deployment.lab 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 30h
openshift-kube-scheduler openshift-kube-scheduler-hub-ctlplane-1.5g-deployment.lab 25m (0%) 0 (0%) 150Mi (0%) 0 (0%) 30h
openshift-kube-storage-version-migrator migrator-7c8b9d7fc7-mb5ck 11m (0%) 0 (0%) 201Mi (0%) 0 (0%) 30h
openshift-local-storage diskmaker-manager-dgx2q 20m (0%) 0 (0%) 70Mi (0%) 0 (0%) 30h
openshift-logging cluster-logging-operator-6cff84cbf9-2wdjs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 143m
openshift-machine-api ironic-proxy-2g547 5m (0%) 0 (0%) 50Mi (0%) 0 (0%) 146m
openshift-machine-api metal3-56c5d48f5-7pvg6 65m (0%) 0 (0%) 555Mi (0%) 0 (0%) 146m
openshift-machine-api metal3-baremetal-operator-97c7cc5f9-x5nfz 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 146m
openshift-machine-api metal3-image-customization-674fd478b5-q6q7d 5m (0%) 0 (0%) 50Mi (0%) 0 (0%) 146m
openshift-machine-config-operator kube-rbac-proxy-crio-hub-ctlplane-1.5g-deployment.lab 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-machine-config-operator machine-config-controller-58f77b6849-b2csz 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 30h
openshift-machine-config-operator machine-config-daemon-8gfkk 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 30h
openshift-machine-config-operator machine-config-server-v97c2 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-marketplace cs-redhat-operator-index-vjm8t 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-monitoring alertmanager-main-0 9m (0%) 0 (0%) 120Mi (0%) 0 (0%) 30h
openshift-monitoring kube-state-metrics-6fd4985569-t4x4h 4m (0%) 0 (0%) 110Mi (0%) 0 (0%) 30h
openshift-monitoring metrics-server-68d4d68b5-hlkc4 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 6h39m
openshift-monitoring monitoring-plugin-675547f6b9-q7tqp 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-monitoring node-exporter-2dcsn 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 30h
openshift-monitoring openshift-state-metrics-664f6f85d6-p6zfc 3m (0%) 0 (0%) 72Mi (0%) 0 (0%) 30h
openshift-monitoring prometheus-k8s-0 75m (0%) 0 (0%) 1099Mi (1%) 0 (0%) 30h
openshift-monitoring prometheus-operator-578754df59-5hph2 6m (0%) 0 (0%) 165Mi (0%) 0 (0%) 143m
openshift-monitoring prometheus-operator-admission-webhook-795b5cdd7f-n6xhl 5m (0%) 0 (0%) 30Mi (0%) 0 (0%) 30h
openshift-monitoring telemeter-client-7d5b898b55-vp7hr 3m (0%) 0 (0%) 70Mi (0%) 0 (0%) 30h
openshift-monitoring thanos-querier-fcb87c979-wsvb8 15m (0%) 0 (0%) 87Mi (0%) 0 (0%) 143m
openshift-multus multus-additional-cni-plugins-dlf29 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 30h
openshift-multus multus-admission-controller-5bd89d9df-zqbtx 20m (0%) 0 (0%) 70Mi (0%) 0 (0%) 30h
openshift-multus multus-fvbsc 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 30h
openshift-multus network-metrics-daemon-fw29v 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 30h
openshift-network-console networking-console-plugin-776976c9c8-mv4pj 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-network-diagnostics network-check-source-579f89f8b7-2fp9k 10m (0%) 0 (0%) 40Mi (0%) 0 (0%) 30h
openshift-network-diagnostics network-check-target-7f5kt 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 30h
openshift-network-node-identity network-node-identity-fzfmz 20m (0%) 0 (0%) 100Mi (0%) 0 (0%) 30h
openshift-network-operator iptables-alerter-dwj8n 10m (0%) 10m (0%) 65Mi (0%) 0 (0%) 30h
openshift-oauth-apiserver apiserver-586d946cb8-l9zdx 150m (0%) 0 (0%) 200Mi (0%) 0 (0%) 30h
openshift-operator-lifecycle-manager packageserver-dd856455b-gpk5x 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-operators istio-operator-54c86cd695-g5f4c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 143m
openshift-ovn-kubernetes ovnkube-control-plane-56584d9f65-vczvp 20m (0%) 0 (0%) 320Mi (0%) 0 (0%) 30h
openshift-ovn-kubernetes ovnkube-node-h9k8v 80m (0%) 0 (0%) 1630Mi (2%) 0 (0%) 30h
openshift-route-controller-manager route-controller-manager-8448b4cb88-rl5xn 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 6h43m
openshift-service-ca service-ca-6f5fcb79c8-l2vlq 10m (0%) 0 (0%) 120Mi (0%) 0 (0%) 30h
openshift-storage odf-operator-controller-manager-77685b5ddd-9fkbc 200m (0%) 200m (0%) 200Mi (0%) 300Mi (0%) 30h
rhbk rhbk-operator-6fdc9d5fb4-475sr 300m (0%) 700m (1%) 450Mi (0%) 450Mi (0%) 143m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 4896m (12%) 17910m (45%)
memory 18007Mi (26%) 34030Mi (49%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal RegisteredNode 149m node-controller Node hub-ctlplane-1.5g-deployment.lab event: Registered Node hub-ctlplane-1.5g-deployment.lab in Controller
# oc adm taint node hub-ctlplane-2.5g-deployment.lab node.kubernetes.io/unreachable:NoExecute
node/hub-ctlplane-2.5g-deployment.lab tainted
[root@INBACRNRDL0102 ~]# oc describe nodes hub-ctlplane-2.5g-deployment.lab
Name: hub-ctlplane-2.5g-deployment.lab
Roles: control-plane,master,worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=hub-ctlplane-2.5g-deployment.lab
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node-role.kubernetes.io/master=
node-role.kubernetes.io/worker=
node.openshift.io/os_id=rhcos
Annotations: k8s.ovn.org/host-cidrs: ["172.16.30.22/24"]
k8s.ovn.org/l3-gateway-config:
{"default":{"mode":"shared","bridge-id":"br-ex","interface-id":"br-ex_hub-ctlplane-2.5g-deployment.lab","mac-address":"aa:aa:aa:aa:01:03",...
k8s.ovn.org/network-ids: {"default":"0"}
k8s.ovn.org/node-chassis-id: c9b287cf-e164-4288-b6cd-88ba5e81a58f
k8s.ovn.org/node-gateway-router-lrp-ifaddrs: {"default":{"ipv4":"100.64.0.4/16"}}
k8s.ovn.org/node-id: 4
k8s.ovn.org/node-masquerade-subnet: {"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}
k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.16.30.22/24"}
k8s.ovn.org/node-subnets: {"default":["10.134.0.0/23"]}
k8s.ovn.org/node-transit-switch-port-ifaddr: {"ipv4":"100.88.0.4/16"}
k8s.ovn.org/remote-zone-migrated: hub-ctlplane-2.5g-deployment.lab
k8s.ovn.org/zone-name: hub-ctlplane-2.5g-deployment.lab
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-master-22119f80b4843c8b8f72be63d136687b
machineconfiguration.openshift.io/desiredConfig: rendered-master-22119f80b4843c8b8f72be63d136687b
machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-master-22119f80b4843c8b8f72be63d136687b
machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-22119f80b4843c8b8f72be63d136687b
machineconfiguration.openshift.io/lastObservedServerCAAnnotation: false
machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 720542
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 26 May 2025 01:00:20 -0700
Taints: <none> ###<<<<<<<<---There is no taint
Unschedulable: false
Lease:
HolderIdentity: hub-ctlplane-2.5g-deployment.lab
AcquireTime: <unset>
RenewTime: Tue, 27 May 2025 07:39:51 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 27 May 2025 07:38:12 -0700 Mon, 26 May 2025 01:00:20 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 27 May 2025 07:38:12 -0700 Mon, 26 May 2025 01:00:20 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 27 May 2025 07:38:12 -0700 Mon, 26 May 2025 01:00:20 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 27 May 2025 07:38:12 -0700 Mon, 26 May 2025 01:02:14 -0700 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.16.30.22
Hostname: hub-ctlplane-2.5g-deployment.lab
Capacity:
cpu: 40
ephemeral-storage: 313894128Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 72030136Ki
pods: 250
Allocatable:
cpu: 39500m
ephemeral-storage: 288211086062
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 70879160Ki
pods: 250
System Info:
Machine ID: 574ceed884774bb48a565bf18c817135
System UUID: 574ceed8-8477-4bb4-8a56-5bf18c817135
Boot ID: 738c675a-9a86-479f-9a7e-d53c49f9bb5c
Kernel Version: 5.14.0-427.64.1.el9_4.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 418.94.202504080525-0
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.31.7-2.rhaos4.18.git83d6749.el9
Kubelet Version: v1.31.7
Kube-Proxy Version: v1.31.7
Non-terminated Pods: (91 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kcli-infra coredns-hub-ctlplane-2.5g-deployment.lab 150m (0%) 0 (0%) 512Mi (0%) 0 (0%) 30h
kcli-infra haproxy-hub-ctlplane-2.5g-deployment.lab 150m (0%) 0 (0%) 512Mi (0%) 0 (0%) 30h
kcli-infra keepalived-hub-ctlplane-2.5g-deployment.lab 150m (0%) 0 (0%) 2Gi (2%) 0 (0%) 30h
kcli-infra mdns-hub-ctlplane-2.5g-deployment.lab 150m (0%) 0 (0%) 1Gi (1%) 0 (0%) 30h
openshift-adp openshift-adp-controller-manager-55f97dd46d-9zxtk 500m (1%) 1 (2%) 128Mi (0%) 512Mi (0%) 30h
openshift-amq-streams amq-streams-cluster-operator-v2.9.0-2-76d5c4f596-xwcnm 200m (0%) 1 (2%) 384Mi (0%) 384Mi (0%) 143m
openshift-apiserver-operator openshift-apiserver-operator-b9ff4697-l5vgb 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-apiserver apiserver-6cb9f4dd8-rxwb6 110m (0%) 0 (0%) 250Mi (0%) 0 (0%) 30h
openshift-authentication-operator authentication-operator-576f97686b-qgm87 20m (0%) 0 (0%) 200Mi (0%) 0 (0%) 30h
openshift-authentication oauth-openshift-89b7f44b5-d69p2 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-cloud-credential-operator cloud-credential-operator-59f7476c4c-ggjmx 20m (0%) 0 (0%) 40Mi (0%) 0 (0%) 30h
openshift-cluster-machine-approver machine-approver-7ffb5f77dd-bwjhk 20m (0%) 0 (0%) 70Mi (0%) 0 (0%) 30h
openshift-cluster-node-tuning-operator cluster-node-tuning-operator-7c664f7594-snf9z 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 30h
openshift-cluster-node-tuning-operator tuned-795z5 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-cluster-olm-operator cluster-olm-operator-6dbfcc698-5jmq8 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 30h
openshift-cluster-storage-operator cluster-storage-operator-558748b75f-8zs5k 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 30h
openshift-cluster-storage-operator csi-snapshot-controller-7869c57f49-g6m5q 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 146m
openshift-cluster-storage-operator csi-snapshot-controller-operator-7fb4d7b57c-fqhsc 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 30h
openshift-config-operator openshift-config-operator-66c89dddbc-s9267 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-console console-b7bf68596-6ft5l 10m (0%) 0 (0%) 100Mi (0%) 0 (0%) 143m
openshift-console downloads-76fc98c7c9-c25sd 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-controller-manager-operator openshift-controller-manager-operator-796c59f564-z5qk5 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-controller-manager controller-manager-85c8cbd6f9-b5rkn 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 6h44m
openshift-dns-operator dns-operator-5b4c468d8c-58bvg 20m (0%) 0 (0%) 69Mi (0%) 0 (0%) 30h
openshift-dns dns-default-dhc75 60m (0%) 0 (0%) 110Mi (0%) 0 (0%) 30h
openshift-dns node-resolver-p7ch7 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 30h
openshift-etcd-operator etcd-operator-67f458dbd6-f6ppq 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-etcd etcd-guard-hub-ctlplane-2.5g-deployment.lab 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 30h
openshift-etcd etcd-hub-ctlplane-2.5g-deployment.lab 370m (0%) 0 (0%) 960Mi (1%) 0 (0%) 30h
openshift-gitops cluster-84d59f6c79-rk22g 250m (0%) 500m (1%) 128Mi (0%) 256Mi (0%) 30h
openshift-gitops gitops-plugin-9c746b4cb-mkj6s 250m (0%) 500m (1%) 128Mi (0%) 256Mi (0%) 30h
openshift-gitops openshift-gitops-applicationset-controller-6b7c978dc7-82vj6 250m (0%) 2 (5%) 512Mi (0%) 1Gi (1%) 30h
openshift-gitops openshift-gitops-dex-server-9d98b46b6-4lrvl 250m (0%) 500m (1%) 128Mi (0%) 256Mi (0%) 143m
openshift-gitops openshift-gitops-repo-server-68c96b4977-t8k9z 1 (2%) 8 (20%) 2Gi (2%) 16Gi (23%) 30h
openshift-image-registry cluster-image-registry-operator-6897b4cb5b-k5gwz 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-image-registry node-ca-pb9n5 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 30h
openshift-ingress-canary ingress-canary-x7z48 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 30h
openshift-ingress-operator ingress-operator-c454b9c66-x5wdm 20m (0%) 0 (0%) 96Mi (0%) 0 (0%) 30h
openshift-ingress router-default-6f79bff9ff-vzr7p 100m (0%) 0 (0%) 256Mi (0%) 0 (0%) 143m
openshift-insights insights-operator-687564db7-ps55h 10m (0%) 0 (0%) 54Mi (0%) 0 (0%) 30h
openshift-kube-apiserver-operator kube-apiserver-operator-758dcb88c8-sglh8 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-kube-apiserver kube-apiserver-guard-hub-ctlplane-2.5g-deployment.lab 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 30h
openshift-kube-apiserver kube-apiserver-hub-ctlplane-2.5g-deployment.lab 290m (0%) 0 (0%) 1224Mi (1%) 0 (0%) 6h36m
openshift-kube-controller-manager-operator kube-controller-manager-operator-685cdfb464-749sz 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-kube-controller-manager kube-controller-manager-guard-hub-ctlplane-2.5g-deployment.lab 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 30h
openshift-kube-controller-manager kube-controller-manager-hub-ctlplane-2.5g-deployment.lab 80m (0%) 0 (0%) 500Mi (0%) 0 (0%) 30h
openshift-kube-scheduler-operator openshift-kube-scheduler-operator-7b8944d9d4-gv9df 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-kube-scheduler openshift-kube-scheduler-guard-hub-ctlplane-2.5g-deployment.lab 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 30h
openshift-kube-scheduler openshift-kube-scheduler-hub-ctlplane-2.5g-deployment.lab 25m (0%) 0 (0%) 150Mi (0%) 0 (0%) 30h
openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-589978f5dc-2ff5z 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-local-storage diskmaker-manager-p2xlk 20m (0%) 0 (0%) 70Mi (0%) 0 (0%) 30h
openshift-local-storage local-storage-operator-85d9c5c6-9z5dv 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-machine-api cluster-autoscaler-operator-74c7c965c5-xk4gz 30m (0%) 0 (0%) 70Mi (0%) 0 (0%) 30h
openshift-machine-api cluster-baremetal-operator-867d57586-xxbtc 20m (0%) 0 (0%) 70Mi (0%) 0 (0%) 30h
openshift-machine-api control-plane-machine-set-operator-f965d496c-ngb67 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-machine-api ironic-proxy-stm5m 5m (0%) 0 (0%) 50Mi (0%) 0 (0%) 146m
openshift-machine-api machine-api-operator-d8fc99dff-t4vtx 20m (0%) 0 (0%) 70Mi (0%) 0 (0%) 30h
openshift-machine-config-operator kube-rbac-proxy-crio-hub-ctlplane-2.5g-deployment.lab 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-machine-config-operator machine-config-daemon-z2788 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 30h
openshift-machine-config-operator machine-config-operator-85bb75494f-bqhsh 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 30h
openshift-machine-config-operator machine-config-server-pkr5m 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-marketplace marketplace-operator-9948777d4-b24tz 1m (0%) 0 (0%) 5Mi (0%) 0 (0%) 30h
openshift-monitoring cluster-monitoring-operator-67b9f49f97-m4p2k 10m (0%) 0 (0%) 75Mi (0%) 0 (0%) 30h
openshift-monitoring metrics-server-68d4d68b5-vtghq 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 143m
openshift-monitoring monitoring-plugin-675547f6b9-4z99q 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 143m
openshift-monitoring node-exporter-z6bfh 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 30h
openshift-monitoring prometheus-operator-admission-webhook-795b5cdd7f-tscsh 5m (0%) 0 (0%) 30Mi (0%) 0 (0%) 143m
openshift-monitoring thanos-querier-fcb87c979-g7f5j 15m (0%) 0 (0%) 87Mi (0%) 0 (0%) 30h
openshift-multus multus-5pclh 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 30h
openshift-multus multus-additional-cni-plugins-rcr58 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 30h
openshift-multus multus-admission-controller-5bd89d9df-7j5c8 20m (0%) 0 (0%) 70Mi (0%) 0 (0%) 143m
openshift-multus network-metrics-daemon-skgmz 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 30h
openshift-network-console networking-console-plugin-776976c9c8-qrtgx 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-network-diagnostics network-check-target-ccvk4 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 30h
openshift-network-node-identity network-node-identity-d5tl4 20m (0%) 0 (0%) 100Mi (0%) 0 (0%) 30h
openshift-network-operator iptables-alerter-8fmtd 10m (0%) 10m (0%) 65Mi (0%) 0 (0%) 30h
openshift-network-operator network-operator-69cffcb848-w4b48 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 30h
openshift-oauth-apiserver apiserver-586d946cb8-khgr4 150m (0%) 0 (0%) 200Mi (0%) 0 (0%) 30h
openshift-operator-controller operator-controller-controller-manager-7c845b4b6-94mg4 15m (0%) 0 (0%) 128Mi (0%) 0 (0%) 146m
openshift-operator-lifecycle-manager catalog-operator-c85bfcd98-zd9md 10m (0%) 0 (0%) 80Mi (0%) 0 (0%) 30h
openshift-operator-lifecycle-manager olm-operator-5df758fdf-2lhlh 10m (0%) 0 (0%) 160Mi (0%) 0 (0%) 30h
openshift-operator-lifecycle-manager package-server-manager-5fb67f9466-s29sf 20m (0%) 0 (0%) 30Mi (0%) 0 (0%) 30h
openshift-operator-lifecycle-manager packageserver-dd856455b-n8cxq 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 146m
openshift-operators cluster-group-upgrades-controller-manager-v2-6fcb8695bf-b2npg 100m (0%) 0 (0%) 20Mi (0%) 0 (0%) 30h
openshift-operators openshift-gitops-operator-controller-manager-58757c9cc-4qh2p 1m (0%) 500m (1%) 15Mi (0%) 128Mi (0%) 30h
openshift-ovn-kubernetes ovnkube-control-plane-56584d9f65-dzllk 20m (0%) 0 (0%) 320Mi (0%) 0 (0%) 30h
openshift-ovn-kubernetes ovnkube-node-p6knw 80m (0%) 0 (0%) 1630Mi (2%) 0 (0%) 30h
openshift-route-controller-manager route-controller-manager-8448b4cb88-cgzrs 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 6h44m
openshift-service-ca-operator service-ca-operator-6448b956f6-8xd92 10m (0%) 0 (0%) 80Mi (0%) 0 (0%) 30h
openshift-storage odf-console-d497c4785-mzg56 100m (0%) 100m (0%) 512Mi (0%) 512Mi (0%) 143m
quay-operator quay-operator.v3.13.5-6464876748-s2bm6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 143m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 5822m (14%) 14110m (35%)
memory 17511Mi (25%) 19712Mi (28%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal RegisteredNode 149m node-controller Node hub-ctlplane-2.5g-deployment.lab event: Registered Node hub-ctlplane-2.5g-deployment.lab in Controller As well, it looks that kubernetes defines the following:
references:
My understanding its that these taints are specifically managed by the kubernetes and its at system level not manual or managed by an administrator. |
This been implemented here: 93d98b5 |
…ted nodes - Prioritize nodes without 'unreachable' or 'not-ready' taints and with recent heartbeats - Fallback to tainted nodes if no healthy nodes exist - Print non-blocking warnings when scheduling to problematic nodes - All logic is derived dynamically from cluster state
TolerationLoop: | ||
for _, tol := range tolerations { | ||
for _, excluded := range excludedTaints { | ||
if tol.ToleratesTaint(&excluded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still unanswered
/retest |
@midu16: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Running the added code locally for filtering the taints (with
In case a cluster has no masters the list of filteredTolerations will be empty. Which is/might be the case on hypershift clusters (#1347). Which makes the must-gather less resilient than now.
Which invalidates the original intention to still allow necessary tolerations. |
for _, cond := range node.Status.Conditions { | ||
if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue { | ||
// Check if heartbeat is recent (less than 2m old) | ||
if time.Since(cond.LastHeartbeatTime.Time) < 2*time.Minute { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This assumes the clock running on a host where oc adm must-gather
runs is in sync with the nodes. Which might not be always the case.
@@ -462,6 +514,10 @@ func (o *MustGatherOptions) Run() error { | |||
nodes, err := o.Client.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{ | |||
LabelSelector: o.NodeSelector, | |||
}) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated code
@@ -401,6 +417,42 @@ func (o *MustGatherOptions) Validate() error { | |||
return nil | |||
} | |||
|
|||
// prioritizeHealthyNodes returns a preferred node to run the must-gather pod on, and a fallback node if no preferred node is found. | |||
func prioritizeHealthyNodes(nodes *corev1.NodeList) (preferred *corev1.Node, fallback *corev1.Node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prioritizeHealthyNodes
is currently unused in the code
[OCPBUGS-50992]: Filter out unreachable taints from tolerations:
unreachableTaintKey
) with effectsNoExecute
andNoSchedule
.node-role.kubernetes.io/master
(if applicable) andnode.kubernetes.io/not-ready
.unreachableTaintKey
.This change ensures that workloads do not unintentionally tolerate unreachable nodes while still allowing necessary tolerations.