Skip to content

[OCPBUGS-50992]: Filter out unreachable taints from tolerations #1990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

midu16
Copy link

@midu16 midu16 commented Mar 11, 2025

[OCPBUGS-50992]: Filter out unreachable taints from tolerations:

  • Define a list of excluded taints (unreachableTaintKey) with effects NoExecute and NoSchedule.
  • Add tolerations for node-role.kubernetes.io/master (if applicable) and node.kubernetes.io/not-ready.
  • Filter out tolerations that match the excluded taints to prevent unintended scheduling behavior.
  • Implement a loop to iterate over tolerations and exclude ones that tolerate unreachableTaintKey.

This change ensures that workloads do not unintentionally tolerate unreachable nodes while still allowing necessary tolerations.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2025
@openshift-ci openshift-ci bot requested review from atiratree and ingvagabund March 11, 2025 19:23
@midu16 midu16 changed the title WIP: Filter out unreachable taints from tolerations [OCPBUGS-50992]: Filter out unreachable taints from tolerations Mar 18, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2025
@midu16
Copy link
Author

midu16 commented Apr 2, 2025

/test images

@midu16
Copy link
Author

midu16 commented Apr 4, 2025

@ingvagabund , @atiratree i am trying to get some sense on the failed checks, but i cannot find the corelation between my proposed changed patch and the errors. What am I missing here from your point of view?

Much appreciated,
M

@ingvagabund
Copy link
Member

ingvagabund commented Apr 7, 2025

Looks like all three e2e tests are flaking (unrelated to this PR) and the verify needs to run "make update-gofmt". I think you are good here when it comes to passing the tests :)

TolerationLoop:
for _, tol := range tolerations {
for _, excluded := range excludedTaints {
if tol.ToleratesTaint(&excluded) {
Copy link
Member

@ingvagabund ingvagabund Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are excluding taints that are added in the same method. What about not adding the excluded taints instead of adding them and then excluding them? Or, is this code to be extended later with a list of user provided excluded taints?

Also, none of those two tolerations tolerate the taint as the Key fields are always different:

  • "node-role.kubernetes.io/master" != "node.kubernetes.io/unreachable"
  • "node.kubernetes.io/not-ready" != "node.kubernetes.io/unreachable"

Which makes filteredTolerations = tolerations always. Making the double loop a no-op. Maybe I misread the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unanswered

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund with respect to this, i have added a comment that i am intending to support a user-defined exclusion tains in the hope to preserve this logic, should we define a unit test which is validating this logic further ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How far in the future do you expect the user-defined exclusion taints support added? If the current logic has no use it's better to introduce it in the PR that introduces the user-defined exclusion taints. So all the relevant changes are in the same PR to preserve the context. This way the code changes will be apart. Maybe never extended/finished.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund I expect to expand it in the next PR with the experimental flag from here: #1990 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. In that case it's better to introduce the loop as part of the next PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund unfortunetly this cannot be removed for next PR, main feature its to manage the tolerations for the must-gather pod and avoid a random alocation to a pod which cannot be scheduled. Initial implementation was allowing the must-gather pod to tolerate everything, hence this section was change to exclude specific taints.

@ingvagabund
Copy link
Member

ingvagabund commented Apr 15, 2025

What about to extend the oc adm must-gather code with a warning saying "hey, the must-gather pod got scheduled to THIS node that is tainted with node.kubernetes.io/unreachable. Maybe you'd like to schedule it to a different node to avoid the pod getting stuck. Here are some suggestions: NODES". This will help with a user initiated case. Not much when running the must-gather through an automation. On the other hand the automation might set a special flag to tell oc adm must-gather to be more strict in selecting which nodes are still "safe" to tolerate. So, as we probably suggested in one of the previous bug reports it's preferable to have user/automation provide a list of explicit tolerations if the current "tolerate everything" does not work well.

@ingvagabund
Copy link
Member

What if all control plane nodes are "node.kubernetes.io/unreachable" tainted? What if all nodes are "node.kubernetes.io/unreachable" tainted? Some nodes might be still eligible for running must-gather. Presence of the taint does not necessarily means the node is "gone". From https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions:

node.kubernetes.io/unreachable: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being "Unknown".

Maybe it's just the node controller that's temporarily separated from the rest of the control plane?

@midu16
Copy link
Author

midu16 commented Apr 15, 2025

What if all control plane nodes are "node.kubernetes.io/unreachable" tainted? What if all nodes are "node.kubernetes.io/unreachable" tainted? Some nodes might be still eligible for running must-gather. Presence of the taint does not necessarily means the node is "gone". From https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions:

node.kubernetes.io/unreachable: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being "Unknown".

Maybe it's just the node controller that's temporarily separated from the rest of the control plane?

I can expand this topic as follows:

  • Don’t outright exclude tainted nodes.

  • Prioritize nodes without the taints.

  • Fallback to unreachable or not-ready nodes if no other healthy candidates exist.

  • Optionally, add logic to:

    • Check node heartbeat age (status.conditions.LastHeartbeatTime)

    • Or try lightweight probing before deploying the workload.

Copy link
Contributor

openshift-ci bot commented May 26, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: midu16
Once this PR has been reviewed and has the lgtm label, please assign ardaguclu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@midu16
Copy link
Author

midu16 commented May 27, 2025

What if all control plane nodes are "node.kubernetes.io/unreachable" tainted? What if all nodes are "node.kubernetes.io/unreachable" tainted? Some nodes might be still eligible for running must-gather. Presence of the taint does not necessarily means the node is "gone". From https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions:

node.kubernetes.io/unreachable: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being "Unknown".

Maybe it's just the node controller that's temporarily separated from the rest of the control plane?

This theoretical supposition that all the nodes are tained, i assume in a manual manner, i am not able to replicate. Here its my try:

# oc get nodes
NAME                               STATUS     ROLES                         AGE   VERSION
hub-ctlplane-0.5g-deployment.lab   NotReady   control-plane,master,worker   30h   v1.31.7
hub-ctlplane-1.5g-deployment.lab   Ready      control-plane,master,worker   30h   v1.31.7
hub-ctlplane-2.5g-deployment.lab   Ready      control-plane,master,worker   30h   v1.31.7

In here, one of the nodes has been powered off, to simulate the exact shutdown scenario.

I will be trying to taint the hub-ctlplane-1.5g-deployment.lab and hub-ctlplane-2.5g-deployment.lab with the following taint node.kubernetes.io/unreachable:NoExecute as follows:

# oc adm taint node hub-ctlplane-1.5g-deployment.lab node.kubernetes.io/unreachable:NoExecute
node/hub-ctlplane-1.5g-deployment.lab tainted
# oc adm taint node hub-ctlplane-2.5g-deployment.lab node.kubernetes.io/unreachable:NoExecute
node/hub-ctlplane-2.5g-deployment.lab tainted

As you can see here, the nodes have neen tainted, but while describing the nodes above tainted:

# oc describe nodes hub-ctlplane-1.5g-deployment.lab
Name:               hub-ctlplane-1.5g-deployment.lab
Roles:              control-plane,master,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=hub-ctlplane-1.5g-deployment.lab
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        k8s.ovn.org/host-cidrs: ["172.16.30.10/24","172.16.30.11/24","172.16.30.21/24"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","bridge-id":"br-ex","interface-id":"br-ex_hub-ctlplane-1.5g-deployment.lab","mac-address":"aa:aa:aa:aa:01:02",...
                    k8s.ovn.org/network-ids: {"default":"0"}
                    k8s.ovn.org/node-chassis-id: 76b0cc16-f376-4664-aaaf-85260e2b7a82
                    k8s.ovn.org/node-gateway-router-lrp-ifaddrs: {"default":{"ipv4":"100.64.0.3/16"}}
                    k8s.ovn.org/node-id: 3
                    k8s.ovn.org/node-masquerade-subnet: {"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.16.30.21/24"}
                    k8s.ovn.org/node-subnets: {"default":["10.133.0.0/23"]}
                    k8s.ovn.org/node-transit-switch-port-ifaddr: {"ipv4":"100.88.0.3/16"}
                    k8s.ovn.org/remote-zone-migrated: hub-ctlplane-1.5g-deployment.lab
                    k8s.ovn.org/zone-name: hub-ctlplane-1.5g-deployment.lab
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-master-22119f80b4843c8b8f72be63d136687b
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-22119f80b4843c8b8f72be63d136687b
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-master-22119f80b4843c8b8f72be63d136687b
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-22119f80b4843c8b8f72be63d136687b
                    machineconfiguration.openshift.io/lastObservedServerCAAnnotation: false
                    machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 720542
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 26 May 2025 01:00:22 -0700
Taints:             <none> ###<<<<<<<<---There is no taint
Unschedulable:      false
Lease:
  HolderIdentity:  hub-ctlplane-1.5g-deployment.lab
  AcquireTime:     <unset>
  RenewTime:       Tue, 27 May 2025 07:39:20 -0700
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 27 May 2025 07:37:29 -0700   Mon, 26 May 2025 01:00:22 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 27 May 2025 07:37:29 -0700   Mon, 26 May 2025 01:00:22 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 27 May 2025 07:37:29 -0700   Mon, 26 May 2025 01:00:22 -0700   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 27 May 2025 07:37:29 -0700   Mon, 26 May 2025 01:02:15 -0700   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.16.30.21
  Hostname:    hub-ctlplane-1.5g-deployment.lab
Capacity:
  cpu:                40
  ephemeral-storage:  313894128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             72030168Ki
  pods:               250
Allocatable:
  cpu:                39500m
  ephemeral-storage:  288211086062
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             70879192Ki
  pods:               250
System Info:
  Machine ID:                                  48d5d5458e8446029eec591aba88f832
  System UUID:                                 48d5d545-8e84-4602-9eec-591aba88f832
  Boot ID:                                     95d9e7c9-0f03-4252-963a-63234abdfe5b
  Kernel Version:                              5.14.0-427.64.1.el9_4.x86_64
  OS Image:                                    Red Hat Enterprise Linux CoreOS 418.94.202504080525-0
  Operating System:                            linux
  Architecture:                                amd64
  Container Runtime Version:                   cri-o://1.31.7-2.rhaos4.18.git83d6749.el9
  Kubelet Version:                             v1.31.7
  Kube-Proxy Version:                          v1.31.7
Non-terminated Pods:                           (74 in total)
  Namespace                                    Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                                    ----                                                               ------------  ----------  ---------------  -------------  ---
  kcli-infra                                   coredns-hub-ctlplane-1.5g-deployment.lab                           150m (0%)     0 (0%)      512Mi (0%)       0 (0%)         30h
  kcli-infra                                   haproxy-hub-ctlplane-1.5g-deployment.lab                           150m (0%)     0 (0%)      512Mi (0%)       0 (0%)         30h
  kcli-infra                                   keepalived-hub-ctlplane-1.5g-deployment.lab                        150m (0%)     0 (0%)      2Gi (2%)         0 (0%)         30h
  kcli-infra                                   mdns-hub-ctlplane-1.5g-deployment.lab                              150m (0%)     0 (0%)      1Gi (1%)         0 (0%)         30h
  openshift-apiserver                          apiserver-6cb9f4dd8-z29xf                                          110m (0%)     0 (0%)      250Mi (0%)       0 (0%)         30h
  openshift-authentication                     oauth-openshift-89b7f44b5-cvs68                                    10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-catalogd                           catalogd-controller-manager-5fd54d45c8-7gnww                       105m (0%)     0 (0%)      264Mi (0%)       0 (0%)         30h
  openshift-cloud-controller-manager-operator  cluster-cloud-controller-manager-operator-8464c5cbdd-4s2jj         30m (0%)      0 (0%)      95Mi (0%)        0 (0%)         146m
  openshift-cluster-node-tuning-operator       tuned-cc8hc                                                        10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-cluster-samples-operator           cluster-samples-operator-58b96497bf-f5d6w                          20m (0%)      0 (0%)      100Mi (0%)       0 (0%)         30h
  openshift-cluster-storage-operator           csi-snapshot-controller-7869c57f49-7j6mg                           10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-cluster-version                    cluster-version-operator-7d97d57688-tx6pl                          20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-console-operator                   console-operator-dd7847f44-2z4gn                                   10m (0%)      0 (0%)      100Mi (0%)       0 (0%)         146m
  openshift-console                            console-b7bf68596-fwhd6                                            10m (0%)      0 (0%)      100Mi (0%)       0 (0%)         143m
  openshift-console                            downloads-76fc98c7c9-hbpzg                                         10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-controller-manager                 controller-manager-85c8cbd6f9-sbhcw                                100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         6h43m
  openshift-dns                                dns-default-ncwml                                                  60m (0%)      0 (0%)      110Mi (0%)       0 (0%)         30h
  openshift-dns                                node-resolver-2zq5f                                                5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         30h
  openshift-etcd                               etcd-guard-hub-ctlplane-1.5g-deployment.lab                        10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         30h
  openshift-etcd                               etcd-hub-ctlplane-1.5g-deployment.lab                              370m (0%)     0 (0%)      960Mi (1%)       0 (0%)         30h
  openshift-gitops                             openshift-gitops-application-controller-0                          1 (2%)        16 (40%)    2Gi (2%)         32Gi (47%)     30h
  openshift-gitops                             openshift-gitops-redis-5d74f4d9d9-fgpmx                            250m (0%)     500m (1%)   128Mi (0%)       256Mi (0%)     143m
  openshift-gitops                             openshift-gitops-server-f5b9f6644-56kvj                            125m (0%)     500m (1%)   128Mi (0%)       256Mi (0%)     143m
  openshift-image-registry                     image-registry-77c7d7d786-762xw                                    100m (0%)     0 (0%)      256Mi (0%)       0 (0%)         143m
  openshift-image-registry                     node-ca-f57pj                                                      10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         30h
  openshift-ingress-canary                     ingress-canary-5qw6n                                               10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         30h
  openshift-ingress                            router-default-6f79bff9ff-grlxb                                    100m (0%)     0 (0%)      256Mi (0%)       0 (0%)         30h
  openshift-kube-apiserver                     kube-apiserver-guard-hub-ctlplane-1.5g-deployment.lab              10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         30h
  openshift-kube-apiserver                     kube-apiserver-hub-ctlplane-1.5g-deployment.lab                    290m (0%)     0 (0%)      1224Mi (1%)      0 (0%)         6h38m
  openshift-kube-controller-manager            kube-controller-manager-guard-hub-ctlplane-1.5g-deployment.lab     10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         30h
  openshift-kube-controller-manager            kube-controller-manager-hub-ctlplane-1.5g-deployment.lab           80m (0%)      0 (0%)      500Mi (0%)       0 (0%)         30h
  openshift-kube-scheduler                     openshift-kube-scheduler-guard-hub-ctlplane-1.5g-deployment.lab    10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         30h
  openshift-kube-scheduler                     openshift-kube-scheduler-hub-ctlplane-1.5g-deployment.lab          25m (0%)      0 (0%)      150Mi (0%)       0 (0%)         30h
  openshift-kube-storage-version-migrator      migrator-7c8b9d7fc7-mb5ck                                          11m (0%)      0 (0%)      201Mi (0%)       0 (0%)         30h
  openshift-local-storage                      diskmaker-manager-dgx2q                                            20m (0%)      0 (0%)      70Mi (0%)        0 (0%)         30h
  openshift-logging                            cluster-logging-operator-6cff84cbf9-2wdjs                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         143m
  openshift-machine-api                        ironic-proxy-2g547                                                 5m (0%)       0 (0%)      50Mi (0%)        0 (0%)         146m
  openshift-machine-api                        metal3-56c5d48f5-7pvg6                                             65m (0%)      0 (0%)      555Mi (0%)       0 (0%)         146m
  openshift-machine-api                        metal3-baremetal-operator-97c7cc5f9-x5nfz                          20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         146m
  openshift-machine-api                        metal3-image-customization-674fd478b5-q6q7d                        5m (0%)       0 (0%)      50Mi (0%)        0 (0%)         146m
  openshift-machine-config-operator            kube-rbac-proxy-crio-hub-ctlplane-1.5g-deployment.lab              20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-machine-config-operator            machine-config-controller-58f77b6849-b2csz                         40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         30h
  openshift-machine-config-operator            machine-config-daemon-8gfkk                                        40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         30h
  openshift-machine-config-operator            machine-config-server-v97c2                                        20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-marketplace                        cs-redhat-operator-index-vjm8t                                     10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-monitoring                         alertmanager-main-0                                                9m (0%)       0 (0%)      120Mi (0%)       0 (0%)         30h
  openshift-monitoring                         kube-state-metrics-6fd4985569-t4x4h                                4m (0%)       0 (0%)      110Mi (0%)       0 (0%)         30h
  openshift-monitoring                         metrics-server-68d4d68b5-hlkc4                                     1m (0%)       0 (0%)      40Mi (0%)        0 (0%)         6h39m
  openshift-monitoring                         monitoring-plugin-675547f6b9-q7tqp                                 10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-monitoring                         node-exporter-2dcsn                                                9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         30h
  openshift-monitoring                         openshift-state-metrics-664f6f85d6-p6zfc                           3m (0%)       0 (0%)      72Mi (0%)        0 (0%)         30h
  openshift-monitoring                         prometheus-k8s-0                                                   75m (0%)      0 (0%)      1099Mi (1%)      0 (0%)         30h
  openshift-monitoring                         prometheus-operator-578754df59-5hph2                               6m (0%)       0 (0%)      165Mi (0%)       0 (0%)         143m
  openshift-monitoring                         prometheus-operator-admission-webhook-795b5cdd7f-n6xhl             5m (0%)       0 (0%)      30Mi (0%)        0 (0%)         30h
  openshift-monitoring                         telemeter-client-7d5b898b55-vp7hr                                  3m (0%)       0 (0%)      70Mi (0%)        0 (0%)         30h
  openshift-monitoring                         thanos-querier-fcb87c979-wsvb8                                     15m (0%)      0 (0%)      87Mi (0%)        0 (0%)         143m
  openshift-multus                             multus-additional-cni-plugins-dlf29                                10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         30h
  openshift-multus                             multus-admission-controller-5bd89d9df-zqbtx                        20m (0%)      0 (0%)      70Mi (0%)        0 (0%)         30h
  openshift-multus                             multus-fvbsc                                                       10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         30h
  openshift-multus                             network-metrics-daemon-fw29v                                       20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         30h
  openshift-network-console                    networking-console-plugin-776976c9c8-mv4pj                         10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-network-diagnostics                network-check-source-579f89f8b7-2fp9k                              10m (0%)      0 (0%)      40Mi (0%)        0 (0%)         30h
  openshift-network-diagnostics                network-check-target-7f5kt                                         10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         30h
  openshift-network-node-identity              network-node-identity-fzfmz                                        20m (0%)      0 (0%)      100Mi (0%)       0 (0%)         30h
  openshift-network-operator                   iptables-alerter-dwj8n                                             10m (0%)      10m (0%)    65Mi (0%)        0 (0%)         30h
  openshift-oauth-apiserver                    apiserver-586d946cb8-l9zdx                                         150m (0%)     0 (0%)      200Mi (0%)       0 (0%)         30h
  openshift-operator-lifecycle-manager         packageserver-dd856455b-gpk5x                                      10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-operators                          istio-operator-54c86cd695-g5f4c                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         143m
  openshift-ovn-kubernetes                     ovnkube-control-plane-56584d9f65-vczvp                             20m (0%)      0 (0%)      320Mi (0%)       0 (0%)         30h
  openshift-ovn-kubernetes                     ovnkube-node-h9k8v                                                 80m (0%)      0 (0%)      1630Mi (2%)      0 (0%)         30h
  openshift-route-controller-manager           route-controller-manager-8448b4cb88-rl5xn                          100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         6h43m
  openshift-service-ca                         service-ca-6f5fcb79c8-l2vlq                                        10m (0%)      0 (0%)      120Mi (0%)       0 (0%)         30h
  openshift-storage                            odf-operator-controller-manager-77685b5ddd-9fkbc                   200m (0%)     200m (0%)   200Mi (0%)       300Mi (0%)     30h
  rhbk                                         rhbk-operator-6fdc9d5fb4-475sr                                     300m (0%)     700m (1%)   450Mi (0%)       450Mi (0%)     143m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                4896m (12%)    17910m (45%)
  memory             18007Mi (26%)  34030Mi (49%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
Events:
  Type    Reason          Age   From             Message
  ----    ------          ----  ----             -------
  Normal  RegisteredNode  149m  node-controller  Node hub-ctlplane-1.5g-deployment.lab event: Registered Node hub-ctlplane-1.5g-deployment.lab in Controller


# oc adm taint node hub-ctlplane-2.5g-deployment.lab node.kubernetes.io/unreachable:NoExecute
node/hub-ctlplane-2.5g-deployment.lab tainted
[root@INBACRNRDL0102 ~]# oc describe nodes hub-ctlplane-2.5g-deployment.lab
Name:               hub-ctlplane-2.5g-deployment.lab
Roles:              control-plane,master,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=hub-ctlplane-2.5g-deployment.lab
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        k8s.ovn.org/host-cidrs: ["172.16.30.22/24"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","bridge-id":"br-ex","interface-id":"br-ex_hub-ctlplane-2.5g-deployment.lab","mac-address":"aa:aa:aa:aa:01:03",...
                    k8s.ovn.org/network-ids: {"default":"0"}
                    k8s.ovn.org/node-chassis-id: c9b287cf-e164-4288-b6cd-88ba5e81a58f
                    k8s.ovn.org/node-gateway-router-lrp-ifaddrs: {"default":{"ipv4":"100.64.0.4/16"}}
                    k8s.ovn.org/node-id: 4
                    k8s.ovn.org/node-masquerade-subnet: {"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.16.30.22/24"}
                    k8s.ovn.org/node-subnets: {"default":["10.134.0.0/23"]}
                    k8s.ovn.org/node-transit-switch-port-ifaddr: {"ipv4":"100.88.0.4/16"}
                    k8s.ovn.org/remote-zone-migrated: hub-ctlplane-2.5g-deployment.lab
                    k8s.ovn.org/zone-name: hub-ctlplane-2.5g-deployment.lab
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-master-22119f80b4843c8b8f72be63d136687b
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-22119f80b4843c8b8f72be63d136687b
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-master-22119f80b4843c8b8f72be63d136687b
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-22119f80b4843c8b8f72be63d136687b
                    machineconfiguration.openshift.io/lastObservedServerCAAnnotation: false
                    machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 720542
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 26 May 2025 01:00:20 -0700
Taints:             <none>   ###<<<<<<<<---There is no taint
Unschedulable:      false
Lease:
  HolderIdentity:  hub-ctlplane-2.5g-deployment.lab
  AcquireTime:     <unset>
  RenewTime:       Tue, 27 May 2025 07:39:51 -0700
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 27 May 2025 07:38:12 -0700   Mon, 26 May 2025 01:00:20 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 27 May 2025 07:38:12 -0700   Mon, 26 May 2025 01:00:20 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 27 May 2025 07:38:12 -0700   Mon, 26 May 2025 01:00:20 -0700   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 27 May 2025 07:38:12 -0700   Mon, 26 May 2025 01:02:14 -0700   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.16.30.22
  Hostname:    hub-ctlplane-2.5g-deployment.lab
Capacity:
  cpu:                40
  ephemeral-storage:  313894128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             72030136Ki
  pods:               250
Allocatable:
  cpu:                39500m
  ephemeral-storage:  288211086062
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             70879160Ki
  pods:               250
System Info:
  Machine ID:                                       574ceed884774bb48a565bf18c817135
  System UUID:                                      574ceed8-8477-4bb4-8a56-5bf18c817135
  Boot ID:                                          738c675a-9a86-479f-9a7e-d53c49f9bb5c
  Kernel Version:                                   5.14.0-427.64.1.el9_4.x86_64
  OS Image:                                         Red Hat Enterprise Linux CoreOS 418.94.202504080525-0
  Operating System:                                 linux
  Architecture:                                     amd64
  Container Runtime Version:                        cri-o://1.31.7-2.rhaos4.18.git83d6749.el9
  Kubelet Version:                                  v1.31.7
  Kube-Proxy Version:                               v1.31.7
Non-terminated Pods:                                (91 in total)
  Namespace                                         Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                                         ----                                                               ------------  ----------  ---------------  -------------  ---
  kcli-infra                                        coredns-hub-ctlplane-2.5g-deployment.lab                           150m (0%)     0 (0%)      512Mi (0%)       0 (0%)         30h
  kcli-infra                                        haproxy-hub-ctlplane-2.5g-deployment.lab                           150m (0%)     0 (0%)      512Mi (0%)       0 (0%)         30h
  kcli-infra                                        keepalived-hub-ctlplane-2.5g-deployment.lab                        150m (0%)     0 (0%)      2Gi (2%)         0 (0%)         30h
  kcli-infra                                        mdns-hub-ctlplane-2.5g-deployment.lab                              150m (0%)     0 (0%)      1Gi (1%)         0 (0%)         30h
  openshift-adp                                     openshift-adp-controller-manager-55f97dd46d-9zxtk                  500m (1%)     1 (2%)      128Mi (0%)       512Mi (0%)     30h
  openshift-amq-streams                             amq-streams-cluster-operator-v2.9.0-2-76d5c4f596-xwcnm             200m (0%)     1 (2%)      384Mi (0%)       384Mi (0%)     143m
  openshift-apiserver-operator                      openshift-apiserver-operator-b9ff4697-l5vgb                        10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-apiserver                               apiserver-6cb9f4dd8-rxwb6                                          110m (0%)     0 (0%)      250Mi (0%)       0 (0%)         30h
  openshift-authentication-operator                 authentication-operator-576f97686b-qgm87                           20m (0%)      0 (0%)      200Mi (0%)       0 (0%)         30h
  openshift-authentication                          oauth-openshift-89b7f44b5-d69p2                                    10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-cloud-credential-operator               cloud-credential-operator-59f7476c4c-ggjmx                         20m (0%)      0 (0%)      40Mi (0%)        0 (0%)         30h
  openshift-cluster-machine-approver                machine-approver-7ffb5f77dd-bwjhk                                  20m (0%)      0 (0%)      70Mi (0%)        0 (0%)         30h
  openshift-cluster-node-tuning-operator            cluster-node-tuning-operator-7c664f7594-snf9z                      10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         30h
  openshift-cluster-node-tuning-operator            tuned-795z5                                                        10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-cluster-olm-operator                    cluster-olm-operator-6dbfcc698-5jmq8                               10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         30h
  openshift-cluster-storage-operator                cluster-storage-operator-558748b75f-8zs5k                          10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         30h
  openshift-cluster-storage-operator                csi-snapshot-controller-7869c57f49-g6m5q                           10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         146m
  openshift-cluster-storage-operator                csi-snapshot-controller-operator-7fb4d7b57c-fqhsc                  10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         30h
  openshift-config-operator                         openshift-config-operator-66c89dddbc-s9267                         10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-console                                 console-b7bf68596-6ft5l                                            10m (0%)      0 (0%)      100Mi (0%)       0 (0%)         143m
  openshift-console                                 downloads-76fc98c7c9-c25sd                                         10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-controller-manager-operator             openshift-controller-manager-operator-796c59f564-z5qk5             10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-controller-manager                      controller-manager-85c8cbd6f9-b5rkn                                100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         6h44m
  openshift-dns-operator                            dns-operator-5b4c468d8c-58bvg                                      20m (0%)      0 (0%)      69Mi (0%)        0 (0%)         30h
  openshift-dns                                     dns-default-dhc75                                                  60m (0%)      0 (0%)      110Mi (0%)       0 (0%)         30h
  openshift-dns                                     node-resolver-p7ch7                                                5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         30h
  openshift-etcd-operator                           etcd-operator-67f458dbd6-f6ppq                                     10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-etcd                                    etcd-guard-hub-ctlplane-2.5g-deployment.lab                        10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         30h
  openshift-etcd                                    etcd-hub-ctlplane-2.5g-deployment.lab                              370m (0%)     0 (0%)      960Mi (1%)       0 (0%)         30h
  openshift-gitops                                  cluster-84d59f6c79-rk22g                                           250m (0%)     500m (1%)   128Mi (0%)       256Mi (0%)     30h
  openshift-gitops                                  gitops-plugin-9c746b4cb-mkj6s                                      250m (0%)     500m (1%)   128Mi (0%)       256Mi (0%)     30h
  openshift-gitops                                  openshift-gitops-applicationset-controller-6b7c978dc7-82vj6        250m (0%)     2 (5%)      512Mi (0%)       1Gi (1%)       30h
  openshift-gitops                                  openshift-gitops-dex-server-9d98b46b6-4lrvl                        250m (0%)     500m (1%)   128Mi (0%)       256Mi (0%)     143m
  openshift-gitops                                  openshift-gitops-repo-server-68c96b4977-t8k9z                      1 (2%)        8 (20%)     2Gi (2%)         16Gi (23%)     30h
  openshift-image-registry                          cluster-image-registry-operator-6897b4cb5b-k5gwz                   10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-image-registry                          node-ca-pb9n5                                                      10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         30h
  openshift-ingress-canary                          ingress-canary-x7z48                                               10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         30h
  openshift-ingress-operator                        ingress-operator-c454b9c66-x5wdm                                   20m (0%)      0 (0%)      96Mi (0%)        0 (0%)         30h
  openshift-ingress                                 router-default-6f79bff9ff-vzr7p                                    100m (0%)     0 (0%)      256Mi (0%)       0 (0%)         143m
  openshift-insights                                insights-operator-687564db7-ps55h                                  10m (0%)      0 (0%)      54Mi (0%)        0 (0%)         30h
  openshift-kube-apiserver-operator                 kube-apiserver-operator-758dcb88c8-sglh8                           10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-kube-apiserver                          kube-apiserver-guard-hub-ctlplane-2.5g-deployment.lab              10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         30h
  openshift-kube-apiserver                          kube-apiserver-hub-ctlplane-2.5g-deployment.lab                    290m (0%)     0 (0%)      1224Mi (1%)      0 (0%)         6h36m
  openshift-kube-controller-manager-operator        kube-controller-manager-operator-685cdfb464-749sz                  10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-kube-controller-manager                 kube-controller-manager-guard-hub-ctlplane-2.5g-deployment.lab     10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         30h
  openshift-kube-controller-manager                 kube-controller-manager-hub-ctlplane-2.5g-deployment.lab           80m (0%)      0 (0%)      500Mi (0%)       0 (0%)         30h
  openshift-kube-scheduler-operator                 openshift-kube-scheduler-operator-7b8944d9d4-gv9df                 10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-kube-scheduler                          openshift-kube-scheduler-guard-hub-ctlplane-2.5g-deployment.lab    10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         30h
  openshift-kube-scheduler                          openshift-kube-scheduler-hub-ctlplane-2.5g-deployment.lab          25m (0%)      0 (0%)      150Mi (0%)       0 (0%)         30h
  openshift-kube-storage-version-migrator-operator  kube-storage-version-migrator-operator-589978f5dc-2ff5z            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-local-storage                           diskmaker-manager-p2xlk                                            20m (0%)      0 (0%)      70Mi (0%)        0 (0%)         30h
  openshift-local-storage                           local-storage-operator-85d9c5c6-9z5dv                              10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-machine-api                             cluster-autoscaler-operator-74c7c965c5-xk4gz                       30m (0%)      0 (0%)      70Mi (0%)        0 (0%)         30h
  openshift-machine-api                             cluster-baremetal-operator-867d57586-xxbtc                         20m (0%)      0 (0%)      70Mi (0%)        0 (0%)         30h
  openshift-machine-api                             control-plane-machine-set-operator-f965d496c-ngb67                 10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-machine-api                             ironic-proxy-stm5m                                                 5m (0%)       0 (0%)      50Mi (0%)        0 (0%)         146m
  openshift-machine-api                             machine-api-operator-d8fc99dff-t4vtx                               20m (0%)      0 (0%)      70Mi (0%)        0 (0%)         30h
  openshift-machine-config-operator                 kube-rbac-proxy-crio-hub-ctlplane-2.5g-deployment.lab              20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-machine-config-operator                 machine-config-daemon-z2788                                        40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         30h
  openshift-machine-config-operator                 machine-config-operator-85bb75494f-bqhsh                           40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         30h
  openshift-machine-config-operator                 machine-config-server-pkr5m                                        20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-marketplace                             marketplace-operator-9948777d4-b24tz                               1m (0%)       0 (0%)      5Mi (0%)         0 (0%)         30h
  openshift-monitoring                              cluster-monitoring-operator-67b9f49f97-m4p2k                       10m (0%)      0 (0%)      75Mi (0%)        0 (0%)         30h
  openshift-monitoring                              metrics-server-68d4d68b5-vtghq                                     1m (0%)       0 (0%)      40Mi (0%)        0 (0%)         143m
  openshift-monitoring                              monitoring-plugin-675547f6b9-4z99q                                 10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         143m
  openshift-monitoring                              node-exporter-z6bfh                                                9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         30h
  openshift-monitoring                              prometheus-operator-admission-webhook-795b5cdd7f-tscsh             5m (0%)       0 (0%)      30Mi (0%)        0 (0%)         143m
  openshift-monitoring                              thanos-querier-fcb87c979-g7f5j                                     15m (0%)      0 (0%)      87Mi (0%)        0 (0%)         30h
  openshift-multus                                  multus-5pclh                                                       10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         30h
  openshift-multus                                  multus-additional-cni-plugins-rcr58                                10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         30h
  openshift-multus                                  multus-admission-controller-5bd89d9df-7j5c8                        20m (0%)      0 (0%)      70Mi (0%)        0 (0%)         143m
  openshift-multus                                  network-metrics-daemon-skgmz                                       20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         30h
  openshift-network-console                         networking-console-plugin-776976c9c8-qrtgx                         10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-network-diagnostics                     network-check-target-ccvk4                                         10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         30h
  openshift-network-node-identity                   network-node-identity-d5tl4                                        20m (0%)      0 (0%)      100Mi (0%)       0 (0%)         30h
  openshift-network-operator                        iptables-alerter-8fmtd                                             10m (0%)      10m (0%)    65Mi (0%)        0 (0%)         30h
  openshift-network-operator                        network-operator-69cffcb848-w4b48                                  10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         30h
  openshift-oauth-apiserver                         apiserver-586d946cb8-khgr4                                         150m (0%)     0 (0%)      200Mi (0%)       0 (0%)         30h
  openshift-operator-controller                     operator-controller-controller-manager-7c845b4b6-94mg4             15m (0%)      0 (0%)      128Mi (0%)       0 (0%)         146m
  openshift-operator-lifecycle-manager              catalog-operator-c85bfcd98-zd9md                                   10m (0%)      0 (0%)      80Mi (0%)        0 (0%)         30h
  openshift-operator-lifecycle-manager              olm-operator-5df758fdf-2lhlh                                       10m (0%)      0 (0%)      160Mi (0%)       0 (0%)         30h
  openshift-operator-lifecycle-manager              package-server-manager-5fb67f9466-s29sf                            20m (0%)      0 (0%)      30Mi (0%)        0 (0%)         30h
  openshift-operator-lifecycle-manager              packageserver-dd856455b-n8cxq                                      10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         146m
  openshift-operators                               cluster-group-upgrades-controller-manager-v2-6fcb8695bf-b2npg      100m (0%)     0 (0%)      20Mi (0%)        0 (0%)         30h
  openshift-operators                               openshift-gitops-operator-controller-manager-58757c9cc-4qh2p       1m (0%)       500m (1%)   15Mi (0%)        128Mi (0%)     30h
  openshift-ovn-kubernetes                          ovnkube-control-plane-56584d9f65-dzllk                             20m (0%)      0 (0%)      320Mi (0%)       0 (0%)         30h
  openshift-ovn-kubernetes                          ovnkube-node-p6knw                                                 80m (0%)      0 (0%)      1630Mi (2%)      0 (0%)         30h
  openshift-route-controller-manager                route-controller-manager-8448b4cb88-cgzrs                          100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         6h44m
  openshift-service-ca-operator                     service-ca-operator-6448b956f6-8xd92                               10m (0%)      0 (0%)      80Mi (0%)        0 (0%)         30h
  openshift-storage                                 odf-console-d497c4785-mzg56                                        100m (0%)     100m (0%)   512Mi (0%)       512Mi (0%)     143m
  quay-operator                                     quay-operator.v3.13.5-6464876748-s2bm6                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         143m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                5822m (14%)    14110m (35%)
  memory             17511Mi (25%)  19712Mi (28%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
Events:
  Type    Reason          Age   From             Message
  ----    ------          ----  ----             -------
  Normal  RegisteredNode  149m  node-controller  Node hub-ctlplane-2.5g-deployment.lab event: Registered Node hub-ctlplane-2.5g-deployment.lab in Controller

As well, it looks that kubernetes defines the following:

When problems occur on nodes, the Kubernetes control plane automatically creates taints that match the conditions affecting the node. An example of this is when the status of the Ready condition remains Unknown or False for longer than the kube-controller-manager's NodeMonitorGracePeriod, which defaults to 50 seconds. This will cause either an node.kubernetes.io/unreachable taint, for an Unknown status, or a node.kubernetes.io/not-ready taint, for a False status, to be added to the Node.

references:

My understanding its that these taints are specifically managed by the kubernetes and its at system level not manual or managed by an administrator.

@midu16
Copy link
Author

midu16 commented May 27, 2025

What about to extend the oc adm must-gather code with a warning saying "hey, the must-gather pod got scheduled to THIS node that is tainted with node.kubernetes.io/unreachable. Maybe you'd like to schedule it to a different node to avoid the pod getting stuck. Here are some suggestions: NODES". This will help with a user initiated case. Not much when running the must-gather through an automation. On the other hand the automation might set a special flag to tell oc adm must-gather to be more strict in selecting which nodes are still "safe" to tolerate. So, as we probably suggested in one of the previous bug reports it's preferable to have user/automation provide a list of explicit tolerations if the current "tolerate everything" does not work well.

This been implemented here: 93d98b5

midu16 added 2 commits May 27, 2025 18:19
…ted nodes

    - Prioritize nodes without 'unreachable' or 'not-ready' taints and with recent heartbeats
    - Fallback to tainted nodes if no healthy nodes exist
    - Print non-blocking warnings when scheduling to problematic nodes
    - All logic is derived dynamically from cluster state
TolerationLoop:
for _, tol := range tolerations {
for _, excluded := range excludedTaints {
if tol.ToleratesTaint(&excluded) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unanswered

@midu16
Copy link
Author

midu16 commented Jun 17, 2025

/retest

Copy link
Contributor

openshift-ci bot commented Jun 17, 2025

@midu16: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 516c4a6 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-upgrade 516c4a6 link true /test e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ingvagabund
Copy link
Member

ingvagabund commented Jun 18, 2025

Running the added code locally for filtering the taints (with hasMaster==false):

func TestExcludeTolerations(t *testing.T) {
	excludedTaints := []corev1.Taint{
		{Key: unreachableTaintKey, Effect: corev1.TaintEffectNoExecute},
		{Key: notReadyTaintKey, Effect: corev1.TaintEffectNoSchedule},
	}

	candidateTolerations := []corev1.Toleration{tolerationNotReady}

	filteredTolerations := make([]corev1.Toleration, 0, len(candidateTolerations))
TolerationLoop:
	for _, tol := range candidateTolerations {
		for _, excluded := range excludedTaints {
			if tol.ToleratesTaint(&excluded) {
				// Skip this toleration if it tolerates an excluded taint
				continue TolerationLoop
			}
		}
		filteredTolerations = append(filteredTolerations, tol)
	}
	fmt.Printf("filteredTolerations: %#v\n", filteredTolerations)
}

In case a cluster has no masters the list of filteredTolerations will be empty. Which is/might be the case on hypershift clusters (#1347). Which makes the must-gather less resilient than now.

This change ensures that workloads do not unintentionally tolerate unreachable nodes while still allowing necessary tolerations.

Which invalidates the original intention to still allow necessary tolerations.

for _, cond := range node.Status.Conditions {
if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
// Check if heartbeat is recent (less than 2m old)
if time.Since(cond.LastHeartbeatTime.Time) < 2*time.Minute {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This assumes the clock running on a host where oc adm must-gather runs is in sync with the nodes. Which might not be always the case.

@@ -462,6 +514,10 @@ func (o *MustGatherOptions) Run() error {
nodes, err := o.Client.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{
LabelSelector: o.NodeSelector,
})
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code

@@ -401,6 +417,42 @@ func (o *MustGatherOptions) Validate() error {
return nil
}

// prioritizeHealthyNodes returns a preferred node to run the must-gather pod on, and a fallback node if no preferred node is found.
func prioritizeHealthyNodes(nodes *corev1.NodeList) (preferred *corev1.Node, fallback *corev1.Node) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prioritizeHealthyNodes is currently unused in the code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants