CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

jakefhyde · 2025-02-07T19:41:55Z

What steps did you take and what happened?

Clusters that do not use machine deployments or machine pools (for example, configuring a cluster with machines manually) will cause the capi-controller-manager to endlessly write error logs whenever the cluster status is updated. The capi-controller-manager will show the following logs:

E0207 14:31:48.116038       1 cluster_controller_status.go:838] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment's RollingOut conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<namespace/cluster>" namespace="<namespace>" name="<name>" reconcileID="71e43209-4e27-41d9-8470-f1c1c33901c1"
E0207 14:31:48.116068       1 cluster_controller_status.go:915] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment, MachineSet's ScalingUp conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<namespace/cluster>" namespace="<namespace>" name="<name>" reconcileID="71e43209-4e27-41d9-8470-f1c1c33901c1"
E0207 14:31:48.116081       1 cluster_controller_status.go:992] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment, MachineSet's ScalingDown conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<namespace/cluster>" namespace="<namespace>" name="<name>" reconcileID="71e43209-4e27-41d9-8470-f1c1c33901c1"

The requisite code can be found here:

We are able to workaround this by setting the conditions to false so that they are present.

What did you expect to happen?

My expectation is that during the v1beta1 -> v1beta2 migration, the status is set to Unknown if the aggregate conditions are not present.

Cluster API version

v1.9.4

Kubernetes version

v1.30.2

Anything else you would like to add?

We implement a bring your own host style provisioning (not related to the byoh provider) in which users can register nodes freely, leaving lifecycle management to the user. Although a less common provisioning model, I imagine this could also affect clusters with control plane providers which have not been updated that also provision machines via manual definition.

Label(s) to be applied

/kind bug
/area conditions

The text was updated successfully, but these errors were encountered:

jakefhyde · 2025-02-07T23:34:04Z

Apologies for the rename, there was a little confusion on my end. We have a test case where the etcd plane is scaled to 0, a new etcd machine is created, and we perform an etcd restore on top of that. These logs were being printed endlessly, so I had erroneously assumed they were related. Although we skip draining with the machine.cluster.x-k8s.io/exclude-node-draining annotation, we weren't doing the same for volume detachment. I think this was fine previously purely by happy accident, and the addition of alwaysReconcile helped expose this race condition.

That all being said, would it be possible to lower the log level for those messages until v1beta2 goes live? I'm content to just leave the conditions on there for now, otherwise it fills the logs and makes debugging quite difficult.

chrischdi · 2025-02-19T14:57:35Z

/triage accepted
/priority important-soon

chrischdi · 2025-02-19T15:10:54Z

/help

k8s-ci-robot · 2025-02-19T15:11:16Z

@chrischdi:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fabriziopandini · 2025-02-19T17:57:52Z

Note, I will try to take a look at this in the context of the work I'm doing for #11474, but I'm not sure if and when I will get to this (if someone else wants to take care of this before me, feel free to do it!)

jakefhyde changed the title ~~Cluster fails to provision if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions~~ CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions Feb 7, 2025

jakefhyde mentioned this issue Feb 7, 2025

Fix capi conditions and deletion of etcd plane rancher/rancher#49053

Closed

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

jakefhyde commented Feb 7, 2025 •

edited

Loading

jakefhyde commented Feb 7, 2025

chrischdi commented Feb 19, 2025

chrischdi commented Feb 19, 2025

k8s-ci-robot commented Feb 19, 2025

fabriziopandini commented Feb 19, 2025

CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

Comments

jakefhyde commented Feb 7, 2025 • edited Loading

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

jakefhyde commented Feb 7, 2025

chrischdi commented Feb 19, 2025

chrischdi commented Feb 19, 2025

k8s-ci-robot commented Feb 19, 2025

Guidelines

fabriziopandini commented Feb 19, 2025

jakefhyde commented Feb 7, 2025 •

edited

Loading