Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterClass Upgrade Fails Due to ManagedFields Behavior Change in CAPI v1.6 → v1.8 #11857

Open
okozachenko1203 opened this issue Feb 17, 2025 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@okozachenko1203
Copy link

What steps did you take and what happened?

I encountered an issue when upgrading a Cluster to use a new ClusterClass after migrating from CAPI v1.6 → v1.8. The upgrade fails because old variables persist in spec.topology.variables, even though they should have been removed by Server-Side Apply (SSA).

Steps to Reproduce

  1. Create an initial ClusterClass (oldClass)

    • This ClusterClass contains a required variable: oldVariable.
  2. Create a Cluster using oldClass

    • The cluster’s spec.topology.variables includes oldVariable.
  3. Create a new ClusterClass (newClass)

    • newClass has a required variable newVariable instead of oldVariable.
    • Only the name differs; the format and value remain the same.
  4. Upgrade the cluster to use newClass via SSA patch

    • Expected result: oldVariable should be removed, and newVariable should be added.

Scenarios Tested

Scenario 1: Running on CAPI v1.6.x

Deployment Versions:

NAMESPACE                           NAME                       VERSION  
capi-kubeadm-bootstrap-system       bootstrap-kubeadm          v1.6.0  
capi-kubeadm-control-plane-system   control-plane-kubeadm      v1.6.0  
capi-system                         cluster-api                v1.6.0  
capo-system                         infrastructure-openstack   v0.9.0  

Result:

  • The upgrade works correctly.
  • After upgrading the ClusterClass, only newVariable exists, and oldVariable is removed.

Scenario 2: Running on CAPI v1.8.x

Deployment Versions:

NAMESPACE                           NAME                       VERSION  
capi-kubeadm-bootstrap-system       bootstrap-kubeadm          v1.8.4  
capi-kubeadm-control-plane-system   control-plane-kubeadm      v1.8.4  
capi-system                         cluster-api                v1.8.4  
capo-system                         infrastructure-openstack   v0.11.2  

Result:

  • The upgrade works correctly.
  • After upgrading the ClusterClass, only newVariable exists, and oldVariable is removed.

Scenario 3: Upgrading from CAPI v1.6 → v1.8 and then upgrading ClusterClass

  1. Deploy Cluster using CAPI v1.6.0 (Scenario 1 setup).
  2. Upgrade CAPI to v1.8.4 (Scenario 2 setup).
  3. Attempt to upgrade Cluster to use newClass (SSA patch).
    • 🔴 The upgrade fails because oldVariable still exists, despite not being defined in newClass.

What did you expect to happen?

  • When upgrading a cluster to use a new ClusterClass, SSA should correctly remove old variables that are no longer part of the new ClusterClass.
  • This behavior should remain consistent across CAPI versions.

Cluster API version

I mentioned the cluster-api versions in the scenario above.

Kubernetes version

I tested in these versions.
1.25.x
1.28.x

Anything else you would like to add?

What I Found

managedFields behaves differently between versions.

  • In CAPI v1.6.x (Older versions), spec.topology.variables was tracked as a single field:
    f:variables: {}
  • In CAPI v1.8.x (Newer versions), each variable in spec.topology.variables is individually tracked:
    f:variables:
      k:{"name":"apiServerTLSCipherSuites"}:
        .: {}
        f:name: {}
        f:value: {}

Hypothesis:

  • In CAPI v1.6, SSA does not track individual variables, so removing a variable implicitly removes it from spec.topology.variables.
  • In CAPI v1.8, SSA tracks each variable separately, preventing removal if ownership conflicts exist.
  • This change breaks upgrades when transitioning from v1.6 to v1.8.

Possible Causes

  1. Changes in SSA handling of spec.topology.variables between CAPI v1.6 → v1.8.
  2. Stricter managedFields tracking in newer versions.
  3. Potential ownership conflicts preventing removal of fields.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 17, 2025
@mnaser
Copy link

mnaser commented Feb 18, 2025

I wonder if this is similar to #11773

@mnaser
Copy link

mnaser commented Feb 18, 2025

Indeed, this is what I found:

#10841

Specifically, this commit explicitly says:

Add "name" as a map key for variable arrays. This will allow separate ownership per variable

So since things are tracked individually, that's exactly what's happening then. Now, we need to figure out how to workaround that..

@okozachenko1203
Copy link
Author

After further investigation, I found that the change in managedFields behavior is not immediate after upgrading CAPI but rather occurs when an SSA (Server-Side Apply) update is performed.

Steps to Reproduce

  1. Deploy a Cluster using CAPI v1.6.x and CAPO v0.9.x

    • spec.topology.variables in managedFields is not tracking individual sub-fields.
    • The managedFields entry looks like this:
      f:variables: {}
  2. Upgrade CAPI to v1.8.x and CAPO to v0.11.x

    • managedFields remains unchanged after the upgrade.
    • The cluster still behaves as if it was created under CAPI v1.6.
  3. Perform a NOOP SSA apply on the cluster(apply with no changes)

    • I applied the same cluster manifest without any modifications:
    • After this apply, managedFields was updated, and sub-fields of spec.topology.variables were now explicitly tracked:
      f:variables:
        k:{"name":"apiServerTLSCipherSuites"}:
          .: {}
          f:name: {}
          f:value: {}
    • This mirrors the behavior of clusters created in CAPI v1.8.x.
  4. After this implicit SSA update, the ClusterClass upgrade worked successfully.

    • oldVariable was removed, and newVariable was applied correctly.
    • The upgrade failure observed earlier no longer occurs.

@sbueringer
Copy link
Member

I think #11773 is different and I would keep it out of this issue to avoid confusion.

Yes we intentionally added "name" as a map key to allow tracking separate ownership per variable.

I'm not sure if there is something that we can do within Cluster API. This seems to be due to how the kube-apiserver handles schema changes in managedFields. I wonder if more recent kube-apiserver versions might have improved or changed that behavior though (very likely not).

@sbueringer
Copy link
Member

sbueringer commented Feb 18, 2025

Please see the discussion here in Slack: https://kubernetes.slack.com/archives/C0EG7JC6T/p1738861392254669

It's not exactly the same topic, but it's potentially relevant. See also the linked PR here:

This might be relevant here: kubernetes-sigs/structured-merge-diff#170
That PR taught SSA how to handle changes in atomicity as best it can. It's not feasible to make such a change backward compatible, but that PR can be useful in some circumstances

(not sure in which Kubernetes version that landed)

@mnaser
Copy link

mnaser commented Feb 18, 2025

@sbueringer does this mean we're missing +structType=atomic in here by any chance that might fix this?

EDIT: I might be wrong, structType=atomic would be counter-intuitive since you want to manage all the variables independently?

@sbueringer
Copy link
Member

sbueringer commented Feb 18, 2025

No. We intentionally moved from atomic to

                    x-kubernetes-list-map-keys:
                    - name
                    x-kubernetes-list-type: map

structType=atomic would be counter-intuitive since you want to manage all the variables independently?

exactly

The PR I linked above might have improved some cases in which the apiserver after the PR handled schema transitions better

@chrischdi
Copy link
Member

/priority important-soon

At least for documenting the issue.

For other points let's see where the dicussion leads to.

/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants