ClusterClass Upgrade Fails Due to ManagedFields Behavior Change in CAPI v1.6 → v1.8 #11857

okozachenko1203 · 2025-02-17T16:17:50Z

What steps did you take and what happened?

I encountered an issue when upgrading a Cluster to use a new ClusterClass after migrating from CAPI v1.6 → v1.8. The upgrade fails because old variables persist in spec.topology.variables, even though they should have been removed by Server-Side Apply (SSA).

Steps to Reproduce

Create an initial ClusterClass (oldClass)
- This ClusterClass contains a required variable: oldVariable.
Create a Cluster using oldClass
- The cluster’s spec.topology.variables includes oldVariable.
Create a new ClusterClass (newClass)
- newClass has a required variable newVariable instead of oldVariable.
- Only the name differs; the format and value remain the same.
Upgrade the cluster to use newClass via SSA patch
- Expected result: oldVariable should be removed, and newVariable should be added.

Scenarios Tested

✅ Scenario 1: Running on CAPI v1.6.x

Deployment Versions:

NAMESPACE                           NAME                       VERSION  
capi-kubeadm-bootstrap-system       bootstrap-kubeadm          v1.6.0  
capi-kubeadm-control-plane-system   control-plane-kubeadm      v1.6.0  
capi-system                         cluster-api                v1.6.0  
capo-system                         infrastructure-openstack   v0.9.0

Result:

The upgrade works correctly.
After upgrading the ClusterClass, only newVariable exists, and oldVariable is removed.

✅ Scenario 2: Running on CAPI v1.8.x

Deployment Versions:

NAMESPACE                           NAME                       VERSION  
capi-kubeadm-bootstrap-system       bootstrap-kubeadm          v1.8.4  
capi-kubeadm-control-plane-system   control-plane-kubeadm      v1.8.4  
capi-system                         cluster-api                v1.8.4  
capo-system                         infrastructure-openstack   v0.11.2

Result:

The upgrade works correctly.
After upgrading the ClusterClass, only newVariable exists, and oldVariable is removed.

❌ Scenario 3: Upgrading from CAPI v1.6 → v1.8 and then upgrading ClusterClass

Deploy Cluster using CAPI v1.6.0 (Scenario 1 setup).
Upgrade CAPI to v1.8.4 (Scenario 2 setup).
Attempt to upgrade Cluster to use newClass (SSA patch).
- 🔴 The upgrade fails because oldVariable still exists, despite not being defined in newClass.

What did you expect to happen?

When upgrading a cluster to use a new ClusterClass, SSA should correctly remove old variables that are no longer part of the new ClusterClass.
This behavior should remain consistent across CAPI versions.

Cluster API version

I mentioned the cluster-api versions in the scenario above.

Kubernetes version

I tested in these versions.
1.25.x
1.28.x

Anything else you would like to add?

What I Found

managedFields behaves differently between versions.

In CAPI v1.6.x (Older versions), spec.topology.variables was tracked as a single field:
```
f:variables: {}
```

In CAPI v1.8.x (Newer versions), each variable in spec.topology.variables is individually tracked:

f:variables:
  k:{"name":"apiServerTLSCipherSuites"}:
    .: {}
    f:name: {}
    f:value: {}

Hypothesis:

In CAPI v1.6, SSA does not track individual variables, so removing a variable implicitly removes it from spec.topology.variables.
In CAPI v1.8, SSA tracks each variable separately, preventing removal if ownership conflicts exist.
This change breaks upgrades when transitioning from v1.6 to v1.8.

Possible Causes

Changes in SSA handling of spec.topology.variables between CAPI v1.6 → v1.8.
Stricter managedFields tracking in newer versions.
Potential ownership conflicts preventing removal of fields.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

mnaser · 2025-02-18T00:24:55Z

I wonder if this is similar to #11773

mnaser · 2025-02-18T00:31:49Z

Indeed, this is what I found:

#10841

Specifically, this commit explicitly says:

Add "name" as a map key for variable arrays. This will allow separate ownership per variable

So since things are tracked individually, that's exactly what's happening then. Now, we need to figure out how to workaround that..

okozachenko1203 · 2025-02-18T06:17:41Z

After further investigation, I found that the change in managedFields behavior is not immediate after upgrading CAPI but rather occurs when an SSA (Server-Side Apply) update is performed.

Steps to Reproduce

Deploy a Cluster using CAPI v1.6.x and CAPO v0.9.x
- spec.topology.variables in managedFields is not tracking individual sub-fields.
- The managedFields entry looks like this:
```
f:variables: {}
```
Upgrade CAPI to v1.8.x and CAPO to v0.11.x
- managedFields remains unchanged after the upgrade.
- The cluster still behaves as if it was created under CAPI v1.6.
Perform a NOOP SSA apply on the cluster(apply with no changes)
- I applied the same cluster manifest without any modifications:
- After this apply, managedFields was updated, and sub-fields of spec.topology.variables were now explicitly tracked:
```
f:variables:
  k:{"name":"apiServerTLSCipherSuites"}:
    .: {}
    f:name: {}
    f:value: {}
```
- This mirrors the behavior of clusters created in CAPI v1.8.x.
After this implicit SSA update, the ClusterClass upgrade worked successfully.
- oldVariable was removed, and newVariable was applied correctly.
- The upgrade failure observed earlier no longer occurs.

sbueringer · 2025-02-18T11:49:54Z

I think #11773 is different and I would keep it out of this issue to avoid confusion.

Yes we intentionally added "name" as a map key to allow tracking separate ownership per variable.

I'm not sure if there is something that we can do within Cluster API. This seems to be due to how the kube-apiserver handles schema changes in managedFields. I wonder if more recent kube-apiserver versions might have improved or changed that behavior though (very likely not).

sbueringer · 2025-02-18T11:54:10Z

Please see the discussion here in Slack: https://kubernetes.slack.com/archives/C0EG7JC6T/p1738861392254669

It's not exactly the same topic, but it's potentially relevant. See also the linked PR here:

This might be relevant here: kubernetes-sigs/structured-merge-diff#170
That PR taught SSA how to handle changes in atomicity as best it can. It's not feasible to make such a change backward compatible, but that PR can be useful in some circumstances

(not sure in which Kubernetes version that landed)

mnaser · 2025-02-18T18:23:04Z

@sbueringer does this mean we're missing +structType=atomic in here by any chance that might fix this?

EDIT: I might be wrong, structType=atomic would be counter-intuitive since you want to manage all the variables independently?

sbueringer · 2025-02-18T18:45:08Z

No. We intentionally moved from atomic to

                    x-kubernetes-list-map-keys:
                    - name
                    x-kubernetes-list-type: map

structType=atomic would be counter-intuitive since you want to manage all the variables independently?

exactly

The PR I linked above might have improved some cases in which the apiserver after the PR handled schema transitions better

chrischdi · 2025-02-19T14:41:30Z

/priority important-soon

At least for documenting the issue.

For other points let's see where the dicussion leads to.

/triage accepted

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 17, 2025

mnaser mentioned this issue Feb 18, 2025

fox: use a strategic merge patch vexxhost/magnum-cluster-api#489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterClass Upgrade Fails Due to ManagedFields Behavior Change in CAPI v1.6 → v1.8 #11857

ClusterClass Upgrade Fails Due to ManagedFields Behavior Change in CAPI v1.6 → v1.8 #11857

okozachenko1203 commented Feb 17, 2025

mnaser commented Feb 18, 2025

mnaser commented Feb 18, 2025

okozachenko1203 commented Feb 18, 2025

sbueringer commented Feb 18, 2025

sbueringer commented Feb 18, 2025 •

edited

Loading

mnaser commented Feb 18, 2025 •

edited

Loading

sbueringer commented Feb 18, 2025 •

edited

Loading

chrischdi commented Feb 19, 2025

ClusterClass Upgrade Fails Due to ManagedFields Behavior Change in CAPI v1.6 → v1.8 #11857

ClusterClass Upgrade Fails Due to ManagedFields Behavior Change in CAPI v1.6 → v1.8 #11857

Comments

okozachenko1203 commented Feb 17, 2025

What steps did you take and what happened?

Steps to Reproduce

Scenarios Tested

✅ Scenario 1: Running on CAPI v1.6.x

✅ Scenario 2: Running on CAPI v1.8.x

❌ Scenario 3: Upgrading from CAPI v1.6 → v1.8 and then upgrading ClusterClass

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

What I Found

Possible Causes

Label(s) to be applied

mnaser commented Feb 18, 2025

mnaser commented Feb 18, 2025

okozachenko1203 commented Feb 18, 2025

Steps to Reproduce

sbueringer commented Feb 18, 2025

sbueringer commented Feb 18, 2025 • edited Loading

mnaser commented Feb 18, 2025 • edited Loading

sbueringer commented Feb 18, 2025 • edited Loading

chrischdi commented Feb 19, 2025

sbueringer commented Feb 18, 2025 •

edited

Loading

mnaser commented Feb 18, 2025 •

edited

Loading

sbueringer commented Feb 18, 2025 •

edited

Loading