Skip to content

KEP-4650: StatefulSet Support for Updating Volume Claim Template #4651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

huww98
Copy link

@huww98 huww98 commented May 22, 2024

  • One-line PR description: initial proposal of KEP-4650: StatefulSet Support for Updating Volume Claim Template

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 22, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 22, 2024
@huww98
Copy link
Author

huww98 commented May 22, 2024

This KEP is inspired by the previous proposal in #3412 . However, there are several major differences, so I decided to make a new KEP for this.

Differences:

  • No validation is performed at StatefulSet level. We accept everything, do what we can, and report what we have done in status. This should resolve the main concern of the previous attempt that it is hard to recover from expansion failure. Just kubectl rollout undo should work in this KEP.
  • We take extra care about backward compatibility. No immediate changes are expected when the feature is enabled.
  • We accept changes to all fields, not just storage size. And explicitly consider what to do if the PVC does not match the template.

Please read more in the "Alternatives" section in the KEP.

This KEP also contains more details about how to coordinate the update of Pod and PVC, which is another main concern of the previous attempt.

Comment on lines 5 to 7
owning-sig: sig-storage
participating-sigs:
- sig-app
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main issues with #3412 were around StatefulSet controller and its behavior in error cases. IMO it should be owned by sig-apps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, given that the changes to code should mainly happen on StatefulSet controller, and the previous attempts all put sig-apps as the owning-sig (it is weird that they are placed in the sig-storage folder)

@xing-yang
Copy link
Contributor

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label May 23, 2024
@huww98 huww98 force-pushed the sts-update-claim branch from 294de45 to 9ef734c Compare May 27, 2024 02:48
@huww98
Copy link
Author

huww98 commented May 27, 2024

/cc @smarterclayton @gnufied @soltysh
Also cc @areller @maxin93 @jimethn
You have joined the conversations in #3412, So you maybe also interested in this KEP.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing production readiness questionnaire filled in and a PRR file in https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-apps matching the template

know that this has succeeded?
-->
* Allow users to update the `volumeClaimTemplates` of a `StatefulSet` in place.
* Automatically update the associated PersistentVolumeClaim objects in-place if applicable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by in-place here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the resizedisk feature can automatically adjust the disk and filesystem sizes while the pod is running. In addition to this there will be new features like VolumeAttributesClass that will also support in-place changes to the storage being used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the word in-place to explicitly distinguish from Delete/Create style update for Pods. We should never delete PVC automatically.

@huww98
Copy link
Author

huww98 commented Jun 17, 2024

Updated "Production Readiness Review Questionnaire" as requested.
/assign @wojtek-t
For PRR.

During the last sig-apps meeting,
Q1: @kow3ns asked when we do want to modify PVC (e.g. migrate between storage providers) why we don't use the application level features, and the sts ordinal feature to migrate between 2 StatefulSet.

First, this KEP is still mainly targeting the use-cases where we can update the PVC in-place, without interrupting the running pods. The migration and other use-cases are just by-product of enabling editing the volumeClaimTemplates. But still, migration by editing sts is simpler, just requiring a rolling restart, which should already be familiar by most k8s operators. Editing the sts does not require switching traffic between 2 sts, for example. And the name of sts won't change after migration.

Yes, we cannot rollback easily with this procedure. The user should delete the PVC again to rollback. And the data in the PVC may not be recovered if retention policy is Delete. But rolling back the stateful application is inherently complex. Once a replica leaves the cluster, its state in the PV becomes stall. There is no guarantee that the old data can be used to rollback anyway. Whichever the procedure used, this complexity should be addressed at higher level. Custom migrators built on the ordinal feature should also face the same issue.

Q2: @soltysh suggested we may still go along the way of KEP-0661, then do more after we got more experience.

Here is why I don't want to proceed that way:

  1. We have VAC now, which is expected to go to beta soon. VAC is closely related to storage class, and can be patched to existing PVC. This KEP also integrates with VAC. Only updating VAC if storage class matches is a very logical choice.
  2. By allowing editing the whole template, we are forced to fully consider what to do if PVC and template are inconsistent. For example, the storage class differs. In KEP-0661, it is logical to just do the expansion anyway. But in this KEP, I think we should not expand it. Because the person who write the StatefulSet spec and the person who apply the change may not the same. We should not surprise the operator. And this is consistent with the VAC operation model. This is the divergency, and we cannot go through KEP-0661 and then to this KEP, or there will be breaking changes.
  3. KEP-0661 proposes not reverting the volumeClaimTemplates when rollback the StatefulSet, which is very confusing. Another potential breaking change if we go that way first.

Please read more in the Alternatives section in this KEP.

@soltysh said we don't want the STS to be stuck in a permanently broken state. Of course. With this KEP, as we are not validating the templates. It is actually very hard to get stuck. Larger PVC is compatible with smaller template, so just rolling back the template should unblock us from future rollout, leaving PVCs in the expanding state, or try to cancel the expansion if RecoverVolumeExpansionFailure feature gate is enabled.

I think the only way we may get stuck is patching VAC, if one replica is successfully updated to the new VAC, but another replica failed. and rolling back the first replica to the old VAC also failed. Even in this case, the user can just set volumeClaimUpdateStrategy to OnDelete to unblock pod rollouts.

Q3: @kow3ns thinks it is not appropriate to delete and recreate the PVC to alter the performance characteristics of volumes.

VAC is the KEP that actually parameterizing the storage class and allow us to specify and update the performance characteristics of volumes without interrupting the running pod, by patching the existing PVC. So this KEP should also integrate with VAC. The update to VAC in the volumeClaimTemplates should not require re-creating the PVC, and is fully automated if everything goes well.

Q4: @kow3ns asked how we should handle each field of volumeClaimTemplates.

This is described in the KEP in "How to update PVCs" section in "Updated Reconciliation Logic". Basically, patch what we can, skip the rest.

It seems the wording "update PVC in-place" causes many mis-understandings. I will replace it with "patch PVC".

We didn’t actually decide anything during the last meeting. I think these core questions should be decided to push this KEP forward:

  • Allow editing all fields vs only allow storage size and VAC?
  • What to do if storage class does not match between template and PVC? Skip, block, or proceed anyway.
  • Should kubectl rollout undo affect volumeClaimTemplates or not?

@soltysh
Copy link
Contributor

soltysh commented Jun 18, 2024

Here is why I don't want to proceed that way:

The general sentiment from that sig-apps call (see https://docs.google.com/document/d/1LZLBGW2wRDwAfdBNHJjFfk9CFoyZPcIYGWU7R1PQ3ng/edit#heading=h.2utc2e8dj14) was that the smaller changes have a greater chances to move forward. Also, it's worth noting that the smaller changes do not stand in opposition to the changes proposed in here, they are only taking the gradual approach by focusing on a minimal subset of changes.

@mowangdk
Copy link

Here is why I don't want to proceed that way:

The general sentiment from that sig-apps call (see https://docs.google.com/document/d/1LZLBGW2wRDwAfdBNHJjFfk9CFoyZPcIYGWU7R1PQ3ng/edit#heading=h.2utc2e8dj14) was that the smaller changes have a greater chances to move forward. Also, it's worth noting that the smaller changes do not stand in opposition to the changes proposed in here, they are only taking the gradual approach by focusing on a minimal subset of changes.

Okay, we agreed that focusing on the minimal subset of changes. @huww98 and @vie-serendipity will proceed with only VolumeClaimTemplates.spec.resources.requests.storage and VolumeClaimTemplates.spec. volumeAttributesClassName fields. and there are still two questions remaining to solve, maybe we can talk about it in the next sig-app meetings.

  • What to do if storage class does not match between template and PVC? Skip, block, or proceed anyway.
  • Should kubectl rollout undo affect volumeClaimTemplates or not?

@huww98
Copy link
Author

huww98 commented Jul 11, 2024

At the last sig-apps meeting, we have decided that we should:

  • Only allow editing of storage size in the template;
  • If storage class does not match between template and PVC, block the update process and wait for user interaction.
    These changes should greatly shrink the scope of this KEP. And I think this is beneficial to get the KEP forward.

But for the validation of the template, I think we still need more discussion. It can be a major blocking point of this KEP. @soltysh think that we should not allow decreasing the size of template. He thinks we can remove the validation later if desired. But I think validation has many drawbacks which may block normal usage of this feature and should be resolved in the initial version:

  1. If we disallow decreasing, we make the editing a one-way road. If a user edited it then found it was a mistake, there is no way back. The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable IMO.
  2. To mitigate the above issue, we will want to prevent the user from going down this one-way road by mistake. We are forced to do way more validations on APIServer, which is very complex, and fragile (please see KEP-0661). For example: check storage class allowVolumeExpansion, check each PVC's storage class and size, basically duplicate all the validations we have done to PVC. And even if we do all the validations, there are still race conditions and async failures that we are impossible to catch. I see this as a major drawback of KEP-0661 that I want to avoid in this KEP.
  3. Validation means we should disable rollback of storage size. If we enable it later, it can surprise users, if it is not called a breaking change.
  4. The validation is conflict to RecoverVolumeExpansionFailure feature, although it is still alpha.

On the contrast, if we just don't add the validation, we can avoid all these issues, and lose nothing: The user can expand PVC independently today. So, the state that the template is smaller than PVC is already very common and stable. The strategy in this state is not even trying to shrink the PVC. I think this is well defined and easy to follow. If Kubernetes ever supports shrinking in the future, we will still need to support drivers that can't shrink. So, even then we can only support shrinking with a new volumeClaimUpdateStrategy (maybe InPlaceShrinkable).

To take one step back, I think validating the template across resources violates the high-level design. The template describes a desired final state, rather than an immediate instruction. A lot of things can happen externally after we update the template. For example, I have an IaaS platform, which tries to kubectl apply one updated StatefulSet + one new StorageClass to the cluster to trigger the expansion of PVs. We don't want to reject it just because the StorageClass is applied after the StatefulSet, right?

To conclude, I don't want to add the validation, we don't add it just to be removed in the future.

@liubog2008
Copy link

On the contrast, if we just don't add the validation, we can avoid all these issues, and lose nothing: The user can expand PVC independently today. So, the state that the template is smaller than PVC is already very common and stable. The strategy in this state is not even trying to shrink the PVC. I think this is well defined and easy to follow. If Kubernetes ever supports shrinking in the future, we will still need to support drivers that can't shrink. So, even then we can only support shrinking with a new volumeClaimUpdateStrategy (maybe InPlaceShrinkable).

Agree. By the way, request means the minimal resource requirement, so the actual allocated which is larger than the request is actually reasonable. What we need is just show it to users.

@soltysh
Copy link
Contributor

soltysh commented Oct 1, 2024

  1. If we disallow decreasing, we make the editing a one-way road. If a user edited it then found it was a mistake, there is no way back. The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable IMO.

That's one way looking at it, also in those cases where a mistake happens (I consider that a rare occurrence), you can always use #3335 and migrate to a new, smaller StatefulSet.
Additionally, it's always easier to limit the scope of any enhancement and expand it once we confirm there will be sufficient use cases supporting it. The other way around, ie. removing functionality is either hard or more often not possible at all. We've done that in the past, and given the available solutions around "fixing a mistake", I'll stand strong with only allowing increasing the size.

@soltysh
Copy link
Contributor

soltysh commented Oct 1, 2024

@huww98 and @liubog2008 are you planning to push this through for 1.32?

@kfox1111
Copy link

kfox1111 commented Oct 1, 2024

Users have been waiting for many years to be able to scale up statefulset volumes. I agree we shouldn't over complicate that use case trying to handle solving other issues at the same time. Lets focus on the very common use case, and then can reevaluate other features after that is completed.

### Kubernetes API Changes

Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet:
* `spec.volumeClaimTemplates.spec.resources.requests.storage` (increase only)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of historical reasons, not all PVCs can be expanded btw. The SC must have allowVolumeExpansion set to true. What happens if user increases size here but underlying SC doesn't allow it?

Currently this is explicitly blocked for PVC via admission. I am not recommending we do the same for Statefulsets, but we need a way out. We don't want statefulset controller to be stuck forever, retrying this operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With current limitation, the way out is:

  • Change the StatefulSet volumeClaimUpdatePolicy back to OnClaimDelete, effectively disabling this feature for the STS.
  • Migrate to a new STS and delete the problematic one.

There is no technical reason preventing reducing storage size of template. Maybe we can discuss this with @soltysh again. If reducing the size is supported, we only need to undo the StatefulSet change.

Copy link
Member

@gnufied gnufied Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not talking about reducing size per-se, but if volume expansion in general is disallowed from SC.

What perhaps we should do is, if user changes template size and SC doesn't allow it, then stateful controller should stop reconciling the change and add warning to the SS. Where it gets tricky is, what if user changes pod spec along with template size?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If the PVC update fails, we should block the StatefulSet rollout process.
    This will also block the creation of new Pod.
    We should detect common cases (e.g. storage class mismatch) and report events before deleting the old Pod.
    If this still happens (e.g., because of webhook), We should retry and report events for this.
    The events and status should look like those when the Pod creation fails.

My current proposal is adding some best effort pre-check before delete the Pod, replicating some of PVC admission logic.

Another possible solution is: we update PVCs before deleting the old pod. So that even if the PVC update failed, the pod is not disrupted and user will have enough time to deal with it. In this way, old Pods may briefly see the new volume config before terminating. This should be fine because updating PVC should be non-disruptive. I prefer this solution now given the complexity of PVC admission, and quota issue mentioned above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware that's the case. But I believe the failure cases in the design details section (below the table) should basically cover this scenario. I'm talking about this wording specifically:

If the PVC update fails, we should block the StatefulSet rollout process. This will 
also block the creation of new Pod. We should detect common cases (e.g. storage 
class mismatch) and report events before deleting the old Pod. If this still happens 
(e.g., because of webhook), We should retry and report events for this. 
The events and status should look like those when the Pod creation fails.

We should ensure the case described by Hemant is expressed there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if user changes pod spec along with template size?

With the new revision, The old pod will still be running if SC doesn't allow expansion.

2. Apply the changes to the PVCs used by this replica.
3. Create the new pod with new `controller-revision-hash` label.
4. Wait for the new pod and PVCs to be ready.
5. Advance to the next replica and repeat from step 1.
Copy link
Member

@gnufied gnufied Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this works for most storage providers. But there are storage providers which can't expand a volume if volume is-in-use by a pod (in control-plane) and yet volume needs to be mounted for file system expansion to happen on the node. So online expansion will not always work for all storage providers.

I am not sure how we intend to solve this, we could just say you must scale down your workloads if you want to expand the SS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just don't support these providers. And this KEP should not break existing workflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained elsewhere, we need to document that we'll just stop the rollout and document that case.

Copy link
Member

@gnufied gnufied Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for drivers that support offline only expansion and if user makes both pod spec and volume claim template change, then they could have stateful rollout wedged. At minimum, we need to add appropriate event and error messages that can surfaced to the user.

I am not sure if the right decision is to leave it at that or we should try and improve things. I was discussing few ideas with @jsafrane and came up with some solutions:

  • when starting a pod, let kubelet wait until controller expansion completes (if it can detect so and not break anything)
  • add a field to CSIDriver, so stateful set controller knows what kind of expansion it is dealing with and support both
  • add enum to StatefulSet what kind of expansion to do (basically add new value to volumeClaimUpdatePolicy)
    • helm chart authors can use safe (= offline) expansion
    • people with local storage knowledge can use online

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when starting a pod, let kubelet wait until controller expansion completes (if it can detect so and not break anything)

@gnufied I think this is not enough. According to CSI spec, offline resize requires the volume not to be controller-published. This means we need to:

  • introduce a new mechanism to communicate volume resize capability (ideally at volume level, new volumes may supports online expansion, while old volumes may not for historical reason).
  • change kubelet to wait for expansion before updating Node.status.volumesInUse
  • change KCM to wait for node.volumesInUse before creating VolumeAttachment

And since we cannot fully cancel an expansion, if the expansion failed, we are blocking the pod from starting until user delete and recreate all the Pod/PVC/PV, which may not be a good user experience.

So I think this is too complex to fit into this KEP.

helm chart authors can use safe (= offline) expansion

Why offline expansion is safer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnufied I think this is not enough.

That should be enough. I am not sure which part is causing that confusion.

And since we cannot fully cancel an expansion, if the expansion failed, we are blocking the pod from starting until user delete and recreate all the Pod/PVC/PV, which may not be a good user experience.

Users who use offline volumes will be wedged right now anyways, even in successful case. Even for online expansion if user updates pod spec and volumeClaimTemplate and expansion fails, then rollout will be wedged.

I do not want to block this KEP on this point tbh. I think we have had enough iterations of this proposal that it seems unfair to block on this, But we It would be useful to think of a path forward for offline only expansion case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why offline expansion is safer?

I meant offline is safer because it will work for all storage providers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnufied By offline expansion, we means the following steps are necessary for the expansion:

  1. delete the old Pod
  2. CSI NodeUnpublishVolume/NodeUnstageVolume
  3. CSI ControllerUnpublishVolume
  4. CSI ControllerExpandVolume
  5. start the new Pod
  6. CSI ControllerPulishVolume
  7. CSI NodeStageVolume/NodePublishVolume/NodeExpandVolume

Is that correct? How can we sequence ControllerExpandVolume after ControllerUnpublishVolume? If the old and new Pods are on the same node, There may be even no ControllerUnpublishVolume. Maybe we can delay the creation of new Pod to achieve this. But that still requires adding some new fields to Kubernetes API, and should belongs to a future KEP.

Users who use offline volumes will be wedged right now anyways, even in successful case. Even for online expansion if user updates pod spec and volumeClaimTemplate and expansion fails, then rollout will be wedged.

But that is all at StatefulSet level. The Pod will always work. If we involve kubelet to wait for something, pod will not come up when that goes wrong, which is much worse.

Do you have an example of SP that only support offline expansion? So that we can check what operations are really required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we sequence ControllerExpandVolume after ControllerUnpublishVolume? If the old and new Pods are on the same node

You don't have to. external-resizer will keep retrying and will eventually succeed when ControllerUnpublish is called. The catch is, you do not schedule new pod until pvc has NodeResizePending status (speaking roughly, there is more nuance to this).


Additionally collect the status of managed PVCs, and show them in the StatefulSet status.
Some fields in the `status` are updated to reflect the status of the PVCs:
- claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new field, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in other comments, I'm against adding one more readyreplicas field which will be very confusing to users.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need a field to expose the PVC readiness, for kubectl rollout status to wait for the last replica to be ready before returning. Any idea better than claimsReadyReplicas?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new revision, I think we don't need claimsReadyReplicas now. When Pod is ready, it is guaranteed that the PVC is ready too. So any existing tools to monitor StatefulSet rollout process does not need to change.

- `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating;
- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds`
- currentRevision, updateRevision, currentReplicas, updatedReplicas
are updated to reflect the status of PVCs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @soltysh

As for changing currentRevision etc. - conceptually changing their semantics sounds ok, but I don't fully understand how that works.
You mentioned "since I've added claim templates to ControllerRevision" - how is that added? What it means for our upgrade story if we have a ControllerRevision that doesn't contain the PVC template?
@soltysh - I would like to hear your thoughts too about that aspect

@gnufied
Copy link
Member

gnufied commented Jun 12, 2025

@wojtek-t would you consider this enhancement from PRR perspective complete for alpha while we finalize some of the implementation details? I am asking this assuming today is PRR freeze.

@wojtek-t
Copy link
Member

@wojtek-t would you consider this enhancement from PRR perspective complete for alpha while we finalize some of the implementation details? I am asking this assuming today is PRR freeze.

PRR freeze is not to have it approved, but rather to have it in reviewable shape
it definitely meets this bar, so we have a week to figure it out - I will take another pass tomorrow

@soltysh
Copy link
Contributor

soltysh commented Jun 16, 2025

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 16, 2025
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several more comments.

-->
* Allow users to update some fields of `volumeClaimTemplates` of a `StatefulSet`, specifically:
* increasing the requested storage size (`spec.volumeClaimTemplates.spec.resources.requests.storage`)
* modifying Volume AttributesClass used by the claim( `spec.volumeClaimTemplates.spec.volumeAttributesClassName`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it should not affect quota.

### Kubernetes API Changes

Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet:
* `spec.volumeClaimTemplates.spec.resources.requests.storage` (increase only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware that's the case. But I believe the failure cases in the design details section (below the table) should basically cover this scenario. I'm talking about this wording specifically:

If the PVC update fails, we should block the StatefulSet rollout process. This will 
also block the creation of new Pod. We should detect common cases (e.g. storage 
class mismatch) and report events before deleting the old Pod. If this still happens 
(e.g., because of webhook), We should retry and report events for this. 
The events and status should look like those when the Pod creation fails.

We should ensure the case described by Hemant is expressed there.

- `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating;
- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds`
- currentRevision, updateRevision, currentReplicas, updatedReplicas
are updated to reflect the status of PVCs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add claimsReadyReplicas instead.

I will admit that our statefulset status already has too many replicas fields, which are confusing to numerous users. Adding one more will only increase that confusion. So I will be strongly against one.

But for currentRevision, updateRevision, currentReplicas, updatedReplicas

Yes, those fields will be affected by these changes, since those directly reflect information about current and updated pods. So it's reasonable they will cover those changes.

@soltysh - I would like to hear your thoughts too about that aspect

We currently calculate the ControllerRevision based on the entire template of a pod (and a few other fields) so adding volumeClaimTemplates is feasible, although will be a significant increase, given that this is an array of templates not a single one, which might be problematic.

2. Apply the changes to the PVCs used by this replica.
3. Create the new pod with new `controller-revision-hash` label.
4. Wait for the new pod and PVCs to be ready.
5. Advance to the next replica and repeat from step 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained elsewhere, we need to document that we'll just stop the rollout and document that case.


Additionally collect the status of managed PVCs, and show them in the StatefulSet status.
Some fields in the `status` are updated to reflect the status of the PVCs:
- claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in other comments, I'm against adding one more readyreplicas field which will be very confusing to users.

- `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating;
- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds`
- currentRevision, updateRevision, currentReplicas, updatedReplicas
are updated to reflect the status of PVCs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider expanding the controllerrevision not with a whole volumeClaimTemplate but only the modifiable fields which are listed in this document. This way we'll ensure the size constraints are not stretched too thing.

* `spec.volumeClaimTemplates.metadata.labels`
* `spec.volumeClaimTemplates.metadata.annotations`

Introduce a new field in StatefulSet `spec`: `volumeClaimUpdatePolicy` to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for not catching this sooner, but upon reviewing the API types once again, I realized we already have .spec.persistentVolumeClaimRetentionPolicy which allows user to define what happens during PVC deletion and scaling. I believe we should re-use that field, and add WhenUpdated as a 3rd supported policy.
So we'd have:

type StatefulSetPersistentVolumeClaimRetentionPolicy struct {
	// existing fields
	WhenDeleted PersistentVolumeClaimRetentionPolicyType
	WhenScaled PersistentVolumeClaimRetentionPolicyType

	// new field
	WhenUpdated PersistentVolumeClaimRetentionPolicyType
}

type PersistentVolumeClaimRetentionPolicyType string

const (
	// existing consts
	RetainPersistentVolumeClaimRetentionPolicyType PersistentVolumeClaimRetentionPolicyType = "Retain"
	DeletePersistentVolumeClaimRetentionPolicyType PersistentVolumeClaimRetentionPolicyType = "Delete"

	// new constant
	InPlacePersistentVolumeClaimRetentionPolicyType PersistentVolumeClaimRetentionPolicyType = "InPlace"
)

This approach would nicely allow users to request appropriate operations based on their demand. Even more, we'd expand our already available Delete policy with a new use-case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, it could make sense to reuse the already existing retention policy field. But not all combinations would be useful.

Can we make a table to see the valid values for each field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, according to #4651 (comment) it seems we might need a struct instead of an enum for the WhenUpdated.

Copy link
Author

@huww98 huww98 Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this idea. The word Retention seems to mean whether to delete the PVC or not. Adding InPlace to this is strange.

And as stated in the non-goal:

Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically.

we don't want to automatic delete the PVC on update, it is too dangerous for the data in the volume. So Delete will not be valid for WhenUpdated. Naturally, InPlace is not applicable to WhenDeleted/WhenScaled, which do not have a new version to update. So only Retain is common, which basically means no-op.

On the other side, I think we may reserve WhenUpdated for future use. When we support adding/removing volumeClaimTemplates from statefulset, we can use WhenUpdated to control whether we should delete PVCs corresponding to removed template.

So I still prefer the original design.

Copy link
Member

@gnufied gnufied Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid I agree with @huww98 , since original type had word - Retention in it, it is hard to reuse that for expansion purposes when former defines retention policy of pvc vis-a-vis statefulset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this idea. The word Retention seems to mean whether to delete the PVC or not. Adding InPlace to this is strange.

Based on this definition, I'm seeing this good fit. Also expanding this field will be easier to ensure:

  1. Compatibility with that feature. This is also what @atiratree mentions, about adding the table listing all the possible combinations and how they will interact.
  2. Simplicity of the API, we're building on what we have, rather then expanding the API surface, which might cause unnecessary confusion.

we don't want to automatic delete the PVC on update, it is too dangerous for the data in the volume. So Delete will not be valid for WhenUpdated. Naturally, InPlace is not applicable to WhenDeleted/WhenScaled, which do not have a new version to update. So only Retain is common, which basically means no-op.

See my no. 1 above, for why combining the two will be beneficial for both of the features.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compatibility with that feature.

I'm not sure I understand. That one is about delete PVCs, while this one is about update PVCs. How do they related?

about adding the table listing all the possible combinations and how they will interact.

When we need such a table, we already making the API too complex. And IMO these two feature works orthogonal, they don't interact with each other.

Simplicity of the API

I'm not sure. Both proposal add a new field. while my original proposal has only 2 possible value for the new field, your proposal has 3 (with 1 invalid). I'd say my proposal is actually simpler.

And how about my point above about reserve for future use?


Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet:
* `spec.volumeClaimTemplates.spec.resources.requests.storage` (increase only)
* `spec.volumeClaimTemplates.spec.volumeAttributesClassName`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support this feature when it is still disabled by default? https://kubernetes.io/docs/concepts/storage/volume-attributes-classes/. Maybe it would be better to start simpler; with the size first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal is using server-side apply for the whole PVC. So we do not need special case for each field, it naturally supports every mutable fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be problematic, I was under the impression that the other feature (being beta) is on by default. That's my bad I haven't checked it. Now looking at #3751 I don't see clear path for that functionality going forward, so I'd be inclined to hold on with this until we have a clear information why it's so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KEP-3136 is why it is off-by-default. And if everything goes well, it will become GA in 1.34.

Anyway, That KEP should be orthogonal to this one. If that one is enabled, we will support it. If not, we will work on our own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is enabled/reaches stable in this release, then we are fine. Can we please track it in the KEP under alpha?

* `spec.volumeClaimTemplates.metadata.labels`
* `spec.volumeClaimTemplates.metadata.annotations`

Introduce a new field in StatefulSet `spec`: `volumeClaimUpdatePolicy` to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, it could make sense to reuse the already existing retention policy field. But not all combinations would be useful.

Can we make a table to see the valid values for each field?

- `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating;
- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds`
- currentRevision, updateRevision, currentReplicas, updatedReplicas
are updated to reflect the status of PVCs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely convinced by the idea of disabling the rollout undo when we increase the size.

What do you think about exploring the rollout undo story further? IMO when somebody changes the image and size, they might still want to revert the image, but keep the PVC with the new size.

some questions:

  • Do we need to include the PVC in the revision? E.g. the case of somebody updating the PVC externally? Also if you cannot roll back, what is the purpose of the revisions?
  • Would it make sense to track the PVC revisions separately?
  • Would it make sense to have different kinds of undo and add support for them in kubectl?

* `spec.volumeClaimTemplates.metadata.labels`
* `spec.volumeClaimTemplates.metadata.annotations`

Introduce a new field in StatefulSet `spec`: `volumeClaimUpdatePolicy` to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, according to #4651 (comment) it seems we might need a struct instead of an enum for the WhenUpdated.

@huww98 huww98 force-pushed the sts-update-claim branch from aa58e48 to 906006f Compare June 18, 2025 03:50
@huww98
Copy link
Author

huww98 commented Jun 18, 2025

A new revision is uploaded: Order of PVC/Pod update is changed. We now update PVC and wait for it to be ready before deleting old pod:

  • resolve the previous comments about quota and PVC update failure for other reasons
  • avoid possible race condition described in KEP-5381
  • remove the necessity of a new status field to monitor the PVC update process.

Currently unresolved discussions (correct me if I missed something):

  • offline expansion: not likely to be included in this KEP. We can continue discussion for future KEP
  • undo: we may allow reducing template size, or enhance kubectl to do partial undo. We will see as we gain more experience.

The enhancement freeze is close. Can we merge this PR and continue discuss and refine in future PR, as suggested in the KEP template.

When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
focused. If you disagree with what is already in a document, open a new PR
with suggested changes.

@huww98 huww98 force-pushed the sts-update-claim branch from 906006f to 0c4d6b5 Compare June 18, 2025 07:46
@huww98 huww98 force-pushed the sts-update-claim branch from 0c4d6b5 to afd1aaf Compare June 18, 2025 08:07


Additionally collect the status of managed PVCs, and show them in the StatefulSet status.
Some fields in the `status` are updated to reflect the status of the PVCs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful (as mentioned in the thread below) to write the full API documentation in the KEP to see what API changes we plan to advertise to the users.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only planed change is already listed below:

  • currentRevision, updateRevision, currentReplicas, updatedReplicas are updated to reflect the status of PVCs.

No new fields in status are planed.

Copy link
Member

@atiratree atiratree Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the fields stay the same, their meaning will change. The meaning is also part of the API. Nevertheless, API review part will be easier if you include the API comments/docs in the KEP.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

With these changes, user can still use `kubectl rollout status` to monitor the update process,
both for automated patching and for the PVCs that need manual intervention.

A PVC is considered ready if:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to use the PVC readiness term in the API? Btw, what about the PVC phase?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no. I only planed to use ready term in this KEP doc, not in API.

- currentRevision, updateRevision, currentReplicas, updatedReplicas
are updated to reflect the status of PVCs.

With these changes, user can still use `kubectl rollout status` to monitor the update process,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to use the PVC readiness in the kubectl rollout status? What changes are required in the kubectl?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planed to add a new field status.claimReadyReplicas and use that in kubectl rollout status. But that is rejected by @soltysh. The current plan is waiting for PVC ready before updating Pod. So pod ready will imply PVC ready, and no changes are required in kubectl. Please see Alternatives > Order of Pod / PVC updates in the KEP doc for more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PVCs can be also observed in kubectl and contribute to the status.

Comment on lines 287 to 288
If `volumeClaimUpdatePolicy` is `OnClaimDelete`, nothing changes. This field acts like a per-StatefulSet feature-gate.
The changes described below applies only for `InPlace` policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like we do not plan to gate the feature. What about this?

Suggested change
If `volumeClaimUpdatePolicy` is `OnClaimDelete`, nothing changes. This field acts like a per-StatefulSet feature-gate.
The changes described below applies only for `InPlace` policy.
If the `volumeClaimUpdatePolicy` field is set to `OnClaimDelete`, nothing changes.
To opt in to the new behavior, the `inPlace` policy should be used.
This new behaviour is described below.

Include `volumeClaimTemplates` in the `ControllerRevision`.

Since modifying `volumeClaimTemplates` will change the hash,
Add support for updating `controller-revision-hash` label of the Pod without deleting and recreating the Pod,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to support changing the InPlace to OnClaimDelete?
What are the considerations? Do we have to update the revisions retrospectively? What happens during a rollback?

Are we going to start tracking volumes in revisions when OnClaimDelete is used?

Maybe it would be better to not allow changes to the volumeClaimUpdatePolicy field. Only during creation of the StatefulSets - this would remove quite a few headaches from managing the revisions IMO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to support changing the InPlace to OnClaimDelete?
What are the considerations?

Yes. I expect user to change to OnClaimDelete to escape from any PVC update failure. And users always change their mind, OnClaimDelete has its own use case where user really want each of the PVCs looks different.

Do we have to update the revisions retrospectively?

I'm not sure I understood this. Will this words from KEP answer your question?

Note that when Pod is at revision B but PVC is at revision A, we will not update PVC.
Such state can only happen when user set volumeClaimUpdatePolicy to InPlace when the feature-gate of KCM is disabled,
or disable the previously enabled feature-gate.
We require user to initiate another rollout to update the PVCs, to avoid any surprise.

What happens during a rollback?

What is rolled back? Behavior when change between InPlace and OnClaimDelete is already described in the doc. volumeClaimTemplate rollback works just like a normal update.

Are we going to start tracking volumes in revisions when OnClaimDelete is used?

No, because OnClaimDelete will be the default value, and it is set automatically for all StatefulSets when the feature-gate is enabled. If we start tracking volumes immediately, we will update all StatefulSets and all pods under any StatefulSet at once. I think this introduce the risk of overloading the control-plane.

Maybe it would be better to not allow changes to the volumeClaimUpdatePolicy field. Only during creation of the StatefulSets - this would remove quite a few headaches from managing the revisions IMO.

I'd like to enable existing StatefulSet to taking advantage of the new feature. And what is the headaches? I'd expect the current plan will handle everything smoothly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I expect user to change to OnClaimDelete to escape from any PVC update failure

Reverting to OnClaimDelete will not help. Updating or reverting the claim templates will.

And users always change their mind, OnClaimDelete has its own use case where user really want each of the PVCs looks different.

Yes this is useful, but we will have to figure out how to support that.

We require user to initiate another rollout to update the PVCs, to avoid any surprise.

I am not sure here, I think the bigger surprise would be if we do not update the PVC when InPlace is used.

What is rolled back? Behavior when change between InPlace and OnClaimDelete is already described in the doc. volumeClaimTemplate rollback works just like a normal update.

The StatefulSet to an older revision. I think it would be good to go over some of these scenarios as they might be surprising and not just a simple update.

I'd like to enable existing StatefulSet to taking advantage of the new feature. And what is the headaches? I'd expect the current plan will handle everything smoothly.

Rev 1:
StatefulSet with OnClaimDelete on node A (tracked)
PVC on Node A (untracked)

Rev 2:
Statefulset with InPlace on node A (tracked)
PVC on Node A (tracked)

Rev3:
Statefulset with InPlace on node B (tracked)
PVC on Node B (tracked)

Reverting to Rev1 will make the pod unschedulable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting to OnClaimDelete will not help. Updating or reverting the claim templates will.

I agree that reverting the claim templates will be very useful. But reverting claim templates will not always recover. For VolumeAttributeClass, when reverting template, we have to revert the already modified VolumeAttributeClassName in PVC, which can be slow and can also fail. However, Reverting to OnClaimDelete should always unblock pod rollout immediately.

I think the bigger surprise would be if we do not update the PVC when InPlace is used.

This is a design decision to make. Consider that when using this feature, rollout out Ver B from Ver A, user finds something wrong, and he turns off the feature gate. Now we should not touch the PVCs, and continue the Pod rollout to Ver B. status.currentRevision will indicate rollout to Ver B is finished. Now the user enables the feature-gate again. If we continue to update PVCs at Ver A to Ver B, we will have no StatefulSet status to track the updates, since status is already at Ver B.

Note this will only happen when rollout happened with KCM feature-gate disabled and volumeClaimUpdatePolicy set to InPlace, which should be very rare. The more common use-case where user update the claimTemplate before changing volumeClaimUpdatePolicy to InPlace will work as you expected. The new claimTemplate changes will be rolled out since change volumeClaimUpdatePolicy to InPlace will add template to ControllerRevision, and we will get a new revision hash to rollout.

Reverting to Rev1 will make the pod unschedulable.

Currently, the nodeAffinity of PV is immutable. So what you describe cannot happen. Although I'd like to enable it in KEP-5381.

But I understand your request, and I also thought about this, but didn't come up with a better solution.

Do you think it is acceptable to update all pods under any StatefulSet at once?

Or we can make this tri-state:

  • empty/nil: the default and preserve the current behavior.
  • OnClaimDelete: Add volumeClaimTemplate to the history, but don't update PVCs
  • InPlace: Add volumeClaimTemplate to the history, and also update PVCs in-place

There is an unique challenge if we track claim templates when OnClaimDelete is set. When updating from OnClaimDelete to InPlace, the revision hash will not change, so rollout is not triggered, user will need to modify claim template again to trigger the rollout. We can resolve this by also adding volumeClaimUpdatePolicy to ControllerRevision. But all the policies we already have does not present in ControllerRevision. So this is not ideal either.

I admit that kubectl rollout undo behavior will become more surprising in the current design. But that is just a convenient method to update the StatefulSet. User can always do the update manually.

So I conclude that the current design is the best. What do you think? Added a new section in Alternatives to discuss this.


Naturally, most of the update control logic also applies to PVCs.
* If `updateStrategy` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest.
* If `updateStrategy` is `OnDelete`, only update the PVCs if the Pod is deleted manually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for not updating the PVCs when OnDelete is used, since this is a legacy behavior.

FYI: revisions do not work in OnDelete and kubernetes/kubernetes#122272 should be fixed first before any work is done in here. Can we please track it in the KEP?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I need to gain deeper understanding on this by reading the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the KEP, we will not update the PVCs when OnDelete is used.

When updating volumeClaimTemplates along with pod template, we will go through the following steps:
1. Apply the changes to the PVCs used by this replica.
2. Wait for the PVCs to be ready.
3. Delete the old pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we delete the old pod first?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see Alternatives > Order of Pod / PVC updates in the KEP doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense. Anyway, volumeClaimUpdatePolicy should be a struct to support customizations of the update process in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make the change

Comment on lines 444 to 446
| at revision A | not existing | create PVC at revision B |
| at revision A | at revision A | update PVC to revision B |
| at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same question about the order applies here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see Alternatives > Order of Pod / PVC updates in the KEP doc.

or disable the previously enabled feature-gate.
We require user to initiate another rollout to update the PVCs, to avoid any surprise.

When `volumeClaimUpdatePolicy` is updated from `OnClaimDelete` to `InPlace`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before. Do we have to support that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it deserves a discussion: #4651 (comment)

We require user to initiate another rollout to update the PVCs, to avoid any surprise.

When `volumeClaimUpdatePolicy` is updated from `OnClaimDelete` to `InPlace`,
StatefulSet controller will begin to add claim templates to ControllerRevision,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the history and rollbacks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create a new ControllerRevision, so a new line in history. For rollback, kubectl rollout undo will be a no-op, since volumeClaimUpdatePolicy is not included in ControllerRevision. User can update it back to OnClaimDelete manually to rollback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the rollout undo/rollback to work even when volumeClaimUpdatePolicy is not present and just be defaulted by the API

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If volumeClaimUpdatePolicy is not set, undo volumeClaimTemplates will not trigger an automated rollout, so not very useful anyway. Discussed in detail in new section When to track volumeClaimTemplates in ControllerRevision.

@gnufied
Copy link
Member

gnufied commented Jun 18, 2025

Offline expansion: not likely to be included in this KEP. We can continue discussion for future KEP.

I would like to see a way this can fail fast for drivers that only support offline expansion. We probably do not want users to attempt a statefulset update that will get wedged because volumes can't be expanded offline. We may not have to block KEP for that and can be discussed during API review as well.

undo: we may allow reducing template size, or enhance kubectl to do partial undo. We will see as we gain more experience.

I am curious why did we decide to store volumeClaimTemplate in controllerrevision? I guess this is the main reason we can't support undo right?

@atiratree
Copy link
Member

I think it would be preferable to design the feature with undo in mind: #4651 (comment)

@wojtek-t
Copy link
Member

Given lack of SIG agreement up until now, it's probably too late to get it into 1.34.

@huww98
Copy link
Author

huww98 commented Jun 20, 2025

I would like to see a way this can fail fast for drivers that only support offline expansion.

@gnufied Can be hard. Kubernetes currently does not aware of whether a volume supports online resize.

I am curious why did we decide to store volumeClaimTemplate in controllerrevision?

As described it the other comment. I need to add PVC into revision, so that I will get a new revision when claim templates are updated and trigger the rollout, reusing the existing Pod rollout infrastructure for PVC.

Besides the above reason, we can still undo other fields of claim templates, and we may also support undo size change in the future.

And it is useful when creating new PVCs. quote the KEP:

When creating new PVCs, use the volumeClaimTemplates from the same revision that is used to create the Pod.

@huww98
Copy link
Author

huww98 commented Jun 23, 2025

I review the StatefulSet controller code recently. And made these changes:

  • We will not update PVCs when OnDelete updateStrategy is used, as suggested by @atiratree
  • We will now update and wait for PVC ready even if the corresponding pod id already ready (update the PVC retrospectively, if I understand this word correctly). Because I realized the Pod may be deleted externally, e.g. evicted. We cannot just skip the PVC update if the Pod is deleted. This will also change the behavior when then feature-gate is enabled, disabled then enabled again. Check the new KEP revision for detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.