-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-4650: StatefulSet Support for Updating Volume Claim Template #4651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This KEP is inspired by the previous proposal in #3412 . However, there are several major differences, so I decided to make a new KEP for this. Differences:
Please read more in the "Alternatives" section in the KEP. This KEP also contains more details about how to coordinate the update of Pod and PVC, which is another main concern of the previous attempt. |
owning-sig: sig-storage | ||
participating-sigs: | ||
- sig-app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main issues with #3412 were around StatefulSet controller and its behavior in error cases. IMO it should be owned by sig-apps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, given that the changes to code should mainly happen on StatefulSet controller, and the previous attempts all put sig-apps as the owning-sig (it is weird that they are placed in the sig-storage folder)
/sig apps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're missing production readiness questionnaire filled in and a PRR file in https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-apps matching the template
keps/sig-apps/4650-stateful-set-update-claim-template/README.md
Outdated
Show resolved
Hide resolved
keps/sig-apps/4650-stateful-set-update-claim-template/README.md
Outdated
Show resolved
Hide resolved
know that this has succeeded? | ||
--> | ||
* Allow users to update the `volumeClaimTemplates` of a `StatefulSet` in place. | ||
* Automatically update the associated PersistentVolumeClaim objects in-place if applicable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by in-place
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the resizedisk feature can automatically adjust the disk and filesystem sizes while the pod is running. In addition to this there will be new features like VolumeAttributesClass that will also support in-place changes to the storage being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the word in-place to explicitly distinguish from Delete/Create style update for Pods. We should never delete PVC automatically.
Updated "Production Readiness Review Questionnaire" as requested. During the last sig-apps meeting, First, this KEP is still mainly targeting the use-cases where we can update the PVC in-place, without interrupting the running pods. The migration and other use-cases are just by-product of enabling editing the volumeClaimTemplates. But still, migration by editing sts is simpler, just requiring a rolling restart, which should already be familiar by most k8s operators. Editing the sts does not require switching traffic between 2 sts, for example. And the name of sts won't change after migration. Yes, we cannot rollback easily with this procedure. The user should delete the PVC again to rollback. And the data in the PVC may not be recovered if retention policy is Q2: @soltysh suggested we may still go along the way of KEP-0661, then do more after we got more experience. Here is why I don't want to proceed that way:
Please read more in the @soltysh said we don't want the STS to be stuck in a permanently broken state. Of course. With this KEP, as we are not validating the templates. It is actually very hard to get stuck. Larger PVC is compatible with smaller template, so just rolling back the template should unblock us from future rollout, leaving PVCs in the expanding state, or try to cancel the expansion if I think the only way we may get stuck is patching VAC, if one replica is successfully updated to the new VAC, but another replica failed. and rolling back the first replica to the old VAC also failed. Even in this case, the user can just set Q3: @kow3ns thinks it is not appropriate to delete and recreate the PVC to alter the performance characteristics of volumes. VAC is the KEP that actually parameterizing the storage class and allow us to specify and update the performance characteristics of volumes without interrupting the running pod, by patching the existing PVC. So this KEP should also integrate with VAC. The update to VAC in the volumeClaimTemplates should not require re-creating the PVC, and is fully automated if everything goes well. Q4: @kow3ns asked how we should handle each field of This is described in the KEP in "How to update PVCs" section in "Updated Reconciliation Logic". Basically, patch what we can, skip the rest. It seems the wording "update PVC in-place" causes many mis-understandings. I will replace it with "patch PVC". We didn’t actually decide anything during the last meeting. I think these core questions should be decided to push this KEP forward:
|
The general sentiment from that sig-apps call (see https://docs.google.com/document/d/1LZLBGW2wRDwAfdBNHJjFfk9CFoyZPcIYGWU7R1PQ3ng/edit#heading=h.2utc2e8dj14) was that the smaller changes have a greater chances to move forward. Also, it's worth noting that the smaller changes do not stand in opposition to the changes proposed in here, they are only taking the gradual approach by focusing on a minimal subset of changes. |
Okay, we agreed that focusing on the minimal subset of changes. @huww98 and @vie-serendipity will proceed with only
|
At the last sig-apps meeting, we have decided that we should:
But for the validation of the template, I think we still need more discussion. It can be a major blocking point of this KEP. @soltysh think that we should not allow decreasing the size of template. He thinks we can remove the validation later if desired. But I think validation has many drawbacks which may block normal usage of this feature and should be resolved in the initial version:
On the contrast, if we just don't add the validation, we can avoid all these issues, and lose nothing: The user can expand PVC independently today. So, the state that the template is smaller than PVC is already very common and stable. The strategy in this state is not even trying to shrink the PVC. I think this is well defined and easy to follow. If Kubernetes ever supports shrinking in the future, we will still need to support drivers that can't shrink. So, even then we can only support shrinking with a new To take one step back, I think validating the template across resources violates the high-level design. The template describes a desired final state, rather than an immediate instruction. A lot of things can happen externally after we update the template. For example, I have an IaaS platform, which tries to To conclude, I don't want to add the validation, we don't add it just to be removed in the future. |
Agree. By the way, |
That's one way looking at it, also in those cases where a mistake happens (I consider that a rare occurrence), you can always use #3335 and migrate to a new, smaller StatefulSet. |
@huww98 and @liubog2008 are you planning to push this through for 1.32? |
Users have been waiting for many years to be able to scale up statefulset volumes. I agree we shouldn't over complicate that use case trying to handle solving other issues at the same time. Lets focus on the very common use case, and then can reevaluate other features after that is completed. |
### Kubernetes API Changes | ||
|
||
Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: | ||
* `spec.volumeClaimTemplates.spec.resources.requests.storage` (increase only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of historical reasons, not all PVCs can be expanded btw. The SC must have allowVolumeExpansion
set to true
. What happens if user increases size here but underlying SC doesn't allow it?
Currently this is explicitly blocked for PVC via admission. I am not recommending we do the same for Statefulsets, but we need a way out. We don't want statefulset controller to be stuck forever, retrying this operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With current limitation, the way out is:
- Change the StatefulSet volumeClaimUpdatePolicy back to OnClaimDelete, effectively disabling this feature for the STS.
- Migrate to a new STS and delete the problematic one.
There is no technical reason preventing reducing storage size of template. Maybe we can discuss this with @soltysh again. If reducing the size is supported, we only need to undo the StatefulSet change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not talking about reducing size per-se, but if volume expansion in general is disallowed from SC.
What perhaps we should do is, if user changes template size and SC doesn't allow it, then stateful controller should stop reconciling the change and add warning to the SS. Where it gets tricky is, what if user changes pod spec along with template size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If the PVC update fails, we should block the StatefulSet rollout process.
This will also block the creation of new Pod.
We should detect common cases (e.g. storage class mismatch) and report events before deleting the old Pod.
If this still happens (e.g., because of webhook), We should retry and report events for this.
The events and status should look like those when the Pod creation fails.
My current proposal is adding some best effort pre-check before delete the Pod, replicating some of PVC admission logic.
Another possible solution is: we update PVCs before deleting the old pod. So that even if the PVC update failed, the pod is not disrupted and user will have enough time to deal with it. In this way, old Pods may briefly see the new volume config before terminating. This should be fine because updating PVC should be non-disruptive. I prefer this solution now given the complexity of PVC admission, and quota issue mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware that's the case. But I believe the failure cases in the design details section (below the table) should basically cover this scenario. I'm talking about this wording specifically:
If the PVC update fails, we should block the StatefulSet rollout process. This will
also block the creation of new Pod. We should detect common cases (e.g. storage
class mismatch) and report events before deleting the old Pod. If this still happens
(e.g., because of webhook), We should retry and report events for this.
The events and status should look like those when the Pod creation fails.
We should ensure the case described by Hemant is expressed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if user changes pod spec along with template size?
With the new revision, The old pod will still be running if SC doesn't allow expansion.
2. Apply the changes to the PVCs used by this replica. | ||
3. Create the new pod with new `controller-revision-hash` label. | ||
4. Wait for the new pod and PVCs to be ready. | ||
5. Advance to the next replica and repeat from step 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this works for most storage providers. But there are storage providers which can't expand a volume if volume is-in-use by a pod (in control-plane) and yet volume needs to be mounted for file system expansion to happen on the node. So online expansion will not always work for all storage providers.
I am not sure how we intend to solve this, we could just say you must scale down your workloads if you want to expand the SS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just don't support these providers. And this KEP should not break existing workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained elsewhere, we need to document that we'll just stop the rollout and document that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for drivers that support offline only expansion and if user makes both pod spec and volume claim template change, then they could have stateful rollout wedged. At minimum, we need to add appropriate event and error messages that can surfaced to the user.
I am not sure if the right decision is to leave it at that or we should try and improve things. I was discussing few ideas with @jsafrane and came up with some solutions:
- when starting a pod, let kubelet wait until controller expansion completes (if it can detect so and not break anything)
- add a field to CSIDriver, so stateful set controller knows what kind of expansion it is dealing with and support both
- add enum to StatefulSet what kind of expansion to do (basically add new value to
volumeClaimUpdatePolicy
)- helm chart authors can use safe (= offline) expansion
- people with local storage knowledge can use online
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when starting a pod, let kubelet wait until controller expansion completes (if it can detect so and not break anything)
@gnufied I think this is not enough. According to CSI spec, offline resize requires the volume not to be controller-published. This means we need to:
- introduce a new mechanism to communicate volume resize capability (ideally at volume level, new volumes may supports online expansion, while old volumes may not for historical reason).
- change kubelet to wait for expansion before updating Node.status.volumesInUse
- change KCM to wait for node.volumesInUse before creating VolumeAttachment
And since we cannot fully cancel an expansion, if the expansion failed, we are blocking the pod from starting until user delete and recreate all the Pod/PVC/PV, which may not be a good user experience.
So I think this is too complex to fit into this KEP.
helm chart authors can use safe (= offline) expansion
Why offline expansion is safer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gnufied I think this is not enough.
That should be enough. I am not sure which part is causing that confusion.
And since we cannot fully cancel an expansion, if the expansion failed, we are blocking the pod from starting until user delete and recreate all the Pod/PVC/PV, which may not be a good user experience.
Users who use offline volumes will be wedged right now anyways, even in successful case. Even for online expansion if user updates pod spec and volumeClaimTemplate and expansion fails, then rollout will be wedged.
I do not want to block this KEP on this point tbh. I think we have had enough iterations of this proposal that it seems unfair to block on this, But we It would be useful to think of a path forward for offline only expansion case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why offline expansion is safer?
I meant offline is safer because it will work for all storage providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gnufied By offline expansion, we means the following steps are necessary for the expansion:
- delete the old Pod
- CSI NodeUnpublishVolume/NodeUnstageVolume
- CSI ControllerUnpublishVolume
- CSI ControllerExpandVolume
- start the new Pod
- CSI ControllerPulishVolume
- CSI NodeStageVolume/NodePublishVolume/NodeExpandVolume
Is that correct? How can we sequence ControllerExpandVolume after ControllerUnpublishVolume? If the old and new Pods are on the same node, There may be even no ControllerUnpublishVolume. Maybe we can delay the creation of new Pod to achieve this. But that still requires adding some new fields to Kubernetes API, and should belongs to a future KEP.
Users who use offline volumes will be wedged right now anyways, even in successful case. Even for online expansion if user updates pod spec and volumeClaimTemplate and expansion fails, then rollout will be wedged.
But that is all at StatefulSet level. The Pod will always work. If we involve kubelet to wait for something, pod will not come up when that goes wrong, which is much worse.
Do you have an example of SP that only support offline expansion? So that we can check what operations are really required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we sequence ControllerExpandVolume after ControllerUnpublishVolume? If the old and new Pods are on the same node
You don't have to. external-resizer will keep retrying and will eventually succeed when ControllerUnpublish
is called. The catch is, you do not schedule new pod until pvc has NodeResizePending
status (speaking roughly, there is more nuance to this).
|
||
Additionally collect the status of managed PVCs, and show them in the StatefulSet status. | ||
Some fields in the `status` are updated to reflect the status of the PVCs: | ||
- claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new field, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in other comments, I'm against adding one more readyreplicas field which will be very confusing to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will need a field to expose the PVC readiness, for kubectl rollout status
to wait for the last replica to be ready before returning. Any idea better than claimsReadyReplicas
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new revision, I think we don't need claimsReadyReplicas
now. When Pod is ready, it is guaranteed that the PVC is ready too. So any existing tools to monitor StatefulSet rollout process does not need to change.
- `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating; | ||
- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` | ||
- currentRevision, updateRevision, currentReplicas, updatedReplicas | ||
are updated to reflect the status of PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @soltysh
As for changing currentRevision
etc. - conceptually changing their semantics sounds ok, but I don't fully understand how that works.
You mentioned "since I've added claim templates to ControllerRevision" - how is that added? What it means for our upgrade story if we have a ControllerRevision that doesn't contain the PVC template?
@soltysh - I would like to hear your thoughts too about that aspect
@wojtek-t would you consider this enhancement from PRR perspective complete for alpha while we finalize some of the implementation details? I am asking this assuming today is PRR freeze. |
PRR freeze is not to have it approved, but rather to have it in reviewable shape |
/label tide/merge-method-squash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several more comments.
--> | ||
* Allow users to update some fields of `volumeClaimTemplates` of a `StatefulSet`, specifically: | ||
* increasing the requested storage size (`spec.volumeClaimTemplates.spec.resources.requests.storage`) | ||
* modifying Volume AttributesClass used by the claim( `spec.volumeClaimTemplates.spec.volumeAttributesClassName`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, it should not affect quota.
### Kubernetes API Changes | ||
|
||
Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: | ||
* `spec.volumeClaimTemplates.spec.resources.requests.storage` (increase only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware that's the case. But I believe the failure cases in the design details section (below the table) should basically cover this scenario. I'm talking about this wording specifically:
If the PVC update fails, we should block the StatefulSet rollout process. This will
also block the creation of new Pod. We should detect common cases (e.g. storage
class mismatch) and report events before deleting the old Pod. If this still happens
(e.g., because of webhook), We should retry and report events for this.
The events and status should look like those when the Pod creation fails.
We should ensure the case described by Hemant is expressed there.
- `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating; | ||
- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` | ||
- currentRevision, updateRevision, currentReplicas, updatedReplicas | ||
are updated to reflect the status of PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add
claimsReadyReplicas
instead.
I will admit that our statefulset status already has too many replicas fields, which are confusing to numerous users. Adding one more will only increase that confusion. So I will be strongly against one.
But for currentRevision, updateRevision, currentReplicas, updatedReplicas
Yes, those fields will be affected by these changes, since those directly reflect information about current and updated pods. So it's reasonable they will cover those changes.
@soltysh - I would like to hear your thoughts too about that aspect
We currently calculate the ControllerRevision based on the entire template of a pod (and a few other fields) so adding volumeClaimTemplates is feasible, although will be a significant increase, given that this is an array of templates not a single one, which might be problematic.
2. Apply the changes to the PVCs used by this replica. | ||
3. Create the new pod with new `controller-revision-hash` label. | ||
4. Wait for the new pod and PVCs to be ready. | ||
5. Advance to the next replica and repeat from step 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained elsewhere, we need to document that we'll just stop the rollout and document that case.
|
||
Additionally collect the status of managed PVCs, and show them in the StatefulSet status. | ||
Some fields in the `status` are updated to reflect the status of the PVCs: | ||
- claimsReadyReplicas: the number of replicas with all PersistentVolumeClaims ready to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in other comments, I'm against adding one more readyreplicas field which will be very confusing to users.
- `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating; | ||
- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` | ||
- currentRevision, updateRevision, currentReplicas, updatedReplicas | ||
are updated to reflect the status of PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider expanding the controllerrevision not with a whole volumeClaimTemplate but only the modifiable fields which are listed in this document. This way we'll ensure the size constraints are not stretched too thing.
* `spec.volumeClaimTemplates.metadata.labels` | ||
* `spec.volumeClaimTemplates.metadata.annotations` | ||
|
||
Introduce a new field in StatefulSet `spec`: `volumeClaimUpdatePolicy` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for not catching this sooner, but upon reviewing the API types once again, I realized we already have .spec.persistentVolumeClaimRetentionPolicy
which allows user to define what happens during PVC deletion and scaling. I believe we should re-use that field, and add WhenUpdated
as a 3rd supported policy.
So we'd have:
type StatefulSetPersistentVolumeClaimRetentionPolicy struct {
// existing fields
WhenDeleted PersistentVolumeClaimRetentionPolicyType
WhenScaled PersistentVolumeClaimRetentionPolicyType
// new field
WhenUpdated PersistentVolumeClaimRetentionPolicyType
}
type PersistentVolumeClaimRetentionPolicyType string
const (
// existing consts
RetainPersistentVolumeClaimRetentionPolicyType PersistentVolumeClaimRetentionPolicyType = "Retain"
DeletePersistentVolumeClaimRetentionPolicyType PersistentVolumeClaimRetentionPolicyType = "Delete"
// new constant
InPlacePersistentVolumeClaimRetentionPolicyType PersistentVolumeClaimRetentionPolicyType = "InPlace"
)
This approach would nicely allow users to request appropriate operations based on their demand. Even more, we'd expand our already available Delete
policy with a new use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, it could make sense to reuse the already existing retention policy field. But not all combinations would be useful.
Can we make a table to see the valid values for each field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, according to #4651 (comment) it seems we might need a struct instead of an enum for the WhenUpdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this idea. The word Retention
seems to mean whether to delete the PVC or not. Adding InPlace
to this is strange.
And as stated in the non-goal:
Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically.
we don't want to automatic delete the PVC on update, it is too dangerous for the data in the volume. So Delete
will not be valid for WhenUpdated
. Naturally, InPlace
is not applicable to WhenDeleted
/WhenScaled
, which do not have a new version to update. So only Retain
is common, which basically means no-op.
On the other side, I think we may reserve WhenUpdated
for future use. When we support adding/removing volumeClaimTemplates from statefulset, we can use WhenUpdated
to control whether we should delete PVCs corresponding to removed template.
So I still prefer the original design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid I agree with @huww98 , since original type had word - Retention
in it, it is hard to reuse that for expansion purposes when former defines retention policy of pvc vis-a-vis statefulset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this idea. The word
Retention
seems to mean whether to delete the PVC or not. AddingInPlace
to this is strange.
Based on this definition, I'm seeing this good fit. Also expanding this field will be easier to ensure:
- Compatibility with that feature. This is also what @atiratree mentions, about adding the table listing all the possible combinations and how they will interact.
- Simplicity of the API, we're building on what we have, rather then expanding the API surface, which might cause unnecessary confusion.
we don't want to automatic delete the PVC on update, it is too dangerous for the data in the volume. So
Delete
will not be valid forWhenUpdated
. Naturally,InPlace
is not applicable toWhenDeleted
/WhenScaled
, which do not have a new version to update. So onlyRetain
is common, which basically means no-op.
See my no. 1 above, for why combining the two will be beneficial for both of the features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compatibility with that feature.
I'm not sure I understand. That one is about delete PVCs, while this one is about update PVCs. How do they related?
about adding the table listing all the possible combinations and how they will interact.
When we need such a table, we already making the API too complex. And IMO these two feature works orthogonal, they don't interact with each other.
Simplicity of the API
I'm not sure. Both proposal add a new field. while my original proposal has only 2 possible value for the new field, your proposal has 3 (with 1 invalid). I'd say my proposal is actually simpler.
And how about my point above about reserve for future use?
|
||
Change API server to allow specific updates to `volumeClaimTemplates` of a StatefulSet: | ||
* `spec.volumeClaimTemplates.spec.resources.requests.storage` (increase only) | ||
* `spec.volumeClaimTemplates.spec.volumeAttributesClassName` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to support this feature when it is still disabled by default? https://kubernetes.io/docs/concepts/storage/volume-attributes-classes/. Maybe it would be better to start simpler; with the size first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal is using server-side apply for the whole PVC. So we do not need special case for each field, it naturally supports every mutable fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be problematic, I was under the impression that the other feature (being beta) is on by default. That's my bad I haven't checked it. Now looking at #3751 I don't see clear path for that functionality going forward, so I'd be inclined to hold on with this until we have a clear information why it's so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KEP-3136 is why it is off-by-default. And if everything goes well, it will become GA in 1.34.
Anyway, That KEP should be orthogonal to this one. If that one is enabled, we will support it. If not, we will work on our own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is enabled/reaches stable in this release, then we are fine. Can we please track it in the KEP under alpha?
* `spec.volumeClaimTemplates.metadata.labels` | ||
* `spec.volumeClaimTemplates.metadata.annotations` | ||
|
||
Introduce a new field in StatefulSet `spec`: `volumeClaimUpdatePolicy` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, it could make sense to reuse the already existing retention policy field. But not all combinations would be useful.
Can we make a table to see the valid values for each field?
- `volumeClaimUpdatePolicy` is `InPlace` and the PVC is updating; | ||
- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least `minReadySeconds` | ||
- currentRevision, updateRevision, currentReplicas, updatedReplicas | ||
are updated to reflect the status of PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely convinced by the idea of disabling the rollout undo
when we increase the size.
What do you think about exploring the rollout undo
story further? IMO when somebody changes the image and size, they might still want to revert the image, but keep the PVC with the new size.
some questions:
- Do we need to include the PVC in the revision? E.g. the case of somebody updating the PVC externally? Also if you cannot roll back, what is the purpose of the revisions?
- Would it make sense to track the PVC revisions separately?
- Would it make sense to have different kinds of
undo
and add support for them in kubectl?
* `spec.volumeClaimTemplates.metadata.labels` | ||
* `spec.volumeClaimTemplates.metadata.annotations` | ||
|
||
Introduce a new field in StatefulSet `spec`: `volumeClaimUpdatePolicy` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, according to #4651 (comment) it seems we might need a struct instead of an enum for the WhenUpdated.
A new revision is uploaded: Order of PVC/Pod update is changed. We now update PVC and wait for it to be ready before deleting old pod:
Currently unresolved discussions (correct me if I missed something):
The enhancement freeze is close. Can we merge this PR and continue discuss and refine in future PR, as suggested in the KEP template.
|
|
||
|
||
Additionally collect the status of managed PVCs, and show them in the StatefulSet status. | ||
Some fields in the `status` are updated to reflect the status of the PVCs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful (as mentioned in the thread below) to write the full API documentation in the KEP to see what API changes we plan to advertise to the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only planed change is already listed below:
- currentRevision, updateRevision, currentReplicas, updatedReplicas are updated to reflect the status of PVCs.
No new fields in status are planed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the fields stay the same, their meaning will change. The meaning is also part of the API. Nevertheless, API review part will be easier if you include the API comments/docs in the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
With these changes, user can still use `kubectl rollout status` to monitor the update process, | ||
both for automated patching and for the PVCs that need manual intervention. | ||
|
||
A PVC is considered ready if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to use the PVC readiness term in the API? Btw, what about the PVC phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no. I only planed to use ready
term in this KEP doc, not in API.
- currentRevision, updateRevision, currentReplicas, updatedReplicas | ||
are updated to reflect the status of PVCs. | ||
|
||
With these changes, user can still use `kubectl rollout status` to monitor the update process, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to use the PVC readiness in the kubectl rollout status
? What changes are required in the kubectl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planed to add a new field status.claimReadyReplicas
and use that in kubectl rollout status
. But that is rejected by @soltysh. The current plan is waiting for PVC ready before updating Pod. So pod ready will imply PVC ready, and no changes are required in kubectl. Please see Alternatives > Order of Pod / PVC updates
in the KEP doc for more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PVCs can be also observed in kubectl and contribute to the status.
If `volumeClaimUpdatePolicy` is `OnClaimDelete`, nothing changes. This field acts like a per-StatefulSet feature-gate. | ||
The changes described below applies only for `InPlace` policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like we do not plan to gate the feature. What about this?
If `volumeClaimUpdatePolicy` is `OnClaimDelete`, nothing changes. This field acts like a per-StatefulSet feature-gate. | |
The changes described below applies only for `InPlace` policy. | |
If the `volumeClaimUpdatePolicy` field is set to `OnClaimDelete`, nothing changes. | |
To opt in to the new behavior, the `inPlace` policy should be used. | |
This new behaviour is described below. |
Include `volumeClaimTemplates` in the `ControllerRevision`. | ||
|
||
Since modifying `volumeClaimTemplates` will change the hash, | ||
Add support for updating `controller-revision-hash` label of the Pod without deleting and recreating the Pod, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to support changing the InPlace
to OnClaimDelete
?
What are the considerations? Do we have to update the revisions retrospectively? What happens during a rollback?
Are we going to start tracking volumes in revisions when OnClaimDelete
is used?
Maybe it would be better to not allow changes to the volumeClaimUpdatePolicy field. Only during creation of the StatefulSets - this would remove quite a few headaches from managing the revisions IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to support changing the InPlace to OnClaimDelete?
What are the considerations?
Yes. I expect user to change to OnClaimDelete
to escape from any PVC update failure. And users always change their mind, OnClaimDelete
has its own use case where user really want each of the PVCs looks different.
Do we have to update the revisions retrospectively?
I'm not sure I understood this. Will this words from KEP answer your question?
Note that when Pod is at revision B but PVC is at revision A, we will not update PVC.
Such state can only happen when user setvolumeClaimUpdatePolicy
toInPlace
when the feature-gate of KCM is disabled,
or disable the previously enabled feature-gate.
We require user to initiate another rollout to update the PVCs, to avoid any surprise.
What happens during a rollback?
What is rolled back? Behavior when change between InPlace
and OnClaimDelete
is already described in the doc. volumeClaimTemplate
rollback works just like a normal update.
Are we going to start tracking volumes in revisions when OnClaimDelete is used?
No, because OnClaimDelete
will be the default value, and it is set automatically for all StatefulSets when the feature-gate is enabled. If we start tracking volumes immediately, we will update all StatefulSets and all pods under any StatefulSet at once. I think this introduce the risk of overloading the control-plane.
Maybe it would be better to not allow changes to the volumeClaimUpdatePolicy field. Only during creation of the StatefulSets - this would remove quite a few headaches from managing the revisions IMO.
I'd like to enable existing StatefulSet to taking advantage of the new feature. And what is the headaches? I'd expect the current plan will handle everything smoothly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I expect user to change to OnClaimDelete to escape from any PVC update failure
Reverting to OnClaimDelete will not help. Updating or reverting the claim templates will.
And users always change their mind, OnClaimDelete has its own use case where user really want each of the PVCs looks different.
Yes this is useful, but we will have to figure out how to support that.
We require user to initiate another rollout to update the PVCs, to avoid any surprise.
I am not sure here, I think the bigger surprise would be if we do not update the PVC when InPlace
is used.
What is rolled back? Behavior when change between InPlace and OnClaimDelete is already described in the doc. volumeClaimTemplate rollback works just like a normal update.
The StatefulSet to an older revision. I think it would be good to go over some of these scenarios as they might be surprising and not just a simple update.
I'd like to enable existing StatefulSet to taking advantage of the new feature. And what is the headaches? I'd expect the current plan will handle everything smoothly.
Rev 1:
StatefulSet with OnClaimDelete on node A (tracked)
PVC on Node A (untracked)
Rev 2:
Statefulset with InPlace on node A (tracked)
PVC on Node A (tracked)
Rev3:
Statefulset with InPlace on node B (tracked)
PVC on Node B (tracked)
Reverting to Rev1 will make the pod unschedulable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting to OnClaimDelete will not help. Updating or reverting the claim templates will.
I agree that reverting the claim templates will be very useful. But reverting claim templates will not always recover. For VolumeAttributeClass, when reverting template, we have to revert the already modified VolumeAttributeClassName in PVC, which can be slow and can also fail. However, Reverting to OnClaimDelete should always unblock pod rollout immediately.
I think the bigger surprise would be if we do not update the PVC when InPlace is used.
This is a design decision to make. Consider that when using this feature, rollout out Ver B from Ver A, user finds something wrong, and he turns off the feature gate. Now we should not touch the PVCs, and continue the Pod rollout to Ver B. status.currentRevision
will indicate rollout to Ver B is finished. Now the user enables the feature-gate again. If we continue to update PVCs at Ver A to Ver B, we will have no StatefulSet status to track the updates, since status is already at Ver B.
Note this will only happen when rollout happened with KCM feature-gate disabled and volumeClaimUpdatePolicy set to InPlace, which should be very rare. The more common use-case where user update the claimTemplate before changing volumeClaimUpdatePolicy to InPlace will work as you expected. The new claimTemplate changes will be rolled out since change volumeClaimUpdatePolicy to InPlace will add template to ControllerRevision, and we will get a new revision hash to rollout.
Reverting to Rev1 will make the pod unschedulable.
Currently, the nodeAffinity of PV is immutable. So what you describe cannot happen. Although I'd like to enable it in KEP-5381.
But I understand your request, and I also thought about this, but didn't come up with a better solution.
Do you think it is acceptable to update all pods under any StatefulSet at once?
Or we can make this tri-state:
- empty/nil: the default and preserve the current behavior.
OnClaimDelete
: Add volumeClaimTemplate to the history, but don't update PVCsInPlace
: Add volumeClaimTemplate to the history, and also update PVCs in-place
There is an unique challenge if we track claim templates when OnClaimDelete
is set. When updating from OnClaimDelete to InPlace, the revision hash will not change, so rollout is not triggered, user will need to modify claim template again to trigger the rollout. We can resolve this by also adding volumeClaimUpdatePolicy to ControllerRevision
. But all the policies we already have does not present in ControllerRevision
. So this is not ideal either.
I admit that kubectl rollout undo
behavior will become more surprising in the current design. But that is just a convenient method to update the StatefulSet. User can always do the update manually.
So I conclude that the current design is the best. What do you think? Added a new section in Alternatives
to discuss this.
|
||
Naturally, most of the update control logic also applies to PVCs. | ||
* If `updateStrategy` is `RollingUpdate`, update the PVCs in the order from the largest ordinal to the smallest. | ||
* If `updateStrategy` is `OnDelete`, only update the PVCs if the Pod is deleted manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for not updating the PVCs when OnDelete is used, since this is a legacy behavior.
FYI: revisions do not work in OnDelete and kubernetes/kubernetes#122272 should be fixed first before any work is done in here. Can we please track it in the KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I need to gain deeper understanding on this by reading the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the KEP, we will not update the PVCs when OnDelete is used.
When updating volumeClaimTemplates along with pod template, we will go through the following steps: | ||
1. Apply the changes to the PVCs used by this replica. | ||
2. Wait for the PVCs to be ready. | ||
3. Delete the old pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we delete the old pod first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see Alternatives > Order of Pod / PVC updates
in the KEP doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok makes sense. Anyway, volumeClaimUpdatePolicy
should be a struct to support customizations of the update process in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make the change
| at revision A | not existing | create PVC at revision B | | ||
| at revision A | at revision A | update PVC to revision B | | ||
| at revision A | at revision B | wait for PVC to be ready, then delete Pod or update Pod label | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same question about the order applies here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see Alternatives > Order of Pod / PVC updates
in the KEP doc.
or disable the previously enabled feature-gate. | ||
We require user to initiate another rollout to update the PVCs, to avoid any surprise. | ||
|
||
When `volumeClaimUpdatePolicy` is updated from `OnClaimDelete` to `InPlace`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned before. Do we have to support that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it deserves a discussion: #4651 (comment)
We require user to initiate another rollout to update the PVCs, to avoid any surprise. | ||
|
||
When `volumeClaimUpdatePolicy` is updated from `OnClaimDelete` to `InPlace`, | ||
StatefulSet controller will begin to add claim templates to ControllerRevision, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the history and rollbacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create a new ControllerRevision
, so a new line in history. For rollback, kubectl rollout undo
will be a no-op, since volumeClaimUpdatePolicy
is not included in ControllerRevision
. User can update it back to OnClaimDelete
manually to rollback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the rollout undo/rollback to work even when volumeClaimUpdatePolicy
is not present and just be defaulted by the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If volumeClaimUpdatePolicy
is not set, undo
volumeClaimTemplates will not trigger an automated rollout, so not very useful anyway. Discussed in detail in new section When to track volumeClaimTemplates in ControllerRevision
.
I would like to see a way this can fail fast for drivers that only support offline expansion. We probably do not want users to attempt a statefulset update that will get wedged because volumes can't be expanded offline. We may not have to block KEP for that and can be discussed during API review as well.
I am curious why did we decide to store |
I think it would be preferable to design the feature with |
Given lack of SIG agreement up until now, it's probably too late to get it into 1.34. |
@gnufied Can be hard. Kubernetes currently does not aware of whether a volume supports online resize.
As described it the other comment. I need to add PVC into revision, so that I will get a new revision when claim templates are updated and trigger the rollout, reusing the existing Pod rollout infrastructure for PVC. Besides the above reason, we can still undo other fields of claim templates, and we may also support undo size change in the future. And it is useful when creating new PVCs. quote the KEP:
|
Pod may be deleted externally, e.g. evicted.
I review the StatefulSet controller code recently. And made these changes:
|
Uh oh!
There was an error while loading. Please reload this page.