-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling a cluster from Kubernetes 1.30 to 1.31 gets stuck in a validation loop when new nodes are added to the cluster via CAS/Karpenter after kops update cluster
completes
#16907
Comments
kops update cluster
completeskops update cluster
completes
Some options discussed in office hours:
We'll likely start with the first option and see how the ergonomics of the second option feel, given that it depends on the first option. In either case we'll add upgrade instructions to the release notes for this new behavior. /kind blocks-next |
@rifelpet: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Another option, and possibly a more correct one, would be to enforce the version skew https://kubernetes.io/releases/version-skew-policy/#kubelet As such, the userdata for instance groups shouldn't be updated, until the control plane is already rolled out to a newer version, thus ensuring that we never have nodes coming up with a kubelet version that is more recent than any control plane node. E.g: kops update cluster would:
In this situation:
Optionally, similar to the suggested above, a flag for going through the whole procedure like --sync or --wait could:
|
/kind bug
1. What
kops
version are you running? The commandkops version
, will displaythis information.
1.31.0-alpha.1
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.Upgrading from 1.30.5 to 1.31.1.
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Update the cluster
kubernetesVersion
and then run:kops update cluster
kops rolling-update cluster
5. What happened after the commands executed?
The rolling-update got stuck in a validation loop and eventually timed out, because pods on the new worker nodes created by Karpenter after
kops update cluster
failed to start as described in kubernetes/kubernetes#127316.6. What did you expect to happen?
Would have been great if the rolling update completed without errors.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
Only relevant part here is having Karpenter enabled and then upgrading the Kubernetes version to 1.31.1.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
Rolling update validation loop outputs things like this over and over:
Upon describing one of those pods:
9. Anything else we need to know?
It should be possible to work around this issue by pausing autoscaling before
kops update cluster
until afterkops rolling-update cluster
has replaced all of the control plane nodes, or with judicious use ofkops rolling-update cluster --cloudonly
.The text was updated successfully, but these errors were encountered: