-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[e2e test flake] Failed to run clusterctl move...failed calling webhook #11856
Comments
Maybe this is just me being paranoid and seeing a pattern where one doesn't exist. These errors are just how it looks when a cluster doesn't come up. It's just weird to see three new failure types in three days on three separate tests. |
The test does an upgrade of the cluster. Looks like the bootstrap controller pod was not able to start after the upgrade again:
To note: the pod was scheduled to the node The node's kubelet logs show that images got removed due to reaching disk threshold: That should explain why we did hit the ImagePullBackoff.
|
Wondering under which circumstances we can hit this. If we run too many and the wrong tests in parallel within a ProwJob Pod? |
I guess its propably due to the prow worker node host filesystem. I'm trying to confirm at #11862 |
(Confirmed): The relevant filesystem comes from the prow node the test is running on. Also the normal thresholds might not make sense for them because they are pretty huge. Example:
In that test-case, garbage collection would kick in at latest at 85% (?) at which point there would still be 42GB of space. |
Just to make sure I understand, this is from the prow node running multiple jobs and having too many images around and then cleaning up images we still need? And pinning would help make sure images we're still using stay around? Would we need to unpin when done? |
Its a layered problem First layer: prow node
Then we have the next layer: our prow test container
Then we have another layer: the clusters we create using CAPD and kind
Now what is in common is the size of the filesystem: On all levels the same filesystem is used, but we only have limited access (overlayfs / how containers do its thing), but the size shown is always the one of the real disk (280GB) In our case:
I hope this makes it a bit more clear :-) Let me know if I can clarify some more. |
Thanks, that helps. So options seem to include:
|
We need to pin images at the layer where we pull them, so in our case it would be inside the CAPD Clusters where we already set them as
I would not like to go down the road of changing the defaults here.
They do, but with their settings which make sense for them. I agree others might hit the same (creating kind cluster, using
Also, looks like kind is already doing this (but for the images it builds into a node image): https://github.com/kubernetes-sigs/kind/blob/main/pkg/build/nodeimage/imageimporter.go#L80 We maybe can do this in CAPD too :-) Another alternative would be to re-load the images to the machines (via CAPD). |
/triage accepted Note: maybe we should make sig-k8s-infra aware that this could happen, just in case others have that same issue and so they have a quick answer on why this happens :-) Edit: did so: https://kubernetes.slack.com/archives/CCK68P2Q2/p1740060087478159 |
Which jobs are flaking?
periodic-cluster-api-e2e-main
periodic-cluster-api-e2e-mink8s-main
periodic-cluster-api-e2e-mink8s-release-1-9
(so far...every day seems to be a new job)
Which tests are flaking?
capi-e2e [It] When testing Cluster API working on single-node self-hosted clusters using ClusterClass [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster [ClusterClass]
capi-e2e [It] When testing clusterctl upgrades using ClusterClass (v1.9=>current) on K8S latest ci mgmt cluster [ClusterClass] Should create a management cluster and then upgrade all the providers [ClusterClass]
capi-e2e [It] When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster
Two of these are failures in clusterctl moves, one is in an attempt to scale a machine deployment.
Since when has it been flaking?
2/15/2024
Testgrid link
https://storage.googleapis.com/k8s-triage/index.html?text=failed%20to%20call%20webhook&job=.*cluster-api.*(test%7Ce2e)*&xjob=.*-provider-.*
Reason for failure (if possible)
Anything else we need to know?
No response
Label(s) to be applied
/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: