Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e test flake] Failed to run clusterctl move...failed calling webhook #11856

Open
cprivitere opened this issue Feb 17, 2025 · 11 comments
Open
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@cprivitere
Copy link
Member

Which jobs are flaking?

periodic-cluster-api-e2e-main
periodic-cluster-api-e2e-mink8s-main
periodic-cluster-api-e2e-mink8s-release-1-9

(so far...every day seems to be a new job)

Which tests are flaking?

capi-e2e [It] When testing Cluster API working on single-node self-hosted clusters using ClusterClass [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster [ClusterClass]
capi-e2e [It] When testing clusterctl upgrades using ClusterClass (v1.9=>current) on K8S latest ci mgmt cluster [ClusterClass] Should create a management cluster and then upgrade all the providers [ClusterClass]
capi-e2e [It] When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster

Two of these are failures in clusterctl moves, one is in an attempt to scale a machine deployment.

Since when has it been flaking?

2/15/2024

Testgrid link

https://storage.googleapis.com/k8s-triage/index.html?text=failed%20to%20call%20webhook&job=.*cluster-api.*(test%7Ce2e)*&xjob=.*-provider-.*

Reason for failure (if possible)

Failed to run clusterctl move
Expected success, but got an error:
    <errors.aggregate | len:2, cap:2>: 
    [action failed after 10 attempts: error creating "addons.cluster.x-k8s.io/v1beta1, Kind=ClusterResourceSet" self-hosted-gdy22l/self-hosted-b8ee9c-crs-0: Internal error occurred: failed calling webhook "validation.clusterresourceset.addons.cluster.x-k8s.io": failed to call webhook: Post "https://capi-webhook-service.capi-system.svc:443/validate-addons-cluster-x-k8s-io-v1beta1-clusterresourceset?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority, action failed after 10 attempts: error creating "cluster.x-k8s.io/v1beta1, Kind=ClusterClass" self-hosted-gdy22l/quick-start: Internal error occurred: failed calling webhook "validation.clusterclass.cluster.x-k8s.io": failed to call webhook: Post "https://capi-webhook-service.capi-system.svc:443/validate-cluster-x-k8s-io-v1beta1-clusterclass?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority]
Failed to scale machine deployment topology md-0
Expected success, but got an error:
    <*errors.withStack | 0xc001809830>: 
    failed to patch Cluster clusterctl-upgrade/clusterctl-upgrade-workload-kmb46o: Internal error occurred: failed calling webhook "validation.cluster.cluster.x-k8s.io": failed to call webhook: Post "https://capi-webhook-service.capi-system.svc:443/validate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority
    {
[FAILED] Failed to run clusterctl move
Expected success, but got an error:
    <errors.aggregate | len:1, cap:1>: 
    action failed after 10 attempts: error adding delete-for-move annotation from "bootstrap.cluster.x-k8s.io/v1beta1, Kind=KubeadmConfig" self-hosted-ftu63g/self-hosted-43qx90-md-0-mwnng-mbk5w-jqntj: Internal error occurred: failed calling webhook "default.kubeadmconfig.bootstrap.cluster.x-k8s.io": failed to call webhook: Post "https://capi-kubeadm-bootstrap-webhook-service.capi-kubeadm-bootstrap-system.svc:443/mutate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfig?timeout=10s": dial tcp 10.128.51.31:443: connect: connection refused

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 17, 2025
@cprivitere
Copy link
Member Author

Maybe this is just me being paranoid and seeing a pattern where one doesn't exist. These errors are just how it looks when a cluster doesn't come up. It's just weird to see three new failure types in three days on three separate tests.

@chrischdi
Copy link
Member

The test does an upgrade of the cluster.

Looks like the bootstrap controller pod was not able to start after the upgrade again:

    state:
      waiting:
        message: 'Back-off pulling image "gcr.io/k8s-staging-cluster-api/kubeadm-bootstrap-controller-amd64:dev":
          ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack
          image "gcr.io/k8s-staging-cluster-api/kubeadm-bootstrap-controller-amd64:dev":
          failed to resolve reference "gcr.io/k8s-staging-cluster-api/kubeadm-bootstrap-controller-amd64:dev":
          gcr.io/k8s-staging-cluster-api/kubeadm-bootstrap-controller-amd64:dev: not
          found'
        reason: ImagePullBackOff

https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-release-1-9/1890615249074130944/artifacts/clusters/self-hosted-43qx90/resources/capi-kubeadm-bootstrap-system/Pod/capi-kubeadm-bootstrap-controller-manager-7879599446-59z6r.yaml

  • The test started upgrading at 04:54:37.967
  • That pod was created at 2025-02-15T05:05:44Z
  • The test started trying to move the cluster at 05:08:31.699

To note: the pod was scheduled to the node self-hosted-43qx90-cqmfb-jv4lm which has the image set as preLoadImages: https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-release-1-9/1890615249074130944/artifacts/clusters/self-hosted-43qx90/resources/self-hosted-ftu63g/DockerMachine/self-hosted-43qx90-cqmfb-jv4lm.yaml

The node's kubelet logs show that images got removed due to reaching disk threshold:

That should explain why we did hit the ImagePullBackoff.

Feb 15 05:04:39.290460 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:04:39.290396     613 kubelet_volumes.go:163] "Cleaned up orphaned pod volumes dir" podUID="5efac9ee-31c9-4a8b-a377-0ba2a79deffe" path="/var/lib/kubelet/pods/5efac9ee-31c9-4a8b-a377-0ba2a79deffe/volumes"
Feb 15 05:05:25.262419 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:25.262267     613 image_gc_manager.go:383] "Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold" usage=85 highThreshold=85 amountToFree=13981256908 lowThreshold=80
Feb 15 05:05:25.264041 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:25.263948     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:baa0d31514ee559c8962d6469a23315bcc8342cbbc1eccc49fa803dcd9653cb7" size=3084671 runtimeHandler=""
Feb 15 05:05:25.290595 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:25.290390     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:27ebcc620bde4c43ac64f806f158db24396a214799e386c2fc74a5a310cbe497" size=73498175 runtimeHandler=""
Feb 15 05:05:25.429100 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:25.429021     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:d995a297967012c97f18ca941dc0ebeebc94722ba00c758cf87edcb540782f22" size=62496298 runtimeHandler=""
Feb 15 05:05:25.598058 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:25.597740     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:ce7242a2b54beda8b4a2079d5d783a823af40d6d56d6a51decc8acc01c22849f" size=53902911 runtimeHandler=""
Feb 15 05:05:25.851309 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:25.851196     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:cb05a8f20cdc3e18095f2560e18f3af9d6c4c81a2847d1cb140fd88b23c82d92" size=85475942 runtimeHandler=""
Feb 15 05:05:25.903318 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:25.902657     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:c69fa2e9cbf5f42dc48af631e956d3f95724c13f91596bc567591790e5e36db6" size=18562039 runtimeHandler=""
Feb 15 05:05:26.226474 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:26.226394     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:c8794c20dcd9222dfe8a8d67435ea7cc114edfed626f7ade95b5c586cb3da05f" size=81263718 runtimeHandler=""
Feb 15 05:05:26.367668 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:26.367614     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:dbcae34dd0560135d059f43be27c687967a037a27965fef1d67cd6b6d4b25531" size=79571553 runtimeHandler=""
Feb 15 05:05:26.452592 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:26.452401     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:849ea80db7159fb01aec7bca60fbeffcec62f5058423d45a2200036ad5323be6" size=84038758 runtimeHandler=""
Feb 15 05:05:26.509224 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:26.506306     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:889ed70abd7975882aad18946e0213f66a2fe5db585e3c96f697425de33d4e65" size=77193318 runtimeHandler=""
Feb 15 05:05:26.551737 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:26.551540     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:d300845f67aebd4f27f549889087215f476cecdd6d9a715b49a4152857549c56" size=39008320 runtimeHandler=""
Feb 15 05:05:26.606494 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:26.606327     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:04b7d0b91e7e514a13892f539330df607343a12ef344d7337f10a975e53cd01a" size=22541737 runtimeHandler=""
Feb 15 05:05:26.716423 self-hosted-43qx90-cqmfb-jv4lm kubelet[613]: I0215 05:05:26.716328     613 image_gc_manager.go:487] "Removing image to free bytes" imageID="sha256:71cea600fd40cd33100ba17da232e78ce565e04bd28a8307a1ba221058e6411e" size=80679909 runtimeHandler=""

@sbueringer
Copy link
Member

Wondering under which circumstances we can hit this. If we run too many and the wrong tests in parallel within a ProwJob Pod?

@chrischdi
Copy link
Member

I guess its propably due to the prow worker node host filesystem.

I'm trying to confirm at #11862

@chrischdi
Copy link
Member

(Confirmed):

The relevant filesystem comes from the prow node the test is running on.

Also the normal thresholds might not make sense for them because they are pretty huge.

Example:

�������(Filesystem                         Size  Used Avail Use% Mounted on
overlay                            280G  129G  151G  47% /
tmpfs                               64M     0   64M   0% /dev
shm                                 64M     0   64M   0% /dev/shm
tmpfs                               63G  9.6M   63G   1% /run
tmpfs                               63G     0   63G   0% /tmp
/.bottlerocket/rootfs/dev/nvme3n1  280G  129G  151G  47% /var
overlay                             20G  1.6G   19G   8% /usr/lib/modules
overlay                            280G  129G  151G  47% /run/docker.sock
tmpfs                              5.0M     0  5.0M   0% /run/lock
shm                                 64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/306dfe99fe0d29bcbbc55835cbca8ef2836680d197bb157c5716ced3fef99cb9/shm
shm                                 64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/b2caa722f2c926d0d143b982fa93ff94c1b8f897e2ae72fe65e895cd78ffa713/shm
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/b2caa722f2c926d0d143b982fa93ff94c1b8f897e2ae72fe65e895cd78ffa713/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/306dfe99fe0d29bcbbc55835cbca8ef2836680d197bb157c5716ced3fef99cb9/rootfs
shm                                 64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/13014cf2cf974b4001cdbb71a2c68fda8ad89b2515ec5640fa83b64aaa6a6a1a/shm
shm                                 64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/e8c27357b4bdde2f84d455d858978e7b6f5b52eff66fec1c26bbed5336c17da9/shm
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/13014cf2cf974b4001cdbb71a2c68fda8ad89b2515ec5640fa83b64aaa6a6a1a/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/e8c27357b4bdde2f84d455d858978e7b6f5b52eff66fec1c26bbed5336c17da9/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/eda4fdaa146f3f03abf3e2f68af3bba9b9a7752d06cac8b7d71fe128ac32f065/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/b68ff35ecd17ad5ccd5ed889a6c4f7ebb0190ef7f22bcaa5182ca3ce521aac03/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/d5e6c0cba792cf9ce63a72669ff0bc7e1af562323e7f9535a8c93eae2cd304e5/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/a66d55d8688de027ab79a007541883ef0bea5d475fc71ecb67e382d2ffcdbf94/rootfs
tmpfs                              125G   12K  125G   1% /var/lib/kubelet/pods/5c8de401-1f4a-4502-a27b-6113ec6c4e96/volumes/kubernetes.io~projected/kube-api-access-zj6pj
shm                                 64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/61a04dcd9999379a414878f5e483097b7b19376558d5c21f2c7591c9f616c4e1/shm
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/61a04dcd9999379a414878f5e483097b7b19376558d5c21f2c7591c9f616c4e1/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/ff586ac7093726cdd9bad43a8f87eb88bc4d099d3f992ace0e87249e1f1cb448/rootfs
tmpfs                               50M   12K   50M   1% /var/lib/kubelet/pods/d53ad4bc-f071-4851-a350-3c9da2fe623d/volumes/kubernetes.io~projected/kube-api-access-4xzwc
shm                                 64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/e24b834eb1fc0fff741c93273c7ccacc712958cd4af2dbfc9331b68d55f9f164/shm
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/e24b834eb1fc0fff741c93273c7ccacc712958cd4af2dbfc9331b68d55f9f164/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/56ecd7fdca9e14c229d75cbf0e9ff40c8b20c33c6c0c3c300788a0070fe66166/rootfs
tmpfs                              170M   12K  170M   1% /var/lib/kubelet/pods/968ca102-694e-4906-aa3f-b56fa4f2f8c1/volumes/kubernetes.io~projected/kube-api-access-w6s4n
tmpfs                              170M   12K  170M   1% /var/lib/kubelet/pods/4faeabd7-7ca5-471f-900f-e561668f6ced/volumes/kubernetes.io~projected/kube-api-access-pgn79
shm                                 64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/5bcc5c32e1b4a7887acf522e6a6fcfd9dbcf699e245cd467b79efb244efe5dcb/shm
shm                                 64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/cd4a1e35cd3b086a7b1e2966f5ff965b2b279a5261e959a2af3dec77ec537fbf/shm
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/5bcc5c32e1b4a7887acf522e6a6fcfd9dbcf699e245cd467b79efb244efe5dcb/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/cd4a1e35cd3b086a7b1e2966f5ff965b2b279a5261e959a2af3dec77ec537fbf/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/e26561b66bace18df83c8bf05f3881cdb0956a47154f445340082cafeea37cd2/rootfs
overlay                            280G  129G  151G  47% /run/containerd/io.containerd.runtime.v2.task/k8s.io/d311cc641af7b8e0ce255028476db88bf8d4ea707ee88f5cf625ea6793762363/rootfs

source

In that test-case, garbage collection would kick in at latest at 85% (?) at which point there would still be 42GB of space.

@chrischdi
Copy link
Member

Note: something that may help would be pinning images when loading them:

Arbitrary pinning is available via sudo ctr -n k8s.io images label <image-name> io.cri-containerd.pinned=pinned since containerd v1.7.14 and v1.6.30.

source

This could be done in CAPD 🤔

@cprivitere
Copy link
Member Author

Just to make sure I understand, this is from the prow node running multiple jobs and having too many images around and then cleaning up images we still need? And pinning would help make sure images we're still using stay around? Would we need to unpin when done?

@chrischdi
Copy link
Member

Its a layered problem

First layer: prow node

  • The prow node has its filesystem (280GB) and runs CRI (propably containerd) on top
  • The kubelet on the prow node creates our test-pod using the above containerd instance
  • The kubelet on the prow node also does image garbage collection, with whatever values have been configured

Then we have the next layer: our prow test container

  • Inside our test container we run docker (DIND) which has its own containerd instance
  • (no kubelet here)

Then we have another layer: the clusters we create using CAPD and kind

  • These clusters run inside the DIND instance in our container, each of the nodes is a container in DIND (no matter if created by kind for the management cluster, or workload clusters)
  • These nodes all run their own kubelet and CRI (containerd), with the kubelet's default GC values

Now what is in common is the size of the filesystem: On all levels the same filesystem is used, but we only have limited access (overlayfs / how containers do its thing), but the size shown is always the one of the real disk (280GB)

In our case:

  • we started the node for our cluster (which is a container on the second layer / DIND)
  • loaded the images (due to what is configured in DockerMachine to preload)
  • the kubelet inside our container noticed the local filesystem is used by more than 80% or 85% (e.g. 238G used) and because of that tries to do garbage collection to free some space if possible
    • This only affects images in our node's container, but free's some space on the overall disk of course
    • We preloaded the images because we can't pull them from somewhere so they are gone because nobody is adding them back

I hope this makes it a bit more clear :-) Let me know if I can clarify some more.

@cprivitere
Copy link
Member Author

Thanks, that helps. So options seem to include:

  • Pin images in that top layer, guarantees the ones we need don't go away. Doesn't guarantee we'll have enough space, however...
  • We're GCing on 85% as its the default, but that means we have plenty of room still when we're doing these deletions, we arguably shouldn't be garbage collecting at all on these?
  • Some other feedback to the prow instance owners that instances are getting full enough that it's triggering default kubelet cleanups regularly in tests running on those clusters, as we surely aren't the only ones hitting this.
  • something else.

@chrischdi
Copy link
Member

chrischdi commented Feb 19, 2025

Thanks, that helps. So options seem to include:

  • Pin images in that top layer, guarantees the ones we need don't go away. Doesn't guarantee we'll have enough space, however...

We need to pin images at the layer where we pull them, so in our case it would be inside the CAPD Clusters where we already set them as preLoadImages in DockerMachine.

  • We're GCing on 85% as its the default, but that means we have plenty of room still when we're doing these deletions, we arguably shouldn't be garbage collecting at all on these?

I would not like to go down the road of changing the defaults here.

  • Some other feedback to the prow instance owners that instances are getting full enough that it's triggering default kubelet cleanups regularly in tests running on those clusters, as we surely aren't the only ones hitting this.

They do, but with their settings which make sense for them. I agree others might hit the same (creating kind cluster, using kind load docker-image) but seems to be a niche use case.

  • something else.

Also, looks like kind is already doing this (but for the images it builds into a node image):

https://github.com/kubernetes-sigs/kind/blob/main/pkg/build/nodeimage/imageimporter.go#L80

We maybe can do this in CAPD too :-)

Another alternative would be to re-load the images to the machines (via CAPD).

@chrischdi
Copy link
Member

chrischdi commented Feb 19, 2025

/triage accepted
/priority important-soon

Note: maybe we should make sig-k8s-infra aware that this could happen, just in case others have that same issue and so they have a quick answer on why this happens :-)

Edit: did so: https://kubernetes.slack.com/archives/CCK68P2Q2/p1740060087478159

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. labels Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants