Kubelet TLS Handshake Failures After Certificate Rotation #16850

roman5595 · 2024-09-20T18:11:07Z

What happened?
We are deploying to several kops clusters via pipelines, since kops 1.23, some pipelines would fail with error below, so we implemented temporary retry mechanism that would retry request, currently we are at kops 1.29 and this issue still persists, this is not causing any outage, but I would like to remove our temporary solution and remove this issue below (I also checked PRs for 1.23 but i didnt find anything that might be related or could cause this issue, also, on kops 1.22 we never encountered this error ) :

/usr/bin/helm Error: unable to get pod logs for <APPLICATION>: Get "https://<WORKER NODE>:10250/containerLogs/default/<APPLICATION>/test-service": write tcp <CONTROL-PLANE NODE>:44194-><WORKER NODE>:10250: use of closed network connection

at the exact same time logs from api-server :
│ kube-apiserver I0920 16:30:39.542320 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials │ kube-apiserver E0920 16:30:39.543643 11 status.go:71] apiserver received an error that is not an metav1.Status: &url.Error{Op:"Get", URL:"https://<WORKER NODE >:10250/containerLogs/default/application/filebeat?sinceSecon │ │ ds=300", Err:(*net.OpError)(0xc071c91090)}: Get "https://<WORKER NODE >:10250/containerLogs/default/application/filebeat?sinceSeconds=300": write tcp <CONTROL-PLANE NODE>:44194-><WORKER NODE>:10250: use of closed network connection

Everytime this error hapens, there is same log in kubelet :

kubelet[5111]: I0920 16:30:39.542666 5111 log.go:245] http: TLS handshake error from <CONTROL-PLANE NODE>:44194: EOF

I checked validity of certificates, they are all valid,

apiserver logs :

I0920 16:10:03.110741 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 16:20:03.515003 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 16:30:39.542320 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 16:43:43.688572 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 16:53:43.688628 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:03:43.689273 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:14:04.170499 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:28:43.689063 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:38:43.688520 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials I0920 17:48:43.688942 11 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials

Is this normal behaviour ? Cert rotation cca every 10 minutes ?

What cloud provider are you using?
AWS

What did you expect to happen?

I expected, that once certs are rotated, this will not cause any intermittent network issues.

Kubelet config
kubelet:
containerLogMaxSize: "20Mi"
containerLogMaxFiles: 5
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
readOnlyPort: 0
protectKernelDefaults: true
streamingConnectionIdleTimeout: "30m"
eventQps: "0"
featureGates:
RotateKubeletServerCertificate: "true"
HPAContainerMetrics: "true"
kubeReserved:
cpu: "100m"
memory: "100Mi"
kubeReservedCgroup: "/kube-reserved"
systemReserved:
cpu: "100m"
memory: "100Mi"
systemReservedCgroup: "/system-reserved"
tlsCipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_RSA_WITH_AES_256_GCM_SHA384
- TLS_RSA_WITH_AES_128_GCM_SHA256

Possible relation

Is there chance that this issue might be related to : golang/go#50984 ?

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2024-12-20T10:37:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-01-19T11:07:09Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2025-02-18T11:13:09Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2025-02-18T11:13:15Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2025

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet TLS Handshake Failures After Certificate Rotation #16850

Kubelet TLS Handshake Failures After Certificate Rotation #16850

roman5595 commented Sep 20, 2024 •

edited

Loading

k8s-triage-robot commented Dec 20, 2024

k8s-triage-robot commented Jan 19, 2025

k8s-triage-robot commented Feb 18, 2025

k8s-ci-robot commented Feb 18, 2025

Kubelet TLS Handshake Failures After Certificate Rotation #16850

Kubelet TLS Handshake Failures After Certificate Rotation #16850

Comments

roman5595 commented Sep 20, 2024 • edited Loading

k8s-triage-robot commented Dec 20, 2024

k8s-triage-robot commented Jan 19, 2025

k8s-triage-robot commented Feb 18, 2025

k8s-ci-robot commented Feb 18, 2025

roman5595 commented Sep 20, 2024 •

edited

Loading