Application errors during linkerd-destination restarts / chart upgrades #10405
Replies: 2 comments
-
It might be expected to see warnings in the proxy logs when trying to connect to linkerd-destination pods that are being terminated, and that depends on how quickly the cluster converges into a cohesive state. But as you point out, it's puzzling why this is affecting application connectivity. |
Beta Was this translation helpful? Give feedback.
-
Hi, sorry for the delay in coming back to this. We've just upgraded to 2.13.0 and a whole bunch of production apps crashed during the rolling upgrade. Logs from one of the application pods that threw connection errors during the upgrade:
And a linkerd-destination pod that is shutting down:
This looks to me like the pod proxy is sending traffic to the destination pod after it has closed down and should not be able to receive traffic anymore. We have seen similar endpoint / endpointSlice propagation issues on GKE which we workaround with preStop sleeps but I don't think we have the same configuration options for the control plane? |
Beta Was this translation helpful? Give feedback.
-
Hi all,
When we upgrade versions of Linkerd control-plane, or rolling restart linkerd-destination for any reason, some of our applications throw connection errors, meaning Linkerd is not zero-downtime for us:
From the application's linkerd-proxy, we see hundreds / thousands of logs such as:
This noise persists until all 3 replicas are back up and running.
Note: We run linkerd control-plane in HA, so we have three replicas of linkerd-destination running, with
rollingUpdate.maxSurge: 25%
androllingUpdate.maxUnavailable: 1
so there should always be at least 2 replicas of destination running during an upgrade.Picking a log to look into:
10.133.24.20
is a destination pod,linkerd-destination-9f699ffdd-9j45f
. This pod reported ready until ~ 16:19 when it then terminated:And the destination pod's logs are:
This log stands out as interesting given it should be terminating:
So it looks to me like there is an issue with graceful shutdown of the destination pod, but equally I don't understand why this is manifesting as an application level error, instead of the proxy using one of the remaining destination pods, or falling back to default clusterIP behaviour (I saw there's another open issue on this topic currently). Can anyone offer some thoughts?
Beta Was this translation helpful? Give feedback.
All reactions