High Node Latency Corresponding to Proxies in failfast #10693

peter-glotfelty · 2023-04-05T22:26:59Z

peter-glotfelty
Apr 5, 2023

Hey folks, not really sure if this is an issue or just a best practices thing so I'm posting here in the discussions. We're running Linkerd 2.12.4 on AKS, and last night we hit an issue where one of our nodes hit networking issues and inbound/outbound latency spiked.

Unfortunately for us, one of the "destination" pods was running on that node, and now looking back, we see a pretty tight correlation between linkerd-proxies across the cluster going into "failfast" mode for tcp connections to services on the cluster during each of these spikes in latency on the one node. outbound_tcp_errors shown below.

The destination pod in question didn't log anything of note. Other services logs are pretty bare as well

[383469.987777s]  WARN ThreadId(01) outbound:proxy{addr=10.0.24.239:443}: linkerd_stack::failfast: TCP Logical entering failfast after 3s
[383469.987996s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=TCP Logical service in fail-fast client.addr=10.240.20.206:56158
[383469.988049s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=TCP Logical service in fail-fast client.addr=10.240.20.206:56168
[383469.989509s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=TCP Logical service in fail-fast client.addr=10.240.20.206:56184
[383470.997698s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=TCP Logical service in fail-fast client.addr=10.240.20.206:54060
[383471.006471s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=TCP Logical service in fail-fast client.addr=10.240.20.206:54070

We're running in high-availability mode, but I'm wondering if we've misconfigured something, or if this might be a bug in the proxies when the destination controller doesn't respond quickly. Anyone have thoughts about where we should investigate next? The issue self-mitigated after about 90 minutes so we have to some extent, lost our repro.

bjoernw · 2023-04-07T19:53:44Z

bjoernw
Apr 7, 2023

Came here to ask about a similar problem where a hiccup in the destination pods caused the source pods' linkerd to go into fail-fast mode for several minutes. I don't see a way to configure this built-in circuit breaking behavior and removing linkerd from the source pods and keeping it on the destination pods prevents this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Node Latency Corresponding to Proxies in failfast #10693

{{title}}

Replies: 1 comment

{{title}}

Select a reply

High Node Latency Corresponding to Proxies in failfast #10693

peter-glotfelty Apr 5, 2023

Replies: 1 comment

bjoernw Apr 7, 2023

peter-glotfelty
Apr 5, 2023

bjoernw
Apr 7, 2023