High Node Latency Corresponding to Proxies in failfast #10693
Unanswered
peter-glotfelty
asked this question in
Help
Replies: 1 comment
-
Came here to ask about a similar problem where a hiccup in the destination pods caused the source pods' linkerd to go into fail-fast mode for several minutes. I don't see a way to configure this built-in circuit breaking behavior and removing linkerd from the source pods and keeping it on the destination pods prevents this. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey folks, not really sure if this is an issue or just a best practices thing so I'm posting here in the discussions. We're running Linkerd
2.12.4
on AKS, and last night we hit an issue where one of our nodes hit networking issues and inbound/outbound latency spiked.Unfortunately for us, one of the "destination" pods was running on that node, and now looking back, we see a pretty tight correlation between linkerd-proxies across the cluster going into "failfast" mode for tcp connections to services on the cluster during each of these spikes in latency on the one node. outbound_tcp_errors shown below.
The destination pod in question didn't log anything of note. Other services logs are pretty bare as well
We're running in high-availability mode, but I'm wondering if we've misconfigured something, or if this might be a bug in the proxies when the destination controller doesn't respond quickly. Anyone have thoughts about where we should investigate next? The issue self-mitigated after about 90 minutes so we have to some extent, lost our repro.
Beta Was this translation helpful? Give feedback.
All reactions