linkerd-proxy container in destination OOMed causing traffic disruption #12924
-
Hi Team,
Investigations:
We need Linkerd team's and community support to understand how can we replicate the issue and ensuring that the adequate steps are taken so that the issue is not repeated. |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 13 replies
-
Questions:
|
Beta Was this translation helpful? Give feedback.
-
This sounds a lot like the issue that was fixed by #12598. If you're able to run a newer version of Linkerd that includes this fix, that would help us confirm. |
Beta Was this translation helpful? Give feedback.
-
@olix0r Thanks for the revert! The issue mentioned in that PR does seem to be related to pod churn where a large scale deployment is rolled out, but that was not our case. We didn't see any big deployment rolled out before the issue. |
Beta Was this translation helpful? Give feedback.
-
It's a little more nuanced then that. This issue could trigger for any workload that communicates with more than 100 target ips. Churn could make the issue worse. I can provide a longer write up when I'm at a desk but I strongly recommend trying a version of Linkerd with this change.
Oliver Gould < ***@***.*** >
…On Mon, Aug 5 2024 at 06:58, Raghav Grover < ***@***.*** > wrote:
@olix0r ( https://github.com/olix0r ) Thanks for the revert! The issue
mentioned in that PR does seem to be related to pod churn where a large
scale deployment is rolled out, but that was not our case. We didn't see
any big deployment rolled out before the issue.
Also we are unable to get it reproduced on our environments, so we will
never no if the update fixed it forever.
—
Reply to this email directly, view it on GitHub (
#12924 (comment)
) , or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AAB2YYS3RXSZQD7XZE46OMTZP6AJ7AVCNFSM6AAAAABL2TL2WSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMRUGM3DMNI
).
You are receiving this because you were mentioned. Message ID: <linkerd/linkerd2/repo-discussions/12924/comments/10243665
@ github. com>
|
Beta Was this translation helpful? Give feedback.
-
We finally managed to get this into prod (the only place we can see this) on The metrics failure to reconcile still exists. Could this be volume related? its the only thing we haven't been able to replicate in test. After a few hours runtime the metric count on a single proxy is
Perhaps it cannot reconcile such a large dataset on initial startup these immediately have roughly ~45k metrics. |
Beta Was this translation helpful? Give feedback.
-
Though i don't completely understand why in Looking back at our commit history in 2022 we were given advice by the community to move from ingress -> enabled as we were having problems with destination service. However this looks like it may have been done to actually fix the GRPC issue unknowingly since it probably greatly reduced GRPC load at the time (i dont have metrics that far back). I wont be able to go back to an older version to test this theory so we might not get confirmation 100% that this was the issue. Thank you all so much for your help in diagnosing this complex issue. (graph of enabled vs ingress mode memory in the issue tracker). |
Beta Was this translation helpful? Give feedback.
-
Dont recommend excluding 443 on your inbound if you are traefik1. It appears traefik 1 is not particularly good at gracefully connections, linkerd however is. 443 should be set to opaque mode anyways. So nevermind that bit of this conversation.
which is probably automatic from helm or a default setting. |
Beta Was this translation helpful? Give feedback.
Though i don't completely understand why in
enabled
mode we don't get any balancer state switching to ingress seems to fix it.#12916 (comment)
Looking back at our commit history in 2022 we were given advice by the community to move from ingress -> enabled as we were having problems with destination service. However this looks like it may have been done to actually fix the GRPC issue unknowingly since it probably greatly reduced GRPC load at the time (i dont have metrics that far back). I wont be able to go back to an older version to test this theory so we might not get confirmation 100% that this was the issue.
Thank you all so much for your help in diagnosing this complex issue. (graph…