linkerd-proxy container in destination OOMed causing traffic disruption #12924

ra-grover · 2024-08-01T13:43:41Z

ra-grover
Aug 1, 2024

Hi Team,
We faced issue with Linkerd destination being OOMed that lead to a traffic disruption and we are struggling to connect the dots as we are not able to reproduce the issue in our test environments.
Steps of what happened:

Few OOMs in destination container were observed prior to outage. In a 12 hour window prior to disruption, half of our destination containers had OOMed atleast once.
At the time of disruption, OOMs were observed in all the pods of destination in linkerd-proxy containers.
We increased the resources for proxy container and issue resolved, traffic was back up normal.
In about an hour , traffic dipped again and this time only in destination containers , OOM was observed.
The resources for destination container were increased which again brought the traffic to normal.

Investigations:

WRT to pod count we dont see a significant pod churn in the cluster, prior to the OOMs. The running pod count remained around 5700 after climbing from 5300, an hour before the disruption.
At the time of disruption we had the following resources on destination.

linkerd-proxy
 Limit - {"cpu":"4","memory":"1Gi"}
 Requests - {"cpu":"4","memory":"1Gi"}
destination
 Limit - {"cpu":"4","memory":"3Gi"}
 Requests - {"cpu":"100m","memory":"2Gi"}

We saw almost 33 Million such log entries in just 4mins from around 400 Traefik pods (more from other application pods), which have linkerd sidecar injected.

WARN ThreadId(01) outbound:proxy{addr=10.2.40.58:8126}:controller{addr=linkerd-dst-headless.linkerd2.svc.cluster.local:8086}:endpoint{addr=10.2.102.227:8086}: linkerd_reconnect: Failed to connect error=endpoint 10.2.102.227:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

ThreadId(01) inbound:server{port=8086}:rescue{client.addr=10.2.118.24:47634}: linkerd_app_core::errors::respond: gRPC request failed error=client 10.2.118.24:47634: server: 10.2.24.49:8086: server 10.2.24.49:8086: service linkerd-dst-headless.linkerd2.svc.cluster.local:8086: service unavailable error.sources=[server 10.2.24.49:8086: service linkerd-dst-headless.linkerd2.svc.cluster.local:8086: service unavailable, service unavailable]

At the time of OOMs we saw containerd.proc.open_fds go through the roof.
We also opened linkerd-proxy does not evict idle metrics eventually leading to OOMKILL #12916 as we observed issues with metrics not being cleaned up.
During the issue for some deployment we saw error message like:
failed to find LINKERD2_PROXY_INBOUND_LISTEN_ADDR environment variable
But upon checking those pods at the time of incident, the variable was indeed there.
In linkerd-proxy-injector we saw error related to linkerd-policy:

[1210272.776747s]  WARN ThreadId(01) watch{port=8443}:controller{addr=linkerd-policy.linkerd2.svc.cluster.local:8090}:endpoint{addr=10.2.92.201:8090}: linkerd_reconnect: Failed to connect error=endpoint 10.2.92.201:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

During the same time, we saw a huge spike in list/watch calls to api server from linkerd-destination.
Just few minutes after the linkerd-proxy in destination OOMs we saw huge drop in the destination activity and it remained low for about 45 mins. Below is the memory usage for destination pods in destination containers.

We need Linkerd team's and community support to understand how can we replicate the issue and ensuring that the adequate steps are taken so that the issue is not repeated.

Answered by tehlers320

Sep 19, 2024

Though i don't completely understand why in enabled mode we don't get any balancer state switching to ingress seems to fix it.

#12916 (comment)

Looking back at our commit history in 2022 we were given advice by the community to move from ingress -> enabled as we were having problems with destination service. However this looks like it may have been done to actually fix the GRPC issue unknowingly since it probably greatly reduced GRPC load at the time (i dont have metrics that far back). I wont be able to go back to an older version to test this theory so we might not get confirmation 100% that this was the issue.

Thank you all so much for your help in diagnosing this complex issue. (graph…

View full answer

ra-grover · 2024-08-01T13:57:56Z

ra-grover
Aug 1, 2024
Author

Questions:

Why does the linkerd-destination being the control plane has linkerd-proxy injected. Linkerd destination controls all linkerd-proxies, but then the linkerd-proxy of the destination dails back into the destination container and has the control over all traffic coming into destination pods. Istio for example doesn't have istio-proxy injected to its control plane, istiod.
Regarding point 3, where we saw millions of log entries for connections to destinations failed, Does that have a potential to bombard linkerd-proxy and OOM kill it ? Why are so much calls/connects made to destination repeatedly. Can this behaviour be configured.

0 replies

olix0r · 2024-08-05T13:43:52Z

olix0r
Aug 5, 2024
Maintainer

This sounds a lot like the issue that was fixed by #12598. If you're able to run a newer version of Linkerd that includes this fix, that would help us confirm.

0 replies

ra-grover · 2024-08-05T13:58:33Z

ra-grover
Aug 5, 2024
Author

@olix0r Thanks for the revert! The issue mentioned in that PR does seem to be related to pod churn where a large scale deployment is rolled out, but that was not our case. We didn't see any big deployment rolled out before the issue.
Also we are unable to get it reproduced on our environments, so we will never no if the update fixed it forever.

0 replies

olix0r · 2024-08-05T14:01:33Z

olix0r
Aug 5, 2024
Maintainer

It's a little more nuanced then that. This issue could trigger for any workload that communicates with more than 100 target ips. Churn could make the issue worse. I can provide a longer write up when I'm at a desk but I strongly recommend trying a version of Linkerd with this change. Oliver Gould < ***@***.*** >

…

On Mon, Aug 5 2024 at 06:58, Raghav Grover < ***@***.*** > wrote: @olix0r ( https://github.com/olix0r ) Thanks for the revert! The issue mentioned in that PR does seem to be related to pod churn where a large scale deployment is rolled out, but that was not our case. We didn't see any big deployment rolled out before the issue. Also we are unable to get it reproduced on our environments, so we will never no if the update fixed it forever. — Reply to this email directly, view it on GitHub ( #12924 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAB2YYS3RXSZQD7XZE46OMTZP6AJ7AVCNFSM6AAAAABL2TL2WSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMRUGM3DMNI ). You are receiving this because you were mentioned. Message ID: <linkerd/linkerd2/repo-discussions/12924/comments/10243665 @ github. com>

5 replies

ra-grover Aug 5, 2024
Author

Sure, thanks I will wait for your response, in order to understand.
Also we will consider upgrading Linkerd in our clusters.

ra-grover Aug 7, 2024
Author

It's a little more nuanced then that. This issue could trigger for any workload that communicates with more than 100 target ips.

@olix0r I would really appreciate if you can share the explanation you mentioned.
Regarding upgrading Linkerd, we will soon start with that on our stack.

olix0r Aug 7, 2024
Maintainer

Just to set expectations, since I'm traveling this week, I probably won't have time to compose a longer explanation until next week. But I will do so when I'm back.

olix0r Aug 12, 2024
Maintainer

Here's a rough sketch of the situation:

Linkerd's destination controller used grpc-go's default
MAX_STREAM_CONCURRENCY of 100.
Linkerd's inbound proxy is responsible for implementing mTLS termination and
proxying of all control plane traffic so that it is uniformly managed. The
inbound proxy maintains a single destination controller connection for each
unique client identity (i.e. workload service account).
Therefore, an individual service account may issue up to 100 concurrent
lookups per destination controller instance before exhausting hitting a
backpressure scenario.
Workload proxies distribute lookups over all available destination controller
instances.
- If a lookup fails, it is is retried with a backoff. The workload will be
  unable to route traffic until it is able to contact a control plane
  instance.
Workloads with many pods or that do lookups for many services may exhaust the
destination controller's stream concurrency limit. There are 2 lookups per
target service ip (though, only 1 if the target is a pod IP). So in practice a
workload with 50 pods that communicates with one service would saturate a
destination controller instance.
Lookups from other workloads that are not affected by the backpressure will
still be able to route traffic.
- UNLESS the proxy is overloaded by retry requests, which we have observed in
  the wild.
By removing the stream concurrency limit, the destination controller no longer
artificially limits the number of concurrent lookups that can be made by a
single service account.
The default backoff behavior may be altered cluster-wide with helm
settings, e.g. proxy.outbound.connect.expBackoffMin and
proxy.outbound.connect.expBackoffMax.

ra-grover Aug 13, 2024
Author

Thank you for the detailed explanation @olix0r 💯.

I am wondering on how can this lead to a OOMs on the istio-proxy of the destination ? For example, at that time it all blew up we had 400 Traefik pods with linkerd-proxy injected which logged that they tried to connect to destination.

WARN ThreadId(01) outbound:proxy{addr=10.2.40.58:8126}:controller{addr=linkerd-dst-headless.linkerd2.svc.cluster.local:8086}:endpoint{addr=10.2.102.227:8086}: linkerd_reconnect: Failed to connect error=endpoint 10.2.102.227:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

How can that translate to 33million such log entries, I understand that the retries are done every 100ms with exponential backoff reaching max of 500ms.
Even we take the worst case, that 100ms is the retry, it would be number of attempts in 1 second * number of pods * seconds*
We observed the logs for 4 seconds, so that will be 10 * 400 * 4 = 16000 (in 4 seconds), which no where comes close to the 33 million.

Are you aware of any edge case which can make the linkerd workloads to just DDos destination, if they are not able to connect to the destination for whatever reason.

tehlers320 · 2024-09-12T16:09:30Z

tehlers320
Sep 12, 2024

We finally managed to get this into prod (the only place we can see this) on edge-2024.8.3.

The metrics failure to reconcile still exists.

Could this be volume related? its the only thing we haven't been able to replicate in test.

After a few hours runtime the metric count on a single proxy is

Perhaps it cannot reconcile such a large dataset on initial startup these immediately have roughly ~45k metrics.

8 replies

tehlers320 Sep 12, 2024

  21 control_identity_balancer_queue_latency_seconds_bucket
  21 control_policy_balancer_queue_latency_seconds_bucket
  21 outbound_tcp_balancer_queue_latency_seconds_bucket
  21 stack_create_total
  21 stack_drop_total
  21 stack_poll_total_ms
  24 outbound_http_route_backend_response_duration_seconds_bucket
  63 stack_poll_total
  88 outbound_tcp_errors_total
  90 tcp_close_total
 208 control_response_latency_ms_bucket
8917 response_latency_ms_count
8917 response_latency_ms_sum
9711 response_total
11174 tcp_open_connections
11174 tcp_open_total
11174 tcp_read_bytes_total
11174 tcp_write_bytes_total
13835 request_total
231842 response_latency_ms_bucket

i had to modify this one, we have 0 grpc/http outbound metrics showing up here even without the sort/head.

linkerd2-cli-stable-2.14.10-darwin-arm64 diagnostics proxy-metrics -n kube-system deploy/traefik-canary| sed -nEe '/^outbound_(http|grpc|tcp)_balancer_/p' |sort -nrk2 |head
outbound_tcp_balancer_queue_gate_open_time_seconds{logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 1726158041.1878534
outbound_tcp_balancer_queue_gate_open_time_seconds{logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 1726113169.493802
outbound_tcp_balancer_queue_gate_open_time_seconds{logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 1726112561.2506693
outbound_tcp_balancer_queue_requests_total{logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 3
outbound_tcp_balancer_queue_requests_total{logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 3
outbound_tcp_balancer_queue_requests_total{logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 3
outbound_tcp_balancer_queue_latency_seconds_count{logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 3
outbound_tcp_balancer_queue_latency_seconds_count{logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 3
outbound_tcp_balancer_queue_latency_seconds_count{logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 3
outbound_tcp_balancer_queue_latency_seconds_bucket{le="3.0",logical="kubernetes.default.svc.cluster.local:443",concrete="kubernetes.default.svc.cluster.local:443"} 3

edit: i checked test i dont see any balancers in here for grpc/http either, only tcp.

our proxies on traefik are just set to enabled not ingress.

I checked the upstream pods they DO have the grpc metric.

outbound_http_balancer_endpoints{endpoint_state="pending",parent_group="core",parent_kind="Service",parent_namespace="some-cool-app",parent_name="some-cool-app",parent_port="6565",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="some-cool-app",backend_name="price-display",backend_port="6565",backend_section_name=""} 0

olix0r Sep 13, 2024
Maintainer

This does likely indicate that that connections are being held open to endpoints that no longer exist: traefik is not communicating with ClusterIP services (that would be load balanced by Linkerd), it is instead doing its own load balancing and communicating with individual pod IPs; hence we don't see any load balancers in traefik's sidecar proxy, though we do see an extremely high cardinality of endpoint metrics:

9711 response_total
11174 tcp_open_connections
11174 tcp_open_total
11174 tcp_read_bytes_total
11174 tcp_write_bytes_total
13835 request_total

This is likely a bad interaction between traefik's configuration and Linkerd. I'm not intimately familiar with Traefik's configuration options, but a quick search yielded advice on using Traefik with a service mesh: https://doc.traefik.io/traefik-enterprise/installing/customizing/kubernetes/#using-kubedns-with-service-mesh-enabled

Are you running Traefik with this configuration?

tehlers320 Sep 16, 2024

we are actually using a modified version of traefik 1.x before any of that traefik mesh stuff existed. We only modified it to continue working with the latest version of kubernetes. But this problem has existed for years. Our solution until recently has been to just restart everything after a week or so. So i doubt its our modifications.

Trying to get some debug logs now...

tehlers320 Sep 16, 2024

I managed to get this to show up in the balancer in thinking about the problem...

With traefik why are we intercepting 443 in front of it.

NLB -> linkerd (iptables intercept) -> passthru -> traefik (decrypt) -> linkerd -> mesh

kubectl -n kube-system patch deployment traefik-canary --type='json' -p='[{"op": "add", "path": "/spec/template/metadata/annotations/linkerd.io~1skip-inbound-ports", "value": "443"}]'
kubectl -n kube-system patch deployment traefik-canary --type='json' -p='[{"op": "add", "path": "/spec/template/metadata/annotations/llinkerd.io~1inject:", "value": "ingress"}]'

Now the flow is

NLB -> traefik (decrypt) -> linkerd -> mesh

Now i see our objects in here.

linkerd2-cli-stable-2.14.10-darwin-arm64 diagnostics paroxy-metrics -n kube-system deploy/traefik-canary| sed -nEe '/^outbound_(http|grpc)_balancer_/p' |sort -nrk2 |head
outbound_http_balancer_queue_gate_open_time_seconds{parent_group="core",parent_kind="Service",parent_namespace="cool-namespace",parent_name="grcp-cool-app",parent_port="6565",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="cool-namespace",backend_name="grpc-cool-namespace-reservation-amenities",backend_port="6565",backend_section_name=""} 1726508847.405098
outbound_http_balancer_queue_gate_open_time_seconds{parent_group="core",parent_kind="Service",parent_namespace="another-namespace",parent_name="another-service",parent_port="6565",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="another-namespace",backend_name="offers-dateless-service",backend_port="6565",backend_section_name=""} 1726508843.5921059
outbound_http_balancer_queue_gate_open_time_seconds{parent_group="core",parent_kind="Service",parent_namespace="some-namespace",parent_name="another-service",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="some-namespace",backend_name="another-service",backend_port="80",backend_section_name=""} 1726508836.1137679
outbound_http_balancer_queue_gate_open_time_seconds{parent_group="core",parent_kind="Service",parent_namespace="some-namespace",parent_name="another-service",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="some-namespace",backend_name="another-service",backend_port="80",backend_section_name=""} 1726508816.3347283
outbound_http_balancer_queue_gate_open_time_seconds{parent_group="core",parent_kind="Service",parent_namespace="some-namespace",parent_name="another-service",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="some-namespace",backend_name="another-service",backend_port="80",backend_section_name=""} 1726508815.3643225
outbound_http_balancer_queue_requests_total{parent_group="core",parent_kind="Service",parent_namespace="some-namespace",parent_name="another-service",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="some-namespace",backend_name="another-service",backend_port="80",backend_section_name=""} 181
outbound_http_balancer_queue_latency_seconds_count{parent_group="core",parent_kind="Service",parent_namespace="some-namespace",parent_name="another-service",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="some-namespace",backend_name="another-service",backend_port="80",backend_section_name=""} 181
outbound_http_balancer_queue_latency_seconds_bucket{le="3.0",parent_group="core",parent_kind="Service",parent_namespace="some-namespace",parent_name="another-service",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="some-namespace",backend_name="another-service",backend_port="80",backend_section_name=""} 181
outbound_http_balancer_queue_latency_seconds_bucket{le="1.0",parent_group="core",parent_kind="Service",parent_namespace="some-namespace",parent_name="another-service",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="some-namespace",backend_name="another-service",backend_port="80",backend_section_name=""} 181
outbound_http_balancer_queue_latency_seconds_bucket{le="0.5",parent_group="core",parent_kind="Service",parent_namespace="some-namespace",parent_name="another-service",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="some-namespace",backend_name="another-service",backend_port="80",backend_section_name=""} 181

My only question is should i remove inbound 443, it does feel un-needed unless i mis-understand something.

olix0r Sep 17, 2024
Maintainer

Given your configuration, it may make sense to take Linkerd out of traefik's inbound path on 443. You can set config.linkerd.io/skip-inbound-ports: "443" as an annotation on the traefik pod spec to accomplish this.

tehlers320 · 2024-09-19T18:29:20Z

tehlers320
Sep 19, 2024

Though i don't completely understand why in enabled mode we don't get any balancer state switching to ingress seems to fix it.

#12916 (comment)

Looking back at our commit history in 2022 we were given advice by the community to move from ingress -> enabled as we were having problems with destination service. However this looks like it may have been done to actually fix the GRPC issue unknowingly since it probably greatly reduced GRPC load at the time (i dont have metrics that far back). I wont be able to go back to an older version to test this theory so we might not get confirmation 100% that this was the issue.

Thank you all so much for your help in diagnosing this complex issue. (graph of enabled vs ingress mode memory in the issue tracker).

0 replies

tehlers320 · 2024-10-02T21:29:13Z

tehlers320
Oct 2, 2024

Dont recommend excluding 443 on your inbound if you are traefik1. It appears traefik 1 is not particularly good at gracefully connections, linkerd however is. 443 should be set to opaque mode anyways. So nevermind that bit of this conversation.

    config.linkerd.io/opaque-ports: "443"

which is probably automatic from helm or a default setting.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linkerd-proxy container in destination OOMed causing traffic disruption #12924

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

linkerd-proxy container in destination OOMed causing traffic disruption #12924

ra-grover Aug 1, 2024

Replies: 7 comments · 13 replies

ra-grover Aug 1, 2024 Author

olix0r Aug 5, 2024 Maintainer

ra-grover Aug 5, 2024 Author

olix0r Aug 5, 2024 Maintainer

ra-grover Aug 5, 2024 Author

ra-grover Aug 7, 2024 Author

olix0r Aug 7, 2024 Maintainer

olix0r Aug 12, 2024 Maintainer

ra-grover Aug 13, 2024 Author

tehlers320 Sep 12, 2024

tehlers320 Sep 12, 2024

olix0r Sep 13, 2024 Maintainer

tehlers320 Sep 16, 2024

tehlers320 Sep 16, 2024

olix0r Sep 17, 2024 Maintainer

tehlers320 Sep 19, 2024

tehlers320 Oct 2, 2024

ra-grover
Aug 1, 2024

Replies: 7 comments 13 replies

ra-grover
Aug 1, 2024
Author

olix0r
Aug 5, 2024
Maintainer

ra-grover
Aug 5, 2024
Author

olix0r
Aug 5, 2024
Maintainer

ra-grover Aug 5, 2024
Author

ra-grover Aug 7, 2024
Author

olix0r Aug 7, 2024
Maintainer

olix0r Aug 12, 2024
Maintainer

ra-grover Aug 13, 2024
Author

tehlers320
Sep 12, 2024

olix0r Sep 13, 2024
Maintainer

olix0r Sep 17, 2024
Maintainer

tehlers320
Sep 19, 2024

tehlers320
Oct 2, 2024