GRPC High Latency Issue #4671

dnjp · 2020-06-25T21:10:57Z

dnjp
Jun 25, 2020

I am trying to determine why the latency for this simple GRPC service running in Kubernetes with Linkerd is so high. I've prepared a minimal example of the service that was used for these latency tests which you can see in this gist. As you can see in the gist, the client ("augmentor" in the image above) is receiving an HTTP request, sending an Echo message to the server ("bidder" in the image above) and the server simply returns the message it received. Under less than 100rps, the P99 latency is around 30ms. If I increase to about 1,200rps the latency skyrockets and will get up to around 200ms.

I have conducted this test on a fresh GKE Cluster running only these two services, deployed with the configuration you will find in the gist. I've installed Linkerd by following the quickstart guide and the output of linkerd check is below. When viewing logs with linkerd tap the requests are coming through properly as HTTP/2 with TLS enabled. I've also verified this is the case by using Wireshark to view the contents of these requests. If I increase rps further and have tracing enabled (using Jaeger), the traces look like this in the worst case:

Given how simple this example application is, I'm inclined to think the issue is likely networking related. Is there something I have not configured properly that would cause this kind of behavior?

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists

linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Status check results are √

Answered by ihcsim

Jun 30, 2020

@danieljamespost So far, I haven't been able to reproduce the latency you experienced. At 1200 rps, the p99 latency hovers around 6ms to 9ms, with 32 concurrent connections:

Here's the p99 trend over a period of 15 minutes:

The CPU utilization is around 70% to 80% (with a limit of 1 CPU):

While the memory usage is near negligible (with a limit of 2GB):

I tested Linkerd 2.8.1 on a GKE cluster comprised of n1-standard-1 nodes, using the following resource requirements on the client and server proxies:

CPU:

limit: 1
request: 0.25
Memory
limit: 2GB
request: 1GB

Also, nothing stood out to me in the attached logs and metrics so far.

Are you able to gather the CPU usage of the proxy? I won…

View full answer

ihcsim · 2020-06-25T21:43:34Z

ihcsim
Jun 25, 2020

200ms does sound a bit slow. The first thing that stood out in your YAML is that you assigned about 4 CPUs to your client and server pods. Did you configure the Linkerd proxy with CPU requests? Make sure your proxies aren't starving. You can use either the linkerd inject or proxy configuration annotations to set these values. It will be interesting to see each container's resource usage.

Also, what does the latency distribution look like without Linkerd? If you haven't already, might want to run the test with a continuous traffic flow for a longer period of time (e.g., 1000 rps for > 5 mins), to get a better sense of the distribution trend.

14 replies

dnjp Jun 26, 2020
Author

Here are the metrics collected:

client-metrics.txt
server-metrics.txt

dnjp Jun 27, 2020
Author

Here are the logs and metrics with the WithInsecure flag enabled using mTLS

client-logs-mtls.txt
server-logs-mtls.txt
client-metrics-mtls.txt
server-metrics.mtls.txt

dnjp Jun 29, 2020
Author

Just following up - we are currently unable to fully utilize Linkerd due to the latency involved with using mTLS for gRPC connections. Were you able to find anything useful in the metrics/logs that would lead to an mTLS based solution so we can use Linkerd for this service?

ihcsim Jun 29, 2020

Can you tell me more about your GKE setup? We will try to reproduce this.

Version of GKE
Node OS: ubuntu, cos etc.
Number of nodes.
Are the nodes on the same node pool?
Do you have network policies?
You mentioned about VPC. So I assume it's the pods are using alias IP
Is this a private GKE clusters?
Do you have any customization on your kube-dns deployment?
Is this latency reproducible on another GKE cluster?

dnjp Jun 29, 2020
Author

1. Version of GKE

1.16.9-gke.6

2. Node OS: ubuntu, cos etc.

cos

3. Number of nodes.

10 nodes

4. Are the nodes on the same node pool?

Yes

5. Do you have network policies?

No

6. You mentioned about VPC. So I assume it's the pods are using alias IP

That's correct

7. Is this a private GKE clusters?

No

8. Do you have any customization on your `kube-dns` deployment?

No

9. Is this latency reproducible on another GKE cluster?

Yes, we have an identical cluster in us-west1 that has these same issues. I also created a separate cluster for testing which had these same problems.

ihcsim · 2020-06-30T23:03:09Z

ihcsim
Jun 30, 2020

@danieljamespost So far, I haven't been able to reproduce the latency you experienced. At 1200 rps, the p99 latency hovers around 6ms to 9ms, with 32 concurrent connections:

Here's the p99 trend over a period of 15 minutes:

The CPU utilization is around 70% to 80% (with a limit of 1 CPU):

While the memory usage is near negligible (with a limit of 2GB):

I tested Linkerd 2.8.1 on a GKE cluster comprised of n1-standard-1 nodes, using the following resource requirements on the client and server proxies:

CPU:

limit: 1
request: 0.25
Memory
limit: 2GB
request: 1GB

Also, nothing stood out to me in the attached logs and metrics so far.

Are you able to gather the CPU usage of the proxy? I wonder if the latency you saw was caused by CPU spikes on the proxy.

2 replies

dnjp Jul 2, 2020
Author

Thank you for taking the time to try to reproduce the issue. I'm assuming you tested this out in us-east1 using the default VPC provisioned when you create the cluster? Let me know if that isn't the case. Also, did you use the WithInsecure option or did you use TLS certs?

I can run another test early next week to get some more data around the proxy.

ihcsim Jul 3, 2020

Would you mind trying out the latest edge proxy? There are some improvements in there that might be relevant. Keep your current Linkerd control plane setup as-is, and just annotate your client and server deployments (in the pod template spec) with config.linkerd.io/proxy-version: edge-20.7.1. The pods will be restarted with the latest proxy image. Give that a try and let us know how it goes.

Yes, I used the default GKE VPC setup in my test. But my cluster is in the us-west1 region. The tool that I used is called strest-grpc, which also uses the WithInsecure option on the client side.

dnjp · 2020-09-03T19:18:58Z

dnjp
Sep 3, 2020
Author

@ihcsim I apologize for taking so long to get back to this issue. We had to hold off on deploying the new GRPC service that I was going to test with. However, I'm now running edge-20.8.4 and the latency issues have gone away and we are getting excellent performance. There are other content-type issues I'm looking into, but the performance is great. Thank you for your help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPC High Latency Issue #4671

{{title}}

Replies: 3 comments 16 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

GRPC High Latency Issue #4671

dnjp Jun 25, 2020

Replies: 3 comments · 16 replies

ihcsim Jun 25, 2020

dnjp Jun 26, 2020 Author

dnjp Jun 27, 2020 Author

dnjp Jun 29, 2020 Author

ihcsim Jun 29, 2020

dnjp Jun 29, 2020 Author

ihcsim Jun 30, 2020

dnjp Jul 2, 2020 Author

ihcsim Jul 3, 2020

dnjp Sep 3, 2020 Author

dnjp
Jun 25, 2020

Replies: 3 comments 16 replies

ihcsim
Jun 25, 2020

dnjp Jun 26, 2020
Author

dnjp Jun 27, 2020
Author

dnjp Jun 29, 2020
Author

dnjp Jun 29, 2020
Author

ihcsim
Jun 30, 2020

dnjp Jul 2, 2020
Author

dnjp
Sep 3, 2020
Author