-
I am trying to determine why the latency for this simple GRPC service running in Kubernetes with Linkerd is so high. I've prepared a minimal example of the service that was used for these latency tests which you can see in this gist. As you can see in the gist, the client ("augmentor" in the image above) is receiving an HTTP request, sending an Echo message to the server ("bidder" in the image above) and the server simply returns the message it received. Under less than 100rps, the P99 latency is around 30ms. If I increase to about 1,200rps the latency skyrockets and will get up to around 200ms. I have conducted this test on a fresh GKE Cluster running only these two services, deployed with the configuration you will find in the gist. I've installed Linkerd by following the quickstart guide and the output of Given how simple this example application is, I'm inclined to think the issue is likely networking related. Is there something I have not configured properly that would cause this kind of behavior?
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 16 replies
-
200ms does sound a bit slow. The first thing that stood out in your YAML is that you assigned about 4 CPUs to your client and server pods. Did you configure the Linkerd proxy with CPU requests? Make sure your proxies aren't starving. You can use either the Also, what does the latency distribution look like without Linkerd? If you haven't already, might want to run the test with a continuous traffic flow for a longer period of time (e.g., 1000 rps for > 5 mins), to get a better sense of the distribution trend. |
Beta Was this translation helpful? Give feedback.
-
@danieljamespost So far, I haven't been able to reproduce the latency you experienced. At 1200 rps, the p99 latency hovers around 6ms to 9ms, with 32 concurrent connections: Here's the p99 trend over a period of 15 minutes: The CPU utilization is around 70% to 80% (with a limit of 1 CPU): While the memory usage is near negligible (with a limit of 2GB): I tested Linkerd 2.8.1 on a GKE cluster comprised of CPU:
Also, nothing stood out to me in the attached logs and metrics so far. Are you able to gather the CPU usage of the proxy? I wonder if the latency you saw was caused by CPU spikes on the proxy. |
Beta Was this translation helpful? Give feedback.
-
@ihcsim I apologize for taking so long to get back to this issue. We had to hold off on deploying the new GRPC service that I was going to test with. However, I'm now running edge-20.8.4 and the latency issues have gone away and we are getting excellent performance. There are other content-type issues I'm looking into, but the performance is great. Thank you for your help. |
Beta Was this translation helpful? Give feedback.
@danieljamespost So far, I haven't been able to reproduce the latency you experienced. At 1200 rps, the p99 latency hovers around 6ms to 9ms, with 32 concurrent connections:
Here's the p99 trend over a period of 15 minutes:
The CPU utilization is around 70% to 80% (with a limit of 1 CPU):
While the memory usage is near negligible (with a limit of 2GB):
I tested Linkerd 2.8.1 on a GKE cluster comprised of
n1-standard-1
nodes, using the following resource requirements on the client and server proxies:CPU:
Memory
Also, nothing stood out to me in the attached logs and metrics so far.
Are you able to gather the CPU usage of the proxy? I won…