Skip to content

Measure GRPC server metrics at the HTTP level and client metrics at the service level #5786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rdettai
Copy link
Collaborator

@rdettai rdettai commented Jun 5, 2025

Description

Currently, we have metrics called grpc_xxx that actually take measurement at the service level:

  • this makes it hard to measure bytes transferd
  • it sometimes measures calls that actually don't use the network stack, which makes the "grpc" name confusing.

This PR attemps to steamline the metrics:

  • grpc_x metrics are now recorded by an http level layer on GRPC servers only
  • we introduce service_x metrics that are recorded in clients only but use the "kind" label to distinguish "local" vs "remote" invocations.

How was this PR tested?

TODO

@rdettai
Copy link
Collaborator Author

rdettai commented Jun 5, 2025

TODO:

  • tests
  • verify that the client layers are properly registered (never attached to server stacks)
  • see if we can somehow improve GrpcMetricsLayer to also track streams
  • see if we can somehow improve GrpcMetricsLayer to also track compression

@rdettai rdettai marked this pull request as draft July 9, 2025 13:58

let rpc_name = extract_rpc_name_from_path(request.uri().path()).to_string();

let request_size = get_content_length(request.headers());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... What happens in gRPC streaming?

@rdettai
Copy link
Collaborator Author

rdettai commented Jul 16, 2025

Memo on what should be done here

We monitor calls to internal services for two reasons:

  • Understand services load across a cluster. This can help identify overloaded services/nodes.
  • Measure sources of network traffic across nodes

Current behavior

  • Service metrics are broken down using a “kind” label that can be “client” and “server”.
    • Server metrics are useful to understand the behavior from the perspective of the service provider.
    • Client metrics can help identifying how the load is spread across callers, they are a bit harder to leverage.
  • We measure these metrics only at the service layer (in the tower stack), so
    • We don’t have access access to http transfer sizes
    • We don’t always distinguish between local vs remote calls, which is confusing because the metric names has “grpc” in its name

Proposed behavior (before this PR)

First of all, we want to have a measure at the network level. The question is how to compose that with current client/server metrics.

Note: it’s a bit tricky to get a handle to the network layer on the client side

On the server side, we want to:

  • Measure bytes received and bytes sent for network calls (show true size, i.e after compression)
  • Measure both number of calls coming from the network stack and those made locally
    • View on the total load of a service.
  • Distinguish between local/remote
    • Proportion of calls shortcutting the network stack
    • Calculate the size per request metric for network calls

Technically, this requires for each service to setup a separate metric layer when the service is mounted on the network or not. In particular, we would generally face the risk of mounting a service level metric layer on a stack that is then mounted to the network, hence double counting some service calls.

On the client side:

  • Remove all client metrics
    • They increase the risk of double counting, especially when based on a label (seen in practice in the Mezmo dashboards)
    • They will be behave differently from the server metrics because it’s technically complicated to get the same http level data
    • They involve a 2x increase of the number service metrics
    • They can be useful, but are generally not intuitive to use (i.e never really used in the Mezmo dashboards)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants