Measure GRPC server metrics at the HTTP level and client metrics at the service level #5786

rdettai · 2025-06-05T15:48:08Z

Description

Currently, we have metrics called grpc_xxx that actually take measurement at the service level:

this makes it hard to measure bytes transferd
it sometimes measures calls that actually don't use the network stack, which makes the "grpc" name confusing.

This PR attemps to steamline the metrics:

grpc_x metrics are now recorded by an http level layer on GRPC servers only
we introduce service_x metrics that are recorded in clients only but use the "kind" label to distinguish "local" vs "remote" invocations.

How was this PR tested?

TODO

…he service level

rdettai · 2025-06-05T15:50:53Z

TODO:

tests
verify that the client layers are properly registered (never attached to server stacks)
see if we can somehow improve GrpcMetricsLayer to also track streams
see if we can somehow improve GrpcMetricsLayer to also track compression

quickwit/quickwit-common/src/tower/metrics_service.rs

quickwit/quickwit-common/src/tower/metrics_grpc.rs

fulmicoton-dd · 2025-07-10T01:37:09Z

quickwit/quickwit-common/src/tower/metrics_grpc.rs

+
+        let rpc_name = extract_rpc_name_from_path(request.uri().path()).to_string();
+
+        let request_size = get_content_length(request.headers());


Hmmm... What happens in gRPC streaming?

rdettai · 2025-07-16T12:28:06Z

Memo on what should be done here

We monitor calls to internal services for two reasons:

Understand services load across a cluster. This can help identify overloaded services/nodes.
Measure sources of network traffic across nodes

Current behavior

Service metrics are broken down using a “kind” label that can be “client” and “server”.
- Server metrics are useful to understand the behavior from the perspective of the service provider.
- Client metrics can help identifying how the load is spread across callers, they are a bit harder to leverage.
We measure these metrics only at the service layer (in the tower stack), so
- We don’t have access access to http transfer sizes
- We don’t always distinguish between local vs remote calls, which is confusing because the metric names has “grpc” in its name

Proposed behavior (before this PR)

First of all, we want to have a measure at the network level. The question is how to compose that with current client/server metrics.

Note: it’s a bit tricky to get a handle to the network layer on the client side

On the server side, we want to:

Measure bytes received and bytes sent for network calls (show true size, i.e after compression)
Measure both number of calls coming from the network stack and those made locally
- View on the total load of a service.
Distinguish between local/remote
- Proportion of calls shortcutting the network stack
- Calculate the size per request metric for network calls

Technically, this requires for each service to setup a separate metric layer when the service is mounted on the network or not. In particular, we would generally face the risk of mounting a service level metric layer on a stack that is then mounted to the network, hence double counting some service calls.

On the client side:

Remove all client metrics
- They increase the risk of double counting, especially when based on a label (seen in practice in the Mezmo dashboards)
- They will be behave differently from the server metrics because it’s technically complicated to get the same http level data
- They involve a 2x increase of the number service metrics
- They can be useful, but are generally not intuitive to use (i.e never really used in the Mezmo dashboards)

Measure GRPC server metrics at the HTTP level and client metrics at t…

2859040

…he service level

rdettai marked this pull request as draft July 9, 2025 13:58

fulmicoton-dd reviewed Jul 10, 2025

View reviewed changes

quickwit/quickwit-common/src/tower/metrics_service.rs Outdated Show resolved Hide resolved

fulmicoton-dd reviewed Jul 10, 2025

View reviewed changes

quickwit/quickwit-common/src/tower/metrics_service.rs Outdated Show resolved Hide resolved

fulmicoton-dd reviewed Jul 10, 2025

View reviewed changes

quickwit/quickwit-common/src/tower/metrics_grpc.rs Outdated Show resolved Hide resolved

fulmicoton-dd reviewed Jul 10, 2025

View reviewed changes

Fix service metric help

732c544

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Measure GRPC server metrics at the HTTP level and client metrics at the service level #5786

Measure GRPC server metrics at the HTTP level and client metrics at the service level #5786

rdettai commented Jun 5, 2025

Uh oh!

rdettai commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fulmicoton-dd Jul 10, 2025

Uh oh!

rdettai commented Jul 16, 2025

Uh oh!

Uh oh!


		let rpc_name = extract_rpc_name_from_path(request.uri().path()).to_string();

		let request_size = get_content_length(request.headers());

Measure GRPC server metrics at the HTTP level and client metrics at the service level #5786

Are you sure you want to change the base?

Measure GRPC server metrics at the HTTP level and client metrics at the service level #5786

Conversation

rdettai commented Jun 5, 2025

Description

How was this PR tested?

Uh oh!

rdettai commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fulmicoton-dd Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

rdettai commented Jul 16, 2025

Memo on what should be done here

Current behavior

Proposed behavior (before this PR)

Uh oh!

Uh oh!