Skip to content

xds: add MetricsReporter for generic xds client #8274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented Apr 24, 2025

  • Add interfaces for xDS client to report different types of metrics.
  • Use provided metrics reporter to register and record metrics for valid, invalid resource updates and xDS server failures.

RELEASE NOTES: None

@purnesh42H purnesh42H requested a review from dfawley April 24, 2025 19:48
@purnesh42H purnesh42H added Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior labels Apr 24, 2025
@purnesh42H purnesh42H added this to the 1.73 Release milestone Apr 24, 2025
@purnesh42H
Copy link
Contributor Author

@dfawley could you take a look at the MetricsReporter interface proposed here? I have tried to keep it generic enough so that we don't need to add new methods to top level interfaces when adding new metric types and metric handles.

Once we have an agreement on the interface, I can write the tests.

@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from 0279ea0 to bd3ecc8 Compare April 24, 2025 19:59
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.27%. Comparing base (080f956) to head (31f0c98).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8274      +/-   ##
==========================================
+ Coverage   82.13%   82.27%   +0.13%     
==========================================
  Files         417      417              
  Lines       41385    41405      +20     
==========================================
+ Hits        33991    34064      +73     
+ Misses       5961     5921      -40     
+ Partials     1433     1420      -13     
Files with missing lines Coverage Δ
xds/internal/clients/xdsclient/authority.go 71.21% <100.00%> (+0.89%) ⬆️
xds/internal/clients/xdsclient/xdsclient.go 80.45% <100.00%> (+0.83%) ⬆️
xds/internal/clients/xdsclient/xdsconfig.go 100.00% <ø> (ø)

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from bd3ecc8 to 6fa2987 Compare April 24, 2025 20:56
@@ -121,6 +122,7 @@ type authorityBuildOptions struct {
getChannelForADS xdsChannelForADS // Function to acquire a reference to an xdsChannel
logPrefix string // Prefix for logging
target string // Target for the gRPC Channel that owns xDS Client/Authority
metricsRecorder MetricsRecorder // Metrics recorder for metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to avoid comments that add no value. Similar for "Prefix for logging" when the name of the variable already tells you what it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed some unnecessary comments

@@ -363,6 +366,9 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re
// On error, keep previous version of the resource. But update status
// and error.
if uErr.Err != nil {
if a.metricsRecorder != nil {
a.metricsRecorder.Record(xdsClientResourceUpdatesInvalidMetric, int64(1), a.target, serverConfig.ServerIdentifier.ServerURI, rType.TypeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast should not be needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -378,6 +384,10 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re
continue
}

if a.metricsRecorder != nil {
a.metricsRecorder.Record(xdsClientResourceUpdatesValidMetric, int64(1), a.target, serverConfig.ServerIdentifier.ServerURI, rType.TypeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// MetricsReporter provides a way for XDSClient to register MetricHandle(s)
// and obtain a MetricsRecorder to record metrics.
type MetricsReporter interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is far more complex than I believe we need for this package. Most of these things here are concerns of OpenTelemetry, not this package. This package can just say what happened. The application can determine based on the outputs of this how to send it to whatever metric system it is using.

E.g.
https://go.dev/play/p/Tk6LKooRIMC

This is if we need to be able to carry information along with each event. If we don't need information in the events, we could just use an enumeration for the events that can occur instead (type Metric int)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should maybe share one Metrics definition between all the clients?

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified to only having MetricsReport that records metrics as suggested above. I have removed the registration part which means the MetricsReporter implementation needs to dynamically consume the reported metric and make sense of it through type casting etc. So, its going to be same for all clients. I added a method to Metric interface to return target because we can't have empty interfaces and target should be common for all types of metrics. Right now for generic client its hardcoded to "xds-client".

I have created a test implementation which is similar to internal/testutils/stats/test_metrics_recorder.go with the difference being the MetricsReporter implementation adding the lables, description, name etc. We can do the same with internal metrics recorder that is used for grpc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have copied the metrics test and modified to use this new MetricsReporter implementation.

@dfawley dfawley assigned purnesh42H and unassigned dfawley Apr 24, 2025
@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from 49961d8 to 7114106 Compare April 25, 2025 18:23
@purnesh42H purnesh42H requested a review from dfawley April 25, 2025 18:24
@purnesh42H purnesh42H assigned dfawley and unassigned purnesh42H Apr 25, 2025
Comment on lines 21 to 22
// Metric is type of metric to be reported by XDSClient.
type Metric interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this definition and the MetricsReporter to clients instead? Individual metrics could be declared in each of the xds/lrs clients, but the base definition could be shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Metric is type of metric to be reported by XDSClient.
type Metric interface {
Target() string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the source of this information? Isn't it passed to the xdsclient when it's created? If so, I don't think there's any need to give it back to the application now. The MetricsRecorder can be instrumented with the same knowledge.

Actually maybe we can just remove this type, even, and make ReportMetric take an any instead. That's probably sufficient. It can help with documentation to have things like this have their own type, but maybe that's not a concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i was also contemplating making it any because vet was complaining of non proto file having interface without methods. Yeah target right now is being hardcoded to "xds-client" by us. I think once we have an interface for logger, we can accept user provided target there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought "target" would be the channel's target string. It seems to have no value, then, if it's hard-coded.

What does logging have to do with this?

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for xds resolver it is a target string but for xds-enabled server it is a constant "#server" as per grfc A71. In internal client, we pass it as a param to the client and mainly use it for refcounting client instances. In external client, we don't have any use case for it except may be wrap in the log messages. So, we can still accept it as param for client via client config may be or just as a prefix in logger? We can have a prefix method in the logger interface.

// the xDS management server for a given resource type.
type MetricResourceUpdateValid struct {
ServerURI string // ServerURI of the xDS management server.
Incr int64 // Count to be incremented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever see this being anything but "1"?

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now its only being incr by 1 but having it has a parameter makes it explicit to user what to do with it. Otherwise we can document it that way that it reports a single new update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just delete it.

@purnesh42H purnesh42H requested a review from dfawley April 28, 2025 10:21
// MetricsReporter is used by the XDSClient to report metrics. Metric can be of
// any type.
//
// For example: see xdsclient/internal/test_metrics_reporter.go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation shouldn't link to internal test code.

If an example is needed, we can add a clients_examples_test.go and put an example there. Otherwise, just text documentation should normally be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have updated the documentation to close what you suggested below. I was thinking of an example which type cast and print the metric but was not very convinced of adding it in the docstring. I think with docstring sentences now user should be able to figure our how to use

Comment on lines 104 to 105
// MetricsReporter is used by the XDSClient to report metrics. Metric can be of
// any type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric can be of any type.

Not really. The metric will be one of a predefined set of types depending on the client. This should explain that each client will produce different metrics and to see that client's documentation for a list of possible metrics events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.

@@ -59,6 +59,17 @@ func (c *Channel) Replace(value any) {
}
}

// ReceiveOrFail returns the value on the underlying channel and true, or nil
// and false if the channel was empty.
func (c *Channel) ReceiveOrFail() (any, bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? It's called once, and its return value isn't used.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is draining the existing value from channel if present before we send the new value. Similar to what is being done in test metrics recorder in stats package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just inconvenience i believe. For tests, we usually want to verify the metric that was reported recently, even though the multiple metrics are reported throughout tests. So, draining the channel before reporting a new value makes it easier by only keeping the metric of immediate interest.

For example, if we are looking for a metric that should not be reported, we only have to check the first value instead of all values in the channel.

I have made the channel size to be 1 to make it more obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I agree with this testing philosophy. We shouldn't blindly throw away outputs at any step. If the test needs to say "ignore everything else, I need this metric" then it should do:

loop: for {
	select {
	case got := <-metricsChan:
		if got == want {
			break loop // success
		}
	case <-ctx.Done():
		t.Fatalf("timed out waiting for metric %v", want)
	}
}

And if we need we can put that in a helper function.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, i have just removed ReceiveOrFail. For the tests we have right now, we are anyway checking for one metric at a time after its reported.

We should probably see if we can do something better in stats one.

@@ -363,6 +367,11 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re
// On error, keep previous version of the resource. But update status
// and error.
if uErr.Err != nil {
if a.metricsReporter != nil {
a.metricsReporter.ReportMetric(MetricResourceUpdateInvalid{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pass all of these by pointer instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// MetricResourceUpdateInvalid is a Metric to report invalid resource updates
// from the xDS management server for a given resource type.
type MetricResourceUpdateInvalid struct {
Target string // Target of the metric.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed Target and Incr

// recording events on channels and provides helpers to check if certain events
// have taken place. It also persists metrics data keyed on the metrics
// type.
type TestMetricsReporter struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not export anything defined in test files. Nothing can import them anyway.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a test file? This is in internal package under xdsclient. But yeah this is only being used for e2e tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, I have moved the test metric recorder within e2e tests so that we don't have to export.

@purnesh42H purnesh42H requested a review from dfawley April 29, 2025 06:43
@@ -59,6 +59,17 @@ func (c *Channel) Replace(value any) {
}
}

// ReceiveOrFail returns the value on the underlying channel and true, or nil
// and false if the channel was empty.
func (c *Channel) ReceiveOrFail() (any, bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to do that?


package xdsclient

// MetricResourceUpdateValid is a Metric to report a valid resource update from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
// MetricResourceUpdateValid is a Metric to report a valid resource update from
// MetricResourceUpdateValid is a metric to report a valid resource update from

(And below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


r.mu.Lock()
defer r.mu.Unlock()
r.data[metricName] = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be ++?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// MetricResourceUpdateValid is a Metric to report a valid resource update from
// the xDS management server for a given resource type.
type MetricResourceUpdateValid struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move metrics into a child package (xdsclient/metrics)? Or is that too many packages? It would help keep them nicely organized, though:

// Package metrics defines all metrics that can be produced by an xDS client.  All calls to
// the MetricsRecorder by the xDS client will contain a struct from this package passed by
// pointer.
package metrics

type ResourceUpdateValid struct {}
type ResourceUpdateInvalid struct {}
type ServerFailure struct {}

Thoughts? @easwars do you have an opinion?

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to child package metrics. Seems fine to me.

Or is that too many packages

this was the only reason i initially didn't do it because this was brought up in during design in general.

@dfawley dfawley assigned purnesh42H and unassigned dfawley Apr 29, 2025
@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from 7afcefa to 71c41ed Compare April 30, 2025 13:44
@purnesh42H purnesh42H assigned dfawley and unassigned purnesh42H Apr 30, 2025
@purnesh42H purnesh42H requested a review from dfawley April 30, 2025 13:46
@purnesh42H purnesh42H force-pushed the generic-xds-client-metrics-recorder branch from 71c41ed to c59d9ac Compare April 30, 2025 13:58
// from this package passed by pointer.
package metrics

// MetricResourceUpdateValid is a metric to report a valid resource update from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the Metric prefix from these types since they end up being used currently as metrics.MetricResourceUpdateValid at callsites, which is harded to read compared to metrics.ResourceUpdateValid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per comments elsewhere I gather that there will be a separate metrics package for the LRS client. If so, what will that package be called? If it will be called metrics as well, then all callsites need to rename both imports. If it will be called lrsmetrics, then this should be called xdsmetrics. But that doesn't read very well. Any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we put the metrics package at the same level as the xdsclient and lrsclient packages and prefix the metric names appropriately. So, metrics defined now might look like metrics.XDSResourceUpdateValid, metrics.ResourceUpdateInvalid etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to separate the metrics that the different clients can generate into those clients' sub-packages. If we ended up with 20 clients (we won't, but humor me) then it would be harder to use just one of them.

I think just metrics is fine, and let the user importing it decide if they need to rename it when they import.

I'm also OK with xdsclient/xdsmetrics and lrsclient/lrsmetrics if anyone feels strongly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think metrics package is fine. We have similar cases in otel library as well and we rename while importing.

@easwars easwars assigned purnesh42H and unassigned easwars and dfawley May 1, 2025
@purnesh42H purnesh42H requested a review from easwars May 2, 2025 08:02
@purnesh42H purnesh42H assigned easwars and dfawley and unassigned purnesh42H May 2, 2025
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple minor nits

// the xDS management server for a given resource type.
type ResourceUpdateValid struct {
ServerURI string // ServerURI of the xDS management server.
ResourceType string // Resource type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete unnecessary comments. Not every field needs a comment. Or explain something about it that wouldn't be known already from the name itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the exported fields and structs must have comments else linter will complain. I have updated the comment for ResourceType. Let me know if it make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we disable that linter? There is no requirement to document every field in every struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random first struct I looked for in the stdlib:

https://pkg.go.dev/time#ParseError

Zero comments.


// metricsData represents data associated with a metric.
type metricsData struct {
intIncr int64 // count to be incremented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's delete this please.

Copy link
Contributor Author

@purnesh42H purnesh42H May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// metricsData represents data associated with a metric.
type metricsData struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this whole type seems unnecessary. Why not just push all metrics received as an any to the channel directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I was thinking of making this test reporter close to our grpc stats one to showcase the metric interface will be able to satisfy what we want in grpc xds client.

// have taken place. It also persists metrics data keyed on the metrics
// type.
type testMetricsReporter struct {
intCountCh *testutils.Channel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an "int count channel". It's a metricsData channel. But see other comment about just putting the metric from the client directly on the channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

// waitForInt64Count waits for an int64 count metric to be recorded and verifies
// that the recorded metrics data matches the expected metricsDataWant. Returns
// an error if failed to wait or received wrong data.
func (r *testMetricsReporter) waitForInt64Count(ctx context.Context, metricsDataWant metricsData) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForMetric(ctx, metricWant any)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@dfawley dfawley assigned purnesh42H and unassigned dfawley May 2, 2025
@purnesh42H purnesh42H requested a review from dfawley May 2, 2025 18:24
@purnesh42H purnesh42H assigned dfawley and unassigned purnesh42H May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants