-
Notifications
You must be signed in to change notification settings - Fork 4.5k
xds: add MetricsReporter for generic xds client #8274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
xds: add MetricsReporter for generic xds client #8274
Conversation
@dfawley could you take a look at the MetricsReporter interface proposed here? I have tried to keep it generic enough so that we don't need to add new methods to top level interfaces when adding new metric types and metric handles. Once we have an agreement on the interface, I can write the tests. |
0279ea0
to
bd3ecc8
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #8274 +/- ##
==========================================
+ Coverage 82.13% 82.27% +0.13%
==========================================
Files 417 417
Lines 41385 41405 +20
==========================================
+ Hits 33991 34064 +73
+ Misses 5961 5921 -40
+ Partials 1433 1420 -13
🚀 New features to boost your workflow:
|
bd3ecc8
to
6fa2987
Compare
@@ -121,6 +122,7 @@ type authorityBuildOptions struct { | |||
getChannelForADS xdsChannelForADS // Function to acquire a reference to an xdsChannel | |||
logPrefix string // Prefix for logging | |||
target string // Target for the gRPC Channel that owns xDS Client/Authority | |||
metricsRecorder MetricsRecorder // Metrics recorder for metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to avoid comments that add no value. Similar for "Prefix for logging" when the name of the variable already tells you what it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed some unnecessary comments
@@ -363,6 +366,9 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re | |||
// On error, keep previous version of the resource. But update status | |||
// and error. | |||
if uErr.Err != nil { | |||
if a.metricsRecorder != nil { | |||
a.metricsRecorder.Record(xdsClientResourceUpdatesInvalidMetric, int64(1), a.target, serverConfig.ServerIdentifier.ServerURI, rType.TypeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast should not be needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -378,6 +384,10 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re | |||
continue | |||
} | |||
|
|||
if a.metricsRecorder != nil { | |||
a.metricsRecorder.Record(xdsClientResourceUpdatesValidMetric, int64(1), a.target, serverConfig.ServerIdentifier.ServerURI, rType.TypeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// MetricsReporter provides a way for XDSClient to register MetricHandle(s) | ||
// and obtain a MetricsRecorder to record metrics. | ||
type MetricsReporter interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is far more complex than I believe we need for this package. Most of these things here are concerns of OpenTelemetry, not this package. This package can just say what happened. The application can determine based on the outputs of this how to send it to whatever metric system it is using.
E.g.
https://go.dev/play/p/Tk6LKooRIMC
This is if we need to be able to carry information along with each event. If we don't need information in the events, we could just use an enumeration for the events that can occur instead (type Metric int
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should maybe share one Metrics definition between all the clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have simplified to only having MetricsReport that records metrics as suggested above. I have removed the registration part which means the MetricsReporter implementation needs to dynamically consume the reported metric and make sense of it through type casting etc. So, its going to be same for all clients. I added a method to Metric interface to return target because we can't have empty interfaces and target should be common for all types of metrics. Right now for generic client its hardcoded to "xds-client".
I have created a test implementation which is similar to internal/testutils/stats/test_metrics_recorder.go
with the difference being the MetricsReporter implementation adding the lables, description, name etc. We can do the same with internal metrics recorder that is used for grpc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have copied the metrics test and modified to use this new MetricsReporter implementation.
49961d8
to
7114106
Compare
// Metric is type of metric to be reported by XDSClient. | ||
type Metric interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this definition and the MetricsReporter
to clients
instead? Individual metrics could be declared in each of the xds/lrs clients, but the base definition could be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// Metric is type of metric to be reported by XDSClient. | ||
type Metric interface { | ||
Target() string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the source of this information? Isn't it passed to the xdsclient when it's created? If so, I don't think there's any need to give it back to the application now. The MetricsRecorder can be instrumented with the same knowledge.
Actually maybe we can just remove this type, even, and make ReportMetric
take an any
instead. That's probably sufficient. It can help with documentation to have things like this have their own type, but maybe that's not a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i was also contemplating making it any because vet was complaining of non proto file having interface without methods. Yeah target right now is being hardcoded to "xds-client" by us. I think once we have an interface for logger, we can accept user provided target there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I thought "target" would be the channel's target string. It seems to have no value, then, if it's hard-coded.
What does logging have to do with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah for xds resolver it is a target string but for xds-enabled server it is a constant "#server" as per grfc A71. In internal client, we pass it as a param to the client and mainly use it for refcounting client instances. In external client, we don't have any use case for it except may be wrap in the log messages. So, we can still accept it as param for client via client config may be or just as a prefix in logger? We can have a prefix method in the logger interface.
// the xDS management server for a given resource type. | ||
type MetricResourceUpdateValid struct { | ||
ServerURI string // ServerURI of the xDS management server. | ||
Incr int64 // Count to be incremented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever see this being anything but "1"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now its only being incr by 1 but having it has a parameter makes it explicit to user what to do with it. Otherwise we can document it that way that it reports a single new update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just delete it.
xds/internal/clients/config.go
Outdated
// MetricsReporter is used by the XDSClient to report metrics. Metric can be of | ||
// any type. | ||
// | ||
// For example: see xdsclient/internal/test_metrics_reporter.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation shouldn't link to internal test code.
If an example is needed, we can add a clients_examples_test.go
and put an example there. Otherwise, just text documentation should normally be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have updated the documentation to close what you suggested below. I was thinking of an example which type cast and print the metric but was not very convinced of adding it in the docstring. I think with docstring sentences now user should be able to figure our how to use
xds/internal/clients/config.go
Outdated
// MetricsReporter is used by the XDSClient to report metrics. Metric can be of | ||
// any type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric can be of any type.
Not really. The metric will be one of a predefined set of types depending on the client. This should explain that each client will produce different metrics and to see that client's documentation for a list of possible metrics events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
@@ -59,6 +59,17 @@ func (c *Channel) Replace(value any) { | |||
} | |||
} | |||
|
|||
// ReceiveOrFail returns the value on the underlying channel and true, or nil | |||
// and false if the channel was empty. | |||
func (c *Channel) ReceiveOrFail() (any, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for? It's called once, and its return value isn't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is draining the existing value from channel if present before we send the new value. Similar to what is being done in test metrics recorder in stats package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just inconvenience i believe. For tests, we usually want to verify the metric that was reported recently, even though the multiple metrics are reported throughout tests. So, draining the channel before reporting a new value makes it easier by only keeping the metric of immediate interest.
For example, if we are looking for a metric that should not be reported, we only have to check the first value instead of all values in the channel.
I have made the channel size to be 1 to make it more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I agree with this testing philosophy. We shouldn't blindly throw away outputs at any step. If the test needs to say "ignore everything else, I need this metric" then it should do:
loop: for {
select {
case got := <-metricsChan:
if got == want {
break loop // success
}
case <-ctx.Done():
t.Fatalf("timed out waiting for metric %v", want)
}
}
And if we need we can put that in a helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, i have just removed ReceiveOrFail
. For the tests we have right now, we are anyway checking for one metric at a time after its reported.
We should probably see if we can do something better in stats one.
@@ -363,6 +367,11 @@ func (a *authority) handleADSResourceUpdate(serverConfig *ServerConfig, rType Re | |||
// On error, keep previous version of the resource. But update status | |||
// and error. | |||
if uErr.Err != nil { | |||
if a.metricsReporter != nil { | |||
a.metricsReporter.ReportMetric(MetricResourceUpdateInvalid{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pass all of these by pointer instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// MetricResourceUpdateInvalid is a Metric to report invalid resource updates | ||
// from the xDS management server for a given resource type. | ||
type MetricResourceUpdateInvalid struct { | ||
Target string // Target of the metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed Target and Incr
// recording events on channels and provides helpers to check if certain events | ||
// have taken place. It also persists metrics data keyed on the metrics | ||
// type. | ||
type TestMetricsReporter struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not export anything defined in test files. Nothing can import them anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a test file? This is in internal package under xdsclient. But yeah this is only being used for e2e tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways, I have moved the test metric recorder within e2e tests so that we don't have to export.
@@ -59,6 +59,17 @@ func (c *Channel) Replace(value any) { | |||
} | |||
} | |||
|
|||
// ReceiveOrFail returns the value on the underlying channel and true, or nil | |||
// and false if the channel was empty. | |||
func (c *Channel) ReceiveOrFail() (any, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to do that?
|
||
package xdsclient | ||
|
||
// MetricResourceUpdateValid is a Metric to report a valid resource update from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
// MetricResourceUpdateValid is a Metric to report a valid resource update from | |
// MetricResourceUpdateValid is a metric to report a valid resource update from |
(And below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
r.mu.Lock() | ||
defer r.mu.Unlock() | ||
r.data[metricName] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be ++
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// MetricResourceUpdateValid is a Metric to report a valid resource update from | ||
// the xDS management server for a given resource type. | ||
type MetricResourceUpdateValid struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move metrics into a child package (xdsclient/metrics
)? Or is that too many packages? It would help keep them nicely organized, though:
// Package metrics defines all metrics that can be produced by an xDS client. All calls to
// the MetricsRecorder by the xDS client will contain a struct from this package passed by
// pointer.
package metrics
type ResourceUpdateValid struct {}
type ResourceUpdateInvalid struct {}
type ServerFailure struct {}
Thoughts? @easwars do you have an opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to child package metrics. Seems fine to me.
Or is that too many packages
this was the only reason i initially didn't do it because this was brought up in during design in general.
7afcefa
to
71c41ed
Compare
71c41ed
to
c59d9ac
Compare
// from this package passed by pointer. | ||
package metrics | ||
|
||
// MetricResourceUpdateValid is a metric to report a valid resource update from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop the Metric
prefix from these types since they end up being used currently as metrics.MetricResourceUpdateValid
at callsites, which is harded to read compared to metrics.ResourceUpdateValid
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per comments elsewhere I gather that there will be a separate metrics package for the LRS client. If so, what will that package be called? If it will be called metrics
as well, then all callsites need to rename both imports. If it will be called lrsmetrics
, then this should be called xdsmetrics
. But that doesn't read very well. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we put the metrics
package at the same level as the xdsclient
and lrsclient
packages and prefix the metric names appropriately. So, metrics defined now might look like metrics.XDSResourceUpdateValid
, metrics.ResourceUpdateInvalid
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to separate the metrics that the different clients can generate into those clients' sub-packages. If we ended up with 20 clients (we won't, but humor me) then it would be harder to use just one of them.
I think just metrics
is fine, and let the user importing it decide if they need to rename it when they import.
I'm also OK with xdsclient/xdsmetrics
and lrsclient/lrsmetrics
if anyone feels strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think metrics package is fine. We have similar cases in otel library as well and we rename while importing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple minor nits
// the xDS management server for a given resource type. | ||
type ResourceUpdateValid struct { | ||
ServerURI string // ServerURI of the xDS management server. | ||
ResourceType string // Resource type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete unnecessary comments. Not every field needs a comment. Or explain something about it that wouldn't be known already from the name itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the exported fields and structs must have comments else linter will complain. I have updated the comment for ResourceType. Let me know if it make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we disable that linter? There is no requirement to document every field in every struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// metricsData represents data associated with a metric. | ||
type metricsData struct { | ||
intIncr int64 // count to be incremented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's delete this please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
// metricsData represents data associated with a metric. | ||
type metricsData struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this whole type seems unnecessary. Why not just push all metrics received as an any
to the channel directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I was thinking of making this test reporter close to our grpc stats one to showcase the metric interface will be able to satisfy what we want in grpc xds client.
// have taken place. It also persists metrics data keyed on the metrics | ||
// type. | ||
type testMetricsReporter struct { | ||
intCountCh *testutils.Channel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an "int count channel". It's a metricsData
channel. But see other comment about just putting the metric from the client directly on the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
// waitForInt64Count waits for an int64 count metric to be recorded and verifies | ||
// that the recorded metrics data matches the expected metricsDataWant. Returns | ||
// an error if failed to wait or received wrong data. | ||
func (r *testMetricsReporter) waitForInt64Count(ctx context.Context, metricsDataWant metricsData) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitForMetric(ctx, metricWant any)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
RELEASE NOTES: None