-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: add region request metrics which record caller info #9117
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: okJiang <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9117 +/- ##
==========================================
- Coverage 76.36% 75.73% -0.64%
==========================================
Files 473 477 +4
Lines 71663 73269 +1606
==========================================
+ Hits 54728 55487 +759
- Misses 13514 14308 +794
- Partials 3421 3474 +53
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Subsystem: "server", | ||
Name: "region_request_cnt", | ||
Help: "Counter of region request.", | ||
}, []string{"request", "caller_id", "caller_component", "err_msg"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it more simple, e.g., request-component-module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScanRegions-tidb-server->tidb-tikv-driver
is kind of confusing and not readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request-component-module LGTM
However, there are caller_id
and caller_component in the current code. If we use request-component-module
here, it will cause confusion with the term component
.
ScanRegions-tidb-server->tidb-tikv-driver is kind of confusing and not readable.
It is indeed easy to get confused, and I am still thinking of a better format. Since we chose to use a hyphen as a connector in the component, continuing to use a hyphen in Grafana might look a bit strange.

Maybe we should name the component using camel case? Such as RegionCache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? cc @lhy1024 @JmPotato @bufferflies
Signed-off-by: okJiang <[email protected]>
client/pkg/caller/caller.go
Outdated
// client-go component | ||
CodecPDClient Component = "codec-pd-client" | ||
RegionCache Component = "region-cache" | ||
StoreCache Component = "store-cache" | ||
Oracles Component = "oracle" | ||
Rawkv Component = "rawkv" | ||
KvStore Component = "kv-store" | ||
InterceptedPDClient Component = "intercepted-pd-client" | ||
|
||
// TiDB component | ||
Pitr Component = "pitr" | ||
Ddl Component = "ddl" | ||
ImportInto Component = "import-into" | ||
TikvHandler Component = "tikv-handler" | ||
GcWorker Component = "gc-worker" | ||
GcJob Component = "gc-job" | ||
DistributedGcJob Component = "distributed-gc-job" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names can be reviewed with these two PR https://github.com/pingcap/tidb/pull/59911/files and tikv/client-go#1516
client/pkg/caller/caller.go
Outdated
CodecPDClient Component = "codec-pd-client" | ||
RegionCache Component = "region-cache" | ||
StoreCache Component = "store-cache" | ||
Oracles Component = "oracle" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oracles Component = "oracle" | |
Oracle Component = "oracle" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in other repo https://github.com/tikv/pd/pull/9117/files/ebd66e5b0042353ad6402f6239ab049a438c21a0#r1982716822
Did I misunderstand your comment? I thought what you needed was the current PR like. tikv/client-go#1516 (comment)
@@ -1440,6 +1443,9 @@ func (s *GrpcServer) GetPrevRegion(ctx context.Context, request *pdpb.GetRegionR | |||
return rsp.(*pdpb.GetRegionResponse), err | |||
} | |||
|
|||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract one function for this? and get the call name by another way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there only one function right now?
Signed-off-by: okJiang <[email protected]>
prometheus.MustRegister(regionRequestCounter) | ||
} | ||
|
||
func incRegionRequestCounter(method string, header *pdpb.RequestHeader, err *pdpb.Error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash function is little heavy, and we don't need to report all the request just see the the source. So we can use sample to report the normal request to decrease the usage of hash function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can indeed lighten the load; for example, we can collect metrics with a probability of one percent. Even with tens of thousands of requests, we can still identify the sources of these large requests.
If there are no objections from others, I will make the changes today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done~ 7c00c9e
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
@@ -28,11 +28,31 @@ type ( | |||
Component string | |||
) | |||
|
|||
// nolint: exported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, otherwise it would require a comment for each variable
if callerComponent == "" { | ||
callerComponent = "unknown" | ||
} | ||
regionRequestCounter.WithLabelValues(method, callerID, callerComponent, errMsg).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the error is nil, the inc one is wrong; it should multi * 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
if err == nil && rand.Intn(100) != 0 { | ||
// sample 1% region requests to avoid high cardinality | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of sampling with a specific probability, can we record the data into a counter and periodically flush it to Prometheus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recording to the counter also requires calculation via a hash map, and we need to ensure it's thread-safe, which likewise introduces considerable overhead.
Signed-off-by: okJiang <[email protected]>
What problem does this PR solve?
Issue Number: Ref #8593
What is changed and how does it work?
Check List
Tests
Related changes
pingcap/docs
/pingcap/docs-cn
:pingcap/tiup
:Release note