Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: add region request metrics which record caller info #9117

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

okJiang
Copy link
Member

@okJiang okJiang commented Mar 5, 2025

What problem does this PR solve?

Issue Number: Ref #8593

What is changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
image

Related changes

Release note

None.

Signed-off-by: okJiang <[email protected]>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 5, 2025
Copy link
Contributor

ti-chi-bot bot commented Mar 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign binshi-bing for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 5, 2025
Copy link

codecov bot commented Mar 5, 2025

Codecov Report

Attention: Patch coverage is 65.95745% with 16 lines in your changes missing coverage. Please review.

Project coverage is 75.73%. Comparing base (3b505f5) to head (66e8344).
Report is 19 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9117      +/-   ##
==========================================
- Coverage   76.36%   75.73%   -0.64%     
==========================================
  Files         473      477       +4     
  Lines       71663    73269    +1606     
==========================================
+ Hits        54728    55487     +759     
- Misses      13514    14308     +794     
- Partials     3421     3474      +53     
Flag Coverage Δ
unittests 75.73% <65.95%> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Subsystem: "server",
Name: "region_request_cnt",
Help: "Counter of region request.",
}, []string{"request", "caller_id", "caller_component", "err_msg"})
Copy link
Member

@rleungx rleungx Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it more simple, e.g., request-component-module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScanRegions-tidb-server->tidb-tikv-driver is kind of confusing and not readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request-component-module LGTM

However, there are caller_id and caller_component in the current code. If we use request-component-module here, it will cause confusion with the term component.

ScanRegions-tidb-server->tidb-tikv-driver is kind of confusing and not readable.

It is indeed easy to get confused, and I am still thinking of a better format. Since we chose to use a hyphen as a connector in the component, continuing to use a hyphen in Grafana might look a bit strange.

image

Maybe we should name the component using camel case? Such as RegionCache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? cc @lhy1024 @JmPotato @bufferflies

Signed-off-by: okJiang <[email protected]>
Comment on lines 38 to 54
// client-go component
CodecPDClient Component = "codec-pd-client"
RegionCache Component = "region-cache"
StoreCache Component = "store-cache"
Oracles Component = "oracle"
Rawkv Component = "rawkv"
KvStore Component = "kv-store"
InterceptedPDClient Component = "intercepted-pd-client"

// TiDB component
Pitr Component = "pitr"
Ddl Component = "ddl"
ImportInto Component = "import-into"
TikvHandler Component = "tikv-handler"
GcWorker Component = "gc-worker"
GcJob Component = "gc-job"
DistributedGcJob Component = "distributed-gc-job"
Copy link
Member Author

@okJiang okJiang Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names can be reviewed with these two PR https://github.com/pingcap/tidb/pull/59911/files and tikv/client-go#1516

@okJiang okJiang changed the title client: add region request metrics which record caller info service: add region request metrics which record caller info Mar 7, 2025
@okJiang okJiang changed the title service: add region request metrics which record caller info metrics: add region request metrics which record caller info Mar 7, 2025
CodecPDClient Component = "codec-pd-client"
RegionCache Component = "region-cache"
StoreCache Component = "store-cache"
Oracles Component = "oracle"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Oracles Component = "oracle"
Oracle Component = "oracle"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is it used?

Copy link
Member Author

@okJiang okJiang Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in other repo https://github.com/tikv/pd/pull/9117/files/ebd66e5b0042353ad6402f6239ab049a438c21a0#r1982716822

Did I misunderstand your comment? I thought what you needed was the current PR like. tikv/client-go#1516 (comment)

@@ -1440,6 +1443,9 @@ func (s *GrpcServer) GetPrevRegion(ctx context.Context, request *pdpb.GetRegionR
return rsp.(*pdpb.GetRegionResponse), err
}

defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract one function for this? and get the call name by another way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there only one function right now?

Signed-off-by: okJiang <[email protected]>
prometheus.MustRegister(regionRequestCounter)
}

func incRegionRequestCounter(method string, header *pdpb.RequestHeader, err *pdpb.Error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash function is little heavy, and we don't need to report all the request just see the the source. So we can use sample to report the normal request to decrease the usage of hash function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can indeed lighten the load; for example, we can collect metrics with a probability of one percent. Even with tens of thousands of requests, we can still identify the sources of these large requests.

If there are no objections from others, I will make the changes today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done~ 7c00c9e

okJiang added 3 commits March 27, 2025 18:10
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
@@ -28,11 +28,31 @@ type (
Component string
)

// nolint: exported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, otherwise it would require a comment for each variable

@okJiang
Copy link
Member Author

okJiang commented Apr 1, 2025

ping @bufferflies @JmPotato @rleungx

if callerComponent == "" {
callerComponent = "unknown"
}
regionRequestCounter.WithLabelValues(method, callerID, callerComponent, errMsg).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the error is nil, the inc one is wrong; it should multi * 100.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image I separated successful and failed requests, marking a 1% sampling rate for successful requests.

Copy link
Contributor

@bufferflies bufferflies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

Comment on lines +220 to +223
if err == nil && rand.Intn(100) != 0 {
// sample 1% region requests to avoid high cardinality
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sampling with a specific probability, can we record the data into a counter and periodically flush it to Prometheus?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recording to the counter also requires calculation via a hash map, and we need to ensure it's thread-safe, which likewise introduces considerable overhead.

Signed-off-by: okJiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants