-
Notifications
You must be signed in to change notification settings - Fork 470
DOC-11497 Docs for obs: Enabling troubleshooting hot spots externally (e.g., logs or metrics) #19577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Files changed:
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
❌ Deploy Preview for cockroachdb-docs failed. Why did it fail? →
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the first look, @angles-n-daemons please review again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, couple more quick comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome Doc! Few questions and suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few questions and suggestions,
- Can we simplify this and remove the second box ("Is there a node outlier in the metrics?")?
- Are guaranteed to have a 'hot ranges log' when there is a popular key log for the latch contention workflow? CC @angles-n-daemons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified diagram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't, I'll explain in detail why.
The hot ranges log shows up under two conditions when enabled:
- The logging interval duration has elapsed (eg, once every four hours).
- A single replica has exceeded the CPU threshold we configured for logging.
Now when there's a popular key, or rather a row hotspot, a single range may be receiving most of the traffic, but much of the incoming queries are waiting for a latch to be released rather than doing anything. Waiting for a latch incurs no effect on cpu utilization, so if there are lots of waiting queries, there's not quite as much cpu activity.
You can see this difference in the Anatomy of a Hotspot document, if you look at "Appendix B: Anatomy of a Row Hotspot", you'll see that while elevated, the cpu utilization for the leaseholder doesn't exceed 25%.
It's certainly possible that this is enough to go over the threshold defined, but not guaranteed.
src/current/v25.2/detect-hotspots.md
Outdated
- Once you identify a relevant log, note the range ID in the tag section of the log. | ||
|
||
{{site.data.alerts.callout_info}} | ||
There may be false positives of the `popular key detected` log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How? If we determined that there is a metric anomaly in latch or CPU, don't we remove the false positives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angles-n-daemons Would you be able to answer the above questions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think metric anomalies don't guarantee that there's a hotspot in the keyspace, there could, for example, be a hotspot in data domiciling, or in a changefeed job or other similar task. Separately, it's possible, because we only collect 20 samples, that the samples collected to determine a popular key are randomly skewed.
I will say though that I'm not sure if the false positives are as big a concern as I thought before, I recommended adding this warning, but I think we can remove it and see if it proves to be an issue at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin-v-ngo thanks for your first review, please take a second look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified diagram
src/current/v25.2/detect-hotspots.md
Outdated
- Once you identify a relevant log, note the range ID in the tag section of the log. | ||
|
||
{{site.data.alerts.callout_info}} | ||
There may be false positives of the `popular key detected` log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angles-n-daemons Would you be able to answer the above questions?
Fixes DOC-11497
Added detect-hotspots.md and associated images.
Rendered previews: