Skip to content

DOC-11497 Docs for obs: Enabling troubleshooting hot spots externally (e.g., logs or metrics) #19577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

florence-crl
Copy link
Contributor

@florence-crl florence-crl commented May 1, 2025

Fixes DOC-11497

Added detect-hotspots.md and associated images.

Rendered previews:

Copy link

github-actions bot commented May 1, 2025

Files changed:

Copy link

netlify bot commented May 1, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit f0370b3
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/6851b312362ada00083641dd

Copy link

netlify bot commented May 1, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit f0370b3
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/6851b312a578fc00083c1794

Copy link

netlify bot commented May 1, 2025

Deploy Preview for cockroachdb-docs failed. Why did it fail? →

Name Link
🔨 Latest commit 0173eb8
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/6813b55b6c4a2d00084eadec

Copy link

netlify bot commented May 1, 2025

Netlify Preview

Name Link
🔨 Latest commit f0370b3
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/6851b3120add9a000825327e
😎 Deploy Preview https://deploy-preview-19577--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@florence-crl florence-crl requested a review from kevin-v-ngo May 13, 2025 19:17
Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the first look, @angles-n-daemons please review again.

Copy link

@angles-n-daemons angles-n-daemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, couple more quick comments here.

Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR

Copy link

@kevin-v-ngo kevin-v-ngo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome Doc! Few questions and suggestions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions and suggestions,

  1. Can we simplify this and remove the second box ("Is there a node outlier in the metrics?")?
  2. Are guaranteed to have a 'hot ranges log' when there is a popular key log for the latch contention workflow? CC @angles-n-daemons

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified diagram

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't, I'll explain in detail why.

The hot ranges log shows up under two conditions when enabled:

  1. The logging interval duration has elapsed (eg, once every four hours).
  2. A single replica has exceeded the CPU threshold we configured for logging.

Now when there's a popular key, or rather a row hotspot, a single range may be receiving most of the traffic, but much of the incoming queries are waiting for a latch to be released rather than doing anything. Waiting for a latch incurs no effect on cpu utilization, so if there are lots of waiting queries, there's not quite as much cpu activity.

You can see this difference in the Anatomy of a Hotspot document, if you look at "Appendix B: Anatomy of a Row Hotspot", you'll see that while elevated, the cpu utilization for the leaseholder doesn't exceed 25%.

It's certainly possible that this is enough to go over the threshold defined, but not guaranteed.

- Once you identify a relevant log, note the range ID in the tag section of the log.

{{site.data.alerts.callout_info}}
There may be false positives of the `popular key detected` log.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How? If we determined that there is a metric anomaly in latch or CPU, don't we remove the false positives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angles-n-daemons Would you be able to answer the above questions?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think metric anomalies don't guarantee that there's a hotspot in the keyspace, there could, for example, be a hotspot in data domiciling, or in a changefeed job or other similar task. Separately, it's possible, because we only collect 20 samples, that the samples collected to determine a popular key are randomly skewed.

I will say though that I'm not sure if the false positives are as big a concern as I thought before, I recommended adding this warning, but I think we can remove it and see if it proves to be an issue at all.

Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin-v-ngo thanks for your first review, please take a second look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified diagram

- Once you identify a relevant log, note the range ID in the tag section of the log.

{{site.data.alerts.callout_info}}
There may be false positives of the `popular key detected` log.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angles-n-daemons Would you be able to answer the above questions?

@florence-crl florence-crl requested a review from kevin-v-ngo June 17, 2025 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants