-
Notifications
You must be signed in to change notification settings - Fork 470
DOC-11497 Docs for obs: Enabling troubleshooting hot spots externally (e.g., logs or metrics) #19577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
DOC-11497 Docs for obs: Enabling troubleshooting hot spots externally (e.g., logs or metrics) #19577
Changes from all commits
0173eb8
6dbaa47
5a2c18f
c31804e
9bd27e8
db76289
b12f81c
4b1cf7a
e78be2d
74965b4
57fa244
8fc6e2c
80e592f
0aab4d9
e8411a4
30fd1c0
6a5609b
c7c0a9e
4c3b13f
3744bc2
98122e5
f724d59
afbd9ff
1886437
8fd4609
7e7f289
7d32151
502cc35
f0370b3
96f0169
d48a24a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
--- | ||
title: Detect Hotspots | ||
summary: Learn how to detect hotspots using real-time monitoring and historical logs in CockroachDB. | ||
toc: true | ||
--- | ||
|
||
This page provides practical guidance on identifying common [hotspots]({% link {{ page.version.version }}/understand-hotspots.md %}) in CockroachDB clusters using real-time monitoring and historical logs. This tutorial assumes that you have identified a metrics outlier in your cluster. It focuses on CPU and latch contention metrics to help you identify hot-key and hot-index scenarios. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "hot-key" and "hot-index" do not need to be hyphenated "hot key" could link to the relevant section below (as could "hot index") do we have anything to link to for "identified a metrics outlier"? i did a search of our docs for "metrics outlier" and got these results, some of which look promising - maybe https://www.cockroachlabs.com/docs/cockroachcloud/metrics-overview.html ? |
||
|
||
## Before you begin | ||
|
||
- Review the [Understand hotspots page]({% link {{ page.version.version }}/understand-hotspots.md %}) for definitions and concepts. | ||
- Ensure you have access to the DB Console Metrics and the relevant logs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "DB console metrics" could link to https://cockroachlabs.com/docs/v25.2/ui-overview#metrics "relevant logs" could link to https://www.cockroachlabs.com/docs/v25.2/logging-overview#logging-channels |
||
|
||
## Troubleshooting overview | ||
|
||
Identify potential hotspots and optimize query and schema performance. The following sections provide details for each step. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest editing these sentence together, something like eg "the following sections provide detailed instructions for identifying potential hotspots and optimizing query and schema performance" although reading through the subsections it seems like it might be more accurate to say something like "the following sections provide detailed instructions for identifying potential hotspots and applying mitigations" what do you think? |
||
|
||
<img src="{{ 'images/v25.2/detect-hotspots-workflow.svg' | relative_url }}" alt="Troubleshoot hotspots workflow" style="border:1px solid #eee;max-width:100%" /> | ||
|
||
## Step 1. Check for a node outlier in metrics | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: my comment above about what to link to for "metrics outliers", maybe that could link to this section? |
||
|
||
To identify a [hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}), monitor the following metrics on the [DB Console **Metrics** page]({% link {{ page.version.version }}/ui-overview.md %}#metrics) and the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). A node with a maximum value that is a clear outlier in the cluster may indicate a potential hotspot. | ||
|
||
### A. Latch conflict wait durations | ||
|
||
- Navigate to [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: missing article, "navigage to the ... page" |
||
- Create a custom chart to monitor the `kv.concurrency.latch_conflict_wait_durations-avg` metric, which tracks time spent on [latch acquisition]({% link {{ page.version.version }}/architecture/transaction-layer.md %}#latch-manager) waiting for conflicts with other latches. For example, a [sequence]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-sequence) that writes to the same row must wait to acquire the latch. | ||
- To display the metric per node, select the `PER NODE/STORE` checkbox. | ||
|
||
For example: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this "for example" could be dropped, and the previous bullet updated slightly to say "select the ... checkbox as shown" i think if the re: the bullets, since this is an ordered series of steps i suggest updating this to a numbered list |
||
|
||
<img src="{{ 'images/v25.2/detect-hotspots-latch-conflict-wait-durations.png' | relative_url }}" alt="kv.concurrency.latch_conflict_wait_durations-avg" style="border:1px solid #eee;max-width:100%" /> | ||
|
||
- Is there a node with a maximum value that is a clear outlier in the cluster for the latch conflict wait durations metric? | ||
|
||
- If **Yes**, note the ID of the [hot node]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-node) and the time period when it was hot. Proceed to check for a [`popular key detected `log](#a-popular-key-detected). | ||
- If **No**, check for a node outlier in [CPU percent](#b-cpu-percent) metric. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: missing article "in the CPU percent metric" |
||
|
||
### B. CPU percent | ||
|
||
- Navigate to the DB Console **Metrics** page **Hardware** dashboard. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest making this a link to the docs: https://www.cockroachlabs.com/docs/v25.2/ui-hardware-dashboard There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. because these steps are done in order, suggest making this an ordered (numbered) list |
||
- Monitor the [**CPU Percent** graph]({% link {{ page.version.version }}/ui-hardware-dashboard.md %}#cpu-percent). | ||
- CPU usage typically increases with traffic volume. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a list of steps, but this is not a step. suggest combining with the previous bullet, since it seems to be a comment on the previous bullet |
||
- Check if the CPU usage of the hottest node is 20% or more above the cluster average. For example, node `n5`, represented by the green line in the following **CPU Percent** graph, hovers at around 87% at time 17:35 compared to other nodes that hover around 20% to 25%. | ||
|
||
<img src="{{ 'images/v25.2/detect-hotspots-cpu-percent.png' | relative_url }}" alt="graph of CPU Percent utilization per node showing hot key" style="border:1px solid #eee;max-width:100%" /> | ||
|
||
- Is there a node with a maximum value that is a clear outlier in the cluster for the CPU percent metric? | ||
|
||
- If **Yes**, note the ID of the [hot node]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-node) and the time period when it was hot. Proceed to check for a [`popular key detected `log](#a-popular-key-detected). | ||
- If **No**, and the metrics outlier appears in a metric other than CPU percent or latch conflict wait duration, consider causes other than a hotspot. | ||
|
||
## Step 2. Check for existence of `no split key found` log | ||
|
||
The [`no split key found` log]({% link {{ page.version.version }}/load-based-splitting.md %}#monitor-load-based-splitting) is emitted in the [`KV_DISTRIBUTION` log channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). This log is only emitted when a single replica begins using a significant percentage of the resources on the node where it resides. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest making "replica" a link to https://www.cockroachlabs.com/docs/v25.2/architecture/glossary.html#replica |
||
|
||
This log is not associated with a specific event type but includes an unstructured message, for example: | ||
|
||
``` | ||
I250523 21:59:25.755283 31560 13@kv/kvserver/split/decider.go:298 ⋮ [T1,Vsystem,n5,s5,r1115/3:‹/Table/106/1/{113338-899841…}›] 2979 no split key found: insufficient counters = 0, imbalance = 20, most popular key occurs in 36% of samples, access balance right-biased 98%, popular key detected, clear direction detected | ||
``` | ||
|
||
In the preceding log example, the square-bracketed tag section provides the following information: | ||
|
||
- node ID: `n5` indicates the node ID is 5. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest capitalizing "Node ID" and "Range ID" |
||
- range ID: `r1115` indicates the range ID is 1115. | ||
|
||
The timestamp at the beginning of the log is `250523 21:59:25.755283`. | ||
|
||
The unstructured message ends with one of the following string combinations: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could simplify to "ends with one of the following strings" |
||
|
||
1. `popular key detected, clear direction detected` | ||
1. `popular key detected, no clear direction` | ||
1. `no popular key, clear direction detected` | ||
1. `no popular key, no clear direction` | ||
|
||
### A. `popular key detected` | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The `popular key detected` log indicates that a significant percentage of reads or writes target a single row within a range. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest linking "range" to https://www.cockroachlabs.com/docs/v25.2/architecture/glossary.html#range since the user needs to understand replicas/ranges pretty well to troubleshoot this |
||
|
||
- To check for a `popular key detected` log, search the `KV_DISTRIBUTION` logs on the hot node from Step 1 within the noted time period. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest making 'Step 1' a link to that step |
||
|
||
- Once you identify a relevant log, note the range ID in the tag section of the log. | ||
|
||
- If the outlier appears in the latch conflict wait durations metric, does a `popular key detected` log exist? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest making "latch conflict wait durations" link back to the previous section about that metric |
||
|
||
- If **Yes**, it may be a [write hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#write-hotspot). Note the range ID of `popular key detected` log and proceed to find the corresponding [hot ranges log](#step-3-find-hot-ranges-log). | ||
- If **No**, investigate other reasons for the latch conflict wait durations metric outlier. | ||
|
||
- If the outlier appears in the CPU percent metric, does a `popular key detected` log exist? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest making "CPU percent metric" link back to the previous section about that metric |
||
|
||
- If **Yes**, it may be a [read hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#read-hotspot), because the write hotspot was ruled out with the latch conflict wait durations metric. The order of operations in this troubleshooting process matters. Note the range ID of `popular key detected` log and proceed to find the corresponding [hot ranges log](#step-3-find-hot-ranges-log). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. possible typo, this currently says "latch conflict wait durations metric", did you mean to edit this to match "CPU percent metric" in the previous bullet? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since the order of operations matters, suggest changing this to be an ordered (numbered) list |
||
- If **No**, check whether a [`clear direction detected` log](#b-clear-direction-detected) exists. | ||
|
||
### B. `clear direction detected` | ||
florence-crl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The `clear direction detected` log indicates that the rows touched in the range are steadily increasing or decreasing within the index. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest making "range" a link to the definition of range as above could make "index" a link to our general "indexes" docs: https://www.cockroachlabs.com/docs/v25.2/indexes.html |
||
|
||
- To determine whether a `clear direction detected` log exists, check whether any `no split key found` logs for the hot node identified in Step 1, within the noted time period, have an unstructured message that ends with `clear direction detected`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest linking 'Step 1' to that step on this page |
||
|
||
- Does a `clear direction detected` log exist? | ||
|
||
- If **Yes**, it may be an [index hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#index-hotspot). Note the range ID of `clear direction detected` log and proceed to find the corresponding [hot ranges log](#step-3-find-hot-ranges-log). | ||
- If **No**, investigate other possible causes for CPU skew. | ||
|
||
## Step 3. Find hot ranges log | ||
|
||
A hot ranges log is a log of an event of type `hot_ranges_stats` emitted to the [`HEALTH` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). Because this log corresponds to an event type, it includes a structured message such as: | ||
|
||
``` | ||
I250602 04:46:54.752464 2023 2@util/log/event_log.go:39 ⋮ [T1,Vsystem,n5] 31977 ={"Timestamp":1748839613749807000,"EventType":"hot_ranges_stats","RangeID":1115,"Qps":0,"LeaseholderNodeID":5,"WritesPerSecond":0.0012048123820978134,"CPUTimePerSecond":251.30338109510822,"Databases":["kv"],"Tables":["kv"],"Indexes":["kv_pkey"]} | ||
``` | ||
|
||
- To find the relevant hot ranges log, within the noted time range of the metric outlier, search for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggest making this an ordered list since the user has to do these steps in order (i think the sublist of log keys would remain unordered since JSON keys are unordered) |
||
- `"EventType":"hot_ranges_stats"` and | ||
- `"RangeID":{range ID from popular key detected log or clear direction log}` and | ||
- `"LeaseholderNodeID":{node ID from metric outlier}`. | ||
- Once you find the relevant hot ranges log, note the values for `Databases`, `Tables`, and `Indexes`. | ||
- For a write hotspot or read hotspot, proceed to [Mitigation for hot key](#mitigation-1-hot-key). | ||
- For an index hotspot, proceed to [Mitigation for hot index](#mitigation-2-hot-index). | ||
|
||
## Mitigation 1 - hot key | ||
|
||
To mitigate a [hot key]({% link {{ page.version.version }}/understand-hotspots.md %}#row-hotspot) (whether a write hotspot or read hotspot), identify the problematic queries and refactor your application accordingly. Use the `Databases`, `Tables`, and `Indexes` values from the hot ranges log to filter the following DB Console pages by the time period of the metric outlier and log emission: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there something on this page to link to from the "hot ranges log" text in this paragraph? |
||
|
||
- The [**Databases** Index Details page]({% link {{ page.version.version }}/ui-databases-page.md %}#index-details-page) includes an **Index Usage** section that shows statement fingerprints using that index. | ||
- The [**SQL Activity Statements** page]({% link {{ page.version.version }}/ui-statements-page.md %}) shows statement fingerprints that can be filtered. | ||
|
||
## Mitigation 2 - hot index | ||
|
||
To mitigate a hot index, update the index schema using the values noted for `Databases`, `Tables`, and `Indexes` in the hot ranges log. Refer to [Resolving index hotspots]({% link {{ page.version.version }}/understand-hotspots.md %}#resolving-index-hotspots). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there something on this page to link to from "the hot ranges log"? |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it looks like the 'reduce hotspots' section of https://www.cockroachlabs.com/docs/v25.2/performance-recipes#hotspots and https://www.cockroachlabs.com/docs/v25.2/understand-hotspots#reduce-hotspots is an include that is shared across both of those pages. would it make sense to also include it on this page as general advice? in addition to the specific advice here |
||
## See also | ||
|
||
- [Understand Hotspots]({% link {{ page.version.version }}/understand-hotspots.md %}) | ||
- [**Metrics** page]({% link {{ page.version.version }}/ui-overview.md %}#metrics) | ||
- [**Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) | ||
- [Logging channels]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels) | ||
- [Load-based splitting]({% link {{ page.version.version }}/load-based-splitting.md %}) | ||
- [**SQL Activity Statements** page]({% link {{ page.version.version }}/ui-statements-page.md %}) | ||
- [**Databases** Index Details page]({% link {{ page.version.version }}/ui-databases-page.md %}#index-details-page) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,8 @@ In distributed SQL, hotspots refer to bottlenecks that limit a cluster's ability | |
|
||
The page also offers best practices for [reducing hotspots](#reduce-hotspots), including a [video demo](#video-demo). | ||
|
||
To troubleshoot common hotspots, refer to the [Detect Hotspots page]({% link {{ page.version.version }}/detect-hotspots.md %}). | ||
|
||
## Terminology | ||
|
||
### Hotspot | ||
|
@@ -335,5 +337,6 @@ For a demo on hotspot reduction, watch the following video: | |
|
||
## See also | ||
|
||
- [Detect Hotspots]({% link {{ page.version.version }}/detect-hotspots.md %}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think another place in our docs that could mention this new 'detect hotspots' page and link to it is https://www.cockroachlabs.com/docs/v25.2/performance-recipes#hotspots |
||
- [Performance Tuning Recipes: Hotspots]({% link {{ page.version.version }}/performance-recipes.md %}#hotspots) | ||
- [Single hot node]({% link {{ page.version.version }}/query-behavior-troubleshooting.md %}#single-hot-node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just wanted to say i like this diagram very much!