Skip to content

Add docs for leader-leaseholder splits #19755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/current/_includes/v25.2/essential-alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,9 +318,9 @@ Send an alert when the number of ranges with replication below the replication f

- Refer to [Replication issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#replication-issues).

### Requests stuck in raft
### Requests stuck in Raft

Send an alert when requests are taking a very long time in replication. An (evaluated) request has to pass through the replication layer, notably the quota pool and raft. If it fails to do so within a highly permissive duration, the gauge is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated.
Send an alert when requests are taking a very long time in replication. An (evaluated) request has to pass through the replication layer, notably the quota pool and raft. If it fails to do so within a highly permissive duration, the gauge is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated. This can also be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits).

**Metric**
<br>`requests.slow.raft`
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
{% include_cached new-in.html version="v25.2" %} For the purposes of [Raft replication]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) and determining the [leaseholder]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) of a [range]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range), node health is no longer determined by heartbeating a single "liveness range"; instead it is determined using [Leader leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases).

<a name="liveness-range"></a>

However, node heartbeats of a single range are still used to determine:

- Whether a node is still a member of a cluster (this is used by [`cockroach node decommission`]({% link {{ page.version.version }}/cockroach-node.md %}#node-decommission)).
Expand Down
4 changes: 2 additions & 2 deletions src/current/_includes/v25.3/essential-alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,9 +318,9 @@ Send an alert when the number of ranges with replication below the replication f

- Refer to [Replication issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#replication-issues).

### Requests stuck in raft
### Requests stuck in Raft

Send an alert when requests are taking a very long time in replication. An (evaluated) request has to pass through the replication layer, notably the quota pool and raft. If it fails to do so within a highly permissive duration, the gauge is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated.
Send an alert when requests are taking a very long time in replication. An (evaluated) request has to pass through the replication layer, notably the quota pool and raft. If it fails to do so within a highly permissive duration, the gauge is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated. This can also be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits).

**Metric**
<br>`requests.slow.raft`
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
For the purposes of [Raft replication]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) and determining the [leaseholder]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) of a [range]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range), node health is no longer determined by heartbeating a single "liveness range"; instead it is determined using [Leader leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases).

<a name="liveness-range"></a>

However, node heartbeats of a single range are still used to determine:

- Whether a node is still a member of a cluster (this is used by [`cockroach node decommission`]({% link {{ page.version.version }}/cockroach-node.md %}#node-decommission)).
Expand Down
9 changes: 9 additions & 0 deletions src/current/v25.2/architecture/replication-layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,15 @@ Unlike table data, system ranges use expiration-based leases; expiration-based l

Expiration-based leases are also used temporarily during operations like lease transfers, until the new Raft leader can be fortified based on store liveness, as described in [Leader leases](#leader-leases).

#### Leader‑leaseholder splits

[Epoch-based leases](#epoch-based-leases) (unlike [Leader leases](#leader-leases)) are vulnerable to _leader-leaseholder splits_. These can occur when a leaseholder's Raft log has fallen behind other replicas in its group and it cannot acquire Raft leadership. Coupled with a [network partition]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#network-partition), this split can cause permanent unavailability of the range if (1) the stale leaseholder continues heartbeating the [liveness range](#liveness-range) to hold its lease but (2) cannot reach the leader to propose writes.

Symptoms of leader-leaseholder splits include a [stalled Raft log]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#requests-stuck-in-raft) on the leaseholder and [increased disk usage]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disks-filling-up) on follower replicas buffering pending Raft entries. Remediations include:

- Restarting the affected nodes.
- Fixing the network partition (or slow networking) between nodes.

#### Leader leases

{% include_cached new-in.html version="v25.2" %} {% include {{ page.version.version }}/leader-leases-intro.md %}
Expand Down
2 changes: 2 additions & 0 deletions src/current/v25.2/cluster-setup-troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,8 @@ Like any database system, if you run out of disk space the system will no longer
- [Why is disk usage increasing despite lack of writes?]({% link {{ page.version.version }}/operational-faqs.md %}#why-is-disk-usage-increasing-despite-lack-of-writes)
- [Can I reduce or disable the storage of timeseries data?]({% link {{ page.version.version }}/operational-faqs.md %}#can-i-reduce-or-disable-the-storage-of-time-series-data)

In rare cases, disk usage can increase on nodes with [Raft followers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) due to a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits).

###### Automatic ballast files

CockroachDB automatically creates an emergency ballast file at [node startup]({% link {{ page.version.version }}/cockroach-start.md %}). This feature is **on** by default. Note that the [`cockroach debug ballast`]({% link {{ page.version.version }}/cockroach-debug-ballast.md %}) command is still available but deprecated.
Expand Down
2 changes: 1 addition & 1 deletion src/current/v25.2/monitoring-and-alerting.md
Original file line number Diff line number Diff line change
Expand Up @@ -1206,7 +1206,7 @@ Currently, not all events listed have corresponding alert rule definitions avail

#### Requests stuck in Raft

- **Rule:** Send an alert when requests are taking a very long time in replication.
- **Rule:** Send an alert when requests are taking a very long time in replication. This can be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits).

- **How to detect:** Calculate this using the `requests_slow_raft` metric in the node's `_status/vars` output.

Expand Down
2 changes: 1 addition & 1 deletion src/current/v25.2/ui-slow-requests-dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Hovering over the graph displays values for the following metrics:

Metric | Description
--------|----
Slow Raft Proposals | The number of requests that have been stuck for longer than usual in [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft), as tracked by the `requests.slow.raft` metric.
Slow Raft Proposals | The number of requests that have been stuck for longer than usual in [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft), as tracked by the `requests.slow.raft` metric. This can be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits).

## Slow DistSender RPCs

Expand Down
9 changes: 9 additions & 0 deletions src/current/v25.3/architecture/replication-layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,15 @@ Unlike table data, system ranges use expiration-based leases; expiration-based l

Expiration-based leases are also used temporarily during operations like lease transfers, until the new Raft leader can be fortified based on store liveness, as described in [Leader leases](#leader-leases).

#### Leader‑leaseholder splits

[Epoch-based leases](#epoch-based-leases) (unlike [Leader leases](#leader-leases)) are vulnerable to _leader-leaseholder splits_. These can occur when a leaseholder's Raft log has fallen behind other replicas in its group and it cannot acquire Raft leadership. Coupled with a [network partition]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#network-partition), this split can cause permanent unavailability of the range if (1) the stale leaseholder continues heartbeating the [liveness range](#liveness-range) to hold its lease but (2) cannot reach the leader to propose writes.

Symptoms of leader-leaseholder splits include a [stalled Raft log]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#requests-stuck-in-raft) on the leaseholder and [increased disk usage]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disks-filling-up) on follower replicas buffering pending Raft entries. Remediations include:

- Restarting the affected nodes.
- Fixing the network partition (or slow networking) between nodes.

#### Leader leases

{% include {{ page.version.version }}/leader-leases-intro.md %}
Expand Down
2 changes: 2 additions & 0 deletions src/current/v25.3/cluster-setup-troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,8 @@ Like any database system, if you run out of disk space the system will no longer
- [Why is disk usage increasing despite lack of writes?]({% link {{ page.version.version }}/operational-faqs.md %}#why-is-disk-usage-increasing-despite-lack-of-writes)
- [Can I reduce or disable the storage of timeseries data?]({% link {{ page.version.version }}/operational-faqs.md %}#can-i-reduce-or-disable-the-storage-of-time-series-data)

In rare cases, disk usage can increase on nodes with [Raft followers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) due to a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits).

###### Automatic ballast files

CockroachDB automatically creates an emergency ballast file at [node startup]({% link {{ page.version.version }}/cockroach-start.md %}). This feature is **on** by default. Note that the [`cockroach debug ballast`]({% link {{ page.version.version }}/cockroach-debug-ballast.md %}) command is still available but deprecated.
Expand Down
2 changes: 1 addition & 1 deletion src/current/v25.3/monitoring-and-alerting.md
Original file line number Diff line number Diff line change
Expand Up @@ -1206,7 +1206,7 @@ Currently, not all events listed have corresponding alert rule definitions avail

#### Requests stuck in Raft

- **Rule:** Send an alert when requests are taking a very long time in replication.
- **Rule:** Send an alert when requests are taking a very long time in replication. This can be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits).

- **How to detect:** Calculate this using the `requests_slow_raft` metric in the node's `_status/vars` output.

Expand Down
2 changes: 1 addition & 1 deletion src/current/v25.3/ui-slow-requests-dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Hovering over the graph displays values for the following metrics:

Metric | Description
--------|----
Slow Raft Proposals | The number of requests that have been stuck for longer than usual in [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft), as tracked by the `requests.slow.raft` metric.
Slow Raft Proposals | The number of requests that have been stuck for longer than usual in [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft), as tracked by the `requests.slow.raft` metric. This can be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits).

## Slow DistSender RPCs

Expand Down
Loading