Skip to content

Commit 190a308

Browse files
committed
add recovery guide
1 parent b7263fa commit 190a308

File tree

3 files changed

+54
-7
lines changed

3 files changed

+54
-7
lines changed

command-line-flags-for-pd-configuration.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,3 +105,13 @@ PD is configurable using command-line flags and environment variables.
105105
106106
- The address of Prometheus Pushgateway, which does not push data to Prometheus by default.
107107
- Default: `""`
108+
109+
## `--force-new-cluster`
110+
111+
- Force to create a new cluster using current nodes.
112+
- Default: `false`
113+
- It is recommended to use this flag only when recovering services due to PD losing most replicas, which might cause data loss.
114+
115+
## `-V`, `--version`
116+
117+
- Output the version of PD and then exit.

tikv-control.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -463,6 +463,26 @@ success!
463463
> - The argument of the `-p` option specifies the PD endpoints without the `http` prefix. Specifying the PD endpoints is to query whether the specified `region_id` is validated or not.
464464
> - You need to run this command for all stores where specified Regions' peers are located.
465465
466+
### Recover from ACID inconsistency data
467+
468+
To recover data from ACID inconsistency, such as the loss of most replicas or incomplete data synchronization, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version.
469+
470+
- The `-v` option is used to specify the version number to restore. To get the value of the `-v` parameter, you can use the `pd-ctl min-resolved-ts` command.
471+
472+
```shell
473+
tikv-ctl --host 127.0.0.1:20160 reset-to-version -v 430315739761082369
474+
```
475+
476+
```
477+
success!
478+
```
479+
480+
> **Note:**
481+
>
482+
> - The preceding command only supports the online mode. Before executing the command, you need to stop processes that will write data to TiKV, such as TiDB. After the command is executed successfully, it will return `success!`.
483+
> - You need to execute the same command for all TiKV nodes in the cluster.
484+
> - All PD scheduling tasks should be stopped before executing the command.
485+
466486
### Ldb Command
467487
468488
The `ldb` command line tool offers multiple data access and database administration commands. Some examples are listed below. For more information, refer to the help message displayed when running `tikv-ctl ldb` or check the documents from RocksDB.

two-data-centers-in-one-city-deployment.md

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,6 @@ The replication mode is controlled by PD. You can configure the replication mode
212212
primary-replicas = 2
213213
dr-replicas = 1
214214
wait-store-timeout = "1m"
215-
wait-sync-timeout = "1m"
216215
```
217216

218217
- Method 2: If you have deployed a cluster, use pd-ctl commands to modify the configurations of PD.
@@ -274,14 +273,32 @@ The details for the status switch are as follows:
274273

275274
### Disaster recovery
276275

277-
This section introduces the disaster recovery solution of the two data centers in one city deployment.
276+
This section introduces the disaster recovery solution of the two data centers in one city deployment. The disaster discussed in this section is the overall failure of the primary data center, or multiple TiKV nodes in the primary/secondary data center fail, resulting in the loss of most replicas and it is unable to provide services.
278277

279-
When a disaster occurs to a cluster in the synchronous replication mode, you can perform data recovery with `RPO = 0`:
278+
#### Overall failure of the primary data center
280279

281-
- If the primary data center fails and most of the Voter replicas are lost, but complete data exists in the DR data center, the lost data can be recovered from the DR data center. At this time, manual intervention is required with professional tools. You can contact the TiDB team for a recovery solution.
280+
In this situation, all Regions in the primary data center have lost most of their replicas, so the cluster is unable to use. At this time, it is necessary to use the secondary data center to recover the service. The replication status before failure determines the recovery ability:
282281

283-
- If the DR center fails and a few Voter replicas are lost, the cluster automatically switches to the asynchronous replication mode.
282+
- If the status before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary data center to recover using `RPO = 0`.
284283

285-
When a disaster occurs to a cluster that is not in the synchronous replication mode and you cannot perform data recovery with `RPO = 0`:
284+
- If the status before failure is in the asynchronous replication mode (the status code is `async`), the written data in the primary data center in the asynchronous replication mode is lost after using the secondary data center to recover. A typical scenario is that the primary data center disconnects from the secondary data center and the primary data center switches to the asynchronous replication mode and provides service for a while before the overall failure.
286285

287-
- If most of the Voter replicas are lost, manual intervention is required with professional tools. You can contact the TiDB team for a recovery solution.
286+
- If the status before failure is switching from the asynchronous to synchronous (the status code is `sync-recover`), part of the written data in the primary data center in the asynchronous replication mode is lost after using the secondary data center to recover. This might cause the ACID inconsistency, and you need to recover it additionally. A typical scenario is that the primary data center disconnects from the secondary data center, the connection is restored after switching to the asynchronous mode, and data is written. But during the data synchronization between primary and secondary, something goes wrong and causes the overall failure of the primary data center.
287+
288+
The process of disaster recovery is as follows:
289+
290+
1. Stop all PD, TiKV, and TiDB services of the secondary data center.
291+
292+
2. Start PD nodes of the secondary data center using the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag.
293+
294+
3. Use the [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data in the secondary data center and the parameters are the list of all Store IDs in the primary data center.
295+
296+
4. Write a new configuration of placement rule using [PD Control](/pd-control.md), and the Voter replica configuration of the Region is the same as the original cluster in the secondary data center.
297+
298+
5. Start the PD and TiKV services of the primary data center.
299+
300+
6. To recover ACID consistency (the status of `DR_STATE` in the old PD is `sync-recover`), you can use [`reset-to-version`](/tikv-control.md#recover-from-acid-inconsistency-data) to process TiKV data and the `version` parameter used can be obtained from `pd-ctl min-resolved-ts`.
301+
302+
7. Start the TiDB service in the primary data center and check the data integrity and consistency.
303+
304+
If you need support for disaster recovery, you can contact the TiDB team for a recovery solution.

0 commit comments

Comments
 (0)