You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disk failures can present in several forms, which may be difficult to detect.
410
+
Disk failures can present in several forms, which may be difficult to detect.
373
411
374
412
Look for the following symptoms:
375
413
@@ -383,28 +421,31 @@ Look for the following symptoms:
383
421
- This is usually caused by writes to disk partially or completely failing
384
422
- This can also be caused by running out of disk space
385
423
386
-
As with an instance crash, the consequences can be far-reaching and not immediately clear in all cases.
387
-
388
-
To migrate to a new disk, follow the [emergency primary migration](#emergency-primary-migration) flow. When you create a new replica, you
389
-
can populate it with the latest snapshot you have taken, and then recover the rest using replicated WALs in the object
390
-
store.
424
+
As with an instance crash, the consequences can be far-reaching and not
425
+
immediately clear in all cases.
391
426
427
+
To migrate to a new disk, follow the
428
+
[emergency primary migration](#emergency-primary-migration) flow. When you
429
+
create a new replica, you can populate it with the latest snapshot you have
430
+
taken, and then recover the rest using replicated WALs in the object store.
392
431
393
432
### Flows
394
433
395
434
#### Planned primary migration
396
435
397
-
Use this flow when you want to change your primary to another instance, but the primary has not failed.
436
+
Use this flow when you want to change your primary to another instance, but the
437
+
primary has not failed.
398
438
399
-
The database can be started in a mode which disallows further ingestion, but allows replication. With this method, you can ensure that all outstanding data has been replicated before you start ingesting into a new primary instance.
439
+
The database can be started in a mode which disallows further ingestion, but
440
+
allows replication. With this method, you can ensure that all outstanding data
441
+
has been replicated before you start ingesting into a new primary instance.
400
442
401
443
- Ensure primary instance is still capable of replicating data to the object store
402
444
- Stop primary instance
403
445
- Restart primary instance with `replication.role=primary-catchup-uploads`
404
446
- Wait for the instance to complete its uploads and exit with `code 0`
405
447
- Then follow the [emergency primary migration](#emergency-primary-migration) flow
406
448
407
-
408
449
#### Emergency primary migration
409
450
410
451
Use this flow when you wish to discard a failed primary instance and move to a new one.
@@ -413,56 +454,72 @@ Use this flow when you wish to discard a failed primary instance and move to a n
413
454
- Stop the replica instance
414
455
- Set `replication.role=primary` on the replica
415
456
- Ensure other primary-related settings are configured appropriately
416
-
- for example, snapshotting policies
417
-
- Create an empty `_migrate_primary` file in your database installation directory (i.e. the parent of `conf` and `db`)
457
+
- for example, snapshotting policies
458
+
- Create an empty `_migrate_primary` file in your database installation
459
+
directory (i.e. the parent of `conf` and `db`)
418
460
- Start the replica instance, which is now the new primary
419
461
- Create a new replica instance to replace the promoted replica
420
462
421
463
:::warning
422
464
423
-
Any data committed to the primary, but not yet replicated, will be lost. If the primary has not
424
-
completely failed, you can follow the [planned primary migration](#planned-primary-migration) flow
425
-
to ensure that all remaining data has been replicated before switching primary.
465
+
Any data committed to the primary, but not yet replicated, will be lost. If the
466
+
primary has not completely failed, you can follow the
467
+
[planned primary migration](#planned-primary-migration) flow to ensure that all
468
+
remaining data has been replicated before switching primary.
426
469
427
470
:::
428
471
429
472
#### When could migration fail?
430
473
431
-
Two primaries started within the same `replication.primary.keepalive.interval=10s` may still break.
474
+
Two primaries started within the same
475
+
`replication.primary.keepalive.interval=10s` may still break.
432
476
433
-
It is important not to migrate the primary without stopping the first primary, if it is still within this interval.
477
+
It is important not to migrate the primary without stopping the first primary,
478
+
if it is still within this interval.
434
479
435
480
This config can be set in the range of 1 to 300 seconds.
436
481
437
482
#### Point-in-time recovery
438
483
439
-
Create a QuestDB instance matching a specific historical point in time.
484
+
Create a QuestDB instance matching a specific historical point in time.
440
485
441
-
This is builds a new instance based on a recently recovered snapshot and WAL data in the object store.
486
+
This is builds a new instance based on a recently recovered snapshot and WAL
487
+
data in the object store.
442
488
443
-
It can also be used if you wish to remove the latest transactions from the database, or if you encounter corrupted
444
-
transactions (though replicating a corrupt transaction has never been observed).
489
+
It can also be used if you wish to remove the latest transactions from the
490
+
database, or if you encounter corrupted transactions (though replicating a
491
+
corrupt transaction has never been observed).
445
492
446
493
**Flow**
447
494
448
-
- (Recommended) Locate a recent primary instance snapshot that predates your intended recovery timestamp.
449
-
- A snapshot taken from **after** your intended recovery timestamp will not work.
450
-
- Create the new primary instance, ideally from a snapshot, and ensure it is not running.
495
+
- (Recommended) Locate a recent primary instance snapshot that predates your
496
+
intended recovery timestamp.
497
+
- A snapshot taken from **after** your intended recovery timestamp will not work.
498
+
- Create the new primary instance, ideally from a snapshot, and ensure it is not
499
+
running.
451
500
- Touch a `_recover_point_in_time` file.
452
-
- Inside this file, add a `replication.object.store` setting pointing to the object store you wish to load transactions from.
453
-
- Also add a `replication.recovery.timestamp` setting with the UTC time to which you would like to recover.
454
-
- The format is `YYYY-MM-DDThh:mm:ss.mmmZ`.
455
-
- (Optional) Configure replication settings in `server.conf` pointing at a **new** object store location.
456
-
- You can either configure this instance as a standalone (non-replicated) instance, or
457
-
- Configure it as a new primary by setting `replication.role=primary`. In this case, the `replication.object.store`**must** point to a fresh, empty location.
458
-
- If you have created the new primary using a snapshot, touch a `_restore` file to trigger the snapshot recovery process.
459
-
- More details can be found in the [backup and restore](/documentation/operations/backup.md) documentation.
501
+
- Inside this file, add a `replication.object.store` setting pointing to the
502
+
object store you wish to load transactions from.
503
+
- Also add a `replication.recovery.timestamp` setting with the UTC time to which
504
+
you would like to recover.
505
+
- The format is `YYYY-MM-DDThh:mm:ss.mmmZ`.
506
+
- (Optional) Configure replication settings in `server.conf` pointing at a
507
+
**new** object store location.
508
+
- You can either configure this instance as a standalone (non-replicated)
509
+
instance, or
510
+
- Configure it as a new primary by setting `replication.role=primary`. In this
511
+
case, the `replication.object.store`**must** point to a fresh, empty
512
+
location.
513
+
- If you have created the new primary using a snapshot, touch a `_restore` file
514
+
to trigger the snapshot recovery process.
515
+
- More details can be found in the
516
+
[backup and restore](/documentation/operations/backup.md) documentation.
460
517
- Start new primary instance.
461
518
462
519
## Multi-primary ingestion
463
520
464
521
[QuestDB Enterprise](/enterprise/) supports multi-primary ingestion, where
465
522
multiple primaries can write to the same database.
466
523
467
-
See the [Multi-primary ingestion](/docs/operations/multi-primary-ingestion/) page for
468
-
more information.
524
+
See the [Multi-primary ingestion](/docs/operations/multi-primary-ingestion/)
0 commit comments