-
Notifications
You must be signed in to change notification settings - Fork 2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Add federated region concept and operations pages. (#24477)
In order to help users understand multi-region federated deployments, this change adds two new sections to the website. The first expands the architecture page, so we can add further detail over time with an initial federation page. The second adds a federation operations page which goes into failure planning and mitigation. Co-authored-by: Aimee Ukasick <[email protected]> Co-authored-by: Michael Schurter <[email protected]>
- Loading branch information
1 parent
89c3d69
commit dc50133
Showing
8 changed files
with
291 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
--- | ||
layout: docs | ||
page_title: Federation | ||
description: |- | ||
Nomad federation is a multi-cluster orchestration and management feature that allows multiple | ||
Nomad clusters, defined as a region, to work together seamlessly. | ||
--- | ||
|
||
# Federation | ||
|
||
Nomad federation is a multi-cluster orchestration and management feature that allows multiple Nomad | ||
clusters, defined as a region, to work together seamlessly. By federating clusters, you benefit | ||
from improved scalability, fault tolerance, and centralized management of workloads across various | ||
data centers or geographical locations. | ||
|
||
## Cross-Region request forwarding | ||
|
||
API calls can include a `region` query parameter that defines the Nomad region the query is | ||
specified for. If this is not the local region, Nomad transparently forwards the request to a | ||
server in the requested region. When you omit the query parameter, Nomad uses the region of the | ||
server that is processing the request. | ||
|
||
## Replication | ||
|
||
Nomad writes the following objects in the authoritative region and replicates them to all federated | ||
regions: | ||
|
||
- ACL [policies][acl_policy], [roles][acl_role], [auth methods][acl_auth_method], | ||
[binding rules][acl_binding_rule], and [global tokens][acl_token] | ||
- [Namespaces][namespace] | ||
- [Node pools][node_pool] | ||
- [Quota specifications][quota] | ||
- [Sentinel policies][sentinel_policies] | ||
|
||
When creating, updating, or deleting these objects, Nomad always sends the request to the | ||
authoritative region using RPC forwarding. | ||
|
||
Nomad starts replication routines on each federated cluster's leader server in a hub and spoke | ||
design. The routines then use blocking queries to receive updates from the authoritative region to | ||
mirror in their own state store. These routines also implement rate limiting, so that busy clusters | ||
do not degrade due to overly aggressive replication processes. | ||
|
||
<Note> | ||
Nomad writes ACL local tokens in the region where you make the request and does not replicate | ||
those local tokens. | ||
</Note> | ||
|
||
## Multi-Region job deployments <EnterpriseAlert inline /> | ||
|
||
Nomad job deployments can use the [`multiregion`][] block when running in federated mode. | ||
Multiregion configuration instructs Nomad to register and run the job on all the specified regions, | ||
removing the need for multiple job specification copies and registration on each region. | ||
Multiregion jobs do not provide regional failover in the event of failure. | ||
|
||
[acl_policy]: /nomad/docs/concepts/acl#policy | ||
[acl_role]: /nomad/docs/concepts/acl#role | ||
[acl_auth_method]: /nomad/docs/concepts/acl#auth-method | ||
[acl_binding_rule]: /nomad/docs/concepts/acl#binding-rule | ||
[acl_token]: /nomad/docs/concepts/acl#token | ||
[node_pool]: /nomad/docs/concepts/node-pools | ||
[namespace]: /nomad/docs/other-specifications/namespace | ||
[quota]: /nomad/docs/other-specifications/quota | ||
[sentinel_policies]: /nomad/docs/enterprise/sentinel#sentinel-policies | ||
[`multiregion`]: /nomad/docs/job-specification/multiregion |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
--- | ||
layout: docs | ||
page_title: Federated cluster failure scenarios | ||
description: Failure scenarios in multi-region federated cluster deployments. | ||
--- | ||
|
||
# Failure scenarios | ||
|
||
When running Nomad in federated mode, failure situations and impacts are different depending on | ||
whether the authoritative region is the impacted region or not, and what the failure mode is. In | ||
soft failures, the region's servers have lost quorum but the Nomad processes are still up, running, | ||
and reachable. In hard failures, the regional servers are completely unreachable and are akin to | ||
the underlying hardware having been terminated (cloud) or powered-off (on-prem). | ||
|
||
The scenarios are based on a Nomad deployment running three federated regions: | ||
* `asia-south-1` | ||
* `europe-west-1` - authoritative region | ||
* `us-east-1` | ||
|
||
## Federated region failure: soft | ||
In this situation the region `asia-south-1` has lost leadership but the servers are reachable and | ||
up. | ||
|
||
All server logs in the impacted region have entries such as this example. | ||
```console | ||
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=d19e6bb5-5ec9-8f75-9caf-47e2513fe28d error="No cluster leader" | ||
``` | ||
|
||
✅ Request forwarding continues to work between all federated regions that are running with | ||
leadership. | ||
|
||
🟨 API requests, either directly or attempting to use request forwarding to the impacted region, | ||
fail unless using the `stale=true` flag. | ||
|
||
✅ Creation and deletion of replicated objects, such as namespaces, is written to the | ||
authoritative region. | ||
|
||
✅ Any federated regions with leadership is able to continue to replicate all objects detailed | ||
previously. | ||
|
||
✅ Creation of local ACL tokens continues to work for all regions with leadership. | ||
|
||
✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership. | ||
|
||
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy. | ||
|
||
## Federated region failure: hard | ||
In this situation the region `asia-south-1` has gone down. When this happens, the Nomad server logs | ||
for the other regions have log entries similar to this example. | ||
```console | ||
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached) | ||
[INFO] [email protected]/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received | ||
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Initiating push/pull sync with: us-east-1-server-1.us-east-1 192.168.1.193:9002 | ||
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached) | ||
[INFO] [email protected]/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received | ||
``` | ||
|
||
✅ Request forwarding continues to work between all federated regions that are running with | ||
leadership. | ||
|
||
❌ API requests, either directly or attempting to use request forwarding to the impacted region, | ||
fail. | ||
|
||
✅ Creation and deletion of replicated objects, such as namespaces, are written to the | ||
authoritative region. | ||
|
||
✅ Any federated regions with leadership continue to replicate all objects detailed | ||
above. | ||
|
||
✅ Creation of local ACL tokens continues to work for all regions which are running with | ||
leadership. | ||
|
||
✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership. | ||
|
||
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy. | ||
|
||
## Authoritative region failure: soft | ||
In this situation the region `europe-west-1` has lost leadership but the servers are reachable and | ||
up. | ||
|
||
The server logs in the authoritative region have entries such as this example. | ||
```console | ||
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=68b3abe2-5e16-8f04-be5a-f76aebb0e59e error="No cluster leader" | ||
``` | ||
|
||
✅ Request forwarding continues to work between all federated regions that are running with | ||
leadership. | ||
|
||
🟨 API requests, either directly or attempting to use request forwarding to the impacted region, | ||
fail unless using the `stale=true` flag. | ||
|
||
❌ Creation and deletion of replicated objects, such as namespaces, fails. | ||
|
||
❌ Any federated regions are able to read data to replicate as they use the stale flag, but no | ||
writes can occur to the authoritative region as described previously. | ||
|
||
✅ Creation of local ACL tokens continues to work for all federated regions which are running | ||
with leadership. | ||
|
||
✅ Jobs **without** the [`multiregion`][] block deploy to all federated regions which | ||
are running with leadership. | ||
|
||
❌ Jobs **with** the [`multiregion`][] block defined fails to deploy. | ||
|
||
## Authoritative region failure: hard | ||
In this situation the region `europe-west-1` has gone down. When this happens, the Nomad server | ||
leader logs for the other regions have log entries similar to this example. | ||
```console | ||
[ERROR] nomad/leader.go:544: nomad: failed to fetch namespaces from authoritative region: error="rpc error: EOF" | ||
[ERROR] nomad/leader.go:1767: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF" | ||
[ERROR] nomad/leader.go:2498: nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: EOF" | ||
[ERROR] nomad/leader_ent.go:226: nomad: failed to fetch quota specifications from authoritative region: error="rpc error: EOF" | ||
[ERROR] nomad/leader.go:703: nomad: failed to fetch node pools from authoritative region: error="rpc error: EOF" | ||
[ERROR] nomad/leader.go:1909: nomad: failed to fetch tokens from authoritative region: error="rpc error: EOF" | ||
[ERROR] nomad/leader.go:2083: nomad: failed to fetch ACL Roles from authoritative region: error="rpc error: EOF" | ||
[DEBUG] nomad/leader_ent.go:84: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF" | ||
[ERROR] nomad/leader.go:2292: nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: EOF" | ||
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached) | ||
[INFO] [email protected]/stdlog.go:60: nomad: memberlist: Suspect europe-west-1-server-1.europe-west-1 has failed, no acks received | ||
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached) | ||
``` | ||
|
||
✅ Request forwarding continues to work between all federated regions that are running with | ||
leadership. | ||
|
||
❌ API requests, either directly or attempting to use request forwarding to the impacted region, | ||
fail. | ||
|
||
❌ Creation and deletion of replicated objects, such as namespaces, fails. | ||
|
||
❌ Any federated regions with leadership is not able to replicate objects detailed in the logs. | ||
|
||
✅ Creation of local ACL tokens continues to work for all regions with leadership. | ||
|
||
✅ Jobs **without** the [`multiregion`][] block deploy to regions with leadership. | ||
|
||
❌ Jobs **with** the [`multiregion`][] block defined fail to deploy. | ||
|
||
[`multiregion`]: /nomad/docs/job-specification/multiregion |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
--- | ||
layout: docs | ||
page_title: Federated cluster operations | ||
description: Information on running Nomad federated clusters. | ||
--- | ||
|
||
## Operational considerations | ||
|
||
When operating multi-region federated Nomad clusters, consider the following: | ||
|
||
* **Regular snapshots**: You can back up Nomad server state using the | ||
[`nomad operator snapshot save`][] and [`nomad operator snapshot agent`][] commands. Performing | ||
regular backups expedites disaster recovery. The cadence depends on cluster rates of change | ||
and your internal SLA’s. You should regularly test snapshots using the | ||
[`nomad operator snapshot restore`][] command to ensure they work. | ||
|
||
* **Local ACL management tokens**: You need local management tokens to perform federated cluster | ||
administration when the authoritative region is down. Make sure you have existing break-glass | ||
tokens available for each region. | ||
|
||
* **Known paths to creating local ACL tokens**: If the authoritative region fails, creation of | ||
global ACL tokens fails. If this happens, having the ability to create local ACL tokens allows | ||
you to continue to interact with each available federated region. | ||
|
||
## Authoritative and federated regions | ||
|
||
* **Can non-authoritative regions continue to operate if the authoritative region is unreachable?**: | ||
Yes, running workloads are never interrupted due to federation failures. Scheduling of new | ||
workloads and rescheduling of failed workloads is never interrupted due to federation failures. | ||
See [Failure Scenarios][failure_scenarios] for details. | ||
|
||
* **Can the authoritative region be deployed with servers only?** Yes, deploying the Nomad | ||
authoritative region with servers only, without clients, works as expected. This servers-only | ||
approach can expedite disaster recovery of the region. Restoration does not include objects such | ||
as nodes, jobs, or allocations, which are large and require compute intensive reconciliation | ||
after restoration. | ||
|
||
* **Can I migrate the authoritative region to a currently federated region?** It is possible by | ||
following these steps: | ||
|
||
1. Update the [`authoritative_region`][] configuration parameter on the desired authoritative | ||
region servers. | ||
1. Restart the server processes in the new authoritative region and ensure all data is present in | ||
state as expected. If the network was partitioned as part of the failure of the original | ||
authoritative region, writes of replicated objects may not have been successfully replicated to | ||
federated regions. | ||
1. Update the [`authoritative_region`][] configuration parameter on the federated region servers | ||
and restart their processes. | ||
|
||
* **Can federated regions be bootstrapped while the authoritative region is down?** No they | ||
cannot. | ||
|
||
[`nomad operator snapshot save`]: /nomad/docs/commands/operator/snapshot/save | ||
[`nomad operator snapshot agent`]: /nomad/docs/commands/operator/snapshot/agent | ||
[`nomad operator snapshot restore`]: /nomad/docs/commands/operator/snapshot/restore | ||
[failure_scenarios]: /nomad/docs/operations/federation/failure | ||
[`authoritative_region`]: /nomad/docs/configuration/server#authoritative_region |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters