Skip to content

Commit

Permalink
docs: Add federated region concept and operations pages. (#24477)
Browse files Browse the repository at this point in the history
In order to help users understand multi-region federated
deployments, this change adds two new sections to the website.

The first expands the architecture page, so we can add further
detail over time with an initial federation page. The second adds
a federation operations page which goes into failure planning and
mitigation.

Co-authored-by: Aimee Ukasick <[email protected]>
Co-authored-by: Michael Schurter <[email protected]>
  • Loading branch information
3 people authored Nov 19, 2024
1 parent 89c3d69 commit dc50133
Show file tree
Hide file tree
Showing 8 changed files with 291 additions and 8 deletions.
64 changes: 64 additions & 0 deletions website/content/docs/concepts/architecture/federation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
layout: docs
page_title: Federation
description: |-
Nomad federation is a multi-cluster orchestration and management feature that allows multiple
Nomad clusters, defined as a region, to work together seamlessly.
---

# Federation

Nomad federation is a multi-cluster orchestration and management feature that allows multiple Nomad
clusters, defined as a region, to work together seamlessly. By federating clusters, you benefit
from improved scalability, fault tolerance, and centralized management of workloads across various
data centers or geographical locations.

## Cross-Region request forwarding

API calls can include a `region` query parameter that defines the Nomad region the query is
specified for. If this is not the local region, Nomad transparently forwards the request to a
server in the requested region. When you omit the query parameter, Nomad uses the region of the
server that is processing the request.

## Replication

Nomad writes the following objects in the authoritative region and replicates them to all federated
regions:

- ACL [policies][acl_policy], [roles][acl_role], [auth methods][acl_auth_method],
[binding rules][acl_binding_rule], and [global tokens][acl_token]
- [Namespaces][namespace]
- [Node pools][node_pool]
- [Quota specifications][quota]
- [Sentinel policies][sentinel_policies]

When creating, updating, or deleting these objects, Nomad always sends the request to the
authoritative region using RPC forwarding.

Nomad starts replication routines on each federated cluster's leader server in a hub and spoke
design. The routines then use blocking queries to receive updates from the authoritative region to
mirror in their own state store. These routines also implement rate limiting, so that busy clusters
do not degrade due to overly aggressive replication processes.

<Note>
Nomad writes ACL local tokens in the region where you make the request and does not replicate
those local tokens.
</Note>

## Multi-Region job deployments <EnterpriseAlert inline />

Nomad job deployments can use the [`multiregion`][] block when running in federated mode.
Multiregion configuration instructs Nomad to register and run the job on all the specified regions,
removing the need for multiple job specification copies and registration on each region.
Multiregion jobs do not provide regional failover in the event of failure.

[acl_policy]: /nomad/docs/concepts/acl#policy
[acl_role]: /nomad/docs/concepts/acl#role
[acl_auth_method]: /nomad/docs/concepts/acl#auth-method
[acl_binding_rule]: /nomad/docs/concepts/acl#binding-rule
[acl_token]: /nomad/docs/concepts/acl#token
[node_pool]: /nomad/docs/concepts/node-pools
[namespace]: /nomad/docs/other-specifications/namespace
[quota]: /nomad/docs/other-specifications/quota
[sentinel_policies]: /nomad/docs/enterprise/sentinel#sentinel-policies
[`multiregion`]: /nomad/docs/job-specification/multiregion
2 changes: 1 addition & 1 deletion website/content/docs/concepts/workload-identity.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ Consul and Vault can be configured to accept workload identities from Nomad for
authentication. Refer to the [Consul][consul_int] and [Vault][vault_int]
integration pages for more information.

[allocation]: /nomad/docs/concepts/architecture#allocation
[allocation]: /nomad/docs/glossary#allocation
[identity-block]: /nomad/docs/job-specification/identity
[jobspec_consul]: /nomad/docs/job-specification/consul
[jobspec_consul_ns]: /nomad/docs/job-specification/consul#namespace
Expand Down
11 changes: 6 additions & 5 deletions website/content/docs/configuration/server.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,12 @@ server {

- `authoritative_region` `(string: "")` - Specifies the authoritative region,
which provides a single source of truth for global configurations such as ACL
Policies and global ACL tokens. Non-authoritative regions will replicate from
the authoritative to act as a mirror. By default, the local region is assumed
to be authoritative. Setting `authoritative_region` assumes that ACLs have
been bootstrapped in the authoritative region. See [Configure for multiple
regions][] in the ACLs tutorial.
Policies and global ACL tokens in multi-region, federated deployments.
Non-authoritative regions will replicate from the authoritative to act as a
mirror. By default, the local region is assumed to be authoritative. Setting
`authoritative_region` assumes that ACLs have been bootstrapped in the
authoritative region. See [Configure for multiple regions][] in the ACLs
tutorial.

- `bootstrap_expect` `(int: required)` - Specifies the number of server nodes to
wait for before bootstrapping. It is most common to use the odd-numbered
Expand Down
2 changes: 1 addition & 1 deletion website/content/docs/networking/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Nomad differs from other tools in this aspect.
## Allocation networking

The base unit of scheduling in Nomad is an
[allocation](/nomad/docs/concepts/architecture#allocation), which means that all
[allocation](/nomad/docs/glossary#allocation), which means that all
tasks in the same allocation run in the same client and share common resources,
such as disk and networking. Allocations can request access to network
resources, such as ports, using the
Expand Down
139 changes: 139 additions & 0 deletions website/content/docs/operations/federation/failure.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
layout: docs
page_title: Federated cluster failure scenarios
description: Failure scenarios in multi-region federated cluster deployments.
---

# Failure scenarios

When running Nomad in federated mode, failure situations and impacts are different depending on
whether the authoritative region is the impacted region or not, and what the failure mode is. In
soft failures, the region's servers have lost quorum but the Nomad processes are still up, running,
and reachable. In hard failures, the regional servers are completely unreachable and are akin to
the underlying hardware having been terminated (cloud) or powered-off (on-prem).

The scenarios are based on a Nomad deployment running three federated regions:
* `asia-south-1`
* `europe-west-1` - authoritative region
* `us-east-1`

## Federated region failure: soft
In this situation the region `asia-south-1` has lost leadership but the servers are reachable and
up.

All server logs in the impacted region have entries such as this example.
```console
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=d19e6bb5-5ec9-8f75-9caf-47e2513fe28d error="No cluster leader"
```

✅ Request forwarding continues to work between all federated regions that are running with
leadership.

🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
fail unless using the `stale=true` flag.

✅ Creation and deletion of replicated objects, such as namespaces, is written to the
authoritative region.

✅ Any federated regions with leadership is able to continue to replicate all objects detailed
previously.

✅ Creation of local ACL tokens continues to work for all regions with leadership.

✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership.

❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.

## Federated region failure: hard
In this situation the region `asia-south-1` has gone down. When this happens, the Nomad server logs
for the other regions have log entries similar to this example.
```console
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
[INFO] [email protected]/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Initiating push/pull sync with: us-east-1-server-1.us-east-1 192.168.1.193:9002
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
[INFO] [email protected]/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
```

✅ Request forwarding continues to work between all federated regions that are running with
leadership.

❌ API requests, either directly or attempting to use request forwarding to the impacted region,
fail.

✅ Creation and deletion of replicated objects, such as namespaces, are written to the
authoritative region.

✅ Any federated regions with leadership continue to replicate all objects detailed
above.

✅ Creation of local ACL tokens continues to work for all regions which are running with
leadership.

✅ Jobs **without** the [`multiregion`][] block deploy to all regions with leadership.

❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.

## Authoritative region failure: soft
In this situation the region `europe-west-1` has lost leadership but the servers are reachable and
up.

The server logs in the authoritative region have entries such as this example.
```console
[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=68b3abe2-5e16-8f04-be5a-f76aebb0e59e error="No cluster leader"
```

✅ Request forwarding continues to work between all federated regions that are running with
leadership.

🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
fail unless using the `stale=true` flag.

❌ Creation and deletion of replicated objects, such as namespaces, fails.

❌ Any federated regions are able to read data to replicate as they use the stale flag, but no
writes can occur to the authoritative region as described previously.

✅ Creation of local ACL tokens continues to work for all federated regions which are running
with leadership.

✅ Jobs **without** the [`multiregion`][] block deploy to all federated regions which
are running with leadership.

❌ Jobs **with** the [`multiregion`][] block defined fails to deploy.

## Authoritative region failure: hard
In this situation the region `europe-west-1` has gone down. When this happens, the Nomad server
leader logs for the other regions have log entries similar to this example.
```console
[ERROR] nomad/leader.go:544: nomad: failed to fetch namespaces from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:1767: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:2498: nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader_ent.go:226: nomad: failed to fetch quota specifications from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:703: nomad: failed to fetch node pools from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:1909: nomad: failed to fetch tokens from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:2083: nomad: failed to fetch ACL Roles from authoritative region: error="rpc error: EOF"
[DEBUG] nomad/leader_ent.go:84: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:2292: nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: EOF"
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
[INFO] [email protected]/stdlog.go:60: nomad: memberlist: Suspect europe-west-1-server-1.europe-west-1 has failed, no acks received
[DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
```

✅ Request forwarding continues to work between all federated regions that are running with
leadership.

❌ API requests, either directly or attempting to use request forwarding to the impacted region,
fail.

❌ Creation and deletion of replicated objects, such as namespaces, fails.

❌ Any federated regions with leadership is not able to replicate objects detailed in the logs.

✅ Creation of local ACL tokens continues to work for all regions with leadership.

✅ Jobs **without** the [`multiregion`][] block deploy to regions with leadership.

❌ Jobs **with** the [`multiregion`][] block defined fail to deploy.

[`multiregion`]: /nomad/docs/job-specification/multiregion
57 changes: 57 additions & 0 deletions website/content/docs/operations/federation/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
layout: docs
page_title: Federated cluster operations
description: Information on running Nomad federated clusters.
---

## Operational considerations

When operating multi-region federated Nomad clusters, consider the following:

* **Regular snapshots**: You can back up Nomad server state using the
[`nomad operator snapshot save`][] and [`nomad operator snapshot agent`][] commands. Performing
regular backups expedites disaster recovery. The cadence depends on cluster rates of change
and your internal SLA’s. You should regularly test snapshots using the
[`nomad operator snapshot restore`][] command to ensure they work.

* **Local ACL management tokens**: You need local management tokens to perform federated cluster
administration when the authoritative region is down. Make sure you have existing break-glass
tokens available for each region.

* **Known paths to creating local ACL tokens**: If the authoritative region fails, creation of
global ACL tokens fails. If this happens, having the ability to create local ACL tokens allows
you to continue to interact with each available federated region.

## Authoritative and federated regions

* **Can non-authoritative regions continue to operate if the authoritative region is unreachable?**:
Yes, running workloads are never interrupted due to federation failures. Scheduling of new
workloads and rescheduling of failed workloads is never interrupted due to federation failures.
See [Failure Scenarios][failure_scenarios] for details.

* **Can the authoritative region be deployed with servers only?** Yes, deploying the Nomad
authoritative region with servers only, without clients, works as expected. This servers-only
approach can expedite disaster recovery of the region. Restoration does not include objects such
as nodes, jobs, or allocations, which are large and require compute intensive reconciliation
after restoration.

* **Can I migrate the authoritative region to a currently federated region?** It is possible by
following these steps:

1. Update the [`authoritative_region`][] configuration parameter on the desired authoritative
region servers.
1. Restart the server processes in the new authoritative region and ensure all data is present in
state as expected. If the network was partitioned as part of the failure of the original
authoritative region, writes of replicated objects may not have been successfully replicated to
federated regions.
1. Update the [`authoritative_region`][] configuration parameter on the federated region servers
and restart their processes.

* **Can federated regions be bootstrapped while the authoritative region is down?** No they
cannot.

[`nomad operator snapshot save`]: /nomad/docs/commands/operator/snapshot/save
[`nomad operator snapshot agent`]: /nomad/docs/commands/operator/snapshot/agent
[`nomad operator snapshot restore`]: /nomad/docs/commands/operator/snapshot/restore
[failure_scenarios]: /nomad/docs/operations/federation/failure
[`authoritative_region`]: /nomad/docs/configuration/server#authoritative_region
24 changes: 23 additions & 1 deletion website/data/docs-nav-data.json
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,16 @@
},
{
"title": "Architecture",
"path": "concepts/architecture"
"routes": [
{
"title": "Overview",
"path": "concepts/architecture"
},
{
"title": "Federation",
"path": "concepts/architecture/federation"
}
]
},
{
"title": "CPU",
Expand Down Expand Up @@ -2385,6 +2394,19 @@
"title": "Key Management",
"path": "operations/key-management"
},
{
"title": "Federation",
"routes": [
{
"title": "Overview",
"path": "operations/federation"
},
{
"title": "Failure",
"path": "operations/federation/failure"
}
]
},
{
"title": "Considerations for Stateful Workloads",
"path": "operations/stateful-workloads"
Expand Down

0 comments on commit dc50133

Please sign in to comment.