Skip to content

Commit

Permalink
docs: Add federated region concept and operations pages.
Browse files Browse the repository at this point in the history
In order to help users understand multi-region federated
deployments, this change adds two new sections to the website.

The first expands the architecture page, so we can add further
detail over time with an initial federation page. The second adds
a federation operations page which goes into failure planning and
mitigation.
  • Loading branch information
jrasell committed Nov 18, 2024
1 parent ff8ca8a commit 48d60b6
Show file tree
Hide file tree
Showing 5 changed files with 282 additions and 1 deletion.
68 changes: 68 additions & 0 deletions website/content/docs/concepts/architecture/federation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
layout: docs
page_title: Federation
description: Learn about processes and architecture of federated clusters.
---

# Federation

Nomad federation is a multi-cluster orchestration and management feature which allows multiple
Nomad clusters, defined as a region, to work together seamlessly. By federating clusters,
organizations can benefit from improved scalability, fault tolerance, and centralized management of
workloads across various data centers or geographical locations.

## Cross Region Request Forwarding

API calls can include a `region` query parameter which defines the Nomad region the query is
specified for. If this is not the local region, the request will be transparently forwarded and
serviced by a server in the requested region. When the query parameter is omitted, the region on
which the machine is servicing the request will be used.

## Replication

In federated Nomad environments, a number of objects are replicated from the authoritative region
to all federated regions. When creating, updating, or deleting these objects, the request will
always be sent to the authoritative region using RPC forwarding.

* **ACL Policies**: All ACL policies are written in the authoritative region and replicated to
federated regions.

* **ACL Roles**: All ACL roles are written in the authoritative region and replicated to federated
regions.

* **ACL Auth Methods**: All ACL authentication methods are written in the authoritative region and
replicated to federated regions.

* **ACL Binding Rules**: All ACL binding rules are written in the authoritative region and
replicated to federated regions.

* **ACL Tokens**: ACL tokens whose `global` parameter is set to `true` are written in the
authoritative region and replicated to federated regions. Otherwise, they are written to the
region where the request is made and not replicated.

* **Namespaces**: All namespaces are written in the authoritative region and replicated to
federated regions.

* **Node Pools**: All node pools are written in the authoritative region and replicated to
federated regions.

* **Quota Specifications**: All quotas are written in the authoritative region and replicated to
federated regions.

* **Sentinel Policies**: All sentinel policies are written in the authoritative region and
replicated to federated regions.

Replication routines are started on each federated cluster's leader server in a hub and spoke
design. The routines then utilize blocking queries to receive updates from the authoritative region
to mirror in their own state store. They also implement rate limiting, so that busy clusters do not
degrade due to overly aggressive replication processes.

## Multi-Region Job Deployments <EnterpriseAlert inline />

Nomad job deployments can utilize the [`multiregion`][] block when running in federated mode using
enterprise binaries. When configured, this instructs Nomad to register and run the job on all the
specified regions, removing the need for multiple copies of the job specification and registration
on each region. It is important to note, multiregion jobs do not provide regional failover in the
event of failure.

[`multiregion`]: /nomad/docs/job-specification/multiregion
138 changes: 138 additions & 0 deletions website/content/docs/operations/federation/failure.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
layout: docs
page_title: Federated Cluster Failure Scenarios
description: Failure scenarios in multi-region federated cluster deployments.
---

# Failure Scenarios

When running Nomad in federated mode, failure situations and impacts are different depending on
whether the authoritative region is the impacted region or not, and what the failure mode is. In
soft failures, the region's servers have lost quorum but the Nomad processes are still up, running,
and reachable. In hard failures, the regional servers are completely unreachable and are akin to
the underlying hardware having been terminated (cloud) or powered-off (on-prem).

The scenarios are based on a Nomad deployment which is running with three federated regions named
`asia-south-1`, `europe-west-1`, and `us-east-1`. The region `europe-west-1` is authoritative.

## Federated Region Failure: Soft
In this situation the region `asia-south-1` has lost leadership but the servers are reachable and
up.

All server logs in the impacted region will have entries such as:
```console
2024-10-30T14:34:50.262Z [ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=d19e6bb5-5ec9-8f75-9caf-47e2513fe28d error="No cluster leader"
```

✅ Request forwarding will continue to work between all federated regions that are running with
leadership.

🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
will fail unless using the `stale=true` flag.

✅ Creation and deletion of replicated objects, such as namespaces, will be written to the
authoritative region.

✅ Any federated regions with leadership will be able to continue to replicate all objects detailed
above.

✅ Creation of local ACL tokens will continue to work for all regions with leadership.

✅ Jobs **without** the [`multiregion`][] block will be deployable to all regions with leadership.

❌ Jobs **with** the [`multiregion`][] block defined will fail to deploy.

## Federated Region Failure: Hard
In this situation the region `asia-south-1` has gone down. When this happens, the Nomad server logs
for the other regions will have log entries similar to those below:
```console
2024-10-30T10:22:49.673Z [DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
2024-10-30T10:22:51.673Z [INFO] [email protected]/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
2024-10-30T10:22:59.618Z [DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Initiating push/pull sync with: us-east-1-server-1.us-east-1 192.168.1.193:9002
2024-10-30T10:22:59.674Z [DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
2024-10-30T10:23:01.673Z [INFO] [email protected]/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
```

✅ Request forwarding will continue to work between all federated regions that are running with
leadership.

❌ API requests, either directly or attempting to use request forwarding to the impacted region,
will fail.

✅ Creation and deletion of replicated objects, such as namespaces, will be written to the
authoritative region.

✅ Any federated regions with leadership will be able to continue to replicate all objects detailed
above.

✅ Creation of local ACL tokens will continue to work for all regions which are running with
leadership.

✅ Jobs **without** the [`multiregion`][] block will be deployable to all regions with leadership.

❌ Jobs **with** the [`multiregion`][] block defined will fail to deploy.

## Authoritative Region Failure: Soft
In this situation the region `europe-west-1` has lost leadership but the servers are reachable and
up.

The server logs in the authoritative region will have entries such as:
```console
2024-10-30T14:42:30.370Z [ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=68b3abe2-5e16-8f04-be5a-f76aebb0e59e error="No cluster leader"
```

✅ Request forwarding will continue to work between all federated regions that are running with
leadership.

🟨 API requests, either directly or attempting to use request forwarding to the impacted region,
will fail unless using the `stale=true` flag.

❌ Creation and deletion of replicated objects, such as namespaces, will fail.

❌ Any federated regions will be able to read data to replicate as they use the stale flag, but no
writes can occur to the authoritative region as described above.

✅ Creation of local ACL tokens will continue to work for all federated regions which are running
with leadership.

✅ Jobs **without** the [`multiregion`][] block will be deployable to all federated regions which
are running with leadership.

❌ Jobs **with** the [`multiregion`][] block defined will fail to deploy.

## Authoritative Region Failure: Hard
In this situation the region `europe-west-1` has gone down. When this happens, the Nomad server
leader logs for the other regions will have log entries similar to those below:
```console
2024-10-30T10:58:52.019Z [ERROR] nomad/leader.go:544: nomad: failed to fetch namespaces from authoritative region: error="rpc error: EOF"
2024-10-30T10:58:52.019Z [ERROR] nomad/leader.go:1767: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
2024-10-30T10:58:52.019Z [ERROR] nomad/leader.go:2498: nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: EOF"
2024-10-30T10:58:52.019Z [ERROR] nomad/leader_ent.go:226: nomad: failed to fetch quota specifications from authoritative region: error="rpc error: EOF"
2024-10-30T10:58:52.019Z [ERROR] nomad/leader.go:703: nomad: failed to fetch node pools from authoritative region: error="rpc error: EOF"
2024-10-30T10:58:52.019Z [ERROR] nomad/leader.go:1909: nomad: failed to fetch tokens from authoritative region: error="rpc error: EOF"
2024-10-30T10:58:52.019Z [ERROR] nomad/leader.go:2083: nomad: failed to fetch ACL Roles from authoritative region: error="rpc error: EOF"
2024-10-30T10:58:52.019Z [DEBUG] nomad/leader_ent.go:84: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
2024-10-30T10:58:52.019Z [ERROR] nomad/leader.go:2292: nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: EOF"
2024-10-30T10:58:57.391Z [DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
2024-10-30T10:58:59.390Z [INFO] [email protected]/stdlog.go:60: nomad: memberlist: Suspect europe-west-1-server-1.europe-west-1 has failed, no acks received
2024-10-30T10:59:02.391Z [DEBUG] [email protected]/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
```

✅ Request forwarding will continue to work between all federated regions that are running with
leadership.

❌ API requests, either directly or attempting to use request forwarding to the impacted region,
will fail.

❌ Creation and deletion of replicated objects, such as namespaces, will fail.

❌ Any federated regions with leadership will not be able to replicate objects detailed in the logs
above.

✅ Creation of local ACL tokens will continue to work for all regions with leadership.

✅ Jobs **without** the [`multiregion`][] block will be deployable to regions with leadership.

❌ Jobs **with** the [`multiregion`][] block defined will fail to deploy.

[`multiregion`]: /nomad/docs/job-specification/multiregion
53 changes: 53 additions & 0 deletions website/content/docs/operations/federation/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
layout: docs
page_title: Federated Cluster Operations
description: Information on running Nomad federated clusters.
---

## Operational Considerations

When operating multi-region federated Nomad clusters, the following considerations are important to
keep in mind:

* **Regular snapshots**: Nomad server state can be backed up using the
[`nomad operator snapshot save`][] and [`nomad operator snapshot agent`][] commands. Performing
regular backups can expedite disaster recovery and the cadence depends on cluster rates of change
and your internal SLA’s. Snapshots should also be tested regularly using the
[`nomad operator snapshot restore`][] command to ensure they work.

* **Local ACL management tokens**: In order to perform federated cluster administration when the
authoritative region is down, local management tokens are required. It is important to have
existing “break glass” tokens for each region available.

* **Known paths to creating local ACL tokens**: If the authoritative region fails, creation of
global ACL tokens will fail. If this happens, having the ability to create local ACL tokens will
allow you to continue to interact with each available federated region.

## FAQ

* **Can the authoritative region be deployed with servers only?** Yes, deploying the Nomad
authoritative region with servers only (no Nomad clients) will work as expected. It can expedite
disaster recovery of the region because restoration does not include objects such as nodes, jobs,
or allocations which are large and require compute intensive reconciliation after restoration.

* **Can I migrate the authoritative region to a currently federated region?** It is possible by
following the steps below:

* Update the [`authoritative_region`][] configuration parameter on the desired authoritative
region servers.

* Restart the server processes in the new authoritative region and ensure all data is present in
state as expected. If the network was partitioned as part of the failure of the original
authoritative region, writes of replicated objects may not have been successfully replicated to
federated regions.

* Update the [`authoritative_region`][] configuration parameter on the federated region servers
and restart their processes.

* **Can federated regions be bootstrapped while the authoritative region is down?** No they
cannot.

[`nomad operator snapshot save`]: /nomad/docs/commands/operator/snapshot/save
[`nomad operator snapshot agent`]: /nomad/docs/commands/operator/snapshot/agent
[`nomad operator snapshot restore`]: /nomad/docs/commands/operator/snapshot/restore
[`authoritative_region`]: /nomad/docs/configuration/server#authoritative_region
24 changes: 23 additions & 1 deletion website/data/docs-nav-data.json
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,16 @@
},
{
"title": "Architecture",
"path": "concepts/architecture"
"routes": [
{
"title": "Overview",
"path": "concepts/architecture"
},
{
"title": "Federation",
"path": "concepts/architecture/federation"
}
]
},
{
"title": "CPU",
Expand Down Expand Up @@ -2385,6 +2394,19 @@
"title": "Key Management",
"path": "operations/key-management"
},
{
"title": "Federation",
"routes": [
{
"title": "Overview",
"path": "operations/federation"
},
{
"title": "Failure",
"path": "operations/federation/failure"
}
]
},
{
"title": "Considerations for Stateful Workloads",
"path": "operations/stateful-workloads"
Expand Down

0 comments on commit 48d60b6

Please sign in to comment.