Nomad server `retry_join` gives up after a single discovery failure #24560

ryansammonaiven · 2024-11-27T23:38:02Z

Nomad version

Nomad v1.9.1+ent
BuildDate 2024-10-21T14:02:49Z
Revision d6cbc0d24773eeeaf9bd7ce6117856be2c05d4a9

Operating system and Environment details

Fedora 40

Issue

Nomad server retry_join gives up after a single discovery failure.

Reproduction steps

Use retry_join with provider=aws inside a VPC while the EC2 VPC endpoint is still provisioning.

Expected Result

The process retries until it succeeds or exhausts the configured number of retries.

Actual Result

The process gave up after a single discovery failure (see logs below).

We believe this is related to #18745 and the addition of a return in command/agent/retry_join.go here.
Prior to this change, such a failure would not cause it to give up (i.e. return).

Nomad Server config

server {
  enabled          = true
  bootstrap_expect = 3
  server_join {
    retry_join = ["provider=aws tag_key=nomad-cluster tag_value=reference-default"]
  }
}

Nomad Server logs

Nov 27 20:26:54 i-08fcbfca67fb5da88.eu-west-1.compute.internal systemd[1]: Started nomad.service - Nomad.
Nov 27 20:26:54 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:     2024-11-27T20:26:54.915Z [INFO]  agent.joiner: discover-aws: Region is eu-west-1
Nov 27 20:26:54 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:     2024-11-27T20:26:54.915Z [INFO]  agent.joiner: discover-aws: Filter instances with nomad-cluster=reference-default
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:     2024-11-27T20:26:55.258Z [ERROR] agent.joiner: discovering join addresses failed: join_config="provider=aws tag_key=nomad-cluster tag_value=reference-default"
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:   error=
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:   | discover-aws: DescribeInstancesInput failed: RequestError: send request failed
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:   | caused by: Post "https://ec2.eu-west-1.amazonaws.com/": dial tcp: lookup ec2.eu-west-1.amazonaws.com: no such host
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:   
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:     2024-11-27T20:26:55.925Z [WARN]  nomad.raft: no known peers, aborting election

The text was updated successfully, but these errors were encountered:

jrasell · 2024-11-28T09:01:55Z

Hi @ryansammonaiven and thanks for raising this and identifying the culprit. I am working on a fix and a test to cover this case and will raise the resulting PR shortly.

ryansammonaiven added the type/bug label Nov 27, 2024

ryansammonaiven mentioned this issue Nov 27, 2024

Add go-netaddrs support to retry_join #18745

Merged

3 tasks

jrasell self-assigned this Nov 28, 2024

jrasell added theme/discovery stage/accepted Confirmed, and intend to work on. No timeline committment though. hcc/jira labels Nov 28, 2024

jrasell added this to Nomad - Community Issues Triage Nov 28, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Nov 28, 2024

jrasell moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Nov 28, 2024

jrasell mentioned this issue Nov 28, 2024

agent: Fix a bug where retry_join was not retrying. #24561

Merged

6 tasks

jrasell closed this as completed in #24561 Nov 29, 2024

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Nov 29, 2024

hc-github-team-nomad-core mentioned this issue Nov 29, 2024

Backport of agent: Fix a bug where retry_join was not retrying. into release/1.9.x #24566

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad server `retry_join` gives up after a single discovery failure #24560

Nomad server `retry_join` gives up after a single discovery failure #24560

ryansammonaiven commented Nov 27, 2024

jrasell commented Nov 28, 2024

Nomad server retry_join gives up after a single discovery failure #24560

Nomad server retry_join gives up after a single discovery failure #24560

Comments

ryansammonaiven commented Nov 27, 2024

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Server config

Nomad Server logs

jrasell commented Nov 28, 2024

Nomad server `retry_join` gives up after a single discovery failure #24560

Nomad server `retry_join` gives up after a single discovery failure #24560