Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server retry_join gives up after a single discovery failure #24560

Closed
ryansammonaiven opened this issue Nov 27, 2024 · 1 comment · Fixed by #24561
Closed

Nomad server retry_join gives up after a single discovery failure #24560

ryansammonaiven opened this issue Nov 27, 2024 · 1 comment · Fixed by #24561
Assignees
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/discovery type/bug

Comments

@ryansammonaiven
Copy link

Nomad version

Nomad v1.9.1+ent
BuildDate 2024-10-21T14:02:49Z
Revision d6cbc0d24773eeeaf9bd7ce6117856be2c05d4a9

Operating system and Environment details

Fedora 40

Issue

Nomad server retry_join gives up after a single discovery failure.

Reproduction steps

Use retry_join with provider=aws inside a VPC while the EC2 VPC endpoint is still provisioning.

Expected Result

The process retries until it succeeds or exhausts the configured number of retries.

Actual Result

The process gave up after a single discovery failure (see logs below).

We believe this is related to #18745 and the addition of a return in command/agent/retry_join.go here.
Prior to this change, such a failure would not cause it to give up (i.e. return).

Nomad Server config

server {
  enabled          = true
  bootstrap_expect = 3
  server_join {
    retry_join = ["provider=aws tag_key=nomad-cluster tag_value=reference-default"]
  }
}

Nomad Server logs

Nov 27 20:26:54 i-08fcbfca67fb5da88.eu-west-1.compute.internal systemd[1]: Started nomad.service - Nomad.
Nov 27 20:26:54 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:     2024-11-27T20:26:54.915Z [INFO]  agent.joiner: discover-aws: Region is eu-west-1
Nov 27 20:26:54 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:     2024-11-27T20:26:54.915Z [INFO]  agent.joiner: discover-aws: Filter instances with nomad-cluster=reference-default
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:     2024-11-27T20:26:55.258Z [ERROR] agent.joiner: discovering join addresses failed: join_config="provider=aws tag_key=nomad-cluster tag_value=reference-default"
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:   error=
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:   | discover-aws: DescribeInstancesInput failed: RequestError: send request failed
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:   | caused by: Post "https://ec2.eu-west-1.amazonaws.com/": dial tcp: lookup ec2.eu-west-1.amazonaws.com: no such host
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:   
Nov 27 20:26:55 i-08fcbfca67fb5da88.eu-west-1.compute.internal nomad[1312]:     2024-11-27T20:26:55.925Z [WARN]  nomad.raft: no known peers, aborting election
@jrasell
Copy link
Member

jrasell commented Nov 28, 2024

Hi @ryansammonaiven and thanks for raising this and identifying the culprit. I am working on a fix and a test to cover this case and will raise the resulting PR shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/discovery type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants