Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow agents in client mode to permanently leave the cluster when leave_on_terminate is enabled #25200

Open
econsult-devops opened this issue Feb 24, 2025 · 1 comment

Comments

@econsult-devops
Copy link

Feature Request: Allow agents in client mode to permanently leave the cluster when leave_on_terminate is enabled

Proposal

We suggest adding an option for agents in client mode to fully leave the cluster when leave_on_terminate is set to true. Currently, when a client node is terminated, it is marked as down but stays in the cluster. Instead, it should be marked as left and then removed from the node pool.

Use Cases

Right now, there’s no way to tell the difference between a node that’s down due to an error and one that’s down because it left the cluster. This can cause problems like:

  1. Monitoring Issues: False alerts might be triggered because monitoring systems can’t distinguish between failed nodes and nodes that were intentionally terminated.
  2. Manual Cleanup: Admins have to run system gc to remove nodes, solving the false alerts that are being triggered, which adds unnecessary overhead.

This feature would be helpful in situations like:

  • down_scaling/auto_scaling: When reducing the number of client nodes or using auto_scaling, terminated nodes should cleanly exit the cluster.
  • Immutable Infrastructure: In setups where instances are frequently recreated, nodes should leave the cluster cleanly when terminated to avoid stale entries.

Proposed Solution

We propose adding a configuration option that lets agents in client mode fully leave the cluster when terminated. This would:

  1. Mark the node as left when the agent is terminated.
  2. Automatically remove the node from the node pool after a configurable time.

This change would make cluster management easier, reduce manual work, and improve monitoring accuracy.

Thanks for considering this request!

@mismithhisler
Copy link
Member

Hi @econsult-devops!

Thanks for opening this issue. I took a look through the code, and can confirm what you are seeing. When the client receives a shutdown signal, it exits without notifying the server. There is an existing drain_on_shutdown configuration that drains the node.

Looking through the code, I think this is a reasonable request, but may take a good bit of work to implement, so I'll mark it for roadmapping.

In the meantime, you can also use the node_gc_threshold configuration to help with some of the manual gc'ing.. Although I do realize this won't help with alerts.

@mismithhisler mismithhisler moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Feb 25, 2025
@mismithhisler mismithhisler removed their assignment Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants