Allow agents in client mode to permanently leave the cluster when `leave_on_terminate` is enabled #25200

econsult-devops · 2025-02-24T10:21:19Z

Feature Request: Allow agents in client mode to permanently leave the cluster when `leave_on_terminate` is enabled

Proposal

We suggest adding an option for agents in client mode to fully leave the cluster when leave_on_terminate is set to true. Currently, when a client node is terminated, it is marked as down but stays in the cluster. Instead, it should be marked as left and then removed from the node pool.

Use Cases

Right now, there’s no way to tell the difference between a node that’s down due to an error and one that’s down because it left the cluster. This can cause problems like:

Monitoring Issues: False alerts might be triggered because monitoring systems can’t distinguish between failed nodes and nodes that were intentionally terminated.
Manual Cleanup: Admins have to run system gc to remove nodes, solving the false alerts that are being triggered, which adds unnecessary overhead.

This feature would be helpful in situations like:

down_scaling/auto_scaling: When reducing the number of client nodes or using auto_scaling, terminated nodes should cleanly exit the cluster.
Immutable Infrastructure: In setups where instances are frequently recreated, nodes should leave the cluster cleanly when terminated to avoid stale entries.

Proposed Solution

We propose adding a configuration option that lets agents in client mode fully leave the cluster when terminated. This would:

Mark the node as left when the agent is terminated.
Automatically remove the node from the node pool after a configurable time.

This change would make cluster management easier, reduce manual work, and improve monitoring accuracy.

Thanks for considering this request!

The text was updated successfully, but these errors were encountered:

mismithhisler · 2025-02-25T21:39:18Z

Hi @econsult-devops!

Thanks for opening this issue. I took a look through the code, and can confirm what you are seeing. When the client receives a shutdown signal, it exits without notifying the server. There is an existing drain_on_shutdown configuration that drains the node.

Looking through the code, I think this is a reasonable request, but may take a good bit of work to implement, so I'll mark it for roadmapping.

In the meantime, you can also use the node_gc_threshold configuration to help with some of the manual gc'ing.. Although I do realize this won't help with alerts.

econsult-devops added the type/enhancement label Feb 24, 2025

tgross added this to Nomad - Community Issues Triage Feb 24, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Feb 24, 2025

mismithhisler self-assigned this Feb 25, 2025

mismithhisler moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Feb 25, 2025

mismithhisler added theme/drain hcc/jira labels Feb 25, 2025

mismithhisler moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Feb 25, 2025

mismithhisler removed their assignment Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow agents in client mode to permanently leave the cluster when `leave_on_terminate` is enabled #25200

Allow agents in client mode to permanently leave the cluster when `leave_on_terminate` is enabled #25200

econsult-devops commented Feb 24, 2025

mismithhisler commented Feb 25, 2025

Allow agents in client mode to permanently leave the cluster when leave_on_terminate is enabled #25200

Allow agents in client mode to permanently leave the cluster when leave_on_terminate is enabled #25200

Comments

econsult-devops commented Feb 24, 2025

Feature Request: Allow agents in client mode to permanently leave the cluster when leave_on_terminate is enabled

Proposal

Use Cases

Proposed Solution

mismithhisler commented Feb 25, 2025

Allow agents in client mode to permanently leave the cluster when `leave_on_terminate` is enabled #25200

Allow agents in client mode to permanently leave the cluster when `leave_on_terminate` is enabled #25200

Feature Request: Allow agents in client mode to permanently leave the cluster when `leave_on_terminate` is enabled