You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a node is configured with drain_on_shutdown for graceful shutdown, tasks with a shutdown_delay configured are incorrectly marked as lost instead of complete when the node is shutdown. This behavior triggers unnecessary alerts in our monitoring system.
The tasks do get migrated, but are partially marked as lost.
It does look like the delay timeout makes a difference whether this bug is triggered or not.
The drain executes successfully and nomad does seem to wait for the shutdown_delay to finish, before shutting down, but the allocation does still get marked as lost instead of complete.
Configuration
drain_on_shutdown {
deadline="10m"
}
We configured the nomad service with an extended timeout period to prevent it from being killed by systemd before gracefully shutting down.
Bug Report: Allocations marked as lost during graceful shutdown with configured
shutdown_delay
Nomad Version
Operating System and Environment
Issue Description
When a node is configured with
drain_on_shutdown
for graceful shutdown, tasks with ashutdown_delay
configured are incorrectly marked as lost instead of complete when the node is shutdown. This behavior triggers unnecessary alerts in our monitoring system.The tasks do get migrated, but are partially marked as lost.
It does look like the delay timeout makes a difference whether this bug is triggered or not.
The drain executes successfully and nomad does seem to wait for the shutdown_delay to finish, before shutting down, but the allocation does still get marked as lost instead of complete.
Configuration
We configured the nomad service with an extended timeout period to prevent it from being killed by systemd before gracefully shutting down.
Observed Behavior
shutdown_delay
(e.g., 30s or 60s) are migrated but marked as lost.qa-mailpit
(30s delay) is not marked as lost.traefik
(30s delay, system job) is marked as lost.qa-pgadmin
(60s delay) is marked as lost.Expected Behavior
Impact
Hypothesis
The issue appears to be influenced by the
shutdown_delay
timeout, where longer delays increase the likelihood of the bug being triggered.Request
Is there a way to prevent allocations from being marked as lost during graceful shutdown? If not, this should be addressed as a bug.
Nomad Server logs
Nomad Client logs
The text was updated successfully, but these errors were encountered: