Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocations marked as lost during graceful shutdown with configured shutdown_delay #25199

Open
econsult-devops opened this issue Feb 24, 2025 · 0 comments
Labels

Comments

@econsult-devops
Copy link

Bug Report: Allocations marked as lost during graceful shutdown with configured shutdown_delay

Nomad Version

  • Version: Nomad v1.9.6
  • Build Date: 2025-02-11T18:55:10Z
  • Revision: 7f8b449+CHANGES

Operating System and Environment

  • Distributor ID: Debian
  • Description: Debian GNU/Linux 12 (bookworm)
  • Release: 12
  • Codename: bookworm

Issue Description

When a node is configured with drain_on_shutdown for graceful shutdown, tasks with a shutdown_delay configured are incorrectly marked as lost instead of complete when the node is shutdown. This behavior triggers unnecessary alerts in our monitoring system.
The tasks do get migrated, but are partially marked as lost.
It does look like the delay timeout makes a difference whether this bug is triggered or not.
The drain executes successfully and nomad does seem to wait for the shutdown_delay to finish, before shutting down, but the allocation does still get marked as lost instead of complete.

Configuration

drain_on_shutdown {
  deadline = "10m"
}

We configured the nomad service with an extended timeout period to prevent it from being killed by systemd before gracefully shutting down.

# /etc/systemd/system/nomad.service.d/override.conf
[Unit]
After=consul.service
Wants=consul.service

[Service]
...
TimeoutStopSec=900s

Observed Behavior

  • Tasks with a shutdown_delay (e.g., 30s or 60s) are migrated but marked as lost.
  • Example:
    • qa-mailpit (30s delay) is not marked as lost.
    • traefik (30s delay, system job) is marked as lost.
    • qa-pgadmin (60s delay) is marked as lost.

Expected Behavior

  • Tasks with a configued shutdown delay should be marked as complete after a graceful shutdown.

Impact

  • False alerts are triggered in monitoring systems due to allocations being marked as lost.

Hypothesis

The issue appears to be influenced by the shutdown_delay timeout, where longer delays increase the likelihood of the bug being triggered.

Request

Is there a way to prevent allocations from being marked as lost during graceful shutdown? If not, this should be addressed as a bug.

Nomad Server logs

2025-02-20 13:43:47.570		
{
  "@level": "warn",
  "@message": "node TTL expired",
  "@module": "nomad.heartbeat",
  "@timestamp": "2025-02-20T12:43:47.570608Z",
  "node_id": "e710ffc2-e2e6-a0ff-822e-839ccd392d24"
}

Nomad Client logs

2025-02-20 13:43:21.301		
nomad.service: Consumed 2.411s CPU time.
2025-02-20 13:43:21.301		
Stopped nomad.service - Nomad.
2025-02-20 13:43:21.301		
nomad.service: Unit process 1859 (nomad) remains running after unit stopped.
2025-02-20 13:43:21.301		
nomad.service: Unit process 1688 (nomad) remains running after unit stopped.
2025-02-20 13:43:21.301		
nomad.service: Unit process 1312 (nomad) remains running after unit stopped.
2025-02-20 13:43:21.301		
nomad.service: Failed with result 'exit-code'.
2025-02-20 13:43:21.301		
nomad.service: Main process exited, code=exited, status=1/FAILURE
2025-02-20 13:43:21.297		
{
  "@level": "info",
  "@message": "shutdown complete",
  "@module": "agent",
  "@timestamp": "2025-02-20T12:43:21.297029Z"
}
2025-02-20 13:43:21.255		
{
  "@level": "info",
  "@message": "plugin manager finished",
  "@module": "client.plugin",
  "@timestamp": "2025-02-20T12:43:21.255166Z",
  "plugin-type": "csi"
}
2025-02-20 13:43:21.254		
{
  "@level": "info",
  "@message": "shutting down plugin manager",
  "@module": "client.plugin",
  "@timestamp": "2025-02-20T12:43:21.254929Z",
  "plugin-type": "csi"
}
2025-02-20 13:43:21.254		
{
  "@level": "info",
  "@message": "plugin manager finished",
  "@module": "client.plugin",
  "@timestamp": "2025-02-20T12:43:21.254840Z",
  "plugin-type": "driver"
}
2025-02-20 13:43:21.253		
{
  "@level": "info",
  "@message": "shutting down plugin manager",
  "@module": "client.plugin",
  "@timestamp": "2025-02-20T12:43:21.253502Z",
  "plugin-type": "driver"
}
2025-02-20 13:43:21.253		
{
  "@level": "info",
  "@message": "plugin manager finished",
  "@module": "client.plugin",
  "@timestamp": "2025-02-20T12:43:21.253442Z",
  "plugin-type": "device"
}
2025-02-20 13:43:21.253		
{
  "@level": "info",
  "@message": "shutting down plugin manager",
  "@module": "client.plugin",
  "@timestamp": "2025-02-20T12:43:21.253338Z",
  "plugin-type": "device"
}
2025-02-20 13:43:21.251		
{
  "@level": "info",
  "@message": "marking allocation for GC",
  "@module": "client.gc",
  "@timestamp": "2025-02-20T12:43:21.251529Z",
  "alloc_id": "835a8c31-594f-a281-1290-13c9a044d603"
}
2025-02-20 13:43:21.251		
{
  "@level": "info",
  "@message": "(runner) received finish",
  "@module": "agent",
  "@timestamp": "2025-02-20T12:43:21.251452Z"
}
2025-02-20 13:43:21.251		
{
  "@level": "info",
  "@message": "(runner) stopping",
  "@module": "agent",
  "@timestamp": "2025-02-20T12:43:21.251293Z"
}
2025-02-20 13:43:21.251		
{
  "@level": "info",
  "@message": "plugin process exited",
  "@module": "client.alloc_runner.task_runner.task_hook.logmon",
  "@timestamp": "2025-02-20T12:43:21.250983Z",
  "alloc_id": "835a8c31-594f-a281-1290-13c9a044d603",
  "id": "1484",
  "plugin": "/usr/bin/nomad",
  "task": "pgadmin"
}
2025-02-20 13:43:21.243		
{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2025-02-20T12:43:21.243256Z",
  "alloc_id": "835a8c31-594f-a281-1290-13c9a044d603",
  "failed": false,
  "msg": "Task successfully killed",
  "task": "pgadmin",
  "type": "Killed"
}
2025-02-20 13:43:21.229		
{
  "@level": "info",
  "@message": "plugin process exited",
  "@module": "client.driver_mgr.docker.docker_logger",
  "@timestamp": "2025-02-20T12:43:21.229794Z",
  "driver": "docker",
  "id": "1706",
  "plugin": "/usr/bin/nomad"
}
2025-02-20 13:43:21.228		
{
  "@level": "error",
  "@message": "log streaming ended with error",
  "@module": "docker_logger",
  "@timestamp": "2025-02-20T12:43:21.227990Z",
  "driver": "docker",
  "error": "context canceled",
  "timestamp": "2025-02-20T12:43:21.227Z"
}
2025-02-20 13:43:21.221		
{
  "@level": "info",
  "@message": "stopped container",
  "@module": "client.driver_mgr.docker",
  "@timestamp": "2025-02-20T12:43:21.221558Z",
  "container_id": "7f996c2df5bcb1b30fdd87e98d9cc7ec2bdbe667ea90cd1512d0b4aaa6da3932",
  "driver": "docker"
}
2025-02-20 13:42:49.833		
{
  "@level": "info",
  "@message": "marking allocation for GC",
  "@module": "client.gc",
  "@timestamp": "2025-02-20T12:42:49.833610Z",
  "alloc_id": "23efacad-4979-4c5f-4340-824c83e636e3"
}
2025-02-20 13:42:49.830		
{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2025-02-20T12:42:49.830218Z",
  "alloc_id": "23efacad-4979-4c5f-4340-824c83e636e3",
  "failed": false,
  "msg": "Task successfully killed",
  "task": "qa-mailpit",
  "type": "Killed"
}
2025-02-20 13:42:49.824		
{
  "@level": "error",
  "@message": "error collecting stats from container",
  "@module": "client.driver_mgr.docker",
  "@timestamp": "2025-02-20T12:42:49.824769Z",
  "container_id": "1b1123629f323249d2dbd97a5bb4ffc7886f76bdd15d95a250855f6ce1d061a6",
  "driver": "docker",
  "error": "error decoding stats data: stats were nil"
}
2025-02-20 13:42:49.823		
{
  "@level": "info",
  "@message": "plugin process exited",
  "@module": "client.driver_mgr.docker.docker_logger",
  "@timestamp": "2025-02-20T12:42:49.823134Z",
  "driver": "docker",
  "id": "1847",
  "plugin": "/usr/bin/nomad"
}
2025-02-20 13:42:49.821		
{
  "@level": "error",
  "@message": "log streaming ended with error",
  "@module": "docker_logger",
  "@timestamp": "2025-02-20T12:42:49.821293Z",
  "driver": "docker",
  "error": "context canceled",
  "timestamp": "2025-02-20T12:42:49.821Z"
}
2025-02-20 13:42:49.815		
{
  "@level": "info",
  "@message": "stopped container",
  "@module": "client.driver_mgr.docker",
  "@timestamp": "2025-02-20T12:42:49.815256Z",
  "container_id": "1b1123629f323249d2dbd97a5bb4ffc7886f76bdd15d95a250855f6ce1d061a6",
  "driver": "docker"
}
2025-02-20 13:42:23.362		
{
  "@level": "info",
  "@message": "shutting down",
  "@module": "client",
  "@timestamp": "2025-02-20T12:42:23.362596Z"
}
2025-02-20 13:42:23.362		
{
  "@level": "info",
  "@message": "requesting shutdown",
  "@module": "agent",
  "@timestamp": "2025-02-20T12:42:23.362085Z"
}
2025-02-20 13:42:19.444		
{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2025-02-20T12:42:19.444125Z",
  "alloc_id": "23efacad-4979-4c5f-4340-824c83e636e3",
  "failed": false,
  "msg": "Waiting for shutdown_delay of 30s before killing the task.",
  "task": "qa-mailpit",
  "type": "Waiting for shutdown delay"
}
2025-02-20 13:42:19.441		
{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2025-02-20T12:42:19.441764Z",
  "alloc_id": "23efacad-4979-4c5f-4340-824c83e636e3",
  "failed": false,
  "msg": "Sent interrupt. Waiting 2m0s before force killing",
  "task": "qa-mailpit",
  "type": "Killing"
}
2025-02-20 13:42:19.414		
{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2025-02-20T12:42:19.414918Z",
  "alloc_id": "835a8c31-594f-a281-1290-13c9a044d603",
  "failed": false,
  "msg": "Waiting for shutdown_delay of 1m0s before killing the task.",
  "task": "pgadmin",
  "type": "Waiting for shutdown delay"
}
2025-02-20 13:42:19.413		
{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2025-02-20T12:42:19.413089Z",
  "alloc_id": "835a8c31-594f-a281-1290-13c9a044d603",
  "failed": false,
  "msg": "Sent interrupt. Waiting 5s before force killing",
  "task": "pgadmin",
  "type": "Killing"
}
2025-02-20 13:42:18.393		
{
  "@level": "info",
  "@message": "monitoring self-drain",
  "@module": "client.drain",
  "@timestamp": "2025-02-20T12:42:18.389504Z"
}
2025-02-20 13:42:18.369		
{
  "@level": "info",
  "@message": "Gracefully shutting down agent...",
  "@module": "agent",
  "@timestamp": "2025-02-20T12:42:18.361460Z"
}
2025-02-20 13:42:18.369		
{
  "@level": "info",
  "@message": "Caught signal: interrupt",
  "@module": "agent",
  "@timestamp": "2025-02-20T12:42:18.361419Z"
}
2025-02-20 13:42:18.360		
Stopping nomad.service - Nomad...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

1 participant