Start system jobs before service jobs #25038

EugenKon · 2025-02-06T19:50:16Z

Nomad version

1.8.2

Operating system and Environment details

Ubuntu 22.04

Issue

We have service and system jobs. the service jobs are started before system jobs. This cause that new hosts becomes unmonitored by datadog.

Reproduction steps

configure system and service jobs. Configure memory limits.
start cluster
add new EC2 instance

Expected Result

System jobs are more important and should be started before regular jobs.

Actual Result

service jobs are started before system jobs.

Totally on this node there are 8Gb.

tgross · 2025-02-06T21:08:39Z

@EugenKon new hosts don't automatically get service allocations added to them unless you have blocked evaluations, so unless that's the case the system job should show up first typically.

But if you need to enforce that, you can also enable preemption in the scheduler to give the system jobs higher priority. Use nomad operator scheduler set-config -preempt-system-scheduler=true and then set the job.priority of those system jobs to a higher priority (ex. 100).

EugenKon · 2025-02-10T21:40:28Z

I'll try this solution when possible.

EugenKon · 2025-02-24T20:13:44Z

@tgross Hi. I had a chance to check this.
According to this documentation: https://developer.hashicorp.com/nomad/docs/concepts/scheduling/preemption#details
The job with high priority will preempt jobs with lower priority. But it does not.

For the sake of example we increased job priority to 90 for wi-knecht task. Others have default 50 or 70. The wi-knecht tasks has a big requirements for Memory, thus I expect all other jobs on the client should be preempted:

So here wi-knecht occupy 3.6Gb and we have 7.7Gb. Jobs 1-4 should be preemted.

ek-mac@job nomad $ nomad job plan -var=deploy_version=nomad -var=max_count=10 -var=min_count=2 derived-src/services/wi/wi-knecht.hcl
+/- Job: "wi-knecht"
+/- Priority: "30" => "90"
+/- Task Group: "wi-knecht-group" (1 create, 1 in-place update)
  +/- Count: "1" => "2" (forces create)
  +/- Scaling {
    +/- Min: "1" => "2"
      }
      Task: "cleanup-task"
      Task: "post-stop-cleanup"
      Task: "wi-knecht-task"

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "wi-knecht-group" (failed to place 1 allocation):
    * Class "webapi": 1 nodes excluded by filter
    * Constraint "${node.class} regexp worker": 1 nodes excluded by filter
    * Resources exhausted on 1 nodes
    * Class "worker" exhausted on 1 nodes
    * Dimension "memory" exhausted on 1 nodes

Job Modify Index: 69375
To submit the job with version verification run:

nomad job run -check-index 69375 -var="deploy_version=nomad" -var="max_count=10" -var="min_count=2" derived-src/services/wi/wi-knecht.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
ek-mac@job nomad $ nomad operator scheduler set-config -preempt-system-scheduler=true
Scheduler configuration updated!
ek-mac@job nomad $ nomad job plan -var=deploy_version=nomad -var=max_count=10 -var=min_count=2 derived-src/services/wi/wi-knecht.hcl
+/- Job: "wi-knecht"
+/- Priority: "30" => "90"
+/- Task Group: "wi-knecht-group" (1 create, 1 in-place update)
  +/- Count: "1" => "2" (forces create)
  +/- Scaling {
    +/- Min: "1" => "2"
      }
      Task: "cleanup-task"
      Task: "post-stop-cleanup"
      Task: "wi-knecht-task"

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "wi-knecht-group" (failed to place 1 allocation):
    * Class "webapi": 1 nodes excluded by filter
    * Constraint "${node.class} regexp worker": 1 nodes excluded by filter
    * Resources exhausted on 1 nodes
    * Class "worker" exhausted on 1 nodes
    * Dimension "memory" exhausted on 1 nodes

Job Modify Index: 69375
To submit the job with version verification run:

nomad job run -check-index 69375 -var="deploy_version=nomad" -var="max_count=10" -var="min_count=2" derived-src/services/wi/wi-knecht.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
ek-mac@job nomad $ nomad job run -check-index 69375 -var="deploy_version=nomad" -var="max_count=10" -var="min_count=2" derived-src/services/wi/wi-knecht.hcl
==> 2025-02-24T15:04:36-05:00: Monitoring evaluation "88187acc"
    2025-02-24T15:04:37-05:00: Evaluation triggered by job "wi-knecht"
    2025-02-24T15:04:37-05:00: Evaluation within deployment: "a367af31"
    2025-02-24T15:04:37-05:00: Allocation "2f6db5d1" modified: node "7e4530a1", group "wi-knecht-group"
    2025-02-24T15:04:37-05:00: Evaluation status changed: "pending" -> "complete"
==> 2025-02-24T15:04:37-05:00: Evaluation "88187acc" finished with status "complete" but failed to place all allocations:
    2025-02-24T15:04:37-05:00: Task Group "wi-knecht-group" (failed to place 1 allocation):
      * Class "webapi": 1 nodes excluded by filter
      * Constraint "${node.class} regexp worker": 1 nodes excluded by filter
      * Resources exhausted on 1 nodes
      * Class "worker" exhausted on 1 nodes
      * Dimension "memory" exhausted on 1 nodes
    2025-02-24T15:04:37-05:00: Evaluation "e20f221d" waiting for additional capacity to place remainder
==> 2025-02-24T15:04:37-05:00: Monitoring deployment "a367af31"
  ⠼ Deployment "a367af31" in progress...

    2025-02-24T15:05:07-05:00
    ID          = a367af31
    Job ID      = wi-knecht
    Job Version = 3
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group       Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    wi-knecht-group  true         2        1       1        0          2025-02-24T20:16:06Z

variable "deploy_version" {
  type = string
}

variable "min_count" {
  type    = number
  default = 1
}

variable "max_count" {
  type    = number
  default = 10
}

job "wi-knecht" {
  region      = "planitar"
  datacenters = ["plntr_dc"]
  type        = "service"

  priority = 90

  update {
    max_parallel      = 1
    min_healthy_time  = "30s"
    healthy_deadline  = "10m"
    progress_deadline = "11m"
    health_check      = "checks"
    canary            = 0
    auto_revert       = true
  }

  group "wi-knecht-group" {
    count = "${var.min_count}"

    scaling {
      enabled = true
      min     = "${var.min_count}"
      max     = "${var.max_count}"
      policy {
        cooldown            = "15m"
        evaluation_interval = "5m"

        check "high-cpu-usage" {
          source       = "nomad-apm"
          query        = "avg_cpu-allocated"
          query_window = "3m"
          group        = "cpu-usage"

          strategy "threshold" {
            lower_bound           = 70
            delta                 = 2
            within_bounds_trigger = 1
          }
        }

        check "low-cpu-usage" {
          source       = "nomad-apm"
          query        = "avg_cpu-allocated"
          query_window = "3m"
          group        = "cpu-usage"

          strategy "threshold" {
            upper_bound           = 30
            delta                 = -1
            within_bounds_trigger = 1
          }
        }

        check "high-memory-usage" {
          source       = "nomad-apm"
          query        = "avg_memory-allocated"
          query_window = "3m"
          group        = "memory-usage"

          strategy "threshold" {
            lower_bound           = 70
            delta                 = 2
            within_bounds_trigger = 1
          }
        }

        check "low-memory-usage" {
          source       = "nomad-apm"
          query        = "avg_memory-allocated"
          query_window = "3m"
          group        = "memory-usage"

          strategy "threshold" {
            upper_bound           = 30
            delta                 = -1
            within_bounds_trigger = 1
          }
        }
      }
    }

    restart {
      attempts = 5
      delay    = "15s"
      interval = "3m"
      mode     = "delay"
    }

   ....
      service {
        tags     = ["wi-knecht"]
        name     = "wi-knecht"
        provider = "consul"
      }

      resources {
        memory     = 3000
        memory_max = 3500
        cpu        = 2000
      }
    }
  }
}

tgross · 2025-02-24T20:46:52Z

Can you show the output of nomad operator scheduler get-config? This job is a service job, so you specifically need to ensure that the scheduler has been configured to allow preemption for service jobs (similar to what I said above about system jobs).

EugenKon · 2025-02-24T21:15:03Z

$ nomad operator scheduler get-config
Scheduler Algorithm           = binpack
Memory Oversubscription       = true
Reject Job Registration       = false
Pause Eval Broker             = false
Preemption System Scheduler   = true
Preemption Service Scheduler  = false
Preemption Batch Scheduler    = false
Preemption SysBatch Scheduler = false
Modify Index                  = 69413

EugenKon · 2025-02-24T21:19:41Z

ah, ok. You are referring to https://developer.hashicorp.com/nomad/docs/concepts/scheduling/preemption#preemption-in-nomad
This description so implicit =(

It is unclear if service job should preempt system jobs. Because I enabled system jobs to be preempted in this case.

tgross · 2025-02-24T21:21:51Z

The section of the docs I linked you to should be pretty clear:

-preempt-service-scheduler - Specifies whether preemption for service jobs is enabled. Note that if this is set to true, then service jobs can preempt any other jobs. Must be one of [true|false].

vs

-preempt-system-scheduler - Specifies whether preemption for system jobs is enabled. Note that if this is set to true, then system jobs can preempt any other jobs. Must be one of [true|false].

EugenKon · 2025-02-24T21:23:19Z

Nice. Thanks. It would be nice to extend this example with that info too:
https://developer.hashicorp.com/nomad/docs/concepts/scheduling/preemption#details

Fix a broken link from the preemption concepts docs to the relevant API. Also include a link to the relevant command. Ref: #25038

tgross · 2025-02-24T21:32:41Z

Done #25203

Fix a broken link from the preemption concepts docs to the relevant API. Also include a link to the relevant command. Ref: #25038

… (#25207) Fix a broken link from the preemption concepts docs to the relevant API. Also include a link to the relevant command. Ref: #25038 Co-authored-by: Tim Gross <[email protected]>

EugenKon · 2025-02-25T21:21:55Z

@tgross

@EugenKon new hosts don't automatically get service allocations added to them unless you have blocked evaluations, so unless that's the case the system job should show up first typically.

Ok. I got another issue. So we have Nomad server configured like this:

server {
  enabled          = true
  bootstrap_expect = SERVER_COUNT

  default_scheduler_config {
    memory_oversubscription_enabled = true
  }

  server_join {
    retry_max      = 0
    retry_interval = "30s"
  }}

and nomad job configured like this:

variable "project_name" {
  type = string
}

variable "aws_region" {
  type = string
}

job "dd-agent" {
  type        = "system"

  group "monitoring" {
    count = 1

    network {
      port "dd-tcp" {
        static = xxx
        to     = xxx
      }

      port "dd-udp" {
        static = xxx
        to     = xxx
      }
    }

    task "datadog-task" {
      volume_mount {
        volume      = "docker-sock-volume"
        destination = "/var/run/docker.sock"
        read_only   = false
      }

      driver = "docker"

      config {
        force_pull   = false
        image        = "datadog/agent:7.59.0-linux"
        ports        = ["dd-tcp", "dd-udp"]
        command      = "bash"
        network_mode = "host"
        volumes      = [
        ]
      }

      template {
        destination = "local/datadog/consul.d/conf.yaml"
        data        = <<-EOH
        EOH
      }

      template {
        destination = "local/datadog/postgres.d/conf.yaml"
        data        = <<-EOH
      }

      resources {
        memory = 700
      }

      service {
        provider = "consul"

        tags = ["dd-agent-service-tag"]
        name = "dd-agent"
        port = "dd-tcp"

        meta {
          custom = "label"
        }

        check {
          type     = "tcp"
          port     = "dd-tcp"
          interval = "30s"
          timeout  = "5s"
        }
      }
    }
  }
}

variable "NOMAD_TOKEN" {
  type = string
}

variable "NOMAD_ADDR" {
  type = string
}

job "autoscaler" {
  type        = "system"

  group "autoscaler" {
    count = 1

    network {
      port "http" {
        to = 8080
      }
    }

    task "autoscaler" {
      driver = "docker"

      config {
        image   = "hashicorp/nomad-autoscaler:0.4.5"
        command = "nomad-autoscaler"
        args = [
          "agent",
          "-config",
          "${NOMAD_TASK_DIR}/config.hcl",
          "-http-bind-address",
          "0.0.0.0",
          "-policy-dir",
          "${NOMAD_TASK_DIR}/policies/"
        ]
        ports = ["http"]

        labels = {
          "com.datadoghq.ad.logs" = jsonencode([{
            source = "nomad"
            service = "nomad-autoscaler"
          }])
        }
      }

      env {
         NOMAD_TOKEN = "${var.NOMAD_TOKEN}"
      }

      service {
        name = "autoscaler"
        port = "http"
      }

      template {
        data = <<EOF
nomad {
  address = "${var.NOMAD_ADDR}"
  region  = "planitar"
  skip_verify = true
}

  apm "nomad" {
    driver = "nomad-apm"
  }

strategy "target-value" {
  driver = "target-value"
}
EOF
        destination = "${NOMAD_TASK_DIR}/config.hcl"
      }

      template {
        data = <<EOF
        destination = "${NOMAD_TASK_DIR}/policies/hashistack.hcl"
      }
    }
  }
}

Usually dd-agent starts normally on all EC2 nodes, but sometimes something goes wrong. New EC2 instances are created. They are registered with consul and nomad. Nomad deployed only one system service and that is all.
We expect Nomad to deploy all 3 system services and then other service jobs.

Here how this client is visible via Nomad UI.

You can see that instance has a lot of resources, but scheduler does not schedule dd-agent there.
you can see there is not so much differences between autoscaller and dd-agent

tgross · 2025-02-26T18:53:52Z

New EC2 instances are created. They are registered with consul and nomad. Nomad deployed only one system service and that is all. We expect Nomad to deploy all 3 system services and then other service jobs.

No, you should not expect to see any service jobs get deployed to the new instances. Not unless you actually deploy them again with additional count. Only the system jobs would show up on the new instances.

And as usual you've provided screenshots without context as to what I'm supposed to be looking at. But I'm going to guess that you're saying "I have 16 nodes and only 14 of the nodes have placements for the system jobs datadog-rds-postrgres-trigger and dd-agent. And you're saying that those 2 extra nodes should have enough memory space for the allocations.

@EugenKon at this point I'm going to push back a challenge to you, in hopes that we can try to raise your level of education here. You have 2 nodes that don't seem to have placements they you think they should. Without using the web UI, which commands should you run to try to figure out why the scheduler made decisions it did about those two jobs?

EugenKon added the type/bug label Feb 6, 2025

tgross closed this as completed Feb 6, 2025

tgross added theme/scheduling theme/preemption Issues related to preemption type/enhancement and removed type/bug labels Feb 6, 2025

This was referenced Feb 7, 2025

Nomad should start system jobs with more higher priority than service jobs. #25061

Closed

Nomad does not report property about not started system jobs #25058

Closed

tgross marked this as a duplicate of #25058 Feb 10, 2025

tgross added a commit that referenced this issue Feb 24, 2025

docs: improve cross-links for scheduler preemption

0ecca63

Fix a broken link from the preemption concepts docs to the relevant API. Also include a link to the relevant command. Ref: #25038

tgross mentioned this issue Feb 24, 2025

docs: improve cross-links for scheduler preemption #25203

Merged

tgross added a commit that referenced this issue Feb 25, 2025

docs: improve cross-links for scheduler preemption (#25203)

7997a76

Fix a broken link from the preemption concepts docs to the relevant API. Also include a link to the relevant command. Ref: #25038

tgross added a commit that referenced this issue Feb 25, 2025

docs: improve cross-links for scheduler preemption (#25203)

4fd4c77

Fix a broken link from the preemption concepts docs to the relevant API. Also include a link to the relevant command. Ref: #25038

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start system jobs before service jobs #25038

Start system jobs before service jobs #25038

EugenKon commented Feb 6, 2025 •

edited

Loading

tgross commented Feb 6, 2025

EugenKon commented Feb 10, 2025

EugenKon commented Feb 24, 2025 •

edited

Loading

tgross commented Feb 24, 2025 •

edited

Loading

EugenKon commented Feb 24, 2025

EugenKon commented Feb 24, 2025

tgross commented Feb 24, 2025

EugenKon commented Feb 24, 2025

tgross commented Feb 24, 2025

EugenKon commented Feb 25, 2025 •

edited

Loading

tgross commented Feb 26, 2025

Start system jobs before service jobs #25038

Start system jobs before service jobs #25038

Comments

EugenKon commented Feb 6, 2025 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

tgross commented Feb 6, 2025

EugenKon commented Feb 10, 2025

EugenKon commented Feb 24, 2025 • edited Loading

tgross commented Feb 24, 2025 • edited Loading

EugenKon commented Feb 24, 2025

EugenKon commented Feb 24, 2025

tgross commented Feb 24, 2025

EugenKon commented Feb 24, 2025

tgross commented Feb 24, 2025

EugenKon commented Feb 25, 2025 • edited Loading

tgross commented Feb 26, 2025

EugenKon commented Feb 6, 2025 •

edited

Loading

EugenKon commented Feb 24, 2025 •

edited

Loading

tgross commented Feb 24, 2025 •

edited

Loading

EugenKon commented Feb 25, 2025 •

edited

Loading