Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start system jobs before service jobs #25038

Closed
EugenKon opened this issue Feb 6, 2025 · 11 comments
Closed

Start system jobs before service jobs #25038

EugenKon opened this issue Feb 6, 2025 · 11 comments

Comments

@EugenKon
Copy link

EugenKon commented Feb 6, 2025

Nomad version

1.8.2

Operating system and Environment details

Ubuntu 22.04

Issue

We have service and system jobs. the service jobs are started before system jobs. This cause that new hosts becomes unmonitored by datadog.

Reproduction steps

configure system and service jobs. Configure memory limits.
start cluster
add new EC2 instance

Expected Result

System jobs are more important and should be started before regular jobs.

Actual Result

service jobs are started before system jobs.

Image

Image Totally on this node there are 8Gb.

@tgross
Copy link
Member

tgross commented Feb 6, 2025

@EugenKon new hosts don't automatically get service allocations added to them unless you have blocked evaluations, so unless that's the case the system job should show up first typically.

But if you need to enforce that, you can also enable preemption in the scheduler to give the system jobs higher priority. Use nomad operator scheduler set-config -preempt-system-scheduler=true and then set the job.priority of those system jobs to a higher priority (ex. 100).

@EugenKon
Copy link
Author

I'll try this solution when possible.

@EugenKon
Copy link
Author

EugenKon commented Feb 24, 2025

@tgross Hi. I had a chance to check this.
According to this documentation: https://developer.hashicorp.com/nomad/docs/concepts/scheduling/preemption#details
The job with high priority will preempt jobs with lower priority. But it does not.

For the sake of example we increased job priority to 90 for wi-knecht task. Others have default 50 or 70. The wi-knecht tasks has a big requirements for Memory, thus I expect all other jobs on the client should be preempted:

Image

So here wi-knecht occupy 3.6Gb and we have 7.7Gb. Jobs 1-4 should be preemted.

ek-mac@job nomad $ nomad job plan -var=deploy_version=nomad -var=max_count=10 -var=min_count=2 derived-src/services/wi/wi-knecht.hcl
+/- Job: "wi-knecht"
+/- Priority: "30" => "90"
+/- Task Group: "wi-knecht-group" (1 create, 1 in-place update)
  +/- Count: "1" => "2" (forces create)
  +/- Scaling {
    +/- Min: "1" => "2"
      }
      Task: "cleanup-task"
      Task: "post-stop-cleanup"
      Task: "wi-knecht-task"

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "wi-knecht-group" (failed to place 1 allocation):
    * Class "webapi": 1 nodes excluded by filter
    * Constraint "${node.class} regexp worker": 1 nodes excluded by filter
    * Resources exhausted on 1 nodes
    * Class "worker" exhausted on 1 nodes
    * Dimension "memory" exhausted on 1 nodes

Job Modify Index: 69375
To submit the job with version verification run:

nomad job run -check-index 69375 -var="deploy_version=nomad" -var="max_count=10" -var="min_count=2" derived-src/services/wi/wi-knecht.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
ek-mac@job nomad $ nomad operator scheduler set-config -preempt-system-scheduler=true
Scheduler configuration updated!
ek-mac@job nomad $ nomad job plan -var=deploy_version=nomad -var=max_count=10 -var=min_count=2 derived-src/services/wi/wi-knecht.hcl
+/- Job: "wi-knecht"
+/- Priority: "30" => "90"
+/- Task Group: "wi-knecht-group" (1 create, 1 in-place update)
  +/- Count: "1" => "2" (forces create)
  +/- Scaling {
    +/- Min: "1" => "2"
      }
      Task: "cleanup-task"
      Task: "post-stop-cleanup"
      Task: "wi-knecht-task"

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "wi-knecht-group" (failed to place 1 allocation):
    * Class "webapi": 1 nodes excluded by filter
    * Constraint "${node.class} regexp worker": 1 nodes excluded by filter
    * Resources exhausted on 1 nodes
    * Class "worker" exhausted on 1 nodes
    * Dimension "memory" exhausted on 1 nodes

Job Modify Index: 69375
To submit the job with version verification run:

nomad job run -check-index 69375 -var="deploy_version=nomad" -var="max_count=10" -var="min_count=2" derived-src/services/wi/wi-knecht.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
ek-mac@job nomad $ nomad job run -check-index 69375 -var="deploy_version=nomad" -var="max_count=10" -var="min_count=2" derived-src/services/wi/wi-knecht.hcl
==> 2025-02-24T15:04:36-05:00: Monitoring evaluation "88187acc"
    2025-02-24T15:04:37-05:00: Evaluation triggered by job "wi-knecht"
    2025-02-24T15:04:37-05:00: Evaluation within deployment: "a367af31"
    2025-02-24T15:04:37-05:00: Allocation "2f6db5d1" modified: node "7e4530a1", group "wi-knecht-group"
    2025-02-24T15:04:37-05:00: Evaluation status changed: "pending" -> "complete"
==> 2025-02-24T15:04:37-05:00: Evaluation "88187acc" finished with status "complete" but failed to place all allocations:
    2025-02-24T15:04:37-05:00: Task Group "wi-knecht-group" (failed to place 1 allocation):
      * Class "webapi": 1 nodes excluded by filter
      * Constraint "${node.class} regexp worker": 1 nodes excluded by filter
      * Resources exhausted on 1 nodes
      * Class "worker" exhausted on 1 nodes
      * Dimension "memory" exhausted on 1 nodes
    2025-02-24T15:04:37-05:00: Evaluation "e20f221d" waiting for additional capacity to place remainder
==> 2025-02-24T15:04:37-05:00: Monitoring deployment "a367af31"
  ⠼ Deployment "a367af31" in progress...

    2025-02-24T15:05:07-05:00
    ID          = a367af31
    Job ID      = wi-knecht
    Job Version = 3
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group       Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    wi-knecht-group  true         2        1       1        0          2025-02-24T20:16:06Z
variable "deploy_version" {
  type = string
}

variable "min_count" {
  type    = number
  default = 1
}

variable "max_count" {
  type    = number
  default = 10
}

job "wi-knecht" {
  region      = "planitar"
  datacenters = ["plntr_dc"]
  type        = "service"

  priority = 90

  update {
    max_parallel      = 1
    min_healthy_time  = "30s"
    healthy_deadline  = "10m"
    progress_deadline = "11m"
    health_check      = "checks"
    canary            = 0
    auto_revert       = true
  }

  group "wi-knecht-group" {
    count = "${var.min_count}"

    scaling {
      enabled = true
      min     = "${var.min_count}"
      max     = "${var.max_count}"
      policy {
        cooldown            = "15m"
        evaluation_interval = "5m"

        check "high-cpu-usage" {
          source       = "nomad-apm"
          query        = "avg_cpu-allocated"
          query_window = "3m"
          group        = "cpu-usage"

          strategy "threshold" {
            lower_bound           = 70
            delta                 = 2
            within_bounds_trigger = 1
          }
        }

        check "low-cpu-usage" {
          source       = "nomad-apm"
          query        = "avg_cpu-allocated"
          query_window = "3m"
          group        = "cpu-usage"

          strategy "threshold" {
            upper_bound           = 30
            delta                 = -1
            within_bounds_trigger = 1
          }
        }

        check "high-memory-usage" {
          source       = "nomad-apm"
          query        = "avg_memory-allocated"
          query_window = "3m"
          group        = "memory-usage"

          strategy "threshold" {
            lower_bound           = 70
            delta                 = 2
            within_bounds_trigger = 1
          }
        }

        check "low-memory-usage" {
          source       = "nomad-apm"
          query        = "avg_memory-allocated"
          query_window = "3m"
          group        = "memory-usage"

          strategy "threshold" {
            upper_bound           = 30
            delta                 = -1
            within_bounds_trigger = 1
          }
        }
      }
    }

    restart {
      attempts = 5
      delay    = "15s"
      interval = "3m"
      mode     = "delay"
    }

   ....
      service {
        tags     = ["wi-knecht"]
        name     = "wi-knecht"
        provider = "consul"
      }

      resources {
        memory     = 3000
        memory_max = 3500
        cpu        = 2000
      }
    }
  }
}

@tgross
Copy link
Member

tgross commented Feb 24, 2025

Can you show the output of nomad operator scheduler get-config? This job is a service job, so you specifically need to ensure that the scheduler has been configured to allow preemption for service jobs (similar to what I said above about system jobs).

@EugenKon
Copy link
Author

$ nomad operator scheduler get-config
Scheduler Algorithm           = binpack
Memory Oversubscription       = true
Reject Job Registration       = false
Pause Eval Broker             = false
Preemption System Scheduler   = true
Preemption Service Scheduler  = false
Preemption Batch Scheduler    = false
Preemption SysBatch Scheduler = false
Modify Index                  = 69413

@EugenKon
Copy link
Author

ah, ok. You are referring to https://developer.hashicorp.com/nomad/docs/concepts/scheduling/preemption#preemption-in-nomad
This description so implicit =(

It is unclear if service job should preempt system jobs. Because I enabled system jobs to be preempted in this case.

@tgross
Copy link
Member

tgross commented Feb 24, 2025

The section of the docs I linked you to should be pretty clear:

-preempt-service-scheduler - Specifies whether preemption for service jobs is enabled. Note that if this is set to true, then service jobs can preempt any other jobs. Must be one of [true|false].

vs

-preempt-system-scheduler - Specifies whether preemption for system jobs is enabled. Note that if this is set to true, then system jobs can preempt any other jobs. Must be one of [true|false].

@EugenKon
Copy link
Author

Nice. Thanks. It would be nice to extend this example with that info too:
https://developer.hashicorp.com/nomad/docs/concepts/scheduling/preemption#details

tgross added a commit that referenced this issue Feb 24, 2025
Fix a broken link from the preemption concepts docs to the relevant API. Also
include a link to the relevant command.

Ref: #25038
@tgross
Copy link
Member

tgross commented Feb 24, 2025

Done #25203

tgross added a commit that referenced this issue Feb 25, 2025
Fix a broken link from the preemption concepts docs to the relevant API. Also include a link to the relevant command.

Ref: #25038
tgross added a commit that referenced this issue Feb 25, 2025
Fix a broken link from the preemption concepts docs to the relevant API. Also include a link to the relevant command.

Ref: #25038
tgross added a commit that referenced this issue Feb 25, 2025
… (#25207)

Fix a broken link from the preemption concepts docs to the relevant API. Also include a link to the relevant command.

Ref: #25038

Co-authored-by: Tim Gross <[email protected]>
@EugenKon
Copy link
Author

EugenKon commented Feb 25, 2025

@tgross

@EugenKon new hosts don't automatically get service allocations added to them unless you have blocked evaluations, so unless that's the case the system job should show up first typically.

Ok. I got another issue. So we have Nomad server configured like this:

server {
  enabled          = true
  bootstrap_expect = SERVER_COUNT

  default_scheduler_config {
    memory_oversubscription_enabled = true
  }

  server_join {
    retry_max      = 0
    retry_interval = "30s"
  }}

and nomad job configured like this:

variable "project_name" {
  type = string
}

variable "aws_region" {
  type = string
}

job "dd-agent" {
  type        = "system"

  group "monitoring" {
    count = 1

    network {
      port "dd-tcp" {
        static = xxx
        to     = xxx
      }

      port "dd-udp" {
        static = xxx
        to     = xxx
      }
    }

    task "datadog-task" {
      volume_mount {
        volume      = "docker-sock-volume"
        destination = "/var/run/docker.sock"
        read_only   = false
      }

      driver = "docker"

      config {
        force_pull   = false
        image        = "datadog/agent:7.59.0-linux"
        ports        = ["dd-tcp", "dd-udp"]
        command      = "bash"
        network_mode = "host"
        volumes      = [
        ]
      }

      template {
        destination = "local/datadog/consul.d/conf.yaml"
        data        = <<-EOH
        EOH
      }

      template {
        destination = "local/datadog/postgres.d/conf.yaml"
        data        = <<-EOH
      }

      resources {
        memory = 700
      }

      service {
        provider = "consul"

        tags = ["dd-agent-service-tag"]
        name = "dd-agent"
        port = "dd-tcp"

        meta {
          custom = "label"
        }

        check {
          type     = "tcp"
          port     = "dd-tcp"
          interval = "30s"
          timeout  = "5s"
        }
      }
    }
  }
}
variable "NOMAD_TOKEN" {
  type = string
}

variable "NOMAD_ADDR" {
  type = string
}

job "autoscaler" {
  type        = "system"

  group "autoscaler" {
    count = 1

    network {
      port "http" {
        to = 8080
      }
    }

    task "autoscaler" {
      driver = "docker"

      config {
        image   = "hashicorp/nomad-autoscaler:0.4.5"
        command = "nomad-autoscaler"
        args = [
          "agent",
          "-config",
          "${NOMAD_TASK_DIR}/config.hcl",
          "-http-bind-address",
          "0.0.0.0",
          "-policy-dir",
          "${NOMAD_TASK_DIR}/policies/"
        ]
        ports = ["http"]

        labels = {
          "com.datadoghq.ad.logs" = jsonencode([{
            source = "nomad"
            service = "nomad-autoscaler"
          }])
        }
      }

      env {
         NOMAD_TOKEN = "${var.NOMAD_TOKEN}"
      }

      service {
        name = "autoscaler"
        port = "http"
      }

      template {
        data = <<EOF
nomad {
  address = "${var.NOMAD_ADDR}"
  region  = "planitar"
  skip_verify = true
}

  apm "nomad" {
    driver = "nomad-apm"
  }

strategy "target-value" {
  driver = "target-value"
}
EOF
        destination = "${NOMAD_TASK_DIR}/config.hcl"
      }

      template {
        data = <<EOF
        destination = "${NOMAD_TASK_DIR}/policies/hashistack.hcl"
      }
    }
  }
}

Usually dd-agent starts normally on all EC2 nodes, but sometimes something goes wrong. New EC2 instances are created. They are registered with consul and nomad. Nomad deployed only one system service and that is all.
We expect Nomad to deploy all 3 system services and then other service jobs.

Here how this client is visible via Nomad UI.

Image

Image

Image

You can see that instance has a lot of resources, but scheduler does not schedule dd-agent there.
you can see there is not so much differences between autoscaller and dd-agent

@tgross
Copy link
Member

tgross commented Feb 26, 2025

New EC2 instances are created. They are registered with consul and nomad. Nomad deployed only one system service and that is all. We expect Nomad to deploy all 3 system services and then other service jobs.

No, you should not expect to see any service jobs get deployed to the new instances. Not unless you actually deploy them again with additional count. Only the system jobs would show up on the new instances.

And as usual you've provided screenshots without context as to what I'm supposed to be looking at. But I'm going to guess that you're saying "I have 16 nodes and only 14 of the nodes have placements for the system jobs datadog-rds-postrgres-trigger and dd-agent. And you're saying that those 2 extra nodes should have enough memory space for the allocations.

@EugenKon at this point I'm going to push back a challenge to you, in hopes that we can try to raise your level of education here. You have 2 nodes that don't seem to have placements they you think they should. Without using the web UI, which commands should you run to try to figure out why the scheduler made decisions it did about those two jobs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants