Nomad does not report property about not started system jobs #25058

EugenKon · 2025-02-07T19:24:09Z

Nomad version

1.8.2

Operating system and Environment details

Ubuntu 24.04

Issue

But we have 12 clients:

If I ssh to that EC2 instance and run docker ps I can see that autoscaler is not among jobs.

Reproduction steps

Deploy cluster.
Create new EC2 instance at the cluster

Expected Result

Nomad should report that some EC2 instances are without system jobs.

Actual Result

No any issues are reported by Nomad

Job file (if appropriate)

variable "NOMAD_TOKEN" {
  type = string
}

variable "NOMAD_ADDR" {
  type = string
}

job "autoscaler" {
  region      = "xxxx"
  type        = "system"

  group "autoscaler" {
    count = 1

    network {
      port "http" {
        to = 8080
      }
    }

    task "autoscaler" {
      driver = "docker"

      config {
        image   = "hashicorp/nomad-autoscaler:0.4.5"
        command = "nomad-autoscaler"
        args = [
          "agent",
          "-config",
          "${NOMAD_TASK_DIR}/config.hcl",
          "-http-bind-address",
          "0.0.0.0",
          "-policy-dir",
          "${NOMAD_TASK_DIR}/policies/"
        ]
        ports = ["http"]

        labels = {
          "com.datadoghq.ad.logs" = jsonencode([{
            source = "nomad"
            service = "nomad-autoscaler"
          }])
        }
      }

      env {
         NOMAD_TOKEN = "${var.NOMAD_TOKEN}"
      }

      service {
        name = "autoscaler"
        port = "http"
      }

      template {
        data = <<EOF
        nomad {
          address = "${var.NOMAD_ADDR}"
          region  = "xxx"
          skip_verify = true
        }

          apm "nomad" {
            driver = "nomad-apm"
          }

        strategy "target-value" {
          driver = "target-value"
        }
        EOF
        destination = "${NOMAD_TASK_DIR}/config.hcl"
      }

      template {
        data = <<EOF
        scaling "cluster_policy" {
          enabled = false // This will be enabled once we extend nomad to scale the cluster.
          min     = 1
          max     = 10

          policy {
            cooldown            = "5m"
            evaluation_interval = "3m"

            check "cpu_allocated_percentage" {
              source       = "nomad-apm"
              query        = "percentage-allocated_cpu"
              query_window = "5m"

              strategy "target-value" {
                target = 70
              }
            }

            check "memory_allocated_percentage" {
              source       = "nomad-apm"
              query        = "percentage-allocated_memory"
              query_window = "5m"

              strategy "target-value" {
                target = 70
              }
            }
          }
        }
        EOF
        destination = "${NOMAD_TASK_DIR}/policies/hashistack.hcl"
      }
    }
  }
}

*It would be nice to have \<details\> tag at the issue template.

The text was updated successfully, but these errors were encountered:

tgross · 2025-02-07T19:32:18Z

@EugenKon if certain clients don't match the constraints of a system job, then there's no way for Nomad to know you meant to have a given system job on that client (otherwise it would have placed it!). If the nodes match the constraints but there's not enough room to make a placement, then the evaluations list for that job will show a blocked eval.

EugenKon · 2025-02-07T20:03:14Z

@tgross Hi. But for this job there is no constraints. And if jobs was not placed because of some constraint the Nomad usually reports that some clients were filtered.

EugenKon · 2025-02-07T20:18:10Z

Based on this screenshot:

I expect to see here 11/12, because system job is documented as to be run on all instances:

Our system jobs do not have any constraints.

Also here job should be marked as the next:

tgross · 2025-02-10T19:48:30Z

because system job is documented as to be run on all instances:

@EugenKon the very next half of that sentence says "that meets the constraints". I'm not sure why you're posting giant screenshots of text when it just says what I just told you.

But for this job there is no constraints.

Jobs almost always have implicit constraints. For example, maybe the 12th node doesn't have a healthy docker driver. Maybe it's in the wrong node pool. Maybe it's missing required CNI plugins. Or maybe it's just full. I don't know because you haven't provided any information about the node or the evals. Post the text (please not a screenshot, they're extremely hard for me to read) of nomad node status -verbose :nodeid of the relevant node, as well as nomad eval status :evalid of the evaluation that should have created the allocation. That's always the starting point when you want to know what's going on.

But also, the last screenshot you posted shows 12 allocations, not 11. Isn't 12 what you want?

EugenKon · 2025-02-10T20:12:48Z

@tgross
Yes, I manually kill some service jobs to allow system job to start.
After that I got 12 allocation running as expected. But I used that screenshot to draw the expected UI.

Jobs almost always have implicit constraints.

In this particular case service job took whole memory thus system job has no room to run.

This job does not have any explicit constraints, thus I suppose it should be run on all available clients. If it does not I expect to see error message.

Imagine the situation. If we have 10 Nomad clients and 5 jobs. We run them and all of them failed because of explicit or implicit (like you mention 'wrong pool', CNI plugin) constraint. And then Nomad UI shows nothing like in my example.

This feels weird: we have a lot of clients, we have jobs, but we see none error message.

Here I expect to see error that 1 system jobs failed and one node exhausted memory.

variable "NOMAD_TOKEN" {
  type = string
}

variable "NOMAD_ADDR" {
  type = string
}

job "autoscaler" {
  type        = "system"

  group "autoscaler" {
    count = 1

    network {
      port "http" {
        to = 8080
      }
    }

    task "autoscaler" {
      driver = "docker"

      config {
        image   = "hashicorp/nomad-autoscaler:0.4.5"
        command = "nomad-autoscaler"
        args = [
          "agent",
          "-config",
          "${NOMAD_TASK_DIR}/config.hcl",
          "-http-bind-address",
          "0.0.0.0",
          "-policy-dir",
          "xxxxxx"
        ]
        ports = ["http"]
     }

      env {
         NOMAD_TOKEN = "${var.NOMAD_TOKEN}"
      }

      service {
        name = "autoscaler"
        port = "http"
      }
  }
}

tgross · 2025-02-10T20:21:48Z

In this particular case service job took whole memory thus system job has no room to run.

So in other words, there was nothing wrong with the decisions the scheduler made, but it was hitting the behavior I've described to you already in #25038 and #25061. You've already been told how to solve this problem, and you're not responding to the specific pieces of information I asked for even if you hadn't been, so there's nothing left to do here.

EugenKon · 2025-02-10T21:39:21Z

Thanks a lot for the new commands. I'll try to use them instead of screenshots where possible. Unfortunately I can not provide output. We are in an active development and we moved already to different configuration. Thus theirs output will be useless here =(.

EugenKon added the type/bug label Feb 7, 2025

tgross closed this as completed Feb 7, 2025

tgross added the theme/scheduling label Feb 7, 2025

EugenKon mentioned this issue Feb 7, 2025

Nomad should start system jobs with more higher priority than service jobs. #25061

Closed

tgross added stage/not-a-bug stage/duplicate labels Feb 10, 2025

tgross closed this as completed Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad does not report property about not started system jobs #25058

Nomad does not report property about not started system jobs #25058

EugenKon commented Feb 7, 2025 •

edited

Loading

tgross commented Feb 7, 2025

EugenKon commented Feb 7, 2025

EugenKon commented Feb 7, 2025 •

edited

Loading

tgross commented Feb 10, 2025

EugenKon commented Feb 10, 2025 •

edited

Loading

tgross commented Feb 10, 2025

EugenKon commented Feb 10, 2025

Nomad does not report property about not started system jobs #25058

Nomad does not report property about not started system jobs #25058

Comments

EugenKon commented Feb 7, 2025 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

tgross commented Feb 7, 2025

EugenKon commented Feb 7, 2025

EugenKon commented Feb 7, 2025 • edited Loading

tgross commented Feb 10, 2025

EugenKon commented Feb 10, 2025 • edited Loading

tgross commented Feb 10, 2025

EugenKon commented Feb 10, 2025

EugenKon commented Feb 7, 2025 •

edited

Loading

EugenKon commented Feb 7, 2025 •

edited

Loading

EugenKon commented Feb 10, 2025 •

edited

Loading