Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad does not report property about not started system jobs #25058

Closed
EugenKon opened this issue Feb 7, 2025 · 7 comments
Closed

Nomad does not report property about not started system jobs #25058

EugenKon opened this issue Feb 7, 2025 · 7 comments

Comments

@EugenKon
Copy link

EugenKon commented Feb 7, 2025

Nomad version

1.8.2

Operating system and Environment details

Ubuntu 24.04

Issue

Image
But we have 12 clients:
Image
If I ssh to that EC2 instance and run docker ps I can see that autoscaler is not among jobs.

Reproduction steps

Deploy cluster.
Create new EC2 instance at the cluster

Expected Result

Nomad should report that some EC2 instances are without system jobs.

Actual Result

No any issues are reported by Nomad

Job file (if appropriate)

variable "NOMAD_TOKEN" {
  type = string
}

variable "NOMAD_ADDR" {
  type = string
}

job "autoscaler" {
  region      = "xxxx"
  type        = "system"

  group "autoscaler" {
    count = 1

    network {
      port "http" {
        to = 8080
      }
    }

    task "autoscaler" {
      driver = "docker"

      config {
        image   = "hashicorp/nomad-autoscaler:0.4.5"
        command = "nomad-autoscaler"
        args = [
          "agent",
          "-config",
          "${NOMAD_TASK_DIR}/config.hcl",
          "-http-bind-address",
          "0.0.0.0",
          "-policy-dir",
          "${NOMAD_TASK_DIR}/policies/"
        ]
        ports = ["http"]

        labels = {
          "com.datadoghq.ad.logs" = jsonencode([{
            source = "nomad"
            service = "nomad-autoscaler"
          }])
        }
      }

      env {
         NOMAD_TOKEN = "${var.NOMAD_TOKEN}"
      }

      service {
        name = "autoscaler"
        port = "http"
      }

      template {
        data = <<EOF
        nomad {
          address = "${var.NOMAD_ADDR}"
          region  = "xxx"
          skip_verify = true
        }

          apm "nomad" {
            driver = "nomad-apm"
          }

        strategy "target-value" {
          driver = "target-value"
        }
        EOF
        destination = "${NOMAD_TASK_DIR}/config.hcl"
      }

      template {
        data = <<EOF
        scaling "cluster_policy" {
          enabled = false // This will be enabled once we extend nomad to scale the cluster.
          min     = 1
          max     = 10

          policy {
            cooldown            = "5m"
            evaluation_interval = "3m"

            check "cpu_allocated_percentage" {
              source       = "nomad-apm"
              query        = "percentage-allocated_cpu"
              query_window = "5m"

              strategy "target-value" {
                target = 70
              }
            }

            check "memory_allocated_percentage" {
              source       = "nomad-apm"
              query        = "percentage-allocated_memory"
              query_window = "5m"

              strategy "target-value" {
                target = 70
              }
            }
          }
        }
        EOF
        destination = "${NOMAD_TASK_DIR}/policies/hashistack.hcl"
      }
    }
  }
}

*It would be nice to have \<details\> tag at the issue template.

@tgross
Copy link
Member

tgross commented Feb 7, 2025

@EugenKon if certain clients don't match the constraints of a system job, then there's no way for Nomad to know you meant to have a given system job on that client (otherwise it would have placed it!). If the nodes match the constraints but there's not enough room to make a placement, then the evaluations list for that job will show a blocked eval.

@EugenKon
Copy link
Author

EugenKon commented Feb 7, 2025

@tgross Hi. But for this job there is no constraints. And if jobs was not placed because of some constraint the Nomad usually reports that some clients were filtered.

@EugenKon
Copy link
Author

EugenKon commented Feb 7, 2025

Based on this screenshot:
Image

I expect to see here 11/12, because system job is documented as to be run on all instances:
Image
Our system jobs do not have any constraints.

Also here job should be marked as the next:
Image

@tgross
Copy link
Member

tgross commented Feb 10, 2025

because system job is documented as to be run on all instances:

@EugenKon the very next half of that sentence says "that meets the constraints". I'm not sure why you're posting giant screenshots of text when it just says what I just told you.

But for this job there is no constraints.

Jobs almost always have implicit constraints. For example, maybe the 12th node doesn't have a healthy docker driver. Maybe it's in the wrong node pool. Maybe it's missing required CNI plugins. Or maybe it's just full. I don't know because you haven't provided any information about the node or the evals. Post the text (please not a screenshot, they're extremely hard for me to read) of nomad node status -verbose :nodeid of the relevant node, as well as nomad eval status :evalid of the evaluation that should have created the allocation. That's always the starting point when you want to know what's going on.

But also, the last screenshot you posted shows 12 allocations, not 11. Isn't 12 what you want?

@EugenKon
Copy link
Author

EugenKon commented Feb 10, 2025

@tgross
Yes, I manually kill some service jobs to allow system job to start.
After that I got 12 allocation running as expected. But I used that screenshot to draw the expected UI.

Jobs almost always have implicit constraints.

In this particular case service job took whole memory thus system job has no room to run.

This job does not have any explicit constraints, thus I suppose it should be run on all available clients. If it does not I expect to see error message.

Imagine the situation. If we have 10 Nomad clients and 5 jobs. We run them and all of them failed because of explicit or implicit (like you mention 'wrong pool', CNI plugin) constraint. And then Nomad UI shows nothing like in my example.

This feels weird: we have a lot of clients, we have jobs, but we see none error message.

Here I expect to see error that 1 system jobs failed and one node exhausted memory.

variable "NOMAD_TOKEN" {
  type = string
}

variable "NOMAD_ADDR" {
  type = string
}

job "autoscaler" {
  type        = "system"

  group "autoscaler" {
    count = 1

    network {
      port "http" {
        to = 8080
      }
    }

    task "autoscaler" {
      driver = "docker"

      config {
        image   = "hashicorp/nomad-autoscaler:0.4.5"
        command = "nomad-autoscaler"
        args = [
          "agent",
          "-config",
          "${NOMAD_TASK_DIR}/config.hcl",
          "-http-bind-address",
          "0.0.0.0",
          "-policy-dir",
          "xxxxxx"
        ]
        ports = ["http"]
     }

      env {
         NOMAD_TOKEN = "${var.NOMAD_TOKEN}"
      }

      service {
        name = "autoscaler"
        port = "http"
      }
  }
}

@tgross
Copy link
Member

tgross commented Feb 10, 2025

In this particular case service job took whole memory thus system job has no room to run.

So in other words, there was nothing wrong with the decisions the scheduler made, but it was hitting the behavior I've described to you already in #25038 and #25061. You've already been told how to solve this problem, and you're not responding to the specific pieces of information I asked for even if you hadn't been, so there's nothing left to do here.

@EugenKon
Copy link
Author

Thanks a lot for the new commands. I'll try to use them instead of screenshots where possible. Unfortunately I can not provide output. We are in an active development and we moved already to different configuration. Thus theirs output will be useless here =(.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants