Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle into release/1.9.x #24530

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #24304 to be assessed for backporting due to the inclusion of the label backport/1.9.x.

The below text is copied from the body of the original PR.


In our production environment where we run Nomad on v1.8.2 we noticed overlapping cpusets and the Nomad reserve/share slices being out of sync. Specifically, the below setup where we have various task in prestart and poststart that are part of the main lifecycle.
image

I managed to reproduce it with the below job spec on the latest main (v1.9.1) in my sandbox environment :

job "redis-job-{{SOME_SED_MAGIC}}" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
      }
      resources {
        cores = 4
      }
    }

    task "redis-start-side" {
      lifecycle {
        hook    = "poststart"
        sidecar = true
      }
      driver = "docker"
      config {
        image = "redis:3.2"
      }
      resources {
        cores = 4
      }
    }
  }
}

Spinning up two jobs with this spec resulted in the following overlap :

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c/cpuset.effective_cpus  8-11

Full output

[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS      NAMES
a52a46cfa489   redis:3.2   "docker-entrypoint.s…"   19 seconds ago   Up 18 seconds   6379/tcp   redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0   redis:3.2   "docker-entrypoint.s…"   19 seconds ago   Up 18 seconds   6379/tcp   redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0

[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-7
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:8-123

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus  4-7

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
        "Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
            "CpusetCpus": "0,1,2,3",
                "NOMAD_CPU_LIMIT=8980",
[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS      NAMES
c9049b1b3f2c   redis:3.2   "docker-entrypoint.s…"   16 seconds ago   Up 15 seconds   6379/tcp   redis-start-side-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
6e06a9ed1631   redis:3.2   "docker-entrypoint.s…"   16 seconds ago   Up 16 seconds   6379/tcp   redis-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
a52a46cfa489   redis:3.2   "docker-entrypoint.s…"   3 minutes ago    Up 3 minutes    6379/tcp   redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0   redis:3.2   "docker-entrypoint.s…"   3 minutes ago    Up 3 minutes    6379/tcp   redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0

[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-11
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:12-123

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179/cpuset.effective_cpus  8-11

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
        "Id": "c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179",
            "CpusetCpus": "8,9,10,11",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
            "CpusetCpus": "0,1,2,3",
                "NOMAD_CPU_LIMIT=8980",
Fixes a bug in the BinPackIterator.Next method, where the scheduler would only
take into account the cpusets of the tasks in the largest lifecycle. This could
result in overlapping cgroup cpusets. By using the Allocation.ReservedCores, the
scheduler will use the same cpuset view as Partition.Reserve. Added logging in
case of future regressions thus not requiring manual inspection of cgroup files.

Overview of commits

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/mvegter-fix-missing-exclusion-of-cpusets/locally-giving-mosquito branch from 6bd6f54 to 17a7fc5 Compare November 21, 2024 18:22
@tgross tgross merged commit e97b625 into release/1.9.x Nov 21, 2024
20 checks passed
@tgross tgross deleted the backport/mvegter-fix-missing-exclusion-of-cpusets/locally-giving-mosquito branch November 21, 2024 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants