Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation Metrics no longer emitted #24339

Open
dosera opened this issue Oct 31, 2024 · 23 comments
Open

Allocation Metrics no longer emitted #24339

dosera opened this issue Oct 31, 2024 · 23 comments

Comments

@dosera
Copy link

dosera commented Oct 31, 2024

Nomad version

Nomad v1.9.1
BuildDate 2024-10-21T09:00:50Z
Revision d9ec23f0c1035401e9df6c64d6ffb8bffc555a5e

Operating system and Environment details

AlmaLinux release 9.4 (Seafoam Ocelot)

/etc/nomad/base.hcl:

telemetry {
..
  prometheus_metrics = "true"
  publish_allocation_metrics = "true"
  publish_node_metrics = "true"

..
}

Issue

After upgrading nomad from 1.8.4 to 1.9.1 allocation metrics like nomad_client_allocs_memory_usage appear not to be properly emitted (or only for several seconds - i am using prometheus to scrape them).

Reproduction steps

  1. Enable telemetry
  2. Start any allocation
  3. Collect metrics

Expected Result

Allocation metrics are available.

Actual Result

Allocation metrics are available sporadically (at best). See the attached screenshot - nomad was upgraded on Oct 24th.

This is how it looks in prometheus since then:
nomad no alloc metrics

@Himura2la
Copy link

Himura2la commented Nov 1, 2024

We have the same issue with allocation metrics after updating from 1.8.2 to 1.9.1:
image
query: avg_over_time(nomad_client_allocs_memory_usage[10m]) / avg_over_time(nomad_client_allocs_memory_allocated[10m])

They are returned irregularly, and often values are zero.

our configs:

telemetry {
  collection_interval = "30s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}
global:
  scrape_interval:     1m
  scrape_timeout:     20s
  evaluation_interval: 1m

  - job_name: nomad
    static_configs:
      - targets:
        # ...
    scheme: https
    metrics_path: /v1/metrics
    params:
      format: [prometheus]
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      # ...

Operating system: Debian 12, amd64

@jrasell
Copy link
Member

jrasell commented Nov 4, 2024

Hi @Himura2la and @dosera and thanks for raising this issue. I have been unable to reproduce this on macOS or Ubuntu 24.04 using the steps below. Could you please provide a minimal job specification that would help reproduce this along with any other relevant information?

Example agent config:

telemetry {
  prometheus_metrics         = "true"
  publish_allocation_metrics = "true"
  publish_node_metrics       = "true"
}

Example job:

job "example" {

  group "cache" {
    network {
      mode = "bridge"
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      identity {
        env  = true
        file = true
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Example test command:

while true; do
> curl -s localhost:4646/v1/metrics | jq '.Gauges | .[] | select(.Name | contains("nomad.client.allocs.memory.usage")) |.Value'
> curl -s 'localhost:4646/v1/metrics?format=prometheus' |grep 'nomad.client.allocs.memory.usage{' |awk '{print $2}'
> sleep 2
> done

@Himura2la
Copy link

Himura2la commented Nov 4, 2024

I'm afraid your validation method is not relevant to the issue, because it's best visible on a long term run.
I think you can simply follow this tutorial https://developer.hashicorp.com/nomad/tutorials/manage-clusters/prometheus-metrics and see how it's going with 5-10 jobs that consume something for several hours

@jedd
Copy link

jedd commented Nov 5, 2024

I'm seeing the same thing with 4 x boxes I upgraded to 1.9.1 from 1.9.0
2024-10-18 - upgraded from 1.8.4 to 1.9.0 - no issues observed at this time
2024-10-25 - upgraded from 1.9.0 to 1.9.1 - from this point I am seeing a lot of gaps in the data - image attached

I'm using Debian testing on all four machines.

My Prometheus scrapes a telegraf (prometheus export mode) on the same machines, and I'm seeing no drops in those data.

Scrape interval for both is set to 1m. Boxes had no other packages upgraded at that time.

More importantly, I'm not seeing CPU or Memory data in the Nomad GUI in either Task or Allocation view. I will occasionally get a red line indicating what I presume is 'current' X MiB / Total memory used, but no history at all.
CPU graph consistently shows 0MHz (and no history of course, either).

image

@jrasell
Copy link
Member

jrasell commented Nov 6, 2024

Hi @jedd @Himura2la and @dosera. I ran two longer tests using Debian 12 and Ubuntu 24.04 and was still unable to reproduce the problem. Each test ran for several hours, include several jobs, and used Prometheus for metric collection. I'll include the two example graphs below.

I'll mark this issue for further investigation and add it to our backlog. If anyone is able to provide a minimal reproduction that would be greatly appreciated.

Screenshot 2024-11-06 at 07 33 46 Screenshot 2024-11-05 at 09 03 06

@jedd
Copy link

jedd commented Nov 6, 2024

Thanks for diving into that, @jrasell -- I'll see if I can distil a prometheus scrape option, but really you've probably already matched that - it's just an interval of 1m, and 4 x hosts on :4646 (no tls, auth). As per my graphs some data is being made available periodically, so it's not an auth issue.

Can you tell me what you saw on the task / allocation cpu & memory graphs?

Opening these up and letting them sit for more than a minute, and I see no change - usually I'd expect to see the graph start to populate and scroll within a few seconds.

Also oddly is the times shown on the X axis - it's 23:17 local here, but in one of these I see 00:00:00 for ALL x points, and on the other 08:44:00 to 08:48:00. These are taken on the same host (my 'master'), but the screenshots are from a job running on this, and the other running on one of my client-only nomad hosts.

image

image

@dmclf
Copy link

dmclf commented Nov 7, 2024

Also seem to be suffering from this issue.

development-cluster, consisting of 5 Ubuntu 22.04 machines, 3 servers, 2 clients.
Servers upgraded from 1.9.0-1 to 1.9.1-1 on October 30, 02:00 UTC
Clients downgraded from 1.9.0-1 to 1.8.3-1 on Oct 20, upgraded to 1.9.1-1 on Oct 30.

nomad version

Nomad v1.9.1
BuildDate 2024-10-21T09:00:50Z
Revision d9ec23f0c1035401e9df6c64d6ffb8bffc555a5e

nomad config

telemetry {
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

tls {
  http = true
  rpc  = true
[..]
vault {
  enabled               = true
[..]
consul {
  allow_unauthenticated = true
  ssl                 = true
}

  • prometheus 2.53.1
  • docker 27.2.1
  • linux kernel 5.15.0-121-generic

distorted picture for several nomad metrics for allocations

  • nomad_client_allocs_memory_usage
    image

  • nomad_client_allocs_memory_allocated
    image

  • nomad_client_allocs_cpu_allocated
    image

  • nomad_client_allocs_cpu_user
    image

non-nomad stats, ie, cadvisor / node-exporters are clearly more consistent

(this may also prove scraping process is mostly stable)

  • cadvisor - machine_memory_bytes
    image

  • node-exporter - node_memory_Active_bytes
    image

nomad-metrics process_virtual_memory_bytes / nomad_client_host_memory_free

and other metrics provided by the nomad-metrics also seem consistent

  • process_virtual_memory_bytes
    image

  • nomad_client_host_memory_free
    image

so something seems odd with the per-task statistics
also when graphing totals of occurrences of metrics for nomad_client_allocs_memory_usage for a specific alloc_id (to also showcase its not restarted/stopped) one clearly sees decreases.

  • avg(count_over_time(nomad_client_allocs_memory_usage{job="nomad_metrics",task_group="demo",alloc_id="3026530 4-3d02-1b82-1023-de6f68af7816"}[10m]))
    image

curl testing

which can also easily be tracked by just doing some curl'ing
ie

curl -k "https://${x}:4646/v1/metrics?format=prometheus" 2>/dev/null|grep ^nomad_client_allocs_memory_usage

where sometimes there is no reply, sometimes some replies, but never consistently all are included.

curl comparison 1.9.1 vs 1.8.3

if doing checks like these, there are also clear inconsistencies between my dev-cluster on 1.9.1
and the staging/prod clusters on 1.8.3

export host=nomad-development-client-1
curl -k "https://${host}:4646/v1/metrics?format=prometheus" 2>/dev/null|grep -e ^nomad_client_tasks_running
echo "wc -l @ nomad_client_allocs_memory_usage" $(curl -k "https://${host}:4646/v1/metrics?format=prometheus" 2>/dev/null | grep -e "^nomad_client_allocs_memory_usage.*task=" | wc -l)"
  • development 1.9.1 (random client)
nomad_client_tasks_running{datacenter="nomad24339",node_class="none",node_id="20591b4f-586c-cf24-0706-b17640b6ea1b",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 23
wc -l @ nomad_client_allocs_memory_usage 2

= 23 vs 2 = inconsistent ❌

  • staging 1.8.3 (random client)
nomad_client_tasks_running{datacenter="nomad24339",node_class="none",node_id="64102b51-0ebc-893a-3c2d-08e538bb4a66",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 331
wc -l @ nomad_client_allocs_memory_usage 331

= 331 vs 331 = consistent ✅

  • production 1.8.3 (random client)
nomad_client_tasks_running{datacenter="nomad24339",node_class="none",node_id="95d3df8d-a82c-0357-60aa-fd6cb952fb2b",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 240
wc -l @ nomad_client_allocs_memory_usage 240

= 240 vs 240 = consistent ✅

dmclf added a commit to dmclf/nomad-debug that referenced this issue Nov 8, 2024
- ugly
- hopefully can help with hashicorp/nomad#24339
@dmclf
Copy link

dmclf commented Nov 8, 2024

Hi @jedd @Himura2la and @dosera. I ran two longer tests using Debian 12 and Ubuntu 24.04 and was still unable to reproduce the problem. Each test ran for several hours, include several jobs, and used Prometheus for metric collection. I'll include the two example graphs below.

I'll mark this issue for further investigation and add it to our backlog. If anyone is able to provide a minimal reproduction that would be greatly appreciated.

@jrasell in a quick-n-dirty attempt to make something that can mimic what I'm seeing I made up something
https://github.com/dmclf/nomad-debug/tree/main/gh-24339

not exactly the most elegant, but hopefully it can aid somehow on your end.
(also not sure if this is guaranteed to work 100%, or needs a rerun every now and then, it is very much quick, dirty n rough)

@jedd
Copy link

jedd commented Nov 9, 2024

I have basically zero golang, and because it's an intermittent problem I'm not 100% sure that my git bisect activities have narrowed things down, but I'm cautiously confident the problem manifested with:

commit 6c3f222

Looking at git show 6c3f222fe154a66369e84789828f4a8d36a20c7b -- drivers/docker/stats.go

there's a comment about a move to streaming stats API, and a bunch of changes to the collection mechanism.

I don't know how fragile that stream may be under various circumstances, and really out of my depth here trying to guess at whether this may be r/c for an intermittent problem like we're seeing, but perhaps @dmclf if you're able to do the https://developer.hashicorp.com/nomad/docs/install#compiling-from-source process on one of your clients, and can validate that commit is where things go pear shaped on your system too?

I noted - but I think it's entirely coincidental - that the problem seemed to occur shortly after journalctl reported client.gc operation was skipped:

client.gc: garbage collection skipped because no terminal allocations: reason="disk usage of 97 is over gc threshold of 80"

(I'm pretty sure this is entirely coincidental, but it was the only log entry emitted by Nomad.)

I did a bisect manually, given intermittent nature of the problem, and waited for 10 minutes with some regular curl'ing to see if the problem manifested - so I caught definite failures, but may have accidentally given a 'good' to some commits that just didn't fail fast enough.

@dmclf
Copy link

dmclf commented Nov 9, 2024

@jedd right... Ok, I suppose I can try that (i'm not much of a golang person nor developer either)

but if you say you may suspect 6c3f222fe154a66369e84789828f4a8d36a20c7b

and then, request me to, from release/1.9.1 try build from a checkout of:
e5e26c8eb34c40263d1557f7e6f948c92fc57ca8 (the one before)

and see if that works better?

@dmclf
Copy link

dmclf commented Nov 9, 2024

@jedd to confirm, from my end at least.

git clone -b release/1.9.1 https://github.com/hashicorp/nomad.git 
git checkout e5e26c8eb34c40263d1557f7e6f948c92fc57ca8 && make bootstrap && make dev && cp bin/nomad /usr/local/bin/

-> works fine ✅

git clone -b release/1.9.1 https://github.com/hashicorp/nomad.git 
git checkout 6c3f222fe154a66369e84789828f4a8d36a20c7b   && make bootstrap && make dev && cp bin/nomad /usr/local/bin/

-> gives this ticket's odd behaviour ❌

dmclf added a commit to dmclf/nomad-debug that referenced this issue Nov 10, 2024
@dmclf
Copy link

dmclf commented Nov 11, 2024

@jedd for completeness

updated 1 of the 2 clients on the dev-cluster too to check behavior there with the 2 versions

results:
image

so initially graph starts with

Nomad v1.9.1
BuildDate 2024-10-21T09:00:50Z
Revision d9ec23f0c1035401e9df6c64d6ffb8bffc555a5e

after updated to

Nomad v1.9.1-dev
BuildDate 2024-10-17T09:13:46Z
Revision e5e26c8eb34c40263d1557f7e6f948c92fc57ca8

and then lastly changed to

# nomad version
Nomad v1.9.1-dev
BuildDate 2024-10-17T12:47:17Z
Revision 6c3f222fe154a66369e84789828f4a8d36a20c7b

@dmclf
Copy link

dmclf commented Nov 11, 2024

and also confirming issue still exists with most recent v1.9.2

Nomad v1.9.2
BuildDate 2024-11-08T08:58:16Z
Revision de8a30ba86a2ed8dfb7188da10c250056d165fea

@jedd
Copy link

jedd commented Nov 11, 2024

As @jrasell wasn't able to reproduce this on their lab - perhaps some other confounding factors are at play?

Last reference I could find to a non-trivial docker stats API change was 2015 - with the introduction of one-shot metrics rather than just streaming (which is the other direction to the change in that commit above).

FWIW these are the two systems I'm seeing this problem present:

Debian bullseye / linux 5.10.0 amd64 / Xeon / docker 20.10.5 / containerd 1.4.13

Debian bookworm / linux 6.1.0 amd64 / Intel i5 / docker 20.10.24 / containerd 1.6.20

@dmclf
Copy link

dmclf commented Nov 11, 2024

@jedd

would you be willing to share the details of the systems used in the lab that were unsuccessful in reproducing the issue?
(os/kernel/arch/docker/containerd versions / amount of tasks running ? )

from my end the problem would be present on (pass -> ❌ indicates problem present)

pass OS Version Kernel / Arch CPU Dockerd Containerd machine user
Alpine (container) 3.20.3 6.8.0-45 amd64 Intel i5 27.2.1 (host) 1.7.22 (host) vmware dmclf
Ubuntu (container) 24.04 6.8.0-45 amd64 Intel i5 27.2.1 (host) 1.7.22 (host) vmware dmclf
Ubuntu 22.04 5.15.0-121 amd64 Xeon 27.2.1 1.7.22 esxi dmclf
Ubuntu 22.04 5.15.0-121 arm64 Graviton2 27.2.1 1.7.22 ec2 dmclf
RHEL 8.5 (Ootpa) 4.18.0-348 x86_64 Xeon 20.10.12 1.4.12 physical dmclf
MacOS 15.1 (Sequoia) host / Ubuntu 24.04 container 24.1.0 arm64 host / 6.10.11-linuxkit container Apple M1 Pro 27.3.1 (MacOS host) - physical dmclf
Debian bullseye 5.10.0 amd64 Xeon 20.10.5 1.4.13 ? jedd
Debian bookworm 6.1.0 amd64 Intel i5 20.10.24 1.6.20 ? jedd
AlmaLinux 9.4 (Seafoam Ocelot) ? ? ? ? ? dosera
Debian 12 ? amd64 ? ? ? Himura2la
macOS ? ? ? ? ? ? jrasell
Ubuntu 24.04 ? ? ? ? ? jrasell
Debian 12 ? ? ? ? ? jrasell
Ubuntu nomad agent -dev 24.04 6.8.0-45 amd64 Intel i5 27.2.1 (host) 1.7.22 (host) vmware dmclf

@jedd
Copy link

jedd commented Nov 11, 2024

@dmclf - so far I've not had any systems present metrics data with Nomad 1.9.1 - but I'm not running a wide range of environments, as noted above.
We have a RHEL platform at work, and this has paused plans to upgrade those - I'll look at a throwaway dev env there to see if the problem presents / doesn't.

@dmclf
Copy link

dmclf commented Nov 11, 2024

@jedd ok, and the details of the system that @jrasell used to try to reproduce in the lab, but failed?

perhaps an easy sample job @jrasell can try
(or increase count to 50 if things work fine with 20)

should be easy to see within 2-3 minutes if stats are being returned properly after containers are reported 'healthy' by nomad.

I can potentially extend scope to RHEL and ARM64 too to see if those are immune but might be nice to get a sense of direction as to what kind of system @jrasell uses in that case.

job "example-job" {
  group "cache" {
    count = 20
    task "redis" {
      driver = "docker"
      config {
        image          = "redis:7"
        auth_soft_fail = true
      }

      identity {
        env = true
        file = true
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

@dmclf
Copy link

dmclf commented Nov 11, 2024

@jedd updated #24339 (comment) table
with a just performed RHEL test , ARM/Graviton2 and Macos/M1 .

tested with official 1.9.1 binary(Revision d9ec23f)
all start failing transmitting reliable stats under a minute after starting the nomad daemon.

@dmclf
Copy link

dmclf commented Nov 12, 2024

As extra observation, but nomad agent -dev is not a valid method to troubleshoot this issue ⚠️
(just tried, and nomad agent -dev seems to work fine and not show this ticket's issue, but that won't be how people normally use nomad)

You will need a minimal 'normal' cluster for this, like 3 servers + 1 client
which has been the case in all my failed tests

versions of servers does not seem to matter, as it happens with servers on 1.8.3 and 1.9.1, whilst client on 1.9.1 is so far always faulty.

@jedd
Copy link

jedd commented Nov 12, 2024

Oh, that's interesting.

FWIW I did my bisect analysis on my workstation which is a single-node (server/client) system, and it was definitely seeing the problem manifest.

(I'll have to update later on whether that machine was in dev mode at the time.)

@dmclf
Copy link

dmclf commented Nov 12, 2024

@jedd ok, it might be single server with bootstrap_expect=1 and single client may also work?

-- update, confirming on my end, single server + single client (but not in dev-mode) also shows same symptoms.

@jrasell
Copy link
Member

jrasell commented Nov 12, 2024

Hi all, just to circle back here and let you know that this issue has been prioritised internally and is waiting for an engineer to to become available and pick it up. Once they do, we will use the comments and details to try and get a reproduction and fix.

@valeriansaliou
Copy link

valeriansaliou commented Dec 6, 2024

I am also having the same issue here, Nomad latest in a single server/client setup.

EDIT: this issue seems to have been fixed, it is just not yet part of a Nomad release -> #24525

It works for a few minutes/hours and then starts not reporting metrics for some allocations, if not all. Some allocations still have metrics, but they may stop working after some time. Those are all using the Docker driver.

When looking at DEBUG logs in the Monitor section of the Nomad client in Nomad Web UI, I am seeing a flood of those logs, which start occurring when the metrics are not reported anymore:

2024-12-06T23:19:26.799+0100 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=323eaaabe9155e561b6f7f685ee6050b6c274a0e9dd285740d91a8d86903ad1b driver=docker error="json: cannot unmarshal string into Go value of type container.Stats"
2024-12-06T23:19:26.800+0100 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=7c19e49d5a185498f773e701140e0e73c7526bd84249cb463ccaba1a21f1712d driver=docker error="invalid character ',' looking for beginning of value"
2024-12-06T23:19:26.851+0100 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=18ca409554162b014ded4fba5d74ccba0da2f4ec9af08c850684a3c93e375374 driver=docker error="invalid character 'm' looking for beginning of value"
2024-12-06T23:19:26.889+0100 [DEBUG] client.driver_mgr.docker: error decoding stats data from container: container_id=87825133c0788d7547df887418d4a62f8c53d0afa9eee24c5a0d6f0dcb434e73 driver=docker error="invalid character 'a' looking for beginning of value"

If configuring the Nomad server to report telemetry every 5s (with the telemetry configuration block in nomad.hcl), after restarting Nomad those logs can immediately be seen. If using the default setting of 1s, they don’t appear immediately. That may help with finding the reason of the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

6 participants