-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocation Metrics no longer emitted #24339
Comments
Hi @Himura2la and @dosera and thanks for raising this issue. I have been unable to reproduce this on macOS or Ubuntu 24.04 using the steps below. Could you please provide a minimal job specification that would help reproduce this along with any other relevant information? Example agent config:
Example job:
Example test command:
|
I'm afraid your validation method is not relevant to the issue, because it's best visible on a long term run. |
I'm seeing the same thing with 4 x boxes I upgraded to 1.9.1 from 1.9.0 I'm using Debian testing on all four machines. My Prometheus scrapes a telegraf (prometheus export mode) on the same machines, and I'm seeing no drops in those data. Scrape interval for both is set to 1m. Boxes had no other packages upgraded at that time. More importantly, I'm not seeing CPU or Memory data in the Nomad GUI in either Task or Allocation view. I will occasionally get a red line indicating what I presume is 'current' X MiB / Total memory used, but no history at all. |
Hi @jedd @Himura2la and @dosera. I ran two longer tests using Debian 12 and Ubuntu 24.04 and was still unable to reproduce the problem. Each test ran for several hours, include several jobs, and used Prometheus for metric collection. I'll include the two example graphs below. I'll mark this issue for further investigation and add it to our backlog. If anyone is able to provide a minimal reproduction that would be greatly appreciated. |
Thanks for diving into that, @jrasell -- I'll see if I can distil a prometheus scrape option, but really you've probably already matched that - it's just an interval of 1m, and 4 x hosts on :4646 (no tls, auth). As per my graphs some data is being made available periodically, so it's not an auth issue. Can you tell me what you saw on the task / allocation cpu & memory graphs? Opening these up and letting them sit for more than a minute, and I see no change - usually I'd expect to see the graph start to populate and scroll within a few seconds. Also oddly is the times shown on the X axis - it's 23:17 local here, but in one of these I see 00:00:00 for ALL x points, and on the other 08:44:00 to 08:48:00. These are taken on the same host (my 'master'), but the screenshots are from a job running on this, and the other running on one of my client-only nomad hosts. |
- ugly - hopefully can help with hashicorp/nomad#24339
@jrasell in a quick-n-dirty attempt to make something that can mimic what I'm seeing I made up something not exactly the most elegant, but hopefully it can aid somehow on your end. |
I have basically zero golang, and because it's an intermittent problem I'm not 100% sure that my git bisect activities have narrowed things down, but I'm cautiously confident the problem manifested with: commit 6c3f222 Looking at there's a comment about a move to streaming stats API, and a bunch of changes to the collection mechanism. I don't know how fragile that stream may be under various circumstances, and really out of my depth here trying to guess at whether this may be r/c for an intermittent problem like we're seeing, but perhaps @dmclf if you're able to do the https://developer.hashicorp.com/nomad/docs/install#compiling-from-source process on one of your clients, and can validate that commit is where things go pear shaped on your system too? I noted - but I think it's entirely coincidental - that the problem seemed to occur shortly after journalctl reported client.gc operation was skipped:
(I'm pretty sure this is entirely coincidental, but it was the only log entry emitted by Nomad.) I did a bisect manually, given intermittent nature of the problem, and waited for 10 minutes with some regular curl'ing to see if the problem manifested - so I caught definite failures, but may have accidentally given a 'good' to some commits that just didn't fail fast enough. |
@jedd right... Ok, I suppose I can try that (i'm not much of a golang person nor developer either) but if you say you may suspect and then, request me to, from and see if that works better? |
@jedd to confirm, from my end at least.
-> works fine ✅
-> gives this ticket's odd behaviour ❌ |
@jedd for completeness updated 1 of the 2 clients on the dev-cluster too to check behavior there with the 2 versions so initially graph starts with
after updated to
and then lastly changed to
|
and also confirming issue still exists with most recent
|
As @jrasell wasn't able to reproduce this on their lab - perhaps some other confounding factors are at play? Last reference I could find to a non-trivial docker stats API change was 2015 - with the introduction of one-shot metrics rather than just streaming (which is the other direction to the change in that commit above). FWIW these are the two systems I'm seeing this problem present: Debian bullseye / linux 5.10.0 amd64 / Xeon / docker 20.10.5 / containerd 1.4.13 Debian bookworm / linux 6.1.0 amd64 / Intel i5 / docker 20.10.24 / containerd 1.6.20 |
would you be willing to share the details of the systems used in the lab that were unsuccessful in reproducing the issue? from my end the problem would be present on (pass -> ❌ indicates problem present)
|
@dmclf - so far I've not had any systems present metrics data with Nomad 1.9.1 - but I'm not running a wide range of environments, as noted above. |
@jedd ok, and the details of the system that @jrasell used to try to reproduce in the lab, but failed? perhaps an easy sample job @jrasell can try should be easy to see within 2-3 minutes if stats are being returned properly after containers are reported 'healthy' by nomad. I can potentially extend scope to RHEL and ARM64 too to see if those are immune but might be nice to get a sense of direction as to what kind of system @jrasell uses in that case.
|
@jedd updated #24339 (comment) table tested with official 1.9.1 binary(Revision d9ec23f) |
As extra observation, but You will need a minimal 'normal' cluster for this, like 3 servers + 1 client versions of servers does not seem to matter, as it happens with servers on 1.8.3 and 1.9.1, whilst client on 1.9.1 is so far always faulty. |
Oh, that's interesting. FWIW I did my bisect analysis on my workstation which is a single-node (server/client) system, and it was definitely seeing the problem manifest. (I'll have to update later on whether that machine was in dev mode at the time.) |
@jedd ok, it might be single server with bootstrap_expect=1 and single client may also work? -- update, confirming on my end, single server + single client (but not in dev-mode) also shows same symptoms. |
Hi all, just to circle back here and let you know that this issue has been prioritised internally and is waiting for an engineer to to become available and pick it up. Once they do, we will use the comments and details to try and get a reproduction and fix. |
(cleanup after yourself)
I am also having the same issue here, Nomad latest in a single server/client setup. EDIT: this issue seems to have been fixed, it is just not yet part of a Nomad release -> #24525 It works for a few minutes/hours and then starts not reporting metrics for some allocations, if not all. Some allocations still have metrics, but they may stop working after some time. Those are all using the Docker driver. When looking at DEBUG logs in the Monitor section of the Nomad client in Nomad Web UI, I am seeing a flood of those logs, which start occurring when the metrics are not reported anymore:
If configuring the Nomad server to report telemetry every 5s (with the telemetry configuration block in nomad.hcl), after restarting Nomad those logs can immediately be seen. If using the default setting of 1s, they don’t appear immediately. That may help with finding the reason of the issue. |
Nomad version
Operating system and Environment details
AlmaLinux release 9.4 (Seafoam Ocelot)
/etc/nomad/base.hcl
:Issue
After upgrading nomad from
1.8.4
to1.9.1
allocation metrics likenomad_client_allocs_memory_usage
appear not to be properly emitted (or only for several seconds - i am using prometheus to scrape them).Reproduction steps
Expected Result
Allocation metrics are available.
Actual Result
Allocation metrics are available sporadically (at best). See the attached screenshot - nomad was upgraded on Oct 24th.
This is how it looks in prometheus since then:
The text was updated successfully, but these errors were encountered: