Skip to content

Commit 47aeada

Browse files
jflyMic92
authored andcommitted
Alert if scraping jobs fail
In addition to alerting on failing scraping jobs, we need to address our existing failures, which I researched in #551. Copy pasting analysis from there: - `r13y`: Last successful scrape: 2024-09-21. Website is down. @grahamc didn't respond to my query. - `rfc39`: IMO, this is mis-architected. This scrapes a job that only runs periodically, which means we regularly get scrape failures. I've filed NixOS/rfc39#14 with the upstream project seeking advice. - Since we don't actually alert on these metrics, I propose doing the simplest thing and just disabling this scrape job for now. - `hydra_notify`: Last successful scrape: 2024-08-02. We disabled this service [in 2024](66da5cf), there's no reason to keep scraping it.
1 parent 1ca9309 commit 47aeada

File tree

4 files changed

+24
-60
lines changed

4 files changed

+24
-60
lines changed

build/pluto/prometheus/default.nix

+24-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
{ ... }:
1+
{ pkgs, ... }:
22

33
{
44
imports = [
@@ -16,8 +16,6 @@
1616
./exporters/owncast.nix
1717
./exporters/postgresql.nix
1818
./exporters/rasdaemon.nix
19-
./exporters/r13y.nix
20-
./exporters/rfc39.nix
2119
./exporters/zfs.nix
2220
];
2321

@@ -40,5 +38,28 @@
4038
"--web.external-url=https://prometheus.nixos.org/"
4139
];
4240
globalConfig.scrape_interval = "15s";
41+
42+
ruleFiles = [
43+
(pkgs.writeText "up.rules" (
44+
builtins.toJSON {
45+
groups = [
46+
{
47+
name = "up";
48+
rules = [
49+
{
50+
alert = "NotUp";
51+
expr = ''
52+
up == 0
53+
'';
54+
for = "10m";
55+
labels.severity = "warning";
56+
annotations.summary = "scrape job {{ $labels.job }} is failing on {{ $labels.instance }}";
57+
}
58+
];
59+
}
60+
];
61+
}
62+
))
63+
];
4364
};
4465
}

build/pluto/prometheus/exporters/hydra.nix

-6
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,6 @@
3636
scheme = "https";
3737
static_configs = [ { targets = [ "hydra.nixos.org:443" ]; } ];
3838
}
39-
{
40-
job_name = "hydra_notify";
41-
metrics_path = "/metrics";
42-
scheme = "http";
43-
static_configs = [ { targets = [ "hydra.nixos.org:9199" ]; } ];
44-
}
4539
{
4640
job_name = "hydra_queue_runner";
4741
metrics_path = "/metrics";

build/pluto/prometheus/exporters/r13y.nix

-10
This file was deleted.

build/pluto/prometheus/exporters/rfc39.nix

-41
This file was deleted.

0 commit comments

Comments
 (0)