Skip to content

Commit 5c91267

Browse files
committed
Alert if scraping jobs fail
In addition to alerting on failing scraping jobs, we need to address our existing failures, which I researched in #551. Copy pasting analysis from there: - `r13y`: Last successful scrape: 2024-09-21. Website is down. @grahamc didn't respond to my query. - `rfc39`: IMO, this is mis-architected. This scrapes a job that only runs periodically, which means we regularly get scrape failures. I've filed NixOS/rfc39#14 with the upstream project seeking advice. - Since we don't actually alert on these metrics, I propose doing the simplest thing and just disabling this scrape job for now. - `hydra_notify`: Last successful scrape: 2024-08-02. We disabled this service [in 2024](66da5cf), there's no reason to keep scraping it.
1 parent 5eeb20e commit 5c91267

File tree

4 files changed

+24
-60
lines changed

4 files changed

+24
-60
lines changed

build/pluto/prometheus/default.nix

+24-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
{ ... }:
1+
{ pkgs, ... }:
22

33
{
44
imports = [
@@ -16,8 +16,6 @@
1616
./exporters/owncast.nix
1717
./exporters/postgresql.nix
1818
./exporters/rasdaemon.nix
19-
./exporters/r13y.nix
20-
./exporters/rfc39.nix
2119
./exporters/zfs.nix
2220
];
2321

@@ -40,5 +38,28 @@
4038
"--web.external-url=https://prometheus.nixos.org/"
4139
];
4240
globalConfig.scrape_interval = "15s";
41+
42+
ruleFiles = [
43+
(pkgs.writeText "up.rules" (
44+
builtins.toJSON {
45+
groups = [
46+
{
47+
name = "up";
48+
rules = [
49+
{
50+
alert = "NotUp";
51+
expr = ''
52+
up == 0
53+
'';
54+
for = "10m";
55+
labels.severity = "warning";
56+
annotations.summary = "scrape job {{ $labels.job }} is failing on {{ $labels.instance }}";
57+
}
58+
];
59+
}
60+
];
61+
}
62+
))
63+
];
4364
};
4465
}

build/pluto/prometheus/exporters/hydra.nix

-6
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,6 @@
3636
scheme = "https";
3737
static_configs = [ { targets = [ "hydra.nixos.org:443" ]; } ];
3838
}
39-
{
40-
job_name = "hydra_notify";
41-
metrics_path = "/metrics";
42-
scheme = "http";
43-
static_configs = [ { targets = [ "hydra.nixos.org:9199" ]; } ];
44-
}
4539
{
4640
job_name = "hydra_queue_runner";
4741
metrics_path = "/metrics";

build/pluto/prometheus/exporters/r13y.nix

-10
This file was deleted.

build/pluto/prometheus/exporters/rfc39.nix

-41
This file was deleted.

0 commit comments

Comments
 (0)