Skip to content

Commit d3d8c62

Browse files
committed
meeting-notes: add 2024-02-22
1 parent 14a29fb commit d3d8c62

File tree

1 file changed

+147
-0
lines changed

1 file changed

+147
-0
lines changed

docs/meeting-notes/2024-02-22.md

+147
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# 2024-02-08
2+
3+
Attendees: delroth, edolstra, hexa, JulienMalka, raitobezarius, vcunat, zimbatm
4+
5+
## [delroth] FYI on availability next few weeks
6+
7+
- Traveling until mid-April, low availability, will be on JST timezone (UTC+9)
8+
- Missing for the next 2 infra meetings
9+
10+
## [delroth] Backups situation
11+
12+
- How do we backup haumea, long term?
13+
- borgbackup isn't really a good fit for a 500GB Postgres DB.
14+
- Currently: zrepl to my personal infra and hexa's, but that's obviously not a good long term
15+
solution.
16+
- Used to have backups to graham's rsync.net account, but that's broken since mid-Jan.
17+
- [raito] Have you ever tried pg_dump's optimized dump format?
18+
- [delroth] Is it fast enough to do a daily dump?
19+
- [raito] unsure, but there are ways to do incremental backups:
20+
- pg_basebackup + pg_dump compressed format
21+
22+
## [hexa] Migration of Synapse from EMS
23+
24+
- Apparently waiting for EMS to sort out removal of PII?
25+
- [raito] As long as there's discussion happening between Graham and EMS we probably don't have to
26+
care about this, the legacy hosting plan is not getting cancelled.
27+
- [raito] If anything goes wrong we'd likely get notified.
28+
29+
## [eelco] Move fastly log aggregator to pluto
30+
31+
- This is currently running on Eelco's local machine which is suboptimal.
32+
- Weekly script that takes Fastly logs and loads them into AWS Athena + generates some aggregates.
33+
- https://github.com/NixOS/infra/tree/master/metrics/fastly
34+
- We will put that on the new Eris: Pluto
35+
- [eelco] I will need to create an AWS IAM to bestow the adequate permissions to enable the script
36+
to run on Pluto.
37+
- [eelco] I just need read/write access to Athena and some S3 bucket.
38+
- [delroth] Who is using this data?
39+
- [eelco] You can see on that page that the reporting is generated via this data
40+
- PII data regarding access logs of cache.nixos.org
41+
- [everyone] What kind of policy do we want regarding PII and the non-critical infrastructure?
42+
e.g. new wiki access logs are available to the non-critical infrastructure
43+
- Let's take note of this, think about it for the next weeks
44+
45+
## [delroth, hexa] Machine changes
46+
47+
Our spend on outdated AWS EC2 instances and EBS volumes is too high and we are cutting back on our
48+
use of EC2 and instead renew our infra at Hetzner.
49+
50+
- Reduce AWS spending
51+
- Started pruning old snapshots and EBS volumes (e.g. nixos-webserver, old nixos versions)
52+
- [eelco] I think it should be fine to delete them. There's a small risk there could be some
53+
historical data, for instance, our subversion repo used to be there as well and the
54+
nix-dev mailing list too. In theory, we have copies of all of that.
55+
- [delroth] I might start an instance and extract the data out there otherwise I will just
56+
delete it.
57+
- [eelco] There was a lot of scratch space for something… I don't remember it.
58+
- [delroth] I think it was bastion and is now paused.
59+
- Bastion is now stopped/paused
60+
- [hexa] Migrated to Eris and now to Pluto
61+
- [hexa] Channel scripts are running way faster
62+
- [raito] :tada:
63+
- Pinged survey.nixos.org owners (@garbas), to get the limesurvey instance migrated to something
64+
more reasonable
65+
- [hexa] $ 150 USD/mo
66+
- [hexa] Proposal: Migrate to Hetzner Cloud for a fraction of the costs
67+
- [delroth] I asked Julien to look into it
68+
- [delroth] In general, it's open to anyone who are looking to do non-critical work
69+
- Archeology machine from the cache team
70+
- [delroth] Jonas, can you look into the cost? And can we make it start on-demand?
71+
- [jonas] asking edef whether they can accomodate these changes]
72+
- Hetzner machine renewal
73+
- Phasing out eris.nixos.org (EX41S-SSD, Intel i7-6700, 64GB RAM, 2x 256GB SATA)
74+
- [hexa] Old hardware
75+
- Created and deployed pluto.nixos.org (EX44, Intel i5-13500, 2x512GB NVME)
76+
- [hexa] Slightly cheaper but modern hardware
77+
- [hexa] Everything migrated except for monitoring
78+
- [delroth] Some disentanglement required to migrate monitoring
79+
80+
There's a potential of around $700/month of savings in all those operations. That is, we're
81+
offsetting our whole current Hetzner spend with those AWS savings.
82+
83+
- [delroth] Future savings (more involved):
84+
- [delroth] Two layers of storage for cache.nixos.org: warm paths on Hetzner
85+
- [delroth] It might be easier to do that stuff on NixOS releases S3 bucket (much smaller
86+
bucket) and it's costing ~1000 USD per month in **bandwidth**
87+
88+
## [julien] Opening non-critical to more members
89+
90+
- [Julien] Idea of non-critical infra was to lower the barrier to entry, because people could be
91+
trusted with less risky infra
92+
- [Julien] I would like to post a Discourse post to look for new people who might be interested
93+
to join the team
94+
- [Julien] It seems like we have some issues open for non-critical infra and let people to
95+
tackle them and could constitute a first project
96+
- [delroth]
97+
https://github.com/NixOS/infra/issues?q=is%3Aopen+is%3Aissue+label%3Anon-critical-infra
98+
- [Julien] I think it's a good time to do such a post and reach out
99+
- [Julien] I wanted to know with everyone if it was okay to invite new people
100+
- [delroth/zimbatm] Yes
101+
- [delroth] I think the most important thing is to know who will take care of onboarding and
102+
leading the work
103+
- [Julien] I am ready to handle the onboarding load and the lead, I would prefer to manage
104+
newcomers rather than do all the stuff by myself
105+
106+
## [delroth, hexa] Deployment changes
107+
108+
We removed nixops and deployment now happens from a `flake.nix`. The plan is to go for colmena
109+
eventually.
110+
111+
- Deployment via `nixos-rebuild --flake .#<host> --target-host root@<host>.nixos.org
112+
--use-substitutes switch`
113+
- NixOps generated configuration was imported and is being migrated, for example we:
114+
- started using agenix for secrets management and imported existing secrets
115+
- and migrated Network configuration to systemd-networkd/resolved
116+
117+
118+
## [delroth, hexa] Infra Changelog
119+
120+
- All machines are now running on NixOS 23.11
121+
- Migrated haumea's database to PostgreSQL 16
122+
- Align timezone across machines
123+
- Fix backup of haumea's database
124+
- zrepl to delroth and hexa
125+
- rsync.net stopped working due to zrepl API version mismatch
126+
- Enabled trimming and scrubbing on all ZFS pools
127+
128+
- Fix the fastly-exporter deployment
129+
- Migrated to nixpkgs module, which [required its own
130+
fixes](https://github.com/NixOS/nixpkgs/pull/287348)
131+
- Generated a new API token, the old one was invalid
132+
- 📊 [Dashboard](https://monitoring.nixos.org/grafana/d/SHjM6e-ik/fastly?orgId=1)
133+
- Fixed [race condition and world-writable state
134+
file](https://github.com/packethost/prometheus-packet-sd/issues/15) upstream in packet-sd
135+
- Added alerting for
136+
- Failed systemd units
137+
- [Domain expiry](https://github.com/NixOS/infra/pull/249) within the next 30 days
138+
- Lazy loading of eval errors on hydra (Patch by @ajs124)
139+
- Reduces page sizes on the common jobsets/evals by 15-20MB to a few kBs
140+
- More work needed, because error logs are still being fetched from the DB, just not rendered
141+
- Services migrated to pluto.nixos.org
142+
- channel-scripts/hydra-mirror
143+
- netboot
144+
- rfc39
145+
- Removed and refactored legacy code, e.g.
146+
- hydra-provisioner
147+
- delft/network.nix

0 commit comments

Comments
 (0)