meeting-notes: add 2024-02-22

delroth · delroth · commit d3d8c6228636 · 2024-02-22T19:14:55.000+01:00
diff --git a/docs/meeting-notes/2024-02-22.md b/docs/meeting-notes/2024-02-22.md
@@ -0,0 +1,147 @@
+# 2024-02-08
+
+Attendees: delroth, edolstra, hexa, JulienMalka, raitobezarius, vcunat, zimbatm
+
+## [delroth] FYI on availability next few weeks
+
+- Traveling until mid-April, low availability, will be on JST timezone (UTC+9)
+- Missing for the next 2 infra meetings
+
+## [delroth] Backups situation
+
+- How do we backup haumea, long term?
+  - borgbackup isn't really a good fit for a 500GB Postgres DB.
+  - Currently: zrepl to my personal infra and hexa's, but that's obviously not a good long term
+    solution.
+  - Used to have backups to graham's rsync.net account, but that's broken since mid-Jan.
+  - [raito] Have you ever tried pg_dump's optimized dump format?
+    - [delroth] Is it fast enough to do a daily dump?
+    - [raito] unsure, but there are ways to do incremental backups:
+        - pg_basebackup + pg_dump compressed format
+
+## [hexa] Migration of Synapse from EMS
+
+- Apparently waiting for EMS to sort out removal of PII?
+- [raito] As long as there's discussion happening between Graham and EMS we probably don't have to
+  care about this, the legacy hosting plan is not getting cancelled.
+- [raito] If anything goes wrong we'd likely get notified.
+
+## [eelco] Move fastly log aggregator to pluto
+
+- This is currently running on Eelco's local machine which is suboptimal.
+- Weekly script that takes Fastly logs and loads them into AWS Athena + generates some aggregates.
+- https://github.com/NixOS/infra/tree/master/metrics/fastly
+- We will put that on the new Eris: Pluto
+- [eelco] I will need to create an AWS IAM to bestow the adequate permissions to enable the script
+  to run on Pluto.
+    - [eelco] I just need read/write access to Athena and some S3 bucket.
+- [delroth] Who is using this data?
+    - [eelco] You can see on that page that the reporting is generated via this data
+- PII data regarding access logs of cache.nixos.org
+    - [everyone] What kind of policy do we want regarding PII and the non-critical infrastructure?
+      e.g. new wiki access logs are available to the non-critical infrastructure
+        - Let's take note of this, think about it for the next weeks
+
+## [delroth, hexa] Machine changes
+
+Our spend on outdated AWS EC2 instances and EBS volumes is too high and we are cutting back on our
+use of EC2 and instead renew our infra at Hetzner.
+
+- Reduce AWS spending
+    - Started pruning old snapshots and EBS volumes (e.g. nixos-webserver, old nixos versions)
+        - [eelco] I think it should be fine to delete them. There's a small risk there could be some
+          historical data, for instance, our subversion repo used to be there as well and the
+          nix-dev mailing list too. In theory, we have copies of all of that.
+        - [delroth] I might start an instance and extract the data out there otherwise I will just
+          delete it.
+        - [eelco] There was a lot of scratch space for something… I don't remember it.
+        - [delroth] I think it was bastion and is now paused.
+    - Bastion is now stopped/paused
+        - [hexa] Migrated to Eris and now to Pluto
+        - [hexa] Channel scripts are running way faster
+        - [raito] :tada:
+    - Pinged survey.nixos.org owners (@garbas), to get the limesurvey instance migrated to something
+      more reasonable
+        - [hexa] $ 150 USD/mo
+        - [hexa] Proposal: Migrate to Hetzner Cloud for a fraction of the costs
+        - [delroth] I asked Julien to look into it
+        - [delroth] In general, it's open to anyone who are looking to do non-critical work
+    - Archeology machine from the cache team
+        - [delroth] Jonas, can you look into the cost? And can we make it start on-demand?
+        - [jonas] asking edef whether they can accomodate these changes]
+- Hetzner machine renewal
+    - Phasing out eris.nixos.org (EX41S-SSD, Intel i7-6700, 64GB RAM, 2x 256GB SATA)
+        - [hexa] Old hardware
+    - Created and deployed pluto.nixos.org (EX44, Intel i5-13500, 2x512GB NVME)
+        - [hexa] Slightly cheaper but modern hardware
+        - [hexa] Everything migrated except for monitoring
+        - [delroth] Some disentanglement required to migrate monitoring
+
+There's a potential of around $700/month of savings in all those operations. That is, we're
+offsetting our whole current Hetzner spend with those AWS savings.
+
+- [delroth] Future savings (more involved):
+    - [delroth] Two layers of storage for cache.nixos.org: warm paths on Hetzner
+    - [delroth] It might be easier to do that stuff on NixOS releases S3 bucket (much smaller
+      bucket) and it's costing ~1000 USD per month in **bandwidth**
+
+## [julien] Opening non-critical to more members
+
+- [Julien] Idea of non-critical infra was to lower the barrier to entry, because people could be
+  trusted with less risky infra
+    - [Julien] I would like to post a Discourse post to look for new people who might be interested
+      to join the team
+    - [Julien] It seems like we have some issues open for non-critical infra and let people to
+      tackle them and could constitute a first project
+      - [delroth]
+        https://github.com/NixOS/infra/issues?q=is%3Aopen+is%3Aissue+label%3Anon-critical-infra
+    - [Julien] I think it's a good time to do such a post and reach out
+    - [Julien] I wanted to know with everyone if it was okay to invite new people
+    - [delroth/zimbatm] Yes
+    - [delroth] I think the most important thing is to know who will take care of onboarding and
+      leading the work
+    - [Julien] I am ready to handle the onboarding load and the lead, I would prefer to manage
+      newcomers rather than do all the stuff by myself
+
+## [delroth, hexa] Deployment changes
+
+We removed nixops and deployment now happens from a `flake.nix`. The plan is to go for colmena
+eventually.
+
+- Deployment via `nixos-rebuild --flake .#<host> --target-host root@<host>.nixos.org
+  --use-substitutes switch`
+- NixOps generated configuration was imported and is being migrated, for example we:
+    - started using agenix for secrets management and imported existing secrets
+    - and migrated Network configuration to systemd-networkd/resolved
+
+
+## [delroth, hexa] Infra Changelog
+
+- All machines are now running on NixOS 23.11
+- Migrated haumea's database to PostgreSQL 16
+- Align timezone across machines
+- Fix backup of haumea's database
+    - zrepl to delroth and hexa
+    - rsync.net stopped working due to zrepl API version mismatch
+- Enabled trimming and scrubbing on all ZFS pools
+
+- Fix the fastly-exporter deployment
+    - Migrated to nixpkgs module, which [required its own
+      fixes](https://github.com/NixOS/nixpkgs/pull/287348)
+    - Generated a new API token, the old one was invalid
+    - 📊 [Dashboard](https://monitoring.nixos.org/grafana/d/SHjM6e-ik/fastly?orgId=1)
+- Fixed [race condition and world-writable state
+  file](https://github.com/packethost/prometheus-packet-sd/issues/15) upstream in packet-sd
+- Added alerting for
+    - Failed systemd units
+    - [Domain expiry](https://github.com/NixOS/infra/pull/249) within the next 30 days
+- Lazy loading of eval errors on hydra (Patch by @ajs124)
+    - Reduces page sizes on the common jobsets/evals by 15-20MB to a few kBs
+    - More work needed, because error logs are still being fetched from the DB, just not rendered
+- Services migrated to pluto.nixos.org
+    - channel-scripts/hydra-mirror
+    - netboot
+    - rfc39
+- Removed and refactored legacy code, e.g.
+    - hydra-provisioner
+    - delft/network.nix