|
| 1 | +# 2024-02-08 |
| 2 | + |
| 3 | +Attendees: delroth, edolstra, hexa, JulienMalka, raitobezarius, vcunat, zimbatm |
| 4 | + |
| 5 | +## [delroth] FYI on availability next few weeks |
| 6 | + |
| 7 | +- Traveling until mid-April, low availability, will be on JST timezone (UTC+9) |
| 8 | +- Missing for the next 2 infra meetings |
| 9 | + |
| 10 | +## [delroth] Backups situation |
| 11 | + |
| 12 | +- How do we backup haumea, long term? |
| 13 | + - borgbackup isn't really a good fit for a 500GB Postgres DB. |
| 14 | + - Currently: zrepl to my personal infra and hexa's, but that's obviously not a good long term |
| 15 | + solution. |
| 16 | + - Used to have backups to graham's rsync.net account, but that's broken since mid-Jan. |
| 17 | + - [raito] Have you ever tried pg_dump's optimized dump format? |
| 18 | + - [delroth] Is it fast enough to do a daily dump? |
| 19 | + - [raito] unsure, but there are ways to do incremental backups: |
| 20 | + - pg_basebackup + pg_dump compressed format |
| 21 | + |
| 22 | +## [hexa] Migration of Synapse from EMS |
| 23 | + |
| 24 | +- Apparently waiting for EMS to sort out removal of PII? |
| 25 | +- [raito] As long as there's discussion happening between Graham and EMS we probably don't have to |
| 26 | + care about this, the legacy hosting plan is not getting cancelled. |
| 27 | +- [raito] If anything goes wrong we'd likely get notified. |
| 28 | + |
| 29 | +## [eelco] Move fastly log aggregator to pluto |
| 30 | + |
| 31 | +- This is currently running on Eelco's local machine which is suboptimal. |
| 32 | +- Weekly script that takes Fastly logs and loads them into AWS Athena + generates some aggregates. |
| 33 | +- https://github.com/NixOS/infra/tree/master/metrics/fastly |
| 34 | +- We will put that on the new Eris: Pluto |
| 35 | +- [eelco] I will need to create an AWS IAM to bestow the adequate permissions to enable the script |
| 36 | + to run on Pluto. |
| 37 | + - [eelco] I just need read/write access to Athena and some S3 bucket. |
| 38 | +- [delroth] Who is using this data? |
| 39 | + - [eelco] You can see on that page that the reporting is generated via this data |
| 40 | +- PII data regarding access logs of cache.nixos.org |
| 41 | + - [everyone] What kind of policy do we want regarding PII and the non-critical infrastructure? |
| 42 | + e.g. new wiki access logs are available to the non-critical infrastructure |
| 43 | + - Let's take note of this, think about it for the next weeks |
| 44 | + |
| 45 | +## [delroth, hexa] Machine changes |
| 46 | + |
| 47 | +Our spend on outdated AWS EC2 instances and EBS volumes is too high and we are cutting back on our |
| 48 | +use of EC2 and instead renew our infra at Hetzner. |
| 49 | + |
| 50 | +- Reduce AWS spending |
| 51 | + - Started pruning old snapshots and EBS volumes (e.g. nixos-webserver, old nixos versions) |
| 52 | + - [eelco] I think it should be fine to delete them. There's a small risk there could be some |
| 53 | + historical data, for instance, our subversion repo used to be there as well and the |
| 54 | + nix-dev mailing list too. In theory, we have copies of all of that. |
| 55 | + - [delroth] I might start an instance and extract the data out there otherwise I will just |
| 56 | + delete it. |
| 57 | + - [eelco] There was a lot of scratch space for something… I don't remember it. |
| 58 | + - [delroth] I think it was bastion and is now paused. |
| 59 | + - Bastion is now stopped/paused |
| 60 | + - [hexa] Migrated to Eris and now to Pluto |
| 61 | + - [hexa] Channel scripts are running way faster |
| 62 | + - [raito] :tada: |
| 63 | + - Pinged survey.nixos.org owners (@garbas), to get the limesurvey instance migrated to something |
| 64 | + more reasonable |
| 65 | + - [hexa] $ 150 USD/mo |
| 66 | + - [hexa] Proposal: Migrate to Hetzner Cloud for a fraction of the costs |
| 67 | + - [delroth] I asked Julien to look into it |
| 68 | + - [delroth] In general, it's open to anyone who are looking to do non-critical work |
| 69 | + - Archeology machine from the cache team |
| 70 | + - [delroth] Jonas, can you look into the cost? And can we make it start on-demand? |
| 71 | + - [jonas] asking edef whether they can accomodate these changes] |
| 72 | +- Hetzner machine renewal |
| 73 | + - Phasing out eris.nixos.org (EX41S-SSD, Intel i7-6700, 64GB RAM, 2x 256GB SATA) |
| 74 | + - [hexa] Old hardware |
| 75 | + - Created and deployed pluto.nixos.org (EX44, Intel i5-13500, 2x512GB NVME) |
| 76 | + - [hexa] Slightly cheaper but modern hardware |
| 77 | + - [hexa] Everything migrated except for monitoring |
| 78 | + - [delroth] Some disentanglement required to migrate monitoring |
| 79 | + |
| 80 | +There's a potential of around $700/month of savings in all those operations. That is, we're |
| 81 | +offsetting our whole current Hetzner spend with those AWS savings. |
| 82 | + |
| 83 | +- [delroth] Future savings (more involved): |
| 84 | + - [delroth] Two layers of storage for cache.nixos.org: warm paths on Hetzner |
| 85 | + - [delroth] It might be easier to do that stuff on NixOS releases S3 bucket (much smaller |
| 86 | + bucket) and it's costing ~1000 USD per month in **bandwidth** |
| 87 | + |
| 88 | +## [julien] Opening non-critical to more members |
| 89 | + |
| 90 | +- [Julien] Idea of non-critical infra was to lower the barrier to entry, because people could be |
| 91 | + trusted with less risky infra |
| 92 | + - [Julien] I would like to post a Discourse post to look for new people who might be interested |
| 93 | + to join the team |
| 94 | + - [Julien] It seems like we have some issues open for non-critical infra and let people to |
| 95 | + tackle them and could constitute a first project |
| 96 | + - [delroth] |
| 97 | + https://github.com/NixOS/infra/issues?q=is%3Aopen+is%3Aissue+label%3Anon-critical-infra |
| 98 | + - [Julien] I think it's a good time to do such a post and reach out |
| 99 | + - [Julien] I wanted to know with everyone if it was okay to invite new people |
| 100 | + - [delroth/zimbatm] Yes |
| 101 | + - [delroth] I think the most important thing is to know who will take care of onboarding and |
| 102 | + leading the work |
| 103 | + - [Julien] I am ready to handle the onboarding load and the lead, I would prefer to manage |
| 104 | + newcomers rather than do all the stuff by myself |
| 105 | + |
| 106 | +## [delroth, hexa] Deployment changes |
| 107 | + |
| 108 | +We removed nixops and deployment now happens from a `flake.nix`. The plan is to go for colmena |
| 109 | +eventually. |
| 110 | + |
| 111 | +- Deployment via `nixos-rebuild --flake .#<host> --target-host root@<host>.nixos.org |
| 112 | + --use-substitutes switch` |
| 113 | +- NixOps generated configuration was imported and is being migrated, for example we: |
| 114 | + - started using agenix for secrets management and imported existing secrets |
| 115 | + - and migrated Network configuration to systemd-networkd/resolved |
| 116 | + |
| 117 | + |
| 118 | +## [delroth, hexa] Infra Changelog |
| 119 | + |
| 120 | +- All machines are now running on NixOS 23.11 |
| 121 | +- Migrated haumea's database to PostgreSQL 16 |
| 122 | +- Align timezone across machines |
| 123 | +- Fix backup of haumea's database |
| 124 | + - zrepl to delroth and hexa |
| 125 | + - rsync.net stopped working due to zrepl API version mismatch |
| 126 | +- Enabled trimming and scrubbing on all ZFS pools |
| 127 | + |
| 128 | +- Fix the fastly-exporter deployment |
| 129 | + - Migrated to nixpkgs module, which [required its own |
| 130 | + fixes](https://github.com/NixOS/nixpkgs/pull/287348) |
| 131 | + - Generated a new API token, the old one was invalid |
| 132 | + - 📊 [Dashboard](https://monitoring.nixos.org/grafana/d/SHjM6e-ik/fastly?orgId=1) |
| 133 | +- Fixed [race condition and world-writable state |
| 134 | + file](https://github.com/packethost/prometheus-packet-sd/issues/15) upstream in packet-sd |
| 135 | +- Added alerting for |
| 136 | + - Failed systemd units |
| 137 | + - [Domain expiry](https://github.com/NixOS/infra/pull/249) within the next 30 days |
| 138 | +- Lazy loading of eval errors on hydra (Patch by @ajs124) |
| 139 | + - Reduces page sizes on the common jobsets/evals by 15-20MB to a few kBs |
| 140 | + - More work needed, because error logs are still being fetched from the DB, just not rendered |
| 141 | +- Services migrated to pluto.nixos.org |
| 142 | + - channel-scripts/hydra-mirror |
| 143 | + - netboot |
| 144 | + - rfc39 |
| 145 | +- Removed and refactored legacy code, e.g. |
| 146 | + - hydra-provisioner |
| 147 | + - delft/network.nix |
0 commit comments