Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CNPG failover #680

Open
samcday opened this issue Aug 10, 2024 · 3 comments
Open

Fix CNPG failover #680

samcday opened this issue Aug 10, 2024 · 3 comments

Comments

@samcday
Copy link
Owner

samcday commented Aug 10, 2024

Originally this issue was "Remove CNPG". It's now "Fix CNPG because there's literally no other option except to completely hand roll a Postgres deployment from scratch and I'd rather punch myself repeatedly in the crotch than do this"


Postgres + Kubernetes is cursed.

First, there was the dumpster fire that Zalando inflicted on the OSS world with their comically bad operator. Then there was the CrunchyData one, which was so tragically documented and maintained I think it might actually be some kind of militiary psy-op posing as a software project.

CNPG was a breath of fresh air, because it actually works reasonably well on the happy path, didn't make a dog's breakfast of backup/restore, and has good documentation.

Unfortunately, in practice, CNPG is also terrible. A light breeze knocks clusters over, and replicas constantly end up in a broken state that requires manual remediation.

The final straw was today when I went around the cluster, replacing a bunch of ethernet cables with some nice short-length ones. Just yanking and reconnecting some cables was enough to bring down several of the DB clusters.

I could dig into this, figure out a reproducible test case, and contribute that (and maybe even a fix) upstream. That would be the right thing to do. I don't want to fucking do the right thing here. I just want a Postgres database running in Kubernetes that is reliable and backed up.

I think the best approach will to be just build a handful of DB clusters with the bitnami Helm chart and accept that they'll need some occasional petting.

@samcday
Copy link
Owner Author

samcday commented Aug 10, 2024

So the proverbial straw is a known issue, at least. Though in some ways that's worse, because this is a critical flaw in the operator that was first raised more than 4 months ago. It still has not had any acknowledgement from the maintainers.

As far as I'm concerned, this is a smoking gun that demonstrates CNPG is, sadly, dead. Or at least, it's dead to me. 6 months of bashing my head against a brick wall is long enough, thank you very much.

Still, I'm somewhat optimistic. I was having similar problems several years ago with my first forays into the Zalando/CrunchyData shitshow. CNPG was a major improvement over the status quo. I will hold out hope that the next iteration in this space yields an operator that is actually worth relying on 🤞

@samcday
Copy link
Owner Author

samcday commented Aug 10, 2024

Ugh. Turns out the bitnami postgres-ha chart is also utter garbage.

The ergonomics are pretty terrible, but I'm used to that kind of abuse with Bitnami Helm charts. The way it does user management is particularly awful though - you have to repeat the users/passwords in both the postgres instances and the pgpool deployment. The users and passwords have to be comma/semicolon delimited. My brain literally cannot.

The kicker is that this thing isn't remotely "HA". If you do a rolling restart of the postgres nodes, the frontend pgpools shit the bed indefinitely until you manually rollout restart them. This was first reported in July 2022, ignored the entire time, and raised several more times, where it was also ignored.

Holy moly. I'm really struggling to accept just how sad the state of the k8s ecosystem is for Postgres. Postgres deserves so much more than this :(

@samcday
Copy link
Owner Author

samcday commented Aug 10, 2024

I guess the outcome of this day is there is no alternative but to be a good citizen, roll up my sleeves, and figure out wtf is up with CNPG. God damnit.

@samcday samcday changed the title Remove CNPG Fix CNPG failover Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant