Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(optionally) only consider streaming nodes to be healthy #50

Closed
mbanck opened this issue Sep 27, 2023 · 7 comments
Closed

(optionally) only consider streaming nodes to be healthy #50

mbanck opened this issue Sep 27, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@mbanck
Copy link

mbanck commented Sep 27, 2023

In #30 the streaming state was added as an alternative to running. However, I don't agree with this - if the Patroni version is 3.0.4 or up, a state of merely running indicates an unhealthy (streaming replication not working) node.

So it would be good to have an option for the cluster_node_count command to only count streaming (replicas and standby leaders) as healthy.

@mbanck mbanck changed the title (optinally) only consider streaming nodes to be healthy (optionally) only consider streaming nodes to be healthy Sep 27, 2023
@blogh blogh added the bug Something isn't working label Sep 27, 2023
@blogh
Copy link
Collaborator

blogh commented Sep 27, 2023

Good catch :

-bash-4.2$ patroni --version
patroni 3.1.1

-bash-4.2$ patronictl --config-file /etc/patroni/demo.yaml list
+ Cluster: patroni-demo (7283074439919541064) ----+-----------+
| Member | Host        | Role    | State     | TL | Lag in MB |
+--------+-------------+---------+-----------+----+-----------+
| p1     | 10.20.30.51 | Replica | running   |  4 |       173 |
| p2     | 10.20.30.52 | Leader  | running   |  8 |           |
| p3     | 10.20.30.53 | Replica | streaming |  8 |         0 |
+--------+-------------+---------+-----------+----+-----------+
(.venv) [vagrant@p2 check_patroni]$ check_patroni -vv -e http://10.20.30.53:8008 cluster_has_replica   
CLUSTERHASREPLICA OK - healthy_replica is 2
| healthy_replica=2 p1_lag=181687224 p1_sync=0 p3_lag=0 p3_sync=0 sync_replica=0 unhealthy_replica=0

@blogh
Copy link
Collaborator

blogh commented Sep 27, 2023

The node_is_replica is also kinda useless as is ..

Tests with the same cluster:

(.venv) [vagrant@p2 check_patroni]$ check_patroni -vv -e http://10.20.30.51:8008 node_is_replica       
NODEISREPLICA OK - This node is a running replica with no noloadbalance tag.
| is_replica=1;;@0

(.venv) [vagrant@p2 check_patroni]$ check_patroni -vv -e http://10.20.30.53:8008 node_is_replica
NODEISREPLICA OK - This node is a running replica with no noloadbalance tag.
| is_replica=1;;@0

The check/description would be more usefull if we checked for healthy replicas.

(.venv) [vagrant@p2 check_patroni]$ check_patroni node_is_replica --help
Usage: check_patroni node_is_replica [OPTIONS]

  Check if the node is a running replica with no noloadbalance tag.

  It is possible to check if the node is synchronous or asynchronous. If
  nothing is specified any kind of replica is accepted. When checking for a
  synchronous replica, it's not possible to specify a lag.

  Check:
  * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
  * `CRITICAL`:  otherwise

  Perfdata: `is_replica` is 1 if the node is a running replica with
  noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.

What's your opinion on this ?

@mbanck
Copy link
Author

mbanck commented Sep 27, 2023

I don't know about the node_is_replica check, as the roles can be dynamic, I don't have a good use-case for this; maybe there should be a useful default for max-lag though? Patroni's kinda-default is 1MB I belive (maximum_lag_on_failover: 104857).

I don't have a great opinion on what the UX should be, except I think the less complicated the better, and it seems useless to check for a running but non-working (or lagging) replica.

So I think a replica should be considered healthy when it is streaming or - for older Patroni versions - when it is on the same timline as the local leader and has no lag.

@mbanck
Copy link
Author

mbanck commented Sep 27, 2023

Well, put otherwise: I had a look at #30 and I noticed that it does not seem to touch node_is_replica.

Maybe a second check node_is_follower or so could be added that does the right thing, or node_is_replica's behavior changed to do so (but probably warranting a major version bump then), as check-patroni is a relatively young project.

@blogh
Copy link
Collaborator

blogh commented Nov 15, 2023

Fixed thanks for reporting this.

@blogh
Copy link
Collaborator

blogh commented Nov 15, 2023

Re opened, there is still the cluster_node_count issue to fix.

@blogh blogh reopened this Nov 15, 2023
@blogh
Copy link
Collaborator

blogh commented Jan 9, 2024

Done here: #66

@blogh blogh closed this as completed Jan 9, 2024
@blogh blogh mentioned this issue Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants