Evict slow/hanged node that still responds to network #510

MagnusKlingenberg · 2018-06-27T11:46:50Z

We have a 3 node pxc cluster with ProxySQL in front where all writes go to one node all managed by severalnines cluster control.

But we have had multiple complete downtimes due to one of the pxc nodes hanging.

Scenario 1: Hanging non write node
The underlying hardware cause all writes to the binlog to hang.
After a while the pxc node can't complete commits
Later the whole cluster stops due to flow control
evs.auto_evict Does nothing since the node still responds to network activity

Possible solution: Allow a node to be evicted if it falls way to behind on writes

Scenario 2: Hanging write node
The underlying hardware cause all writes to the binlog to hang.
After a while the pxc node can't complete commits
The ProxySQL galera checker script report all ok so traffic is not moved to new node

Possible solution: Allow a node to self evict if it is unable to perform commits

How to reproduce:
Install a regular galera cluster with 3 nodes.
Make sure the binlog is located on a separate partition
Keep a steady stream of write queries to one of the nodes.
Run "fsfreeze -f /mount/for/binlog/"
Wait for the cluster to stop serving queries.

Workaround:
We have currently deployed a workaround where a script regularly try to update a file on /mount/for/binlog and if it takes more than X seconds to complete block all network traffic to other galera nodes. That way the node drops out of the cluster and proxysql can find a new node to write to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evict slow/hanged node that still responds to network #510

Evict slow/hanged node that still responds to network #510

MagnusKlingenberg commented Jun 27, 2018

Evict slow/hanged node that still responds to network #510

Evict slow/hanged node that still responds to network #510

Comments

MagnusKlingenberg commented Jun 27, 2018