Pinot graceful node replacement for large scale production usage #14592

lnbest0707-uber · 2024-12-04T02:11:54Z

On the real world cloud based stateful platform, host underlying the Pinot container would run in dynamic status. Host/Node replacement is very frequent. Such operation ideally should be fully transparent to users even without Pinot admins' awareness.
However, Pinot nowadays, does not have a really graceful (enough) way to handle the node replacement. Though it is usually with multiple replicas, running in a under replication status would make the system stressful and risky. For example, if the table is with 2 replicas, during node replacement, we have to experience:

Many segments are only under 1 replica, the query load on it would go double.
For segments running with 1 replica, we are experiencing a very high risk that the data might lose or query might fail if another node goes down due to hardware or network issues.

Though we would experience same issue during node restart, node replacement is far slower than node restart especially with high data volume. For example, for a large node with 5+TB data, the entire single node replacement might take 5+ hours to complete. This is far longer than the node restart which might only take minutes. The longer the node replacement is, the longer node downtime we have to endure, the higher the risk is introduced.

Hence, reducing node replacement downtime is crucial for smooth large scale production maintenance.

During the downtime, we would observe

Helix pending messages slowly decrease to 0
SEGMENTS_WITH_LESS_REPLICAS (introduced in Add metrics for SEGMENTS_WITH_LESS_REPLICAS monitoring #12336) slowly decrease

The speed is far slower than a node restart because the node has to download the missing segment data from either deep store or peers before loading them into memory.
Therefore, a straightforward and effective way to reduce the downtime is that, before bringing down the old node (ON), we'd better pre-download all required segments on the new node (NN). Afterwards, bringing down the ON and starting up the NN would be same as the lightweight node restart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pinot graceful node replacement for large scale production usage #14592

Pinot graceful node replacement for large scale production usage #14592

lnbest0707-uber commented Dec 4, 2024

Pinot graceful node replacement for large scale production usage #14592

Pinot graceful node replacement for large scale production usage #14592

Comments

lnbest0707-uber commented Dec 4, 2024