Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

anushar04 · 2025-03-26T00:07:11Z

Bug Report

We use lettuce client to connect to aws elastic cache (redis) with cluster mode enabled.
We have 5 shards (with 3 nodes each, onf of the 3 is master), replica node in shard 1 had degaraded performance due to which AWS triggered replacement for the same which took 7 mins, during this window, we were not able to read from primary thought master node was not impacted.

Current Behavior

Read/Write from master node fails while one of the replica's in shard is being reaplced.

We received 2 different types of errors during this window

Command timed out after [x] secons
CLUSTERDOWN Hash slot not served

// your stack trace here;

Java Application

Input Code

// your code here;

Expected no disruption in the read /write with master node

Environment

Lettuce version(s): 5.1.5.RELEASE
Redis version: 5.0.9 engine version

Possible Solution

Additional context

07:51 AM PST - redis-0001-003 Primary became unhealthy - we had some issue reading from it - this is expected from lettuce
07:55 AM PST - continued to provide Degraded experience from master node redis-0001-003
07:56 AM PST - Failover of master node performed by AWS redis-0001-002 - new master(No impact during time)
07:56 AM PST to 08:31 AM PST - redis-0001-003 was not available in the shard, however other 2 nodes in shard were active
08:31 AM PST - AWS triggered replacement for redis-0001-003 (replica) since it was still in degraded state.During this window, application was not able to read or write from master node
08:38 AM PST - Complete Application Recovery redis-0001-002 continued to be primary, we were able to read / write from the client

Also during this failure 8:31 to 8:38 we see logs trying to reconnect to redis-0001-003 from connectionWatchDog

Need to understand why read from master node failed while replica being replaced.

tishun · 2025-03-27T15:10:37Z

Hey @anushar04 ,

The way I read this is that the driver was using the redis-0001-003 node even after it was replaced with redis-0001-002?
How is the driver configured? Do you have some topology update mechanism configured?

During such a failover the driver has no way to know that the - otherwise healthy - node was experiencing issues. Depending on how topology is updated and how the driver is configured it might continue trying to connect to the same node.

There are a lot of details missing, so I can't help much.

tishun added the status: waiting-for-triage label Mar 27, 2025

tishun added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage labels Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

anushar04 commented Mar 26, 2025

tishun commented Mar 27, 2025

Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

Comments

anushar04 commented Mar 26, 2025

Bug Report

Current Behavior

Environment

Possible Solution

Additional context

tishun commented Mar 27, 2025