Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis Cluster Node Enabled - Failed to read from master when replica is being replaced #3231

Open
anushar04 opened this issue Mar 26, 2025 · 1 comment
Labels
status: waiting-for-feedback We need additional information before we can continue

Comments

@anushar04
Copy link

Bug Report

We use lettuce client to connect to aws elastic cache (redis) with cluster mode enabled.
We have 5 shards (with 3 nodes each, onf of the 3 is master), replica node in shard 1 had degaraded performance due to which AWS triggered replacement for the same which took 7 mins, during this window, we were not able to read from primary thought master node was not impacted.

Current Behavior

Read/Write from master node fails while one of the replica's in shard is being reaplced.

We received 2 different types of errors during this window
  • Command timed out after [x] secons
  • CLUSTERDOWN Hash slot not served
// your stack trace here;

Java Application

Input Code
// your code here;

Expected no disruption in the read /write with master node

Environment

  • Lettuce version(s): 5.1.5.RELEASE
  • Redis version: 5.0.9 engine version

Possible Solution

Additional context

07:51 AM PST - redis-0001-003 Primary became unhealthy - we had some issue reading from it - this is expected from lettuce
07:55 AM PST - continued to provide Degraded experience from master node redis-0001-003
07:56 AM PST - Failover of master node performed by AWS redis-0001-002 - new master(No impact during time)
07:56 AM PST to 08:31 AM PST - redis-0001-003 was not available in the shard, however other 2 nodes in shard were active
08:31 AM PST - AWS triggered replacement for redis-0001-003 (replica) since it was still in degraded state.During this window, application was not able to read or write from master node
08:38 AM PST - Complete Application Recovery redis-0001-002 continued to be primary, we were able to read / write from the client

Also during this failure 8:31 to 8:38 we see logs trying to reconnect to redis-0001-003 from connectionWatchDog

Need to understand why read from master node failed while replica being replaced.

@tishun
Copy link
Collaborator

tishun commented Mar 27, 2025

Hey @anushar04 ,

The way I read this is that the driver was using the redis-0001-003 node even after it was replaced with redis-0001-002?
How is the driver configured? Do you have some topology update mechanism configured?

During such a failover the driver has no way to know that the - otherwise healthy - node was experiencing issues. Depending on how topology is updated and how the driver is configured it might continue trying to connect to the same node.

There are a lot of details missing, so I can't help much.

@tishun tishun added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage labels Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-feedback We need additional information before we can continue
Projects
None yet
Development

No branches or pull requests

2 participants