You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use lettuce client to connect to aws elastic cache (redis) with cluster mode enabled.
We have 5 shards (with 3 nodes each, onf of the 3 is master), replica node in shard 1 had degaraded performance due to which AWS triggered replacement for the same which took 7 mins, during this window, we were not able to read from primary thought master node was not impacted.
Current Behavior
Read/Write from master node fails while one of the replica's in shard is being reaplced.
We received 2 different types of errors during this window
Command timed out after [x] secons
CLUSTERDOWN Hash slot not served
// your stack trace here;
Java Application
Input Code
// your code here;
Expected no disruption in the read /write with master node
Environment
Lettuce version(s): 5.1.5.RELEASE
Redis version: 5.0.9 engine version
Possible Solution
Additional context
07:51 AM PST - redis-0001-003 Primary became unhealthy - we had some issue reading from it - this is expected from lettuce
07:55 AM PST - continued to provide Degraded experience from master node redis-0001-003
07:56 AM PST - Failover of master node performed by AWS redis-0001-002 - new master(No impact during time)
07:56 AM PST to 08:31 AM PST - redis-0001-003 was not available in the shard, however other 2 nodes in shard were active
08:31 AM PST - AWS triggered replacement for redis-0001-003 (replica) since it was still in degraded state.During this window, application was not able to read or write from master node
08:38 AM PST - Complete Application Recovery redis-0001-002 continued to be primary, we were able to read / write from the client
Also during this failure 8:31 to 8:38 we see logs trying to reconnect to redis-0001-003 from connectionWatchDog
Need to understand why read from master node failed while replica being replaced.
The text was updated successfully, but these errors were encountered:
The way I read this is that the driver was using the redis-0001-003 node even after it was replaced with redis-0001-002?
How is the driver configured? Do you have some topology update mechanism configured?
During such a failover the driver has no way to know that the - otherwise healthy - node was experiencing issues. Depending on how topology is updated and how the driver is configured it might continue trying to connect to the same node.
There are a lot of details missing, so I can't help much.
Bug Report
We use lettuce client to connect to aws elastic cache (redis) with cluster mode enabled.
We have 5 shards (with 3 nodes each, onf of the 3 is master), replica node in shard 1 had degaraded performance due to which AWS triggered replacement for the same which took 7 mins, during this window, we were not able to read from primary thought master node was not impacted.
Current Behavior
Read/Write from master node fails while one of the replica's in shard is being reaplced.
// your stack trace here;
Java Application
Input Code
// your code here;
Expected no disruption in the read /write with master node
Environment
Possible Solution
Additional context
07:51 AM PST - redis-0001-003 Primary became unhealthy - we had some issue reading from it - this is expected from lettuce
07:55 AM PST - continued to provide Degraded experience from master node redis-0001-003
07:56 AM PST - Failover of master node performed by AWS redis-0001-002 - new master(No impact during time)
07:56 AM PST to 08:31 AM PST - redis-0001-003 was not available in the shard, however other 2 nodes in shard were active
08:31 AM PST - AWS triggered replacement for redis-0001-003 (replica) since it was still in degraded state.During this window, application was not able to read or write from master node
08:38 AM PST - Complete Application Recovery redis-0001-002 continued to be primary, we were able to read / write from the client
Also during this failure 8:31 to 8:38 we see logs trying to reconnect to redis-0001-003 from connectionWatchDog
Need to understand why read from master node failed while replica being replaced.
The text was updated successfully, but these errors were encountered: