Don't deplete all the startup nodes after ConnectionError/TimeoutError #3697

eoghanmurray · 2025-07-04T12:57:12Z

Don't deplete all the startup nodes after ConnectionError/TimeoutError against all nodes, rather keep one around so that retry algorithm has at least one node to work with

Description of change

See bug report #3693

… or TimeoutError against all nodes, rather keep one around so that retry algorithm has at least one node to work with

Copilot

Pull Request Overview

This PR updates the cluster command execution logic to avoid removing the last remaining startup node on connection or timeout failures, ensuring retries can still proceed.

Preserve one startup node when all others fail and wrap the original exception in a RedisClusterException
Only remove failed nodes if more than one startup node remains
Re-raise the appropriate exception after forcing a cluster layout reinitialization

Comments suppressed due to low confidence (2)

redis/asyncio/cluster.py:824

[nitpick] The error message could be more descriptive and grammatically clear, e.g., 'Unable to connect to Redis Cluster: connection or timeout errors on all startup nodes'.

                        'Connection or Timeout Errors across all startup nodes'

redis/asyncio/cluster.py:820

Add a unit test covering the scenario where only one startup node remains to ensure it isn't removed and the correct RedisClusterException is raised with the original cause.

                if len(self.nodes_manager.startup_nodes) == 1:

Copilot · 2025-07-14T07:33:03Z

redis/asyncio/cluster.py

+                    ce = RedisClusterException(
+                        'Redis Cluster cannot be connected. '
+                        'Connection or Timeout Errors across all startup nodes'
+                    )
+                    ce.__cause__ = e
+                    e = ce


[nitpick] Reassigning the caught exception variable e to a new exception can be confusing; consider raising the new RedisClusterException directly or using a separate variable name for clarity.

Suggested change

ce = RedisClusterException(

'Redis Cluster cannot be connected. '

'Connection or Timeout Errors across all startup nodes'

)

ce.__cause__ = e

e = ce

raise RedisClusterException(

'Redis Cluster cannot be connected. '

'Connection or Timeout Errors across all startup nodes'

) from e

@eoghanmurray , Hello!

Thanks for the PR!

We also encountered a similar error when redis removes all startup nodes from the pool and goes into an endless retreat.

Would you like to see the comments from Copilot?

I want to see the merged PR as soon as possible :)

petyaslavova · 2025-07-17T05:53:29Z

Hi @eoghanmurray, thank you for your PR!
Could you please apply the same changes to the sync cluster client and also incorporate the suggestions proposed by Copilot?
This update will improve the situation and serve as a partial fix, though some cases may still remain unresolved.
As a general recommendation when using the Redis cluster client, I suggest using a DNS address with dynamic_startup_nodes=False.

eoghanmurray and others added 2 commits July 4, 2025 13:55

Don't deplete all the startup nodes after a series of ConnectionError…

699f8f6

… or TimeoutError against all nodes, rather keep one around so that retry algorithm has at least one node to work with

Merge branch 'master' into keep-last-cluster-node-for-retry

9a6ab4a

petyaslavova requested a review from Copilot July 14, 2025 07:32

Copilot AI reviewed Jul 14, 2025

View reviewed changes

Merge branch 'master' into keep-last-cluster-node-for-retry

ecbc75c

petyaslavova added the maintenance Maintenance (CI, Releases, etc) label Jul 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't deplete all the startup nodes after ConnectionError/TimeoutError #3697

Don't deplete all the startup nodes after ConnectionError/TimeoutError #3697

Uh oh!

eoghanmurray commented Jul 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 14, 2025

Uh oh!

Schtil Jul 16, 2025

Uh oh!

petyaslavova commented Jul 17, 2025

Uh oh!

Uh oh!

Don't deplete all the startup nodes after ConnectionError/TimeoutError #3697

Are you sure you want to change the base?

Don't deplete all the startup nodes after ConnectionError/TimeoutError #3697

Uh oh!

Conversation

eoghanmurray commented Jul 4, 2025

Description of change

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Schtil Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

petyaslavova commented Jul 17, 2025

Uh oh!

Uh oh!