Lettuce Sharded PubSub Resubscribe Possible Issue #3213

ThePeterLuu · 2025-03-12T02:03:11Z

Bug Report

I'm experiencing an issue that seems similar to #2940 in that Lettuce does not seem to be resubscribing Sharded PubSub subscriptions automatically, except I'm using Lettuce 6.5.5.RELEASE, in which the referenced issue should have been fixed.

I may be misunderstanding how things are supposed to work or have misconfigured something, so if that's the case, please let me know.

Current Behavior

Assume that you have a number of applications connected to a Redis cluster with two shards, using .connectPubSub(...).async().ssubscribe(topic).... The subscriptions are distributed across the two shards.

Then, remove one shard (either manually, or via autoscaling policy, such as in an AWS ElastiCache deployment).

By watching debug logs, it seems to me that when subscriptions are made with the regular non-sharded .subscribe call, once Lettuce is disconnected from the shard going-away, and reconnects to the new shard, Lettuce issues another SUBSCRIBE command. This can also be verified by connecting to the Redis cluster via CLI and running CLIENT LIST (I am able to see the connection that was transferred over, and that the latest command run on that connection was subscribe) and with PUBSUB CHANNELS.

However, when subscriptions are made with the sharded .ssubscribe, Lettuce is able to reconnect to the new shard, but there are no debug logs indicating a SSUBSCRIBE command was made. Connecting via CLI shows with CLIENT LIST that the application did successfully reconnect to the new shard, but the latest command is cluster|myid instead of ssubscribe and PUBSUB SHARDCHANNELS shows only the subscriptions that were originally created on that shard and none of the transferred connections.

This difference in behavior (where SUBSCRIBE reconnects successfully, but SSUBSCRIBE does not) also applies to test-initiated failovers (initiated via the AWS ElastiCache console), with the same outcome.

The result is that some number of Sharded PubSub messages are lost because there are no active subscribers for those messages.

Input Code

I can paste my Lettuce client configuration if desired or if that would be helpful, in case you think this might be a problem with my configuration.

Expected behavior/code

Since SUBSCRIBE seems to automatically resubscribe on failovers and auto-scale-in for Redis clusters, I would have expected SSUBSCRIBE to also do the same.

Environment

Lettuce version(s): 6.5.5.RELEASE
Redis version: 7.2.4 (Valkey 8.0.1)

Additional context

As a tangentially related question, does Lettuce handle slot movement/rebalancing for SSUBSCRIBE, such as when nodes are added and slots are redistributed? I couldn't really find documentation around how slot movement in Sharded Pub/Sub works in general and I'm not familiar enough with Lettuce and Redis to figure it out from reading the code, though I gave it a try. Mostly my concern is if it's something I'd need to implement myself, or if that's handled by using Lettuce.

The text was updated successfully, but these errors were encountered:

ThePeterLuu · 2025-03-13T01:52:35Z

I did some additional testing and created a simple Java console application, with only the Lettuce 6.5.5.RELEASE client and Netty 4.1.119.Final. I configured six RedisClusterClient instances, with two clients handling publish and spublish respectively, two subscriber clients, and two sharded subscriber clients (for a total of four topics). I was able to confirm that on both adding new nodes (scaling up) the ElastiCache cluster, sharded subscriptions can stop receiving messages and on downscaling/removing nodes, sharded subscriptions can also stop receiving messages.

In both cases, the non-sharded subscriptions seemed unaffected, which is differing behavior.

If it isn't a configuration mistake on my end or perhaps intended (perhaps I'm supposed to override in RedisPubSubListener the sunsubscribed method or listen to some topology refresh event to manually re-establish the sharded subscription when slots move?) then it seems to be an unexpected bug of some type?

Without pasting the code, the general options I have configured are:

ClientResources using a custom DnsAddressResolverGroup to specify things like the NioDatagramChannel type, DnsNameResolverChannelStrategy.ChannelPerResolution, specifying retries on timeout, TCP fallback, no DNS caching, etc
The implementation of RedisPubSubListener only overrides message(...)
SocketOptions with extended keepalive options enabled, idle set at 30 seconds and a few seconds interval after idle
TcpUserTimeout set to 30 seconds
ConnectTimeout set to 10 seconds
PeriodicRefresh set to 60 seconds
DynamicRefreshSources set to false
EnableAllAdaptiveRefreshTriggers set to true
CloseStaleConnections set to true
NodeFilter enabled, filtering out NodeFlag.FAIL and NodeFlag.EVENTUAL_FAIL

Thank you!

tishun added type: bug A general bug status: waiting-for-triage labels Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lettuce Sharded PubSub Resubscribe Possible Issue #3213

Lettuce Sharded PubSub Resubscribe Possible Issue #3213

ThePeterLuu commented Mar 12, 2025 •

edited

Loading

ThePeterLuu commented Mar 13, 2025 •

edited

Loading

Lettuce Sharded PubSub Resubscribe Possible Issue #3213

Lettuce Sharded PubSub Resubscribe Possible Issue #3213

Comments

ThePeterLuu commented Mar 12, 2025 • edited Loading

Bug Report

Current Behavior

Input Code

Expected behavior/code

Environment

Additional context

ThePeterLuu commented Mar 13, 2025 • edited Loading

ThePeterLuu commented Mar 12, 2025 •

edited

Loading

ThePeterLuu commented Mar 13, 2025 •

edited

Loading