-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JetStream consumer does not resume receiving messages after successful reconnect #1729
Comments
Hello @njkleiner, thanks for creating the issue. I'll look at this, but in the meantime could you please check if your consumer is still there after the reconnect (at the point at which I'm obviously not saying that's the case and I'll be looking at this regardless, but this is a pretty common case so would be nice if you could check. |
Sure, I can take a look. But if this turns out to be the case, I would still argue that the client should not continue blocking indefinitely -- upon a successful reconnect -- when a consumer is deleted server side (and instead return an error immediately, on reconnect). |
@njkleiner I agree with you, but there is not much we can do to make it better. We do not have a way of knowing the consumer has been deleted unless it happened during an active pull request, which results in us getting a |
If the consumer does heartbeats the client will detect it is gone eventually. |
I haven't had time to take a detailed look yet, but it appears that at least both of the example logs I have provided where the consumer gets "stuck" represent cases where there has been a disconnect of at least five seconds. So you might be right about the consumer being deleted on the server side @piotrpio, I will investigate further next week. @derekcollison I am not entirely sure how to interpret your comment. As I have originally stated, a "stuck" consumer eventually enters an indefinite state of heartbeat errors. So the client detected it to be "gone" in that sense, but I have not treated these errors as unrecoverable so far -- my logic was that, if I assume an unstable connection where reconnects may occur, heartbeat timeouts are expected as well, and that successful reconnects would imply that heartbeat timeouts may, in principle, be recovered eventually as well. Are you saying that indefinite heartbeat errors as described are indicative of a consumer that was deleted server side? And, if so, is there a way to distinguish these heartbeat errors from other (recoverable) heartbeat errors on the client side? |
I think if the system can delete consumers out from underneath of apps, then the heartbeat should be required and iff the heartbeat fails, do a consumer info to determine if the consumer still exists, and if not take appropriate action to resolve. |
Let's clear out some points. 1. Missing Heartbeat: A missing heartbeat indicates that the client did not receive a heartbeat message from the server-side consumer in time. This can be due to several factors, including: • Temporary unavailability of JetStream While a few missed heartbeats may be transient and recoverable, persistent issues over a longer period suggest it’s worth checking what’s happening with the consumer. There’s no definitive threshold for how many missed heartbeats are too many or whether this should be treated as a terminal error—it largely depends on your specific architecture, topology, and infrastructure. 2. Handling Terminal Errors: We initially treated some errors as terminal but realized we cannot always determine if an error is truly terminal. In many scenarios, consumers are managed separately from where messages are consumed; systems managing streams and consumers might independently create or delete them, and that should not force action by consuming apps. Because of this, we currently believe that even a “Consumer Deleted” error should not be considered terminal. Making errors terminal forces users to take immediate action. By treating errors as non-terminal, we provide users with all available information and options. |
@piotrpio It would appear that the issue was in fact that the consumer was deleted server-side after the idle timeout had expired. Setting a higher idle timeout indeed seems to work, in the sense that I have been unable to reproduce Though I have also encountered additional, isolated problems related to heartbeat errors and Go timer behavior Unfortunately, I am not able to reproduce these problems anymore or provide more information, as we have chosen to implement a different approach for consuming from a JetStream in the meantime since I originally created this issue. Either way, there does not appear to be a defect in the library w.r.t. the issue at hand. |
Observed behavior
I am experiencing a bug where a JetStream consumer will sometimes become "stuck" after a reconnect.
Concretely, a consumer will sometimes not resume receiving messages and instead reach a state
where it permanently throws "no heartbeat" errors after a successful reconnect to the server.
Expected behavior
The JetStream consumer should resume receiving messages after a successful reconnect (which it sometimes does).
Server and client version
I am using the
main
Branch ofnats.go
as of commitc7cf3452dd6359bdf40cbad0c39d900cbeba81e2
.Host environment
I am running these tests using Go 1.23.2 (darwin/arm64), with the
asynctimerchan=1
GODEBUG setting enabled.I am using the
Consumer.Consume
method and the followingConsumerConfig
Steps to reproduce
After a lot of manual debugging, I believe the issue is a race condition in the core NATS code,
where a call to
Subscription.pCond.Wait
will block forever, because no subsequent call topCond.Signal
orpCond.Broadcast
ever occurs in spite of the interrupted connection having been successfully reconnected.I have attached two example logs each that demonstrate a "stuck" consumer and a consumer that is not "stuck"
respectively. See the attached patch for the context w.r.t. the debug messages in these logs.
stuck2.log
notstuck2.log
stuck.log
notstuck.log
0001-add-debug-messages.patch
The text was updated successfully, but these errors were encountered: