JetStream consumer does not resume receiving messages after successful reconnect #1729

njkleiner · 2024-10-21T16:18:04Z

Observed behavior

I am experiencing a bug where a JetStream consumer will sometimes become "stuck" after a reconnect.

Concretely, a consumer will sometimes not resume receiving messages and instead reach a state
where it permanently throws "no heartbeat" errors after a successful reconnect to the server.

Expected behavior

The JetStream consumer should resume receiving messages after a successful reconnect (which it sometimes does).

Server and client version

I am using the main Branch of nats.go as of commit c7cf3452dd6359bdf40cbad0c39d900cbeba81e2.

Host environment

I am running these tests using Go 1.23.2 (darwin/arm64), with the asynctimerchan=1 GODEBUG setting enabled.

I am using the Consumer.Consume method and the following ConsumerConfig

jetstream.ConsumerConfig{
	Durable: "",

	AckPolicy:     jetstream.AckExplicitPolicy,
	DeliverPolicy: jetstream.DeliverAllPolicy,
}

Steps to reproduce

After a lot of manual debugging, I believe the issue is a race condition in the core NATS code,
where a call to Subscription.pCond.Wait will block forever, because no subsequent call to pCond.Signal or pCond.Broadcast ever occurs in spite of the interrupted connection having been successfully reconnected.

I have attached two example logs each that demonstrate a "stuck" consumer and a consumer that is not "stuck"
respectively. See the attached patch for the context w.r.t. the debug messages in these logs.

stuck2.log
notstuck2.log
stuck.log
notstuck.log

0001-add-debug-messages.patch

The text was updated successfully, but these errors were encountered:

piotrpio · 2024-10-25T08:37:33Z

Hello @njkleiner, thanks for creating the issue. I'll look at this, but in the meantime could you please check if your consumer is still there after the reconnect (at the point at which Consume is stuck)? You're using an ephemeral consumer (Durable is empty) and you don't explicitly set InactiveThreshold in your consumer config, so the server uses the default, which is 5 seconds. So if the client is disconnected and does not send pull requests for al least 5 seconds, thew consumer will be automatically deleted.

I'm obviously not saying that's the case and I'll be looking at this regardless, but this is a pretty common case so would be nice if you could check.

njkleiner · 2024-10-25T09:06:09Z

Sure, I can take a look.

But if this turns out to be the case, I would still argue that the client should not continue blocking indefinitely -- upon a successful reconnect -- when a consumer is deleted server side (and instead return an error immediately, on reconnect).

piotrpio · 2024-10-25T10:03:49Z

@njkleiner I agree with you, but there is not much we can do to make it better. We do not have a way of knowing the consumer has been deleted unless it happened during an active pull request, which results in us getting a Consumer Deleted status. But in case of reconnect we have no way of knowing unless we ask for consumer info on each reconnect, which is quite a heavy operation.

derekcollison · 2024-10-25T16:50:43Z

If the consumer does heartbeats the client will detect it is gone eventually.

njkleiner · 2024-10-25T17:23:29Z

I haven't had time to take a detailed look yet, but it appears that at least both of the example logs I have provided where the consumer gets "stuck" represent cases where there has been a disconnect of at least five seconds.

So you might be right about the consumer being deleted on the server side @piotrpio, I will investigate further next week.

@derekcollison I am not entirely sure how to interpret your comment. As I have originally stated, a "stuck" consumer eventually enters an indefinite state of heartbeat errors.

So the client detected it to be "gone" in that sense, but I have not treated these errors as unrecoverable so far -- my logic was that, if I assume an unstable connection where reconnects may occur, heartbeat timeouts are expected as well, and that successful reconnects would imply that heartbeat timeouts may, in principle, be recovered eventually as well.

Are you saying that indefinite heartbeat errors as described are indicative of a consumer that was deleted server side? And, if so, is there a way to distinguish these heartbeat errors from other (recoverable) heartbeat errors on the client side?

derekcollison · 2024-10-25T19:32:11Z

I think if the system can delete consumers out from underneath of apps, then the heartbeat should be required and iff the heartbeat fails, do a consumer info to determine if the consumer still exists, and if not take appropriate action to resolve.

Jarema · 2024-10-27T21:16:16Z

Let's clear out some points.

1. Missing Heartbeat:

A missing heartbeat indicates that the client did not receive a heartbeat message from the server-side consumer in time. This can be due to several factors, including:

• Temporary unavailability of JetStream
• Network issues
• Underlying infrastructure restarts or problems

While a few missed heartbeats may be transient and recoverable, persistent issues over a longer period suggest it’s worth checking what’s happening with the consumer. There’s no definitive threshold for how many missed heartbeats are too many or whether this should be treated as a terminal error—it largely depends on your specific architecture, topology, and infrastructure.

2. Handling Terminal Errors:

We initially treated some errors as terminal but realized we cannot always determine if an error is truly terminal. In many scenarios, consumers are managed separately from where messages are consumed; systems managing streams and consumers might independently create or delete them, and that should not force action by consuming apps. Because of this, we currently believe that even a “Consumer Deleted” error should not be considered terminal. Making errors terminal forces users to take immediate action. By treating errors as non-terminal, we provide users with all available information and options.

njkleiner · 2024-12-09T16:30:37Z

@piotrpio It would appear that the issue was in fact that the consumer was deleted server-side after the idle timeout had expired. Setting a higher idle timeout indeed seems to work, in the sense that I have been unable to reproduce
the issue at hand for any disconnect shorter than the explicitly set (reasonably high) idle timeout.

Though I have also encountered additional, isolated problems related to heartbeat errors and Go timer behavior
-- which was what originally led me to believe there was some larger issue going on here -- these actually appear to be unrelated after all. In one instance, for example, I have observed a single consumer getting "stuck" indefinitely and continuously throwing "no heartbeat" errors in spite of no reconnection having occurred at all (according to the respective callbacks, and as further evidenced by other consumers successfully receiving messages at the same time).

Unfortunately, I am not able to reproduce these problems anymore or provide more information, as we have chosen to implement a different approach for consuming from a JetStream in the meantime since I originally created this issue.

Either way, there does not appear to be a defect in the library w.r.t. the issue at hand.

njkleiner added the defect Suspected defect such as a bug or regression label Oct 21, 2024

piotrpio self-assigned this Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JetStream consumer does not resume receiving messages after successful reconnect #1729

JetStream consumer does not resume receiving messages after successful reconnect #1729

njkleiner commented Oct 21, 2024

piotrpio commented Oct 25, 2024

njkleiner commented Oct 25, 2024

piotrpio commented Oct 25, 2024

derekcollison commented Oct 25, 2024

njkleiner commented Oct 25, 2024

derekcollison commented Oct 25, 2024

Jarema commented Oct 27, 2024

njkleiner commented Dec 9, 2024

JetStream consumer does not resume receiving messages after successful reconnect #1729

JetStream consumer does not resume receiving messages after successful reconnect #1729

Comments

njkleiner commented Oct 21, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

piotrpio commented Oct 25, 2024

njkleiner commented Oct 25, 2024

piotrpio commented Oct 25, 2024

derekcollison commented Oct 25, 2024

njkleiner commented Oct 25, 2024

derekcollison commented Oct 25, 2024

Jarema commented Oct 27, 2024

njkleiner commented Dec 9, 2024