-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I often get the error nats: no heartbeat received #1622
Comments
Hey! What we don't see here, is the context in which both provided functions are called. Could you please share it? Inactive Threshold kicks in when there is noone listening for messages from a consumer (so, no This means it can happen for example if there is a long pause between creating a consumer and consuming its messages, or longer than a minute app downtime (the app that is consuming). Moving the issue to the nats.go repo, as it's probably a client side discussion unless we find any issue. |
I believe this happens if you spend too long processing a message in the message loop. You need to take the message off the "queue" and return processing to the library asap (at least that is what I got when trying to debug this issue in my code). |
That should not be the case, especially in such a long inactive threshold. |
AFAICT, the issue appears to be that heartbeats are only processed when calling |
This is not correct. The heartbeats are indeed processed in |
Indeed! Looking at the code, it looks like this error might be coming from networking issues as the heartbeat isn't paused/reset during reconnection: Lines 608 to 630 in 8894a27
|
I can confirm that the above PR appears to fix the issue -- at least, I haven't seen this message in awhile now. |
@VuongUranus try adding this to your go.mod file and see if you are also not seeing the issue anymore? replace github.com/nats-io/nats.go => github.com/withinboredom/nats.go patch-1 |
@withinboredom i can't replace |
@VuongUranus The issue mentioned by @withinboredom is fixed here: #1643, we'll be merging it soon. However, I am still not certain that this is indeed your problem. It would be great if you could verify and if you still encounter the issue it would be very helpful if you could answer the questions from this comment: #1622 (comment) |
My message handling function is in a goroutine so it may not be due to the effects of processing messages for too long. |
@piotrpio |
I have also "no heartbeat received" with 1.36.0 version.
|
After you get that error, do you get it once, and then resume normal operation, or there are many consecutive heartbeat errors? |
This error occurs repeatedly on a consumer. When this error occurs, I tried searching for the consumer using CLI but could not find it. And I found that when I use |
This error means that there is some issue with the Consumer, or JetStream. Ack should have nothing to do with it. Most probable reason is having consumer with inactivity threshold set, which will go away after given duration of client inactivity. |
I use Consume function to pull messages so why is the consumer deleted by inactivity threshold? |
I increased the value of |
@VuongUranus I have encountered this error. It was a consecutive set of The thing was that I missed Maybe in your case there were some server failure and reconnection attempts exhaustion. |
Facing the same issue here. I'm using Another odd thing is that this is only observed on the server but not on local developing environment. So there is a chance this might be a network issue? Hope to get some insights from you guys. here is the config for the consumer:
|
@yimliuwork can you share a snippet of code where you create and then use the consumer? |
Hi @Jarema, Thanks for replying! The Consumer creation, MessageContext creation and message iteration are in different functions, I'll just putting everything together here (without logging code): func (n *NatsClient) GenerateConsumer(ctx context.Context, queueGroup, gamespace string, startSeq uint64) (jetstream.Consumer, error) {
consConfig := jetstream.ConsumerConfig{
Name: queueGroup,
Durable: queueGroup,
Description: fmt.Sprintf("consumer for queue group: %v", queueGroup),
AckPolicy: jetstream.AckExplicitPolicy,
AckWait: 30 * time.Second, // reduce redelivery as much as possible
MaxDeliver: 10, // may need redelivery when one connection is cut off and messages in its buffer are not delivered
FilterSubject: fmt.Sprintf("%v.gamespace.%v", n.mpEnv, gamespace),
ReplayPolicy: jetstream.ReplayInstantPolicy,
MaxAckPending: 5120, // put a high number to ensure good throughput
HeadersOnly: false,
InactiveThreshold: 60 * time.Second,
}
if startSeq != 0 {
consConfig.DeliverPolicy = jetstream.DeliverByStartSequencePolicy
consConfig.OptStartSeq = startSeq
} else {
consConfig.DeliverPolicy = jetstream.DeliverNewPolicy
}
// If consumer already exists and the provided configuration differs from its configuration, ErrConsumerExists
// is returned. If the provided configuration is the same as the existing consumer, the existing consumer
// is returned.
return n.js.CreateConsumer(ctx, fmt.Sprintf("GAMESPACE-%v", n.mpEnv), consConfig)
}
func (s *Server) initSubscription(ctx context.Context, queueGroup, gamespace string, startSeq uint64) (jetstream.MessagesContext, error) {
cons, err := s.nc.GenerateConsumer(ctx, queueGroup, gamespace, startSeq)
if err != nil {
return nil, err
}
log.Info(ctx, fmt.Sprintf("consumer %v created or found", queueGroup))
subscription, err := cons.Messages(jetstream.PullMaxMessages(200))
if err != nil {
return nil, err
}
return subscription, nil
}
func (sw *streamWorker) startPulling(ctx context.Context) {
defer log.Debug(ctx, "stop pulling...")
defer close(sw.toSend)
firstMsg := true
for {
msg, err := sw.subscription.Next()
sw.toSend <- msg
}
} Judging from logs, I found that the |
This is pretty weird. Could you show us how those functions are used? |
That's pretty much all the code really. So basically I'm building a grpc server-stream API that distributes messages from a message queue (a consumer) to different clients. The API request payload includes the consumer's name and start sequence. With each request, we create such consumer or we find the consumer if it already exists. Then we create a subscription to that consumer using Another issue I just found out from load testing is that this 'no heartbeat' error can happen half way through a stream. Like I have 100 subscriptions to one consumer, and they are pulling messages fine. Then at some point all 100 subscriptions receive a 'no heartbeat' err. |
@Jarema I have a possible reproduction here alongside another problem from this issue #1703 (comment) |
Observed behavior
I often get the error no heartbeat received. Why do I get this error, where have I misconfigured?
When I get this error it seems like my pull consumer has been deleted, I don't know the reason but it seems to be due to exceeding the time configured in the InactiveThreshold attribute configured in the consumer config.
Expected behavior
Please explain to me and avoid this situation again.
Server and client version
nats-server version: 2.10.12
Host environment
No response
Steps to reproduce
No response
The text was updated successfully, but these errors were encountered: