-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAFT leadership transfers and health check failures [v2.10.22] #6079
Comments
Can you please provide more complete logs from around the times of the problem, as well as server configs? Do you have account limits and/or Normally the only things that should be causing leader transfers on streams in normal operation is a) if you ask it to by issuing a step-down, or b) if you've hit up against the configured JetStream system limits. |
@neilalexander We do not have any account level limits. max_file_store is 50GB and max_memory_store is at 10GB. Have shared the config file and complete logs over email. Let me know if you want any additional details. |
I've taken a look at the logs you sent through but it appears as though the system is already unstable by the start of the logs? Was there a network-level event leading up to this, or any nodes that restarted unexpectedly? |
@neilalexander We didn't observe any network-level events. The nodes did restart due to health check failures. I've sent you another email containing additional logs from an hour before the instability occurred. Let me know if that helps or if you have any additional queries |
I am going to try reproducing this from the MQTT side. The QoS2-on-JetStream implementation is quite resource intensive (per sub, and per message), this kind of volume might have introduced failures, and ultimately blocking the IO (readloop) waiting for JS responses before acknowledging back to the MQTT clients, as required by the protocol. |
@levb have shared the config file with Neil. Let me know if you need any additional inputs in reproducing this. Can jump on a call as well if required. |
@levb @neilalexander My hunch is that the huge amount of raft sync required for R3 consumers might be causing the instability in the system. Even in steady state scenario we have 2Mil system messages per minute. Let me know your thoughts on this? @derekcollison Do we have any plans to support R3 file streams with R1 memory consumers? |
@derekcollison Additionally, we have already set Do we have plans to re-introduce this consumer replica override capability? |
It will work but yes if there are retention based streams backing the MQTT stuff the system will override and force the peer sets to be the same. This QOS2? |
@derekcollison this ticket is, but @slice-arpitkhatri said they got into this state with QoS1 as well, |
Yes, have faced the issue with both QOS 1 and QOS 2. |
@levb @neilalexander @MauriceVanVeen @derekcollison Any luck figuring out the issue? |
@slice-arpitkhatri what are the specs of the cluster? On k8s I would recommend to give it at least 4 cores to each nats pod, another thing would be to change the readiness health check so that it is done on
|
@wallyqs we've allocated 16 cores to each pod |
Observed behavior
We've observed frequent RAFT leadership transfers of the $MQTT_PUBREL consumers and health check failures, even in a steady state. Occasionally, these issues escalate, causing sharp spikes in leadership transfers and health check failures, which lead to cluster downtime.
During these intense spikes, metrics from NATS Surveyor show an enormous surge in system messages, with counts reaching billions of messages per minute (metric name: nats_core_account_msgs_recv).
System details
Additional details
Associated logs:
nats traffic
in steady state (taken minutes after starting the pods) :nats-traffic-of-sys-account.txt
Expected behavior
No leadership transfers of consumers & no health check failures in steady state.
Server and client version
Nats Server version 2.10.22
Host environment
Kubernetes v1.25
Steps to reproduce
Setup a 3 node NATS cluster, start 5k MQTT connections with 10k (2 per each client) QOS 2 subscriptions and publish QOS 2 messages at 10 RPS.
The text was updated successfully, but these errors were encountered: