Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAFT leadership transfers and health check failures [v2.10.22] #6079

Open
slice-arpitkhatri opened this issue Nov 5, 2024 · 15 comments
Open
Assignees
Labels
defect Suspected defect such as a bug or regression

Comments

@slice-arpitkhatri
Copy link

slice-arpitkhatri commented Nov 5, 2024

Observed behavior

We've observed frequent RAFT leadership transfers of the $MQTT_PUBREL consumers and health check failures, even in a steady state. Occasionally, these issues escalate, causing sharp spikes in leadership transfers and health check failures, which lead to cluster downtime.

During these intense spikes, metrics from NATS Surveyor show an enormous surge in system messages, with counts reaching billions of messages per minute (metric name: nats_core_account_msgs_recv).

System details

  1. Peak load of 5k MQTT clients, each with 2 QoS 2 subscriptions, totaling 10k subscriptions across 10k MQTT topics.
  2. Messages produced at ~10 RPS
  3. A single NATS queue group subscription is used to consume MQTT-published messages on one topic.

Additional details

  1. Cluster of 3 nodes
  2. max_outstanding_catcup 128MB

Associated logs:

  • RAFT [cnrtt3eg - C-R3F-yMOeq7kb] Stepping down due to leadership transfer
  • Falling behind in health check, commit 3202757 != applied 3202742
  • Healthcheck failed: "JetStream is not current with the meta leader"
Screenshot 2024-11-05 at 22 29 15

leadership transfer

nats traffic in steady state (taken minutes after starting the pods) :

Screenshot 2024-11-05 at 22 21 06

nats-traffic-of-sys-account.txt

Expected behavior

No leadership transfers of consumers & no health check failures in steady state.

Server and client version

Nats Server version 2.10.22

Host environment

Kubernetes v1.25

Steps to reproduce

Setup a 3 node NATS cluster, start 5k MQTT connections with 10k (2 per each client) QOS 2 subscriptions and publish QOS 2 messages at 10 RPS.

@slice-arpitkhatri slice-arpitkhatri added the defect Suspected defect such as a bug or regression label Nov 5, 2024
@neilalexander
Copy link
Member

Can you please provide more complete logs from around the times of the problem, as well as server configs?

Do you have account limits and/or max_file/max_mem set?

Normally the only things that should be causing leader transfers on streams in normal operation is a) if you ask it to by issuing a step-down, or b) if you've hit up against the configured JetStream system limits.

@slice-arpitkhatri
Copy link
Author

@neilalexander We do not have any account level limits. max_file_store is 50GB and max_memory_store is at 10GB.

Have shared the config file and complete logs over email. Let me know if you want any additional details.

@neilalexander
Copy link
Member

I've taken a look at the logs you sent through but it appears as though the system is already unstable by the start of the logs? Was there a network-level event leading up to this, or any nodes that restarted unexpectedly?

@slice-arpitkhatri
Copy link
Author

@neilalexander We didn't observe any network-level events. The nodes did restart due to health check failures. I've sent you another email containing additional logs from an hour before the instability occurred. Let me know if that helps or if you have any additional queries

@levb
Copy link
Contributor

levb commented Nov 6, 2024

I am going to try reproducing this from the MQTT side. The QoS2-on-JetStream implementation is quite resource intensive (per sub, and per message), this kind of volume might have introduced failures, and ultimately blocking the IO (readloop) waiting for JS responses before acknowledging back to the MQTT clients, as required by the protocol.

@slice-arpitkhatri
Copy link
Author

@levb have shared the config file with Neil. Let me know if you need any additional inputs in reproducing this. Can jump on a call as well if required.

@slice-arpitkhatri
Copy link
Author

slice-arpitkhatri commented Nov 8, 2024

@levb @neilalexander My hunch is that the huge amount of raft sync required for R3 consumers might be causing the instability in the system. Even in steady state scenario we have 2Mil system messages per minute. Let me know your thoughts on this?

@derekcollison Do we have any plans to support R3 file streams with R1 memory consumers?

@derekcollison
Copy link
Member

That is supported today. Under mqqt config section you have the following options to control consumers.

image

config blocks just convert to snakecase, e.g. consumer_replicas = 1

@slice-arpitkhatri
Copy link
Author

slice-arpitkhatri commented Nov 8, 2024

@derekcollison
I believe the consumer_replicas setting under the MQTT config is currently not in use (server ignores this config, see this), and that the consumer replicas are instead aligned with the parent stream replica for interest or workqueue streams ( source )

Additionally, we have already set consumer_replicas as 1 in our production cluster, and I can see that the consumers still have a raft leader, which wouldn't be the case if this consumer replica override config were functional.

Do we have plans to re-introduce this consumer replica override capability?

@derekcollison
Copy link
Member

It will work but yes if there are retention based streams backing the MQTT stuff the system will override and force the peer sets to be the same.

This QOS2?

@levb
Copy link
Contributor

levb commented Nov 8, 2024

@derekcollison this ticket is, but @slice-arpitkhatri said they got into this state with QoS1 as well,
image

@slice-arpitkhatri
Copy link
Author

Yes, have faced the issue with both QOS 1 and QOS 2.

@wallyqs wallyqs changed the title RAFT leadership transfers and health check failures RAFT leadership transfers and health check failures [v2.10.22] Dec 11, 2024
@slice-arpitkhatri
Copy link
Author

@levb @neilalexander @MauriceVanVeen @derekcollison Any luck figuring out the issue?

@wallyqs
Copy link
Member

wallyqs commented Jan 6, 2025

@slice-arpitkhatri what are the specs of the cluster? On k8s I would recommend to give it at least 4 cores to each nats pod, another thing would be to change the readiness health check so that it is done on /varz instead to avoid a k8s healthcheck disconnecting the clients when the meta leader is behind temporarily.

readinessProbe:
  httpGet:
    path: /varz

@slice-arpitkhatri
Copy link
Author

@wallyqs we've allocated 16 cores to each pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

6 participants