Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester goes haywire in distributor-ingester setup #8040

Open
mfoldenyi opened this issue Jan 6, 2025 · 0 comments
Open

Ingester goes haywire in distributor-ingester setup #8040

mfoldenyi opened this issue Jan 6, 2025 · 0 comments
Labels

Comments

@mfoldenyi
Copy link

mfoldenyi commented Jan 6, 2025

Thanos, Prometheus and Golang version used:
Observed with:
bitnami/thanos:0.35.1-debian-12-r1
bitnami/thanos:0.37.1-debian-12-r0

bitnami/thanos@sha256:9a802050e9fa1d479dbda733d47a8dc5d48b9e3b5a3806b2724d4f04e169ba48 (0.35.1)
bitnami/thanos@sha256:5bf82b98c82c485a9033ad6779ec1861b44c758627225c782e97cf44130c9f84 (0.37.1)

A little context
Running a DEV cluster on Amazon EKS, having a separated Distributor - Ingester setup. Distributors are autoscaled, ingesters are setup fixed in 3 groups of 5 (each group in a separate AWS AZ). Hashring is a global AZ aware Ketama hashring, with replication factor 3.
I have a test load coming in about 60 requests/second, or ~50K appended samples / second.

Object Storage Provider:
S3
What happened:
On a random day at 13:19:30 everything was running smoothly. At 13:20:00 something happened, and successful ingestion dropped significantly. At the same time, the distributors reported a huge spike in inflight request count:
image
At the same moment, one of the ingesters (c-0, the first in the group for AZ c) went dark, prometheus could not scrape it anymore:
image
CPU and Memory metrics (which are not scraped off of the ingester) show from this time onward that it started to build up memory steadily and eventually got OOMKilled:
image
(In this particular case this ingester had already higher amounts of memory in use, but in a different occurence, this was not true)
When it was eventually terminated and restarted, it came back with a temporarily high memory usage, but ultimately it joined back with the other ingesters.
When these charts are put next to each other, it is very obvious that the failed ingestion is the direct result of ingester c-0 being in this weird state:
image

During this state the distributors' grpc client reported a large number of DeadlineExceeded:
image

What you expected to happen:

  1. Not to have an ingester randomly freak out and developing a memory hike up to OOMKill
  2. metric ingestion to go on without interruptions, since of the 3 groups of 5 ingesters, there was only 1 ingester compromised, which should have satisfied the write quorum of 2.

How to reproduce it (as minimally and precisely as possible):
No idea as of yet. Last Head compaction and upload happened at 12:00:00, 80 minutes prior. There was no reconfigurations, restarts, or otherwise notable events happening in the cluster at the time.
Happened 2 times in total so far, on different versions (0.35.1/0.37.1) and different clusters.

Full logs to relevant components:
There was no logging at all before or during the event, the last entries before the restart are related to the last head compaction and upload. After the restart, there was a huge amount of "out-of-order" warnings. (400K+ log entries)
In general, there is nothing in the logs for +-1 day that is out of the ordinary at all besides the out-of-order warnings.

Anything else we need to know:
All in all the event took 3 hours and 10 minutes, but this was determinedby the memory headroom the ingester had at the time.

I realise this is very little to go on, so I am not expecting a miracle, but if anybody has some ideas what to look for or maybe what to enable so that next time I have more info on such an event would be welcome. (that said, debug log enabling will not happen. Tried that, and the sheer volume of the debug logs is simply too much for this cluster to be feasible, especially in a "maybe it will happen again" scenario where I would need to turn it on and leave it on)

Thanks for reading!

@dosubot dosubot bot added the bug label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant