Ingester goes haywire in distributor-ingester setup #8040

mfoldenyi · 2025-01-06T12:29:25Z

Thanos, Prometheus and Golang version used:
Observed with:
bitnami/thanos:0.35.1-debian-12-r1
bitnami/thanos:0.37.1-debian-12-r0

bitnami/thanos@sha256:9a802050e9fa1d479dbda733d47a8dc5d48b9e3b5a3806b2724d4f04e169ba48 (0.35.1)
bitnami/thanos@sha256:5bf82b98c82c485a9033ad6779ec1861b44c758627225c782e97cf44130c9f84 (0.37.1)

A little context
Running a DEV cluster on Amazon EKS, having a separated Distributor - Ingester setup. Distributors are autoscaled, ingesters are setup fixed in 3 groups of 5 (each group in a separate AWS AZ). Hashring is a global AZ aware Ketama hashring, with replication factor 3.
I have a test load coming in about 60 requests/second, or ~50K appended samples / second.

Object Storage Provider:
S3
What happened:
On a random day at 13:19:30 everything was running smoothly. At 13:20:00 something happened, and successful ingestion dropped significantly. At the same time, the distributors reported a huge spike in inflight request count:

At the same moment, one of the ingesters (c-0, the first in the group for AZ c) went dark, prometheus could not scrape it anymore:

CPU and Memory metrics (which are not scraped off of the ingester) show from this time onward that it started to build up memory steadily and eventually got OOMKilled:

(In this particular case this ingester had already higher amounts of memory in use, but in a different occurence, this was not true)
When it was eventually terminated and restarted, it came back with a temporarily high memory usage, but ultimately it joined back with the other ingesters.
When these charts are put next to each other, it is very obvious that the failed ingestion is the direct result of ingester c-0 being in this weird state:

During this state the distributors' grpc client reported a large number of DeadlineExceeded:

What you expected to happen:

Not to have an ingester randomly freak out and developing a memory hike up to OOMKill
metric ingestion to go on without interruptions, since of the 3 groups of 5 ingesters, there was only 1 ingester compromised, which should have satisfied the write quorum of 2.

How to reproduce it (as minimally and precisely as possible):
No idea as of yet. Last Head compaction and upload happened at 12:00:00, 80 minutes prior. There was no reconfigurations, restarts, or otherwise notable events happening in the cluster at the time.
Happened 2 times in total so far, on different versions (0.35.1/0.37.1) and different clusters.

Full logs to relevant components:
There was no logging at all before or during the event, the last entries before the restart are related to the last head compaction and upload. After the restart, there was a huge amount of "out-of-order" warnings. (400K+ log entries)
In general, there is nothing in the logs for +-1 day that is out of the ordinary at all besides the out-of-order warnings.

Anything else we need to know:
All in all the event took 3 hours and 10 minutes, but this was determinedby the memory headroom the ingester had at the time.

I realise this is very little to go on, so I am not expecting a miracle, but if anybody has some ideas what to look for or maybe what to enable so that next time I have more info on such an event would be welcome. (that said, debug log enabling will not happen. Tried that, and the sheer volume of the debug logs is simply too much for this cluster to be feasible, especially in a "maybe it will happen again" scenario where I would need to turn it on and leave it on)

Thanks for reading!

dosubot bot added the bug label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingester goes haywire in distributor-ingester setup #8040

Ingester goes haywire in distributor-ingester setup #8040

mfoldenyi commented Jan 6, 2025 •

edited

Loading

Ingester goes haywire in distributor-ingester setup #8040

Ingester goes haywire in distributor-ingester setup #8040

Comments

mfoldenyi commented Jan 6, 2025 • edited Loading

mfoldenyi commented Jan 6, 2025 •

edited

Loading