receive: Why load distribution in Thanos Receivers are not evenly distributed? #3794

heartTorres · 2020-11-23T07:05:57Z

heartTorres
Nov 23, 2020

Thanos, Prometheus and Golang version used:
Thanos v15.0

Object Storage Provider:
S3

What happened:
I have installed - 3 Thanos Receivers, 1 compactor, 1 store, 2 querier.
1 pod is installed on each 1 node server.
When I re-installed the Thanos Receivers, 1 pod consumes 2x the memory of the other 2 pods. The problem with this is that, since these 3 pods of Thanos Receivers are part of a single hashring, the 3rd pod (thanos-receive-2) gets oomkilled immediately when the other 2 pods still have enough resource space and no metrics are being shown in thanos query.

Note: I did re-install many times, but the behavior is still the same, the 3rd pod still consumes twice the memory than the other 2 pods. I did check the pod itself and there is no other app running on that pod and there is no error in the logs.

What you expected to happen:
The load/memory should be at least (close) distributed evenly.

squat · 2021-02-12T11:30:51Z

squat
Feb 12, 2021
Maintainer

Note: I did re-install many times, but the behavior is still the same, the 3rd pod still consumes twice the memory than the other 2 pods. I did check the pod itself and there is no other app running on that pod and there is no error in the logs

This is actually good and expected behavior. The same time series should always hash to the same node in a hashring of a given size. This means that for a given set of time series, a hashring should always expect roughly the same load distribution.

The fact that one replica sees consistently higher load than the others is most likely due to some inherent lumpiness in the time series being sent. Thanos decides which replica should ingest data by hashing the name and label-value pairs of a time series and picking a corresponding replica from the ring to handle this hash. It seems that the data being sent simply has more time series that hash to one replica. There is no guarantee that data will be distributed uniformly across replicas, however, for a good hash function, the greater the number of time series and the more random the names and labels of those time series, statistically, we should coverage towards a uniform distribution.

Unfortunately I don't think there is actually a bug here :/ take for example the case where a Prometheus server only produces a single time series and remote writes a million samples/second. We would rightfully expect super high load on a single hashring replica. This is essentially an extreme version of the case we see here.

One thing I could imagine would be to submit a feature proposal to improve the randomness of replica selection and thus drive load distribution towards uniformity even in the degenerate single-time-series case by adding a random value to the hash that changes for every request. It would be important that replicas forward this metadata along with the request so that once it is set, the value is fixed and replicas can agree on who should ultimately ingest a sample. WDYT? This would help solve load distribution for lumpy data.

1 reply

heartTorres Apr 6, 2021
Author

Alright, So the next step I should do is open a feature proposal to improve the randomness of replica selection. Since we already increased our receivers replicas, it is now more obvious the memory consumption which will hit other nodes to immediately oomkilled. We would really need this feature. Thank you so much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receive: Why load distribution in Thanos Receivers are not evenly distributed? #3794

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 1 reply

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

This comment has been hidden.

This comment has been hidden.

Select a reply

receive: Why load distribution in Thanos Receivers are not evenly distributed? #3794

heartTorres Nov 23, 2020

Replies: 7 comments · 1 reply

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

squat Feb 12, 2021 Maintainer

heartTorres Apr 6, 2021 Author

This comment has been hidden.

This comment has been hidden.

heartTorres
Nov 23, 2020

Replies: 7 comments 1 reply

squat
Feb 12, 2021
Maintainer

heartTorres Apr 6, 2021
Author