You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanos, Prometheus and Golang version used:
0.37.2
We are doing some tests and a tenant started to ship quite a lot of data. This initially caused us to not have enough resources. After which we scaled things up, but it took a long while to get into a healthy shape.
I think the following screenshot shows pretty well the state. These are the Thanos Receivers running in 'router' / distributor mode:
Now, I get the increase in 409's, because we were bottlenecked, so this is somewhat inevitable. What I however cannot exactly grasp is the amount of 500's. Basically my question is: Where is that coming from?
How we eventually solved this, was by simply setting --receive.forward.async-workers=200 which I feel is quite a lot compared to the default of 5. What makes me curious though is that I feel that increasing the amount of replicas on the distributor hardly achieves the same result. This is based on nothing, but I feel that running 5 replicas with each 200 workers, works better than running 20 replicas with each 100 workers. Even though the latter should result in effectively more workers.
The forward delay was really high but remained "stable" around that ~12-14 minute mark. Which is kinda interesting as I would figure it would increase more over time, but it just seems to cap around that. Is there some 'limit' at which there is so much delay, it stops adding more things and thus returns 500's?
So, obviously, the root cause here is getting to this state in the first place. Which isn't a Thanos bug or whatever. We merely utilised this situation to get more experience and battle test the system. Yet I feel those 500's are very weird and I also feel like the amount of workers are limited by something else (?). I feel there should be a way to scale it in such way, that regardless of the load, retries, etc. the system should recover faster the more resources I throw at it.
More stats:
About ~25m active series
~500k series/s
~50k samples/s per distributor
7 receivers
replication factor of 3
The text was updated successfully, but these errors were encountered:
Thanos, Prometheus and Golang version used:
0.37.2
We are doing some tests and a tenant started to ship quite a lot of data. This initially caused us to not have enough resources. After which we scaled things up, but it took a long while to get into a healthy shape.
I think the following screenshot shows pretty well the state. These are the Thanos Receivers running in 'router' / distributor mode:
Now, I get the increase in 409's, because we were bottlenecked, so this is somewhat inevitable. What I however cannot exactly grasp is the amount of 500's. Basically my question is: Where is that coming from?
How we eventually solved this, was by simply setting
--receive.forward.async-workers=200
which I feel is quite a lot compared to the default of5
. What makes me curious though is that I feel that increasing the amount of replicas on the distributor hardly achieves the same result. This is based on nothing, but I feel that running 5 replicas with each 200 workers, works better than running 20 replicas with each 100 workers. Even though the latter should result in effectively more workers.The forward delay was really high but remained "stable" around that ~12-14 minute mark. Which is kinda interesting as I would figure it would increase more over time, but it just seems to cap around that. Is there some 'limit' at which there is so much delay, it stops adding more things and thus returns 500's?
So, obviously, the root cause here is getting to this state in the first place. Which isn't a Thanos bug or whatever. We merely utilised this situation to get more experience and battle test the system. Yet I feel those 500's are very weird and I also feel like the amount of workers are limited by something else (?). I feel there should be a way to scale it in such way, that regardless of the load, retries, etc. the system should recover faster the more resources I throw at it.
More stats:
The text was updated successfully, but these errors were encountered: