Receive: http 500 responses and the effect of receive.forward.async-worker #8063

wiardvanrij · 2025-01-17T19:54:09Z

Thanos, Prometheus and Golang version used:
0.37.2

We are doing some tests and a tenant started to ship quite a lot of data. This initially caused us to not have enough resources. After which we scaled things up, but it took a long while to get into a healthy shape.

I think the following screenshot shows pretty well the state. These are the Thanos Receivers running in 'router' / distributor mode:

Now, I get the increase in 409's, because we were bottlenecked, so this is somewhat inevitable. What I however cannot exactly grasp is the amount of 500's. Basically my question is: Where is that coming from?

How we eventually solved this, was by simply setting --receive.forward.async-workers=200 which I feel is quite a lot compared to the default of 5. What makes me curious though is that I feel that increasing the amount of replicas on the distributor hardly achieves the same result. This is based on nothing, but I feel that running 5 replicas with each 200 workers, works better than running 20 replicas with each 100 workers. Even though the latter should result in effectively more workers.

The forward delay was really high but remained "stable" around that ~12-14 minute mark. Which is kinda interesting as I would figure it would increase more over time, but it just seems to cap around that. Is there some 'limit' at which there is so much delay, it stops adding more things and thus returns 500's?

So, obviously, the root cause here is getting to this state in the first place. Which isn't a Thanos bug or whatever. We merely utilised this situation to get more experience and battle test the system. Yet I feel those 500's are very weird and I also feel like the amount of workers are limited by something else (?). I feel there should be a way to scale it in such way, that regardless of the load, retries, etc. the system should recover faster the more resources I throw at it.

More stats:

About ~25m active series
~500k series/s
~50k samples/s per distributor
7 receivers
replication factor of 3

The text was updated successfully, but these errors were encountered:

dosubot bot added the component: receive label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receive: http 500 responses and the effect of receive.forward.async-worker #8063

Receive: http 500 responses and the effect of receive.forward.async-worker #8063

wiardvanrij commented Jan 17, 2025

Receive: http 500 responses and the effect of receive.forward.async-worker #8063

Receive: http 500 responses and the effect of receive.forward.async-worker #8063

Comments

wiardvanrij commented Jan 17, 2025