Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive: http 500 responses and the effect of receive.forward.async-worker #8063

Open
wiardvanrij opened this issue Jan 17, 2025 · 0 comments

Comments

@wiardvanrij
Copy link
Member

Thanos, Prometheus and Golang version used:
0.37.2

We are doing some tests and a tenant started to ship quite a lot of data. This initially caused us to not have enough resources. After which we scaled things up, but it took a long while to get into a healthy shape.

I think the following screenshot shows pretty well the state. These are the Thanos Receivers running in 'router' / distributor mode:

Image

Now, I get the increase in 409's, because we were bottlenecked, so this is somewhat inevitable. What I however cannot exactly grasp is the amount of 500's. Basically my question is: Where is that coming from?

How we eventually solved this, was by simply setting --receive.forward.async-workers=200 which I feel is quite a lot compared to the default of 5. What makes me curious though is that I feel that increasing the amount of replicas on the distributor hardly achieves the same result. This is based on nothing, but I feel that running 5 replicas with each 200 workers, works better than running 20 replicas with each 100 workers. Even though the latter should result in effectively more workers.

The forward delay was really high but remained "stable" around that ~12-14 minute mark. Which is kinda interesting as I would figure it would increase more over time, but it just seems to cap around that. Is there some 'limit' at which there is so much delay, it stops adding more things and thus returns 500's?

Image

So, obviously, the root cause here is getting to this state in the first place. Which isn't a Thanos bug or whatever. We merely utilised this situation to get more experience and battle test the system. Yet I feel those 500's are very weird and I also feel like the amount of workers are limited by something else (?). I feel there should be a way to scale it in such way, that regardless of the load, retries, etc. the system should recover faster the more resources I throw at it.

More stats:

  • About ~25m active series
  • ~500k series/s
  • ~50k samples/s per distributor
  • 7 receivers
  • replication factor of 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant