-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel updates of VPAs and checkpoints. #7951
Comments
Wow! That's pretty darn big! Thanks for the code change and description, I think we could improve something here. Will see if someone wants to take the issue and see how they solve it. |
/help |
@adrianmoisey: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Thanks for this! I really like that approach :) Updating the VPAs concurrently is definitely a great idea. /assign |
I'm wondering if this will require a setting to control concurrency, but I'll wait for a PR before I make more comments :P |
Yeah, I see your point. I'll dig into this and see if we can come up with a smarter approach. In any case, I plan to put "concurrent mode" behind a flag (defaulting to false for now) so users can choose between both modes. WDYT?
|
That makes perfect sense to me |
Hey @tkukushkin, looking at the numbers you provide, it seems that you left the default values for the recommender client-side qps settings: autoscaler/vertical-pod-autoscaler/common/flags.go Lines 42 to 43 in dc57f7c
As you can see, those values (5 requests/s to the KAPI) are not fit for large-scale usage. There was an upstream discussion around removing client-side rate limiting at all, now that API priority and fairness is available for the KAPIs, maybe we could pick this up for VPA as well and remove the client-side rate limiting entirely? |
As additional context: vpa-recommender has some pretty good instrumentation, allowing you to keep an eye on how long the individual steps in its loop take. A few years back, I had an issue showing the histograms for those steps in action – we had similar problems in large-scale environments like you do now: #4498 |
Hey @voelzmo,
I didn't change defaults, we override them through command line options of our recommender's deployment, our current values are
My knowledge about it is really poor, so I just trust your opinion.
Yeah, we know about these metrics and monitor them. |
Oh that's so confusing, because the numbers seemed to line up so nicely! Which version of VPA are you running? Maybe the recommender needs more than 1 call to update a VPA status then? |
We've running 1.2.1.
As far as I understand, 1 call to update VPA, 1 call to update checkpoint. Updating of VPAs takes 9 seconds, updating of checkpoints takes 13 seconds, in total one iteration takes less than 25 seconds and it's totally fine for us.
I wouldn't like to increase load on our Kubernetes masters so much. |
Oh, lol, sorry – I misread seconds for minutes. Not enough coffee this morning ☕ |
No, wait, I did read "minutes" in your original post:
In the scenario, where one loop took 10 minutes to execute:
I understand that using your modified code, the
That's what the goroutine-based solution does as well, doesn't it? |
So if I got it right, we should completely remove client-side rate limiting (regardless of this issue) since Kubernetes has built-in flow control (https://kubernetes.io/docs/concepts/cluster-administration/flow-control/) from version 1.20, right? |
If memory serves me well we had default settings.
I don't have this information anymore, it was long ago 😞 But I remember that even after this change default rate limit was not enough to fit in 1 minute, so we increased it through CLI options.
I don't get this point, sorry, could you please explain why it skips client-side rate limiting? I believe goroutine-based solution makes as many requests concurrently as client side rate limit allows. And I see a lot of logs like:
Current approach makes only 1 request at a time. Even if rate limit is high, like 1000 qps, API server has to reply in 1ms to make 1000 requests in 1 second? |
I apologize that I misled you with numbers by not providing the settings under which we received them. |
Yeah, sorry, I'm easily confused today. Great point about the query latency, which I absolutely wasn't accounting for! So to summarize
Does that make sense? |
Yes, it totally makes sense. Btw, I've just tested not modified version of recommender with our current configuration of rate limit, and one iteration takes more than 6 minutes even with timeout error from |
This matches my assumption of the system. If concurrency is increased too high, the QPS and burst in client-go gets hit. I assume a go-routine-per-vpa (as per the current solution) is quite overkill, and I think we need to set a sane (safe?) default, but also make it configurable, should users need to increase the throughput. |
Thanks for a great summarisation! |
/unassign |
Which component are you using?:
/area vertical-pod-autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
Hi! We have more than 2.7k VPAs in our cluster and we encountered a problem, that recommender starts working really slowly, one recommendations cycle takes more than 10 minutes.
We've tried to make updating all VPAs and all checkpoints in parallel with simple wait groups, relying on rate limit of
--kube-api-qps
and it works pretty nice,UpdateVPAs
step takes around 9 seconds andMaintainCheckpoints
takes 13 seconds in our cluster.This also allowed to remove
--min-checkpoints
and--checkpoints-timeout
options because imho they don't make sense anymore.I pushed modified version for reference: master...tkukushkin:autoscaler:parallel-updates
Describe the solution you'd like.:
I'm not good at Go, otherwise I would make a Pull Request by myself, but I think such approach might be implemented.
I'm not completely sure
--min-checkpoints
and--checkpoints-timeout
options should be removed.Also may be it's better to add specific rate limit for update operations.
The text was updated successfully, but these errors were encountered: