Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Groups API thresholds/quotas exceeded by RBACSync querying #30

Open
RochesterinNYC opened this issue Dec 2, 2021 · 2 comments
Open

Comments

@RochesterinNYC
Copy link

Hi, we (Spotify) are currently using RBACSync on each GKE/K8s cluster that we run (over 30 clusters). The aggregate querying of all these RBACSync instances frequently hit the limits for the Google Groups API.

We understand that there are many different configuration options that can change the RBACSync query behavior/frequency of the Google Groups API querying. However, these configuration options aren't able to mitigate the large amount of querying done by RBACSync when you have a lot of Kubernetes clusters running that each have a copy of RBACSync running on them. The problem is especially exacerbated when a new version of RBACSync is deployed as the RBACSync workloads on each cluster will restart/be updated at once and result in a lot of simultaneous querying.

We're curious if there's anything that's recommended for this case to avoid the aggregate of these RBACSync instances exceeding the Google Groups API. We're curious how Cruise/other RBACSync users are running, operating, and deploying RBACSync and if anyone else is encountering these kinds of issues.

@stevvooe
Copy link
Contributor

stevvooe commented Feb 1, 2022

Hello!

Which configuration options have you tried at this point?

Typically, we have raised our quotas in this case. That doesn't scale well with the number of clusters but I wouldn't expect 30 clusters to hit that, unless you have a large number of groups. In the short term, lowering the poll period with -upstream-poll-period will help a lot. By default, it is 5m, but 15m or 30m works fine for most cases.

Long term, I think there are some changes that could help:

  1. Use time/rate to add actual rate limiting to count and limit queries from an individual rbacsync instance. This will better smear queries where you might have clusters with different numbers of groups.
  2. Query group membership through an intermediate cache. This is a larger problem to solve, but I see no reason that can't be added as a client for rbacsync, along with an intermediate group caching service (or maybe have it query sibling clusters).

@RochesterinNYC
Copy link
Author

RochesterinNYC commented Feb 3, 2022

So far, we've tweaking and configuring the gsuite.timeout and upstream-poll-period options.

Unfortunately, we're not able to raise our Google Groups API quota any further. For a sense of the scale here, we have over 2500 Namespaces and hence approximately that many Google Groups to sync RBAC for (one for each Namespace). In terms of running RBACSync typically, the quota is not hit/playing around with the upstream-poll-period helped (raising it to higher than the 5 minute default).

However, we encounter issues on operations like deploying/re-deploying RBACSync. Our deployment system will essentially deploy the new version of RBACSync to each cluster relatively simultaneously, and the new/fresh RBAC instance on each cluster will simultaneously start trying to sync groups (so 30 clusters * 2500 groups ~= 75000 groups being queried for the Google Groups API). Staggering the deploys might work here, but it's not a paradigm that our deployment system currently supports/it only supports gradual rollouts within the context of each cluster and not across all clusters.

For the cache option, this isn't something that's currently or natively supported correct? It would need to be two pieces of custom software here: a custom cache and some kind of custom Google Groups proxy API would be required to sit in between RBACSync and the Google Groups API and use this cache?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants