feat: adjust pool dynamically based on demand #3855

galargh · 2024-04-20T09:01:58Z

Description

This PR extends the pool lambda with:

ability to adjust the pool based on the number of queued jobs
support for Repo level runners
support for multiple runner owners

Why?

Dynamic Pool

Sometimes, a runner startup fails, for example, due to a connectivity issue with GitHub servers. When using the self-hosted runners in an ephemeral mode, such failures can lead to starvation. By adding support for dynamic runner scaling based on the number of currently queued jobs to the pool adjust lambda, it becomes possible to counteract this issue.

Repo Level Runners

Up until now, the pool adjustment supported only Org-level runners. Adding support for Repo-level runners means that anyone using self-hosted runners in this manner will now be able to use the pool lambda as intended.

Multiple Runner Owners

By accepting multiple runners as input to the pool lambda, the overall number of lambdas created for the setup can be reduced. This might be desirable when someone is running Repo-level runners for many repositories.

How?

Dynamic Pool

The use of this feature can be controlled via the pool_size input of the pool lambda. Setting it to -1 will enable the dynamic pool. When the dynamic pool is enabled, the function determines the desired pool size by counting the number of currently queued jobs in the context of the runner owner.

Repo Level Runners

From now on, the pool lambda will accept runner_owner input in the form of OWNER/REPO. When the runner owner is a repository, the function will create repository-level runners instead of org-level ones. It will also check for idle runners in the repository context. It will not check for idle runners in the context of the organization.

Multiple Runner Owners

The pool lambda will expect the runner_owner input to be a comma-separated list of owners and process pool adjustment for each.

Testing

I have deployed the runners.zip and configured the dynamic pool in the setup I use for https://github.com/libp2p and https://github.com/ipfs.

I added a set of unit tests for the newly added functionality.

npalm · 2024-05-03T12:59:49Z

lambdas/functions/control-plane/src/pool/pool.ts

+    let topUp = 0;
+    if (event.poolSize >= 0) {
+      topUp = event.poolSize - numberOfRunnersInPool;
+    } else if (event.poolSize === -1) {


Can you move the else part to a seperate method, for readavblity?

Of course! I'm on it. Thanks for the suggestion.

npalm · 2024-05-03T13:01:49Z

lambdas/functions/control-plane/src/pool/pool.ts

@@ -18,6 +22,13 @@ interface RunnerStatus {
  status: string;
 }

+function canRunJob(workflowJobLabels: string[], runnerLabels: string[]): boolean {


Assume this code, is copied from webhook? Correct? In that case we should move this to a common module. Can be done later in a new PR. In that case can you create a issue, and link it here in the code?

Yes, that's correct. I simplified it a little bit here. It makes sense to me to want to extract it to a common module. I'll try to incorporate it into this PR, but in case it starts feeling to big for this change, I'll create an issue and propose a fix separately as suggested.

Both fine, in case you keep it. Please create an issue and refer in a comment ot the issue.

npalm · 2024-05-03T13:04:17Z

lambdas/functions/control-plane/src/pool/pool.ts

@@ -36,7 +47,7 @@ export async function adjust(event: PoolEvent): Promise<void> {
  const launchTemplateName = process.env.LAUNCH_TEMPLATE_NAME;
  const instanceMaxSpotPrice = process.env.INSTANCE_MAX_SPOT_PRICE;
  const instanceAllocationStrategy = process.env.INSTANCE_ALLOCATION_STRATEGY || 'lowest-price'; // same as AWS default
-  const runnerOwner = process.env.RUNNER_OWNER;
+  const runnerOwners = process.env.RUNNER_OWNER.split(',');


Assume you propose to use a list split with comma for muliple owners, which can be a repo as well.

Yes, that's correct. It can be either a list of orgs or repos depending on the usecase.

npalm · 2024-05-03T13:11:20Z

lambdas/functions/control-plane/src/pool/pool.ts

+      logger.info('Checking for queued jobs to determine pool size');
+      let repos;
+      if (runnerType === 'Repo') {
+        repos = [repo];


Would it not be easier to maintain to fetch th lis tof repo's from the app. Like you do for the org. In that case this if is not needed. And the pool can adjust regardless knowing if it is installed in an org or repo.

If we were to do it that way, then it wouldn't be possible to provide different configs for different repositories since a single pool adjust configuration would always be applied to all the repositories the app has access too. Even though it is not a requirement for me at the moment, I can see it as a valid use case.

npalm · 2024-05-03T13:17:14Z

lambdas/functions/control-plane/src/pool/pool.ts

+      if (runnerType === 'Repo') {
+        repos = [repo];
+      } else {
+        // @ts-expect-error The types normalized by paginate are not correct,


This way to dynamically scaling could work for smaller deployments. Since it will check all repo's and for each repo the potentially queued jobs. In case you have for example 1000 repo's. It means 1000 calls to check for jobs. Which mean the app will rate limit quickly.

GitHub does not provide a way to request queued jobs, so via the API there is no good alternative. But cause a rate limit will cause the whole control plane will not scale anymore till the rate limit is reset.

The only alternative I could think of is hooing on the events and use the events to press less hard on the API.

I am fine with this approach. But there should be some very clear warning in the documenntation. I also would opt we add a flag to enable dynamic scaling and mark it clearly experimental.

Yes, that's correct! Unfortunately, I couldn't find a more efficient way to implement this using the API alone. As you said, it could be possible to use webhook events to try to limit the search space if, instead of checking all the repositories, we checked only those that had some jobs scheduled "recently" that weren't scheduled yet.

One way to implement it could be with an extra SQS queue where we would store workflow job queued events. The queue could have a delay configured. There could also be a lambda that consumes events from that queue. It would check whether a job the event is for has been scheduled already or not. If it has, that's it. If it hasn't, it would put the event back on that queue and on the queue that the scale up lambda consumes. Would something like that make sense to you?

I am fine with this approach. But there should be some very clear warning in the documenntation. I also would opt we add a flag to enable dynamic scaling and mark it clearly experimental.

Sure, that makes sense! I'll add a new config option for enabling the "dynamic" mode explicitly instead of controlling it via setting poolSize to -1. I'll make sure to add a warning about this in the documentation too.

galargh

Thank you for the review, really appreciated 🙇 I'm aiming to apply the requested changes next week.

galargh · 2024-05-03T16:36:32Z

lambdas/functions/control-plane/src/pool/pool.ts

@@ -18,6 +22,13 @@ interface RunnerStatus {
  status: string;
 }

+function canRunJob(workflowJobLabels: string[], runnerLabels: string[]): boolean {


Yes, that's correct. I simplified it a little bit here. It makes sense to me to want to extract it to a common module. I'll try to incorporate it into this PR, but in case it starts feeling to big for this change, I'll create an issue and propose a fix separately as suggested.

galargh · 2024-05-03T16:37:11Z

lambdas/functions/control-plane/src/pool/pool.ts

@@ -36,7 +47,7 @@ export async function adjust(event: PoolEvent): Promise<void> {
  const launchTemplateName = process.env.LAUNCH_TEMPLATE_NAME;
  const instanceMaxSpotPrice = process.env.INSTANCE_MAX_SPOT_PRICE;
  const instanceAllocationStrategy = process.env.INSTANCE_ALLOCATION_STRATEGY || 'lowest-price'; // same as AWS default
-  const runnerOwner = process.env.RUNNER_OWNER;
+  const runnerOwners = process.env.RUNNER_OWNER.split(',');


Yes, that's correct. It can be either a list of orgs or repos depending on the usecase.

galargh · 2024-05-03T16:37:34Z

lambdas/functions/control-plane/src/pool/pool.ts

+    let topUp = 0;
+    if (event.poolSize >= 0) {
+      topUp = event.poolSize - numberOfRunnersInPool;
+    } else if (event.poolSize === -1) {


Of course! I'm on it. Thanks for the suggestion.

galargh · 2024-05-03T18:50:29Z

lambdas/functions/control-plane/src/pool/pool.ts

+      logger.info('Checking for queued jobs to determine pool size');
+      let repos;
+      if (runnerType === 'Repo') {
+        repos = [repo];


If we were to do it that way, then it wouldn't be possible to provide different configs for different repositories since a single pool adjust configuration would always be applied to all the repositories the app has access too. Even though it is not a requirement for me at the moment, I can see it as a valid use case.

galargh · 2024-05-03T19:16:40Z

lambdas/functions/control-plane/src/pool/pool.ts

+      if (runnerType === 'Repo') {
+        repos = [repo];
+      } else {
+        // @ts-expect-error The types normalized by paginate are not correct,


Yes, that's correct! Unfortunately, I couldn't find a more efficient way to implement this using the API alone. As you said, it could be possible to use webhook events to try to limit the search space if, instead of checking all the repositories, we checked only those that had some jobs scheduled "recently" that weren't scheduled yet.

One way to implement it could be with an extra SQS queue where we would store workflow job queued events. The queue could have a delay configured. There could also be a lambda that consumes events from that queue. It would check whether a job the event is for has been scheduled already or not. If it has, that's it. If it hasn't, it would put the event back on that queue and on the queue that the scale up lambda consumes. Would something like that make sense to you?

I am fine with this approach. But there should be some very clear warning in the documenntation. I also would opt we add a flag to enable dynamic scaling and mark it clearly experimental.

Sure, that makes sense! I'll add a new config option for enabling the "dynamic" mode explicitly instead of controlling it via setting poolSize to -1. I'll make sure to add a warning about this in the documentation too.

github-actions · 2024-08-16T01:53:40Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

npalm · 2024-11-01T08:04:57Z

@galargh Still a nice PR, but some time ago I have implemented a job retry mechanism, which used another queue to rety jobs. Is that also solvering your issue here? We use the job retry on our smaller fleets where we have not or very small pools. I leave teh PR open, sicne I see some improvements on the unit test that I would like bring back.

galargh added 6 commits April 20, 2024 10:08

feat: adjust pool dynamically based on demand

755da9d

docs: update pool configuration description

17e986d

Merge branch 'main' into feat/dynamic-pool

d7a3444

chore: make pool adhere to the linting conventions

eada9ce

chore: improve pool statement test coverage

f8bc290

chore: improve pool branch test coverage

e9a9cab

galargh mentioned this pull request Apr 22, 2024

fix: webhook expects REPOSITORY_ALLOW_LIST env var #3856

Merged

npalm self-requested a review April 22, 2024 15:22

npalm requested changes May 3, 2024

View reviewed changes

galargh commented May 3, 2024

View reviewed changes

Merge branch 'main' into feat/dynamic-pool

7ff988b

github-actions bot added the Stale label Aug 16, 2024

npalm removed the Stale label Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adjust pool dynamically based on demand #3855

feat: adjust pool dynamically based on demand #3855

galargh commented Apr 20, 2024

npalm May 3, 2024

galargh May 3, 2024

npalm May 3, 2024

galargh May 3, 2024

npalm May 17, 2024

npalm May 3, 2024

galargh May 3, 2024

npalm May 3, 2024

galargh May 3, 2024

npalm May 3, 2024

galargh May 3, 2024

galargh left a comment

galargh May 3, 2024

galargh May 3, 2024

galargh May 3, 2024

galargh May 3, 2024

galargh May 3, 2024

github-actions bot commented Aug 16, 2024

npalm commented Nov 1, 2024

feat: adjust pool dynamically based on demand #3855

Are you sure you want to change the base?

feat: adjust pool dynamically based on demand #3855

Conversation

galargh commented Apr 20, 2024

Description

Why?

Dynamic Pool

Repo Level Runners

Multiple Runner Owners

How?

Dynamic Pool

Repo Level Runners

Multiple Runner Owners

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galargh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Aug 16, 2024

npalm commented Nov 1, 2024