-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adjust pool dynamically based on demand #3855
base: main
Are you sure you want to change the base?
Conversation
let topUp = 0; | ||
if (event.poolSize >= 0) { | ||
topUp = event.poolSize - numberOfRunnersInPool; | ||
} else if (event.poolSize === -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the else part to a seperate method, for readavblity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course! I'm on it. Thanks for the suggestion.
@@ -18,6 +22,13 @@ interface RunnerStatus { | |||
status: string; | |||
} | |||
|
|||
function canRunJob(workflowJobLabels: string[], runnerLabels: string[]): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume this code, is copied from webhook? Correct? In that case we should move this to a common module. Can be done later in a new PR. In that case can you create a issue, and link it here in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. I simplified it a little bit here. It makes sense to me to want to extract it to a common module. I'll try to incorporate it into this PR, but in case it starts feeling to big for this change, I'll create an issue and propose a fix separately as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both fine, in case you keep it. Please create an issue and refer in a comment ot the issue.
@@ -36,7 +47,7 @@ export async function adjust(event: PoolEvent): Promise<void> { | |||
const launchTemplateName = process.env.LAUNCH_TEMPLATE_NAME; | |||
const instanceMaxSpotPrice = process.env.INSTANCE_MAX_SPOT_PRICE; | |||
const instanceAllocationStrategy = process.env.INSTANCE_ALLOCATION_STRATEGY || 'lowest-price'; // same as AWS default | |||
const runnerOwner = process.env.RUNNER_OWNER; | |||
const runnerOwners = process.env.RUNNER_OWNER.split(','); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume you propose to use a list split with comma for muliple owners, which can be a repo as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. It can be either a list of orgs or repos depending on the usecase.
logger.info('Checking for queued jobs to determine pool size'); | ||
let repos; | ||
if (runnerType === 'Repo') { | ||
repos = [repo]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not be easier to maintain to fetch th lis tof repo's from the app. Like you do for the org. In that case this if is not needed. And the pool can adjust regardless knowing if it is installed in an org or repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to do it that way, then it wouldn't be possible to provide different configs for different repositories since a single pool adjust configuration would always be applied to all the repositories the app has access too. Even though it is not a requirement for me at the moment, I can see it as a valid use case.
if (runnerType === 'Repo') { | ||
repos = [repo]; | ||
} else { | ||
// @ts-expect-error The types normalized by paginate are not correct, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way to dynamically scaling could work for smaller deployments. Since it will check all repo's and for each repo the potentially queued jobs. In case you have for example 1000 repo's. It means 1000 calls to check for jobs. Which mean the app will rate limit quickly.
GitHub does not provide a way to request queued jobs, so via the API there is no good alternative. But cause a rate limit will cause the whole control plane will not scale anymore till the rate limit is reset.
The only alternative I could think of is hooing on the events and use the events to press less hard on the API.
I am fine with this approach. But there should be some very clear warning in the documenntation. I also would opt we add a flag to enable dynamic scaling and mark it clearly experimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct! Unfortunately, I couldn't find a more efficient way to implement this using the API alone. As you said, it could be possible to use webhook events to try to limit the search space if, instead of checking all the repositories, we checked only those that had some jobs scheduled "recently" that weren't scheduled yet.
One way to implement it could be with an extra SQS queue where we would store workflow job queued events. The queue could have a delay configured. There could also be a lambda that consumes events from that queue. It would check whether a job the event is for has been scheduled already or not. If it has, that's it. If it hasn't, it would put the event back on that queue and on the queue that the scale up lambda consumes. Would something like that make sense to you?
I am fine with this approach. But there should be some very clear warning in the documenntation. I also would opt we add a flag to enable dynamic scaling and mark it clearly experimental.
Sure, that makes sense! I'll add a new config option for enabling the "dynamic" mode explicitly instead of controlling it via setting poolSize
to -1
. I'll make sure to add a warning about this in the documentation too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review, really appreciated 🙇 I'm aiming to apply the requested changes next week.
@@ -18,6 +22,13 @@ interface RunnerStatus { | |||
status: string; | |||
} | |||
|
|||
function canRunJob(workflowJobLabels: string[], runnerLabels: string[]): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. I simplified it a little bit here. It makes sense to me to want to extract it to a common module. I'll try to incorporate it into this PR, but in case it starts feeling to big for this change, I'll create an issue and propose a fix separately as suggested.
@@ -36,7 +47,7 @@ export async function adjust(event: PoolEvent): Promise<void> { | |||
const launchTemplateName = process.env.LAUNCH_TEMPLATE_NAME; | |||
const instanceMaxSpotPrice = process.env.INSTANCE_MAX_SPOT_PRICE; | |||
const instanceAllocationStrategy = process.env.INSTANCE_ALLOCATION_STRATEGY || 'lowest-price'; // same as AWS default | |||
const runnerOwner = process.env.RUNNER_OWNER; | |||
const runnerOwners = process.env.RUNNER_OWNER.split(','); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. It can be either a list of orgs or repos depending on the usecase.
let topUp = 0; | ||
if (event.poolSize >= 0) { | ||
topUp = event.poolSize - numberOfRunnersInPool; | ||
} else if (event.poolSize === -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course! I'm on it. Thanks for the suggestion.
logger.info('Checking for queued jobs to determine pool size'); | ||
let repos; | ||
if (runnerType === 'Repo') { | ||
repos = [repo]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to do it that way, then it wouldn't be possible to provide different configs for different repositories since a single pool adjust configuration would always be applied to all the repositories the app has access too. Even though it is not a requirement for me at the moment, I can see it as a valid use case.
if (runnerType === 'Repo') { | ||
repos = [repo]; | ||
} else { | ||
// @ts-expect-error The types normalized by paginate are not correct, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct! Unfortunately, I couldn't find a more efficient way to implement this using the API alone. As you said, it could be possible to use webhook events to try to limit the search space if, instead of checking all the repositories, we checked only those that had some jobs scheduled "recently" that weren't scheduled yet.
One way to implement it could be with an extra SQS queue where we would store workflow job queued events. The queue could have a delay configured. There could also be a lambda that consumes events from that queue. It would check whether a job the event is for has been scheduled already or not. If it has, that's it. If it hasn't, it would put the event back on that queue and on the queue that the scale up lambda consumes. Would something like that make sense to you?
I am fine with this approach. But there should be some very clear warning in the documenntation. I also would opt we add a flag to enable dynamic scaling and mark it clearly experimental.
Sure, that makes sense! I'll add a new config option for enabling the "dynamic" mode explicitly instead of controlling it via setting poolSize
to -1
. I'll make sure to add a warning about this in the documentation too.
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions. |
@galargh Still a nice PR, but some time ago I have implemented a job retry mechanism, which used another queue to rety jobs. Is that also solvering your issue here? We use the job retry on our smaller fleets where we have not or very small pools. I leave teh PR open, sicne I see some improvements on the unit test that I would like bring back. |
Description
This PR extends the
pool
lambda with:Why?
Dynamic Pool
Sometimes, a runner startup fails, for example, due to a connectivity issue with GitHub servers. When using the self-hosted runners in an ephemeral mode, such failures can lead to starvation. By adding support for dynamic runner scaling based on the number of currently queued jobs to the pool adjust lambda, it becomes possible to counteract this issue.
Repo Level Runners
Up until now, the pool adjustment supported only Org-level runners. Adding support for Repo-level runners means that anyone using self-hosted runners in this manner will now be able to use the pool lambda as intended.
Multiple Runner Owners
By accepting multiple runners as input to the pool lambda, the overall number of lambdas created for the setup can be reduced. This might be desirable when someone is running Repo-level runners for many repositories.
How?
Dynamic Pool
The use of this feature can be controlled via the
pool_size
input of the pool lambda. Setting it to-1
will enable the dynamic pool. When the dynamic pool is enabled, the function determines the desired pool size by counting the number of currently queued jobs in the context of the runner owner.Repo Level Runners
From now on, the pool lambda will accept
runner_owner
input in the form ofOWNER/REPO
. When the runner owner is a repository, the function will create repository-level runners instead of org-level ones. It will also check for idle runners in the repository context. It will not check for idle runners in the context of the organization.Multiple Runner Owners
The pool lambda will expect the
runner_owner
input to be a comma-separated list of owners and process pool adjustment for each.Testing
I have deployed the
runners.zip
and configured the dynamic pool in the setup I use for https://github.com/libp2p and https://github.com/ipfs.I added a set of unit tests for the newly added functionality.