updates to support 1000-node SPS scaling test using the IDPS SRL pipeline #414
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
The purpose of this PR is to provide updates to SPS deployment, configuration and DAGs that will allow it to support a large scale test (1000 worker nodes).
Proposed Changes
weight_rule
andpriority_weight
so that large scale processing is not bottlenecked by the default priority schemeretries
to3
for all KPO calls to improve dagRun resiliencyc6i.large
PGE requirementsspot
for cost savingsairflow/helm/values.tmpl.yaml
that were required in order to perform the 1000-node SPS scaling test; the updated values were determined as a result of iterating over the large scale test many times and making these tweaks to resolve various bottlenecksIssues
#285
Testing
Video of the large scaling test can be viewed here: https://drive.google.com/file/d/1CZ1U0AOUy8bzm9HDWlq6YUVH75InzyEG/view?usp=drive_link.