Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: MULTIARCH-4974: Cluster wide architecture weighted affinity #452

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AnnaZivkovic
Copy link
Contributor

Implementing cluster wide weights using PreferredDuringSchedulingIgnoredDuringExecution which stores its items as a list.

In the case where we have predefined PreferredSchedulingTerms. We must preserve the weights and avoid unbalancing them with the new arch weights. To do so we can normalize the existing weights using arch weights.

new_weight = 100 * old_weight/ sum(arch weights)

For example
User defined arch weights

  • arm64, 50
  • amd64, 75

The pod yaml would look like the following if there was a pre existing rule

      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100 * (15 + 50)/ (50 + 75) =  52
          preference:
            matchExpressions:
              - key: label-1
                operator: In
                values:
                  - key-1
              - key: lable/arch
                operator: In
                values:
                  - arm64
          - weight: 100 * (15 + 75)/ (50 + 75) =  72
          preference:
            matchExpressions:
              - key: label-1
                operator: In
                values:
                  - key-1
              - key: lable/arch
                operator: In
                values:
                  - amd64
        - weight: 100 * 50 / (50 + 75) = 40
          preference:
            matchExpressions:
              - key: label/arch
                operator: In
                values:
                  - arm64
          -weight: 100 * 75 / (50 + 75) = 60
          preference:
            matchExpressions:
              - key: label/arch
                operator: In
                values:
                  -amd64
             

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 5, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 5, 2025

@AnnaZivkovic: This pull request references MULTIARCH-4974 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Implementing cluster wide weights using PreferredDuringSchedulingIgnoredDuringExecution which stores its items as a list.

In the case where we have predefined PreferredSchedulingTerms. We must preserve the weights and avoid unbalancing them with the new arch weights. To do so we can normalize the existing weights using arch weights.

new_weight = 100 * old_weight/ sum(arch weights)

For example
User defined arch weights

  • arm64, 50
  • amd64, 75

The pod yaml would look like the following if there was a pre existing rule

     preferredDuringSchedulingIgnoredDuringExecution:
       - weight: 100 * (15 + 50)/ (50 + 75) =  52
         preference:
           matchExpressions:
             - key: label-1
               operator: In
               values:
                 - key-1
             - key: lable/arch
               operator: In
               values:
                 - arm64
         - weight: 100 * (15 + 75)/ (50 + 75) =  72
         preference:
           matchExpressions:
             - key: label-1
               operator: In
               values:
                 - key-1
             - key: lable/arch
               operator: In
               values:
                 - amd64
       - weight: 100 * 50 / (50 + 75) = 40
         preference:
           matchExpressions:
             - key: label/arch
               operator: In
               values:
                 - arm64
         -weight: 100 * 75 / (50 + 75) = 60
         preference:
           matchExpressions:
             - key: label/arch
               operator: In
               values:
                 -amd64
            

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

openshift-ci bot commented Feb 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from annazivkovic. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aleskandro
Copy link
Member

aleskandro commented Feb 5, 2025

@Prashanth684 and I were also discussing about providing two merging strategies with an additional field for the nodeAffinityScoring plugin, mergingStrategy: append|normalize.

In append mode, we concatenate the list of preferred affinity terms in the CPPC to the ones in the Pod we are reconciling. We let the user be free to set the weights in their pods and the CPPC according to their preference, considering no normalization will take place.

normalize merging strategy

In normalize mode, we can (a) combine the two sets with the cartesian product, and (b) append the CPPC preferences as they are too.

Combining the two sets with the cartesian product allows us to scale the preference for a given term to N preferences weighted by both the architecture preference and the one already defined. For example, considering the architecture preferences to be ${(25, arm64), (75, amd64), a pod expressing the preference for a node labeled disktype=ssd with weight 1 (with no other terms, a weight of 1 is the same as 100%, which may potentially loose in the case the weights of the architectures are set differently, e.g., by letting their sum be 100) should be scaled to prefer, in order:

  1. disktype==ssd + amd64
  2. disktype==ssd + arm64
  3. amd64 (includes the case no disktype label is present in the node)
  4. arm64 (includes the case no disktype label is present in the node)

Changing the previously set preferred affinity term, $$(1, disktype=ssd)$$, can be considered safe as there is no node where the architecture labels are not available (that could let the scheduler skip scoring a node labeled disktype==ssd).

Finally, also appending the preferred affinity terms as they are in the CPPC for the architectures will cover the scoring of nodes that have none of the other labels in the pod preferred affinity terms, but the architecture ones. In this way, we don't risk to skip the architecture-aware scoring of nodes that do not match any of the matchExpressions already set in the pod specs' preferred affinity terms.

In the case where we have predefined PreferredSchedulingTerms. We must preserve the weights and avoid unbalancing them with the new arch weights. To do so we can normalize the existing weights using arch weights.

new_weight = 100 * old_weight/ sum(arch weights)

I think the normalization should use the sum of all the weights (arch weights and already existing weights) in the denominator.

Each preferred term is a tuple

$$T = (weight \in \mathbb{N}^+, labelSelectorPredicates)$$

The resulting set of preferred affinity terms should consist of the union of the cartesian product of the pod ones with the CPPC ones and the CPPC ones:

$$\text{PodPreferredAffinityTerms}_\text{New} = \text{PodPreferredAffinityTerms}_\text{Old} \bigtimes \text{CPPCPreferredAffinityTerms} \bigcup \text{CPPCPreferredAffinityTerms} \left|\text{PodPreferredAffinityTerms}_\text{New}\right| = \left|\text{PodPreferredAffinityTerms}_\text{Old} \bigtimes \text{CPPCPreferredAffinityTerms} \bigcup \text{CPPCPreferredAffinityTerms}\right| \le \left|\text{PodPreferredAffinityTerms}_\text{Old}\right| \cdot 4 + 4$$

For each preferred affinity term $$T_{old} = (weight_{old}, labelSelectorPredicates_{old})$$ in the pod specifications, we will have N terms $$T_i$$, where N is the number of (architecture) preferred affinity terms in the CPPC (at most 4).

In the new set, $$T_{i} \forall i \in \text{CPPCPreferredAffinityTerms} \subseteq \{ amd64, arm64, ppc64le, s390x \} $$ should consist of:

$$labelsSelectorPredicates_{new, i} = labelSelectorPredicates_{old} \cup \{ \text{"kubernetes.io/arch"}\in \{ i \} \}$$ $$weight_{new, i} = 100 \cdot \Bigg\lceil \frac{weight_{old} + weight_{i}}{\sum_{j=1}^{M}{weight_{old,j}} + \sum_{i=1}^{N}{weight_{i}}} \Bigg\rceil$$

where M is the number of preferred affinity terms in the Pod specs: $$M = \left|\text{PodPreferredAffinityTerms}\right|$$

Example

$$\text{CPPCPreferredAffinityTerms} = \{ (25, \{amd64\}), (75, \{arm64\}) \}$$ $$\text{PodPreferredAffinityTerms} = \{ (10, \{ssd\}) \}$$ $$\text{PodPreferredAffinityTerms}_{new} = \{(31, \{amd64, ssd\}), (77, \{arm64, ssd\}), (22, \{amd64\}), (68, \{arm64\})\}$$

For that pod, the normalize merging strategy will prioritize:

  1. ARM64 Node with disktype=ssd
  2. ARM64 Node without disktype=ssd
  3. AMD64 Node with disktype=ssd
  4. AMD64 Node without disktype=ssd

The concatenate merging strategy would produce

$$\text{PodPreferredAffinityTerms}_{new} = \{(25, \{amd64\}), (75, \{arm64\}), (10, \{ssd\}) \}$$

In this case, the scheduler will prioritize:

  1. ARM64 Node with disktype=ssd
  2. ARM64 Node without disktype=ssd
  3. AMD64 Node with disktype=ssd
  4. AMD64 Node without disktype=ssd

Which is the same, though. So I'm not sure yet if we should skip the two strategies (implementing concatenation only) or find a better normalization formula.

@AnnaZivkovic, having this information in a small EP (and doing some more examples to find a logic idea of how we will affect the scheduling of pods with pre-existing affinity terms) in the docs/ folder might be nice.

@AnnaZivkovic AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from 6777797 to e39938b Compare February 5, 2025 20:36
@Prashanth684
Copy link
Contributor

@AnnaZivkovic, having this information in a small EP (and doing some more examples to find a logic idea of how we will affect the scheduling of pods with pre-existing affinity terms) in the docs/ folder might be nice.

+1 Nice write up @aleskandro . We should definitely get this recorded in an EP and define the two strategies clearly. In future there might also be room to add more strategies like maybe a strategy where an input variable or even a normalization function is provided by user to influence the scheduling.

@Prashanth684
Copy link
Contributor

Which is the same, though. So I'm not sure yet if we should skip the two strategies (implementing concatenation only) or find a better normalization formula.

wait I'm confused..the whole reason we thought about the merging strategy was because the nodes with ssd should be considered over nodes without ssd? or is it just purely based on the higher number?

@AnnaZivkovic AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from d3b7b94 to 7c27733 Compare February 7, 2025 00:54
@AnnaZivkovic
Copy link
Contributor Author

Which is the same, though. So I'm not sure yet if we should skip the two strategies (implementing concatenation only) or find a better normalization formula.

wait I'm confused..the whole reason we thought about the merging strategy was because the nodes with ssd should be considered over nodes without ssd? or is it just purely based on the higher number?

When the scheduler finds nodes that meet all the other scheduling requirements of the Pod, the scheduler iterates through every preferred rule that the node satisfies and adds the value of the weight for that expression to a sum. The final sum is added to the score of other priority functions for the node. Nodes with the highest total score are prioritized when the scheduler makes a scheduling decision for the Pod.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

So it will prefer the one with the highest score. We could just append to this list, but we risk unbalancing any predefined user rules

main.go Outdated Show resolved Hide resolved
@AnnaZivkovic AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch 2 times, most recently from dcab070 to ececc41 Compare February 11, 2025 16:55
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025
…terPodPlacementConfig

Signed-off-by: Punith Kenchappa <[email protected]>

# Conflicts:
#	bundle/manifests/multiarch-tuning-operator.clusterserviceversion.yaml
@AnnaZivkovic AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from ececc41 to ea9a87b Compare February 11, 2025 21:45
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025
@AnnaZivkovic AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from ea9a87b to c694bbb Compare February 11, 2025 21:47
@AnnaZivkovic AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from c694bbb to 1df7a2c Compare February 11, 2025 22:34
Copy link

openshift-ci bot commented Feb 13, 2025

@AnnaZivkovic: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/vendor 13b95a0 link true /test vendor
ci/prow/ocp416-e2e-gcp 13b95a0 link true /test ocp416-e2e-gcp
ci/prow/lint 13b95a0 link true /test lint
ci/prow/unit 13b95a0 link true /test unit
ci/prow/goimports 13b95a0 link true /test goimports
ci/prow/ocp418-e2e-gcp 13b95a0 link true /test ocp418-e2e-gcp
ci/prow/ocp417-e2e-gcp 13b95a0 link true /test ocp417-e2e-gcp
ci/prow/vet 13b95a0 link true /test vet

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants