WIP: MULTIARCH-4974: Cluster wide architecture weighted affinity #452

AnnaZivkovic · 2025-02-05T04:31:13Z

Implementing cluster wide weights using PreferredDuringSchedulingIgnoredDuringExecution which stores its items as a list.

In the case where we have predefined PreferredSchedulingTerms. We must preserve the weights and avoid unbalancing them with the new arch weights. To do so we can normalize the existing weights using arch weights.

new_weight = 100 * old_weight/ sum(arch weights)

For example
User defined arch weights

arm64, 50
amd64, 75

The pod yaml would look like the following if there was a pre existing rule

      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100 * (15 + 50)/ (50 + 75) =  52
          preference:
            matchExpressions:
              - key: label-1
                operator: In
                values:
                  - key-1
              - key: lable/arch
                operator: In
                values:
                  - arm64
          - weight: 100 * (15 + 75)/ (50 + 75) =  72
          preference:
            matchExpressions:
              - key: label-1
                operator: In
                values:
                  - key-1
              - key: lable/arch
                operator: In
                values:
                  - amd64
        - weight: 100 * 50 / (50 + 75) = 40
          preference:
            matchExpressions:
              - key: label/arch
                operator: In
                values:
                  - arm64
          -weight: 100 * 75 / (50 + 75) = 60
          preference:
            matchExpressions:
              - key: label/arch
                operator: In
                values:
                  -amd64

openshift-ci-robot · 2025-02-05T04:31:17Z

@AnnaZivkovic: This pull request references MULTIARCH-4974 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Implementing cluster wide weights using PreferredDuringSchedulingIgnoredDuringExecution which stores its items as a list.

In the case where we have predefined PreferredSchedulingTerms. We must preserve the weights and avoid unbalancing them with the new arch weights. To do so we can normalize the existing weights using arch weights.

new_weight = 100 * old_weight/ sum(arch weights)

For example
User defined arch weights

arm64, 50

amd64, 75

The pod yaml would look like the following if there was a pre existing rule
     preferredDuringSchedulingIgnoredDuringExecution:
       - weight: 100 * (15 + 50)/ (50 + 75) =  52
         preference:
           matchExpressions:
             - key: label-1
               operator: In
               values:
                 - key-1
             - key: lable/arch
               operator: In
               values:
                 - arm64
         - weight: 100 * (15 + 75)/ (50 + 75) =  72
         preference:
           matchExpressions:
             - key: label-1
               operator: In
               values:
                 - key-1
             - key: lable/arch
               operator: In
               values:
                 - amd64
       - weight: 100 * 50 / (50 + 75) = 40
         preference:
           matchExpressions:
             - key: label/arch
               operator: In
               values:
                 - arm64
         -weight: 100 * 75 / (50 + 75) = 60
         preference:
           matchExpressions:
             - key: label/arch
               operator: In
               values:
                 -amd64
            

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-02-05T04:31:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from annazivkovic. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aleskandro · 2025-02-05T11:26:34Z

@Prashanth684 and I were also discussing about providing two merging strategies with an additional field for the nodeAffinityScoring plugin, mergingStrategy: append|normalize.

In append mode, we concatenate the list of preferred affinity terms in the CPPC to the ones in the Pod we are reconciling. We let the user be free to set the weights in their pods and the CPPC according to their preference, considering no normalization will take place.

`normalize` merging strategy

In normalize mode, we can (a) combine the two sets with the cartesian product, and (b) append the CPPC preferences as they are too.

Combining the two sets with the cartesian product allows us to scale the preference for a given term to N preferences weighted by both the architecture preference and the one already defined. For example, considering the architecture preferences to be ${(25, arm64), (75, amd64), a pod expressing the preference for a node labeled disktype=ssd with weight 1 (with no other terms, a weight of 1 is the same as 100%, which may potentially loose in the case the weights of the architectures are set differently, e.g., by letting their sum be 100) should be scaled to prefer, in order:

disktype==ssd + amd64
disktype==ssd + arm64
amd64 (includes the case no disktype label is present in the node)
arm64 (includes the case no disktype label is present in the node)

Changing the previously set preferred affinity term, $$(1, disktype=ssd)$$, can be considered safe as there is no node where the architecture labels are not available (that could let the scheduler skip scoring a node labeled disktype==ssd).

Finally, also appending the preferred affinity terms as they are in the CPPC for the architectures will cover the scoring of nodes that have none of the other labels in the pod preferred affinity terms, but the architecture ones. In this way, we don't risk to skip the architecture-aware scoring of nodes that do not match any of the matchExpressions already set in the pod specs' preferred affinity terms.

In the case where we have predefined PreferredSchedulingTerms. We must preserve the weights and avoid unbalancing them with the new arch weights. To do so we can normalize the existing weights using arch weights.

new_weight = 100 * old_weight/ sum(arch weights)

I think the normalization should use the sum of all the weights (arch weights and already existing weights) in the denominator.

Each preferred term is a tuple

$$T = (weight \in \mathbb{N}^+, labelSelectorPredicates)$$

The resulting set of preferred affinity terms should consist of the union of the cartesian product of the pod ones with the CPPC ones and the CPPC ones:

$$\text{PodPreferredAffinityTerms}_\text{New} = \text{PodPreferredAffinityTerms}_\text{Old} \bigtimes \text{CPPCPreferredAffinityTerms} \bigcup \text{CPPCPreferredAffinityTerms} \left|\text{PodPreferredAffinityTerms}_\text{New}\right| = \left|\text{PodPreferredAffinityTerms}_\text{Old} \bigtimes \text{CPPCPreferredAffinityTerms} \bigcup \text{CPPCPreferredAffinityTerms}\right| \le \left|\text{PodPreferredAffinityTerms}_\text{Old}\right| \cdot 4 + 4$$

For each preferred affinity term $$T_{old} = (weight_{old}, labelSelectorPredicates_{old})$$ in the pod specifications, we will have N terms $$T_i$$, where N is the number of (architecture) preferred affinity terms in the CPPC (at most 4).

In the new set, $$T_{i} \forall i \in \text{CPPCPreferredAffinityTerms} \subseteq \{ amd64, arm64, ppc64le, s390x \} $$ should consist of:

$$labelsSelectorPredicates_{new, i} = labelSelectorPredicates_{old} \cup \{ \text{"kubernetes.io/arch"}\in \{ i \} \}$$

$$weight_{new, i} = 100 \cdot \Bigg\lceil \frac{weight_{old} + weight_{i}}{\sum_{j=1}^{M}{weight_{old,j}} + \sum_{i=1}^{N}{weight_{i}}} \Bigg\rceil$$

where M is the number of preferred affinity terms in the Pod specs: $$M = \left|\text{PodPreferredAffinityTerms}\right|$$

Example

$$\text{CPPCPreferredAffinityTerms} = \{ (25, \{amd64\}), (75, \{arm64\}) \}$$

$$\text{PodPreferredAffinityTerms} = \{ (10, \{ssd\}) \}$$

$$\text{PodPreferredAffinityTerms}_{new} = \{(31, \{amd64, ssd\}), (77, \{arm64, ssd\}), (22, \{amd64\}), (68, \{arm64\})\}$$

For that pod, the normalize merging strategy will prioritize:

ARM64 Node with disktype=ssd
ARM64 Node without disktype=ssd
AMD64 Node with disktype=ssd
AMD64 Node without disktype=ssd

The concatenate merging strategy would produce

$$\text{PodPreferredAffinityTerms}_{new} = \{(25, \{amd64\}), (75, \{arm64\}), (10, \{ssd\}) \}$$

In this case, the scheduler will prioritize:

ARM64 Node with disktype=ssd
ARM64 Node without disktype=ssd
AMD64 Node with disktype=ssd
AMD64 Node without disktype=ssd

Which is the same, though. So I'm not sure yet if we should skip the two strategies (implementing concatenation only) or find a better normalization formula.

@AnnaZivkovic, having this information in a small EP (and doing some more examples to find a logic idea of how we will affect the scheduling of pods with pre-existing affinity terms) in the docs/ folder might be nice.

Prashanth684 · 2025-02-05T23:39:22Z

@AnnaZivkovic, having this information in a small EP (and doing some more examples to find a logic idea of how we will affect the scheduling of pods with pre-existing affinity terms) in the docs/ folder might be nice.

+1 Nice write up @aleskandro . We should definitely get this recorded in an EP and define the two strategies clearly. In future there might also be room to add more strategies like maybe a strategy where an input variable or even a normalization function is provided by user to influence the scheduling.

Prashanth684 · 2025-02-06T00:00:39Z

Which is the same, though. So I'm not sure yet if we should skip the two strategies (implementing concatenation only) or find a better normalization formula.

wait I'm confused..the whole reason we thought about the merging strategy was because the nodes with ssd should be considered over nodes without ssd? or is it just purely based on the higher number?

AnnaZivkovic · 2025-02-07T01:34:31Z

Which is the same, though. So I'm not sure yet if we should skip the two strategies (implementing concatenation only) or find a better normalization formula.
wait I'm confused..the whole reason we thought about the merging strategy was because the nodes with ssd should be considered over nodes without ssd? or is it just purely based on the higher number?

When the scheduler finds nodes that meet all the other scheduling requirements of the Pod, the scheduler iterates through every preferred rule that the node satisfies and adds the value of the weight for that expression to a sum. The final sum is added to the score of other priority functions for the node. Nodes with the highest total score are prioritized when the scheduler makes a scheduling decision for the Pod.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

So it will prefer the one with the highest score. We could just append to this list, but we risk unbalancing any predefined user rules

main.go

…terPodPlacementConfig Signed-off-by: Punith Kenchappa <[email protected]> # Conflicts: # bundle/manifests/multiarch-tuning-operator.clusterserviceversion.yaml

Make clusterPodPlacementConfig private

… during scheduling using append method

openshift-ci · 2025-02-13T09:17:19Z

@AnnaZivkovic: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/vendor	`13b95a0`	link	true	`/test vendor`
ci/prow/ocp416-e2e-gcp	`13b95a0`	link	true	`/test ocp416-e2e-gcp`
ci/prow/lint	`13b95a0`	link	true	`/test lint`
ci/prow/unit	`13b95a0`	link	true	`/test unit`
ci/prow/goimports	`13b95a0`	link	true	`/test goimports`
ci/prow/ocp418-e2e-gcp	`13b95a0`	link	true	`/test ocp418-e2e-gcp`
ci/prow/ocp417-e2e-gcp	`13b95a0`	link	true	`/test ocp417-e2e-gcp`
ci/prow/vet	`13b95a0`	link	true	`/test vet`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 5, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2025

openshift-ci bot requested review from aleskandro and Prashanth684 February 5, 2025 04:31

AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from 6777797 to e39938b Compare February 5, 2025 20:36

AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from d3b7b94 to 7c27733 Compare February 7, 2025 00:54

aleskandro reviewed Feb 7, 2025

View reviewed changes

main.go Outdated Show resolved Hide resolved

AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch 2 times, most recently from dcab070 to ececc41 Compare February 11, 2025 16:55

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025

MULTIARCH-5058: Implement the node affinity scoring specs in the Clus…

b348c61

…terPodPlacementConfig Signed-off-by: Punith Kenchappa <[email protected]> # Conflicts: # bundle/manifests/multiarch-tuning-operator.clusterserviceversion.yaml

AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from ececc41 to ea9a87b Compare February 11, 2025 21:45

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025

AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from ea9a87b to c694bbb Compare February 11, 2025 21:47

AnnaZivkovic added 2 commits February 11, 2025 14:20

Move cache into clusterpodplacementconfg_informer.go

ca384b5

Make clusterPodPlacementConfig private

MULTIARCH-4974: Added preferredDuringSchedulingIgnoredDuringExecution…

1df7a2c

… during scheduling using append method

AnnaZivkovic force-pushed the cluster-wide-architecture-weighted-affinity branch from c694bbb to 1df7a2c Compare February 11, 2025 22:34

Document changes

13b95a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: MULTIARCH-4974: Cluster wide architecture weighted affinity #452

WIP: MULTIARCH-4974: Cluster wide architecture weighted affinity #452

AnnaZivkovic commented Feb 5, 2025

openshift-ci-robot commented Feb 5, 2025 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Feb 5, 2025

aleskandro commented Feb 5, 2025 •

edited

Loading

Prashanth684 commented Feb 5, 2025

Prashanth684 commented Feb 6, 2025

AnnaZivkovic commented Feb 7, 2025

openshift-ci bot commented Feb 13, 2025

WIP: MULTIARCH-4974: Cluster wide architecture weighted affinity #452

Are you sure you want to change the base?

WIP: MULTIARCH-4974: Cluster wide architecture weighted affinity #452

Conversation

AnnaZivkovic commented Feb 5, 2025

openshift-ci-robot commented Feb 5, 2025 • edited by openshift-ci bot Loading

openshift-ci bot commented Feb 5, 2025

aleskandro commented Feb 5, 2025 • edited Loading

normalize merging strategy

Example

Prashanth684 commented Feb 5, 2025

Prashanth684 commented Feb 6, 2025

AnnaZivkovic commented Feb 7, 2025

openshift-ci bot commented Feb 13, 2025

openshift-ci-robot commented Feb 5, 2025 •

edited by openshift-ci bot

Loading

aleskandro commented Feb 5, 2025 •

edited

Loading

`normalize` merging strategy