Skip to content

[wip][OTA-1545] Extend ClusterVersion for accepted risks #2360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Jun 9, 2025

The API extensions is proposed in openshift/enhancements#1807

  • Add a new field 'clusterversion.spec.desiredUpdate.accept': It contains
    the names of conditional update risks that are considered acceptable.
  • Move clusterversion.status.conditionalUpdates.risks two levels up as
    clusterversion.status.conditionalUpdateRisks. It contains all the risks
    for clusterversion.status.conditionalUpdates.
  • Add new field 'clusterversion.status.conditionalUpdates.riskNames': It
    contains the names of risk for the conditional update. It deprecates
    clusterversion.status.conditionalUpdates.risks.
  • Add a new field 'clusterversion.status.conditionalUpdateRisks.conditions':
    It contains the observations of the conditional update risk's current
    status.

Copy link
Contributor

openshift-ci bot commented Jun 9, 2025

Hello @hongkailiu! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 9, 2025
@hongkailiu hongkailiu changed the title [OTA-1545] Extend ClusterVersion for accepted risks [wip][OTA-1545] Extend ClusterVersion for accepted risks Jun 9, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2025
@openshift-ci openshift-ci bot requested review from deads2k and everettraven June 9, 2025 21:08
Copy link
Contributor

openshift-ci bot commented Jun 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 10, 2025
@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 10, 2025
@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 10, 2025
@hongkailiu hongkailiu force-pushed the OTA-1545 branch 2 times, most recently from 6dc1e06 to e780c3a Compare June 10, 2025 16:12
@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 10, 2025
@hongkailiu hongkailiu force-pushed the OTA-1545 branch 4 times, most recently from 5218865 to 4b05550 Compare June 10, 2025 22:49
@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 10, 2025
@hongkailiu hongkailiu force-pushed the OTA-1545 branch 3 times, most recently from 854ed1f to f03a0de Compare June 11, 2025 03:39
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is marked as WIP, but I figured it wouldn't hurt to provide some early feedback.

As @wking mentioned, these new fields will need to be gated with a new feature gate.

// The cluster-version operator will evaluate all risks associated to a conditional
// update when it is the desired update and only accept it if all its associated
// risks are in desiredUpdate.accept.
// +kubebuilder:validation:MaxItems=1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was 1000 chosen? Do we have a record somewhere of how many UpdateRisks there are?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/openshift/cincinnati-graph-data/tree/master/blocked-edges

$ ls blocked-edges/*.yaml | while read file; do yq -r '.name'  "$file"; done | tee ~/Downloads/risks.txt

$ cat ~/Downloads/risks.txt| sort | uniq | wc -l
      91

So far we have 91 risks (I do not mean every one will appear in cv.status (CVO does some filtering).
But the total number could grow as more risks are claimed out OCP bugs.
1000 is a number with the room for the future.
I picked it without thinking much except the above.

What is the impact of say, putting 10 there in the rule?
If we update the object by 11 elements, would K8S block the update and throw some error?

// operator only if all of its risks are acceptable.
//
// +kubebuilder:validation:items:MaxLength=256
// +kubebuilder:validation:MaxItems=1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was 1000 chosen? Do we have a history of there being up to 1000 risks for given upgrade?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up https://github.com/openshift/api/pull/2360/files/a9d2af3985180a169deaef9fea6ae0d40e807b8d#r2154495183

In theory, all the risks could be accepted by the user.
1000 is just a direct result of 1000 there.

// it is either not applied to the cluster or considered acceptable
// by the cluster administrator.
// +kubebuilder:validation:items:MaxLength=256
// +kubebuilder:validation:MaxItems=100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 100 here but 1000 elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places are total risks for all conditional updates.
This one is the risks associated for ONE conditional updates.

// risks represents the range of issues associated with
// updating to the target release. The cluster-version
// operator will evaluate all entries, and only recommend the
// update if there is at least one entry and all entries
// recommend the update.
// DEPRECATED: the risks has been deprecated by riskNames.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean for a user/clients?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It suggest a user who uses cv.status.conditionalUpdates.risks to use cv.status.conditionalUpdates.riskNames instead.

If other fields of cv.status.conditionalUpdates.risks than name are used, then it has to use the name as the key to get the whole object of an risk in cv.status.conditionalUpdateRisks.

// conditions represents the observations of the conditional update
// risk's current status. Known types are:
// * Apply, for whether the risk is applied to the current cluster.
// +kubebuilder:validation:MaxItems=16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 16? It looks like you've only got one known condition type so would a maximum length of 1 be sufficient? Do you anticipate adding additional condition types in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decrease to 2.
Checked a few existing conditions.
It seems this marker requirement is new.
Is it OK if i leave a room for adding a new type without having to change the API (although I do not anticipate adding other types at the moment)?

@@ -806,6 +842,15 @@ type ConditionalUpdate struct {
// for not recommending a conditional update.
// +k8s:deepcopy-gen=true
type ConditionalUpdateRisk struct {
// conditions represents the observations of the conditional update
// risk's current status. Known types are:
// * Apply, for whether the risk is applied to the current cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied feels a bit more appropriate here than Apply

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked on a live cluster and Applied is indeed fits better: No verb yet.

$ oc get clusterversion version -o yaml | yq -r '.status.conditions[]|.type'
RetrievedUpdates
Upgradeable
ImplicitlyEnabledCapabilities
ReleaseAccepted
Available
Failing
Progressing

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 19, 2025
Copy link
Member Author

@hongkailiu hongkailiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @everettraven

// The cluster-version operator will evaluate all risks associated to a conditional
// update when it is the desired update and only accept it if all its associated
// risks are in desiredUpdate.accept.
// +kubebuilder:validation:MaxItems=1000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/openshift/cincinnati-graph-data/tree/master/blocked-edges

$ ls blocked-edges/*.yaml | while read file; do yq -r '.name'  "$file"; done | tee ~/Downloads/risks.txt

$ cat ~/Downloads/risks.txt| sort | uniq | wc -l
      91

So far we have 91 risks (I do not mean every one will appear in cv.status (CVO does some filtering).
But the total number could grow as more risks are claimed out OCP bugs.
1000 is a number with the room for the future.
I picked it without thinking much except the above.

What is the impact of say, putting 10 there in the rule?
If we update the object by 11 elements, would K8S block the update and throw some error?

// those are considered acceptable. A conditional update is accepted by Cluster-Version
// operator only if all of its risks are acceptable.
//
// +kubebuilder:validation:items:MaxLength=256
Copy link
Member Author

@hongkailiu hongkailiu Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up https://github.com/openshift/api/pull/2360/files/a9d2af3985180a169deaef9fea6ae0d40e807b8d#r2154495183

Here are some examples of risk names:

$ cat ~/Downloads/risks.txt| sort| uniq | head -n 3
AcceleratedNetworkingRace
AMD19hFirmware
ARM64SecCompError524

and the longest one is 55 at the moment:

$ awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' ~/Downloads/risks.txt
LabeledMachineConfigAndContainerRuntimeConfigBlocksMCO

$ awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' ~/Downloads/risks.txt | wc -m
      55

At the moment, there are not restrictions on the risk names from CVO's point of view.

// operator only if all of its risks are acceptable.
//
// +kubebuilder:validation:items:MaxLength=256
// +kubebuilder:validation:MaxItems=1000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up https://github.com/openshift/api/pull/2360/files/a9d2af3985180a169deaef9fea6ae0d40e807b8d#r2154495183

In theory, all the risks could be accepted by the user.
1000 is just a direct result of 1000 there.

// it is either not applied to the cluster or considered acceptable
// by the cluster administrator.
// +kubebuilder:validation:items:MaxLength=256
// +kubebuilder:validation:MaxItems=100
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places are total risks for all conditional updates.
This one is the risks associated for ONE conditional updates.

// risks represents the range of issues associated with
// updating to the target release. The cluster-version
// operator will evaluate all entries, and only recommend the
// update if there is at least one entry and all entries
// recommend the update.
// DEPRECATED: the risks has been deprecated by riskNames.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It suggest a user who uses cv.status.conditionalUpdates.risks to use cv.status.conditionalUpdates.riskNames instead.

If other fields of cv.status.conditionalUpdates.risks than name are used, then it has to use the name as the key to get the whole object of an risk in cv.status.conditionalUpdateRisks.

@@ -806,6 +842,15 @@ type ConditionalUpdate struct {
// for not recommending a conditional update.
// +k8s:deepcopy-gen=true
type ConditionalUpdateRisk struct {
// conditions represents the observations of the conditional update
// risk's current status. Known types are:
// * Apply, for whether the risk is applied to the current cluster.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked on a live cluster and Applied is indeed fits better: No verb yet.

$ oc get clusterversion version -o yaml | yq -r '.status.conditions[]|.type'
RetrievedUpdates
Upgradeable
ImplicitlyEnabledCapabilities
ReleaseAccepted
Available
Failing
Progressing

// conditions represents the observations of the conditional update
// risk's current status. Known types are:
// * Apply, for whether the risk is applied to the current cluster.
// +kubebuilder:validation:MaxItems=16
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decrease to 2.
Checked a few existing conditions.
It seems this marker requirement is new.
Is it OK if i leave a room for adding a new type without having to change the API (although I do not anticipate adding other types at the moment)?

- Add a new field 'clusterversion.spec.desiredUpdate.accept': It contains
  the names of conditional update risks that are considered acceptable.
- Move `clusterversion.status.conditionalUpdates.risks` two levels up as
  `clusterversion.status.conditionalUpdateRisks`. It contains all the risks
  for `clusterversion.status.conditionalUpdates`.
- Add new field 'clusterversion.status.conditionalUpdates.riskNames': It
  contains the names of risk for the conditional update. It deprecates
  `clusterversion.status.conditionalUpdates.risks`.
- Add a new field 'clusterversion.status.conditionalUpdateRisks.conditions':
  It contains the observations of the conditional update risk's current
  status.
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 19, 2025
@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller June 19, 2025 19:03
Copy link
Contributor

openshift-ci bot commented Jun 19, 2025

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -806,6 +845,15 @@ type ConditionalUpdate struct {
// for not recommending a conditional update.
// +k8s:deepcopy-gen=true
type ConditionalUpdateRisk struct {
// conditions represents the observations of the conditional update
// risk's current status. Known types are:
// * Applied, for whether the risk is applied to the current cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer "Applies" over "Applied". In reality, any condition is going to be slightly stale, e.g. the ClusterVersion status.conditions Failing isn't actually "is failing this way right now" it's "was failing that way recently". But we don't give consumers a way to determine freshness, and we expect them to act as if the gap between "how fresh the data actually is" and "right now" is almost always negligibly small. To me, Applied sets folks up to worry about how stale the data is, and without a way to determine freshness, it's hard to be productive with that worry. While Applies says "it's ok if you assume this is current; if we fall too far behind there will probably be other alerting".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants