[wip][OTA-1545] Extend ClusterVersion for accepted risks #2360

hongkailiu · 2025-06-09T21:07:23Z

The API extensions is proposed in openshift/enhancements#1807

Add a new field 'clusterversion.spec.desiredUpdate.accept': It contains
the names of conditional update risks that are considered acceptable.
Move clusterversion.status.conditionalUpdates.risks two levels up as
clusterversion.status.conditionalUpdateRisks. It contains all the risks
for clusterversion.status.conditionalUpdates.
Add new field 'clusterversion.status.conditionalUpdates.riskNames': It
contains the names of risk for the conditional update. It deprecates
clusterversion.status.conditionalUpdates.risks.
Add a new field 'clusterversion.status.conditionalUpdateRisks.conditions':
It contains the observations of the conditional update risk's current
status.

openshift-ci · 2025-06-09T21:07:27Z

Hello @hongkailiu! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci · 2025-06-09T21:08:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

config/v1/types_cluster_version.go

everettraven

I know this is marked as WIP, but I figured it wouldn't hurt to provide some early feedback.

As @wking mentioned, these new fields will need to be gated with a new feature gate.

config/v1/types_cluster_version.go

everettraven · 2025-06-18T12:38:53Z

config/v1/types_cluster_version.go

+	// The cluster-version operator will evaluate all risks associated to a conditional
+	// update when it is the desired update and only accept it if all its associated
+	// risks are in desiredUpdate.accept.
+	// +kubebuilder:validation:MaxItems=1000


Why was 1000 chosen? Do we have a record somewhere of how many UpdateRisks there are?

https://github.com/openshift/cincinnati-graph-data/tree/master/blocked-edges

$ ls blocked-edges/*.yaml | while read file; do yq -r '.name' "$file"; done | tee ~/Downloads/risks.txt $ cat ~/Downloads/risks.txt| sort | uniq | wc -l 91

So far we have 91 risks (I do not mean every one will appear in cv.status (CVO does some filtering).
But the total number could grow as more risks are claimed out OCP bugs.
1000 is a number with the room for the future.
I picked it without thinking much except the above.

What is the impact of say, putting 10 there in the rule?
If we update the object by 11 elements, would K8S block the update and throw some error?

config/v1/types_cluster_version.go

everettraven · 2025-06-18T12:57:59Z

config/v1/types_cluster_version.go

+	// operator only if all of its risks are acceptable.
+	//
+	// +kubebuilder:validation:items:MaxLength=256
+	// +kubebuilder:validation:MaxItems=1000


Why was 1000 chosen? Do we have a history of there being up to 1000 risks for given upgrade?

Follow up https://github.com/openshift/api/pull/2360/files/a9d2af3985180a169deaef9fea6ae0d40e807b8d#r2154495183

In theory, all the risks could be accepted by the user.
1000 is just a direct result of 1000 there.

everettraven · 2025-06-18T13:04:20Z

config/v1/types_cluster_version.go

+	// it is either not applied to the cluster or considered acceptable
+	// by the cluster administrator.
+	// +kubebuilder:validation:items:MaxLength=256
+	// +kubebuilder:validation:MaxItems=100


Why 100 here but 1000 elsewhere?

Other places are total risks for all conditional updates.
This one is the risks associated for ONE conditional updates.

config/v1/types_cluster_version.go

everettraven · 2025-06-18T13:05:41Z

config/v1/types_cluster_version.go

 	// risks represents the range of issues associated with
 	// updating to the target release. The cluster-version
 	// operator will evaluate all entries, and only recommend the
 	// update if there is at least one entry and all entries
 	// recommend the update.
+	// DEPRECATED: the risks has been deprecated by riskNames.


What does this mean for a user/clients?

It suggest a user who uses cv.status.conditionalUpdates.risks to use cv.status.conditionalUpdates.riskNames instead.

If other fields of cv.status.conditionalUpdates.risks than name are used, then it has to use the name as the key to get the whole object of an risk in cv.status.conditionalUpdateRisks.

everettraven · 2025-06-18T13:08:27Z

config/v1/types_cluster_version.go

+	// conditions represents the observations of the conditional update
+	// risk's current status. Known types are:
+	// * Apply, for whether the risk is applied to the current cluster.
+	// +kubebuilder:validation:MaxItems=16


Why 16? It looks like you've only got one known condition type so would a maximum length of 1 be sufficient? Do you anticipate adding additional condition types in the future?

Decrease to 2.
Checked a few existing conditions.
It seems this marker requirement is new.
Is it OK if i leave a room for adding a new type without having to change the API (although I do not anticipate adding other types at the moment)?

everettraven · 2025-06-18T13:09:24Z

config/v1/types_cluster_version.go

@@ -806,6 +842,15 @@ type ConditionalUpdate struct {
 // for not recommending a conditional update.
 // +k8s:deepcopy-gen=true
 type ConditionalUpdateRisk struct {
+	// conditions represents the observations of the conditional update
+	// risk's current status. Known types are:
+	// * Apply, for whether the risk is applied to the current cluster.


Applied feels a bit more appropriate here than Apply

Checked on a live cluster and Applied is indeed fits better: No verb yet.

$ oc get clusterversion version -o yaml | yq -r '.status.conditions[]|.type' RetrievedUpdates Upgradeable ImplicitlyEnabledCapabilities ReleaseAccepted Available Failing Progressing

hongkailiu

Thanks for the review, @everettraven

hongkailiu · 2025-06-19T17:14:47Z

config/v1/types_cluster_version.go

+	// The cluster-version operator will evaluate all risks associated to a conditional
+	// update when it is the desired update and only accept it if all its associated
+	// risks are in desiredUpdate.accept.
+	// +kubebuilder:validation:MaxItems=1000


https://github.com/openshift/cincinnati-graph-data/tree/master/blocked-edges

$ ls blocked-edges/*.yaml | while read file; do yq -r '.name' "$file"; done | tee ~/Downloads/risks.txt $ cat ~/Downloads/risks.txt| sort | uniq | wc -l 91

So far we have 91 risks (I do not mean every one will appear in cv.status (CVO does some filtering).
But the total number could grow as more risks are claimed out OCP bugs.
1000 is a number with the room for the future.
I picked it without thinking much except the above.

What is the impact of say, putting 10 there in the rule?
If we update the object by 11 elements, would K8S block the update and throw some error?

hongkailiu · 2025-06-19T17:35:49Z

config/v1/types_cluster_version.go

+	// those are considered acceptable. A conditional update is accepted by Cluster-Version
+	// operator only if all of its risks are acceptable.
+	//
+	// +kubebuilder:validation:items:MaxLength=256


Follow up https://github.com/openshift/api/pull/2360/files/a9d2af3985180a169deaef9fea6ae0d40e807b8d#r2154495183

Here are some examples of risk names:

$ cat ~/Downloads/risks.txt| sort| uniq | head -n 3 AcceleratedNetworkingRace AMD19hFirmware ARM64SecCompError524

and the longest one is 55 at the moment:

$ awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' ~/Downloads/risks.txt LabeledMachineConfigAndContainerRuntimeConfigBlocksMCO $ awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' ~/Downloads/risks.txt | wc -m 55

At the moment, there are not restrictions on the risk names from CVO's point of view.

hongkailiu · 2025-06-19T17:43:29Z

config/v1/types_cluster_version.go

+	// operator only if all of its risks are acceptable.
+	//
+	// +kubebuilder:validation:items:MaxLength=256
+	// +kubebuilder:validation:MaxItems=1000


Follow up https://github.com/openshift/api/pull/2360/files/a9d2af3985180a169deaef9fea6ae0d40e807b8d#r2154495183

In theory, all the risks could be accepted by the user.
1000 is just a direct result of 1000 there.

hongkailiu · 2025-06-19T17:51:37Z

config/v1/types_cluster_version.go

+	// it is either not applied to the cluster or considered acceptable
+	// by the cluster administrator.
+	// +kubebuilder:validation:items:MaxLength=256
+	// +kubebuilder:validation:MaxItems=100


Other places are total risks for all conditional updates.
This one is the risks associated for ONE conditional updates.

hongkailiu · 2025-06-19T18:00:47Z

config/v1/types_cluster_version.go

 	// risks represents the range of issues associated with
 	// updating to the target release. The cluster-version
 	// operator will evaluate all entries, and only recommend the
 	// update if there is at least one entry and all entries
 	// recommend the update.
+	// DEPRECATED: the risks has been deprecated by riskNames.


It suggest a user who uses cv.status.conditionalUpdates.risks to use cv.status.conditionalUpdates.riskNames instead.

If other fields of cv.status.conditionalUpdates.risks than name are used, then it has to use the name as the key to get the whole object of an risk in cv.status.conditionalUpdateRisks.

hongkailiu · 2025-06-19T18:05:50Z

config/v1/types_cluster_version.go

@@ -806,6 +842,15 @@ type ConditionalUpdate struct {
 // for not recommending a conditional update.
 // +k8s:deepcopy-gen=true
 type ConditionalUpdateRisk struct {
+	// conditions represents the observations of the conditional update
+	// risk's current status. Known types are:
+	// * Apply, for whether the risk is applied to the current cluster.


Checked on a live cluster and Applied is indeed fits better: No verb yet.

$ oc get clusterversion version -o yaml | yq -r '.status.conditions[]|.type' RetrievedUpdates Upgradeable ImplicitlyEnabledCapabilities ReleaseAccepted Available Failing Progressing

hongkailiu · 2025-06-19T18:13:47Z

config/v1/types_cluster_version.go

+	// conditions represents the observations of the conditional update
+	// risk's current status. Known types are:
+	// * Apply, for whether the risk is applied to the current cluster.
+	// +kubebuilder:validation:MaxItems=16


Decrease to 2.
Checked a few existing conditions.
It seems this marker requirement is new.
Is it OK if i leave a room for adding a new type without having to change the API (although I do not anticipate adding other types at the moment)?

- Add a new field 'clusterversion.spec.desiredUpdate.accept': It contains the names of conditional update risks that are considered acceptable. - Move `clusterversion.status.conditionalUpdates.risks` two levels up as `clusterversion.status.conditionalUpdateRisks`. It contains all the risks for `clusterversion.status.conditionalUpdates`. - Add new field 'clusterversion.status.conditionalUpdates.riskNames': It contains the names of risk for the conditional update. It deprecates `clusterversion.status.conditionalUpdates.risks`. - Add a new field 'clusterversion.status.conditionalUpdateRisks.conditions': It contains the observations of the conditional update risk's current status.

petr-muller · 2025-06-19T19:03:02Z

/cc

openshift-ci · 2025-06-19T22:07:22Z

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

wking · 2025-06-23T21:49:40Z

config/v1/types_cluster_version.go

@@ -806,6 +845,15 @@ type ConditionalUpdate struct {
 // for not recommending a conditional update.
 // +k8s:deepcopy-gen=true
 type ConditionalUpdateRisk struct {
+	// conditions represents the observations of the conditional update
+	// risk's current status. Known types are:
+	// * Applied, for whether the risk is applied to the current cluster.


nit: I'd prefer "Applies" over "Applied". In reality, any condition is going to be slightly stale, e.g. the ClusterVersion status.conditions Failing isn't actually "is failing this way right now" it's "was failing that way recently". But we don't give consumers a way to determine freshness, and we expect them to act as if the gap between "how fresh the data actually is" and "right now" is almost always negligibly small. To me, Applied sets folks up to worry about how stale the data is, and without a way to determine freshness, it's hard to be productive with that worry. While Applies says "it's ok if you assume this is current; if we fall too far behind there will probably be other alerting".

openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 9, 2025

hongkailiu changed the title ~~[OTA-1545] Extend ClusterVersion for accepted risks~~ [wip][OTA-1545] Extend ClusterVersion for accepted risks Jun 9, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2025

openshift-ci bot requested review from deads2k and everettraven June 9, 2025 21:08

hongkailiu force-pushed the OTA-1545 branch from 0eed585 to 611d63d Compare June 10, 2025 14:30

openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 10, 2025

hongkailiu force-pushed the OTA-1545 branch from 1398b1f to 92b713d Compare June 10, 2025 15:09

openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 10, 2025

hongkailiu force-pushed the OTA-1545 branch from 92b713d to a3e4639 Compare June 10, 2025 15:12

openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 10, 2025

hongkailiu force-pushed the OTA-1545 branch 2 times, most recently from 6dc1e06 to e780c3a Compare June 10, 2025 16:12

openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 10, 2025

hongkailiu force-pushed the OTA-1545 branch 4 times, most recently from 5218865 to 4b05550 Compare June 10, 2025 22:49

openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 10, 2025

hongkailiu mentioned this pull request Jun 10, 2025

[wip][OTA-1544]update: support accepted risks openshift/enhancements#1807

Open

hongkailiu force-pushed the OTA-1545 branch 3 times, most recently from 854ed1f to f03a0de Compare June 11, 2025 03:39

hongkailiu force-pushed the OTA-1545 branch from f03a0de to 73dc822 Compare June 11, 2025 14:17

wking reviewed Jun 16, 2025

View reviewed changes

config/v1/types_cluster_version.go Outdated Show resolved Hide resolved

wking reviewed Jun 16, 2025

View reviewed changes

config/v1/types_cluster_version.go Outdated Show resolved Hide resolved

wking reviewed Jun 16, 2025

View reviewed changes

config/v1/types_cluster_version.go Outdated Show resolved Hide resolved

hongkailiu force-pushed the OTA-1545 branch from 73dc822 to a9d2af3 Compare June 17, 2025 15:05

everettraven reviewed Jun 18, 2025

View reviewed changes

hongkailiu force-pushed the OTA-1545 branch from a9d2af3 to f75de30 Compare June 19, 2025 18:23

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 19, 2025

hongkailiu commented Jun 19, 2025

View reviewed changes

hongkailiu added 3 commits June 19, 2025 14:41

make update

99ba268

Add a feature gate: ClusterUpgradeAcceptedRisks

7d206bc

hongkailiu force-pushed the OTA-1545 branch from f75de30 to 7d206bc Compare June 19, 2025 18:52

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 19, 2025

openshift-ci bot requested a review from petr-muller June 19, 2025 19:03

wking reviewed Jun 23, 2025

View reviewed changes

[wip][OTA-1545] Extend ClusterVersion for accepted risks #2360

Are you sure you want to change the base?

[wip][OTA-1545] Extend ClusterVersion for accepted risks #2360

Uh oh!

Conversation

hongkailiu commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jun 9, 2025

Uh oh!

openshift-ci bot commented Jun 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

everettraven left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongkailiu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongkailiu Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petr-muller commented Jun 19, 2025

Uh oh!

openshift-ci bot commented Jun 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hongkailiu commented Jun 9, 2025 •

edited

Loading

hongkailiu Jun 19, 2025 •

edited

Loading