-
Notifications
You must be signed in to change notification settings - Fork 547
[wip][OTA-1545] Extend ClusterVersion for accepted risks #2360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hello @hongkailiu! Some important instructions when contributing to openshift/api: |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hongkailiu The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6dc1e06
to
e780c3a
Compare
5218865
to
4b05550
Compare
854ed1f
to
f03a0de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is marked as WIP, but I figured it wouldn't hurt to provide some early feedback.
As @wking mentioned, these new fields will need to be gated with a new feature gate.
// The cluster-version operator will evaluate all risks associated to a conditional | ||
// update when it is the desired update and only accept it if all its associated | ||
// risks are in desiredUpdate.accept. | ||
// +kubebuilder:validation:MaxItems=1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was 1000 chosen? Do we have a record somewhere of how many UpdateRisks there are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/openshift/cincinnati-graph-data/tree/master/blocked-edges
$ ls blocked-edges/*.yaml | while read file; do yq -r '.name' "$file"; done | tee ~/Downloads/risks.txt
$ cat ~/Downloads/risks.txt| sort | uniq | wc -l
91
So far we have 91
risks (I do not mean every one will appear in cv.status
(CVO does some filtering).
But the total number could grow as more risks are claimed out OCP bugs.
1000
is a number with the room for the future.
I picked it without thinking much except the above.
What is the impact of say, putting 10 there in the rule?
If we update the object by 11 elements, would K8S block the update and throw some error?
// operator only if all of its risks are acceptable. | ||
// | ||
// +kubebuilder:validation:items:MaxLength=256 | ||
// +kubebuilder:validation:MaxItems=1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was 1000 chosen? Do we have a history of there being up to 1000 risks for given upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, all the risks could be accepted by the user.
1000 is just a direct result of 1000 there.
// it is either not applied to the cluster or considered acceptable | ||
// by the cluster administrator. | ||
// +kubebuilder:validation:items:MaxLength=256 | ||
// +kubebuilder:validation:MaxItems=100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 100 here but 1000 elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other places are total risks for all conditional updates.
This one is the risks associated for ONE conditional updates.
// risks represents the range of issues associated with | ||
// updating to the target release. The cluster-version | ||
// operator will evaluate all entries, and only recommend the | ||
// update if there is at least one entry and all entries | ||
// recommend the update. | ||
// DEPRECATED: the risks has been deprecated by riskNames. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean for a user/clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It suggest a user who uses cv.status.conditionalUpdates.risks
to use cv.status.conditionalUpdates.riskNames
instead.
If other fields of cv.status.conditionalUpdates.risks
than name
are used, then it has to use the name as the key to get the whole object of an risk in cv.status.conditionalUpdateRisks
.
config/v1/types_cluster_version.go
Outdated
// conditions represents the observations of the conditional update | ||
// risk's current status. Known types are: | ||
// * Apply, for whether the risk is applied to the current cluster. | ||
// +kubebuilder:validation:MaxItems=16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 16? It looks like you've only got one known condition type so would a maximum length of 1 be sufficient? Do you anticipate adding additional condition types in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decrease to 2
.
Checked a few existing conditions.
It seems this marker requirement is new.
Is it OK if i leave a room for adding a new type without having to change the API (although I do not anticipate adding other types at the moment)?
config/v1/types_cluster_version.go
Outdated
@@ -806,6 +842,15 @@ type ConditionalUpdate struct { | |||
// for not recommending a conditional update. | |||
// +k8s:deepcopy-gen=true | |||
type ConditionalUpdateRisk struct { | |||
// conditions represents the observations of the conditional update | |||
// risk's current status. Known types are: | |||
// * Apply, for whether the risk is applied to the current cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied
feels a bit more appropriate here than Apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked on a live cluster and Applied
is indeed fits better: No verb yet.
$ oc get clusterversion version -o yaml | yq -r '.status.conditions[]|.type'
RetrievedUpdates
Upgradeable
ImplicitlyEnabledCapabilities
ReleaseAccepted
Available
Failing
Progressing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @everettraven
// The cluster-version operator will evaluate all risks associated to a conditional | ||
// update when it is the desired update and only accept it if all its associated | ||
// risks are in desiredUpdate.accept. | ||
// +kubebuilder:validation:MaxItems=1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/openshift/cincinnati-graph-data/tree/master/blocked-edges
$ ls blocked-edges/*.yaml | while read file; do yq -r '.name' "$file"; done | tee ~/Downloads/risks.txt
$ cat ~/Downloads/risks.txt| sort | uniq | wc -l
91
So far we have 91
risks (I do not mean every one will appear in cv.status
(CVO does some filtering).
But the total number could grow as more risks are claimed out OCP bugs.
1000
is a number with the room for the future.
I picked it without thinking much except the above.
What is the impact of say, putting 10 there in the rule?
If we update the object by 11 elements, would K8S block the update and throw some error?
// those are considered acceptable. A conditional update is accepted by Cluster-Version | ||
// operator only if all of its risks are acceptable. | ||
// | ||
// +kubebuilder:validation:items:MaxLength=256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some examples of risk names:
$ cat ~/Downloads/risks.txt| sort| uniq | head -n 3
AcceleratedNetworkingRace
AMD19hFirmware
ARM64SecCompError524
and the longest one is 55
at the moment:
$ awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' ~/Downloads/risks.txt
LabeledMachineConfigAndContainerRuntimeConfigBlocksMCO
$ awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' ~/Downloads/risks.txt | wc -m
55
At the moment, there are not restrictions on the risk names from CVO's point of view.
// operator only if all of its risks are acceptable. | ||
// | ||
// +kubebuilder:validation:items:MaxLength=256 | ||
// +kubebuilder:validation:MaxItems=1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, all the risks could be accepted by the user.
1000 is just a direct result of 1000 there.
// it is either not applied to the cluster or considered acceptable | ||
// by the cluster administrator. | ||
// +kubebuilder:validation:items:MaxLength=256 | ||
// +kubebuilder:validation:MaxItems=100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other places are total risks for all conditional updates.
This one is the risks associated for ONE conditional updates.
// risks represents the range of issues associated with | ||
// updating to the target release. The cluster-version | ||
// operator will evaluate all entries, and only recommend the | ||
// update if there is at least one entry and all entries | ||
// recommend the update. | ||
// DEPRECATED: the risks has been deprecated by riskNames. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It suggest a user who uses cv.status.conditionalUpdates.risks
to use cv.status.conditionalUpdates.riskNames
instead.
If other fields of cv.status.conditionalUpdates.risks
than name
are used, then it has to use the name as the key to get the whole object of an risk in cv.status.conditionalUpdateRisks
.
config/v1/types_cluster_version.go
Outdated
@@ -806,6 +842,15 @@ type ConditionalUpdate struct { | |||
// for not recommending a conditional update. | |||
// +k8s:deepcopy-gen=true | |||
type ConditionalUpdateRisk struct { | |||
// conditions represents the observations of the conditional update | |||
// risk's current status. Known types are: | |||
// * Apply, for whether the risk is applied to the current cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked on a live cluster and Applied
is indeed fits better: No verb yet.
$ oc get clusterversion version -o yaml | yq -r '.status.conditions[]|.type'
RetrievedUpdates
Upgradeable
ImplicitlyEnabledCapabilities
ReleaseAccepted
Available
Failing
Progressing
config/v1/types_cluster_version.go
Outdated
// conditions represents the observations of the conditional update | ||
// risk's current status. Known types are: | ||
// * Apply, for whether the risk is applied to the current cluster. | ||
// +kubebuilder:validation:MaxItems=16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decrease to 2
.
Checked a few existing conditions.
It seems this marker requirement is new.
Is it OK if i leave a room for adding a new type without having to change the API (although I do not anticipate adding other types at the moment)?
- Add a new field 'clusterversion.spec.desiredUpdate.accept': It contains the names of conditional update risks that are considered acceptable. - Move `clusterversion.status.conditionalUpdates.risks` two levels up as `clusterversion.status.conditionalUpdateRisks`. It contains all the risks for `clusterversion.status.conditionalUpdates`. - Add new field 'clusterversion.status.conditionalUpdates.riskNames': It contains the names of risk for the conditional update. It deprecates `clusterversion.status.conditionalUpdates.risks`. - Add a new field 'clusterversion.status.conditionalUpdateRisks.conditions': It contains the observations of the conditional update risk's current status.
/cc |
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@@ -806,6 +845,15 @@ type ConditionalUpdate struct { | |||
// for not recommending a conditional update. | |||
// +k8s:deepcopy-gen=true | |||
type ConditionalUpdateRisk struct { | |||
// conditions represents the observations of the conditional update | |||
// risk's current status. Known types are: | |||
// * Applied, for whether the risk is applied to the current cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer "Applies" over "Applied". In reality, any condition is going to be slightly stale, e.g. the ClusterVersion status.conditions
Failing
isn't actually "is failing this way right now" it's "was failing that way recently". But we don't give consumers a way to determine freshness, and we expect them to act as if the gap between "how fresh the data actually is" and "right now" is almost always negligibly small. To me, Applied
sets folks up to worry about how stale the data is, and without a way to determine freshness, it's hard to be productive with that worry. While Applies
says "it's ok if you assume this is current; if we fall too far behind there will probably be other alerting".
The API extensions is proposed in openshift/enhancements#1807
the names of conditional update risks that are considered acceptable.
clusterversion.status.conditionalUpdates.risks
two levels up asclusterversion.status.conditionalUpdateRisks
. It contains all the risksfor
clusterversion.status.conditionalUpdates
.contains the names of risk for the conditional update. It deprecates
clusterversion.status.conditionalUpdates.risks
.It contains the observations of the conditional update risk's current
status.