-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-48810: Ensure that build jobs are always reconciled #4811
base: master
Are you sure you want to change the base?
OCPBUGS-48810: Ensure that build jobs are always reconciled #4811
Conversation
Skipping CI for Draft Pull Request. |
@cheesesashimi: This pull request references Jira Issue OCPBUGS-43896, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheesesashimi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
55000c2
to
091e97c
Compare
/testwith openshift/machine-config-operator/master/e2e-gcp-op-ocl openshift/api#2134 |
@cheesesashimi: This pull request references Jira Issue OCPBUGS-48810, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
6b3309a
to
c29ba3e
Compare
c29ba3e
to
d79c934
Compare
// state transitions is observed, the MachineOSBuild object should not be | ||
// updated because they are invalid and make no sense. | ||
{ | ||
name: "Terminal -> Initial", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: For the case where a build fails, we fix the issue, and then trigger the build again with the rebuild annotation, having this restriction shouldn't cause an issue right?
The MOSB is deleted and recreated for this case but it is created with the same name. I don't think trying to update the state for the new MOSB with the same name should cause an issue but I think it is something worth confirming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't in that case because it is a new MOSB.
Changes LGTM |
/jira refresh |
@cheesesashimi: This pull request references Jira Issue OCPBUGS-48810, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Upgrade from 4.19.0-0.nightly-2025-01-28-090833 to this fix (RUNNING):
Hello, this scenario is working ok
The result is that the new os-builder pod is able to reconcile everything and the image is correctly deployed to the nodes. Nevertheless, we have tested this other scenario
The result in this scenario is that the MOSB resource is not updated and we can see this error in the os-builder logs:
Is the the second scenario within the scope of the issue that we are fixing in this PR or it needs a different jira ticket? |
It feels like this second scenario is within the scope of this PR, so I'll try to fix it. |
d79c934
to
0fe2c2f
Compare
Hello! We still see a problem when we do this:
We see these messages in the os-builder pod
|
0fe2c2f
to
d2b1357
Compare
If the machine-os-builder pod is stopped and an active build job completes before the machine-os-builder pod is rescheduled, it will be ignored. In this situation, we should check if the job is in a terminal state and take the appropriate action if it is. This also opportunistically cleans up the buildprogress -> conditions mapping and adds additional test cases for detecting state changes.
d2b1357
to
dd1daba
Compare
All the scenarios described in the PR worked. Nevertheless, we hit an issue in our regressions
The result is that even though a new MC was rendered, no machineosbuild is created. The cluster remains in an inconsistent status, with the worker pool updating but with no image to update, stuck reporting 0 workers updated. ~$ oc get mcp We can reproduce it by
|
@cheesesashimi: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
- What I did
If the machine-os-builder pod is stopped and a running job completes simultaneously, it is possible that the machine-os-builder pod will ignore it once it has been rescheduled. Although the build controller is aware of the existence of the job object, because it has not transitioned state, it is ignored. To avoid this, we check each added job object and reconcile it accordingly by updating the MachineOSBuild status according to our state transition rules.
Note: This needs to be rebased and land after #4756 does because it is based upon those changes.
- How to verify it
One could verify this by terminating the machine-os-builder pod at approximately the same time as a build job completes. In that situation, the MachineOSBuild should be updated after the machine-os-builder pod is rescheduled and begins running. Also, the unit test suite has an added test for this specific scenario.
- Description for the changelog
Ensure that build jobs are always reconciled