OCPBUGS-48810: Ensure that build jobs are always reconciled #4811

cheesesashimi · 2025-01-27T17:12:45Z

- What I did

If the machine-os-builder pod is stopped and a running job completes simultaneously, it is possible that the machine-os-builder pod will ignore it once it has been rescheduled. Although the build controller is aware of the existence of the job object, because it has not transitioned state, it is ignored. To avoid this, we check each added job object and reconcile it accordingly by updating the MachineOSBuild status according to our state transition rules.

Note: This needs to be rebased and land after #4756 does because it is based upon those changes.

- How to verify it

One could verify this by terminating the machine-os-builder pod at approximately the same time as a build job completes. In that situation, the MachineOSBuild should be updated after the machine-os-builder pod is rescheduled and begins running. Also, the unit test suite has an added test for this specific scenario.

- Description for the changelog
Ensure that build jobs are always reconciled

openshift-ci · 2025-01-27T17:12:52Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2025-01-27T17:12:52Z

@cheesesashimi: This pull request references Jira Issue OCPBUGS-43896, which is invalid:

expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

If the machine-os-builder pod is stopped and a running job completes simultaneously, it is possible that the machine-os-builder pod will ignore it once it has been rescheduled. Although the build controller is aware of the existence of the job object, because it has not transitioned state, it is ignored. To avoid this, we check each added job object to determine if it is in a terminal state. If so, we update the MachineOSBuild status. We purposely ignore initial and transient states because they cause an update event whenever a state transition occurs.

- How to verify it

One could verify this by terminating the machine-os-builder pod at approximately the same time as a build job completes. In that situation, the MachineOSBuild should be updated after the machine-os-builder pod is rescheduled and begins running. Also, the unit test suite has an added test for this specific scenario.

- Description for the changelog
Ensure that build jobs are always reconciled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-01-27T17:13:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cheesesashimi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cheesesashimi · 2025-01-27T17:15:17Z

/testwith openshift/machine-config-operator/master/e2e-gcp-op-ocl openshift/api#2134

openshift-ci-robot · 2025-01-27T17:17:02Z

@cheesesashimi: This pull request references Jira Issue OCPBUGS-48810, which is invalid:

expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

If the machine-os-builder pod is stopped and a running job completes simultaneously, it is possible that the machine-os-builder pod will ignore it once it has been rescheduled. Although the build controller is aware of the existence of the job object, because it has not transitioned state, it is ignored. To avoid this, we check each added job object to determine if it is in a terminal state. If so, we update the MachineOSBuild status. We purposely ignore initial and transient states because they cause an update event whenever a state transition occurs.

Note: This needs to be rebased and land after #4756 does because it is based upon those changes.

- How to verify it

One could verify this by terminating the machine-os-builder pod at approximately the same time as a build job completes. In that situation, the MachineOSBuild should be updated after the machine-os-builder pod is rescheduled and begins running. Also, the unit test suite has an added test for this specific scenario.

- Description for the changelog
Ensure that build jobs are always reconciled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

umohnani8 · 2025-01-30T15:48:06Z

pkg/controller/build/helpers_test.go

+		// state transitions is observed, the MachineOSBuild object should not be
+		// updated because they are invalid and make no sense.
+		{
+			name:     "Terminal -> Initial",


Question: For the case where a build fails, we fix the issue, and then trigger the build again with the rebuild annotation, having this restriction shouldn't cause an issue right?

The MOSB is deleted and recreated for this case but it is created with the same name. I don't think trying to update the state for the new MOSB with the same name should cause an issue but I think it is something worth confirming.

It shouldn't in that case because it is a new MOSB.

umohnani8 · 2025-01-30T15:50:34Z

Changes LGTM

cheesesashimi · 2025-01-30T18:24:25Z

/jira refresh

openshift-ci-robot · 2025-01-30T18:24:30Z

@cheesesashimi: This pull request references Jira Issue OCPBUGS-48810, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sergiordlr · 2025-01-31T13:29:48Z

Upgrade from 4.19.0-0.nightly-2025-01-28-090833 to this fix (RUNNING):

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-4.19-upgrade-from-stable-4.19-aws-ipi-ocl-fips-f60/1885328465318645760

Hello, this scenario is working ok

Create a MOSC
Wait until a MOSB is created and its status is fully filled in.
Wait until the build pod runs for 1 minute or so
Start deleting the os-builder pod in a loop watch oc delete pod -l k8s-app=machine-os-builder
Wait until the build pod finishes and reports a Completed status
Stop deleting the os-builder pod

The result is that the new os-builder pod is able to reconcile everything and the image is correctly deployed to the nodes.

Nevertheless, we have tested this other scenario

Create a MOSC
Wait until a MOSB is created
Wait until the build pod is created. Here the MOSB status is still empty
While the MOSB status is empty, start deleting the os-builder pod in a loop watch oc delete pod -l k8s-app=machine-os-builder
When the builder pod has run for 2 minutes or so, stop deleting the os-builder pod

The result in this scenario is that the MOSB resource is not updated and we can see this error in the os-builder logs:

I0131 13:28:12.905928       1 jobimagebuilder.go:165] Using provided build job
I0131 13:28:12.905932       1 jobimagebuilder.go:192] Build job "build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" status {Conditions:[{Type:SuccessCriteriaMet Status:True LastProbeTime:2025-01-31 13:22:18 +0000 UTC LastTransitionTime:2025-01-31 13:22:18 +0000 UTC Reason:CompletionsReached Message:Reached expected number of succeeded pods} {Type:Complete Status:True LastProbeTime:2025-01-31 13:22:18 +0000 UTC LastTransitionTime:2025-01-31 13:22:18 +0000 UTC Reason:CompletionsReached Message:Reached expected number of succeeded pods}] StartTime:2025-01-31 13:19:19 +0000 UTC CompletionTime:2025-01-31 13:22:18 +0000 UTC Active:0 Succeeded:1 Failed:0 Terminating:0xc000365b3c CompletedIndexes: FailedIndexes:<nil> UncountedTerminatedPods:&UncountedTerminatedPods{Succeeded:[],Failed:[],} Ready:0xc000365b40} mapped to MachineOSBuild progress "Succeeded"
I0131 13:28:12.907891       1 reconciler.go:632] MachineOSBuild "mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" transitioned from transient state (Building) -> terminal state (Succeeded); update needed
I0131 13:28:12.911455       1 reconciler.go:813] Finished updating Job "build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" after 5.969159ms
E0131 13:28:12.911492       1 wrappedqueue.go:257] "Unhandled Error" err="Updating Job \"build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7\" failed: could not set status on MachineOSBuild \"mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7\": could not update status on MachineOSBuild \"mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7\": MachineOSBuild.machineconfiguration.openshift.io \"mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7\" is invalid: status: Invalid value: \"object\": buildEnd must be after buildStart"
I0131 13:28:12.912579       1 wrappedqueue.go:258] Dropping item <kind: "Job", name: "build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7", func: "(*OSBuildController).updateJob"> out of queue machineosbuilder: Updating Job "build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" failed: could not set status on MachineOSBuild "mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7": could not update status on MachineOSBuild "mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7": MachineOSBuild.machineconfiguration.openshift.io "mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" is invalid: status: Invalid value: "object": buildEnd must be after buildStart

Is the the second scenario within the scope of the issue that we are fixing in this PR or it needs a different jira ticket?

cheesesashimi · 2025-01-31T18:50:48Z

It feels like this second scenario is within the scope of this PR, so I'll try to fix it.

sergiordlr · 2025-02-05T11:43:33Z

Hello! We still see a problem when we do this:

Create a MOSC
Wait until a MOSB is created
3.Before the build pod starts running (while it is still in Creating status) delete the os-builder pod in a loop watch oc delete pod -l k8s-app=machine-os-builder
Wait until the build pod finishes and reports a Completed status
Stop deleting the os-builder pod

We see these messages in the os-builder pod

2025-02-05T10:38:41.493652922Z I0205 10:38:41.493627       1 reconciler.go:617] MachineOSBuild "mosc-worker-f0a8abdb824fef7196a904a0976592cf" transitioned from initial state -> terminal state (Succeeded); update needed
2025-02-05T10:38:41.499492631Z I0205 10:38:41.499439       1 reconciler.go:798] Finished adding Job "build-mosc-worker-f0a8abdb824fef7196a904a0976592cf" after 9.813623ms
2025-02-05T10:38:41.499492631Z E0205 10:38:41.499473       1 wrappedqueue.go:257] "Unhandled Error" err="Adding Job \"build-mosc-worker-f0a8abdb824fef7196a904a0976592cf\" failed: could not update job status for \"build-mosc-worker-f0a8abdb824fef7196a904a0976592cf\": unable to set status on MachineOSBuild \"mosc-worker-f0a8abdb824fef7196a904a0976592cf\": could not update status on MachineOSBuild \"mosc-worker-f0a8abdb824fef7196a904a0976592cf\": MachineOSBuild.machineconfiguration.openshift.io \"mosc-worker-f0a8abdb824fef7196a904a0976592cf\" is invalid: status: Invalid value: \"object\": buildEnd must be after buildStart"
2025-02-05T10:38:41.500573992Z I0205 10:38:41.500560       1 wrappedqueue.go:258] Dropping item <kind: "Job", name: "build-mosc-worker-f0a8abdb824fef7196a904a0976592cf", func: "(*OSBuildController).addJob"> out of queue machineosbuilder: Adding Job "build-mosc-worker-f0a8abdb824fef7196a904a0976592cf" failed: could not update job status for "build-mosc-worker-f0a8abdb824fef7196a904a0976592cf": unable to set status on MachineOSBuild "mosc-worker-f0a8abdb824fef7196a904a0976592cf": could not update status on MachineOSBuild "mosc-worker-f0a8abdb824fef7196a904a0976592cf": MachineOSBuild.machineconfiguration.openshift.io "mosc-worker-f0a8abdb824fef7196a904a0976592cf" is invalid: status: Invalid value: "object": buildEnd must be after buildStart

If the machine-os-builder pod is stopped and an active build job completes before the machine-os-builder pod is rescheduled, it will be ignored. In this situation, we should check if the job is in a terminal state and take the appropriate action if it is. This also opportunistically cleans up the buildprogress -> conditions mapping and adds additional test cases for detecting state changes.

sergiordlr · 2025-02-07T16:46:59Z

All the scenarios described in the PR worked. Nevertheless, we hit an issue in our regressions

WIth OCL enabled
os-builder pod is restarted and takes several minutes to take the lease when it is recreated
In the meantime a new MC is created and a new MC is rendered
os-builder takes the lease and start working again

The result is that even though a new MC was rendered, no machineosbuild is created. The cluster remains in an inconsistent status, with the worker pool updating but with no image to update, stuck reporting 0 workers updated.

~$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-a424d6e7f92b9772d16735ba5fe617ab True False False 3 3 3 0 171m
worker rendered-worker-5ca8169e97f53703c89fad5b7fdc9f5c False True False 2 0 0 0 171m

We can reproduce it by

Enable ocl
Delete the os-builder in a loop
Create a MC and wait for a new rendered-MC
Stop deleting the os-builder pod.

…down

openshift-ci · 2025-02-08T02:21:14Z

@cheesesashimi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn	`9970e04`	link	true	`/test e2e-aws-ovn`
ci/prow/e2e-azure-ovn-upgrade-out-of-change	`9970e04`	link	false	`/test e2e-azure-ovn-upgrade-out-of-change`
ci/prow/e2e-gcp-op-techpreview	`9970e04`	link	false	`/test e2e-gcp-op-techpreview`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. label Jan 27, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2025

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 27, 2025

cheesesashimi changed the title ~~OCPBUGS-43896: examine added jobs~~ OCPBUGS-43896: Ensure that build jobs are always reconciled Jan 27, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2025

cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from 55000c2 to 091e97c Compare January 27, 2025 17:13

cheesesashimi changed the title ~~OCPBUGS-43896: Ensure that build jobs are always reconciled~~ OCPBUGS-48810: Ensure that build jobs are always reconciled Jan 27, 2025

openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. and removed jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. labels Jan 27, 2025

cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch 2 times, most recently from 6b3309a to c29ba3e Compare January 27, 2025 19:37

cheesesashimi marked this pull request as ready for review January 27, 2025 19:40

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2025

openshift-ci bot requested review from dkhater-redhat and umohnani8 January 27, 2025 19:40

cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from c29ba3e to d79c934 Compare January 27, 2025 19:54

umohnani8 reviewed Jan 30, 2025

View reviewed changes

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 30, 2025

openshift-ci bot requested a review from sergiordlr January 30, 2025 18:24

cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from d79c934 to 0fe2c2f Compare February 4, 2025 15:00

cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from 0fe2c2f to d2b1357 Compare February 6, 2025 21:35

cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from d2b1357 to dd1daba Compare February 6, 2025 22:52

handle MachineConfigPool changes while the OSBuildController is shut …

9970e04

…down

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-48810: Ensure that build jobs are always reconciled #4811

OCPBUGS-48810: Ensure that build jobs are always reconciled #4811

cheesesashimi commented Jan 27, 2025 •

edited

Loading

openshift-ci bot commented Jan 27, 2025

openshift-ci-robot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

cheesesashimi commented Jan 27, 2025

openshift-ci-robot commented Jan 27, 2025

umohnani8 Jan 30, 2025

cheesesashimi Jan 30, 2025 •

edited

Loading

umohnani8 commented Jan 30, 2025

cheesesashimi commented Jan 30, 2025

openshift-ci-robot commented Jan 30, 2025

sergiordlr commented Jan 31, 2025 •

edited

Loading

cheesesashimi commented Jan 31, 2025

sergiordlr commented Feb 5, 2025

sergiordlr commented Feb 7, 2025

openshift-ci bot commented Feb 8, 2025

OCPBUGS-48810: Ensure that build jobs are always reconciled #4811

Are you sure you want to change the base?

OCPBUGS-48810: Ensure that build jobs are always reconciled #4811

Conversation

cheesesashimi commented Jan 27, 2025 • edited Loading

openshift-ci bot commented Jan 27, 2025

openshift-ci-robot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

cheesesashimi commented Jan 27, 2025

openshift-ci-robot commented Jan 27, 2025

umohnani8 Jan 30, 2025

Choose a reason for hiding this comment

cheesesashimi Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

umohnani8 commented Jan 30, 2025

cheesesashimi commented Jan 30, 2025

openshift-ci-robot commented Jan 30, 2025

sergiordlr commented Jan 31, 2025 • edited Loading

cheesesashimi commented Jan 31, 2025

sergiordlr commented Feb 5, 2025

sergiordlr commented Feb 7, 2025

openshift-ci bot commented Feb 8, 2025

cheesesashimi commented Jan 27, 2025 •

edited

Loading

cheesesashimi Jan 30, 2025 •

edited

Loading

sergiordlr commented Jan 31, 2025 •

edited

Loading