Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-48810: Ensure that build jobs are always reconciled #4811

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cheesesashimi
Copy link
Member

@cheesesashimi cheesesashimi commented Jan 27, 2025

- What I did

If the machine-os-builder pod is stopped and a running job completes simultaneously, it is possible that the machine-os-builder pod will ignore it once it has been rescheduled. Although the build controller is aware of the existence of the job object, because it has not transitioned state, it is ignored. To avoid this, we check each added job object and reconcile it accordingly by updating the MachineOSBuild status according to our state transition rules.

Note: This needs to be rebased and land after #4756 does because it is based upon those changes.

- How to verify it

One could verify this by terminating the machine-os-builder pod at approximately the same time as a build job completes. In that situation, the MachineOSBuild should be updated after the machine-os-builder pod is rescheduled and begins running. Also, the unit test suite has an added test for this specific scenario.

- Description for the changelog
Ensure that build jobs are always reconciled

@openshift-ci-robot openshift-ci-robot added the jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. label Jan 27, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 27, 2025
Copy link
Contributor

openshift-ci bot commented Jan 27, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-43896, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

If the machine-os-builder pod is stopped and a running job completes simultaneously, it is possible that the machine-os-builder pod will ignore it once it has been rescheduled. Although the build controller is aware of the existence of the job object, because it has not transitioned state, it is ignored. To avoid this, we check each added job object to determine if it is in a terminal state. If so, we update the MachineOSBuild status. We purposely ignore initial and transient states because they cause an update event whenever a state transition occurs.

- How to verify it

One could verify this by terminating the machine-os-builder pod at approximately the same time as a build job completes. In that situation, the MachineOSBuild should be updated after the machine-os-builder pod is rescheduled and begins running. Also, the unit test suite has an added test for this specific scenario.

- Description for the changelog
Ensure that build jobs are always reconciled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@cheesesashimi cheesesashimi changed the title OCPBUGS-43896: examine added jobs OCPBUGS-43896: Ensure that build jobs are always reconciled Jan 27, 2025
Copy link
Contributor

openshift-ci bot commented Jan 27, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2025
@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from 55000c2 to 091e97c Compare January 27, 2025 17:13
@cheesesashimi
Copy link
Member Author

/testwith openshift/machine-config-operator/master/e2e-gcp-op-ocl openshift/api#2134

@cheesesashimi cheesesashimi changed the title OCPBUGS-43896: Ensure that build jobs are always reconciled OCPBUGS-48810: Ensure that build jobs are always reconciled Jan 27, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. and removed jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. labels Jan 27, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-48810, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

If the machine-os-builder pod is stopped and a running job completes simultaneously, it is possible that the machine-os-builder pod will ignore it once it has been rescheduled. Although the build controller is aware of the existence of the job object, because it has not transitioned state, it is ignored. To avoid this, we check each added job object to determine if it is in a terminal state. If so, we update the MachineOSBuild status. We purposely ignore initial and transient states because they cause an update event whenever a state transition occurs.

Note: This needs to be rebased and land after #4756 does because it is based upon those changes.

- How to verify it

One could verify this by terminating the machine-os-builder pod at approximately the same time as a build job completes. In that situation, the MachineOSBuild should be updated after the machine-os-builder pod is rescheduled and begins running. Also, the unit test suite has an added test for this specific scenario.

- Description for the changelog
Ensure that build jobs are always reconciled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch 2 times, most recently from 6b3309a to c29ba3e Compare January 27, 2025 19:37
@cheesesashimi cheesesashimi marked this pull request as ready for review January 27, 2025 19:40
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2025
@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from c29ba3e to d79c934 Compare January 27, 2025 19:54
// state transitions is observed, the MachineOSBuild object should not be
// updated because they are invalid and make no sense.
{
name: "Terminal -> Initial",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: For the case where a build fails, we fix the issue, and then trigger the build again with the rebuild annotation, having this restriction shouldn't cause an issue right?

The MOSB is deleted and recreated for this case but it is created with the same name. I don't think trying to update the state for the new MOSB with the same name should cause an issue but I think it is something worth confirming.

Copy link
Member Author

@cheesesashimi cheesesashimi Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't in that case because it is a new MOSB.

@umohnani8
Copy link
Contributor

Changes LGTM

@cheesesashimi
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 30, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-48810, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sergiordlr January 30, 2025 18:24
@sergiordlr
Copy link

sergiordlr commented Jan 31, 2025

Upgrade from 4.19.0-0.nightly-2025-01-28-090833 to this fix (RUNNING):

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-4.19-upgrade-from-stable-4.19-aws-ipi-ocl-fips-f60/1885328465318645760

Hello, this scenario is working ok

  1. Create a MOSC
  2. Wait until a MOSB is created and its status is fully filled in.
  3. Wait until the build pod runs for 1 minute or so
  4. Start deleting the os-builder pod in a loop watch oc delete pod -l k8s-app=machine-os-builder
  5. Wait until the build pod finishes and reports a Completed status
  6. Stop deleting the os-builder pod

The result is that the new os-builder pod is able to reconcile everything and the image is correctly deployed to the nodes.

Nevertheless, we have tested this other scenario

  1. Create a MOSC
  2. Wait until a MOSB is created
  3. Wait until the build pod is created. Here the MOSB status is still empty
  4. While the MOSB status is empty, start deleting the os-builder pod in a loop watch oc delete pod -l k8s-app=machine-os-builder
  5. When the builder pod has run for 2 minutes or so, stop deleting the os-builder pod

The result in this scenario is that the MOSB resource is not updated and we can see this error in the os-builder logs:

I0131 13:28:12.905928       1 jobimagebuilder.go:165] Using provided build job
I0131 13:28:12.905932       1 jobimagebuilder.go:192] Build job "build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" status {Conditions:[{Type:SuccessCriteriaMet Status:True LastProbeTime:2025-01-31 13:22:18 +0000 UTC LastTransitionTime:2025-01-31 13:22:18 +0000 UTC Reason:CompletionsReached Message:Reached expected number of succeeded pods} {Type:Complete Status:True LastProbeTime:2025-01-31 13:22:18 +0000 UTC LastTransitionTime:2025-01-31 13:22:18 +0000 UTC Reason:CompletionsReached Message:Reached expected number of succeeded pods}] StartTime:2025-01-31 13:19:19 +0000 UTC CompletionTime:2025-01-31 13:22:18 +0000 UTC Active:0 Succeeded:1 Failed:0 Terminating:0xc000365b3c CompletedIndexes: FailedIndexes:<nil> UncountedTerminatedPods:&UncountedTerminatedPods{Succeeded:[],Failed:[],} Ready:0xc000365b40} mapped to MachineOSBuild progress "Succeeded"
I0131 13:28:12.907891       1 reconciler.go:632] MachineOSBuild "mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" transitioned from transient state (Building) -> terminal state (Succeeded); update needed
I0131 13:28:12.911455       1 reconciler.go:813] Finished updating Job "build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" after 5.969159ms
E0131 13:28:12.911492       1 wrappedqueue.go:257] "Unhandled Error" err="Updating Job \"build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7\" failed: could not set status on MachineOSBuild \"mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7\": could not update status on MachineOSBuild \"mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7\": MachineOSBuild.machineconfiguration.openshift.io \"mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7\" is invalid: status: Invalid value: \"object\": buildEnd must be after buildStart"
I0131 13:28:12.912579       1 wrappedqueue.go:258] Dropping item <kind: "Job", name: "build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7", func: "(*OSBuildController).updateJob"> out of queue machineosbuilder: Updating Job "build-mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" failed: could not set status on MachineOSBuild "mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7": could not update status on MachineOSBuild "mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7": MachineOSBuild.machineconfiguration.openshift.io "mosc-worker-4fc90392d6b6f8f5a2d137af4082d2b7" is invalid: status: Invalid value: "object": buildEnd must be after buildStart

Is the the second scenario within the scope of the issue that we are fixing in this PR or it needs a different jira ticket?

@cheesesashimi
Copy link
Member Author

It feels like this second scenario is within the scope of this PR, so I'll try to fix it.

@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from d79c934 to 0fe2c2f Compare February 4, 2025 15:00
@sergiordlr
Copy link

Hello! We still see a problem when we do this:

  1. Create a MOSC
  2. Wait until a MOSB is created
    3.Before the build pod starts running (while it is still in Creating status) delete the os-builder pod in a loop watch oc delete pod -l k8s-app=machine-os-builder
  3. Wait until the build pod finishes and reports a Completed status
  4. Stop deleting the os-builder pod

We see these messages in the os-builder pod

2025-02-05T10:38:41.493652922Z I0205 10:38:41.493627       1 reconciler.go:617] MachineOSBuild "mosc-worker-f0a8abdb824fef7196a904a0976592cf" transitioned from initial state -> terminal state (Succeeded); update needed
2025-02-05T10:38:41.499492631Z I0205 10:38:41.499439       1 reconciler.go:798] Finished adding Job "build-mosc-worker-f0a8abdb824fef7196a904a0976592cf" after 9.813623ms
2025-02-05T10:38:41.499492631Z E0205 10:38:41.499473       1 wrappedqueue.go:257] "Unhandled Error" err="Adding Job \"build-mosc-worker-f0a8abdb824fef7196a904a0976592cf\" failed: could not update job status for \"build-mosc-worker-f0a8abdb824fef7196a904a0976592cf\": unable to set status on MachineOSBuild \"mosc-worker-f0a8abdb824fef7196a904a0976592cf\": could not update status on MachineOSBuild \"mosc-worker-f0a8abdb824fef7196a904a0976592cf\": MachineOSBuild.machineconfiguration.openshift.io \"mosc-worker-f0a8abdb824fef7196a904a0976592cf\" is invalid: status: Invalid value: \"object\": buildEnd must be after buildStart"
2025-02-05T10:38:41.500573992Z I0205 10:38:41.500560       1 wrappedqueue.go:258] Dropping item <kind: "Job", name: "build-mosc-worker-f0a8abdb824fef7196a904a0976592cf", func: "(*OSBuildController).addJob"> out of queue machineosbuilder: Adding Job "build-mosc-worker-f0a8abdb824fef7196a904a0976592cf" failed: could not update job status for "build-mosc-worker-f0a8abdb824fef7196a904a0976592cf": unable to set status on MachineOSBuild "mosc-worker-f0a8abdb824fef7196a904a0976592cf": could not update status on MachineOSBuild "mosc-worker-f0a8abdb824fef7196a904a0976592cf": MachineOSBuild.machineconfiguration.openshift.io "mosc-worker-f0a8abdb824fef7196a904a0976592cf" is invalid: status: Invalid value: "object": buildEnd must be after buildStart

@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from 0fe2c2f to d2b1357 Compare February 6, 2025 21:35
If the machine-os-builder pod is stopped and an active build job
completes  before the machine-os-builder pod is rescheduled, it will be
ignored. In this situation, we should check if the job is in a terminal
state and take the appropriate action if it is.

This also opportunistically cleans up the buildprogress -> conditions
mapping and adds additional test cases for detecting state changes.
@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-43896-rebase branch from d2b1357 to dd1daba Compare February 6, 2025 22:52
@sergiordlr
Copy link

All the scenarios described in the PR worked. Nevertheless, we hit an issue in our regressions

  1. WIth OCL enabled
  2. os-builder pod is restarted and takes several minutes to take the lease when it is recreated
  3. In the meantime a new MC is created and a new MC is rendered
  4. os-builder takes the lease and start working again

The result is that even though a new MC was rendered, no machineosbuild is created. The cluster remains in an inconsistent status, with the worker pool updating but with no image to update, stuck reporting 0 workers updated.

~$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-a424d6e7f92b9772d16735ba5fe617ab True False False 3 3 3 0 171m
worker rendered-worker-5ca8169e97f53703c89fad5b7fdc9f5c False True False 2 0 0 0 171m

We can reproduce it by

  1. Enable ocl
  2. Delete the os-builder in a loop
  3. Create a MC and wait for a new rendered-MC
  4. Stop deleting the os-builder pod.

Copy link
Contributor

openshift-ci bot commented Feb 8, 2025

@cheesesashimi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn 9970e04 link true /test e2e-aws-ovn
ci/prow/e2e-azure-ovn-upgrade-out-of-change 9970e04 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-gcp-op-techpreview 9970e04 link false /test e2e-gcp-op-techpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants