-
Notifications
You must be signed in to change notification settings - Fork 45
[reconfigurator] Pre-checks and post_update actions for RoT bootloader update #8325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[reconfigurator] Pre-checks and post_update actions for RoT bootloader update #8325
Conversation
// We'll loop for 3 minutes to wait for any ongoing RoT bootloader update. | ||
// We need to wait for 2 resets which have a timeout of 60 seconds each, | ||
// and an attempt to retrieve boot info, which has a time out of 30 seconds. | ||
// We give an additional 30 seconds to as a buffer for the other actions. | ||
Ok(PrecheckStatus::WaitingForOngoingRotBootloaderUpdate) => { | ||
if before.elapsed() | ||
>= WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT | ||
{ | ||
return Err(UpdateWaitError::Timeout( | ||
WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT, | ||
)); | ||
} | ||
|
||
tokio::time::sleep(ROT_BOOLOADER_UPDATE_PROGRESS_INTERVAL) | ||
.await; | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davepacheco is this implementation accurate with #7988 (comment) ? Or is there something I missed?
// TODO-K: In the RoT bootloader update code in wicket, there is a set of | ||
// known bootloader FWIDs that don't have cabooses. Is this something we | ||
// should care about here? | ||
// https://github.com/oxidecomputer/omicron/blob/89ce370f0a96165c777e90a008257a6085897f2a/wicketd/src/update_tracker.rs#L1817-L1841 | ||
|
||
// TODO-K: There are also older versions of the SP have a bug that prevents | ||
// setting the active slot for the RoT bootloader. Is this something we should | ||
// care about here? | ||
// https://github.com/oxidecomputer/omicron/blob/89ce370f0a96165c777e90a008257a6085897f2a/wicketd/src/update_tracker.rs#L1705-L1710 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to get some input from @davepacheco or @lzrd here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke with @lzrd IRL about this. We do want to keep these checks in place for development experience. But they're not urgent. These can be added later.
// TODO-K: In post_update we'll be restarting the RoT twice to do signature | ||
// checks, and to set stage0 to the new version. What happens if the RoT | ||
// itself is being updated (during the reset stage)? Should we check for that | ||
// here before setting the RoT bootloader as ready to update? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this even possible? I know the planner will be doing the SP, RoT, bootloader, host OS updates sequentially. But could it be possible that a rogue nexus may attempt to do an RoT update while a bootloader one is happening or vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about a "rogue" Nexus, but I think we should assume it's always possible any given Nexus could be executing an older blueprint concurrently with a different Nexus executing a newer blueprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yeah, that makes sense. I agree.
I guess my question now is what happens if a Nexus is resetting an RoT as part of an RoT update, and another Nexus is resetting an RoT as part of an RoT bootloader update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of the prechecks should prevent this from happening, even if a Nexus is operating on a blueprint. Assuming of this is working as intended:
- The planner ensures there is at most one
PendingMgsUpdate
in a given blueprint - The planner only removes a
PendingMgsUpdate
if it's completed or become impossible - The prechecks of any given update prevent a Nexus from starting an update if the target isn't in the same state it was when the planner decided to perform the update
I don't think it's possible for two different Nexuses to attempt to reset two different components simultaneously:
- "reset" happens at the end of the update
- ... which means all the prechecks passed
- ... which means the update couldn't have been completed yet
- ... which means the planner couldn't have created a new blueprint with a different
PendingMgsUpdate
(unless the update has become impossible, which should have caused any in-flight update to fail before it got to "reset")
Maybe there's some path through here where this is possible, but if there is it seems like something we have to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding, but that's assuming we're talking about the same blueprint version, yes? Just a comment above you mention:
... I think we should assume it's always possible any given Nexus could be executing an older blueprint concurrently with a different Nexus executing a newer blueprint.
So, we could assume it's possible a Nexus is resetting an RoT as part of an RoT update of an older blueprint, and another Nexus is resetting an RoT as part of an RoT bootloader update of a newer blueprint.
Something like:
- Nexus#1 with a blueprint with a new RoT version starts an RoT update.
- Nexus#2 with a different blueprint with a new RoT bootloader version (and no changes to the RoT) starts an update.
- Both Nexus#1 and Nexus#2 enter the post-update stage at similar times, and clash resetting the RoT
Is this possible?
If so, it would probably make sense for the RoT bootloader to have pre-checks that validate the expected state of the RoT, and the RoT bootloader to have pre-checks that validate the expected state of the RoT bootloader.
Is it overkill to add those additional checks even if we're almost certain that this scenario is near impossible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding, but that's assuming we're talking about the same blueprint version, yes?
No, I meant if one Nexus was executing an older blueprint.
Something like:
- Nexus#1 with a blueprint with a new RoT version starts an RoT update.
- Nexus#2 with a different blueprint with a new RoT bootloader version (and no changes to the RoT) starts an update.
- Both Nexus#1 and Nexus#2 enter the post-update stage at similar times, and clash resetting the RoT
Is this possible?
It shouldn't be. Assuming the blueprint Nexus#1 is operating on is older, I think the first two points I made above:
- The planner ensures there is at most one PendingMgsUpdate in a given blueprint
- The planner only removes a PendingMgsUpdate if it's completed or become impossible
mean that the existence of the blueprint Nexus#2 is operating on implies that the RoT update in Nexus#1's older blueprint has either completed or become impossible, which means Nexus#1's prechecks should fail, and therefore it will never get to the post-update stage.
Maybe it's possible in the "abandoned update" case, but I hope not? That would be an ordering something like this:
- Nexus#1 starts an RoT update, performs successful prechecks, then goes out to lunch.
- A couple minutes later, Nexus#2 aborts Nexus#1's update, takes it over, and drives it to completion.
- A new blueprint is produced to update the RoT bootloader.
- Nexus#2 starts executing the new RoT bootloader update.
- Nexus#1 wakes back up; it's already passed the prechecks, so it resumes updating from whatever point it was at.
I guess there's a question here of: in step 5, would Nexus#1 realize everything has changed out from under it and fail to resume an update that's been abandoned? Maybe it depends on which step exactly it was on? (cc @davepacheco who has probably thought about this a lot more 😅)
This commit implements several checks that must happen before updating an RoT bootloader, and post-update actions.
Manual testing on a simulated Omircon:
Previous state
Updating via reconfigurator-sp-updater:
State after the update
Related: #7988