Skip to content

[sled-agent] Integrate config-reconciler #8064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

jgallagher
Copy link
Contributor

This PR integrates the new sled-agent-config-reconciler crate with sled-agent. It will not currently pass tests due to the reconciler not being completely implemented, but I'd like to get any feedback on this integration work itself (particularly as it pertains to the API of sled-agent-config-reconciler). See the description of #8063 for more context.

There are a couple serious warts with this PR:

  • The inventory system has not been updated with all the details we need to report for the reconciler. This is a bigger chunk of work because it involves a database migration and touches various bits of Nexus, so I'll do that in a separate PR.
  • This integration removes most uses of the StorageManager (because its functionality is being absorbed into sled-agent-config-reconciler); however, the storage manager also has a rich set of test support. This PR leaves a couple sled-agent submodules using that test support (support-bundle/storage and zone-bundle). In the long run I think it'd be better to rework these (if there are no remaining production uses of StorageManager), but for now I think this is... okay? Feedback welcome.

jgallagher added a commit that referenced this pull request Apr 29, 2025
This is somewhat extracted from #8064, but can be landed independently
and will make some of the followup sled-agent-config-reconciler PRs a
little cleaner.

Fixes #7774.
@@ -34,14 +34,6 @@ enum SledAgentCommands {
#[clap(subcommand)]
Zones(ZoneCommands),

/// print information about zpools
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you expecting that inventory will supplant this info? Or are you planning on replacing this access to the sled agent later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that inventory would supplant this. (I think maybe it already has, in practice? I definitely only look at inventory when I'm curious about zpools; I don't think I've ever used these omdb subcommands.)

jgallagher added a commit that referenced this pull request Apr 30, 2025
This is somewhat extracted from #8064, but can be landed independently
and will make some of the followup sled-agent-config-reconciler PRs a
little cleaner. We don't yet ledger `OmicronSledConfig`s to disk, so
we're free to fiddle with the details of its fields without worrying
about backwards compatibility.

Fixes #7774.
@jgallagher jgallagher force-pushed the john/sled-agent-config-reconciler-2 branch from abd7542 to 2574c5c Compare April 30, 2025 19:17
Base automatically changed from john/sled-agent-config-reconciler-1 to main May 1, 2025 12:34
@jgallagher jgallagher force-pushed the john/sled-agent-config-reconciler-2 branch from 2574c5c to a057195 Compare May 2, 2025 14:59
@jgallagher jgallagher force-pushed the john/sled-agent-config-reconciler-2 branch from a057195 to 0faddda Compare May 21, 2025 20:38
jgallagher added a commit that referenced this pull request May 22, 2025
…ig reconciler (#8188)

The primary change here is replacing these inventory fields (a subset of
`OmicronSledConfig`):

```rust
    pub omicron_zones: OmicronZonesConfig,
    pub omicron_physical_disks_generation: Generation,
```

with these:

```rust
    pub ledgered_sled_config: Option<OmicronSledConfig>,
    pub reconciler_status: ConfigReconcilerInventoryStatus,
    pub last_reconciliation: Option<ConfigReconcilerInventory>,
```

Once #8064 lands, all three of these will be filled in meaningfully; as
of this PR, only `ledgered_sled_config` is populated.
(`reconciler_status` is always `NotYetRun` and `last_reconciliation` is
always `None`, since there is no reconciler yet.) The rest of the
changes are all fallout from changing inventory:

* Update `omdb` printing
* Update sled-agent to report the new inventory fields
* Update consumers of inventory (tests, reconfigurator planner, one
Nexus RPW) - these all just look at `ledgered_sled_config` for now, but
will need to be updated on #8064 once other fields are populated
* Update database schema, model, and queries (the bulk of the diff).
This requires dropping all preexisting collections, since there's no way
to migrate from just `omicron_zones` to a full `OmicronSledConfig`. The
first few schema migrations take care of this.

Before merging I'll go through an upgrade on a racklette and confirm
things come back up okay after the schema migration blows away all the
pre-update inventory collections. (We think this is fine, but it'd be
good to confirm.) But I think this is close enough that it's reviewable.

Couple other minor changes that came along for the ride:

* Closes #6770 (`inv_sled_omicron_zones` is gone now)
* Fixes #8084 (added `image_source` columns to the inventory zone config
table, so we don't lose `ImageSource::Artifact { hash }` values reported
by sled-agent)
@jgallagher jgallagher force-pushed the john/sled-agent-config-reconciler-2 branch from 0faddda to 8ff4ae3 Compare May 22, 2025 15:30
@jgallagher jgallagher marked this pull request as ready for review May 22, 2025 21:24
@jgallagher jgallagher requested a review from sunshowers May 22, 2025 21:24
@jgallagher
Copy link
Contributor Author

I'm putting racklette testing notes for this branch plus a few followups in comments on the last of those followups (#8220).

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hard PR to review, given its broad scope. It was made somewhat easier by recognizing a few patterns such as replacing calls to the storage manager with rx channels for disk and datasets.

It all appears correct to me, but again, hard to really tell. I'm sure it was tedious to implement as well :)

Regardless, looks good enough to merge and continue with.

method = GET,
path = "/datasets",
}]
async fn datasets_get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these only used by the OMDB commands that got removed?

async fn dyn_datasets_config_list(&self) -> Result<DatasetsConfig, Error> {
self.datasets_config_list().await.map_err(|err| err.into())
// TODO-cleanup This is super gross; add a better API (maybe fetch a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to clean this up in this PR?

/// Given a sled config, produce a reconciler result that sled-agent could
/// have emitted if reconciliation succeeded.
///
/// This method should only be used by tests and dev tools; real code should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark this with #[test] and maybe #[cfg(any(test, feature = "testing"))] ?

@@ -214,17 +214,18 @@ impl<'a> Planner<'a> {
// The sled is not expunged. We have to see if the inventory
// reflects the parent blueprint disk generation. If it does
// then we mark any expunged disks decommissioned.
//
// TODO-correctness We inspect `last_reconciliation` here to confirm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On balance, this seems like the right choice to me. We should know the sled agent has acted before decommissioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants