From 88d155c292943fe131cb7b67dc21cd846553e169 Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Thu, 22 May 2025 12:21:43 -0700
Subject: [PATCH 01/10] WIP: reconfigurator dev guide

---
 docs/reconfigurator-dev-guide.adoc | 264 +++++++++++++++++++++++++++++
 docs/reconfigurator.adoc           |   4 +
 2 files changed, 268 insertions(+)
 create mode 100644 docs/reconfigurator-dev-guide.adoc
diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
new file mode 100644
index 00000000000..713c8553886
--- /dev/null
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -0,0 +1,264 @@
+:showtitle:
+:numbered:
+:toc: left
+
+= Reconfigurator Developer Guide
+
+This document covers practical tips for working on Reconfigurator.  For principles and design, see xref:reconfigurator.adoc[Reconfigurator Overview].
+
+== Introduction
+
+Reconfigurator is a control plane subsystem that's responsible for runtime changes to the control plane.  It's used to add, remove, and upgrade components.  It's divided into two big pieces:
+
+* The **planner** generates **blueprints**, which are complete descriptions of how the system _should_ look (in terms of what components exist, at what versions, etc.)
+* The **executor** takes a given blueprint and attempts to make reality match it.
+
+Blueprints are stored in CockroachDB (the control plane database).  This makes them available to all Nexus instances.  It also ensures strong consistency in what the system's current blueprint is supposed to be.
+
+Reconfigurator is designed to run autonomously as part of **Nexus** (see xref:control-plane-architecture.adoc[]).  But as much as possible, the pieces are factored into self-contained packages that don't know about most of Nexus.  As a concrete example:
+
+* Autonomous blueprint execution in real systems is driven by the Nexus `blueprint_execution` background task.  But that task essentially just invokes `nexus_reconfigurator_execution::realize_blueprint`.
+* Execution itself is encapsulated within this `nexus_reconfigurator_execution` package.
+
+This has some big benefits:
+
+* When working on blueprint execution, you usually only need to run `cargo check` and `cargo test` on the `nexus_reconfigurator_execution` package.  You don't need to build and link Nexus (which involves a lot more code and takes a lot more time).
+* It's possible to build developer tools like `reconfigurator-exec-unsafe`, which directly uses the `nexus_reconfigurator_execution` package.  This gives developers finer control over blueprint execution and more direct visibility, while still using the exact same interfaces that the autonomous system in Nexus is using.
+
+It does mean there are lots of layers, though.  Here's a conceptual map of components involved in blueprint execution:
+
+```mermaid
+graph TD
+    Executor["Executor"]
+    MgsUpdateDriver["MgsUpdateDriver<br />(updates SPs, RoTs, etc.)"]
+    LiveSystem["Live System<br />(rack/racklette/a4x2/simulated))"]
+
+    subgraph Nexus ["Nexus (real systems)"]
+        NexusUsesExecutor["Executor<br/>(background task)"]
+    end
+
+    %% Dev tools - right side: Executor tools
+    subgraph ExecutorTools ["Execution Tools (dev/test)"]
+        ExecUnsafe(["reconfigurator-exec-unsafe<br/>(execute blueprints manually)"])
+        SPUpdater(["reconfigurator-sp-updater<br/>(updates SPs, RoTs, etc.)"])
+    end
+
+    NexusUsesExecutor --> |blueprint from database| Executor
+    ExecUnsafe -->|blueprint from file| Executor
+
+    Executor -->|Modifies| LiveSystem
+    Executor --> |input: from blueprint| MgsUpdateDriver
+    MgsUpdateDriver --> |Modifies| LiveSystem
+    SPUpdater --> |input: REPL| MgsUpdateDriver
+
+    %% Styling
+    style Nexus fill:#c8e6c9,stroke:#388e3c
+    style ExecutorTools fill:#f3e5f5,stroke:#8e24aa
+    style Executor fill:#ffe0b2,stroke:#fb8c00
+    style MgsUpdateDriver fill:#ffe0b2,stroke:#fb8c00
+```
+
+// XXX-dap fix diagrams not working well in dark mode
+
+== Key Rust packages
+
+Below are some of the most important Rust packages to know about.  This is not a complete list.
+
+.Key Rust packages used in Reconfigurator
+[cols="1,1,3",options="header"]
+|===
+h|Area
+|Omicron repo path (Rust package)
+|Description
+
+.2+|Not Reconfigurator-specific
+
+|`nexus/types` (`nexus_types`)
+|Very widely-used package containing types common to many parts of Nexus (shared with lots of components that are used within Nexus, but aren't generally aware of the rest of Nexus).  (Not Reconfigurator-specific.)
+
+|`nexus/db-model`/`nexus/db-queries` (`nexus_db_model`, `nexus_db_queries`)
+|Everything related to the control plane database: Rust types representing the database schema itself, types that model the various tables in the database, and implementations of database queries that fetch/insert/update/delete data in the database.  (Not Reconfigurator-specific.)
+
+.4+|Reconfigurator (and Reconfigurator-adjacent)
+
+|`nexus/inventory` (`nexus_inventory`)
+|Inventory subsystem.  Collects information from the whole system about its current state, stores it in the database, and makes it available to the rest of Nexus.  Inventory collection is driven periodically and on-demand by a Nexus background task that just calls into this package.
+
+|`nexus/reconfigurator/planning` (`nexus_reconfigurator_planning`)
+|Implementation of the planner.  Currently, this is driven only by explicit calls to the Nexus internal API, which in turn come from a person running `omdb`.  In the medium term, this will be driven periodically and on-demand by a Nexus background task.
+
+|`nexus/reconfigurator/execution` (`nexus_reconfigurator_execution`)
+|Implementation of blueprint execution.  Blueprint execution is driven periodically and on-demand by a Nexus background task that just calls into this package.
+
+|`nexus/mgs-updates` (`nexus_mgs_updates`)
+a|Implementation of software update for components that are updated through Management Gateway Service (MGS) and the service processor (SP).  This includes the service processor Hubris image, the root of trust Hubris image, the root of trust bootloader, and phase 1 of the host operating system (the part that's stored in flash).
+
+This is used as part of execution.
+|===
+
+== Developer tools
+
+.Key developer tools for working on Reconfigurator
+[cols="1,1,1,3",options="header"]
+|===
+h|Area
+|Tool
+|Omicron repo path
+|Description
+
+.4+|Reconfigurator-specific
+|`reconfigurator-cli`
+|`dev-tools/reconfigurator-cli`
+|Directly edit blueprints or run the planner in-memory.  Can import state from real systems and export blueprints back to real systems.  Essential tool for observing and testing planner behavior and for generating blueprints that a real system might not otherwise do.  This in turn is useful for development and for operational emergencies.
+
+|`reconfigurator-exec-unsafe`
+|`dev-tools/reconfigurator-exec-unsafe`
+|Directly execute blueprints against a live system (outside the context of Nexus).  The main use of this tool is to be able to precisely control blueprint execution (usually for testing) and to be able to execute blueprints whose JSON representation does not match the database representation (common while features are under development, but never expected in a real system).
+
+|`reconfigurator-sp-updater`
+|`dev-tools/reconfigurator-sp-updater`
+|Directly runs Reconfigurator-style updates of MGS/SP-managed software.  This is used for development and testing of `nexus_mgs_updates` without having to create blueprints or go through real blueprint execution.
+
+|`repo-depot-standalone`
+|`dev-tools/repo-depot-standalone`
+a|Standalone command line tool for serving the Repo Depot API (which serves TUF repo artifacts over HTTP) from any TUF repository in your local filesystem.
++
+This is especially useful with `reconfigurator-sp-updater`.
+
+.2+|Non-Reconfigurator-specific (general tools)
+|`omdb`
+|`dev-tools/omdb`
+a|`omdb` is a general tool for inspecting and controlling various Omicron components.
+
+* You can control blueprint planning and execution with `omdb nexus blueprints`.
+* You can monitor blueprint execution with `omdb nexus background-tasks show blueprint_executor`.
+* You can view database state with `omdb db` (e.g., `omdb db inventory collections show latest`).
+
+|`cargo xtask omicron-dev run-all`
+|`dev-tools/omicron-dev`
+|Stands up a whole control plane using simulated sled agents.  This is by far the quickest and simplest way to test quite a lot of the system, but of course has limitations on what it's able to simulate.
+
+|===
+
+Here's a conceptual map of components involved in planning and execution and the tools you can use to work on them directly:
+
+```mermaid
+graph TD
+    Planner["Planner / Blueprint Editor"]
+    subgraph Nexus ["Nexus (real systems)"]
+        NexusUsesPlanner["Planner<br/>(background task)<br/>(eventually)"]
+        NexusUsesExecutor["Executor<br/>(background task)"]
+    end
+
+    NexusUsesPlanner -->|blueprint: <br />from database| Planner
+
+    subgraph PlannerTools ["Planner Tools (dev/test/support)"]
+        CLI(["reconfigurator-cli<br />(dev/test/support tool)"])
+    end
+    CLI -->|"blueprint: synthetic (REPL) or loaded from a real system"| Planner
+
+    style Nexus fill:#c8e6c9,stroke:#388e3c
+    style Planner fill:#ffe0b2,stroke:#fb8c00
+    style PlannerTools fill:#f3e5f5,stroke:#8e24aa
+
+    Executor["Executor"]
+    MgsUpdateDriver["MgsUpdateDriver<br />(updates SPs, RoTs, etc.)"]
+    LiveSystem["Live System<br />(rack/racklette/a4x2/simulated))"]
+
+    %% Dev tools - right side: Executor tools
+    subgraph ExecutorTools ["Execution Tools (dev/test)"]
+        ExecUnsafe(["reconfigurator-exec-unsafe<br/>(execute blueprints manually)"])
+        SPUpdater(["reconfigurator-sp-updater<br/>(updates SPs, RoTs, etc.)"])
+    end
+
+    NexusUsesExecutor --> |blueprint: from database| Executor
+    ExecUnsafe -->|blueprint: from file| Executor
+
+    Executor -->|Modifies| LiveSystem
+    Executor --> |input: from blueprint| MgsUpdateDriver
+    MgsUpdateDriver --> |Modifies| LiveSystem
+    SPUpdater --> |input: REPL| MgsUpdateDriver
+
+
+    %% Styling
+    style Nexus fill:#c8e6c9,stroke:#388e3c
+    style ExecutorTools fill:#f3e5f5,stroke:#8e24aa
+    style Executor fill:#ffe0b2,stroke:#fb8c00
+    style MgsUpdateDriver fill:#ffe0b2,stroke:#fb8c00
+
+```
+
+== Nexus background tasks
+
+Background operations in the control plane are driven by Nexus **background tasks**.  See xref:../nexus/src/app/background/mod.rs[] for important background on the design of background tasks.  Most importantly, the system has been designed to streamline writing background activities that:
+
+* correctly handle crashing in the middle of execution
+* correctly handle being executed concurrently (in other Nexus instances)
+* make their status observable
+* can be activated on-demand by a developer or support technician
+
+Again, there's a lot more about this in the comment in the file linked above.
+
+**In general, the Rust module that implements the background task does almost nothing except call into an implementation that's in some other Rust package.**  Generally, this approach:
+
+* Makes it easier to write comprehensive tests for the background task.  That's because the background task abstraction itself is intentionally very opaque.  It just has one `activate()` function.  So to test it exhaustively, it's helpful to put the bulk of the implementation into something with a richer interface for control and observability.
+* Makes it faster to iterate on the implementation because you need only run `cargo check`, `cargo nextest`, etc. on your implementation package, which usually won't require building and linking the rest of Nexus.  By contrast, the background tasks themselves are part of Nexus so rebuilding them takes more time.
+
+Each background task has a fixed name (e.g., `blueprint_executor`).  You can use `omdb nexus background-tasks` to list, activate, observe the status of background tasks.
+
+Here are the most important background tasks related to Reconfigurator:
+
+// XXX-dap working here
+
+.Key Reconfigurator-related background tasks
+[cols="1h,4",options="header"]
+|===
+|Task name
+|Description
+
+|`inventory_collection`
+|Fetches information about the current state of all hardware and software in the system (the whole rack)
+
+|`blueprint_executor`
+|Executes the most recently loaded blueprint
+
+|`blueprint_loader`
+|Loads the latest target blueprint from the database
+
+|`blueprint_rendezvous`
+|Updates rendezvous tables based on the most recent target blueprint
+
+|`dns_config_internal`, `dns_servers_internal`, `dns_propagation_internal`,
+`dns_config_external`, `dns_servers_external`, `dns_propagation_external`
+|Drives the propagation of internal and external DNS.  Configuration changes start in Nexus and get written to the database.  Then these background tasks load the configuration (`dns_config_*`), load the list of servers to propagate it to (`dns_servers_*`), and propagate the config to the servers (`dns_propagation_*`).
+
+|`tuf_artifact_replication`
+|Distributes all artifact files in all user-uploaded TUF repositories to all sleds
+
+|===
+
+Many other tasks work with Reconfigurator, too (e.g., region replacement and region snapshot replacement).
+
+Notably absent from this list is anything related to planning.  This has not been automated as a background task yet.
+
+== Testing and developer workflow
+
+// XXX-dap talk about developer workflow: "inner loop", etc.
+// XXX-dap talk about testing and test environments, live tests, etc.
+
+
+
+// XXX-dap task: generate a new blueprint using the planner
+// XXX-dap task: export reconfigurator state
+// XXX-dap task: generate a new blueprint using reconfigurator-cli
+// XXX-dap task: import blueprint
+// XXX-dap task: execute blueprint (via Nexus)
+// XXX-dap task: monitor blueprint execution
+// XXX-dap task: previewing what changes a blueprint will make
+
+
+
+// XXX-dap diagram showing:
+// - planner creates blueprints and stores them into database
+// - user can import blueprints with reconfigurator-cli
+// - execution reads blueprints
diff --git a/docs/reconfigurator.adoc b/docs/reconfigurator.adoc
index 0aa7ce78cc3..d623e85f997 100644
--- a/docs/reconfigurator.adoc
+++ b/docs/reconfigurator.adoc
@@ -4,6 +4,10 @@
 
 = Reconfigurator
 
+This document gives a first-principles overview of Reconfigurator.
+
+**For practical tips for working on Reconfigurator, see the xref:reconfigurator-dev-guide.adoc[Reconfigurator Developer Guide].**
+
 == Introduction
 
 **Reconfigurator** is a system within Nexus for carrying out all kinds of system changes in a controlled way.  Examples of what Reconfigurator can do today or that we plan to extend it to do in the future:

From 432041b4c848a85a3b30d3a4e84af5b4c88a1a3c Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Thu, 22 May 2025 15:09:22 -0700
Subject: [PATCH 02/10] WIP

---
 docs/reconfigurator-dev-guide.adoc | 211 ++++++++++++++++++++++++++++-
 1 file changed, 205 insertions(+), 6 deletions(-)

diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
index 713c8553886..99e6d427020 100644
--- a/docs/reconfigurator-dev-guide.adoc
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -208,8 +208,6 @@ Each background task has a fixed name (e.g., `blueprint_executor`).  You can use
 
 Here are the most important background tasks related to Reconfigurator:
 
-// XXX-dap working here
-
 .Key Reconfigurator-related background tasks
 [cols="1h,4",options="header"]
 |===
@@ -243,9 +241,206 @@ Notably absent from this list is anything related to planning.  This has not bee
 
 == Testing and developer workflow
 
-// XXX-dap talk about developer workflow: "inner loop", etc.
-// XXX-dap talk about testing and test environments, live tests, etc.
+There are a bunch of different environments that you can set up and use to test Omicron.
+
+.Kinds of Omicron test environments
+[cols="1,2,2a,2a,2a",options="header"]
+|===
+|Name
+|Summary
+|Pros
+|Good for
+|Limitations
+
+|xref:how-to-run-simulated.adoc[`cargo xtask omicron-dev run-all`]
+|Command-line tool that stands up real instances of much of the control plane locally (in-process and child processes): Nexus, CockroachDB, Clickhouse, Management Gateway Service, Oximeter, Crucible Pantry.  Limitations result from using simulated sled agent, simulated service processors, and loopback networking.
+* Easy (one command), quick (starts in ~10s)
+* Fast to iterate (rebuilds in a minute or two, depending on what component you're changing)
+* Exactly matches the environment provided to Nexus integration tests (so it can be useful for developing and debugging these tests).
+|
+* Nexus internal/external API changes
+* Most of development for anything that can be simulated (e.g., inventory, most parts of execution)
+* `omdb`-only changes
+|
+* Simulated sled agent has many limitations: cannot run VMs, does not simulate the actual control plane components that it pretends to run, no simulation of Crucible storage, etc.
+* Simulated SPs have limited fidelity to the real thing (e.g., resetting SP will not simulate reset of the sled, even though a real one would)
+* No Wicket, no full RSS path
+* No meaningful simulation of networking (so can't be used to test behavior of underlay connectivity, external connectivity, configuring Dendrite, etc.)
+
+|https://github.com/oxidecomputer/testbed/tree/main/a4x2[`a4x2`]
+|Uses VMs, fancy local networking config, and a software-based switch (https://github.com/oxidecomputer/softnpu[softnpu]) to create a multi-sled environment that looks much more realistic to the control plane than `omicron-dev run-all`.
+|
+* Much higher fidelity to real systems than `omicron-dev run-all`:
+** most components' environments look largely like a real system (e.g., run in a zone, using the SMF start methods)
+** softnpu implements the same (runtime-configurable) networking behavior that real switches do
+** real sled agent runs real instances of all components except simulated networking (which is full-fidelity) and simulated service processors
+|
+* More time required up front to get started (may need beefier dev machine)
+* Somewhat bumpy developer experience (see README)
+* Longer iteration time (rebuild and redeploy takes ~30-60 minutes)
+* Limitations in fidelity:
+** Cannot run instances (sleds are running in VMs and we don't support nested virt)
+** Service processors are simulated (just like `omicron-dev run-all`)
+
+|xref:how-to-run.adoc[`Running non-simulated Omicron on a single system`]
+|Runs real Sled Agent and all other components directly on your dev system the same way they'd run on a real system
+|
+* Moderate iteration time (rebuild and redeploy could take minutes, depending on what you're changing)
+* Could support running VMs
+| ?
+|
+* "Takes over" your dev system -- does not clearly delineate what global state it's responsible for and have a way to clean it all up
+* Somewhat brittle (e.g., after reboot, SMF service for sled agent may start but not find the files it needs)
+* Limitations in fidelity:
+** Only one sled
+** No service processors
+** Networking simulation is incomplete (connectivity depends on how your dev system is set up)
+
+|Racklette
+|Real Oxide hardware (sleds and switches), essentially indistinguishable from a real Oxide rack
+|Everything.  Worthwhile for:
+* any testing involving real "customer" VMs
+* final smoke testing for work developed with simulated components
+|
+* Very limited, shared resource
+|===
+
+https://github.com/oxidecomputer/omicron/pull/7424[Work is ongoing] to add `cargo xtask` commands for launching an a4x2 environment.  This would significantly streamline the process of using a4x2 and also make it possible to use a4x2 in CI.
+
+A common development workflow is:
+
+* "inner loop" as you work on code: run `cargo check`
+* some combination of:
+** use `cargo xtask omicron-dev run-all` and various developer tools to test it out
+** add unit tests run with `cargo nextest run`
+* once things are working, test end-to-end on a4x2 (if that's faithful enough) or a racklette
+
+== Automated testing
+
+Broadly, we have several kinds of tests:
+
+* Various levels of unit test and small-scale integration tests for most components, including the planner, execution, etc.  The integration tests use an environment identical to `cargo xtask omicron-dev run-all`.
+* For testing the planner and blueprint builder: we have `reconfigurator-cli` _scripts_ that run a bunch of commands print the contents of blueprints and diffs between blueprints and verify that these look like we expect.
+* Omicron CI runs xref:../end-to-end-tests["end-to-end"] tests in the "Running non-simulated Omicron on a single system" environment.
+* We have a small number of xref:../live-tests["live tests"] that can be run on-demand in a4x2 or a racklette that exercise behavior that can't currently be tested in CI.
+
+The https://github.com/oxidecomputer/omicron/pull/7424[ongoing work mentioned above] will make it possible to run the live tests in a4x2 in CI.
+
+== Updates for SPs, RoTs, etc.
 
+Updates for the following components get lumped together:
+
+* service processor Hubris image
+* root of trust Hubris image
+* root of trust bootloader Hubris image
+* host OS phase 1 image
+
+That's because all of these are managed by the service processor (SP).  They all follow a similar flow.  The control plane talks to SPs through Management Gateway Service, so we often call these MGS-managed updates or just "MGS Updates" (or sometimes "SP-managed updates").
+
+There are a few ways to update SPs and their associated components:
+
+* via Wicket, which uses MGS to deploy an artifact from the TUF repo.  This is the way we update most systems in development and production today.  Since you're supplying the TUF repo, Wicket is doing the work to figure out which artifact is appropriate for the hardware being updated.
+* via `faux-mgs`, which talks directly to the SP and deploys an image directly from a file you give it.  Since you're giving it the specific file to use, you do the work of figuring out what that should be (e.g., picking which artifact from a TUF repo is appropriate for the hardware you're updating).  Updating with `faux-mgs` is outside the scope of this document but there's some information and links below on how to do this.
+* via `humility` or other low-level tools (outside the scope of this document)
+* "Reconfigurator-driven": what this section is about.
+
+"Reconfigurator-driven" means that we're using `nexus_mgs_updates` to perform the update.  That implementation is designed to support:
+
+* updating to software images stored in a TUF repository
+* resuming after crashing at any point
+* executing concurrently (in different Nexus instances)
+
+The easiest way to test Reconfigurator-driven updates is using `reconfigurator-sp-updater` (more on this below).  You can also use `reconfigurator-cli` to generate a blueprint that specifies an MGS-managed update and then use `reconfigurator-exec-unsafe` to execute it.  This is more cumbersome but tests the integration of `nexus_mgs_updates` into blueprint execution.  (That's pretty simple and tested at this point so this is probably not a very useful flow unless something is broken.)  Eventually, you'll be able to test these updates through normal, Nexus-driven blueprint execution.  This is blocked on database support for the parts of blueprints that specify MGS-managed updates.
+
+Regardless of how you perform updates, it's useful to use `faux-mgs` to read the ground truth state from the SP about its configuration (what versions are in each slot and which slots are active).  More on this below.
+
+=== Task: manually performing Reconfigurator-driven update of SPs and RoTs
+
+. Decide what software you want to deploy.  This must be packaged in a TUF repository.
++
+If you're just testing update and don't care what you're deploying, you can use one generated by the CI process from any commit on "main".
+2. Figure out which artifact within the TUF repository you need to use for your hardware.
++
+In all cases, you can either look at the metadata in the unpacked TUF repo (`jq < repo/targets/*.artifacts.json`) or just look at the filenames of the artifacts (`ls repo/targets`).
+** For service processors: the image should reflect the type of board you're updating (`kind` should include `switch` or `gimlet` or `psc`)
+3. Use `repo-depot-standalone` to serve the TUF repo depot API backed by the TUF repo you want to use.
+4. Use `reconfigurator-sp-updater` to perform the update.
+
+// XXX-dap working here
+// XXX-dap create separate task sections for the different pieces here
+
+=== Setting up `faux-mgs`
+
+https://github.com/oxidecomputer/management-gateway-service/tree/main/faux-mgs[`faux-mgs`] is a command-line tool that talks directly to SPs (without using MGS).  For Omicron developers, it's the lowest level tool we usually need to directly inspect SP state and issue commands to the SP.
+
+This tool is most useful for:
+
+* directly inspecting the current SP state (while debugging or learning)
+* manually performing SP-managed updates as part of understanding how they work
+
+To use: first clone the above repo and build with `cargo build --bin=faux-mgs`.
+
+For racklettes: copy this binary to the switch zone and run it from there.  Use `faux-mgs --interface gimlet14 ...` to use it against the SP for sled 14 (just as an example).  Use `dladm show-vlan` in the switch zone to see what other interfaces exist to talk to switches, PSCs, etc.
+
+For a4x2: copy this binary to the switch zone and run it from there.  You'll need to find the IP and ports of the simulated SPs running in this zone.  TODO how do you do that?
+
+For `omicron-dev run-all`, you can run this command from the same system where you're running `omicron-dev`.  Instead of `--interface`, you need to use the `--sp-sim-addr IPV6_ADDR:PORT` option to point `faux-mgs` at the simulated SP.  Unfortunately, the easiest way to find the address and port of the simulated SP is in the log file whose path is printed out by `omicron-dev run-all`.
+
+---
+
+However you get `faux-mgs` running, you can use it to inspect state and https://github.com/oxidecomputer/meta/blob/master/engineering/mupdate/manual-rot-sp-updates.adoc[perform updates by hand].  (If you follow those linked instructions, note that they use `pilot sp exec -e CMD SERIAL`.  This is a thin wrapper that finds the right interface for the host with serial `SERIAL` and then runs `faux-mgs --interface INTERFACE CMD`.  You can just do this transformation yourself.)
+
+The most useful commands for inspecting state are:
+
+* `faux-mgs ... state`: summarizes the SP and RoT information
+* `faux-mgs ... update-status`: reports whether any SP-managed update is in progress
+* `faux-mgs ... read-component-caboose`: reports one piece of metadata about the software in a particular firmware slot.  You need to specify the component (e.g., `sp` or `rot`), the slot (e.g., `0` or `1`), and the key (`VERS` for version, `SIGN` for a hash of the signing key, etc.)
+
+Also useful are:
+
+* `faux-mgs ... reset`: resets a componnet (SP, RoT, etc.)
+* `faux-mgs ... update`: uploads a new software image for a particular component (SP, RoT, etc.) slot
+
+=== Using `omdb` to read inventory
+
+The system inventory includes all the information we need about SPs and what software they're running.  You can print this with:
+
+```
+$ omdb db inventory collections show latest sp
+...
+
+Sled SimGimlet00
+    part number: i86pc
+    power:    A2
+    revision: 0
+    MGS slot: Sled 0 (cubby 0)
+    found at: 2025-05-22 21:49:54.267308 UTC from http://[::1]:63421
+    cabooses:
+        SLOT       BOARD        NAME         VERSION GIT_COMMIT
+        SpSlot0    SimGimletSp  SimGimlet    0.0.2   ffffffff
+        SpSlot1    SimGimletSp  SimGimlet    0.0.1   fefefefe
+        RotSlotA   SimRot       SimGimletRot 0.0.4   eeeeeeee
+        RotSlotB   SimRot       SimGimletRot 0.0.3   edededed
+        Stage0     SimRotStage0 SimGimletRot 0.0.200 ddddddddd
+        Stage0Next SimRotStage0 SimGimletRot 0.0.200 dadadadad
+    RoT pages:
+        SLOT         DATA_BASE64
+        Cmpa         Z2ltbGV0LWNtcGEAAAAAAAAAAAAAAAAA...
+        CfpaActive   Z2ltbGV0LWNmcGEtYWN0aXZlAAAAAAAA...
+        CfpaInactive Z2ltbGV0LWNmcGEtaW5hY3RpdmUAAAAA...
+        CfpaScratch  Z2ltbGV0LWNmcGEtc2NyYXRjaAAAAAAA...
+    RoT: active slot: slot A
+    RoT: persistent boot preference: slot A
+    RoT: pending persistent boot preference: -
+    RoT: transient boot preference: -
+    RoT: slot A SHA3-256: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+    RoT: slot B SHA3-256: bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
+
+...
+```
+
+// XXX-dap link to section showing how to trigger an inventory collection, which itself should link to a section on waiting for it to complete.
+This is a handy summary, but it only gets updated when inventory is collected.  This is more cumbersome than `faux-mgs` when you only need to get one piece of information and need it to be up-to-date.
 
 
 // XXX-dap task: generate a new blueprint using the planner
@@ -255,8 +450,12 @@ Notably absent from this list is anything related to planning.  This has not bee
 // XXX-dap task: execute blueprint (via Nexus)
 // XXX-dap task: monitor blueprint execution
 // XXX-dap task: previewing what changes a blueprint will make
-
-
+// XXX-dap task: trigger inventory collection
+// XXX-dap task: wait for inventory collection to complete
+// XXX-dap task: download a TUF repo from CI (and link this where we do it above)
+// XXX-dap task: figure out which SP image you need
+// XXX-dap task: figure out which RoT image you need from a TUF repo
+// XXX-dap task: serve a local, unpacked TUF repo via repo-depot-API
 
 // XXX-dap diagram showing:
 // - planner creates blueprints and stores them into database

From 2fa38e3c88364fc89ce32e5b241b0f1622bd6abe Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Fri, 23 May 2025 09:32:08 -0700
Subject: [PATCH 03/10] WIP: about to try to make this more coherent

---
 docs/reconfigurator-dev-guide.adoc | 118 ++++++++++++++++++++++++++---
 1 file changed, 106 insertions(+), 12 deletions(-)

diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
index 99e6d427020..2474b6148a2 100644
--- a/docs/reconfigurator-dev-guide.adoc
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -254,6 +254,7 @@ There are a bunch of different environments that you can set up and use to test
 
 |xref:how-to-run-simulated.adoc[`cargo xtask omicron-dev run-all`]
 |Command-line tool that stands up real instances of much of the control plane locally (in-process and child processes): Nexus, CockroachDB, Clickhouse, Management Gateway Service, Oximeter, Crucible Pantry.  Limitations result from using simulated sled agent, simulated service processors, and loopback networking.
+|
 * Easy (one command), quick (starts in ~10s)
 * Fast to iterate (rebuilds in a minute or two, depending on what component you're changing)
 * Exactly matches the environment provided to Nexus integration tests (so it can be useful for developing and debugging these tests).
@@ -275,6 +276,8 @@ There are a bunch of different environments that you can set up and use to test
 ** softnpu implements the same (runtime-configurable) networking behavior that real switches do
 ** real sled agent runs real instances of all components except simulated networking (which is full-fidelity) and simulated service processors
 |
+* Testing that can't be done with `omicron-dev run-all`
+|
 * More time required up front to get started (may need beefier dev machine)
 * Somewhat bumpy developer experience (see README)
 * Longer iteration time (rebuild and redeploy takes ~30-60 minutes)
@@ -282,7 +285,7 @@ There are a bunch of different environments that you can set up and use to test
 ** Cannot run instances (sleds are running in VMs and we don't support nested virt)
 ** Service processors are simulated (just like `omicron-dev run-all`)
 
-|xref:how-to-run.adoc[`Running non-simulated Omicron on a single system`]
+|xref:how-to-run.adoc[Running non-simulated Omicron on a single system]
 |Runs real Sled Agent and all other components directly on your dev system the same way they'd run on a real system
 |
 * Moderate iteration time (rebuild and redeploy could take minutes, depending on what you're changing)
@@ -354,22 +357,113 @@ The easiest way to test Reconfigurator-driven updates is using `reconfigurator-s
 
 Regardless of how you perform updates, it's useful to use `faux-mgs` to read the ground truth state from the SP about its configuration (what versions are in each slot and which slots are active).  More on this below.
 
+[task-download-TUF-repo]
+=== Task: downloading a TUF repo from CI
+
+Reconfigurator-driven updates always use artifacts from a TUF repository.  TUF repos are built from each Omicron commit, both on "main" and pull request branches.
+
+First, decide the commit you want to use.  We'll call that `OMICRON_COMMIT`.  If you don't care all that much (because you're just testing update itself, not the image that you're deploying), just https://github.com/oxidecomputer/omicron/commits/main/[list the recent commits to "main"] and pick the latest one that has passed all CI checks.
+
+For our example, we'll use `OMICRON_COMMIT=630cc10930c448ce5c3e92b65be3a66ed73bbb64`:
+
+```
+$ OMICRON_COMMIT=630cc10930c448ce5c3e92b65be3a66ed73bbb64
+```
+
+Check that its TUF repo build job completed by visiting `https://github.com/oxidecomputer/omicron/commit/OMICRON_COMMIT`. Just below the title, where it says who authored the commit, there should be green checkmark showing that all CI jobs passed.  If you see a green checkmark here, you should be set.  If not, some jobs failed.  You can click the icon to see the list of checks run and see if the "build TUF repo" one passed or not.  If not, pick another commit.
+
+Now, construct the download URL like this:
+
+```
+$ TUF_REPO_DOWNLOAD_URL=https://buildomat.eng.oxide.computer/public/file/oxidecomputer/omicron/rot-all/$OMICRON_COMMIT/repo.zip
+```
+
+Now `cd` to the directory you want to download the TUF repo to.  You should have at least 4-5 GiB of free disk space (enough for the zipped and unzipped copies of the TUF repo).  We'll create a directory named for the commit:
+
+```
+$ mkdir $OMICRON_COMMIT
+$ cd $OMICRON_COMMIT
+```
+
+Download the repo with:
+
+```
+$ curl -L -C - -O $TUF_REPO_DOWNLOAD_URL
+```
+
+If this gets interrupted, you can run the same command again to resume it where it left off.
+
+For some of the workflows here, you'll want an _unpacked_ TUF repo.  You can unpack it with:
+
+```
+$ unzip FILENAME
+```
+
+This should create a directory called `repo` with subdirectories `metadata` and `targets`.
+
+
+[task-trim-TUF-repo]
+=== Task: trim TUF repo for shipping to a switch zone
+
+On a4x2 or a racklette, it's handy to run `reconfigurator-sp-updater` and `repo-depot-standalone` from the switch zone, with the TUF repo you're using also in the switch zone.  But the switch zone generally doesn't have enough space for a full TUF repo.  You can work around this by deleting some large artifacts that we don't need for our purposes.
+
+Prerequisite: you must already have an _unpacked_ TUF repo.  You could <<task-download-TUF-repo,download one from CI>>.
+
+For testing SP updates, we don't need the host OS and control plane images, which are by far the largest files in the repo.  You can delete them with:
+
+```
+$ rm -f repo/targets/*.host* repo/targets/*.trampoline-* repo/targets/*.control-plane-*
+```
+
+Then copy this directory tree over to the switch zone.
+
+[task-repo-depot-standalone]
+=== Task: use `repo-depot-standalone` to serve artifacts from a local TUF repo
+
+Prerequisite: you must already have an _unpacked_ TUF repo.  You could <<task-download-TUF-repo,download one from CI>>.
+
+We'll call UNPACKED_TUF_REPO_PATH the path to the "repo" directory that you get after unpacking it.
+
+```
+$ UNPACKED_TUF_REPO_PATH=./repo
+```
+
+Now, over in a clone of Omicron, build the `repo-depot-standalone` tool in the usual way:
+
+```
+$ cargo build --bin=repo-depot-standalone
+```
+
+If you're using `omicron-dev run-all`, you can probably run `repo-depot-standalone` right in the repo as either `./target/debug/repo-depot-standalone` or `cargo run --bin=repo-depot-standalone`.  If you're testing in a4x2 or on a racklette, you'll likely want to copy `repo-depot-standalone` _and_ the TUF repo (probably <<task-trim-TUF-repo,trimmed>>) to the switch zone in your system and do the rest there.
+
+One the system where you want to run `repo-depot-standalone`, and where the unpacked TUF repo is at `$UNPACKED_TUF_REPO_PATH`, you can now just run:
+
+```
+$ ./repo-depot-standalone $UNPACKED_TUF_REPO_PATH
+May 22 23:05:52.057 INFO loaded Omicron TUF repository, path: /home/dap/tuf-repos/R12/repo
+May 22 23:05:52.061 INFO listening, local_addr: [::]:64761
+```
+
+As the log implies, this is now running a repo depot server on IPv6 localhost (`::`) port 64761.  (If you want to specify a specific IP and port to listen on, which is useful so that the port doesn't keep changing on you, you can do that with the `--listen-addr` argument.)
+
 === Task: manually performing Reconfigurator-driven update of SPs and RoTs
 
-. Decide what software you want to deploy.  This must be packaged in a TUF repository.
-+
-If you're just testing update and don't care what you're deploying, you can use one generated by the CI process from any commit on "main".
-2. Figure out which artifact within the TUF repository you need to use for your hardware.
-+
-In all cases, you can either look at the metadata in the unpacked TUF repo (`jq < repo/targets/*.artifacts.json`) or just look at the filenames of the artifacts (`ls repo/targets`).
-** For service processors: the image should reflect the type of board you're updating (`kind` should include `switch` or `gimlet` or `psc`)
-3. Use `repo-depot-standalone` to serve the TUF repo depot API backed by the TUF repo you want to use.
-4. Use `reconfigurator-sp-updater` to perform the update.
+Prerequisite: you must have something serving the TUF repo depot API.  Usually you will:
+
+* <<task-download-TUF-repo>>
+* For a4x2 and racklettes: <<task-trim-TUF-repo>>
+* <<task-repo-depot-standalone>>
+
+We'll call the IP:port where the repo depot API is being served `REPO_DEPOT_SOCKADDR`:
+
+```
+$ REPO_DEPOT_SOCKADDR=[::]:64761
+```
 
 // XXX-dap working here
-// XXX-dap create separate task sections for the different pieces here
+// XXX-dap build and start reconfigurator-sp-updater and then use it
 
-=== Setting up `faux-mgs`
+=== Task: setting up `faux-mgs`
 
 https://github.com/oxidecomputer/management-gateway-service/tree/main/faux-mgs[`faux-mgs`] is a command-line tool that talks directly to SPs (without using MGS).  For Omicron developers, it's the lowest level tool we usually need to directly inspect SP state and issue commands to the SP.
 

From 26beb93befdcdd40224cfdfe70c5dcc9bf739051 Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Fri, 23 May 2025 13:40:49 -0700
Subject: [PATCH 04/10] WIP: about to convert table to prose

---
 docs/reconfigurator-dev-guide.adoc | 410 +++++++++++++++++++++++++++--
 1 file changed, 385 insertions(+), 25 deletions(-)

diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
index 2474b6148a2..8609831de37 100644
--- a/docs/reconfigurator-dev-guide.adoc
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -6,6 +6,8 @@
 
 This document covers practical tips for working on Reconfigurator.  For principles and design, see xref:reconfigurator.adoc[Reconfigurator Overview].
 
+NOTE: Documents like this tend to get out of date as the software evolves.  If you notice errors, _please_ consider fixing it.  If you're not sure how, reach out for help.
+
 == Introduction
 
 Reconfigurator is a control plane subsystem that's responsible for runtime changes to the control plane.  It's used to add, remove, and upgrade components.  It's divided into two big pieces:
@@ -357,10 +359,100 @@ The easiest way to test Reconfigurator-driven updates is using `reconfigurator-s
 
 Regardless of how you perform updates, it's useful to use `faux-mgs` to read the ground truth state from the SP about its configuration (what versions are in each slot and which slots are active).  More on this below.
 
-[task-download-TUF-repo]
+[#task-testing-reconfigurator-driven-sp-updates]
+=== Testing Reconfigurator-driven updates
+
+You can test Reconfigurator-driven updates using any of the test environments mentioned above (`omicron-dev run-all`, a4x2, or a racklette).  However, the flow is a bit different in each case.
+
+With **`omicron-dev run-all`**, the flow is:
+
+. Build the binaries you need (by cloning the corresponding repo and using `cargo build --bin=BINARY`):
+** `reconfigurator-sp-updater` (built from Omicron repo)
+** `repo-depot-standalone` (built from Omicron repo)
+** `faux-mgs` (built from `management-gateway-service` repo)
+. Get _and unpack_ at least one TUF repository with images for **simulated** SPs (probably by <<task-build-fake-TUF-repo,building your own>>).  You'll want two different TUF repos if you want to be able to do multiple updates, switching between two different versions.
+. Start `cargo xtask omicron-dev run-all`.
+. <<task-decide-sp-artifact,Figure out which artifacts you want to use for testing.>>
+. <<task-repo-depot-standalone,Start `repo-depot-standalone`>> backed by this TUF repository.
+. <<task-start-sp-updater,Start `reconfigurator-sp-updater`>>.
+. <<task-sp-update,Do an SP update>>.
+
+With **a4x2**, the flow is:
+
+. Build the binaries you need (by cloning the corresponding repo and using `cargo build --bin=BINARY`):
+** `reconfigurator-sp-updater` (built from Omicron repo)
+** `repo-depot-standalone` (built from Omicron repo)
+** `faux-mgs` (built from `management-gateway-service` repo)
+. Get at least one TUF repository with images for **simulated** SPs (probably by <<task-build-fake-TUF-repo,building your own>>).  You'll want two different TUF repos if you want to be able to do multiple updates, switching between two different versions.
+. Use `scp` to copy the TUF repository and the binaries to a switch zone in your racklette.  For example:
++
+```
+scp \
+    my-tuf-repo.zip \
+    omicron/target/debug/repo-depot-standalone \
+    omicron/target/debug/reconfigurator-sp-updater \
+    management-gateway-service/target/debug/faux-mgs \
+    root@MY_A4X2_G0_GZ_IP:/zone/oxz_switch/root/root
+```
+. <<task-decide-sp-artifact,Figure out which artifacts you want to use for testing.>>
+. From inside the switch zone:
+.. Unpack the TUF repository (with `unzip`).
+.. <<task-repo-depot-standalone,Start `repo-depot-standalone`>> backed by this TUF repository.
+.. <<task-start-sp-updater,Start `reconfigurator-sp-updater`>>.
+.. <<task-sp-update,Do an SP update>>.
+
+With a **racklette**, the flow is:
+
+. Build the binaries you need (by cloning the corresponding repo and using `cargo build --bin=BINARY`):
+** `reconfigurator-sp-updater` (built from Omicron repo)
+** `repo-depot-standalone` (built from Omicron repo)
+** `faux-mgs` (built from `management-gateway-service` repo)
+. Get at least one TUF repository with images for **real** SPs (probably by <<task-download-TUF-repo,getting one from CI>> or using an official release one).  You'll want two different TUF repos if you want to be able to do multiple updates, switching between two different versions.
+. <<task-trim-TUF-repo,Trim the TUF repo(s) that you want to use>> so that they will fit in the switch zone of your racklette.
+. Use `scp` to copy the _trimmed_ TUF repository and the binaries to a switch zone in your racklette.  For example:
++
+```
+scp \
+    my-trimmed-tuf-repo.zip \
+    omicron/target/debug/repo-depot-standalone \
+    omicron/target/debug/reconfigurator-sp-updater \
+    management-gateway-service/target/debug/faux-mgs \
+    root@racklet_gz_ip:/zone/oxz_switch/root/root
+```
+. <<task-decide-sp-artifact,Figure out which artifacts you want to use for testing.>>
+. From inside the switch zone:
+.. Unpack the TUF repository (with `unzip`).
+.. <<task-repo-depot-standalone,Start `repo-depot-standalone`>> backed by this TUF repository.
+.. <<task-start-sp-updater,Start `reconfigurator-sp-updater`>>.
+.. <<task-sp-update,Do an SP update>>.
+
+
+These steps are described in sections below.
+
+[#task-build-fake-TUF-repo]
+=== Task: build a TUF repo with images targeting simulated SPs
+
+The artifacts in TUF repos built by the Omicron build process do not work with simulated SPs.  That's because simulated SPs report a different board type than real Oxide hardware.  But you can easily build your own TUF repo with images that do work with simulated SPs.
+
+. You'll need a copy of the `tufaceous` binary.
+.. Clone the https://github.com/oxidecomputer/tufaceous[tufaceous] repository.
+.. Build with `cargo build --bin=tufaceous`.
+. You'll need a TUF repository manifest that specifies that `tufaceous` should conjure up fake Hubris images for simulated SPs.  There's one in the Omicron repo at xref:../update-common/manifests/fake.toml[].
+. Run:
++
+```
+$ tufaceous assemble update-common/manifests/fake.toml /var/tmp/my-fake-repo.zip
+```
+. Confirm the contents of the repo:
++
+```
+$ zipinfo /var/tmp/my-fake-repo.zip
+```
+
+[#task-download-TUF-repo]
 === Task: downloading a TUF repo from CI
 
-Reconfigurator-driven updates always use artifacts from a TUF repository.  TUF repos are built from each Omicron commit, both on "main" and pull request branches.
+To test Reconfigurator-driven updates of real SPs, you can use the artifacts from TUF repositories that are built with each Omicron commit on GitHub, including those on "main" and pull request branches.
 
 First, decide the commit you want to use.  We'll call that `OMICRON_COMMIT`.  If you don't care all that much (because you're just testing update itself, not the image that you're deploying), just https://github.com/oxidecomputer/omicron/commits/main/[list the recent commits to "main"] and pick the latest one that has passed all CI checks.
 
@@ -391,7 +483,7 @@ Download the repo with:
 $ curl -L -C - -O $TUF_REPO_DOWNLOAD_URL
 ```
 
-If this gets interrupted, you can run the same command again to resume it where it left off.
+Sometimes this download gets interrupted.  If that happens, you can run the same command again to resume the download where it left off.
 
 For some of the workflows here, you'll want an _unpacked_ TUF repo.  You can unpack it with:
 
@@ -402,7 +494,7 @@ $ unzip FILENAME
 This should create a directory called `repo` with subdirectories `metadata` and `targets`.
 
 
-[task-trim-TUF-repo]
+[#task-trim-TUF-repo]
 === Task: trim TUF repo for shipping to a switch zone
 
 On a4x2 or a racklette, it's handy to run `reconfigurator-sp-updater` and `repo-depot-standalone` from the switch zone, with the TUF repo you're using also in the switch zone.  But the switch zone generally doesn't have enough space for a full TUF repo.  You can work around this by deleting some large artifacts that we don't need for our purposes.
@@ -417,53 +509,320 @@ $ rm -f repo/targets/*.host* repo/targets/*.trampoline-* repo/targets/*.control-
 
 Then copy this directory tree over to the switch zone.
 
-[task-repo-depot-standalone]
-=== Task: use `repo-depot-standalone` to serve artifacts from a local TUF repo
+[#task-decide-sp-artifact]
+=== Task: decide which SP artifact you want to deploy
 
-Prerequisite: you must already have an _unpacked_ TUF repo.  You could <<task-download-TUF-repo,download one from CI>>.
+Prerequisite:
+
+* You must have an unpacked TUF repo.
 
-We'll call UNPACKED_TUF_REPO_PATH the path to the "repo" directory that you get after unpacking it.
+You must first decide which SP you're going to update.  With simulated SPs (`cargo xtask omicron-dev run-all` and a4x2), this choice doesn't matter much.  With real hardware, it's a bigger deal because resetting the SP will reset the corresponding host.  You don't want to update the SP for the host you're doing your testing from!
 
+If you don't particularly care because you just to want to test update itself, sled 15 is a good choice on a racklette (since it's not a Scrimlet) and `SimGimlet00` (the first sled) is a good choice in simulated deployments.
+
+Once you've picked an SP, you need to know what kind of board it is.
+
+* With real hardware, it will be specific Gimlet revision (e.g., `gimlet-e`), Sidecar revision (`sidecar-c`), or PSC (e.g., `psc-c`).
+* With simulated SPs, it will be `SimGimletSp` or `SimSidecarSp`.
+
+Once you know which SP you're going to update, you can identify the board in one of two ways:
+
+* Using <<task-using-omdb,omdb>> to view inventory, you want the value of the BOARD column for the `SpSlot0` caboose.  (It will be the same for `SpSlot1`.)
+* Using <<task-using-faux-mgs,faux-mgs>>, you first need to figure out how to get `faux-mgs` to talk to the SP you care about (described in the linked section), and then you can use the `read-component-caboose` command, like this:
++
 ```
-$ UNPACKED_TUF_REPO_PATH=./repo
+$ faux-mgs --log-level warn --sp-sim-addr [::1]:42084 read-component-caboose --component sp --slot 0 BORD
+SimGimletSp
 ```
 
-Now, over in a clone of Omicron, build the `repo-depot-standalone` tool in the usual way:
+Finally, you need to find the artifact in your TUF repo that corresponds to the SP image for this type of board.  Here's an example list of TUF repo artifacts:
 
 ```
-$ cargo build --bin=repo-depot-standalone
+repo $ ls targets/
+005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236.gimlet_rot_bootloader-fake-gimlet-rot-bootloader-1.0.0.tar.gz
+005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236.psc_rot_bootloader-fake-psc-rot-bootloader-1.0.0.tar.gz
+005ea358f1cd316df42465b1e3a0334ea22cc0c0442cf9ddf9b42fbf49780236.switch_rot_bootloader-fake-switch-rot-bootloader-1.0.0.tar.gz
+019d84b563f32a85467235d23142de2fff11eb4e70b18c9567a374af8aa2422b.control_plane-fake-control-plane-1.0.0.tar.gz
+339cb54072f5f61b36377062e64e6e41f5491e5eccbf1caec637bfbf1ae069ac.psc_rot-fake-psc-rot-1.0.0.tar.gz
+4cd56ec2380cbbbc1da842c44776e421bf0cb2362e22dd2ff65eb8cba337fe00.artifacts.json
+64f911b96c7b2f08222d25c1a37f039173da7461897ec28d5850c986c9e29e50.trampoline-fake-trampoline-1.0.0.tar.gz
+727d2cc5e0d4677940fb8a66156ab376f7485bde7e55963694913d94aa92d119.gimlet_rot-fake-gimlet-rot-1.0.0.tar.gz
+727d2cc5e0d4677940fb8a66156ab376f7485bde7e55963694913d94aa92d119.switch_rot-fake-switch-rot-1.0.0.tar.gz
+7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670.gimlet_sp-fake-gimlet-sp-1.0.0.tar.gz
+ab32ec86e942e1a16c8d43ea143cd80dd05a9639529d3569b1c24dfa2587ee74.switch_sp-fake-switch-sp-1.0.0.tar.gz
+d51b8fd66c631346459725b8868d0614f0884dba05faec20fc0fdd334eb5d0fd.host-fake-host-1.0.0.tar.gz
+f896cf5b19ca85864d470ad8587f980218bff3954e7f52bbd999699cd0f9635b.psc_sp-fake-psc-sp-1.0.0.tar.gz
 ```
 
-If you're using `omicron-dev run-all`, you can probably run `repo-depot-standalone` right in the repo as either `./target/debug/repo-depot-standalone` or `cargo run --bin=repo-depot-standalone`.  If you're testing in a4x2 or on a racklette, you'll likely want to copy `repo-depot-standalone` _and_ the TUF repo (probably <<task-trim-TUF-repo,trimmed>>) to the switch zone in your system and do the rest there.
+For the SP, we want an artifact whose name looks like `\*.*_sp*`.  It's one of these:
 
-One the system where you want to run `repo-depot-standalone`, and where the unpacked TUF repo is at `$UNPACKED_TUF_REPO_PATH`, you can now just run:
+```
+repo $ ls targets/*.*_sp*
+targets/7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670.gimlet_sp-fake-gimlet-sp-1.0.0.tar.gz
+targets/ab32ec86e942e1a16c8d43ea143cd80dd05a9639529d3569b1c24dfa2587ee74.switch_sp-fake-switch-sp-1.0.0.tar.gz
+targets/f896cf5b19ca85864d470ad8587f980218bff3954e7f52bbd999699cd0f9635b.psc_sp-fake-psc-sp-1.0.0.tar.gz
+```
 
+This is a TUF repo for simulated hardware.  In that case, there's only one image for each type of board so it's pretty easy.  A TUF repo for real hardware will look more like this:
+
+```
+repo $ ls targets/*.*_sp*
+targets/48d00f59dacc27e8cbb3abcfff2a263d5dbd361fe018e1bf06fb936811cc2446.switch_sp-sidecar-b-1.0.32.tar.gz
+targets/556dcf6416e6da79d49657c0cf77d02e286ba28dc481f92e87136c44b1e9f329.gimlet_sp-gimlet-f-1.0.32.tar.gz
+targets/7576f5a13feefe75f6390c78666cc62ebef4b36d16959dc38141497ece21198b.psc_sp-psc-b-1.0.31.tar.gz
+targets/7f6cf23a3cf26fe9c7a40a76d7e2be8a418723ef505786c8e41df89fd8d1f77e.gimlet_sp-gimlet-d-1.0.32.tar.gz
+targets/90d483ff62ad16fb82d7e8831f222071dda4aba046fba1603b823555c6bb096e.switch_sp-sidecar-d-1.0.32.tar.gz
+targets/9e53e5f408e9a0026955c31ae52d222ed192f098de57f24855e67fda114a4ed7.psc_sp-psc-c-1.0.31.tar.gz
+targets/c9cb6c6d2b3fd9e198074b4160119caa21ca88632b218420a570725ffd0b8616.gimlet_sp-gimlet-e-1.0.32.tar.gz
+targets/d761c7f19bb33c9250c847ce83ade57a137013b8497ffa81e4ded85014571dd0.gimlet_sp-gimlet-c-1.0.32.tar.gz
+targets/e151c800331d0e20a9be15eecd1511dcd576f16bc5c4deebcf2d7bf48e77e0f6.switch_sp-sidecar-c-1.0.32.tar.gz
+targets/f2fcb24dbb85a8be78235226fc95dd183250f75819bc813befdf5a166a72acd0.gimlet_sp-gimlet-b-1.0.32.tar.gz
 ```
-$ ./repo-depot-standalone $UNPACKED_TUF_REPO_PATH
+
+Find the one that matches your board (e.g., `gimlet-e`).
+
+In either case, the artifact id is the long shasum at the beginning of the filename.  If you wanted the `gimlet-e` SP image, you'd use `c9cb6c6d2b3fd9e198074b4160119caa21ca88632b218420a570725ffd0b8616`.
+
+This document uses simulated images, and we'll update a simulated sled SP, so we'll choose `7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670` from the output above.
+
+---
+
+That's the quick-and-dirty way.  The more precise way to work this out is:
+
+. Look at `targets/*.artifacts.json` and find the entry in the `artifacts` array having `kind` = `gimlet_sp` (or `switch_sp` or `psc_sp`, if you're updating a switch or PSC) and `name` matching your board.  Note the `"target"` property.
+. Find the file in `targets` whose suffix matches the `"target"` property.
+
+For example, in our case, the first entry in `artifacts` is the one that we want:
+
+```json
+{
+  "system_version": "1.0.0",
+  "artifacts": [
+    {
+      "name": "fake-gimlet-sp",
+      "version": "1.0.0",
+      "kind": "gimlet_sp",
+      "target": "gimlet_sp-fake-gimlet-sp-1.0.0.tar.gz"
+    },
+    ...
+```
+
+That tells us that we want `targets/*.gimlet_sp-fake-gimlet-sp-1.0.0.tar.gz`, which is `targets/7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670.gimlet_sp-fake-gimlet-sp-1.0.0.tar.gz`, whose artifact id is `7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670`.
+
+[#task-repo-depot-standalone]
+=== Task: use `repo-depot-standalone` to serve artifacts from a local TUF repo
+
+Prerequisite: you must already have one or more _unpacked_ TUF repos.  See above for which ones to use.
+
+If you're testing with `cargo xtask omicron-dev run-all`, you can run `repo-depot-standalone` right in the repo.  If you're testing on a4x2 or a racklette, you'll want to copy this binary (and the TUF repo(s)) to the switch zone.  See above for more on this.
+
+Once you have the binary and unpacked TUF repo(s) where you want them, you just run the command with one or more paths to the "repo" directory in each unpacked TUF repo.  We'll also use the `--listen-addr` argument to start it on a predictable port, but you can leave this off to pick any unused port:
+
+```
+$ ./repo-depot-standalone --listen-addr [::]:64761 /home/dap/tuf-repos/R12/repo
 May 22 23:05:52.057 INFO loaded Omicron TUF repository, path: /home/dap/tuf-repos/R12/repo
 May 22 23:05:52.061 INFO listening, local_addr: [::]:64761
 ```
 
-As the log implies, this is now running a repo depot server on IPv6 localhost (`::`) port 64761.  (If you want to specify a specific IP and port to listen on, which is useful so that the port doesn't keep changing on you, you can do that with the `--listen-addr` argument.)
+As the log implies, this is now running a repo depot server on IPv6 localhost (`::`) port 64761.
 
-=== Task: manually performing Reconfigurator-driven update of SPs and RoTs
+[#task-start-sp-updater]
+=== Task: start `reconfigurator-sp-updater`
 
-Prerequisite: you must have something serving the TUF repo depot API.  Usually you will:
+Prerequisites:
 
-* <<task-download-TUF-repo>>
-* For a4x2 and racklettes: <<task-trim-TUF-repo>>
-* <<task-repo-depot-standalone>>
+* you must have something serving the TUF repo depot API (see above)
+* you have a system running a DNS server and MGS that points at one or more SPs to update.  This is usually `cargo xtask omicron-dev run-all`, a4x2, or a racklette.
 
-We'll call the IP:port where the repo depot API is being served `REPO_DEPOT_SOCKADDR`:
+In our example, we'll assume the repo depot server is running on `[::]:64761`.
+
+If you're using a4x2 or a racklette, you can start the updater with:
 
 ```
-$ REPO_DEPOT_SOCKADDR=[::]:64761
+$ reconfigurator-sp-updater [::1]:64761
 ```
 
-// XXX-dap working here
-// XXX-dap build and start reconfigurator-sp-updater and then use it
+If you're using `omicron-dev run-all`, you'll also need the IP:port where the internal DNS server is running.  This is printed out by `omicron-dev run-all`, which emits a line like this:
+
+```
+...
+omicron-dev: internal DNS:          [::1]:63673
+...
+```
+
+In this case, we'd say:
+
+```
+$ reconfigurator-sp-updater --dns-server [::1]:63673 [::1]:64761
+
+```
+
+Once `reconfigurator-sp-updater` starts, you'll get a REPL and can try an SP update.
+
+[#task-sp-update]
+=== Task: Do an SP update
+
+Prerequisites:
+
+* you must already be running `reconfigurator-sp-updater` (see above)
+* you must have already decided which SP to update and which artifact to deploy.  See <<task-decide-sp-artifact>>.  Here, we're going to update `SimGimlet00` to artifact id `7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670`.
+
+In the `reconfigurator-sp-updater` REPL, you can use `help` to see what's available:
+
+```
+〉help
+reconfigurator-sp-updater: interactively manage SP updates
+
+Usage: <COMMAND>
+
+Commands:
+  config  Show configured updates
+  status  Show status of recent and in-progress updates
+  set     Configure an update
+  delete  Delete a configured update
+  help    Print this message or the help of the given subcommand(s)
+```
+
+Initially, `config` will show no configured updates:
+
+```
+〉config
+configured updates (0):
+
+```
+
+and `status` will show nothing in progress or completed:
+
+```
+〉status
+recent completed attempts:
+
+currently in progress:
+
+waiting for retry:
+
+```
+
+In order to configure an SP update, you need to know what software is currently running on the SP.  You can view this with `omdb`:
+
+```
+$ omdb  --dns-server [::1]:63673 db inventory collections show latest sp
+...
+Sled SimGimlet00
+    part number: i86pc
+    power:    A2
+    revision: 0
+    MGS slot: Sled 0 (cubby 0)
+    found at: 2025-05-23 17:36:11.421897 UTC from http://[::1]:58672
+    cabooses:
+        SLOT       BOARD        NAME         VERSION GIT_COMMIT
+        SpSlot0    SimGimletSp  SimGimlet    0.0.2   ffffffff
+        SpSlot1    SimGimletSp  SimGimlet    0.0.1   fefefefe
+        RotSlotA   SimRot       SimGimletRot 0.0.4   eeeeeeee
+        RotSlotB   SimRot       SimGimletRot 0.0.3   edededed
+        Stage0     SimRotStage0 SimGimletRot 0.0.200 ddddddddd
+        Stage0Next SimRotStage0 SimGimletRot 0.0.200 dadadadad
+...
+```
+
+That shows version 0.0.2 in the SP active slot (slot 0) and 0.0.1 in the SP inactive slot (slot 1).  For more on using inventory like this, see <<task-using-omdb,Using `omdb`>> -- note that this information is cached and will not necessarily show the right thing after you perform the update.
+
+You can view the very latest state with `faux-mgs` (see <<task-using-faux-mgs,Using `faux-mgs`>>):
+
+```
+$ faux-mgs --log-level warn --sp-sim-addr [::1]:42084 read-component-caboose --component sp --slot 0 VERS
+0.0.2
+$ faux-mgs --log-level warn --sp-sim-addr [::1]:42084 read-component-caboose --component sp --slot 1 VERS
+0.0.1
+```
+
+Now we have enough information to configure an SP update:
+
+```
+〉help set
+Configure an update
+
+Usage: set <SERIAL> <ARTIFACT_HASH> <VERSION> <COMMAND>
+
+Commands:
+  sp
+  help  Print this message or the help of the given subcommand(s)
+
+Arguments:
+  <SERIAL>         serial number to update
+  <ARTIFACT_HASH>  artifact hash id
+  <VERSION>        version
+
+Options:
+  -h, --help  Print help
+
+〉set SimGimlet00 7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670 1.0.0 sp help
+error: the following required arguments were not provided:
+  <EXPECTED_INACTIVE_VERSION>
+
+Usage: set <SERIAL> <ARTIFACT_HASH> <VERSION> sp <EXPECTED_ACTIVE_VERSION> <EXPECTED_INACTIVE_VERSION>
+
+For more information, try '--help'.
+
+〉set SimGimlet00 7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670 1.0.0 sp 0.0.2 0.0.1
+updated configuration for SimGimlet00
+```
+
+NOTE: You will immediately start seeing log messages from `nexus_mgs_updates` spewing to the console.  This is ugly, but it's been convenient to be able to see these logs in real time.
+
+After a few seconds (20+ seconds on a racklette), you'd expect to see status like this:
+
+```
+〉status
+recent completed attempts:
+    2025-05-23T17:46:18.020Z to 2025-05-23T17:46:19.156Z (took 1s 135ms): serial SimGimlet00
+        attempt#: 1
+        version:  1.0.0
+        hash:     7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670
+        result:   Ok(CompletedUpdate)
+
+currently in progress:
+
+waiting for retry:
+    serial SimGimlet00: will try again at 2025-05-23 17:46:39.156210419 UTC (attempt 2)
+
+```
+
+We can see that it successfully performed the update.
+
+All updates (even successful ones) are re-attempted after 20 seconds.  So if you wait for another lap:
+
+```
+〉status
+recent completed attempts:
+    2025-05-23T17:46:18.020Z to 2025-05-23T17:46:19.156Z (took 1s 135ms): serial SimGimlet00
+        attempt#: 1
+        version:  1.0.0
+        hash:     7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670
+        result:   Ok(CompletedUpdate)
+    2025-05-23T17:46:39.158Z to 2025-05-23T17:46:39.238Z (took 79ms): serial SimGimlet00
+        attempt#: 2
+        version:  1.0.0
+        hash:     7e6667e646ad001b54c8365a3d309c03f89c59102723d38d01697ee8079fe670
+        result:   Ok(FoundNoChangesNeeded)
+
+currently in progress:
+
+waiting for retry:
+    serial SimGimlet00: will try again at 2025-05-23 17:46:59.238220447 UTC (attempt 3)
+```
+
+This time, it was able to tell that it didn't need to do anything.
+
+To stop trying, unconfigure the update:
+
+```
+〉delete SimGimlet00
+deleted configured update for serial SimGimlet00
+```
 
-=== Task: setting up `faux-mgs`
+[#task-using-faux-mgs]
+=== Using `faux-mgs`
 
 https://github.com/oxidecomputer/management-gateway-service/tree/main/faux-mgs[`faux-mgs`] is a command-line tool that talks directly to SPs (without using MGS).  For Omicron developers, it's the lowest level tool we usually need to directly inspect SP state and issue commands to the SP.
 
@@ -495,6 +854,7 @@ Also useful are:
 * `faux-mgs ... reset`: resets a componnet (SP, RoT, etc.)
 * `faux-mgs ... update`: uploads a new software image for a particular component (SP, RoT, etc.) slot
 
+[#task-using-omdb]
 === Using `omdb` to read inventory
 
 The system inventory includes all the information we need about SPs and what software they're running.  You can print this with:

From 2015e52cdf4a490dc7ae543bb7bae86ef82639d1 Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Fri, 23 May 2025 13:45:57 -0700
Subject: [PATCH 05/10] replace table with prose (about to revert)

---
 docs/reconfigurator-dev-guide.adoc | 78 ++++++++++++++++++------------
 1 file changed, 47 insertions(+), 31 deletions(-)

diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
index 8609831de37..7bcb2658c11 100644
--- a/docs/reconfigurator-dev-guide.adoc
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -245,55 +245,67 @@ Notably absent from this list is anything related to planning.  This has not bee
 
 There are a bunch of different environments that you can set up and use to test Omicron.
 
-.Kinds of Omicron test environments
-[cols="1,2,2a,2a,2a",options="header"]
-|===
-|Name
-|Summary
-|Pros
-|Good for
-|Limitations
-
-|xref:how-to-run-simulated.adoc[`cargo xtask omicron-dev run-all`]
-|Command-line tool that stands up real instances of much of the control plane locally (in-process and child processes): Nexus, CockroachDB, Clickhouse, Management Gateway Service, Oximeter, Crucible Pantry.  Limitations result from using simulated sled agent, simulated service processors, and loopback networking.
-|
+=== xref:how-to-run-simulated.adoc[`cargo xtask omicron-dev run-all`]
+
+This is a command-line tool that stands up real instances of much of the control plane locally (in-process and child processes): Nexus, CockroachDB, Clickhouse, Management Gateway Service, Oximeter, Crucible Pantry.  Limitations result from using simulated sled agent, simulated service processors, and loopback networking.
+
+Pros:
+
 * Easy (one command), quick (starts in ~10s)
 * Fast to iterate (rebuilds in a minute or two, depending on what component you're changing)
 * Exactly matches the environment provided to Nexus integration tests (so it can be useful for developing and debugging these tests).
-|
+
+Good for:
+
 * Nexus internal/external API changes
 * Most of development for anything that can be simulated (e.g., inventory, most parts of execution)
 * `omdb`-only changes
-|
+
+Limitations:
+
 * Simulated sled agent has many limitations: cannot run VMs, does not simulate the actual control plane components that it pretends to run, no simulation of Crucible storage, etc.
 * Simulated SPs have limited fidelity to the real thing (e.g., resetting SP will not simulate reset of the sled, even though a real one would)
 * No Wicket, no full RSS path
 * No meaningful simulation of networking (so can't be used to test behavior of underlay connectivity, external connectivity, configuring Dendrite, etc.)
 
-|https://github.com/oxidecomputer/testbed/tree/main/a4x2[`a4x2`]
-|Uses VMs, fancy local networking config, and a software-based switch (https://github.com/oxidecomputer/softnpu[softnpu]) to create a multi-sled environment that looks much more realistic to the control plane than `omicron-dev run-all`.
-|
+=== https://github.com/oxidecomputer/testbed/tree/main/a4x2[`a4x2`]
+
+This uses VMs, fancy local networking config, and a software-based switch (https://github.com/oxidecomputer/softnpu[softnpu]) to create a multi-sled environment that looks much more realistic to the control plane than `omicron-dev run-all`.
+
+Pros:
+
 * Much higher fidelity to real systems than `omicron-dev run-all`:
 ** most components' environments look largely like a real system (e.g., run in a zone, using the SMF start methods)
 ** softnpu implements the same (runtime-configurable) networking behavior that real switches do
 ** real sled agent runs real instances of all components except simulated networking (which is full-fidelity) and simulated service processors
-|
-* Testing that can't be done with `omicron-dev run-all`
-|
+
+Good for:
+
+* Testing that can't be done with `omicron-dev run-all` but does not require real hardware
+
+Limitations:
+
 * More time required up front to get started (may need beefier dev machine)
 * Somewhat bumpy developer experience (see README)
 * Longer iteration time (rebuild and redeploy takes ~30-60 minutes)
 * Limitations in fidelity:
 ** Cannot run instances (sleds are running in VMs and we don't support nested virt)
 ** Service processors are simulated (just like `omicron-dev run-all`)
+** Host OS is installed to physical disks, rather than running out of a ramdisk loaded from M2s like on real systems.  System behavior after a reboot may differ from real deployments.
+
+https://github.com/oxidecomputer/omicron/pull/7424[Work is ongoing] to add `cargo xtask` commands for launching an a4x2 environment.  This would significantly streamline the process of using a4x2 and also make it possible to use a4x2 in CI.
+
+=== xref:how-to-run.adoc[Running non-simulated Omicron on a single system]
+
+This runs real Sled Agent and all other components directly on your dev system the same way they'd run on a real system
+
+Pros:
 
-|xref:how-to-run.adoc[Running non-simulated Omicron on a single system]
-|Runs real Sled Agent and all other components directly on your dev system the same way they'd run on a real system
-|
 * Moderate iteration time (rebuild and redeploy could take minutes, depending on what you're changing)
 * Could support running VMs
-| ?
-|
+
+Limitations:
+
 * "Takes over" your dev system -- does not clearly delineate what global state it's responsible for and have a way to clean it all up
 * Somewhat brittle (e.g., after reboot, SMF service for sled agent may start but not find the files it needs)
 * Limitations in fidelity:
@@ -301,16 +313,20 @@ There are a bunch of different environments that you can set up and use to test
 ** No service processors
 ** Networking simulation is incomplete (connectivity depends on how your dev system is set up)
 
-|Racklette
-|Real Oxide hardware (sleds and switches), essentially indistinguishable from a real Oxide rack
-|Everything.  Worthwhile for:
+=== Racklette
+
+This is Real Oxide hardware (sleds and switches), essentially indistinguishable from a real Oxide rack.
+
+Can be used to test just about everything, but particularly good for:
+
 * any testing involving real "customer" VMs
 * final smoke testing for work developed with simulated components
-|
+
+Limitations:
+
 * Very limited, shared resource
-|===
 
-https://github.com/oxidecomputer/omicron/pull/7424[Work is ongoing] to add `cargo xtask` commands for launching an a4x2 environment.  This would significantly streamline the process of using a4x2 and also make it possible to use a4x2 in CI.
+=== Development workflow
 
 A common development workflow is:
 

From 9abd087b4255bb36404579693b4f10431d5be3aa Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Fri, 23 May 2025 13:45:58 -0700
Subject: [PATCH 06/10] Revert "replace table with prose (about to revert)"

This reverts commit 2015e52cdf4a490dc7ae543bb7bae86ef82639d1.
---
 docs/reconfigurator-dev-guide.adoc | 78 ++++++++++++------------------
 1 file changed, 31 insertions(+), 47 deletions(-)

diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
index 7bcb2658c11..8609831de37 100644
--- a/docs/reconfigurator-dev-guide.adoc
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -245,67 +245,55 @@ Notably absent from this list is anything related to planning.  This has not bee
 
 There are a bunch of different environments that you can set up and use to test Omicron.
 
-=== xref:how-to-run-simulated.adoc[`cargo xtask omicron-dev run-all`]
-
-This is a command-line tool that stands up real instances of much of the control plane locally (in-process and child processes): Nexus, CockroachDB, Clickhouse, Management Gateway Service, Oximeter, Crucible Pantry.  Limitations result from using simulated sled agent, simulated service processors, and loopback networking.
-
-Pros:
-
+.Kinds of Omicron test environments
+[cols="1,2,2a,2a,2a",options="header"]
+|===
+|Name
+|Summary
+|Pros
+|Good for
+|Limitations
+
+|xref:how-to-run-simulated.adoc[`cargo xtask omicron-dev run-all`]
+|Command-line tool that stands up real instances of much of the control plane locally (in-process and child processes): Nexus, CockroachDB, Clickhouse, Management Gateway Service, Oximeter, Crucible Pantry.  Limitations result from using simulated sled agent, simulated service processors, and loopback networking.
+|
 * Easy (one command), quick (starts in ~10s)
 * Fast to iterate (rebuilds in a minute or two, depending on what component you're changing)
 * Exactly matches the environment provided to Nexus integration tests (so it can be useful for developing and debugging these tests).
-
-Good for:
-
+|
 * Nexus internal/external API changes
 * Most of development for anything that can be simulated (e.g., inventory, most parts of execution)
 * `omdb`-only changes
-
-Limitations:
-
+|
 * Simulated sled agent has many limitations: cannot run VMs, does not simulate the actual control plane components that it pretends to run, no simulation of Crucible storage, etc.
 * Simulated SPs have limited fidelity to the real thing (e.g., resetting SP will not simulate reset of the sled, even though a real one would)
 * No Wicket, no full RSS path
 * No meaningful simulation of networking (so can't be used to test behavior of underlay connectivity, external connectivity, configuring Dendrite, etc.)
 
-=== https://github.com/oxidecomputer/testbed/tree/main/a4x2[`a4x2`]
-
-This uses VMs, fancy local networking config, and a software-based switch (https://github.com/oxidecomputer/softnpu[softnpu]) to create a multi-sled environment that looks much more realistic to the control plane than `omicron-dev run-all`.
-
-Pros:
-
+|https://github.com/oxidecomputer/testbed/tree/main/a4x2[`a4x2`]
+|Uses VMs, fancy local networking config, and a software-based switch (https://github.com/oxidecomputer/softnpu[softnpu]) to create a multi-sled environment that looks much more realistic to the control plane than `omicron-dev run-all`.
+|
 * Much higher fidelity to real systems than `omicron-dev run-all`:
 ** most components' environments look largely like a real system (e.g., run in a zone, using the SMF start methods)
 ** softnpu implements the same (runtime-configurable) networking behavior that real switches do
 ** real sled agent runs real instances of all components except simulated networking (which is full-fidelity) and simulated service processors
-
-Good for:
-
-* Testing that can't be done with `omicron-dev run-all` but does not require real hardware
-
-Limitations:
-
+|
+* Testing that can't be done with `omicron-dev run-all`
+|
 * More time required up front to get started (may need beefier dev machine)
 * Somewhat bumpy developer experience (see README)
 * Longer iteration time (rebuild and redeploy takes ~30-60 minutes)
 * Limitations in fidelity:
 ** Cannot run instances (sleds are running in VMs and we don't support nested virt)
 ** Service processors are simulated (just like `omicron-dev run-all`)
-** Host OS is installed to physical disks, rather than running out of a ramdisk loaded from M2s like on real systems.  System behavior after a reboot may differ from real deployments.
-
-https://github.com/oxidecomputer/omicron/pull/7424[Work is ongoing] to add `cargo xtask` commands for launching an a4x2 environment.  This would significantly streamline the process of using a4x2 and also make it possible to use a4x2 in CI.
-
-=== xref:how-to-run.adoc[Running non-simulated Omicron on a single system]
-
-This runs real Sled Agent and all other components directly on your dev system the same way they'd run on a real system
-
-Pros:
 
+|xref:how-to-run.adoc[Running non-simulated Omicron on a single system]
+|Runs real Sled Agent and all other components directly on your dev system the same way they'd run on a real system
+|
 * Moderate iteration time (rebuild and redeploy could take minutes, depending on what you're changing)
 * Could support running VMs
-
-Limitations:
-
+| ?
+|
 * "Takes over" your dev system -- does not clearly delineate what global state it's responsible for and have a way to clean it all up
 * Somewhat brittle (e.g., after reboot, SMF service for sled agent may start but not find the files it needs)
 * Limitations in fidelity:
@@ -313,20 +301,16 @@ Limitations:
 ** No service processors
 ** Networking simulation is incomplete (connectivity depends on how your dev system is set up)
 
-=== Racklette
-
-This is Real Oxide hardware (sleds and switches), essentially indistinguishable from a real Oxide rack.
-
-Can be used to test just about everything, but particularly good for:
-
+|Racklette
+|Real Oxide hardware (sleds and switches), essentially indistinguishable from a real Oxide rack
+|Everything.  Worthwhile for:
 * any testing involving real "customer" VMs
 * final smoke testing for work developed with simulated components
-
-Limitations:
-
+|
 * Very limited, shared resource
+|===
 
-=== Development workflow
+https://github.com/oxidecomputer/omicron/pull/7424[Work is ongoing] to add `cargo xtask` commands for launching an a4x2 environment.  This would significantly streamline the process of using a4x2 and also make it possible to use a4x2 in CI.
 
 A common development workflow is:
 

From 52388eaaaeb375deb9fa6521a7571ba1dfa75a8c Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Fri, 23 May 2025 13:52:39 -0700
Subject: [PATCH 07/10] add link

---
 docs/reconfigurator-dev-guide.adoc | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
index 8609831de37..c09fdfdb97d 100644
--- a/docs/reconfigurator-dev-guide.adoc
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -241,7 +241,7 @@ Many other tasks work with Reconfigurator, too (e.g., region replacement and reg
 
 Notably absent from this list is anything related to planning.  This has not been automated as a background task yet.
 
-== Testing and developer workflow
+== Manual testing and developer workflow
 
 There are a bunch of different environments that you can set up and use to test Omicron.
 
@@ -301,13 +301,16 @@ There are a bunch of different environments that you can set up and use to test
 ** No service processors
 ** Networking simulation is incomplete (connectivity depends on how your dev system is set up)
 
-|Racklette
-|Real Oxide hardware (sleds and switches), essentially indistinguishable from a real Oxide rack
+|https://github.com/oxidecomputer/meta/blob/master/engineering/lab/environments.adoc[Racklette]
+|Real Oxide hardware (sleds and switches)
+|
+* Essentially indistinguishable from a real Oxide rack
 |Everything.  Worthwhile for:
 * any testing involving real "customer" VMs
 * final smoke testing for work developed with simulated components
 |
 * Very limited, shared resource
+
 |===
 
 https://github.com/oxidecomputer/omicron/pull/7424[Work is ongoing] to add `cargo xtask` commands for launching an a4x2 environment.  This would significantly streamline the process of using a4x2 and also make it possible to use a4x2 in CI.

From 82c58f70c37a51c07420b0f429d26f3d2d392f99 Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Fri, 23 May 2025 13:56:13 -0700
Subject: [PATCH 08/10] edits

---
 docs/reconfigurator-dev-guide.adoc | 34 ++++++++++--------------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
index c09fdfdb97d..ab1ee96ee3a 100644
--- a/docs/reconfigurator-dev-guide.adoc
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -2,6 +2,17 @@
 :numbered:
 :toc: left
 
+// TODO: This guide could use more sections:
+// - task: generate a new blueprint using the planner
+// - task: export reconfigurator state
+// - task: generate a new blueprint using reconfigurator-cli
+// - task: import blueprint
+// - task: execute blueprint (via Nexus)
+// - task: monitor blueprint execution
+// - task: previewing what changes a blueprint will make
+// - task: trigger inventory collection (and add back reference from `omdb` section)
+// - task: wait for inventory collection to complete
+
 = Reconfigurator Developer Guide
 
 This document covers practical tips for working on Reconfigurator.  For principles and design, see xref:reconfigurator.adoc[Reconfigurator Overview].
@@ -60,8 +71,6 @@ graph TD
     style MgsUpdateDriver fill:#ffe0b2,stroke:#fb8c00
 ```
 
-// XXX-dap fix diagrams not working well in dark mode
-
 == Key Rust packages
 
 Below are some of the most important Rust packages to know about.  This is not a complete list.
@@ -896,25 +905,4 @@ Sled SimGimlet00
 ...
 ```
 
-// XXX-dap link to section showing how to trigger an inventory collection, which itself should link to a section on waiting for it to complete.
 This is a handy summary, but it only gets updated when inventory is collected.  This is more cumbersome than `faux-mgs` when you only need to get one piece of information and need it to be up-to-date.
-
-
-// XXX-dap task: generate a new blueprint using the planner
-// XXX-dap task: export reconfigurator state
-// XXX-dap task: generate a new blueprint using reconfigurator-cli
-// XXX-dap task: import blueprint
-// XXX-dap task: execute blueprint (via Nexus)
-// XXX-dap task: monitor blueprint execution
-// XXX-dap task: previewing what changes a blueprint will make
-// XXX-dap task: trigger inventory collection
-// XXX-dap task: wait for inventory collection to complete
-// XXX-dap task: download a TUF repo from CI (and link this where we do it above)
-// XXX-dap task: figure out which SP image you need
-// XXX-dap task: figure out which RoT image you need from a TUF repo
-// XXX-dap task: serve a local, unpacked TUF repo via repo-depot-API
-
-// XXX-dap diagram showing:
-// - planner creates blueprints and stores them into database
-// - user can import blueprints with reconfigurator-cli
-// - execution reads blueprints

From 6fcdb7aa508e1120f060eed0fad6cd34ebee9d5d Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Fri, 23 May 2025 13:58:15 -0700
Subject: [PATCH 09/10] title

---
 docs/reconfigurator-dev-guide.adoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
index ab1ee96ee3a..5b6598d31b6 100644
--- a/docs/reconfigurator-dev-guide.adoc
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -343,7 +343,7 @@ Broadly, we have several kinds of tests:
 
 The https://github.com/oxidecomputer/omicron/pull/7424[ongoing work mentioned above] will make it possible to run the live tests in a4x2 in CI.
 
-== Updates for SPs, RoTs, etc.
+== Updating SPs
 
 Updates for the following components get lumped together:
 

From 5ebbd4a279b43e38e40541abef34d3284ee15b1c Mon Sep 17 00:00:00 2001
From: David Pacheco <dap@oxidecomputer.com>
Date: Fri, 23 May 2025 14:37:28 -0700
Subject: [PATCH 10/10] update racklette link

---
 docs/reconfigurator-dev-guide.adoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/reconfigurator-dev-guide.adoc b/docs/reconfigurator-dev-guide.adoc
index 5b6598d31b6..61cbf9f54c3 100644
--- a/docs/reconfigurator-dev-guide.adoc
+++ b/docs/reconfigurator-dev-guide.adoc
@@ -310,7 +310,7 @@ There are a bunch of different environments that you can set up and use to test
 ** No service processors
 ** Networking simulation is incomplete (connectivity depends on how your dev system is set up)
 
-|https://github.com/oxidecomputer/meta/blob/master/engineering/lab/environments.adoc[Racklette]
+|https://inventron.eng.oxide.computer/env?group=testrack[Racklette]
 |Real Oxide hardware (sleds and switches)
 |
 * Essentially indistinguishable from a real Oxide rack