Description
See discussion in oxidecomputer/stlouis#94, specifically https://github.com/oxidecomputer/stlouis/issues/94#issuecomment-1496594723. We need to choose collectively an architectural path forward here, both short-term and long-term, so one possible outcome is that this ticket will be closed without any work required in sled-agent. There are several options here:
- Do nothing in the short term, which will mean that service actions involving insertion of a PCIe device (SSD or SC) will require the Gimlet in question to be rebooted before the newly-attached device will function. This would require documentation in the form of a highly visible product-level erratum. In the long term, implement one of the remaining options.
- In the short term, have sled-agent detect the insertion of an SSD via hacks (in a path similar to existing/planned hacks for detection of configured storage devices) and configure the attachment point to bring the device online. In this scenario, Sidecar attachment can be disregarded as the short-term service action for SC-Scrimlet changes is already complex and expected to be infrequent. In the long term, implement one of the remaining options.
- In the short term, have system software forcibly enable all devices on hot-insertion. In the long term, implement one of the remaining options. While this can likely be done fairly quickly, it will contend with other host system software work and will require tradeoffs between the existing schedule and other work. Short-term risks are substantial due to inadequate downstack staffing.
- Provide a flexible mechanism to configure system software to automatically enable devices on hot-insertion, with a sysevent and/or FRU monitor based mechanism for detecting auto-enable failures and perhaps diagnose such failures as faults. Configure system software to do this automatically on all oxide arch implementations and/or as part of the Helios build process. Requires generic and Gimlet-specific topo work, including a FRU monitor (see RFD 360). Months of work; requires schedule slip if part of the MVP definition.
- Provide FRU monitoring mechanisms in fmd or other system software and provide a sysevent and/or additional interface for upstack software to consume it. As part of this, manage and propagate hot-insertion events into upstack software (sled-agent, on the oxide architecture) that is tasked with implementing policy. This could be done in conjunction with other non-sled-agent implementations serving the same purpose on other architectures. Months of OS work, plus additional sled-agent work; requires schedule slip if part of the MVP definition.
Some additional hybrid and/or interim-path solutions may exist. The above does not specifically consider what happens if sled-agent (and/or other userland functionality, including userland system software if applicable) is unable to run, which needs additional design work. I plan to write a very brief RFD on this, so this isn't the place to choose our path. Instead, I'm opening this ticket to track the potential for upstack software work in this area and ensure adequate visibility for MVP definition vs. schedule vs. staffing priority calls and possible effects on other sled-agent engineering choices. As there is little documentation covering sled-agent's intended functions or architecture I do not have good visibility into that aspect of this problem. Note that this behaviour is not specific to Oxide hardware so it technically exists on PC-based stand-ins also; if that's considered an important environment going forward (possible but not recommended), the long-term solution should take that into consideration.