Skip to content

[nexus] add tokio-dtrace probes #8364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

[nexus] add tokio-dtrace probes #8364

wants to merge 7 commits into from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jun 17, 2025

I've written a crate that adds USDT probes for Tokio runtime events, using the unstable callbacks provided by the Tokio runtime. This commit adds that crate to Nexus. Note that, because tokio-dtrace releis on Tokio's unstable features, it was necessary to add config to .cargo/config.toml to enable the --cfg tokio_unstable rustc config when compiling for illumos. In addition, adding the code necessary configure the Tokio runtime to record these probes required some changes to bin/nexus.rs, to change the use of the #[tokio::main] attribute to manual use of the runtime builder.

I'd like to add these DTrace
probes in other binaries in Omicron as well, but figured I'd start with Nexus as a proof of concept. It might be worth some general refactoring of our main functions in the process of doing so.

I've written [a crate][1] that adds USDT probes for Tokio runtime
events, using the unstable callbacks provided by the Tokio runtime. This
commit adds that crate to Nexus. Note that, because `tokio-dtrace`
releis on Tokio's unstable features, it was necessary to add config to
`.cargo/config.toml` to enable the `--cfg tokio_unstable` rustc config
when compiling for illumos. In addition, adding the code necessary
configure the Tokio runtime to record these probes required some changes
to `bin/nexus.rs`, to change the use of the `#[tokio::main]` attribute
to manual use of the runtime builder.

I'd like to add these DTrace
probes in other binaries in Omicron as well, but figured I'd start with
Nexus as a proof of concept. It might be worth some general refactoring
of our `main` functions in the process of doing so.

[1]: https://github.com/oxidecomputer/tokio-dtrace
@hawkw hawkw requested review from davepacheco and smklein June 17, 2025 23:04
@hawkw
Copy link
Member Author

hawkw commented Jun 17, 2025

Yuck, the build failure suggests that ... something ... is trying to compile tokio-dtrace without tokio_unstable enabled. Not sure why...

@hawkw
Copy link
Member Author

hawkw commented Jun 17, 2025

Yuck, the build failure suggests that ... something ... is trying to compile tokio-dtrace without tokio_unstable enabled. Not sure why...

Oh, it's because the build script that runs on that CI job is doing its own munging of the RUSTFLAGS environment variable in order to set -D warnings (which is quite reasonable). Thus, it clobbers the tokio_unstable flag that gets set in .cargo/config.toml. That's so cool! Gotta fix that, I guess.

@hawkw
Copy link
Member Author

hawkw commented Jun 17, 2025

NOTE TO FUTURE ELIZA: you should make sure that RUSTFLAGS="--cfg tokio_unstable" also makes it into...wherever the release TUF repo artifacts actually get built. Maybe ask @iliana where to look...

run_server(&config).await.map_err(|err| CmdError::Failure(anyhow!(err)))
let rt = {
let mut builder = tokio::runtime::Builder::new_multi_thread();
#[cfg(target_os = "illumos")]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could also be conditional on cfg(tokio_unstable) so we don't fail to build if it's not set, but i actually want builds to fail for now to make sure it's gettig set in the places I want it to be...

Comment on lines +51 to +61
let rt = {
let mut builder = tokio::runtime::Builder::new_multi_thread();
#[cfg(target_os = "illumos")]
tokio_dtrace::register_hooks(&mut builder)
.map_err(|err| CmdError::Failure(anyhow!(err)))?;
builder
.enable_all()
.build()
.map_err(|err| CmdError::Failure(anyhow!(err)))?
};
rt.block_on(async move {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, I think it would be nice to take this runtime setup code and put it in a crate, so that we can do the same setup in sled-agent and gateway binaries (in order to ensure we also set up the tokio-dtrace probes there, etc). But I didn't do that in this branch, since that felt like a larger change...especially since it should probably be a new crate. There's similar code in omicron_common's cmd module, but I didn't want to put it there, as that crate is also a dependency of a bunch of libraries, and making it depend on tokio would probably be bad for the parallelizability of the build graph: right now, we can go and compile a bunch of the smaller crates that depend on omicron_common fairly quickly, while the runtime setup code would only be a dependency of the actual binary crates.

@iliana
Copy link
Contributor

iliana commented Jun 18, 2025

Oh, it's because the build script that runs on that CI job is doing its own munging of the RUSTFLAGS environment variable in order to set -D warnings (which is quite reasonable). Thus, it clobbers the tokio_unstable flag that gets set in .cargo/config.toml. That's so cool! Gotta fix that, I guess.

It'd be nice if we could fetch the rustflags out of .cargo/config.toml and append to it from our bash scripts instead of having to copy the rustflags in several places. Maybe worth a small xtask? Or can we convince cargo to tell us what it thinks the rustflags are? Not sure if this is worth it for this PR.

@hawkw
Copy link
Member Author

hawkw commented Jun 18, 2025

@iliana apparently there's a cargo config get build.rustflags that will give you what's in the .cargo/config.toml, but it's only available on nightly toolchains :/

see rust-lang/cargo#12087

@hawkw
Copy link
Member Author

hawkw commented Jun 18, 2025

huh, looks like there's a helios-deploy failure due to...sled-agent crashlooping? which is weird because i didn't mess with sled-agent. will investigate.

@hawkw
Copy link
Member Author

hawkw commented Jun 18, 2025

huh, looks like there's a helios-deploy failure due to...sled-agent crashlooping? which is weird because i didn't mess with sled-agent. will investigate.

uh: https://buildomat.eng.oxide.computer/wg/0/artefact/01JY1ZKBMDX77JR3RMXBDT3BV9/VaWSJ8qC6auY1se2JQlSbfEe7oYePo6Nfzk1XCJtaOrRpCSx/01JY1ZMJ1MWPP2B35JK5ZBVY3D/01JY226GX073ZZ57MNZE0JXA0P/oxide-sled-agent:default.log?format=x-bunyan

2	 	[ Jun 18 17:42:41 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
3	 	[ Jun 18 17:42:41 Method "start" exited with status 0. ]
4	 	[ Jun 18 17:42:41 Rereading configuration. ]
5	 	ld.so.1: sled-agent: fatal: libipcc.so.1: open failed: No such file or directory
6	 	[ Jun 18 17:42:41 No 'refresh' method defined.  Treating as :true. ]
7	 	[ Jun 18 17:42:41 Stopping because all processes in service exited. ]
8	 	[ Jun 18 17:42:41 Executing stop method (:kill). ]
9	 	[ Jun 18 17:42:41 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
10	 	[ Jun 18 17:42:41 Method "start" exited with status 0. ]
11	 	ld.so.1: sled-agent: fatal: libipcc.so.1: open failed: No such file or directory
12	 	[ Jun 18 17:42:41 Stopping because all processes in service exited. ]
13	 	[ Jun 18 17:42:41 Executing stop method (:kill). ]
14	 	[ Jun 18 17:42:41 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
15	 	[ Jun 18 17:42:41 Method "start" exited with status 0. ]
16	 	ld.so.1: sled-agent: fatal: libipcc.so.1: open failed: No such file or directory
17	 	[ Jun 18 17:42:41 Stopping because all processes in service exited. ]
18	 	[ Jun 18 17:42:41 Executing stop method (:kill). ]
19	 	[ Jun 18 17:42:41 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
20	 	[ Jun 18 17:42:41 Method "start" exited with status 0. ]
21	 	ld.so.1: sled-agent: fatal: libipcc.so.1: open failed: No such file or directory
22	 	[ Jun 18 17:42:41 Stopping because all processes in service exited. ]
23	 	[ Jun 18 17:42:41 Executing stop method (:kill). ]
24	 	[ Jun 18 17:42:41 Restarting too quickly, changing state to maintenance. ]

i don't really think i did that?

hawkw added a commit that referenced this pull request Jun 22, 2025
Currently, all the async Rust binaries in the Omicron workspace use the
`#[tokio::main]` attribute on an `async fn main()`. This creates a Tokio
runtime with the default configuration provided by that attribute, and
runs the contents of the `async` main function on that runtime.

We now have need to configure the Tokio runtime used by Omicron
differently than the defaults provided by the `#[tokio::main]` macro. In
particular, we want to do the following:
- Disable the LIFO slot optimization, which can cause severe latency
spikes in some cases (see
#8334 (comment))
- When running on illumos, configure runtime callbacks to emit
`tokio-dtrace` probes (see #8364)

This requires the use of tokio's `runtime::Builder` API, rather than the
`#[tokio::main]` attribute. Since there are a lot of binaries in
Omicron, this branch introduces a new `omicron-runtime` crate that
configures a runtime with our desired settings and runs a future on it.
This can be used to replace `#[tokio::main]` fairly simply, which I've
done for most binaries in the repo.

There are a handful of binaries I did *not* update to use
`omicron-runtime`. In particular, I didn't change the use of
`#[tokio::main]` in example binaries (in particular, `oximeter-producer`
and `update-engine`), as I didn't want to add an `omicron-runtime`
dependency to those crates just for example. In the case of
`oximeter-producer`, in particular, which is depended on by code outside
of the Omicron repo, that would force those binaries to compile
`omicron-runtime` unnecessarily. Similarly, I didn't make some of the
dev-tools
binaries which are used in xtasks use `omicron-runtime`, because I
didn't want to introduce another dependency that has to be compiled
before an xtask can run, and that felt more important than adding DTrace
probes to those simple commands --- if something goes wrong with them,
we can add the proves later. The exception to this are the `mgs-dev` and
`omicron-dev` binaries, which run MGS and Nexus *in-process*; in those
cases, I felt it was useful to have the DTrace probes enabled when
running in development. Similarly, I also thought it was worthwhile to
use `omicron-runtime` in OMDB, as it's deployed on production systems
and wouild be harder to recompile with DTrace probes if we end up
wanting them.

Depends on #8364
hawkw added a commit that referenced this pull request Jun 23, 2025
Currently, all the async Rust binaries in the Omicron workspace use the
`#[tokio::main]` attribute on an `async fn main()`. This creates a Tokio
runtime with the default configuration provided by that attribute, and
runs the contents of the `async` main function on that runtime.

We now have need to configure the Tokio runtime used by Omicron
differently than the defaults provided by the `#[tokio::main]` macro. In
particular, we want to do the following:
- Disable the LIFO slot optimization, which can cause severe latency
spikes in some cases (see
#8334 (comment))
- When running on illumos, configure runtime callbacks to emit
`tokio-dtrace` probes (see #8364)

This requires the use of tokio's `runtime::Builder` API, rather than the
`#[tokio::main]` attribute. Since there are a lot of binaries in
Omicron, this branch introduces a new `omicron-runtime` crate that
configures a runtime with our desired settings and runs a future on it.
This can be used to replace `#[tokio::main]` fairly simply, which I've
done for most binaries in the repo.

There are a handful of binaries I did *not* update to use
`omicron-runtime`. In particular, I didn't change the use of
`#[tokio::main]` in example binaries (in particular, `oximeter-producer`
and `update-engine`), as I didn't want to add an `omicron-runtime`
dependency to those crates just for example. In the case of
`oximeter-producer`, in particular, which is depended on by code outside
of the Omicron repo, that would force those binaries to compile
`omicron-runtime` unnecessarily. Similarly, I didn't make some of the
dev-tools
binaries which are used in xtasks use `omicron-runtime`, because I
didn't want to introduce another dependency that has to be compiled
before an xtask can run, and that felt more important than adding DTrace
probes to those simple commands --- if something goes wrong with them,
we can add the proves later. The exception to this are the `mgs-dev` and
`omicron-dev` binaries, which run MGS and Nexus *in-process*; in those
cases, I felt it was useful to have the DTrace probes enabled when
running in development. Similarly, I also thought it was worthwhile to
use `omicron-runtime` in OMDB, as it's deployed on production systems
and wouild be harder to recompile with DTrace probes if we end up
wanting them.

Depends on #8364
hawkw added a commit that referenced this pull request Jun 23, 2025
Currently, all the async Rust binaries in the Omicron workspace use the
`#[tokio::main]` attribute on an `async fn main()`. This creates a Tokio
runtime with the default configuration provided by that attribute, and
runs the contents of the `async` main function on that runtime.

We now have need to configure the Tokio runtime used by Omicron
differently than the defaults provided by the `#[tokio::main]` macro. In
particular, we want to do the following:
- Disable the LIFO slot optimization, which can cause severe latency
spikes in some cases (see
#8334 (comment))
- When running on illumos, configure runtime callbacks to emit
`tokio-dtrace` probes (see #8364)

This requires the use of tokio's `runtime::Builder` API, rather than the
`#[tokio::main]` attribute. Since there are a lot of binaries in
Omicron, this branch introduces a new `omicron-runtime` crate that
configures a runtime with our desired settings and runs a future on it.
This can be used to replace `#[tokio::main]` fairly simply, which I've
done for most binaries in the repo.

There are a handful of binaries I did *not* update to use
`omicron-runtime`. In particular, I didn't change the use of
`#[tokio::main]` in example binaries (in particular, `oximeter-producer`
and `update-engine`), as I didn't want to add an `omicron-runtime`
dependency to those crates just for example. In the case of
`oximeter-producer`, in particular, which is depended on by code outside
of the Omicron repo, that would force those binaries to compile
`omicron-runtime` unnecessarily. Similarly, I didn't make some of the
dev-tools
binaries which are used in xtasks use `omicron-runtime`, because I
didn't want to introduce another dependency that has to be compiled
before an xtask can run, and that felt more important than adding DTrace
probes to those simple commands --- if something goes wrong with them,
we can add the proves later. The exception to this are the `mgs-dev` and
`omicron-dev` binaries, which run MGS and Nexus *in-process*; in those
cases, I felt it was useful to have the DTrace probes enabled when
running in development. Similarly, I also thought it was worthwhile to
use `omicron-runtime` in OMDB, as it's deployed on production systems
and wouild be harder to recompile with DTrace probes if we end up
wanting them.

Depends on #8364
@hawkw hawkw closed this in #8412 Jun 26, 2025
karencfv pushed a commit to karencfv/omicron that referenced this pull request Jun 26, 2025
…uter#8412)

Currently, all the async Rust binaries in the Omicron workspace use the
`#[tokio::main]` attribute on an `async fn main()`. This creates a Tokio
runtime with the default configuration provided by that attribute, and
runs the contents of the `async` main function on that runtime.

We now have need to configure the Tokio runtime used by Omicron
differently than the defaults provided by the `#[tokio::main]` macro. In
particular, we want to do the following:
- Disable the LIFO slot optimization, which can cause severe latency
spikes in some cases (see

oxidecomputer#8334 (comment))
- When running on illumos, configure runtime callbacks to emit
`tokio-dtrace` probes (see oxidecomputer#8364)

In order to make it easier to use these configurations in all production
binaries in Omicron and other projects, I've introduced a new
[`oxide-tokio-rt`
crate](https://github.com/oxidecomputer/oxide-tokio-rt). This crate
provides
functions that configure a runtime with our desired settings and runs a
future
on it. This can be used to replace `#[tokio::main]` fairly simply, which
I've
done for most binaries in the repo.

There are a handful of binaries I did *not* update to use
`omicron-runtime`. In particular, I didn't change the use of
`#[tokio::main]` in example binaries (in particular, `oximeter-producer`
and `update-engine`), as I didn't want to add an `omicron-runtime`
dependency to those crates just for example. In the case of
`oximeter-producer`, in particular, which is depended on by code outside
of the Omicron repo, that would force those binaries to compile
`omicron-runtime` unnecessarily. Similarly, I didn't make some of the
dev-tools
binaries which are used in xtasks use `oxide-tokio-rt`, because I didn't
want to introduce another dependency that has to be compiled before an
xtask can run, and that felt more important than adding DTrace probes to
those simple commands --- if something goes wrong with them, we can add
the proves later. The exception to this are the `mgs-dev` and
`omicron-dev` binaries, which run MGS and Nexus *in-process*; in those
cases, I felt it was useful to have the DTrace probes enabled when
running in development. Similarly, I also thought it was worthwhile to
use `omicron-runtime` in OMDB, as it's deployed on production systems
and would be harder to recompile with DTrace probes if we end up
wanting them.

Closes oxidecomputer#8364
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants