Skip to content

Trust quorum: reconfiguration and commit behavior #8052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

andrewjstone
Copy link
Contributor

This PR adds further functionality to the sans-io trust quorum protocol. Configurations can now be committed via Node::commit_reconfiguration. For each reconfiguration attempt made on top of a committed configuration, the rack secret for the last committed reconfiguration will be reconstructed after retreiving a threshold of shares from members of that configuration. At this point this "old" rack secret will be encrypted with a key derived from the rack secret for the current configuration being coordinated and included as necessary in prepare messages sent out during coordination.

The property based test for coordinator behavior has been expanded to include support for this functionality, as well as to allow dropping messages between nodes if such an action is generated. The bulk of this PR lies in the test code, and it has been restructured to handle multiple reconfigurations and commits. This has led to the tracking of shares across non-existent test nodes, and enhancements to the model.

Additionally, a small change was made to copy some of the errors out of validators.rs and into their own file.

This PR adds further functionality to the sans-io trust quorum protocol.
Configurations can now be committed via `Node::commit_reconfiguration`.
For each reconfiguration attempt made on top of a committed
configuration, the rack secret for the last committed reconfiguration
will be reconstructed after retreiving a threshold of shares from
members of that configuration. At this point this "old" rack secret will
be encrypted with a key derived from the rack secret for the current
configuration being coordinated and included as necessary in prepare
messages sent out during coordination.

The property based test for coordinator behavior has been expanded to
include support for this functionality, as well as to allow dropping
messages between nodes if such an action is generated. The bulk of
this PR lies in the test code, and it has been restructured to handle
multiple reconfigurations and commits. This has led to the tracking of
shares across non-existent test nodes, and enhancements to the model.

Additionally, a small change was made to copy some of the errors out of
`validators.rs` and into their own file.
@andrewjstone andrewjstone requested a review from sunshowers April 25, 2025 23:05
It's no longer necessary to filter out the coordinator explicitly, as
it's share is always included in the `collected_shares` upon construction.
andrewjstone added a commit that referenced this pull request Apr 29, 2025
This builds on #8052.

Node's now handle `PrepareMsg`s from coordinators. The coordinator
proptest was updated to generate prepares from non-existent test only
nodes and send them to the coordinator.

Additionally, protocol invariant violations are now detected in a few
cases and recorded to the `PersistentState`. This is for debugging and
support purposes. The goal is to test the code well enough that we never
actually see an alarm in production.
Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(in the middle of reviewing rn)

};

info!(
log,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this log have the component set to tq-coordinator-state?

});
}
}
CoordinatorOperation::CollectLrtqShares { .. } => {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that there wasn't a Rust warning for these variables being unused. Do you know why?

Comment on lines +300 to +306
// A valid share was received. Is it new?
if collected_shares.insert(from, share).is_some() {
return None;
}
//
// Do we have enough shares to recompute the rack secret
// for `epoch`?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit

Suggested change
// A valid share was received. Is it new?
if collected_shares.insert(from, share).is_some() {
return None;
}
//
// Do we have enough shares to recompute the rack secret
// for `epoch`?
// A valid share was received. Is it new?
if collected_shares.insert(from, share).is_some() {
return None;
}
// Do we have enough shares to recompute the rack secret
// for `epoch`?

Err(err) => {
error!(
self.log,
"Failed to reconstruct old rack secret: {err}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth using slog-inline-error?

}
};

// Encrypt our old secret with a key derived from the new secret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this protocol described somewhere? Would be nice to have a link to refer to in a comment.

Comment on lines +267 to +268
// First, perform some validation on the incoming share
if *last_committed_epoch != epoch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would recommend creating a new logger with the last_committed_epoch set up here so it doesn't get missed in any of the statements below.

error!(
self.log,
"Failed to reconstruct old rack secret: {err}";
"epoch" => %epoch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be last_committed_epoch? and is epoch = self.configuration.epoch also worth logging here?

Comment on lines +445 to +449
error!(
self.log,
"logic error: already preparing";
"epoch" => %self.configuration.epoch,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth logging the arguments to the Prepare message?

self.log,
"Starting to prepare after collecting shares";
"epoch" => %self.configuration.epoch,
// Safety: This whole method relies on having a previous configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment meant to go above, in last_committed_epoch? If so could it become an expect message?

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(still reviewing)

Comment on lines +209 to +215
pub fn reconstruct_from_iter<'a>(
shares: impl Iterator<Item = &'a Share>,
) -> Result<ReconstructedRackSecret, RackSecretReconstructError> {
let mut shares: Vec<Share> = shares.cloned().collect();
let res = RackSecret::reconstruct(&shares);
shares.zeroize();
res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this looks different from reconstruct above?

Comment on lines +276 to +277
// This key is only used to encrypt one plaintext. A nonce of all zeroes is
// all that's required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this the case? I'm curious but not sure how the second statement follows from the first.

new_rack_secret.expose_secret(),
);

// The "info" string is context to bind the key to its purpose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "bind" mean here?

prk.expand_multi_info(
&[
b"trust-quorum-v1-rack-secret",
rack_id.as_untyped_uuid().as_ref(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh I guess we should probably impl AsRef<[u8]> within newtype-uuid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants