-
Notifications
You must be signed in to change notification settings - Fork 43
Trust quorum: reconfiguration and commit behavior #8052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR adds further functionality to the sans-io trust quorum protocol. Configurations can now be committed via `Node::commit_reconfiguration`. For each reconfiguration attempt made on top of a committed configuration, the rack secret for the last committed reconfiguration will be reconstructed after retreiving a threshold of shares from members of that configuration. At this point this "old" rack secret will be encrypted with a key derived from the rack secret for the current configuration being coordinated and included as necessary in prepare messages sent out during coordination. The property based test for coordinator behavior has been expanded to include support for this functionality, as well as to allow dropping messages between nodes if such an action is generated. The bulk of this PR lies in the test code, and it has been restructured to handle multiple reconfigurations and commits. This has led to the tracking of shares across non-existent test nodes, and enhancements to the model. Additionally, a small change was made to copy some of the errors out of `validators.rs` and into their own file.
It's no longer necessary to filter out the coordinator explicitly, as it's share is always included in the `collected_shares` upon construction.
This builds on #8052. Node's now handle `PrepareMsg`s from coordinators. The coordinator proptest was updated to generate prepares from non-existent test only nodes and send them to the coordinator. Additionally, protocol invariant violations are now detected in a few cases and recorded to the `PersistentState`. This is for debugging and support purposes. The goal is to test the code well enough that we never actually see an alarm in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(in the middle of reviewing rn)
}; | ||
|
||
info!( | ||
log, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this log have the component set to tq-coordinator-state
?
}); | ||
} | ||
} | ||
CoordinatorOperation::CollectLrtqShares { .. } => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that there wasn't a Rust warning for these variables being unused. Do you know why?
// A valid share was received. Is it new? | ||
if collected_shares.insert(from, share).is_some() { | ||
return None; | ||
} | ||
// | ||
// Do we have enough shares to recompute the rack secret | ||
// for `epoch`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit
// A valid share was received. Is it new? | |
if collected_shares.insert(from, share).is_some() { | |
return None; | |
} | |
// | |
// Do we have enough shares to recompute the rack secret | |
// for `epoch`? | |
// A valid share was received. Is it new? | |
if collected_shares.insert(from, share).is_some() { | |
return None; | |
} | |
// Do we have enough shares to recompute the rack secret | |
// for `epoch`? |
Err(err) => { | ||
error!( | ||
self.log, | ||
"Failed to reconstruct old rack secret: {err}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth using slog-inline-error?
} | ||
}; | ||
|
||
// Encrypt our old secret with a key derived from the new secret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this protocol described somewhere? Would be nice to have a link to refer to in a comment.
// First, perform some validation on the incoming share | ||
if *last_committed_epoch != epoch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would recommend creating a new logger with the last_committed_epoch
set up here so it doesn't get missed in any of the statements below.
error!( | ||
self.log, | ||
"Failed to reconstruct old rack secret: {err}"; | ||
"epoch" => %epoch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be last_committed_epoch
? and is epoch = self.configuration.epoch
also worth logging here?
error!( | ||
self.log, | ||
"logic error: already preparing"; | ||
"epoch" => %self.configuration.epoch, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth logging the arguments to the Prepare
message?
self.log, | ||
"Starting to prepare after collecting shares"; | ||
"epoch" => %self.configuration.epoch, | ||
// Safety: This whole method relies on having a previous configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment meant to go above, in last_committed_epoch
? If so could it become an expect
message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(still reviewing)
pub fn reconstruct_from_iter<'a>( | ||
shares: impl Iterator<Item = &'a Share>, | ||
) -> Result<ReconstructedRackSecret, RackSecretReconstructError> { | ||
let mut shares: Vec<Share> = shares.cloned().collect(); | ||
let res = RackSecret::reconstruct(&shares); | ||
shares.zeroize(); | ||
res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this looks different from reconstruct
above?
// This key is only used to encrypt one plaintext. A nonce of all zeroes is | ||
// all that's required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this the case? I'm curious but not sure how the second statement follows from the first.
new_rack_secret.expose_secret(), | ||
); | ||
|
||
// The "info" string is context to bind the key to its purpose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "bind" mean here?
prk.expand_multi_info( | ||
&[ | ||
b"trust-quorum-v1-rack-secret", | ||
rack_id.as_untyped_uuid().as_ref(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh I guess we should probably impl AsRef<[u8]>
within newtype-uuid.
This PR adds further functionality to the sans-io trust quorum protocol. Configurations can now be committed via
Node::commit_reconfiguration
. For each reconfiguration attempt made on top of a committed configuration, the rack secret for the last committed reconfiguration will be reconstructed after retreiving a threshold of shares from members of that configuration. At this point this "old" rack secret will be encrypted with a key derived from the rack secret for the current configuration being coordinated and included as necessary in prepare messages sent out during coordination.The property based test for coordinator behavior has been expanded to include support for this functionality, as well as to allow dropping messages between nodes if such an action is generated. The bulk of this PR lies in the test code, and it has been restructured to handle multiple reconfigurations and commits. This has led to the tracking of shares across non-existent test nodes, and enhancements to the model.
Additionally, a small change was made to copy some of the errors out of
validators.rs
and into their own file.