Trust quorum: reconfiguration and commit behavior #8052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

andrewjstone wants to merge 4 commits into main from ajs/realtq-4

Contributor

andrewjstone commented Apr 25, 2025

This PR adds further functionality to the sans-io trust quorum protocol. Configurations can now be committed via Node::commit_reconfiguration. For each reconfiguration attempt made on top of a committed configuration, the rack secret for the last committed reconfiguration will be reconstructed after retreiving a threshold of shares from members of that configuration. At this point this "old" rack secret will be encrypted with a key derived from the rack secret for the current configuration being coordinated and included as necessary in prepare messages sent out during coordination.

The property based test for coordinator behavior has been expanded to include support for this functionality, as well as to allow dropping messages between nodes if such an action is generated. The bulk of this PR lies in the test code, and it has been restructured to handle multiple reconfigurations and commits. This has led to the tracking of shares across non-existent test nodes, and enhancements to the model.

Additionally, a small change was made to copy some of the errors out of validators.rs and into their own file.


          Trust quorum: reconfiguration and commit behavior

94f7b6e

This PR adds further functionality to the sans-io trust quorum protocol.
Configurations can now be committed via `Node::commit_reconfiguration`.
For each reconfiguration attempt made on top of a committed
configuration, the rack secret for the last committed reconfiguration
will be reconstructed after retreiving a threshold of shares from
members of that configuration. At this point this "old" rack secret will
be encrypted with a key derived from the rack secret for the current
configuration being coordinated and included as necessary in prepare
messages sent out during coordination.

The property based test for coordinator behavior has been expanded to
include support for this functionality, as well as to allow dropping
messages between nodes if such an action is generated. The bulk of
this PR lies in the test code, and it has been restructured to handle
multiple reconfigurations and commits. This has led to the tracking of
shares across non-existent test nodes, and enhancements to the model.

Additionally, a small change was made to copy some of the errors out of
`validators.rs` and into their own file.

andrewjstone requested a review from sunshowers

April 25, 2025 23:05

andrewjstone added 3 commits

April 25, 2025 23:08


          Reduce send_msgs filter condition

9858e7b

It's no longer necessary to filter out the coordinator explicitly, as
it's share is always included in the `collected_shares` upon construction.


          ugh, missed a comment

bc17d08


          Remove another redundant removal of coordinator id

cbb7db4

andrewjstone added a commit that referenced this pull request


          Trust Quorum: Handle prepare messages + Alarms

10ae04e

This builds on #8052.

Node's now handle `PrepareMsg`s from coordinators. The coordinator
proptest was updated to generate prepares from non-existent test only
nodes and send them to the coordinator.

Additionally, protocol invariant violations are now detected in a few
cases and recorded to the `PersistentState`. This is for debugging and
support purposes. The goal is to test the code well enough that we never
actually see an alarm in production.

andrewjstone mentioned this pull request

Trust Quorum: Handle prepare messages + Alarms #8062

Closed

sunshowers reviewed

View reviewed changes

Contributor

sunshowers left a comment

(in the middle of reviewing rn)

trust-quorum/src/coordinator_state.rs

                       };
+                      info!(
+                          log,

Contributor

sunshowers May 23, 2025

nit: should this log have the component set to tq-coordinator-state?

trust-quorum/src/coordinator_state.rs

+                                  });
+                              }
+                          }
+                          CoordinatorOperation::CollectLrtqShares { .. } => {}

Contributor

sunshowers May 23, 2025

Interesting that there wasn't a Rust warning for these variables being unused. Do you know why?

trust-quorum/src/coordinator_state.rs

Comment on lines +300 to +306

+                              // A valid share was received. Is it new?
+                              if collected_shares.insert(from, share).is_some() {
+                                  return None;
+                              }
+                              //
+                              // Do we have enough shares to recompute the rack secret
+                              // for `epoch`?

Contributor

sunshowers May 23, 2025

tiny nit

Suggested change

      
                            // A valid share was received. Is it new?
          
                            if collected_shares.insert(from, share).is_some() {
          
                                return None;
          
                            }
          
                            //
          
                            // Do we have enough shares to recompute the rack secret
          
                            // for `epoch`?
          
                            // A valid share was received. Is it new?
          
                            if collected_shares.insert(from, share).is_some() {
          
                                return None;
          
                            }
          
                            // Do we have enough shares to recompute the rack secret
          
                            // for `epoch`?

trust-quorum/src/coordinator_state.rs

+                                  Err(err) => {
+                                      error!(
+                                          self.log,
+                                          "Failed to reconstruct old rack secret: {err}";

Contributor

sunshowers May 23, 2025

worth using slog-inline-error?

trust-quorum/src/coordinator_state.rs

+                                  }
+                              };
+                              // Encrypt our old secret with a key derived from the new secret

Contributor

sunshowers May 23, 2025

Is this protocol described somewhere? Would be nice to have a link to refer to in a comment.

Contributor Author

andrewjstone Jun 16, 2025

It's described in RFD 238, section 5.3. This is actually one of the things that is changing in follow up PRs. The new protocol may encrypt more than one prior rack secret so that secrets can be rotated across multiple committed configurations. This protocol change was specified in TLA+ in #8347.

trust-quorum/src/coordinator_state.rs

Comment on lines +267 to +268

		// First, perform some validation on the incoming share
		if *last_committed_epoch != epoch {

Contributor

sunshowers May 23, 2025

Would recommend creating a new logger with the last_committed_epoch set up here so it doesn't get missed in any of the statements below.

trust-quorum/src/coordinator_state.rs

+                                      error!(
+                                          self.log,
+                                          "Failed to reconstruct old rack secret: {err}";
+                                          "epoch" => %epoch

Contributor

sunshowers May 23, 2025

should this be last_committed_epoch? and is epoch = self.configuration.epoch also worth logging here?

trust-quorum/src/coordinator_state.rs

Comment on lines +445 to +449

+                              error!(
+                                  self.log,
+                                  "logic error: already preparing";
+                                  "epoch" => %self.configuration.epoch,
+                              );

Contributor

sunshowers May 23, 2025

worth logging the arguments to the Prepare message?

trust-quorum/src/coordinator_state.rs

+                          self.log,
+                          "Starting to prepare after collecting shares";
+                          "epoch" => %self.configuration.epoch,
+                          // Safety: This whole method relies on having a previous configuration

Contributor

sunshowers May 23, 2025

Is this comment meant to go above, in last_committed_epoch? If so could it become an expect message?

sunshowers reviewed

View reviewed changes

Contributor

sunshowers left a comment

(still reviewing)

trust-quorum/src/crypto.rs

Comment on lines +209 to +215

+                  pub fn reconstruct_from_iter<'a>(
+                      shares: impl Iterator<Item = &'a Share>,
+                  ) -> Result<ReconstructedRackSecret, RackSecretReconstructError> {
+                      let mut shares: Vec<Share> = shares.cloned().collect();
+                      let res = RackSecret::reconstruct(&shares);
+                      shares.zeroize();
+                      res

Contributor

sunshowers May 23, 2025

Is there a reason this looks different from reconstruct above?

Contributor Author

andrewjstone Jun 16, 2025

This is just for API ergonomics.

trust-quorum/src/crypto.rs

Comment on lines +276 to +277

		// This key is only used to encrypt one plaintext. A nonce of all zeroes is
		// all that's required.

Contributor

sunshowers May 23, 2025

Why is this the case? I'm curious but not sure how the second statement follows from the first.

Contributor Author

andrewjstone Jun 16, 2025

Nonces are used to ensure that an identical input message maps to a different output message when encrypted with the same key. The purpose of doing this is twofold:

Prevent replay attacks - An attacker shouldn't be able to resend an intercepted message and gain any privileges, like authentication.
Prevent cryptanalysis - An attacker shouldn't be able to deduce a message based on structure and location in the ciphertext.

In this case, we care about neither of these scenarios. There is no chance of a replay attack, as the nonce is used locally to encrypt data and not sent over the network. Everything sent over the network is already encrypted with TLS.

In the second case (where the comment is relevant), we never encrypt more than one piece of data with the same key. There's no chance for cryptanalysis since we encrypt all the data in one shot, and never reuse the key.

Furthermore, this data must be able to be deterministically encrypted and decrypted each time the rack secret is reconstructed and the key derived. The easiest thing to do then is to just use a constant of zero for the nonce, so we don't need to keep track via some advanced protocol. Changing the encrypted value each time we do this doesn't buy us anything and it adds significant complexity and risk.

trust-quorum/src/crypto.rs

+                      new_rack_secret.expose_secret(),
+                  );
+                  // The "info" string is context to bind the key to its purpose

Contributor

sunshowers May 23, 2025

What does "bind" mean here?

Contributor Author

andrewjstone Jun 16, 2025

Each key should have a single purpose. This ties the key to that "purpose" via a unique string and a set of unique parameters for this specific encryption task. It prevents accidentally deriving the same key given the same input key material, but for which the key is intended for different purposes. None of this is strictly necessary here, but is defense in depth and a best practice.

This is primarily protection against a kind of confused deputy attack, described for a somewhat similar case by soatok.

I'm going to conjure up a very hypothetical example that I read something similar about a while ago, but cannot find the reference to.

We can describe a hypothetical service that maintains shared input key material that it uses across accounts in order to derive encryption keys. These keys are used to decrypt on demand ciphertext stored in user accounts. Let's assume there's also a vulnerability that allows an attacker to upload encrypted data to an account they have access to so that the service will decrypt it on demand. Then, let's assume an attacker has a legitimate account for this service, as they are a member of an organization, and input key material is based on a shared org secret uploaded to the service at sign up time. Let's see what happens if keys are derived without binding the derived key to context.

The attacker is in the office, and has access to encrypted backups of data stored in the a particularly sensitive company account. They do not have access to this information in the course of their duties. Say this is sensitive insider information that can be traded on and in the CFO's account. The attacker is able to inject this encrypted copy of data inside their own legitimate account via the vulnerability. Since the same encryption key is shared across users for this organization, the attacker can then ask to download or view the decrypted version even though they only have access to the encrypted data and a mechanism for uploading it to the server.

However, If there was context info used to derive the key, we could set the context info to the particular user's account id. Then the encrypted data would only properly decrypt when the derived key was regenerated with the same input key material and the CFOs account id. The attacker could upload this data, but then when accessing the service they wouldn't be able to decrypt the information, because the derived key would be constructed using their own account id, and would not decrypt the data correctly.

Contributor Author

andrewjstone Jun 16, 2025

And because this example is very contrived, and I realize you may have been asking a simpler question, I'll provide that. By binding the info it just means that the algorithm for generating the key mixes the bits from the info string in with the input key material and other context parameters to generate the key.

trust-quorum/src/crypto.rs

+                  prk.expand_multi_info(
+                      &[
+                          b"trust-quorum-v1-rack-secret",
+                          rack_id.as_untyped_uuid().as_ref(),

Contributor

sunshowers May 23, 2025

Huh I guess we should probably impl AsRef<[u8]> within newtype-uuid.

Contributor

sunshowers May 23, 2025

Now out in newtype-uuid 1.2.2 :)

sunshowers reviewed

View reviewed changes

Contributor

sunshowers left a comment

nice! several comments but overall looks good

trust-quorum/src/crypto.rs

+                  prk.expand_multi_info(
+                      &[
+                          b"trust-quorum-v1-rack-secret",
+                          rack_id.as_untyped_uuid().as_ref(),

Contributor

sunshowers May 23, 2025

Now out in newtype-uuid 1.2.2 :)

trust-quorum/src/errors.rs

Comment on lines +23 to +24

		#[error("prepare for a later configuration exists")]
		OutOfOrderCommit,

Contributor

sunshowers May 23, 2025

can/should this include more information on the epoch numbers?

trust-quorum/src/lib.rs

+              impl Epoch {
+                  // Increment the epoch and return the new value
+                  pub fn inc(&self) -> Epoch {
+                      Epoch(self.0 + 1)

Contributor

sunshowers May 23, 2025

worth using checked_add?

trust-quorum/src/node.rs

Comment on lines +300 to +302

+                      // Safety: We have a committed configuration that is not LRTQ
+                      let our_share =
+                          self.persistent_state.last_committed_key_share().unwrap();

Contributor

sunshowers May 23, 2025

move to expect?

trust-quorum/tests/coordinator.rs

+                      &mut self,
+                      from: PlatformId,
+                      epoch: Epoch,
+                  ) -> Result<bool, TestCaseError> {

Contributor

sunshowers May 26, 2025

use an enum here? in particular it would be nice to indicate why false was returned.

trust-quorum/tests/coordinator.rs

Comment on lines +331 to +332

		// A cache of our member universe, so we only have to generate it once
		member_universe: Vec<PlatformId>,

Contributor

sunshowers May 26, 2025

worth mentioning that the first element of the list is the coordinator?

trust-quorum/tests/coordinator.rs

Comment on lines +579 to +584

+                  // Precondition: Nexus will only actually commit if enough nodes
+                  // have acked the `PrepareMsg` from the coordinator. Let's simulate
+                  // that by checking our model state.
+                  //
+                  // Return `Some(reconfigure_msg)` if Nexus would decide to commit, `None` otherwise.
+                  fn action_commit_precondition(

Contributor

sunshowers May 26, 2025

nice, love a clear property to test with

Contributor Author

andrewjstone commented Jun 16, 2025

After modeling the protocol in TLA+, a few improvements were made. The changes are large enough that they probably warrant closing this PR and #8062 and reworking them to incorporate the changes. In particular all the ordering between prepares and commits is mostly unnecessary now, as configurations can be separately retrieved after the fact and shares recomputed locally at the node missing the prepare. This should make things much more straightforward to implement and reduce the number of alarms introduced in #8062, if there are any remaining.

I'll take into consideration all the review comments here and incorporate them in the new PR(s). I'm also going to answer a few of the questions in these PRs, as they are particularly useful to understand. I'll add related comments in the follow up PRs.

Contributor

sunshowers commented Jun 16, 2025

Sounds good @andrewjstone, thanks!

andrewjstone closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet