Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make sure own identity does not get wiped during TVU deduplication #5380

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alexpyattaev
Copy link

@alexpyattaev alexpyattaev commented Mar 19, 2025

Problem

  • fn dedup_tvu_addrs in cluster_nodes.rs is too aggressive and can accidentally delete node's own identity if a staked node was ever running on the same TVU socket address.
  • This in turn invalidates a bunch of invariants for weighted shuffle, causing validator to panic

Summary of Changes

  • Make sure fn dedup_tvu_addrs does not accidentally wipe node's own identity
  • Add extra paranoid checks in case other changes are made around that code resulting in similar issues.

Fixes #5356

@alexpyattaev alexpyattaev added the v2.2 Backport to v2.2 branch label Mar 19, 2025
Copy link

mergify bot commented Mar 19, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

@alexpyattaev alexpyattaev marked this pull request as ready for review March 19, 2025 20:07
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most importantly, I think this solves the problem. Nice find!

That said, I would prefer to either assert that we will always find our own index or fully handle the case where we don't.

Left a few other nits along the way

// This node's index within shuffled nodes.
let index = nodes.by_ref().position(pred).unwrap();
// This node's index should be somewhere within shuffled indices.
let index = nodes.by_ref().position(pred)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to just unwrap/expect on the return value, might as well just unwrap here, closer to the root of the problem

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 92202a0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • get_retransmit_addrs() returns a Result
  • get_retransmit_addrs() calls get_retransmit_peers() (the function that this .unwrap() is in)

Why not have get_retransmit_peers() return a Result and allow the caller decide how to handle it ? While we have seemingly found the cause of this panic, I don't think it hurts to make our error handling more robust

@@ -310,6 +311,11 @@ pub fn new_cluster_nodes<T: 'static>(
if broadcast {
weighted_shuffle.remove_index(index[&self_pubkey]);
}
// Paranoid check to ensure the correct operation of the weighted shuffle.
assert!(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe debug_assert ?

Copy link
Author

@alexpyattaev alexpyattaev Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not get called very often (once every few seconds), and this ensures the thing crashes early (rather than in any of the unwraps in the code). Basically this is the check that would get hit if the invariant of "have my own key in the list" is ever violated in the future changes to dedup logic and/or if there are bugs in gossip.

@alexpyattaev alexpyattaev requested a review from bw-solana March 20, 2025 15:44
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had the chance to review yet but will do so today. Please hold off on merging until I'm able to review

@alexpyattaev alexpyattaev added the automerge automerge Merge this Pull Request automatically once CI passes label Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge automerge Merge this Pull Request automatically once CI passes v2.2 Backport to v2.2 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Turbine panics on v2.2
3 participants