stop unstaked nodes from pushing EpochSlots into the cluster #5141

alexpyattaev · 2025-03-04T20:06:33Z

Problem

EpochSlots is 70% of gossip traffic
Unstaked nodes do not need to send it

Summary of Changes

Prevent them from sending the message

Fixes #
Partially #5034

alexpyattaev · 2025-03-04T20:12:39Z

@gregcusack @bw-solana please take a look if this is what we need to stop unstaked nodes from polluting gossip

gregcusack · 2025-03-05T01:59:31Z

@gregcusack @bw-solana please take a look if this is what we need to stop unstaked nodes from polluting gossip

just to confirm and based off of side convo, we are holding off on this until epochslots are ready to be fully removed, right?

alexpyattaev · 2025-03-05T09:08:33Z

just to confirm and based off of side convo, we are holding off on this until epochslots are ready to be fully removed, right?

As far as I understand we can do this right away since EpochSlots, when made by unstaked nodes, do not really do much other than pollute gossip (since repair will prioritize staked nodes anyway, and unstaked nodes do not take part in consensus).

alexpyattaev · 2025-03-07T11:50:40Z

@alessandrod mentioned that FD are happy with EpochSlots gone, can we at least start testing this simple solution that will cut bandwidth by 50%? Waiting for a complete solution can take months.

bw-solana

Left a couple comments on the code.

But it sounds like we need to get aligned on direction...

My understanding is EpochSlots are used in repair and ancestor hash repair sampling services. If unstaked nodes stop pushing out EpochSlots, what is the expected behavior change? More concentrated repair/sampling load on the staked nodes?

If we're comfortable with the behavior changes for these services, this seems like high impact gossip bandwidth reduction.

bw-solana · 2025-03-07T15:51:02Z

core/src/cluster_slots_service.rs

@@ -79,6 +79,14 @@ impl ClusterSlotsService {
        cluster_slots_update_receiver: ClusterSlotsUpdateReceiver,
        exit: Arc<AtomicBool>,
    ) {
+        let node_id = cluster_info.id();
+        let my_stake = bank_forks


I believe we need to derive each of these in the loop since they can change

addressed in df0ffb9

bw-solana · 2025-03-07T15:51:23Z

core/src/cluster_slots_service.rs

-                    &cluster_slots_update_receiver,
-                    &cluster_info,
-                );
+            // only staked nodes push EpochSlots into CRDS


would be better to include the "why" here

addressed in df0ffb9

AshwinSekar · 2025-03-07T16:51:14Z

My understanding is EpochSlots are used in repair and ancestor hash repair sampling services

Speaking from the consensus angle, one of the conditions for kicking off ancestor hashes repair is when we observe 52%+ on a dead block from EpochSlots. If unstaked nodes do not push EpochSlots this doesn't matter.

However when sampling for ancestor hashes repair (unlike normal repair) we select peers purely based on EpochSlots. For correctness it does not matter if we exclude unstaked nodes here, since we must have already seen that enough staked nodes have frozen this slot. It might add latency to only sample from staked nodes, however if we're in a situation that is reliant on ancestor hashes repair the cluster is already temporarily stuck.

I think the more important factor is whether we want to restrict regular repair to only staked nodes.

alexpyattaev · 2025-03-07T18:33:26Z

Regular repair already heavily prefers staked nodes. Basically the weight function is literally just node's stake, and unstaked nodes are given a stake of 1.

bw-solana · 2025-03-13T17:44:02Z

The code changes on this PR LGTM as far as the mechanics of removing EpochSlots, but I want to make sure everyone is on board that we aren't accidentally rugging any downstream services.

It sounds like we are cleared to remove EpochSlots from gossip for unstaked nodes from a concensus (ancestor hash repair sampling) perspective according to @AshwinSekar (correct me if this is wrong).

Are we okay from a repair perspective? Any additional code changes we would need to make this work? @behzadnouri ?

Any concerns for FD? CC @ptaffet-jump

behzadnouri · 2025-03-13T18:15:30Z

core/src/cluster_slots_service.rs

+            let my_stake = bank_forks
+                .read()
+                .unwrap()
+                .root_bank()
+                .current_epoch_stakes()
+                .node_id_to_stake(&node_id)
+                .unwrap_or_default();


This probably should use EpochSpecs:
https://github.com/anza-xyz/agave/blob/e12e6fbf7/gossip/src/epoch_specs.rs#L18

Done in e3df68e

alexpyattaev · 2025-03-13T18:28:34Z

Further digging - repair weights for unstaked nodes are set to 1, repair weights for staked nodes are in the millions, raw data from https://github.com/alexpyattaev/agave/blob/4a8e72c36ff6f36cdfe712af7b6edf2cc7825f59/core/src/repair/serve_repair.rs#L1087 looks like this:

14077635496876, 27799088767508, 100008717125, 1997717120, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Similar thing is going on with ancestor hashes

agave/core/src/cluster_slots_service/cluster_slots.rs

Line 210 in 29c037b

.filter_map(|(i, ci)| Some((slot_peers.get(ci.pubkey())? + 1, i)))

actual odds look like this:

len(weights)=5732 # total number of nodes in cluster
 max(weights)=13322909430537740 # max weight (in lamports)
 np.median(weights)=1.0 # yep, most nodes are unstaked
 sum(weights==1)=4405 # 4405 to be exact, that is their total weight for sampling too
sum(weights[weights>1])=377029995643360008 # that is total weight of all staked nodes
4405/377029995643360008 = 1.1683420552477143e-14 # chance of unstaked node getting picked at all

so I think we have very low odds of actually picking any unstaked node even today....

bw-solana · 2025-03-13T19:04:41Z

Further digging - repair weights for unstaked nodes are set to 1, repair weights for staked nodes are in the millions, raw data from https://github.com/alexpyattaev/agave/blob/4a8e72c36ff6f36cdfe712af7b6edf2cc7825f59/core/src/repair/serve_repair.rs#L1087 looks like this:
14077635496876, 27799088767508, 100008717125, 1997717120, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
Similar thing is going on with ancestor hashes

agave/core/src/cluster_slots_service/cluster_slots.rs

Line 210 in 29c037b

.filter_map(|(i, ci)| Some((slot_peers.get(ci.pubkey())? + 1, i)))

actual odds look like this:
len(weights)=5732 # total number of nodes in cluster
 max(weights)=13322909430537740 # max weight (in lamports)
 np.median(weights)=1.0 # yep, most nodes are unstaked
 sum(weights==1)=4405 # 4405 to be exact, that is their total weight for sampling too
sum(weights[weights>1])=377029995643360008 # that is total weight of all staked nodes
4405/377029995643360008 = 1.1683420552477143e-14 # chance of unstaked node getting picked at all
so I think we have very low odds of actually picking any unstaked node even today....

this matches my understanding. My takeaway being this change would have immeasurable change to repair concentration on staked nodes or protocol security.

behzadnouri

please wait for Ashwin to also approve

jeffwashington · 2025-03-13T19:44:48Z

I think the more important factor is whether we want to restrict regular repair to only staked nodes.

It appears @AshwinSekar is only concerned with restricting repair to only staked nodes.
It appears @alexpyattaev demonstrated with math that repair is already restricted to only staked nodes effectively.

I think there is value in getting this in and getting the testing going. We have spilled a lot of ink on epoch slots.

wen-coding · 2025-03-13T19:46:13Z

I think the more important factor is whether we want to restrict regular repair to only staked nodes.

It appears @AshwinSekar is only concerned with restricting repair to only staked nodes. It appears @alexpyattaev demonstrated with math that repair is already restricted to only staked nodes effectively.

I think there is value in getting this in and getting the testing going. We have spilled a lot of ink on epoch slots.

Ashwin is OOO today, but I did chat with him earlier this week about this change. He thought it should be fine.

alexpyattaev · 2025-03-13T21:35:07Z

side-note - an unstaked node will push a 3-4 epochslots messages on startup anyway even with this patch applied. it happens in a completely different part of the code that i missed, and only once, so no need to patch it.

alessandrod · 2025-03-13T23:48:27Z

backport to 2.2?

bw-solana · 2025-03-13T23:56:21Z

backport to 2.2?

I support this, but it would be good to make sure we've collected adequate signal for the "remove deprecated values from gossip pull messages" change on testnet to not confuse things.

@gregcusack - do we have confirmation yet? I think we're still around 40% Agave stake on testnet..

gregcusack · 2025-03-14T00:16:21Z

no confirmation yet since we're still waiting for a little more stake on agave on testnet. tbh the believe risk on that backport is very low. but good to make sure.

mergify · 2025-03-14T07:37:53Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

* stop unstaked nodes from pushing EpochSlots into the cluster * reload own stake on every epoch in case I become staked * use epoch specs to reduce contention for bank forks --------- Co-authored-by: Alex Pyattaev <[email protected]> Big thanks to Behzad for code suggestions. (cherry picked from commit 145d562)

…ackport of #5141) (#5286) stop unstaked nodes from pushing EpochSlots into the cluster (#5141) * stop unstaked nodes from pushing EpochSlots into the cluster * reload own stake on every epoch in case I become staked * use epoch specs to reduce contention for bank forks --------- Co-authored-by: Alex Pyattaev <[email protected]> Big thanks to Behzad for code suggestions. (cherry picked from commit 145d562) Co-authored-by: Alex Pyattaev <[email protected]>

alexpyattaev force-pushed the epoch_slots_unstaked branch from 038f5b2 to d20ce72 Compare March 4, 2025 20:11

alexpyattaev marked this pull request as ready for review March 7, 2025 11:47

alexpyattaev requested review from behzadnouri and AshwinSekar March 7, 2025 11:47

bw-solana reviewed Mar 7, 2025

View reviewed changes

Alex Pyattaev added 2 commits March 8, 2025 18:21

stop unstaked nodes from pushing EpochSlots into the cluster

316fe7f

reload own stake on every iteration

92475ba

alexpyattaev force-pushed the epoch_slots_unstaked branch from df0ffb9 to 92475ba Compare March 8, 2025 18:21

behzadnouri reviewed Mar 13, 2025

View reviewed changes

gregcusack self-requested a review March 13, 2025 19:06

Alex Pyattaev added 3 commits March 13, 2025 19:07

use epoch specs to reduce contention for bank forks

e3df68e

remove unneeded pub

ce285c0

minor tweaks to code structure

baf40dc

behzadnouri previously approved these changes Mar 13, 2025

View reviewed changes

extra caution around the channel drain

07b8a18

alexpyattaev dismissed behzadnouri’s stale review via 07b8a18 March 13, 2025 19:21

behzadnouri approved these changes Mar 13, 2025

View reviewed changes

alexpyattaev merged commit 145d562 into anza-xyz:master Mar 14, 2025
47 checks passed

alexpyattaev deleted the epoch_slots_unstaked branch March 14, 2025 06:28

alessandrod added the v2.2 Backport to v2.2 branch label Mar 14, 2025

mergify bot mentioned this pull request Mar 14, 2025

v2.2: stop unstaked nodes from pushing EpochSlots into the cluster (backport of #5141) #5286

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop unstaked nodes from pushing EpochSlots into the cluster #5141

stop unstaked nodes from pushing EpochSlots into the cluster #5141

alexpyattaev commented Mar 4, 2025 •

edited

Loading

alexpyattaev commented Mar 4, 2025

gregcusack commented Mar 5, 2025

alexpyattaev commented Mar 5, 2025

alexpyattaev commented Mar 7, 2025 •

edited

Loading

bw-solana left a comment

bw-solana Mar 7, 2025

alexpyattaev Mar 7, 2025

bw-solana Mar 7, 2025

alexpyattaev Mar 7, 2025

AshwinSekar commented Mar 7, 2025

alexpyattaev commented Mar 7, 2025

bw-solana commented Mar 13, 2025

behzadnouri Mar 13, 2025

alexpyattaev Mar 13, 2025

alexpyattaev commented Mar 13, 2025

bw-solana commented Mar 13, 2025

behzadnouri left a comment

jeffwashington commented Mar 13, 2025

wen-coding commented Mar 13, 2025 •

edited

Loading

alexpyattaev commented Mar 13, 2025

alessandrod commented Mar 13, 2025

bw-solana commented Mar 13, 2025

gregcusack commented Mar 14, 2025 •

edited

Loading

mergify bot commented Mar 14, 2025

stop unstaked nodes from pushing EpochSlots into the cluster #5141

stop unstaked nodes from pushing EpochSlots into the cluster #5141

Conversation

alexpyattaev commented Mar 4, 2025 • edited Loading

Problem

Summary of Changes

alexpyattaev commented Mar 4, 2025

gregcusack commented Mar 5, 2025

alexpyattaev commented Mar 5, 2025

alexpyattaev commented Mar 7, 2025 • edited Loading

bw-solana left a comment

Choose a reason for hiding this comment

bw-solana Mar 7, 2025

Choose a reason for hiding this comment

alexpyattaev Mar 7, 2025

Choose a reason for hiding this comment

bw-solana Mar 7, 2025

Choose a reason for hiding this comment

alexpyattaev Mar 7, 2025

Choose a reason for hiding this comment

AshwinSekar commented Mar 7, 2025

alexpyattaev commented Mar 7, 2025

bw-solana commented Mar 13, 2025

behzadnouri Mar 13, 2025

Choose a reason for hiding this comment

alexpyattaev Mar 13, 2025

Choose a reason for hiding this comment

alexpyattaev commented Mar 13, 2025

bw-solana commented Mar 13, 2025

behzadnouri left a comment

Choose a reason for hiding this comment

jeffwashington commented Mar 13, 2025

wen-coding commented Mar 13, 2025 • edited Loading

alexpyattaev commented Mar 13, 2025

alessandrod commented Mar 13, 2025

bw-solana commented Mar 13, 2025

gregcusack commented Mar 14, 2025 • edited Loading

mergify bot commented Mar 14, 2025

alexpyattaev commented Mar 4, 2025 •

edited

Loading

alexpyattaev commented Mar 7, 2025 •

edited

Loading

wen-coding commented Mar 13, 2025 •

edited

Loading

gregcusack commented Mar 14, 2025 •

edited

Loading