try bounded channel for sigverify-retransmit #5091

alexpyattaev · 2025-02-27T20:52:01Z

Problem

Retransmit stage has been observed to have stalls which result in unbounded growth of the channel from sigverify stage
Bounded channels have better performance overall

Summary of Changes

Change to bounded channel with enough capacity to effectively never overflow (1024 batches of up to 1024 shreds) = 1M shreds = 10 Gbits = roughly 3 slots (and we discard everything older than 1 slot anyway). Channel buffer occupies 24 Kbytes of RAM.
add metric "num_retransmit_stage_overflow" to keep an eye on the number of shreds that do not fit

alexpyattaev · 2025-02-27T20:59:04Z

Testing on mainnet, no packets got dropped over 1 hour of operation. On average the channel is effectively empty since it stores batches and not individual packets.

alexpyattaev · 2025-02-28T06:11:14Z

@yihau can you check why NOCI would not clear from this PR? Thanks!

bw-solana · 2025-02-28T23:18:42Z

core/src/tvu.rs

@@ -191,7 +191,7 @@ impl Tvu {
        );

        let (verified_sender, verified_receiver) = unbounded();
-        let (retransmit_sender, retransmit_receiver) = unbounded();
+        let (retransmit_sender, retransmit_receiver) = bounded(1024); //Allow for a max of 1024 batches of 1024 packets each (according to MAX_IOV).


what is the typical batch size we see here on mainnet? My understanding is it is typically only 1 shred or 2, which would mean we might fill the channel with only a couple thousand shreds (single slot).

My understanding of the intent is that this is only to prevent OOM in extreme cases. Dropping shreds is a sign that something is not going well in the system. Seems we might want to make this channel even larger in size to be safe.

This never dropped any shreds on mainnet in over 24 hours. Under heavy load the batches would be larger too. If batches do not become larger under load we are doing something very wrong. If we get 1-2 packets in a batch we should just flatten the batches right away to reduce heap allocations. I suggest we try this on testnet over a few weeks and if it ever overflows we can bump the size up.

Update: got 132 shreds reported lost. Isolated events unrelated to number of shreds stored. resulting shred loss is 0.02%

Incidentally we have a problem with shred_fetch_stage. It simply does not make batches that are big enough, and when retransmit backs up the channel can get clogged.

It appears that vast majority of the time we have <10 shreds in an individual vec that is pushed over the channel. This is better than 1-2, but still probably not good enough.

Another 100 shreds dropped over 20 hour period.
Bumped buffer length to 2048 entries just to be sure we never drop one.

Yes I have seen the coalesce code, but I was not sure if adjusting something like that would be in scope for this PR.

The oldest shreds are probably so old as to be worthless to downstream nodes

This sounds reasonable. However, this should never happen so I am not sure it makes a difference which ones we drop.

will test tomorrow if channel is big enough for extreme loads.

blocks with 18577 transactions of 1 KB each. No problems.

thanks @KirillLykov for help running the test!

Can you confirm how many shreds were generated per block during this attack?

It looks like the workload is rather bursty in nature (I'm assuming only 1/10 nodes generating the large blocks). Would be good to run with sustained traffic (pointing a handful of bench-tps instances at the cluster should do the trick) to see what happens when there's no idle time to "catch up". Also, adding some fake turbine tree nodes so that retransmit is more expensive would be interesting as well (one example of how I did this here: alessandrod@5b784b7)

@behzadnouri - can you think of anything else we should add to the experiment? I believe the goal here is to see how large we should size the channel such that we never actually hit the limit under "normal" operation.

Load testing under 5/10 leaders producing 18K TX blocks, no drops, so channel can deal with massive numbers of shreds no probelm as long as retransmit stage can keep up.

Adding fake turbine peers overloads retransmit stage and causes loadshed on the buffer, as expected.

steviez · 2025-03-06T19:50:02Z

core/src/tvu.rs

@@ -191,7 +191,7 @@ impl Tvu {
        );

        let (verified_sender, verified_receiver) = unbounded();
-        let (retransmit_sender, retransmit_receiver) = unbounded();
+        let (retransmit_sender, retransmit_receiver) = bounded(2048); //Allow for a max of 2048 batches of up to 1024 packets each (according to MAX_IOV). In reality this holds about 15K shreds since most batches are never full.


nit: Please place the comment on line above and wrap it; a line like this can prevent cargo fmt from working properly.

Side note, we are looking to add a lint rule in another PR that will enforce this

fixed in 847265e

alexpyattaev · 2025-03-20T19:03:30Z

@cpubot what do you think? Good idea or not? Or should we go with the new EvictingSender here?

cpubot · 2025-03-21T16:03:47Z

@cpubot what do you think? Good idea or not? Or should we go with the new EvictingSender here?

I agree with @bw-solana we should prefer newer shreds, so EvictingSender would be a reasonable tool for the job

alexpyattaev added noCI Suppress CI on this Pull Request and removed noCI Suppress CI on this Pull Request labels Feb 27, 2025

alexpyattaev marked this pull request as ready for review February 27, 2025 20:54

alexpyattaev requested a review from bw-solana February 27, 2025 20:56

alexpyattaev requested a review from behzadnouri February 27, 2025 21:17

yihau added the CI Pull Request is ready to enter CI label Feb 28, 2025

anza-team removed the CI Pull Request is ready to enter CI label Feb 28, 2025

bw-solana reviewed Feb 28, 2025

View reviewed changes

alexpyattaev requested a review from bw-solana March 3, 2025 20:40

steviez reviewed Mar 6, 2025

View reviewed changes

Alex Pyattaev added 3 commits March 7, 2025 11:24

try bounded channel for sigverify-retransmit

f612dd9

bump to 2048 slots to avoid any backpressure

67c5bbb

formatting

847265e

alexpyattaev force-pushed the retransmit_stage_bounded_chan branch from 88f0071 to 847265e Compare March 7, 2025 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try bounded channel for sigverify-retransmit #5091

try bounded channel for sigverify-retransmit #5091

alexpyattaev commented Feb 27, 2025

alexpyattaev commented Feb 27, 2025

alexpyattaev commented Feb 28, 2025

bw-solana Feb 28, 2025

alexpyattaev Mar 1, 2025

alexpyattaev Mar 1, 2025 •

edited

Loading

alexpyattaev Mar 1, 2025

alexpyattaev Mar 2, 2025

alexpyattaev Mar 4, 2025 •

edited

Loading

alexpyattaev Mar 5, 2025

alexpyattaev Mar 6, 2025 •

edited

Loading

bw-solana Mar 6, 2025

alexpyattaev Mar 6, 2025

steviez Mar 6, 2025

alexpyattaev Mar 7, 2025

alexpyattaev commented Mar 20, 2025

cpubot commented Mar 21, 2025

try bounded channel for sigverify-retransmit #5091

Are you sure you want to change the base?

try bounded channel for sigverify-retransmit #5091

Conversation

alexpyattaev commented Feb 27, 2025

Problem

Summary of Changes

alexpyattaev commented Feb 27, 2025

alexpyattaev commented Feb 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexpyattaev Mar 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexpyattaev Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexpyattaev Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexpyattaev commented Mar 20, 2025

cpubot commented Mar 21, 2025

alexpyattaev Mar 1, 2025 •

edited

Loading

alexpyattaev Mar 4, 2025 •

edited

Loading

alexpyattaev Mar 6, 2025 •

edited

Loading