Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try bounded channel for sigverify-retransmit #5091

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

alexpyattaev
Copy link

Problem

  • Retransmit stage has been observed to have stalls which result in unbounded growth of the channel from sigverify stage
  • Bounded channels have better performance overall

Summary of Changes

  • Change to bounded channel with enough capacity to effectively never overflow (1024 batches of up to 1024 shreds) = 1M shreds = 10 Gbits = roughly 3 slots (and we discard everything older than 1 slot anyway). Channel buffer occupies 24 Kbytes of RAM.
  • add metric "num_retransmit_stage_overflow" to keep an eye on the number of shreds that do not fit

@alexpyattaev alexpyattaev added noCI Suppress CI on this Pull Request and removed noCI Suppress CI on this Pull Request labels Feb 27, 2025
@alexpyattaev alexpyattaev marked this pull request as ready for review February 27, 2025 20:54
@alexpyattaev
Copy link
Author

Testing on mainnet, no packets got dropped over 1 hour of operation. On average the channel is effectively empty since it stores batches and not individual packets.
Näyttökuva 2025-02-27 kello 22 57 17

@alexpyattaev
Copy link
Author

@yihau can you check why NOCI would not clear from this PR? Thanks!

@yihau yihau added the CI Pull Request is ready to enter CI label Feb 28, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Feb 28, 2025
core/src/tvu.rs Outdated
@@ -191,7 +191,7 @@ impl Tvu {
);

let (verified_sender, verified_receiver) = unbounded();
let (retransmit_sender, retransmit_receiver) = unbounded();
let (retransmit_sender, retransmit_receiver) = bounded(1024); //Allow for a max of 1024 batches of 1024 packets each (according to MAX_IOV).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the typical batch size we see here on mainnet? My understanding is it is typically only 1 shred or 2, which would mean we might fill the channel with only a couple thousand shreds (single slot).

My understanding of the intent is that this is only to prevent OOM in extreme cases. Dropping shreds is a sign that something is not going well in the system. Seems we might want to make this channel even larger in size to be safe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This never dropped any shreds on mainnet in over 24 hours. Under heavy load the batches would be larger too. If batches do not become larger under load we are doing something very wrong. If we get 1-2 packets in a batch we should just flatten the batches right away to reduce heap allocations. I suggest we try this on testnet over a few weeks and if it ever overflows we can bump the size up.

Copy link
Author

@alexpyattaev alexpyattaev Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: got 132 shreds reported lost. Isolated events unrelated to number of shreds stored. resulting shred loss is 0.02%

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally we have a problem with shred_fetch_stage. It simply does not make batches that are big enough, and when retransmit backs up the channel can get clogged.
Näyttökuva 2025-03-02 kello 1 07 36
It appears that vast majority of the time we have <10 shreds in an individual vec that is pushed over the channel. This is better than 1-2, but still probably not good enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another 100 shreds dropped over 20 hour period.
Bumped buffer length to 2048 entries just to be sure we never drop one.

Copy link
Author

@alexpyattaev alexpyattaev Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I have seen the coalesce code, but I was not sure if adjusting something like that would be in scope for this PR.

The oldest shreds are probably so old as to be worthless to downstream nodes

This sounds reasonable. However, this should never happen so I am not sure it makes a difference which ones we drop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will test tomorrow if channel is big enough for extreme loads.

Copy link
Author

@alexpyattaev alexpyattaev Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Näyttökuva 2025-03-06 kello 13 05 29

blocks with 18577 transactions of 1 KB each. No problems.

thanks @KirillLykov for help running the test!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm how many shreds were generated per block during this attack?

It looks like the workload is rather bursty in nature (I'm assuming only 1/10 nodes generating the large blocks). Would be good to run with sustained traffic (pointing a handful of bench-tps instances at the cluster should do the trick) to see what happens when there's no idle time to "catch up". Also, adding some fake turbine tree nodes so that retransmit is more expensive would be interesting as well (one example of how I did this here: alessandrod@5b784b7)

@behzadnouri - can you think of anything else we should add to the experiment? I believe the goal here is to see how large we should size the channel such that we never actually hit the limit under "normal" operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Näyttökuva 2025-03-06 kello 21 54 21
Load testing under 5/10 leaders producing 18K TX blocks, no drops, so channel can deal with massive numbers of shreds no probelm as long as retransmit stage can keep up.

Adding fake turbine peers overloads retransmit stage and causes loadshed on the buffer, as expected.

@alexpyattaev alexpyattaev requested a review from bw-solana March 3, 2025 20:40
core/src/tvu.rs Outdated
@@ -191,7 +191,7 @@ impl Tvu {
);

let (verified_sender, verified_receiver) = unbounded();
let (retransmit_sender, retransmit_receiver) = unbounded();
let (retransmit_sender, retransmit_receiver) = bounded(2048); //Allow for a max of 2048 batches of up to 1024 packets each (according to MAX_IOV). In reality this holds about 15K shreds since most batches are never full.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please place the comment on line above and wrap it; a line like this can prevent cargo fmt from working properly.

Side note, we are looking to add a lint rule in another PR that will enforce this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 847265e

@alexpyattaev alexpyattaev force-pushed the retransmit_stage_bounded_chan branch from 88f0071 to 847265e Compare March 7, 2025 12:23
@alexpyattaev
Copy link
Author

@cpubot what do you think? Good idea or not? Or should we go with the new EvictingSender here?

@cpubot
Copy link

cpubot commented Mar 21, 2025

@cpubot what do you think? Good idea or not? Or should we go with the new EvictingSender here?

I agree with @bw-solana we should prefer newer shreds, so EvictingSender would be a reasonable tool for the job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants