-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try bounded channel for sigverify-retransmit #5091
base: master
Are you sure you want to change the base?
try bounded channel for sigverify-retransmit #5091
Conversation
@yihau can you check why NOCI would not clear from this PR? Thanks! |
core/src/tvu.rs
Outdated
@@ -191,7 +191,7 @@ impl Tvu { | |||
); | |||
|
|||
let (verified_sender, verified_receiver) = unbounded(); | |||
let (retransmit_sender, retransmit_receiver) = unbounded(); | |||
let (retransmit_sender, retransmit_receiver) = bounded(1024); //Allow for a max of 1024 batches of 1024 packets each (according to MAX_IOV). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the typical batch size we see here on mainnet? My understanding is it is typically only 1 shred or 2, which would mean we might fill the channel with only a couple thousand shreds (single slot).
My understanding of the intent is that this is only to prevent OOM in extreme cases. Dropping shreds is a sign that something is not going well in the system. Seems we might want to make this channel even larger in size to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This never dropped any shreds on mainnet in over 24 hours. Under heavy load the batches would be larger too. If batches do not become larger under load we are doing something very wrong. If we get 1-2 packets in a batch we should just flatten the batches right away to reduce heap allocations. I suggest we try this on testnet over a few weeks and if it ever overflows we can bump the size up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: got 132 shreds reported lost. Isolated events unrelated to number of shreds stored. resulting shred loss is 0.02%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally we have a problem with shred_fetch_stage. It simply does not make batches that are big enough, and when retransmit backs up the channel can get clogged.
It appears that vast majority of the time we have <10 shreds in an individual vec that is pushed over the channel. This is better than 1-2, but still probably not good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another 100 shreds dropped over 20 hour period.
Bumped buffer length to 2048 entries just to be sure we never drop one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I have seen the coalesce code, but I was not sure if adjusting something like that would be in scope for this PR.
The oldest shreds are probably so old as to be worthless to downstream nodes
This sounds reasonable. However, this should never happen so I am not sure it makes a difference which ones we drop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will test tomorrow if channel is big enough for extreme loads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.

blocks with 18577 transactions of 1 KB each. No problems.
thanks @KirillLykov for help running the test!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm how many shreds were generated per block during this attack?
It looks like the workload is rather bursty in nature (I'm assuming only 1/10 nodes generating the large blocks). Would be good to run with sustained traffic (pointing a handful of bench-tps instances at the cluster should do the trick) to see what happens when there's no idle time to "catch up". Also, adding some fake turbine tree nodes so that retransmit is more expensive would be interesting as well (one example of how I did this here: alessandrod@5b784b7)
@behzadnouri - can you think of anything else we should add to the experiment? I believe the goal here is to see how large we should size the channel such that we never actually hit the limit under "normal" operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core/src/tvu.rs
Outdated
@@ -191,7 +191,7 @@ impl Tvu { | |||
); | |||
|
|||
let (verified_sender, verified_receiver) = unbounded(); | |||
let (retransmit_sender, retransmit_receiver) = unbounded(); | |||
let (retransmit_sender, retransmit_receiver) = bounded(2048); //Allow for a max of 2048 batches of up to 1024 packets each (according to MAX_IOV). In reality this holds about 15K shreds since most batches are never full. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please place the comment on line above and wrap it; a line like this can prevent cargo fmt
from working properly.
Side note, we are looking to add a lint rule in another PR that will enforce this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 847265e
88f0071
to
847265e
Compare
@cpubot what do you think? Good idea or not? Or should we go with the new EvictingSender here? |
I agree with @bw-solana we should prefer newer shreds, so EvictingSender would be a reasonable tool for the job |
Problem
Summary of Changes