Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p/discover: improved node revalidation #687

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sonhv0212
Copy link
Contributor

@sonhv0212 sonhv0212 commented Feb 6, 2025

Cherry-picked following commits from go-ethereum to improve table revalidation and fix some follow-after issues

ethereum/go-ethereum#29572
ethereum/go-ethereum#29864
ethereum/go-ethereum#29836
ethereum/go-ethereum#30239

@sonhv0212 sonhv0212 force-pushed the discovery-new-reval branch 2 times, most recently from b31535a to 856723c Compare February 7, 2025 03:57
@minh-bq minh-bq added this to the v2.9.1 milestone Feb 10, 2025
@minh-bq
Copy link
Collaborator

minh-bq commented Feb 10, 2025

We should copy description from the above commit to this PR.

This PR will revert the effect of this commit: f0935e6 ("p2p/discover: apply enr filtering in discovery phase, fix minor logic in doRevalidate"). So maybe @huyngopt1994 wants to take a look.

@minh-bq minh-bq requested a review from huyngopt1994 February 10, 2025 06:47
@huyngopt1994
Copy link
Collaborator

@minh-bq Yes, @sonhv0212 need to rework this PR again for adapting with the implementing enr filtering .

@huyngopt1994
Copy link
Collaborator

@sonhv0212 Coul u rework this PR which including some commits from Private repo in here ?

fjl and others added 4 commits February 14, 2025 14:14
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.

In this change, the revalidation process is improved with the following logic:

- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
  faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
  Once validation passes, it transfers to the 'slow' list. If a request fails, or the
  node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
  we will not drop the node on the first failing check. We instead quickly decay the
  livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.

I am also adding a debug API endpoint to dump the node table content.

Co-Authored-By: Martin HS <[email protected]>
In #29572, I assumed the revalidation list that the node is contained in could only ever
be changed by the outcome of a revalidation request. But turns out that's not true: if the
node gets removed due to FINDNODE failure, it will also be removed from the list it is in.
This causes a crash.

The invariant is: while node is in table, it is always in exactly one of the two lists. So
it seems best to store a pointer to the current list within the node itself.
It seems the semantic differences between addFoundNode and addInboundNode were lost in
(and are unsure if is available) whereas addInboundNode is for adding nodes that have
contacted the local node and we can verify they are active.

handleAddNode seems to be the consolidation of those two methods, yet it bumps the node in
the bucket (updating it's IP addr) even if the node was not an inbound. This PR fixes
this. It wasn't originally caught in tests like TestTable_addSeenNode because the
manipulation of the node object actually modified the node value used by the test.

New logic is added to reject non-inbound updates unless the sequence number of the
(signed) ENR increases. Inbound updates, which are published by the updated node itself,
are always accepted. If an inbound update changes the endpoint, the node will be
revalidated on an expedited schedule.

Co-Authored-By: Felix Lange <[email protected]>
…(#30239)

If `nextTime` has passed, but all nodes are excluded, `get` would return
`nil` and `run` would therefore not invoke `schedule`. Then, we schedule
a timer for the past, as neither `nextTime` value has been updated. This
creates a busy loop, as the timer immediately returns.

With this PR, revalidation will be also rescheduled when all nodes are
excluded.

---------

Co-Authored-By: lightclient <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants