Swip responsible nbhood split #43

crtahlin · 2024-02-09T16:54:57Z

Describe how a responsible neighborhood split / storage radius increase should be handled by a node.

ldeffenb · 2024-05-15T13:48:08Z

Having read through this SWIP, and watching what is about to occur in the sepolia testnet swarm, and having monitored the pusher behavior when errors occur, I have doubts on the wisdom of pausing new chunk acceptance with an error message to the push.

The pusher tries really hard (and fast) to deliver deferred chunks. When an error occurs, it just keeps trying the push until some other node accepts it. This happens on errors as well as if a "shallow receipt depth" is detected. And that latter is what I suspect would eventually happen if the target neighborhood rejects a push because it cannot split.

And this would then cause an outward ripple effect as the "shallow" chunk accepting node(s) that have errored-out all of their closer peers, would then accept the chunk(s) into their reserve, eventually filling it and causing yet another neighborhood to attempt a split and possibly pause. Rinse and repeat in an outward direction.

IMHO, it would be better for the over-full, cannot split neighborhood nodes to continue to accept chunks so that the swarm can continue to fully operate. New data can still be stored, and existing data would not be evicted until the newly split neighborhoods have sufficient peers to cover them. Then the reserve evictions can resume, knowing that the chunks have a new protected home.

I've often thought that nodes should have a pseudo-reserve, secure-cache, where they pull and retain chunks for their adjacent neighborhoods all the time. I call this pseudo-reserve because these chunks would be stored WITH their stamps, unlike the stamp-less chunks in the cache. That way, the stamped chunks can be pulled back into the adjacent neighborhoods when/if new nodes appear to cover them.

This provides better storage redundancy, and even ensures (somewhat) retrievability because of the kademlia routing to get "close" to the target storage neighborhood. Retrieval requests would (hopefully, or eventually) be routed through the adjacent neighborhood nodes which would be able to satisfy the request from the pseudo-reserve.

And an extension to this is that the storage compensation schelling game could actually be competed in the pseudo-neighborhoods because in theory they would be fully populated with all of the required chunks.

ldeffenb · 2024-05-16T01:39:49Z

Consider this action when sepolia just went from 3 to 4.

Once the depth increased, the error rate of shallow receipts went up and the check issue rate went up as well because of the quick error retries in the pusher.

crtahlin · 2024-08-12T14:10:48Z

Keeping an extra "reserve" for accepting chunks could be problematic, as one does not know in advance how many it would need to accept - perhaps filling up the hardrive? Or stopping at some point, where again the mechanism described would need to be used.

A node could also signal much before it runs out of space that a negative situation is arising.

As for the situation described above, if I understand correctly, it should be solved in general, so that error messages do not overwhelm the network, that the network adapts more appropriately.

Adding @istae to the thread.

zelig

Chunk distribution is pretty uniform, so any file of reasonable size will be expected to push at least one chunk to all neighbourhoods.
A file upload constitutes a failure if any of its chunks fails to upload.
Swarm's operation with the required redundancy is incentivised by the pricing mechanism
It is not in the direct interest of node operators to stop accepting chunks.
Erasure coded content upload should not be considered necesarily a failure if chunk push to certain neighbourhoods fail since downloaders can still reconstruct the content.
the new stewartship tool to repair missing chunks of an erasure coded file ensure that content can survive temporary upload failures as long as only the proportion of filled up neighnbourhoods does not exceed the chunk failure rate assumption behind the respective erasure coding redundancy level.

1 and 2 implies that it is sufficient to indicate to users directly that the network may be saturated and due to certain neighbourhoods not being able to split, file upload is prolematic
3 and 4 implies that the problem is tackled by the incentive system and also that any further measures as proposed in this SWIP are not inentive-aligned and at best are of dubious added value given the complexity and reliability due to compliance.
5 and 6 imply that there may not be merit in rejecting uploads due to saturated neighbourhoods and in fact there is already a user-side mitigation that enables content to survice temporary non-storage helped by so called stewards who do have the incentive to keep files retrievable.

Although I think the observation in this swip that users may want to be informed about the conditions of their newly uploaded content being retrievable from the network is a correct one but also valid for the retrievability of past data. Therefore I recommend implementing a warning as part of network monitoring that tracks neighbouthood health and suggest the appropriate level of erasure redundancy necessary for downloads and therefore required for uploads.

crtahlin · 2024-10-07T08:52:30Z

The motivation for the SWIP is for nodes not to increase radius if chunks would get lost in the process.

It is giving priority to data that is already stored to data that is yet to be uploaded. Sure, the uploads might not proceed, but existing data will not be discarded. (It does not solve the problem of a whole neighbourhood of nodes leaving the network. So it is a partial solution, at best. )

Any monitoring to instruct uploaders of EC level to use, would also only cover future uploads. Past uploads with lesser EC guarantees would maybe not meet the "current" requirements.

It is my belief that it is in the interest of nodes (node operators) for data to persist on the network as expected by the uploaders. As the uploaders will otherwise stop using the network, which is not in the interest of nodes (node operators).
But, this is about the users and their expectaions. Perhaps the users of Swarm will be such a group, that it will be satisfied with (some) non-EC uploads being unavailable after a while. And eventually even EC uploads being unavailable after long enough time (or running this new stewardship function on them periodically, to refresh them).

crtahlin added 2 commits February 7, 2024 17:04

edit header

9dc14e4

first version, WIP

6e29152

bee-runner bot added the pull-request label Feb 9, 2024

istae mentioned this pull request Aug 13, 2024

responsible neighborhood splitting ethersphere/bee#4758

Open

zelig assigned crtahlin Oct 3, 2024

zelig reviewed Oct 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swip responsible nbhood split #43

Swip responsible nbhood split #43

crtahlin commented Feb 9, 2024

ldeffenb commented May 15, 2024

ldeffenb commented May 16, 2024

crtahlin commented Aug 12, 2024

zelig left a comment

crtahlin commented Oct 7, 2024

Swip responsible nbhood split #43

Are you sure you want to change the base?

Swip responsible nbhood split #43

Conversation

crtahlin commented Feb 9, 2024

ldeffenb commented May 15, 2024

ldeffenb commented May 16, 2024

crtahlin commented Aug 12, 2024

zelig left a comment

Choose a reason for hiding this comment

crtahlin commented Oct 7, 2024