vine: properly close link in the worker transfer server process #4076
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed Changes
There were a lot of pain about tasks being forsaken and workflow being stuck, like task forsaken loop in #4038, long retrieving time in #4007, @cmoore24-24 's complaint about workflow's long tail in the end in, manager's
put
request may fail in #4061, temp files cannot communicate with peers in #3763, imbalanced replica count for each temp file in #4056.This is likely the culprit behind tasks being forsaken for no apparent reason. The reason tasks are forsaken is that they are dispatched to a worker with the expectation that their inputs will be transferred from another worker. If the worker determines that one of its inputs is invalid, it will intentionally forsake the task. This behavior is expected—tasks being forsaken is a symptom, not the root cause. The issue is not that workers forsake tasks when they shouldn't, but rather that worker transfers are failing.
The occurrence of tasks being forsaken appears to be quite random. As shown in the following figure, a task may be forsaken at different stages:
The root cause of all these symptoms is the frequent failure of worker transfers. The issue lies in the worker's transfer server process, where the transfer server opens a file descriptor in a child process to accept new connections but fails to properly close the opened file descriptors in the parent process. This leads to file descriptor leaks when a large number of worker transfers are in progress.
In short, as transfer connections pile up, the parent process (worker transfer server) hoards file descriptions until it hits the system limit (1024 for each process I guess?), this makes some transfers fail or get stuck in a long queue. Things get really bad when a task needs hundreds of input files via worker transfer—everything slows to a crawl, the waiting tasks end up with being forsaken and the application just sits there waiting for transfers to be done. That’s why we saw some tasks wait for 30 minites after the resubmission—fds run out on the worker, transfers drag, and everything gets bottlenecked!
And because the problem only occurs when the file descriptor limit per process is hit, we cannot reproduce the issue in every run:
Merge Checklist
The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.
make test
Run local tests prior to pushing.make format
Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)make lint
Run lint on source code prior to pushing.