vine: properly close link in the worker transfer server process #4076

JinZhou5042 · 2025-02-26T17:57:50Z

Proposed Changes

There were a lot of pain about tasks being forsaken and workflow being stuck, like task forsaken loop in #4038, long retrieving time in #4007, @cmoore24-24 's complaint about workflow's long tail in the end in, manager's put request may fail in #4061, temp files cannot communicate with peers in #3763, imbalanced replica count for each temp file in #4056.

This is likely the culprit behind tasks being forsaken for no apparent reason. The reason tasks are forsaken is that they are dispatched to a worker with the expectation that their inputs will be transferred from another worker. If the worker determines that one of its inputs is invalid, it will intentionally forsake the task. This behavior is expected—tasks being forsaken is a symptom, not the root cause. The issue is not that workers forsake tasks when they shouldn't, but rather that worker transfers are failing.

The occurrence of tasks being forsaken appears to be quite random. As shown in the following figure, a task may be forsaken at different stages:

At the beginning of a worker's connection (which likely indicates that the destination worker is fine, but there are errors in the source worker).
In the middle of the application, on a random worker (suggesting the issue is not specific to a particular worker).
At the end of the application, with a straggler task sitting on a worker, which eventually determines that its inputs are invalid and is ultimately forsaken.

The root cause of all these symptoms is the frequent failure of worker transfers. The issue lies in the worker's transfer server process, where the transfer server opens a file descriptor in a child process to accept new connections but fails to properly close the opened file descriptors in the parent process. This leads to file descriptor leaks when a large number of worker transfers are in progress.

In short, as transfer connections pile up, the parent process (worker transfer server) hoards file descriptions until it hits the system limit (1024 for each process I guess?), this makes some transfers fail or get stuck in a long queue. Things get really bad when a task needs hundreds of input files via worker transfer—everything slows to a crawl, the waiting tasks end up with being forsaken and the application just sits there waiting for transfers to be done. That’s why we saw some tasks wait for 30 minites after the resubmission—fds run out on the worker, transfers drag, and everything gets bottlenecked!

And because the problem only occurs when the file descriptor limit per process is hit, we cannot reproduce the issue in every run:

sometimes there are more workers, which helps distribute the peer transfer burden, making the issue less apparent.
sometimes a worker unintentionally accumulates a large number of inputs for a task, and when that task is dispatched to another worker, it results in a straggler issue.
sometimes even when the process reaches the fd limit, the OS may manage to clean up stale file descriptors when a new one is opened, preventing the issue from occurring.

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

make test Run local tests prior to pushing.
make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
make lint Run lint on source code prior to pushing.
Manual Update: Update the manual to reflect user-visible changes.
Type Labels: Select a github label for the type: bugfix, enhancement, etc.
Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
PR RTM: Mark your PR as ready to merge.

vine: close transfer link in the server process

932ba06

JinZhou5042 marked this pull request as ready for review February 26, 2025 17:57

btovar approved these changes Feb 26, 2025

View reviewed changes

btovar merged commit e657730 into cooperative-computing-lab:master Feb 26, 2025
9 of 10 checks passed

btovar pushed a commit that referenced this pull request Feb 26, 2025

vine: close transfer link in the server process (#4076)

69c88ae

JinZhou5042 deleted the worker_transfer_server_should_close_fd branch February 26, 2025 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vine: properly close link in the worker transfer server process #4076

vine: properly close link in the worker transfer server process #4076

JinZhou5042 commented Feb 26, 2025 •

edited

Loading

vine: properly close link in the worker transfer server process #4076

vine: properly close link in the worker transfer server process #4076

Conversation

JinZhou5042 commented Feb 26, 2025 • edited Loading

Proposed Changes

Merge Checklist

JinZhou5042 commented Feb 26, 2025 •

edited

Loading