Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: properly close link in the worker transfer server process #4076

Conversation

JinZhou5042
Copy link
Member

@JinZhou5042 JinZhou5042 commented Feb 26, 2025

Proposed Changes

There were a lot of pain about tasks being forsaken and workflow being stuck, like task forsaken loop in #4038, long retrieving time in #4007, @cmoore24-24 's complaint about workflow's long tail in the end in, manager's put request may fail in #4061, temp files cannot communicate with peers in #3763, imbalanced replica count for each temp file in #4056.

This is likely the culprit behind tasks being forsaken for no apparent reason. The reason tasks are forsaken is that they are dispatched to a worker with the expectation that their inputs will be transferred from another worker. If the worker determines that one of its inputs is invalid, it will intentionally forsake the task. This behavior is expected—tasks being forsaken is a symptom, not the root cause. The issue is not that workers forsake tasks when they shouldn't, but rather that worker transfers are failing.

The occurrence of tasks being forsaken appears to be quite random. As shown in the following figure, a task may be forsaken at different stages:

  • At the beginning of a worker's connection (which likely indicates that the destination worker is fine, but there are errors in the source worker).
  • In the middle of the application, on a random worker (suggesting the issue is not specific to a particular worker).
  • At the end of the application, with a straggler task sitting on a worker, which eventually determines that its inputs are invalid and is ultimately forsaken.
image

The root cause of all these symptoms is the frequent failure of worker transfers. The issue lies in the worker's transfer server process, where the transfer server opens a file descriptor in a child process to accept new connections but fails to properly close the opened file descriptors in the parent process. This leads to file descriptor leaks when a large number of worker transfers are in progress.

In short, as transfer connections pile up, the parent process (worker transfer server) hoards file descriptions until it hits the system limit (1024 for each process I guess?), this makes some transfers fail or get stuck in a long queue. Things get really bad when a task needs hundreds of input files via worker transfer—everything slows to a crawl, the waiting tasks end up with being forsaken and the application just sits there waiting for transfers to be done. That’s why we saw some tasks wait for 30 minites after the resubmission—fds run out on the worker, transfers drag, and everything gets bottlenecked!

And because the problem only occurs when the file descriptor limit per process is hit, we cannot reproduce the issue in every run:

  • sometimes there are more workers, which helps distribute the peer transfer burden, making the issue less apparent.
  • sometimes a worker unintentionally accumulates a large number of inputs for a task, and when that task is dispatched to another worker, it results in a straggler issue.
  • sometimes even when the process reaches the fd limit, the OS may manage to clean up stale file descriptors when a new one is opened, preventing the issue from occurring.

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

@JinZhou5042 JinZhou5042 marked this pull request as ready for review February 26, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants