Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jobs] Move task retry logic to correct branch in stream_logs_by_id #4407

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

andylizf
Copy link
Contributor

@andylizf andylizf commented Nov 24, 2024

Fixes #4250

The task retry logic in stream_logs_by_id was placed in the wrong branch of the if-else structure. It was in the else branch of if task_id < num_tasks - 1 and follow, which meant it only triggered when the last task failed, rather than when each task failed.

Move the retry logic to be checked immediately after we detect a job failure, using JobStatus.FAILED which stands for user code failure, before handling the task progression logic.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Manual Test

For a job with task retry enabled:

resources:
  ports: 8080
  cpus: 2+
  job_recovery:
    max_restarts_on_errors: 1

run: exit 1  # Task 1: Always fails
---
run: exit 0  # Task 2: Always succeeds

Before fix

⚙︎ Job submitted, ID: 4
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
ERROR: Job 1 failed with return code list: [1] 
Shared connection to 34.41.128.28 closed.

✓ Managed job finished: 4 (status: CANCELLED).

The job immediately terminates after first failure without retrying.

After Fix

⚙︎ Job submitted, ID: 3
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
ERROR: Job 1 failed with return code list: [1] 
Shared connection to 34.41.128.28 closed.

├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
ERROR: Job 1 failed with return code list: [1] 
Shared connection to 34.72.7.54 closed.

✓ Managed job finished: 3 (status: CANCELLED).

The job properly retries the failed task once before terminating.

@andylizf
Copy link
Contributor Author

@cblmemo PTAL, thanks!

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @andylizf ! mostly lgtm. Left one discussion.

sky/jobs/utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andylizf for fixing this! This looks mostly good to me. Could we add a smoke test for this new fix, so we can catch any regression in the future?

@@ -384,8 +384,29 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str:
job_statuses = backend.get_job_status(handle, stream_logs=False)
job_status = list(job_statuses.values())[0]
assert job_status is not None, 'No job found.'
assert task_id is not None, job_id
if job_status == job_lib.JobStatus.FAILED:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be a few different statuses that can trigger the retry, including FAILED_SETUP and FAILED. Should we add both?

Copy link
Contributor Author

@andylizf andylizf Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree we should add both FAILED and FAILED_SETUP as retry triggers since setup failure is also a user program failure.

Also, The comment mentions FAILED_SETUP needs to be after FAILED - does this ordering affect our retry logic implementation?

    # The job setup failed (only in effect if --detach-setup is set). It
    # needs to be placed after the `FAILED` state, so that the status
    # set by our generated ray program will not be overwritten by
    # ray's job status (FAILED).
    # This is for a better UX, so that the user can find out the reason
    # of the failure quickly.
    FAILED_SETUP = 'FAILED_SETUP'

sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! lgtm. should be good to go after adding smoke test.

@andylizf
Copy link
Contributor Author

andylizf commented Nov 26, 2024

@cblmemo The smoke test passed, PTAL, thanks!

sky/jobs/utils.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: stream_logs_by_id incorrectly handles task retry logic
3 participants