[Jobs] Move task retry logic to correct branch in `stream_logs_by_id` #4407

andylizf · 2024-11-24T21:03:46Z

The task retry logic in stream_logs_by_id was placed in the wrong branch of the if-else structure. It was in the else branch of if task_id < num_tasks - 1 and follow, which meant it only triggered when the last task failed, rather than when each task failed.

Move the retry logic to be checked immediately after we detect a job failure, using JobStatus.FAILED which stands for user code failure, before handling the task progression logic.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Manual Test

For a job with task retry enabled:

resources:
  ports: 8080
  cpus: 2+
  job_recovery:
    max_restarts_on_errors: 1

run: exit 1  # Task 1: Always fails
---
run: exit 0  # Task 2: Always succeeds

Before fix

⚙︎ Job submitted, ID: 4
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
ERROR: Job 1 failed with return code list: [1] 
Shared connection to 34.41.128.28 closed.

✓ Managed job finished: 4 (status: CANCELLED).

The job immediately terminates after first failure without retrying.

After Fix

⚙︎ Job submitted, ID: 3
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
ERROR: Job 1 failed with return code list: [1] 
Shared connection to 34.41.128.28 closed.

├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
ERROR: Job 1 failed with return code list: [1] 
Shared connection to 34.72.7.54 closed.

✓ Managed job finished: 3 (status: CANCELLED).

The job properly retries the failed task once before terminating.

…_id`

andylizf · 2024-11-24T21:04:35Z

@cblmemo PTAL, thanks!

cblmemo

Thanks for fixing this @andylizf ! mostly lgtm. Left one discussion.

sky/jobs/utils.py

Michaelvll

Thanks @andylizf for fixing this! This looks mostly good to me. Could we add a smoke test for this new fix, so we can catch any regression in the future?

Michaelvll · 2024-11-25T17:17:30Z

sky/jobs/utils.py

@@ -384,8 +384,29 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str:
                job_statuses = backend.get_job_status(handle, stream_logs=False)
                job_status = list(job_statuses.values())[0]
                assert job_status is not None, 'No job found.'
+                assert task_id is not None, job_id
+                if job_status == job_lib.JobStatus.FAILED:


There can be a few different statuses that can trigger the retry, including FAILED_SETUP and FAILED. Should we add both?

Yes, I agree we should add both FAILED and FAILED_SETUP as retry triggers since setup failure is also a user program failure.

Also, The comment mentions FAILED_SETUP needs to be after FAILED - does this ordering affect our retry logic implementation?

# The job setup failed (only in effect if --detach-setup is set). It # needs to be placed after the `FAILED` state, so that the status # set by our generated ray program will not be overwritten by # ray's job status (FAILED). # This is for a better UX, so that the user can find out the reason # of the failure quickly. FAILED_SETUP = 'FAILED_SETUP'

sky/jobs/utils.py

cblmemo

Thanks! lgtm. should be good to go after adding smoke test.

andylizf · 2024-11-26T03:50:47Z

@cblmemo The smoke test passed, PTAL, thanks!

sky/jobs/utils.py

tests/test_smoke.py

Co-authored-by: Tian Xia <[email protected]>

fix(jobs): move task retry logic to correct branch in `stream_logs_by…

655b81f

…_id`

cblmemo reviewed Nov 24, 2024

View reviewed changes

sky/jobs/utils.py Outdated Show resolved Hide resolved

Michaelvll reviewed Nov 25, 2024

View reviewed changes

andylizf added 4 commits November 25, 2024 11:36

refactor: use next for better readibility

0e4e747

refactor: add some comments for why it's wait until not RUNNING

45f0c46

refactor: a pylint's bug

782a6d8

fix: also include failed_setup

33ebbf3

cblmemo reviewed Nov 25, 2024

View reviewed changes

sky/jobs/utils.py Outdated Show resolved Hide resolved

sky/jobs/utils.py Outdated Show resolved Hide resolved

sky/jobs/utils.py Outdated Show resolved Hide resolved

andylizf added 2 commits November 25, 2024 13:44

refactor: a extracted user_code_failure_states

eb7fb69

refactor: remove nonlocal

c554f51

cblmemo reviewed Nov 25, 2024

View reviewed changes

andylizf added 3 commits November 25, 2024 17:05

fix: stop logging retry for no-follow

62941d5

Merge remote-tracking branch 'upstream/master' into fix-utils-retry

24cf0a9

tests: smoke tests for managed jobs retrying

ebf3f91

format

939b057

cblmemo reviewed Nov 26, 2024

View reviewed changes

sky/jobs/utils.py Outdated Show resolved Hide resolved

tests/test_smoke.py Outdated Show resolved Hide resolved

andylizf and others added 3 commits November 26, 2024 15:33

format

8da7604

Co-authored-by: Tian Xia <[email protected]>

chore: extract yaml file to test_yamls/

6518e44

refactor: use def rather than lambda

2e937ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Jobs] Move task retry logic to correct branch in `stream_logs_by_id` #4407

[Jobs] Move task retry logic to correct branch in `stream_logs_by_id` #4407

andylizf commented Nov 24, 2024 •

edited

Loading

andylizf commented Nov 24, 2024

cblmemo left a comment

Michaelvll left a comment

Michaelvll Nov 25, 2024

andylizf Nov 25, 2024 •

edited

Loading

cblmemo left a comment

andylizf commented Nov 26, 2024 •

edited

Loading

[Jobs] Move task retry logic to correct branch in stream_logs_by_id #4407

Are you sure you want to change the base?

[Jobs] Move task retry logic to correct branch in stream_logs_by_id #4407

Conversation

andylizf commented Nov 24, 2024 • edited Loading

Manual Test

Before fix

After Fix

andylizf commented Nov 24, 2024

cblmemo left a comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Nov 25, 2024

Choose a reason for hiding this comment

andylizf Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

cblmemo left a comment

Choose a reason for hiding this comment

andylizf commented Nov 26, 2024 • edited Loading

[Jobs] Move task retry logic to correct branch in `stream_logs_by_id` #4407

[Jobs] Move task retry logic to correct branch in `stream_logs_by_id` #4407

andylizf commented Nov 24, 2024 •

edited

Loading

andylizf Nov 25, 2024 •

edited

Loading

andylizf commented Nov 26, 2024 •

edited

Loading