Bug: `stream_logs_by_id` incorrectly handles task retry logic #4250

andylizf · 2024-11-02T23:08:01Z

The recent commit 46ea0d8 that adds task retry functionality misplaced the retry logic in stream_logs_by_id. The retry check is currently in the wrong branch of the if-else structure:

Current placement:

if returncode == 0:
    if job_status != job_lib.JobStatus.CANCELLED:
        if task_id < num_tasks - 1 and follow:
            # handle next task
        else:  # <- retry logic incorrectly placed here
            task_specs = managed_job_state.get_task_specs(...)
            # retry logic

Should be:

if returncode == 0:
    if job_status != job_lib.JobStatus.CANCELLED:
        if task_id < num_tasks - 1 and follow:
            # handle next task
        else:
            break
else:  # <- retry logic should be here, handling cluster failures
    # check task_specs and handle retry
# old cluster failed logic

This bug prevents proper handling of task retries as it's checking for retries in the wrong scenario (when task completes normally) instead of when task/cluster fails.

          Seems the retry logic is added in the wrong branch. It's currently in the `else` branch of `if task_id < num_tasks - 1 and follow`, which means it only triggers when we want to terminate. The retry check should be in the outer `else` branch where we handle cluster failures.

Originally posted by @andylizf in #4169 (comment)

The text was updated successfully, but these errors were encountered:

andylizf · 2024-11-02T23:08:43Z

@cblmemo PTAL, thanks!

andylizf · 2024-11-02T23:48:04Z

@Michaelvll I noticed a potential issue with the retry logic placement in stream_logs_by_id. The current implementation puts it in the else branch of task_id < num_tasks - 1 and follow, which seems incorrect as it only handles the last task or non-follow case.

I think it should log retries when tasks actually fail (i.e., in the outer else branch where we handle cluster failures). However, @cblmemo pointed out that the fix might not be as simple as just moving the code block, since there are two cases that can trigger retry: user program failure and log tailing failure. The outer else branch only handles the latter case.

I might be missing something in my understanding of the task retry mechanism. Could you please take a look and verify if this is indeed an issue? cc @romilbhardwaj

Michaelvll · 2024-11-15T20:16:00Z

That's a good catch @andylizf! We may need to move the code path for handling the restarts before the place we handle task_id < num_tasks - 1 with some special handling. Would you be able to help dig into it @andylizf?

andylizf · 2024-11-15T23:08:28Z

Thanks for pointing that out! I'll look into it tomorrow.

andylizf mentioned this issue Nov 15, 2024

Replace len() Zero Checks with Pythonic Empty Sequence Checks #4298

Open

5 tasks

andylizf linked a pull request Nov 24, 2024 that will close this issue

[Jobs] Move task retry logic to correct branch in stream_logs_by_id #4407

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: `stream_logs_by_id` incorrectly handles task retry logic #4250

Bug: `stream_logs_by_id` incorrectly handles task retry logic #4250

andylizf commented Nov 2, 2024

andylizf commented Nov 2, 2024

andylizf commented Nov 2, 2024

Michaelvll commented Nov 15, 2024

andylizf commented Nov 15, 2024

Bug: stream_logs_by_id incorrectly handles task retry logic #4250

Bug: stream_logs_by_id incorrectly handles task retry logic #4250

Comments

andylizf commented Nov 2, 2024

andylizf commented Nov 2, 2024

andylizf commented Nov 2, 2024

Michaelvll commented Nov 15, 2024

andylizf commented Nov 15, 2024

Bug: `stream_logs_by_id` incorrectly handles task retry logic #4250

Bug: `stream_logs_by_id` incorrectly handles task retry logic #4250