execute_k8s_job does not handle watch client stale state #26626

OrenLederman · 2024-12-20T14:21:47Z

What's the issue?

Long calls to execute_k8s_job sometimes fail when reading the logs. The method has retries around next(log_stream), but if the watch client enters a stale state, the code ends up failing. Example log:

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "run_generic_training":

  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/execute_plan.py", line 245, in dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/execute_step.py", line 499, in core_dagster_event_sequence_for_step
    for user_event in _step_output_error_checked_user_event_sequence(
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/execute_step.py", line 183, in _step_output_error_checked_user_event_sequence
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/execute_step.py", line 87, in _process_asset_results_to_events
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/compute.py", line 193, in execute_core_compute
    for step_output in _yield_compute_results(step_context, inputs, compute_fn, compute_context):
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/compute.py", line 162, in _yield_compute_results
    for event in iterate_with_context(
  File "/usr/local/lib/python3.10/dist-packages/dagster/_utils/__init__.py", line 480, in iterate_with_context
    with context_fn():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/utils.py", line 84, in op_execution_error_boundary
    raise error_cls(

The above exception was caused by the following exception:
urllib3.exceptions.ProtocolError: Response ended prematurely

  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/usr/local/lib/python3.10/dist-packages/dagster/_utils/__init__.py", line 482, in iterate_with_context
    next_output = next(iterator)
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/compute_generator.py", line 140, in _coerce_op_compute_fn_to_iterator
    result = invoke_compute_fn(
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/compute_generator.py", line 128, in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
  File "/app/generic_ml_training/dags/ops.py", line 117, in run_generic_training
    execute_k8s_job(
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/decorator_utils.py", line 203, in wrapped_with_pre_call_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dagster_k8s/ops/k8s_job_op.py", line 424, in execute_k8s_job
    raise e
  File "/usr/local/lib/python3.10/dist-packages/dagster_k8s/ops/k8s_job_op.py", line 389, in execute_k8s_job
    log_entry = k8s_api_retry(
  File "/usr/local/lib/python3.10/dist-packages/dagster_k8s/client.py", line 144, in k8s_api_retry
    return fn()
  File "/usr/local/lib/python3.10/dist-packages/dagster_k8s/ops/k8s_job_op.py", line 390, in <lambda>
    lambda: next(log_stream),
  File "/usr/local/lib/python3.10/dist-packages/kubernetes/watch/watch.py", line 178, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.10/dist-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for segment in resp.stream(amt=None, decode_content=False):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1057, in stream
    yield from self.read_chunked(amt, decode_content=decode_content)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1206, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1136, in _update_chunk_length
    raise ProtocolError("Response ended prematurely") from None

The above exception occurred during handling of the following exception:
ValueError: invalid literal for int() with base 16: b''

  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1128, in _update_chunk_length
    self.chunk_left = int(line, 16)

I found similar issues reported in ansible-playbook, and the relevant issue in the kubernetes client. The solution is to move the watch client creation (log_stream = watch.stream()) into a loop as well. I'm trying it out in my repo and will post a PR with a fix after I confirm that it's working (or at least not introducing new issues)

What did you expect to happen?

The code shouldn't fail because of intermediate errors

How to reproduce?

This is difficult to reproduce. It originates from the underlying k8s client and only happens very rarely (but often enough to fail long running, expensive, training jobs).

Dagster version

1.9.3

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
By submitting this issue, you agree to follow Dagster's Code of Conduct.

The text was updated successfully, but these errors were encountered:

OrenLederman added the type: bug Something isn't working label Dec 20, 2024

garethbrickman added deployment: k8s Related to deploying Dagster to Kubernetes area: execution Related to Execution labels Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

execute_k8s_job does not handle watch client stale state #26626

execute_k8s_job does not handle watch client stale state #26626

OrenLederman commented Dec 20, 2024 •

edited

Loading

execute_k8s_job does not handle watch client stale state #26626

execute_k8s_job does not handle watch client stale state #26626

Comments

OrenLederman commented Dec 20, 2024 • edited Loading

What's the issue?

What did you expect to happen?

How to reproduce?

Dagster version

Deployment type

Deployment details

Additional information

Message from the maintainers

OrenLederman commented Dec 20, 2024 •

edited

Loading