Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execute_k8s_job does not handle watch client stale state #26626

Open
OrenLederman opened this issue Dec 20, 2024 · 0 comments
Open

execute_k8s_job does not handle watch client stale state #26626

OrenLederman opened this issue Dec 20, 2024 · 0 comments
Labels
area: execution Related to Execution deployment: k8s Related to deploying Dagster to Kubernetes type: bug Something isn't working

Comments

@OrenLederman
Copy link
Contributor

OrenLederman commented Dec 20, 2024

What's the issue?

Long calls to execute_k8s_job sometimes fail when reading the logs. The method has retries around next(log_stream), but if the watch client enters a stale state, the code ends up failing. Example log:

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "run_generic_training":

  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/execute_plan.py", line 245, in dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/execute_step.py", line 499, in core_dagster_event_sequence_for_step
    for user_event in _step_output_error_checked_user_event_sequence(
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/execute_step.py", line 183, in _step_output_error_checked_user_event_sequence
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/execute_step.py", line 87, in _process_asset_results_to_events
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/compute.py", line 193, in execute_core_compute
    for step_output in _yield_compute_results(step_context, inputs, compute_fn, compute_context):
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/compute.py", line 162, in _yield_compute_results
    for event in iterate_with_context(
  File "/usr/local/lib/python3.10/dist-packages/dagster/_utils/__init__.py", line 480, in iterate_with_context
    with context_fn():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/utils.py", line 84, in op_execution_error_boundary
    raise error_cls(

The above exception was caused by the following exception:
urllib3.exceptions.ProtocolError: Response ended prematurely

  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/usr/local/lib/python3.10/dist-packages/dagster/_utils/__init__.py", line 482, in iterate_with_context
    next_output = next(iterator)
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/compute_generator.py", line 140, in _coerce_op_compute_fn_to_iterator
    result = invoke_compute_fn(
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/execution/plan/compute_generator.py", line 128, in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
  File "/app/generic_ml_training/dags/ops.py", line 117, in run_generic_training
    execute_k8s_job(
  File "/usr/local/lib/python3.10/dist-packages/dagster/_core/decorator_utils.py", line 203, in wrapped_with_pre_call_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dagster_k8s/ops/k8s_job_op.py", line 424, in execute_k8s_job
    raise e
  File "/usr/local/lib/python3.10/dist-packages/dagster_k8s/ops/k8s_job_op.py", line 389, in execute_k8s_job
    log_entry = k8s_api_retry(
  File "/usr/local/lib/python3.10/dist-packages/dagster_k8s/client.py", line 144, in k8s_api_retry
    return fn()
  File "/usr/local/lib/python3.10/dist-packages/dagster_k8s/ops/k8s_job_op.py", line 390, in <lambda>
    lambda: next(log_stream),
  File "/usr/local/lib/python3.10/dist-packages/kubernetes/watch/watch.py", line 178, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.10/dist-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for segment in resp.stream(amt=None, decode_content=False):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1057, in stream
    yield from self.read_chunked(amt, decode_content=decode_content)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1206, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1136, in _update_chunk_length
    raise ProtocolError("Response ended prematurely") from None

The above exception occurred during handling of the following exception:
ValueError: invalid literal for int() with base 16: b''

  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1128, in _update_chunk_length
    self.chunk_left = int(line, 16)

I found similar issues reported in ansible-playbook, and the relevant issue in the kubernetes client. The solution is to move the watch client creation (log_stream = watch.stream()) into a loop as well. I'm trying it out in my repo and will post a PR with a fix after I confirm that it's working (or at least not introducing new issues)

What did you expect to happen?

The code shouldn't fail because of intermediate errors

How to reproduce?

This is difficult to reproduce. It originates from the underlying k8s client and only happens very rarely (but often enough to fail long running, expensive, training jobs).

Dagster version

1.9.3

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
By submitting this issue, you agree to follow Dagster's Code of Conduct.

@OrenLederman OrenLederman added the type: bug Something isn't working label Dec 20, 2024
@garethbrickman garethbrickman added deployment: k8s Related to deploying Dagster to Kubernetes area: execution Related to Execution labels Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: execution Related to Execution deployment: k8s Related to deploying Dagster to Kubernetes type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants