execute_k8s_job does not handle watch client stale state #26626
Labels
area: execution
Related to Execution
deployment: k8s
Related to deploying Dagster to Kubernetes
type: bug
Something isn't working
What's the issue?
Long calls to
execute_k8s_job
sometimes fail when reading the logs. The method has retries aroundnext(log_stream)
, but if the watch client enters a stale state, the code ends up failing. Example log:I found similar issues reported in ansible-playbook, and the relevant issue in the kubernetes client. The solution is to move the watch client creation (
log_stream = watch.stream()
) into a loop as well. I'm trying it out in my repo and will post a PR with a fix after I confirm that it's working (or at least not introducing new issues)What did you expect to happen?
The code shouldn't fail because of intermediate errors
How to reproduce?
This is difficult to reproduce. It originates from the underlying k8s client and only happens very rarely (but often enough to fail long running, expensive, training jobs).
Dagster version
1.9.3
Deployment type
Dagster Helm chart
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
By submitting this issue, you agree to follow Dagster's Code of Conduct.
The text was updated successfully, but these errors were encountered: