Replies: 4 comments 4 replies
-
Hi @NiallRees... Was it the run that failed, or the schedule evaluation that failed? If it was the run itself that failed, are you using the default run launcher? You can make runs resilient to code server downtime by launching runs as independent containers. The ecs / k8s / docker run launchers are examples of run launchers that isolate runs as separate containers. For run failures, you can set up retries or run failure sensor alerting. For schedule failures, the scheduler daemon process should retry recently failed ticks, so it should tolerate ephemeral failures due to code server redeploys. |
Beta Was this translation helpful? Give feedback.
-
Hi @prha - got some more screenshots. It was the schedule that failed. I'm using the K8s helm chart. There was no retry unfortunately - it started 20 of the total partitions and then just stopped. Thanks for the help |
Beta Was this translation helpful? Give feedback.
-
Here are the logs from the daemon. Schedule starts at 06:00. It starts 20 runs/partitions out of over a hundred, then doesn't recover after the code server returns.
|
Beta Was this translation helpful? Give feedback.
-
Some more information - this was Dagster 1.5.6, by which point this PR had been released. cc @gibsondan |
Beta Was this translation helpful? Give feedback.
-
This morning, our scheduled run failed mid-run. The error message was
dagster._core.errors.DagsterCodeLocationLoadError: Failure loading replicator-code: dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
.Two questions:
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions