Replies: 2 comments 7 replies
-
Just adding a little bump here to try to get some eyeballs. |
Beta Was this translation helpful? Give feedback.
-
Hi @nbuck1234 - our recommendation for how to make runs more resilient to daemon failures is to use one of our run launchers that launches runs in an isolated container. The one that works in a single host is the DockerRunLauncher: https://docs.dagster.io/deployment/guides/docker#launching-runs-in-containers and we also have integrestions with Kubernetes: https://docs.dagster.io/deployment/guides/kubernetes/deploying-with-helm and Amazon ECS: https://docs.dagster.io/deployment/guides/aws#launching-runs-in-ecs. These take a bit more work to deploy, but give you much better isolation guarantees since runs happen in an entirely different container, rather than in a subprocess of the daemon process. Once the run has launched, the daemon can go down entirely and the run will continue unimpeded. |
Beta Was this translation helpful? Give feedback.
-
My primary question: Is there a way for the dagster daemon to recover from being restarted during a job run? Is the description below the expected behavior?
My setup:
My observations relating to dagster services and in-progress runs:
Webserver: Restarting the webserver at any time does not affect the outcome of the run.
Database: If the PostgreSQL database service is restarted before or within a short time of an in process run completes, dagster picks it up and the run will show up complete in the web ui. If the database comes back up after a few minutes of the run actually completed, you will end up seeing an error in ui that the run result couldn't be written to the database.
Daemon: This is the big one. If the daemon goes down during a run, regardless of whether it is restarted before the run should have completed, all sense of state for the run is gone. You can do whatever you want, but the run will still show in progress in ui, and it must be force terminated. I would love to find a way to restart the daemon and have some amount of recovery and/or graceful exits on in process jobs.
I hope this post is helpful to the community, cheers.
Nick
Beta Was this translation helpful? Give feedback.
All reactions