Confusion about the Dagster hybrid deployment: agents, launchers, executors, workers #18267

skeller88 · 2023-11-27T01:58:35Z

skeller88
Nov 27, 2023

I'm finding it difficult to follow all of the deployment-related documentation (though it's definitely helpful), because there's overlapping docs and concepts spread across OSS, serverless, hybrid, and then all 3. Here's my current understanding.

Daemon - long-running process needed for schedules, sensors, queued run coordinator, monitoring worker runs and triggering retries

Storage
History of runs - Sqlite, postgres, mysql, can be set to local or remote db
Logs from op compute functions - s3, gcs
Output from ops - where to store outputs that are written to a filesystem

gRPC server - loads code that defines our Dagster assets (ops, jobs, assets, schedules, etc) and makes metadata about those assets available via a GraphQL API

Execution
Run coordinator -> run launcher -> run worker -> executor
Run coordinator - a class invoked by either the webserver or a Dagster api request. This class can be configured to pass runs to the daemon via a queue. The coordinator determines the policy used to set the prioritization rules and concurrency limits for runs.

Run launcher - a class invoked by the daemon when it receives a run from the coordinator. This class initializes a new run worker to handle execution. Depending on the launcher, this could mean spinning up a new process, container, Kubernetes pod, etc.
Run worker - a process which traverses a graph and uses the executor to execute each op.
Executor - a class invoked by the run worker for running user ops. Depending on the executor, ops run in local processes, new containers, Kubernetes pods, etc.

Web server - UI for visualizing jobs, assets, and ops, as well as starting jobs.

Some questions.

Dagster hosts the following in Dagster Cloud:

Daemon
Run storage
Log storage?
GraphQL API
Web server
Run coordinator?
Anything else?

The user hosts:

Agent
Run launcher
Run worker
Executor

I get that the Agent relays data to/from Dagster Cloud. Does it also serve the role of the run launcher?

What happens with the DefaultRunLauncher if no executors are available? Does the op get queued using the runs database?

How do you configure the Run Worker? I'm not seeing a section on that. I'm a little confused by how a run launcher triggers one or more run workers, which in turn trigger one or more executors?

In general, I'm confused on how the job execution flow works within a hybrid deployment. Is it possible to use a celery executor with a hybrid deployment? Is the Dagster-hosted run storage accessible outside of Dagster cloud? That seems like the only way that it would work.

Thanks!

Answered by gibsondan

Nov 27, 2023

Hi @skeller88 - the overall understanding you lay out here is more or less entirely correct. Happy to mop up the specific questions here:

The agent does serve the role of run launcher. It spins up an isolated task/pod/process/etc. for each run based on instructions that it pulls from our API.
The default executor if you don't pick one is to run each op in its own subprocess. More here: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor
The specific knobs available to configure the run worker vary considerably depending on which agent you're using (or what environment you're deploying to in OSS). We have a k8s agent configuration reference here, for example:

View full answer

gibsondan · 2023-11-27T22:21:52Z

gibsondan
Nov 27, 2023
Maintainer

Hi @skeller88 - the overall understanding you lay out here is more or less entirely correct. Happy to mop up the specific questions here:

The agent does serve the role of run launcher. It spins up an isolated task/pod/process/etc. for each run based on instructions that it pulls from our API.
The default executor if you don't pick one is to run each op in its own subprocess. More here: https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#default-job-executor
The specific knobs available to configure the run worker vary considerably depending on which agent you're using (or what environment you're deploying to in OSS). We have a k8s agent configuration reference here, for example: https://docs.dagster.io/dagster-cloud/deployment/agents/kubernetes/configuration-reference#kubernetes-agent-configuration-reference and and ECS agent reference here: https://docs.dagster.io/dagster-cloud/deployment/agents/amazon-ecs/configuration-reference#amazon-ecs-agent-configuration-reference
This doc might be helpful to explain the job execution flow in hybrid: https://docs.dagster.io/dagster-cloud/deployment/hybrid#hybrid-architecture-overview - once a run actually starts, its executing in the agent's environment in its own task/pod/etc. and writing metadata back to our servers over our API.
We don't currently support the celery executor in dagster cloud. Since we first shipped our celery integration we've significantly improved our built-in concurrency support which covers a lot of the same bases that would lead somebody to use celery: https://docs.dagster.io/guides/limiting-concurrency-in-data-pipelines#limiting-concurrency-in-data-pipelines - so we haven't seen much demand for celery support in Dagster Cloud.

Let us know if there's anything else we can help with!

8 replies

gibsondan Nov 27, 2023
Maintainer

That's all correct, yeah.

Dagster cloud uses a standard run queue run coordinator. It can be configured via API / settings: https://docs.dagster.io/dagster-cloud/managing-deployments/deployment-settings-reference#run-queue-run_queue

skeller88 Nov 28, 2023
Author

I'm looking for a bit more detail on the implementation of the coordinator if possible, to give us visibility into the durability and performance of the coordinator at scale. Is it an in-memory queue backed by some persistence layer like postgres? Is it redis?

gibsondan Nov 28, 2023
Maintainer

It's stored persistently in postgres - you can see the implementation here in open source, which is spiritually similar to how it works in the cloud product as well: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_daemon/run_coordinator/queued_run_coordinator_daemon.py#L184-L260

podviaznikov Nov 28, 2023

thank you for this thread. I personally found documentation and terminology about all 3 deployments scenarios (i tried them all) a bit confusing and complex.

gRPC server and code locations -> those are not super straightforward to understand. What do they do? why the are named like that etc

Radiergummi Nov 12, 2024

We don't currently support the celery executor in dagster cloud. Since we first shipped our celery integration we've significantly improved our built-in concurrency support which covers a lot of the same bases that would lead somebody to use celery […]

@gibsondan this is a bit concerning to me as I'm just wrestling with a Celery setup. Just as @podviaznikov, I also struggle to make sense of code locations, especially since they seem so eager to execute jobs. At first I thought they host the user code exclusively, which would enable reloading user code by simply updating the code location, but with Celery worker instances, those need the source code too, rendering the code location instance redundant. Or so I think.

I would absolutely love to get rid of the complexity Celery and RabbitMQ bring to the table; all I really want is the ability to spread job execution across an arbitrary number of nodes. From all I've gathered about Dagster, in theory, that should be possible by user code runners connecting to the Daemon instance.
Do I understand you correctly that gRPC servers should be able to fulfil that role, as in, when set up correctly, several code location instances are able to run jobs concurrently? If yes, then the documentation gave me absolutely no clues on how to do that yet.

It feels like the documentation describes several approaches to demonstrate theoretical feasibility, but focuses on K8s or Dagster+ exclusively for "proper" deployments. I wouldn't mind if that was the case, but a little clarity wouldn't hurt and may have led me to buy a license before attempting to set up an underspecified system in the first place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about the Dagster hybrid deployment: agents, launchers, executors, workers #18267

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Confusion about the Dagster hybrid deployment: agents, launchers, executors, workers #18267

skeller88 Nov 27, 2023

Replies: 1 comment · 8 replies

gibsondan Nov 27, 2023 Maintainer

gibsondan Nov 27, 2023 Maintainer

skeller88 Nov 28, 2023 Author

gibsondan Nov 28, 2023 Maintainer

podviaznikov Nov 28, 2023

Radiergummi Nov 12, 2024

skeller88
Nov 27, 2023

Replies: 1 comment 8 replies

gibsondan
Nov 27, 2023
Maintainer

gibsondan Nov 27, 2023
Maintainer

skeller88 Nov 28, 2023
Author

gibsondan Nov 28, 2023
Maintainer