Using Dagster for lightweight task orchestration at Snorkel.ai #22259
Replies: 1 comment 1 reply
-
Hi @derekahuang - thanks for taking a look at Dagster! What you've described -- having various "blocks" of processing with different constraints -- maps well to the modular design of pipelines using Assets in Dagster. I don't believe there is ever too early a time to adopt an orchestrator, as many of the features are opt-in. This means you can start small and gradually adopt more complex features as your workflows evolve; but by adopting early you won't have to re-write things entirely. One of the main benefits of Dagster is its strong developer workflow, allowing you to develop locally with a high level of feature parity to what is then deployed in production. Regarding running Dagster in a worker process, you might lose some of Dagster's functionality. While some users run Dagster within a GitHub action, this can result in the loss of features like persistence of runs, sensors, and automations. You can find examples of how people are using Dagster here: https://github.com/dagster-io/awesome-dagster Another thing to consider is using Dagster as the worker and launching jobs via the API from your other processes. Hopefully, this gives you a starting point to explore. Let me know if you have any other questions! |
Beta Was this translation helpful? Give feedback.
-
Hi there, my name is Derek and I’m an engineer at Snorkel.ai. We build a platform for data centric AI and we run a lot of jobs, ranging from data preprocessing to computing metrics to making queries to external LLM providers.
I’m currently exploring how we can redesign our job system away from running monolithic functions, but rather running as a series of blocks. Each block might have different resource constraints — one task might be CPU bound and sent to execute on our Ray cluster, one job might be making network requests and just needs to run multithreaded.
I came across Dagster because it can get messy managing the dependencies between all these blocks. Dagster seems like a great way to manage the dependencies while at the same time providing robust features.
I’m curious if this community has thoughts on if Dagster is the right tool for this type of problem or if it is overkill. It would be running entirely in process (an RQ worker would execute the Flow) and if there are any tradeoffs/considerations that I should be aware of before we start developing. Thank you!
Beta Was this translation helpful? Give feedback.
All reactions