Using Dagster for lightweight task orchestration at Snorkel.ai #22259

derekahuang · 2024-06-03T18:54:32Z

derekahuang
Jun 3, 2024

Hi there, my name is Derek and I’m an engineer at Snorkel.ai. We build a platform for data centric AI and we run a lot of jobs, ranging from data preprocessing to computing metrics to making queries to external LLM providers.

I’m currently exploring how we can redesign our job system away from running monolithic functions, but rather running as a series of blocks. Each block might have different resource constraints — one task might be CPU bound and sent to execute on our Ray cluster, one job might be making network requests and just needs to run multithreaded.

I came across Dagster because it can get messy managing the dependencies between all these blocks. Dagster seems like a great way to manage the dependencies while at the same time providing robust features.

I’m curious if this community has thoughts on if Dagster is the right tool for this type of problem or if it is overkill. It would be running entirely in process (an RQ worker would execute the Flow) and if there are any tradeoffs/considerations that I should be aware of before we start developing. Thank you!

cmpadden · 2024-06-04T19:41:49Z

cmpadden
Jun 4, 2024
Maintainer

Hi @derekahuang - thanks for taking a look at Dagster!

What you've described -- having various "blocks" of processing with different constraints -- maps well to the modular design of pipelines using Assets in Dagster. I don't believe there is ever too early a time to adopt an orchestrator, as many of the features are opt-in. This means you can start small and gradually adopt more complex features as your workflows evolve; but by adopting early you won't have to re-write things entirely. One of the main benefits of Dagster is its strong developer workflow, allowing you to develop locally with a high level of feature parity to what is then deployed in production.

Regarding running Dagster in a worker process, you might lose some of Dagster's functionality. While some users run Dagster within a GitHub action, this can result in the loss of features like persistence of runs, sensors, and automations. You can find examples of how people are using Dagster here:

https://github.com/dagster-io/awesome-dagster

Another thing to consider is using Dagster as the worker and launching jobs via the API from your other processes.

Hopefully, this gives you a starting point to explore. Let me know if you have any other questions!

1 reply

derekahuang Jul 19, 2024
Author

Thanks for the response, I'll look into the examples and let you know if I have any more questions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Dagster for lightweight task orchestration at Snorkel.ai #22259

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Using Dagster for lightweight task orchestration at Snorkel.ai #22259

derekahuang Jun 3, 2024

Replies: 1 comment · 1 reply

cmpadden Jun 4, 2024 Maintainer

derekahuang Jul 19, 2024 Author

derekahuang
Jun 3, 2024

Replies: 1 comment 1 reply

cmpadden
Jun 4, 2024
Maintainer

derekahuang Jul 19, 2024
Author