[RFC] Dagster Pipes (previously ext) #16319

schrockn · 2023-09-05T17:42:09Z

schrockn
Sep 5, 2023
Maintainer

Introduction

Dagster’s 1.5 release will contain a new protocol designed to enhance integrations with external execution environments. We call it Pipes. Pipes is short for "Protocol for Inter-Process Execution with Streaming logs and metadata"

Pipes has a few goals:

Make it easier to incorporate existing code into Dagster
Make it easier for centralized data and data engineers to onboard stakeholder teams onto Dagster incrementally
Run code in external environments and easily get streaming logging and structured metadata back into Dagster's UI and tools.
Formal support for separation of orchestration and business logic environments
First-class multi-language support in Dagster

Context

Dagster has traditionally integrated business logic and orchestration. Our tutorial focuses on this approach, where business logic is structured within dagster's definition objects and requires importing the full dagster library. For simple data pipelines where data fits in memory and is directly processed within the orchestrator, this works well.

However this approach falls flat in a number of important contexts:

When dealing with pre-existing code and python environments.
When writing business logic in remote or hosted execution environments, such as Spark.
When dealing with business logic written in other programming languages.

The “way out” of this is for the body of an asset to invoke external environments directly via Python clients. However, users that do this are left with little support in Dagster:

Most of our integrations are not biased towards those approaches.
Thorny problems, such as ingesting the logs produced by those external processes into Dagster, are left as an “exercise to the reader”, which is not tenable given the complexity of doing so.
The code in those external environments is not able to participate in Dagster in first-class ways, unable to perform simple tasks such as emitting metadata back into Dagster.

Not repeating Airflow’s mistakes

Airflow has gone through a similar journey, and many users––especially ones that operate at-scale data platforms––use operators that separate execution and orchestration, such as the K8sPodOperator(see docs). This is a fantastic article that details this approach and its operational advantages.

However there are also large costs associated with this approach. In particular users are forced to write per-task, bespoke CLI applications in order to invoke compute using K8sPodOperator. One of our users memorably described this: “In Airflow you have to choose between dependency hell (meaning a single, shared Python environment) and CLI hell.”

Path Forward: A Protocol

What we propose is a protocol for between the orchestration environment and external execution, and a toolkit for building implementations of that protocol. In order for an external process to participate in a first-class way it must:

Accept parameters using data structures defined by the pipes json schema.
Stream unstructured logs to an agreed upon channel (e.g. stderr or an s3 path).
Stream structured (via json schema) messaged to an agreed upon location.

The transport layer varies depending on your operating context. For the subprocess case, the default transport layer is combination of environment variables and temp files for injecting context and parameters, and a temp file for streaming structured metadata. For a case like Databricks, parameters and context information are passed as parameters to the REST API, and logs and messages are streamed to dbfs.

We will provide out-of-the-box implementations of streaming logs and structured messages for major integrations and object stores. This is the most complex component of the protocol, and these are generally accessible from all cloud-based services. Customizing the "launch" behavior is more straightforward, and, with appropriate support and guardrails in the toolkit, we are confident that the community and users can implement that as required.

What does the code look like

For most users writing business logic, they will not have to understand or care that there is a “protocol.” What they will experience are much-improved integrations with environments like Kuberbetes, Lambda, Databricks, arbitrary subprocesses, and other hosted runtimes. Let's work through an example to see what it looks like in practice.

The scenario here is that you have been tasked with orchestrating an existing Python script that produces an asset, making its logs viewable in Dagster’s UI, and then altering it to emit some metadata back in Dagster (e.g. number of rows). However it is large, complex, untested, not authored by you, and incomprehensible. You do not fully understand it, nor do you have any desire to. You want to invoke this script as an external process, rather than bring that code into the Dagster process.

You write the following asset that invokes the external script. You use the PipedSubprocess that implements the orchestration side of protocol.

from dagster import asset, AssetExecutionContext, PipedSubprocess

@asset
def asset_representing_script(context: AssetExecutionContext, piped_subprocess: PipedSubprocess):
    return piped_subprocess.run(command=["python", "the_existing_script.py"], context=context).get_results()

This works on the existing script without any modifications to that script. However now you want code in that script to emit metadata back to Dagster. Previously the way to do that would have been to write code within the asset function in the Dagster process. However you want to log metadata that is only available in the script. Dagster Pipes allows you to do that. You can with just a few lines of code:

# dagster_pipes has no additional dependencies
from dagster_pipes import init_dagster_pipes, PipesContext

init_dagster_pipes()

# A thousand lines of existing code you don't fully understand

context = PipesContext.get()
assert context.asset_key # things like asset key are passed automatically and available in context
context.report_asset_materialization(metadata={"nrows": nrows}) # nrows was defined somewhere in the script

dagster-pipes has no dependencies. It easy to install and also easy to vendor (it is a single Python file), if necessary. You have to add a single line of code to initialize Dagster Pipes, and then a single line to report a materialization with metadata.

With a few lines of code in the script, it is now a first-class Dagster asset. When it is run, its logs appear streaming in the Dagster UI, and its asset catalog entry is collecting metadata.

Kubernetes

Now imagine you wanted to move this script to Kubernetes. Here you'll see the power of standardization, as we can shift the code to execute in a different environment.

from dagster import asset, AssetExecutionContext
from dagster_k8s.pipes import PipesK8sClient

@asset
def foo(context: AssetExecutionContext, pipes_k8s_client: PipesK8sClient):
    pipes_k8s_client.run(
        image="path/to/image",
        namespace="a_namespace", 
        command=["python", "the_existing_script.py"],
        context=context
    )

There are no modifications necessary in the other process. You can use the same code unmodified, as long as it is accessible inside of the container.

This structure is a big step forward for data engineering teams that want to incorporate stakeholder teams into a unified Dagster deployment. The stakeholder teams can, with minimal modifications, get their assets into the asset graph, use Dagster as their system of record for metadata, and use Dagster's UI for improved operations and observability. Contrast that with today's world, where stakeholder teams have to substantially restructure their business logic code to fit into Dagster definitions and bring the full dagster library into their Python environment in order to get those benefits.

What about Step Launchers?

Users who have grappled with the issues––especially in Spark––may be asking "what about step launchers?"

Historically we have tried to make the integration of business logic and orchestration work in external runtimes such as Spark with framework-level abstractions.

The step launcher was the framework-level abstraction designed to support ergonomic remote execution. However this has usability problems and is also inherently untenable for some users to adopt:

It requires that the code deployed to the spark cluster be structured terms of code locations and dagster definitions. This can be undesirable (maybe you have existing spark jobs you don't want to touch) and impossible (maybe there is a dependency conflict with full dagster library).
It also takes responsibility for managing the deployment of code to the remote cluster at runtime, which is unacceptable for users that have their own devops processes or for those that (understandably) want to deploy at push time rather than at runtime.
They often require custom, complex implementations that are difficult to write.

For users who do not want to structure their spark business logic in Dagster definitions, we think that pipes is the right path forward.

Multi-language future

The library to implement this on the external side is lightweight. In the case where this IPC is implemented using temp files and environment variables (for example our subprocess and kubernetes integrations work like this), no external dependencies are required and it is a small amount of Python code.

As a result, this protocol is fairly straightforward implement in other programming languages. They just have to deserialize and serialize standardized objects to a filesystem or an object store. This will enable a future where practitioners in any programming language and any hosted execution environment can participate in Dagster in a first-class way, which is an exciting future. As we mature the system, we'll formalize this protocol in a spec and provide implementations in other programming languages.

Call-to-action

We have two asks:

First please provide feedback on this idea. And if you see yourself using this, please let us know your concrete use case! It's always helpful to know all the different ways people can envision using a tool.

Second we are looking for design partners. This feature is under active development in our repo, and is in our public releases (but not in our top-level exports). We're looking for folks who want to use these capabilities immediately. If that is of interest to you, please reach out! Our subprocess and Kubernetes integrations are ready for use by active design partners. We've created a channel in the dagster slack, #dagster-pipes, for those who want to follow along.

We are targeting the following external environments (focusing on aws).

Subprocess
Docker
Kubernetes
Databricks
Lambda

We can prioritize development based on demand for these (and other) environments, so please speak up! While this enables a multi-language future, we are only targeting Python in the near-term.

Please comment here with questions and feedback. And join the dagster-pipes slack channel! Thank you!

psarka · 2023-09-05T18:25:54Z

psarka
Sep 5, 2023

I have a question: In the example where

from dagster_ext import init_dagster_ext, ExtContext

init_dagster_ext()

logic and subsequent logging calls are added, do you foresee that the script could remain runnable without dagster (as in whatever fashion it was runnable orriginally, like python script.py)?

1 reply

schrockn Sep 5, 2023
Maintainer Author

Yes it will be runnable without Dagster. The ext launchers set an environment variable with a well-known name. If it doesn't exist the init method does nothing and the context is a MagicMock (from the python unittest library)

zyd14 · 2023-09-05T22:09:25Z

zyd14
Sep 5, 2023

My team is very interested in this. We have a huge graph of assets that are produced via Scala Spark code in a bespoke workflow manager. It would be amazing to be able to extend the bespoke workflow manager to produce Dagster events. We'd be happy to collaborate on something of this nature. As of right now these are run on top of Databricks, so it seems that you're already planning on developing a runner (not sure what the appropriate terminology is yet in this context) for that in the near-term, but it seems that there would need to be a bit of Scala code to do the serialization/deserialization - my team would be happy to contribute to this, we'd just need some guidance on the protocol as it becomes more standardized.

Overall really excited about this, it opens up a whole new world of possibilities. The ability to write pipelines in a language like Scala or Rust and still interact with Dagster could be really huge, both for type-safety and performance.

0 replies

pablo-statsig · 2023-09-05T23:03:34Z

pablo-statsig
Sep 5, 2023

This seems really sweet. We are creating a dataproc step launcher, but I guess it might make more sense to wait for this feature first.

I do have some questions. One of the benefits of the step launcher is that you can write code inside of the dagster codebase and run that code remotely. Here, there would be an expectation that the code is already inside of the spark cluster, right?

We would have to create some sort of 'dataproc_ext_client' to be able to run these jobs, right? What would the work of creating one of these clients entail compared to the amount of work of creating a step launcher (which is non-trivial)?

Also, we have a bunch of node.js jobs that we currently run using a subprocess, what would change in the way that we run these scripts? Is there a future where we can declare jobs directly inside of node.js?

1 reply

schrockn Sep 5, 2023
Maintainer Author

One of the benefits of the step launcher is that you can write code inside of the dagster codebase and run that code remotely. Here, there would be an expectation that the code is already inside of the spark cluster, right?

Yes exactly. ext is agnostic to how code is deployed. This is the tradeoff this system is making.

We would have to create some sort of 'dataproc_ext_client' to be able to run these jobs, right? What would the work of creating one of these clients entail compared to the amount of work of creating a step launcher (which is non-trivial)?

We anticipate this being substantially simpler. Writing the ext clients is easier to understand because they are invoked directly from the body of the asset, rather than multiple times within the harrowing depths of our framework. That means you can surround invocations with logging and generally have more control.

The context-specific complexity generally lies with getting the parameter passing and log slurping right. We're still early on the journey, but our goal is certainly to make this substantially easier than a step launcher. I'll be able to be more concrete as we mature our initial implementations.

Also, we have a bunch of node.js jobs that we currently run using a subprocess, what would change in the way that we run these scripts? Is there a future where we can declare jobs directly inside of node.js?

So it depends on what want you do exactly. We anticipate having a js module that would allow for similar looking code as the script above which allows the js author to stream metadata to Dagster.

import { ExtContext, initDagsterExt } from 'dagster-ext';

initDagsterExt()

# your js

ExtContext.get().reportAssetMetadata("label1", "value1")

In terms of defining jobs in js, we don't anticipate supporting that directly for some time. In the more near-term, we want to build some additional examples demonstrating how folks can create a yaml-based dsl to construct asset graphs (a common request) and then use ext in the body of those assets. That way your stakeholders (or you) can write yaml (or whatever file format if you are anti-yaml) and javascript and write full dagster jobs.

Steiniche · 2023-09-06T09:48:11Z

Steiniche
Sep 6, 2023

Thanks for this proposal and discussion - we really enjoy watching how Dagster is progressing.
We are interested in this as we use Dagster to orchestrate e.g. Spark jobs or notebooks and currently we have a black box limitation as we can only see simple states in the execution.
These states are often, in progress, failed, succeeded.
By making it possible with ext to stream logs in from spark jobs it will be possible to create better visibility into execution and errors which would be a huge upside.
This will be a win for using Dagster as an orchestrate for data jobs but also to do MLOps with e.g. MLFlow.
From the proposal I gather that this example will look like the Kubernetes example?
Can anyone clarify if this assumption is correct or if there are a gotcha in there somewhere?

2 replies

schrockn Sep 6, 2023
Maintainer Author

Thanks for the feedback @Steiniche! Glad to hear that it resonates. And yes we think this will be very useful for ML. We see a lot of ML teams using hosted, external runtimes for compute for many different reasons, and a big motivation is for Dagster to have a more "first-class" relationship with those tools.

Specifically around MLFlow, are you referring to the ability to run projects? My understanding of MLFlow is mostly as an experiment tracking/metadata tool, but it may have expanded its scope.

Steiniche Sep 12, 2023

I am referring to being able to do the following :

Run experiments in an external tool, e.g. MLFlow
Deploy the model, on external compute e.g. Databricks
Monitor the model for skew etc.
If skew is detected, retain model
If model performance better in A/B or test set, automatically deploy it to production

It seems to me that with ext it will be possible to make a complete view and orchestrate this in Dagster.
This would make Dagster one of the better MLOps orchestrators out there and its a bonus that you have materialized assets making it great for data jobs as well.

bmarcj · 2023-09-06T10:30:01Z

bmarcj
Sep 6, 2023

Sounds pretty interesting.

How do you imagine credentials might work? For instance, if the_existing_script.py writes to an external database, perhaps taking arguments or looking for environment variables holding the database connection parameters, do these get included the context?

1 reply

schrockn Sep 6, 2023
Maintainer Author

The signature of ExtSubprocess.run method is:

def run(
    self,
    command: Union[str, Sequence[str]],
    *,
    extras: Optional[ExtExtras] = None,
    env: Optional[Mapping[str, str]] = None,
    context: OpExecutionContext,
    # eliding some irrelevant arguments
) -> None:

Users at the run callsite can specify environment variables that get injected into the process, or they can specify "extras" which are just user-defined arguments attached to the context.

askvinni · 2023-09-06T12:19:34Z

askvinni
Sep 6, 2023

Already discussed this offline with Nick, but we're mega excited for this as we're just about to roll out Dagster and the library should enable us to onboard a product team much more quickly into the system! Specifically looking to orchestrate Spark computations running in our own Kubernetes cluster in Azure.

Only question is, how are we expected to provide the Definitions to Dagster? I'm assuming this will require another gRPC server holding a Definitions object with the asset information?

1 reply

schrockn Sep 6, 2023
Maintainer Author

Definitions only exist in the orchestration process. The body of the assets in that orchestration process contain the asset information.

Very concretely, in the examples above, the process containing the following code would have definitions:

from dagster import asset, AssetExecutionContext
from dagster_k8s.ext_client import ExtK8sPod

@asset
def foo(context: AssetExecutionContext, ext_k8s_pod: ExtK8sPod):
    ext_k8s_pod.run(
        image="path/to/image",
        namespace="a_namespace", 
        command=["python", "the_existing_script.py"],
        context=context
    )

And script (we call it the "ext process") that contains this code:

# dagster_ext has no additional dependencies
from dagster_ext import init_dagster_ext, ExtContext

init_dagster_ext()

# A thousand lines of existing code you don't fully understand

context = ExtContext.get()
print(context.asset_key) # things like asset key are passed automatically and available in context
context.report_asset_metadata(label="nrows", value=nrows) # nrows was defined somewhere in the script

would not include the core dagster library or create a Definitions object.

zyd14 · 2023-09-06T16:20:54Z

zyd14
Sep 6, 2023

I'm curious if you've imagined dagster-ext being used to simplify / complement step launchers? It seems that this framework could be used in tandem with some more generalized tooling for moving step code to a remote location for external execution, to give a more solid framework to back step launchers. I totally see the utility of not having to go through step launchers, but they're also a pretty awesome way to give a more unified coding experience to devs where they can define all their pipelines in one place and not have to think about making sure code changes get deployed to the various different platforms on which it needs to be executed. This is particularly useful for local iteration, where a dev can be working in their IDE making changes and launching jobs through a local Dagster deployment. They don't have to remember to sync code to somewhere that Databricks can refer to every time they make a change, the step launcher does that at runtime for them. Sometimes if our devs are using a pre-configured step launcher or working on someone else's pipeline they don't even realize their code is executing on Databricks, it's that transparent.

1 reply

schrockn Sep 6, 2023
Maintainer Author

Re: step launchers I definitely consider this a "one step backwards, two step forwards moment" I think step launchers commingle a bunch of concerns:

Remote launching
Log collection
Code deployment
Structuring Spark jobs in an opinionated way

If you buy into all of that and can use an off-the-shelf launcher, then it works very nicely. In particular when married with branch deployments this is a buttery smooth workflow for spark devs. However if want to vary any of those or need to do any customization, it becomes unwieldy very quickly. ext decomposes this process.

After we nail down the launching and log collection portion of this, I think the logical step is figure out the right dev workflow and code deployment strategy (they are intimately linked). Ideally, in the Databricks case, we would want it to align with Databricks' recommended devops processes, rather than completely rolling our own bespoke thing, the way that step launchers do today. Then we would want this to work smoothly with branch deployments.

This is all a bit vague right now, but I think you are right to point we are a removing some important capabilities for those who are 100% bought in on step launchers right now.

easontm · 2023-09-08T03:52:32Z

easontm
Sep 8, 2023

I notice that in your k8s example, the command arg is open and I technically wouldn't have to run python

    command=["python", "the_existing_script.py"],

If I have a team trying to run some bespoke CLI utility written in an arbitrary non-python language, will they be able to at least read the context information via pod-mounted env variables or similar?

1 reply

schrockn Sep 8, 2023
Maintainer Author

Yes they can. This what we alluded to in the "multi-language" part of this document. In the case of our k8s implementation, we do exactly as you describe: injecting context information into env vars with a gzip-encoded json payload. There is nothing preventing you from reading this yourself in your code. Our plan is the formalize this protocol and then publish libraries in different languages to support it. In the interim, if you want to roll your own, highly recommend reaching out to us to coordinate!

fabiopicchi · 2023-09-12T15:22:49Z

fabiopicchi
Sep 12, 2023

Our team currently materializes a few assets in non-standard third party platforms and the way we've dealt with those has been through polling their API to check for asset readiness. We can run our code in those platforms, so I'm wondering if we would be able not only to share logs, but to push information telling dagster that the asset is materialized. I would love to drop this polling architecture...

5 replies

schrockn Sep 12, 2023
Maintainer Author

@fabiopicchi do you launch the compute from Dagster, or is it scheduled independently?

fabiopicchi Sep 12, 2023

The compute is triggered by an API call done from Dagster. We do it like that as this step is part of a larger dag.

schrockn Sep 12, 2023
Maintainer Author

What are the third-party platforms in question? Do you deploy code to them?

fabiopicchi Sep 14, 2023

Yep, we can deploy code to them :).

The two platforms I was referring to are Valohai and FME.

schrockn Sep 14, 2023
Maintainer Author

So this case, yes, you would be able to push metadata to dagster via methods like context.report_asset_metadata. That serializes a structured json object to a well-known channel (e.g. an object store) which is picked up by the orchestration process.

mjkanji · 2023-09-14T11:33:51Z

mjkanji
Sep 14, 2023

How will I/O for assets executed by an external process be handled?

For writing/persisting the asset, presumably, the external process now has to do the work of writing the output to disk/S3/data warehouse? If so, won't this represent a significant regression compared to the abstraction an IOManager provides?

For reading/loading the asset in downstream assets that are not executed in an external process, I assume you still need to define an io_manager_key in the @asset decorator? If so, how do we ensure that the external process abides by any standards specified by the IOManager?

1 reply

schrockn Sep 14, 2023
Maintainer Author

@mjkanji

Using ext generally opts you out of the i/o manager abstraction. We expect that you would moving, at most, just small data (read: scalars) between assets on the orchestration side.

If there is sufficient demand, you can imagine that we make I/O managers (or a similar abstraction) available on the ext side.

If you feel super strongly about using I/O managers in the remote execution environment, then I recommend continuing to use step launchers.

mjf-89 · 2023-09-18T07:50:40Z

mjf-89
Sep 18, 2023

I'm really interested in this since at our site we have a mixed Cloud/HPC environment where we have dagster OSS on the cloud side and where we launch heavy steps on the HPC side with a custom StepLauncher. We are currently releasing in production the first pipelines and we have adopted the main dagster abstractions: Software Defined Assets, Resources and IOManager.

My main concern in using this proposed protocol over a StepLauncher is that it seems like it will opt you out not only from the IOManager, but also from the Resource abstraction. Moreover the resulting Software Defined Asset is opaque since the code that is being executed and that actually materializes the asset is not even hosted in the same repository. I can see the value of this protocol in onboarding existing codebases but it would be really nice to use ext also within fresh pipelines natively written using the dagater framework and abstractions. It would probably be nice to have an @external_asset for such use cases, do you plan to add something similar in the future?

When I originally got into dagster the IOManager was a key and distinguishing component of it. It was advertised as a good abstraction to use also with spark not only for scalars and small enough datasets. This new protocol, together with the deps argument and the latest docs seem to go in a different direction.

While at the beginning dagster was requiring you to refactor the code to use its abstractions, now the underlying message I read is: "structure your code as you please and use the tools that you are most familiar with. We will provide a simple api to describe your pipeline in dagster and to stream metadata to it so that it can orchestrate and visualize the pipeline for you".
Is this more or less the direction we can expect dagster to move? Do you envision a future for the IOManager abstraction or at this point from your side is considered almost deprecated for real world application scenarios?

1 reply

schrockn Sep 18, 2023
Maintainer Author

Hey @mjf-89 thanks for the thoughtful comment. Re: I/O managers in general we'll be posting a longer discussion about this topic sooner but the tl;dr is the there is a very passionate subset of users that get a lot of value out of I/O managers, but there is a larger set that either find them confusing or misaligned with the way they use the system. Right now I/O managers are very "front and center", and we are moving to make them more of an opt-in feature that appears when you need it, rather than frontloading that complexity onto everyone.

In terms of ext and I/O managers it remains to be seen. Based on your requirements and preferences I would recommend continuing to use step launchers and I/O managers for the time being. For users like you, this is very much the "one step backwards" part of the "one step backwards, two step forwards" part of the ext development.

We very well end up supporting the IO-manager interfaces in remote ext process, but it will certainly be an opt-in layer stacked cleanup on top of ext, rather than tightly coupled to it.

christianrowlands · 2023-09-20T18:27:39Z

christianrowlands
Sep 20, 2023

I was on a call today with Dagster Sales, and when describing my use case they pointed me toward this discussion on Dagster Ext.

However, after reading this page, I am not sure it fits my use case. Let me explain what I am trying to do to see it can be solved using Dagster Ext.

I am in the process of evaluating Dagster to replace SnapLogic for a large number of integrations (1200). These integrations run nightly to extract the last 24 hours of the customer's data and send it to our application's data warehouse. To accomplish this, we run an on-prem local agent (created by SnapLogic) that connects to the customer's SQL database in a customers environment, and the agent then ships the data up the the SnapLogic server. This is orchestrated from SnapLogic, with of course some setup on the customer's windows server to install and configure the remote agent.

Reading through this discussion, it seems as though we could sort of accomplish this, but we would still have to write our own "agent" code, deploy that code to the customer's server, and then have that code read from the DB and send up the data of interest. The benefit of using Dagster Ext would be the logs would show up in Dagster Cloud, and we could configure some of the input parameters from Dagster Cloud that are sent to the "agent". The problem here is that if we want to deploy new code to the customer's server, we have to do that manually, we can't leverage Dagster to update the "agent".

Am I understanding this correctly? Is there any other way to accomplish this with Dagster?

1 reply

schrockn Sep 20, 2023
Maintainer Author

That's correct. In the structure you describe (as I understand it) you would still be responsible for getting code deployed up the customer's server. You could orchestrate that process with a different Dagster job on some regular cadence or potentially as part of main pipeline itself.

It all depends on the workflow you want to achieve and how you want your deployment process to work. But ext is agnostic to how code gets deployed to its target.

OneCyrus · 2023-09-22T15:31:57Z

OneCyrus
Sep 22, 2023

I'm coming from a thread on slack and wanted to verify if this addresses the thing I'm searching or it's something different.

my question was:

we are running the dagster webserver in k8s and would like to run "code locations" outside of k8s. basically we are running some dagster jobs in github actions pipelines. to have better visibility we wanted to report the runs back to the dagster webserver. but it looks like dagster is expecting the "code locations" to be the server and the webserver connecting to the running job? is there a way to change the communication direction so that the "code location" can report the run back to the webserver? actually I'm looking for a way to specify dagster webserver in the code location where it can report the state back after executing. with github pipelines we want to destroy the code location after the job executed.

i guess the basic question would be: if i orchestrate the job scheduling and running of dagster code locations myself, can I still send the metadata/job runs back to a central dagster instance which would visualize the job results? (failures, logs, history, ...)

not sure if the following use case is what I'm looking forward to?

The code in those external environments is not able to participate in Dagster in first-class ways, unable to perform simple tasks such as emitting metadata back into Dagster.

6 replies

OneCyrus Sep 23, 2023

This is a super interesting use case that I'd love to dig into. We are doing a push around composability of Dagster to enable these types of scenarios. A few questions for you:

great to hear that this is something you're open to enable 😎

With your current setup, can you launch ad-hoc runs of these jobs? Or are they just run within Github?

see next question but yes we can trigger a github action with the workflow_dispatch feature/trigger on demand.

Do you schedule your compute in anyway, or do you just rely on github actions' triggering mechanisms?

We rely on github actions for the triggers. basically we have the following triggers. we use environment variables to switch between different storage/databases for the materialized assets depending on the running branch. but all intermediate storage is duckdb inside the github actions container which will be erased after a run.

on:
  workflow_dispatch: # allow to trigger the run manually
  pull_request: # trigger a run on every pull request for integration test
  push:
    branches: # trigger on new code on dev and main branches to have an up to date version
      - main
      - dev
  schedule: 
    - cron: "5 */2 * * *" # run every 2 hours

How are you triggering dagster in the github cli? The dagster cli? Custom Python script?

yes, through the cli (dagster job execute ...)

Within the github action, what access do you have available? Can you talk to a database in your infrastructure? Do you have network egress?

we use self hosted github actions runners which have access to internal infra (s3 storage and k8s cluster where we have data management tools and could place a dagster webserver for reporting)

i guess the main issue is that the network communication is currently initiated by the dagster webserve/daemon to the code location (code location is listening for a connection on port 4266).

if there would be an option to change the direction and let the dagster webserver/daemon listen on port 4266 and enable the code location to connect to this central instance it would enable our scenario. (provide a way to set a dagster webserver url in the code location config).

schrockn Sep 23, 2023
Maintainer Author

We rely on github actions for the triggers. basically we have the following triggers. we use environment variables to switch between different storage/databases for the materialized assets depending on the running branch. but all intermediate storage is duckdb inside the github actions container which will be erased after a run.

So good.

i guess the main issue is that the network communication is currently initiated by the dagster webserve/daemon to the code location (code location is listening for a connection on port 4266).

I think this shouldn't be a problem. The webserver does not need access to the code location "inside" the github action. All it needs is a running pod in k8s so that it can get metadata. It can be a completely separate container/process than what is running inside the github actions.

What is essential is that the dagster run inside of the github action can write events to the event log in our db, and that your webserver be able to query that same event log. So you just need to configure those runs with the with right instance config.

You don't sound like a Cloud user, but if you were it would be even more straightforward, as the runs would just send metadata events to the agent.

So long story short I think you should be able to get this working today. No ext required.

schrockn Sep 23, 2023
Maintainer Author

Made a diagram of my understanding of the situation

OneCyrus Sep 24, 2023

that sounds interesting. just to clarify:

there is a way to let the dagster job in github action connect to the same database which is used for the dagster webserver/daemon? how can I do this? i couldn't find anything in the documentation
we need to run a code location which is accessible all the time by the dagster daemon? what information does this actually provide? and how different can the dagster asset definitions be if we don't have them in sync?

schrockn Sep 24, 2023
Maintainer Author

there is a way to let the dagster job in github action connect to the same database which is used for the dagster webserver/daemon? how can I do this? i couldn't find anything in the documentation

Any dagster process on bootup searches for the folder defined in DAGSTER_HOME environment variable for configuration information. You need to configure the processes/containers running in the GH action to do this. See documentation.

we need to run a code location which is accessible all the time by the dagster daemon? what information does this actually provide? and how different can the dagster asset definitions be if we don't have them in sync?

An excellent question and your skepticism is warranted. We actually don't needthe information in the code location for most common operations these days since we generate a metadata slug that contains enough information to render the asset graph and other metadata elements. However cutting the dep on the code server for common operations was an after-the-fact optimization. There are still a few places where we make deep assumptions that the code location is accessible and queryable from the webserver, for certain features of sensors, partitioning, and configuration. See DagsterGrpcClient here if you are curious about some of these operations.

At some point, it would be great to remove this hard requirement and only spin up code servers when you need them (or never?). But that requires a fairly invasive surgery and API deprecation/elimination for us to do and is not on the short-term roadmap.

how different can the dagster asset definitions be if we don't have them in sync

This is difficult to say precisely. I highly recommend that you do keep them in sync. The most reliable configuration here is to have the container as a well-known spot in a registry and have the code server and GH action pull from the same address in the registry. You would probably get away with some skew but you might see extremely misleading information in the UI.

Hope that helps!

schrockn · 2023-09-25T20:38:58Z

schrockn
Sep 25, 2023
Maintainer Author

Hey all. Quick update here. We decided to change the name from "ext" to "Pipes". We think that is a lot more clear and provides an analogy that people can latch onto. It goes out this week!

0 replies

Daniel-V1 · 2023-09-26T19:00:45Z

Daniel-V1
Sep 26, 2023

First let me say that I love dagster, and you guys are doing amazing work. I have never been more happy
working on data pipelines than when I've been able to do it with this framework. I find the emphasis
on correctness, the ability and the push to accurately describe the state of your data, the push for separation of concerns (logic/io/etc), and the usability and testability those things bring to be a huge step forward for the state of the art.

Dagit (or is it dagster-webserver now?), is such a huge gift for being able to see what's going on with the data, and for showing it to others. I had a much easier time convincing people that we should use dagster once they saw it. I absolutely love the idea of being able to leverage the beautiful single pane of glass from other languages, but I think I would love it much more if it were a deeper/more stringent integration.

I know this is only a first step, and I believe/hope that you have plans to do deeper integration with other languages/ecosystems, but this step worries me because I can see it being a local maxima for a lot of people/teams. If someone can take their random jupyter notebook, lambda, aws step function, or whatever and pretend it works well with dagster then I feel like I lose leverage to push them for more rigor. Dagster as a framework has allowed and helped me to push for people to do better: to more accurately model their pipelines, to separate io/resources, to test things, to make them able to run locally and not just in production.

I also feel like it makes dagit less useful, and much more like airflow. I no longer necessarily know what's going in or out of a given asset/op if it's using this, or what resources it needs/uses. Since it's sold as just adding a few lines of code I know I'm going to have trouble buying time to refactor/improve or even really understand the existing code.

I know this is probably the better choice from a company perspective for dagster labs (RIP elementl), and that it's not dagster's responsibility to fix other company's engineering/organizational issues, but I worry that it detracts from the principled stance I've seen dagster as taking up until now and the state of the art.

I would be much more excited if the concepts of registering resources/assets/ops/graphs etc was ported to other languages/ecosystems first so that we could have that deep integration with dagit and engineering rigor, though I recognize that would likely be much more difficult and I haven't though about this nearly as much as you all have.

Anyway, that's just my two cents. Thank you all again so much for all that you've done so far and continue to do!

3 replies

schrockn Sep 27, 2023
Maintainer Author

First let me say that I love dagster, and you guys are doing amazing work. I have never been more happy
working on data pipelines than when I've been able to do it with this framework. I find the emphasis
on correctness, the ability and the push to accurately describe the state of your data, the push for separation of concerns (logic/io/etc), and the usability and testability those things bring to be a huge step forward for the state of the art.

Thank you for the kind words. This type of feedback means a lot to me and the entire team.

I know this is only a first step, and I believe/hope that you have plans to do deeper integration with other languages/ecosystems, but this step worries me because I can see it being a local maxima for a lot of people/teams.

So yes, this is definitely a first step. As, as you note, if you are going from fully vertically integrated Dagster to using Pipes, this will feel like a step backwards. Deployment will not be as smooth; less opinions on structure in the context of Pipes will mean that asset will not as testable by default; and so on.

In the context of PIpes, I think we will be able to build a much more lightweight, composable transform layer that delivers much of the value of I/O managers but with less complexity. Resources, likewise, very well may manifest in Pipes as well. Before investing more heavily here we need to validate the value of Pipes as is and the organize demand for additional constructs within the context of Pipes.

Pipes certainly doesn't represent an abandonment of functional data engineering principles that drive Dagster. For example, we definitely think pure functions that take dataframes and return dataframes are the right way to build data pipelines in technologies like Pandas, Polars, and Spark. However we need a more flexible approach in order for users to use Dagster as the "single pane of glass" that you want.

I also feel like it makes dagit less useful, and much more like airflow. I no longer necessarily know what's going in or out of a given asset/op if it's using this, or what resources it needs/uses. Since it's sold as just adding a few lines of code I know I'm going to have trouble buying time to refactor/improve or even really understand the existing code.

The observability experience with Dagster invoking computations via Pipes will definitely be better than Airflow. We will have streaming logs, the ability to send metadata back to Dagster, integrated lineage etc. The UI will still be a very rich experience. While you can get started with a few lines of code, we want the user to be able to do much more in the pipes process, and we will be adding additional features based on user feedback and our own vision.

I know this is probably the better choice from a company perspective for dagster labs (RIP elementl), and that it's not dagster's responsibility to fix other company's engineering/organizational issues, but I worry that it detracts from the principled stance I've seen dagster as taking up until now and the state of the art.

I think this is a better choice for both the project and the company. We have lots of users excited to use Dagster to power their whole platform, but the current structure does not provide incremental adoption pathways to do so. Dagster needs to embrace more composability and flexibility, and this is one of the ways we are doing that.

Thanks so much for the thoughtful post and please follow up with any further questions.

schrockn Sep 27, 2023
Maintainer Author

@Daniel-V1 I'd love to get your contact info so we can discuss this more. My email is schrockn at dagsterlabs.com. Feel free to pop me an email if you are open to it. 🙏🏻

Daniel-V1 Sep 28, 2023

I just sent an email! Thanks for your response!

cyberjar09 · 2023-10-03T05:44:33Z

cyberjar09
Oct 3, 2023

I was wondering if pipes would fit our requirements. But it would seem not? 🤔
We have a need for this type of architecture (using OSS):

Primary motivation:

do not leak db credentials outside any cluster

Things we tried:

K8sRunLauncher (wont work because the job pods spin up in HQ, not the regional clusters)
DefaultRunLauncher (workload execution runs in the regional cluster but this cannot be done in prod because of high memory requirements and parallelisation needed when spinning up jobs)

I was hoping the agent may be made available as part of the OSS offering (it seems to fit our problem statement well), and I'm aware that we can write a custom run launcher, but I do not want to write and maintain this type of distributed custom code long term which is hard to test, debug and may be brittle over time.

5 replies

schrockn Oct 3, 2023
Maintainer Author

It depends on the problem you are trying to solve. Pipes definitely could not replicate that exact architecture. However it might be able to address your issue with a slightly different architecture.

Instead of having code locations "live" in each region, you would instead launch compute into each region via Pipes, and communicate back to the centralized orchestrator through a private channel set up by Pipes.

If ingress into the k8s cluster is completely impossible, then Cloud is probably your only option. And I want to emphasize this: we aren't ideologically against porting the Agent architecture to OSS. In fact that would be ideal as it would simplify and unify our architecture. If we were starting from scratch OSS would have the agent as well. However given the way the system has evolved that is pretty costly and won't be done in the short term.

cyberjar09 Oct 3, 2023

thanks for the feedback and input @schrockn. Much appreciated! :)

As you rightly said, the architecture above is an implementation detail and the primary motivation still holds: do not leak regional DB credentials outside the cluster (HQ should not have all the creds of all regional DBs)

as an FYI, ingress to the regional clusters are definitely a possibility.

Trying to understand your suggestion (pls correct me if im wrong), we essentially need to write a tiny app that satisfies the dagster pipes interface and does the actual extraction? Is this accurate? If so, then yes, we do satisfy the requirement of keeping the DB creds isolated in the cluster, but I suspect we may lose the useful dagster primitives we rely on today like PartitionsDefinition, AutoMaterializePolicy, FreshnessPolicy, IO manager etc. 🤔

schrockn Oct 3, 2023
Maintainer Author

Orchestration primitives like PartitionsDefinition and scheduling abstractions like AutoMaterializePolicy would still be availble. You are still writing code with @asset or equivalent and your code locations would be resident in your central cluster. The body of that asset is mostly just invoking a pipes client that invokes a k8s pod that containers your transformation in the remote cluster, so I/O managers are not relevant and would likely not be used.

cyberjar09 Oct 5, 2023

edit: nvm, I realised this does not make a lot of sense .. this loses a lot of resolution like how to collect logs etc.

Have youll considered a PipesApiClient that will allow a more request-response style of communication? I Was trying to think through the solution in my head and if the hq cluter was to trigger an api call to the regional cluster it would allow us to then deploy the workload in that cluster and respond with the result (Eventually) 🤔 ... probably need to set high timeouts or some other mechanism, maybe egress the metadata once the job exec is finished in the remote location

cyberjar09 Oct 5, 2023

alternatively, is there a way to supply the PipesK8sClient with different contexts depending on where we want the workload executed? 🤔

v1gnesh · 2023-10-05T05:47:01Z

v1gnesh
Oct 5, 2023

Pretty far-out vision: It appears that this will allow the use of Dagster's UI with not just external data ops, but job schedulers in general? Wondering if dagster-pipes can officially house some custom pipes to batch job schedulers on mainframe.
This way, I get to use Dagster's rich UI without the custom web UI (or native screens) that comes with the mf batch schedulers.
Thoughts?

1 reply

schrockn Oct 5, 2023
Maintainer Author

Hey @v1gnesh Pipes still requires the use of Dagster of a scheduler. However, you have predicted External Assets, which we'll be rolling out next week, that allow Dagster to be used as the observability, lineage, and data quality layer for other job schedulers.

schrockn · 2023-10-13T20:10:09Z

schrockn
Oct 13, 2023
Maintainer Author

Hi all! We are now live.

Blog Post
Youtube Video
Docs

Wanted to thanks everyone in this discussion for such thoughtful commentary and feedback. It was an invaluable part of the process.

Please go kick the tires and see if the reality matches the promise!

1 reply

cyberjar09 Oct 14, 2023

congrats to the team

[RFC] Dagster Pipes (previously ext) #16319

schrockn Sep 5, 2023 Maintainer

Introduction

Context

Not repeating Airflow’s mistakes

Path Forward: A Protocol

What does the code look like

Kubernetes

What about Step Launchers?

Multi-language future

Call-to-action

Replies: 18 comments · 32 replies

schrockn Sep 5, 2023 Maintainer Author

schrockn Sep 5, 2023 Maintainer Author

schrockn Sep 6, 2023 Maintainer Author

schrockn Sep 6, 2023 Maintainer Author

schrockn Sep 6, 2023 Maintainer Author

schrockn Sep 6, 2023 Maintainer Author

schrockn Sep 8, 2023 Maintainer Author

schrockn Sep 12, 2023 Maintainer Author

schrockn Sep 12, 2023 Maintainer Author

schrockn Sep 14, 2023 Maintainer Author

schrockn Sep 14, 2023 Maintainer Author

schrockn Sep 18, 2023 Maintainer Author

schrockn Sep 20, 2023 Maintainer Author

schrockn Sep 23, 2023 Maintainer Author

schrockn Sep 23, 2023 Maintainer Author

schrockn Sep 24, 2023 Maintainer Author

schrockn Sep 25, 2023 Maintainer Author

schrockn Sep 27, 2023 Maintainer Author

schrockn Sep 27, 2023 Maintainer Author

schrockn Oct 3, 2023 Maintainer Author

schrockn Oct 3, 2023 Maintainer Author

schrockn Oct 5, 2023 Maintainer Author

schrockn Oct 13, 2023 Maintainer Author

schrockn
Sep 5, 2023
Maintainer

Replies: 18 comments 32 replies

schrockn Sep 5, 2023
Maintainer Author

schrockn Sep 5, 2023
Maintainer Author

schrockn Sep 6, 2023
Maintainer Author

schrockn Sep 6, 2023
Maintainer Author

schrockn Sep 6, 2023
Maintainer Author

schrockn Sep 6, 2023
Maintainer Author

schrockn Sep 8, 2023
Maintainer Author

schrockn Sep 12, 2023
Maintainer Author

schrockn Sep 12, 2023
Maintainer Author

schrockn Sep 14, 2023
Maintainer Author

schrockn Sep 14, 2023
Maintainer Author

schrockn Sep 18, 2023
Maintainer Author

schrockn Sep 20, 2023
Maintainer Author

schrockn Sep 23, 2023
Maintainer Author

schrockn Sep 23, 2023
Maintainer Author

schrockn Sep 24, 2023
Maintainer Author

schrockn
Sep 25, 2023
Maintainer Author

schrockn Sep 27, 2023
Maintainer Author

schrockn Sep 27, 2023
Maintainer Author

schrockn Oct 3, 2023
Maintainer Author

schrockn Oct 3, 2023
Maintainer Author

schrockn Oct 5, 2023
Maintainer Author

schrockn
Oct 13, 2023
Maintainer Author