RFC: Community Input for the Dagster Embedded ELT #17300

PedramNavid · 2023-10-19T03:39:36Z

PedramNavid
Oct 19, 2023
Maintainer

Hello!

With the great reception we've seen of our initial launch of dagster-embedded-elt, we'd love to get feedback from the community about what they think about our philosophy and approach. We've captured these thoughts in more detail in our blog post, but briefly, we believe that smaller embedded libraries work really well when paired with a powerful orchestrator.

We've shipped our initial version using Sling and the [docs] cover the API and code examples.

By not having to reinvent an orchestrator, these libraries can be focused on what makes ingestion hard, while Dagster can be used to fill in the orchestration gaps, such as state management, scheduling, logging, and so on.

Are there particular integrations that you would like us to focus on next? What do you think about this overall approach? Do you think there are other parts of the data lifecycle that would benefit from this as well?

Appreciate it!

geoHeil · 2023-11-02T09:56:02Z

geoHeil
Nov 2, 2023

I really like:

smaller embedded libraries work really well when paired with a powerful orchestrator.

I think inbuilt stateful data quality checks (anomaly detection)

would be really cool

0 replies

dduong1603 · 2023-11-02T14:34:46Z

dduong1603
Nov 2, 2023

I think dlt would be a great next integration 😄

0 replies

tacastillo · 2023-11-03T20:46:05Z

tacastillo
Nov 3, 2023

Proposing something related to the greater topic of integrations:

What if dagster-embedded-elt became a framework to build small libraries, while also providing small libraries for common tools.

If you look at the source code for our data ingestion integrations, they often follow the same implementation patterns. An option that we can follow is making builder pattern utilities that standardize how to do a task in Dagster and providing a spec/abstract class for each tool on how to incorporate into that standard.

This way, dagster-embedded-etl will solve what it means to "integrate with Dagster" and the user of any given tool just has to leverage their own existing APIs to finish that integration by building a resource from that abstract class. The resource would just have to define a specific set of methods to answer questions like:

how to start a run of their tool
how do you know when the run is done
what are the assets made from this?
What metadata should we embed with each asset?

Here's a short conceptual example of what this could look like for ingesting data with dlt:

from dagster_embedded_elt import build_ingestion_assets, EmbeddedEltResource

class DltResource(EmbeddedEltResource):
    source: str
    destination: str

    @override
    def start_ingestion():
        # defines how to start dlt
        ...
        pipeline.run(...)

    @override
    def poll_ingestion():
        # defines how to eval the sync progress (in-progress vs done)
        # also sends logs back to Dagster
        pass

    @override
    def get_asset_keys():
        # define how to figure out what assets are being made at compile-time
        pass

    @override
    def get_materialization_metadata():
        # define how metadata is made for a materialization
        pass


mongo_to_bigquery_resource = DltResource(source="mongodb", destination="bigquery")
dlt_ingested_assets = build_ingestion_assets(mongo_to_bigquery_resource)

defs = Definitions(
    assets=dlt_ingested_assets
)

This can eventually span any other category of tools in the stack. Embedded ELT can standardize what it means to:

update a BI tool
perform a reverse ETL
transform data
define specific types of data quality like anomaly detection

And we can package common representations into the library, allowing people to work with their Slings, Meltanos, dlts, etc. out of the box, but also be able to quickly add a new integration for the next incoming ingestion tool or a proprietary in-house service.

2 replies

PedramNavid Nov 9, 2023
Maintainer Author

I really like this idea, my only fear is abstracting too early without having a full understanding of how this will be used across implementations. Might be worth visiting after 1-2 other implementations have been added

edsoncezar16 May 22, 2024

Hey, guys, we should consider this point more seriously going forward.

I think we could take the dbt integration as the gold standard. In our company, we adopted dagster as a standard tool in our stack
after the first project using it mainly due to its magnificent integration with dbt. However, we are strongly missing similar levels of integration with EL tools.

The killer features that should be standardized across any integration, in our opinion:

CLI first approach: in the same way we configure a dbt project that we can run on its own without any bind to dagster, we should aim for the same level of decoupling with any EL tool.
manifest-like asset generation: dagster can use the dbt resource to create assets from the manifest, and use the translator to
customize fields. We have to ensure a similar 'parse project and generate assets' approach without forcing any binding to dagster right from the start.
standard asset/translator settings: ensure we provide consistent methods for generating and materializing assets so that the dagster-related project structure remains tool-agnostic

What @tacastillo proposed matches all of these requirements if instead of get_asset_keys we implement something like get_assets_defs. I do think we should only focus on standardizing EL, though, to ensure consistency for the EL (with tool of choice) and T (with dbt) phases. Beyond that step I think the trade-off is negative for trying to manage from Dagster, or at least not as useful for the community as a whole as the ELT lifecycle.

Speaking for myself, now, I would be more than willing to keep giving my small contributions to improve based on practical experience :)

nixent · 2023-11-05T13:56:22Z

nixent
Nov 5, 2023

Love the idea about embeddable ELT, well spotted that big data is not necessarily that big and many tasks can be solved with lightweight tools.

Now I have tested it out and have few comments:

SlingResource config is defining point to point connection. Multiple resources are needed in situations when Sling is used to move data between more than 2 points

source = SlingSourceConnection(...)
target = SlingTargetConnection(...)

sling = SlingResource(source_connection=source, target_connection=target)

And then sling_resource_key has to be specified in the asset definition, which is defeating purpose of having single config

def build_sling_asset(
    ...
    sling_resource_key: str = "sling"
)

It this situation it would be easier to have one SlingResource with all possible sources and targets, same way as it is done in Sling environments and then use those values in the asset definition:

def build_sling_asset(
    ...
    source_connection,
    target_connection
)

If SlingResource has just one source+target pair, source_connection/target_connection can be defaulted to it.

It would be nice to add partitions_def to build_sling_asset to support backfills
It would be nice to add Execution Duration to metadata emitted by asset materialization

1 reply

PedramNavid Nov 7, 2023
Maintainer Author

This is great feedback! I think Replication support would be nice too!

hello-world-bfree · 2023-11-10T00:45:15Z

hello-world-bfree
Nov 10, 2023

Fantastic addition! This is exactly as simple as basic data moving should be.

One thing that I'd love would be a simple means of leveraging the --stdout arg sling provides. For example, I have a use case where I want to replicate some tables over to another database of the same type - i.e., mysql to mysql - but I only want to do it if my source db doesn't have a new hash value in the migrations table, meaning that I don't want to replicate if there have been schema changes. This logic is necessary so that the tests running on the target db aren't compromised by schema changes. If I were working directly through the Sling CLI, it'd be a simple to use the --src-stream and --stdout args for both source and target dbs, compare the values, then proceed or not. With the current set-up of sling assets, this isn't as simple of a proposition. I've coded my way around it, but it'd be nice if it were out-of-box as a method on the SlingResource.

Another use case that'd be nice to support would be dynamic sling asset creation. For example, if I were to only want to move certain tables from one db to another with a full-refresh, it'd be ideal to just provide a list of the table names. It's a lot of boilerplate to have to manually write out build_sling_asset objects for each table. Writing helper functions to generate them more easily myself is nice in a lot of ways, but it seems like a common enough use case to make it a simple configuration.

Thanks for listening and for all your doing! Love the direction this is going!

7 replies

tacastillo Dec 8, 2023

Hey Brice!

Thanks! Telling us your use cases is super valuable feedback that helps us figure out how to further build it.

When you're referring to file changes, are they changes to an individual file or is the change that new similar files are occasionally added?

If it's the former, what limitations do you experience with using a full-refresh or snapshot mode in Sling?

Or if it's the latter, we're considering adding partition support in an upcoming release for Sling-based assets. In that case, you could make an asset for the entire SFTP site and have a partition for each file within it.

bvallier Dec 8, 2023

Changes to files that get refreshed with new data every month. Same file. For instance:

ext_file_def = build_sling_asset(
    sling_resource_key="sling_cim",
    asset_spec=AssetSpec(key="ext_file", description="Ext Data"),
    source_stream=f"sftp://{SERVER}//ext_file.sas7bdat",
    target_object="schema.ext_table",
    mode=SlingMode.FULL_REFRESH,
)

Now ideally, we can re-leverage this AssetDef to materialize a table via a Sensor that monitors timestamps on the file and if the cursor changes, execute a job to re-process the data. Does that make sense?

tacastillo Dec 8, 2023

Hmm, I might be missing something, so thank you for your patience!

Here's an example of what I interpreted your use case to be. It's a sensor that polls the SFTP server, checks if the file had been updated, and then re-materializes the asset if so. Would this work for you?

from dagster import define_asset_job, AssetSelection, sensor, SensorEvaluationContext, RunRequest, SensorResult
import os
import json

sftp_job = define_asset_job(
    name="my_asset_job",
    selection=AssetSelection.keys(["ext_file"])
)


@sensor(job=sftp_job)
def my_directory_sensor(context: SensorEvaluationContext):

    last_modified_at = json.loads(context.cursor) if context.cursor else None

    #psuedo code for fetching the file from a sftp server
    file = get_file_from_sftp("ext_file.txt")

    # get when the file was last modified. not 100% since the file is in the server and not your local machine
    modified_time = os.path.getmtime(file)

    run_requests = []

    # if the file was modified after the last time we ran the sensor
    # then run the job that processes the file
    if modified_time > last_modified_at:
        last_modified_at = modified_time

        run_requests.append(RunRequest(
            run_key=f"ext_file_{last_modified_at}"
        ))

    return SensorResult(
        run_requests=run_requests,
        cursor=last_modified_at
    )

bvallier Dec 8, 2023

this is generally the right idea, but i'm looking to set the asset key at run time for the job. For instance, 1 of 3 files have been updated and just want to update 1 specific file that has an updated modify time.

bvallier Dec 9, 2023

OK - finally figured something out. You lead me on the right path. I just needed to feed the RunRequest an asset_selection like so:

run_requests.append(RunRequest(run_key=run_key, asset_selection=[asset_key]))

Where asset_key is AssetKey([filename]).

The reason this seems to be more complicated is because of how the sling assets are built via:

ext_file_def = build_sling_asset(
    sling_resource_key="sling_cim",
    asset_spec=AssetSpec(key="ext_file", description="Ext Data"),
    source_stream=f"sftp://{SERVER}//ext_file.sas7bdat",
    target_object="schema.ext_table",
    mode=SlingMode.FULL_REFRESH,
)

Perhaps some room for improvement on ergonomics of working with that CLI/Sling Library.

cdchan · 2024-01-05T21:18:08Z

cdchan
Jan 5, 2024

I really like this integration in general. It seems to solve a lot of the frustration I had when setting up dlt and I'm enjoying exploring Sling.

I'd be curious to hear what others think of this idea: what I like about the dbt integration (dbt, not dlt) is that you can operate dbt as a CLI tool, and Dagster can ingest that configuration and we can orchestrate dbt commands through Dagster.

As is, I can have a YAML config I use with the Sling CLI, but I have to recreate that config in Dagster / python if I want to use the current integration. It would be annoying / difficult to keep these in sync. The Sling CLI is nice to use and hypothetically convenient for my co-workers who might want to do occasional ad-hoc runs without having to get Dagster involved

6 replies

cdchan Jan 8, 2024

Sorry, I wasn't clear - I mostly I meant something like reading in a Replication YAML to create assets in Dagster (like how @dbt_assets reads in a manifest). That way Sling can be run outside of Dagster, but Dagster assets can still be in sync with whatever streams are set up in YAML.

I do like having Dagster handle credentials in production (and so I would set up Sling source and target resources). For local dev, yeah, we just have a shared password manager and I imagine my team members would populate ~/.sling/env.yaml with whatever source and target dbs they need to.

PedramNavid Jan 9, 2024
Maintainer Author

Gotcha -- this is helpful. I think we can definitely accept a Sling replication YAML in this update.

@sling_asset(replication_manifest="path_to_file.yml")
def my_sling_asset(context, sling):
    for line in sling.stream()
        context.log.info(line)
    context.log.info("Sync complete")

This is roughly what I'm thinking. The SlingResource will need some changes as well but I think we're somewhat close

bvallier Jan 24, 2024

Hey @PedramNavid is there currently a way to pull the context as an argument when building the sling asset? I have a sensor that looks for new files in a directory and tries to process it as a partition, but not seeing an established way to do that for sling-based assets. See Tim's comment above: #17300 (reply in thread)

PedramNavid Jan 24, 2024
Maintainer Author

@bvallier you can replicate the building function like this:

    @asset
    def my_sync(context: AssetExecutionContext, sling: SlingResource) -> MaterializeResult:
        last_row_count_observed = None
        for stdout_line in sling.sync(
            source_stream=source_stream,
            target_object=target_object,
            mode=mode,
            primary_key=primary_key,
            update_key=update_key,
            source_options=source_options,
            target_options=target_options,
        ):
            match = re.search(r"(\d+) rows", stdout_line)
            if match:
                last_row_count_observed = int(match.group(1))
            context.log.info(stdout_line)

        return MaterializeResult(
            metadata=(
                {} if last_row_count_observed is None else {"row_count": last_row_count_observed}
            )
        )

That should give you full access to the asset properties and context. There's a PR I'm working on: #19057 that should make this much more seamless for you. Feel free to comment on that if you notice anything missing.

bvallier Jan 24, 2024

Like you said, ergonomics could be improved, but this seems to do the trick. Thanks Pedram!

hello-world-bfree · 2024-01-25T16:08:23Z

hello-world-bfree
Jan 25, 2024

@PedramNavid Is there anywhere to follow the embedded-elt roadmap? I have a need to start building out some externally sourced pipelines and I'm tempted to take my own swing at integrating dlt - or something like it should I find a better option - but I don't want to waste my time if y'all have plans in the relatively near future. Much appreciated!

2 replies

cody-scott Jan 26, 2024

I would agree here. I like the activity from flarco around this repo, but it is a bit iffy on our side to choose one of the experimental implementations if its going to disappear (see airbyte).

PedramNavid Feb 15, 2024
Maintainer Author

It's too early for us to share a public roadmap, but once we've finalized the implementation on Sling, dlt is the next integration we're planning.

cbini · 2024-02-08T02:06:38Z

cbini
Feb 8, 2024

Finally have some bandwidth for this, and I'm excited by the prospect of being able to insource whatever we're still using Fivetran and Airbyte for. I learned about dlt after seeing it mentioned in the video explaining Embedded ELT, and that would take a solid bite out of our managed services.

Here's what we're paying for currently:

Salesforce*
Zendesk Support*
Postgres*
Hubspot*
Instagram Business
Facebook Pages
Coupa
ADP Workforce Now
* supported by dlt

Like the comment above, I'd love to see a roadmap too. Our fiscal year is in July, so it would be great if we could have a (mostly) reliable integration sometime in the next 6 months.

1 reply

PedramNavid Feb 15, 2024
Maintainer Author

Thanks for the confirmation Charlie! I'm also getting back to this after a little time on other priorities. I'll reach out to you on our Slack, would love to better understand some of your requirements as we build out the next step of this integration.

PedramNavid · 2024-02-16T03:09:58Z

PedramNavid
Feb 16, 2024
Maintainer Author

Hey all, I wanted to give a quick preview of where we've been headed since this RFC was posted, I know it's been a while since this post, but we have been working on making sure the next iteration addresses all of the feedback here.

Here's an example asset that is using a new asset decorator, rather than the existing asset builder methods. As you can see, you can include multiple connections in a single resource.

It uses the new Sling replication API: https://docs.slingdata.io/sling-cli/run/configuration#replication-config, which has both a yml config and a Python config option.

sling_resource = SlingResource(
    connections=[
        SlingConnectionResource(
            name="MY_POSTGRES", type="postgres", connection_string=EnvVar("POSTGRES_URL")
        ),
        SlingConnectionResource(
            name="MY_DUCKDB",
            type="duckdb",
            connection_string="duckdb:///var/tmp/duckdb.db",
        ),
    ]
)


@sling_assets(replication_config=replication_config) # can also add partitions_def, backfill_policy and op_tags here
def my_assets(context, sling: SlingResource):
    for lines in sling.replicate(
        replication_config=replication_config,
        dagster_sling_translator=DagsterSlingTranslator(),
        debug=True,
    ):
        context.log.info(lines)

An example replication config:

source: MY_POSTGRES
target: MY_DUCKDB

defaults:
  mode: full-refresh 

  object: '{stream_schema}_{stream_table}'

streams:
  public.accounts:
  public.users:
    disabled: true
  public.finance_departments_old:
    object: 'departments' # overwrite default object
    source_options:
      empty_as_null: false
    meta:
      dagster_source: boo

  public."Transactions":
    mode: incremental # overwrite default mode
    primary_key: id
    update_key: last_updated_at
  
  public.all_users:
    sql: |
      select all_user_id, name 
      from public."all_Users"
    object: public.all_users # need to add 'object' key for custom SQL

env:
  SLING_LOADED_AT_COLUMN: true # adds the _sling_loaded_at timestamp column
  SLING_STREAM_URL_COLUMN: true # if source is file, adds a _sling_stream_url column with file path / URL

With the above code and config, you'll get an asset graph like this:

The implementation is now working, so I Just wanted to share this as I work to wrap up docs, examples and tests.

Thank you to @nixent for the feedback on resources and partition defs, to @hello-world-bfree for the suggestions on making stdout available to process, and @cdchan for allowing you to use a shared Sling replication config.

Please keep the feedback coming, we do read it and appreciate it.

0 replies

dduong1603 · 2024-02-28T21:16:22Z

dduong1603
Feb 28, 2024

PyAirbyte just entered public beta. Maybe this is something to consider on the roadmap since they have more connectors than other packages? 😄

1 reply

PedramNavid Feb 28, 2024
Maintainer Author

We're considering it! If people are interested, definitely upvote this here.

PedramNavid · 2024-03-01T22:16:54Z

PedramNavid
Mar 1, 2024
Maintainer Author

Hi all! Just wanted to let you all know our latest version of dagter-embedded-elt just shipped.

Check out our docs for the latest: https://docs.dagster.io/integrations/embedded-elt

And you can see example code right:
https://github.com/dagster-io/dagster/tree/master/examples/experimental/sling_decorator

This release makes it ever easier to sync data using multiple connections and to define multiple tables using Sling's replication yaml. We've tested it on millions of rows and it's worked really well, but as always, would love your feedback! Much of the feedback you all provided has gone into making this latest release possible.

1 reply

nixent Mar 2, 2024

@PedramNavid Nice! Will give it a try and revert with feedback

hello-world-bfree · 2024-03-14T15:00:36Z

hello-world-bfree
Mar 14, 2024

👋 Hey @PedramNavid! Not sure how deep y'all are into the dlt integration, but I wanted to provide some hopefully helpful feedback after integrating for myself.

I actually found integrating the two to be a fairly painless process after getting up-to-speed on how dlt works under the hood. I'm sure you guys will make working with dlt in Dagster more ergonomic but I wanted to suggest that the coupling be kept as light and loose as possible. I'm tempted to go as far as saying an explicit integration may not even be necessary. Some helper functions and education - maybe an opinionated best practices guide - alone might do the trick.

Showing folks how to do it - and how easy it is! - may be enough. dlt is so flexible that I fear any explicit integration may lead to unnecessary restrictions and limitations.

With that said, I'm loving the pairing of Dagster and dlt! It was the last piece of the puzzle to allow my org to completely cut ties with Fivetran and Airbyte. Appreciate you bringing the idea of embedded ELT to the forefront!

19 replies

cmpadden Mar 29, 2024
Maintainer

That's awesome - and definitely proof that it should be possible! Thanks for doing the leg work.

I'll try to get this integration merged soon so that we can start iterating on these enhancements in the main repo! 🎉

hello-world-bfree Apr 3, 2024

@cmpadden Apologies, I haven't had the chance to test it yet, but I think there might be a potentially silent issue with the concurrent asset updates from the same multi-asset.

Does a dlt multi-asset use the same pipeline name across all dlt resources? If so, I think we'll hit the dlt split brain issue that may or may not result in an error but will more than likely result in corrupt data. For example, if I'm running the Stripe source, it'll work fine if I materialize my assets together; however, if I decide to select subscription and materialize, then immediately decide I want to select customer and materialize as well, there'll be two instances of the same pipeline with the same name running in parallel. This causes a funky scenario for dlt, leading to corrupt data or errors.

You may have already addressed this, but it just came to mind, so I wanted to run it by ya. Thanks for all your work on this! This'll unlock a lot of capabilities for a lot of people!

cmpadden Apr 4, 2024
Maintainer

Thanks for bringing this up, @hello-world-bfree , this is definitely still an issue. We have access to the pipeline object in the run() method, so I wonder if we could append some random characters to the end of the name before execution. I will create an issue in our repository to track this, but I do think it's a blocker for us to implement partitioning. I'll continue to track their open issue as well to see if it gets resolved on their end dlt-hub/dlt#1102.

Edit: we can track it here #21022

cmpadden May 23, 2024
Maintainer

Hi all - @edsoncezar16 has recently landed a PR that adds partitioning support for dlt. This is still an experimental feature, but I would be curious to learn if this meets the requirements of your use cases. Thank you!

#22000

trabianmatt May 23, 2024

Looks good on first pass through -- thanks for the update!

cmpadden · 2024-04-05T20:25:42Z

cmpadden
Apr 5, 2024
Maintainer

Hey everyone! I'm excited to share that the dltHub integration has landed in version 1.7 of Dagster. You can find relevant documentation and the introductory blog post here:

Thanks so much to all of the community members requested this integration, and for the early feedback from members including:

I look forward to further collaboration and making continued improvements to the integration. Cheers!

0 replies

gofford · 2024-04-09T17:47:51Z

gofford
Apr 9, 2024

The dlt enhancement is great, thanks @cmpadden. Now that I'm using it fully I have some observations and questions:

it would be nice to be able to apply a tag to all of the resources generated from a source, and subsequently allow all resources to be selected in an AssetSelection by tag. This would be especially useful for dynamic resources where there can be many. I can see this working as an an update to the Translator and AssetSpec.
it seems to be possible to have a "successfully failed" state for a dlt pipeline in dagster. For example, dagster will report a successful run even if the dlt job contains failures. It's a matter of interpretation whether this is a success or failure from a dagster perspective, but my gut feel would be that a job that fails to load is a failed ELT job and this should be reflected in dagster. What do you think?

2 replies

cmpadden Apr 9, 2024
Maintainer

Hey @gofford - glad to hear that you're adopting it!

For using tags in the AssetSelection, I think that this could be possible. Though, I wonder however if this can be done on the dlt side of things. For example, use a with_resources with some filtering before passing that to the dlt_assets decorator.
I'm totally onboard with this one. I explored raising failures at the asset level, but all of the sources I was using would throw an exception and the entire pipeline would fail, marking all assets as failed. In this section of code we can review the metadata, filter by the specific resource, and determine the failure/success status. Would you be able to provide a minimal example of a pipeline that marks a resource as failed? That would allow me to write a basic unit test and start exploring this enhancement!

gofford Apr 10, 2024

I don't have a shareable repro handy at the moment unfortunately. If it helps, I first noticed the behaviour when a source with "schema_contract": {"data_type": "freeze"} encountered a malformed data type.

The dagster run reported a success, but the job metadata is:

[
  {
    "created_at": "2024-04-10 08:23:52.439258+00:00",
    "elapsed": 42.53101372718811,
    "failed_message": "{\"error_result\":{\"reason\":\"invalid\",\"message\":\"Provided Schema does not match Table REDACTED:whistl_staging.orders_report_bulk. Field order_number has changed mode from REQUIRED to NULLABLE\"},\"errors\":[{\"reason\":\"invalid\",\"message\":\"Provided Schema does not match Table REDACTED:whistl_staging.orders_report_bulk. Field order_number has changed mode from REQUIRED to NULLABLE\"}],\"job_start\":\"2024-04-10T08:24:14.851000Z\",\"job_end\":\"2024-04-10T08:24:14.951000Z\",\"job_id\":\"orders_report_bulk_6ed3b97ede_0_parquet\"}",
    "file_format": "parquet",
    "file_id": "6ed3b97ede",
    "file_path": "/var/dlt/pipelines/whistl_private/load/loaded/1712737432.18553/failed_jobs/orders_report_bulk.6ed3b97ede.0.parquet",
    "file_size": 8173,
    "retry_count": 0,
    "state": "failed_jobs",
    "table_name": "orders_report_bulk"
  }
]

and there was no data in the table at all.

Daniel-Vetter-Coverwhale · 2024-04-11T15:18:33Z

Daniel-Vetter-Coverwhale
Apr 11, 2024

I know I'm late to the party, but I missed this before.

I had previously been building my own assets on top of the replication config using an earlier version of the sling integration, but now I'm trying to switch to the new decorator. However, it seems that it makes a multi-asset with can_subset explicitly set to False, which disallows running a single stream. I was wondering if there's a reason for that, as it is really helpful if you have just one table or stream that needs to be replicated at a given time. I think this also disallows making jobs that replicate certain subsets of the data more frequently, though that's less important for my immediate use case which is just running/re-running failed assets.

Replicating a subset of streams is supported by sling with the --streams cli option - https://docs.slingdata.io/sling-cli/run/configuration/cli-flags.

2 replies

cmpadden Apr 11, 2024
Maintainer

Hi @Daniel-Vetter-Coverwhale - there is no reason that subsetting is disabled other than the fact that it has not yet been implemented. You can find an issue for tracking the feature here:

#20215

We are always open to contributions, so if you already have code implementing such a feature, we're totally open to collaborate. Otherwise we will try to get to it in the relatively near future. Thanks!

Daniel-Vetter-Coverwhale Apr 11, 2024

Awesome, thanks for the pointer to the issue! I'll see if I can whip something up that fits the new methodologies

Daniel-Vetter-Coverwhale · 2024-05-02T13:50:52Z

Daniel-Vetter-Coverwhale
May 2, 2024

So I tried using the DLT integration just briefly, so I might be missing some things, but I have a couple of questions around making the pipeline in the asset decorator:

How do you pass partitions into the pipeline? Typically that partition definition is also a part of the asset decorator, and for a lot of dlt pipeilnes you're going to want to specify time ranges for the resource. My concrete use case was a partitioned asset that needed to pull files for individual days from S3
Similarly, but less problematically, there are destinations that we may want to create from config, which can be specified in the pipeline run, but are more nicely specified (type hints, etc) in the pipeline definition. Also you can't create a pipeline without a destination, so you have to put one in and then override it, and if you have one in the pipeline definition in the decorator then you have to populate a secrets file or environment variables, even if they're just going to be overridden

2 replies

cmpadden May 2, 2024
Maintainer

Hi @Daniel-Vetter-Coverwhale - to your points:

Partitioning is not currently supported, and will likely not be supported until [dagster-embedded-elt][dlt] handle conflicting pipeline names for parallelization / partitioning #21022 is resolved or a workaround is determined, as there are possible race conditions if the same pipeline runs in parallel. This is because state is managed by dlt at the destination. However, incremental loading via a cursor field (managed by dlt) is possible, and outlined in this documentation.
What you've outlined about the destination is a valid short-coming of the implementation. Right now, it's not possible to use Dagster's EnvVar for the destination as it's not wrapped in a resource, but instead you have to do what you have outlined. I am open to suggestions in how you think this might be improved.

Daniel-Vetter-Coverwhale May 2, 2024

Thanks for pointing out the existing issue and upstream issue!

My first naive thought is that moving the pipeline creation into the asset body avoids these issues (barring the upstream issue). I'm not sure how well that works from a tight integration standpoint, but I think someone above was mentioning a lighter integration might be fine.

For the second piece, even EnvVar wouldn't make it perfect, but I do think making the destination a dagster resource would work well. For example, I have a resource that represents secrets in aws secrets manager, and then I use that resource in other resources to populate their required config fields. I think conceptually it makes sense as well to model the DLT sources/resources/destinations as Dagster resources as well. You could probably parse these in the decorator itself, but dagster resource injection already works in the function definition.

Maybe an asset factory instead of a new decorator? But that's already doable by end users, same as building the pipeline in an asset. That is the nice thing about all this (but probably also makes the design harder), is that even without a really tight integration both DLT and Dagster are ultimately python and thus we have the power that that buys us readily available!

geoHeil · 2024-05-03T13:32:32Z

geoHeil
May 3, 2024

Extract telemetry i.e. number of rows imported from ELT/sling.

Is there already some work on retrieving the affected record count i.e. after a sync via sling into the asset metadata?
I would love to see them in dagster insights

1 reply

cmpadden Aug 20, 2024
Maintainer

Hi @geoHeil - completely agree that this would be great. We've recently added column meta data, but still need to implement data like row counts. I've created an issue from your discussion. Thank you.

#23759

wj-c · 2024-08-16T04:02:24Z

wj-c
Aug 16, 2024

I find it very troublesome to define upstream dependencies and auto materialization for individual mapped dlt asset.

In dbt, I can define auto materialization and dependencies in model schema.yml for each individual model. And I don't need to write a specific translator. But I don't find a way to define them for each dlt resource. If I define them in translator, it means same auto materialization and same dependencies for a batch of dlt resource.

Unless I write a dlt_asset for every single dlt resource, I can control auto materialization and dependencies on a per dlt resource grain. But then I don't see the point of using this integration library anymore, I can just write generic dagster asset defs.

2 replies

cmpadden Aug 20, 2024
Maintainer

Thanks for mentioning this, @wj-c . The ergonomics are definitely not ideal. We're giving some thought into how to better enable manipulating properties on multi-assets, which would likely improve how this is done for dlt.

For the time being I've created a GitHub issue from your discussion, thanks.

#23760

wj-c Aug 21, 2024

Thank you cmpadden

RFC: Community Input for the Dagster Embedded ELT #17300

PedramNavid Oct 19, 2023 Maintainer

Replies: 18 comments · 49 replies

PedramNavid Nov 9, 2023 Maintainer Author

PedramNavid Nov 7, 2023 Maintainer Author

PedramNavid Jan 9, 2024 Maintainer Author

PedramNavid Jan 24, 2024 Maintainer Author

PedramNavid Feb 15, 2024 Maintainer Author

PedramNavid Feb 15, 2024 Maintainer Author

PedramNavid Feb 16, 2024 Maintainer Author

PedramNavid Feb 28, 2024 Maintainer Author

PedramNavid Mar 1, 2024 Maintainer Author

cmpadden Mar 29, 2024 Maintainer

cmpadden Apr 4, 2024 Maintainer

cmpadden May 23, 2024 Maintainer

cmpadden Apr 5, 2024 Maintainer

cmpadden Apr 9, 2024 Maintainer

cmpadden Apr 11, 2024 Maintainer

cmpadden May 2, 2024 Maintainer

cmpadden Aug 20, 2024 Maintainer

cmpadden Aug 20, 2024 Maintainer

PedramNavid
Oct 19, 2023
Maintainer

Replies: 18 comments 49 replies

PedramNavid Nov 9, 2023
Maintainer Author

PedramNavid Nov 7, 2023
Maintainer Author

PedramNavid Jan 9, 2024
Maintainer Author

PedramNavid Jan 24, 2024
Maintainer Author

PedramNavid Feb 15, 2024
Maintainer Author

PedramNavid Feb 15, 2024
Maintainer Author

PedramNavid
Feb 16, 2024
Maintainer Author

PedramNavid Feb 28, 2024
Maintainer Author

PedramNavid
Mar 1, 2024
Maintainer Author

cmpadden Mar 29, 2024
Maintainer

cmpadden Apr 4, 2024
Maintainer

cmpadden May 23, 2024
Maintainer

cmpadden
Apr 5, 2024
Maintainer

cmpadden Apr 9, 2024
Maintainer

cmpadden Apr 11, 2024
Maintainer

cmpadden May 2, 2024
Maintainer

cmpadden Aug 20, 2024
Maintainer

cmpadden Aug 20, 2024
Maintainer