Loading or adding assets and partitions at run time from config #16524

dmsfabiano · 2023-09-14T20:15:43Z

dmsfabiano
Sep 14, 2023

Background

We are evaluating the feasibility of using Dagster for all our Data/ML pipelines. The use-case is simple; we receive datasets, in which we apply arbitrary (but repetitive) rules to clean those and make clean datasets, which are then used to train our ML models

To accomplish this, we have:

Used DynamicPartitionsDefinition to define a raw_datasets partitioned asset, where each partition is a dataset. They all have the same structure, but we receive them on a schedule from production dumps
To populate raw_datasets we use sensor that based on some arbitrary checks on our SQL Datasets returns a SensorResult that yields run_requests and dynamic_partitions_requests

This has worked smoothly so far, what we are trying to do next, is have a job that references (by parameter) a dataset name (raw_datasets.partition_key) and some extra parameter's to apply some common data cleaning tasks and creates a partition for a new asset (say clean_datasets)

One example argument could be remove_nan where if true it drops all rows with remove_nan. We can have an asset that holds the logic of cleaning the raw_dataset into partitions, but SpecificPartitionMappings have to be hard coded and declared upfront

Issue

I have not found a way to have an op or a job that can yield or create a partition based on parameters, and here is what we have tried

I went through Using LastPartitionMapping without an asset #13918 and Partitioned jobs with partitioned source assets as input #13357 where this is a mention of a workaround using load_asset_value inside an op to be able to get an specific partition. This helps, we can use this concept to fetch raw_datasets.partition_key
I went through Enable yielding dynamic partitions requests within assets/ops #13955 which mentions is not currently supporting yield dynamic partitions requests. However, while reading through How do I create dynamic partitions within an op/asset? #15428, I realized that we could use this, to create a new partition on the parametrized job of clean_datasets. However, how do we tell dagster to start materializing the new partition, and furthermore, pass the parameters for that run?

Desired outcome

Ideally, we would have something like (pseudo-code)

@op

def add_dynamic_partition_op(context, partition_key):

    context.instance.add_dynamic_partitions(dynamic_partitions_def.name, [partition_key])

    return RunRequest(...), dynamic_partitions_requests

Another way would be to somehow have the ability to manually trigger creations of partitions based on some arbitrary rules (This is exactly what a sensor does) but there is a need to trigger manually, and with different parameters

Disclaimer

We are relatively new dagster, we may be missing understanding of some concepts

cimadure · 2023-09-29T14:30:16Z

cimadure
Sep 29, 2023

Hi,
This is exactly what I'm trying to do with my 2-weeks exposure to Dagster. I hope someone answers to you @dmsfabiano and gives us a hint.
BTW, did you try Config ? https://docs.dagster.io/concepts/configuration/config-schema

1 reply

dmsfabiano Dec 11, 2023
Author

@cimadure added an example here #18625, check it out and let me know if it works for you too!

sryza · 2023-09-29T20:44:46Z

sryza
Sep 29, 2023

Hey @dmsfabiano - not sure how this discussion slipped through the cracks. I think I've wrapped my head around part of what you're asking, but still trying to understand it completely

If you want to apply the same logic - e.g. remove_nan - to every partition in raw_datasets, you could just have a separate asset for that:

@asset(partitions_def=my_dynamic_partitions_def)
def datasets_without_nans(context):
    raw_dataset_for_partition = get_raw_dataset_for_partition(partition=context.partition_key)
    write_dataset_without_nans(remove_nans(raw_dataset_for_partition), partition=context.partition_key)

Are you looking to have a more complicated dependency graph than this though? E.g. is the idea that you want to only apply remove_nan to a subset of the partitions in raw_datasets?

8 replies

dmsfabiano Oct 2, 2023
Author

Awesome! Exactly!! Either the UI or programmatically triggering dagster would work! However, I would not imagine code changes as part of the flow, if code changes would be need it, it would make sense to introduce a new asset or job in that case

sryza Oct 3, 2023

Got it. Ok, so here's how I would think about modeling this in Dagster:

from dagster import asset, DynamicPartitionsDefinition, AssetDep, AllPartitionMapping, Config

@asset(partitions_def=DynamicPartitionsDefinition(name="truth_table_ids"))
def truth_tables(context):
    dataset = ...
    store_dataset_at_path(dataset, f"truth_tables/{context.partition_key}")


class ExperimentConfig(Config):
    truth_table: str
    some_hyperparameter: float


@asset(
    partitions_def=DynamicPartitionsDefinition(name="experiment_ids"),
    deps=[AssetDep("truth_tables", partition_mapping=AllPartitionMapping())],
)
def experiments(context, config: ExperimentConfig):
    dataset = load_dataset_from_path(f"truth_tables/{config.truth_table}")
    experiment = run_experiment(dataset, config)
    store_experiment_at_path(experiment, f"experiments/{context.partition_key}")


# unimplemented helper functions
def store_dataset_at_path(dataset, path):
    ...


def load_dataset_from_path(path):
    ...


def store_experiment_at_path(experiment, path):
    ...


def run_experiment(dataset, config):
    ...

There are a couple caveats:

This uses the AllPartitionMapping, instead of trying to make Dagster aware of the individual partition dependencies (which currently isn't possible when the dependencies are dynamic). If you're using backfills or auto-materialize, then this might cause problems. Do you expect to use those for these assets?
There's currently a bug that causes issues when you try to enter config in the launchpad and also supply a partition, but I believe this should be straightforward to fix in an upcoming release.

dmsfabiano Oct 3, 2023
Author

Thank you for providing this! A few (i think), interesting discussion points

We lose the lineage (as you are mentioning we cannot make Dagster aware of the individual partition); which, I feel is one of the major advantages of dagster over other orchestrators
As mentioned in Using LastPartitionMapping without an asset #13918 would it be possible to be able to use load_asset_value inside of an op/job? that way we could use the partition itself rather than depending on a resources (like in the snippet you provided). The same could be accomplished by either implementation 1 or 2 of what you had mentioned in Partitioned jobs with partitioned source assets as input #13357

In general,
Is there a motivation towards not been able to obtain a partition dynamically within a job / asset? I.E. not a desired feature for Dagster or just haven't gotten to it yet? What would be great is that if we could let dagster know the partition key will be a run time input

@asset(
    partitions_def=DynamicPartitionsDefinition(name="experiment_ids"),
    deps=[AssetDep("truth_tables", partition_mapping=RuntimeMapping())],
)
def experiments(truth_tables):
  ...

Or with the job based approach

@job
def send_emails_job(dataset_name: str):
    send_emails(emails_to_send.to_source_asset(partition_key=dataset_name))

I think it would open up for a lot of use-cases to enjoy the magic of dagster 🙌

sryza Oct 3, 2023

Is there a motivation towards not been able to obtain a partition dynamically within a job / asset? I.E. not a desired feature for Dagster or just haven't gotten to it yet?

We're not philosophically opposed to this - we just haven't gotten to it yet. Adding a degree of freedom like this can end up having far reaching impacts, so we apply some extra care when considering these kinds of changes.

We lose the lineage (as you are mentioning we cannot make Dagster aware of the individual partition); which, I feel is one of the major advantages of dagster over other orchestrators

Are you able to talk in a little more detail about where / how you'd like this lineage to show up? It's possible that we might be able to offer either a workaround or make a quick fix to improve the situation here.

As mentioned in #13918 would it be possible to be able to use load_asset_value inside of an op/job?

Yes, this is certainly possible.

dmsfabiano Oct 19, 2023
Author

As an update, I gave a shot at the things we discussed, and at surface level, load_asset_value seemed like the perfect shot, but it has a few issues (or I am using incorrectly), the documentation only highlights how to use it outside of a run, not necessarily within an op or job.

@sryza if there are any examples, or further docs that I am missing let me know, I think if I can get load_asset_value to work, I would be able to provide and end to end example and close the discussion

defs = Definitions(
    assets=all_assets,
    jobs=all_jobs,
    schedules=[],
    sensors=all_sensors,
    resources=ALL_RESOURCES[deployment_name],
)

@op
def run_explicit_validate(
    context: OpExecutionContext, config: RunValidateConfig, sql_client: SQLClient
):
    dataset = defs.load_asset_value(
        AssetKey("raw_dataset"), partition_key=f"{config.dataset_name}"
    )

yields the following error
dagster._check.CheckError: Failure condition: The instance is not available to load partitions. You may be seeing this error when using dynamic partitions with a version of dagster-webserver or dagster-cloud that is older than 1.1.18.

Then I went ahead and removed the partition_key from the load_asset_value call, and it yielded an scenario where dagster is not able to find the asset even though its there.

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\df89291\\AppData\\Local\\Temp\\tmpz2auyvvt\\storage\\raw_dataset'

It seems to me like for this case, dagster is looking for storage in the temporary folder instead of the folder where is initialized from?

dmsfabiano · 2023-12-11T00:34:55Z

dmsfabiano
Dec 11, 2023
Author

@cimadure @sryza I have created a PR here #18625 to further the examples of usage of dynamic asset partitions, which showcases how to load dynamic partitions dynamically through a job that is configurable, but essentially comes down to a simple usage:

@op
def dynamic_partition_loader(
    context: OpExecutionContext, asset_key: str, partition_key: str
) -> Any:
    """Dynamically fetches a previous value of asset_key at partition_id.

    Args:
        context (OpExecutionContext): standard op context
        asset_key (str): unique identifier of the asset to load partition from
        partition_key (str): unique identifier of the partition

    Returns:
        Any: the previously stored value of the dynamic partition ww
    """
    with defs.get_asset_value_loader(instance=context.instance) as loader:
        partition_value = loader.load_asset_value(
            AssetKey(asset_key),
            partition_key=partition_key,
        )

        return partition_value


@job
def adhoc_partition_load():
    """Job wrapper of dynamic_partition_loader."""
    dynamic_partition_loader()


defs = Definitions(
    assets=load_assets_from_modules([assets]),
    sensors=[release_sensor],
    jobs=[adhoc_partition_load],
    resources={"warehouse": duckdb_io_manager.configured({"database": "releases.duckdb"})},
)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading or adding assets and partitions at run time from config #16524

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Loading or adding assets and partitions at run time from config #16524

dmsfabiano Sep 14, 2023

Background

Issue

Other solutions

Desired outcome

Disclaimer

Replies: 3 comments · 9 replies

cimadure Sep 29, 2023

dmsfabiano Dec 11, 2023 Author

sryza Sep 29, 2023

dmsfabiano Oct 2, 2023 Author

sryza Oct 3, 2023

dmsfabiano Oct 3, 2023 Author

sryza Oct 3, 2023

dmsfabiano Oct 19, 2023 Author

dmsfabiano Dec 11, 2023 Author

dmsfabiano
Sep 14, 2023

Replies: 3 comments 9 replies

cimadure
Sep 29, 2023

dmsfabiano Dec 11, 2023
Author

sryza
Sep 29, 2023

dmsfabiano Oct 2, 2023
Author

dmsfabiano Oct 3, 2023
Author

dmsfabiano Oct 19, 2023
Author

dmsfabiano
Dec 11, 2023
Author