How to launch a single run that backfills a range of asset partitions #11653

sryza · 2023-01-12T00:15:27Z

sryza
Jan 12, 2023

Refer to the Dagster documentation: https://docs.dagster.io/concepts/partitions-schedules-sensors/backfills#single-run-backfills

mjclarke94 · 2023-01-16T23:38:13Z

mjclarke94
Jan 16, 2023

Hi @sryza,

Thanks for this, really useful feature! Just a sanity check, am I right in thinking that this won't work with multi-dimensional partitions due to the fact that they don't have partition ranges, but use subsets instead?

Just wanting to make sure before I start throwing weird strings in to the tag to try and trick it in to working!

4 replies

sryza Jan 17, 2023
Author

@clairelin135 - do you know the answer to this?

clairelin135 Jan 17, 2023
Maintainer

Hi @mjclarke94. So multipartitioned assets do have partition ranges--any contiguous subset of multipartitions_def.get_partition_keys() is a valid range that you can input through the tags.

Something to note is that there is a slight nuance with the way multipartition keys are defined. This multipartitions definition

MultiPartitionsDefinition(
    {
        "date": DailyPartitionsDefinition(start_date="2022-06-11"),
        "abc": StaticPartitionsDefinition(["a", "b", "c"]),
    }
)

has partition keys ['a|2022-06-11'...'a|2023-01-16', 'b|2022-06-11'...'b|2023-01-16', 'c|2022-06-11'...'c|2023-01-16'] so selecting a range a|2022-06-11...b|2022-06-11 will select all of the "a" partition dimension keys. This is a little awkward and ideally we support subsets for this functionality too, but for now the ranges do work.

mjclarke94 Jan 17, 2023

Ah, interesting. Thanks for clarifying!

patrikdevlin May 14, 2024

Linking this bug here to thinks back if anyone else runs into the issue. Currently, I have a StaticPartitionsDefinition that is tied to each tenant, but selecting all the n time-based partitions for say tenant a, will just spawn a new job for each. Which is quite slow

#18852

j-hulbert · 2023-01-24T21:26:09Z

j-hulbert
Jan 24, 2023

Thanks for providing this example! Is this warning expected when running a partition range using this method? This is with a custom IO manager and I don't get this warning when I run for a single partition.

WARNING: No previously stored outputs found for source StepOutputHandle(step_key='raw_sf__product_download', output_name='result', mapping_key=None). This is either because you are using an IO Manager that does not depend on run ID, or because all the previous runs have skipped the output in conditional execution.

1 reply

sryza Jan 26, 2023
Author

I wouldn't expect that warning message to be related.

Are you re-executing a run?

geoHeil · 2023-02-15T17:01:51Z

geoHeil
Feb 15, 2023

@sryza do they even support non contiguous ranges?

1 reply

sryza Feb 15, 2023
Author

Currently, they only support contiguous ranges

seandavi · 2023-02-20T01:00:46Z

seandavi
Feb 20, 2023

When using this feature, how does one handle AssetMaterializations and Outputs? Yield one output for each partition? One AssetMaterialization for each partition?

3 replies

sryza Feb 22, 2023
Author

You only need one Output across all partitions. The value you should supply for your Output will depend on your IO manager. If it's a database IO manager like DuckDB or Snowflake, then you'll typically just return a single DataFrame. If it's a filesystem IO manager like the default IO manager or S3, then you'll return a dictionary keyed by partition key.

Btw this only works with software-defined assets. With software-defined assets, you never need to yield AssetMaterializations yourself.

geoHeil Feb 23, 2023

But for any materialization metadata (row count, ...) which usually is counted per partition (plots of assets page in dagster) how should this be handled for such a case? Wouldn`t multiple materialization events be required?

sryza Feb 23, 2023
Author

Currently, the same metadata will get attached to every materialization. I filed an issue to track making it possible to provide different metadata for each partition: #12498.

clairelin135 · 2023-03-31T22:18:35Z

clairelin135
Mar 31, 2023
Maintainer

An example of how you could programmatically launch runs across a partition range, i.e. within a schedule:

@schedule(job=the_job, cron_schedule="* * * * *")
def my_schedule():
    ...
    return RunRequest(
        tags={
            "dagster/asset_partition_range_start": start_partition_key,
            "dagster/asset_partition_range_end": end_partition_key,
        }
    )

1 reply

sryza Jun 5, 2023
Author

fyi I spun this out into a separate discussion: #14622

slopp · 2023-04-19T16:32:07Z

slopp
Apr 19, 2023
Maintainer

An example of a project that implements this capability, specifically for two assets:

https://github.com/dagster-io/hooli-data-eng-pipelines/blob/master/hooli_data_eng/assets/raw_data/__init__.py

hourly_partitions = HourlyPartitionsDefinition(
    start_date="2023-04-11-00:00"
)


def _hourly_partition_seq(start, end):
    start = pd.to_datetime(start)
    end = pd.to_datetime(end)
    hourly_diffs = int((end - start) / timedelta(hours=1))
    
    return [str(start + timedelta(hours=i)) for i in range(hourly_diffs)]


@asset(
    compute_kind="api",
    required_resource_keys={"data_api"},
    partitions_def=hourly_partitions,
    metadata={"partition_expr": "created_at"},
)
def users(context: OpExecutionContext) -> pd.DataFrame:
    """A table containing all users data"""
    api = context.resources.data_api
    # during a backfill the partition range will span multiple hours
    # during a single run the partition range will be for a single hour
    first_partition, last_partition = context.asset_partitions_time_window_for_output()
    partition_seq = _hourly_partition_seq(first_partition, last_partition)
    all_users = []
    for partition in partition_seq:
        resp = api.get_users(partition)
        users = pd.read_json(resp.json())
        all_users.append(users)

    return pd.concat(all_users)

In this example the asset is written so that it always uses context.asset_partitions_time_window_for_output() inside the asset function, this ensures the asset will work if materialized for a single partition or for multiple partitions. In this example the asset loops through an API calling it once per partition and then takes advantage of the Snowflake and DuckDB IO managers to correctly handle the output.

The example also includes a unit test to ensure the asset works whether it is called with a single partition (regular incremental runs) or a backfill (multiple partitions supplied in one run): https://github.com/dagster-io/hooli-data-eng-pipelines/blob/master/hooli_data_eng_tests/test_assets.py

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to launch a single run that backfills a range of asset partitions #11653

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to launch a single run that backfills a range of asset partitions #11653

Replies: 6 comments · 10 replies

sryza Jan 17, 2023 Author

clairelin135 Jan 17, 2023 Maintainer

sryza Jan 26, 2023 Author

sryza Feb 15, 2023 Author

sryza Feb 22, 2023 Author

sryza Feb 23, 2023 Author

clairelin135 Mar 31, 2023 Maintainer

sryza Jun 5, 2023 Author

slopp Apr 19, 2023 Maintainer

Replies: 6 comments 10 replies

sryza Jan 17, 2023
Author

clairelin135 Jan 17, 2023
Maintainer

sryza Jan 26, 2023
Author

sryza Feb 15, 2023
Author

sryza Feb 22, 2023
Author

sryza Feb 23, 2023
Author

clairelin135
Mar 31, 2023
Maintainer

sryza Jun 5, 2023
Author

slopp
Apr 19, 2023
Maintainer