[how-to] Processing many partitions of an asset in parallel, within a single run #26738

dpeng817 · 2024-12-27T16:48:04Z

dpeng817
Dec 27, 2024
Maintainer

An asset backed by a dynamic graph can process many partitions in parallel.

partitions_def = ...

# Fan out a computation for each partition
@dg.op(out=dg.DynamicOut())
def fan_out_partitions(context: dg.OpExecutionContext):
    for partition_key in context.partition_keys:
        yield dg.DynamicOutput(values_per_partition[partition_key], mapping_key=partition_key)

# process each partitioned object
@dg.op
def process_partition(context, obj: Any):
    return val + 1

# Collect the results, but we aren't returning anything from this function.
# The return type is Nothing, indicating no return.
# We don't return anything because there is no single path that we can store this value at, which will
# Confuse many of the system IO managers.
@dg.op(out=dg.Out(dagster_type=dg.Nothing))
def collect_partition_results(context, objs):
    ...

# Construct a graph which returns the result of `collect_partition_results`.
# The graph doesn't actually return anything, but returning here tells Dagster that when 
# `collect_partition_results` completes, the asset has been materialized.
# We additionally have backfill_policy set to single_run, which allows us to operate on many partitions in a single run.
@dg.graph_asset(partitions_def=partitions_def, backfill_policy=dg.BackfillPolicy.single_run())
def doubly_dynamic_asset():
    vals = fan_out_partitions().map(process_partition)
    return collect_partition_results(vals.collect())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[how-to] Processing many partitions of an asset in parallel, within a single run #26738

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

[how-to] Processing many partitions of an asset in parallel, within a single run #26738

dpeng817 Dec 27, 2024 Maintainer

Replies: 0 comments

dpeng817
Dec 27, 2024
Maintainer