How is external asset suppose to be use? #18211

marcilj · 2023-11-21T23:09:08Z

marcilj
Nov 21, 2023

I've read the documentation here
and listen to the blog post here but I have a hard time understanding how this is suppose to scale / be configured.

The primary example describe in the documentation is configuring files like shown here.

from dagster import AssetSpec, Definitions, external_assets_from_specs

raw_csv_asset = AssetSpec(
    key="raw_csv_asset",
    description="A raw csv file generated daily by a 3rd party.",
    metadata={"owner": "Alice"},
)

parquet_asset = AssetSpec(
    key="parquet_asset",
    deps=[raw_csv_asset],
    description="Raw csv asset immediately transformed to parquet.",
    metadata={"owner": "Bob"},
)

defs = Definitions(
    assets=external_assets_from_specs([raw_csv_asset, parquet_asset])
)

In a real world scenario that would end up being some files stored in S3 and there's where my confusion start.

If we want to follow the transformation on a file, then for each file we need to create an AssetSpec. That seems like a lot of configuration and this seems like it wouldn't be very scalable.

If we want to follow the transformation of a group of files to trigger runs downstream, it mean that we have a concept of group of files that work together to build a downstream asset.

So we would configure folders of files in our AssetSpec definition.

That being said, AssetSpec doesn't seem to support partitions, so each files would only be new metadata on the asset instead of a new partition?

Now I'll stop the assumption and ask how would a production ready structure like mine could be configured with external assets feature.

Each of my files are being processed 4 times before landing in Snowflake.

Landing Area (The file is raw)
The file is validated and compressed
We flatten each row in each file and filter some.
We merge multiple files together to reduce cost on Snowflake imports.

In my situation in each step, the new file land in a new bucket.

That means that my file go from bucket-1 -> `bucket-2' -> 'bucket-3' -> 'bucket-4'.

The files I store in my bucket are also stored in a structured way, per types and dates.

That means that my file go from bucket-1/type-1/2023/01/01 -> `bucket-2/type-1/2023/01/01' -> 'bucket-3type-1/2023/01/01' -> 'bucket-4type-1/2023/01/01'.

Knowing that I have around 50 types of data defined and more coming, how would I configure this to see upstream assets in dagster?

What I've tried is to configured it like this :

from dagster import AssetSpec, Definitions, external_assets_from_specs

bucket_1_type_1 = AssetSpec(
    key=AssetKey(["bucket-1", "type-1"],
    metadata={"owner": "data"},
    group_name="type_1",
)

bucket_2_type_1 = AssetSpec(
    key=AssetKey(["bucket-2", "type-1"],
    metadata={"owner": "data"},
    deps=[bucket_1_type_1],
    group_name="type_1",
)

bucket_3_type_1 = AssetSpec(
    key=AssetKey(["bucket-3", "type-1"],
    metadata={"owner": "data"},
    deps=[bucket_2_type_1],
    group_name="type_1",
)

bucket_4_type_1 = AssetSpec(
    key=AssetKey(["bucket-4", "type-1"],
    metadata={"owner": "data"},
    deps=[bucket_3_type_1],
    group_name="type_1",
)

But even for 1 type this seems to become highly complicated and doesn't seems like it will scale very easily.
In addition since the partition is not really use in that situation I have a hard time understanding how I would be able to trigger run based on anything that happen upstream if I can't configured them.

Thank you for you help.

nickvazz · 2024-01-28T09:39:41Z

nickvazz
Jan 28, 2024

Hi @marcilj,

Reading your question, something I have found to useful / workable solution for generating many asset is to use a factory pattern

https://dagster.io/blog/python-factory-patterns - their final result

from dagster import Definitions, asset
import requests
import csv

specs = [
    {'name': 'volunteers', 'endpoint': 'v1/volunteers', 'file_type': 'csv'},
    {'name': 'donations', 'endpoint': 'v2/donations', 'file_type': 'csv'},
    {'name': 'donors', 'endpoint': 'v1/donors', 'file_type': 'json'},
    {'name': 'projects', 'endpoint': 'v1/projects', 'file_type': 'json'},
    {'name': 'fundraisers', 'endpoint': 'v1/fundraisers', 'file_type': 'csv'},
]

def generate_donor_platform_asset(spec):
    @asset(name=spec['name'])
    def _asset():
        result = requests.get(f'www.donorplatform.org/api/{spec["endpoint"]}')
        with open(f'{spec["name"]}.f{spec["file_type"]}', 'w') as f:
            if spec["file_type"] == 'csv':
                writer = csv.writer(f)
                writer.writerows(result)
            elif spec["file_type"] == 'json':
                f.write(result)

    return _asset


defs = Definitions(assets=[generate_donor_platform_asset(spec) for spec in specs])

In your case you might be able to use:

from functools import reduce
from operator import add
from dagster import AssetKey, AssetSpec, Definitions, external_assets_from_specs


def asset_spec_bucket_factory(spec):
    buckets = spec['buckets']
    data_type = spec['data_type']
    group_name = spec['group_name']
    
    meta_data = spec['meta_data']
    
    
    a0 = AssetSpec(
        key=AssetKey([buckets[0], data_type]),
        metadata=meta_data,
        group_name=group_name,
    )
    
    a1 = AssetSpec(
        key=AssetKey([buckets[1], data_type]),
        metadata=meta_data,
        deps=[a0],
        group_name=group_name,
    )

    a2 = AssetSpec(
        key=AssetKey([buckets[2], data_type]),
        metadata=meta_data,
        deps=[a1],
        group_name=group_name,
    )

    a3 = AssetSpec(
        key=AssetKey([buckets[3], data_type]),
        metadata=meta_data,
        deps=[a2],
        group_name=group_name,
    )

    return external_assets_from_specs([a0, a1, a2, a3])


specs = {
    'case-1': {
        'buckets': ['bucket1', 'bucket2', 'bucket3', 'bucket4'],
        "data_type": "type1",
        "group_name": "type1",
        "meta_data": {"owner": "data"}
    },
    'case-2': {
        'buckets': ['bucket1', 'bucket2', 'bucket3', 'bucket4'],
        "data_type": "type2",
        "group_name": "type2",
        "meta_data": {"owner": "data"}
    }
}

all_assets = reduce(add, [asset_spec_bucket_factory(spec) for case, spec in specs.items()])

Quick question:

In addition since the partition is not really use in that situation I have a hard time understanding how I would be able to trigger run based on anything that happen upstream if I can't configured them.

From your description, it looks as though, only the first step where something lands in bucket 1 is out of your control? If that is the case, you would use regular @dagster.assets instead and have your processing be within the function itself to go from bucket-1 --> bucket-2.

What I currently do

For some context, I currently have files landing in various directories like so:

experiement_type_A/exp0/*
experiement_type_A/exp1/*
...
experiement_type_A/expN/*
experiement_type_B/exp0/*
experiement_type_B/exp1/*
...
experiement_type_B/expM/*

Each experiment type here represents an asset where each exp0...expN...expM are dynamic partitions. When dagster starts up, it has no idea what exp0... exist -- this is why a dynamic partition is needed. To add more partitions, I use a sensor whose entire job is to glob through the directories, and decide which directories are new and should make a new partition for them. Additionally, there is an additional sensor that watches only the dynamic partition directories modification time, and submits a new run when things have changed. Here is some documentation that was fairly straightforward to follow to do that:

submit runs from a sensor: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#sensor-optimizations-using-cursors

create new dynamic partitions: https://docs.dagster.io/concepts/partitions-schedules-sensors/partitioning-assets#dynamically-partitioned-assets

Now on to the Partitioned `AssetSpec`

That said, I am also interested in an AssetSpec that has partitions. After some looking around through the codebase, I was able to make partitions work with an AssetSpec with some slight code additions. I found in some of the dagster tests an AssetSpec modification that added partitions. Below I have included where in their codebase they made a partitioned AssetSpec as well as a simple implementation.

I was able to verify that if you materialize a partition of the AssetSpec externally via dagster.AssetMaterialization(...), auto materialization rules would still take care of downstream assets.

where they define it

dagster/python_modules/dagster/dagster_tests/definitions_tests/auto_materialize_tests/asset_daemon_scenario.py

Line 192 in 4a9a085

class AssetSpecWithPartitionsDef(

where they use it

dagster/python_modules/dagster/dagster_tests/definitions_tests/auto_materialize_tests/asset_daemon_scenario.py

Line 308 in 4a9a085

AssetSpecWithPartitionsDef(**{**spec._asdict(), **kwargs})

code to make `AssetSpec`s have partitions

from collections import namedtuple

from dagster import (
    AssetSpec,
    AutoMaterializePolicy, 
    AutoMaterializeRule,
    Definitions, 
    FreshnessPolicy,
    StaticPartitionsDefinition,
    asset,
    external_asset_from_spec,
)


auto_policy = AutoMaterializePolicy.eager(
    max_materializations_per_minute=None
).with_rules(
    AutoMaterializeRule.skip_on_parent_missing(),
    AutoMaterializeRule.skip_on_backfill_in_progress(),
    AutoMaterializeRule.materialize_on_parent_updated(),
)

oceans_partitions_def = StaticPartitionsDefinition(
    ["arctic", "atlantic", "indian", "pacific", "southern"]
)
class AssetSpecWithPartitionsDef(
    namedtuple(
        "AssetSpecWithPartitionsDef",
        AssetSpec._fields + ("partitions_def", ),
        defaults=(None,) * (1 + len(AssetSpec._fields)),
    )
):
    ...

def create_asset_from_asset_spec_with_partition(*, partitions_def, **spec_kwargs):
    asset = external_asset_from_spec(
        AssetSpecWithPartitionsDef(
            **AssetSpec(**spec_kwargs)._asdict(),
            partitions_def=partitions_def,
        )
    )
    asset._partitions_def = partitions_def
    return asset


bronze = create_asset_from_asset_spec_with_partition(
    key=["proj","sub_proj","sim_type","bronze"],
    partitions_def=oceans_partitions_def,
)


@asset(
    key=['proj','sub_proj','sim_type','silver'], 
    partitions_def=oceans_partitions_def, 
    deps=[bronze],
    auto_materialize_policy=auto_policy,
    compute_kind='dask',
)
def silver():
    pass

@asset(
    key=['proj','sub_proj','sim_type','gold'], 
    partitions_def=oceans_partitions_def, 
    deps=[silver],
    auto_materialize_policy=auto_policy,
    compute_kind='parquet',
)
def gold():
    pass


assets = [bronze, silver, gold]

defs = Definitions(assets=assets)

how to materialize the partitions of the `AssetSpec`

import dagster
instance = dagster.DagsterInstance.get()

for partition in ["arctic", "atlantic", "indian", "pacific", "southern"]:
    instance.report_runless_asset_event(
        dagster.AssetMaterialization(
            dagster.AssetKey(["proj","sub_proj","sim_type","bronze"]), 
            metadata={"something": 1},
            partition=partition,
        )
    )

mid-auto materialize

note: the `compute_kind` (dask/parquet in img above) looks like it has a bug #19450

hope that helps! 😄

1 reply

the4thamigo-uk Feb 2, 2024

@nickvazz That is amazing work. I've used your AssetSpecWithPartitionsDef workaround and it looks great in the dagit UI. However, I wonder if you have raised a feature request for this?

the4thamigo-uk · 2024-07-19T11:59:33Z

the4thamigo-uk
Jul 19, 2024

@nickvazz Seeing this break in dagster 1.7.14 with :

The above exception was caused by the following exception:
AttributeError: 'AssetSpecWithPartitionsDef' object has no attribute 'auto_materialize_policy'

1.7.13 seems ok though.

btw, I wonder if we have a proper fix for the original issue?

3 replies

nickvazz Jul 19, 2024

It looked like it was only the "compute_kind" portion, so only something for the ui (I personally just commented out that line for now)

the4thamigo-uk Jul 19, 2024

I just worked around it with :

AssetSpec._fields + ("partitions_def", "auto_materialize_policy")

v1gnesh Jul 25, 2024

Thanks for the tip @the4thamigo-uk.
If you can, can you show me what I'm doing wrong:

auto_policy = AutoMaterializePolicy.eager(
    max_materializations_per_minute=None
).with_rules(
    AutoMaterializeRule.skip_on_parent_missing(),
    AutoMaterializeRule.skip_on_backfill_in_progress(),
    AutoMaterializeRule.materialize_on_parent_updated(),
)


static_parts = StaticPartitionsDefinition(["hour0", "hour1", "hour2", "hour3", "hour4"])


class AssetSpecWithPartitionsDef(
    namedtuple(
        "AssetSpecWithPartitionsDef",
        AssetSpec._fields + ("partitions_def", "auto_materialize_policy"),
        defaults=(None,) * (1 + len(AssetSpec._fields)),
    )
): ...


def create_asset_from_asset_spec_with_partition(*, partitions_def, **spec_kwargs):
    asset = external_asset_from_spec(
        AssetSpecWithPartitionsDef(
            **AssetSpec(**spec_kwargs)._asdict(),
            partitions_def=partitions_def,
        )
    )
    asset._partitions_def = partitions_def
    return asset


expart = create_asset_from_asset_spec_with_partition(
    key=["expart"],
    partitions_def=static_parts,
    group_name='mon1'
)


@asset(partitions_def=HourlyPartitionsDefinition(start_date='2024-07-23-20:00'),
       auto_materialize_policy=auto_policy, deps=[expart], group_name='mon1')
def hourlyo():
    pass


defs = Definitions(assets=[expart, hourlyo])

I still see the external asset as an external asset, it doesn't show it like a 'native' asset (i.e., bronze in the workaround example is external, but it shows partition info in the 'pill').

I don't see the grouping showing up as a 'folder' in the UI as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is external asset suppose to be use? #18211

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How is external asset suppose to be use? #18211

marcilj Nov 21, 2023

Replies: 2 comments · 4 replies

nickvazz Jan 28, 2024

In your case you might be able to use:

Quick question:

What I currently do

submit runs from a sensor: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#sensor-optimizations-using-cursors

create new dynamic partitions: https://docs.dagster.io/concepts/partitions-schedules-sensors/partitioning-assets#dynamically-partitioned-assets

Now on to the Partitioned AssetSpec

where they define it

where they use it

code to make AssetSpecs have partitions

how to materialize the partitions of the AssetSpec

mid-auto materialize

note: the compute_kind (dask/parquet in img above) looks like it has a bug #19450

hope that helps! 😄

the4thamigo-uk Feb 2, 2024

the4thamigo-uk Jul 19, 2024

nickvazz Jul 19, 2024

the4thamigo-uk Jul 19, 2024

v1gnesh Jul 25, 2024

marcilj
Nov 21, 2023

Replies: 2 comments 4 replies

nickvazz
Jan 28, 2024

Now on to the Partitioned `AssetSpec`

code to make `AssetSpec`s have partitions

how to materialize the partitions of the `AssetSpec`

note: the `compute_kind` (dask/parquet in img above) looks like it has a bug #19450

the4thamigo-uk
Jul 19, 2024