Best practice on getting a list of elements as inputs for an Asset. #19215

marcilj · 2024-01-15T14:48:34Z

marcilj
Jan 15, 2024

Let's take a basic scenario.
Create a pipeline that retrieve postal codes from an API by providing a list of addresses from an input list and store that to Snowflake.

What's the best way to achieve this using Dagster?

Base elements for all options

# resources/my_api_client.py
class MyClient:
    def __init__(self, client_id, secret):
        self.client_id = client_id
        self.secret = secret

        self.session = requests.Session()
        self.session.timeout = 60

    def execute_request(self, method, path="", params=None, data=None):
        """Execute an HTTP Request query"""

        json_response = self.session.request(
            method=method,
            url=f"{self.base_url}{path}",
            params=params,
            json=data,
        )

        return json_response.json()

    def get_postal_codes(
        self,
        addresses: List[str],
        date,
    ) -> List[str]:
        """Get postal code for multiple addresses"""

        records = []
        for address in addresses:
            records.append(self.get_postal_code(address))

        return records

    def get_postal_code(self, address: str):
        """Get postal code for a one address"""

        path = f"/postalcode/by/{address}"

        return self.execute_request("GET", path)

# resources/__init__.py
defs = Definitions(
    assets=all_assets,
    resources=resources_by_deployment_name[DEPLOYMENT_NAME],
    sensors=all_sensors,
    schedules=all_schedules,
    jobs=all_jobs,
)

Option 1 (Using a static list):

#my_dagster_project/assets/postal_codes.py
@asset(
    io_manager_key="snowflake_io_manager",
    required_resource_keys={"my_api_client"},
)
def postal_codes(context):
    addresses = ["first address", "second address", "third_address"]

    # Get the client
    client = context.resources.my_api_client

    postal_codes = client.get_postal_codes(addresses)

    return Output(value=postal_codes)

Obviously not the best option.

Option 2 (Using addresses asset and input lits):

@asset(
    io_manager_key="snowflake_io_manager",
)
def addresses(context):
    # This is a representation of highly elaborated code
    # that would endup creating an asset of addresses
    return ['first address', 'second address', 'third_address']

@asset(
    io_manager_key="snowflake_io_manager",
    required_resource_keys={"my_api_client"},
)
def postal_codes(context, addresses):

    # Get the client
    client = context.resources.my_api_client

    postal_codes = client.get_postal_codes(addresses)

    return Output(value=postal_codes)

The problem here is that the asset will get refreshed on each run, and that something we might not want. Also there's no way to filter in the asset, we would need to create the logic in the postal code asset to filter on what we want if needed.

Option 3 (Using an Ops to get the addresses):

@ops(required_resource_keys={"snowflake_client"},)
def addresses(context):
    client = context.resources.snowflake_client
    query = f"""
        select addresses from my_table
        """

    result_df = self.client.cursor().execute(query).fetch_pandas_all()
    tokens = result_df["addresses"]

    return tokens

addresses_as_asset = AssetDefinition.from_ops(addresses)

@asset(
    io_manager_key="snowflake_io_manager",
    required_resource_keys={"my_api_client"},
)
def postal_codes(context, addresses_as_asset):

    # Get the client
    client = context.resources.my_api_client

    postal_codes = client.get_postal_codes(addresses)

    return Output(value=postal_codes)

This seems like a good option, but we don't leverage the concept of asset so we kinda lost lineage of where the source data is coming from.

Option 4 (Using the list of addresses as partitions):

@asset(
    partitions_def=DynamicPartitionsDefinition(name="addresses")
    io_manager_key="snowflake_io_manager",
    required_resource_keys={"my_api_client"},
)
def postal_codes(context, addresses_as_asset):
    address = context.asset_partition_key_for_output()

    # Get the client
    client = context.resources.my_api_client

    postal_code = client.get_postal_code(address)

    return Output(value=postal_code)

The problem here is the scalability and the price to run this on Dagster Cloud. We need to create logic that update the list of partitions, so technically have an ops that does this. If we configure the Ops for this why don't we use the Ops directly as an input instead?
Since we pay for every materialization, this will increase the cost to run Dagster Cloud by a lot.

So here's where I am In my reflection, I don't know which options is the best, if anyone could review them and tell me what would be the best option and why that would be awesome.

Thank you very much.

jamiedemaria · 2024-01-17T18:26:32Z

jamiedemaria
Jan 17, 2024
Maintainer

For option 2

The problem here is that the asset will get refreshed on each run,

You have the ability to select which assets are part of a particular run. These assets will use the latest version of the data for any assets they depend on even if those assets are not part of the run. Given that ability, I think option 2 probably makes the most sense given what I know of your use case?

1 reply

marcilj Feb 8, 2024
Author

Thank you @jamiedemaria.

Most of my usecases have the same steps.

Get a list of elements somewhere. (A stake holder populate a table and that's my input)
Filter that list on groups or types.
Take each of the elements from that filtered list and do an action on it. Normally this is querying an External API to get data.
Store each results into a table.

Once that's done, I'll normally jump on doing some DBT, to do the transformation.

But the 3 first steps are really hard for me to understand in Dagster if I want to do it the right way.

a. I've tried creating 1 asset per group of elements and I've put all of the logic code for all elements in the asset block.

This worked, but anything I got an error the hole asset stopped, and I needed to restart from scratch.
b. I've tried creating a graph asset from multiple Ops.
This work fine, since I now have more visibility on what's going on.
Using Map and Collect I'm able retry only specific calls and steps.
The problem I have for this right now is that if any of the elements fails completly. (We've retry up to the max of the retry policy), None of the response that I've got from any of the APIs is getting stored. That's because collect won't work if any of the upstream DynamicOutput fail.

So again, any idea on how I could do this? It seems like a simple usecase, but it's very hard ahah.

Note : I'm currently working on step b.2 and that would be to create a graph instead of a graph asset and insert each DynamicOutput directly in the table instead of trying to build a graph asset. This seems like a bad Idea, but conceptually should work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice on getting a list of elements as inputs for an Asset. #19215

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Best practice on getting a list of elements as inputs for an Asset. #19215

marcilj Jan 15, 2024

Base elements for all options

Option 1 (Using a static list):

Option 2 (Using addresses asset and input lits):

Option 3 (Using an Ops to get the addresses):

Option 4 (Using the list of addresses as partitions):

Replies: 1 comment · 1 reply

jamiedemaria Jan 17, 2024 Maintainer

marcilj Feb 8, 2024 Author

marcilj
Jan 15, 2024

Replies: 1 comment 1 reply

jamiedemaria
Jan 17, 2024
Maintainer

marcilj Feb 8, 2024
Author