Replies: 1 comment 1 reply
-
For option 2
You have the ability to select which assets are part of a particular run. These assets will use the latest version of the data for any assets they depend on even if those assets are not part of the run. Given that ability, I think option 2 probably makes the most sense given what I know of your use case? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Let's take a basic scenario.
Create a pipeline that retrieve postal codes from an API by providing a list of addresses from an input list and store that to Snowflake
.What's the
best
way to achieve this using Dagster?Base elements for all options
Option 1 (Using a static list):
Obviously not the best option.
Option 2 (Using addresses asset and input lits):
The problem here is that the asset will get refreshed on each run, and that something we might not want. Also there's no way to filter in the asset, we would need to create the logic in the postal code asset to filter on what we want if needed.
Option 3 (Using an Ops to get the addresses):
This seems like a good option, but we don't leverage the concept of asset so we kinda lost lineage of where the source data is coming from.
Option 4 (Using the list of addresses as partitions):
The problem here is the scalability and the price to run this on Dagster Cloud. We need to create logic that update the list of partitions, so technically have an ops that does this. If we configure the Ops for this why don't we use the Ops directly as an input instead?
Since we pay for every materialization, this will increase the cost to run Dagster Cloud by a lot.
So here's where I am In my reflection, I don't know which options is the best, if anyone could review them and tell me what would be the best option and why that would be awesome.
Thank you very much.
Beta Was this translation helpful? Give feedback.
All reactions