`FreshnessPolicy` should have a "max_parent_level" parameter #20207

emirkmo · 2024-03-01T23:28:17Z

emirkmo
Mar 1, 2024

If an asset has a hefty set of upstream assets (say 20 levels of parents with ELT/ETL pipelines, dbt models, etc. updating them), the max_lag_minutes of the FreshnessPolicy becomes too burdensome to estimate.

I want to be able to make a Freshness guarantee that this asset contains data at most 1 hour late, when compared to its immediate parent, or parents within 5 levels up. But I don't care about 20 levels up.. The upstream assets 10-20 levels up may take a variable amount of time to run, so one has to estimate and add that to the FreshnessPolicy, max_lag_minutes, to make a FreshnessPolicy that can actually be met.

More concretely, if I have asset Z with a cron schedule at 5PM, which has source assets A,B,C,D somewhere up the chain: It's too burdensome for Z to consider each pipeline that takes source A or B or C or D and materializes the intermediate assets. Maybe the pipelines (Assets E -> ...) take 2 minutes to run on Tuesdays but 5 hours on Friday nights. Now Z has to add up the time of asset runs, including dagster tick delays, external delays/scheduling (say some of these are observable source assets/specs), to know to set max_lag_minutes to 5 hours. And each newly added downstream asset, or integrated upstream source asset, keeps increasing this burden.

However, Z should be able to make a simple Freshness guarantee: I will contain data no later than 5 minutes of asset Y (max_parent_level=1) by 5 pm. And Y might say I will contain data no later than 1 hour of max_parent_level=5.

This naturally maps to team boundaries as well since say a data analyst team working on their dbt models isn't reasonably expected to know the ins and outs of the scheduling of the upstream ELT from a different team that feeds the data into a DW they consume from. But Dagster is great for showing the lineage and orchestrating both the ELT and the dbt model. Let the analyst make a boundary of max_parent_level=5 or whatever so that they are only considering their freshness with respect to source data they understand, but we don't have to add phantom sources or hide the lineage of the data.

cbini · 2024-03-19T16:23:23Z

cbini
Mar 19, 2024

+1 and piggybacking on this to say that Freshness Policies as they exist are confusing in how they define "freshness".

Some source data will never update (e.g. a lookup table), but I'd like to use a freshness policy to determine if my reporting views have incorporated all new materializations since the end of the last interval. In their current state, freshness policies only consider the absolute age of all underlying assets.

If every downstream asset has materialized after their respective upstream assets, then the freshness policy of the terminal asset should say that it's up to date.

For example:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FreshnessPolicy` should have a "max_parent_level" parameter #20207

{{title}}

Replies: 1 comment

{{title}}

Select a reply

FreshnessPolicy should have a "max_parent_level" parameter #20207

emirkmo Mar 1, 2024

Replies: 1 comment

cbini Mar 19, 2024

`FreshnessPolicy` should have a "max_parent_level" parameter #20207

emirkmo
Mar 1, 2024

cbini
Mar 19, 2024