Replies: 1 comment
-
+1 and piggybacking on this to say that Freshness Policies as they exist are confusing in how they define "freshness". Some source data will never update (e.g. a lookup table), but I'd like to use a freshness policy to determine if my reporting views have incorporated all new materializations since the end of the last interval. In their current state, freshness policies only consider the absolute age of all underlying assets. If every downstream asset has materialized after their respective upstream assets, then the freshness policy of the terminal asset should say that it's up to date. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
If an asset has a hefty set of upstream assets (say 20 levels of parents with ELT/ETL pipelines, dbt models, etc. updating them), the
max_lag_minutes
of the FreshnessPolicy becomes too burdensome to estimate.I want to be able to make a Freshness guarantee that this asset contains data at most 1 hour late, when compared to its immediate parent, or parents within 5 levels up. But I don't care about 20 levels up.. The upstream assets 10-20 levels up may take a variable amount of time to run, so one has to estimate and add that to the
FreshnessPolicy
,max_lag_minutes
, to make a FreshnessPolicy that can actually be met.More concretely, if I have asset Z with a cron schedule at 5PM, which has source assets A,B,C,D somewhere up the chain: It's too burdensome for Z to consider each pipeline that takes source A or B or C or D and materializes the intermediate assets. Maybe the pipelines (Assets E -> ...) take 2 minutes to run on Tuesdays but 5 hours on Friday nights. Now Z has to add up the time of asset runs, including dagster tick delays, external delays/scheduling (say some of these are observable source assets/specs), to know to set
max_lag_minutes
to 5 hours. And each newly added downstream asset, or integrated upstream source asset, keeps increasing this burden.However, Z should be able to make a simple Freshness guarantee: I will contain data no later than 5 minutes of asset Y (
max_parent_level=1
) by 5 pm. And Y might say I will contain data no later than 1 hour ofmax_parent_level=5
.This naturally maps to team boundaries as well since say a data analyst team working on their dbt models isn't reasonably expected to know the ins and outs of the scheduling of the upstream ELT from a different team that feeds the data into a DW they consume from. But Dagster is great for showing the lineage and orchestrating both the ELT and the dbt model. Let the analyst make a boundary of
max_parent_level
=5 or whatever so that they are only considering their freshness with respect to source data they understand, but we don't have to add phantom sources or hide the lineage of the data.Beta Was this translation helpful? Give feedback.
All reactions