You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR: what is a proper way to get the data already presented in asset when re-calculating it?
There is a usefull function in dbt - quering already exisitng table in DB in exact same model, responsible for creating this table. It is particulary usefull for incremental tasks, so you can filter out only incoming rows already presented in the table and save some time on recalculating them again every time.
So my questinon is basically what is the best way to achieve same thing inside the Dagster asset?
I mean, while dagster abstracts away actualy read-write of the data from underlying storage, I cannot find a way to pass the data of as existing asset into it's own calculation.
Therer is a way to overcome it directly calling underling storage from asset function, but it either:
breaks the idea of abstracting IO from the calculation and removes an ability to easily change physical storage with IO manger
or makes the code very messy, trying to call IOManager internal adapter from the asset as a resource (also this way requires a very specific code architecture, I assume).
Per my knowledge, dagster somehow supports self-dependency in very specific cases with date-partitioning:
Assets can only depend on themselves if they are:
(a) time-partitioned and each partition depends on earlier partitions
(b) multipartitioned, with one time dimension that depends on earlier time partitions
However, it's now always makes optimal to use time-partitioning or maybe there is already plain, not-partitioned pipeline and intoducing partitions there will add unnesesary complexity into it.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
TLDR: what is a proper way to get the data already presented in asset when re-calculating it?
There is a usefull function in dbt - quering already exisitng table in DB in exact same model, responsible for creating this table. It is particulary usefull for incremental tasks, so you can filter out only incoming rows already presented in the table and save some time on recalculating them again every time.
So my questinon is basically what is the best way to achieve same thing inside the Dagster asset?
I mean, while dagster abstracts away actualy read-write of the data from underlying storage, I cannot find a way to pass the data of as existing asset into it's own calculation.
Therer is a way to overcome it directly calling underling storage from asset function, but it either:
Per my knowledge, dagster somehow supports self-dependency in very specific cases with date-partitioning:
However, it's now always makes optimal to use time-partitioning or maybe there is already plain, not-partitioned pipeline and intoducing partitions there will add unnesesary complexity into it.
Beta Was this translation helpful? Give feedback.
All reactions