Assets & IOManager improvements #14978
danielgafni
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Me and @sryza had a discussion regrading the possible ways to improve the
IOManager
and the assets.Here are my thoughts:
IOManager.delete_output
method. Add a way to call it from Dagit.SourceAsset
shouldn’t have this functionality, only normalAssetsDefinition
. Calling this method should delete the data (in contrary to "wiping" assets from Dagit which only deletes their records from Dagster's DB).IOManager.observe_input
,IOManager.observe_object
methods.observe_object
would take preference overobserve_input
. The difference is -observe_object
would get chained withload_input
and requires to actually load & process the data (for example, computing a hash), whileobserve_input
won't necessarily load the data and could use something likemodified_at
timestamp to compute theDataVersion
. They could be used in both normal and source assets. For normal assets we could use it to recreate the metadata and data version in case they were lost, the object was modified externally, or we are in a fresh branch deployment and using theBranchingIOManager
to read upstream production data. This would also cover the "mark assets/partitions as materialized" use case. Having these methods in theIOManager
would be more convenient than in theAssetsDefinition
/SourceAsset
, because theIOManager
already has the logic to load/observe the data, while having the observe_fn as a separate function requires duplicating this code.PartitionedIOManager
. Instead of handling the logic around loading multiple partitions inload_input
, let Dagster detect if it needs to call it multiple times when loading multiple partitions. We currently have this logic inside theUPathIOManager
. We would have to add an additional optional “partition_key” argument to load_input to achieve that. When loading multiple partitions, Dagster could resolve the partitions mapping and pass the upstream partitions there. ThePartitionedIOManager
would return a mapping of parttion_keys to objects, like theUPathIOManager
currently does. This behavior would not be ideal for all possible IOManagers (for example, when working with DBs you would rather have a singleload_input
call which would be a bulk read of all the partitions), but it would be convenient when bulk read is not available.Beta Was this translation helpful? Give feedback.
All reactions