-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[external-assets] refactor AssetGraph #19900
Conversation
e1375b2
to
92c7a9f
Compare
44f6b66
to
bf0569b
Compare
92c7a9f
to
1ae5a2c
Compare
bf0569b
to
e839eb2
Compare
1ae5a2c
to
03b5a31
Compare
e839eb2
to
ee4dd4e
Compare
c58fbba
to
d3673a7
Compare
ee4dd4e
to
d29c798
Compare
d3673a7
to
96bc75b
Compare
d29c798
to
c4ff010
Compare
96bc75b
to
30b2395
Compare
c4ff010
to
a62ad47
Compare
30b2395
to
b5da3fc
Compare
0f657ae
to
c0e8feb
Compare
1d3da61
to
982b7e4
Compare
c0e8feb
to
1340755
Compare
982b7e4
to
a3e7f79
Compare
1340755
to
e6a7dab
Compare
5a0382b
to
3dd1442
Compare
Deploy preview for dagit-core-storybook ready! ✅ Preview Built with commit 9532c18. |
9146696
to
ab4801d
Compare
non_source_node_pairs: List[Tuple[RepositoryHandle, "ExternalAssetNode"]] = [] | ||
observable_source_node_pairs: List[Tuple[RepositoryHandle, "ExternalAssetNode"]] = [] | ||
non_observable_source_node_pairs: List[Tuple[RepositoryHandle, "ExternalAssetNode"]] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to rename ExternalAssetNode
or else we might go crazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok given this we need clearer names and copious comments here. Are observable_source_node_pairs
just assets represented by SourceAsset
, or is it inclusive of external observables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"source" here in general is inclusive of observables-- when the ExternalAssetNode
list is built in the code server external assets and source assets both have is_source
set to True
. I've changed the names and added a comment.
# It is possible for multiple nodes to exist that share the same key. This is invalid if | ||
# more than one node is materializable or if more than one node is observable. It is valid | ||
# if there is at most one materializable node and at most one observable node, with all | ||
# other nodes unexecutable. The asset graph will receive only a single `ExternalAssetNode` | ||
# representing the asset. This will always be the materializable node if one exists; then | ||
# the observable node if it exists; then finally the first-encountered unexecutable node. | ||
for repo_handle, node in non_source_node_pairs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super consequential decision and we need to consider it carefully. This is a pretty deep assumption that will spread throughout the system, all the way to the frontend and our top-level APIs.
The case that directly exposes this is when an asset is materializable in one repo but observable in another. What is the behavior in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this is a refactor that makes (1) the existing "merge" behavior legible for valid cases; (2) errors on undefined cases. i.e. There is no intended behavior change here except to error on two materialization nodes or two observation nodes, which is an undefined state that we currently don't explicitly handle.
The case that directly exposes this is when an asset is materializable in one repo but observable in another. What is the behavior in that case?
This PR preserves the status quo behavior for this case. ExternalAssetGraph
will report that the asset is materializable but also is observable.
The use of ExternalAssetNode
for the merged info is non-ideal because it supports only a single execution_type
(and for other reasons). Via DM I mentioned that I was working on a coherent representation for assets across locations that, among other things, maps execution types to repo handles. My idea is to not use ExternalAssetNode
, which isn't suited for a multi-location representation, but rather a new class GlobalAssetNode
or MultiLocationAssetNode
(or whatever) analogous to the asset node class you proposed for InternalAssetGraph
. Then both InternalAssetGraph
and ExternalAssetGraph
will have their own node classes that share a common interface.
if ad.group_names_by_key[key] == group_name | ||
} | ||
|
||
def get_required_multi_asset_keys(self, asset_key: AssetKey) -> AbstractSet[AssetKey]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is required a new adjective in our ontology? I figured out based on code (keys that will have corresponding results 100% of the time when the containing op is executed) but it wasn't super natural
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah that's been around for a long time. I usually get confused by it when I encounter it (usually via AssetOutputInfo.is_required
). get_required_multi_asset_keys
is a method that is already present on existing AssetGraph
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word makes more sense to me in the context of an output (since that is always thought of in the context of an op) whereas the word makes less sense whening thinking about an asset in the context of a group or graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is looking pretty good. The most important thing is to discuss the implication of that canonicalization process of going from multiple representations of assets across code locations to a single ExternalAssetNode
.
IMO we should also rename ExternalAssetNode
asap to prevent ourselves from getting confused. We can then rename the entire External* family of classes to the same scheme out of the critical path of this workstream.
Strongly prefer to rename all |
ab4801d
to
9532c18
Compare
Deploy preview for dagster-docs ready! Preview available at https://dagster-docs-r6v2lftz9-elementl.vercel.app Direct link to changed pages: |
Deploy preview for dagit-storybook ready! ✅ Preview Built with commit 9532c18. |
if not node.is_source or node.is_observable | ||
|
||
# Split the nodes into materializable, observable, and unexecutable nodes. Observable and | ||
# unexecutable nodes represent both source and external assets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to leave no doubt, this means that a SourceAsset
in a code location gets serialized as an ExternalAssetNode
and this in basically an unfortunate naming collision because of the entire External family of classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Let's leave a comment to that effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated comment
asset_nodes_by_key = {} | ||
|
||
# It is possible for multiple nodes to exist that share the same key. This is invalid if | ||
# more than one node is materializable or if more than one node is observable. It is valid | ||
# if there is at most one materializable node and at most one observable node, with all | ||
# other nodes unexecutable. The asset graph will receive only a single `ExternalAssetNode` | ||
# representing the asset. This will always be the materializable node if one exists; then | ||
# the observable node if it exists; then finally the first-encountered unexecutable node. | ||
for repo_handle, node in materializable_node_pairs: | ||
if node.asset_key in asset_nodes_by_key: | ||
check.failed("Found two materialization nodes with the same asset key") | ||
asset_nodes_by_key[node.asset_key] = node | ||
|
||
for repo_handle, node in observable_node_pairs: | ||
if node.asset_key in asset_nodes_by_key: | ||
current_node = asset_nodes_by_key[node.asset_key] | ||
if current_node.is_observable: | ||
check.failed("Found two observable source nodes with the same asset key") | ||
asset_nodes_by_key[node.asset_key] = current_node._replace(is_observable=True) | ||
else: | ||
asset_nodes_by_key[node.asset_key] = node | ||
|
||
for repo_handle, node in unexecutable_node_pairs: | ||
if node.asset_key in asset_nodes_by_key: | ||
continue | ||
asset_nodes_by_key[node.asset_key] = node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extracting this function into its own block might be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree here as it's tightly coupled to the preceding block that splits nodes into materializable/observable/unexecutable, and these two blocks together comprise basically the entire function.
9532c18
to
01cf448
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok let's move forward with it.
[INTERNAL_BRANCH=sean/external-assets-refactor-asset-graph]
01cf448
to
ed6e938
Compare
…20059) ## Summary & Motivation #19900 led to an auto-materialize performance regression caught by our regression tests. After investigating with a profiler, I found the issue was lack of caching on asset key subset methods in the new `AssetGraph`. This PR adds them. ## How I Tested These Changes Ran perf regression tests against this branch.
…onstructing ExternalAssetGraph (#20062) ## Summary & Motivation #19900 introduced a check in the construction of `ExternalAssetGraph` that would throw an error if two materializable or observable defs having the same key were detected in a deployment (this is an invalid state). However, after running a shadow script against customer deployments, I discovered that ~3% of active deployments have this problem, so I am changing it to a warning (and making the message more informative) until we can find a clean strategy to message our customers who have this problem. ## How I Tested These Changes New test.
…20059) ## Summary & Motivation #19900 led to an auto-materialize performance regression caught by our regression tests. After investigating with a profiler, I found the issue was lack of caching on asset key subset methods in the new `AssetGraph`. This PR adds them. ## How I Tested These Changes Ran perf regression tests against this branch.
…onstructing ExternalAssetGraph (#20062) ## Summary & Motivation #19900 introduced a check in the construction of `ExternalAssetGraph` that would throw an error if two materializable or observable defs having the same key were detected in a deployment (this is an invalid state). However, after running a shadow script against customer deployments, I discovered that ~3% of active deployments have this problem, so I am changing it to a warning (and making the message more informative) until we can find a clean strategy to message our customers who have this problem. ## How I Tested These Changes New test.
## Summary & Motivation Internal companion PR: dagster-io/internal#8339 - Refactor `InternalAssetGraph` and `ExternalAssetGraph` to implement the `AssetGraph` interface by proxying to wrapped `AssetsDefinition`/`ExternalAssetNode` instead of building separate `<property_x>_by_key` dicts. - Add `AssetGraph` methods to retrieve various subsets of keys (e.g. `executable_asset_keys`) - Rename a few methods for consistency (e.g. `observable_keys` -> `observable_asset_keys`) - Standardize property lookup methods to `AssetGraph.get_<property_x>(asset_key)` instead of `AssetGraph.<property_x>_by_key.get(asset_key)`. The previous state is a mix, where some properties are exposed via single-asset-scoped getters and others are exposed as dict properties. This is over-complicated and leads to inconsistency in how lookups for missing asset keys are handled. We standardize on single-key lookup-- if callers need a dict, this can be easily constructed on site with something like: ```python {k: asset_graph.get_x(k) for k in graph.{all,observable,materializable,executable,external}_asset_keys`} ``` - Move `InternalAssetGraph` to its own module. ## Additional notes `AssetGraph` (a private API) has two variants; `InternalAssetGraph` (for assets scoped to one location) and `ExternalAssetGraph` (for assets scoped to multiple locations). They share a common interface that is based on a large number of `<property_x>_by_key` dicts, e.g. `code_location_by_key`. These dictionaries are constructed from `AssetsDefinition` (for `InternalAssetGraph`) and `ExternalAssetNode` (for `ExternalAssetGraph`) in lengthy `@staticmethod` constructors. This is complex, error-prone, and adds significant friction when making any changes to assets. This PR eliminates the `<property_x>_by_key` dicts in favor of having `InternalAssetGraph` and `ExternalAssetGraph` wrap `AssetsDefinition` and `ExternalAssetNode` respectively, implementing the common `AssetGraph` interface in terms of lookup operations on these nodes. In the initial version of this PR I have not added any caching on the methods except for `asset_dep_graph`. We may want to cache various sets of asset keys etc. I'm frankly not sure about the memory/perf tradeoff. ## How I Tested These Changes Existing test suite.
…20059) ## Summary & Motivation #19900 led to an auto-materialize performance regression caught by our regression tests. After investigating with a profiler, I found the issue was lack of caching on asset key subset methods in the new `AssetGraph`. This PR adds them. ## How I Tested These Changes Ran perf regression tests against this branch.
…onstructing ExternalAssetGraph (#20062) ## Summary & Motivation #19900 introduced a check in the construction of `ExternalAssetGraph` that would throw an error if two materializable or observable defs having the same key were detected in a deployment (this is an invalid state). However, after running a shadow script against customer deployments, I discovered that ~3% of active deployments have this problem, so I am changing it to a warning (and making the message more informative) until we can find a clean strategy to message our customers who have this problem. ## How I Tested These Changes New test.
Summary & Motivation
Internal companion PR: https://github.com/dagster-io/internal/pull/8339
InternalAssetGraph
andExternalAssetGraph
to implement theAssetGraph
interface by proxying to wrappedAssetsDefinition
/ExternalAssetNode
instead of building separate<property_x>_by_key
dicts.AssetGraph
methods to retrieve various subsets of keys (e.g.executable_asset_keys
)observable_keys
->observable_asset_keys
)AssetGraph.get_<property_x>(asset_key)
instead ofAssetGraph.<property_x>_by_key.get(asset_key)
. The previous state is a mix, where some properties are exposed via single-asset-scoped getters and others are exposed as dict properties. This is over-complicated and leads to inconsistency in how lookups for missing asset keys are handled. We standardize on single-key lookup-- if callers need a dict, this can be easily constructed on site with something like:InternalAssetGraph
to its own module.Additional notes
AssetGraph
(a private API) has two variants;InternalAssetGraph
(for assets scoped to one location) andExternalAssetGraph
(for assets scoped to multiple locations). They share a common interface that is based on a large number of<property_x>_by_key
dicts, e.g.code_location_by_key
. These dictionaries are constructed fromAssetsDefinition
(forInternalAssetGraph
) andExternalAssetNode
(forExternalAssetGraph
) in lengthy@staticmethod
constructors. This is complex, error-prone, and adds significant friction when making any changes to assets.This PR eliminates the
<property_x>_by_key
dicts in favor of havingInternalAssetGraph
andExternalAssetGraph
wrapAssetsDefinition
andExternalAssetNode
respectively, implementing the commonAssetGraph
interface in terms of lookup operations on these nodes.In the initial version of this PR I have not added any caching on the methods except for
asset_dep_graph
. We may want to cache various sets of asset keys etc. I'm frankly not sure about the memory/perf tradeoff.How I Tested These Changes
Existing test suite.