Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[external-assets] refactor AssetGraph #19900

Merged
merged 1 commit into from
Feb 23, 2024

Conversation

smackesey
Copy link
Collaborator

@smackesey smackesey commented Feb 19, 2024

Summary & Motivation

Internal companion PR: https://github.com/dagster-io/internal/pull/8339

  • Refactor InternalAssetGraph and ExternalAssetGraph to implement the AssetGraph interface by proxying to wrapped AssetsDefinition/ExternalAssetNode instead of building separate <property_x>_by_key dicts.
  • Add AssetGraph methods to retrieve various subsets of keys (e.g. executable_asset_keys)
  • Rename a few methods for consistency (e.g. observable_keys -> observable_asset_keys)
  • Standardize property lookup methods to AssetGraph.get_<property_x>(asset_key) instead of AssetGraph.<property_x>_by_key.get(asset_key). The previous state is a mix, where some properties are exposed via single-asset-scoped getters and others are exposed as dict properties. This is over-complicated and leads to inconsistency in how lookups for missing asset keys are handled. We standardize on single-key lookup-- if callers need a dict, this can be easily constructed on site with something like:
{k: asset_graph.get_x(k) for k in graph.{all,observable,materializable,executable,external}_asset_keys`}
  • Move InternalAssetGraph to its own module.

Additional notes

AssetGraph (a private API) has two variants; InternalAssetGraph (for assets scoped to one location) and ExternalAssetGraph (for assets scoped to multiple locations). They share a common interface that is based on a large number of <property_x>_by_key dicts, e.g. code_location_by_key. These dictionaries are constructed from AssetsDefinition (for InternalAssetGraph) and ExternalAssetNode (for ExternalAssetGraph) in lengthy @staticmethod constructors. This is complex, error-prone, and adds significant friction when making any changes to assets.

This PR eliminates the <property_x>_by_key dicts in favor of having InternalAssetGraph and ExternalAssetGraph wrap AssetsDefinition and ExternalAssetNode respectively, implementing the common AssetGraph interface in terms of lookup operations on these nodes.

In the initial version of this PR I have not added any caching on the methods except for asset_dep_graph. We may want to cache various sets of asset keys etc. I'm frankly not sure about the memory/perf tradeoff.

How I Tested These Changes

Existing test suite.

@smackesey smackesey force-pushed the sean/external-assets-assets-to-execute branch from e1375b2 to 92c7a9f Compare February 19, 2024 19:59
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 44f6b66 to bf0569b Compare February 19, 2024 20:00
@smackesey smackesey force-pushed the sean/external-assets-assets-to-execute branch from 92c7a9f to 1ae5a2c Compare February 19, 2024 21:34
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from bf0569b to e839eb2 Compare February 19, 2024 21:45
@smackesey smackesey force-pushed the sean/external-assets-assets-to-execute branch from 1ae5a2c to 03b5a31 Compare February 20, 2024 12:29
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from e839eb2 to ee4dd4e Compare February 20, 2024 12:29
@smackesey smackesey force-pushed the sean/external-assets-assets-to-execute branch 2 times, most recently from c58fbba to d3673a7 Compare February 20, 2024 15:40
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from ee4dd4e to d29c798 Compare February 20, 2024 15:40
@smackesey smackesey force-pushed the sean/external-assets-assets-to-execute branch from d3673a7 to 96bc75b Compare February 20, 2024 20:46
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from d29c798 to c4ff010 Compare February 20, 2024 20:46
@smackesey smackesey force-pushed the sean/external-assets-assets-to-execute branch from 96bc75b to 30b2395 Compare February 20, 2024 21:08
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from c4ff010 to a62ad47 Compare February 20, 2024 21:09
@smackesey smackesey force-pushed the sean/external-assets-assets-to-execute branch from 30b2395 to b5da3fc Compare February 20, 2024 21:10
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch 2 times, most recently from 0f657ae to c0e8feb Compare February 20, 2024 21:19
@smackesey smackesey force-pushed the sean/external-assets-assets-to-execute branch from 1d3da61 to 982b7e4 Compare February 20, 2024 21:54
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from c0e8feb to 1340755 Compare February 20, 2024 21:54
@smackesey smackesey force-pushed the sean/external-assets-assets-to-execute branch from 982b7e4 to a3e7f79 Compare February 20, 2024 22:58
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 1340755 to e6a7dab Compare February 20, 2024 23:01
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 5a0382b to 3dd1442 Compare February 22, 2024 20:10
Copy link

github-actions bot commented Feb 22, 2024

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-aft28vbdb-elementl.vercel.app
https://sean-external-assets-refactor-asset-graph.core-storybook.dagster-docs.io

Built with commit 9532c18.
This pull request is being automatically deployed with vercel-action

@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch 2 times, most recently from 9146696 to ab4801d Compare February 23, 2024 10:22
Comment on lines 102 to 104
non_source_node_pairs: List[Tuple[RepositoryHandle, "ExternalAssetNode"]] = []
observable_source_node_pairs: List[Tuple[RepositoryHandle, "ExternalAssetNode"]] = []
non_observable_source_node_pairs: List[Tuple[RepositoryHandle, "ExternalAssetNode"]] = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to rename ExternalAssetNode or else we might go crazy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok given this we need clearer names and copious comments here. Are observable_source_node_pairs just assets represented by SourceAsset, or is it inclusive of external observables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"source" here in general is inclusive of observables-- when the ExternalAssetNode list is built in the code server external assets and source assets both have is_source set to True. I've changed the names and added a comment.

Comment on lines 115 to 121
# It is possible for multiple nodes to exist that share the same key. This is invalid if
# more than one node is materializable or if more than one node is observable. It is valid
# if there is at most one materializable node and at most one observable node, with all
# other nodes unexecutable. The asset graph will receive only a single `ExternalAssetNode`
# representing the asset. This will always be the materializable node if one exists; then
# the observable node if it exists; then finally the first-encountered unexecutable node.
for repo_handle, node in non_source_node_pairs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super consequential decision and we need to consider it carefully. This is a pretty deep assumption that will spread throughout the system, all the way to the frontend and our top-level APIs.

The case that directly exposes this is when an asset is materializable in one repo but observable in another. What is the behavior in that case?

Copy link
Collaborator Author

@smackesey smackesey Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this is a refactor that makes (1) the existing "merge" behavior legible for valid cases; (2) errors on undefined cases. i.e. There is no intended behavior change here except to error on two materialization nodes or two observation nodes, which is an undefined state that we currently don't explicitly handle.

The case that directly exposes this is when an asset is materializable in one repo but observable in another. What is the behavior in that case?

This PR preserves the status quo behavior for this case. ExternalAssetGraph will report that the asset is materializable but also is observable.

The use of ExternalAssetNode for the merged info is non-ideal because it supports only a single execution_type (and for other reasons). Via DM I mentioned that I was working on a coherent representation for assets across locations that, among other things, maps execution types to repo handles. My idea is to not use ExternalAssetNode, which isn't suited for a multi-location representation, but rather a new class GlobalAssetNode or MultiLocationAssetNode (or whatever) analogous to the asset node class you proposed for InternalAssetGraph. Then both InternalAssetGraph and ExternalAssetGraph will have their own node classes that share a common interface.

if ad.group_names_by_key[key] == group_name
}

def get_required_multi_asset_keys(self, asset_key: AssetKey) -> AbstractSet[AssetKey]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is required a new adjective in our ontology? I figured out based on code (keys that will have corresponding results 100% of the time when the containing op is executed) but it wasn't super natural

Copy link
Collaborator Author

@smackesey smackesey Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah that's been around for a long time. I usually get confused by it when I encounter it (usually via AssetOutputInfo.is_required). get_required_multi_asset_keys is a method that is already present on existing AssetGraph.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word makes more sense to me in the context of an output (since that is always thought of in the context of an op) whereas the word makes less sense whening thinking about an asset in the context of a group or graph.

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is looking pretty good. The most important thing is to discuss the implication of that canonicalization process of going from multiple representations of assets across code locations to a single ExternalAssetNode.

IMO we should also rename ExternalAssetNode asap to prevent ourselves from getting confused. We can then rename the entire External* family of classes to the same scheme out of the critical path of this workstream.

@smackesey
Copy link
Collaborator Author

IMO we should also rename ExternalAssetNode asap to prevent ourselves from getting confused. We can then rename the entire External* family of classes to the same scheme out of the critical path of this workstream.

Strongly prefer to rename all External* in one shot rather than first ExternalAssetNode and then the others. I think that is pretty likely to confuse other engs not in this stream.

@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from ab4801d to 9532c18 Compare February 23, 2024 13:32
Copy link

github-actions bot commented Feb 23, 2024

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-r6v2lftz9-elementl.vercel.app
https://sean-external-assets-refactor-asset-graph.dagster.dagster-docs.io

Direct link to changed pages:

Copy link

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-q1hmxdndf-elementl.vercel.app
https://sean-external-assets-refactor-asset-graph.components-storybook.dagster-docs.io

Built with commit 9532c18.
This pull request is being automatically deployed with vercel-action

if not node.is_source or node.is_observable

# Split the nodes into materializable, observable, and unexecutable nodes. Observable and
# unexecutable nodes represent both source and external assets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to leave no doubt, this means that a SourceAsset in a code location gets serialized as an ExternalAssetNode and this in basically an unfortunate naming collision because of the entire External family of classes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Let's leave a comment to that effect

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated comment

Comment on lines +115 to +142
asset_nodes_by_key = {}

# It is possible for multiple nodes to exist that share the same key. This is invalid if
# more than one node is materializable or if more than one node is observable. It is valid
# if there is at most one materializable node and at most one observable node, with all
# other nodes unexecutable. The asset graph will receive only a single `ExternalAssetNode`
# representing the asset. This will always be the materializable node if one exists; then
# the observable node if it exists; then finally the first-encountered unexecutable node.
for repo_handle, node in materializable_node_pairs:
if node.asset_key in asset_nodes_by_key:
check.failed("Found two materialization nodes with the same asset key")
asset_nodes_by_key[node.asset_key] = node

for repo_handle, node in observable_node_pairs:
if node.asset_key in asset_nodes_by_key:
current_node = asset_nodes_by_key[node.asset_key]
if current_node.is_observable:
check.failed("Found two observable source nodes with the same asset key")
asset_nodes_by_key[node.asset_key] = current_node._replace(is_observable=True)
else:
asset_nodes_by_key[node.asset_key] = node

for repo_handle, node in unexecutable_node_pairs:
if node.asset_key in asset_nodes_by_key:
continue
asset_nodes_by_key[node.asset_key] = node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracting this function into its own block might be nice

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree here as it's tightly coupled to the preceding block that splits nodes into materializable/observable/unexecutable, and these two blocks together comprise basically the entire function.

@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 9532c18 to 01cf448 Compare February 23, 2024 14:48
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let's move forward with it.

[INTERNAL_BRANCH=sean/external-assets-refactor-asset-graph]
@smackesey smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 01cf448 to ed6e938 Compare February 23, 2024 16:44
@smackesey smackesey merged commit d7d5e62 into master Feb 23, 2024
1 of 2 checks passed
@smackesey smackesey deleted the sean/external-assets-refactor-asset-graph branch February 23, 2024 17:04
smackesey added a commit that referenced this pull request Feb 26, 2024
…20059)

## Summary & Motivation

#19900 led to an auto-materialize performance regression caught by our
regression tests. After investigating with a profiler, I found the issue
was lack of caching on asset key subset methods in the new `AssetGraph`.
This PR adds them.

## How I Tested These Changes

Ran perf regression tests against this branch.
smackesey added a commit that referenced this pull request Feb 26, 2024
…onstructing ExternalAssetGraph (#20062)

## Summary & Motivation

#19900 introduced a check in the construction of `ExternalAssetGraph`
that would throw an error if two materializable or observable defs
having the same key were detected in a deployment (this is an invalid
state). However, after running a shadow script against customer
deployments, I discovered that ~3% of active deployments have this
problem, so I am changing it to a warning (and making the message more
informative) until we can find a clean strategy to message our customers
who have this problem.

## How I Tested These Changes

New test.
cmpadden pushed a commit that referenced this pull request Feb 28, 2024
…20059)

## Summary & Motivation

#19900 led to an auto-materialize performance regression caught by our
regression tests. After investigating with a profiler, I found the issue
was lack of caching on asset key subset methods in the new `AssetGraph`.
This PR adds them.

## How I Tested These Changes

Ran perf regression tests against this branch.
cmpadden pushed a commit that referenced this pull request Feb 28, 2024
…onstructing ExternalAssetGraph (#20062)

## Summary & Motivation

#19900 introduced a check in the construction of `ExternalAssetGraph`
that would throw an error if two materializable or observable defs
having the same key were detected in a deployment (this is an invalid
state). However, after running a shadow script against customer
deployments, I discovered that ~3% of active deployments have this
problem, so I am changing it to a warning (and making the message more
informative) until we can find a clean strategy to message our customers
who have this problem.

## How I Tested These Changes

New test.
PedramNavid pushed a commit that referenced this pull request Mar 28, 2024
## Summary & Motivation

Internal companion PR: dagster-io/internal#8339

- Refactor `InternalAssetGraph` and `ExternalAssetGraph` to implement
the `AssetGraph` interface by proxying to wrapped
`AssetsDefinition`/`ExternalAssetNode` instead of building separate
`<property_x>_by_key` dicts.
- Add `AssetGraph` methods to retrieve various subsets of keys (e.g.
`executable_asset_keys`)
- Rename a few methods for consistency (e.g. `observable_keys` ->
`observable_asset_keys`)
- Standardize property lookup methods to
`AssetGraph.get_<property_x>(asset_key)` instead of
`AssetGraph.<property_x>_by_key.get(asset_key)`. The previous state is a
mix, where some properties are exposed via single-asset-scoped getters
and others are exposed as dict properties. This is over-complicated and
leads to inconsistency in how lookups for missing asset keys are
handled. We standardize on single-key lookup-- if callers need a dict,
this can be easily constructed on site with something like:
 
```python
{k: asset_graph.get_x(k) for k in graph.{all,observable,materializable,executable,external}_asset_keys`}
```

- Move `InternalAssetGraph` to its own module.

## Additional notes

`AssetGraph` (a private API) has two variants; `InternalAssetGraph` (for
assets scoped to one location) and `ExternalAssetGraph` (for assets
scoped to multiple locations). They share a common interface that is
based on a large number of `<property_x>_by_key` dicts, e.g.
`code_location_by_key`. These dictionaries are constructed from
`AssetsDefinition` (for `InternalAssetGraph`) and `ExternalAssetNode`
(for `ExternalAssetGraph`) in lengthy `@staticmethod` constructors. This
is complex, error-prone, and adds significant friction when making any
changes to assets.

This PR eliminates the `<property_x>_by_key` dicts in favor of having
`InternalAssetGraph` and `ExternalAssetGraph` wrap `AssetsDefinition`
and `ExternalAssetNode` respectively, implementing the common
`AssetGraph` interface in terms of lookup operations on these nodes.

In the initial version of this PR I have not added any caching on the
methods except for `asset_dep_graph`. We may want to cache various sets
of asset keys etc. I'm frankly not sure about the memory/perf tradeoff.

## How I Tested These Changes

Existing test suite.
PedramNavid pushed a commit that referenced this pull request Mar 28, 2024
…20059)

## Summary & Motivation

#19900 led to an auto-materialize performance regression caught by our
regression tests. After investigating with a profiler, I found the issue
was lack of caching on asset key subset methods in the new `AssetGraph`.
This PR adds them.

## How I Tested These Changes

Ran perf regression tests against this branch.
PedramNavid pushed a commit that referenced this pull request Mar 28, 2024
…onstructing ExternalAssetGraph (#20062)

## Summary & Motivation

#19900 introduced a check in the construction of `ExternalAssetGraph`
that would throw an error if two materializable or observable defs
having the same key were detected in a deployment (this is an invalid
state). However, after running a shadow script against customer
deployments, I discovered that ~3% of active deployments have this
problem, so I am changing it to a warning (and making the message more
informative) until we can find a clean strategy to message our customers
who have this problem.

## How I Tested These Changes

New test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants