[external-assets] refactor AssetGraph #19900

smackesey · 2024-02-19T18:12:25Z

Summary & Motivation

Internal companion PR: https://github.com/dagster-io/internal/pull/8339

Refactor InternalAssetGraph and ExternalAssetGraph to implement the AssetGraph interface by proxying to wrapped AssetsDefinition/ExternalAssetNode instead of building separate <property_x>_by_key dicts.
Add AssetGraph methods to retrieve various subsets of keys (e.g. executable_asset_keys)
Rename a few methods for consistency (e.g. observable_keys -> observable_asset_keys)
Standardize property lookup methods to AssetGraph.get_<property_x>(asset_key) instead of AssetGraph.<property_x>_by_key.get(asset_key). The previous state is a mix, where some properties are exposed via single-asset-scoped getters and others are exposed as dict properties. This is over-complicated and leads to inconsistency in how lookups for missing asset keys are handled. We standardize on single-key lookup-- if callers need a dict, this can be easily constructed on site with something like:

{k: asset_graph.get_x(k) for k in graph.{all,observable,materializable,executable,external}_asset_keys`}

Move InternalAssetGraph to its own module.

Additional notes

AssetGraph (a private API) has two variants; InternalAssetGraph (for assets scoped to one location) and ExternalAssetGraph (for assets scoped to multiple locations). They share a common interface that is based on a large number of <property_x>_by_key dicts, e.g. code_location_by_key. These dictionaries are constructed from AssetsDefinition (for InternalAssetGraph) and ExternalAssetNode (for ExternalAssetGraph) in lengthy @staticmethod constructors. This is complex, error-prone, and adds significant friction when making any changes to assets.

This PR eliminates the <property_x>_by_key dicts in favor of having InternalAssetGraph and ExternalAssetGraph wrap AssetsDefinition and ExternalAssetNode respectively, implementing the common AssetGraph interface in terms of lookup operations on these nodes.

In the initial version of this PR I have not added any caching on the methods except for asset_dep_graph. We may want to cache various sets of asset keys etc. I'm frankly not sure about the memory/perf tradeoff.

How I Tested These Changes

Existing test suite.

smackesey · 2024-02-19T18:12:36Z

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

github-actions · 2024-02-22T20:13:47Z

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-aft28vbdb-elementl.vercel.app
https://sean-external-assets-refactor-asset-graph.core-storybook.dagster-docs.io

Built with commit 9532c18.
This pull request is being automatically deployed with vercel-action

schrockn · 2024-02-23T11:27:37Z

python_modules/dagster/dagster/_core/definitions/external_asset_graph.py

+        non_source_node_pairs: List[Tuple[RepositoryHandle, "ExternalAssetNode"]] = []
+        observable_source_node_pairs: List[Tuple[RepositoryHandle, "ExternalAssetNode"]] = []
+        non_observable_source_node_pairs: List[Tuple[RepositoryHandle, "ExternalAssetNode"]] = []


we need to rename ExternalAssetNode or else we might go crazy.

Ok given this we need clearer names and copious comments here. Are observable_source_node_pairs just assets represented by SourceAsset, or is it inclusive of external observables?

"source" here in general is inclusive of observables-- when the ExternalAssetNode list is built in the code server external assets and source assets both have is_source set to True. I've changed the names and added a comment.

schrockn · 2024-02-23T11:35:13Z

python_modules/dagster/dagster/_core/definitions/external_asset_graph.py

+        # It is possible for multiple nodes to exist that share the same key. This is invalid if
+        # more than one node is materializable or if more than one node is observable. It is valid
+        # if there is at most one materializable node and at most one observable node, with all
+        # other nodes unexecutable. The asset graph will receive only a single `ExternalAssetNode`
+        # representing the asset. This will always be the materializable node if one exists; then
+        # the observable node if it exists; then finally the first-encountered unexecutable node.
+        for repo_handle, node in non_source_node_pairs:


This is a super consequential decision and we need to consider it carefully. This is a pretty deep assumption that will spread throughout the system, all the way to the frontend and our top-level APIs.

The case that directly exposes this is when an asset is materializable in one repo but observable in another. What is the behavior in that case?

To be clear, this is a refactor that makes (1) the existing "merge" behavior legible for valid cases; (2) errors on undefined cases. i.e. There is no intended behavior change here except to error on two materialization nodes or two observation nodes, which is an undefined state that we currently don't explicitly handle.

The case that directly exposes this is when an asset is materializable in one repo but observable in another. What is the behavior in that case?

This PR preserves the status quo behavior for this case. ExternalAssetGraph will report that the asset is materializable but also is observable.

The use of ExternalAssetNode for the merged info is non-ideal because it supports only a single execution_type (and for other reasons). Via DM I mentioned that I was working on a coherent representation for assets across locations that, among other things, maps execution types to repo handles. My idea is to not use ExternalAssetNode, which isn't suited for a multi-location representation, but rather a new class GlobalAssetNode or MultiLocationAssetNode (or whatever) analogous to the asset node class you proposed for InternalAssetGraph. Then both InternalAssetGraph and ExternalAssetGraph will have their own node classes that share a common interface.

schrockn · 2024-02-23T11:39:16Z

python_modules/dagster/dagster/_core/definitions/internal_asset_graph.py

+            if ad.group_names_by_key[key] == group_name
+        }
+
+    def get_required_multi_asset_keys(self, asset_key: AssetKey) -> AbstractSet[AssetKey]:


is required a new adjective in our ontology? I figured out based on code (keys that will have corresponding results 100% of the time when the containing op is executed) but it wasn't super natural

Nah that's been around for a long time. I usually get confused by it when I encounter it (usually via AssetOutputInfo.is_required). get_required_multi_asset_keys is a method that is already present on existing AssetGraph.

The word makes more sense to me in the context of an output (since that is always thought of in the context of an op) whereas the word makes less sense whening thinking about an asset in the context of a group or graph.

schrockn

Ok this is looking pretty good. The most important thing is to discuss the implication of that canonicalization process of going from multiple representations of assets across code locations to a single ExternalAssetNode.

IMO we should also rename ExternalAssetNode asap to prevent ourselves from getting confused. We can then rename the entire External* family of classes to the same scheme out of the critical path of this workstream.

smackesey · 2024-02-23T12:29:16Z

IMO we should also rename ExternalAssetNode asap to prevent ourselves from getting confused. We can then rename the entire External* family of classes to the same scheme out of the critical path of this workstream.

Strongly prefer to rename all External* in one shot rather than first ExternalAssetNode and then the others. I think that is pretty likely to confuse other engs not in this stream.

github-actions · 2024-02-23T13:36:50Z

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-r6v2lftz9-elementl.vercel.app
https://sean-external-assets-refactor-asset-graph.dagster.dagster-docs.io

Direct link to changed pages:

github-actions · 2024-02-23T13:37:38Z

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-q1hmxdndf-elementl.vercel.app
https://sean-external-assets-refactor-asset-graph.components-storybook.dagster-docs.io

Built with commit 9532c18.
This pull request is being automatically deployed with vercel-action

schrockn · 2024-02-23T13:48:22Z

python_modules/dagster/dagster/_core/definitions/external_asset_graph.py

-            if not node.is_source or node.is_observable
+
+        # Split the nodes into materializable, observable, and unexecutable nodes. Observable and
+        # unexecutable nodes represent both source and external assets.


Just to leave no doubt, this means that a SourceAsset in a code location gets serialized as an ExternalAssetNode and this in basically an unfortunate naming collision because of the entire External family of classes

Cool. Let's leave a comment to that effect

updated comment

schrockn · 2024-02-23T13:49:27Z

python_modules/dagster/dagster/_core/definitions/external_asset_graph.py

+        asset_nodes_by_key = {}
+
+        # It is possible for multiple nodes to exist that share the same key. This is invalid if
+        # more than one node is materializable or if more than one node is observable. It is valid
+        # if there is at most one materializable node and at most one observable node, with all
+        # other nodes unexecutable. The asset graph will receive only a single `ExternalAssetNode`
+        # representing the asset. This will always be the materializable node if one exists; then
+        # the observable node if it exists; then finally the first-encountered unexecutable node.
+        for repo_handle, node in materializable_node_pairs:
+            if node.asset_key in asset_nodes_by_key:
+                check.failed("Found two materialization nodes with the same asset key")
+            asset_nodes_by_key[node.asset_key] = node
+
+        for repo_handle, node in observable_node_pairs:
+            if node.asset_key in asset_nodes_by_key:
+                current_node = asset_nodes_by_key[node.asset_key]
+                if current_node.is_observable:
+                    check.failed("Found two observable source nodes with the same asset key")
+                asset_nodes_by_key[node.asset_key] = current_node._replace(is_observable=True)
+            else:
+                asset_nodes_by_key[node.asset_key] = node
+
+        for repo_handle, node in unexecutable_node_pairs:
+            if node.asset_key in asset_nodes_by_key:
+                continue
+            asset_nodes_by_key[node.asset_key] = node


extracting this function into its own block might be nice

Disagree here as it's tightly coupled to the preceding block that splits nodes into materializable/observable/unexecutable, and these two blocks together comprise basically the entire function.

schrockn

Ok let's move forward with it.

[INTERNAL_BRANCH=sean/external-assets-refactor-asset-graph]

…20059) ## Summary & Motivation #19900 led to an auto-materialize performance regression caught by our regression tests. After investigating with a profiler, I found the issue was lack of caching on asset key subset methods in the new `AssetGraph`. This PR adds them. ## How I Tested These Changes Ran perf regression tests against this branch.

…onstructing ExternalAssetGraph (#20062) ## Summary & Motivation #19900 introduced a check in the construction of `ExternalAssetGraph` that would throw an error if two materializable or observable defs having the same key were detected in a deployment (this is an invalid state). However, after running a shadow script against customer deployments, I discovered that ~3% of active deployments have this problem, so I am changing it to a warning (and making the message more informative) until we can find a clean strategy to message our customers who have this problem. ## How I Tested These Changes New test.

…20059) ## Summary & Motivation #19900 led to an auto-materialize performance regression caught by our regression tests. After investigating with a profiler, I found the issue was lack of caching on asset key subset methods in the new `AssetGraph`. This PR adds them. ## How I Tested These Changes Ran perf regression tests against this branch.

…onstructing ExternalAssetGraph (#20062) ## Summary & Motivation #19900 introduced a check in the construction of `ExternalAssetGraph` that would throw an error if two materializable or observable defs having the same key were detected in a deployment (this is an invalid state). However, after running a shadow script against customer deployments, I discovered that ~3% of active deployments have this problem, so I am changing it to a warning (and making the message more informative) until we can find a clean strategy to message our customers who have this problem. ## How I Tested These Changes New test.

## Summary & Motivation Internal companion PR: dagster-io/internal#8339 - Refactor `InternalAssetGraph` and `ExternalAssetGraph` to implement the `AssetGraph` interface by proxying to wrapped `AssetsDefinition`/`ExternalAssetNode` instead of building separate `<property_x>_by_key` dicts. - Add `AssetGraph` methods to retrieve various subsets of keys (e.g. `executable_asset_keys`) - Rename a few methods for consistency (e.g. `observable_keys` -> `observable_asset_keys`) - Standardize property lookup methods to `AssetGraph.get_<property_x>(asset_key)` instead of `AssetGraph.<property_x>_by_key.get(asset_key)`. The previous state is a mix, where some properties are exposed via single-asset-scoped getters and others are exposed as dict properties. This is over-complicated and leads to inconsistency in how lookups for missing asset keys are handled. We standardize on single-key lookup-- if callers need a dict, this can be easily constructed on site with something like: ```python {k: asset_graph.get_x(k) for k in graph.{all,observable,materializable,executable,external}_asset_keys`} ``` - Move `InternalAssetGraph` to its own module. ## Additional notes `AssetGraph` (a private API) has two variants; `InternalAssetGraph` (for assets scoped to one location) and `ExternalAssetGraph` (for assets scoped to multiple locations). They share a common interface that is based on a large number of `<property_x>_by_key` dicts, e.g. `code_location_by_key`. These dictionaries are constructed from `AssetsDefinition` (for `InternalAssetGraph`) and `ExternalAssetNode` (for `ExternalAssetGraph`) in lengthy `@staticmethod` constructors. This is complex, error-prone, and adds significant friction when making any changes to assets. This PR eliminates the `<property_x>_by_key` dicts in favor of having `InternalAssetGraph` and `ExternalAssetGraph` wrap `AssetsDefinition` and `ExternalAssetNode` respectively, implementing the common `AssetGraph` interface in terms of lookup operations on these nodes. In the initial version of this PR I have not added any caching on the methods except for `asset_dep_graph`. We may want to cache various sets of asset keys etc. I'm frankly not sure about the memory/perf tradeoff. ## How I Tested These Changes Existing test suite.

…20059) ## Summary & Motivation #19900 led to an auto-materialize performance regression caught by our regression tests. After investigating with a profiler, I found the issue was lack of caching on asset key subset methods in the new `AssetGraph`. This PR adds them. ## How I Tested These Changes Ran perf regression tests against this branch.

…onstructing ExternalAssetGraph (#20062) ## Summary & Motivation #19900 introduced a check in the construction of `ExternalAssetGraph` that would throw an error if two materializable or observable defs having the same key were detected in a deployment (this is an invalid state). However, after running a shadow script against customer deployments, I discovered that ~3% of active deployments have this problem, so I am changing it to a warning (and making the message more informative) until we can find a clean strategy to message our customers who have this problem. ## How I Tested These Changes New test.

smackesey mentioned this pull request Feb 19, 2024

[external-assets] Make observe support external observables #19898

Merged

smackesey force-pushed the sean/external-assets-assets-to-execute branch from e1375b2 to 92c7a9f Compare February 19, 2024 19:59

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 44f6b66 to bf0569b Compare February 19, 2024 20:00

smackesey force-pushed the sean/external-assets-assets-to-execute branch from 92c7a9f to 1ae5a2c Compare February 19, 2024 21:34

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from bf0569b to e839eb2 Compare February 19, 2024 21:45

smackesey mentioned this pull request Feb 19, 2024

[external-assets] Remove source assets from AssetGraph #19901

Merged

smackesey force-pushed the sean/external-assets-assets-to-execute branch from 1ae5a2c to 03b5a31 Compare February 20, 2024 12:29

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from e839eb2 to ee4dd4e Compare February 20, 2024 12:29

smackesey mentioned this pull request Feb 20, 2024

Register dagster-polars in DagsterLibraryRegistry #19910

Merged

smackesey force-pushed the sean/external-assets-assets-to-execute branch 2 times, most recently from c58fbba to d3673a7 Compare February 20, 2024 15:40

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from ee4dd4e to d29c798 Compare February 20, 2024 15:40

This was referenced Feb 20, 2024

Sensor error tweak #19916

Merged

[dagster-fivetran] tests tweak to avoid duplicate asset keys #19920

Merged

smackesey force-pushed the sean/external-assets-assets-to-execute branch from d3673a7 to 96bc75b Compare February 20, 2024 20:46

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from d29c798 to c4ff010 Compare February 20, 2024 20:46

smackesey force-pushed the sean/external-assets-assets-to-execute branch from 96bc75b to 30b2395 Compare February 20, 2024 21:08

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from c4ff010 to a62ad47 Compare February 20, 2024 21:09

smackesey force-pushed the sean/external-assets-assets-to-execute branch from 30b2395 to b5da3fc Compare February 20, 2024 21:10

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch 2 times, most recently from 0f657ae to c0e8feb Compare February 20, 2024 21:19

smackesey force-pushed the sean/external-assets-assets-to-execute branch from 1d3da61 to 982b7e4 Compare February 20, 2024 21:54

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from c0e8feb to 1340755 Compare February 20, 2024 21:54

smackesey force-pushed the sean/external-assets-assets-to-execute branch from 982b7e4 to a3e7f79 Compare February 20, 2024 22:58

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 1340755 to e6a7dab Compare February 20, 2024 23:01

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 5a0382b to 3dd1442 Compare February 22, 2024 20:10

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch 2 times, most recently from 9146696 to ab4801d Compare February 23, 2024 10:22

smackesey mentioned this pull request Feb 23, 2024

[external-assets] Replace source assets with external assets at the repository level #20003

Merged

schrockn reviewed Feb 23, 2024

View reviewed changes

schrockn requested changes Feb 23, 2024

View reviewed changes

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from ab4801d to 9532c18 Compare February 23, 2024 13:32

schrockn reviewed Feb 23, 2024

View reviewed changes

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 9532c18 to 01cf448 Compare February 23, 2024 14:48

schrockn approved these changes Feb 23, 2024

View reviewed changes

[external-assets] refactor AssetGraph

ed6e938

[INTERNAL_BRANCH=sean/external-assets-refactor-asset-graph]

smackesey force-pushed the sean/external-assets-refactor-asset-graph branch from 01cf448 to ed6e938 Compare February 23, 2024 16:44

smackesey merged commit d7d5e62 into master Feb 23, 2024
1 of 2 checks passed

smackesey deleted the sean/external-assets-refactor-asset-graph branch February 23, 2024 17:04

smackesey mentioned this pull request Feb 26, 2024

[external-assets] Cache asset key subsets in refactored AssetGraph #20059

Merged

smackesey mentioned this pull request Feb 26, 2024

[external-assets] Change error to warning for duplicate assets when constructing ExternalAssetGraph #20062

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[external-assets] refactor AssetGraph #19900

[external-assets] refactor AssetGraph #19900

smackesey commented Feb 19, 2024 •

edited

Loading

smackesey commented Feb 19, 2024 •

edited

Loading

github-actions bot commented Feb 22, 2024 •

edited

Loading

schrockn Feb 23, 2024

schrockn Feb 23, 2024

smackesey Feb 23, 2024

schrockn Feb 23, 2024

smackesey Feb 23, 2024 •

edited

Loading

schrockn Feb 23, 2024

smackesey Feb 23, 2024 •

edited

Loading

schrockn Feb 23, 2024

schrockn left a comment

smackesey commented Feb 23, 2024

github-actions bot commented Feb 23, 2024 •

edited

Loading

github-actions bot commented Feb 23, 2024

schrockn Feb 23, 2024

smackesey Feb 23, 2024

schrockn Feb 23, 2024

smackesey Feb 23, 2024

schrockn Feb 23, 2024

smackesey Feb 23, 2024

schrockn left a comment

[external-assets] refactor AssetGraph #19900

[external-assets] refactor AssetGraph #19900

Conversation

smackesey commented Feb 19, 2024 • edited Loading

Summary & Motivation

Additional notes

How I Tested These Changes

smackesey commented Feb 19, 2024 • edited Loading

github-actions bot commented Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smackesey Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smackesey Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schrockn left a comment

Choose a reason for hiding this comment

smackesey commented Feb 23, 2024

github-actions bot commented Feb 23, 2024 • edited Loading

github-actions bot commented Feb 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schrockn left a comment

Choose a reason for hiding this comment

smackesey commented Feb 19, 2024 •

edited

Loading

smackesey commented Feb 19, 2024 •

edited

Loading

github-actions bot commented Feb 22, 2024 •

edited

Loading

smackesey Feb 23, 2024 •

edited

Loading

smackesey Feb 23, 2024 •

edited

Loading

github-actions bot commented Feb 23, 2024 •

edited

Loading