Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe External Assets; rename create_unexecutable_observable_assets_def to external_assets_from_specs #16754

Merged
merged 4 commits into from
Sep 27, 2023

Conversation

schrockn
Copy link
Member

@schrockn schrockn commented Sep 25, 2023

Summary & Motivation

This PR renames create_unexecutable_observable_assets_def to external_assets_from_specs. I also want this PR to serve as the final discussion on naming this feature and documents this capability. The verbiage in the docblock:

Create an external assets definition from a sequence of asset specs.

An external asset is an asset that is not materialized by Dagster, but is tracked in the
asset graph and asset catalog.

A common use case for external assets is modeling data produced by an process not
under Dagster's control. For example a daily drop of a file from a third party in s3.

In most systems these are described as sources. This includes Dagster, which includes
:py:class:`SourceAsset`, which will be supplanted by external assets in the near-term
future, as external assets are a superset of the functionality
of Source Assets.

External assets can act as sources, but that is not their only use.

In particular, external assets have themselves have lineage–-specified through the
``deps`` argument of :py:class:`AssetSpec`-– and can depend on other external assets.
External assets are not allowed to depend on non-external assets.

The user can emit `AssetMaterialization`, `AssetObservation`, and `AssetCheckEvaluations`
events attached external assets.  And Dagster now has the ability to have "runless"
events to enable many use cases that were previously not possible.  Runless events
are events generated outside the context of a particular run (for example, in a
sensor or by a script), allowing for greater flexibility in event generation.

This can be done in a few ways:

Note to reviewers that this in an in-progress doc block and the below will have links and examples.

1) DagsterInstance exposes `report_runless_event` that can be used to generate events for
    external assets directly on an instance. See docs.
2) Sensors can build these events and return them using :py:class:`SensorResult`. A use
    case for this is using a sensor to continously monitor the metadata exhaust from
    an external system and inserting events that
    reflect that exhaust. See docs.
3) Dagster Cloud exposes a REST API for ingesting runless events. Users can copy and
    paste a curl command in the their external computations (such as Airflow operator)
    to register metadata associated with those computations See docs.
4) Dagster ops can generate these events directly and yield them or by calling
    ``log_event`` on :py:class:`OpExecutionContext`.  Use cases for this include
    querying metadata in an external system that is too expensive to do so in a sensor. Or
    for adapting pure op-based Dagster code to take advantage of asset-oriented lineage,
    observability, and data quality features, without having to port them wholesale
    to `@asset`- and `@multi_asset`-based code.

This feature set allows users to use Dagster as an observability, lineage, and
data quality tool for assets that are not materialized by Dagster. In addition to
traditional use cases like sources, this feature can model entire lineage graphs of
assets that are scheduled and materialized by other tools and workflow engines. This
allows users to use Dagster as a cross-cutting observability tool without migrating
their entire data platform to a single orchestration engine.

External assets do not have all the features of normal assets: they cannot be
materialized ad hoc by Dagster (this is diabled in the UI); cannot be backfilled; cannot
be scheduled using auto-materialize policies; and opt out of other features around
direct materialization, both now and in the future. External assets also provide fewer
guarantees around the correctness of information of their information in the asset
catalog. In other words, in exchange for the flexibility Dagster provides less guardrails
for external assets than assets that are materialized by Dagster, and there is an increased
chance that they will insert nonsensical information into the asset catalog, potentially
eroding trust.

Suggesting alternative lanuage in this docblock is the best way to talk about an alternative name IMO.

How I Tested These Changes

@schrockn
Copy link
Member Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@schrockn schrockn changed the title rename create_unexecutable_observable_assets_def to create_external_assets_def_from_specs rename create_unexecutable_observable_assets_def to external_assets_def_from_specs Sep 26, 2023
@github-actions
Copy link

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-cybu18mzv-elementl.vercel.app
https://rename-to-external.components-storybook.dagster-docs.io

Built with commit 71e5d29.
This pull request is being automatically deployed with vercel-action

@github-actions
Copy link

github-actions bot commented Sep 26, 2023

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-fdlpk1l54-elementl.vercel.app
https://rename-to-external.core-storybook.dagster-docs.io

Built with commit 2629ca6.
This pull request is being automatically deployed with vercel-action

@github-actions
Copy link

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-obbuf9osz-elementl.vercel.app
https://rename-to-external.dagster.dagster-docs.io

Direct link to changed pages:

@schrockn schrockn changed the title rename create_unexecutable_observable_assets_def to external_assets_def_from_specs Describe External Assets; rename create_unexecutable_observable_assets_def to external_assets_def_from_specs Sep 26, 2023
@schrockn schrockn marked this pull request as ready for review September 26, 2023 10:14
Copy link
Contributor

@petehunt petehunt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have long thought that external assets would be the right name for this feature so I am supportive! This feature resembles BigQuery's external tables feature so I think it will resonate with practitioners.

Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reservations around "external" have just been the existing meaning in the code base, but those artifacts are not part of the public API so can be renamed.

I think external is much better than unexecutable & observable. I don't think there have been any other real contenders.

I think the external language works well when describing these. In replacing source asset we have some dynamic scoping for what "external" refers to but i think it still makes sense (outside of dagster vs outside of a particular code location).

from dagster._core.definitions.decorators.asset_decorator import asset, multi_asset
from dagster._core.definitions.source_asset import SourceAsset
from dagster._core.errors import DagsterInvariantViolationError


def create_unexecutable_observable_assets_def(specs: Sequence[AssetSpec]):
def external_assets_def_from_specs(specs: Sequence[AssetSpec]) -> AssetsDefinition:
"""Create an external assets definition from a sequence of asset specs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's a use case where you would create an external asset from multiple AssetSpecs?

Copy link
Contributor

@jamiedemaria jamiedemaria Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an aside, doing this

a1 = AssetSpec(key="table_1")
a2 = AssetSpec(key="table_2")
a3 = AssetSpec(key="table_3", deps=[a1])

external_assets = external_assets_def_from_specs([a1, a2, a3]) 

causes errors when running the UI

dagster._core.errors.DagsterInvariantViolationError: Expected Tuple annotation for multiple outputs, but received non-tuple annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsequential to the behavior of this system. This is just a weird outcropping of the fact that our core abstraction is pluralized. I could change it to produce N asset defs if so desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not fully understand, but the assets in the fn name isn't my concern, it's that you can pass a list of AssetSpecs. I don't really get why you would ever pass a list, so I'm wondering why it can't be

def external_assets_def_from_spec(spec: AssetSpec) -> AssetsDefinition:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more about convenience than anything. People will construct N asset specs to represent the graph and I want them to be able to coerce them into a form that can be threaded to Definitions in a single fn call.

Copy link
Contributor

@jamiedemaria jamiedemaria Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that makes sense. I will note that just with the doc bloc and function signature, I was pretty unsure about what would happen if i passed a list of AssetSpecs. And it does still seem to throw an error if you try to pass a list (but that's fixable). I think the initial confusion can be remedied by addressing what happens when you pass multiple AssetSpecs in the doc block, or by generating an AssetsDefinition per spec. While the latter is maybe less efficient/convenient, it fits my intuituve mental model of what I would expect this function to do, rather than a single multi_asset which feels like it should consist of assets that are related to each other in some way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally cool with creating multiple assets_def. @sryza is that what you would prefer as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that feels more grokkable to me. And I agree that it's ergonomic for it to take a list, rather than requiring the user to write a comprehension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Makes sense to me.

Copy link
Member Author

@schrockn schrockn Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this was a very good call. Error message if you somehow get to the user-defined execute function will be much more clear.

@sryza
Copy link
Contributor

sryza commented Sep 26, 2023

Interestingly, we already use the "external" language heavily in the concept docs where we present source assets: https://docs.dagster.io/concepts/assets/software-defined-assets#defining-external-asset-dependencies. This helps me feel comfortable with this change.

However, it's a little difficult to understand the impact of this change without understanding how it affects the UI and other parts of how we communicate about Dagster. E.g. are there a set of places in the docs where we'll want to add text like "if the asset is an external asset, then it behaves like X instead?"

I feel optimistic about the "external" naming, but the above gives me some apprehension about committing to it before we have the breathing room to understand the user-facing impact a little bit more.

A separate, more tactical reservation I have is that it feels weird to build an AssetsDefinition that has multiple of these. It's functionally equivalent to a single AssetsDefinition per asset spec, right? Allowing multiple to go on a single AssetsDefinition exposes a degree of freedom that might lead users to believe there's a difference in behavior.

All that said, exposing a helper function seems pretty low risk.

@schrockn
Copy link
Member Author

A separate, more tactical reservation I have is that it feels weird to build an AssetsDefinition that has multiple of these. It's functionally equivalent to a single AssetsDefinition per asset spec, right? Allowing multiple to go on a single AssetsDefinition exposes a degree of freedom that might lead users to believe there's a difference in behavior.

#16754 (comment)

@schrockn
Copy link
Member Author

All that said, exposing a helper function seems pretty low risk.

👍🏻 It is not even essential that this go out with 1.5. It is purely additive on top so can be in a follow on. What is important is aligning on visioning and naming prior to launch week event.

@PedramNavid
Copy link
Contributor

Nothing new for me to add, the name is a big improvement over unexecutable. I think there are some questions around docs and how we want to communicate this to our users. I think API docs are part of it, but probably will warrant further prose too.

@schrockn
Copy link
Member Author

schrockn commented Sep 26, 2023

Roger that. It appears we have consensus on external. We will land this at a reasonable pace, after the pipes renaming has landed. It doesn't not need to land in 1.5 to talk about it at Launch Week.

However I want to make it clear that I have very high conviction that this structure replaces source assets and observable source assets, and post launch week will subsequently––as part of our quality/consolidation push––do the following:

  • Embark at eliminating SourceAsset from the core of the framework
  • Enable "active observation" (dagster actively emitting Asset Observations for an external asset) of external assets in vanilla sensors and schedules,
  • Consolidating Materialize and Observe Sources into a single button Execute, the covers but traditional assets and actively observed external assets
  • Fitting actively observed external assets into whatever AMP evolves into.

This has been a laceration in our product and architecture for too long, and I look forward to eliminating it.

…ssets_def_from_specs

to external_asset_def_from_specs

docs

snapshot

snaps
@github-actions
Copy link

Deploy preview for dagster-university ready!

✅ Preview
https://dagster-university-gyi8ec01v-elementl.vercel.app
https://rename-to-external.dagster-university.dagster-docs.io

Built with commit 2629ca6.
This pull request is being automatically deployed with vercel-action

@schrockn schrockn changed the title Describe External Assets; rename create_unexecutable_observable_assets_def to external_assets_def_from_specs Describe External Assets; rename create_unexecutable_observable_assets_def to external_assets_from_specs Sep 27, 2023
Copy link
Member Author

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to land this but not export it top-level.

@schrockn schrockn merged commit 56219ae into master Sep 27, 2023
@schrockn schrockn deleted the rename-to-external branch September 27, 2023 09:39
PedramNavid pushed a commit that referenced this pull request Sep 28, 2023
…s_def to external_assets_from_specs (#16754)

## Summary & Motivation

This PR renames `create_unexecutable_observable_assets_def` to
`external_assets_from_specs`. I also want this PR to serve as the final
discussion on naming this feature and documents this capability. The
verbiage in the docblock:

```
Create an external assets definition from a sequence of asset specs.

An external asset is an asset that is not materialized by Dagster, but is tracked in the
asset graph and asset catalog.

A common use case for external assets is modeling data produced by an process not
under Dagster's control. For example a daily drop of a file from a third party in s3.

In most systems these are described as sources. This includes Dagster, which includes
:py:class:`SourceAsset`, which will be supplanted by external assets in the near-term
future, as external assets are a superset of the functionality
of Source Assets.

External assets can act as sources, but that is not their only use.

In particular, external assets have themselves have lineage-specified through the
``deps`` argument of :py:class:`AssetSpec`- and can depend on other external assets.
External assets are not allowed to depend on non-external assets.

The user can emit `AssetMaterialization`, `AssetObservation`, and `AssetCheckEvaluations`
events attached external assets.  And Dagster now has the ability to have "runless"
events to enable many use cases that were previously not possible.  Runless events
are events generated outside the context of a particular run (for example, in a
sensor or by an script), allowing for greater flexibility in event generation.
This can be done in a few ways:

Note to reviewers that this in an in-progress doc block and the below will have links and examples.

1) DagsterInstance exposes `report_runless_event` that can be used to generate events for
    external assets directly on an instance. See docs.
2) Sensors can build these events and return them using :py:class:`SensorResult`. A use
    case for this is using a sensor to continously monitor the metadata exhaust from
    an external system and inserting events that
    reflect that exhaust. See docs.
3) Dagster Cloud exposes a REST API for ingesting runless events. Users can copy and
    paste a curl command in the their external computations (such as Airflow operator)
    to register metadata associated with those computations See docs.
4) Dagster ops can generate these events directly and yield them or by calling
    ``log_event`` on :py:class:`OpExecutionContext`.  Use cases for this include
    querying metadata in an external system that is too expensive to do so in a sensor. Or
    for adapting pure op-based Dagster code to take advantage of asset-oriented lineage,
    observability, and data quality features, without having to port them wholesale
    to `@asset`- and `@multi_asset`-based code.

This feature set allows users to use Dagster as an observability, lineage, and
data quality tool for assets that are not materialized by Dagster. In addition to
traditional use cases like sources, this feature can model entire lineage graphs of
assets that are scheduled and materialized by other tools and workflow engines. This
allows users to use Dagster as a cross-cutting observability tool without migrating
their entire data platform to a single orchestration engine.

External assets do not have all the features of normal assets: they cannot be
materialized ad hoc by Dagster (this is diabled in the UI); cannot be backfilled; cannot
be scheduled using auto-materialize policies; and opt out of other features around
direct materialization, both now and in the future. External assets also provide fewer
guarantees around the correctness of information of their information in the asset
catalog. In other words, in exchange for the flexibility Dagster provides less guardrails
for external assets than assets that are materialized by Dagster, and there is an increased
chance that they will insert non-sensical information into the asset catalog, potentially
eroding trust.
```

Suggesting alternative lanuage in this docblock is the best way to talk
about an alternative name IMO.

## How I Tested These Changes
erinkcochran87 added a commit that referenced this pull request Oct 12, 2023
## Summary & Motivation

Adds an External Assets concept page (motivation described in
#16754).

This also contains a code change necessary because of the bug
demonstrated in #17077.

## How I Tested These Changes

BK. Also loaded examples in `dagster dev`

---------

Co-authored-by: Erin Cochran <[email protected]>
Co-authored-by: Yuhan Luo <[email protected]>
yuhan added a commit that referenced this pull request Oct 12, 2023
Adds an External Assets concept page (motivation described in
#16754).

This also contains a code change necessary because of the bug
demonstrated in #17077.

BK. Also loaded examples in `dagster dev`

---------

Co-authored-by: Erin Cochran <[email protected]>
Co-authored-by: Yuhan Luo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants