RFC: Asset checks #15880

sryza · 2023-08-16T16:30:40Z

sryza
Aug 16, 2023

Note

Asset checks are now available in dagster 1.5, and this post is not current. The latest on checks is here.

The following is a spec for a not-yet-implemented Python API for defining and executing asset checks in Dagster. We would love your feedback on any and all aspects of it!

This supersedes #9543.

Asset checks

Dagster allows you to define and execute data quality checks on your software-defined assets. Each asset check verifies some property of a data asset, e.g. that is has no null values in a particular column.

When viewing an asset in Dagster’s UI, you can see all of its checks, and whether they’ve passed, failed, or haven’t run. When launching a run to execute an asset, by default its checks will also be executed. Checks can also be executed on their own, independent of asset materializations.

By setting their severity level to ERROR, you can specify that your checks impact control flow, i.e. only materialize downstream assets if the checks on the upstream assets succeed.

Defining asset checks

Single asset check that executes in its own op

The following code defines an asset named orders and an asset check named orders_id_is_unique. When executed, the check verifies a property of the orders asset: that all the values in its primary key column are unique.

from dagster import asset, asset_check, AssetCheckResult, Definitions

@asset
def orders():
    orders_df = pd.DataFrame("order_id": [1, 2], "item_id": [432, 878])
    orders_df.to_csv("orders.csv")

@asset_check(asset=orders, description="ensure there are no null order_ids")
def orders_id_has_no_nulls(context):
    orders_df = pd.read_csv("orders.csv")
    num_null_order_ids = orders_df["order_id"].isna().sum()
    return AssetCheckResult(
        success=(num_null_order_ids == 0),
        metadata={"num_null_order_ids": num_null_order_ids}
    )

defs = Definitions(
    assets=[orders],
    asset_checks=[orders_id_has_no_nulls],
)

The orders_id_is_unique check runs in its own op. That means that, if you launch a run that materializes the orders asset and also executes the orders_id_is_unique check, and you’re using the multiprocess_executor, the check will execute in a separate process from the process that materializes the asset.

Multiple asset checks that execute in a single op

Sometimes, you want to define multiple checks that are executed within the same function. For example, this is useful in situations where loading the data that you want to check is time-consuming.

@multi_asset_check(
    check_specs=[
        AssetCheckSpec("orders_id_has_no_nulls", asset=orders),
        AssetCheckSpec("orders_id_is_unique", asset=orders),
    ]
)
def orders_checks():
    orders_df = pd.read_csv("orders.csv")
    num_null_order_ids = orders_df["order_id"].isna().sum()

    return [
        AssetCheckResult(
	    check_name="orders_id_has_no_nulls",
            asset_key="orders",
            success=(num_null_order_ids == 0),
            metadata={"num_null_order_ids": num_null_order_ids},
        ),
        AssetCheckResult(
            check_name="orders_id_is_unique",
            asset_key="orders",
            success=orders_df["order_id"].is_unique,
        )
    ]

The check_specs argument to multi_asset_check specifies the set of checks being defined. The decorated function is expected to return a AssetCheckResult corresponding to each of the checks (unless you make the multi-check subsettable - see below).

orders_checks is an object that contains the definitions of all these checks, along with the function that executes them.

Checks that execute in the same op that materializes the asset

Sometimes, it makes the most sense for a single function to both materialize an asset and then execute a check on it.

When defining an asset using the @asset or @multi_asset decorators, you can provide values for the checks argument. Each provided AssetCheckDecl declares a check that the decorated function should yield a AssetCheckResult for:

@asset(
    check_specs=[
        AssetCheckSpec("orders_id_has_no_nulls", severity=CheckSeverity.ERROR)
    ]
)
def orders():
    orders_df = pd.DataFrame("order_id": [1, 2], "item_id": [432, 878])

    # save the output and indicate that it's been saved
    orders_df.to_csv("orders)
    yield Output(value=None)

    # check it
    num_null_order_ids = orders_df["order_id"].isna().sum()
    yield AssetCheckResult(
        success=(num_null_order_ids == 0),
        metadata={"num_null_order_ids": num_null_order_ids},
    )

Asset check factories

A common pattern is exposing a SQL or YAML interface that allows data practitioners in your organization to write checks without using Python.

Dagster doesn’t provide its own SQL or YAML interface or set of pre-built checks, because these tend to be specific to the needs of the organization, but here’s an example of how you might construct your own:

def make_checks(check_blobs: List[Mapping[str, Any]]) -> List[AssetChecksDefinition]:
    checks = []
    for check_blob in check_blobs:
    @asset_check(name=check_blob["name"], asset=check_blob["asset"])
    def _check(context):
        db_connection = ...
        rows = db_connection.execute(check_blob["sql"])
        return AssetCheckResult(success=(rows == 0))

        checks.append(_check)

    return checks

check_blobs = [
    {
        "name": "orders_id_has_no_nulls",
        "asset": "orders",
        "sql": "select * from orders where order_id is null"
    },
    {
        "name": "items_id_has_no_nulls",
        "asset": "items",
        "sql": "select * from items where item_id is null"
    },
    ...
]

defs = Definitions(asset_checks=make_checks(check_blobs))

Asset checks and control flow

NOTE: severity no longer impacts control flow. See #16569 for the latest APIs.

Sometimes, if a check fails, you want to halt the pipeline instead of letting bad data propagate to your downstream assets.

You can configure this behavior with the severity parameter on your check. The default severity for a check is WARNING. If you set the severity for a check to ERROR, then downstream assets in the same run will wait for the check to succeed and skip materialization if it does not succeed.

In this example, the orders_id_has_no_nulls check has severity=CheckSeverity.ERROR. orders_report is an asset that’s downstream of orders. If you execute a run that includes orders, orders_id_has_no_nulls, and orders_report, then Dagster will only materialize orders_report if orders_id_has_no_nulls succeeds.

@asset_check(
    asset=orders,
    description="ensure there are no null order_ids",
    severity=CheckSeverity.ERROR,
)
def orders_id_has_no_nulls(context):
    orders_df = pd.read_csv("orders.csv")
    num_null_order_ids = orders_df["order_id"].isna().sum()
    return AssetCheckResult(
        success=(num_null_order_ids == 0),
        metadata={"num_null_order_ids": num_null_order_ids}
    )

@asset(deps=[orders])
def orders_report():
    ...

If you’re defining your checks using @multi_asset_check, @asset, or @multi_asset, you can set the severity parameter on your AssetCheckSpec.

Asset checks in the UI

Checks tab on asset details page shows the status of each check:

Click into an individual asset check definition to see its evaluation history:

See asset checks on the asset graph:

There are a couple options for this. The other one is included here: #15938.

ShahBinoy · 2023-08-16T18:41:37Z

ShahBinoy
Aug 16, 2023

This looks interesting. Two comments from the initial look.

This might be nit-picky on the aspect of the naming convention. With dagster multi_asset already hints towards multitude of assets, @asset_check seem to be associated with single asset, but just from co-relation @multi_asset_check would rather have to be associated with multi_asset, but from example shown in the description, would @asset_multi_check sound more discernable.
How would this reflect on the UI?

6 replies

sryza Aug 18, 2023
Author

UI mocks now added!

dsto Aug 28, 2023

Do you have an example of a check that depends on multiple upstream assets and what this would look like in the UI?

sryza Aug 29, 2023
Author

The way I've been imagining this will look in Python is something like this:

@asset_check(asset=asset1, other_assets=[asset2, asset3])
def my_check():
    ...

I.e. every check as a "primary" asset whose status it reflects on, but can take other assets as input as well.

Curious how this fits with your use case?

dsto Aug 30, 2023

This makes sense. So in the UI, the primary asset will "own" the check, but the other assets will be upstream dependencies as well (but those edges won't show in UI)? Or would the Dagster UI also show those additional dependencies?

sryza Aug 30, 2023
Author

It depends what part of the UI you're looking at. The main asset graph doesn't have separate nodes for asset checks, so you wouldn't see the dependencies visualized there. But on the page for a run that includes asset checks that execute in their own ops, you'd see an edge between the op that executes the check and the ops that materialize all the upstream assets that the check depends on.

The pop-out that gives details on a check would also indicate all the assets it depends on.

erinov1 · 2023-08-17T01:42:19Z

erinov1
Aug 17, 2023

This would be a great feature. Here are a couple of ideas:

Ideally an asset check could be provided resources so that, e.g., slack notifications could be sent with the result of a check, maybe as a callback when CheckResult is returned.
It would also be nice to be able to easily swap between severity levels depending on the environment. For example, in a dev environment you might want halt a run because of failed check, while in prod its important to continue running (while loudly alerting that the check failed).

2 replies

sryza Aug 17, 2023
Author

Ideally an asset check could be provided resources so that, e.g., slack notifications could be sent with the result of a check, maybe as a callback when CheckResult is returned.

Definitely going to allow checks to have resources. That said, the recommended pattern for sending slack notifications will likely be a sensor that tails the event log for check results, or Dagster Cloud alerting.

It would also be nice to be able to easily swap between severity levels depending on the environment. For example, in a dev environment you might want halt a run because of failed check, while in prod its important to continue running (while loudly alerting that the check failed).

Probably the best way to implement this will be:

check_severity = AssetCheckSeverity.ERROR if os.getenv["ENVIRONMENT"] == "prod" else AssetCheckSeverity.WARN

@asset_check(severity=check_severity)
def my_check():
    ...

alex-orlovskyi Aug 21, 2023

+1 for alerting from checks, i find this crucial.

alex-orlovskyi · 2023-08-21T08:18:01Z

alex-orlovskyi
Aug 21, 2023

Will checks be available for dbt assets?
For example: if dbt_asset_a checks (not empty) fails, then do not start downstream.
Also: would be great to have alert for this specific dbt_asset_a, and maybe separate allert for dbt_asset_b, not the whole dbt pipeline.

4 replies

askvinni Aug 21, 2023

This would be major, especially if the checks integrate with dbt tests out of the box.

sryza Aug 21, 2023
Author

Yes – the current plan is to load all dbt tests as Dagster asset checks.

abi-mutinex Aug 22, 2023

This is awesome @sryza. Godspeed!

johannkm Sep 15, 2023

Dbt support was released today: #16527

geoHeil · 2023-08-21T09:59:59Z

geoHeil
Aug 21, 2023

Consider how to run the tests i.e. the same goes for DBT if the tests are run using the CI pipeline re-processing PBs of data might get costly quickly.

Some samply methodology i.e. only processing the last month of data will be required.

Keep in mind that the actually required sample may be dependent on the different downstream usecases actually working with the data

0 replies

danielgafni · 2023-08-21T17:42:45Z

danielgafni
Aug 21, 2023
Collaborator

Great stuff! My 2 cents:

it would be nice if the checks would be executed in parallel
it would be really useful if checks with non-error severity could be executed after the asset materialization, thus not blocking the downstream pipeline. I currently have a check which takes 30 minutes to run but is not strictly required for the downstream pipeline
different downstream assets may consider the same upstream check either required or not

2 replies

geoHeil Aug 21, 2023

There also should be categories of tests - short and quick ones and long running ones - certain very expensive tests should be running on a different candence than the cheap ones.

sryza Aug 21, 2023
Author

it would be nice if the checks would be executed in parallel

Checks either:

Execute in their own ops, in which case they'll be in parallel as long as you're using the multi-process executor or similar
Execute in @asset-decorated functions, in which case implementing parallelism is up to the user

it would be really useful if checks with non-error severity could be executed after the asset materialization, thus not blocking the downstream pipeline. I currently have a check which takes 30 minutes to run but is not strictly required for the downstream pipeline

Definitely. This is the current plan.

different downstream assets may consider the same upstream check either required or not

Are you able to add a little color to the situation where this would be the case for you?

mjclarke94 · 2023-08-21T21:38:30Z

mjclarke94
Aug 21, 2023

How are checks going to work in the context of time-partitioned assets?

Let's say I am performing a "no nulls" check on a column because I know that downstream I perform a cumulative sum. If yesterday there were nulls, and today there weren't, then the output of the cumulative sum from that null onwards would be null, even if the new data is good when viewed in isolation.

In some cases, it would be desirable to have the asset level view say "This asset is passing the checks because all of the most recent data is fine.", whilst in this case it would be "A failure in any past partition will still be impacting things today".

7 replies

sryza Aug 29, 2023
Author

An asset-level check would look like a normal check on a non-partitioned asset. A partition-level check would have a partition_def:

@asset_check(asset=asset1, partitions_def=DailyPartitionsDefinition(...))
def my_check():
    ...

bkozura Aug 31, 2023

Could we get an example ui view of what that would look like for partitioned assets? Would we be able to tell at a glance if one or multiple partitions failed and which ones?
Let's say its partitioned by month- what does it look like for - all months have passed, a single month fails, multiple months fail?

johannkm Sep 15, 2023

Currently we've only designed and implemented asset-level checks. We'll share mocks etc. once we're working on partition-level checks!

bkozura Sep 23, 2023

Ah, so partition-level checks are not supported yet? What happens if I run a check on a partitioned asset at the moment?

johannkm Sep 24, 2023

If it uses an io manager for input, it will receive data for all partitions. The check result will apply to the whole asset, e.g. it overrides any previous results

abi-mutinex · 2023-08-22T04:04:39Z

abi-mutinex
Aug 22, 2023

This is awesome @sryza ! Already made plans to adopt it as soon as it arrives. Wondering how will this work with dbt tests. We have a bunch of them and it would be great to have them surface on UI like this. Any insights?

3 replies

sryza Aug 22, 2023
Author

The current plan is for dagster-dbt to load each dbt test as an asset check.

abi-mutinex Aug 22, 2023

Mind-blowing! When do you guys plan on releasing this?

johannkm Sep 15, 2023

Today 🙂 #16527

jzgrzebnicki · 2023-08-22T05:57:42Z

jzgrzebnicki
Aug 22, 2023

Shouldn't asset checks have an asset as an argument and use its io_manager to load the data?

@asset_check(
    description="ensure there are no null order_ids",
    severity=CheckSeverity.ERROR,
)
def orders_id_has_no_nulls(context, orders):
    num_null_order_ids = orders["order_id"].isna().sum()
    return CheckResult(
        success=(num_null_order_ids == 0),
        metadata={"num_null_order_ids": num_null_order_ids}
    )

14 replies

bkozura Sep 8, 2023

I think one way to make this 'staging' step work without duplicate SDAs would be to have the asset_check to run before the IO manager is called at end of materialization.
Some options:

The asset_check gets access to the "output" before the io-manager does using a configurable option to swap the order.
IO managers can have access to kickoff the asset_checks and wait for check completions before proceeding.

danielgafni Sep 9, 2023
Collaborator

I think one way to make this 'staging' step work without duplicate SDAs would be to have the asset_check to run before the IO manager is called at end of materialization.

Some options:

The asset_check gets access to the "output" before the io-manager does using a configurable option to swap the order.

IO managers can have access to kickoff the asset_checks and wait for check completions before proceeding.

I think this at least should be configurable. For example, compute-intensive (long-running) checks with low severity should not be blocking the pipeline.

geoHeil Sep 10, 2023

Great idea. However, what about a distributed external executor such as spark: There, certain computations must be cached in order for this testing step to be efficient. This would then be on the user.

I think it should at least be configurable when the tests are run.

johannkm Sep 15, 2023

Hi guys, today we shipped support for asset checks on @graph_assets. With these you can fully control the order of materializing and checking, plus control flow impacts of checks (whether downstream assets run or not).

We'll put some examples in docs, but for now here's the code for an asset that creates some staged data, runs a check, then materializes and stores only if the check passed.

johannkm Sep 20, 2023

And here's a discussion about blocking: #16569

jerome-laurent-pro · 2023-08-22T08:02:11Z

jerome-laurent-pro
Aug 22, 2023

Currently, I'm using pandera to do these checks.

Do you plan to integrate pandera to this slick UI via the extension?

If not, should I remove the checks from the pandera schema validation? Can both coexist? Should I completely replace my pandera schema validation by a solution native to dagster?

5 replies

sryza Aug 22, 2023
Author

Dagster isn't opinionated at all on how you do the checks, so you're free to use execute them with Pandera

bendnorman Sep 11, 2023

How do asset checks interact with the dagster-pandera library? Asset checks are nice in that they don't care how you do the checks. If we did want to use pandera to validate dataframes returned by assets, would you recommend using dagster-pandera or do pandera validations within an asset check?

johannkm Sep 15, 2023

For now you should use pandera directly. We'll update dagster-pandera it to add checks support but don't have a date yet

bendnorman Sep 18, 2023

Sounds good. By use pandera directly you mean use pandera within an asset check?

johannkm Sep 20, 2023

that's right

askvinni · 2023-08-23T06:57:01Z

askvinni
Aug 23, 2023

I think I'd like to see the ability to write checks independent of assets and pass them into the assets through an argument. I'm aware the current mental model is to attach an asset to a check, but I feel like this would cause a lot of code duplication and/or tons of asset check factories. The opposite route feels a little more logical to me, since most checks will likely be applicable to multiple assets.

Additionally, it would be great if the UI could differentiate between checks running within the asset computation and those running in their own outside process.

2 replies

sryza Aug 24, 2023
Author

As part of the design process that led me to the above proposal, I prototyped this direction as well. The main issues I ran into with it were:

For the "Multiple asset checks that execute in a single op" use case, I found it difficult to come up with an API that would jive with this approach.
For the "Asset check factories" use case, the user code can get hairy if you want to define your checks and your assets separately, because you end up needing to take your asset definitions and modify them to add the checks.

askvinni Aug 24, 2023

Makes perfect sense.

prratek · 2023-08-28T15:17:16Z

prratek
Aug 28, 2023

Based on this proposal it seems like the check itself has an associated severity level. I could see a use case for changing severity based on the output of the check - for example, a check for what pct of values in a column are null might return a WARNING level severity if it's less than X% but ERROR if it's more than that. I think something analogous to how you can configure warn_if and error_if thresholds on dbt tests would be useful.

2 replies

sryza Aug 29, 2023
Author

We need to know the severity ahead of time, so that we know whether we can start executing downstream assets before the check finishes executing.

One thing we could do is rename severity to max_severity? Then we could allow the check result to include a severity that's lower in certain conditions?

prratek Aug 30, 2023

That makes sense!

dsto · 2023-08-28T22:20:26Z

dsto
Aug 28, 2023

This is really exciting! Any plans for stateful checks that can cache and compare computed values over time, e.g. looking at week-over-week deltas and alerting for large changes?

2 replies

sryza Aug 29, 2023
Author

Our thinking is that you'll be able to implement this on top of asset materialization metadata. For example:

import statistics
from dagster import (
    asset,
    asset_check,
    AssetCheckResult,
    MaterializeResult,
    EventRecordsFilter,
    DagsterEventType,
    AssetKey,
)


@asset
def asset1():
    num_rows = 5  # replace this with logic to actually calculate number of rows
    return MaterializeResult(metadata={"num_rows": num_rows})


@asset_check(asset=asset1)
def num_rows_is_within_two_standard_deviations(context):
    """Check that the number of rows is within two standard deviations of the mean of the last 10
    materializations"""
    records = context.instance.get_event_records(
        EventRecordsFilter(
            event_type=DagsterEventType.ASSET_MATERIALIZATION, asset_key=AssetKey("asset1")
        ),
        limit=11,
    )

    if len(records) < 11:
        return AssetCheckResult(success=True)

    num_rows_values = [
        record.asset_materialization.metadata["num_rows"].value for record in records
    ]
    latest_value = num_rows_values[-1]
    historical_values = num_rows_values[:-1]

    mean = statistics.mean(historical_values)
    stdev = statistics.stdev(historical_values)

    return AssetCheckResult(
        success=abs(latest_value - mean) - 2 * stdev, metadata={"mean": mean, "stdev": stdev}
    )

dsto Aug 29, 2023

Interesting, this makes sense. One issue I could foresee with this approach would be that it requires defining the needed metadata outside of the DQ check. It could also get complicated when we start to think about slice-based aggregations (e.g. rows by country), since we'd need to embed that grouped metadata in the asset materialization metadata. How do you imagine visualizing the results of these trend-based DQ checks? Would they be querying Dagster's transactional DB to retrieve historical metadata values?

dsto · 2023-08-28T22:22:22Z

dsto
Aug 28, 2023

Will there be a way to override a failed blocking (CheckSeverity.ERROR) check and allow materialization of downstream assets to continue?

3 replies

sryza Aug 29, 2023
Author

In what scope would you want to override it? Would you want to do it at the time you launch the run? Or after the run completes and you observe the failure you'd want to kick off another run that picks up where the first one started but ignores the failed check?

dsto Aug 29, 2023

The latter case - essentially we would want to observe a failure, mark the op as "skipped", and then allow the downstream materialization to continue.

johannkm Sep 15, 2023

Currently we don't support marking a check as skipped, but you can manually materialize downstream assets

erinov1 · 2023-08-29T15:03:15Z

erinov1
Aug 29, 2023

I'm not sure whether this is feasible, but can severity be a property of the returned CheckResult instead of being specified in the asset_check decorator? That would make it much easier to adjust the severity level depending on the result of the actual check or other considerations (for example if the asset_check could be passed Config that specified the severity level).

5 replies

sryza Aug 29, 2023
Author

We need to know the severity ahead of time, so that we know whether we can start executing downstream assets before the check finishes executing.

One thing we could do is rename severity to max_severity? Then we could allow the check result to include a severity that's lower in certain conditions?

sryza Aug 30, 2023
Author

Also @erinov1, just to make sure I understand the behavior you're looking for, imagine you attach a WARN severity to a check result: would you want it to block downstream assets from running? What if you attach an ERROR severity to the check result?

Or are these just statuses that you'd want to show up in the UI but not affect control flow?

erinov1 Aug 30, 2023

@sryza thanks for the reply. Thinking about it more, I agree with @schrockn 's comments from #16182:

Based on the feedback we are getting, I am biasing much more towards an explicit argument/mechanism that is obviously about control flow exclusively.

One dimension is whether an edge should be added be from the check op and downstream assets in the computation graph. If no edge is added, then the check can never block downstream. This could be specified in the in the check decorator.

As a separate dimension, I was then thinking of the severity as analogous to python's logger level (debug, info, warning, error, etc.). One would attach success, warning, error, etc. to the check result itself, which would appear in the UI but does not explicitly control flow (namely where the check op is inserted in the graph). The interesting case in my original question would be

An edge is added between the check and downstream assets, so its success/failure is able to determine whether downstream operations should run.
Depending on the check computation itself, the user can attach WARN or ERROR. The simplest setup would be that WARN would allow the downstream computation to proceed, ERROR would block. More robustly would be the max severity: if the severity of the failure is above a given threshold, then fail the check and block downstream.

Again, I'm not sure whether this is feasible to implement in the current framework, but I appreciate the comments.

sryza Sep 1, 2023
Author

Got it. Here's a PR that showcases a couple different API options for making this behavior more flexible: #16261.

johannkm Sep 20, 2023

@erinov1 we went with the suggestion of setting severity at runtime. Discussion: #16569

johannkm · 2023-09-15T03:58:11Z

johannkm
Sep 15, 2023

Hi all, we've shipped early implementations of these apis. Try them out! #16266

0 replies

danielgafni · 2023-09-20T14:31:22Z

danielgafni
Sep 20, 2023
Collaborator

It would also be useful if a check could be lazy (expensive and low severity), e.g. have some kind of a FreshnessPolicy, so it won't get executed for every asset materialization

1 reply

sryza Sep 22, 2023
Author

Here's a relevant issue for tracking: #16719

OneCyrus · 2023-09-23T14:01:32Z

OneCyrus
Sep 23, 2023

great to see that quality checks are becoming a core feature of dagster. it's great to validate an asset and basically see if it matches the quality which is expected. this matches with some cases we have pretty well. but in our real world scenarios we are facing more and more quality issues which are not on a per asset level but on a per row level. basically our asset has many rows which match the expected quality checks and only only sometimes there might be a row which has a quality issue.
as we don't want to have issue further downstream with this quality issue we could stop the materialization when not all rows are perfect to prevent errors. but i IMHO that's not the best way to deal with that. what would prefer is that we could filter rows from materialization and only let the rows with the proper quality be used downstream.

basically the feature request would be to use checks to split an asset into rows which match the quality checks and lower quality asset which has all the rows which have quality issue. materialize the asset with the matched quality for downstream use and materialize the quality issue rows somewhere else so we can dig deeper into the root cause of the issues.

2 replies

johannkm Sep 24, 2023

what would prefer is that we could filter rows from materialization and only let the rows with the proper quality be used downstream.

My initial reaction is that you could model this with two assets:

[my_raw_asset]
(with a check that returns if num_invalid_rows > 0, and emits num_invalid_rows in the metadata)
|
|
\/
[my_filtered_asset.]

The naive implementation would involve finding invalid rows twice, once in the check and once in my_filtered_asset. If you wanted to avoid this, you could use a graph asset and have a single Op calculate my_filtered_asset and yield a check result.

OneCyrus Sep 24, 2023

yes, we are basically doing this with manual splitting and defining multiple assets. but we have a lot of assets and a more streamlined way would help us a lot.
if there would be an out of the box solution that would be great.

danielgafni · 2023-09-26T08:02:07Z

danielgafni
Sep 26, 2023
Collaborator

I would like my asset_check to produce some data too. Not Dagster metadata, but actual data. For example, I have a costly check which finds missing files against a 30M+ rows DataFrame. I would like to save the missing files list. It might be too big for Dagster's metadata, so it makes sense saving it as Parquet.

Should asset_check also be able to produce data or should I be using an actual asset for this instead?

1 reply

sryza Sep 26, 2023
Author

Should asset_check also be able to produce data or should I be using an actual asset for this instead?

You should use an actual asset for this instead. You can include an AssetCheckSpec on that asset if you also want it to yield check results that show up in the checks UI.

danielgafni · 2023-09-26T14:54:03Z

danielgafni
Sep 26, 2023
Collaborator

It would be useful to have deps argument for @asset_check

3 replies

sryza Sep 26, 2023
Author

Here's a relevant tracking issue: #16725

danielgafni Sep 26, 2023
Collaborator

Thanks Sandy. What about ins?

Edit: oh so right now asset checks can't load external assets at all?

sryza Sep 26, 2023
Author

Exactly

PadenZach · 2023-10-03T00:03:48Z

PadenZach
Oct 3, 2023

Started experimenting with asset checks, Liking it so far! I did run into a usage pattern that may not be covered very ergonomically in the framework as it is now.

Essentially, we’d like to be able to be better able to seperate and compose our checks. Right now as far as I can tell, you have two options:

Define asset_checks as an additional op
Yield check results from an asset function itself.

We’d like to be able to use both flows to run the same checks 🙂 Here’s why:

We use external compute (databricks via step launcher). This has caching advantages and reduces spin up time/latency if we were able to do it in the same ephemeral cluster as the main asset materialization (IE: we want to run an asset check in the same op that materializes the asset, right now we’d look to yield an AssetCheckResult).

We’d also like to decouple the above from the asset code, to be able to run them on their own. For example; perhaps we extend our asset checks to cover a new bug. We’d like to be able to then re-run asset checks without re-materializing the asset. (This could be even more important in the future when/if partitioned checks are added 🙂 )

We can do the above, by creating our asset checks as functions then creating a generator that yields our check results during materialization, and then re-using (or perhaps creating a factory) to create and define asset checks for us; however, in this case we’re not sure how to control asset checks from not-running twice.

I suppose another way of looking at this may be that we want greater control of when/where asset checks run. As another example, I could see us creating an asset check that can be ran without external compute (eg: scan metadata only).

4 replies

danielgafni Oct 3, 2023
Collaborator

Hey, are you using an external environment in your step launcher? Perhaps, a solution would be to yield asset checks with dagster-pipes right after execution (thus reusing the same compute) and create similar separate checks?

PadenZach Oct 3, 2023

Yeah! We could do something like that, alternatively we've consider creating a custom step launcher that has slightly different parameters (eg: a databricks step launcher with "getOrCreate" clusters instead of just connect to existing cluster or create a new ephemeral one).

I still think there's some user story here related to enabling this sort of pattern ergonomically, with the stress around having multiple places to run asset checks.

@asset(
    check_specs=[
        AssetCheckSpec("orders_id_has_no_nulls", severity=CheckSeverity.ERROR)
    ]
)
def orders():
    orders_df = pd.DataFrame("order_id": [1, 2], "item_id": [432, 878])

    # save the output and indicate that it's been saved
    orders_df.to_csv("orders)
    yield Output(value=None)

    # check it
    num_null_order_ids = orders_df["order_id"].isna().sum()
    yield AssetCheckResult(  # Defines check as part of asset function.
        # NOTE: We want to block this materialization if success criteria isn't true.
        success=(num_null_order_ids == 0),
        metadata={"num_null_order_ids": num_null_order_ids},
    )

# Defines asset check as it's own op
@asset_check(asset=orders, description="Some other check we don't need to run with the asset")
def orders_other_check(context):
    orders_df = pd.read_csv("orders.csv")
    num_null_order_ids = orders_df["order_id"].isna().sum()
    return AssetCheckResult(
        success=(num_null_order_ids == 0),
        metadata={"num_null_order_ids": num_null_order_ids}
    )

Right now, to get all the functionality we'd like we ultimately need to use two different apis to do it (or make put efforts into configuring dagster-pipes and/or modifying the step launcher so that a cluster isn't immediately terminated on task end).

sryza Oct 3, 2023
Author

@PadenZach is this an accurate description of what you're looking for?

Have the asset be materialized and the check executed within the same step
Have the ability to execute the check without also re-materializing the asset

If so, this will soon be possible with the following:

@asset(check_specs=[AssetCheckSpec("check1", asset="asset1")], can_subset=True)
def asset1(context):
    if context.selected_asset_keys:
        materialize_the_asset(...)

    if context.selected_asset_check_keys:
        execute_the_check(...)

cc @rexledesma

PadenZach Oct 3, 2023

Essentially yes :) I think i'd prefer a bit more syntactic sugar, but that may be my own personal preference.

cbini · 2023-10-16T20:43:45Z

cbini
Oct 16, 2023

Started to think about how asset checks can be useful to us beyond running dbt tests, and I'm wondering if there are any plans to incorporate a callback function of sorts where if the check fails, it runs the provided function with the relevant metadata?

That way we could create an incident/task in an external system (think Pagerduty, Asana, Zendesk) if some invalid/non-conforming data is caught.

0 replies

j-blackwell · 2023-12-21T11:08:52Z

j-blackwell
Dec 21, 2023

@sryza I'm pretty sure the asset check factory example has a bug in it - B023. Specifically, Python can't handle using the loop variables within the function definition.

rows = db_connection.execute(check_blob["sql"])

will always take the value of the last item in the list. Python's solution is to put the values as a parameter with a default value, but we can't do this in dagster. Is there a way that we can use a RunConfig to get around this? (I will post this on the discussion).

3 replies

johannkm Dec 21, 2023

Great catch, this bug is also in the docs: https://docs.dagster.io/concepts/assets/asset-checks#asset-check-factories
We'll get it fixed

danielgafni Dec 21, 2023
Collaborator

You can define a separate function with any required logic and call it inside the for loop to get around this behavior (to be clear, this is not a bug).

johannkm Dec 21, 2023

#18867

RFC: Asset checks #15880

Asset checks

Defining asset checks

Single asset check that executes in its own op

Multiple asset checks that execute in a single op

Checks that execute in the same op that materializes the asset

Asset check factories

Asset checks and control flow

Asset checks in the UI

Checks tab on asset details page shows the status of each check:

Click into an individual asset check definition to see its evaluation history:

See asset checks on the asset graph:

Replies: 22 comments · 71 replies

sryza Aug 18, 2023 Author

sryza Aug 29, 2023 Author

sryza Aug 30, 2023 Author

sryza Aug 17, 2023 Author

sryza Aug 21, 2023 Author

danielgafni Aug 21, 2023 Collaborator

sryza Aug 21, 2023 Author

sryza Aug 29, 2023 Author

sryza Aug 22, 2023 Author

danielgafni Sep 9, 2023 Collaborator

sryza Aug 22, 2023 Author

sryza Aug 24, 2023 Author

sryza Aug 29, 2023 Author

sryza Aug 29, 2023 Author

sryza Aug 29, 2023 Author

sryza Aug 29, 2023 Author

Replies: 22 comments 71 replies

sryza Aug 18, 2023
Author

sryza Aug 29, 2023
Author

sryza Aug 30, 2023
Author

sryza Aug 17, 2023
Author

sryza Aug 21, 2023
Author

danielgafni
Aug 21, 2023
Collaborator

sryza Aug 21, 2023
Author

sryza Aug 29, 2023
Author

sryza Aug 22, 2023
Author

danielgafni Sep 9, 2023
Collaborator

sryza Aug 22, 2023
Author

sryza Aug 24, 2023
Author

sryza Aug 29, 2023
Author

sryza Aug 29, 2023
Author

sryza Aug 29, 2023
Author

sryza Aug 29, 2023
Author