Replies: 22 comments 71 replies
-
This looks interesting. Two comments from the initial look.
|
Beta Was this translation helpful? Give feedback.
-
This would be a great feature. Here are a couple of ideas:
|
Beta Was this translation helpful? Give feedback.
-
Will checks be available for dbt assets? |
Beta Was this translation helpful? Give feedback.
-
Consider how to run the tests i.e. the same goes for DBT if the tests are run using the CI pipeline re-processing PBs of data might get costly quickly. Some samply methodology i.e. only processing the last month of data will be required. Keep in mind that the actually required sample may be dependent on the different downstream usecases actually working with the data |
Beta Was this translation helpful? Give feedback.
-
Great stuff! My 2 cents:
|
Beta Was this translation helpful? Give feedback.
-
How are checks going to work in the context of time-partitioned assets? Let's say I am performing a "no nulls" check on a column because I know that downstream I perform a cumulative sum. If yesterday there were nulls, and today there weren't, then the output of the cumulative sum from that null onwards would be null, even if the new data is good when viewed in isolation. In some cases, it would be desirable to have the asset level view say "This asset is passing the checks because all of the most recent data is fine.", whilst in this case it would be "A failure in any past partition will still be impacting things today". |
Beta Was this translation helpful? Give feedback.
-
This is awesome @sryza ! Already made plans to adopt it as soon as it arrives. Wondering how will this work with dbt tests. We have a bunch of them and it would be great to have them surface on UI like this. Any insights? |
Beta Was this translation helpful? Give feedback.
-
Shouldn't asset checks have an asset as an argument and use its io_manager to load the data? @asset_check(
description="ensure there are no null order_ids",
severity=CheckSeverity.ERROR,
)
def orders_id_has_no_nulls(context, orders):
num_null_order_ids = orders["order_id"].isna().sum()
return CheckResult(
success=(num_null_order_ids == 0),
metadata={"num_null_order_ids": num_null_order_ids}
) |
Beta Was this translation helpful? Give feedback.
-
Currently, I'm using pandera to do these checks. Do you plan to integrate pandera to this slick UI via the extension? If not, should I remove the checks from the pandera schema validation? Can both coexist? Should I completely replace my pandera schema validation by a solution native to dagster? |
Beta Was this translation helpful? Give feedback.
-
I think I'd like to see the ability to write checks independent of assets and pass them into the assets through an argument. I'm aware the current mental model is to attach an asset to a check, but I feel like this would cause a lot of code duplication and/or tons of asset check factories. The opposite route feels a little more logical to me, since most checks will likely be applicable to multiple assets. Additionally, it would be great if the UI could differentiate between checks running within the asset computation and those running in their own outside process. |
Beta Was this translation helpful? Give feedback.
-
Based on this proposal it seems like the check itself has an associated severity level. I could see a use case for changing severity based on the output of the check - for example, a check for what pct of values in a column are null might return a |
Beta Was this translation helpful? Give feedback.
-
This is really exciting! Any plans for stateful checks that can cache and compare computed values over time, e.g. looking at week-over-week deltas and alerting for large changes? |
Beta Was this translation helpful? Give feedback.
-
Will there be a way to override a failed blocking (CheckSeverity.ERROR) check and allow materialization of downstream assets to continue? |
Beta Was this translation helpful? Give feedback.
-
I'm not sure whether this is feasible, but can |
Beta Was this translation helpful? Give feedback.
-
Hi all, we've shipped early implementations of these apis. Try them out! #16266 |
Beta Was this translation helpful? Give feedback.
-
It would also be useful if a check could be lazy (expensive and low severity), e.g. have some kind of a |
Beta Was this translation helpful? Give feedback.
-
great to see that quality checks are becoming a core feature of dagster. it's great to validate an asset and basically see if it matches the quality which is expected. this matches with some cases we have pretty well. but in our real world scenarios we are facing more and more quality issues which are not on a per asset level but on a per row level. basically our asset has many rows which match the expected quality checks and only only sometimes there might be a row which has a quality issue. basically the feature request would be to use checks to split an asset into rows which match the quality checks and lower quality asset which has all the rows which have quality issue. materialize the asset with the matched quality for downstream use and materialize the quality issue rows somewhere else so we can dig deeper into the root cause of the issues. |
Beta Was this translation helpful? Give feedback.
-
I would like my Should |
Beta Was this translation helpful? Give feedback.
-
It would be useful to have |
Beta Was this translation helpful? Give feedback.
-
Started experimenting with asset checks, Liking it so far! I did run into a usage pattern that may not be covered very ergonomically in the framework as it is now. Essentially, we’d like to be able to be better able to seperate and compose our checks. Right now as far as I can tell, you have two options:
We’d like to be able to use both flows to run the same checks 🙂 Here’s why: We use external compute (databricks via step launcher). This has caching advantages and reduces spin up time/latency if we were able to do it in the same ephemeral cluster as the main asset materialization (IE: we want to run an asset check in the same op that materializes the asset, right now we’d look to yield an AssetCheckResult). We’d also like to decouple the above from the asset code, to be able to run them on their own. For example; perhaps we extend our asset checks to cover a new bug. We’d like to be able to then re-run asset checks without re-materializing the asset. (This could be even more important in the future when/if partitioned checks are added 🙂 ) We can do the above, by creating our asset checks as functions then creating a generator that yields our check results during materialization, and then re-using (or perhaps creating a factory) to create and define asset checks for us; however, in this case we’re not sure how to control asset checks from not-running twice. I suppose another way of looking at this may be that we want greater control of when/where asset checks run. As another example, I could see us creating an asset check that can be ran without external compute (eg: scan metadata only). |
Beta Was this translation helpful? Give feedback.
-
Started to think about how asset checks can be useful to us beyond running dbt tests, and I'm wondering if there are any plans to incorporate a callback function of sorts where if the check fails, it runs the provided function with the relevant metadata? That way we could create an incident/task in an external system (think Pagerduty, Asana, Zendesk) if some invalid/non-conforming data is caught. |
Beta Was this translation helpful? Give feedback.
-
@sryza I'm pretty sure the asset check factory example has a bug in it - B023. Specifically, Python can't handle using the loop variables within the function definition. rows = db_connection.execute(check_blob["sql"]) will always take the value of the last item in the list. Python's solution is to put the values as a parameter with a default value, but we can't do this in dagster. Is there a way that we can use a |
Beta Was this translation helpful? Give feedback.
-
Note
Asset checks are now available in dagster 1.5, and this post is not current. The latest on checks is here.
The following is a spec for a not-yet-implemented Python API for defining and executing asset checks in Dagster. We would love your feedback on any and all aspects of it!
This supersedes #9543.
Asset checks
Dagster allows you to define and execute data quality checks on your software-defined assets. Each asset check verifies some property of a data asset, e.g. that is has no null values in a particular column.
When viewing an asset in Dagster’s UI, you can see all of its checks, and whether they’ve passed, failed, or haven’t run. When launching a run to execute an asset, by default its checks will also be executed. Checks can also be executed on their own, independent of asset materializations.
By setting their severity level to ERROR, you can specify that your checks impact control flow, i.e. only materialize downstream assets if the checks on the upstream assets succeed.
Defining asset checks
Single asset check that executes in its own op
The following code defines an asset named
orders
and an asset check namedorders_id_is_unique
. When executed, the check verifies a property of theorders
asset: that all the values in its primary key column are unique.The
orders_id_is_unique
check runs in its own op. That means that, if you launch a run that materializes theorders
asset and also executes theorders_id_is_unique
check, and you’re using themultiprocess_executor
, the check will execute in a separate process from the process that materializes the asset.Multiple asset checks that execute in a single op
Sometimes, you want to define multiple checks that are executed within the same function. For example, this is useful in situations where loading the data that you want to check is time-consuming.
The
check_specs
argument tomulti_asset_check
specifies the set of checks being defined. The decorated function is expected to return aAssetCheckResult
corresponding to each of the checks (unless you make the multi-check subsettable - see below).orders_checks
is an object that contains the definitions of all these checks, along with the function that executes them.Checks that execute in the same op that materializes the asset
Sometimes, it makes the most sense for a single function to both materialize an asset and then execute a check on it.
When defining an asset using the
@asset
or@multi_asset
decorators, you can provide values for thechecks
argument. Each providedAssetCheckDecl
declares a check that the decorated function should yield aAssetCheckResult
for:Asset check factories
A common pattern is exposing a SQL or YAML interface that allows data practitioners in your organization to write checks without using Python.
Dagster doesn’t provide its own SQL or YAML interface or set of pre-built checks, because these tend to be specific to the needs of the organization, but here’s an example of how you might construct your own:
Asset checks and control flow
NOTE: severity no longer impacts control flow. See #16569 for the latest APIs.
Sometimes, if a check fails, you want to halt the pipeline instead of letting bad data propagate to your downstream assets.
You can configure this behavior with the severity parameter on your check. The default severity for a check is WARNING. If you set the severity for a check to ERROR, then downstream assets in the same run will wait for the check to succeed and skip materialization if it does not succeed.
In this example, the
orders_id_has_no_nulls
check hasseverity=CheckSeverity.ERROR
.orders_report
is an asset that’s downstream oforders
. If you execute a run that includesorders
,orders_id_has_no_nulls
, andorders_report
, then Dagster will only materializeorders_report
iforders_id_has_no_nulls
succeeds.If you’re defining your checks using
@multi_asset_check
,@asset
, or@multi_asset
, you can set theseverity
parameter on yourAssetCheckSpec
.Asset checks in the UI
Checks tab on asset details page shows the status of each check:
Click into an individual asset check definition to see its evaluation history:
See asset checks on the asset graph:
There are a couple options for this. The other one is included here: #15938.
Beta Was this translation helpful? Give feedback.
All reactions