Replies: 3 comments
-
We do something analogous for job tags. Since our custom Snowflake resource initializes with the correct schema (in our case the "project" name, which by definition needs to be both the name of the directory that contains its code and its asset group name), we needed to enforce the existence of certain job tags. Snowflake resource snippet:
To enforce the existence of certain tags, we have a
With some edits this function can also enforce ECS resource usage to keep users from getting their grubby little paws on the most expensive Fargate tasks instead of writing better code, and any other attribute we can extract from a I'm a big fan of this idea generally, but it might be best implemented with some helper factory functions to import and call where ever people choose is best for them. |
Beta Was this translation helpful? Give feedback.
-
How would this interplay with the larger ecosystem of data contracts like https://github.com/bitol-io/open-data-contract-standard ? |
Beta Was this translation helpful? Give feedback.
-
Does it make sense to build this into dagster natively? Would it make more sense to strengthen the DBT ecosystem such as https://blog.picnic.nl/picnic-open-sources-dbt-score-linting-model-metadata-with-ease-428278f9f05b and then call out to that ecosystem? |
Beta Was this translation helpful? Give feedback.
-
A data catalog is only as good as the metadata that’s inside of it. And even with a pristine catalog, as data pipelines grow and new data assets are developed, it’s easy for standards to become relaxed and for important data attributes to go undocumented.
To ensure that data assets are properly documented, one useful strategy is to use CI to enforce a minimum standard of documentation for any data asset definitions that are added to the platform.
What does it mean to “use CI to enforce” something?
A typical workflow for contributing to a data pipeline looks something like this:
“Using CI to enforce a minimum standard of documentation” means adding a step between (2) and (3). In this step, an automated test executes, and it fails if any of the asset definitions in the repository are missing the expected documentation. The data practitioner is expected to get the test to pass before merging.
There are many ways to automatically execute a test on every branch. If you’re managing your repository with Github, a common one is to use Github Actions to execute tests using pytest.
Writing tests to enforce documentation coverage
There’s no single “right” set of attributes for all data assets across all organizations. I.e. it’s ultimately up to you to decide what kind of documentation you want to enforce. Some examples:
Writing these kinds of tests in Dagster is very straightforward. The general pattern is:
Definitions
object that you used to [define your Dagster code location](https://docs.dagster.io/concepts/code-locations#defining-code-locations).AssetSpec
s from it. EachAssetSpec
contains the attributes that were supplied to define the asset.AssetSpec
s and assert that they contain the attributes you expect them to.Example: test that all assets have a description
Example: test that all assets have a “storage kind” tag that’s either “snowflake” or “s3”
Example: test that all assets have all their columns documented
Example: test that every asset definition in the “core_analytics” asset group has an owner
What if I want to grandfather in existing assets but enforce standards for new assets?
Imagine this:
You can implement this by creating a one-time, whitelisted set of old asset definitions that get “grandfathered in”, and excluding them from your completeness tests.
Step 1: compile a list of all the assets you want to grandfather
You can write a script like the following:
This will produce output that looks something like this:
Which you can then copy/paste into your file of coverage tests.
The idea is to run this script just once. New assets added from now on shouldn’t get included.
Step 2: write tests that ignore the grandfathered assets
The following tests that all assets have descriptions, except for the grandfathered assets:
Step 3: incrementally, remove assets from the set of grandfathered assets
Over time, as you add documentation to assets, you can take them out of the
GRANDFATHERED_ASSETS
set. The goal is empty it out entirely in the long run.Beta Was this translation helpful? Give feedback.
All reactions