Adding asset metadata coverage tests #22204

sryza · 2024-05-31T17:37:56Z

sryza
May 31, 2024

A data catalog is only as good as the metadata that’s inside of it. And even with a pristine catalog, as data pipelines grow and new data assets are developed, it’s easy for standards to become relaxed and for important data attributes to go undocumented.

To ensure that data assets are properly documented, one useful strategy is to use CI to enforce a minimum standard of documentation for any data asset definitions that are added to the platform.

What does it mean to “use CI to enforce” something?

A typical workflow for contributing to a data pipeline looks something like this:

Data practitioner edits their copy of the organization’s git repository to add a new dbt model or other kind of data asset definition.
Data practitioner creates a branch with their changes, which can be reviewed by others.
Data practitioner merges the branch to master.

“Using CI to enforce a minimum standard of documentation” means adding a step between (2) and (3). In this step, an automated test executes, and it fails if any of the asset definitions in the repository are missing the expected documentation. The data practitioner is expected to get the test to pass before merging.

There are many ways to automatically execute a test on every branch. If you’re managing your repository with Github, a common one is to use Github Actions to execute tests using pytest.

Writing tests to enforce documentation coverage

There’s no single “right” set of attributes for all data assets across all organizations. I.e. it’s ultimately up to you to decide what kind of documentation you want to enforce. Some examples:

Enforce that every asset definition has a description
Enforce that every asset definition has an owner
Enforce that every asset definition in the “core_analytics” asset group has an owner
Enforce that every column of every table asset definition has a description
Enforce that every asset definition has a tag that documents the storage system (e.g. “snowflake” or “s3”) used to store it

Writing these kinds of tests in Dagster is very straightforward. The general pattern is:

Import the Definitions object that you used to [define your Dagster code location](https://docs.dagster.io/concepts/code-locations#defining-code-locations).
Get all the AssetSpecs from it. Each AssetSpec contains the attributes that were supplied to define the asset.
Iterate over those AssetSpecs and assert that they contain the attributes you expect them to.

Example: test that all assets have a description

from my_package.definitions import defs

def test_all_assets_have_descriptions():
    for asset_spec in defs.get_all_asset_specs():
        assert asset_spec.description is not None, f"{asset_spec.key} is missing description"

Example: test that all assets have a “storage kind” tag that’s either “snowflake” or “s3”

from my_package.definitions import defs

def test_all_assets_have_storage_kind_tag():
    for asset_spec in defs.get_all_asset_specs():
        storage_kind = asset_spec.tags.get("dagster/storage_kind")
        assert storage_kind is not None, f"{asset_spec.key} is missing dagster/storage_kind tag"
        assert storage_kind in [
            "snowflake",
            "s3",
        ], f"{asset_spec.key} has unexpected dagster/storage_kind tag {storage_kind}"

Example: test that all assets have all their columns documented

from my_package.definitions import defs

def test_all_columns_have_descriptions():
    for asset_spec in defs.get_all_asset_specs():
        column_schema = asset_spec.metadata["dagster/column_schema"]
        assert column_schema is not None, f"{asset_spec.key} missing column schema"
        for column in column_schema.columns:
            assert (
                column.description is not None
            ), f"{asset_spec.key} missing description for column {column.name}"

Example: test that every asset definition in the “core_analytics” asset group has an owner

from my_package.definitions import defs

def test_all_core_analytics_assets_have_owners():
    for asset_spec in defs.get_all_asset_specs():
        if asset_spec.group_name == "core_analytics":
            assert len(asset_spec.owners) > 0, f"{asset_spec.key} missing an owner"

What if I want to grandfather in existing assets but enforce standards for new assets?

Imagine this:

You have a ton of assets that you’ve defined over the years. Adding documentation to every one of them right now would be too onerous.
You still want to enforce documentation standards when adding new asset definitions.

You can implement this by creating a one-time, whitelisted set of old asset definitions that get “grandfathered in”, and excluding them from your completeness tests.

Step 1: compile a list of all the assets you want to grandfather

You can write a script like the following:

from my_package.definitions import defs

if __name__ == "__main__":
		all_asset_keys = {spec for spec in defs.get_all_asset_specs()}
    print("GRANDFATHERED_ASSETS = {\n    " + ',\n    '.join(str(k) for k in all_asset_keys) + "\n}")

This will produce output that looks something like this:

GRANDFATHERED_ASSETS = {
    AssetKey(['asset1']),
    AssetKey(['asset2'])
}

Which you can then copy/paste into your file of coverage tests.

The idea is to run this script just once. New assets added from now on shouldn’t get included.

Step 2: write tests that ignore the grandfathered assets

The following tests that all assets have descriptions, except for the grandfathered assets:

from my_package.definitions import defs

# copied this in from our script
GRANDFATHERED_ASSETS = {
    AssetKey(['asset1']),
    AssetKey(['asset2'])
}

def test_all_non_grandfathered_assets_have_descriptions():
    for asset_spec in defs.get_all_asset_specs():
        if asset_spec.key not in GRANDFATHERED_ASSETS:
            assert asset_spec.description is not None, f"{asset_spec.key} is missing description"

Step 3: incrementally, remove assets from the set of grandfathered assets

Over time, as you add documentation to assets, you can take them out of the GRANDFATHERED_ASSETS set. The goal is empty it out entirely in the long run.

leejlFG · 2024-06-12T18:04:18Z

leejlFG
Jun 12, 2024

We do something analogous for job tags. Since our custom Snowflake resource initializes with the correct schema (in our case the "project" name, which by definition needs to be both the name of the directory that contains its code and its asset group name), we needed to enforce the existence of certain job tags.

Snowflake resource snippet:

    def setup_for_execution(self, context: InitResourceContext):
        """Dynamically sets the default Snowflake schema based on the asset group name/project name."""

        self._run_id = context.run_id

        if self.snowflake_schema:
            self._snowflake_schema = self.snowflake_schema
        else:
            try:
                for asset_key in context.dagster_run.asset_selection:  # type: ignore
                    self._snowflake_schema = asset_key.path[1].upper()
                    break
            except TypeError:
                # asset_select is type none for schedule runs, will throw TypeError
                # requires the project name to be in the tags, enforced in sheriff.py
                self._snowflake_schema = context.dagster_run.tags["project"].upper()
            except AttributeError:
                # if the run is not an asset run, use dummy schema
                # needed to pass tests where asset selection is not defined
                self._snowflake_schema = "dummy_schema"
            except Exception as e:
                raise e

To enforce the existence of certain tags, we have a sheriff.py file shared amongst all code locations with something similar to the following function:

def validate_job_tags(definition: Definitions) -> None:
    """
    This function takes a definition and validates its jobs, making sure it has the required tags.
    These tags (team and project) are required etc etc
    """
    for job in definition.get_all_job_defs():
        if (
            (job.name.startswith("__ASSET_JOB"))
            | (job.name == "refresh_assets_from_prod")
            | (job.name == "clone_for_cutover")
        ): # these are the jobs exempt from this check, the "grandfathered" jobs analogue
            continue
        # validate every job has a project tag
        if not all(tag in job.tags for tag in ("project", "team")):
            raise Exception(
                f"Job: {job.name} does not have the requisite tags. Please add **tags to your jobs.py file under the tags parameter."
            ) # every jobs.py and assets.py file has some metadata and tags defined at the top that we import upwards as needed

    return None

With some edits this function can also enforce ECS resource usage to keep users from getting their grubby little paws on the most expensive Fargate tasks instead of writing better code, and any other attribute we can extract from a JobDefinition. We had it in tests first, but since users, those pesky users, found ways around running tests locally before pushing code to GitHub, we now import this function into top-level __init__.py files and call it immediately after the Definition. This keeps the dev cycle a little faster because it fails even with a dagster dev command instead of waiting for CI to flag it.

I'm a big fan of this idea generally, but it might be best implemented with some helper factory functions to import and call where ever people choose is best for them.

0 replies

geoHeil · 2024-06-12T19:23:47Z

geoHeil
Jun 12, 2024

How would this interplay with the larger ecosystem of data contracts like https://github.com/bitol-io/open-data-contract-standard ?

0 replies

geoHeil · 2024-06-12T19:24:32Z

geoHeil
Jun 12, 2024

Does it make sense to build this into dagster natively? Would it make more sense to strengthen the DBT ecosystem such as https://blog.picnic.nl/picnic-open-sources-dbt-score-linting-model-metadata-with-ease-428278f9f05b and then call out to that ecosystem?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding asset metadata coverage tests #22204

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Adding asset metadata coverage tests #22204

sryza May 31, 2024

What does it mean to “use CI to enforce” something?

Writing tests to enforce documentation coverage

What if I want to grandfather in existing assets but enforce standards for new assets?

Replies: 3 comments

leejlFG Jun 12, 2024

geoHeil Jun 12, 2024

geoHeil Jun 12, 2024

sryza
May 31, 2024

leejlFG
Jun 12, 2024

geoHeil
Jun 12, 2024

geoHeil
Jun 12, 2024