Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Data freshness guide #23937

Merged

Conversation

clairelin135
Copy link
Contributor

Summary & Motivation

How I Tested These Changes

Changelog [New | Bug | Docs]

Replace this message with a changelog entry, or NOCHANGELOG

@clairelin135
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @clairelin135 and the rest of your teammates on Graphite Graphite

Copy link

github-actions bot commented Aug 26, 2024

Deploy preview for dagster-docs-beta ready!

✅ Preview
https://dagster-docs-beta-fgoa0f2v3-elementl.vercel.app
https://claire-doc-332-write-testing-for-data-freshness-guide.dagster.dagster-docs.io

Built with commit 535f355.
This pull request is being automatically deployed with vercel-action


## Next steps

- Explore more [asset checks](/todo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Re: line +44]

I didn’t even know about this D+ feature, could you add a next step that links to a /todo page for a Dagster+ anomaly detection page?

See this comment inline on Graphite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

@@ -0,0 +1,7 @@
from dagster_cloud.anomaly_detection import build_anomaly_detection_freshness_checks

hourly_sales = ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn’t this be an asset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, but I was hoping I could avoid repeating the whole code example from the previous block....

I'll try making assets=... instead



@dg.observable_source_asset(specs=[dg.AssetSpec("hourly_sales")])
def hourly_sales(snowflake: dg_snowflake.SnowflakeResource):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we name this function something like check_hourly_sales to avoid confusing it with the asset spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually expect both names to be the same, since the asset spec is just defining an asset key...

But that actually reminds me that defining an asset spec here is redundant, so I'm removing that

yield dg.ObserveResult(
asset_key=table_name,
metadata={
"dagster/last_updated_timestamp": dg.MetadataValue.timestamp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this metadata is pretty magical it might be worth adding a brief description of it in the docs, something to the effect of the metadata is a special key used by freshness checks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add a list of dagster metadata keys in the docs somewhere, along with what the purpose of each key. Currently these special keys are referenced in scattered places (for example) which makes them hard to search for.

And ideally when you're adding metadata, you can look at a single list and see all the different things you can emit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have this: https://docs.dagster.io/concepts/metadata-tags/asset-metadata#standard-asset-metadata-entries

Btw, just noticed that docs search for "metadata" only returns API reference links.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely have a metadata reference page. I think just mentioning that this key is what the freshness check uses should be sufficient, and we can add a link to the metadata page once created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the metadata reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to mention the key and added a TODO for adding it to the metadata reference

Copy link
Contributor

@PedramNavid PedramNavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great. Just a few small questions but otherwise looks nice. I love how concise this explanation of freshness checks is. I don't think I've seen such a succinct explanation of how to add freshness checks.

### Use anomaly detection to test data freshness (Dagster+ Pro)

Instead of applying policies on an asset-by-asset basis, Dagster+ Pro users can take advantage of a time series anomaly detection model to determine if data is arriving later than expected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the OG guide:

This model uses data from past materializations/observations to determine if data is arriving later than expected. Note: If the asset hasn't been updated enough times, the check will pass with a message indicating that more data is needed to detect anomalies.

I think this is important to note (wordsmithing encouraged), otherwise users may get confused when they see this message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, added



@dg.observable_source_asset(specs=[dg.AssetSpec("hourly_sales")])
def hourly_sales(snowflake: dg_snowflake.SnowflakeResource):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

yield dg.ObserveResult(
asset_key=table_name,
metadata={
"dagster/last_updated_timestamp": dg.MetadataValue.timestamp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@sryza
Copy link
Contributor

sryza commented Aug 27, 2024

Thoughts on calling this "Check for data freshness" instead of "test for data freshness"? A couple reasons:

  • I think the connotation with "test" is often "something that runs outside of production", but these often run in production
  • It will do better in SEO for someone who knows that Dagster has "data freshness checks" and wants to find the page

@sryza sryza requested a review from dpeng817 August 27, 2024 16:36
@clairelin135 clairelin135 marked this pull request as ready for review August 27, 2024 17:01
@graphite-app graphite-app bot added the area: docs Related to documentation in general label Aug 27, 2024
Copy link

graphite-app bot commented Aug 27, 2024

Graphite Automations

"docs-beta - Assign Reviewers" took an action on this PR • (08/27/24)

2 reviewers were added to this PR based on Pedram Navid's automation.

@clairelin135
Copy link
Contributor Author

Agree that "check for data freshness" is a clearer name. Updated to reflect this.

@clairelin135 clairelin135 force-pushed the claire/doc-332-write-testing-for-data-freshness-guide branch from 813dfbf to 77930ba Compare August 28, 2024 04:36
Copy link
Contributor

@erinkcochran87 erinkcochran87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one suggestion about adding a callout, but I'm happy with this. Nicely done! 🎉

@clairelin135 clairelin135 force-pushed the claire/doc-332-write-testing-for-data-freshness-guide branch from 03efb64 to 535f355 Compare August 28, 2024 18:25
@clairelin135 clairelin135 merged commit c501b3d into master Aug 28, 2024
3 of 4 checks passed
@clairelin135 clairelin135 deleted the claire/doc-332-write-testing-for-data-freshness-guide branch August 28, 2024 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: docs Related to documentation in general docathon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants