-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Data freshness guide #23937
[docs] Data freshness guide #23937
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @clairelin135 and the rest of your teammates on Graphite |
Deploy preview for dagster-docs-beta ready! ✅ Preview Built with commit 535f355. |
|
||
## Next steps | ||
|
||
- Explore more [asset checks](/todo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn’t even know about this D+ feature, could you add a next step that links to a /todo page for a Dagster+ anomaly detection page?
See this comment inline on Graphite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
@@ -0,0 +1,7 @@ | |||
from dagster_cloud.anomaly_detection import build_anomaly_detection_freshness_checks | |||
|
|||
hourly_sales = ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn’t this be an asset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should, but I was hoping I could avoid repeating the whole code example from the previous block....
I'll try making assets=...
instead
|
||
|
||
@dg.observable_source_asset(specs=[dg.AssetSpec("hourly_sales")]) | ||
def hourly_sales(snowflake: dg_snowflake.SnowflakeResource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we name this function something like check_hourly_sales to avoid confusing it with the asset spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually expect both names to be the same, since the asset spec is just defining an asset key...
But that actually reminds me that defining an asset spec here is redundant, so I'm removing that
yield dg.ObserveResult( | ||
asset_key=table_name, | ||
metadata={ | ||
"dagster/last_updated_timestamp": dg.MetadataValue.timestamp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this metadata is pretty magical it might be worth adding a brief description of it in the docs, something to the effect of the metadata is a special key used by freshness checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should add a list of dagster metadata keys in the docs somewhere, along with what the purpose of each key. Currently these special keys are referenced in scattered places (for example) which makes them hard to search for.
And ideally when you're adding metadata, you can look at a single list and see all the different things you can emit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently have this: https://docs.dagster.io/concepts/metadata-tags/asset-metadata#standard-asset-metadata-entries
Btw, just noticed that docs search for "metadata" only returns API reference links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely have a metadata reference page. I think just mentioning that this key is what the freshness check uses should be sufficient, and we can add a link to the metadata page once created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the metadata reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to mention the key and added a TODO for adding it to the metadata reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great. Just a few small questions but otherwise looks nice. I love how concise this explanation of freshness checks is. I don't think I've seen such a succinct explanation of how to add freshness checks.
### Use anomaly detection to test data freshness (Dagster+ Pro) | ||
|
||
Instead of applying policies on an asset-by-asset basis, Dagster+ Pro users can take advantage of a time series anomaly detection model to determine if data is arriving later than expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the OG guide:
This model uses data from past materializations/observations to determine if data is arriving later than expected. Note: If the asset hasn't been updated enough times, the check will pass with a message indicating that more data is needed to detect anomalies.
I think this is important to note (wordsmithing encouraged), otherwise users may get confused when they see this message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, added
|
||
|
||
@dg.observable_source_asset(specs=[dg.AssetSpec("hourly_sales")]) | ||
def hourly_sales(snowflake: dg_snowflake.SnowflakeResource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
yield dg.ObserveResult( | ||
asset_key=table_name, | ||
metadata={ | ||
"dagster/last_updated_timestamp": dg.MetadataValue.timestamp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thoughts on calling this "Check for data freshness" instead of "test for data freshness"? A couple reasons:
|
Graphite Automations"docs-beta - Assign Reviewers" took an action on this PR • (08/27/24)2 reviewers were added to this PR based on Pedram Navid's automation. |
Agree that "check for data freshness" is a clearer name. Updated to reflect this. |
813dfbf
to
77930ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one suggestion about adding a callout, but I'm happy with this. Nicely done! 🎉
docs/docs-beta/docs/guides/quality-testing/data-freshness-testing.md
Outdated
Show resolved
Hide resolved
…ing.md Co-authored-by: Erin Cochran <[email protected]>
03efb64
to
535f355
Compare
Summary & Motivation
How I Tested These Changes
Changelog [New | Bug | Docs]