-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[docs-revamp] - Clean up Sensors guide (#24462)
## Summary & Motivation ## How I Tested These Changes ## Changelog NOCHANGELOG --------- Co-authored-by: Colton Padden <[email protected]>
- Loading branch information
1 parent
a4164ca
commit a48b3e6
Showing
3 changed files
with
124 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,77 @@ | ||
--- | ||
title: Create event-based pipelines with sensors | ||
sidebar_label: Sensors | ||
title: Creating event-based pipelines with sensors | ||
sidebar_label: Event triggers | ||
sidebar_position: 20 | ||
--- | ||
|
||
Sensors are a way to trigger runs in response to events in Dagster. Sensors | ||
run on a regular interval and can either trigger a run, or provide a reason why a run was skipped. | ||
|
||
Sensors allow you to respond to events in external systems. For example, you can trigger a run when a new file arrives in an S3 bucket, or when a row is updated in a database. | ||
Sensors enable you to trigger Dagster runs in response to events from external systems. They run at regular intervals, either triggering a run or explaining why a run was skipped. For example, you can trigger a run when a new file is added to an Amazon S3 bucket or when a database row is updated. | ||
|
||
<details> | ||
<summary>Prerequisites</summary> | ||
|
||
To follow the steps in this guide, you'll need: | ||
|
||
- Familiarity with [Assets](/concepts/assets) | ||
- Familiarity with [Ops and Jobs](/concepts/ops-jobs) | ||
</details> | ||
|
||
## Basic sensor example | ||
|
||
This example includes a `check_for_new_files` function that simulates finding new files. In a real scenario, this function would check an actual system or directory. | ||
## Basic sensor | ||
|
||
The sensor runs every 5 seconds. If it finds new files, it starts a run of `my_job`. If not, it skips the run and logs "No new files found" in the Dagster UI. | ||
Sensors are defined with the `@sensor` decorator. The following example includes a `check_for_new_files` function that simulates finding new files. In a real scenario, this function would check an actual system or directory. | ||
|
||
If the sensor finds new files, it starts a run of `my_job`. If not, it skips the run and logs `No new files found` in the Dagster UI. | ||
|
||
<CodeExample filePath="guides/automation/simple-sensor-example.py" language="python" title="Simple Sensor Example" /> | ||
<CodeExample filePath="guides/automation/simple-sensor-example.py" language="python" /> | ||
|
||
:::tip | ||
Unless a sensor has a `default_status` of `DefaultSensorStatus.RUNNING`, it won't be enabled when first deployed to a Dagster instance. To find and enable the sensor, click **Automation > Sensors** in the Dagster UI. | ||
::: | ||
|
||
## Customizing intervals between evaluations | ||
|
||
The `minimum_interval_seconds` argument allows you to specify the minimum number of seconds that will elapse between sensor evaluations. This means that the sensor won't be evaluated more frequently than the specified interval. | ||
|
||
It's important to note that this interval represents a minimum interval between runs of the sensor and not the exact frequency the sensor runs. If a sensor takes longer to complete than the specified interval, the next evaluation will be delayed accordingly. | ||
|
||
```python | ||
# Sensor will be evaluated at least every 30 seconds | ||
@dg.sensor(job=my_job, minimum_interval_seconds=30) | ||
def new_file_sensor(): | ||
... | ||
``` | ||
|
||
In this example, if the `new_file_sensor`'s evaluation function takes less than a second to run, you can expect the sensor to run consistently around every 30 seconds. However, if the evaluation function takes longer, the interval between evaluations will be longer. | ||
|
||
## Preventing duplicate runs | ||
|
||
To prevent duplicate runs, you can use run keys to uniquely identify each `RunRequest`. In the [previous example](#basic-sensor), the `RunRequest` was constructed with a `run_key`: | ||
|
||
By default, sensors aren't enabled when first deployed to a Dagster instance. | ||
Click "Automation" in the top navigation to find and enable a sensor. | ||
``` | ||
yield dg.RunRequest(run_key=filename) | ||
``` | ||
|
||
For a given sensor, a single run is created for each `RunRequest` with a unique `run_key`. Dagster will skip processing requests with previously used run keys, ensuring that duplicate runs won't be created. | ||
|
||
## Cursors and high volume events | ||
|
||
When dealing with a large number of events, you may want to implement a cursor to optimize sensor performance. Unlike run keys, cursors allow you to implement custom logic that manages state. | ||
|
||
The following example demonstrates how you might use a cursor to only create `RunRequests` for files in a directory that have been updated since the last time the sensor ran. | ||
|
||
<CodeExample filePath="guides/automation/sensor-cursor.py" language="python" /> | ||
|
||
For sensors that consume multiple event streams, you may need to serialize and deserialize a more complex data structure in and out of the cursor string to keep track of the sensor's progress over the multiple streams. | ||
|
||
:::note | ||
The preceding example uses both a `run_key` and a cursor, which means that if the cursor is reset but the files don't change, new runs won't be launched. This is because the run keys associated with the files won't change. | ||
|
||
If you want to be able to reset a sensor's cursor, don't set `run_key`s on `RunRequest`s. | ||
::: | ||
|
||
## Next steps | ||
|
||
By understanding and effectively using these automation methods, you can build more efficient data pipelines that respond to your specific needs and constraints. | ||
|
||
- Run pipelines on a [schedule](/guides/schedules) | ||
- Trigger cross-job dependencies with [asset sensors](/guides/asset-sensors) | ||
- Explore [Declarative Automation](/concepts/automation/declarative-automation) as an alternative to sensors |
56 changes: 56 additions & 0 deletions
56
examples/docs_beta_snippets/docs_beta_snippets/guides/automation/sensor-cursor.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
import os | ||
|
||
import dagster as dg | ||
|
||
MY_DIRECTORY = "data" | ||
|
||
|
||
@dg.asset | ||
def my_asset(context: dg.AssetExecutionContext): | ||
context.log.info("Hello, world!") | ||
|
||
|
||
my_job = dg.define_asset_job("my_job", selection=[my_asset]) | ||
|
||
|
||
@dg.sensor( | ||
job=my_job, | ||
minimum_interval_seconds=5, | ||
default_status=dg.DefaultSensorStatus.RUNNING, | ||
) | ||
# highlight-start | ||
# Enable sensor context | ||
def updated_file_sensor(context): | ||
# Get current cursor value from sensor context | ||
last_mtime = float(context.cursor) if context.cursor else 0 | ||
# highlight-end | ||
|
||
max_mtime = last_mtime | ||
|
||
# Loop through directory | ||
for filename in os.listdir(MY_DIRECTORY): | ||
filepath = os.path.join(MY_DIRECTORY, filename) | ||
if os.path.isfile(filepath): | ||
# Get the file's last modification time (st_mtime) | ||
fstats = os.stat(filepath) | ||
file_mtime = fstats.st_mtime | ||
|
||
# If the file was updated since the last eval time, continue | ||
if file_mtime <= last_mtime: | ||
continue | ||
|
||
# Construct the RunRequest with run_key and config | ||
run_key = f"{filename}:{file_mtime}" | ||
run_config = {"ops": {"my_asset": {"config": {"filename": filename}}}} | ||
yield dg.RunRequest(run_key=run_key, run_config=run_config) | ||
|
||
# highlight-start | ||
# Keep the larger value of max_mtime and file last updated | ||
max_mtime = max(max_mtime, file_mtime) | ||
|
||
# Update the cursor | ||
context.update_cursor(str(max_mtime)) | ||
# highlight-end | ||
|
||
|
||
defs = dg.Definitions(assets=[my_asset], jobs=[my_job], sensors=[updated_file_sensor]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters