-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Doc 302 new etl tutorial - part 1 (#25320)
## Summary & Motivation I'm a little way into this and would like to get feedback from @PedramNavid and @cmpadden on the structure and general flow. This isn't done at this point, but I figure we could collaborate here and iterate from there. I made some changes to the reference file to make it more concise regarding metadata output. The new code example function works great. Main Questions I have at this point: 1. Where will we put pictures for the new docs site? I want to add some UI screenshots for more color. 2. How many explanations should I add here? I know on the previous site, the main complaint was that there was too much going on. I think having a couple of sentences explaining a concept is fine, with a link to the main page for that concept/API 3. Should we hide the code examples and have the user reveal them or leave them as is? ## How I Tested These Changes ## Changelog > Insert changelog entry or delete this section. --------- Signed-off-by: nikki everett <[email protected]> Co-authored-by: Nikki Everett <[email protected]> Co-authored-by: nikki everett <[email protected]>
- Loading branch information
1 parent
5bcd9ad
commit 38cb6e5
Showing
25 changed files
with
1,028 additions
and
393 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
56 changes: 56 additions & 0 deletions
56
docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
--- | ||
title: Automate your pipeline | ||
description: Set schedules and utilize asset based automation | ||
last_update: | ||
author: Alex Noonan | ||
sidebar_position: 60 | ||
--- | ||
|
||
There are several ways to automate pipelines and assets [in Dagster](/guides/automate). | ||
|
||
In this step you will: | ||
|
||
- Add automation to assets to run when upstream assets are materialized. | ||
- Create a schedule to run a set of assets on a cron schedule. | ||
|
||
## 1. Automate asset materialization | ||
|
||
Ideally, the reporting assets created in the last step should refresh whenever the upstream data is updated. Dagster's [declarative automation](/guides/automate/declarative-automation) framework allows you do this by adding an automation condition to the asset definition. | ||
|
||
Update the `monthly_sales_performance` asset to add the automation condition to the decorator: | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="155" lineEnd="209"/> | ||
|
||
Do the same thing for `product_performance`: | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="216" lineEnd="267"/> | ||
|
||
## 2. Scheduled jobs | ||
|
||
Cron-based schedules are common in data orchestration. For our pipeline, assume that updated CSVs are uploaded to a file location at a specific time every week by an external process. | ||
|
||
Copy the following code underneath the `product performance` asset: | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="268" lineEnd="273"/> | ||
|
||
## 3. Enable and test automations | ||
|
||
The final step is to enable the automations in the UI. | ||
|
||
To accomplish this: | ||
1. Navigate to the Automation page. | ||
2. Select all automations. | ||
3. Using actions, start all automations. | ||
4. Select the `analysis_update_job`. | ||
5. Test the schedule and evaluate for any time in the dropdown menu. | ||
6. Open in Launchpad. | ||
|
||
The job is now executing. | ||
|
||
Additionally, if you navigate to the Runs tab, you should see that materializations for `monthly_sales_performance` and `product_performance` have run as well. | ||
|
||
![2048 resolution](/images/tutorial/etl-tutorial/automation-final.png) | ||
|
||
## Next steps | ||
|
||
- Continue this tutorial with adding a [sensor based asset](create-a-sensor-asset) |
69 changes: 69 additions & 0 deletions
69
docs/docs-beta/docs/etl-pipeline-tutorial/create-a-sensor-asset.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
--- | ||
title: Create a sensor asset | ||
description: Use sensors to create event driven pipelines | ||
last_update: | ||
author: Alex Noonan | ||
sidebar_position: 70 | ||
--- | ||
|
||
[Sensors](/guides/automate/sensors) allow you to automate workflows based on external events or conditions, making them useful for event-driven automation, especially in situations where jobs occur at irregular cadences or in rapid succession. | ||
|
||
Consider using sensors in the following situations: | ||
- **Event-driven workflows**: When your workflow depends on external events, such as the arrival of a new data file or a change in an API response. | ||
- **Conditional execution**: When you want to execute jobs only if certain conditions are met, reducing unnecessary computations. | ||
- **Real-time processing**: When you need to process data as soon as it becomes available, rather than waiting for a scheduled time. | ||
|
||
In this step you will: | ||
|
||
- Create an asset that runs based on a event-driven workflow | ||
- Create a sensor to listen for conditions to materialize the asset | ||
|
||
## 1. Create an event-driven asset | ||
|
||
For our pipeline, we want to model a situation where an executive wants a pivot table report of sales results by department and product. They want that processed in real time from their request. | ||
|
||
For this asset, we need to define the structure of the request that it is expecting in the materialization context. | ||
|
||
Other than that, defining this asset is the same as our previous assets. Copy the following code beneath `product_performance`. | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="275" lineEnd="312"/> | ||
|
||
## 2. Build the sensor | ||
|
||
To define a sensor in Dagster, use the `@sensor` decorator. This decorator is applied to a function that evaluates whether the conditions for triggering a job are met. | ||
|
||
Sensors include the following elements: | ||
|
||
- **Job**: The job that the sensor will trigger when the conditions are met. | ||
- **RunRequest**: An object that specifies the configuration for the job run. It includes a `run_key` to ensure idempotency and a `run_config` for job-specific settings. | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="314" lineEnd="355"/> | ||
|
||
## 3. Materialize the sensor asset | ||
|
||
1. Update your Definitions object to the following: | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="357" lineEnd="373"/> | ||
|
||
2. Reload your Definitions. | ||
|
||
3. Navigate to the Automation page. | ||
|
||
4. Turn on the `automation_request_sensor`. | ||
|
||
5. Click on the `automation_request_sensor` details. | ||
|
||
![2048 resolution](/images/tutorial/etl-tutorial/sensor-evaluation.png) | ||
|
||
6. Add `request.json` from the `sample_request` folder to `requests` folder. | ||
|
||
7. Click on the green tick to see the run for this request. | ||
|
||
![2048 resolution](/images/tutorial/etl-tutorial/sensor-asset-run.png) | ||
|
||
|
||
## Next steps | ||
|
||
Now that we have our complete project, the next step is to refactor the project into more a more manageable structure so we can add to it as needed. | ||
|
||
Finish the tutorial by [refactoring your project](refactor-your-project). |
43 changes: 43 additions & 0 deletions
43
...cs-beta/docs/etl-pipeline-tutorial/create-and-materialize-a-downstream-asset.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
--- | ||
title: Create and materialize a downstream asset | ||
description: Reference Assets as dependencies to other assets | ||
last_update: | ||
author: Alex Noonan | ||
sidebar_position: 30 | ||
--- | ||
|
||
Now that we have the raw data loaded into DuckDB, we need to create a [downstream asset](/guides/build/create-asset-pipelines/assets-concepts/asset-dependencies) that combines the upstream assets together. In this step, you will: | ||
|
||
- Create a downstream asset | ||
- Materialize that asset | ||
|
||
## 1. Create a downstream asset | ||
|
||
Now that we have all of our raw data loaded into DuckDB, our next step is to merge it together in a view composed of data from all three source tables. | ||
|
||
To accomplish this in SQL, we will bring in our `sales_data` table and then left join on `sales_reps` and `products` on their respective id columns. Additionally, we will keep this view concise and only have relevant columns for analysis. | ||
|
||
As you can see, the new `joined_data` asset looks a lot like our previous ones, with a few small changes. We put this asset into a different group. To make this asset dependent on the raw tables, we add the asset keys to the `deps` parameter in the asset definition. | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="89" lineEnd="132"/> | ||
|
||
## 2. Materialize the asset | ||
|
||
1. Add the joined_data asset to the Definitions object | ||
|
||
```python | ||
defs = dg.Definitions( | ||
assets=[products, | ||
sales_reps, | ||
sales_data, | ||
joined_data, | ||
], | ||
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")}, | ||
) | ||
``` | ||
|
||
2. In the Dagster UI, reload definitions and materialize the `joined_data` asset. | ||
|
||
## Next steps | ||
|
||
- Continue this tutorial with by [creating and materializing a partitioned asset](ensure-data-quality-with-asset-checks). |
108 changes: 108 additions & 0 deletions
108
docs/docs-beta/docs/etl-pipeline-tutorial/create-and-materialize-assets.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
--- | ||
title: Create and materialize assets | ||
description: Load project data and create and materialize assets | ||
last_update: | ||
author: Alex Noonan | ||
sidebar_position: 20 | ||
--- | ||
|
||
|
||
In the first step of the tutorial, you created your Dagster project with the raw data files. In this step, you will: | ||
- Create your initial Definitions object | ||
- Add a DuckDB resource | ||
- Build software-defined assets | ||
- Materialize your assets | ||
|
||
## 1. Create a definitions object | ||
|
||
In Dagster, the [Definitions API docs](/todo) object is where you define and organize various components within your project, such as assets and resources. | ||
|
||
Open the `definitions.py` file in the `etl_tutorial` directory and copy the following code into it: | ||
|
||
```python | ||
import json | ||
import os | ||
|
||
from dagster_duckdb import DuckDBResource | ||
|
||
import dagster as dg | ||
|
||
defs = dg.Definitions( | ||
assets=[], | ||
resources={}, | ||
) | ||
``` | ||
|
||
## 2. Define the DuckDB resource | ||
|
||
In Dagster, [Resources API docs](/todo) are the external services, tools, and storage backends you need to do your job. For the storage backend in this project, we'll use [DuckDB](https://duckdb.org/), a fast, in-process SQL database that runs inside your application. We'll define it once in the definitions object, making it available to all assets and objects that need it. | ||
|
||
```python | ||
defs = dg.Definitions( | ||
assets=[], | ||
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")}, | ||
) | ||
``` | ||
|
||
## 3. Create assets | ||
|
||
Software defined [assets API docs](/todo) are the main building blocks in Dagster. An asset is composed of three components: | ||
1. Asset key or unique identifier. | ||
2. An op which is a function that is invoked to produce the asset. | ||
3. Upstream dependencies that the asset depends on. | ||
|
||
You can read more about our philosophy behind the [asset centric approach](https://dagster.io/blog/software-defined-assets). | ||
|
||
### Products asset | ||
|
||
First, we will create an asset that creates a DuckDB table to hold data from the products CSV. This asset takes the `duckdb` resource defined earlier and returns a `MaterializeResult` object. | ||
Additionally, this asset contains metadata in the `@dg.asset` decorator parameters to help categorize the asset, and in the `return` block to give us a preview of the asset in the Dagster UI. | ||
|
||
To create this asset, open the `definitions.py` file and copy the following code into it: | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="8" lineEnd="33"/> | ||
|
||
### Sales reps asset | ||
|
||
The code for the sales reps asset is similar to the product asset code. In the `definitions.py` file, copy the following code below the product asset code: | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="35" lineEnd="61"/> | ||
|
||
### Sales data asset | ||
|
||
To add the sales data asset, copy the following code into your `definitions.py` file below the sales reps asset: | ||
|
||
<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="62" lineEnd="87"/> | ||
|
||
## 4. Add assets to the definitions object | ||
|
||
Now to pull these assets into our Definitions object. Adding them to the Definitions object makes them available to the Dagster project. Add them to the empty list in the assets parameter. | ||
|
||
```python | ||
defs = dg.Definitions( | ||
assets=[products, | ||
sales_reps, | ||
sales_data, | ||
], | ||
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")}, | ||
) | ||
``` | ||
|
||
## 5. Materialize assets | ||
|
||
To materialize your assets: | ||
1. In a browser, navigate to the URL of the Dagster server that yous started earlier. | ||
2. Navigate to **Deployment**. | ||
3. Click Reload definitions. | ||
4. Click **Assets**, then click "View global asset lineage" to see all of your assets. | ||
|
||
![2048 resolution](/images/tutorial/etl-tutorial/etl-tutorial-first-asset-lineage.png) | ||
|
||
5. Click materialize all. | ||
6. Navigate to the runs tab and select the most recent run. Here you can see the logs from the run. | ||
![2048 resolution](/images/tutorial/etl-tutorial/first-asset-run.png) | ||
|
||
|
||
## Next steps | ||
|
||
- Continue this tutorial with your [asset dependencies](create-and-materialize-a-downstream-asset) |
Oops, something went wrong.
38cb6e5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deploy preview for dagster-docs-beta ready!
✅ Preview
https://dagster-docs-beta-jws0rsqp7-elementl.vercel.app
Built with commit 38cb6e5.
This pull request is being automatically deployed with vercel-action