Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc 302 new etl tutorial - part 1 #25320

Merged
merged 45 commits into from
Jan 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
c275842
file copy
C00ldudeNoonan Oct 11, 2024
054141c
config file creation
C00ldudeNoonan Oct 14, 2024
89be27a
adding additional pages and project config logic
C00ldudeNoonan Oct 16, 2024
59f5a64
add defintions object
C00ldudeNoonan Oct 16, 2024
bf7b65b
Merge remote-tracking branch 'origin/master' into new-etl-tutorial--D…
C00ldudeNoonan Oct 16, 2024
d6d69cf
added intial assets and did some cleanup
C00ldudeNoonan Oct 16, 2024
19d3236
minor typo fixes
C00ldudeNoonan Oct 18, 2024
9b8bdc2
linting
C00ldudeNoonan Oct 18, 2024
6f078db
more to first asset
C00ldudeNoonan Oct 18, 2024
8b6d1f6
consolidated pages and added partitions page
C00ldudeNoonan Oct 21, 2024
8ef90cf
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Nov 13, 2024
2425783
add screenshots and update format and writeup
C00ldudeNoonan Nov 14, 2024
49035dd
update name in sidebar for consistency
C00ldudeNoonan Nov 14, 2024
17aff77
vale formatting errors fix
C00ldudeNoonan Nov 14, 2024
75e60fe
applied notes from Nikki
C00ldudeNoonan Nov 15, 2024
d4ff6d3
whitespace fixes
C00ldudeNoonan Nov 15, 2024
b30f860
Update docs/docs-beta/docs/tutorial/03-creating-a-downstream-asset.md
C00ldudeNoonan Nov 19, 2024
140a122
added partitions, automations, and sensors
C00ldudeNoonan Nov 26, 2024
f29065e
add commentary to page 6 and 7
C00ldudeNoonan Dec 2, 2024
130b418
added final pages and screenshots
C00ldudeNoonan Dec 10, 2024
d34c41b
ruff update
C00ldudeNoonan Dec 10, 2024
62e5fd0
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Dec 27, 2024
aae8195
updated code references and sidebar
C00ldudeNoonan Dec 27, 2024
1eb255c
page link fixes
C00ldudeNoonan Dec 27, 2024
4148df7
page links
C00ldudeNoonan Dec 27, 2024
aee2029
update links
C00ldudeNoonan Dec 30, 2024
5db379b
update sidebar links to remove folder
C00ldudeNoonan Dec 30, 2024
6974fcb
update 404 link
C00ldudeNoonan Dec 30, 2024
1cb9423
Merge remote-tracking branch 'origin/master' into new-etl-tutorial--D…
C00ldudeNoonan Dec 30, 2024
8c9ea96
Merge remote-tracking branch 'origin/master' into DOC-302-new-etl-tut…
C00ldudeNoonan Jan 2, 2025
c08c901
update tutorial link
C00ldudeNoonan Jan 2, 2025
f1cdc8e
merge master and fix conflict
neverett Jan 3, 2025
3aaf92d
remove empty tutorial pages, move multi-asset integration guide
neverett Jan 3, 2025
bee1dc0
reorganize etl pipeline tutorial
neverett Jan 3, 2025
288893c
update sidebar, fix quickstart links, update index page
neverett Jan 3, 2025
c3b695d
fix links
neverett Jan 3, 2025
9d22054
Merge branch 'master' into DOC-302-new-etl-tutorial
neverett Jan 5, 2025
8094471
fix links
neverett Jan 5, 2025
e64f7ee
fix another link
neverett Jan 5, 2025
2aa8225
change file name and title for consistency
neverett Jan 5, 2025
919a4bb
apply nikki's feedback
C00ldudeNoonan Jan 6, 2025
62ff5c1
typo fixes
C00ldudeNoonan Jan 6, 2025
9f94197
Merge branch 'master' into new-etl-tutorial--DOC-302-
C00ldudeNoonan Jan 6, 2025
1765e03
update code references
C00ldudeNoonan Jan 7, 2025
3dd37ae
Update tense of header
C00ldudeNoonan Jan 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ This guide will cover three options for adding a new code location:
<details>
<summary>Prerequisites</summary>

1. An existing Dagster project. Refer to the [recommended project structure](/tutorial/create-new-project) and [code requirements](/dagster-plus/deployment/code-requirements) pages for more information.
1. An existing Dagster project. Refer to the [recommended project structure](/guides/build/project-structure) and [code requirements](/dagster-plus/deployment/code-requirements) pages for more information.

2. Editor, Admin, or Organization Admin permissions in Dagster+.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
title: Automate your pipeline
description: Set schedules and utilize asset based automation
last_update:
author: Alex Noonan
sidebar_position: 60
---

There are several ways to automate pipelines and assets [in Dagster](/guides/automate).

Check warning on line 9 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 9, "column": 88}}}, "severity": "WARNING"}

In this step you will:

- Add automation to assets to run when upstream assets are materialized.
- Create a schedule to run a set of assets on a cron schedule.

## 1. Automate asset materialization

Check warning on line 16 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 16, "column": 37}}}, "severity": "WARNING"}

Ideally, the reporting assets created in the last step should refresh whenever the upstream data is updated. Dagster's [declarative automation](/guides/automate/declarative-automation) framework allows you do this by adding an automation condition to the asset definition.

Update the `monthly_sales_performance` asset to add the automation condition to the decorator:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="155" lineEnd="209"/>

Do the same thing for `product_performance`:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="216" lineEnd="267"/>

## 2. Scheduled jobs

Cron-based schedules are common in data orchestration. For our pipeline, assume that updated CSVs are uploaded to a file location at a specific time every week by an external process.

Check failure on line 30 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'CSVs'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'CSVs'?", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 30, "column": 94}}}, "severity": "ERROR"}

Check failure on line 30 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'CSVs' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'CSVs' spelled correctly?", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 30, "column": 94}}}, "severity": "ERROR"}

Copy the following code underneath the `product performance` asset:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="268" lineEnd="273"/>

## 3. Enable and test automations

Check failure on line 36 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'automations' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'automations' spelled correctly?", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 36, "column": 23}}}, "severity": "ERROR"}

The final step is to enable the automations in the UI.

To accomplish this:
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
1. Navigate to the Automation page.
2. Select all automations.

Check warning on line 42 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 42, "column": 27}}}, "severity": "WARNING"}
3. Using actions, start all automations.

Check warning on line 43 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 43, "column": 41}}}, "severity": "WARNING"}
4. Select the `analysis_update_job`.
5. Test the schedule and evaluate for any time in the dropdown menu.

Check warning on line 45 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 45, "column": 69}}}, "severity": "WARNING"}
6. Open in Launchpad.

The job is now executing.

Check warning on line 48 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 48, "column": 26}}}, "severity": "WARNING"}

Additionally, if you navigate to the Runs tab, you should see that materializations for `monthly_sales_performance` and `product_performance` have run as well.

Check warning on line 50 in docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/automate-your-pipeline.md", "range": {"start": {"line": 50, "column": 160}}}, "severity": "WARNING"}

![2048 resolution](/images/tutorial/etl-tutorial/automation-final.png)

## Next steps

- Continue this tutorial with adding a [sensor based asset](create-a-sensor-asset)
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: Create a sensor asset
description: Use sensors to create event driven pipelines
last_update:
author: Alex Noonan
sidebar_position: 70
---

[Sensors](/guides/automate/sensors) allow you to automate workflows based on external events or conditions, making them useful for event-driven automation, especially in situations where jobs occur at irregular cadences or in rapid succession.

Consider using sensors in the following situations:
- **Event-driven workflows**: When your workflow depends on external events, such as the arrival of a new data file or a change in an API response.
- **Conditional execution**: When you want to execute jobs only if certain conditions are met, reducing unnecessary computations.
- **Real-time processing**: When you need to process data as soon as it becomes available, rather than waiting for a scheduled time.

In this step you will:

- Create an asset that runs based on a event-driven workflow
- Create a sensor to listen for conditions to materialize the asset

## 1. Create an event-driven asset

For our pipeline, we want to model a situation where an executive wants a pivot table report of sales results by department and product. They want that processed in real time from their request.

For this asset, we need to define the structure of the request that it is expecting in the materialization context.

Other than that, defining this asset is the same as our previous assets. Copy the following code beneath `product_performance`.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="275" lineEnd="312"/>

## 2. Build the sensor

To define a sensor in Dagster, use the `@sensor` decorator. This decorator is applied to a function that evaluates whether the conditions for triggering a job are met.

Sensors include the following elements:

- **Job**: The job that the sensor will trigger when the conditions are met.
- **RunRequest**: An object that specifies the configuration for the job run. It includes a `run_key` to ensure idempotency and a `run_config` for job-specific settings.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="314" lineEnd="355"/>

## 3. Materialize the sensor asset

1. Update your Definitions object to the following:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="357" lineEnd="373"/>

2. Reload your Definitions.

3. Navigate to the Automation page.

4. Turn on the `automation_request_sensor`.

5. Click on the `automation_request_sensor` details.

![2048 resolution](/images/tutorial/etl-tutorial/sensor-evaluation.png)

6. Add `request.json` from the `sample_request` folder to `requests` folder.

7. Click on the green tick to see the run for this request.

![2048 resolution](/images/tutorial/etl-tutorial/sensor-asset-run.png)


## Next steps

Now that we have our complete project, the next step is to refactor the project into more a more manageable structure so we can add to it as needed.

Finish the tutorial by [refactoring your project](refactor-your-project).
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: Create and materialize a downstream asset
description: Reference Assets as dependencies to other assets
last_update:
author: Alex Noonan
sidebar_position: 30
---

Now that we have the raw data loaded into DuckDB, we need to create a [downstream asset](/guides/build/create-asset-pipelines/assets-concepts/asset-dependencies) that combines the upstream assets together. In this step, you will:

- Create a downstream asset
- Materialize that asset

## 1. Create a downstream asset

Now that we have all of our raw data loaded into DuckDB, our next step is to merge it together in a view composed of data from all three source tables.

To accomplish this in SQL, we will bring in our `sales_data` table and then left join on `sales_reps` and `products` on their respective id columns. Additionally, we will keep this view concise and only have relevant columns for analysis.

As you can see, the new `joined_data` asset looks a lot like our previous ones, with a few small changes. We put this asset into a different group. To make this asset dependent on the raw tables, we add the asset keys to the `deps` parameter in the asset definition.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="89" lineEnd="132"/>
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

## 2. Materialize the asset

1. Add the joined_data asset to the Definitions object

Check failure on line 26 in docs/docs-beta/docs/etl-pipeline-tutorial/create-and-materialize-a-downstream-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'joined_data'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'joined_data'?", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/create-and-materialize-a-downstream-asset.md", "range": {"start": {"line": 26, "column": 12}}}, "severity": "ERROR"}

Check failure on line 26 in docs/docs-beta/docs/etl-pipeline-tutorial/create-and-materialize-a-downstream-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'joined_data' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'joined_data' spelled correctly?", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/create-and-materialize-a-downstream-asset.md", "range": {"start": {"line": 26, "column": 12}}}, "severity": "ERROR"}

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
joined_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

2. In the Dagster UI, reload definitions and materialize the `joined_data` asset.

## Next steps

- Continue this tutorial with by [creating and materializing a partitioned asset](ensure-data-quality-with-asset-checks).
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: Create and materialize assets
description: Load project data and create and materialize assets
last_update:
author: Alex Noonan
sidebar_position: 20
---


In the first step of the tutorial, you created your Dagster project with the raw data files. In this step, you will:
- Create your initial Definitions object
- Add a DuckDB resource
- Build software-defined assets
- Materialize your assets

## 1. Create a definitions object

In Dagster, the [Definitions API docs](/todo) object is where you define and organize various components within your project, such as assets and resources.

Open the `definitions.py` file in the `etl_tutorial` directory and copy the following code into it:

```python
import json
import os

from dagster_duckdb import DuckDBResource

import dagster as dg

defs = dg.Definitions(
assets=[],
resources={},
)
```

## 2. Define the DuckDB resource

In Dagster, [Resources API docs](/todo) are the external services, tools, and storage backends you need to do your job. For the storage backend in this project, we'll use [DuckDB](https://duckdb.org/), a fast, in-process SQL database that runs inside your application. We'll define it once in the definitions object, making it available to all assets and objects that need it.

```python
defs = dg.Definitions(
assets=[],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## 3. Create assets

Software defined [assets API docs](/todo) are the main building blocks in Dagster. An asset is composed of three components:
1. Asset key or unique identifier.
2. An op which is a function that is invoked to produce the asset.
3. Upstream dependencies that the asset depends on.

Check warning on line 52 in docs/docs-beta/docs/etl-pipeline-tutorial/create-and-materialize-assets.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/etl-pipeline-tutorial/create-and-materialize-assets.md", "range": {"start": {"line": 52, "column": 52}}}, "severity": "WARNING"}

You can read more about our philosophy behind the [asset centric approach](https://dagster.io/blog/software-defined-assets).

### Products asset

First, we will create an asset that creates a DuckDB table to hold data from the products CSV. This asset takes the `duckdb` resource defined earlier and returns a `MaterializeResult` object.
Additionally, this asset contains metadata in the `@dg.asset` decorator parameters to help categorize the asset, and in the `return` block to give us a preview of the asset in the Dagster UI.

To create this asset, open the `definitions.py` file and copy the following code into it:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="8" lineEnd="33"/>

### Sales reps asset

The code for the sales reps asset is similar to the product asset code. In the `definitions.py` file, copy the following code below the product asset code:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="35" lineEnd="61"/>

### Sales data asset

To add the sales data asset, copy the following code into your `definitions.py` file below the sales reps asset:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="62" lineEnd="87"/>

## 4. Add assets to the definitions object

Now to pull these assets into our Definitions object. Adding them to the Definitions object makes them available to the Dagster project. Add them to the empty list in the assets parameter.

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## 5. Materialize assets

To materialize your assets:
1. In a browser, navigate to the URL of the Dagster server that yous started earlier.
2. Navigate to **Deployment**.
3. Click Reload definitions.
4. Click **Assets**, then click "View global asset lineage" to see all of your assets.

![2048 resolution](/images/tutorial/etl-tutorial/etl-tutorial-first-asset-lineage.png)

5. Click materialize all.
6. Navigate to the runs tab and select the most recent run. Here you can see the logs from the run.
![2048 resolution](/images/tutorial/etl-tutorial/first-asset-run.png)


## Next steps

- Continue this tutorial with your [asset dependencies](create-and-materialize-a-downstream-asset)
Loading
Loading