Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc 302 new etl tutorial - part 1 #25320

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
c275842
file copy
C00ldudeNoonan Oct 11, 2024
054141c
config file creation
C00ldudeNoonan Oct 14, 2024
89be27a
adding additional pages and project config logic
C00ldudeNoonan Oct 16, 2024
59f5a64
add defintions object
C00ldudeNoonan Oct 16, 2024
bf7b65b
Merge remote-tracking branch 'origin/master' into new-etl-tutorial--D…
C00ldudeNoonan Oct 16, 2024
d6d69cf
added intial assets and did some cleanup
C00ldudeNoonan Oct 16, 2024
19d3236
minor typo fixes
C00ldudeNoonan Oct 18, 2024
9b8bdc2
linting
C00ldudeNoonan Oct 18, 2024
6f078db
more to first asset
C00ldudeNoonan Oct 18, 2024
8b6d1f6
consolidated pages and added partitions page
C00ldudeNoonan Oct 21, 2024
8ef90cf
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Nov 13, 2024
2425783
add screenshots and update format and writeup
C00ldudeNoonan Nov 14, 2024
49035dd
update name in sidebar for consistency
C00ldudeNoonan Nov 14, 2024
17aff77
vale formatting errors fix
C00ldudeNoonan Nov 14, 2024
75e60fe
applied notes from Nikki
C00ldudeNoonan Nov 15, 2024
d4ff6d3
whitespace fixes
C00ldudeNoonan Nov 15, 2024
b30f860
Update docs/docs-beta/docs/tutorial/03-creating-a-downstream-asset.md
C00ldudeNoonan Nov 19, 2024
140a122
added partitions, automations, and sensors
C00ldudeNoonan Nov 26, 2024
f29065e
add commentary to page 6 and 7
C00ldudeNoonan Dec 2, 2024
130b418
added final pages and screenshots
C00ldudeNoonan Dec 10, 2024
d34c41b
ruff update
C00ldudeNoonan Dec 10, 2024
62e5fd0
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Dec 27, 2024
aae8195
updated code references and sidebar
C00ldudeNoonan Dec 27, 2024
1eb255c
page link fixes
C00ldudeNoonan Dec 27, 2024
4148df7
page links
C00ldudeNoonan Dec 27, 2024
aee2029
update links
C00ldudeNoonan Dec 30, 2024
5db379b
update sidebar links to remove folder
C00ldudeNoonan Dec 30, 2024
6974fcb
update 404 link
C00ldudeNoonan Dec 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/docs-beta/docs/getting-started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,5 +153,5 @@ id,name,age,city,age_group

Congratulations! You've just built and run your first pipeline with Dagster. Next, you can:

- Continue with the [ETL pipeline tutorial](/tutorial/tutorial-etl) to learn how to build a more complex ETL pipeline
- Continue with the [ETL pipeline tutorial](/tutorial/etl-tutorial/etl-tutorial-introduction) to learn how to build a more complex ETL pipeline
- Learn how to [Think in assets](/guides/build/assets-concepts/index.md)
136 changes: 136 additions & 0 deletions docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
title: Build an ETL Pipeline
description: Learn how to build an ETL pipeline with Dagster
last_update:
author: Alex Noonan
---

# Build your first ETL pipeline

In this tutorial, you'll build an ETL pipeline with Dagster that:

- Imports sales data to DuckDB
- Transforms data into reports
- Runs scheduled reports automatically
- Generates one-time reports on demand

## You will learn to:

C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Set up a Dagster project with the recommended project structure
- Create Assets with metadata
- Connect Dagster to external systems with Resources
- Build dependencies between assets
- Run a pipeline by materializing assets
- Add schedules, sensors, and partitions to your assets
- Refactor project when it becomes more complex

## Prerequisites

<details>
<summary>Prerequisites</summary>

To follow the steps in this guide, you'll need:

- Basic Python knowledge
- Python 3.9+ installed on your system. Refer to the [Installation guide](/getting-started/installation) for information.
- Familiarity with SQL and Python data manipulation libraries, such as Pandas.
- Understanding of data pipelines and the extract, transform, and load process.
</details>


## Step 1: Set up your Dagster environment
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

First, set up a new Dagster project.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

1. Open your terminal and create a new directory for your project:

```bash
mkdir dagster-etl-tutorial
cd dagster-etl-tutorial
```

2. Create and activate a virtual environment:

<Tabs>
<TabItem value="macos" label="MacOS">
```bash
python -m venv dagster_tutorial
source dagster_tutorial/bin/activate
```
</TabItem>
<TabItem value="windows" label="Windows">
```bash
python -m venv dagster_tutorial
dagster_tutorial\Scripts\activate
```
</TabItem>
</Tabs>

3. Install Dagster and the required dependencies:

```bash
pip install dagster dagster-webserver pandas dagster-duckdb
```

## Step 2: Create the Dagster project structure

Run the following command to create the project directories and files for this tutorial:

```bash
dagster project from-example --example getting_started_etl_tutorial
```

Your project should have this structure:
{/* vale off */}
```
dagster-etl-tutorial/
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
├── data/
│ └── products.csv
│ └── sales_data.csv
│ └── sales_reps.csv
│ └── sample_request/
│ └── request.json
├── etl_tutorial/
│ └── definitions.py
├── pyproject.toml
├── setup.cfg
├── setup.py
```
{/* vale on */}

C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
:::info
Dagster has several example projects you can install depending on your use case. To see the full list, run `dagster project list-examples`. For more information on the `dagster project` command, see the [API documentation](https://docs-preview.dagster.io/api/cli#dagster-project).
:::

### Dagster Project Structure

#### dagster-etl-tutorial root directory

Check failure on line 107 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use '\bDagster\b' instead of 'dagster'. Raw Output: {"message": "[Vale.Terms] Use '\\bDagster\\b' instead of 'dagster'.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 107, "column": 6}}}, "severity": "ERROR"}

In the `dagster-etl-tutorial` root directory, there are three configuration files that are common in Python package management. These files manage dependencies and identify the Dagster modules in the project.
| File | Purpose |
|------|---------|
| pyproject.toml | This file is used to specify build system requirements and package metadata for Python projects. It is part of the Python packaging ecosystem. |
| setup.cfg | This file is used for configuration of your Python package. It can include metadata about the package, dependencies, and other configuration options. |
| setup.py | This script is used to build and distribute your Python package. It is a standard file in Python projects for specifying package details. |

#### etl_tutorial directory

Check failure on line 116 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'etl_tutorial' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'etl_tutorial' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 116, "column": 6}}}, "severity": "ERROR"}

Check failure on line 116 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'etl_tutorial'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'etl_tutorial'?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 116, "column": 6}}}, "severity": "ERROR"}

This is the main directory where you will define your assets, jobs, schedules, sensors, and resources.
| File | Purpose |
|------|---------|
| definitions.py | This file is typically used to define jobs, schedules, and sensors. It organizes the various components of your Dagster project. This allows Dagster to load the definitions in a module. |

#### data directory

The data directory contains the raw data files for the project. We will reference these files in our software-defined assets in the next step of the tutorial.

## Step 3: Launch the Dagster webserver

To make sure Dagster and its dependencies were installed correctly, navigate to the project root directory and start the Dagster webserver:"

C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
followed by a bash code snippet for `dagster dev`


## Next steps

- Continue this tutorial by [creating and materializing assets](/tutorial/create-and-materialize-assets)
107 changes: 107 additions & 0 deletions docs/docs-beta/docs/tutorial/02-create-and-materialize-assets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
title: Create and materialize assets
description: Load project data and create and materialize assets
last_update:
author: Alex Noonan
---


In the first step of the tutorial, you created your Dagster project with the raw data files. In this step, you will:
- Create your initial Definitions object
- Add a DuckDB resource
- Build software-defined assets
- Materialize your assets

## 1. Create a Definitions object

In Dagster, the [Definitions API docs](/todo) object is where you define and organize various components within your project, such as assets and resources.

Open the `definitions.py` file in the `etl_tutorial` directory and copy the following code into it:

```python
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
import json
import os

from dagster_duckdb import DuckDBResource

import dagster as dg

defs = dg.Definitions(
assets=[],
resources={},
)
```

## 2. Define the DuckDB resource

In Dagster, [Resources API docs](/todo) are the external services, tools, and storage backends you need to do your job. For the storage backend in this project, we'll use [DuckDB](https://duckdb.org/), a fast, in-process SQL database that runs inside your application. We'll define it once in the definitions object, making it available to all assets and objects that need it.

```python
defs = dg.Definitions(
assets=[],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## 3. Create assets

Software defined [assets API docs](/todo) are the main building blocks in Dagster. An asset is composed of three components:
1. Asset key or unique identifier.
2. An op which is a function that is invoked to produce the asset.
3. Upstream dependencies that the asset depends on.

You can read more about our philosophy behind the [asset centric approach](https://dagster.io/blog/software-defined-assets).

### Products asset

First, we will create an asset that creates a DuckDB table to hold data from the products CSV. This asset takes the `duckdb` resource defined earlier and returns a `MaterializeResult` object.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
Additionally, this asset contains metadata in the `@dg.asset` decorator parameters to help categorize the asset, and in the `return` block to give us a preview of the asset in the Dagster UI.

To create this asset, open the `definitions.py `file and copy the following code into it:
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="8" lineEnd="33"/>

### Sales reps asset

The code for the sales reps asset is similar to the product asset code. In the `definitions.py` file, add the following code below the product asset code:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="35" lineEnd="61"/>

### Sales data asset

To add the sales data asset, copy the following code into your `definitions.py` file below the sales reps asset:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="62" lineEnd="87"/>

## 4. Add assets to the Definitions object

Now to pull these assets into our Definitions object. Adding them to the Definitions object makes them available to the Dagster project. Add them to the empty list in the assets parameter.

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## 5. Materialize assets

To materialize your assets:
1. In a browser, navigate to the URL of the Dagster server that yous started earlier.
2. Navigate to **Deployment**.
3. Click Reload definitions.
4. Click **Assets**, then click "View global asset lineage" to see all of your assets.

![2048 resolution](/images/tutorial/etl-tutorial/etl-tutorial-first-asset-lineage.png)

5. Click materialize all.
6. Navigate to the runs tab and select the most recent run. Here you can see the logs from the run.
![2048 resolution](/images/tutorial/etl-tutorial/first-asset-run.png)


## Next steps

- Continue this tutorial with your with your [asset dependencies](/tutorial/create-and-materialize-a-downstream-asset)
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: Create and materialize a downstream asset
description: Reference Assets as dependencies to other assets
last_update:
author: Alex Noonan
---

Now that we have the raw data loaded into DuckDB, we need to create a [downstream asset](/guides/build/assets-concepts/asset-dependencies) that combines the upstream assets together. In this step, you will:

- Create a downstream asset
- Materialize that asset

## 1. Creating a downstream asset

Now that we have all of our raw data loaded into DuckDB our next step is to merge it together in a view composed of data from all three source tables.

To accomplish this in SQL we will bring in our `sales_data` table and then left join on `sales_reps` and `products` on their respective id columns. Additionally, we will keep this view concise and only have relevant columns for analysis.

As you can see here this asset looks a lot like our previous ones with a few small changes. We put this asset into a different group. To make this asset dependent on the raw tables we add the asset keys the `deps` parameter in the asset definition.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="89" lineEnd="132"/>

## 2. Materialize the Asset

1. We need to add the Asset we just made to the Definitions object.

Your Definitions object should now look like this:

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
joined_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

2. In the Dagster UI, reload definitions and materialize the `joined_data` asset

## Next steps

- Continue this tutorial with [create and materialize a partitioned asset](/tutorial/ensuring-data-quality-with-asset-checks)
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
title: Ensuring data quality with asset checks
description: Ensure assets are correct with asset checks
last_update:
author: Alex Noonan
---

Data Quality is critical in data pipelines. Much like in a factory producing cars, inspecting parts after they complete certain steps ensures that defects are caught before the car is completely assembled.

Check warning on line 8 in docs/docs-beta/docs/tutorial/04-ensuring-data-quality-with-asset-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/04-ensuring-data-quality-with-asset-checks.md", "range": {"start": {"line": 8, "column": 206}}}, "severity": "WARNING"}

In Dagster, you define [asset checks](/guides/test/asset-checks) in a similar way that you would define an Asset. In this step you will:

- Define an asset check
- Execute that asset check in the UI

## 1. Define the Asset CHeck

In this case we want to create a check to identify if there are any rows that have a product or sales rep that are not in the `joined_data` table.

Check warning on line 17 in docs/docs-beta/docs/tutorial/04-ensuring-data-quality-with-asset-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/04-ensuring-data-quality-with-asset-checks.md", "range": {"start": {"line": 17, "column": 147}}}, "severity": "WARNING"}

Paste the following code beneath the `joined_data` asset.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="134" lineEnd="149"/>

## 2. Run the asset Check

Before the asset check can be ran it needs to be added to the definitions object. Asset checks are added to their own list like assets.

Check warning on line 25 in docs/docs-beta/docs/tutorial/04-ensuring-data-quality-with-asset-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/04-ensuring-data-quality-with-asset-checks.md", "range": {"start": {"line": 25, "column": 136}}}, "severity": "WARNING"}

Your definitions object should look like this now:

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
joined_data,
],
asset_checks=[missing_dimension_check],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```
Asset checks will run when an asset is materialized, but asset checks can also be executed manually in the UI.

1. Reload Definitions
2. Navigate to the Asset Details page for the `joined_data` asset.

Check warning on line 43 in docs/docs-beta/docs/tutorial/04-ensuring-data-quality-with-asset-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Terms.dagster-ui] Use 'Asset details' instead of 'Asset Details' when referring to a Dagster UI component or page. Raw Output: {"message": "[Terms.dagster-ui] Use 'Asset details' instead of 'Asset Details' when referring to a Dagster UI component or page.", "location": {"path": "docs/docs-beta/docs/tutorial/04-ensuring-data-quality-with-asset-checks.md", "range": {"start": {"line": 43, "column": 20}}}, "severity": "WARNING"}
3. Select the checks tab.
4. Press the execute button in for `missing_dimension_check`

![2048 resolution](/images/tutorial/etl-tutorial/asset-check.png)

## Next steps

- Continue this tutorial with [Asset Checks](/tutorial/create-and-materialize-partitioned-asset)
Loading
Loading