Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc 302 new etl tutorial - part 1 #25320

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
c275842
file copy
C00ldudeNoonan Oct 11, 2024
054141c
config file creation
C00ldudeNoonan Oct 14, 2024
89be27a
adding additional pages and project config logic
C00ldudeNoonan Oct 16, 2024
59f5a64
add defintions object
C00ldudeNoonan Oct 16, 2024
bf7b65b
Merge remote-tracking branch 'origin/master' into new-etl-tutorial--D…
C00ldudeNoonan Oct 16, 2024
d6d69cf
added intial assets and did some cleanup
C00ldudeNoonan Oct 16, 2024
19d3236
minor typo fixes
C00ldudeNoonan Oct 18, 2024
9b8bdc2
linting
C00ldudeNoonan Oct 18, 2024
6f078db
more to first asset
C00ldudeNoonan Oct 18, 2024
8b6d1f6
consolidated pages and added partitions page
C00ldudeNoonan Oct 21, 2024
8ef90cf
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Nov 13, 2024
2425783
add screenshots and update format and writeup
C00ldudeNoonan Nov 14, 2024
49035dd
update name in sidebar for consistency
C00ldudeNoonan Nov 14, 2024
17aff77
vale formatting errors fix
C00ldudeNoonan Nov 14, 2024
75e60fe
applied notes from Nikki
C00ldudeNoonan Nov 15, 2024
d4ff6d3
whitespace fixes
C00ldudeNoonan Nov 15, 2024
b30f860
Update docs/docs-beta/docs/tutorial/03-creating-a-downstream-asset.md
C00ldudeNoonan Nov 19, 2024
140a122
added partitions, automations, and sensors
C00ldudeNoonan Nov 26, 2024
f29065e
add commentary to page 6 and 7
C00ldudeNoonan Dec 2, 2024
130b418
added final pages and screenshots
C00ldudeNoonan Dec 10, 2024
d34c41b
ruff update
C00ldudeNoonan Dec 10, 2024
62e5fd0
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Dec 27, 2024
aae8195
updated code references and sidebar
C00ldudeNoonan Dec 27, 2024
1eb255c
page link fixes
C00ldudeNoonan Dec 27, 2024
4148df7
page links
C00ldudeNoonan Dec 27, 2024
aee2029
update links
C00ldudeNoonan Dec 30, 2024
5db379b
update sidebar links to remove folder
C00ldudeNoonan Dec 30, 2024
6974fcb
update 404 link
C00ldudeNoonan Dec 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
title: Build an ETL Pipeline
description: Learn how to build an ETL pipeline with Dagster
last_update:
author: Alex Noonan
---

# Build your first ETL pipeline

In this tutorial, you'll build an ETL pipeline with Dagster that:

1. Imports sales data to DuckDB
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
2. Transforms data into reports
3. Runs scheduled reports automatically
4. Generates one-time reports on demand

## What you'll learn

C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Setting up a Dagster project with the recommended project structure
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Creating Assets with metadata
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Using Resources to connect to external systems
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Building dependencies between assets
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Running a pipeline by materializing assets
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Adding schedules, sensors, and partitions to your assets
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Refactor project into recommended structure
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

[Add image for what the completed global asset graph looks like]
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

## Prerequisites

<details>
<summary>Prerequisites</summary>

To follow the steps in this guide, you'll need:

- Basic Python knowledge
- Python 3.9+ installed on your system. Refer to the [Installation guide](/getting-started/installation) for information.
- Familiarity with SQL or Python data manipulation libraries (Pandas or Polars).
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Understanding of data pipelines and the extract, transform, and load process.
</details>


## Step 1: Set up your Dagster environment
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

First, set up a new Dagster project.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

1. Open your terminal and create a new directory for your project:

```bash
mkdir dagster-etl-tutorial
cd dagster-etl-tutorial
```

2. Create and activate a virtual environment:

<Tabs>
<TabItem value="macos" label="MacOS">
```bash
python -m venv dagster_tutorial
source dagster_tutorial/bin/activate
```
</TabItem>
<TabItem value="windows" label="Windows">
```bash
python -m venv dagster_tutorial
dagster_tutorial\Scripts\activate
```
</TabItem>
</Tabs>

3. Install Dagster and the required dependencies:

```bash
pip install dagster dagster-webserver pandas dagster-duckdb
```

## Step 2: Create the Dagster project structure

Next, you'll create the project directories and files for this tutorial with the `dagster project from-example` command:
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

```bash

Check warning on line 81 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 81, "column": 11}}}, "severity": "WARNING"}
dagster project from-example --example getting_started_etl_tutorial
```

Your project should have this structure:
{/* vale off */}
```
dagster-etl-tutorial/
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
├── data/
│ └── products.csv
│ └── sales_data.csv
│ └── sales_reps.csv
│ └── sample_request/
│ └── request.json
├── etl_tutorial/
│ └── definitions.py
├── pyproject.toml
├── setup.cfg
├── setup.py
```
{/* vale on */}

C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
:::info
Dagster has several example projects you can install depending on your use case. To see the full list, run `dagster project list-examples`. For more information on the `dagster project` command, see the [API documentation](https://docs-preview.dagster.io/api/cli#dagster-project).
:::

Check warning on line 105 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 105, "column": 4}}}, "severity": "WARNING"}

## Dagster Project Structure
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

In the root directory there are three configuration files that are common in Python package management. These manage dependencies and identifies the Dagster modules in the project. The `etl_tutorial` folder is where our Dagster definition for this code location exists. The data directory is where the raw data for the project is stored and we will reference these files in our software-defined assets.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
### File/Directory Descriptions
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

#### dagster-etl-tutorial directory

Check warning on line 112 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 112, "column": 36}}}, "severity": "WARNING"}
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

In the `dagster-etl-tutorial` root directory, there are three configuration files that are common in Python package management. These files manage dependencies and identify the Dagster modules in the project.
| File | Purpose |
|------|---------|
| pyproject.toml | This file is used to specify build system requirements and package metadata for Python projects. It is part of the Python packaging ecosystem. |
| setup.cfg | This file is used for configuration of your Python package. It can include metadata about the package, dependencies, and other configuration options. |
| setup.py | This script is used to build and distribute your Python package. It is a standard file in Python projects for specifying package details. |

#### etl_tutorial directory

Check failure on line 121 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'etl_tutorial' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'etl_tutorial' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 121, "column": 6}}}, "severity": "ERROR"}

Check failure on line 121 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'etl_tutorial'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'etl_tutorial'?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 121, "column": 6}}}, "severity": "ERROR"}

main directory where you will define your assets, jobs, schedules, sensors, and resources.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
| File | Purpose |
|------|---------|
| definitions.py | This file is typically used to define jobs, schedules, and sensors. It organizes the various components of your Dagster project. This allows Dagster to load the definitions in a module. |

#### data directory

The data directory contains the raw data files for the project. We will reference these files in our software-defined assets in the next step of the tutorial.

## Launch Dagster
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

Start the Dagster webserver from your project's root directory. If you are not in the project root directory navigate there now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rephrase this a bit to tighten things up and provide a motivation for this section with something like:

"To make sure Dagster and its dependencies were installed correctly, navigate to the project root directory and start the Dagster webserver:"

followed by a bash code snippet for dagster dev (so it can be copy/pasted).


C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
```bash
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
cd getting_started_etl_tutorial
```

Run the `dagster dev` command. Dagster should open up in your browser.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

## Next steps

- Continue this tutorial with [create and materialize assets](/tutorial/02-create-and-materialize-assets)
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
title: Create and materialize assets
description: Load project data and create and materialize assets
last_update:
author: Alex Noonan
---

# Create and materialize assets

In the first step of the tutorial, you created your Dagster project with the raw data files. In this step, you will:
- Create your initial definitions object
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Add a DuckDB resource
- Build software-defined assets
- Materialize your assets

## 1. Create a Definitions object

In Dagster, the [Definitions](/api/definitions) object is where you define and organize various components within your project, such as assets and resources.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

## 2. Define the DuckDB resource

In Dagster,[Resources](/api/resources) are external services, tool, and storage necessary to do your job. In this project, we'll use [DuckDB](https://duckdb.org/) - a fast, in-process SQL database that runs inside your application - for storage. We'll define it once in the definitions object, making it available to all assets and objects that need it.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

Open the `definitions.py` file in the `etl_tutorial` directory and copy the following code into it:

```python
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
import json
import os

from dagster_duckdb import DuckDBResource

import dagster as dg

defs = dg.Definitions(
assets=[],
resources={},
)
```

## 3. Create Assets
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

Software defined [assets](/api/assets) are the main building block in Dagster. An asset is composed of the asset key which is how its identified, a op which is a function that is invoked to produce the asset and upstream dependencies that the asset depends on. You can read more about our philosophy behind the [asset centric approach](https://dagster.io/blog/software-defined-assets).
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

### Products asset

First, we will create an asset that creates a DuckDB table to hold data from the products CSV. This asset takes the `duckdb` resource defined earlier and returns a `MaterializeResult` object.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
Additionally, this asset contains metadata in the `@dg.asset` decorator parameters to help categorize the asset, and in the `return` block to give us a preview of the asset in the Dagster UI.
To create this asset, open the `definitions.py `file and copy the following code into it:
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="8" lineEnd="33"/>

### Sales Reps Asset
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

The code for the sales reps asset is similar to the product asset code. In the `definitions.py` file, add the following code below the product asset code:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="35" lineEnd="61"/>

### Sales Data Asset
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

To add the sales data asset, copy the following code into your `definitions.py` file below the product repos asset:
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="62" lineEnd="87"/>

## 4. Add assets to the Definitions object

Now to pull these assets into our Definitions object, add them to the empty list in the assets parameter.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## 5. Materialize Assets
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

To materialize your assets:
1. In a browser, navigate to the URL of the Dagster server that we started earlier.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
2. Navigate to **Deployment**
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
3. Reload Definitions

Check warning on line 83 in docs/docs-beta/docs/tutorial/02-create-and-materialize-assets.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/02-create-and-materialize-assets.md", "range": {"start": {"line": 83, "column": 22}}}, "severity": "WARNING"}
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
4. Click **Assets**, then click "View global asset lineage" to see all of your assets.

![2048 resolution](/images/tutorial/etl-tutorial/etl-tutorial-first-asset-lineage.png)

Click materialize all. Navigate to the runs tab and select the most recent run. Here you can see the logs from the run.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

![2048 resolution](/images/tutorial/etl-tutorial/first-asset-run.png)


## Next steps

- Continue this tutorial with your [Asset Dependencies and Checks](/tutorial/03-creating-a-downstream-asset)
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
46 changes: 46 additions & 0 deletions docs/docs-beta/docs/tutorial/03-creating-a-downstream-asset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
title: Creating a Downstream Asset
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
description: Reference Assets as dependencies to other assets
last_update:
author: Alex Noonan
---

# Asset Dependencies
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

Now that we have the raw data loaded into DuckDB, we need to create a [downstream asset](guides/asset-dependencies.md) that combines the staging assets together. In this step, you will:
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

- Create a downstream asset
- Materialize that asset

## Creating a Downstream asset
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

Now that we have all of our raw data loaded and staged into DuckDB our next step is to merge it together in a .
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

The data structure that of a fact table (sales data) with 2 dimensions off of it (sales reps and products). To accomplish that in SQL we will bring in our `sales_data` table and then left join on `sales_reps` and `products` on their respective id columns. Additionally, we will keep this view concise and only have relevant columns for analysis.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="89" lineEnd="132"/>

As you can see here this asset looks a lot like our previous ones with a few small changes. We put this asset into a different group. To make this asset dependent on the raw tables we add the asset keys the `deps` parameter in the asset definition.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

## Materialize the Asset
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

We need to add the Asset we just made to the Definitions object.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

Your Definitions object should now look like this:

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
joined_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

Go back into the UI, reload definitions, and materialize the `joined_data` asset.

## Next steps

- Continue this tutorial with [Asset Checks](/tutorial/04-ensuring-data-quality-with-asset-checks)
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
10 changes: 10 additions & 0 deletions docs/docs-beta/docs/tutorial/05-partitions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Partitions
description: Partitioning Assets by datetime and categories
last_update:
date: 2024-10-16
author: Alex Noonan
---



62 changes: 0 additions & 62 deletions docs/docs-beta/docs/tutorial/tutorial-etl.md

This file was deleted.

6 changes: 5 additions & 1 deletion docs/docs-beta/sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,11 @@ const sidebars: SidebarsConfig = {
type: 'category',
label: 'Tutorial',
collapsed: false,
items: ['tutorial/tutorial-etl'],
items: [
'tutorial/etl-tutorial-introduction',
'tutorial/create-and-materialize-assets',
'tutorial/creating-a-downstream-asset',
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
],
},
{
type: 'category',
Expand Down
2 changes: 1 addition & 1 deletion docs/docs-beta/src/theme/MDXComponents.tsx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// Import the original mapper
import MDXComponents from '@theme-original/MDXComponents';
import { PyObject } from '../components/PyObject';
import {PyObject} from '../components/PyObject';
import CodeExample from '../components/CodeExample';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading