Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc 302 new etl tutorial - part 1 #25320

Draft
wants to merge 29 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
c275842
file copy
C00ldudeNoonan Oct 11, 2024
054141c
config file creation
C00ldudeNoonan Oct 14, 2024
89be27a
adding additional pages and project config logic
C00ldudeNoonan Oct 16, 2024
59f5a64
add defintions object
C00ldudeNoonan Oct 16, 2024
bf7b65b
Merge remote-tracking branch 'origin/master' into new-etl-tutorial--D…
C00ldudeNoonan Oct 16, 2024
d6d69cf
added intial assets and did some cleanup
C00ldudeNoonan Oct 16, 2024
19d3236
minor typo fixes
C00ldudeNoonan Oct 18, 2024
9b8bdc2
linting
C00ldudeNoonan Oct 18, 2024
6f078db
more to first asset
C00ldudeNoonan Oct 18, 2024
8b6d1f6
consolidated pages and added partitions page
C00ldudeNoonan Oct 21, 2024
8ef90cf
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Nov 13, 2024
2425783
add screenshots and update format and writeup
C00ldudeNoonan Nov 14, 2024
49035dd
update name in sidebar for consistency
C00ldudeNoonan Nov 14, 2024
17aff77
vale formatting errors fix
C00ldudeNoonan Nov 14, 2024
75e60fe
applied notes from Nikki
C00ldudeNoonan Nov 15, 2024
d4ff6d3
whitespace fixes
C00ldudeNoonan Nov 15, 2024
b30f860
Update docs/docs-beta/docs/tutorial/03-creating-a-downstream-asset.md
C00ldudeNoonan Nov 19, 2024
140a122
added partitions, automations, and sensors
C00ldudeNoonan Nov 26, 2024
f29065e
add commentary to page 6 and 7
C00ldudeNoonan Dec 2, 2024
130b418
added final pages and screenshots
C00ldudeNoonan Dec 10, 2024
d34c41b
ruff update
C00ldudeNoonan Dec 10, 2024
62e5fd0
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Dec 27, 2024
aae8195
updated code references and sidebar
C00ldudeNoonan Dec 27, 2024
1eb255c
page link fixes
C00ldudeNoonan Dec 27, 2024
4148df7
page links
C00ldudeNoonan Dec 27, 2024
aee2029
update links
C00ldudeNoonan Dec 30, 2024
5db379b
update sidebar links to remove folder
C00ldudeNoonan Dec 30, 2024
6974fcb
update 404 link
C00ldudeNoonan Dec 30, 2024
1cb9423
Merge remote-tracking branch 'origin/master' into new-etl-tutorial--D…
C00ldudeNoonan Dec 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: Build an ETL Pipeline
description: Learn how to build an ETL pipeline with Dagster
last_update:
date: 2024-08-10
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
author: Pedram Navid
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
---

# Build your first ETL pipeline

Welcome to this hands-on tutorial where you'll learn how to build an ETL pipeline with Dagster while exploring key parts of Dagster.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
If you haven't already, complete the [Quick Start](/getting-started/quickstart) tutorial to get familiar with Dagster.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

## What you'll learn

C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Setting up a Dagster project with the recommended project structure
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Creating Assets and using Resources to connect to external systems
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Adding metadata to your assets
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Building dependencies between assets
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Running a pipeline by materializing assets
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
- Adding schedules, sensors, and partitions to your assets
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

## Step 1: Set up your Dagster environment
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

First, set up a new Dagster project.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

1. Open your terminal and create a new directory for your project:

```bash title="Create a new directory"
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
mkdir dagster-etl-tutorial
cd dagster-etl-tutorial
```

2. Create a virtual environment and activate it:
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

```bash title="Create a virtual environment"
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
python -m venv venv
source venv/bin/activate
# On Windows, use `venv\Scripts\activate`
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
```

3. Install Dagster and the required dependencies:

```bash title="Install Dagster and dependencies"
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
pip install dagster dagster-webserver pandas
```

## Step 2: Copying Data Files

C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
Next we will get the raw data for the project.

Check warning on line 50 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 50, "column": 47}}}, "severity": "WARNING"}

1. Create a new folder for the raw data:

```bash title="Create the data directory"
mkdir data
cd data
```

2. Copy the raw csv files:

Check failure on line 59 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'csv' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'csv' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 59, "column": 17}}}, "severity": "ERROR"}

Check failure on line 59 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'csv'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'csv'?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 59, "column": 17}}}, "severity": "ERROR"}

```bash title="Copy the csv files"
curl -L -o products.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/products.csv

curl -L -o sales_reps.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_reps.csv

curl -L -o sales_data.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_data.csv

Check warning on line 66 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 66, "column": 197}}}, "severity": "WARNING"}
```
3. Copy Sample Request json file

Check failure on line 68 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'json' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'json' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 68, "column": 24}}}, "severity": "ERROR"}

Check failure on line 68 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'json'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'json'?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 68, "column": 24}}}, "severity": "ERROR"}

```bash title="Create the sample request"
mkdir sample_request
cd sample_request
curl -L -o request.json https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sample_request/request.json

Check warning on line 74 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 74, "column": 1}}}, "severity": "WARNING"}
# navigating back to the root directory
cd../..
```


C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved
## What you've learned
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

- Set up a Python virtual environment and installed Dagster
- Copied raw data for project

## Next steps

- Continue this tutorial with [setting up your dagster project ](/tutorial/dagster-project-setup)
103 changes: 103 additions & 0 deletions docs/docs-beta/docs/tutorial/02-dagster-project-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: Dagster Project Setup
description: Learn how to setup a Dagster project from scratch
last_update:
date: 2024-10-16
author: Alex Noonan
---

# Dagster Project Setup

## What you'll learn

- Setting up a Dagster project with the recommended project structure


## Step 1: Create Dagster Project Files

Dagster needs several project files to run. These files are common in Python Package managment and help manage project configurationa and dependencies.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

The setup.cfg file is an INI-style configuration file that contains option defaults for setup.py commands.

1. Create Config file

```bash title="Create Config file"
echo -e "[metadata]\nname = dagster_etl_tutorial" > setup.cfg
```

2. Create Setup Python File

The setup.py file is a build script for configuring Python packages. In a Dagster project, you use setup.py to defin any Python packages your project depends on, including Dagster itself.
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

```bash title="Create Setup file"
echo > setup.py
```


Open that python file and put the following code in there.


```python title="Setup.py"
from setuptools import find_packages, setup

setup(
name="dagster_etl_tutorial",
packages=find_packages(exclude=["dagster_etl_tutorial_tests"]),
install_requires=[
"dagster",
"dagster-cloud",
"duckdb"
],
extras_require={"dev": ["dagster-webserver", "pytest"]},
)
```
3. Create Toml file

The pyproject.toml file is a configuation file that specifices package core metadata in a static, tool agnostic way.


```bash title="Create Pyproject file"
echo > pyproject.toml
```

Open that file up and add the following

```toml
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[tool.dagster]
module_name = "dagster_tutorial.definitions"
code_location_name = "dagster_tutorial"
```

4. Create Dagster Python Module and Definitions file


## Next we will create our Python Definitions file

1. Create ETL tutorial directory

```bash title="Create the tutorial directory"
mkdir dagster-etl-tutorial
cd dagster-etl-tutorial
```

2. Create Dagster Definitions File

In this guide we will use a simplified project structure to focus on core Dagster concepts. To accomplish this all of our code will be in one definitons file.


```bash title="Create definitions.py file"
echo > definitions.py
```

## What you've learned

- Set up a Python virtual environment and installed Dagster
- Copied raw data for project

## Next steps

- Continue this tutorial with your [first asset](/tutorial/your-first-asset)
86 changes: 86 additions & 0 deletions docs/docs-beta/docs/tutorial/03-your-first-asset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: Your First Asset

Check warning on line 2 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 2, "column": 24}}}, "severity": "WARNING"}
description: Get the project data and create your first Asset
last_update:
date: 2024-10-16
author: Alex Noonan
---

# Your First Software Defined Asset

Now that we have the raw data files and the Dagster project setup lets create some loading those csv's into duckdb.

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'csv's'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'csv's'?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 98}}}, "severity": "ERROR"}

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'csv's' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'csv's' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 98}}}, "severity": "ERROR"}

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'DuckDB' instead of 'duckdb'. Raw Output: {"message": "[Vale.Terms] Use 'DuckDB' instead of 'duckdb'.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 109}}}, "severity": "ERROR"}

Check warning on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 116}}}, "severity": "WARNING"}

## What you'll learn

- Creating our intial defintions object

Check failure on line 15 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'intial' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'intial' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 15, "column": 16}}}, "severity": "ERROR"}

Check failure on line 15 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'intial'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'intial'?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 15, "column": 16}}}, "severity": "ERROR"}
- Adding a duckdb resource
- Building some basic software defined assets

Check warning on line 17 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 17, "column": 46}}}, "severity": "WARNING"}

## Building definitions object

The definitions object [need docs reference] in Dagster serves as the central configuration point for defining and organizing various componenets within a Dagster Project. It acts as a container that holds all the necessary configurations for a code location, ensuring that everything is organized and easily accessible.

Check warning on line 21 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 21, "column": 321}}}, "severity": "WARNING"}
C00ldudeNoonan marked this conversation as resolved.
Show resolved Hide resolved

1. Creating Definitions Object and duckdb resource

Open the definitions.py file and add the following import statements and definitions object.

Check warning on line 25 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 25, "column": 93}}}, "severity": "WARNING"}

```python
import json
import os

from dagster_duckdb import DuckDBResource

import dagster as dg

defs = dg.Definitions(
assets=[],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## Loading raw data

1. Products Asset

We need to create an asset that creates a duckdb table for the products csv. Additionally we should add meta data to help categorize this asset and give us a preview of what it looks like in the Dagster UI.

Check warning on line 45 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 45, "column": 207}}}, "severity": "WARNING"}

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="8" lineEnd="33"/>

You'll notice here that we have meta data for the compute kind for this asset as well as making it part of the ingestion group. Additionally, at the end we add the row count and a preview of what the table looks like.

Check warning on line 49 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 49, "column": 218}}}, "severity": "WARNING"}

2. Sales Reps Asset

This code will be very similar to the product asset but this time its focused on Sales Reps.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="35" lineEnd="61"/>

3. Sales Data Asset

Same thing for Sales Data

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="62" lineEnd="87"/>

4. Bringing our assets into the Definitions object

Now to pull these assets into our definitions object simply add them to the empty list in the assets parameter.

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")}
),
```

## What you've learned

- Created a Dagster Definition
- Built our ingestion assets



## Next steps

- Continue this tutorial with your [Asset Dependencies]
62 changes: 0 additions & 62 deletions docs/docs-beta/docs/tutorial/tutorial-etl.md

This file was deleted.

6 changes: 5 additions & 1 deletion docs/docs-beta/sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,11 @@ const sidebars: SidebarsConfig = {
type: 'category',
label: 'Tutorial',
collapsed: false,
items: ['tutorial/tutorial-etl'],
items: [
'tutorial/01-etl-tutorial-introduction',
'tutorial/02-dagster-project-setup',
'tutorial/03-your-first-asset',
],
},
{
type: 'category',
Expand Down
2 changes: 1 addition & 1 deletion docs/docs-beta/src/theme/MDXComponents.tsx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// Import the original mapper
import MDXComponents from '@theme-original/MDXComponents';
import { PyObject } from '../components/PyObject';
import {PyObject} from '../components/PyObject';
import CodeExample from '../components/CodeExample';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
Expand Down
Loading
Loading