Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc 302 new etl tutorial - part 1 #25320

Merged
merged 45 commits into from
Jan 7, 2025
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
c275842
file copy
C00ldudeNoonan Oct 11, 2024
054141c
config file creation
C00ldudeNoonan Oct 14, 2024
89be27a
adding additional pages and project config logic
C00ldudeNoonan Oct 16, 2024
59f5a64
add defintions object
C00ldudeNoonan Oct 16, 2024
bf7b65b
Merge remote-tracking branch 'origin/master' into new-etl-tutorial--D…
C00ldudeNoonan Oct 16, 2024
d6d69cf
added intial assets and did some cleanup
C00ldudeNoonan Oct 16, 2024
19d3236
minor typo fixes
C00ldudeNoonan Oct 18, 2024
9b8bdc2
linting
C00ldudeNoonan Oct 18, 2024
6f078db
more to first asset
C00ldudeNoonan Oct 18, 2024
8b6d1f6
consolidated pages and added partitions page
C00ldudeNoonan Oct 21, 2024
8ef90cf
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Nov 13, 2024
2425783
add screenshots and update format and writeup
C00ldudeNoonan Nov 14, 2024
49035dd
update name in sidebar for consistency
C00ldudeNoonan Nov 14, 2024
17aff77
vale formatting errors fix
C00ldudeNoonan Nov 14, 2024
75e60fe
applied notes from Nikki
C00ldudeNoonan Nov 15, 2024
d4ff6d3
whitespace fixes
C00ldudeNoonan Nov 15, 2024
b30f860
Update docs/docs-beta/docs/tutorial/03-creating-a-downstream-asset.md
C00ldudeNoonan Nov 19, 2024
140a122
added partitions, automations, and sensors
C00ldudeNoonan Nov 26, 2024
f29065e
add commentary to page 6 and 7
C00ldudeNoonan Dec 2, 2024
130b418
added final pages and screenshots
C00ldudeNoonan Dec 10, 2024
d34c41b
ruff update
C00ldudeNoonan Dec 10, 2024
62e5fd0
Merge branch 'master' into DOC-302-new-etl-tutorial
C00ldudeNoonan Dec 27, 2024
aae8195
updated code references and sidebar
C00ldudeNoonan Dec 27, 2024
1eb255c
page link fixes
C00ldudeNoonan Dec 27, 2024
4148df7
page links
C00ldudeNoonan Dec 27, 2024
aee2029
update links
C00ldudeNoonan Dec 30, 2024
5db379b
update sidebar links to remove folder
C00ldudeNoonan Dec 30, 2024
6974fcb
update 404 link
C00ldudeNoonan Dec 30, 2024
1cb9423
Merge remote-tracking branch 'origin/master' into new-etl-tutorial--D…
C00ldudeNoonan Dec 30, 2024
8c9ea96
Merge remote-tracking branch 'origin/master' into DOC-302-new-etl-tut…
C00ldudeNoonan Jan 2, 2025
c08c901
update tutorial link
C00ldudeNoonan Jan 2, 2025
f1cdc8e
merge master and fix conflict
neverett Jan 3, 2025
3aaf92d
remove empty tutorial pages, move multi-asset integration guide
neverett Jan 3, 2025
bee1dc0
reorganize etl pipeline tutorial
neverett Jan 3, 2025
288893c
update sidebar, fix quickstart links, update index page
neverett Jan 3, 2025
c3b695d
fix links
neverett Jan 3, 2025
9d22054
Merge branch 'master' into DOC-302-new-etl-tutorial
neverett Jan 5, 2025
8094471
fix links
neverett Jan 5, 2025
e64f7ee
fix another link
neverett Jan 5, 2025
2aa8225
change file name and title for consistency
neverett Jan 5, 2025
919a4bb
apply nikki's feedback
C00ldudeNoonan Jan 6, 2025
62ff5c1
typo fixes
C00ldudeNoonan Jan 6, 2025
9f94197
Merge branch 'master' into new-etl-tutorial--DOC-302-
C00ldudeNoonan Jan 6, 2025
1765e03
update code references
C00ldudeNoonan Jan 7, 2025
3dd37ae
Update tense of header
C00ldudeNoonan Jan 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: Build an ETL Pipeline
description: Learn how to build an ETL pipeline with Dagster
last_update:
date: 2024-08-10
author: Pedram Navid
---

# Build your first ETL pipeline

Welcome to this hands-on tutorial where you'll learn how to build an ETL pipeline with Dagster while exploring key parts of Dagster.
If you haven't already, complete the [Quick Start](/getting-started/quickstart) tutorial to get familiar with Dagster.

## What you'll learn

- Setting up a Dagster project with the recommended project structure
- Creating Assets and using Resources to connect to external systems
- Adding metadata to your assets
- Building dependencies between assets
- Running a pipeline by materializing assets
- Adding schedules, sensors, and partitions to your assets

## Step 1: Set up your Dagster environment

First, set up a new Dagster project.

1. Open your terminal and create a new directory for your project:

```bash title="Create a new directory"
mkdir dagster-etl-tutorial
cd dagster-etl-tutorial
```

2. Create a virtual environment and activate it:

```bash title="Create a virtual environment"
python -m venv venv
source venv/bin/activate
# On Windows, use `venv\Scripts\activate`
```

3. Install Dagster and the required dependencies:

```bash title="Install Dagster and dependencies"
pip install dagster dagster-webserver pandas
```

## Step 2: Copying Data Files

Next we will get the raw data for the project.

Check warning on line 50 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 50, "column": 47}}}, "severity": "WARNING"}

1. Create a new folder for the raw data:

```bash title="Create the data directory"
mkdir data
cd data
```

2. Copy the raw csv files:

Check failure on line 59 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'csv' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'csv' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 59, "column": 17}}}, "severity": "ERROR"}

Check failure on line 59 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'csv'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'csv'?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 59, "column": 17}}}, "severity": "ERROR"}

```bash title="Copy the csv files"
curl -L -o products.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/products.csv

curl -L -o sales_reps.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_reps.csv

curl -L -o sales_data.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_data.csv

Check warning on line 66 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 66, "column": 197}}}, "severity": "WARNING"}
```
3. Copy Sample Request json file

Check failure on line 68 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'json' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'json' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 68, "column": 24}}}, "severity": "ERROR"}

Check failure on line 68 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'json'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'json'?", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 68, "column": 24}}}, "severity": "ERROR"}

```bash title="Create the sample request"
mkdir sample_request
cd sample_request
curl -L -o request.json https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sample_request/request.json

Check warning on line 74 in docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md", "range": {"start": {"line": 74, "column": 1}}}, "severity": "WARNING"}
# navigating back to the root directory
cd../..
```


## What you've learned

- Set up a Python virtual environment and installed Dagster
- Copied raw data for project

## Next steps

- Continue this tutorial with [setting up your dagster project ](/tutorial/dagster-project-setup)
103 changes: 103 additions & 0 deletions docs/docs-beta/docs/tutorial/02-dagster-project-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: Dagster Project Setup
description: Learn how to setup a Dagster project from scratch
last_update:
date: 2024-10-16
author: Alex Noonan
---

# Dagster Project Setup

## What you'll learn

- Setting up a Dagster project with the recommended project structure


## Step 1: Create Dagster Project Files

Dagster needs several project files to run. These files are common in Python Package managment and help manage project configurationa and dependencies.

The setup.cfg file is an INI-style configuration file that contains option defaults for setup.py commands.

1. Create Config file

```bash title="Create Config file"
echo -e "[metadata]\nname = dagster_etl_tutorial" > setup.cfg
```

2. Create Setup Python File

The setup.py file is a build script for configuring Python packages. In a Dagster project, you use setup.py to defin any Python packages your project depends on, including Dagster itself.

```bash title="Create Setup file"
echo > setup.py
```


Open that python file and put the following code in there.


```python title="Setup.py"
from setuptools import find_packages, setup

setup(
name="dagster_etl_tutorial",
packages=find_packages(exclude=["dagster_etl_tutorial_tests"]),
install_requires=[
"dagster",
"dagster-cloud",
"duckdb"
],
extras_require={"dev": ["dagster-webserver", "pytest"]},
)
```
3. Create Toml file

The pyproject.toml file is a configuation file that specifices package core metadata in a static, tool agnostic way.


```bash title="Create Pyproject file"
echo > pyproject.toml
```

Open that file up and add the following

```toml
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[tool.dagster]
module_name = "dagster_tutorial.definitions"
code_location_name = "dagster_tutorial"
```

4. Create Dagster Python Module and Definitions file


## Next we will create our Python Definitions file

1. Create ETL tutorial directory

```bash title="Create the tutorial directory"
mkdir dagster-etl-tutorial
cd dagster-etl-tutorial
```

2. Create Dagster Definitions File

In this guide we will use a simplified project structure to focus on core Dagster concepts. To accomplish this all of our code will be in one definitons file.


```bash title="Create definitions.py file"
echo > definitions.py
```

## What you've learned

- Set up a Python virtual environment and installed Dagster
- Copied raw data for project

## Next steps

- Continue this tutorial with your [first asset](/tutorial/your-first-asset)
86 changes: 86 additions & 0 deletions docs/docs-beta/docs/tutorial/03-your-first-asset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: Your First Asset

Check warning on line 2 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 2, "column": 24}}}, "severity": "WARNING"}
description: Get the project data and create your first Asset
last_update:
date: 2024-10-16
author: Alex Noonan
---

# Your First Software Defined Asset

Now that we have the raw data files and the Dagster project setup lets create some loading those csv's into duckdb.

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'csv's'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'csv's'?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 98}}}, "severity": "ERROR"}

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'csv's' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'csv's' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 98}}}, "severity": "ERROR"}

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'DuckDB' instead of 'duckdb'. Raw Output: {"message": "[Vale.Terms] Use 'DuckDB' instead of 'duckdb'.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 109}}}, "severity": "ERROR"}

Check warning on line 11 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 11, "column": 116}}}, "severity": "WARNING"}

## What you'll learn

- Creating our intial defintions object

Check failure on line 15 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'intial' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'intial' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 15, "column": 16}}}, "severity": "ERROR"}

Check failure on line 15 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'intial'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'intial'?", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 15, "column": 16}}}, "severity": "ERROR"}
- Adding a duckdb resource
- Building some basic software defined assets

Check warning on line 17 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 17, "column": 46}}}, "severity": "WARNING"}

## Building definitions object

The definitions object [need docs reference] in Dagster serves as the central configuration point for defining and organizing various componenets within a Dagster Project. It acts as a container that holds all the necessary configurations for a code location, ensuring that everything is organized and easily accessible.

Check warning on line 21 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 21, "column": 321}}}, "severity": "WARNING"}

1. Creating Definitions Object and duckdb resource

Open the definitions.py file and add the following import statements and definitions object.

Check warning on line 25 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 25, "column": 93}}}, "severity": "WARNING"}

```python
import json
import os

from dagster_duckdb import DuckDBResource

import dagster as dg

defs = dg.Definitions(
assets=[],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## Loading raw data

1. Products Asset

We need to create an asset that creates a duckdb table for the products csv. Additionally we should add meta data to help categorize this asset and give us a preview of what it looks like in the Dagster UI.

Check warning on line 45 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 45, "column": 207}}}, "severity": "WARNING"}

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="8" lineEnd="33"/>

You'll notice here that we have meta data for the compute kind for this asset as well as making it part of the ingestion group. Additionally, at the end we add the row count and a preview of what the table looks like.

Check warning on line 49 in docs/docs-beta/docs/tutorial/03-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-your-first-asset.md", "range": {"start": {"line": 49, "column": 218}}}, "severity": "WARNING"}

2. Sales Reps Asset

This code will be very similar to the product asset but this time its focused on Sales Reps.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="35" lineEnd="61"/>

3. Sales Data Asset

Same thing for Sales Data

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="62" lineEnd="87"/>

4. Bringing our assets into the Definitions object

Now to pull these assets into our definitions object simply add them to the empty list in the assets parameter.

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")}
),
```

## What you've learned

- Created a Dagster Definition
- Built our ingestion assets



## Next steps

- Continue this tutorial with your [Asset Dependencies]
62 changes: 0 additions & 62 deletions docs/docs-beta/docs/tutorial/tutorial-etl.md

This file was deleted.

6 changes: 5 additions & 1 deletion docs/docs-beta/sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,11 @@ const sidebars: SidebarsConfig = {
type: 'category',
label: 'Tutorial',
collapsed: false,
items: ['tutorial/tutorial-etl'],
items: [
'tutorial/01-etl-tutorial-introduction',
'tutorial/02-dagster-project-setup',
'tutorial/03-your-first-asset',
],
},
{
type: 'category',
Expand Down
2 changes: 1 addition & 1 deletion docs/docs-beta/src/theme/MDXComponents.tsx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// Import the original mapper
import MDXComponents from '@theme-original/MDXComponents';
import { PyObject } from '../components/PyObject';
import {PyObject} from '../components/PyObject';
import CodeExample from '../components/CodeExample';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
Expand Down
Loading
Loading