Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[components] [docs] Initial components guide #26651

Merged
merged 1 commit into from
Dec 21, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
202 changes: 202 additions & 0 deletions docs/docs-beta/docs/guides/build/components.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
---
title: "Components"
sidebar_position: 200
---

Welcome to Dagster Components.

Dagster Components is a new way to structure your Dagster projects. It aims to provide:

- An opinionated project layout that supports ongoing scaffolding from “Hello world” to the most advanced projects

Check warning on line 10 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-non-standard-quotes] Use standard single quotes or double quotes only. Do not use left or right quotes. Raw Output: {"message": "[Dagster.chars-non-standard-quotes] Use standard single quotes or double quotes only. Do not use left or right quotes.", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 10, "column": 72}}}, "severity": "WARNING"}

Check warning on line 10 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-non-standard-quotes] Use standard single quotes or double quotes only. Do not use left or right quotes. Raw Output: {"message": "[Dagster.chars-non-standard-quotes] Use standard single quotes or double quotes only. Do not use left or right quotes.", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 10, "column": 84}}}, "severity": "WARNING"}
- A class-based interface for dynamically constructing definitions
- A toolkit to build YAML DSL frontends for components so that components can be constructed in a low-code fashion.
- A format for components to provide their own scaffolding, in order to organize and reference integration-specific artifacts files.

Check warning on line 13 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Terms.words] Use 'to' instead of 'in order to'. Raw Output: {"message": "[Terms.words] Use 'to' instead of 'in order to'.", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 13, "column": 61}}}, "severity": "WARNING"}

## Project Setup

First let's install the `dg` command line tool. This lives in the published Python package `dagster-dg`. `dg` is designed to be globally installed and has no dependency on `dagster` itself. We will use the tool feature of Python package manager `uv` to install a globally available `dg`. `dg` will also be use `uv` internally to manage the python enviroment associated with your project.

Check warning on line 17 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Terms.engineering] Use 'Python' instead of 'python'. Raw Output: {"message": "[Terms.engineering] Use 'Python' instead of 'python'.", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 17, "column": 341}}}, "severity": "WARNING"}

Check failure on line 17 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'enviroment' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'enviroment' spelled correctly?", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 17, "column": 348}}}, "severity": "ERROR"}

Check failure on line 17 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'enviroment'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'enviroment'?", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 17, "column": 348}}}, "severity": "ERROR"}

```bash
brew install uv && uv tool install -e -e $DAGSTER_GIT_REPO_DIR/python_modules/libraries/dagster-dg/
```

Let's have a look at what's available:

```bash
dg --help

Usage: dg [OPTIONS] COMMAND [ARGS]...

CLI tools for working with Dagster components.

Commands:
code-location Commands for operating code location directories.
component Commands for operating on components.
component-type Commands for operating on components types.
deployment Commands for operating on deployment directories.

Options:
--builtin-component-lib TEXT Specify a builitin component library to use.
--verbose Enable verbose output for debugging.
--disable-cache Disable caching of component registry data.
--clear-cache Clear the cache before running the command.
--rebuild-component-registry Recompute and cache the set of available component types for the current environment.
Note that this also happens automatically whenever the cache is detected to be stale.
--cache-dir PATH Specify a directory to use for the cache.
-v, --version Show the version and exit.
-h, --help Show this message and exit.
```

We're going to generate a new code location.

```bash
dg code-location generate jaffle_platform
```

Let's have a look at what it generated:

```bash
cd jaffle_platform && tree
```

You can see that we have a basic project structure with a few non-standard files/directories:

- `jaffle_platform/components`: this is where we will define our components
- `jaffle_platform/lib`: this is where we can put custom component types
- `definitions.py`: this comes preloaded with some basic code that will scrape up and merge all the Dagster definitions from our components.

## Hello Platform

We are going to set up a data platform using sling to ingest data, dbt to process the data, and python to do AI.

Check warning on line 70 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Terms.engineering] Use 'Python' instead of 'python'. Raw Output: {"message": "[Terms.engineering] Use 'Python' instead of 'python'.", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 70, "column": 97}}}, "severity": "WARNING"}

### Ingest

First we set up sling. If we query the available component-types in our code location, we don't see anything sling-related:

```bash
dg component-type list

dagster_components.pipes_subprocess_script_collection
Assets that wrap Python scripts executed with Dagster's PipesSubprocessClient.
```

This is because the basic `dagster-components` package is lightweight and doesn't include copmonents for specific tools. We can get access to a `sling` component by installing the `sling` extra:

Check failure on line 83 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'copmonents' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'copmonents' spelled correctly?", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 83, "column": 91}}}, "severity": "ERROR"}

Check failure on line 83 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'copmonents'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'copmonents'?", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 83, "column": 91}}}, "severity": "ERROR"}

```bash
uv add 'dagster-components[sling]' dagster-embedded-elt
```

Now let's see what's available:

```bash
dg component-type list

dagster_components.pipes_subprocess_script_collection
Assets that wrap Python scripts executed with Dagster's PipesSubprocessClient.
dagster_components.sling_replication`
```

Great-- we now have the `dagster_components.sling_replication` component type available. Let's create a new instance of this component:

```bash
dg component generate dagster_components.sling_replication ingest_files

Creating a Dagster component instance folder at /Users/smackesey/stm/code/elementl/tmp/jaffle_platform/jaffle_platform/components/ingest_files.
```

This adds a component instance to the project at `jaffle_platform/components/ingest_files`:

```bash
tree jaffle_platform

jaffle_platform/
├── __init__.py
├── __pycache__
│   └── __init__.cpython-312.pyc
├── components
│   └── ingest_files
│   ├── component.yaml
│   └── replication.yaml
├── definitions.py
└── lib
├── __init__.py
└── __pycache__
└── __init__.cpython-312.pyc

6 directories, 7 files
```

Notice that our component has two files: `component.yaml` and `replication.yaml`. The `component.yaml` file is common to all Dagster components, and specifies the component type and any associated parameters. Right now the parameters are empty:

```yaml
### jaffle_platform/components/ingest_files/component.yaml
component_type: dagster_components.sling_replication

params: {}
```

The `replication.yaml` file is a sling-specific file.

We want to replicate data on the public internet into duckdb:

Check failure on line 140 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'DuckDB' instead of 'duckdb'. Raw Output: {"message": "[Vale.Terms] Use 'DuckDB' instead of 'duckdb'.", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 140, "column": 55}}}, "severity": "ERROR"}

```bash
uv run sling conns set DUCKDB type=duckdb instance=/tmp/jaffle_platform.duckdb

4:55PM INF connection `DUCKDB` has been set in /Users/smackesey/.sling/env.yaml. Please test with `sling conns test DUCKDB`
```

```bash
uv run sling conns test DUCKDB

4:55PM INF success!
```

Now let's download a file locally (sling doesn’t support reading from the public internet):

Check warning on line 154 in docs/docs-beta/docs/guides/build/components.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-non-standard-quotes] Use standard single quotes or double quotes only. Do not use left or right quotes. Raw Output: {"message": "[Dagster.chars-non-standard-quotes] Use standard single quotes or double quotes only. Do not use left or right quotes.", "location": {"path": "docs/docs-beta/docs/guides/build/components.md", "range": {"start": {"line": 154, "column": 47}}}, "severity": "WARNING"}

```bash
curl -O https://raw.githubusercontent.com/dbt-labs/jaffle-shop-classic/refs/heads/main/seeds/raw_customers.csv &&
curl -O https://raw.githubusercontent.com/dbt-labs/jaffle-shop-classic/refs/heads/main/seeds/raw_orders.csv &&
curl -O https://raw.githubusercontent.com/dbt-labs/jaffle-shop-classic/refs/heads/main/seeds/raw_payments.csv
```

And copy-paste the below code into `replication.yaml`:

```yaml
source: LOCAL
target: DUCKDB

defaults:
mode: full-refresh
object: "{stream_table}"

streams:
file://raw_customers.csv:
object: "main.raw_customers"
file://raw_orders.csv:
object: "main.raw_orders"
file://raw_payments.csv:
object: "main.raw_payments"
```

Let's load up our code location in the Dagster UI to see what we've got:

```bash
uv run dagster dev # will be dg dev in the future
```

Click "Materialize All", and we should now have tables in the DuckDB instance. Let's verify on the command line:

```
brew install duckdb
duckdb /tmp/jaffle_platform.duckdb -c "SELECT * FROM raw_customers LIMIT 5;"
┌───────┬────────────┬───────────┬──────────────────┐
│ id │ first_name │ last_name │ _sling_loaded_at │
│ int32 │ varchar │ varchar │ int64 │
├───────┼────────────┼───────────┼──────────────────┤
│ 1 │ Michael │ P. │ 1734732030 │
│ 2 │ Shawn │ M. │ 1734732030 │
│ 3 │ Kathleen │ P. │ 1734732030 │
│ 4 │ Jimmy │ C. │ 1734732030 │
│ 5 │ Katherine │ R. │ 1734732030 │
└───────┴────────────┴───────────┴──────────────────┘
```
Loading