Skip to content

Commit

Permalink
Doc 302 new etl tutorial - part 1 (#25320)
Browse files Browse the repository at this point in the history
## Summary & Motivation
I'm a little way into this and would like to get feedback from
@PedramNavid and @cmpadden on the structure and general flow. This isn't
done at this point, but I figure we could collaborate here and iterate
from there.

I made some changes to the reference file to make it more concise
regarding metadata output. The new code example function works great.

Main Questions I have at this point:

1. Where will we put pictures for the new docs site? I want to add some
UI screenshots for more color.
2. How many explanations should I add here? I know on the previous site,
the main complaint was that there was too much going on. I think having
a couple of sentences explaining a concept is fine, with a link to the
main page for that concept/API
3. Should we hide the code examples and have the user reveal them or
leave them as is?


## How I Tested These Changes

## Changelog

> Insert changelog entry or delete this section.

---------

Signed-off-by: nikki everett <[email protected]>
Co-authored-by: Nikki Everett <[email protected]>
Co-authored-by: nikki everett <[email protected]>
  • Loading branch information
3 people authored Jan 7, 2025
1 parent 5bcd9ad commit 38cb6e5
Show file tree
Hide file tree
Showing 25 changed files with 1,028 additions and 393 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ This guide will cover three options for adding a new code location:
<details>
<summary>Prerequisites</summary>

1. An existing Dagster project. Refer to the [recommended project structure](/tutorial/create-new-project) and [code requirements](/dagster-plus/deployment/code-requirements) pages for more information.
1. An existing Dagster project. Refer to the [recommended project structure](/guides/build/project-structure) and [code requirements](/dagster-plus/deployment/code-requirements) pages for more information.

2. Editor, Admin, or Organization Admin permissions in Dagster+.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
title: Automate your pipeline
description: Set schedules and utilize asset based automation
last_update:
author: Alex Noonan
sidebar_position: 60
---

There are several ways to automate pipelines and assets [in Dagster](/guides/automate).

In this step you will:

- Add automation to assets to run when upstream assets are materialized.
- Create a schedule to run a set of assets on a cron schedule.

## 1. Automate asset materialization

Ideally, the reporting assets created in the last step should refresh whenever the upstream data is updated. Dagster's [declarative automation](/guides/automate/declarative-automation) framework allows you do this by adding an automation condition to the asset definition.

Update the `monthly_sales_performance` asset to add the automation condition to the decorator:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="155" lineEnd="209"/>

Do the same thing for `product_performance`:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="216" lineEnd="267"/>

## 2. Scheduled jobs

Cron-based schedules are common in data orchestration. For our pipeline, assume that updated CSVs are uploaded to a file location at a specific time every week by an external process.

Copy the following code underneath the `product performance` asset:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="268" lineEnd="273"/>

## 3. Enable and test automations

The final step is to enable the automations in the UI.

To accomplish this:
1. Navigate to the Automation page.
2. Select all automations.
3. Using actions, start all automations.
4. Select the `analysis_update_job`.
5. Test the schedule and evaluate for any time in the dropdown menu.
6. Open in Launchpad.

The job is now executing.

Additionally, if you navigate to the Runs tab, you should see that materializations for `monthly_sales_performance` and `product_performance` have run as well.

![2048 resolution](/images/tutorial/etl-tutorial/automation-final.png)

## Next steps

- Continue this tutorial with adding a [sensor based asset](create-a-sensor-asset)
69 changes: 69 additions & 0 deletions docs/docs-beta/docs/etl-pipeline-tutorial/create-a-sensor-asset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: Create a sensor asset
description: Use sensors to create event driven pipelines
last_update:
author: Alex Noonan
sidebar_position: 70
---

[Sensors](/guides/automate/sensors) allow you to automate workflows based on external events or conditions, making them useful for event-driven automation, especially in situations where jobs occur at irregular cadences or in rapid succession.

Consider using sensors in the following situations:
- **Event-driven workflows**: When your workflow depends on external events, such as the arrival of a new data file or a change in an API response.
- **Conditional execution**: When you want to execute jobs only if certain conditions are met, reducing unnecessary computations.
- **Real-time processing**: When you need to process data as soon as it becomes available, rather than waiting for a scheduled time.

In this step you will:

- Create an asset that runs based on a event-driven workflow
- Create a sensor to listen for conditions to materialize the asset

## 1. Create an event-driven asset

For our pipeline, we want to model a situation where an executive wants a pivot table report of sales results by department and product. They want that processed in real time from their request.

For this asset, we need to define the structure of the request that it is expecting in the materialization context.

Other than that, defining this asset is the same as our previous assets. Copy the following code beneath `product_performance`.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="275" lineEnd="312"/>

## 2. Build the sensor

To define a sensor in Dagster, use the `@sensor` decorator. This decorator is applied to a function that evaluates whether the conditions for triggering a job are met.

Sensors include the following elements:

- **Job**: The job that the sensor will trigger when the conditions are met.
- **RunRequest**: An object that specifies the configuration for the job run. It includes a `run_key` to ensure idempotency and a `run_config` for job-specific settings.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="314" lineEnd="355"/>

## 3. Materialize the sensor asset

1. Update your Definitions object to the following:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="357" lineEnd="373"/>

2. Reload your Definitions.

3. Navigate to the Automation page.

4. Turn on the `automation_request_sensor`.

5. Click on the `automation_request_sensor` details.

![2048 resolution](/images/tutorial/etl-tutorial/sensor-evaluation.png)

6. Add `request.json` from the `sample_request` folder to `requests` folder.

7. Click on the green tick to see the run for this request.

![2048 resolution](/images/tutorial/etl-tutorial/sensor-asset-run.png)


## Next steps

Now that we have our complete project, the next step is to refactor the project into more a more manageable structure so we can add to it as needed.

Finish the tutorial by [refactoring your project](refactor-your-project).
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: Create and materialize a downstream asset
description: Reference Assets as dependencies to other assets
last_update:
author: Alex Noonan
sidebar_position: 30
---

Now that we have the raw data loaded into DuckDB, we need to create a [downstream asset](/guides/build/create-asset-pipelines/assets-concepts/asset-dependencies) that combines the upstream assets together. In this step, you will:

- Create a downstream asset
- Materialize that asset

## 1. Create a downstream asset

Now that we have all of our raw data loaded into DuckDB, our next step is to merge it together in a view composed of data from all three source tables.

To accomplish this in SQL, we will bring in our `sales_data` table and then left join on `sales_reps` and `products` on their respective id columns. Additionally, we will keep this view concise and only have relevant columns for analysis.

As you can see, the new `joined_data` asset looks a lot like our previous ones, with a few small changes. We put this asset into a different group. To make this asset dependent on the raw tables, we add the asset keys to the `deps` parameter in the asset definition.

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="89" lineEnd="132"/>

## 2. Materialize the asset

1. Add the joined_data asset to the Definitions object

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
joined_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

2. In the Dagster UI, reload definitions and materialize the `joined_data` asset.

## Next steps

- Continue this tutorial with by [creating and materializing a partitioned asset](ensure-data-quality-with-asset-checks).
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: Create and materialize assets
description: Load project data and create and materialize assets
last_update:
author: Alex Noonan
sidebar_position: 20
---


In the first step of the tutorial, you created your Dagster project with the raw data files. In this step, you will:
- Create your initial Definitions object
- Add a DuckDB resource
- Build software-defined assets
- Materialize your assets

## 1. Create a definitions object

In Dagster, the [Definitions API docs](/todo) object is where you define and organize various components within your project, such as assets and resources.

Open the `definitions.py` file in the `etl_tutorial` directory and copy the following code into it:

```python
import json
import os

from dagster_duckdb import DuckDBResource

import dagster as dg

defs = dg.Definitions(
assets=[],
resources={},
)
```

## 2. Define the DuckDB resource

In Dagster, [Resources API docs](/todo) are the external services, tools, and storage backends you need to do your job. For the storage backend in this project, we'll use [DuckDB](https://duckdb.org/), a fast, in-process SQL database that runs inside your application. We'll define it once in the definitions object, making it available to all assets and objects that need it.

```python
defs = dg.Definitions(
assets=[],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## 3. Create assets

Software defined [assets API docs](/todo) are the main building blocks in Dagster. An asset is composed of three components:
1. Asset key or unique identifier.
2. An op which is a function that is invoked to produce the asset.
3. Upstream dependencies that the asset depends on.

You can read more about our philosophy behind the [asset centric approach](https://dagster.io/blog/software-defined-assets).

### Products asset

First, we will create an asset that creates a DuckDB table to hold data from the products CSV. This asset takes the `duckdb` resource defined earlier and returns a `MaterializeResult` object.
Additionally, this asset contains metadata in the `@dg.asset` decorator parameters to help categorize the asset, and in the `return` block to give us a preview of the asset in the Dagster UI.

To create this asset, open the `definitions.py` file and copy the following code into it:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="8" lineEnd="33"/>

### Sales reps asset

The code for the sales reps asset is similar to the product asset code. In the `definitions.py` file, copy the following code below the product asset code:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="35" lineEnd="61"/>

### Sales data asset

To add the sales data asset, copy the following code into your `definitions.py` file below the sales reps asset:

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="62" lineEnd="87"/>

## 4. Add assets to the definitions object

Now to pull these assets into our Definitions object. Adding them to the Definitions object makes them available to the Dagster project. Add them to the empty list in the assets parameter.

```python
defs = dg.Definitions(
assets=[products,
sales_reps,
sales_data,
],
resources={"duckdb": DuckDBResource(database="data/mydb.duckdb")},
)
```

## 5. Materialize assets

To materialize your assets:
1. In a browser, navigate to the URL of the Dagster server that yous started earlier.
2. Navigate to **Deployment**.
3. Click Reload definitions.
4. Click **Assets**, then click "View global asset lineage" to see all of your assets.

![2048 resolution](/images/tutorial/etl-tutorial/etl-tutorial-first-asset-lineage.png)

5. Click materialize all.
6. Navigate to the runs tab and select the most recent run. Here you can see the logs from the run.
![2048 resolution](/images/tutorial/etl-tutorial/first-asset-run.png)


## Next steps

- Continue this tutorial with your [asset dependencies](create-and-materialize-a-downstream-asset)
Loading

1 comment on commit 38cb6e5

@github-actions
Copy link

@github-actions github-actions bot commented on 38cb6e5 Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploy preview for dagster-docs-beta ready!

✅ Preview
https://dagster-docs-beta-jws0rsqp7-elementl.vercel.app

Built with commit 38cb6e5.
This pull request is being automatically deployed with vercel-action

Please sign in to comment.