Skip to content

Commit

Permalink
write-ingesting-data-with-dagster
Browse files Browse the repository at this point in the history
  • Loading branch information
sryza committed Aug 26, 2024
1 parent 87dce05 commit 1656e58
Showing 1 changed file with 67 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,70 @@ title: Ingesting data with Dagster
description: Learn how to ingest data into Dagster
sidebar_position: 10
sidebar_label: Ingesting data
---
---

This guide explains how to use Dagster to orchestrate the ingestion of data into a data warehouse or data lake, where it can be queried and transformed. Dagster integrates with several tools that are purpose-built for data ingestion, and it also enables writing custom code for ingesting data.

A data platform typically centers on a small number of data warehouses or object stores, where data is consolidated in standard formats for reporting, transformation, and analysis. The data inside the platform often comes from a large and diverse set of sources, such as application logs, customer relationship management (CRM) software, spreadsheets, public data sources available on the internet, and more. Data ingestion is the process of extracting data from these sources to loading it into the data platform. Data ingestion makes up the "E" and "L" in the "ELT" paradigm (extract-load-transform).

Check warning on line 10 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.acronyms] Spell out 'CRM', if it's unfamiliar to the audience. Raw Output: {"message": "[Dagster.acronyms] Spell out 'CRM', if it's unfamiliar to the audience.", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 10, "column": 324}}}, "severity": "INFO"}

Check warning on line 10 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.acronyms] Spell out 'ELT', if it's unfamiliar to the audience. Raw Output: {"message": "[Dagster.acronyms] Spell out 'ELT', if it's unfamiliar to the audience.", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 10, "column": 564}}}, "severity": "INFO"}

## What you'll learn

- How Dagster helps with data ingestion
- How Dagster relates to different technologies and paradigms for data ingestion
- How to get started using Dagster to ingest data using your preferred technology and paradigm

<details>
<summary>Prerequisites</summary>
- Familiarity with [asset definitions](/concepts/assets)
</details>

## How Dagster helps with data ingestion

As a data orchestrator, Dagster helps with data ingestion in the following ways:
- It can automatically kick off computations that ingest data.
- It can coordinate data ingestion with downstream data transformation, for example to rebuild a set of dbt models after the upstream data they depend on is updated.
- It can represent ingested data assets in its data asset graph, which enables understanding what ingested data exists, how ingested data is used, and where data is ingested from.

### Batch vs. streaming data ingestion

There are two main paradigms for data ingestion: batch and streaming. With batch data ingestion, data is moved in discrete batches. With streaming data ingestion, data continuously flows in.

Check warning on line 32 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.sentence-length-variety] This paragraph has a stdev less than 2 Raw Output: {"message": "[Dagster.sentence-length-variety] This paragraph has a stdev less than 2", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 32, "column": 1}}}, "severity": "WARNING"}

Dagster has a different relationship to streaming data ingestion than it does to batch data ingestion. For batch data ingestion, Dagster often takes responsibility for kicking off the computations to ingest the batches of data, either at a regular cadence or when it discovers that new data is available in the source system.

For streaming data ingestion, Dagster doesn't orchestrate the data ingestion, but still often represents the ingested data assets in its asset graph. Dagster can hold metadata about these assets, represent the lineage between them and downstream assets, and automatically kick off computations to observe them.

## Orchestrate a data ingestion tool

Dagster integrates with several tools that are purpose-built for data ingestion. These tools roughly fall into two main categories:
- Hosted data ingestion services.
- Embeddable data ingestion libraries.

Check failure on line 42 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'Embeddable'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'Embeddable'?", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 42, "column": 3}}}, "severity": "ERROR"}

Check failure on line 42 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'Embeddable' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'Embeddable' spelled correctly?", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 42, "column": 3}}}, "severity": "ERROR"}

### Hosted data ingestion services

Dagster provides two integrations with two hosted data ingestion services: Fivetran and Airbyte.

Check failure on line 46 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'Fivetran'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'Fivetran'?", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 46, "column": 76}}}, "severity": "ERROR"}

Check failure on line 46 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'Fivetran' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'Fivetran' spelled correctly?", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 46, "column": 76}}}, "severity": "ERROR"}

With a hosted data ingestion service, the set of data sources to ingest into the data platform is defined inside the service. Dagster's integrations with these services invoke their REST APIs to find out about the set of data assets that they ingest and load those asset definitions into Dagster's asset graph. The "materialize" action on these Dagster asset definitions invokes REST APIs provided by these services to ingest the latest data from the source.

To learn how to use Dagster to orchestrate a hosted data ingestion service, follow the link for that service above.

Check warning on line 50 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Terms.words] Use 'preceding' instead of 'above'. Raw Output: {"message": "[Terms.words] Use 'preceding' instead of 'above'.", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 50, "column": 110}}}, "severity": "WARNING"}

### Embeddable data ingestion libraries

Check failure on line 52 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'Embeddable'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'Embeddable'?", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 52, "column": 5}}}, "severity": "ERROR"}

Check failure on line 52 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'Embeddable' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'Embeddable' spelled correctly?", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 52, "column": 5}}}, "severity": "ERROR"}

With an embeddable data ingestion library, the set of data sources to ingest into the data platform is defined using files that are typically managed inside the same git repository as other Dagster pipelines. Dagster's integrations with these libraries invokes them to interpret these files and load the data assets defined in them into Dagster's asset graph. The "materialize" action on these Dagster asset definitions inbokes the library to ingest the latest data from the source.

Check failure on line 54 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'embeddable' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'embeddable' spelled correctly?", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 54, "column": 9}}}, "severity": "ERROR"}

Check failure on line 54 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'embeddable'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'embeddable'?", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 54, "column": 9}}}, "severity": "ERROR"}

Check failure on line 54 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'inbokes' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'inbokes' spelled correctly?", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 54, "column": 421}}}, "severity": "ERROR"}

Dagster provides two integrations with embeddable data ingestion libraries: Sling and DLT.

Check warning on line 56 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.acronyms] Spell out 'DLT', if it's unfamiliar to the audience. Raw Output: {"message": "[Dagster.acronyms] Spell out 'DLT', if it's unfamiliar to the audience.", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 56, "column": 87}}}, "severity": "INFO"}

To learn how to use Dagster to orchestrate a hosted data ingestion service, follow the link for that library above.

Check warning on line 58 in docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Terms.words] Use 'preceding' instead of 'above'. Raw Output: {"message": "[Terms.words] Use 'preceding' instead of 'above'.", "location": {"path": "docs/docs-beta/docs/guides/ingestion-and-transformation/ingesting-data.md", "range": {"start": {"line": 58, "column": 110}}}, "severity": "WARNING"}

## Write a custom data ingestion pipeline

It's also common to write code in a language like Python to ingest data into a data platform.

For example, if there's a CSV file on the internet of counties, and you want to load it into your Snowflake data warehouse as a table, you might directly define an asset that represents that table in your warehouse. The asset's materialization function fetches data from the internet and loads it into that table.

```python
@asset
def counties(snowflake: SnowflakeResource) -> None:
# TODO
data = fetch_some_data()
snowflake.conn.execute("INSERT INTO ...")
```

0 comments on commit 1656e58

Please sign in to comment.