Skip to content

Commit

Permalink
integration docs beta 1/ replicate all integration pages from mkt sit…
Browse files Browse the repository at this point in the history
…e to beta docs (#24330)

## Summary & Motivation
copy over from https://dagster.io/integrations (note: this content is
up-to-date per PR stack
dagster-io/dagster-website#1280 few weeks ago)

this PR made the following changes:
1. update title to "Dagster & <name>"
2. add `sidebar_label: <name>" so it won't show a wall of "Dagster &" on
left nav
3. fix all vale errors and warnings, including a lot of vale accept
additions
4. rename files from `dagster-<name>.mdx` to just `<name>.md`

Next steps in later stack:
- move code to python files
- improve navigation: Index page and/or left nav to bucket integrations
into categories, differentiate community owned

Later steps:
- improve doc content page one by one (e.g. template guides, reuse the
good ones from current docs site)

**Open discussion:** Figure out the relationship between docs and
marketing site regarding integrations.
* Option 1: no dagster.io/integrations and redirect that to
docs.dagster.io/integrations
* Yuhan's pick: I'm actually leaning towards this to completely
consolidate all integration contents into the docs site for simplicity
and ease of navigation so there won't be two similar content on two
different sites, but I'd need to consult the SEO implication in this
option.
* Option 2: keep both dagster.io/integrations and
docs.dagster.io/integrations; no code in marketing site, only for SEO
purpose; docs pages focus on more technical guides/references.

## How I Tested These Changes
**see in preview:
https://dagster-docs-beta-211qncb7r-elementl.vercel.app/integrations**

## Changelog
`NOCHANGELOG`

---------

Co-authored-by: colton <[email protected]>
  • Loading branch information
yuhan and cmpadden authored Sep 11, 2024
1 parent bd231c6 commit 6dca8c2
Show file tree
Hide file tree
Showing 56 changed files with 3,061 additions and 13 deletions.
6 changes: 3 additions & 3 deletions docs/docs-beta/docs/getting-started/installation.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: "Installing Dagster"
description: "Learn how to install Dagster"
title: Installing Dagster
description: Learn how to install Dagster
sidebar_position: 20
sidebar_label: "Installation"
sidebar_label: Installation
---

# Installing Dagster
Expand Down
2 changes: 1 addition & 1 deletion docs/docs-beta/docs/guides/asset-dependencies.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Consider this example:

<CodeExample filePath="guides/data-assets/passing-data-assets/passing-data-avoid.py" language="python" title="Avoid Passing Data Between Assets" />

This example downloads a zip file from Google Drive, unzips it, and loads the data into a pandas DataFrame. It relies on each asset running on the same file system to perform these operations.
This example downloads a zip file from Google Drive, unzips it, and loads the data into a Pandas DataFrame. It relies on each asset running on the same file system to perform these operations.

The assets are modeled as tasks, rather than as data assets. For more information on the difference between tasks and data assets, check out the [Thinking in Assets](/concepts/assets/thinking-in-assets) guide.

Expand Down
3 changes: 2 additions & 1 deletion docs/docs-beta/docs/integrations.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: "Integrations"
title: 'Integrations'
displayed_sidebar: 'integrations'
---

# Integrations
52 changes: 52 additions & 0 deletions docs/docs-beta/docs/integrations/airbyte.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
layout: Integration
status: published
name: Airbyte
title: Dagster & Airbyte
sidebar_label: Airbyte
excerpt: Orchestrate Airbyte connections and schedule syncs alongside upstream or downstream dependencies.
date: 2022-11-07
apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-airbyte
docslink: https://docs.dagster.io/integrations/airbyte
partnerlink: https://airbyte.com/tutorials/orchestrate-data-ingestion-and-transformation-pipelines
logo: /integrations/airbyte.svg
categories:
- ETL
enabledBy:
enables:
---

### About this integration

Using this integration, you can trigger Airbyte syncs and orchestrate your Airbyte connections from within Dagster, making it easy to chain an Airbyte sync with upstream or downstream steps in your workflow.

### Installation

```bash
pip install dagster-airbyte
```

### Example

```python
from dagster import EnvVar
from dagster_airbyte import AirbyteResource, load_assets_from_airbyte_instance
import os

# Connect to your OSS Airbyte instance
airbyte_instance = AirbyteResource(
host="localhost",
port="8000",
# If using basic auth, include username and password:
username="airbyte",
password=EnvVar("AIRBYTE_PASSWORD")
)

# Load all assets from your Airbyte instance
airbyte_assets = load_assets_from_airbyte_instance(airbyte_instance)

```

### About Airbyte

**Airbyte** is an open source data integration engine that helps you consolidate your SaaS application and database data into your data warehouses, lakes and databases.
48 changes: 48 additions & 0 deletions docs/docs-beta/docs/integrations/aws-athena.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
layout: Integration
status: published
name: AWS Athena
title: Dagster & AWS Athena
sidebar_label: AWS Athena
excerpt: This integration allows you to connect to AWS Athena and analyze data in Amazon S3 using standard SQL within your Dagster pipelines.
date: 2024-06-21
apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-aws
docslink:
partnerlink: https://aws.amazon.com/
logo: /integrations/aws-athena.svg
categories:
- Storage
enabledBy:
enables:
---

### About this integration

This integration allows you to connect to AWS Athena, a serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Using this integration, you can issue queries to Athena, fetch results, and handle query execution states within your Dagster pipelines.

### Installation

```bash
pip install dagster-aws
```

### Examples

```python
from dagster import Definitions, asset
from dagster_aws.athena import AthenaClientResource


@asset
def example_athena_asset(athena: AthenaClientResource):
return athena.get_client().execute_query("SELECT 1", fetch_results=True)


defs = Definitions(
assets=[example_athena_asset], resources={"athena": AthenaClientResource()}
)
```

### About AWS Athena

AWS Athena is a serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. Athena is easy to use; point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there are no infrastructure setups, and you pay only for the queries you run. It scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries.
63 changes: 63 additions & 0 deletions docs/docs-beta/docs/integrations/aws-cloudwatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
layout: Integration
status: published
name: AWS CloudWatch
title: Dagster & AWS CloudWatch
sidebar_label: AWS CloudWatch
excerpt: This integration allows you to send Dagster logs to AWS CloudWatch, enabling centralized logging and monitoring of your Dagster jobs.
date: 2024-06-21
apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-aws
docslink:
partnerlink: https://aws.amazon.com/
logo: /integrations/aws-cloudwatch.svg
categories:
- Monitoring
enabledBy:
enables:
---

### About this integration

This integration allows you to send Dagster logs to AWS CloudWatch, enabling centralized logging and monitoring of your Dagster jobs. By using AWS CloudWatch, you can take advantage of its powerful log management features, such as real-time log monitoring, log retention policies, and alerting capabilities.

Using this integration, you can configure your Dagster jobs to log directly to AWS CloudWatch, making it easier to track and debug your workflows. This is particularly useful for production environments where centralized logging is essential for maintaining observability and operational efficiency.

### Installation

```bash
pip install dagster-aws
```

### Examples

```python
import dagster as dg
from dagster_aws.cloudwatch import cloudwatch_logger


@dg.asset
def my_asset(context: dg.AssetExecutionContext):
context.log.info("Hello, CloudWatch!")
context.log.error("This is an error")
context.log.debug("This is a debug message")


defs = dg.Definitions(
assets=[my_asset],
loggers={
"cloudwatch_logger": cloudwatch_logger,
},
)
```

### About AWS CloudWatch

AWS CloudWatch is a monitoring and observability service provided by Amazon Web Services (AWS). It allows you to collect, access, and analyze performance and operational data from a variety of AWS resources, applications, and services. With AWS CloudWatch, you can set up alarms, visualize logs and metrics, and gain insights into your infrastructure and applications to ensure they're running smoothly.

AWS CloudWatch provides features such as:

- Real-time monitoring: Track the performance of your applications and infrastructure in real-time.
- Log management: Collect, store, and analyze log data from various sources.
- Alarms and notifications: Set up alarms to automatically notify you of potential issues.
- Dashboards: Create custom dashboards to visualize metrics and logs.
- Integration with other AWS services: Seamlessly integrate with other AWS services for a comprehensive monitoring solution.
58 changes: 58 additions & 0 deletions docs/docs-beta/docs/integrations/aws-ecr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
layout: Integration
status: published
name: AWS ECR
title: Dagster & AWS ECR
sidebar_label: AWS ECR
excerpt: This integration allows you to connect to AWS Elastic Container Registry (ECR), enabling you to manage your container images more effectively in your Dagster pipelines.
date: 2024-06-21
apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-aws
docslink:
partnerlink: https://aws.amazon.com/
logo: /integrations/aws-ecr.svg
categories:
- Other
enabledBy:
enables:
---

### About this integration

This integration allows you to connect to AWS Elastic Container Registry (ECR). It provides resources to interact with AWS ECR, enabling you to manage your container images.

Using this integration, you can seamlessly integrate AWS ECR into your Dagster pipelines, making it easier to manage and deploy containerized applications.

### Installation

```bash
pip install dagster-aws
```

### Examples

```python
from dagster import asset, Definitions
from dagster_aws.ecr import ECRPublicResource


@asset
def get_ecr_login_password(ecr_public: ECRPublicResource):
return ecr_public.get_client().get_login_password()


defs = Definitions(
assets=[get_ecr_login_password],
resources={
"ecr_public": ECRPublicResource(
region_name="us-west-1",
aws_access_key_id="your_access_key_id",
aws_secret_access_key="your_secret_access_key",
aws_session_token="your_session_token",
)
},
)
```

### About AWS ECR

AWS Elastic Container Registry (ECR) is a fully managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. AWS ECR is integrated with Amazon Elastic Kubernetes Service (EKS), simplifying your development to production workflow. With ECR, you can securely store and manage your container images and easily integrate with your existing CI/CD pipelines. AWS ECR provides high availability and scalability, ensuring that your container images are always available when you need them.
96 changes: 96 additions & 0 deletions docs/docs-beta/docs/integrations/aws-emr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
layout: Integration
status: published
name: AWS EMR
title: Dagster & AWS EMR
sidebar_label: AWS EMR
excerpt: The AWS EMR integration allows you to seamlessly integrate AWS EMR into your Dagster pipelines for petabyte-scale data processing using open source tools like Apache Spark, Hive, Presto, and more.
date: 2024-06-21
apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-aws
docslink:
partnerlink: https://aws.amazon.com/
logo: /integrations/aws-emr.svg
categories:
- Compute
enabledBy:
enables:
---

### About this integration

The `dagster-aws` integration provides ways orchestrating data pipelines that leverage AWS services, including AWS EMR (Elastic MapReduce). This integration allows you to run and scale big data workloads using open source tools such as Apache Spark, Hive, Presto, and more.

Using this integration, you can:

- Seamlessly integrate AWS EMR into your Dagster pipelines.
- Utilize EMR for petabyte-scale data processing.
- Easily manage and monitor EMR clusters and jobs from within Dagster.
- Leverage Dagster's orchestration capabilities to handle complex data workflows involving EMR.

### Installation

```bash
pip install dagster-aws
```

### Examples

```python
from pathlib import Path
from typing import Any

from dagster import Definitions, ResourceParam, asset
from dagster_aws.emr import emr_pyspark_step_launcher
from dagster_aws.s3 import S3Resource
from dagster_pyspark import PySparkResource
from pyspark.sql import DataFrame, Row
from pyspark.sql.types import IntegerType, StringType, StructField, StructType


emr_pyspark = PySparkResource(spark_config={"spark.executor.memory": "2g"})


@asset
def people(
pyspark: PySparkResource, pyspark_step_launcher: ResourceParam[Any]
) -> DataFrame:
schema = StructType(
[StructField("name", StringType()), StructField("age", IntegerType())]
)
rows = [
Row(name="Thom", age=51),
Row(name="Jonny", age=48),
Row(name="Nigel", age=49),
]
return pyspark.spark_session.createDataFrame(rows, schema)


@asset
def people_over_50(
pyspark_step_launcher: ResourceParam[Any], people: DataFrame
) -> DataFrame:
return people.filter(people["age"] > 50)


defs = Definitions(
assets=[people, people_over_50],
resources={
"pyspark_step_launcher": emr_pyspark_step_launcher.configured(
{
"cluster_id": {"env": "EMR_CLUSTER_ID"},
"local_pipeline_package_path": str(Path(__file__).parent),
"deploy_local_pipeline_package": True,
"region_name": "us-west-1",
"staging_bucket": "my_staging_bucket",
"wait_for_logs": True,
}
),
"pyspark": emr_pyspark,
"s3": S3Resource(),
},
)
```

### About AWS EMR

**AWS EMR** (Elastic MapReduce) is a cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. It simplifies running big data frameworks, allowing you to process and analyze large datasets quickly and cost-effectively. AWS EMR provides the scalability, flexibility, and reliability needed to handle complex data processing tasks, making it an ideal choice for data engineers and scientists.
Loading

2 comments on commit 6dca8c2

@github-actions
Copy link

@github-actions github-actions bot commented on 6dca8c2 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploy preview for dagster-docs-beta ready!

✅ Preview
https://dagster-docs-beta-2locj6pyt-elementl.vercel.app
https://dagster-docs-beta.dagster-docs.io

Built with commit 6dca8c2.
This pull request is being automatically deployed with vercel-action

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploy preview for dagster-docs ready!

✅ Preview
https://dagster-docs-8no15cmtx-elementl.vercel.app
https://master.dagster.dagster-docs.io

Built with commit 6dca8c2.
This pull request is being automatically deployed with vercel-action

Please sign in to comment.