integration docs beta 1/ replicate all integration pages from mkt sit…

…e to beta docs (#24330) ## Summary & Motivation copy over from https://dagster.io/integrations (note: this content is up-to-date per PR stack dagster-io/dagster-website#1280 few weeks ago) this PR made the following changes: 1. update title to "Dagster & <name>" 2. add `sidebar_label: <name>" so it won't show a wall of "Dagster &" on left nav 3. fix all vale errors and warnings, including a lot of vale accept additions 4. rename files from `dagster-<name>.mdx` to just `<name>.md` Next steps in later stack: - move code to python files - improve navigation: Index page and/or left nav to bucket integrations into categories, differentiate community owned Later steps: - improve doc content page one by one (e.g. template guides, reuse the good ones from current docs site) **Open discussion:** Figure out the relationship between docs and marketing site regarding integrations. * Option 1: no dagster.io/integrations and redirect that to docs.dagster.io/integrations * Yuhan's pick: I'm actually leaning towards this to completely consolidate all integration contents into the docs site for simplicity and ease of navigation so there won't be two similar content on two different sites, but I'd need to consult the SEO implication in this option. * Option 2: keep both dagster.io/integrations and docs.dagster.io/integrations; no code in marketing site, only for SEO purpose; docs pages focus on more technical guides/references. ## How I Tested These Changes **see in preview: https://dagster-docs-beta-211qncb7r-elementl.vercel.app/integrations** ## Changelog `NOCHANGELOG` --------- Co-authored-by: colton <[email protected]>
dagster-io · Sep 11, 2024 · 6dca8c2 · 6dca8c2 · github-actions · Sep 11, 2024
1 parent bd231c6
commit 6dca8c2
Show file tree

Hide file tree

Showing 56 changed files with 3,061 additions and 13 deletions.
diff --git a/docs/docs-beta/docs/getting-started/installation.md b/docs/docs-beta/docs/getting-started/installation.md
@@ -1,8 +1,8 @@
 ---
-title: "Installing Dagster"
-description: "Learn how to install Dagster"
+title: Installing Dagster
+description: Learn how to install Dagster
 sidebar_position: 20
-sidebar_label: "Installation"
+sidebar_label: Installation
 ---
 
 # Installing Dagster

diff --git a/docs/docs-beta/docs/guides/asset-dependencies.md b/docs/docs-beta/docs/guides/asset-dependencies.md
@@ -92,7 +92,7 @@ Consider this example:
 
 <CodeExample filePath="guides/data-assets/passing-data-assets/passing-data-avoid.py" language="python" title="Avoid Passing Data Between Assets" />
 
-This example downloads a zip file from Google Drive, unzips it, and loads the data into a pandas DataFrame. It relies on each asset running on the same file system to perform these operations.
+This example downloads a zip file from Google Drive, unzips it, and loads the data into a Pandas DataFrame. It relies on each asset running on the same file system to perform these operations.
 
 The assets are modeled as tasks, rather than as data assets. For more information on the difference between tasks and data assets, check out the [Thinking in Assets](/concepts/assets/thinking-in-assets) guide.
 

diff --git a/docs/docs-beta/docs/integrations.md b/docs/docs-beta/docs/integrations.md
@@ -1,5 +1,6 @@
 ---
-title: "Integrations"
+title: 'Integrations'
+displayed_sidebar: 'integrations'
 ---
 
 # Integrations
diff --git a/docs/docs-beta/docs/integrations/airbyte.md b/docs/docs-beta/docs/integrations/airbyte.md
@@ -0,0 +1,52 @@
+---
+layout: Integration
+status: published
+name: Airbyte
+title: Dagster & Airbyte
+sidebar_label: Airbyte
+excerpt: Orchestrate Airbyte connections and schedule syncs alongside upstream or downstream dependencies.
+date: 2022-11-07
+apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-airbyte
+docslink: https://docs.dagster.io/integrations/airbyte
+partnerlink: https://airbyte.com/tutorials/orchestrate-data-ingestion-and-transformation-pipelines
+logo: /integrations/airbyte.svg
+categories:
+  - ETL
+enabledBy:
+enables:
+---
+
+### About this integration
+
+Using this integration, you can trigger Airbyte syncs and orchestrate your Airbyte connections from within Dagster, making it easy to chain an Airbyte sync with upstream or downstream steps in your workflow.
+
+### Installation
+
+```bash
+pip install dagster-airbyte
+```
+
+### Example
+
+```python
+from dagster import EnvVar
+from dagster_airbyte import AirbyteResource, load_assets_from_airbyte_instance
+import os
+
+# Connect to your OSS Airbyte instance
+airbyte_instance = AirbyteResource(
+    host="localhost",
+    port="8000",
+    # If using basic auth, include username and password:
+    username="airbyte",
+    password=EnvVar("AIRBYTE_PASSWORD")
+)
+
+# Load all assets from your Airbyte instance
+airbyte_assets = load_assets_from_airbyte_instance(airbyte_instance)
+
+```
+
+### About Airbyte
+
+**Airbyte** is an open source data integration engine that helps you consolidate your SaaS application and database data into your data warehouses, lakes and databases.
diff --git a/docs/docs-beta/docs/integrations/aws-athena.md b/docs/docs-beta/docs/integrations/aws-athena.md
@@ -0,0 +1,48 @@
+---
+layout: Integration
+status: published
+name: AWS Athena
+title: Dagster & AWS Athena
+sidebar_label: AWS Athena
+excerpt: This integration allows you to connect to AWS Athena and analyze data in Amazon S3 using standard SQL within your Dagster pipelines.
+date: 2024-06-21
+apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-aws
+docslink:
+partnerlink: https://aws.amazon.com/
+logo: /integrations/aws-athena.svg
+categories:
+  - Storage
+enabledBy:
+enables:
+---
+
+### About this integration
+
+This integration allows you to connect to AWS Athena, a serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Using this integration, you can issue queries to Athena, fetch results, and handle query execution states within your Dagster pipelines.
+
+### Installation
+
+```bash
+pip install dagster-aws
+```
+
+### Examples
+
+```python
+from dagster import Definitions, asset
+from dagster_aws.athena import AthenaClientResource
+
+
+@asset
+def example_athena_asset(athena: AthenaClientResource):
+    return athena.get_client().execute_query("SELECT 1", fetch_results=True)
+
+
+defs = Definitions(
+    assets=[example_athena_asset], resources={"athena": AthenaClientResource()}
+)
+```
+
+### About AWS Athena
+
+AWS Athena is a serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. Athena is easy to use; point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there are no infrastructure setups, and you pay only for the queries you run. It scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries.
diff --git a/docs/docs-beta/docs/integrations/aws-cloudwatch.md b/docs/docs-beta/docs/integrations/aws-cloudwatch.md
@@ -0,0 +1,63 @@
+---
+layout: Integration
+status: published
+name: AWS CloudWatch
+title: Dagster & AWS CloudWatch
+sidebar_label: AWS CloudWatch
+excerpt: This integration allows you to send Dagster logs to AWS CloudWatch, enabling centralized logging and monitoring of your Dagster jobs.
+date: 2024-06-21
+apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-aws
+docslink: 
+partnerlink: https://aws.amazon.com/
+logo: /integrations/aws-cloudwatch.svg
+categories:
+  - Monitoring
+enabledBy:
+enables:
+---
+
+### About this integration
+
+This integration allows you to send Dagster logs to AWS CloudWatch, enabling centralized logging and monitoring of your Dagster jobs. By using AWS CloudWatch, you can take advantage of its powerful log management features, such as real-time log monitoring, log retention policies, and alerting capabilities.
+
+Using this integration, you can configure your Dagster jobs to log directly to AWS CloudWatch, making it easier to track and debug your workflows. This is particularly useful for production environments where centralized logging is essential for maintaining observability and operational efficiency.
+
+### Installation
+
+```bash
+pip install dagster-aws
+```
+
+### Examples
+
+```python
+import dagster as dg
+from dagster_aws.cloudwatch import cloudwatch_logger
+
+
+@dg.asset
+def my_asset(context: dg.AssetExecutionContext):
+    context.log.info("Hello, CloudWatch!")
+    context.log.error("This is an error")
+    context.log.debug("This is a debug message")
+
+
+defs = dg.Definitions(
+    assets=[my_asset],
+    loggers={
+        "cloudwatch_logger": cloudwatch_logger,
+    },
+)
+```
+
+### About AWS CloudWatch
+
+AWS CloudWatch is a monitoring and observability service provided by Amazon Web Services (AWS). It allows you to collect, access, and analyze performance and operational data from a variety of AWS resources, applications, and services. With AWS CloudWatch, you can set up alarms, visualize logs and metrics, and gain insights into your infrastructure and applications to ensure they're running smoothly.
+
+AWS CloudWatch provides features such as:
+
+- Real-time monitoring: Track the performance of your applications and infrastructure in real-time.
+- Log management: Collect, store, and analyze log data from various sources.
+- Alarms and notifications: Set up alarms to automatically notify you of potential issues.
+- Dashboards: Create custom dashboards to visualize metrics and logs.
+- Integration with other AWS services: Seamlessly integrate with other AWS services for a comprehensive monitoring solution.
diff --git a/docs/docs-beta/docs/integrations/aws-ecr.md b/docs/docs-beta/docs/integrations/aws-ecr.md
@@ -0,0 +1,58 @@
+---
+layout: Integration
+status: published
+name: AWS ECR
+title: Dagster & AWS ECR
+sidebar_label: AWS ECR
+excerpt: This integration allows you to connect to AWS Elastic Container Registry (ECR), enabling you to manage your container images more effectively in your Dagster pipelines.
+date: 2024-06-21
+apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-aws
+docslink: 
+partnerlink: https://aws.amazon.com/
+logo: /integrations/aws-ecr.svg
+categories:
+  - Other
+enabledBy:
+enables:
+---
+
+### About this integration
+
+This integration allows you to connect to AWS Elastic Container Registry (ECR). It provides resources to interact with AWS ECR, enabling you to manage your container images.
+
+Using this integration, you can seamlessly integrate AWS ECR into your Dagster pipelines, making it easier to manage and deploy containerized applications.
+
+### Installation
+
+```bash
+pip install dagster-aws
+```
+
+### Examples
+
+```python
+from dagster import asset, Definitions
+from dagster_aws.ecr import ECRPublicResource
+
+
+@asset
+def get_ecr_login_password(ecr_public: ECRPublicResource):
+    return ecr_public.get_client().get_login_password()
+
+
+defs = Definitions(
+    assets=[get_ecr_login_password],
+    resources={
+        "ecr_public": ECRPublicResource(
+            region_name="us-west-1",
+            aws_access_key_id="your_access_key_id",
+            aws_secret_access_key="your_secret_access_key",
+            aws_session_token="your_session_token",
+        )
+    },
+)
+```
+
+### About AWS ECR
+
+AWS Elastic Container Registry (ECR) is a fully managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. AWS ECR is integrated with Amazon Elastic Kubernetes Service (EKS), simplifying your development to production workflow. With ECR, you can securely store and manage your container images and easily integrate with your existing CI/CD pipelines. AWS ECR provides high availability and scalability, ensuring that your container images are always available when you need them.
diff --git a/docs/docs-beta/docs/integrations/aws-emr.md b/docs/docs-beta/docs/integrations/aws-emr.md
@@ -0,0 +1,96 @@
+---
+layout: Integration
+status: published
+name: AWS EMR
+title: Dagster & AWS EMR
+sidebar_label: AWS EMR
+excerpt: The AWS EMR integration allows you to seamlessly integrate AWS EMR into your Dagster pipelines for petabyte-scale data processing using open source tools like Apache Spark, Hive, Presto, and more.
+date: 2024-06-21
+apireflink: https://docs.dagster.io/_apidocs/libraries/dagster-aws
+docslink: 
+partnerlink: https://aws.amazon.com/
+logo: /integrations/aws-emr.svg
+categories:
+  - Compute
+enabledBy:
+enables:
+---
+
+### About this integration
+
+The `dagster-aws` integration provides ways orchestrating data pipelines that leverage AWS services, including AWS EMR (Elastic MapReduce). This integration allows you to run and scale big data workloads using open source tools such as Apache Spark, Hive, Presto, and more.
+
+Using this integration, you can:
+
+- Seamlessly integrate AWS EMR into your Dagster pipelines.
+- Utilize EMR for petabyte-scale data processing.
+- Easily manage and monitor EMR clusters and jobs from within Dagster.
+- Leverage Dagster's orchestration capabilities to handle complex data workflows involving EMR.
+
+### Installation
+
+```bash
+pip install dagster-aws
+```
+
+### Examples
+
+```python
+from pathlib import Path
+from typing import Any
+
+from dagster import Definitions, ResourceParam, asset
+from dagster_aws.emr import emr_pyspark_step_launcher
+from dagster_aws.s3 import S3Resource
+from dagster_pyspark import PySparkResource
+from pyspark.sql import DataFrame, Row
+from pyspark.sql.types import IntegerType, StringType, StructField, StructType
+
+
+emr_pyspark = PySparkResource(spark_config={"spark.executor.memory": "2g"})
+
+
+@asset
+def people(
+    pyspark: PySparkResource, pyspark_step_launcher: ResourceParam[Any]
+) -> DataFrame:
+    schema = StructType(
+        [StructField("name", StringType()), StructField("age", IntegerType())]
+    )
+    rows = [
+        Row(name="Thom", age=51),
+        Row(name="Jonny", age=48),
+        Row(name="Nigel", age=49),
+    ]
+    return pyspark.spark_session.createDataFrame(rows, schema)
+
+
+@asset
+def people_over_50(
+    pyspark_step_launcher: ResourceParam[Any], people: DataFrame
+) -> DataFrame:
+    return people.filter(people["age"] > 50)
+
+
+defs = Definitions(
+    assets=[people, people_over_50],
+    resources={
+        "pyspark_step_launcher": emr_pyspark_step_launcher.configured(
+            {
+                "cluster_id": {"env": "EMR_CLUSTER_ID"},
+                "local_pipeline_package_path": str(Path(__file__).parent),
+                "deploy_local_pipeline_package": True,
+                "region_name": "us-west-1",
+                "staging_bucket": "my_staging_bucket",
+                "wait_for_logs": True,
+            }
+        ),
+        "pyspark": emr_pyspark,
+        "s3": S3Resource(),
+    },
+)
+```
+
+### About AWS EMR
+
+**AWS EMR** (Elastic MapReduce) is a cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. It simplifies running big data frameworks, allowing you to process and analyze large datasets quickly and cost-effectively. AWS EMR provides the scalability, flexibility, and reliability needed to handle complex data processing tasks, making it an ideal choice for data engineers and scientists.