add github workflow for linting

dagster-io · Aug 21, 2024 · 972e9d1 · 972e9d1
1 parent 09d6005
commit 972e9d1
Show file tree

Hide file tree

Showing 42 changed files with 448 additions and 14,250 deletions.
diff --git a/.github/workflows/build-docs-revamp.yml b/.github/workflows/build-docs-revamp.yml
@@ -22,6 +22,18 @@ jobs:
       - name: Checkout docs/revamp branch
         uses: actions/checkout@v4
 
+      - name: Install node
+        uses: actions/setup-node@v4
+        with:
+          node-version: 18  
+          cache: 'yarn'
+          cache-dependency-path: 'docs/docs-beta/yarn.lock'
+
+      - name: Run Linting
+        run: |
+          cd docs/docs-beta
+          yarn run lint
+
       - name: Publish Preview to Vercel
         uses: amondnet/vercel-action@v25
         with:

diff --git a/docs/docs-beta/.remarkrc.js b/docs/docs-beta/.remarkrc.js
@@ -1,5 +1,3 @@
 module.exports = {
-  plugins: [
-    'remark-frontmatter',
-  ]
-}
+  plugins: ['remark-frontmatter'],
+};
diff --git a/docs/docs-beta/.vscode/extensions.json b/docs/docs-beta/.vscode/extensions.json
@@ -1,9 +1,9 @@
 {
-    "recommendations": [
-        "dbaeumer.vscode-eslint",
-        "unifiedjs.vscode-mdx",
-        "esbenp.prettier-vscode",
-        "mrmlnc.vscode-remark",
-        "chrischinchilla.vale-vscode"
-    ]
-}
+  "recommendations": [
+    "dbaeumer.vscode-eslint",
+    "unifiedjs.vscode-mdx",
+    "esbenp.prettier-vscode",
+    "mrmlnc.vscode-remark",
+    "chrischinchilla.vale-vscode"
+  ]
+}
diff --git a/docs/docs-beta/README.md b/docs/docs-beta/README.md
@@ -5,7 +5,7 @@ The documentation site is built using [Docusaurus](https://docusaurus.io/), a mo
 
 ### Installation
 
-The site uses [pnpm](https://pnpm.io/) for package management. 
+The site uses [pnpm](https://pnpm.io/) for package management.
 It also uses [vale](https://vale.sh/) to check for issues in the documentation.
 
 Install dependencies with:
@@ -21,9 +21,10 @@ Code in `./src` contains custom components, styles, themes, and layouts.
 Code `./content-templates` contains the templates for the documentation pages.
 Code in `./docs/` is the source of truth for the documentation.
 
-`./docs/code_examples` contains all code examples for the documentation. 
+`./docs/code_examples` contains all code examples for the documentation.
 
 The docs are broken down into the following sections:
+
 - [Tutorials](./docs/tutorials/)
 - [Guides](./docs/guides/)
 - [Concepts](./docs/concepts/)
@@ -40,7 +41,6 @@ pnpm start
 
 This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server. Access the website at [http://localhost:3050](http://localhost:3050).
 
-
 To lint the documentation for issues:
 
 ```bash
@@ -53,8 +53,6 @@ To autofix linting issues and format with prettier:
 pnpm lint:fix
 ```
 
-
-
 ### Build
 
 To build the site for production:
@@ -63,4 +61,4 @@ To build the site for production:
 pnpm build
 ```
 
-This command generates static content into the `build` directory and can be served using any static contents hosting service.
+This command generates static content into the `build` directory and can be served using any static contents hosting service.
diff --git a/docs/docs-beta/content-templates/concept.md b/docs/docs-beta/content-templates/concept.md
@@ -5,49 +5,45 @@ description: ''
 
 # [TOPIC]
 
-<!-- This section is an intro that includes:
+This section is an intro that includes:
 
 - A brief description of what the topic is,
 - An example of how it could be used in the real-world
 - What it can do in the UI
- -->
 
 ---
 
 ## Benefits
 
-<!-- This section lists the benefits of using the topic, whatever it is. The items listed here should be solutions to real-world problems that the user cares about, ex:
+This section lists the benefits of using the topic, whatever it is. The items listed here should be solutions to real-world problems that the user cares about, ex:
 
 Using schedules helps you:
 
 - Predictably process and deliver data to stakeholders and business-critical applications
 - Consistently run data pipelines without the need for manual intervention
 - Optimize resource usage by scheduling pipelines to run during off-peak hours
--->
 
 Using [TOPIC] helps you:
 
-<!-- - A benefit of the thing
+- A benefit of the thing
 - Another benefit
-- And one more -->
+- And one more
 
 ---
 
 ## Prerequisites
 
-<!-- This section lists the prerequisites users must complete before they should/can proceed. For concepts, we should list the other concepts they should be familiar with first. -->
-
 Before continuing, you should be familiar with:
 
-<!-- - Ex: To use asset checks, users should understand Asset definitions first
+- Ex: To use asset checks, users should understand Asset definitions first
 - Another one
-- One more -->
+- One more
 
 ---
 
 ## How it works
 
-<!-- This section provides a high-level overview of how the concept works without getting too into the technical details. Code can be shown here, but this section shouldn't focus on it. The goal is to help the user generally understand how the thing works and what they need to do to get it working without overwhelming them with details.
+This section provides a high-level overview of how the concept works without getting too into the technical details. Code can be shown here, but this section shouldn't focus on it. The goal is to help the user generally understand how the thing works and what they need to do to get it working without overwhelming them with details.
 
 For example, this is the How it works for Schedules:
 
@@ -57,40 +53,33 @@ Schedules run jobs at fixed time intervals and have two main components:
 
 - A cron expression, which defines when the schedule runs. Simple and complex schedules are supported, allowing you to have fine-grained control over when runs are executed. With cron syntax, you can:
 
-   - Create custom schedules like Every hour from 9:00AM - 5:00PM with cron expressions (0 9-17 * * *)
-   - Quickly create basic schedules like Every day at midnight with predefined cron definitions (@daily, @midnight)
+  - Create custom schedules like Every hour from 9:00AM - 5:00PM with cron expressions (0 9-17 \* \* \*)
+  - Quickly create basic schedules like Every day at midnight with predefined cron definitions (@daily, @midnight)
 
-   To make creating cron expressions easier, you can use an online tool like Crontab Guru. This tool allows you to create and describe cron expressions in a human-readable format and test the execution dates produced by the expression. Note: While this tool is useful for general cron expression testing, always remember to test your schedules in Dagster to ensure the results are as expected.
+  To make creating cron expressions easier, you can use an online tool like Crontab Guru. This tool allows you to create and describe cron expressions in a human-readable format and test the execution dates produced by the expression. Note: While this tool is useful for general cron expression testing, always remember to test your schedules in Dagster to ensure the results are as expected.
 
 For a schedule to run, it must be turned on and an active dagster-daemon process must be running. If you used dagster dev to start the Dagster UI/webserver, the daemon process will be automatically launched alongside the webserver.
 
 After these criteria are met, the schedule will run at the interval specified in the cron expression. Schedules will execute in UTC by default, but you can specify a custom timezone.
 
--->
-
 ---
 
 ## Getting started
 
-<!-- This section is a list of guides / links to pages to help the user get started using the topic. -->
+This section is a list of guides / links to pages to help the user get started using the topic.
 
 Check out these guides to get started with [CONCEPT]:
 
 From here, you can:
 
-<!-- A list of things the user can do once they've got the basics down. These could be links to additional guides, ex:
-
 - Construct schedules to run partitioned jobs
 - Execute jobs in specific timezones
 - Learn to test your schedules
-- Identify and resolve common issues with our troubleshooting guide -->
 
 ### Limitations [and notes]
 
-<!-- This section should describe any known limitations that could impact the user, ex: "Schedules will execute in UTC unless a timezone is specified" -->
-
 ---
 
 ## Related
 
-<!-- A list of related links and resources -->
+A list of related links and resources
diff --git a/docs/docs-beta/content-templates/example-reference.md b/docs/docs-beta/content-templates/example-reference.md
@@ -18,17 +18,13 @@ This reference contains a variety of examples using Dagster [TOPIC]. Each exampl
 
 [This example demonstrates [description of what the example accomplishes]
 
-<!-- Example: This example demonstrates how to use resources in schedules. To specify a resource dependency, annotate the resource as a parameter to the schedule's function. -->
-
 ```python title="my_schedule.py"
 @schedule(job=my_job, cron_schedule="* * * * *")
 def logs_then_skips(context):
     context.log.info("Logging from a schedule!")
     return SkipReason("Nothing to do")
 ```
 
-<!-- We need to fix the base table implemenatation before launch. This is a must. -->
-
 |                      |     |
 | -------------------- | --- |
 | Notes                |     |
@@ -37,8 +33,6 @@ def logs_then_skips(context):
 
 ---
 
-<!-- This section lists a few additional sources of inspiration, such as DOP and GitHub discussions. You shouldn't need to change anything here. -->
-
 import InspirationList from '../partials/\_InspirationList.md';
 
 <InspirationList />
diff --git a/docs/docs-beta/content-templates/guide-with-steps.md b/docs/docs-beta/content-templates/guide-with-steps.md
@@ -23,20 +23,20 @@ To follow the steps in this guide, you'll need:
 
 </details>
 
-## Step 1: Title that describes what this step will do {#step-1}
+## Step 1: Title that describes what this step will do
 
 For section / step headings:
 
 - Titles should describe an action, ex: "Generate a token"
 - Don't use gerunds (-ing) in titles, as it can cause issues with translation + SEO
-- Each section heading should have an identifier that includes the word 'step' and the number of the step, ex: {#step-1}
+- Each section heading should have an identifier that includes the word 'step' and the number of the step, ex: `#step-1`
 
-### Step 1.1: Title that describes a substep {#step-1-1}
+### Step 1.1: Title that describes a substep
 
 If a step would benefit by being broken into smaller steps, follow this section's formatting
 Each substep should get an H3 and start with Step N., followed by the number of the substep
 
-## Step 2: Another step {#step-2}
+## Step 2: Another step
 
 ## Next steps
 

diff --git a/docs/docs-beta/docs/guides/automation/configuring.md b/docs/docs-beta/docs/guides/automation/configuring.md
@@ -0,0 +1,5 @@
+---
+title: Configuring pipelines and runs
+sidebar_label: Configuring pipelines
+sidebar_position: 40
+---
diff --git a/docs/docs-beta/docs/guides/data-modeling/asset-dependencies.md b/docs/docs-beta/docs/guides/data-modeling/asset-dependencies.md
@@ -0,0 +1,122 @@
+---
+title: Pass data between assets
+description: Learn how to pass data between assets in Dagster
+sidebar_position: 30
+last_update:
+  date: 2024-08-11
+  author: Pedram Navid
+---
+
+In Dagster, assets are the building blocks of your data pipeline and it's common to want to pass data between them. This guide will help you understand how to pass data between assets.
+
+There are three ways of passing data between assets:
+
+- Explicitly managing data, by using external storage
+- Implicitly managing data, using I/O managers
+- Avoiding passing data between assets altogether by combining several tasks into a single asset
+
+This guide walks through all three methods.
+
+---
+
+<details>
+  <summary>Prerequisites</summary>
+
+To follow the steps in this guide, you'll need:
+
+- A basic understanding of Dagster concepts such as assets and resources
+- Dagster and the `dagster-duckdb-pandas` package installed
+</details>
+
+---
+
+## Move data assets explicitly using external storage
+
+A common and recommended approach to passing data between assets is explicitly managing data using external storage. This example pipeline uses a SQLite database as external storage:
+
+<CodeExample filePath="guides/data-assets/passing-data-assets/passing-data-explicit.py" language="python" title="Using External Storage" />
+
+In this example, the first asset opens a connection to the SQLite database and writes data to it. The second asset opens a connection to the same database and reads data from it. The dependency between the first asset and the second asset is made explicit through the asset's `deps` argument.
+
+The benefits of this approach are:
+
+- It's explicit and easy to understand how data is stored and retrieved
+- You have maximum flexibility in terms of how and where data is stored, for example, based on environment
+
+The downsides of this approach are:
+
+- You need to manage connections and transactions manually
+- You need to handle errors and edge cases, for example, if the database is down or if a connection is closed
+
+## Move data between assets implicitly using I/O managers
+
+Dagster's I/O managers are a powerful feature that manages data between assets by defining how data is read from and written to external storage. They help separate business logic from I/O operations, reducing boilerplate code and making it easier to change where data is stored.
+
+I/O managers handle:
+
+1. **Input**: Reading data from storage and loading it into memory for use by dependent assets.
+2. **Output**: Writing data to the configured storage location.
+
+For a deeper understanding of I/O managers, check out the [Understanding I/O managers](/concepts/io-managers) guide.
+
+<CodeExample filePath="guides/data-assets/passing-data-assets/passing-data-io-manager.py" language="python" title="Using I/O managers" />
+
+In this example, a `DuckDBPandasIOManager` is instantiated to run using a local file. The I/O manager handles both reading and writing to the database.
+
+:::warning
+
+This example works for local development, but in a production environment
+each step would execute in a separate environment and would not have access to the same file system. Consider a cloud-hosted environment for production purposes.
+
+:::
+
+The `people()` and `birds()` assets both write their dataframes to DuckDB
+for persistent storage. The `combined_data()` asset requests data from both assets by adding them as parameters to the function, and the I/O manager handles the reading them from DuckDB and making them available to the `combined_data` function as dataframes. **Note**: When you use I/O managers you don't need to manually add the asset's dependencies through the `deps` argument.
+
+The benefits of this approach are:
+
+- The reading and writing of data is handled by the I/O manager, reducing boilerplate code
+- It's easy to swap out different I/O managers based on environments without changing the underlying asset computation
+
+The downsides of this approach are:
+
+- The I/O manager approach is less flexible should you need to customize how data is read or written to storage
+- Some decisions may be made by the I/O manager for you, such as naming conventions that can be hard to override.
+
+## Avoid passing data between assets by combining assets
+
+In some cases, you may find that you can avoid passing data between assets by
+carefully considering how you have modeled your pipeline:
+
+Consider this example:
+
+<CodeExample filePath="guides/data-assets/passing-data-assets/passing-data-avoid.py" language="python" title="Avoid Passing Data Between Assets" />
+
+This example downloads a zip file from Google Drive, unzips it, and loads the data into a pandas DataFrame. It relies on each asset running on the same file system to perform these operations.
+
+The assets are modeled as tasks, rather than as data assets. For more information on the difference between tasks and data assets, check out the [Thinking in Assets](/concepts/assets/thinking-in-assets) guide.
+
+In this refactor, the `download_files`, `unzip_files`, and `load_data` assets are combined into a single asset, `my_dataset`. This asset downloads the files, unzips them, and loads the data into a data warehouse.
+
+<CodeExample filePath="guides/data-assets/passing-data-assets/passing-data-rewrite-assets.py" language="python" title="Avoid Passing Data Between Assets" />
+
+This approach still handles passing data explicitly, but no longer does it across assets,
+instead within a single asset. This pipeline still assumes enough disk and
+memory available to handle the data, but for smaller datasets, it can work well.
+
+The benefits of this approach are:
+
+- All the computation that defines how an asset is created is contained within a single asset, making it easier to understand and maintain
+- It can be faster than relying on external storage, and doesn't require the overhead of setting up additional compute instances.
+
+The downsides of this approach are:
+
+- It makes certain assumptions about how much data is being processed
+- It can be difficult to reuse functions across assets, since they're tightly coupled to the data they produce
+- It may not always be possible to swap functionality based on the environment you are running in. For example, if you are running in a cloud environment, you may not have access to the local file system.
+
+---
+
+## Related resources
+
+TODO: add links to relevant API documentation here.
diff --git a/docs/docs-beta/docs/guides/data-modeling/configuring-assets.md b/docs/docs-beta/docs/guides/data-modeling/configuring-assets.md
@@ -0,0 +1,5 @@
+---
+title: Configuring assets and ops
+sidebar_label: Configuring assets
+sidebar_position: 50
+---
diff --git a/docs/docs-beta/docs/guides/data-modeling/creating-asset-factories.md b/docs/docs-beta/docs/guides/data-modeling/creating-asset-factories.md
@@ -1,4 +1,5 @@
 ---
-title: "Create asset factories"
-sidebar_position: 50
+title: "Creating asset factories"
+sidebar_position: 60
+sidebar_label: "Creating asset factories"
 ---
diff --git a/docs/docs-beta/docs/guides/data-modeling/external-assets.md b/docs/docs-beta/docs/guides/data-modeling/external-assets.md
@@ -0,0 +1,5 @@
+---
+title: Representing external data sources with external assets
+sidebar_position: 80
+sidebar_label: "Representing external data sources"
+---