Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Passing Data Between Assets Guide #23598

Merged
merged 11 commits into from
Aug 13, 2024
Merged

Conversation

PedramNavid
Copy link
Contributor

@PedramNavid PedramNavid commented Aug 12, 2024

Added an initial Guide for quick feedback.

Added a collapsible pre-req block .
image

Walk through three different ways of passing data between assets. I don't know how I feel about the last one.

Also created a new Component called CodeExample which lets you embedand highlight code blocks.

<CodeExample filePath="guides/passing-data-assets/passing-data-explicit.py" language="python" title="Using External Storage" />

I have taken inspiration on how to write How To Guides from here: https://diataxis.fr/how-to-guides/

Outstanding Questions

  • Is the content accurate?
  • Is it at the right level of detail for a How To?
  • Do we like the collapsible code pre-req code block?
  • How does the sidebar experience feel? Currently Guides > Data Assets > How to pass data between assets

@graphite-app graphite-app bot added the area: docs Related to documentation in general label Aug 12, 2024
@graphite-app graphite-app bot requested a review from erinkcochran87 August 12, 2024 23:02
Copy link
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @PedramNavid and the rest of your teammates on Graphite Graphite

Copy link

netlify bot commented Aug 12, 2024

Deploy Preview for dagsterapidocs canceled.

Name Link
🔨 Latest commit 8ae090a
🔍 Latest deploy log https://app.netlify.com/sites/dagsterapidocs/deploys/66bbb3ce5dfcdd0008c189db

@PedramNavid PedramNavid changed the title This adds the first How To Guide, Passing Data Between Assets. Add Passing Data Between Assets Guide Aug 12, 2024
Copy link

github-actions bot commented Aug 12, 2024

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-ivp0ss0rf-elementl.vercel.app
https://pdrm-add-passing-data.dagster.dagster-docs.io

Direct link to changed pages:

@PedramNavid PedramNavid force-pushed the pdrm/add-passing-data branch from 230b5f0 to 64e2508 Compare August 12, 2024 23:08
@@ -0,0 +1,122 @@
---
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamiedemaria Could I bother you to review this for accuracy?

It also creates a new Component called CodeExample which lets you embed
and highlight code blocks.
@PedramNavid PedramNavid force-pushed the pdrm/add-passing-data branch from 64e2508 to c706397 Compare August 12, 2024 23:21
@PedramNavid PedramNavid force-pushed the pdrm/add-passing-data branch from 95aea5f to fee23c4 Compare August 13, 2024 17:27
Copy link

github-actions bot commented Aug 13, 2024

Deploy preview for dagster-docs-next ready!

✅ Preview
https://dagster-docs-next-3ofu98n0t-elementl.vercel.app

Built with commit 8ae090a.
This pull request is being automatically deployed with vercel-action

Copy link
Contributor

@slopp slopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length, format, and code examples make sense to me. Minor comments in-line, and one major suggestion to add a fourth example of data assets that don't pass (or process) at all. I suggest this fourth example because I think it would resonate with teams coming from Airflow and "fills out" the mental model for how assets can be used to implement data pipelines.

In Dagster, assets are the building blocks of your data pipeline and it's common to want to pass data between them. This guide will help you understand how to pass data between assets.

There are three ways of passing data between assets:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to include a fourth "fake" case which is "you do not pass data between assets because your pipeline is not processing data directly"

This would be something like:

@asset 
def people(): 
 """ call the lambda function that loads people"""
 
@asset 
def birds(): 
 """ call the lambda function that loads birds"""
 
 
 @asset(
   deps = [people, birds]
 ) 
def people_and_birds(): 
 """ call the stored procedure that concats people and birds"""

@PedramNavid
Copy link
Contributor Author

Merging this to keep things moving, feel free to continue reviewing however.

@PedramNavid PedramNavid merged commit 387b7a5 into docs/revamp Aug 13, 2024
7 of 8 checks passed
@PedramNavid PedramNavid deleted the pdrm/add-passing-data branch August 13, 2024 22:32
Copy link
Contributor

@jamiedemaria jamiedemaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this breakdown into the different approaches. I agree with lopp that the fourth case would be really useful.

One thing I'm curious about is framing this as "passing data" vs like storing and loading assets. I think "passing data" is probably what a new dagster user would be looking to figure out how to do, but I think it also has an implication that the data is transitory. I think the final example does a good job of communicating "the output of each asset should be the actual data asset you want stored, not an intermediate state". Maybe a paragraph at the beginning that sets up that framework for the reader would be helpful

Also left a couple small copy-edit things I noticed while reading


This example works for local development, but in a production environment
each step would execute in a separate environment and would not have access to the same file system. Consider a cloud-hosted environment for production purposes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Consider a cloud hosted environment for production purposes"

nit: is this referring to cloud hosted storage? using "environment" to refer to two different things (where the computation is performed and where the database is hosted) confused me a bit

2. **Output**: Writing data to the configured storage location.

For a deeper understanding of IO Managers, check out the [Understanding IO Managers](/concepts/io-managers) guide.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IO -> I/O (couple other places where this applies as well)

<CodeExample filePath="guides/data-assets/passing-data-assets/passing-data-rewrite-assets.py" language="python" title="Avoid Passing Data Between Assets" />

This approach still handles passing data explicitly, but no longer does it across assets,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a word missing in this sentence

To follow the steps in this guide, you'll need:

- A basic understanding of Dagster concepts such as assets and resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something I really appreciate when I read other docs sites is when they link to a page that has the information i need to have a basic understanding (so maybe the concept pages for us?). Then I can open it and be like "i've read this im good" or if it's something i know nothing about I am given the resource i need to learn it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: docs Related to documentation in general
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants