Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] [pipes] - Add Databricks integration guide #17114

Merged
merged 12 commits into from
Nov 7, 2023

Conversation

erinkcochran87
Copy link
Contributor

@erinkcochran87 erinkcochran87 commented Oct 10, 2023

Summary & Motivation

This PR adds a guide for integrating Dagster Pipes with Databricks.

TODO/?s:

  • Finish descriptions of SubmitTask spec
  • Finish UI section
  • Check in code examples
  • Check on some of the PyObjects - may be out of sync due to changes in libraries
  • Add info about sending data back to Dagster (Step 2)

How I Tested These Changes

👀 , bk

@erinkcochran87
Copy link
Contributor Author

@yuhan @smackesey - Here's a rough first pass of the Databricks + Pipes guide. There are still some TODOs/Questions (see PR description) that I could use help with.

Sean - do you have a working example I can use to take screenshots of the Dagster UI?

Copy link
Collaborator

@smackesey smackesey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

A lot of comments-- there are many (and will be more) subtle things that I think you won't know very well since you don't know Databricks/Pipes the way I do, having been banging my head against them for weeks.

I think the most efficient thing will be for you to incorporate (or dispute) the comments I provided here and make whatever other changes you want to where you are semi-satisfied, and I can follow up directly with tweaks rather than further commenting. We could do this by either you merging a draft or me just taking over pushing to the PR, up to you.


<Note>
<strong>Heads up!</strong> This guide focuses on using an out-of-the-box
Databricks resource. For further customization, use the{" "}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Databricks resource" -> "Databricks Pipes client"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it (sort of?) functions like a resource, and is configured like one. Resource is a term Dagster users are already familiar with, whereas this is a new term. Do you feel strongly about changing this?


- `spark_python_task` - An object specifying the Python file the job should run, which contains the following properties:
- `python_file` - The URI of the Python file to run, located in DBFS.
- `source` - TODO - is this required?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should attempt to formally document the various fields here. This is part of the Databricks API and has nothing to do with Dagster per se. We might just say something like:

Here we are targeting an existing python script on DBFS. Refer to the Databricks SDK/API documentation for more information on how to specify a Databricks task.

Copy link
Contributor Author

@erinkcochran87 erinkcochran87 Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, about that - did you find good documentation that describes how to do this? What I found was pretty light and didn't totally match up here. I also would prefer not to include these fields, but if we can't point to good documentation, I don't think it's a good experience for users of our integration.

@erinkcochran87
Copy link
Contributor Author

A lot of comments-- there are many (and will be more) subtle things that I think you won't know very well since you don't know Databricks/Pipes the way I do, having been banging my head against them for weeks.

I'm sure 😂 Definitely a good learning opportunity for me here, though!

I think the most efficient thing will be for you to incorporate (or dispute) the comments I provided here and make whatever other changes you want to where you are semi-satisfied, and I can follow up directly with tweaks rather than further commenting.

That sounds good! I did respond to a few of your comments, mostly asking for clarification/reasoning on a few things. It's only a few - if you don't mind responding to those at least, I'd love to have the context for the suggestions.

We could do this by either you merging a draft or me just taking over pushing to the PR, up to you.

Feel free to push to this branch and we can work together! I think that's the easiest way to do this.

@smackesey smackesey force-pushed the erin/pipes-databricks branch from 46286d5 to 3d8b0d8 Compare October 27, 2023 14:49
@smackesey
Copy link
Collaborator

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@github-actions
Copy link

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-5n2aagbry-elementl.vercel.app
https://erin-pipes-databricks.components-storybook.dagster-docs.io

Built with commit 3d8b0d8.
This pull request is being automatically deployed with vercel-action

@github-actions
Copy link

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-4yo7866qt-elementl.vercel.app
https://erin-pipes-databricks.core-storybook.dagster-docs.io

Built with commit 3d8b0d8.
This pull request is being automatically deployed with vercel-action

@github-actions
Copy link

Deploy preview for dagster-university ready!

✅ Preview
https://dagster-university-dn634922c-elementl.vercel.app
https://erin-pipes-databricks.dagster-university.dagster-docs.io

Built with commit 3d8b0d8.
This pull request is being automatically deployed with vercel-action

@smackesey smackesey force-pushed the erin/pipes-databricks branch 4 times, most recently from 72e608e to 211f481 Compare October 27, 2023 15:53
@smackesey smackesey force-pushed the erin/pipes-databricks branch from 211f481 to 1eb547f Compare October 27, 2023 16:00
@smackesey
Copy link
Collaborator

@erinkcochran87 @yuhan OK this has been updated. You'll want to do a final proofread or perhaps massage it for other reasons but I think it's in good shape:

  • I made the steps less granular. This is a stylistic choice but made it much easier to write the code snippets and ensure their correctness. Also FWIW I personally find large code blocks that are "complete" easier to comprehend, and I'm pretty sure any user interested in this guide is going to be highly technical.
  • the snippet code now has tests, although those tests are set up to skip on BK due to the difficulty of running Databricks jobs on BK. However, they can be run locally by:
export DATABRICKS_HOST=<host>
export DATABRICKS_TOKEN=<your personal access token>
pytest examples/docs_snippets/docs_snippets_tests/guides_tests/dagster_pipes_tests/test_databricks.py::test_databricks_asset
  • Added screenshots.

  • Filled in TODOs.

  • Various other tweaks.

  • One caveat is that the final "Advanced" section is not under test (though it is linted and type-checked, which is an improvement), in part due to the way it's written. I think that is OK for now.

  • Deleted the Pipes content from Databricks README since it now lives here.

@smackesey
Copy link
Collaborator

@erinkcochran87 Have you looked at this yet/can we get it merged? There is a user I'd like to share it with.

Copy link
Collaborator

@smackesey smackesey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great

@smackesey smackesey merged commit b2ca074 into master Nov 7, 2023
1 check passed
@smackesey smackesey deleted the erin/pipes-databricks branch November 7, 2023 13:35
smackesey added a commit that referenced this pull request Nov 7, 2023
## Summary & Motivation

This PR adds a guide for integrating Dagster Pipes with Databricks.

TODO/?s:

- [x] Finish descriptions of `SubmitTask` spec
- [x] Finish UI section
- [x] Check in code examples
- [x] Check on some of the `PyObjects` - may be out of sync due to
changes in libraries
- [x] Add info about sending data back to Dagster (Step 2)

## How I Tested These Changes

eyes, bk

---------

Co-authored-by: Sean Mackesey <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants