-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] [pipes] - Add Databricks integration guide #17114
Conversation
@yuhan @smackesey - Here's a rough first pass of the Databricks + Pipes guide. There are still some TODOs/Questions (see PR description) that I could use help with. Sean - do you have a working example I can use to take screenshots of the Dagster UI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
A lot of comments-- there are many (and will be more) subtle things that I think you won't know very well since you don't know Databricks/Pipes the way I do, having been banging my head against them for weeks.
I think the most efficient thing will be for you to incorporate (or dispute) the comments I provided here and make whatever other changes you want to where you are semi-satisfied, and I can follow up directly with tweaks rather than further commenting. We could do this by either you merging a draft or me just taking over pushing to the PR, up to you.
docs/content/guides/dagster-pipes/integrating-databricks-with-dagster-pipes.mdx
Outdated
Show resolved
Hide resolved
|
||
<Note> | ||
<strong>Heads up!</strong> This guide focuses on using an out-of-the-box | ||
Databricks resource. For further customization, use the{" "} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Databricks resource" -> "Databricks Pipes client"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it (sort of?) functions like a resource, and is configured like one. Resource
is a term Dagster users are already familiar with, whereas this is a new term. Do you feel strongly about changing this?
docs/content/guides/dagster-pipes/integrating-databricks-with-dagster-pipes.mdx
Outdated
Show resolved
Hide resolved
docs/content/guides/dagster-pipes/integrating-databricks-with-dagster-pipes.mdx
Outdated
Show resolved
Hide resolved
docs/content/guides/dagster-pipes/integrating-databricks-with-dagster-pipes.mdx
Outdated
Show resolved
Hide resolved
|
||
- `spark_python_task` - An object specifying the Python file the job should run, which contains the following properties: | ||
- `python_file` - The URI of the Python file to run, located in DBFS. | ||
- `source` - TODO - is this required? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should attempt to formally document the various fields here. This is part of the Databricks API and has nothing to do with Dagster per se. We might just say something like:
Here we are targeting an existing python script on DBFS. Refer to the Databricks SDK/API documentation for more information on how to specify a Databricks task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, about that - did you find good documentation that describes how to do this? What I found was pretty light and didn't totally match up here. I also would prefer not to include these fields, but if we can't point to good documentation, I don't think it's a good experience for users of our integration.
docs/content/guides/dagster-pipes/integrating-databricks-with-dagster-pipes.mdx
Outdated
Show resolved
Hide resolved
docs/content/guides/dagster-pipes/integrating-databricks-with-dagster-pipes.mdx
Outdated
Show resolved
Hide resolved
docs/content/guides/dagster-pipes/integrating-databricks-with-dagster-pipes.mdx
Outdated
Show resolved
Hide resolved
docs/content/guides/dagster-pipes/integrating-databricks-with-dagster-pipes.mdx
Outdated
Show resolved
Hide resolved
I'm sure 😂 Definitely a good learning opportunity for me here, though!
That sounds good! I did respond to a few of your comments, mostly asking for clarification/reasoning on a few things. It's only a few - if you don't mind responding to those at least, I'd love to have the context for the suggestions.
Feel free to push to this branch and we can work together! I think that's the easiest way to do this. |
46286d5
to
3d8b0d8
Compare
Current dependencies on/for this PR: This comment was auto-generated by Graphite. |
Deploy preview for dagit-storybook ready! ✅ Preview Built with commit 3d8b0d8. |
Deploy preview for dagit-core-storybook ready! ✅ Preview Built with commit 3d8b0d8. |
Deploy preview for dagster-university ready! ✅ Preview Built with commit 3d8b0d8. |
72e608e
to
211f481
Compare
211f481
to
1eb547f
Compare
@erinkcochran87 @yuhan OK this has been updated. You'll want to do a final proofread or perhaps massage it for other reasons but I think it's in good shape:
|
@erinkcochran87 Have you looked at this yet/can we get it merged? There is a user I'd like to share it with. |
…/dagster into erin/pipes-databricks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great
## Summary & Motivation This PR adds a guide for integrating Dagster Pipes with Databricks. TODO/?s: - [x] Finish descriptions of `SubmitTask` spec - [x] Finish UI section - [x] Check in code examples - [x] Check on some of the `PyObjects` - may be out of sync due to changes in libraries - [x] Add info about sending data back to Dagster (Step 2) ## How I Tested These Changes eyes, bk --------- Co-authored-by: Sean Mackesey <[email protected]>
Summary & Motivation
This PR adds a guide for integrating Dagster Pipes with Databricks.
TODO/?s:
SubmitTask
specPyObjects
- may be out of sync due to changes in librariesHow I Tested These Changes
👀 , bk