-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Save Pins to Databricks #839
Comments
Thanks for this suggestion! 🙌 Can you share some specifics about how and what you would like to store in Databricks, perhaps highlighting what is different from the workflows supported by sparklyr? Like this: |
I have a use case for this, although interested in suggestions if there's a better solution. I work a lot with survey data that comes in an SPSS There is Unity Catalog Volumes, but I can't figure out how to store factors on Databricks while retaining read access from my local machine. You can save and read Pins in Databricks would solve this, because I could write the data directly to Thanks for all your work on this package! |
In general, I think accessing/storing information in Databrick's Volumes provides some great benefits.
Through the A pseudo example for reading in a directory of from databricks.sdk import WorkspaceClient
name = f"{catalog}.{database}.{volume}"
wc = WorkspaceClient()
volume = wc.volumes.read(name)
spark.read.text(
paths=volume.storage_location, # directory
wholetext=True, # single row
pathGlobFilter="*.yaml"
) A more full example in R via library(reticulate)
# https://github.com/databrickslabs/databricks-sdk-r
# package for using the REST API in R
library(databricks)
client <- DatabricksClient()
# this can also be accomplished through reticulate
volume <-
client |>
read_volume("{catalog}.{database}.{volume}")
location <- volume$storage_location
# grabbing a cluster that I can access more data
clusters <-
client |>
list_clusters() |>
subset(startsWith(creator_user_name, "jbarbone"))
cluster <- clusters$cluster_id[1]
# requires the {databricks-sdk} and {databricks-connect} packages
db <- import("databricks.sdk")
connect <- import("databricks.connect")
w <- db$WorkspaceClient()
volume <- w$volumes$read("{catalog}.{database}.{volume}")
location <- volume$storage_location
pyspark <- import("pyspark")
spark <-
connect $
DatabricksSession $
builder $
profile("DEFAULT") $
clusterId(cluster) $
getOrCreate()
content <- spark$read$text(
paths = location,
wholetext = TRUE,
pathGlobFilter = "*.yaml"
) Through the REST API you can find the storage location, but (https://docs.databricks.com/api/workspace/volumes/list) but you may need to access the spark context to read the data. The R example seems to work fine for me, and locally I just have a |
* Ports host and token functions * Starts board_databricks * Starts pin list * Removes pipes * Centralices content retreival adds pin_exists * Starts pin_meta function * Simplifies arguments, renames token and host functions * First pass at pin_store * Fixex hashed subfolder * Adds pin_versions * Improvements to cache path * Adds download file helper * Adds download step to meta, fixes cache discovery * Adds pin_fetch * Adds pin_delete, and all supporting functions * Assignes proper file rights to local cache * Passes all tests * Adds board_deparse * Adds required_pkgs * Starts testing * Avoids failing when checking contents of a folder, needed for prefix * Passes all tests * Fixes a pkg check finding * Starts documentation * Completes documentation * Adds NEWS item * Small fix to documentation, adds some instructions to tests * Properly handles lack of host or token * Fixes pkg_down failure, address oldrel-4 issue by reverting to older mode of `purrr`, and improves some tests * More consistent filename * More consistent filename * Edits to docs * Redocument * Update R/board_databricks.R Co-authored-by: Julia Silge <[email protected]> * Update R/board_databricks.R Co-authored-by: Julia Silge <[email protected]> * Removes reference to bucket, and re-documents * Little more doc refining * Try running tests in CI * Update snapshot --------- Co-authored-by: Julia Silge <[email protected]>
Many thanks to @edgararuiz for his work implementing |
Great work, thank you @juliasilge and @edgararuiz for bringing this feature to the package. I've tested this using version 1.4.0 for various file types and sizes, all worked smoothly. |
I've ran into a small issue within my workflow. We track multiple Databricks profiles with separate hosts and tokens configured in a some_custom_board_function <- function(profile, ...) {
# simplified
cfg <- ini::read.ini("~/.databrickscfg")
profile <- mark::match_param(profile, names(cfg))
config <- cfg[[profile]]
withr::local_envvar(c(
DATABRICKS_HOST = config$host,
DATABRICKS_TOKEN = config$token
))
board_databricks(...)
board
}
board <- some_custom_board_function()
pin_write(board, mtcars)
#> Using `name = 'mtcars'`
#> Guessing `type = 'rds'`
#> Error in `purrr::keep()`:
#> ℹ In index: 1.
#> ℹ With name: message.
#> Caused by error in `.x$is_directory`:
#> ! $ operator is invalid for atomic vectors We have custom board creating and pin writing wrappers; everything still works with a few extra steps. It would be nice to look for these settings in Still very excited for this update and looking forward to using it more. |
Thanks for sharing that @jmbarbone! I have opened #848 to track additional auth needs for Databricks; please add any additional thoughts there. 🙌 |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Dear Development Team,
Given the increasing collaboration between Posit and Databricks, I believe that the capability to store Pins to Databricks, in comparison to other platforms such as S3 and Azure, could prove to be an appealing feature for enterprise clients.
Sincerely,
The text was updated successfully, but these errors were encountered: