-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose object_store for direct use #1008
Comments
I agree! This is why I created obstore. It's a fast Python binding for
This is the goal of I.e. So far, I've put more effort into publishing Unfortunately, you can't currently use one store class across multiple Python libraries because |
Thanks for pointing out obstore! I did not know about this library. This looks very useful :) One thing we could perhaps try to do is provide an easy constructor, where if you have a datafusion-python ObjectStore, you can easily create the equivalent obstore ObjectStore? That would be enough for my use case. |
Looking at what is exposed for producing the object store objects in datafusion-python, it looks like this is only a subset of what obstore allows. It should therefore be viable to make the datafusion-python ObjectStore objects either remember their construction arguments, or query them from the built ObjectStore, and expose them in a way that lets us then construct the equivalent obstore version from python (maybe allowing augmenting this with some extras, like something like, df_store = datafusion.object_store.GoogleCloud('my-cool-bucket')
store = df_store.as_obstore() with I'll try to implement this. |
Maybe... @timsaucer mentioned on discord I think that he was considering looking into it, similar to his work on https://docs.rs/datafusion-ffi/latest/datafusion_ffi/.
Wouldn't that in effect be constructing our own stable FFI API across libraries? I'm not sure that's possible and/or maintainable. |
What I was imagining was rather a way to create the obstore version from the construction arguments of the datafusion-python ObjectStore. So that wouldn't be FFI, rather it'd be constructing a whole new instance, but this time managed by obstore. |
Yes,
Well, that's adding complexity on the datafusion-python side...
This isn't generally possible. Once you have an
This sounds like replicating the configuration system that exists in Rust in pure Python, which sounds difficult to maintain. I would love to see something like this exist, because it's also necessary for pickling, and would love to see a PR. But so far I'm skeptical of a solution that's maintainable. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I need to be able to delete old resources generated by
write_parquet()
and similar methods, move them out of the way, or do other such operations that broadly fall in the category of 'data/artifact cleanup'. Datafusion doesn't directly implement such move/delete operations, so this requires a different library. Depending on what environment i'm operating in (local file system, S3, google bucket) this requires a slightly different setup and operation.However, datafusion ships with
ObjectStore
, a generic frontend for many different object store systems, mainly S3 and its equivalents in other cloud environments, but also providing an api-compatible local file storage version of this. This is used to be able to read and write to such object stores from within datafusion. In datafusion-python, these ObjectStore objects are opaque handles that are only useful for registering with a session context. In rust however, these also allow the user to directly manipulate objects in these stores, fetch them, delete them, move them, etc.Describe the solution you'd like
I would like the datafusion-python version of ObjectStore to not just be an opaque handle, but instead allow access to the underlying methods. this will allow me to generically implement operations on generated artifacts that are not doable in datafusion directly.
Describe alternatives you've considered
The workaround is to use an s3-compatible library directly. This doesn't help with local files though, which still requires a separate code path.
Another possibility is to have a separate python library wrapping the rust object_store crate, as arguably it's not the job of datafusion-python to provide a good API for this. However, it's useful to be able to define just one ObjectStore (like from a configuration) and use it both for datafusion and for related object-store operations like artifact cleanup.
The text was updated successfully, but these errors were encountered: