Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose object_store for direct use #1008

Open
matko opened this issue Jan 29, 2025 · 6 comments
Open

Expose object_store for direct use #1008

matko opened this issue Jan 29, 2025 · 6 comments
Labels
enhancement New feature or request

Comments

@matko
Copy link

matko commented Jan 29, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I need to be able to delete old resources generated by write_parquet() and similar methods, move them out of the way, or do other such operations that broadly fall in the category of 'data/artifact cleanup'. Datafusion doesn't directly implement such move/delete operations, so this requires a different library. Depending on what environment i'm operating in (local file system, S3, google bucket) this requires a slightly different setup and operation.

However, datafusion ships with ObjectStore, a generic frontend for many different object store systems, mainly S3 and its equivalents in other cloud environments, but also providing an api-compatible local file storage version of this. This is used to be able to read and write to such object stores from within datafusion. In datafusion-python, these ObjectStore objects are opaque handles that are only useful for registering with a session context. In rust however, these also allow the user to directly manipulate objects in these stores, fetch them, delete them, move them, etc.

Describe the solution you'd like
I would like the datafusion-python version of ObjectStore to not just be an opaque handle, but instead allow access to the underlying methods. this will allow me to generically implement operations on generated artifacts that are not doable in datafusion directly.

Describe alternatives you've considered
The workaround is to use an s3-compatible library directly. This doesn't help with local files though, which still requires a separate code path.

Another possibility is to have a separate python library wrapping the rust object_store crate, as arguably it's not the job of datafusion-python to provide a good API for this. However, it's useful to be able to define just one ObjectStore (like from a configuration) and use it both for datafusion and for related object-store operations like artifact cleanup.

@matko matko added the enhancement New feature or request label Jan 29, 2025
@kylebarron
Copy link
Contributor

kylebarron commented Jan 29, 2025

I agree! This is why I created obstore. It's a fast Python binding for object_store. Early benchmarks indicate 10x higher throughput than s3fs and aioboto3.

However, it's useful to be able to define just one ObjectStore (like from a configuration) and use it both for datafusion and for related object-store operations like artifact cleanup.

This is the goal of pyo3_object_store, so that we can define configuration and builders around object_store once and then reuse it across many different Rust-Python libraries that internally use object_store.

I.e. obstore is for Python end users who want to use object_store from Python, while pyo3_object_store is for other Rust developers creating their own Python packages who want to use object_store from Rust.

So far, I've put more effort into publishing obstore than pyo3_object_store, but I'd like to polish up pyo3_object_store (especially after the next object_store release) and I'd be happy to explore using it inside datafusion-python if there's interest.

Unfortunately, you can't currently use one store class across multiple Python libraries because object_store is not FFI stable. So you need to use the class exported from obstore whenever you use obstore methods, and you'd need to use the class exported from datafusion.python whenever you use datafusion methods (even though each class would take the same builder params). Or, we could try to solve the object_store FFI problem.

@matko
Copy link
Author

matko commented Jan 30, 2025

Thanks for pointing out obstore! I did not know about this library. This looks very useful :)
Solving the FFI problem seems very unfeasible. Maybe one day in the future Rust will just get a stable ABI and these problems go away. For now though, I think the answer here is that I should be using obstore alongside datafusion-python.

One thing we could perhaps try to do is provide an easy constructor, where if you have a datafusion-python ObjectStore, you can easily create the equivalent obstore ObjectStore? That would be enough for my use case.

@matko
Copy link
Author

matko commented Jan 30, 2025

Looking at what is exposed for producing the object store objects in datafusion-python, it looks like this is only a subset of what obstore allows. It should therefore be viable to make the datafusion-python ObjectStore objects either remember their construction arguments, or query them from the built ObjectStore, and expose them in a way that lets us then construct the equivalent obstore version from python (maybe allowing augmenting this with some extras, like ClientConfig and RetryConfig which this library offers).

something like,

df_store = datafusion.object_store.GoogleCloud('my-cool-bucket')
store = df_store.as_obstore()

with as_obstore optionally taking a ClientConfig and RetryConfig.

I'll try to implement this.

@kylebarron
Copy link
Contributor

Solving the FFI problem seems very unfeasible.

Maybe... @timsaucer mentioned on discord I think that he was considering looking into it, similar to his work on https://docs.rs/datafusion-ffi/latest/datafusion_ffi/.

One thing we could perhaps try to do is provide an easy constructor, where if you have a datafusion-python ObjectStore, you can easily create the equivalent obstore ObjectStore?

Wouldn't that in effect be constructing our own stable FFI API across libraries? I'm not sure that's possible and/or maintainable.

@matko
Copy link
Author

matko commented Jan 30, 2025

Wouldn't that in effect be constructing our own stable FFI API across libraries? I'm not sure that's possible and/or maintainable.

What I was imagining was rather a way to create the obstore version from the construction arguments of the datafusion-python ObjectStore. So that wouldn't be FFI, rather it'd be constructing a whole new instance, but this time managed by obstore.

@kylebarron
Copy link
Contributor

this is only a subset of what obstore allows

Yes, obstore is a full binding to object_store and intends to allow the full configuration that you can do in Rust.

remember their construction arguments

Well, that's adding complexity on the datafusion-python side...

query them from the built ObjectStore

This isn't generally possible. Once you have an AmazonS3 you can't backtrack to the builder parameters. If instead you store a AmazonS3Builder, then you can access some but not all config parameters (those whose values are representable by String) via get_config_value. But this also means that every time you use the store you'd have to call AmazonS3Builder.build, which has overhead.

expose them in a way that lets us then construct the equivalent obstore version from python

This sounds like replicating the configuration system that exists in Rust in pure Python, which sounds difficult to maintain.


I would love to see something like this exist, because it's also necessary for pickling, and would love to see a PR. But so far I'm skeptical of a solution that's maintainable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants