Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading parquet from s3 #22

Open
dsebban opened this issue Apr 10, 2023 · 2 comments
Open

Reading parquet from s3 #22

dsebban opened this issue Apr 10, 2023 · 2 comments

Comments

@dsebban
Copy link

dsebban commented Apr 10, 2023

Hi, Thank you on this awesome work @chitralverma !
I am trying to read from s3 in scala . I can see that writing is pretty simple as you added utility function to pass options in write_utils.rs

write.options(
        Map(
          "aws_default_region" -> "us‑east‑2",
          "aws_access_key_id" -> "ABC",
          "aws_secret_access_key" -> "XYZ"
        )

I am trying to do the same reading into a df, passing a s3 path to Polars.scan obviously throws

val df = Polars.parquet.scan("s3://test1/myfile.zstd.parquet")

Configuration 'aws' must be provided in order to use s3 cloud urls."))

Is there something I need tweak in the config to be able to read directly from s3 ? Maybe we need a read_utils.rs . I would be able to contribute, if you have some guidance :)

@chitralverma
Copy link
Owner

Thanks for writing @dsebban. Hit the star button if you like the project :)

Now regarding the S3 (or any object store) store read. The support in polars-rust is extended for parquet format only and has to be specifically enabled via features. I have purposefully not done this at the moment because if things don't work for all the formats in a similar way then it leads to API inconsistency.

The underlying crate object_store heavily relies on rust async and polars doesn't for performance reasons so there is some incompatibility around that as well.

One tricky way to get around this would be to read files as byte streams on scala/ java streams and then pass over JNI to the rust side of things. This passed stream can then be fed to the polars readers for all formats. The problem with this approach is that I haven't tried this before and I doubt this can be done in a performant way.

If you have an other ideas on this or if you want to take a stab at this, PRs are most welcome.

@dsebban
Copy link
Author

dsebban commented Apr 11, 2023

Thank you for you answer! I see your concerns about api uniformity between read and write. I don't think passing bytes through the JNI layer can be done in a performant manner too. I will dig in the code to understand a bit more about your statement about async and Polaris not working well together. Ideally you would like a Spark like API right ? df.read.options(..).path(s3/local json/csv/arrow)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants