Reading parquet from s3 #22

dsebban · 2023-04-10T11:08:16Z

Hi, Thank you on this awesome work @chitralverma !
I am trying to read from s3 in scala . I can see that writing is pretty simple as you added utility function to pass options in write_utils.rs

write.options(
        Map(
          "aws_default_region" -> "us‑east‑2",
          "aws_access_key_id" -> "ABC",
          "aws_secret_access_key" -> "XYZ"
        )

I am trying to do the same reading into a df, passing a s3 path to Polars.scan obviously throws

val df = Polars.parquet.scan("s3://test1/myfile.zstd.parquet")

Configuration 'aws' must be provided in order to use s3 cloud urls."))

Is there something I need tweak in the config to be able to read directly from s3 ? Maybe we need a read_utils.rs . I would be able to contribute, if you have some guidance :)

The text was updated successfully, but these errors were encountered:

chitralverma · 2023-04-10T16:41:18Z

Thanks for writing @dsebban. Hit the star button if you like the project :)

Now regarding the S3 (or any object store) store read. The support in polars-rust is extended for parquet format only and has to be specifically enabled via features. I have purposefully not done this at the moment because if things don't work for all the formats in a similar way then it leads to API inconsistency.

The underlying crate object_store heavily relies on rust async and polars doesn't for performance reasons so there is some incompatibility around that as well.

One tricky way to get around this would be to read files as byte streams on scala/ java streams and then pass over JNI to the rust side of things. This passed stream can then be fed to the polars readers for all formats. The problem with this approach is that I haven't tried this before and I doubt this can be done in a performant way.

If you have an other ideas on this or if you want to take a stab at this, PRs are most welcome.

dsebban · 2023-04-11T11:27:26Z

Thank you for you answer! I see your concerns about api uniformity between read and write. I don't think passing bytes through the JNI layer can be done in a performant manner too. I will dig in the code to understand a bit more about your statement about async and Polaris not working well together. Ideally you would like a Spark like API right ? df.read.options(..).path(s3/local json/csv/arrow)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading parquet from s3 #22

Reading parquet from s3 #22

dsebban commented Apr 10, 2023 •

edited

Loading

chitralverma commented Apr 10, 2023

dsebban commented Apr 11, 2023

Reading parquet from s3 #22

Reading parquet from s3 #22

Comments

dsebban commented Apr 10, 2023 • edited Loading

chitralverma commented Apr 10, 2023

dsebban commented Apr 11, 2023

dsebban commented Apr 10, 2023 •

edited

Loading