Polars Datatype Catalog Entry Cannot Partition on Saving Parquet #908

alexdavis24 · 2024-10-18T16:43:42Z

Description

I would like to save partitioned Polars parquet datasets which currently relies on Pyarrow using write_parquet
- Following documentation: https://docs.pola.rs/api/python/version/0.18/reference/api/polars.DataFrame.write_parquet.html
When running with default system (Rust-based implementation) or with Pyarrow, Kedro returns errors (see below)
I have encountered this writing locally and writing to s3
Attempting to explicitly pass a filesystem within catalog (see code below) does not work.
I believe this is due to handling as a BytesIO object in the save, rather than a direct write of an expected dataframe:
- https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-4.1.0/_modules/kedro_datasets/polars/lazy_polars_dataset.html#LazyPolarsDataset
- https://docs.kedro.org/projects/kedro-datasets/en/latest/_modules/kedro_datasets/polars/eager_polars_dataset.html

Context

Trying to partition a large dataframe using polars across a single column

Steps to Reproduce

Sample dataset that runs locally with no issues:

df = pl.DataFrame(
    {"A": [1, 2, 3],
        "B": [4, 5, 6]}
)
path = "tmp/test.parquet"
df.write_parquet(
    path,
    use_pyarrow=True,
    pyarrow_options={"partition_cols": ["B"]}
)
# this also runs with no issues
df.write_parquet(
    path,
    partition_by=["B"]
)

Sample code following the same implementation:

# pipelines.py
from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
import polars as pl
def my_func():
    return pl.DataFrame(
    {"A": [1, 2, 3],
        "B": [4, 5, 6]}
    )
def create_pipeline() -> Pipeline:
    return pipeline(
    node(
                func=my_func,
                inputs={}
                outputs="my_entry",
                name="partition_polars"        
    )

# catalog.yml
# using Rust
my_entry:
  # also tried with polars.LazyPolarsDatset
  type: polars.EagerPolarsDataset 
  filepath: /tmp/test.parquet
  file_format: parquet
  save_args:
    partition_by: 
      - B

# catalog.yml
# using pyarrow (C++)
my_entry:
  type: polars.EagerPolarsDataset
  filepath: /tmp/test.parquet
  file_format: parquet
  save_args:
    use_pyarrow: True
    pyarrow_options:
      partition_cols: 
      - B
  fs_args:
    filesystem: pyarrow._fs.FileSystem

Expected Result

New partitioned parquet file should be created locally or in S3

Actual Result

From Rust implementation:

DatasetError: Failed while saving data to data set
EagerPolarsDataset(file_format=parquet, filepath=/tmp/test.parquet,
load_args={}, protocol=file, save_args={'partition_by': ['dt1y']}).
'BytesIO' object cannot be converted to 'PyString'

From Pyarrow:

DatasetError: Failed while saving data to data set
 LazyPolarsDataset(filepath=/tmp/test.parquet, load_args={}, protocol=file, 
save_args={'pyarrow_options': {'compression': zstd, 'partition_cols': ['dt1y'],
'write_statistics': True}, 'use_pyarrow': True}).
Argument 'filesystem' has incorrect type (expected pyarrow._fs.FileSystem, got 
NoneType)

Your Environment

Kedro version used (pip show kedro or kedro -V): 0.19.3
Polars: 1.9.0 and 1.6.0
Python version used (python -V): 3.11
Operating system and version: MacOS M1 using Docker Compose + Docker Desktop

The text was updated successfully, but these errors were encountered:

datajoely · 2024-10-18T17:00:07Z

Just want to say thanks for such a clear write up and investigation 💪

SajidAlamQB · 2024-10-22T11:16:22Z

Thank you @alexdavis24 for reporting this, we'll have a look.

merelcht · 2024-12-11T13:30:37Z

Hi @alexdavis24 , I've been able to replicate the issue. I'm not super familiar with Polars and/or Pyarrow, but I think your analysis that the issue lies in the saving with BytesIO is correct. It also seems that because in the implementation of the save method, the data is written to a BytesIO buffer and then uses fsspec to write it to the target path, it completely bypasses the PyArrow filesystem and shouldn't require you to pass a filesystem argument. However, if PyArrow is being invoked with a None filesystem somehow, the issue might be with how fsspec or the BytesIO buffer is handled.

I managed to get things working with the following catalog entry:

my_entry:
  type: polars.EagerPolarsDataset
  filepath: /tmp/test.parquet
  file_format: parquet
  save_args:
    pyarrow_options:
      partition_cols:
      - B

So removing the explicit filesystem argument and also removing use_pyarrow: True. I don't know if this produces the desired result though. Let me know what you think of this.

astrojuanlu transferred this issue from kedro-org/kedro Oct 25, 2024

astrojuanlu added the Community Issue/PR opened by the open-source community label Nov 18, 2024

astrojuanlu removed this from Kedro Framework Nov 18, 2024

merelcht added the support: needs more info label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars Datatype Catalog Entry Cannot Partition on Saving Parquet #908

Polars Datatype Catalog Entry Cannot Partition on Saving Parquet #908

alexdavis24 commented Oct 18, 2024 •

edited

Loading

datajoely commented Oct 18, 2024

SajidAlamQB commented Oct 22, 2024

merelcht commented Dec 11, 2024

Polars Datatype Catalog Entry Cannot Partition on Saving Parquet #908

Polars Datatype Catalog Entry Cannot Partition on Saving Parquet #908

Comments

alexdavis24 commented Oct 18, 2024 • edited Loading

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

datajoely commented Oct 18, 2024

SajidAlamQB commented Oct 22, 2024

merelcht commented Dec 11, 2024

alexdavis24 commented Oct 18, 2024 •

edited

Loading