Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: xorbits's read_parquet compatible with pandas on pyarrow engine #770

Open
luweizheng opened this issue May 20, 2024 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@luweizheng
Copy link
Collaborator

luweizheng commented May 20, 2024

Xorbits integrates the pyarrow backend. See this blog post for more info. And we also introduce use_arrow_dtype in read_parquet. If we install the pyarrow backend, Xorbits will detect it, marking use_arrow_dtype to True in the configuration and it will read parquet with arrow dtype. Some dtypes of pyarrow and pandas are different, for example, timestamp. Suppose time is a timestamp column. If time is a pandas dtype we can do it like this: df["time"].dt. But pyarrow does not have dt attribute.

If arrow is installed, xorbits use arrow and use_arrow_dtype of the configuration is set as true. So here we read data in pyarrow format: https://github.com/xorbitsai/xorbits/blob/b1f1107af931e9101b22e4f1e000add3820297b5/python/xorbits/_mars/dataframe/datasource/read_parquet.py#L181C1-L201C18

We may include this in our document or change the default behavior of the ArrowEngine when reading parquet files.

@XprobeBot XprobeBot added the enhancement New feature or request label May 20, 2024
@XprobeBot XprobeBot added this to the v0.7.3 milestone May 20, 2024
@XprobeBot XprobeBot modified the milestones: v0.7.3, v0.7.4 Aug 22, 2024
@luweizheng luweizheng removed this from the v0.7.4 milestone Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants