Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: move polars.from_dataframe to polars.interchange.from_dataframe, or even consider removing? #20065

Open
MarcoGorelli opened this issue Nov 29, 2024 · 1 comment
Labels
A-api Area: changes to the public API A-interchange Area: Python dataframe interchange protocol enhancement New feature or an improvement of an existing feature
Milestone

Comments

@MarcoGorelli
Copy link
Collaborator

polars.from_dataframe uses the dataframe interchange protocol to convert between dataframes. However, users aren't necessarily are of this, nor are they aware of the limitations of the interchange protocol. Proof: at EuroScipy this year I saw someone use pl.from_dataframe to convert from pandas to Polars because they thought that was just the recommended way of doing it

The reason this matters is that the interchange protocol is tied down to pandas/polars implementations. And pandas implementation had some severe (critical?) bugs before 2.2, which users wouldn't necessarily be aware of. A new Polars user could well use pl.from_dataframe with a pandas dataframe, get nonsense data back, and blame Polars even though the bug was on the pandas side

So, my first inclination would be to move polars.from_dataframe to polars.interchange.from_dataframe

But second, we should probably have a conversation about whether we should keep it at all? Polars already supports the PyCapsule Interface now for both import and export, so anyone wishing to agnostically convert between dataframe can already use that, and with that it also opens the doors to agnostically accessing the underlying data from say C or Rust

In terms of ecosystem:

  • the interchange protocol hasn't been updated since August 2023. It still doesn't support Date, Duration, or nested dtypes, and I can't see that changing, there's radio silence in the repo. This is a bit sad to me - we had our disagreements over the "dataframe api standard" project, but I was hoping that at least the interchange protocol would have continued
  • All open source users of the interchange protocol have switched to either the PyCapsule Interface or they've made their code totally dataframe agnostic via Narwhals. The only one left is Seaborn, and they're debating what to do, but they don't seem keen on keeping the interchange protocol around
  • cuDF have announced their intention to deprecate support for it [FEA] Deprecate and remove data interchange protocol rapidsai/cudf#17403

What I feel bad about is that Stijn was the only person to have read the interchange protocol spec carefully enough to have come up with a correct and useful implementation, and it would be a pity to see that effort go to waste. Do we take ownership of the interchange protocol and drive it forwards, or just let it sink and encourage the PyCapsule Interface for the same use cases?

TL;DR:

  • the dataframe interchange protocol has some design and implementation issues that don't look like they'll ever get resolved
  • The PyCapsule Interface allows for the same things, but much more robustly
  • cuDF want to remove support for the interchange protocol
  • downstream libraries have moved to different solutions
@stinodego stinodego added this to the 2.0.0 milestone Nov 29, 2024
@stinodego stinodego added A-interchange Area: Python dataframe interchange protocol A-api Area: changes to the public API enhancement New feature or an improvement of an existing feature labels Nov 29, 2024
@MarcoGorelli
Copy link
Collaborator Author

Initial response from discussion:

  • OK to move polars.from_dataframe to polars.interchange.from_dataframe
  • there's no hurry to throw out support for the protocol immediately, we can keep monitoring the situation and see how things progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API A-interchange Area: Python dataframe interchange protocol enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants