Skip to content

Proposal to allow third-party engines for readers and writers #61584

Open
@datapythonista

Description

@datapythonista

In PDEP-9 it was discussed the possibility of allowing third-party packages to automatically add pandas.read_<format> functions and DataFrame.to_<format> methods. There was a main challenges that made the proposal not move forward: the complexity of managing multiple packages for the same format (conflicting names, differences in signatures...).

What I propose here is similar, but not to register the readers/writers for whole formats, but engines of the existing formats instead. This is less ambitious, since it doesn't allow adding new formats to pandas (e.g. pandas.read_dicom, a format for medical data), but it still have the rest of the advantages of PDEP-9:

  • It still allows third-party packages to provide the code for pandas readers/writers (e.g. a faster csv reader, a new excel reader wrapping another excel library...)
  • It opens the door to removing from our code base connectors that can be better maintained elsewhere. As an example, engines like fastparquet for parquet, as well as others, are basically a mapping between our functions signature and their functions signatures, with a bit of extra logic. I think the engines are way more likely to need changes because changes in the wrapped library, than in our function signature, so to me it makes things simpler and easier to maintain if the engine was part of the fastparquet and pyarrow libraries. Moving engines out of pandas is something for the future, and it can be discussed individually, since it probably makes sense to keep many, and move out some
  • There would be no need to deal with optional dependencies for the engines using this system. Dealing with optional dependencies adds complexity that we can avoid
  • It would simplify our dependencies significantly (if moving engines out of pandas happens), as well as our tests. We had problems in the past because we skip tests depending on whether a library can be imported or not. And we were for a while not running many pandas tests. Having less optional dependencies would help prevent this sort of problems.
  • Conflicts in this case seem unlikely. Most of the engines are names after the library they wrap, as opposed to libraries "fighting" to register a format name. There could still be in some cases, but only for users with both the conflicting packages installed, and we can warn in this case.
  • We will continue to control the signature for all readers and writers, which for the users means that the formats are fixed, and every format has a unique signature which is documented in our docs
  • In some cases we already use **kwargs for engine specific parameters. This provides extra flexibility while keeping most of the signature unified

Implementing this would have no impact to users unless they call a reader/writer with an engine value that is unknown. At that point instead of raising as now, we would first check for registered entry points, and if one exist for the format (e.g. "csv") and the provided engine name (e.g. "arrow-rs", a possible new reader based in Rust's Arrow implementation, if someone implements that), then the function provided by the entry point would handle the request.

Only small drawback I can find is that since engines would be generic, the API pages of the documentation won't be able to provide engine specific information for the engines not in pandas itself. I think this is very reasonable, and we can keep a registry of known connectors in the Ecosystem page with links to their docs, as we usually do.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions