Skip to content

ENH: Create Better IntervalDtype using PyArrow structs. #53033

Open
@randolf-scholz

Description

@randolf-scholz

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, pandas.IntervalArray suffer from 3 major limitations:

  1. They are limited to data with the same closedness on both sides. no longer the case apparently
  2. All datapoints are limited to the same closedness in the array. (i.e. the same array can only store closed intervals or only open intervals).
  3. Intervals do not allow missing values
    • In particular one cannot represent unbounded intervals for data types that lack an actual infinity value like int32.
  4. Some dtypes are not allowed like string

As a practical application for (1) that I am very interested in is storing information about the range of valid values for the columns of another DataFrame.

Feature Description

Given the better integration with pyarrow since 2.0, we can recreate IntervalDtype using pyarrow.struct:

import pyarrow as pa

def arrow_interval_dtype(subtype):
    fields = [
        ("lower_bound", subtype),
        ("upper_bound", subtype),
        ("lower_inclusive", pa.bool_()),
        ("upper_inclusive", pa.bool_()),
    ]
    return pa.struct(fields)

Contrary to the current IntervalDtype, this would solve all 3 major problems at once:

  1. Each element of the resulting StructArray can have separate closedness
  2. Pyarrow datatypes all support missing values
  3. We can in principle use any ordered data type for the subtype.

Alternative Solutions

None.

Additional Context

Additionally, common request is adding extra operations for interval dtypes:

Additionally, one could imagine having a IntervalUnion type, that can represent finite unions of intervals, combining the interval type discussed here with pyarrow list-type. This type would naturally arise when performing unions of intervals, such as [0, 2]∪[3, 5]. The nice thing here is that the resulting space is mathematically closed under the standard set operations (union, intersection, complements, difference)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions