Skip to content

Request For Help: unexplained ArrowInvalid overflow #61776

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jbrockmendel
Copy link
Member

Because of #61775 and to address failures in #61732 I'm trying out calling pd.to_datetime in ArrowEA._box_pa_array when we have a timestamp type. AFAICT this isn't breaking anything at construction-time (see the assertion this adds, which isn't failing in any tests). What is breaking is subsequent subtraction operations, that are raising pyarrow.lib.ArrowInvalid: overflow.

pytest "pandas/tests/extension/test_arrow.py::TestArrowArray::test_arith_series_with_scalar[__sub__-timestamp[s, tz=US/Eastern]]"
[...]
E   pyarrow.lib.ArrowInvalid: overflow

It is happening on both sub and rsub ops. When I try operating with a subset of of the array it looks like the exception only happens when i use a slice that contains a null.

To examine the buffers, I added a breakpoint after the assertion in the diff. In the relevant case, alt[8] is null:

left = alt[8:10]
right = pa_array[8:10]

lb = left.buffers()[1]
rb = right.buffers()[1]

(Pdb) np.asarray(lb[64:72]).view("M8[ns]")
array(['NaT'], dtype='datetime64[ns]')

(Pdb) np.asarray(rb[64:72]).view("M8[ns]")
array(['1970-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

So my current hypothesis is that when we get to the pc.subtract_checked call, it isn't skipping the iNaT entry despite the null bit, and the subtraction for that entry is overflowing. This seems likely unintentional and may be an upstream bug cc @jorisvandenbossche?

Regardless of if it is an upstream bug, I could use guidance on how to make the construction with to_datetime work. Filtering out Decimal(NaN) manually would be pretty inefficient.

@jbrockmendel jbrockmendel marked this pull request as draft July 4, 2025 02:04
@jorisvandenbossche
Copy link
Member

So my current hypothesis is that when we get to the pc.subtract_checked call, it isn't skipping the iNaT entry despite the null bit, and the subtraction for that entry is overflowing.

I assume that is indeed what is happening here, because there is in any case an (unfortunately long-standing) bug for exactly this case: apache/arrow#35088 (rereading the issue and based on Weston's comment, it seems the fix should actually be quite easy).

A workaround might be to cast the duration to int64 (which should be zero-copy), and the the substract_checked kernel should work correctly.

@jorisvandenbossche
Copy link
Member

So my current hypothesis is that when we get to the pc.subtract_checked call, it isn't skipping the iNaT entry despite the null bit, and the subtraction for that entry is overflowing.

I assume that is indeed what is happening here, because there is in any case an (unfortunately long-standing) bug for exactly this case: apache/arrow#35088 (rereading the issue and based on Weston's comment, it seems the fix should actually be quite easy).

A workaround might be to cast the duration to int64 (which should be zero-copy), and the the substract_checked kernel should work correctly.

>>> arr = pa.array(pd.Series([pd.Timestamp("2020-01-01"), None]))
>>> other = pa.scalar(pd.Timestamp("2019-12-31T20:01:01"), type=arr.type)
>>> 
>>> pc.subtract_checked(arr, other)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[35], line 1
----> 1 pc.subtract_checked(arr, other)

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/compute.py:252, in _make_generic_wrapper.<locals>.wrapper(memory_pool, *args)
    250 if args and isinstance(args[0], Expression):
    251     return Expression._call(func_name, list(args))
--> 252 return func.call(args, None, memory_pool)

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/_compute.pyx:407, in pyarrow._compute.Function.call()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: overflow
>>> pc.subtract_checked(arr.cast("int64"), other.cast("int64")).cast(pa.duration(arr.type.unit)).to_pandas()
0   0 days 03:58:59
1               NaT
dtype: timedelta64[s]

And so you can indeed see that the underlying values would overflow if the value masked by the null is not ignored:

>>> np_arr = np.frombuffer(arr.buffers()[1], dtype="int64")
>>> np_arr
array([          1577836800, -9223372036854775808])
>>> other.value
1577822461
>>> np_arr - other.value
array([              14339, 9223372035276953347])

@jorisvandenbossche
Copy link
Member

Regardless of if it is an upstream bug, I could use guidance on how to make the construction with to_datetime work. Filtering out Decimal(NaN) manually would be pretty inefficient.

What do you want to change here exactly? The issue is that pyarrow allows Decimal(NaN) as a null value when constructing from a list of scalars, and pandas does not? (or the other way around, so creating an inconsistency in behaviour?)

@jorisvandenbossche
Copy link
Member

Seeing #61773, I understand the issue now (it's also related to the fact that we specify pa.array(..., from_pandas=true) to allow NaN, since we support that in pandas for this creation, so we cannot turn that off. But then pyarrow does not seem to distinguish numpy vs decimal NaN ..).

In the end, the reason that this overflow comes up in the tests because of this change is because in pd.to_datetime, we create a numpy datetime64 array using NaT, and numpy uses the smallest integer for NaT. When converting that numpy array to pyarrow, the data is converted zero-copy (only as bitmask is added) and so the masked value is this smallest integer.
When pa.array(...) creates the array from the python scalars, it defaults to fill masked values with 0, so you don't run (or not that easily) into overflows.

So one workaround would be to also fill the created pyarrow array with zeros. One potential way of doing this:

>>> pa_type = pa.timestamp("us")
>>> 
>>> np_arr = pd.to_datetime(scalars).as_unit(pa_type.unit).values
>>> np_arr
array(['2020-01-01T00:00:00.000000',                        'NaT'],
      dtype='datetime64[us]')
>>> mask = np.isnat(arr)
>>> np_arr2 = np_arr.astype("int64")
>>> np_arr2
array([    1577836800000000, -9223372036854775808])
>>> np_arr2[mask] = 0
>>> pa_arr = pa.array(np_arr2, mask=mask, type=pa_type)
>>> pa_arr
<pyarrow.lib.TimestampArray object at 0x7f1ad0ef86a0>
[
  2020-01-01 00:00:00.000000,
  null
]
>>> np.frombuffer(pa_arr.buffers()[1], dtype="int64")
array([1577836800000000,                0])

@jbrockmendel
Copy link
Member Author

So one workaround would be to also fill the created pyarrow array with zeros.

I eventually stumbled on that idea long after posting. Will give it a go in #61773. Thank you.

@jbrockmendel jbrockmendel deleted the bug-arrow-to_datetime branch July 4, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants