Request For Help: unexplained ArrowInvalid overflow #61776

jbrockmendel · 2025-07-04T02:04:25Z

Because of #61775 and to address failures in #61732 I'm trying out calling pd.to_datetime in ArrowEA._box_pa_array when we have a timestamp type. AFAICT this isn't breaking anything at construction-time (see the assertion this adds, which isn't failing in any tests). What is breaking is subsequent subtraction operations, that are raising pyarrow.lib.ArrowInvalid: overflow.

pytest "pandas/tests/extension/test_arrow.py::TestArrowArray::test_arith_series_with_scalar[__sub__-timestamp[s, tz=US/Eastern]]"
[...]
E   pyarrow.lib.ArrowInvalid: overflow

It is happening on both sub and rsub ops. When I try operating with a subset of of the array it looks like the exception only happens when i use a slice that contains a null.

To examine the buffers, I added a breakpoint after the assertion in the diff. In the relevant case, alt[8] is null:

left = alt[8:10]
right = pa_array[8:10]

lb = left.buffers()[1]
rb = right.buffers()[1]

(Pdb) np.asarray(lb[64:72]).view("M8[ns]")
array(['NaT'], dtype='datetime64[ns]')

(Pdb) np.asarray(rb[64:72]).view("M8[ns]")
array(['1970-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

So my current hypothesis is that when we get to the pc.subtract_checked call, it isn't skipping the iNaT entry despite the null bit, and the subtraction for that entry is overflowing. This seems likely unintentional and may be an upstream bug cc @jorisvandenbossche?

Regardless of if it is an upstream bug, I could use guidance on how to make the construction with to_datetime work. Filtering out Decimal(NaN) manually would be pretty inefficient.

jorisvandenbossche · 2025-07-04T08:08:07Z

So my current hypothesis is that when we get to the pc.subtract_checked call, it isn't skipping the iNaT entry despite the null bit, and the subtraction for that entry is overflowing.

I assume that is indeed what is happening here, because there is in any case an (unfortunately long-standing) bug for exactly this case: apache/arrow#35088 (rereading the issue and based on Weston's comment, it seems the fix should actually be quite easy).

A workaround might be to cast the duration to int64 (which should be zero-copy), and the the substract_checked kernel should work correctly.

jorisvandenbossche · 2025-07-04T08:26:43Z

So my current hypothesis is that when we get to the pc.subtract_checked call, it isn't skipping the iNaT entry despite the null bit, and the subtraction for that entry is overflowing.

I assume that is indeed what is happening here, because there is in any case an (unfortunately long-standing) bug for exactly this case: apache/arrow#35088 (rereading the issue and based on Weston's comment, it seems the fix should actually be quite easy).

A workaround might be to cast the duration to int64 (which should be zero-copy), and the the substract_checked kernel should work correctly.

>>> arr = pa.array(pd.Series([pd.Timestamp("2020-01-01"), None]))
>>> other = pa.scalar(pd.Timestamp("2019-12-31T20:01:01"), type=arr.type)
>>> 
>>> pc.subtract_checked(arr, other)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[35], line 1
----> 1 pc.subtract_checked(arr, other)

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/compute.py:252, in _make_generic_wrapper.<locals>.wrapper(memory_pool, *args)
    250 if args and isinstance(args[0], Expression):
    251     return Expression._call(func_name, list(args))
--> 252 return func.call(args, None, memory_pool)

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/_compute.pyx:407, in pyarrow._compute.Function.call()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: overflow
>>> pc.subtract_checked(arr.cast("int64"), other.cast("int64")).cast(pa.duration(arr.type.unit)).to_pandas()
0   0 days 03:58:59
1               NaT
dtype: timedelta64[s]

And so you can indeed see that the underlying values would overflow if the value masked by the null is not ignored:

>>> np_arr = np.frombuffer(arr.buffers()[1], dtype="int64")
>>> np_arr
array([          1577836800, -9223372036854775808])
>>> other.value
1577822461
>>> np_arr - other.value
array([              14339, 9223372035276953347])

jorisvandenbossche · 2025-07-04T08:32:50Z

Regardless of if it is an upstream bug, I could use guidance on how to make the construction with to_datetime work. Filtering out Decimal(NaN) manually would be pretty inefficient.

What do you want to change here exactly? The issue is that pyarrow allows Decimal(NaN) as a null value when constructing from a list of scalars, and pandas does not? (or the other way around, so creating an inconsistency in behaviour?)

jorisvandenbossche · 2025-07-04T09:04:40Z

Seeing #61773, I understand the issue now (it's also related to the fact that we specify pa.array(..., from_pandas=true) to allow NaN, since we support that in pandas for this creation, so we cannot turn that off. But then pyarrow does not seem to distinguish numpy vs decimal NaN ..).

In the end, the reason that this overflow comes up in the tests because of this change is because in pd.to_datetime, we create a numpy datetime64 array using NaT, and numpy uses the smallest integer for NaT. When converting that numpy array to pyarrow, the data is converted zero-copy (only as bitmask is added) and so the masked value is this smallest integer.
When pa.array(...) creates the array from the python scalars, it defaults to fill masked values with 0, so you don't run (or not that easily) into overflows.

So one workaround would be to also fill the created pyarrow array with zeros. One potential way of doing this:

>>> pa_type = pa.timestamp("us")
>>> 
>>> np_arr = pd.to_datetime(scalars).as_unit(pa_type.unit).values
>>> np_arr
array(['2020-01-01T00:00:00.000000',                        'NaT'],
      dtype='datetime64[us]')
>>> mask = np.isnat(arr)
>>> np_arr2 = np_arr.astype("int64")
>>> np_arr2
array([    1577836800000000, -9223372036854775808])
>>> np_arr2[mask] = 0
>>> pa_arr = pa.array(np_arr2, mask=mask, type=pa_type)
>>> pa_arr
<pyarrow.lib.TimestampArray object at 0x7f1ad0ef86a0>
[
  2020-01-01 00:00:00.000000,
  null
]
>>> np.frombuffer(pa_arr.buffers()[1], dtype="int64")
array([1577836800000000,                0])

jbrockmendel · 2025-07-04T14:10:31Z

So one workaround would be to also fill the created pyarrow array with zeros.

I eventually stumbled on that idea long after posting. Will give it a go in #61773. Thank you.

Request For Help: unexplained ArrowInvalid overflow

4e73f1e

jbrockmendel marked this pull request as draft July 4, 2025 02:04

jbrockmendel closed this Jul 4, 2025

jbrockmendel deleted the bug-arrow-to_datetime branch July 4, 2025 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Request For Help: unexplained ArrowInvalid overflow #61776

Request For Help: unexplained ArrowInvalid overflow #61776

jbrockmendel commented Jul 4, 2025

Uh oh!

jorisvandenbossche commented Jul 4, 2025

Uh oh!

jorisvandenbossche commented Jul 4, 2025

Uh oh!

jorisvandenbossche commented Jul 4, 2025

Uh oh!

jorisvandenbossche commented Jul 4, 2025

Uh oh!

jbrockmendel commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

Request For Help: unexplained ArrowInvalid overflow #61776

Request For Help: unexplained ArrowInvalid overflow #61776

Conversation

jbrockmendel commented Jul 4, 2025

Uh oh!

jorisvandenbossche commented Jul 4, 2025

Uh oh!

jorisvandenbossche commented Jul 4, 2025

Uh oh!

jorisvandenbossche commented Jul 4, 2025

Uh oh!

jorisvandenbossche commented Jul 4, 2025

Uh oh!

jbrockmendel commented Jul 4, 2025

Uh oh!

Uh oh!