Skip to content

Commit

Permalink
Add support for arrow large_string in cudf (#15093)
Browse files Browse the repository at this point in the history
This PR adds support for `large_string` type of `arrow` arrays in `cudf`. `cudf` strings column lacks 64 bit offset support and it is WIP: #13733

This workaround is essential because `pandas-2.2+` is now defaulting to `large_string` type for arrow-strings instead of `string` type.: pandas-dev/pandas#56220

This PR fixes all 25 `dask-cudf` failures.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Ashwin Srinath (https://github.com/shwina)

URL: #15093
  • Loading branch information
galipremsagar authored Feb 20, 2024
1 parent 44686ca commit 6903f80
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 0 deletions.
7 changes: 7 additions & 0 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1920,6 +1920,13 @@ def as_column(
return col

elif isinstance(arbitrary, (pa.Array, pa.ChunkedArray)):
if pa.types.is_large_string(arbitrary.type):
# Pandas-2.2+: Pandas defaults to `large_string` type
# instead of `string` without data-introspection.
# Temporary workaround until cudf has native
# support for `LARGE_STRING` i.e., 64 bit offsets
arbitrary = arbitrary.cast(pa.string())

if pa.types.is_float16(arbitrary.type):
raise NotImplementedError(
"Type casting from `float16` to `float32` is not "
Expand Down
8 changes: 8 additions & 0 deletions python/cudf/cudf/tests/test_series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2700,3 +2700,11 @@ def test_series_dtype_astypes(data):
result = cudf.Series(data, dtype="float64")
expected = cudf.Series([1.0, 2.0, 3.0])
assert_eq(result, expected)


def test_series_from_large_string():
pa_large_string_array = pa.array(["a", "b", "c"]).cast(pa.large_string())
got = cudf.Series(pa_large_string_array)
expected = pd.Series(pa_large_string_array)

assert_eq(expected, got)
2 changes: 2 additions & 0 deletions python/cudf/cudf/utils/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,8 @@ def cudf_dtype_from_pa_type(typ):
return cudf.core.dtypes.StructDtype.from_arrow(typ)
elif pa.types.is_decimal(typ):
return cudf.core.dtypes.Decimal128Dtype.from_arrow(typ)
elif pa.types.is_large_string(typ):
return cudf.dtype("str")
else:
return cudf.api.types.pandas_dtype(typ.to_pandas_dtype())

Expand Down

0 comments on commit 6903f80

Please sign in to comment.