-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] dask_cudf.read_parquet
failed with "NotImplementedError: large_string"
#13039
Comments
Team can you please take a look? This is currently a show stopper for me and I am literally freezed with GPU related development. Thanks a lot! |
I've got a few more bug-like issues, I'll raise them here shortly. |
In the meanwhile if someone has got a good introduction/tutorial about cudf other than the one already posted like 10-minute series, please throw it in here. |
I've appended full error log below:
|
Reading the log, I found that The pandas-written parquet file was successfully converted to a Here's the highlighted line of code in error log that did this:
Until I found that when we write By the way, I am not sure whether this involves support for Hope you guys can take a look, thanks a lot! |
Thanks for raising @stucash - I'll take a look at this today to see how I can help. |
@stucash - I cannot be 100% certain without having a complete write + read reproducer to run locally. However, it looks like your original dataset may contain extremely large row-groups. Unfortunately, until very recently the default row-group size in PyArrow was 64Mi rows, which can sometimes result in string columns that cannot be read back by If the problem is that the row-groups are too large, you will need to rewrite the files with polars or pandas, passing through It may also be possible that your row-groups are within the cudf limit, but that pyarrow is choosing to use a large_string when converting the dtype to pandas (note that cudf's read_parquet code currently leans on the arrow's native pandas logic to figure out what cudf dtypes to use). If I can get my hands on a reproducer for this, we can probably resolve the problem in cudf/dask-cudf. |
Hmmm, this is interesting. I was expecting the same pyarrow issue to show up for a pandas-written parquet file as well (since both are presumably using arrow as the backend). Good to know. |
@rjzamora Thanks for taking the time to investigate; I've got 7 parquet files (1.5GB ish per file) originally from Let me prepare a reproducer with data; in the meanwhile I'll try your suggestion of the |
Thank you @stucash for posting this. It also occurs to me that libcudf does not support the |
I've been having this kind of issues using NVTabular that relies on cudf. Given that pandas was not an option due to its slowness, I managed to save the polars dataframe through a manually crafted function that uses This part simplifies the schema removing the def pyarrow_simplified_schema(schema: pa.Schema) -> pa.Schema:
"""
Convert LargeList<LargeString> fields to LargeList<String> in a PyArrow schema.
Parameters
----------
schema : pa.Schema
The original schema of the PyArrow Table.
Returns
-------
pa.Schema
A new schema where all LargeList<LargeString> fields are converted to LargeList<String>.
"""
fields = []
for field in schema:
if pa.types.is_float64(field.type):
warn(
f"NVTabular does not support double precision, downcasting {field.name} to float32"
)
fields.append(pa.field(field.name, pa.float32()))
elif pa.types.is_large_list(field.type) or pa.types.is_list(field.type):
if pa.types.is_large_string(field.type.value_type):
fields.append(pa.field(field.name, pa.list_(pa.string())))
elif pa.types.is_float64(field.type.value_type):
warn(
f"NVTabular does not support double precision, downcasting {field.name} to float32"
)
fields.append(pa.field(field.name, pa.list_(pa.float32())))
else:
# passthrough on other types
fields.append(pa.field(field.name, pa.list_(field.type.value_type)))
elif pa.types.is_large_string(field.type):
fields.append(pa.field(field.name, pa.string()))
else:
# passthrough on other types
fields.append(field)
return pa.schema(fields) After this I was able to load everything correctly using the cudf implementation of NVTabular. |
@GregoryKimball @rjzamora , is this now likely resolved due to the merge of large string support and interop with pyarrow in v24.08? |
I am a new user of
dask
/dask_cudf
.I have parquet files of various sizes (11GB, 2.5GB, 1.1GB), all of which failed with
NotImplementedError: large_string
.My
dask.dataframe
backend iscudf
. When the backend ispandas
,read.parquet
works fine.Here's an exerpt of what my data looks like in
csv
format:What I did was really simple:
The only large string I could think of is the timestamp string.
Is there a way around this in
cudf
as it is not implemented yet? The format is2023-03-12 09:00:00+00:00
.The text was updated successfully, but these errors were encountered: