Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]The cudf to_csv interface cannot read files larger than 2GB and displays a negative size error. #13785

Closed
Ploverain opened this issue Jul 31, 2023 · 3 comments
Assignees
Labels
0 - Waiting on Author Waiting for author to respond to review bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@Ploverain
Copy link

Ploverain commented Jul 31, 2023

Describe the bug
The cudf to_csv interface cannot write files larger than 2GB and displays a negative size error.

Steps/Code to reproduce bug
import cudf

df = cudf.read_csv("3G.csv")
df.to_csv("result.csv")

Expected behavior
i hope df.to_csv() create a 3G size csv

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

@Ploverain Ploverain added Needs Triage Need team to review and classify bug Something isn't working labels Jul 31, 2023
@vuule vuule self-assigned this Aug 1, 2023
@GregoryKimball
Copy link
Contributor

Thank you @Ploverain for sharing this request. It sounds like the CSV writer is limited by our strings column character limit of 2.1B characters (also see #13733). At a minimum we should provide a better error message that recommends partitioning the dataframe.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Aug 4, 2023
@GregoryKimball GregoryKimball added this to the CSV writer continuous improvement milestone Aug 4, 2023
@beckernick
Copy link
Member

beckernick commented Aug 4, 2023

You can use the chunksize parameter to get around this issue. E.g.,

df = cudf.read_csv("3G.csv")
df.to_csv("result.csv", chunksize=5e6) # assuming five million rows will work -- you may want to try a higher or lower value

We could also explore handling this under the hood in the Python layer (via some kind of data introspection or otherwise) (cc @wence- , as this came up in a recent discussion)

@GregoryKimball GregoryKimball added 0 - Waiting on Author Waiting for author to respond to review and removed 0 - Backlog In queue waiting for assignment labels Aug 8, 2023
@GregoryKimball GregoryKimball modified the milestones: CSV writer continuous improvement, CSV reader continuous improvement May 17, 2024
@davidwendt
Copy link
Contributor

If you have enough GPU memory, this should work now in 24.08. Fixed in #16148
Regardless, I would still recommend using chunksize parameter as mentioned in Nick's comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Waiting on Author Waiting for author to respond to review bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

5 participants