-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add support for Parquet rowgroup and ORC stripe cudf size configs #11799
Comments
@ustcfy we currently do not honor https://github.com/rapidsai/cudf/blob/d1bad33caef34b8fa95543c7494780f2084ee603/cpp/include/cudf/io/orc.hpp#L41-L42 But those do not behave the same way as what the Spark ORC writer configs do. The size limit in CUDF is for pre-compressed data sizes. Spark/ORC's size limit I believe is post compression, checked periodically. Beyond that the RAPIDS Accelerator will split the input data into batches at an arbitrary point (targeting about 1 GiB uncompressed by default). The CUDF ORC writer will also not produce stripes that span these batches. Because of all of those differences we decided not to expose these configs. I would really like to understand your use case so that we can produce the correct solution. I am happy to expose these configs in a non-standard way, because they are not the same. But I am not sure that is what you really want. |
I want to address this issue #11735. I need to generate multiple stripes in the test, but with the default configuration, the stripes seem too large. I feel that exposing the |
Okay, then we can expose the size and row count configs, but lets do them as rapids specific configs for now. We can then decide if we want to honor the standard ORC ones. While we are at it we should do the same for parquet. |
orc.stripe.row.count
config
Need cuDF to expose the relevant interfaces; this is the related issue: rapidsai/cudf#17785 |
Is your feature request related to a problem? Please describe.
I hope support for the
orc.stripe.row.count
parameter can be added, which would facilitate testing by allowing precise control over the number of stripes generated.The only parameter related to ORC stripes that I found is
orc.stripe.size
, which is not very convenient to use.The text was updated successfully, but these errors were encountered: