[BUG] query95 @ 30TB negative allocation from BaseHashJoinIterator.countGroups
with default 200 partitions
#6983
Labels
bug
Something isn't working
cudf_dependency
An issue or PR with this label depends on a new feature in cudf
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
I am running into a negative allocation error with query95 when I run it at SF=30K and with the default number of shuffle partitions (200). If I change the shuffle partitions to 400, I don't see it anymore.
It is happening while the join code is trying to estimate its output size, and it does so by invoking the cuDF hash aggregate. I added some debug info to try and narrow it down some. In this case, with 200 partitions, the build side (which is also the stream side since this is a join with itself), is too big. This is an instance of: #2354, where we would like to figure to be robust enough to be able to handle such a case, without requiring a change in a config.
The text was updated successfully, but these errors were encountered: