[BUG] query95 @ 30TB negative allocation from `BaseHashJoinIterator.countGroups` with default 200 partitions #6983

abellina · 2022-11-02T16:45:08Z

I am running into a negative allocation error with query95 when I run it at SF=30K and with the default number of shuffle partitions (200). If I change the shuffle partitions to 400, I don't see it anymore.

It is happening while the join code is trying to estimate its output size, and it does so by invoking the cuDF hash aggregate. I added some debug info to try and narrow it down some. In this case, with 200 partitions, the build side (which is also the stream side since this is a join with itself), is too big. This is an instance of: #2354, where we would like to figure to be robust enough to be able to handle such a case, without requiring a change in a config.

Executor task launch worker for task 7.0 in stage 40.0 (TID 14390) 22/11/02 16:21:14:813 INFO HashJoinIterator: about to call groupBy in countGroups with keysTable of size: 4810449888 B, and rowCount 1202612472

Executor task launch worker for task 7.0 in stage 40.0 (TID 14390) 22/11/02 16:21:14:813 INFO HashJoinIterator: about to call groupBy in countGroups with keysTable of size: 4810449888 B, and rowCount 1202612472
Exception in thread "Executor task launch worker for task 7.0 in stage 40.0 (TID 14390)" java.lang.IllegalArgumentException: requirement failed: onAllocFailure invoked with invalid allocSize -13122048
        at scala.Predef$.require(Predef.scala:281)
        at com.nvidia.spark.rapids.DeviceMemoryEventHandler.onAllocFailure(DeviceMemoryEventHandler.scala:115)
        at ai.rapids.cudf.Table.groupByAggregate(Native Method)
        at ai.rapids.cudf.Table.access$3000(Table.java:41)
        at ai.rapids.cudf.Table$GroupByOperation.aggregate(Table.java:3657)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$countGroups$1(GpuHashJoin.scala:354)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:61)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:59)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:53)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.countGroups(GpuHashJoin.scala:349)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.guessStreamMagnificationFactor(GpuHashJoin.scala:374)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$streamMagnificationFactor$1(GpuHashJoin.scala:256)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$streamMagnificationFactor$1$adapted(GpuHashJoin.scala:255)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:61)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:59)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:53)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.streamMagnificationFactor$lzycompute(GpuHashJoin.scala:255)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.streamMagnificationFactor(GpuHashJoin.scala:253)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.computeNumJoinRows(GpuHashJoin.scala:268)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$6(AbstractGpuJoinIterator.scala:214)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:61)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:59)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:53)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$5(AbstractGpuJoinIterator.scala:213)
        at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:166)
        at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:213)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:93)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:193)
        at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:192)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:61)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:59)
        at com.nvidia.spark.RebaseHelper$.withResource(RebaseHelper.scala:26)
        at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:192)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:288)
        at com.nvidia.spark.rapids.GpuShuffledHashJoinExec$.$anonfun$getBuiltBatchAndStreamIter$1(GpuShuffledHashJoinExec.scala:241)
        at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:120)
        at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:118)
        at com.nvidia.spark.rapids.GpuShuffledHashJoinExec$.closeOnExcept(GpuShuffledHashJoinExec.scala:192)
        at com.nvidia.spark.rapids.GpuShuffledHashJoinExec$.getBuiltBatchAndStreamIter(GpuShuffledHashJoinExec.scala:235)
        at com.nvidia.spark.rapids.GpuShuffledHashJoinExec.$anonfun$doExecuteColumnar$1(GpuShuffledHashJoinExec.scala:174)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)

The text was updated successfully, but these errors were encountered:

abellina · 2022-11-02T22:04:24Z

Also, this is an inner join where it is joining with itself with an additional condition.

abellina · 2022-11-03T15:43:37Z

cuDF issue: rapidsai/cudf#12058 for the negative overflow part of this.

ttnghia · 2022-11-04T21:22:34Z

So from what we found, there are many places that can trigger overflowing especially in the context of cudf hash-based aggregate. In particular, cudf hash-based aggregate relies on hashmap data structure which allocates a "desired occupancy" of 50%. That means, the hash table needs will allocate memory 2X larger than the input, which may cause overflow in various algorithms.

So the rule of thumb here may be to conservatively limit the input size to be smaller than INT_MAX / 2.

abellina · 2022-11-09T17:45:00Z

I found other issues with this query after applying @davidwent's fix rapidsai/cudf#12079. I'll file a separate issue for that and likely something either in cuco or cuDF.

sameerz · 2022-11-18T21:00:28Z

cudf issues are resolved, we can run Q95 at SF30k with 200 partitions

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Nov 2, 2022

abellina mentioned this issue Nov 2, 2022

[TASK] Run without fatal OOMs #6746

Closed

10 tasks

abellina mentioned this issue Nov 9, 2022

[BUG] 30TB query95 fails on the join with illegal memory access with 200 partitions #7036

Closed

sameerz closed this as completed Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] query95 @ 30TB negative allocation from `BaseHashJoinIterator.countGroups` with default 200 partitions #6983

[BUG] query95 @ 30TB negative allocation from `BaseHashJoinIterator.countGroups` with default 200 partitions #6983

abellina commented Nov 2, 2022 •

edited

Loading

abellina commented Nov 2, 2022

abellina commented Nov 3, 2022

ttnghia commented Nov 4, 2022

abellina commented Nov 9, 2022

sameerz commented Nov 18, 2022

[BUG] query95 @ 30TB negative allocation from BaseHashJoinIterator.countGroups with default 200 partitions #6983

[BUG] query95 @ 30TB negative allocation from BaseHashJoinIterator.countGroups with default 200 partitions #6983

Comments

abellina commented Nov 2, 2022 • edited Loading

abellina commented Nov 2, 2022

abellina commented Nov 3, 2022

ttnghia commented Nov 4, 2022

abellina commented Nov 9, 2022

sameerz commented Nov 18, 2022

[BUG] query95 @ 30TB negative allocation from `BaseHashJoinIterator.countGroups` with default 200 partitions #6983

[BUG] query95 @ 30TB negative allocation from `BaseHashJoinIterator.countGroups` with default 200 partitions #6983

abellina commented Nov 2, 2022 •

edited

Loading