Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] query95 @ 30TB negative allocation from BaseHashJoinIterator.countGroups with default 200 partitions #6983

Closed
abellina opened this issue Nov 2, 2022 · 5 comments
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Nov 2, 2022

I am running into a negative allocation error with query95 when I run it at SF=30K and with the default number of shuffle partitions (200). If I change the shuffle partitions to 400, I don't see it anymore.

It is happening while the join code is trying to estimate its output size, and it does so by invoking the cuDF hash aggregate. I added some debug info to try and narrow it down some. In this case, with 200 partitions, the build side (which is also the stream side since this is a join with itself), is too big. This is an instance of: #2354, where we would like to figure to be robust enough to be able to handle such a case, without requiring a change in a config.

Executor task launch worker for task 7.0 in stage 40.0 (TID 14390) 22/11/02 16:21:14:813 INFO HashJoinIterator: about to call groupBy in countGroups with keysTable of size: 4810449888 B, and rowCount 1202612472

Executor task launch worker for task 7.0 in stage 40.0 (TID 14390) 22/11/02 16:21:14:813 INFO HashJoinIterator: about to call groupBy in countGroups with keysTable of size: 4810449888 B, and rowCount 1202612472
Exception in thread "Executor task launch worker for task 7.0 in stage 40.0 (TID 14390)" java.lang.IllegalArgumentException: requirement failed: onAllocFailure invoked with invalid allocSize -13122048
        at scala.Predef$.require(Predef.scala:281)
        at com.nvidia.spark.rapids.DeviceMemoryEventHandler.onAllocFailure(DeviceMemoryEventHandler.scala:115)
        at ai.rapids.cudf.Table.groupByAggregate(Native Method)
        at ai.rapids.cudf.Table.access$3000(Table.java:41)
        at ai.rapids.cudf.Table$GroupByOperation.aggregate(Table.java:3657)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$countGroups$1(GpuHashJoin.scala:354)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:61)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:59)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:53)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.countGroups(GpuHashJoin.scala:349)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.guessStreamMagnificationFactor(GpuHashJoin.scala:374)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$streamMagnificationFactor$1(GpuHashJoin.scala:256)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.$anonfun$streamMagnificationFactor$1$adapted(GpuHashJoin.scala:255)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:61)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:59)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:53)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.streamMagnificationFactor$lzycompute(GpuHashJoin.scala:255)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.streamMagnificationFactor(GpuHashJoin.scala:253)
        at org.apache.spark.sql.rapids.execution.BaseHashJoinIterator.computeNumJoinRows(GpuHashJoin.scala:268)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$6(AbstractGpuJoinIterator.scala:214)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:61)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:59)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:53)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$5(AbstractGpuJoinIterator.scala:213)
        at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:166)
        at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:213)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:93)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:193)
        at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:192)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:61)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:59)
        at com.nvidia.spark.RebaseHelper$.withResource(RebaseHelper.scala:26)
        at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:192)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:288)
        at com.nvidia.spark.rapids.GpuShuffledHashJoinExec$.$anonfun$getBuiltBatchAndStreamIter$1(GpuShuffledHashJoinExec.scala:241)
        at com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:120)
        at com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:118)
        at com.nvidia.spark.rapids.GpuShuffledHashJoinExec$.closeOnExcept(GpuShuffledHashJoinExec.scala:192)
        at com.nvidia.spark.rapids.GpuShuffledHashJoinExec$.getBuiltBatchAndStreamIter(GpuShuffledHashJoinExec.scala:235)
        at com.nvidia.spark.rapids.GpuShuffledHashJoinExec.$anonfun$doExecuteColumnar$1(GpuShuffledHashJoinExec.scala:174)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Nov 2, 2022
@abellina
Copy link
Collaborator Author

abellina commented Nov 2, 2022

Also, this is an inner join where it is joining with itself with an additional condition.

@abellina
Copy link
Collaborator Author

abellina commented Nov 3, 2022

cuDF issue: rapidsai/cudf#12058 for the negative overflow part of this.

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 4, 2022

So from what we found, there are many places that can trigger overflowing especially in the context of cudf hash-based aggregate. In particular, cudf hash-based aggregate relies on hashmap data structure which allocates a "desired occupancy" of 50%. That means, the hash table needs will allocate memory 2X larger than the input, which may cause overflow in various algorithms.

So the rule of thumb here may be to conservatively limit the input size to be smaller than INT_MAX / 2.

@abellina
Copy link
Collaborator Author

abellina commented Nov 9, 2022

I found other issues with this query after applying @davidwent's fix rapidsai/cudf#12079. I'll file a separate issue for that and likely something either in cuco or cuDF.

@sameerz
Copy link
Collaborator

sameerz commented Nov 18, 2022

cudf issues are resolved, we can run Q95 at SF30k with 200 partitions

@sameerz sameerz closed this as completed Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

3 participants