You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[FEA] Profiling tool auto-tuner should keep doubling shuffle partitions and related AQE setting if shuffle stage's tasks failed with exit code 137
#1566
Open
viadea opened this issue
Feb 28, 2025
· 3 comments
In a shuffle stage, if there are tasks failed with exit code 137 and also that stage failed, I suspect the OS runs OOM.
So increasing the shuffle partitions and related AQE settings might help.
For example:
For spark.sql.adaptive.coalescePartitions.initialPartitionNum: The initial number of shuffle partitions before coalescing. If not set, it equals to spark.sql.shuffle.partitions.
Because bootstrap init GPU configs will always have set spark.sql.adaptive.coalescePartitions.initialPartitionNum, so we need to set both of them to be safe.
We met this issue in other customer's engagement quite often in the past. In our case, it's due to unexpected overhead memory usage in container. With the great work from @binmahone, we resolved this type of issue.
A summary here:
Please use spark.rapids.memory.host.offHeapLimit.enabled to ensure we have a memory cap for all native memory usage.
Turn on jemalloc but expect to have less benefit in your case. For that customer case, it's due to Celeborn usage.
Thx @winningsix . Do you think this could be a zero-config effort so that spark.rapids.memory.host.offHeapLimit.enabled can be enabled to be limited based on spark.executor.memory/overhead configs?
In a shuffle stage, if there are tasks failed with exit code 137 and also that stage failed, I suspect the OS runs OOM.
So increasing the shuffle partitions and related AQE settings might help.
For example:
The text was updated successfully, but these errors were encountered: