[FEA] Profiling tool auto-tuner should keep doubling shuffle partitions and related AQE setting if shuffle stage's tasks failed with exit code 137 #1566

viadea · 2025-02-28T05:21:14Z

In a shuffle stage, if there are tasks failed with exit code 137 and also that stage failed, I suspect the OS runs OOM.
So increasing the shuffle partitions and related AQE settings might help.
For example:

spark.sql.shuffle.partitions=2000
spark.sql.adaptive.coalescePartitions.initialPartitionNum=2000

The text was updated successfully, but these errors were encountered:

viadea · 2025-02-28T22:16:47Z

For spark.sql.adaptive.coalescePartitions.initialPartitionNum: The initial number of shuffle partitions before coalescing. If not set, it equals to spark.sql.shuffle.partitions.

Because bootstrap init GPU configs will always have set spark.sql.adaptive.coalescePartitions.initialPartitionNum, so we need to set both of them to be safe.

winningsix · 2025-03-04T06:01:12Z

We met this issue in other customer's engagement quite often in the past. In our case, it's due to unexpected overhead memory usage in container. With the great work from @binmahone, we resolved this type of issue.

A summary here:

Please use spark.rapids.memory.host.offHeapLimit.enabled to ensure we have a memory cap for all native memory usage.
Turn on jemalloc but expect to have less benefit in your case. For that customer case, it's due to Celeborn usage.

spark.executorEnv.LD_PRELOAD=/usr/local/lib/libjemalloc.so
spark.executorEnv.MALLOC_CONF=dirty_decay_ms:0,muzzy_decay_ms:0

viadea · 2025-03-04T06:06:02Z

Thx @winningsix . Do you think this could be a zero-config effort so that spark.rapids.memory.host.offHeapLimit.enabled can be enabled to be limited based on spark.executor.memory/overhead configs?

viadea added ? - Needs Triage feature request New feature or request labels Feb 28, 2025

amahussein added core_tools Scope the core module (scala) autotuner labels Feb 28, 2025

mattahrens assigned parthosa Feb 28, 2025

mattahrens removed the ? - Needs Triage label Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Profiling tool auto-tuner should keep doubling shuffle partitions and related AQE setting if shuffle stage's tasks failed with exit code 137 #1566

[FEA] Profiling tool auto-tuner should keep doubling shuffle partitions and related AQE setting if shuffle stage's tasks failed with exit code 137 #1566

viadea commented Feb 28, 2025 •

edited

Loading

viadea commented Feb 28, 2025

winningsix commented Mar 4, 2025

viadea commented Mar 4, 2025

[FEA] Profiling tool auto-tuner should keep doubling shuffle partitions and related AQE setting if shuffle stage's tasks failed with exit code 137 #1566

[FEA] Profiling tool auto-tuner should keep doubling shuffle partitions and related AQE setting if shuffle stage's tasks failed with exit code 137 #1566

Comments

viadea commented Feb 28, 2025 • edited Loading

viadea commented Feb 28, 2025

winningsix commented Mar 4, 2025

viadea commented Mar 4, 2025

viadea commented Feb 28, 2025 •

edited

Loading