Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Profiling tool auto-tuner should keep doubling shuffle partitions and related AQE setting if shuffle stage's tasks failed with exit code 137 #1566

Open
viadea opened this issue Feb 28, 2025 · 3 comments
Assignees
Labels
autotuner core_tools Scope the core module (scala) feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Feb 28, 2025

In a shuffle stage, if there are tasks failed with exit code 137 and also that stage failed, I suspect the OS runs OOM.
So increasing the shuffle partitions and related AQE settings might help.
For example:

spark.sql.shuffle.partitions=2000
spark.sql.adaptive.coalescePartitions.initialPartitionNum=2000
@viadea
Copy link
Collaborator Author

viadea commented Feb 28, 2025

For spark.sql.adaptive.coalescePartitions.initialPartitionNum: The initial number of shuffle partitions before coalescing. If not set, it equals to spark.sql.shuffle.partitions.

Because bootstrap init GPU configs will always have set spark.sql.adaptive.coalescePartitions.initialPartitionNum, so we need to set both of them to be safe.

@winningsix
Copy link

We met this issue in other customer's engagement quite often in the past. In our case, it's due to unexpected overhead memory usage in container. With the great work from @binmahone, we resolved this type of issue.

A summary here:

  1. Please use spark.rapids.memory.host.offHeapLimit.enabled to ensure we have a memory cap for all native memory usage.
  2. Turn on jemalloc but expect to have less benefit in your case. For that customer case, it's due to Celeborn usage.
spark.executorEnv.LD_PRELOAD=/usr/local/lib/libjemalloc.so
spark.executorEnv.MALLOC_CONF=dirty_decay_ms:0,muzzy_decay_ms:0

@viadea
Copy link
Collaborator Author

viadea commented Mar 4, 2025

Thx @winningsix . Do you think this could be a zero-config effort so that spark.rapids.memory.host.offHeapLimit.enabled can be enabled to be limited based on spark.executor.memory/overhead configs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autotuner core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants