[BUG] Aggregate metric per stage is missing filter for stage attempts #1552

sayedbilalbari · 2025-02-19T23:22:58Z

Currently aggregateDiagnosticMetricByStage and aggregateSparkMetricsByStageInternal use the getAllStages method of the stageModelManager which returns all the stages ( failed, successful, incomplete etc.).

spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

Line 395 in 78cab00

app.stageManager.getAllStages.foreach { sm =>

and

spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

Line 324 in 78cab00

app.stageManager.getAllStages.map { sm =>

This could lead to incorrect aggregation since different stageAttempts can override each other. The behavior here is non-deterministic as we are not sure which attempt will end up being last.

Solution

We want to enforce behaviour that just counts for the successful attempts of a stage as we are already dumping the failed stages in a separate report

When aggregating metrics, we should make sure that we do not mix-and-match-between different attempts.
This can be done by only picking the attempts that have not failed. Same applies for incomplete attempts since those ones can override each other as well.
Another alternative is to aggregate per stage attempt but this might not be ideal because failed stages do not have associated accumulables in the eventlog.

The text was updated successfully, but these errors were encountered:

sayedbilalbari added ? - Needs Triage bug Something isn't working labels Feb 19, 2025

sayedbilalbari changed the title ~~[BUG] aggregateDiagnosticMetricByStage and aggregateSparkMetricsByStageInternal output do not contain filter for failed stages~~ [BUG] aggregateDiagnosticMetricByStage and aggregateSparkMetricsByStageInternal - missing filter for failed stages Feb 19, 2025

amahussein added the core_tools Scope the core module (scala) label Feb 19, 2025

amahussein changed the title ~~[BUG] aggregateDiagnosticMetricByStage and aggregateSparkMetricsByStageInternal - missing filter for failed stages~~ [BUG] Aggregate metric per stage is missing filter for stage attempts Feb 20, 2025

This was referenced Feb 21, 2025

Adds filter for failed and non completed stages #1556

Closed

Adds filter for failed and non completed stages #1558

Open

sayedbilalbari self-assigned this Feb 26, 2025

sayedbilalbari removed the ? - Needs Triage label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Aggregate metric per stage is missing filter for stage attempts #1552

[BUG] Aggregate metric per stage is missing filter for stage attempts #1552

sayedbilalbari commented Feb 19, 2025 •

edited by amahussein

Loading

[BUG] Aggregate metric per stage is missing filter for stage attempts #1552

[BUG] Aggregate metric per stage is missing filter for stage attempts #1552

Comments

sayedbilalbari commented Feb 19, 2025 • edited by amahussein Loading

Solution

sayedbilalbari commented Feb 19, 2025 •

edited by amahussein

Loading