You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently aggregateDiagnosticMetricByStage and aggregateSparkMetricsByStageInternal use the getAllStages method of the stageModelManager which returns all the stages ( failed, successful, incomplete etc.).
This could lead to incorrect aggregation since different stageAttempts can override each other. The behavior here is non-deterministic as we are not sure which attempt will end up being last.
Solution
We want to enforce behaviour that just counts for the successful attempts of a stage as we are already dumping the failed stages in a separate report
When aggregating metrics, we should make sure that we do not mix-and-match-between different attempts.
This can be done by only picking the attempts that have not failed. Same applies for incomplete attempts since those ones can override each other as well.
Another alternative is to aggregate per stage attempt but this might not be ideal because failed stages do not have associated accumulables in the eventlog.
The text was updated successfully, but these errors were encountered:
sayedbilalbari
changed the title
[BUG] aggregateDiagnosticMetricByStage and aggregateSparkMetricsByStageInternal output do not contain filter for failed stages
[BUG] aggregateDiagnosticMetricByStage and aggregateSparkMetricsByStageInternal - missing filter for failed stages
Feb 19, 2025
amahussein
changed the title
[BUG] aggregateDiagnosticMetricByStage and aggregateSparkMetricsByStageInternal - missing filter for failed stages
[BUG] Aggregate metric per stage is missing filter for stage attempts
Feb 20, 2025
Currently
aggregateDiagnosticMetricByStage
andaggregateSparkMetricsByStageInternal
use the getAllStages method of the stageModelManager which returns all the stages ( failed, successful, incomplete etc.).spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Line 395 in 78cab00
and
spark-rapids-tools/core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Line 324 in 78cab00
This could lead to incorrect aggregation since different stageAttempts can override each other. The behavior here is non-deterministic as we are not sure which attempt will end up being last.
Solution
We want to enforce behaviour that just counts for the successful attempts of a stage as we are already dumping the failed stages in a separate report
The text was updated successfully, but these errors were encountered: