MueLu: Race condition in refmaxwell relaxation mtgs parallel coloring algorithm #11280

rppawlo · 2022-11-17T15:02:15Z

Bug Report

@trilinos/muelu

Description

About 2 months ago we started seeing random failures in EMPIRE's nightly Trilinos sync. The linear solver for an EM problem was failing to converge.

This only fails for two empire tests, so a very small subset of empire tests.
The failure rate is high only on intel (uses openmp backend). We have seen this failure about once every 2 weeks on the gnu openmp build too. So not as frequent, but not tied to a single compiler. Have not seen this on cuda or the serial backends so far.
It only happens in parallel. Single MPI process runs consistently pass.
We dump the linear system right before the belos solve and the matrix, left hand side and right hand side all match when comparing values of passing and failing runs.
If I switch the preconditioner from refmaxwell to ifpack2 then the test passes consistently.

After ruling out a number of empire possibilities, we worked with the muelu team yesterday to diagnose. Int eh end, @cgcgcg suggested:

On more thing you could try. There are 6 blocks of MT Gauss-Seidel
settings (3x "smoother: pre params" and 3x "smoother: post params"). In
each of them, you could set

"relaxation: mtgs coloring algorithm" = "serial"

This fixed the problem in empire.

So there seems to be a race condition in muelu or an underlying library they are using. Since it occurs only for mpi parallel tests, I'm guessing that a fence is needed before an mpi communication is launched.

Steps to Reproduce

See EMPIRE#2400 for more details to reproduce.

The text was updated successfully, but these errors were encountered:

cgcgcg · 2022-11-17T15:11:16Z

One additional observation is that with the 'Default' coloring algorithm and a single thread the issue also goes away. I don't think that there were any changes in Ifpack2 that should have caused this.

@brian-kelley did the Kokkos promotion get us any changes from Kokkos Kernels that would have affected the coloring? I think there was some refactoring, but not sure about actual changes in functionality.

brian-kelley · 2022-11-17T16:51:14Z

@cgcgcg No, the coloring hasn't changed in a while. The last change that could possibly be related is that I changed the default algo for non-GPU parallel (so OpenMP included) from VB to VBBIT in January (to KokkosKernels develop). This change was just for performance. It's a slightly different impl of the same algorithm, but still tested pretty well across all the backends we support.

@rppawlo So it failed to converge at all, not just failed to converge within an expected number of iters? We have seen the latter issue come up with nondeterministic coloring. If this can be replicated reliably, could you please try "relaxation: mtgs coloring algorithm" = "vb"?

cgcgcg · 2022-11-17T16:53:48Z

The runs went from <10 iterations to 500 == max. The input deck is not setting anything for the coloring algorithm, and was hence using the default.

rppawlo · 2022-11-17T19:07:36Z

Running with "vb" results in the same failures as the default algorithm. It was also suggested to run with "vbd". That causes a seg fault.

brian-kelley · 2022-11-22T16:49:57Z

The VBD/VBDBIT segfault is now fixed in KokkosKernels develop (and this will make it into Kokkos 4.0/Trilinos 14 releases, though I could make a patch if you need it sooner).

For WaveLaunch2D failing to converge with VB, I have replicated it and compared what's going on when it converges vs. when it doesn't. So far, it really just looks like bad luck when it fails. The coloring is still valid, the number of colors is the same, the inverse diagonals are correct, and it's permuting and updating the LHS correctly.

cgcgcg · 2022-11-22T17:04:02Z

@brian-kelley So you think we have stumbled across a matrix where the coloring solution makes the difference between convergence and stalling? For fun and giggles, can you change the damping factor from 1.0 to 0.99?

brian-kelley · 2022-11-22T17:56:27Z

@cgcgcg Yes, seems like it. I tried 0.99 and 0.95 and it still stalled. At 0.9 it seems to converge in the 500 iter limit every time (but the number of iters is still variable).

mhoemmen · 2022-11-30T22:24:21Z

Is this just for GPUs? I was reminded of how Volta's Independent Thread Scheduling required code changes to ensure correct algorithms a while back.

brian-kelley · 2022-11-30T22:39:03Z

@mhoemmen The coloring is all in terms of parallel_fors over RangePolicies, so the same code is run on CPUs and GPUs. The intended behavior involves data races - it picks a color for V that isn't taken by V's neighbors, but all of V's neighbors are doing the same thing at the same time. The conflicts are resolved in another pass. So there's really no good way that we've found to make the speculative coloring deterministic while keeping the performance and low number of colors.

You're right though, we had other code in SpGEMM that needed to change for Volta because it couldn't have any data races.

GrahamBenHarper · 2022-11-30T22:45:21Z

This might also be related to #11026.

I believe OpenMP is the only build where we use nondeterministic aggregation by default, which catches people off-guard from time to time. It's on my to-do list to change the default to deterministic, but it's a backwards-incompatible change that will affect a lot of upstream applications with OpenMP builds and slow down aggregation on them, so I've been holding off for fear of breaking a lot of things...

mhoemmen · 2022-12-01T19:44:59Z

@GrahamBenHarper I can't speak for the users any more, but my past experience is that they would prefer default behavior to minimize variance and maximize likelihood of success, even at the cost of some average performance.

github-actions · 2023-12-02T12:24:28Z

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

GrahamBenHarper · 2023-12-02T21:33:42Z

#12351 is open to fix issues like this.

There were some discussions with @trilinos/muelu at TUG about how to move forward; If I understand correctly, the problem is we would like the default behavior in all cases (serial/cuda/openmp) to

Be performant
Allow selection of aggregate sizes
Be deterministic

These are at odds with each other to some extent because mis2 aggregation/coarsening does not allow selection of aggregate sizes, despite being deterministic and the most performant. Somebody from MueLu can please correct me if my summary is wrong.

cgcgcg · 2023-12-04T14:43:00Z

At TUG we decided to switch our default aggregation strategy to something deterministic once we have a better understanding of the performance implications. In practice it doesn't seem anyone cares about setting bounds on aggregate sizes.

github-actions · 2025-02-16T12:33:49Z

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

rppawlo added the type: bug The primary issue is a bug in Trilinos code or tests label Nov 17, 2022

rppawlo assigned cgcgcg Nov 17, 2022

rppawlo added pkg: MueLu client: EMPIRE All issues that most directly target the ATDM EMPIRE code PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area labels Nov 17, 2022

cgcgcg added pkg: Ifpack2 pkg: KokkosKernels labels Nov 17, 2022

PhilMiller mentioned this issue Nov 30, 2022

Trilinos nightly test failures in MueLu: MueLu_UnitTestsTpetra_MPI_1 and MueLu_UnitTestsTpetra_MPI_4 unit tests kokkos/kokkos#5643

Closed

github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Dec 2, 2023

github-actions bot removed the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Dec 3, 2023

jhux2 added this to MueLu Aug 12, 2024

jhux2 moved this to Backlog in MueLu Aug 12, 2024

github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Feb 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MueLu: Race condition in refmaxwell relaxation mtgs parallel coloring algorithm #11280

MueLu: Race condition in refmaxwell relaxation mtgs parallel coloring algorithm #11280

rppawlo commented Nov 17, 2022

cgcgcg commented Nov 17, 2022

brian-kelley commented Nov 17, 2022 •

edited

Loading

cgcgcg commented Nov 17, 2022

rppawlo commented Nov 17, 2022

brian-kelley commented Nov 22, 2022

cgcgcg commented Nov 22, 2022

brian-kelley commented Nov 22, 2022

mhoemmen commented Nov 30, 2022

brian-kelley commented Nov 30, 2022

GrahamBenHarper commented Nov 30, 2022

mhoemmen commented Dec 1, 2022

github-actions bot commented Dec 2, 2023

GrahamBenHarper commented Dec 2, 2023 •

edited

Loading

cgcgcg commented Dec 4, 2023

github-actions bot commented Feb 16, 2025

MueLu: Race condition in refmaxwell relaxation mtgs parallel coloring algorithm #11280

MueLu: Race condition in refmaxwell relaxation mtgs parallel coloring algorithm #11280

Comments

rppawlo commented Nov 17, 2022

Bug Report

Description

Steps to Reproduce

cgcgcg commented Nov 17, 2022

brian-kelley commented Nov 17, 2022 • edited Loading

cgcgcg commented Nov 17, 2022

rppawlo commented Nov 17, 2022

brian-kelley commented Nov 22, 2022

cgcgcg commented Nov 22, 2022

brian-kelley commented Nov 22, 2022

mhoemmen commented Nov 30, 2022

brian-kelley commented Nov 30, 2022

GrahamBenHarper commented Nov 30, 2022

mhoemmen commented Dec 1, 2022

github-actions bot commented Dec 2, 2023

GrahamBenHarper commented Dec 2, 2023 • edited Loading

cgcgcg commented Dec 4, 2023

github-actions bot commented Feb 16, 2025

brian-kelley commented Nov 17, 2022 •

edited

Loading

GrahamBenHarper commented Dec 2, 2023 •

edited

Loading