Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MueLu: Race condition in refmaxwell relaxation mtgs parallel coloring algorithm #11280

Open
rppawlo opened this issue Nov 17, 2022 · 14 comments
Open
Assignees
Labels
client: EMPIRE All issues that most directly target the ATDM EMPIRE code PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Ifpack2 pkg: KokkosKernels pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests

Comments

@rppawlo
Copy link
Contributor

rppawlo commented Nov 17, 2022

Bug Report

@trilinos/muelu

Description

About 2 months ago we started seeing random failures in EMPIRE's nightly Trilinos sync. The linear solver for an EM problem was failing to converge.

  1. This only fails for two empire tests, so a very small subset of empire tests.
  2. The failure rate is high only on intel (uses openmp backend). We have seen this failure about once every 2 weeks on the gnu openmp build too. So not as frequent, but not tied to a single compiler. Have not seen this on cuda or the serial backends so far.
  3. It only happens in parallel. Single MPI process runs consistently pass.
  4. We dump the linear system right before the belos solve and the matrix, left hand side and right hand side all match when comparing values of passing and failing runs.
  5. If I switch the preconditioner from refmaxwell to ifpack2 then the test passes consistently.

After ruling out a number of empire possibilities, we worked with the muelu team yesterday to diagnose. Int eh end, @cgcgcg suggested:

On more thing you could try. There are 6 blocks of MT Gauss-Seidel
settings (3x "smoother: pre params" and 3x "smoother: post params"). In
each of them, you could set

"relaxation: mtgs coloring algorithm" = "serial"

This fixed the problem in empire.

So there seems to be a race condition in muelu or an underlying library they are using. Since it occurs only for mpi parallel tests, I'm guessing that a fence is needed before an mpi communication is launched.

Steps to Reproduce

  1. See EMPIRE#2400 for more details to reproduce.
@rppawlo rppawlo added the type: bug The primary issue is a bug in Trilinos code or tests label Nov 17, 2022
@rppawlo rppawlo added pkg: MueLu client: EMPIRE All issues that most directly target the ATDM EMPIRE code PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area labels Nov 17, 2022
@cgcgcg
Copy link
Contributor

cgcgcg commented Nov 17, 2022

One additional observation is that with the 'Default' coloring algorithm and a single thread the issue also goes away. I don't think that there were any changes in Ifpack2 that should have caused this.

@brian-kelley did the Kokkos promotion get us any changes from Kokkos Kernels that would have affected the coloring? I think there was some refactoring, but not sure about actual changes in functionality.

@brian-kelley
Copy link
Contributor

brian-kelley commented Nov 17, 2022

@cgcgcg No, the coloring hasn't changed in a while. The last change that could possibly be related is that I changed the default algo for non-GPU parallel (so OpenMP included) from VB to VBBIT in January (to KokkosKernels develop). This change was just for performance. It's a slightly different impl of the same algorithm, but still tested pretty well across all the backends we support.

@rppawlo So it failed to converge at all, not just failed to converge within an expected number of iters? We have seen the latter issue come up with nondeterministic coloring. If this can be replicated reliably, could you please try "relaxation: mtgs coloring algorithm" = "vb"?

@cgcgcg
Copy link
Contributor

cgcgcg commented Nov 17, 2022

The runs went from <10 iterations to 500 == max. The input deck is not setting anything for the coloring algorithm, and was hence using the default.

@rppawlo
Copy link
Contributor Author

rppawlo commented Nov 17, 2022

Running with "vb" results in the same failures as the default algorithm. It was also suggested to run with "vbd". That causes a seg fault.

@brian-kelley
Copy link
Contributor

The VBD/VBDBIT segfault is now fixed in KokkosKernels develop (and this will make it into Kokkos 4.0/Trilinos 14 releases, though I could make a patch if you need it sooner).

For WaveLaunch2D failing to converge with VB, I have replicated it and compared what's going on when it converges vs. when it doesn't. So far, it really just looks like bad luck when it fails. The coloring is still valid, the number of colors is the same, the inverse diagonals are correct, and it's permuting and updating the LHS correctly.

@cgcgcg
Copy link
Contributor

cgcgcg commented Nov 22, 2022

@brian-kelley So you think we have stumbled across a matrix where the coloring solution makes the difference between convergence and stalling? For fun and giggles, can you change the damping factor from 1.0 to 0.99?

@brian-kelley
Copy link
Contributor

@cgcgcg Yes, seems like it. I tried 0.99 and 0.95 and it still stalled. At 0.9 it seems to converge in the 500 iter limit every time (but the number of iters is still variable).

@mhoemmen
Copy link
Contributor

Is this just for GPUs? I was reminded of how Volta's Independent Thread Scheduling required code changes to ensure correct algorithms a while back.

@brian-kelley
Copy link
Contributor

@mhoemmen The coloring is all in terms of parallel_fors over RangePolicies, so the same code is run on CPUs and GPUs. The intended behavior involves data races - it picks a color for V that isn't taken by V's neighbors, but all of V's neighbors are doing the same thing at the same time. The conflicts are resolved in another pass. So there's really no good way that we've found to make the speculative coloring deterministic while keeping the performance and low number of colors.

You're right though, we had other code in SpGEMM that needed to change for Volta because it couldn't have any data races.

@GrahamBenHarper
Copy link
Contributor

This might also be related to #11026.

I believe OpenMP is the only build where we use nondeterministic aggregation by default, which catches people off-guard from time to time. It's on my to-do list to change the default to deterministic, but it's a backwards-incompatible change that will affect a lot of upstream applications with OpenMP builds and slow down aggregation on them, so I've been holding off for fear of breaking a lot of things...

@mhoemmen
Copy link
Contributor

mhoemmen commented Dec 1, 2022

@GrahamBenHarper I can't speak for the users any more, but my past experience is that they would prefer default behavior to minimize variance and maximize likelihood of success, even at the cost of some average performance.

Copy link

github-actions bot commented Dec 2, 2023

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Dec 2, 2023
@GrahamBenHarper
Copy link
Contributor

GrahamBenHarper commented Dec 2, 2023

#12351 is open to fix issues like this.

There were some discussions with @trilinos/muelu at TUG about how to move forward; If I understand correctly, the problem is we would like the default behavior in all cases (serial/cuda/openmp) to

  1. Be performant
  2. Allow selection of aggregate sizes
  3. Be deterministic

These are at odds with each other to some extent because mis2 aggregation/coarsening does not allow selection of aggregate sizes, despite being deterministic and the most performant. Somebody from MueLu can please correct me if my summary is wrong.

@github-actions github-actions bot removed the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Dec 3, 2023
@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 4, 2023

At TUG we decided to switch our default aggregation strategy to something deterministic once we have a better understanding of the performance implications. In practice it doesn't seem anyone cares about setting bounds on aggregate sizes.

@jhux2 jhux2 added this to MueLu Aug 12, 2024
@jhux2 jhux2 moved this to Backlog in MueLu Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: EMPIRE All issues that most directly target the ATDM EMPIRE code PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Ifpack2 pkg: KokkosKernels pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Backlog
Development

No branches or pull requests

5 participants