-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MueLu: Race condition in refmaxwell relaxation mtgs parallel coloring algorithm #11280
Comments
One additional observation is that with the 'Default' coloring algorithm and a single thread the issue also goes away. I don't think that there were any changes in Ifpack2 that should have caused this. @brian-kelley did the Kokkos promotion get us any changes from Kokkos Kernels that would have affected the coloring? I think there was some refactoring, but not sure about actual changes in functionality. |
@cgcgcg No, the coloring hasn't changed in a while. The last change that could possibly be related is that I changed the default algo for non-GPU parallel (so OpenMP included) from VB to VBBIT in January (to KokkosKernels develop). This change was just for performance. It's a slightly different impl of the same algorithm, but still tested pretty well across all the backends we support. @rppawlo So it failed to converge at all, not just failed to converge within an expected number of iters? We have seen the latter issue come up with nondeterministic coloring. If this can be replicated reliably, could you please try "relaxation: mtgs coloring algorithm" = "vb"? |
The runs went from <10 iterations to 500 == max. The input deck is not setting anything for the coloring algorithm, and was hence using the default. |
Running with "vb" results in the same failures as the default algorithm. It was also suggested to run with "vbd". That causes a seg fault. |
The VBD/VBDBIT segfault is now fixed in KokkosKernels develop (and this will make it into Kokkos 4.0/Trilinos 14 releases, though I could make a patch if you need it sooner). For WaveLaunch2D failing to converge with VB, I have replicated it and compared what's going on when it converges vs. when it doesn't. So far, it really just looks like bad luck when it fails. The coloring is still valid, the number of colors is the same, the inverse diagonals are correct, and it's permuting and updating the LHS correctly. |
@brian-kelley So you think we have stumbled across a matrix where the coloring solution makes the difference between convergence and stalling? For fun and giggles, can you change the damping factor from 1.0 to 0.99? |
@cgcgcg Yes, seems like it. I tried 0.99 and 0.95 and it still stalled. At 0.9 it seems to converge in the 500 iter limit every time (but the number of iters is still variable). |
Is this just for GPUs? I was reminded of how Volta's Independent Thread Scheduling required code changes to ensure correct algorithms a while back. |
@mhoemmen The coloring is all in terms of parallel_fors over RangePolicies, so the same code is run on CPUs and GPUs. The intended behavior involves data races - it picks a color for V that isn't taken by V's neighbors, but all of V's neighbors are doing the same thing at the same time. The conflicts are resolved in another pass. So there's really no good way that we've found to make the speculative coloring deterministic while keeping the performance and low number of colors. You're right though, we had other code in SpGEMM that needed to change for Volta because it couldn't have any data races. |
This might also be related to #11026. I believe OpenMP is the only build where we use nondeterministic aggregation by default, which catches people off-guard from time to time. It's on my to-do list to change the default to deterministic, but it's a backwards-incompatible change that will affect a lot of upstream applications with OpenMP builds and slow down aggregation on them, so I've been holding off for fear of breaking a lot of things... |
@GrahamBenHarper I can't speak for the users any more, but my past experience is that they would prefer default behavior to minimize variance and maximize likelihood of success, even at the cost of some average performance. |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
#12351 is open to fix issues like this. There were some discussions with @trilinos/muelu at TUG about how to move forward; If I understand correctly, the problem is we would like the default behavior in all cases (serial/cuda/openmp) to
These are at odds with each other to some extent because mis2 aggregation/coarsening does not allow selection of aggregate sizes, despite being deterministic and the most performant. Somebody from MueLu can please correct me if my summary is wrong. |
At TUG we decided to switch our default aggregation strategy to something deterministic once we have a better understanding of the performance implications. In practice it doesn't seem anyone cares about setting bounds on aggregate sizes. |
Bug Report
@trilinos/muelu
Description
About 2 months ago we started seeing random failures in EMPIRE's nightly Trilinos sync. The linear solver for an EM problem was failing to converge.
After ruling out a number of empire possibilities, we worked with the muelu team yesterday to diagnose. Int eh end, @cgcgcg suggested:
This fixed the problem in empire.
So there seems to be a race condition in muelu or an underlying library they are using. Since it occurs only for mpi parallel tests, I'm guessing that a fence is needed before an mpi communication is launched.
Steps to Reproduce
The text was updated successfully, but these errors were encountered: