`ManualEnsemble` class for specifying all comms in an `Ensemble` #189

JHopeCollins · 2024-05-07T17:21:29Z

This PR slightly changes the way we split a large Ensemble into multiple smaller Ensembles, either in the SliceJacobiPC or in the nonlinear Gauss-Seidel iterations.
firedrake.Ensemble essentially has two responsibilities:

Make sure we have three comms on each rank, a global_comm, a spatial_comm, and an ensemble_comm. Currently, firedrake.Ensemble does this by taking a global_comm and splitting it into a Cartesian product of comms, providing a spatial_comm and an ensemble_comm from this product.
Wrap mpi4py calls so that we can send firedrake.Functions from one spatial_comm to another across the ensemble_comm with a simple API and some sanity checks.

Creating a split ensemble involves intercepting the logic in 1 to make sure the comms in the split ensemble relate properly to the comms in the larger ensemble. Specifically, the split ensemble needs three communicators:

A global_comm split from the global_comm of the larger ensemble
An ensemble_comm split from the global_comm or ensemble_comm of the larger ensemble
A spatial_comm that is the same as the spatial_comm from the larger ensemble so we can use the same mesh with both Ensembles.

The main issue here is that we need the spatial_comm of the split ensemble to be the same comm as the spatial_comm of the large ensemble, not just congruent. This means that we can't just make the smaller global_comm for the split ensemble and reuse the existing firedrake.Ensemble.__init__.

Previously I made a new EnsembleConnector class (terrible name, it connects existing spatial_comms, it doesn't connect different Ensembles). This class inherited from firedrake.Ensemble but overrode __init__, taking a global_comm and a specific spatial_comm, then created a new ensemble_comm by splitting the provided global_comm. To go with this is a split_ensemble function that takes in a large ensemble, splits it's global_comm, and passes the split global_comm and the spatial_comm to the new EnsembleConnector.
This works fine for our case but has a couple of issues (other than the naming issues that already plague Ensemble).

The main issue is that it mixes up responsibilities. The split_ensemble function does some, but not all of task 1, making sure we have three viable comms. It sorts out the global_comm and spatial_comm but not the ensemble_comm, which is left to the EnsembleConnector.
It is specific to the case of already having the spatial_comm but not the ensemble_comm (what about the case where I already have the ensemble_comm but not the spatial_comm, or already have both).

This PR changes to having a ManualEnsemble class that inherits from firedrake.Ensemble but just takes three comms and checks that they look like a global/spatial/ensemble comm set (i.e. they look like a cartesian product of comms). It essentially is only doing task 2, wrapping mpi4py calls, and trusts that the three provided comms are a valid set to use.
The split_ensemble function now does all of task 1, taking in a larger Ensemble, splitting the global_comm and the ensemble_comm, and passing these plus the original spatial_comm to ManualEnsemble.

ManualEnsemble is simpler and more general than EnsembleConnector was (also more of a footgun, but I've tried to add enough checks), and split_ensemble now takes care of all of the logic of splitting an Ensemble, rather than just some of it.

JHopeCollins · 2024-05-14T12:29:07Z

@JDBetteridge this PR only involves the split_ensemble function and the ManualEnsemble class, the create_ensemble is just a convenience function not used here.

JDBetteridge

I may need to think about this some more...

JDBetteridge · 2024-05-15T13:53:04Z

asQ/ensemble.py

+    weakref.finalize(new_ensemble, split_global_comm.Free)
+    weakref.finalize(new_ensemble, split_ensemble_comm.Free)


I think that the finalizer should be set in the __init__ method of the ManualEnsemble. It's the pattern used elsewhere and it prevents someone (user or developer) forgetting to add the finalizers.

This comes back to the question of what ManualEnsemble should be responsible for. In this case the ensemble_comm and the global_comm need finalising but the spatial_comm doesn't.

Currently I've gone with "the user of ManualEnsemble is totally responsible for the comms they pass in", but we could have optional arguments to ManualEnsemble.__init__ for which comms to set finalizers for, I'd be ok with that.

Just to note, I'm saying "user" here but ManualEnsemble isn't exposed publicly, it's meant for internal use so I'd expect it to always be wrapped in something like the split_ensemble function which has more knowledge about comm lifetime.

e.g.

class ManualEnsemble(Ensemble) def __init__(self, global_comm, spatial_comm, ensemble_comm, finalize_global_comm=False, finalize_spatial_comm=False, finalize_ensemble_comm=False) if finalize_global_comm: weakref.finalize(self, global_comm.Free) if finalize_spatial_comm: weakref.finalize(self, spatial_comm.Free) if finalize_ensemble_comm: weakref.finalize(self, ensemble_comm.Free) ...

JDBetteridge · 2024-05-15T16:07:20Z

asQ/ensemble.py

+            raise PyOP2CommError("spatial_comm must be subgroup of global_comm")
+        if not is_subgroup(ensemble_group, global_group):
+            raise PyOP2CommError("ensemble_comm must be subgroup of global_comm")
+


This logic isn't completely exhaustive, it doesn't currently check whether you have the same communicator. For example:

if __name__ == "__main__": r = COMM_WORLD.rank s = COMM_WORLD.size ensemble_color1 = int(r < s/2) ensemble1 = COMM_WORLD.Split(color=ensemble_color1, key=r) ensemble_color2 = r >= s/2 ensemble2 = COMM_WORLD.Split(color=ensemble_color2, key=r) spatial_color = r % (s/2) spatial = COMM_WORLD.Split(color=spatial_color, key=r) correct = ManualEnsemble(COMM_WORLD, spatial, ensemble1) if r < s/2: broken = ManualEnsemble(COMM_WORLD, spatial, ensemble1) else: broken = ManualEnsemble(COMM_WORLD, spatial, ensemble2) print("ALL PASSED")

Will run just fine, but the broken ensemble uses two different communicators.

This is broken in a very subtle way as the mismatched comm will be destroyed when the ensemble is destroyed.

I don't see what is broken about this.

Assuming 8 ranks, ranks 0-3 use the ensemble comm from the first Split, and ranks 4-7 use the ensemble comm from the second Split call, but this doesn't matter. All ranks in each ensemble comm use the same one, which is what matters.
They don't "know", and don't need to know, what the other half is doing so long as every rank has an ensemble comm that connects the same part of all spatial comms.

As written so far, ManualEnsemble doesn't destroy any of the comms its given (the docstring explicitly says that the caller is responsible for this).

ManualEnsemble class for specifying all comms in Ensemble

a66f5cf

JHopeCollins added refactor Tidy up existing functionality Core functionality Adding to the main paradiag functionality labels May 7, 2024

JHopeCollins self-assigned this May 7, 2024

JHopeCollins added 2 commits May 7, 2024 18:31

use MPI.Group.Compare for the comm tests in ManualEnsemble

b5ccbfe

use MPI.Group.Compare for the comm tests in ManualEnsemble

9113af8

JHopeCollins requested a review from colinjcotter May 14, 2024 10:00

split_ensemble - use ensemble_comm.rank as key to split ensemble_comm

714d9fb

JHopeCollins requested a review from JDBetteridge May 14, 2024 12:28

make sure the new comms in split_comm are cleaned up

b7fd79b

colinjcotter approved these changes May 14, 2024

View reviewed changes

Merge branch 'master' into update_ensemble_splitting

5f70d3b

JDBetteridge requested changes May 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ManualEnsemble` class for specifying all comms in an `Ensemble` #189

`ManualEnsemble` class for specifying all comms in an `Ensemble` #189

JHopeCollins commented May 7, 2024 •

edited

Loading

JHopeCollins commented May 14, 2024 •

edited

Loading

JDBetteridge left a comment

JDBetteridge May 15, 2024

JHopeCollins May 16, 2024

JHopeCollins May 16, 2024 •

edited

Loading

JDBetteridge May 15, 2024

JHopeCollins May 16, 2024 •

edited

Loading

		weakref.finalize(new_ensemble, split_global_comm.Free)
		weakref.finalize(new_ensemble, split_ensemble_comm.Free)

ManualEnsemble class for specifying all comms in an Ensemble #189

Are you sure you want to change the base?

ManualEnsemble class for specifying all comms in an Ensemble #189

Conversation

JHopeCollins commented May 7, 2024 • edited Loading

JHopeCollins commented May 14, 2024 • edited Loading

JDBetteridge left a comment

Choose a reason for hiding this comment

JDBetteridge May 15, 2024

Choose a reason for hiding this comment

JHopeCollins May 16, 2024

Choose a reason for hiding this comment

JHopeCollins May 16, 2024 • edited Loading

Choose a reason for hiding this comment

JDBetteridge May 15, 2024

Choose a reason for hiding this comment

JHopeCollins May 16, 2024 • edited Loading

Choose a reason for hiding this comment

`ManualEnsemble` class for specifying all comms in an `Ensemble` #189

`ManualEnsemble` class for specifying all comms in an `Ensemble` #189

JHopeCollins commented May 7, 2024 •

edited

Loading

JHopeCollins commented May 14, 2024 •

edited

Loading

JHopeCollins May 16, 2024 •

edited

Loading

JHopeCollins May 16, 2024 •

edited

Loading