-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ManualEnsemble
class for specifying all comms in an Ensemble
#189
base: master
Are you sure you want to change the base?
Conversation
@JDBetteridge this PR only involves the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may need to think about this some more...
weakref.finalize(new_ensemble, split_global_comm.Free) | ||
weakref.finalize(new_ensemble, split_ensemble_comm.Free) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the finalizer should be set in the __init__
method of the ManualEnsemble
. It's the pattern used elsewhere and it prevents someone (user or developer) forgetting to add the finalizers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes back to the question of what ManualEnsemble
should be responsible for. In this case the ensemble_comm
and the global_comm
need finalising but the spatial_comm
doesn't.
Currently I've gone with "the user of ManualEnsemble
is totally responsible for the comms they pass in", but we could have optional arguments to ManualEnsemble.__init__
for which comms to set finalizers for, I'd be ok with that.
Just to note, I'm saying "user" here but ManualEnsemble
isn't exposed publicly, it's meant for internal use so I'd expect it to always be wrapped in something like the split_ensemble
function which has more knowledge about comm lifetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
class ManualEnsemble(Ensemble)
def __init__(self, global_comm, spatial_comm, ensemble_comm,
finalize_global_comm=False, finalize_spatial_comm=False, finalize_ensemble_comm=False)
if finalize_global_comm:
weakref.finalize(self, global_comm.Free)
if finalize_spatial_comm:
weakref.finalize(self, spatial_comm.Free)
if finalize_ensemble_comm:
weakref.finalize(self, ensemble_comm.Free)
...
raise PyOP2CommError("spatial_comm must be subgroup of global_comm") | ||
if not is_subgroup(ensemble_group, global_group): | ||
raise PyOP2CommError("ensemble_comm must be subgroup of global_comm") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic isn't completely exhaustive, it doesn't currently check whether you have the same communicator. For example:
if __name__ == "__main__":
r = COMM_WORLD.rank
s = COMM_WORLD.size
ensemble_color1 = int(r < s/2)
ensemble1 = COMM_WORLD.Split(color=ensemble_color1, key=r)
ensemble_color2 = r >= s/2
ensemble2 = COMM_WORLD.Split(color=ensemble_color2, key=r)
spatial_color = r % (s/2)
spatial = COMM_WORLD.Split(color=spatial_color, key=r)
correct = ManualEnsemble(COMM_WORLD, spatial, ensemble1)
if r < s/2:
broken = ManualEnsemble(COMM_WORLD, spatial, ensemble1)
else:
broken = ManualEnsemble(COMM_WORLD, spatial, ensemble2)
print("ALL PASSED")
Will run just fine, but the broken
ensemble uses two different communicators.
This is broken in a very subtle way as the mismatched comm will be destroyed when the ensemble is destroyed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what is broken about this.
Assuming 8 ranks, ranks 0-3 use the ensemble comm from the first Split
, and ranks 4-7 use the ensemble comm from the second Split
call, but this doesn't matter. All ranks in each ensemble comm use the same one, which is what matters.
They don't "know", and don't need to know, what the other half is doing so long as every rank has an ensemble comm that connects the same part of all spatial comms.
As written so far, ManualEnsemble
doesn't destroy any of the comms its given (the docstring explicitly says that the caller is responsible for this).
This PR slightly changes the way we split a large
Ensemble
into multiple smallerEnsembles
, either in theSliceJacobiPC
or in the nonlinear Gauss-Seidel iterations.firedrake.Ensemble
essentially has two responsibilities:global_comm
, aspatial_comm
, and anensemble_comm
. Currently,firedrake.Ensemble
does this by taking aglobal_comm
and splitting it into a Cartesian product of comms, providing aspatial_comm
and anensemble_comm
from this product.mpi4py
calls so that we can sendfiredrake.Functions
from onespatial_comm
to another across theensemble_comm
with a simple API and some sanity checks.Creating a split ensemble involves intercepting the logic in 1 to make sure the comms in the split ensemble relate properly to the comms in the larger ensemble. Specifically, the split ensemble needs three communicators:
global_comm
split from theglobal_comm
of the larger ensembleensemble_comm
split from theglobal_comm
orensemble_comm
of the larger ensemblespatial_comm
that is the same as thespatial_comm
from the larger ensemble so we can use the same mesh with both Ensembles.The main issue here is that we need the
spatial_comm
of the split ensemble to be the same comm as thespatial_comm
of the large ensemble, not just congruent. This means that we can't just make the smaller global_comm for the split ensemble and reuse the existingfiredrake.Ensemble.__init__
.Previously I made a new
EnsembleConnector
class (terrible name, it connects existingspatial_comms
, it doesn't connect differentEnsembles
). This class inherited fromfiredrake.Ensemble
but overrode__init__
, taking aglobal_comm
and a specificspatial_comm
, then created a newensemble_comm
by splitting the providedglobal_comm
. To go with this is asplit_ensemble
function that takes in a large ensemble, splits it'sglobal_comm
, and passes the splitglobal_comm
and thespatial_comm
to the newEnsembleConnector
.This works fine for our case but has a couple of issues (other than the naming issues that already plague
Ensemble
).split_ensemble
function does some, but not all of task 1, making sure we have three viable comms. It sorts out theglobal_comm
andspatial_comm
but not theensemble_comm
, which is left to theEnsembleConnector
.spatial_comm
but not theensemble_comm
(what about the case where I already have theensemble_comm
but not thespatial_comm
, or already have both).This PR changes to having a
ManualEnsemble
class that inherits fromfiredrake.Ensemble
but just takes three comms and checks that they look like a global/spatial/ensemble comm set (i.e. they look like a cartesian product of comms). It essentially is only doing task 2, wrappingmpi4py
calls, and trusts that the three provided comms are a valid set to use.The
split_ensemble
function now does all of task 1, taking in a largerEnsemble
, splitting theglobal_comm
and theensemble_comm
, and passing these plus the originalspatial_comm
toManualEnsemble
.ManualEnsemble
is simpler and more general thanEnsembleConnector
was (also more of a footgun, but I've tried to add enough checks), andsplit_ensemble
now takes care of all of the logic of splitting anEnsemble
, rather than just some of it.