Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ManualEnsemble class for specifying all comms in an Ensemble #189

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

JHopeCollins
Copy link
Member

@JHopeCollins JHopeCollins commented May 7, 2024

This PR slightly changes the way we split a large Ensemble into multiple smaller Ensembles, either in the SliceJacobiPC or in the nonlinear Gauss-Seidel iterations.
firedrake.Ensemble essentially has two responsibilities:

  1. Make sure we have three comms on each rank, a global_comm, a spatial_comm, and an ensemble_comm. Currently, firedrake.Ensemble does this by taking a global_comm and splitting it into a Cartesian product of comms, providing a spatial_comm and an ensemble_comm from this product.
  2. Wrap mpi4py calls so that we can send firedrake.Functions from one spatial_comm to another across the ensemble_comm with a simple API and some sanity checks.

Creating a split ensemble involves intercepting the logic in 1 to make sure the comms in the split ensemble relate properly to the comms in the larger ensemble. Specifically, the split ensemble needs three communicators:

  1. A global_comm split from the global_comm of the larger ensemble
  2. An ensemble_comm split from the global_comm or ensemble_comm of the larger ensemble
  3. A spatial_comm that is the same as the spatial_comm from the larger ensemble so we can use the same mesh with both Ensembles.

The main issue here is that we need the spatial_comm of the split ensemble to be the same comm as the spatial_comm of the large ensemble, not just congruent. This means that we can't just make the smaller global_comm for the split ensemble and reuse the existing firedrake.Ensemble.__init__.

Previously I made a new EnsembleConnector class (terrible name, it connects existing spatial_comms, it doesn't connect different Ensembles). This class inherited from firedrake.Ensemble but overrode __init__, taking a global_comm and a specific spatial_comm, then created a new ensemble_comm by splitting the provided global_comm. To go with this is a split_ensemble function that takes in a large ensemble, splits it's global_comm, and passes the split global_comm and the spatial_comm to the new EnsembleConnector.
This works fine for our case but has a couple of issues (other than the naming issues that already plague Ensemble).

  1. The main issue is that it mixes up responsibilities. The split_ensemble function does some, but not all of task 1, making sure we have three viable comms. It sorts out the global_comm and spatial_comm but not the ensemble_comm, which is left to the EnsembleConnector.
  2. It is specific to the case of already having the spatial_comm but not the ensemble_comm (what about the case where I already have the ensemble_comm but not the spatial_comm, or already have both).

This PR changes to having a ManualEnsemble class that inherits from firedrake.Ensemble but just takes three comms and checks that they look like a global/spatial/ensemble comm set (i.e. they look like a cartesian product of comms). It essentially is only doing task 2, wrapping mpi4py calls, and trusts that the three provided comms are a valid set to use.
The split_ensemble function now does all of task 1, taking in a larger Ensemble, splitting the global_comm and the ensemble_comm, and passing these plus the original spatial_comm to ManualEnsemble.

ManualEnsemble is simpler and more general than EnsembleConnector was (also more of a footgun, but I've tried to add enough checks), and split_ensemble now takes care of all of the logic of splitting an Ensemble, rather than just some of it.

@JHopeCollins JHopeCollins added refactor Tidy up existing functionality Core functionality Adding to the main paradiag functionality labels May 7, 2024
@JHopeCollins JHopeCollins self-assigned this May 7, 2024
@JHopeCollins
Copy link
Member Author

JHopeCollins commented May 14, 2024

@JDBetteridge this PR only involves the split_ensemble function and the ManualEnsemble class, the create_ensemble is just a convenience function not used here.

Copy link
Member

@JDBetteridge JDBetteridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may need to think about this some more...

Comment on lines +56 to +57
weakref.finalize(new_ensemble, split_global_comm.Free)
weakref.finalize(new_ensemble, split_ensemble_comm.Free)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the finalizer should be set in the __init__ method of the ManualEnsemble. It's the pattern used elsewhere and it prevents someone (user or developer) forgetting to add the finalizers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes back to the question of what ManualEnsemble should be responsible for. In this case the ensemble_comm and the global_comm need finalising but the spatial_comm doesn't.

Currently I've gone with "the user of ManualEnsemble is totally responsible for the comms they pass in", but we could have optional arguments to ManualEnsemble.__init__ for which comms to set finalizers for, I'd be ok with that.

Just to note, I'm saying "user" here but ManualEnsemble isn't exposed publicly, it's meant for internal use so I'd expect it to always be wrapped in something like the split_ensemble function which has more knowledge about comm lifetime.

Copy link
Member Author

@JHopeCollins JHopeCollins May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

class ManualEnsemble(Ensemble)
    def __init__(self, global_comm, spatial_comm, ensemble_comm,
                 finalize_global_comm=False, finalize_spatial_comm=False, finalize_ensemble_comm=False)

        if finalize_global_comm:
            weakref.finalize(self, global_comm.Free)
        if finalize_spatial_comm:
            weakref.finalize(self, spatial_comm.Free)
        if finalize_ensemble_comm:
            weakref.finalize(self, ensemble_comm.Free)
        ...

raise PyOP2CommError("spatial_comm must be subgroup of global_comm")
if not is_subgroup(ensemble_group, global_group):
raise PyOP2CommError("ensemble_comm must be subgroup of global_comm")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic isn't completely exhaustive, it doesn't currently check whether you have the same communicator. For example:

if __name__ == "__main__":
    r = COMM_WORLD.rank
    s = COMM_WORLD.size

    ensemble_color1 = int(r < s/2)
    ensemble1 = COMM_WORLD.Split(color=ensemble_color1, key=r)

    ensemble_color2 = r >= s/2
    ensemble2 = COMM_WORLD.Split(color=ensemble_color2, key=r)

    spatial_color = r % (s/2)
    spatial = COMM_WORLD.Split(color=spatial_color, key=r)

    correct = ManualEnsemble(COMM_WORLD, spatial, ensemble1)
    if r < s/2:
        broken = ManualEnsemble(COMM_WORLD, spatial, ensemble1)
    else:
        broken = ManualEnsemble(COMM_WORLD, spatial, ensemble2)

    print("ALL PASSED")

Will run just fine, but the broken ensemble uses two different communicators.

This is broken in a very subtle way as the mismatched comm will be destroyed when the ensemble is destroyed.

Copy link
Member Author

@JHopeCollins JHopeCollins May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what is broken about this.

Assuming 8 ranks, ranks 0-3 use the ensemble comm from the first Split, and ranks 4-7 use the ensemble comm from the second Split call, but this doesn't matter. All ranks in each ensemble comm use the same one, which is what matters.
They don't "know", and don't need to know, what the other half is doing so long as every rank has an ensemble comm that connects the same part of all spatial comms.

As written so far, ManualEnsemble doesn't destroy any of the comms its given (the docstring explicitly says that the caller is responsible for this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core functionality Adding to the main paradiag functionality refactor Tidy up existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants