Skip to content

Use a temp pass through errhandler for trivial errors causing abort #2231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 31, 2025

Conversation

islas
Copy link
Collaborator

@islas islas commented May 30, 2025

TYPE: bug fix

KEYWORDS: mpi, quilting, comm

SOURCE: internal

DESCRIPTION OF CHANGES:
Problem:
PR #2157 added changes to match an appropriate MPI_Datatype to a specific typesize during col_on_comm() and dst_on_comm(). This relies on MPI_Type_match_size() to query MPI about the equivalent MPI definition for a particular datatype size. There are safety checks to query MPI_TYPECLASS_INTEGER if MPI_TYPECLASS_REAL fails.

However, when given a datatype size that does not match a possible MPI_TYPECLASS_REAL value (e.g. 1 byte where no real exists for single byte) instead of getting a failure via return code the query is treated as a critical failure and fully aborts the program. As the query does not rely on critical process handling and since there already exists adequate checks to abort if no sufficient value is found, this preemptive abort is unnecessary.

Solution:
Temporarily install a pass through errhandler that does not modify the return code but also does not abort. Allow the if statements of finding a correct MPI_Datatype to abort if deemed necessary. Additionally, once the checks are complete, reinstate any previous errhandler and free our pass through handle.

ISSUE:
#2225

TESTS CONDUCTED:

  1. Tested the stability of this call to handle correct and incorrect types various times while constantly replacing the error handler.

RELEASE NOTE:
In collect_on_comm.c, use a temporary pass through errhandler to allow MPI_Type_match_size to fail correctly with error code rather than fully abort the program.

@islas islas requested a review from a team as a code owner May 30, 2025 18:51
@islas islas changed the base branch from master to release-v4.7.1 May 30, 2025 18:51
// handler temporarily
ierr = MPI_Comm_create_errhandler( &temp_errhandler, &errhandler );
ierr = MPI_Errhandler_get( MPI_COMM_WORLD, &previous );
ierr = MPI_Errhandler_set( MPI_COMM_WORLD, errhandler );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct, MPI_Errhandler_get and MPI_Errhandler_set have been deprecated, and MPI_Comm_get_errhandler and MPI_Comm_set_errhandler are the MPI-2 replacements. See https://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0-sf/mpi2-report.htm#Node18 .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah I see now. I found the same for MPI_Comm_create_errhandler() but had not looked closely at the get/set functions as I didn't get a deprecation warning on those. Interestingly there is no new call for the MPI_Errhandler_free()

@weiwangncar
Copy link
Collaborator

The Jenkins test results:

Test Type              | Expected  | Received |  Failed
= = = = = = = = = = = = = = = = = = = = = = = =  = = = =
Number of Tests        : 23           24
Number of Builds       : 60           57
Number of Simulations  : 158           150        0
Number of Comparisons  : 95           86        0

Failed Simulations are: 
None
Which comparisons are not bit-for-bit: 
None

@weiwangncar
Copy link
Collaborator

I passed the routine to another user who has problem with io_quilting on Derecho and it worked for him!

@islas islas merged commit 46f9750 into wrf-model:release-v4.7.1 May 31, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants