17.2.468. MPIX_Comm_ack_failed

MPIX_Comm_get_failed - acknowledge failed processes in a communicator.

This is part of the User Level Fault Mitigation ULFM extension.

17.2.468.1. SYNTAX

17.2.468.1.1. C Syntax

#include <mpi.h>
#include <mpi-ext.h>

int MPIX_Comm_ack_failed(MPI_Comm comm, int num_to_ack, int *num_acked)

17.2.468.1.2. Fortran Syntax

USE MPI
USE MPI_EXT
! or the older form: INCLUDE 'mpif.h'

MPIX_COMM_ACK_FAILED(COMM, NUM_TO_ACK, NUM_ACKED, IERROR)
     INTEGER COMM, NUM_TO_ACK, NUM_ACKED, IERROR

17.2.468.1.3. Fortran 2008 Syntax

USE mpi_f08
USE mpi_ext_f08

MPIX_Comm_ack_failed(comm, num_to_ack, num_acked, ierror)
     TYPE(MPI_Comm), INTENT(IN) :: comm
     INTEGER, INTENT(IN) :: num_to_ack
     INTEGER, INTENT(OUT) :: num_acked
     INTEGER, OPTIONAL, INTENT(OUT) :: ierror

17.2.468.2. INPUT PARAMETERS

  • comm: Communicator (handle).

  • num_to_ack: maximum number of process failures to acknowledge in comm (integer)

17.2.468.3. OUTPUT PARAMETERS

  • num_acked: number of acknowledged failures in comm (integer).

  • ierror: Fortran only: Error status (integer).

17.2.468.4. DESCRIPTION

his local operation gives the users a way to acknowledge locally notified failures on comm. The operation acknowledges the first num_to_ack process failures on comm, that is, it acknowledges the failure of members with a rank lower than num_to_ack in the group that would be produced by a concurrent call to MPIX_Comm_get_failed on the same comm.

The operation also sets the value of num_acked to the current number of acknowledged process failures in comm, that is, a process failure has been acknowledged on comm if and only if the rank of the process is lower than num_acked in the group that would be produced by a subsequent call to MPIX_Comm_get_failed on the same comm.

num_acked can be larger than num_to_ack when process failures have been acknowledged in a prior call to MPIX_Comm_ack_failed.

17.2.468.5. EFFECT OF ACKNOWLEDGING FAILURES

After an MPI process failure is acknowledged on comm, unmatched MPI_ANY_SOURCE receive operations on the same comm that would have raised an error of class MPIX_ERR_PROC_FAILED_PENDING proceed without further raising errors due to this acknowledged failure.

Also, MPIX_Comm_agree on the same comm will not raise an error of class MPI_ERR_PROC_FAILED due to this acknowledged failure.

17.2.468.6. USAGE PATTERNS

One may query, without side effect, for the number of currently aknowledged process failures comm by supplying 0 in num_to_ack.

Conversely, one may unconditionally acknowledge all currently known process failures in comm by supplying the size of the group of comm in num_to_ack.

Note that the number of acknowledged processes, as returned in num_acked, can be smaller or larger than the value supplied in num_to_ack; It is however never larger than the size of the group returned by a subsequent call to MPIX_Comm_get_failed.

17.2.468.7. EFFECT ON COLLECTIVE OPERATIONS

Calling MPIX_Comm_ack_failed on a communicator with failed MPI processes has no effect on collective operations (except for MPIX_Comm_agree). If a collective operation would raise an error due to the communicator containing a failed process it will continue to raise an error even after the failure has been acknowledged. In order to use collective operations between MPI processes of a communicator that contains failed MPI processes, users should create a new communicator (e.g., by calling MPIX_Comm_shrink).

17.2.468.8. WHEN COMMUNICATOR IS AN INTER-COMMUNICATOR

When the communicator is an inter-communicator, the failures of members in both the local and the remote groups of comm are acknowledged.

17.2.468.9. ERRORS

Almost all MPI routines return an error value; C routines as the return result of the function and Fortran routines in the last argument.

Before the error value is returned, the current MPI error handler associated with the communication object (e.g., communicator, window, file) is called. If no communication object is associated with the MPI call, then the call is considered attached to MPI_COMM_SELF and will call the associated MPI error handler. When MPI_COMM_SELF is not initialized (i.e., before MPI_Init/MPI_Init_thread, after MPI_Finalize, or when using the Sessions Model exclusively) the error raises the initial error handler. The initial error handler can be changed by calling MPI_Comm_set_errhandler on MPI_COMM_SELF when using the World model, or the mpi_initial_errhandler CLI argument to mpiexec or info key to MPI_Comm_spawn/MPI_Comm_spawn_multiple. If no other appropriate error handler has been set, then the MPI_ERRORS_RETURN error handler is called for MPI I/O functions and the MPI_ERRORS_ABORT error handler is called for all other MPI functions.

Open MPI includes three predefined error handlers that can be used:

  • MPI_ERRORS_ARE_FATAL Causes the program to abort all connected MPI processes.

  • MPI_ERRORS_ABORT An error handler that can be invoked on a communicator, window, file, or session. When called on a communicator, it acts as if MPI_Abort was called on that communicator. If called on a window or file, acts as if MPI_Abort was called on a communicator containing the group of processes in the corresponding window or file. If called on a session, aborts only the local process.

  • MPI_ERRORS_RETURN Returns an error code to the application.

MPI applications can also implement their own error handlers by calling:

Note that MPI does not guarantee that an MPI program can continue past an error.

See the MPI man page for a full list of MPI error codes.

See the Error Handling section of the MPI-3.1 standard for more information.