Available Collective Components
===============================
Open MPI's ``coll`` framework provides a number of components
implementing collective communication, each of which targets a
different environment or scenario. Some of these components may not
be available depending on how Open MPI was compiled and what hardware
is available on the system. A run-time decision based on each
component's self reported priority, selects which component will be
used. These priorities may be adjusted on the command line or with
any of the other usual ways of setting MCA variables, giving us a way
to influence or override component selection. In the end, which of
the available components is selected depends on a number of factors
such as the underlying hardware and the whether or not a specific
collective is provided by the component as not all components
implement all collectives. However, there is always a fallback
``basic`` component that steps in and takes over when another component
fails to provide an implementation.
The following provides a list of components and their primary target scenario:
- ``han`` : component providing hierarchical algorithms.
- ``libnbc``: component providing non-blocking collective operations
based on a modified version of the libNBC library.
- ``self``: component providing single-process collective algorithms.
- ``tuned``: component providing fine grained mechanisms to switch
between algorithms for each operation and message size. See :doc:`tuned` for
more details.
- ``ucc``: component using the `UCC library `_
for collective operations. See :doc:`ucc` for more details.
- ``xhc``: shared memory collective component, employing hierarchical &
topology-aware algorithms, with XPMEM for data transfers. See :doc:`xhc` for
more details.
- ``acoll``: collective component tuned for AMD Zen architectures. See :doc:`acoll` for
more details.
- ``accelerator``: component providing host-proxy algorithms for some
collective operations using device buffers.
- ``ftagree``: component providing fault-tolerant collective operations.
- ``inter``: component providing collective operations for inter-communicators.
- ``basic``: component providing basic algorithms, used as a fall-back component.
- ``sync``: component used in scenarios where some nodes can be
overrun with messages. This component can be used to insert
synchronization points every *n-th* execution of a collective
operations.
- ``portals4``: component targeting portals4 networks.
Different component can and will be used for different collective
operations, since no component is providing implementations for all
operations defined in the MPI specification.
Displaying collective component selection
-----------------------------------------
Open MPI 6.0.x provides a mechanism to display which component has
been selected for a particular communication and communicator by setting the verbosity level
of the *coll_base_verbose* mca variable.
Specifically, setting *coll_base_verbose* to certain values will
influence which functions are precisely being displayed:
- values between *1 - 19*: will print the selected component for
blocking and non-blocking collectives assigned to MPI_COMM_WORLD,
but not for persistent collective operations
- value *20*: will print the selected component for all blocking and
non-blocking collectives for all communicators, but not the
persistent collectives
- values larger than *20*: will print the selected component for all
communicators and all collective operations
Example:
.. code-block:: sh
shell$ mpiexec --mca coll_base_verbose 10 -n 4 ./
...
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 allgather -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 allgatherv -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 allreduce -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 alltoall -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 alltoallv -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 alltoallw -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 barrier -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 bcast -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 exscan -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 gather -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 gatherv -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 reduce -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 reduce_scatter_block -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 reduce_scatter -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 scan -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 scatter -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 scatterv -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_allgather -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_allgatherv -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_alltoall -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_alltoallv -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_alltoallw -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 reduce_local -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 iallgather -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 iallgatherv -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 iallreduce -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 ialltoall -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 ialltoallv -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 ialltoallw -> libnbc
.. note:: While this output can provide valuable information, it might
not always accurately reflect which component executes the
operation, since some components have built-in logic to call the
next component in the priority list if certain conditions are not
met. For example, the `accelerator` collective component will use
this mechanism to hand-off the execution of the operation to the
next component in the priority list if the collective operation
invoked does not use device buffers.