The ``ucc`` Component
=====================
The ``ucc`` collective component uses the `Unified Collective
Communication (UCC) library `_ to
offload selected MPI collective operations to UCC. This component is
useful on systems where UCC has been configured for the target transport
or accelerator environment.
Building with UCC
-----------------
Open MPI must be configured with UCC support:
.. code-block:: sh
shell$ ./configure --with-ucc=/path/to/ucc-install
If UCC support is explicitly requested and the UCC headers and library
cannot be found, ``configure`` aborts. The ``ucc`` component is disabled
when Open MPI is configured with progress thread support, because the UCC
driver does not currently support progress threads.
Enabling the Component
----------------------
The component is not enabled by default. Enable it at run time and give
it a high enough priority to be selected:
.. code-block:: sh
shell$ mpirun --mca coll_ucc_enable 1 \
--mca coll_ucc_priority 100 \
-np 64 ./my_mpi_app
The ``ucc`` component is considered only for intracommunicators whose
size is at least ``coll_ucc_np``. The default value of ``coll_ucc_np``
is ``2``.
UCC Layers and Protocols
------------------------
For each MPI communicator selected for UCC, Open MPI creates a UCC
``team``: the UCC group object used to initialize and execute collective
operations. Inside UCC, collective implementations are selected through
two kinds of layers:
* Collective layers (CLs), such as ``basic`` and ``hier``, decide how a
collective is decomposed.
* Team layers (TLs), such as ``ucp``, ``self``, ``cuda``, ``nccl``,
``rccl``, ``sharp``, and ``mlx5``, provide the underlying transport or
accelerator implementation.
For example, the ``ucp`` TL uses UCX/UCP transports such as InfiniBand,
RoCE, and shared memory; ``sharp`` uses SHARP in-network collective
offload; and ``nccl`` or ``rccl`` can be used for GPU collectives on
CUDA or ROCm memory.
The ``basic`` CL is the general-purpose layer. The ``hier`` CL can use
system hierarchy when it is available; for example, it may split work
across ``NODE`` and ``NET`` subgroups, plus the ``FULL`` group, and then
pipeline phases through different TLs. A typical hierarchical protocol
could use an intra-node reduction, an inter-node operation such as
SHARP, and an intra-node broadcast.
The exact CLs, TLs, and algorithms available depend on how UCC was
built. Use UCC's own tools to inspect the installed library:
.. code-block:: sh
shell$ ucc_info -s # Show available CLs and TLs
shell$ ucc_info -A # Show supported collective algorithms
shell$ ucc_info -caf # Show UCC configuration variables
Open MPI's ``coll_ucc_cls`` MCA parameter is passed to UCC as its
``CLS`` setting. It can be used to restrict team creation to specific
UCC collective layers, for example:
.. code-block:: sh
shell$ mpirun --mca coll_ucc_enable 1 \
--mca coll_ucc_cls hier \
./my_mpi_app
For lower-level TL tuning, use UCC environment variables such as
``UCC_TL__TUNE`` or a UCC configuration file. UCC scores TLs
based on factors including the collective type, message size, memory
type, and team size.
Selecting Collective Operations
-------------------------------
Use ``coll_ucc_cts`` to choose which collective operations the component
should provide. By default, the component enables all supported blocking
and nonblocking operations.
.. code-block:: sh
shell$ mpirun --mca coll_ucc_enable 1 \
--mca coll_ucc_cts allreduce,iallreduce,bcast,ibcast \
./my_mpi_app
Prefix the value with ``^`` to start from all supported operations and
disable specific operations from that set:
.. code-block:: sh
shell$ mpirun --mca coll_ucc_enable 1 \
--mca coll_ucc_cts ^alltoall,ialltoall \
./my_mpi_app
The supported operation names are:
* ``barrier``, ``bcast``, ``allreduce``, ``alltoall``, ``alltoallv``,
``allgather``, ``allgatherv``, ``reduce``, ``gather``, ``gatherv``,
``reduce_scatter_block``, ``reduce_scatter``, ``scatterv``, and
``scatter``
* ``ibarrier``, ``ibcast``, ``iallreduce``, ``ialltoall``,
``ialltoallv``, ``iallgather``, ``iallgatherv``, ``ireduce``,
``igather``, ``igatherv``, ``ireduce_scatter_block``,
``ireduce_scatter``, ``iscatterv``, and ``iscatter``
The aliases ``colls_b``, ``colls_i`` (or ``colls_nb``), and ``colls_p``
select all blocking, nonblocking, and persistent collective operations,
respectively. Individual persistent collective operations can be
selected by adding the ``_init`` suffix to the blocking operation name,
for example ``allreduce_init``.
Other MCA Parameters
--------------------
.. list-table::
:header-rows: 1
:widths: 30 15 55
* - Parameter
- Default
- Description
* - ``coll_ucc_enable``
- ``0``
- Enable or disable the component.
* - ``coll_ucc_priority``
- ``10``
- Component selection priority.
* - ``coll_ucc_verbose``
- ``0``
- Verbosity level for component logging.
* - ``coll_ucc_np``
- ``2``
- Minimum communicator size for enabling the component.
* - ``coll_ucc_cls``
- UCC default
- Comma-separated list of UCC collective layers to use for team
creation, passed to UCC as ``CLS``.
* - ``coll_ucc_cts``
- All supported blocking and nonblocking operations
- Comma-separated list of UCC collective types to enable.
Verifying Selection
-------------------
Use ``coll_base_verbose`` to check which collective component Open MPI
selects for each operation:
.. code-block:: sh
shell$ mpirun --mca coll_ucc_enable 1 \
--mca coll_ucc_priority 100 \
--mca coll_base_verbose 20 \
./my_mpi_app
See :doc:`components` for more details about interpreting collective
component selection output.