4.15. Advice for packagers

4.15.1. Do not use Open MPI’s internal dependent libraries

The Open MPI community strongly suggests that binary Open MPI packages should not include Hwloc, Libevent, PMIx, or PRRTE. Although several of these libraries are required by Open MPI (and are therefore bundled in the Open MPI source code distribution for end-user convenience), binary Open MPI packages should limit themselves solely to Open MPI artifacts. Specifically: ensure to configure and build Open MPI against external installations of these required packages.

Packagers may therefore wish to configure Open MPI with something like the following:

# Install Sphinx so that Open MPI can re-build its docs with the
# installed PRRTE's docs

virtualalenv venv
. ./venv/bin/activate
pip install docs/requirements.txt

./configure --with-libevent=external --with-hwloc=external \
    --with-pmix=external --with-prrte=external ...

Important

Note the installation of the Sphinx tool so that Open MPI can re-build its documentation with the external PRRTE’s documentation.

Failure to do this will mean Open MPI’s documentation will be correct for the version of PRRTE that is bundled in the Open MPI distribution, but may not be entirely correct for the version of PRRTE that you are building against.

The external keywords will force Open MPI’s configure to ignore all the bundled libraries and only look for external versions of these support libraries. This also has the benefit of causing configure to fail if it cannot find the required support libraries outside of the Open MPI source tree — a good sanity check to ensure that your package is correctly relying on the independently-built and installed versions.

See this section for more information about the required support library --with-FOO command line options.

4.15.2. Have Sphinx installed

Since you should be (will be) installing Open MPI against an external PRRTE and PMIx, you should have Sphinx installed before running Open MPI’s configure script.

This will allow Open MPI to (re-)build its documentation according to the PMIx and PRRTE that you are building against.

To be clear: the Open MPI distribution tarball comes with pre-built documentation — rendered in HTML and nroff — that is suitable for the versions of PRRTE and PMIx that are bundled in that tarball.

However, if you are building Open MPI against not-bundled versions of PRRTE / PMIx (as all packagers should be), Open MPI needs to re-build its documentation with specific information from those external PRRTE / PMIx installs. For that, you need to have Sphinx installed before running Open MPI’s configure script.

4.15.3. Components (“plugins”): static or DSO?

Open MPI contains a large number of components (sometimes called “plugins”) to effect different types of functionality in MPI. For example, some components effect Open MPI’s networking functionality: they may link against specialized libraries to provide highly-optimized network access.

Open MPI can build its components as Dynamic Shared Objects (DSOs) or statically included in core libraries (regardless of whether those libraries are built as shared or static libraries).

Note

As of Open MPI head of development, configure’s global default is to build all components as static (i.e., part of the Open MPI core libraries, not as DSOs). Prior to Open MPI v5.0.0, the global default behavior was to build most components as DSOs.

4.15.3.1. Why build components as DSOs?

There are advantages to building components as DSOs:

  • Open MPI’s core libraries — and therefore MPI applications — will have very few dependencies. For example, if you build Open MPI with support for a specific network stack, the libraries in that network stack will be dependencies of the DSOs, not Open MPI’s core libraries (or MPI applications).

  • Removing Open MPI functionality that you do not want is as simple as removing a DSO from $libdir/open-mpi.

4.15.3.2. Why build components as part of Open MPI’s core libraries?

The biggest advantage to building the components as part of Open MPI’s core libraries is when running at (very) large scales when Open MPI is installed on a network filesystem (vs. being installed on a local filesystem).

For example, consider launching a single MPI process on each of 1,000 nodes. In this scenario, the following is accessed from the network filesystem:

  1. The MPI application

  2. The core Open MPI libraries and their dependencies (e.g., libmpi)

    • Depending on your configuration, this is probably on the order of 10-20 library files.

  3. All DSO component files and their dependencies

    • Depending on your configuration, this can be 200+ component files.

If all components are physically located in the libraries, then the third step loads zero DSO component files. When using a networked filesystem while launching at scale, this can translate to large performance savings.

Note

If not using a networked filesystem, or if not launching at scale, loading a large number of DSO files may not consume a noticeable amount of time during MPI process launch. Put simply: loading DSOs as indvidual files generally only matters when using a networked filesystem while launching at scale.

4.15.3.3. Direct controls for building components as DSOs or not

Open MPI head of development has two configure-time defaults regarding the treatment of components that may be of interest to packagers:

  1. Open MPI’s libraries default to building as shared libraries (vs. static libraries). For example, on Linux, Open MPI will default to building libmpi.so (vs. libmpi.a).

    Note

    See the descriptions of --disable-shared and --enable-static in this section for more details about how to change this default.

    Also be sure to see this warning about building static apps.

  2. Open MPI will default to including its components in its libraries (as opposed to being compiled as dynamic shared objects, or DSOs). For example, libmpi.so on Linux systems will contain the UCX PML component, instead of the UCX PML being compiled into mca_pml_ucx.so and dynamically opened at run time via dlopen(3).

    Note

    See the descriptions of --enable-mca-dso and --enable-mca-static in this section for more details about how to change this defaults.

A side effect of these two defaults is that all the components included in the Open MPI libraries will bring their dependencies with them. For example (on Linux), if the XYZ PML component in the MPI layer requires libXYZ.so, then these defaults mean that libmpi.so will depend on libXYZ.so. This dependency will likely be telegraphed into the Open MPI binary package that includes libmpi.so.

Conversely, if the XYZ PML component was built as a DSO, then — assuming no other parts of Open MPI require libXYZ.solibmpi.so would not be dependent on libXYZ.so. Instead, the mca_pml_xyz.so DSO would have the dependency upon libXYZ.so.

Packagers can use these facts to potentially create multiple binary Open MPI packages, each with different dependencies by, for example, using --enable-mca-dso to selectively build some components as DSOs and leave the others included in their respective Open MPI libraries.

See the section on building accelerator support for a practical example where this can be useful.

4.15.3.4. GNU Libtool dependency flattening

When compiling Open MPI’s components statically as part of Open MPI’s core libraries, GNU Libtool — which is used as part of Open MPI’s build system — will attempt to “flatten” dependencies.

For example, the ompi_info(1) command links against the Open MPI core library libopen-pal. This library will have dependencies on various HPC-class network stack libraries. For simplicity, the discussion below assumes that Open MPI was built with support for Libfabric and UCX, and therefore libopen-pal has direct dependencies on libfabric and libucx.

In this scenario, GNU Libtool will automatically attempt to “flatten” these dependencies by linking ompi_info(1) directly to libfabric and libucx (vs. letting libopen-pal pull the dependencies in at run time).

  • In some environments (e.g., Ubuntu 22.04), the compiler and/or linker will automatically utilize the linker CLI flag -Wl,--as-needed, which will effectively cause these dependencies to not be flattened: ompi_info(1) will not have a direct dependencies on either libfabric or libucx.

  • In other environments (e.g., Fedora 38), the compiler and linker will not utilize the -Wl,--as-needed linker CLI flag. As such, ompi_info(1) will show direct dependencies on libfabric and libucx.

Just to be clear: these flattened dependencies are not a problem. Open MPI will function correctly with or without the flattened dependencies. There is no performance impact associated with having — or not having — the flattened dependencies. We mention this situation here in the documentation simply because it surprised some Open MPI downstream package managers to see that ompi_info(1) in Open MPI head of development had more shared library dependencies than it did in prior Open MPI releases.

If packagers want ompi_info(1) to not have these flattened dependencies, use either of the following mechanisms:

  1. Use --enable-mca-dso to force all components to be built as DSOs (this was actually the default behavior before Open MPI v5.0.0).

  2. Add LDFLAGS=-Wl,--as-needed to the configure command line when building Open MPI.

    Note

    The Open MPI community specifically chose not to automatically utilize this linker flag for the following reasons:

    1. Having the flattened dependencies does not cause any correctness or performance problems.

    2. There’s multiple mechanisms (see above) for users or packagers to change this behavior, if desired.

    3. Certain environments have chosen to have — or not have — this flattened dependency behavior. It is not Open MPI’s place to override these choices.

    4. In general, Open MPI’s configure script only utilizes compiler and linker flags if they are needed. All other flags should be the user’s / packager’s choice.

4.15.3.5. Building accelerator support as DSOs

If you are building a package that includes support for one or more accelerators, it may be desirable to build accelerator-related components as DSOs (see the static or DSO? section for details).

Rationale

Accelerator hardware is expensive, and may only be present on some compute nodes in an HPC cluster. Specifically: there may not be any accelerator hardware on “head” or compile nodes in an HPC cluster. As such, invoking Open MPI commands on a “head” node with an MPI that was built with static accelerator support but no accelerator hardware may fail to launch because of run-time linker issues (because the accelerator hardware support libraries are likely not present).

Building Open MPI’s accelerator-related components as DSOs allows Open MPI to try opening the accelerator components, but proceed if those DSOs fail to open due to the lack of support libraries.

Use the --enable-mca-dso command line parameter to Open MPI’s configure command can allow packagers to build all accelerator-related components as DSO. For example:

# Build all the accelerator-related components as DSOs (all other
# components will default to being built in their respective
# libraries)
shell$ ./configure --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator

Per the example above, this allows packaging $libdir as part of the “main” Open MPI binary package, but then packaging $libdir/openmpi/mca_accelerator_*.so and the other named components as sub-packages. These sub-packages may inherit dependencies on the CUDA and/or ROCM packages, for example. The “main” package can be installed on all nodes, and the accelerator-specific subpackage can be installed on only the nodes with accelerator hardware and support libraries.