13.13. LLM-friendly documentation artifacts

In addition to the human-facing HTML and Unix man pages, the Open MPI documentation build publishes machine-readable, LLM-friendly artifacts for the public MPI APIs (C, mpif.h, use mpi, and use mpi_f08). They are indexed from a top-level llms.txt and published under llms/ in each documentation version (for example, https://docs.open-mpi.org/en/<version>/llms.txt, where <version> is the Read the Docs version slug, such as v6.1.x or main).

The intent is to give LLMs, retrieval systems, and coding assistants concise, authoritative, version-correct MPI API information — a signature, its parameters, the right language interface, and a link back to the human docs — without scraping themed HTML or guessing from filenames, and without creating a second hand-maintained API reference that could drift from the real documentation.

This page is for Open MPI developers and release managers who maintain these artifacts. It records the design intent and the day-to-day maintenance rules; the full design record (every field, every alternative considered) lives in specs/llms-friendly-docs/spec.md in the source tree.

13.13.1. Background: the llms.txt convention

llms.txt is a convention proposed by Jeremy Howard in September 2024 (llmstxt.org) for exposing LLM-friendly content at a well-known location — a Markdown file at the site root (/llms.txt), alongside the established robots.txt and sitemap.xml. The bare convention is an index: an H1 project name, a short summary, and H2-delimited lists of Markdown links, optionally with companion .md versions of HTML pages.

Read the Docs supports this convention: it does not auto-generate the file, but if a built, public, active default version contains an llms.txt in its HTML output, Read the Docs serves it at the project root.

Open MPI extends the convention rather than merely conforming to it. A bare link index is not enough for a tool that needs to answer signature, parameter, and interface questions programmatically, so alongside the llms.txt index Open MPI also publishes a structured JSONL API catalog, language-specific Markdown corpora, per-symbol Markdown pages, and a machine-readable manifest (see What is generated). The llms.txt index itself follows the spirit of the convention (an H1 name, a summary, and link sections) but adds prose that an automated consumer needs — most importantly, a description of the URL scheme for other versions (see Read the Docs and the version-neutral llms.txt).

13.13.2. Design principles

Five cross-cutting principles explain most of the decisions below. When in doubt, preserve these properties.

  • The existing RST man pages and the MPI Forum JSON binding metadata are the only source of truth. A core goal is to avoid a second hand-maintained API reference. The generated artifacts are derived, never authored: the set of documented procedures comes from the man-openmpi/man3/MPI_*.3.rst man pages and the command corpus from the man-openmpi/man1/*.rst pages, while standard-API signatures/parameters come from the MPI Forum metadata (mpi-standard-apis.json via the embedded pympistandard library). This is why adding a new API is a matter of adding a man page, not editing the generator (see Updating the documentation when APIs change), and why the curated examples reuse the top-level examples/ tree instead of copying it.

  • Each artifact’s content hash is a pure function of its semantic content. Build-identity fields (git_commit, git_describe, generated_at) live only in the manifest — never in the catalog records, corpora, or per-symbol pages. If a commit hash were embedded in every record, every artifact’s hash would change on every repository commit, even one that touches no documentation, making the manifest’s per-artifact hashes useless as change detectors. Keeping build identity in one place gives consumers the clean property “hash changeddocumentation content changed.” This is also why the versioned llms.txt is timestamp-free (it carries the version, not a generation time) and why the manifest is the one artifact that does not inventory itself.

  • Reproducible under Open MPI’s existing knob. The build fits the project’s established reproducible-builds model rather than inventing a new one: the semantic artifacts contain no wall-clock data and are byte-identical across reruns at a given commit; only the manifest’s generated_at varies, and it is derived from SOURCE_DATE_EPOCH when set (see Reproducibility and the per-release manifest).

  • The schemas are simultaneously the published contract and the CI validator. docs/llms-src/*.schema.json are shipped as artifacts and used directly by make check, so the contract and the check cannot drift. They are “open” objects (no additionalProperties: false) so that an additive field never breaks a consumer validating against an older cached copy of the schema (see Schema evolution).

  • The artifacts ride inside the HTML output tree. Rather than a parallel distribution mechanism, the llms/ tree and llms.txt are copied into the Sphinx HTML output, so they inherit the existing HTML packaging, installation, and tarball machinery for free, and the link strategy follows the build type (see How they are built and Link strategy: relative vs. absolute links).

13.13.3. What is generated

  • llms/openmpi-mpi-api.jsonl — one JSON record per documented MPI procedure (standard, extension, and deprecated/removed). A record is per procedure, not per page.

  • llms/openmpi-mpi-api.md and the four per-interface corpora (C, mpif.h, use mpi, use mpi_f08). mpif.h and use mpi share the same f90 signature and differ only in the access preamble, so those two corpora are near-duplicates by design; they are kept separate for audience clarity and possible future divergence.

  • llms/man-openmpi/man3/MPI_<name>.3.md — one Markdown page per man page, 1:1 with the human man pages so canonical URLs line up. Overview/non-procedure pages (e.g. MPI_T.3, MPI_Errors.3) get a Markdown page but no JSONL record.

  • llms/man-openmpi/man1/<command>.1.md — one Markdown page per command man page (mpirun, ompi_info, the wrapper compilers, …). These document Open MPI commands, not MPI APIs, so they are a Markdown corpus only — there are no JSONL catalog records for them.

  • llms/openmpi-docs-manifest.json — the artifact inventory; the only artifact that carries build identity (git commit/describe and generated_at).

  • The curated llms/openmpi-mpi-interface-guide.md, llms/openmpi-mpi-examples.md, and llms/openmpi-runtime-introspection.md (hand-written sources under docs/llms-src/), plus the two published *.schema.json files. The runtime-introspection guide tells a consumer how to query an installed Open MPI with ompi_info — its version, build configuration, available MCA components, and the run-time MCA parameters those components expose. That surface is installation-specific (it depends on which components were built), so it is intentionally not snapshotted into the corpus; an ompi_info --all --parsable snapshot would bake in one machine’s paths and component set, immediately drift, and could not be produced at all on Read the Docs (which builds the docs without a full Open MPI install). The corpus therefore points consumers at the live, self-describing command output instead.

13.13.4. How they are built

docs/generate-llm-docs.py produces the artifacts into a build-tree staging directory (docs/llms-build/). It is run in both documentation build paths, just like the man-page bindings generator: from a sentinel target in docs/Makefile.am (after the man3 bindings, before Sphinx) for make builds, and from .readthedocs-pre-create-environment.sh for Read the Docs builds (which run sphinx-build directly and never run make). Shared MPI metadata logic (binding rendering, VERSION parsing, build identity) lives in docs/ompi_docs_common.py, which is also used by the man-page bindings generator. No separate command is needed: building the docs builds the LLM artifacts.

A build-finished hook in docs/conf.py then copies the staging tree (llms-build/llms<output>/llms and llms-build/llms.txt<output>/llms.txt) into the Sphinx HTML output. The copy lives in a Sphinx hook — rather than html_extra_path or a Makefile step — for two reasons: (1) the staged Markdown is deliberately excluded from Sphinx source discovery (exclude_patterns), which would also suppress an html_extra_path entry pointing at it; and (2) the hook runs inside sphinx-build, so publication works identically under make and on Read the Docs. Because the artifacts ride inside the HTML output tree, the existing html-local / EXTRA_DIST (tarballs), install-data-hook (install), and uninstall-hook (uninstall) machinery handles them with no new rules.

13.13.6. Read the Docs and the version-neutral llms.txt

Read the Docs serves the version-neutral https://docs.open-mpi.org/llms.txt by serving the default version’s own llms.txt at the site root. There is no separate, hand-maintained top-level index file: the root URL is simply whichever llms.txt the current default version produced. Each documentation version emits exactly one self-describing llms.txt for itself.

This RTD behavior directly drove the content of llms.txt. Because the file served at the root is just some version’s file — and the default version may be a series that does not even carry these artifacts yet — llms.txt cannot assume it is authoritative for the whole project. So it self-describes the version-slug URL scheme:

  • it states the https://docs.open-mpi.org/en/VERSION_SLUG/ scheme and how to read a slug (main = the main-branch build; vA.B.x = a release branch; vA.B.C = a specific tagged release), worded so an LLM that wants a different version can construct the URL itself;

  • it notes that documentation for Open MPI versions older than v5.0.0 is not published in this format, and points at the legacy README/FAQ/doc pages;

  • for a Read the Docs main build it uses dual attribution — naming both the main slug and the vA.B.x series that build currently represents — because the same file is both “the development tip” and “the current pre-release series.”

If a future need arises for the root /llms.txt to be something other than the default version’s copy, Read the Docs exact redirects can point /llms.txt at a chosen versioned path; no generator change would be required.

13.13.7. Validation

make check builds the artifacts if they are not already present and then runs docs/validate-llm-docs.py, which validates the catalog and manifest against the published JSON Schemas (docs/llms-src/*.schema.json), checks cross-field invariants (for example, that a record’s languages equals the distinct set of its bindings[].language) and manifest integrity (hashes, byte sizes, coverage), confirms the generated Markdown is free of unresolved RST (.. include:: directives, :ref: roles, Sphinx-only substitutions), confirms the versioned llms.txt carries no generation timestamp, verifies the committed sample records (specs/llms-friendly-docs/sample-records.jsonl) match the generated catalog, and confirms the generator is deterministic at a fixed SOURCE_DATE_EPOCH.

13.13.8. Reproducibility and the per-release manifest

The artifacts are regenerated by the normal documentation build, so there is no separate regeneration step for release managers, and no separate command for developers: make (or a Read the Docs build) regenerates everything, including the manifest, every time.

The manifest (llms/openmpi-docs-manifest.json) is rebuilt on every build and is the single place that carries build identity: git_commit, git_describe, generated_at, the Open MPI version/series (from the top-level VERSION file), the Read the Docs slug when present, and one entry per artifact (path, URL, media type, SHA-256, byte size, estimated token count, and the symbols/languages it covers). It does not inventory itself — a file cannot record its own hash and size without changing them. Consumers do not need a separately incremented “docs release number”: they compare the manifest’s identity fields and per-artifact hashes. If, say, the v6.1.x branch gets a documentation fix before v6.1.1 ships, the Open MPI version may be unchanged but the git identity and the hashes of the affected artifacts change, while everything else stays byte-identical.

For a reproducible release tarball, set SOURCE_DATE_EPOCH (as already documented for reproducible Open MPI builds): the generator honors it for the manifest generated_at timestamp, exactly as config/getdate.sh and Sphinx’s format_date do for the rest of the docs build. With SOURCE_DATE_EPOCH set, the whole documentation build (HTML, man pages, and LLM artifacts) is reproducible. In a from-tarball build with no .git and no SOURCE_DATE_EPOCH, the generator degrades gracefully rather than failing: the git fields become unknown/omitted and generated_at falls back to the build date, mirroring config/opal_get_version.sh. Distribution tarballs ship the already-rendered llms/ tree inside html/, so installing or packaging them never requires Sphinx; Sphinx is needed only to regenerate them in a developer clone.

13.13.9. Schema evolution

The catalog and manifest each conform to a published JSON Schema that doubles as the CI validator.

  • Changes are additive within a schema_version: new fields must be optional, and a field may be marked deprecated (still emitted) for one Open MPI release series before removal. Never remove, rename, repurpose, or change the meaning of an existing field without bumping the schema version. The published schemas use open objects so an additive field does not break a consumer validating against an older cached schema.

  • schema_version / artifact_schema_version are bumped only on a breaking change, a decision owned by the documentation maintainers. The schema_version is independent of the Open MPI release number, so the schema can stay fixed across many releases.

  • A schema-version bump or notable field change gets a changelog entry under docs/release-notes/changelog/. Routine regeneration (content that flows automatically from changed RST or metadata) does not.

13.13.10. Updating the documentation when APIs change

The generated artifacts cannot drift, because they regenerate from the RST man pages and the MPI Forum JSON binding metadata. The practical consequence is that you update the LLM docs by updating the ordinary documentation, not by editing the generator.

Adding or changing an MPI API function. When a new MPI Standard function is added (for example, when a new MPI version lands), the LLM artifacts pick it up automatically once its man page exists:

  1. Add or edit docs/man-openmpi/man3/MPI_<Name>.3.rst — the same hand-written man page that produces the human HTML/man output. This man page is what makes the function “documented”; both the man3 binding generator and the LLM generator enumerate the man3 .rst files (via os.listdir), so a function that is in the metadata but has no .rst page is silently not documented anywhere.

  2. Add the page to the explicit OMPI_MAN3 list in docs/Makefile.am. The generators auto-discover the file and the RST_SOURCE_FILES wildcard already makes it a rebuild dependency, but the installation list is explicit (Automake installs man pages by name), so a new page must be listed there to be installed.

  3. For standard APIs there is nothing to author for the bindings: the C, mpif.h/use mpi, and use mpi_f08 signatures (and any large-count “embiggened” variant) are rendered from pympistandard + mpi-standard-apis.json.

  4. If one man page documents several procedures, mark them with a .. mpi-bindings: MPI_Foo, MPI_Bar comment line so each co-documented procedure gets its own bindings, catalog record, and documented_with linkage.

No change to generate-llm-docs.py, the schemas, or the validator is needed for a routine new function: a new .rst automatically yields a new man page, a per-symbol Markdown page, and a JSONL catalog record.

Open MPI extensions (MPIX_*, OMPI_*) are not in pympistandard, so their signatures are taken verbatim from the RST SYNTAX block and their structured parameter fields are best-effort (unknown where they cannot be extracted reliably). The catalog’s kind field lets consumers tell standard records from extension records.

Upgrading the MPI Standard metadata (for example, replacing the 4.1 apis.json with a 5.0 one): update docs/mpi-standard-apis.json (it is a symlink to the versioned mpi-standard-<ver>-apis.json). Note that load_pympistandard calls use_api_version(1, ...) — the 1 is pympistandard’s data-format version, not the MPI version. A new MPI metadata file is only loadable if the vendored 3rd-party/pympistandard can parse that format; if the MPI Forum bumped the JSON format, update the vendored library in lockstep and smoke-test that the new JSON loads.

Coverage gaps are currently silent. There is no check that flags a procedure present in the metadata but missing a man page (or vice versa); such a procedure is simply absent from the artifacts. (The spec envisions a lightweight CI “drift hint” for this; it is not yet implemented.) When adding a batch of new functions, cross-check that every intended function actually has an MPI_*.3.rst page.

Curated docs and samples. The generated artifacts cannot drift, but the curated docs/llms-src/ files (interface guide, examples, and runtime-introspection guide) can. A pull request that changes public MPI documentation should also update the affected curated files when relevant — for example, the runtime-introspection guide if the ompi_info interface or the MCA parameter-setting conventions change (this expectation is also recorded in the top-level AGENTS.md). When the curated examples or a schema change, regenerate specs/llms-friendly-docs/sample-records.jsonl so make check continues to pass.

13.13.11. Verifying publication

After a merge that affects the artifacts, confirm Read the Docs published them for the version you merged to: fetch that version’s llms.txt (for example https://docs.open-mpi.org/en/main/llms.txt for the main branch) and one versioned artifact (for example the manifest, .../en/<version>/llms/openmpi-docs-manifest.json) and confirm they resolve and parse. Because the version-neutral https://docs.open-mpi.org/llms.txt follows Read the Docs’ default version — which may not be the version you just merged — always spot-check the specific version slug you published.

13.13.12. Alternatives considered and deferred

For the historical record (details in specs/llms-friendly-docs/spec.md):

  • OpenSHMEM public APIs can later reuse this same artifact model, but support is deferred until there is concrete demand; this effort covers MPI only.

  • An llms-full.txt single-payload entry point, compressed downloadable bundles, and stable redirects for renamed artifacts were each considered and dropped — the aggregate Markdown corpora, the release tarballs plus the manifest, and ad hoc Read the Docs redirects respectively make them unnecessary. The JSON Schema files, by contrast, were promoted into scope as the shared contract/validator.