From gitignore to docinclude

Creative documentation techniques

We have a lot of documentation at Vertalo: policies, procedures, regulatory information, and loads of technical docs. The later are scattered around all of our code repositories and are periodically gathered into a centralized internal documentation site alongside non-technical documentation.

During a recent project to revamp and improve our methodologies for both developing docs locally within a repo and the gathering of repo-specific technical docs into the internal docs site, we encountered and interesting opportunity. Can we write code into each repo the specifies how that repo’s docs are to be gathered for company-wide distribution in the internal site?

In our planning and discussions, we realized that we were looking for the opposite of .gitignore methodology. Surely, we thought, there is an open source implementation of the logic behind .gitignore that we can leverage. As one of our collegues put it, “Surely someone has already solved this problem.”

Enter pathspec

Since our docs are typically implemented in mkdocs, which is a Python package, we did quick research and found a possible solution. The pathspec Python package implements, among other things, Git’s “wildmatch” pattern matching, the same algorithm that Git uses to ignore files when adding and committing.

Consider the following example tree of a simple Python app:

├── .docinclude
├── README.md
├── app
│   ├── __init__.py
│   └── app.py
├── config.py
├── docs
│   ├── about.md
│   ├── css
│   │   └── custom.css
│   ├── img
│   │   └── logo.png
│   ├── index.md
│   └── private.md
├── extraneous
│   └── something.md
├── notes.md
├── requirements.txt

We want to use pathspec to include files for some operation. In our particular case, we want a list of particular files to be copied into the centralized documentation site for publication. But it could be any operation on a selected list of file paths.

The .docinclude file

To first emulate .gitignore, we create a file for use in pathspec called .docinclude. Consider this example:

# Include all docs directories and their contents, plus other markdown files
# (see https://pypi.org/project/pathspec/)
**/docs/*
**/*.md

# Exclude anything in the extraneous/ directory
!extraneous/

# Exclude private.md files
!private.md

Now consider a simple Python script using pathspec and the .docinclude file:

from pathlib import Path

import pathspec

DOCS_DIR = 'mkdocs-docs'
SPEC_FILE = '.docinclude'

def schema():
    # Create PathSpec object from .docinclude
    # (see https://pypi.org/project/pathspec/)
    with open(SPEC_FILE, "r") as fp:
        spec = pathspec.PathSpec.from_lines("gitwildmatch", fp)

    # Use the PathSpec object to match our desired documentation assets
    matches = spec.match_tree('.')
    paths = [print(match) for match in matches]

if __name__ == "__main__":
    schema()

And the output:

README.md
docs/about.md
docs/css/custom.css
docs/img/logo.png
docs/index.md
notes.md

We have now successfully used pathspec and a little Python to:

  • Select all contents of the docs/ directory along with other Markdown files, while
  • excluding the contents of extraneous/ and any private.md file.

While there are many ways to tackle a given problem, we like this approach because it separates specification of which files to select from the code that does the selecting. The .docinclude file becomes a single source of truth for the documents within a repo to be included in our centralization operation.


See also