We have a lot of documentation at Vertalo: policies, procedures, regulatory information, and loads of technical docs. The later are scattered around all of our code repositories and are periodically gathered into a centralized internal documentation site alongside non-technical documentation.
During a recent project to revamp and improve our methodologies for both developing docs locally within a repo and the gathering of repo-specific technical docs into the internal docs site, we encountered and interesting opportunity. Can we write code into each repo the specifies how that repo’s docs are to be gathered for company-wide distribution in the internal site?
In our planning and discussions, we realized that we were looking for the
opposite of .gitignore
methodology. Surely, we thought, there is an open source
implementation of the logic behind .gitignore
that we can leverage. As one of our
collegues put it, “Surely someone has already solved this problem.”
Enter pathspec
Since our docs are typically implemented in mkdocs
, which is a Python package, we did
quick research and found a possible solution. The pathspec
Python package implements, among other things, Git’s “wildmatch” pattern
matching, the same algorithm that Git uses to ignore files when adding and committing.
Consider the following example tree of a simple Python app:
├── .docinclude
├── README.md
├── app
│ ├── __init__.py
│ └── app.py
├── config.py
├── docs
│ ├── about.md
│ ├── css
│ │ └── custom.css
│ ├── img
│ │ └── logo.png
│ ├── index.md
│ └── private.md
├── extraneous
│ └── something.md
├── notes.md
├── requirements.txt
We want to use pathspec
to include files for some operation. In our particular case,
we want a list of particular files to be copied into the centralized documentation site
for publication. But it could be any operation on a selected list of file paths.
The .docinclude
file
To first emulate .gitignore
, we create a file for use in pathspec
called .docinclude
. Consider this example:
# Include all docs directories and their contents, plus other markdown files
# (see https://pypi.org/project/pathspec/)
**/docs/*
**/*.md
# Exclude anything in the extraneous/ directory
!extraneous/
# Exclude private.md files
!private.md
Now consider a simple Python script using pathspec
and the .docinclude
file:
from pathlib import Path
import pathspec
DOCS_DIR = 'mkdocs-docs'
SPEC_FILE = '.docinclude'
def schema():
# Create PathSpec object from .docinclude
# (see https://pypi.org/project/pathspec/)
with open(SPEC_FILE, "r") as fp:
spec = pathspec.PathSpec.from_lines("gitwildmatch", fp)
# Use the PathSpec object to match our desired documentation assets
matches = spec.match_tree('.')
paths = [print(match) for match in matches]
if __name__ == "__main__":
schema()
And the output:
README.md
docs/about.md
docs/css/custom.css
docs/img/logo.png
docs/index.md
notes.md
We have now successfully used pathspec
and a little Python to:
- Select all contents of the
docs/
directory along with other Markdown files, while - excluding the contents of
extraneous/
and anyprivate.md
file.
While there are many ways to tackle a given problem, we like this approach because it
separates specification of which files to select from the code that does the
selecting. The .docinclude
file becomes a single source of truth for the documents
within a repo to be included in our centralization operation.