CI (docs): Reusable Mintlify docs checks for integrations repos#106258
Conversation
|
Workflow [PR], commit [1a5f6aa] Summary: ✅ AI ReviewSummaryThis PR factors the Mintlify docs checks into a reusable standalone driver, adds a guard for one-way copied docs, and updates changed-file collection so rename sources can be inspected. The earlier safety issues around stale overlay files, missing changed-file data, rename-outs, and escaped Missing context / blind spots
Tests
Final VerdictStatus: Minimum required action: add focused regression tests for the read-only copied-docs guard and |
Decouple the Mintlify docs check logic from Praktika so another repository (e.g. a language client whose docs are a slice of the larger site) can run the same checks without vendoring Praktika. - Add `ci/jobs/scripts/docs/mintlify_docs_check.py`, a self-contained driver that pulls the docs image, shallow/sparse-clones the aggregator docs repo (including `ci/jobs/scripts/docs` so shared check scripts come along), overlays one or more local folders, and runs the checks inside the image. The host only needs `docker`, `git` and `python3`; `mint` is provided by the image. The check definitions live in `DEFAULT_CHECKS` as the single source of truth. - Rewrite `ci/jobs/docs_job_mintlify.py` as a thin Praktika adapter that imports `DEFAULT_CHECKS` and runs them natively (it already runs inside the image), so new checks are declared in one place. - Add `ci/jobs/scripts/docs/check_readonly_copies.py` and `readonly_copies.json`, a one-way-sync guard that fails edits to docs folders whose canonical source lives in another repo and points contributors at that source. Wired into the Praktika job only (aggregator side); deliberately not part of `DEFAULT_CHECKS`. - Track `ci/jobs/scripts/docs` in the docs job digest so changes to the shared check scripts re-run the job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The standalone check should validate the exact docs slice the consuming repo would sync. The old `--overlay` defaulted to a merge, so a page deleted or renamed in the source repo left the stale file from the cloned aggregator under the destination, letting `mint validate` and `mint broken-links` pass even though the next sync would remove that page. Since the docs sync is one-way (the consuming repo is the source of truth), there is never a need to merge. Rename the function and the flag to `replace` and always wipe the destination before copying, so deletions and renames are visible to the checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`clone_aggregator` ran `git clone --filter=blob:none` with no `--depth`, so it downloaded the aggregator's entire commit history (just without blobs) before the shallow `fetch` -- enormous for a repo like ClickHouse, run on every docs change. Build the repo with `git init` + a single `git fetch --depth 1 --filter=blob:none` over the sparse paths instead, so only the tip commit and only the docs slice are fetched. Measured against ClickHouse master: 3.6 s, 31 MB, 1 commit -- no src/ or contrib/ subtrees. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The driver orchestrated docker itself (`docker pull` + `docker run` per check), which required docker on the host and could not run from inside a container. Run the checks natively instead: the driver is now meant to run inside the docs image (`clickhouse/docs-builder`), where `mint`, `git` and `python3` are all present. It clones the aggregator, replaces the target folder, and runs each check from the docs root via `sh -lc`. This lets a consuming repo run the image as its CI job container, fetch this one file, and run it -- vendoring nothing. The shallow/sparse clone and one-way replace are unchanged, and the shared `DEFAULT_CHECKS` list is still the single source of truth with the Praktika job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9c08cb6 to
6a591d8
Compare
`Info().get_changed_files` returns `None` when Praktika could not fetch or store the PR file list; treating that as an empty list made the guard report success without checking anything. Distinguish `None` from an empty list and fail the job instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`GH.get_changed_files` used `gh pr view --json files --jq '.files[].path'`,
which reports only the destination path of a renamed file. Renaming a file out
of a protected folder therefore bypassed the read-only docs guard: no path in
the list started with the protected prefix, although the rename deletes the
file from the copied tree. The same blind spot made job filtering miss renames
out of a watched directory.
Switch to the REST files API (`pulls/{n}/files` for PRs, `commits/{sha}`
for branches) and emit `previous_filename` alongside `filename`, so both
sides of a rename appear in the list. Rename sources do not exist in the
checkout at `HEAD`, exactly like deleted files, which were always included;
all consumers of the list either match strings or check file existence first.
The REST endpoint with `--paginate` also lifts the 100-file cap of
`gh pr view --json files`.
Also document in `check_readonly_copies` that the input must cover both
sides of renames, and that standalone callers should pipe
`git diff --name-only --no-renames`.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`replace` wipes its destination with `rmtree` before copying, and the destination was joined to the docs root without validation, so an absolute path, a `..` component, or a symlink pointing out of the clone would delete an unrelated directory on the host. Resolve the destination with `realpath` against the docs root and fail unless it lands strictly inside it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Hey @maxknv, could I please ask for a review when you get a chance? 🙏 Open to ideas here if there is a better way to do this |
| return path.strip().removeprefix("./") | ||
|
|
||
|
|
||
| def find_violations(changed_files, sources): |
There was a problem hiding this comment.
This is now the enforcement boundary for one-way copied docs, but the PR does not add focused coverage for the contract it depends on. The current series has fixed several edge cases by review (None changed-file data passing as success, rename sources escaping the guard, and unsafe --replace destinations reaching rmtree), so these invariants should be locked down before this becomes reusable CI.
Please add small Python tests covering at least: a protected-path edit failing with the canonical-source message, an unrelated path passing, a rename-out source path being treated as a violation, _readonly_copies_guard failing when changed_files is unavailable, and resolve_replace_dest rejecting absolute and .. destinations before replace can delete anything.
Structures the Mintlify docs check such that it can be easily run standalone without Praktika in other repos. The motivation for this is to enable us to easily run the same Mintlify docs checks that will run in this repo in any of the language client repositories.
Context
Our integrations team would like to keep docs close to the source code so they can make docs updates alongside code changes. Currently the docs are kept in clickhouse-docs repo and it's necessary to context switch to make docs updates.
With the Mintlify migration, beginning with clickhouse-connect: ClickHouse/clickhouse-connect#768, we will move language client docs back into their own repositories.
docs_builderimageWe want the sync to be unidirectional, so there is a check added to this repository to fail the docs check with an error message pointing to the canonical source if someone tries to make the updates to the copies in ClickHouse/ClickHouse docs folder.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...