From 4c9dc4792410294be78db39d5865735ce0c6936b Mon Sep 17 00:00:00 2001 From: docushell-admin Date: Wed, 17 Jun 2026 08:37:02 +0530 Subject: [PATCH 1/2] Bind font isolation fixtures Signed-off-by: docushell-admin --- .../fixtures/font-isolation/manifest.json | 33 +++++ docs/execution-status.md | 4 +- fixtures/validate_fixtures.py | 129 +++++++++++++++++- 3 files changed, 162 insertions(+), 4 deletions(-) create mode 100644 crates/ethos-cli/tests/fixtures/font-isolation/manifest.json diff --git a/crates/ethos-cli/tests/fixtures/font-isolation/manifest.json b/crates/ethos-cli/tests/fixtures/font-isolation/manifest.json new file mode 100644 index 0000000..eeac8b3 --- /dev/null +++ b/crates/ethos-cli/tests/fixtures/font-isolation/manifest.json @@ -0,0 +1,33 @@ +{ + "manifest_version": 1, + "root": "crates/ethos-cli/tests/fixtures/font-isolation", + "fixtures": [ + { + "id": "cid-cjk-like", + "file": "cid-cjk-like.pdf", + "sha256": "9eb58cc130b6d4d03d3cb7c0d6c71fbd5bdbc6839c51e2fc14f6dea9d9514d0e", + "subsets": ["fonts", "cid"], + "expected_behavior": "deterministic_success_or_stable_error", + "provenance": "Synthetic PDF generated by Ethos maintainers for internal CID/CJK-like font isolation checks.", + "license": "Apache-2.0" + }, + { + "id": "missing-font", + "file": "missing-font.pdf", + "sha256": "bdc7633fb937493153fd063d561bf066a2a226aaf43913f98a514cc92610511c", + "subsets": ["fonts", "missing_font"], + "expected_behavior": "deterministic_substitution_id", + "provenance": "Synthetic PDF generated by Ethos maintainers for internal missing-font substitution isolation checks.", + "license": "Apache-2.0" + }, + { + "id": "standard14-fonts", + "file": "standard14-fonts.pdf", + "sha256": "af767edb5d64ded23fa4c23c35f93cf3765b5952905c46cde3e7179d9660d421", + "subsets": ["fonts", "standard14"], + "expected_behavior": "deterministic_substitution_id", + "provenance": "Synthetic PDF generated by Ethos maintainers for internal Standard 14 font substitution isolation checks.", + "license": "Apache-2.0" + } + ] +} diff --git a/docs/execution-status.md b/docs/execution-status.md index 77829b5..2552b75 100644 --- a/docs/execution-status.md +++ b/docs/execution-status.md @@ -17,7 +17,7 @@ The committed implementation now includes: - `ethos doc parse` / `ethos fingerprint` PDF execution through a worker process with `max_parse_ms` timeout enforcement, stable error-envelope relay, diagnostics-gated worker stderr, and page-range validation/filtering. - Quantized page/span extraction at the backend boundary, plus a basic deterministic layout pass that assembles paragraph `text_block` elements, fixture-backed alpha heading and flat list-item elements, and simple column reading order for the current born-digital fixtures. Current alpha layout confidence is explicit for heading signals, and below-threshold layout confidence emits deterministic `low_confidence_reading_order` diagnostics instead of staying silent. Fixture validation binds selected `fixture.json` expectations to committed extraction/layout goldens and binds current alpha text/Markdown exports to committed layout output so current read-order, element-type, heading-export, list-item, and export cases fail closed on drift. - An internal layout evaluator scaffold exists at `fixtures/evaluate_layout_alpha.py` and `make layout-evaluator-alpha`. It reads committed `fixture.json`, `extraction.json`, `layout.json`, `text.txt`, and `markdown.md` files, summarizes alpha element-type and subset coverage, and fails closed on missing layout expectations, dangling/invalid warning references, confidence-policy drift, export-golden drift, invalid span expectation metadata, expected page/span-text/font-id drift, expected rotation drift, or drift in fixture-backed reading order / heading / list-item / hyphenation / ligature cases. PR CI runs the evaluator and has a static workflow guard for that wiring. -- Schema/example/profile validation is green through `schemas/validate_examples.py` using `jsonschema` draft 2020-12 validation, including the crop descriptor artifact contract plus referential-integrity and bbox sanity checks outside JSON Schema. +- Schema/example/profile validation is green through `schemas/validate_examples.py` using `jsonschema` draft 2020-12 validation, including the crop descriptor artifact contract plus referential-integrity and bbox sanity checks outside JSON Schema. Fixture validation also binds internal font-isolation PDFs to committed manifest hashes. - `ethos verify` now produces non-empty quote, value, presence, and table-cell verification checks over native Ethos document JSON and synthetic OpenDataLoader-style JSON through `--grounding opendataloader-json`; it also verifies quote/value/presence citations over pinned real OpenDataLoader 2.4.7 JSON, including grounded and ungrounded cases, maps explicit real OpenDataLoader-style row/cell structures to table-cell grounding, and normalizes conservative real-style text/child-container aliases when page/bbox/text data remains explicit. Citation/config inputs are rejected when they drift outside the closed schemas. The public demo harness covers grounded, ungrounded, split-quote, not-found, stale-fingerprint, unsupported non-v1 claim, capability-limited, malformed-citation, malformed OpenDataLoader-style input, and summary-format reject paths. - Verification semantics are now trust-honest at alpha scope: quote containment is explicitly labeled, value/table-cell checks require normalized equality, fingerprint-pinned citations fail closed when source fingerprints are unavailable, and structured capability limits explain why a run is downgraded. - `make verify-alpha` is the current alpha trust-loop command: it checks native examples, split-quote evidence matching, unsupported non-v1 claim reporting, synthetic OpenDataLoader-style examples, pinned real OpenDataLoader grounded/ungrounded examples, schema validation, verify-alpha case inventory coverage, usage diagnostics for malformed citations and malformed OpenDataLoader-style structures, byte-identical repeated verification reports, byte-identical native crop descriptors, summary diagnostics for an ungrounded native case, and foreign fixture manifest hash binding. @@ -55,7 +55,7 @@ Milestone A has an accepted internal Gate Zero decision for roadmap control, so | Layout groundwork | Landed: basic paragraph text blocks, fixture-backed alpha heading and flat list-item elements, simple column reading order over quantized spans, explicit alpha heading-confidence values, deterministic below-threshold confidence diagnostics, fixture metadata checks against committed extraction/layout goldens for current read-order and element-type expectations, and alpha text/Markdown export goldens derived from committed layout output | Tables, nested/richer list and heading semantics, broader rotation/quirk handling, and broader confidence dimensions remain future work | | Layout evaluator scaffold | Landed: deterministic internal evaluator over committed extraction/layout fixture expectations, with heading/list/reading-order/rotation/hyphenation/ligature/font-identity/span-expectation coverage checks, expected page/span-text/font-id checks, expected-spans metadata validation, warning-reference checks, confidence-policy checks, text/Markdown export-golden checks, expectation drift diagnostics, report JSON, Make target, unit coverage, PR CI wiring, and static CI workflow guard coverage | Broader evaluator dimensions remain future work | | Python surface scaffold | Landed: internal stdlib wrapper over a caller-provided local `ethos doc parse` command, with explicit JSON/Markdown/text methods, page selection passthrough, diagnostics passthrough, timeout handling, command failure reporting, and mocked-command unit coverage | Native binding work, broader API design, and public setup path remain future work | -| Font policy groundwork | Partially landed: substitution table and profile policy are present; substitution-table bytes are pinned by the deterministic profile and checked by schema/example validation; absent bundled fallback assets must remain represented by a null fallback-bundle hash; fixture output uses deterministic substitution IDs, committed embedded-font fixture metadata now binds expected extraction font identity, and document schema/font extraction keep emitted font ids inside the deterministic ASCII `embedded:` / `subst:` contract | Bundled fallback asset introduction/hash pinning and broader font/CID validation remain open | +| Font policy groundwork | Partially landed: substitution table and profile policy are present; substitution-table bytes are pinned by the deterministic profile and checked by schema/example validation; absent bundled fallback assets must remain represented by a null fallback-bundle hash; fixture output uses deterministic substitution IDs, committed embedded-font fixture metadata now binds expected extraction font identity, document schema/font extraction keep emitted font ids inside the deterministic ASCII `embedded:` / `subst:` contract, and CLI font-isolation PDFs are manifest/hash-bound | Bundled fallback asset introduction/hash pinning and broader font/CID validation remain open | | Schema/example validation | Landed: schemas, examples, deterministic profile, referential integrity, and bbox sanity pass the `jsonschema` validation gate | Contract changes still require explicit versioning and compatibility review | | Trust-layer implementation | Landed: `ethos verify` quote/value/presence/table-cell checks, explicit quote-containment labeling, normalized equality for value/table-cell checks, stale and unverifiable fingerprint handling, unsupported claim reporting, structured capability limits, native Ethos JSON path, ODL-style adapter path with synthetic table/cell mapping, explicit real ODL-style row/cell table grounding, conservative real-style text/child-container alias normalization, pinned real OpenDataLoader 2.4.7 grounded/ungrounded fixtures, foreign fixture manifest hash validation, crop-ref evidence plumbing, stable logical native crop refs, native crop descriptor artifacts, raw BGRA crop rendering in `ethos-pdf`, CLI PNG crop artifact production for bound native source PDFs, same-host rendered crop repeatability check, rendered-crop run comparison helper, strict citation/config input validation, citation input schema, split-quote fixture coverage, explicit unsupported non-v1 claim reporting, OpenDataLoader-style structure diagnostics for malformed bbox and unknown-page references, verify-alpha case inventory checks, and demo fixtures | Still needed: additional adapter hardening against broader real output shapes, future claim-kind expansion outside the current v1 alpha policy, and a decision on whether cross-platform rendered crop artifact equality is worth pursuing after the current macOS/Linux bbox drift finding | | WS-HARNESS readiness | Partially landed: readiness path is green for frozen corpus/hardware and pinned competitors, Gate Zero evidence preflight validates the current `ethos-bench` handoff, and gates fail closed if those records regress | Public-safe comparison report flow, release/package approval, claim-wording approval, and future evidence-refresh workflow still need hardening | diff --git a/fixtures/validate_fixtures.py b/fixtures/validate_fixtures.py index 9f833c5..669acf5 100644 --- a/fixtures/validate_fixtures.py +++ b/fixtures/validate_fixtures.py @@ -24,7 +24,8 @@ 4. manifest corpus metadata matches fixture.json; 5. manifest sha256 == fixture.json sha256 == sha256(document.pdf); 6. successful parse fixtures carry stage goldens with the expected v1 shape; -7. foreign parser fixture packages bind their manifest hashes to committed files. +7. foreign parser fixture packages bind their manifest hashes to committed files; +8. CLI font-isolation PDFs, including the CID/CJK-like fixture, are manifest-bound. Exit 0 = green. Any failure prints the offending file/context and exits 1. """ @@ -36,6 +37,7 @@ from pathlib import Path ROOT = Path(__file__).resolve().parent +REPO_ROOT = ROOT.parent MANIFEST = ROOT / "manifest.json" ALLOWED_CATEGORIES = {"failure", "public", "security", "synthetic"} MANIFEST_KEYS = {"manifest_version", "root", "subsets_declared", "fixtures"} @@ -58,6 +60,23 @@ "source_provenance", "license", } +FONT_ISOLATION_ROOT_VALUE = "crates/ethos-cli/tests/fixtures/font-isolation" +FONT_ISOLATION_MANIFEST = REPO_ROOT / FONT_ISOLATION_ROOT_VALUE / "manifest.json" +FONT_ISOLATION_MANIFEST_KEYS = {"manifest_version", "root", "fixtures"} +FONT_ISOLATION_ENTRY_KEYS = { + "id", + "file", + "sha256", + "subsets", + "expected_behavior", + "provenance", + "license", +} +FONT_ISOLATION_SUBSETS = {"fonts", "cid", "missing_font", "standard14"} +FONT_ISOLATION_BEHAVIORS = { + "deterministic_substitution_id", + "deterministic_success_or_stable_error", +} HEX256 = re.compile(r"^[0-9a-f]{64}$") SLUG = re.compile(r"^[a-z0-9][a-z0-9-]*$") SUBSET = re.compile(r"^[a-z0-9][a-z0-9_]*$") @@ -77,11 +96,21 @@ def ok(msg: str) -> None: print(f"ok {msg}") +def display_path(path: Path) -> str: + try: + return str(path.relative_to(REPO_ROOT)) + except ValueError: + return str(path) + + def load_json(path: Path): try: return json.loads(path.read_text(encoding="utf-8")) + except OSError as exc: + fail(f"{display_path(path)} is not readable: {exc}") + return None except json.JSONDecodeError as exc: - fail(f"{path.relative_to(ROOT)} is not valid JSON: {exc}") + fail(f"{display_path(path)} is not valid JSON: {exc}") return None @@ -317,6 +346,100 @@ def validate_foreign_fixture_packages() -> int: return count +def validate_font_isolation_manifest() -> int: + ctx = display_path(FONT_ISOLATION_MANIFEST) + manifest = load_json(FONT_ISOLATION_MANIFEST) + if manifest is None: + return 0 + if not isinstance(manifest, dict): + fail(f"{ctx} must be an object") + return 0 + if set(manifest) != FONT_ISOLATION_MANIFEST_KEYS: + fail(f"{ctx} must contain exactly {sorted(FONT_ISOLATION_MANIFEST_KEYS)}") + return 0 + if manifest.get("manifest_version") != 1: + fail(f"{ctx}.manifest_version must be 1") + if manifest.get("root") != FONT_ISOLATION_ROOT_VALUE: + fail(f"{ctx}.root must be {FONT_ISOLATION_ROOT_VALUE}") + + entries = manifest.get("fixtures") + if not isinstance(entries, list) or not entries: + fail(f"{ctx}.fixtures must be a non-empty array") + return 0 + + package_dir = FONT_ISOLATION_MANIFEST.parent + pdf_files = sorted(path.name for path in package_dir.glob("*.pdf")) + indexed_files = [] + seen_ids = set() + seen_files = set() + + for index, entry in enumerate(entries): + entry_ctx = f"{ctx} fixtures[{index}]" + if not isinstance(entry, dict): + fail(f"{entry_ctx} must be an object") + continue + if set(entry) != FONT_ISOLATION_ENTRY_KEYS: + fail(f"{entry_ctx} must contain exactly {sorted(FONT_ISOLATION_ENTRY_KEYS)}") + continue + + fixture_id = entry.get("id") + fixture_file = entry.get("file") + sha = entry.get("sha256") + subsets = entry.get("subsets") + expected_behavior = entry.get("expected_behavior") + + if not isinstance(fixture_id, str) or not SLUG.fullmatch(fixture_id): + fail(f"{entry_ctx}.id must be a slug") + elif fixture_id in seen_ids: + fail(f"{entry_ctx}.id duplicates '{fixture_id}'") + else: + seen_ids.add(fixture_id) + + if not isinstance(fixture_file, str) or not is_safe_relative_path(fixture_file): + fail(f"{entry_ctx}.file must be a safe relative path") + continue + if Path(fixture_file).parts != (fixture_file,) or not fixture_file.endswith(".pdf"): + fail(f"{entry_ctx}.file must be a PDF filename in {FONT_ISOLATION_ROOT_VALUE}") + continue + if fixture_file in seen_files: + fail(f"{entry_ctx}.file duplicates '{fixture_file}'") + seen_files.add(fixture_file) + indexed_files.append(fixture_file) + + if not isinstance(sha, str) or not HEX256.fullmatch(sha): + fail(f"{entry_ctx}.sha256 must be lowercase hex sha256") + if not isinstance(subsets, list) or not subsets: + fail(f"{entry_ctx}.subsets must be a non-empty array") + else: + for subset in subsets: + if not isinstance(subset, str) or subset not in FONT_ISOLATION_SUBSETS: + fail(f"{entry_ctx}.subsets contains invalid subset '{subset}'") + if expected_behavior not in FONT_ISOLATION_BEHAVIORS: + fail(f"{entry_ctx}.expected_behavior is not recognized") + non_placeholder_text(entry.get("provenance"), f"{entry_ctx}.provenance") + non_placeholder_text(entry.get("license"), f"{entry_ctx}.license") + + pdf_path = package_dir / fixture_file + if not pdf_path.is_file(): + fail(f"{entry_ctx}.file missing: {fixture_file}") + continue + actual_sha = sha256_file(pdf_path) + if actual_sha != sha: + fail(f"{entry_ctx}.sha256 {sha} does not match {fixture_file} {actual_sha}") + + if indexed_files != sorted(indexed_files): + fail(f"{ctx}.fixtures must be sorted by file") + + missing_manifest = sorted(set(pdf_files) - set(indexed_files)) + extra_manifest = sorted(set(indexed_files) - set(pdf_files)) + for path in missing_manifest: + fail(f"{FONT_ISOLATION_ROOT_VALUE}/{path} is missing from font-isolation manifest") + for path in extra_manifest: + fail(f"{FONT_ISOLATION_ROOT_VALUE}/{path} appears in manifest but has no PDF") + + return len(entries) + + def validate_expected_count(value, expected, ctx: str) -> None: if expected is None: return @@ -638,6 +761,7 @@ def validate_stage_expectations(metadata_path: Path, metadata, extraction, layou fail(f"{path} appears in manifest but has no document.pdf") foreign_package_count = validate_foreign_fixture_packages() +font_isolation_fixture_count = validate_font_isolation_manifest() if not failures: ok(f"fixture manifest indexes {len(entries)} fixtures") @@ -648,6 +772,7 @@ def validate_stage_expectations(metadata_path: Path, metadata, extraction, layou ok("successful fixture metadata expectations match committed stage goldens") ok("successful fixture text and Markdown exports match committed layout goldens") ok(f"foreign fixture manifests bind {foreign_package_count} package(s) to committed hashes") + ok(f"font-isolation manifest binds {font_isolation_fixture_count} PDF fixture(s)") if failures: print(f"\n{failures} failure(s)") From 26c91553006d7c647c04ee4acf52c0ba2b7a6e3b Mon Sep 17 00:00:00 2001 From: docushell-admin Date: Wed, 17 Jun 2026 08:38:39 +0530 Subject: [PATCH 2/2] Add Milestone B internal check target Signed-off-by: docushell-admin --- Makefile | 11 ++++++++++- docs/execution-status.md | 2 +- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index 51e558c..9f25686 100644 --- a/Makefile +++ b/Makefile @@ -13,7 +13,7 @@ COMPARE_RENDERED_CROPS_LEFT ?= $(VERIFY_RENDERED_CROPS_OUT)/run1 COMPARE_RENDERED_CROPS_RIGHT ?= $(VERIFY_RENDERED_CROPS_OUT)/run2 LAYOUT_EVALUATOR_OUT ?= $(ROOT)/target/layout-evaluator-alpha -.PHONY: verify-alpha verify-alpha-tree verify-rendered-crops compare-rendered-crops layout-evaluator-alpha python-surface-test release-hygiene release-advisory third-party-license-manifest release-notice-draft +.PHONY: verify-alpha verify-alpha-tree verify-rendered-crops compare-rendered-crops layout-evaluator-alpha python-surface-test milestone-b-internal-checks release-hygiene release-advisory third-party-license-manifest release-notice-draft $(ETHOS_BIN): cargo build --locked -p ethos-cli @@ -48,6 +48,15 @@ layout-evaluator-alpha: python-surface-test: PYTHONPATH=$(ROOT)/python $(PYTHON) -m unittest discover -s python/tests +milestone-b-internal-checks: + $(PYTHON) fixtures/validate_fixtures.py + $(MAKE) verify-alpha PYTHON=$(PYTHON) + $(MAKE) layout-evaluator-alpha PYTHON=$(PYTHON) + $(MAKE) python-surface-test PYTHON=$(PYTHON) + $(PYTHON) .github/scripts/claims_gate.py + $(PYTHON) .github/scripts/readiness_gate.py public + git diff --check + release-hygiene: cargo metadata --locked --offline --format-version 1 --no-deps >/dev/null $(CARGO_DENY) --version diff --git a/docs/execution-status.md b/docs/execution-status.md index 2552b75..8d670e7 100644 --- a/docs/execution-status.md +++ b/docs/execution-status.md @@ -20,7 +20,7 @@ The committed implementation now includes: - Schema/example/profile validation is green through `schemas/validate_examples.py` using `jsonschema` draft 2020-12 validation, including the crop descriptor artifact contract plus referential-integrity and bbox sanity checks outside JSON Schema. Fixture validation also binds internal font-isolation PDFs to committed manifest hashes. - `ethos verify` now produces non-empty quote, value, presence, and table-cell verification checks over native Ethos document JSON and synthetic OpenDataLoader-style JSON through `--grounding opendataloader-json`; it also verifies quote/value/presence citations over pinned real OpenDataLoader 2.4.7 JSON, including grounded and ungrounded cases, maps explicit real OpenDataLoader-style row/cell structures to table-cell grounding, and normalizes conservative real-style text/child-container aliases when page/bbox/text data remains explicit. Citation/config inputs are rejected when they drift outside the closed schemas. The public demo harness covers grounded, ungrounded, split-quote, not-found, stale-fingerprint, unsupported non-v1 claim, capability-limited, malformed-citation, malformed OpenDataLoader-style input, and summary-format reject paths. - Verification semantics are now trust-honest at alpha scope: quote containment is explicitly labeled, value/table-cell checks require normalized equality, fingerprint-pinned citations fail closed when source fingerprints are unavailable, and structured capability limits explain why a run is downgraded. -- `make verify-alpha` is the current alpha trust-loop command: it checks native examples, split-quote evidence matching, unsupported non-v1 claim reporting, synthetic OpenDataLoader-style examples, pinned real OpenDataLoader grounded/ungrounded examples, schema validation, verify-alpha case inventory coverage, usage diagnostics for malformed citations and malformed OpenDataLoader-style structures, byte-identical repeated verification reports, byte-identical native crop descriptors, summary diagnostics for an ungrounded native case, and foreign fixture manifest hash binding. +- `make verify-alpha` is the current alpha trust-loop command: it checks native examples, split-quote evidence matching, unsupported non-v1 claim reporting, synthetic OpenDataLoader-style examples, pinned real OpenDataLoader grounded/ungrounded examples, schema validation, verify-alpha case inventory coverage, usage diagnostics for malformed citations and malformed OpenDataLoader-style structures, byte-identical repeated verification reports, byte-identical native crop descriptors, summary diagnostics for an ungrounded native case, and foreign fixture manifest hash binding. `make milestone-b-internal-checks` composes the current internal Milestone B validation path across fixture validation, verify alpha, layout evaluator, Python surface tests, and policy gates. - An internal Python surface scaffold exists under `python/ethos_pdf`. It shells out to a caller-provided local `ethos` CLI binary for `ethos doc parse` JSON, Markdown, and text output, and has stdlib unit tests that use a fake local command. This is pre-alpha scaffolding for Milestone B API shape work, not a public installation or publication path. - Native Ethos verification can emit deterministic, schema-backed crop descriptor JSON artifacts through `--crop-dir`; these bind `document_fingerprint`, page, bbox, and check ids. Native `crop_ref` filenames are logical evidence references derived from document fingerprint, check id, and page, while descriptors still record the exact observed bbox. When `--crop-source-pdf` is supplied, the CLI validates source-PDF fingerprint binding and emits PNG crop artifacts whose filenames, byte hashes, dimensions, and source fingerprint are bound from the descriptor. `make verify-rendered-crops` checks same-host repeated-run stability for the rendered artifact path, and `make compare-rendered-crops` classifies two rendered-crop runs by separating logical evidence identity from rendered artifact byte equality. Cross-platform rendered image determinism is not claimed; the 2026-06-14 macOS arm64 vs Linux x64 validation record in `docs/validation/rendered-crops-2026-06-14.md` preserved document fingerprint and `payload_sha256` but failed rendered artifact byte equality because the evidence bbox differed slightly across platforms.