diff --git a/docs/execution-status.md b/docs/execution-status.md index 2863984..f93d2f1 100644 --- a/docs/execution-status.md +++ b/docs/execution-status.md @@ -14,7 +14,7 @@ The committed implementation now includes: - A pinned Phase 1 PDFium profile in `docs/pdfium-profile.md` and `profiles/ethos-deterministic-v1.json`: `chromium/7881`, V8/XFA disabled, platform artifact hashes, runtime library hashes, and provenance are recorded. - Runtime checks that reject missing or mismatched PDFium versions, release artifacts, and extracted libraries with stable errors before dynamic loading. - `ethos doc parse` / `ethos fingerprint` PDF execution through a worker process with `max_parse_ms` timeout enforcement, stable error-envelope relay, diagnostics-gated worker stderr, and page-range validation/filtering. -- Quantized page/span extraction at the backend boundary, plus a basic deterministic layout pass that assembles paragraph `text_block` elements and simple column reading order for the current born-digital fixtures. Fixture validation binds selected `fixture.json` expectations to committed extraction/layout goldens so current read-order cases fail closed on drift. +- Quantized page/span extraction at the backend boundary, plus a basic deterministic layout pass that assembles paragraph `text_block` elements and simple column reading order for the current born-digital fixtures. Fixture validation binds selected `fixture.json` expectations to committed extraction/layout goldens so current read-order and element-type cases fail closed on drift. - Schema/example/profile validation is green through `schemas/validate_examples.py` using `jsonschema` draft 2020-12 validation, including the crop descriptor artifact contract plus referential-integrity and bbox sanity checks outside JSON Schema. - `ethos verify` now produces non-empty quote, value, presence, and table-cell verification checks over native Ethos document JSON and synthetic OpenDataLoader-style JSON through `--grounding opendataloader-json`; it also verifies quote/value/presence citations over pinned real OpenDataLoader 2.4.7 JSON, including grounded and ungrounded cases. Citation/config inputs are rejected when they drift outside the closed schemas. The public demo harness covers grounded, ungrounded, split-quote, not-found, stale-fingerprint, unsupported non-v1 claim, capability-limited, malformed-citation, malformed OpenDataLoader-style input, and summary-format reject paths. - Verification semantics are now trust-honest at alpha scope: quote containment is explicitly labeled, value/table-cell checks require normalized equality, fingerprint-pinned citations fail closed when source fingerprints are unavailable, and structured capability limits explain why a run is downgraded. @@ -49,7 +49,7 @@ Milestone A has an accepted internal Gate Zero decision for roadmap control, so | PDFium Phase 1 profile | Landed: pinned profile, V8/XFA-disabled state, platform hashes, runtime library hashes, and provenance are recorded | Phase 2 project-maintained builds still block Public Beta | | PDFium loader/runtime checks | Landed: missing/mismatched version, artifact, and runtime library hashes fail deterministically | Release packaging and operator setup path still need hardening | | Real PDF backend | Landed for simple born-digital PDFs: page count, quantized spans, worker execution, timeout, page filtering, and fingerprint path exist | Wider corpus coverage, failure fixtures, memory-limit behavior, quirk log, and Gate Zero run are still missing | -| Layout groundwork | Landed: basic paragraph text blocks, simple column reading order over quantized spans, and fixture metadata checks against committed extraction/layout goldens | Tables, headings, lists, rotation/quirk handling, and confidence policy remain future work | +| Layout groundwork | Landed: basic paragraph text blocks, simple column reading order over quantized spans, and fixture metadata checks against committed extraction/layout goldens for current read-order and element-type expectations | Tables, headings, lists, rotation/quirk handling, and confidence policy remain future work | | Font policy groundwork | Partially landed: substitution table and profile policy are present; fixture output uses deterministic substitution IDs | Bundled fallback asset hashing and broader font/CID validation remain open | | Schema/example validation | Landed: schemas, examples, deterministic profile, referential integrity, and bbox sanity pass the `jsonschema` validation gate | Contract changes still require explicit versioning and compatibility review | | Trust-layer implementation | Landed: `ethos verify` quote/value/presence/table-cell checks, explicit quote-containment labeling, normalized equality for value/table-cell checks, stale and unverifiable fingerprint handling, unsupported claim reporting, structured capability limits, native Ethos JSON path, ODL-style adapter path with synthetic table/cell mapping, pinned real OpenDataLoader 2.4.7 grounded/ungrounded fixtures, foreign fixture manifest hash validation, crop-ref evidence plumbing, stable logical native crop refs, native crop descriptor artifacts, raw BGRA crop rendering in `ethos-pdf`, CLI PNG crop artifact production for bound native source PDFs, same-host rendered crop repeatability check, rendered-crop run comparison helper, strict citation/config input validation, citation input schema, split-quote fixture coverage, explicit unsupported non-v1 claim reporting, OpenDataLoader-style structure diagnostics for malformed bbox and unknown-page references, verify-alpha case inventory checks, and demo fixtures | Still needed: real OpenDataLoader table-cell grounding, additional adapter hardening against broader real output shapes, future claim-kind expansion outside the current v1 alpha policy, and a decision on whether cross-platform rendered crop artifact equality is worth pursuing after the current macOS/Linux bbox drift finding | diff --git a/fixtures/README.md b/fixtures/README.md index b2b5ce2..a987d51 100644 --- a/fixtures/README.md +++ b/fixtures/README.md @@ -43,6 +43,7 @@ expectations to those committed goldens: - `expected_pages`: exact `extraction.json` page count. - `expected_span_text`: exact `extraction.json` span text order. - `expected_elements`: exact `layout.json` element count. +- `expected_element_types`: exact `layout.json` element type order. - `expected_text`: exact `layout.json` element text order. Use a string for a single layout element and a string array when reading order spans multiple elements. diff --git a/fixtures/synthetic/hyphenated-line-break/fixture.json b/fixtures/synthetic/hyphenated-line-break/fixture.json index 608149c..630197d 100644 --- a/fixtures/synthetic/hyphenated-line-break/fixture.json +++ b/fixtures/synthetic/hyphenated-line-break/fixture.json @@ -10,5 +10,6 @@ "two-line paragraph grouping", "Type1 base font text spans" ], - "expected_text": "hyphen ated" + "expected_text": "hyphen ated", + "expected_element_types": ["text_block"] } diff --git a/fixtures/synthetic/ligature-fi-embedded-font/fixture.json b/fixtures/synthetic/ligature-fi-embedded-font/fixture.json index afd4bd8..c219bf3 100644 --- a/fixtures/synthetic/ligature-fi-embedded-font/fixture.json +++ b/fixtures/synthetic/ligature-fi-embedded-font/fixture.json @@ -16,5 +16,6 @@ "expected_spans": [ { "text": "office", "char_start": 0, "char_end": 6 }, { "text": "file", "char_start": 7, "char_end": 11 } - ] + ], + "expected_element_types": ["text_block"] } diff --git a/fixtures/synthetic/rotation-90/fixture.json b/fixtures/synthetic/rotation-90/fixture.json index 4eac1b3..cfe535a 100644 --- a/fixtures/synthetic/rotation-90/fixture.json +++ b/fixtures/synthetic/rotation-90/fixture.json @@ -11,5 +11,6 @@ "Type1 base font text spans" ], "expected_text": "Rotate Ninety", - "expected_rotation": 90 + "expected_rotation": 90, + "expected_element_types": ["text_block"] } diff --git a/fixtures/synthetic/simple-text/fixture.json b/fixtures/synthetic/simple-text/fixture.json index ee5c7b3..b1886f6 100644 --- a/fixtures/synthetic/simple-text/fixture.json +++ b/fixtures/synthetic/simple-text/fixture.json @@ -11,5 +11,6 @@ "PDFium char-box quantization", "canonical document assembly" ], - "expected_text": "Hello Ethos" + "expected_text": "Hello Ethos", + "expected_element_types": ["text_block"] } diff --git a/fixtures/synthetic/two-columns/fixture.json b/fixtures/synthetic/two-columns/fixture.json index bc01820..efbf876 100644 --- a/fixtures/synthetic/two-columns/fixture.json +++ b/fixtures/synthetic/two-columns/fixture.json @@ -24,6 +24,7 @@ ], "expected_pages": 1, "expected_elements": 2, + "expected_element_types": ["text_block", "text_block"], "exercises": [ "two-column born-digital text extraction", "geometry-based column ordering", diff --git a/fixtures/synthetic/two-lines/fixture.json b/fixtures/synthetic/two-lines/fixture.json index 18490d2..6e0cf74 100644 --- a/fixtures/synthetic/two-lines/fixture.json +++ b/fixtures/synthetic/two-lines/fixture.json @@ -10,5 +10,6 @@ "pages": 1, "expected_text": "First line Second line", "expected_pages": 1, - "expected_elements": 1 + "expected_elements": 1, + "expected_element_types": ["text_block"] } diff --git a/fixtures/validate_fixtures.py b/fixtures/validate_fixtures.py index 7ee52b0..1e1f44f 100644 --- a/fixtures/validate_fixtures.py +++ b/fixtures/validate_fixtures.py @@ -278,6 +278,21 @@ def validate_expected_text(metadata, layout, ctx: str) -> None: fail(f"{ctx} expected_text must be a string or string array") +def validate_expected_element_types(metadata, layout, ctx: str) -> None: + if "expected_element_types" not in metadata: + return + expected = metadata["expected_element_types"] + if not isinstance(expected, list) or not all(isinstance(item, str) for item in expected): + fail(f"{ctx} expected_element_types must be a string array") + return + elements = layout.get("elements") if isinstance(layout, dict) else None + if not isinstance(elements, list): + return + actual = [element.get("type") for element in elements] + if actual != expected: + fail(f"{ctx} expected_element_types must match layout element type order") + + def validate_expected_span_text(metadata, extraction, ctx: str) -> None: if "expected_span_text" not in metadata: return @@ -309,6 +324,7 @@ def validate_stage_expectations(metadata_path: Path, metadata, extraction, layou f"{ctx} expected_elements", ) validate_expected_text(metadata, layout, ctx) + validate_expected_element_types(metadata, layout, ctx) manifest = load_json(MANIFEST)