Skip to content

[EPAC-2301]: Recover intermediate bill XML links for diff ingestion#825

Merged
riddim-developer-bot[bot] merged 1 commit into
mainfrom
symphony/epac-2301-recover-intermediate-bill-xml-links-for-diff-ing
Jun 15, 2026
Merged

[EPAC-2301]: Recover intermediate bill XML links for diff ingestion#825
riddim-developer-bot[bot] merged 1 commit into
mainfrom
symphony/epac-2301-recover-intermediate-bill-xml-links-for-diff-ing

Conversation

@riddim-developer-bot

Copy link
Copy Markdown
Contributor

Scope

Recover intermediate-stage bill XML links in backend/bills-indexer so later diff generation has version text on both sides.

PR #813 proved the bill-diff route and 204 unavailable path work in staging, but the producer emitted bill_diffs=77 bill_clause_diffs=0: intermediate bill versions had no ingested text. The root cause was in the LEGISinfo adapter's document-link discovery.

Root cause

enrichVersions only scraped the first publication's DocumentViewer page for an .xml/.pdf anchor. Every later stage's URL was then derived by sort order (_1/_N/) and persisted without validation. When parl.ca's per-version document number does not track the publication order — or the page renders its links client-side — the derived URL was wrong, so as-amended-by-committee, as-passed-by-the-house-of-commons, and as-passed-by-the-senate ingested no text.

What changed

enrichVersions now reads each stage's own DocumentViewer page and resolves its XML through a validated candidate chain (most-trusted first):

  1. the stage page's direct .xml anchor;
  2. the XML sibling beside the stage page's PDF anchor — parl.ca keeps C-11_3/C-11_3.PDF next to C-11_3/C-11_E.xml, so the PDF's directory pins the correct version;
  3. a sort-order sibling derived from the first resolved stage (best-effort, only when a page exposes no links of its own).

A candidate is persisted as xml_url only when it fetches as a 2xx <Bill> XML payload (looksLikeBillXML). A guess that 404s, or that 302-redirects to a 200 HTML error page, is dropped rather than stored. Direct first-reading and royal-assent anchors keep resolving exactly as before. No domain types changed; all new logic stays in the LEGISinfo adapter.

Bugfix SPEC

  • Spec: EPAC-2301 (full observed/expected/acceptance/evidence/validation/non-goals in the issue body)
  • Trace ID: n/a (Linear-issue-driven, not an ad-hoc bugfix intake)

Acceptance criteria

Criterion Where
Intermediate stage with source XML → XMLURL populated with a live HTTPS URL resolveVersionXML runs per stage; TestEnrichVersionsResolvesDirectAndPDFSiblingXMLLinks, …DerivesSortOrderSiblingWhenPageHasNoLinks
C-11 first-reading / as-amended-by-committee / as-passed-by-the-house-of-commons populated when source returns 2xx alive-check below + the two tests above
Existing first-reading / royal-assent XML+PDF anchors stay stable direct anchor wins first; asserted in …ResolvesDirectAndPDFSiblingXMLLinks and the original TestFetcherBuildsRelationalBillRecordsFromLegisInfoExports
Derived candidate without a 2xx bill-XML payload is not persisted looksLikeBillXML gate; TestEnrichVersionsDropsUnvalidatedXMLCandidates (soft-200 HTML + 404)
Deterministic discovery over direct anchors / alternate links / missing source TestEnrichVersionsDropsUnvalidatedXMLCandidates, TestLooksLikeBillXML, TestXMLSiblingFromPDF, TestDedupeNonEmpty, TestEnrichVersionsDoesNotReuseBaseXMLForLaterStages
Logs/stats stay machine-readable, no per-bill spam no new per-stage logging added; existing aggregate fetch_* logs unchanged

Testing notes

  • Automated tests run: go test -count=1 ./... (whole bills-indexer module) and go test -race ./internal/adapter/legisinfo/ — all green. gofmt -l clean, go vet ./... clean.
  • Manual verification: Backend ingestion change; no iOS UI surface, so Simulator verification is not applicable. The iOS diff viewer already consumes the endpoint unchanged.

Alive-check evidence (parl.ca, curl -A "Mozilla/5.0")

Confirms the recovered URL pattern is live, that the document root is <Bill> (what looksLikeBillXML accepts), and that a non-existent version redirects to a 200 HTML "404" page (what the validator rejects so it is never stored):

C-11_1/C-11_E.xml -> HTTP 200
C-11_2/C-11_E.xml -> HTTP 200
C-11_3/C-11_E.xml -> HTTP 200
C-11_4/C-11_E.xml -> HTTP 302   # version does not exist
C-11_5/C-11_E.xml -> HTTP 302   # version does not exist

# root element of C-11_2/C-11_E.xml:
<?xml version="1.0" encoding="UTF-8"?>...<Bill bill-origin="commons" bill-type="govt-public" xml:lang="en">...

# following the C-11_4 redirect:
final_status=200 content_type=text/html   ->  <html>...<title>Error Page - Page d'erreur - 404</title>

# PDF and XML share a per-version directory:
C-11_2/C-11_2.PDF -> HTTP 200 (application/pdf)

C-11_2/C-11_3 are exactly the intermediate stages PR #813 reported as 0/N; both now return live <Bill> XML, and the soft-error redirect for missing versions is dropped by the validator.

Screenshots

N/A — backend bills-indexer change, no UI surface.

Notes

  • Cost: the adapter now fetches one DocumentViewer page per version (previously one per bill). This is bounded by the publication count and acceptable for the periodic indexer; XML payloads were already fetched per version for hashing.
  • Out of scope: generating bill_clause_diffs for all version pairs (follow-up comparable-pair producer), the /api/v1/bills/{id}/diff response shape, and iOS changes. Live staging coverage is recorded in EPAC-2290.
  • Release-Note: none — backend ingestion only; no user-visible change ships in this PR (diff data surfaces after the EPAC-2290 staging backfill rerun).

Related issue

  • Closes: EPAC-2301
  • Validation gates: EPAC-2290 (staging coverage rerun), EPAC-2292 (route verification)

The bills indexer only scraped the first publication's DocumentViewer page
for an XML/PDF anchor, then derived every later stage's URL by sort order and
persisted it unvalidated. Intermediate stages (as-amended-by-committee,
as-passed-by-the-house-of-commons, as-passed-by-the-senate) ended up with no
ingested text, so the diff producer emitted bill_clause_diffs=0.

enrichVersions now reads each stage's own DocumentViewer page and resolves its
XML through a validated candidate chain: the page's direct .xml anchor, the XML
sibling beside the page's PDF anchor (same parl.ca per-version directory), then
a sort-order sibling derived from the first stage. A candidate is persisted only
when it fetches as a 2xx <Bill> XML payload, so a guess that 404s or redirects to
an HTML error page is dropped instead of stored as xml_url. Direct first-reading
and royal-assent anchors keep resolving unchanged.
@riddim-developer-bot riddim-developer-bot Bot enabled auto-merge (squash) June 15, 2026 01:04
@riddim-developer-bot riddim-developer-bot Bot merged commit 54b4c20 into main Jun 15, 2026
62 checks passed
@riddim-developer-bot riddim-developer-bot Bot deleted the symphony/epac-2301-recover-intermediate-bill-xml-links-for-diff-ing branch June 15, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants