[EPAC-2301]: Recover intermediate bill XML links for diff ingestion#825
Merged
riddim-developer-bot[bot] merged 1 commit intoJun 15, 2026
Conversation
The bills indexer only scraped the first publication's DocumentViewer page for an XML/PDF anchor, then derived every later stage's URL by sort order and persisted it unvalidated. Intermediate stages (as-amended-by-committee, as-passed-by-the-house-of-commons, as-passed-by-the-senate) ended up with no ingested text, so the diff producer emitted bill_clause_diffs=0. enrichVersions now reads each stage's own DocumentViewer page and resolves its XML through a validated candidate chain: the page's direct .xml anchor, the XML sibling beside the page's PDF anchor (same parl.ca per-version directory), then a sort-order sibling derived from the first stage. A candidate is persisted only when it fetches as a 2xx <Bill> XML payload, so a guess that 404s or redirects to an HTML error page is dropped instead of stored as xml_url. Direct first-reading and royal-assent anchors keep resolving unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope
Recover intermediate-stage bill XML links in
backend/bills-indexerso later diff generation has version text on both sides.PR #813 proved the bill-diff route and
204 unavailablepath work in staging, but the producer emittedbill_diffs=77 bill_clause_diffs=0: intermediate bill versions had no ingested text. The root cause was in the LEGISinfo adapter's document-link discovery.Root cause
enrichVersionsonly scraped the first publication's DocumentViewer page for an.xml/.pdfanchor. Every later stage's URL was then derived by sort order (_1/→_N/) and persisted without validation. When parl.ca's per-version document number does not track the publication order — or the page renders its links client-side — the derived URL was wrong, soas-amended-by-committee,as-passed-by-the-house-of-commons, andas-passed-by-the-senateingested no text.What changed
enrichVersionsnow reads each stage's own DocumentViewer page and resolves its XML through a validated candidate chain (most-trusted first):.xmlanchor;C-11_3/C-11_3.PDFnext toC-11_3/C-11_E.xml, so the PDF's directory pins the correct version;A candidate is persisted as
xml_urlonly when it fetches as a 2xx<Bill>XML payload (looksLikeBillXML). A guess that 404s, or that 302-redirects to a 200 HTML error page, is dropped rather than stored. Direct first-reading and royal-assent anchors keep resolving exactly as before. No domain types changed; all new logic stays in the LEGISinfo adapter.Bugfix SPEC
Acceptance criteria
XMLURLpopulated with a live HTTPS URLresolveVersionXMLruns per stage;TestEnrichVersionsResolvesDirectAndPDFSiblingXMLLinks,…DerivesSortOrderSiblingWhenPageHasNoLinksfirst-reading/as-amended-by-committee/as-passed-by-the-house-of-commonspopulated when source returns 2xx…ResolvesDirectAndPDFSiblingXMLLinksand the originalTestFetcherBuildsRelationalBillRecordsFromLegisInfoExportslooksLikeBillXMLgate;TestEnrichVersionsDropsUnvalidatedXMLCandidates(soft-200 HTML + 404)TestEnrichVersionsDropsUnvalidatedXMLCandidates,TestLooksLikeBillXML,TestXMLSiblingFromPDF,TestDedupeNonEmpty,TestEnrichVersionsDoesNotReuseBaseXMLForLaterStagesfetch_*logs unchangedTesting notes
go test -count=1 ./...(wholebills-indexermodule) andgo test -race ./internal/adapter/legisinfo/— all green.gofmt -lclean,go vet ./...clean.Alive-check evidence (parl.ca,
curl -A "Mozilla/5.0")Confirms the recovered URL pattern is live, that the document root is
<Bill>(whatlooksLikeBillXMLaccepts), and that a non-existent version redirects to a 200 HTML "404" page (what the validator rejects so it is never stored):C-11_2/C-11_3are exactly the intermediate stages PR #813 reported as0/N; both now return live<Bill>XML, and the soft-error redirect for missing versions is dropped by the validator.Screenshots
N/A — backend bills-indexer change, no UI surface.
Notes
bill_clause_diffsfor all version pairs (follow-up comparable-pair producer), the/api/v1/bills/{id}/diffresponse shape, and iOS changes. Live staging coverage is recorded in EPAC-2290.Related issue