Populate genome_assembly for 4DN files (empty viz — 4DN never exposes an assembly) — Closes #74#75
Merged
Merged
Conversation
The 4DN Search API (Fourfront/Elasticsearch) caps every result window at 10,000 rows, so fetch_file_metadata_bulk's from-based deep pagination of FileProcessed and FileFastq retrieved at most 10k per type and silently stopped there. With tens of thousands of 4DN files, most never had their metadata fetched, so their genome_assembly and output_type were never promoted to the top-level file fields the API exposes as genomeAssembly. Downstream visualizations that key off the assembly rendered empty. Take the accessions to enrich as an argument and query the Search API filtered by accession in bounded batches (a single type=File query per batch, one accession filter each). Every batch's result set stays far under the 10k window, so every requested file is fetched regardless of corpus size. A per-batch non-200 or network error is logged and skipped rather than aborting or silently truncating the whole fetch. The 4DN enrichment step now builds the accession-to-id map from the materialized files first and passes those accessions to the fetch, so enrichment is scoped to files actually held rather than a blind scan.
Add TestFetchFileMetadataBulk coverage for the batched fetch: empty and all-falsy input short-circuit with no request, accession dedup, exact query-URL shape (type=File, per-batch limit, format, field set, no from offset), batch boundaries at exactly the batch size and one over, graph parsing (missing accession skipped, empty entry dropped, all direct and track fields mapped, falsy track fields skipped, extra_files parsed or omitted, accession absent from the graph), and failure isolation across all-fail and middle-batch-fail runs. Add property-based tests asserting the requested union equals the deduped input, batches partition the input within the size bound, result keys are a subset of the input, and no query ever emits a from offset.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Populate the genome assembly (and other portal-derived file fields) for every 4DN file so downstream visualizations can select the correct genome. 4DN file enrichment was truncated:
fetch_file_metadata_bulkdeep-paginated the 4DN/Fourfront Search API with afromoffset, but that API caps every result window at 10,000 rows, so only the first ~10k of each file subtype were fetched and the tens of thousands of remaining files were never enriched. Theirgenome_assembly/output_typewere therefore never promoted to the top-level file fields the API exposes asgenomeAssembly/outputType, and any visualization keyed off the assembly rendered empty.Replace the
from-paginated full scan with accession-filtered batches: the enrichment takes the accessions it needs and queries the Search API filtered by accession, a bounded batch per request. Each batch's result set stays far under the 10k window, so every requested file is fetched regardless of corpus size. A singletype=Filequery covers bothFileProcessedandFileFastq. A per-batch failure is logged and skipped rather than aborting or silently truncating.Verified end-to-end against a local stack (real Rust materializer +
POST /sync?dccs=4dn): the reported file4DNFIMTTOWBNnow resolves toGRCh38, andgenome_assemblycoverage across 53,697 materialized 4DN files rose from 4,894 (9%) to 19,659 (36.6%) — the remainder areFileFastqreads that legitimately have no assembly — whileoutput_typereached 99.7%.Deploy note: this changes only the enrichment code path, which runs during a sync. Existing materialized data was produced by the old truncating code and is not rewritten retroactively — a 4DN sync (
POST /sync?dccs=4dn) must run against the deployed code for the assembly to populate on live data.Closes #74
Proposed changes
Accession-batched Search-API fetch —
src/cfdb/services/fourdn.pyChange
fetch_file_metadata_bulkfrom a no-arg full scan tofetch_file_metadata_bulk(accessions). Dedupe the accessions and query the Search API in batches of_FILE_METADATA_BATCH_SIZE(100) — onetype=Filequery per batch with oneaccession=filter each and nofromoffset. Per-item parsing (genome_assembly, file_type, file_type_detailed, track_and_facet_info fields, extra_files) is unchanged. A non-200 or network error on a batch is logged and skipped, and the final log reports the fetched count and failed-batch count so a partial fetch is visible.Targeted enrichment wiring —
src/cfdb/services/sync.py_enrich_4dn_api_metadatabuilds the accession-to-id map from the materialized files first, then passes those accessions to the fetch, scoping enrichment to files actually held rather than a blind scan of all 4DN files.Test cases
TestFetchFileMetadataBulkfromoffsetTestFetchFileMetadataBulkTestFetchFileMetadataBulkTestFetchFileMetadataBulkTestFetchFileMetadataBulktype=File, one accession filter, the field set, limit, format, and nofromTestFetchFileMetadataBulkTestFetchFileMetadataBulkTestFetchFileMetadataBulkTestFetchFileMetadataBulkTestFetchFileMetadataBulkfile_type_detailedand all track fieldsTestFetchFileMetadataBulkTestFetchFileMetadataBulkextra_fileslistextra_filesequalsparse_extra_filesoutputTestFetchFileMetadataBulkTestFetchFileMetadataBulkfrom