Skip to content

Populate genome_assembly for 4DN files (empty viz — 4DN never exposes an assembly) — Closes #74#75

Merged
conradbzura merged 2 commits into
masterfrom
74-populate-4dn-genome-assembly
Jul 2, 2026
Merged

Populate genome_assembly for 4DN files (empty viz — 4DN never exposes an assembly) — Closes #74#75
conradbzura merged 2 commits into
masterfrom
74-populate-4dn-genome-assembly

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

Summary

Populate the genome assembly (and other portal-derived file fields) for every 4DN file so downstream visualizations can select the correct genome. 4DN file enrichment was truncated: fetch_file_metadata_bulk deep-paginated the 4DN/Fourfront Search API with a from offset, but that API caps every result window at 10,000 rows, so only the first ~10k of each file subtype were fetched and the tens of thousands of remaining files were never enriched. Their genome_assembly/output_type were therefore never promoted to the top-level file fields the API exposes as genomeAssembly/outputType, and any visualization keyed off the assembly rendered empty.

Replace the from-paginated full scan with accession-filtered batches: the enrichment takes the accessions it needs and queries the Search API filtered by accession, a bounded batch per request. Each batch's result set stays far under the 10k window, so every requested file is fetched regardless of corpus size. A single type=File query covers both FileProcessed and FileFastq. A per-batch failure is logged and skipped rather than aborting or silently truncating.

Verified end-to-end against a local stack (real Rust materializer + POST /sync?dccs=4dn): the reported file 4DNFIMTTOWBN now resolves to GRCh38, and genome_assembly coverage across 53,697 materialized 4DN files rose from 4,894 (9%) to 19,659 (36.6%) — the remainder are FileFastq reads that legitimately have no assembly — while output_type reached 99.7%.

Deploy note: this changes only the enrichment code path, which runs during a sync. Existing materialized data was produced by the old truncating code and is not rewritten retroactively — a 4DN sync (POST /sync?dccs=4dn) must run against the deployed code for the assembly to populate on live data.

Closes #74

Proposed changes

Accession-batched Search-API fetch — src/cfdb/services/fourdn.py

Change fetch_file_metadata_bulk from a no-arg full scan to fetch_file_metadata_bulk(accessions). Dedupe the accessions and query the Search API in batches of _FILE_METADATA_BATCH_SIZE (100) — one type=File query per batch with one accession= filter each and no from offset. Per-item parsing (genome_assembly, file_type, file_type_detailed, track_and_facet_info fields, extra_files) is unchanged. A non-200 or network error on a batch is logged and skipped, and the final log reports the fetched count and failed-batch count so a partial fetch is visible.

Targeted enrichment wiring — src/cfdb/services/sync.py

_enrich_4dn_api_metadata builds the accession-to-id map from the materialized files first, then passes those accessions to the fetch, scoping enrichment to files actually held rather than a blind scan of all 4DN files.

Test cases

# Test Suite Given When Then Coverage Target
1 TestFetchFileMetadataBulk More accessions than the batch size Metadata is requested One bounded query per batch is issued with no from offset Batching without deep pagination
2 TestFetchFileMetadataBulk Accessions spanning multiple batches Metadata is requested Every accession's entry is returned Cross-batch aggregation
3 TestFetchFileMetadataBulk An empty or all-falsy accession list Metadata is requested An empty dict is returned with no HTTP request Input short-circuit
4 TestFetchFileMetadataBulk Duplicate accessions Metadata is requested Each distinct accession is requested once Deduplication
5 TestFetchFileMetadataBulk A single accession Metadata is requested The query carries type=File, one accession filter, the field set, limit, format, and no from Query-URL correctness
6 TestFetchFileMetadataBulk A partial final batch Metadata is requested Each query's limit equals its batch length Per-batch limit
7 TestFetchFileMetadataBulk Exactly the batch size, and one over Metadata is requested One query, and two queries, are issued Batch boundaries
8 TestFetchFileMetadataBulk A graph item missing an accession Metadata is requested The item is skipped Missing-accession guard
9 TestFetchFileMetadataBulk An item with no mappable fields Metadata is requested The accession is omitted from the result Empty-entry drop
10 TestFetchFileMetadataBulk An item with file_type_detailed and all track fields Metadata is requested Every field is mapped into the entry Field mapping
11 TestFetchFileMetadataBulk An item with falsy track sub-fields Metadata is requested Falsy sub-fields are omitted Truthiness guard
12 TestFetchFileMetadataBulk An item with a raw extra_files list Metadata is requested extra_files equals parse_extra_files output extra_files pass-through
13 TestFetchFileMetadataBulk A first, middle, or all-batch request failure Metadata is requested Remaining batches are kept and no failure aborts the fetch Failure isolation
14 TestFetchFileMetadataBulk Arbitrary accession lists Metadata is requested Requested union equals deduped input, batches partition within the size bound, result keys stay a subset of input, and no query emits from Batching invariants (property-based)

The 4DN Search API (Fourfront/Elasticsearch) caps every result window at
10,000 rows, so fetch_file_metadata_bulk's from-based deep pagination of
FileProcessed and FileFastq retrieved at most 10k per type and silently
stopped there. With tens of thousands of 4DN files, most never had their
metadata fetched, so their genome_assembly and output_type were never
promoted to the top-level file fields the API exposes as genomeAssembly.
Downstream visualizations that key off the assembly rendered empty.

Take the accessions to enrich as an argument and query the Search API
filtered by accession in bounded batches (a single type=File query per
batch, one accession filter each). Every batch's result set stays far
under the 10k window, so every requested file is fetched regardless of
corpus size. A per-batch non-200 or network error is logged and skipped
rather than aborting or silently truncating the whole fetch.

The 4DN enrichment step now builds the accession-to-id map from the
materialized files first and passes those accessions to the fetch, so
enrichment is scoped to files actually held rather than a blind scan.
Add TestFetchFileMetadataBulk coverage for the batched fetch: empty and
all-falsy input short-circuit with no request, accession dedup, exact
query-URL shape (type=File, per-batch limit, format, field set, no from
offset), batch boundaries at exactly the batch size and one over, graph
parsing (missing accession skipped, empty entry dropped, all direct and
track fields mapped, falsy track fields skipped, extra_files parsed or
omitted, accession absent from the graph), and failure isolation across
all-fail and middle-batch-fail runs.

Add property-based tests asserting the requested union equals the deduped
input, batches partition the input within the size bound, result keys are
a subset of the input, and no query ever emits a from offset.
@conradbzura conradbzura self-assigned this Jul 2, 2026
@conradbzura conradbzura marked this pull request as ready for review July 2, 2026 14:44
@conradbzura conradbzura merged commit b102668 into master Jul 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Populate genome_assembly for 4DN files (empty viz — 4DN never exposes an assembly)

1 participant