Populate genome_assembly for 4DN files (empty viz — 4DN never exposes an assembly) — Closes #74 by conradbzura · Pull Request #75 · abdenlab/cfdb

conradbzura · 2026-07-02T14:25:25Z

Summary

Populate the genome assembly (and other portal-derived file fields) for every 4DN file so downstream visualizations can select the correct genome. 4DN file enrichment was truncated: fetch_file_metadata_bulk deep-paginated the 4DN/Fourfront Search API with a from offset, but that API caps every result window at 10,000 rows, so only the first ~10k of each file subtype were fetched and the tens of thousands of remaining files were never enriched. Their genome_assembly/output_type were therefore never promoted to the top-level file fields the API exposes as genomeAssembly/outputType, and any visualization keyed off the assembly rendered empty.

Replace the from-paginated full scan with accession-filtered batches: the enrichment takes the accessions it needs and queries the Search API filtered by accession, a bounded batch per request. Each batch's result set stays far under the 10k window, so every requested file is fetched regardless of corpus size. A single type=File query covers both FileProcessed and FileFastq. A per-batch failure is logged and skipped rather than aborting or silently truncating.

Verified end-to-end against a local stack (real Rust materializer + POST /sync?dccs=4dn): the reported file 4DNFIMTTOWBN now resolves to GRCh38, and genome_assembly coverage across 53,697 materialized 4DN files rose from 4,894 (9%) to 19,659 (36.6%) — the remainder are FileFastq reads that legitimately have no assembly — while output_type reached 99.7%.

Deploy note: this changes only the enrichment code path, which runs during a sync. Existing materialized data was produced by the old truncating code and is not rewritten retroactively — a 4DN sync (POST /sync?dccs=4dn) must run against the deployed code for the assembly to populate on live data.

Closes #74

Proposed changes

Accession-batched Search-API fetch — `src/cfdb/services/fourdn.py`

Change fetch_file_metadata_bulk from a no-arg full scan to fetch_file_metadata_bulk(accessions). Dedupe the accessions and query the Search API in batches of _FILE_METADATA_BATCH_SIZE (100) — one type=File query per batch with one accession= filter each and no from offset. Per-item parsing (genome_assembly, file_type, file_type_detailed, track_and_facet_info fields, extra_files) is unchanged. A non-200 or network error on a batch is logged and skipped, and the final log reports the fetched count and failed-batch count so a partial fetch is visible.

Targeted enrichment wiring — `src/cfdb/services/sync.py`

_enrich_4dn_api_metadata builds the accession-to-id map from the materialized files first, then passes those accessions to the fetch, scoping enrichment to files actually held rather than a blind scan of all 4DN files.

Test cases

#	Test Suite	Given	When	Then	Coverage Target
1	`TestFetchFileMetadataBulk`	More accessions than the batch size	Metadata is requested	One bounded query per batch is issued with no `from` offset	Batching without deep pagination
2	`TestFetchFileMetadataBulk`	Accessions spanning multiple batches	Metadata is requested	Every accession's entry is returned	Cross-batch aggregation
3	`TestFetchFileMetadataBulk`	An empty or all-falsy accession list	Metadata is requested	An empty dict is returned with no HTTP request	Input short-circuit
4	`TestFetchFileMetadataBulk`	Duplicate accessions	Metadata is requested	Each distinct accession is requested once	Deduplication
5	`TestFetchFileMetadataBulk`	A single accession	Metadata is requested	The query carries `type=File`, one accession filter, the field set, limit, format, and no `from`	Query-URL correctness
6	`TestFetchFileMetadataBulk`	A partial final batch	Metadata is requested	Each query's limit equals its batch length	Per-batch limit
7	`TestFetchFileMetadataBulk`	Exactly the batch size, and one over	Metadata is requested	One query, and two queries, are issued	Batch boundaries
8	`TestFetchFileMetadataBulk`	A graph item missing an accession	Metadata is requested	The item is skipped	Missing-accession guard
9	`TestFetchFileMetadataBulk`	An item with no mappable fields	Metadata is requested	The accession is omitted from the result	Empty-entry drop
10	`TestFetchFileMetadataBulk`	An item with `file_type_detailed` and all track fields	Metadata is requested	Every field is mapped into the entry	Field mapping
11	`TestFetchFileMetadataBulk`	An item with falsy track sub-fields	Metadata is requested	Falsy sub-fields are omitted	Truthiness guard
12	`TestFetchFileMetadataBulk`	An item with a raw `extra_files` list	Metadata is requested	`extra_files` equals `parse_extra_files` output	extra_files pass-through
13	`TestFetchFileMetadataBulk`	A first, middle, or all-batch request failure	Metadata is requested	Remaining batches are kept and no failure aborts the fetch	Failure isolation
14	`TestFetchFileMetadataBulk`	Arbitrary accession lists	Metadata is requested	Requested union equals deduped input, batches partition within the size bound, result keys stay a subset of input, and no query emits `from`	Batching invariants (property-based)

The 4DN Search API (Fourfront/Elasticsearch) caps every result window at 10,000 rows, so fetch_file_metadata_bulk's from-based deep pagination of FileProcessed and FileFastq retrieved at most 10k per type and silently stopped there. With tens of thousands of 4DN files, most never had their metadata fetched, so their genome_assembly and output_type were never promoted to the top-level file fields the API exposes as genomeAssembly. Downstream visualizations that key off the assembly rendered empty. Take the accessions to enrich as an argument and query the Search API filtered by accession in bounded batches (a single type=File query per batch, one accession filter each). Every batch's result set stays far under the 10k window, so every requested file is fetched regardless of corpus size. A per-batch non-200 or network error is logged and skipped rather than aborting or silently truncating the whole fetch. The 4DN enrichment step now builds the accession-to-id map from the materialized files first and passes those accessions to the fetch, so enrichment is scoped to files actually held rather than a blind scan.

Add TestFetchFileMetadataBulk coverage for the batched fetch: empty and all-falsy input short-circuit with no request, accession dedup, exact query-URL shape (type=File, per-batch limit, format, field set, no from offset), batch boundaries at exactly the batch size and one over, graph parsing (missing accession skipped, empty entry dropped, all direct and track fields mapped, falsy track fields skipped, extra_files parsed or omitted, accession absent from the graph), and failure isolation across all-fail and middle-batch-fail runs. Add property-based tests asserting the requested union equals the deduped input, batches partition the input within the size bound, result keys are a subset of the input, and no query ever emits a from offset.

conradbzura added 2 commits July 2, 2026 10:14

conradbzura self-assigned this Jul 2, 2026

conradbzura marked this pull request as ready for review July 2, 2026 14:44

conradbzura merged commit b102668 into master Jul 2, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Populate genome_assembly for 4DN files (empty viz — 4DN never exposes an assembly) — Closes #74#75

Populate genome_assembly for 4DN files (empty viz — 4DN never exposes an assembly) — Closes #74#75
conradbzura merged 2 commits into
masterfrom
74-populate-4dn-genome-assembly

conradbzura commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

conradbzura commented Jul 2, 2026

Summary

Proposed changes

Accession-batched Search-API fetch — src/cfdb/services/fourdn.py

Targeted enrichment wiring — src/cfdb/services/sync.py

Test cases

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Accession-batched Search-API fetch — `src/cfdb/services/fourdn.py`

Targeted enrichment wiring — `src/cfdb/services/sync.py`