feat: use multiple reference sequences for minimizer index generation #404
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background:
Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.
Implementation:
files.minimizerReferencesfield in pathogen.json (array of FASTA file paths)get_minimizer_refs()function reads sequences from all listed files, falls back to main reference if field is absentmake_ref_search_index()combines minimizers from all references using set union; uses average length for normalizationUsage:
In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:
{ "files": { "reference": "reference.fasta", "minimizerReferences": [ "clade_a.fasta", "clade_b.fasta" ] } }Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.
Checklist
Check if changes affect downstream workflows which depend on this dataset. For instance, Nextstrain ingest workflows may break if clade nomenclature changes. Consider fixing those workflows or making an issue at least.Not applicable