feat: use multiple reference sequences for minimizer index generation #404

ivan-aksamentov · 2026-01-13T15:41:10Z

Background:

Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.

Implementation:

Add optional files.minimizerReferences field in pathogen.json (array of FASTA file paths)
New get_minimizer_refs() function reads sequences from all listed files, falls back to main reference if field is absent
make_ref_search_index() combines minimizers from all references using set union; uses average length for normalization
Backward compatible: existing datasets work unchanged

Usage:

In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:

{
  "files": {
    "reference": "reference.fasta",
    "minimizerReferences": [
      "clade_a.fasta",
      "clade_b.fasta"
    ]
  }
}

Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.

Checklist

Check if changes affect downstream workflows which depend on this dataset. For instance, Nextstrain ingest workflows may break if clade nomenclature changes. Consider fixing those workflows or making an issue at least. Not applicable

Background: Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates. Implementation: - Add optional `files.minimizerReferences` field in pathogen.json (array of FASTA file paths) - New `get_minimizer_refs()` function reads sequences from all listed files, falls back to main reference if field is absent - `make_ref_search_index()` combines minimizers from all references using set union; uses average length for normalization - Backward compatible: existing datasets work unchanged Usage: In pathogen.json, add array of FASTA paths containing representative sequences for the dataset: ```json { "files": { "reference": "reference.fasta", "minimizerReferences": [ "clade_a.fasta", "clade_b.fasta" ] } } ``` Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry. Co-Authored-By: Claude <noreply@anthropic.com>

ivan-aksamentov deployed to refs/pull/404/merge January 13, 2026 15:41 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: use multiple reference sequences for minimizer index generation #404

feat: use multiple reference sequences for minimizer index generation #404

Uh oh!

ivan-aksamentov commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: use multiple reference sequences for minimizer index generation #404

Are you sure you want to change the base?

feat: use multiple reference sequences for minimizer index generation #404

Uh oh!

Conversation

ivan-aksamentov commented Jan 13, 2026

Background:

Implementation:

Usage:

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants