Skip to content

Conversation

@ivan-aksamentov
Copy link
Member

Background:

Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.

Implementation:

  • Add optional files.minimizerReferences field in pathogen.json (array of FASTA file paths)
  • New get_minimizer_refs() function reads sequences from all listed files, falls back to main reference if field is absent
  • make_ref_search_index() combines minimizers from all references using set union; uses average length for normalization
  • Backward compatible: existing datasets work unchanged

Usage:

In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:

{
  "files": {
    "reference": "reference.fasta",
    "minimizerReferences": [
      "clade_a.fasta",
      "clade_b.fasta"
    ]
  }
}

Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.

Checklist

  • Check if changes affect downstream workflows which depend on this dataset. For instance, Nextstrain ingest workflows may break if clade nomenclature changes. Consider fixing those workflows or making an issue at least. Not applicable

Background:
Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.

Implementation:
- Add optional `files.minimizerReferences` field in pathogen.json (array of FASTA file paths)
- New `get_minimizer_refs()` function reads sequences from all listed files, falls back to main reference if field is absent
- `make_ref_search_index()` combines minimizers from all references using set union; uses average length for normalization
- Backward compatible: existing datasets work unchanged

Usage:
In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:

```json
{
  "files": {
    "reference": "reference.fasta",
    "minimizerReferences": [
      "clade_a.fasta",
      "clade_b.fasta"
    ]
  }
}
```

Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.

Co-Authored-By: Claude <noreply@anthropic.com>
@ivan-aksamentov ivan-aksamentov deployed to refs/pull/404/merge January 13, 2026 15:41 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants