-
Notifications
You must be signed in to change notification settings - Fork 66
feat(qc): recombination: A: Weighted Threshold #1736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ivan-aksamentov
wants to merge
2
commits into
master
Choose a base branch
from
feat/qc-recomb-strategy-a
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+27,063
−7
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Recombination Detection: Strategy A - Weighted Threshold
### Scientific Motivation
Recombinant viruses inherit genetic material from multiple parental lineages,
resulting in a distinctive pattern of private mutations when compared to a
single reference sequence. The key insight is that different mutation types
carry different diagnostic weight:
- Unlabeled mutations: Novel mutations not associated with known lineages
- Labeled mutations: Mutations characteristic of specific viral lineages
- Reversions: Positions where the sample reverts to the ancestral state
Reversions are particularly significant because they often indicate that a
genomic region derives from a different parental lineage than the reference.
When a recombinant inherits a segment from a lineage closer to the ancestral
sequence, positions that mutated in the reference lineage will appear as
reversions. By weighting reversions more heavily, this strategy increases
sensitivity to this recombination signature.
### Mechanism
The weighted threshold strategy calculates a weighted sum of private mutation
counts by type:
weightedCount = unlabeled * weightUnlabeled
+ labeled * weightLabeled
+ reversions * weightReversion
When weightedCount exceeds the configured threshold, the strategy computes:
excess = max(0, weightedCount - threshold)
strategyScore = excess * weight / threshold
The final QC score is: strategyScore * scoreWeight
Default weights (1.0, 1.0, 2.0) reflect that reversions are more diagnostic
of recombination than other private mutations. Dataset maintainers can tune
these weights based on the evolutionary dynamics of their specific pathogen.
### Configuration
In pathogen.json, configure under qc.recombinants:
"qc": {
"recombinants": {
"enabled": true,
"scoreWeight": 100,
"weightedThreshold": {
"enabled": true,
"threshold": 20,
"weight": 1.0,
"weightUnlabeled": 1.0,
"weightLabeled": 1.0,
"weightReversion": 2.0
}
}
}
Parameters:
- enabled: Master switch for recombinant detection
- scoreWeight: Multiplier for the rule's contribution to overall QC score
- weightedThreshold.enabled: Enable this specific strategy
- weightedThreshold.threshold: Weighted count above which recombinant is flagged
- weightedThreshold.weight: Scaling factor for excess calculation
- weightedThreshold.weight*: Per-mutation-type weights
### Advantages
- Simple, interpretable algorithm with clear biological rationale
- Configurable weights allow tuning for different pathogens
- Low computational overhead - O(1) calculation from pre-computed counts
- Backward compatible with existing datasets via legacy fallback
- Reversion weighting captures a key recombination signature
### Limitations
- Does not consider spatial distribution of mutations along the genome
- Cannot distinguish recombination from convergent evolution
- Requires careful threshold tuning per pathogen to avoid false positives
- Single threshold may not suit all lineage combinations
- No direct evidence of breakpoint locations
### Comparison to Other Strategies
Strategy A is the simplest approach, using only mutation counts without
positional information. Other strategies extend detection capabilities:
- B (Spatial Uniformity): Uses coefficient of variation to detect clustered
mutations - recombinants often show mutations concentrated in segments
from the divergent parent
- C (Cluster Gaps): Analyzes gaps between SNP clusters to identify potential
breakpoint regions
- D (Reversion Clustering): Looks for spatially clustered reversions, which
indicate a segment from a more ancestral lineage
- E (Multi-Ancestor): Compares to multiple reference sequences to find
regions matching different ancestors
- F (Label Switching): Detects when different genome regions show mutations
characteristic of different lineages
Strategy A serves as a baseline detector. For comprehensive recombinant
identification, combine with spatial strategies (B, C, D) when available.
### Implementation Summary
Files modified:
- packages/nextclade/src/qc/qc_config.rs - Strategy config structs with
defaults and backward compatibility for legacy mutationsThreshold
- packages/nextclade/src/qc/qc_rule_recombinants.rs - Main rule implementation
with weighted count calculation and scoring
- packages/nextclade/src/qc/qc_recomb_utils.rs - Shared utilities for position
clustering and spatial analysis (foundation for strategies B-D)
- packages/nextclade/src/qc/mod.rs - Module exports
- packages/nextclade/src/qc/qc_run.rs - Integration into QC pipeline
- packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Web UI message
formatting with weighted count details
- packages/nextclade-web/src/components/Results/* - UI integration
- packages/nextclade-schemas/*.json,yaml - Updated JSON schemas
Test dataset added:
- data/recomb/enpen/enterovirus/ev-d68/ - Enterovirus D68 dataset with
recombinants rule enabled for testing the weighted threshold strategy
### Future Work
- Add unit tests for edge cases (zero mutations, extreme weights)
- Implement strategies B-F for comprehensive recombination detection
- Add visualization of mutation types in web UI results table
- Consider adaptive thresholds based on genome length and diversity
- Explore machine learning approaches trained on known recombinants
Co-Authored-By: Claude <noreply@anthropic.com>
The mutations_threshold field is still used for backward compatibility in get_weighted_threshold_config() method. Since there is no stable public interface being maintained, the deprecation warning is unnecessary and causes CI failures when warnings are treated as errors. Co-Authored-By: Claude <noreply@anthropic.com>
Member
Author
|
Test with strategy-specific dataset: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Recombination Detection: Strategy A - Weighted Threshold
Scientific Motivation
Recombinant viruses inherit genetic material from multiple parental lineages, resulting in a distinctive pattern of private mutations when compared to a single reference sequence. The key insight is that different mutation types carry different diagnostic weight:
Reversions are particularly significant because they often indicate that a genomic region derives from a different parental lineage than the reference. When a recombinant inherits a segment from a lineage closer to the ancestral sequence, positions that mutated in the reference lineage will appear as reversions. By weighting reversions more heavily, this strategy increases sensitivity to this recombination signature.
Mechanism
The weighted threshold strategy calculates a weighted sum of private mutation counts by type:
When weightedCount exceeds the configured threshold, the strategy computes:
The final QC score is: strategyScore * scoreWeight
Default weights (1.0, 1.0, 2.0) reflect that reversions are more diagnostic of recombination than other private mutations. Dataset maintainers can tune these weights based on the evolutionary dynamics of their specific pathogen.
Configuration
In pathogen.json, configure under qc.recombinants:
Parameters:
Advantages
Limitations
Comparison to Other Strategies
Strategy A is the simplest approach, using only mutation counts without positional information. Other strategies extend detection capabilities:
Strategy A serves as a baseline detector. For comprehensive recombinant identification, combine with spatial strategies (B, C, D) when available.
Implementation Summary
Files modified:
Test dataset added:
Future Work