-
Notifications
You must be signed in to change notification settings - Fork 66
feat(qc): recombination: B: Spatial Uniformity #1737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ivan-aksamentov
wants to merge
2
commits into
master
Choose a base branch
from
feat/qc-recomb-strategy-b
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+26,907
−7
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Recombination Detection: Strategy B - Spatial Uniformity
### Scientific Motivation
Non-recombinant sequences accumulate mutations uniformly across the genome through random point mutations during replication. In contrast, recombinants inherit large genomic segments from different parental lineages, causing private mutations (differences from the assigned reference) to cluster in specific regions where the sequence differs from its primary ancestor.
The coefficient of variation (CV) - standard deviation divided by mean - quantifies this non-uniformity. A CV near zero indicates uniform distribution (consistent with normal evolution), while high CV values suggest clustering (consistent with recombination). This statistical approach is robust because it normalizes for the total number of mutations, detecting the spatial pattern rather than raw counts.
### Mechanism
1. **Genome segmentation**: The reference genome is divided into N equal-sized segments (configurable, default 10)
2. **Mutation counting**: Private substitutions are counted per segment, producing a distribution vector
3. **CV calculation**:
- Compute mean mutation count across segments
- Compute standard deviation of counts
- CV = std_dev / mean
- If mean is zero (no mutations), CV = 0
4. **Threshold evaluation**:
- If CV > cvThreshold, the sequence is flagged
- Score = (CV - cvThreshold) * weight
- Score of 0 if CV <= threshold
5. **Score aggregation**: The spatial uniformity score is combined with other enabled strategy scores, then multiplied by the overall recombinants rule weight.
### Configuration
In `pathogen.json`, configure under `qc.recombinants.spatialUniformity`:
```json
{
"qc": {
"recombinants": {
"enabled": true,
"scoreWeight": 100.0,
"spatialUniformity": {
"enabled": true,
"weight": 50.0,
"numSegments": 10,
"cvThreshold": 1.5
}
}
}
}
```
Parameters:
- `enabled`: Toggle this strategy on/off
- `weight`: Multiplier for this strategy's contribution to combined score
- `numSegments`: Number of genome segments for distribution analysis
- `cvThreshold`: CV above this value triggers detection
### Advantages
- Statistically principled: uses coefficient of variation, a well-understood measure of dispersion
- Position-agnostic: detects clustering without requiring knowledge of specific breakpoint locations
- Robust to mutation count: normalizes by mean, works for both low and high mutation loads
- Low computational cost: O(n) where n is number of mutations
- Configurable sensitivity: tunable via segment count and threshold
### Limitations
- Segment boundaries are arbitrary and may split natural clusters
- Very short genomes with few segments may produce noisy CV values
- Does not identify actual breakpoint locations, only detects clustering
- May produce false positives for sequences with genuine localized hypermutation
- Requires sufficient mutations (low counts produce unreliable CV)
### Comparison to Other Strategies
| Strategy | Signal Used | Best For |
|----------|-------------|----------|
| A: Weighted Threshold | Total mutation count | High-divergence recombinants |
| **B: Spatial Uniformity** | **Mutation clustering (CV)** | **Position-agnostic clustering** |
| C: Cluster Gaps | Gaps between SNP clusters | Clear breakpoint regions |
| D: Reversion Clustering | Clustered reversions | Recombinants with reversion hotspots |
| E: Multi-Ancestor | Segment-wise ancestor matching | Known reference set |
| F: Label Switching | Lineage label changes | Labeled datasets |
Strategy B is preferred when:
- Breakpoint locations are unknown or variable
- Dataset lacks mutation labels or multiple references
- A general-purpose clustering metric is needed
### Implementation Summary
Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added `QcRecombConfigSpatialUniformity` struct
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Added `strategy_spatial_uniformity` function and `RecombResultSpatialUniformity` result struct
- `packages/nextclade/src/qc/qc_recomb_utils.rs` - Added `segment_mutation_counts` and `compute_cv` utilities
- `packages/nextclade/src/qc/mod.rs` - Registered new utility module
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - Added UI formatting for spatial uniformity results
Tests added:
- Unit tests for `strategy_spatial_uniformity` covering disabled config, empty input, uniform distribution, and clustered distribution cases
- Unit tests for `segment_mutation_counts` and `compute_cv` utilities
Dataset included:
- `data/recomb/enpen/enterovirus/ev-d68/` - Test dataset with spatial uniformity detection enabled
### Future Work
- Adaptive segment sizing based on genome length
- Sliding window variant for smoother detection
- Multi-resolution analysis (multiple segment sizes)
- Integration with breakpoint detection to report likely recombination regions
- Weighted CV that accounts for expected mutation rates in different genome regions
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Member
Author
|
Test with strategy-specific dataset: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Recombination Detection: Strategy B - Spatial Uniformity
Scientific Motivation
Non-recombinant sequences accumulate mutations uniformly across the genome through random point mutations during replication. In contrast, recombinants inherit large genomic segments from different parental lineages, causing private mutations (differences from the assigned reference) to cluster in specific regions where the sequence differs from its primary ancestor.
The coefficient of variation (CV) - standard deviation divided by mean - quantifies this non-uniformity. A CV near zero indicates uniform distribution (consistent with normal evolution), while high CV values suggest clustering (consistent with recombination). This statistical approach is robust because it normalizes for the total number of mutations, detecting the spatial pattern rather than raw counts.
Mechanism
Genome segmentation: The reference genome is divided into N equal-sized segments (configurable, default 10)
Mutation counting: Private substitutions are counted per segment, producing a distribution vector
CV calculation:
Threshold evaluation:
Score aggregation: The spatial uniformity score is combined with other enabled strategy scores, then multiplied by the overall recombinants rule weight.
Configuration
In
pathogen.json, configure underqc.recombinants.spatialUniformity:{ "qc": { "recombinants": { "enabled": true, "scoreWeight": 100.0, "spatialUniformity": { "enabled": true, "weight": 50.0, "numSegments": 10, "cvThreshold": 1.5 } } } }Parameters:
enabled: Toggle this strategy on/offweight: Multiplier for this strategy's contribution to combined scorenumSegments: Number of genome segments for distribution analysiscvThreshold: CV above this value triggers detectionAdvantages
Limitations
Comparison to Other Strategies
Strategy B is preferred when:
Implementation Summary
Files modified:
packages/nextclade/src/qc/qc_config.rs- AddedQcRecombConfigSpatialUniformitystructpackages/nextclade/src/qc/qc_rule_recombinants.rs- Addedstrategy_spatial_uniformityfunction andRecombResultSpatialUniformityresult structpackages/nextclade/src/qc/qc_recomb_utils.rs- Addedsegment_mutation_countsandcompute_cvutilitiespackages/nextclade/src/qc/mod.rs- Registered new utility modulepackages/nextclade-web/src/helpers/formatQCRecombinants.ts- Added UI formatting for spatial uniformity resultsTests added:
strategy_spatial_uniformitycovering disabled config, empty input, uniform distribution, and clustered distribution casessegment_mutation_countsandcompute_cvutilitiesDataset included:
data/recomb/enpen/enterovirus/ev-d68/- Test dataset with spatial uniformity detection enabledFuture Work