Skip to content

Conversation

@ivan-aksamentov
Copy link
Member

Recombination Detection: Strategy B - Spatial Uniformity

Scientific Motivation

Non-recombinant sequences accumulate mutations uniformly across the genome through random point mutations during replication. In contrast, recombinants inherit large genomic segments from different parental lineages, causing private mutations (differences from the assigned reference) to cluster in specific regions where the sequence differs from its primary ancestor.

The coefficient of variation (CV) - standard deviation divided by mean - quantifies this non-uniformity. A CV near zero indicates uniform distribution (consistent with normal evolution), while high CV values suggest clustering (consistent with recombination). This statistical approach is robust because it normalizes for the total number of mutations, detecting the spatial pattern rather than raw counts.

Mechanism

  1. Genome segmentation: The reference genome is divided into N equal-sized segments (configurable, default 10)

  2. Mutation counting: Private substitutions are counted per segment, producing a distribution vector

  3. CV calculation:

    • Compute mean mutation count across segments
    • Compute standard deviation of counts
    • CV = std_dev / mean
    • If mean is zero (no mutations), CV = 0
  4. Threshold evaluation:

    • If CV > cvThreshold, the sequence is flagged
    • Score = (CV - cvThreshold) * weight
    • Score of 0 if CV <= threshold
  5. Score aggregation: The spatial uniformity score is combined with other enabled strategy scores, then multiplied by the overall recombinants rule weight.

Configuration

In pathogen.json, configure under qc.recombinants.spatialUniformity:

{
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "spatialUniformity": {
        "enabled": true,
        "weight": 50.0,
        "numSegments": 10,
        "cvThreshold": 1.5
      }
    }
  }
}

Parameters:

  • enabled: Toggle this strategy on/off
  • weight: Multiplier for this strategy's contribution to combined score
  • numSegments: Number of genome segments for distribution analysis
  • cvThreshold: CV above this value triggers detection

Advantages

  • Statistically principled: uses coefficient of variation, a well-understood measure of dispersion
  • Position-agnostic: detects clustering without requiring knowledge of specific breakpoint locations
  • Robust to mutation count: normalizes by mean, works for both low and high mutation loads
  • Low computational cost: O(n) where n is number of mutations
  • Configurable sensitivity: tunable via segment count and threshold

Limitations

  • Segment boundaries are arbitrary and may split natural clusters
  • Very short genomes with few segments may produce noisy CV values
  • Does not identify actual breakpoint locations, only detects clustering
  • May produce false positives for sequences with genuine localized hypermutation
  • Requires sufficient mutations (low counts produce unreliable CV)

Comparison to Other Strategies

Strategy Signal Used Best For
A: Weighted Threshold Total mutation count High-divergence recombinants

Strategy B is preferred when:

  • Breakpoint locations are unknown or variable
  • Dataset lacks mutation labels or multiple references
  • A general-purpose clustering metric is needed

Implementation Summary

Files modified:

  • packages/nextclade/src/qc/qc_config.rs - Added QcRecombConfigSpatialUniformity struct
  • packages/nextclade/src/qc/qc_rule_recombinants.rs - Added strategy_spatial_uniformity function and RecombResultSpatialUniformity result struct
  • packages/nextclade/src/qc/qc_recomb_utils.rs - Added segment_mutation_counts and compute_cv utilities
  • packages/nextclade/src/qc/mod.rs - Registered new utility module
  • packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Added UI formatting for spatial uniformity results

Tests added:

  • Unit tests for strategy_spatial_uniformity covering disabled config, empty input, uniform distribution, and clustered distribution cases
  • Unit tests for segment_mutation_counts and compute_cv utilities

Dataset included:

  • data/recomb/enpen/enterovirus/ev-d68/ - Test dataset with spatial uniformity detection enabled

Future Work

  • Adaptive segment sizing based on genome length
  • Sliding window variant for smoother detection
  • Multi-resolution analysis (multiple segment sizes)
  • Integration with breakpoint detection to report likely recombination regions
  • Weighted CV that accounts for expected mutation rates in different genome regions

## Recombination Detection: Strategy B - Spatial Uniformity

### Scientific Motivation

Non-recombinant sequences accumulate mutations uniformly across the genome through random point mutations during replication. In contrast, recombinants inherit large genomic segments from different parental lineages, causing private mutations (differences from the assigned reference) to cluster in specific regions where the sequence differs from its primary ancestor.

The coefficient of variation (CV) - standard deviation divided by mean - quantifies this non-uniformity. A CV near zero indicates uniform distribution (consistent with normal evolution), while high CV values suggest clustering (consistent with recombination). This statistical approach is robust because it normalizes for the total number of mutations, detecting the spatial pattern rather than raw counts.

### Mechanism

1. **Genome segmentation**: The reference genome is divided into N equal-sized segments (configurable, default 10)

2. **Mutation counting**: Private substitutions are counted per segment, producing a distribution vector

3. **CV calculation**:
   - Compute mean mutation count across segments
   - Compute standard deviation of counts
   - CV = std_dev / mean
   - If mean is zero (no mutations), CV = 0

4. **Threshold evaluation**:
   - If CV > cvThreshold, the sequence is flagged
   - Score = (CV - cvThreshold) * weight
   - Score of 0 if CV <= threshold

5. **Score aggregation**: The spatial uniformity score is combined with other enabled strategy scores, then multiplied by the overall recombinants rule weight.

### Configuration

In `pathogen.json`, configure under `qc.recombinants.spatialUniformity`:

```json
{
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "spatialUniformity": {
        "enabled": true,
        "weight": 50.0,
        "numSegments": 10,
        "cvThreshold": 1.5
      }
    }
  }
}
```

Parameters:
- `enabled`: Toggle this strategy on/off
- `weight`: Multiplier for this strategy's contribution to combined score
- `numSegments`: Number of genome segments for distribution analysis
- `cvThreshold`: CV above this value triggers detection

### Advantages

- Statistically principled: uses coefficient of variation, a well-understood measure of dispersion
- Position-agnostic: detects clustering without requiring knowledge of specific breakpoint locations
- Robust to mutation count: normalizes by mean, works for both low and high mutation loads
- Low computational cost: O(n) where n is number of mutations
- Configurable sensitivity: tunable via segment count and threshold

### Limitations

- Segment boundaries are arbitrary and may split natural clusters
- Very short genomes with few segments may produce noisy CV values
- Does not identify actual breakpoint locations, only detects clustering
- May produce false positives for sequences with genuine localized hypermutation
- Requires sufficient mutations (low counts produce unreliable CV)

### Comparison to Other Strategies

| Strategy | Signal Used | Best For |
|----------|-------------|----------|
| A: Weighted Threshold | Total mutation count | High-divergence recombinants |
| **B: Spatial Uniformity** | **Mutation clustering (CV)** | **Position-agnostic clustering** |
| C: Cluster Gaps | Gaps between SNP clusters | Clear breakpoint regions |
| D: Reversion Clustering | Clustered reversions | Recombinants with reversion hotspots |
| E: Multi-Ancestor | Segment-wise ancestor matching | Known reference set |
| F: Label Switching | Lineage label changes | Labeled datasets |

Strategy B is preferred when:
- Breakpoint locations are unknown or variable
- Dataset lacks mutation labels or multiple references
- A general-purpose clustering metric is needed

### Implementation Summary

Files modified:
- `packages/nextclade/src/qc/qc_config.rs` - Added `QcRecombConfigSpatialUniformity` struct
- `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Added `strategy_spatial_uniformity` function and `RecombResultSpatialUniformity` result struct
- `packages/nextclade/src/qc/qc_recomb_utils.rs` - Added `segment_mutation_counts` and `compute_cv` utilities
- `packages/nextclade/src/qc/mod.rs` - Registered new utility module
- `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - Added UI formatting for spatial uniformity results

Tests added:
- Unit tests for `strategy_spatial_uniformity` covering disabled config, empty input, uniform distribution, and clustered distribution cases
- Unit tests for `segment_mutation_counts` and `compute_cv` utilities

Dataset included:
- `data/recomb/enpen/enterovirus/ev-d68/` - Test dataset with spatial uniformity detection enabled

### Future Work

- Adaptive segment sizing based on genome length
- Sliding window variant for smoother detection
- Multi-resolution analysis (multiple segment sizes)
- Integration with breakpoint detection to report likely recombination regions
- Weighted CV that accounts for expected mutation rates in different genome regions

Co-Authored-By: Claude <noreply@anthropic.com>

This comment was marked as resolved.

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

@ivan-aksamentov
Copy link
Member Author

Test with strategy-specific dataset:

Preview with EV-D68 test dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants