feat(qc): recombination: B: Spatial Uniformity #1737

ivan-aksamentov · 2026-01-13T13:10:48Z

Recombination Detection: Strategy B - Spatial Uniformity

Scientific Motivation

Non-recombinant sequences accumulate mutations uniformly across the genome through random point mutations during replication. In contrast, recombinants inherit large genomic segments from different parental lineages, causing private mutations (differences from the assigned reference) to cluster in specific regions where the sequence differs from its primary ancestor.

The coefficient of variation (CV) - standard deviation divided by mean - quantifies this non-uniformity. A CV near zero indicates uniform distribution (consistent with normal evolution), while high CV values suggest clustering (consistent with recombination). This statistical approach is robust because it normalizes for the total number of mutations, detecting the spatial pattern rather than raw counts.

Mechanism

Genome segmentation: The reference genome is divided into N equal-sized segments (configurable, default 10)
Mutation counting: Private substitutions are counted per segment, producing a distribution vector
CV calculation:
- Compute mean mutation count across segments
- Compute standard deviation of counts
- CV = std_dev / mean
- If mean is zero (no mutations), CV = 0
Threshold evaluation:
- If CV > cvThreshold, the sequence is flagged
- Score = (CV - cvThreshold) * weight
- Score of 0 if CV <= threshold
Score aggregation: The spatial uniformity score is combined with other enabled strategy scores, then multiplied by the overall recombinants rule weight.

Configuration

In pathogen.json, configure under qc.recombinants.spatialUniformity:

{
  "qc": {
    "recombinants": {
      "enabled": true,
      "scoreWeight": 100.0,
      "spatialUniformity": {
        "enabled": true,
        "weight": 50.0,
        "numSegments": 10,
        "cvThreshold": 1.5
      }
    }
  }
}

Parameters:

enabled: Toggle this strategy on/off
weight: Multiplier for this strategy's contribution to combined score
numSegments: Number of genome segments for distribution analysis
cvThreshold: CV above this value triggers detection

Advantages

Statistically principled: uses coefficient of variation, a well-understood measure of dispersion
Position-agnostic: detects clustering without requiring knowledge of specific breakpoint locations
Robust to mutation count: normalizes by mean, works for both low and high mutation loads
Low computational cost: O(n) where n is number of mutations
Configurable sensitivity: tunable via segment count and threshold

Limitations

Segment boundaries are arbitrary and may split natural clusters
Very short genomes with few segments may produce noisy CV values
Does not identify actual breakpoint locations, only detects clustering
May produce false positives for sequences with genuine localized hypermutation
Requires sufficient mutations (low counts produce unreliable CV)

Comparison to Other Strategies

Strategy	Signal Used	Best For
A: Weighted Threshold	Total mutation count	High-divergence recombinants

Strategy B is preferred when:

Breakpoint locations are unknown or variable
Dataset lacks mutation labels or multiple references
A general-purpose clustering metric is needed

Implementation Summary

Files modified:

packages/nextclade/src/qc/qc_config.rs - Added QcRecombConfigSpatialUniformity struct
packages/nextclade/src/qc/qc_rule_recombinants.rs - Added strategy_spatial_uniformity function and RecombResultSpatialUniformity result struct
packages/nextclade/src/qc/qc_recomb_utils.rs - Added segment_mutation_counts and compute_cv utilities
packages/nextclade/src/qc/mod.rs - Registered new utility module
packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Added UI formatting for spatial uniformity results

Tests added:

Unit tests for strategy_spatial_uniformity covering disabled config, empty input, uniform distribution, and clustered distribution cases
Unit tests for segment_mutation_counts and compute_cv utilities

Dataset included:

data/recomb/enpen/enterovirus/ev-d68/ - Test dataset with spatial uniformity detection enabled

Future Work

Adaptive segment sizing based on genome length
Sliding window variant for smoother detection
Multi-resolution analysis (multiple segment sizes)
Integration with breakpoint detection to report likely recombination regions
Weighted CV that accounts for expected mutation rates in different genome regions

## Recombination Detection: Strategy B - Spatial Uniformity ### Scientific Motivation Non-recombinant sequences accumulate mutations uniformly across the genome through random point mutations during replication. In contrast, recombinants inherit large genomic segments from different parental lineages, causing private mutations (differences from the assigned reference) to cluster in specific regions where the sequence differs from its primary ancestor. The coefficient of variation (CV) - standard deviation divided by mean - quantifies this non-uniformity. A CV near zero indicates uniform distribution (consistent with normal evolution), while high CV values suggest clustering (consistent with recombination). This statistical approach is robust because it normalizes for the total number of mutations, detecting the spatial pattern rather than raw counts. ### Mechanism 1. **Genome segmentation**: The reference genome is divided into N equal-sized segments (configurable, default 10) 2. **Mutation counting**: Private substitutions are counted per segment, producing a distribution vector 3. **CV calculation**: - Compute mean mutation count across segments - Compute standard deviation of counts - CV = std_dev / mean - If mean is zero (no mutations), CV = 0 4. **Threshold evaluation**: - If CV > cvThreshold, the sequence is flagged - Score = (CV - cvThreshold) * weight - Score of 0 if CV <= threshold 5. **Score aggregation**: The spatial uniformity score is combined with other enabled strategy scores, then multiplied by the overall recombinants rule weight. ### Configuration In `pathogen.json`, configure under `qc.recombinants.spatialUniformity`: ```json { "qc": { "recombinants": { "enabled": true, "scoreWeight": 100.0, "spatialUniformity": { "enabled": true, "weight": 50.0, "numSegments": 10, "cvThreshold": 1.5 } } } } ``` Parameters: - `enabled`: Toggle this strategy on/off - `weight`: Multiplier for this strategy's contribution to combined score - `numSegments`: Number of genome segments for distribution analysis - `cvThreshold`: CV above this value triggers detection ### Advantages - Statistically principled: uses coefficient of variation, a well-understood measure of dispersion - Position-agnostic: detects clustering without requiring knowledge of specific breakpoint locations - Robust to mutation count: normalizes by mean, works for both low and high mutation loads - Low computational cost: O(n) where n is number of mutations - Configurable sensitivity: tunable via segment count and threshold ### Limitations - Segment boundaries are arbitrary and may split natural clusters - Very short genomes with few segments may produce noisy CV values - Does not identify actual breakpoint locations, only detects clustering - May produce false positives for sequences with genuine localized hypermutation - Requires sufficient mutations (low counts produce unreliable CV) ### Comparison to Other Strategies | Strategy | Signal Used | Best For | |----------|-------------|----------| | A: Weighted Threshold | Total mutation count | High-divergence recombinants | | **B: Spatial Uniformity** | **Mutation clustering (CV)** | **Position-agnostic clustering** | | C: Cluster Gaps | Gaps between SNP clusters | Clear breakpoint regions | | D: Reversion Clustering | Clustered reversions | Recombinants with reversion hotspots | | E: Multi-Ancestor | Segment-wise ancestor matching | Known reference set | | F: Label Switching | Lineage label changes | Labeled datasets | Strategy B is preferred when: - Breakpoint locations are unknown or variable - Dataset lacks mutation labels or multiple references - A general-purpose clustering metric is needed ### Implementation Summary Files modified: - `packages/nextclade/src/qc/qc_config.rs` - Added `QcRecombConfigSpatialUniformity` struct - `packages/nextclade/src/qc/qc_rule_recombinants.rs` - Added `strategy_spatial_uniformity` function and `RecombResultSpatialUniformity` result struct - `packages/nextclade/src/qc/qc_recomb_utils.rs` - Added `segment_mutation_counts` and `compute_cv` utilities - `packages/nextclade/src/qc/mod.rs` - Registered new utility module - `packages/nextclade-web/src/helpers/formatQCRecombinants.ts` - Added UI formatting for spatial uniformity results Tests added: - Unit tests for `strategy_spatial_uniformity` covering disabled config, empty input, uniform distribution, and clustered distribution cases - Unit tests for `segment_mutation_counts` and `compute_cv` utilities Dataset included: - `data/recomb/enpen/enterovirus/ev-d68/` - Test dataset with spatial uniformity detection enabled ### Future Work - Adaptive segment sizing based on genome length - Sliding window variant for smoother detection - Multi-resolution analysis (multiple segment sizes) - Integration with breakpoint detection to report likely recombination regions - Weighted CV that accounts for expected mutation rates in different genome regions Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-01-13T14:21:56Z

Preview: https://nextstrain--nextclade--pr-1737.previews.neherlab.click

(ci)

ivan-aksamentov · 2026-01-13T14:36:35Z

Test with strategy-specific dataset:

Preview with EV-D68 test dataset

ivan-aksamentov requested a review from Copilot January 13, 2026 13:13

Copilot started reviewing on behalf of ivan-aksamentov January 13, 2026 13:13 View session

ivan-aksamentov mentioned this pull request Jan 13, 2026

QC label for recombinant sequences #1699

Open

This comment was marked as resolved.

Sign in to view

fix: remove unnecessary deprecation attribute

4b061c1

Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(qc): recombination: B: Spatial Uniformity #1737

feat(qc): recombination: B: Spatial Uniformity #1737

ivan-aksamentov commented Jan 13, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

ivan-aksamentov commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(qc): recombination: B: Spatial Uniformity #1737

Are you sure you want to change the base?

feat(qc): recombination: B: Spatial Uniformity #1737

Conversation

ivan-aksamentov commented Jan 13, 2026

Recombination Detection: Strategy B - Spatial Uniformity

Scientific Motivation

Mechanism

Configuration

Advantages

Limitations

Comparison to Other Strategies

Implementation Summary

Future Work

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

ivan-aksamentov commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants