Skip to content

Conversation

@ivan-aksamentov
Copy link
Member

Recombination Detection: Strategy A - Weighted Threshold

Scientific Motivation

Recombinant viruses inherit genetic material from multiple parental lineages, resulting in a distinctive pattern of private mutations when compared to a single reference sequence. The key insight is that different mutation types carry different diagnostic weight:

  • Unlabeled mutations: Novel mutations not associated with known lineages
  • Labeled mutations: Mutations characteristic of specific viral lineages
  • Reversions: Positions where the sample reverts to the ancestral state

Reversions are particularly significant because they often indicate that a genomic region derives from a different parental lineage than the reference. When a recombinant inherits a segment from a lineage closer to the ancestral sequence, positions that mutated in the reference lineage will appear as reversions. By weighting reversions more heavily, this strategy increases sensitivity to this recombination signature.

Mechanism

The weighted threshold strategy calculates a weighted sum of private mutation counts by type:

weightedCount = unlabeled * weightUnlabeled
              + labeled * weightLabeled
              + reversions * weightReversion

When weightedCount exceeds the configured threshold, the strategy computes:

excess = max(0, weightedCount - threshold)
strategyScore = excess * weight / threshold

The final QC score is: strategyScore * scoreWeight

Default weights (1.0, 1.0, 2.0) reflect that reversions are more diagnostic of recombination than other private mutations. Dataset maintainers can tune these weights based on the evolutionary dynamics of their specific pathogen.

Configuration

In pathogen.json, configure under qc.recombinants:

"qc": {
  "recombinants": {
    "enabled": true,
    "scoreWeight": 100,
    "weightedThreshold": {
      "enabled": true,
      "threshold": 20,
      "weight": 1.0,
      "weightUnlabeled": 1.0,
      "weightLabeled": 1.0,
      "weightReversion": 2.0
    }
  }
}

Parameters:

  • enabled: Master switch for recombinant detection
  • scoreWeight: Multiplier for the rule's contribution to overall QC score
  • weightedThreshold.enabled: Enable this specific strategy
  • weightedThreshold.threshold: Weighted count above which recombinant is flagged
  • weightedThreshold.weight: Scaling factor for excess calculation
  • weightedThreshold.weight*: Per-mutation-type weights

Advantages

  • Simple, interpretable algorithm with clear biological rationale
  • Configurable weights allow tuning for different pathogens
  • Low computational overhead - O(1) calculation from pre-computed counts
  • Backward compatible with existing datasets via legacy fallback
  • Reversion weighting captures a key recombination signature

Limitations

  • Does not consider spatial distribution of mutations along the genome
  • Cannot distinguish recombination from convergent evolution
  • Requires careful threshold tuning per pathogen to avoid false positives
  • Single threshold may not suit all lineage combinations
  • No direct evidence of breakpoint locations

Comparison to Other Strategies

Strategy A is the simplest approach, using only mutation counts without positional information. Other strategies extend detection capabilities:

  • B (Spatial Uniformity): Uses coefficient of variation to detect clustered mutations - recombinants often show mutations concentrated in segments from the divergent parent
  • C (Cluster Gaps): Analyzes gaps between SNP clusters to identify potential breakpoint regions
  • D (Reversion Clustering): Looks for spatially clustered reversions, which indicate a segment from a more ancestral lineage
  • E (Multi-Ancestor): Compares to multiple reference sequences to find regions matching different ancestors
  • F (Label Switching): Detects when different genome regions show mutations characteristic of different lineages

Strategy A serves as a baseline detector. For comprehensive recombinant identification, combine with spatial strategies (B, C, D) when available.

Implementation Summary

Files modified:

  • packages/nextclade/src/qc/qc_config.rs - Strategy config structs with defaults and backward compatibility for legacy mutationsThreshold
  • packages/nextclade/src/qc/qc_rule_recombinants.rs - Main rule implementation with weighted count calculation and scoring
  • packages/nextclade/src/qc/qc_recomb_utils.rs - Shared utilities for position clustering and spatial analysis (foundation for strategies B-D)
  • packages/nextclade/src/qc/mod.rs - Module exports
  • packages/nextclade/src/qc/qc_run.rs - Integration into QC pipeline
  • packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Web UI message formatting with weighted count details
  • packages/nextclade-web/src/components/Results/* - UI integration
  • packages/nextclade-schemas/*.json,yaml - Updated JSON schemas

Test dataset added:

  • data/recomb/enpen/enterovirus/ev-d68/ - Enterovirus D68 dataset with recombinants rule enabled for testing the weighted threshold strategy

Future Work

  • Add unit tests for edge cases (zero mutations, extreme weights)
  • Implement strategies B-F for comprehensive recombination detection
  • Add visualization of mutation types in web UI results table
  • Consider adaptive thresholds based on genome length and diversity
  • Explore machine learning approaches trained on known recombinants

## Recombination Detection: Strategy A - Weighted Threshold

### Scientific Motivation

Recombinant viruses inherit genetic material from multiple parental lineages,
resulting in a distinctive pattern of private mutations when compared to a
single reference sequence. The key insight is that different mutation types
carry different diagnostic weight:

- Unlabeled mutations: Novel mutations not associated with known lineages
- Labeled mutations: Mutations characteristic of specific viral lineages
- Reversions: Positions where the sample reverts to the ancestral state

Reversions are particularly significant because they often indicate that a
genomic region derives from a different parental lineage than the reference.
When a recombinant inherits a segment from a lineage closer to the ancestral
sequence, positions that mutated in the reference lineage will appear as
reversions. By weighting reversions more heavily, this strategy increases
sensitivity to this recombination signature.

### Mechanism

The weighted threshold strategy calculates a weighted sum of private mutation
counts by type:

    weightedCount = unlabeled * weightUnlabeled
                  + labeled * weightLabeled
                  + reversions * weightReversion

When weightedCount exceeds the configured threshold, the strategy computes:

    excess = max(0, weightedCount - threshold)
    strategyScore = excess * weight / threshold

The final QC score is: strategyScore * scoreWeight

Default weights (1.0, 1.0, 2.0) reflect that reversions are more diagnostic
of recombination than other private mutations. Dataset maintainers can tune
these weights based on the evolutionary dynamics of their specific pathogen.

### Configuration

In pathogen.json, configure under qc.recombinants:

    "qc": {
      "recombinants": {
        "enabled": true,
        "scoreWeight": 100,
        "weightedThreshold": {
          "enabled": true,
          "threshold": 20,
          "weight": 1.0,
          "weightUnlabeled": 1.0,
          "weightLabeled": 1.0,
          "weightReversion": 2.0
        }
      }
    }

Parameters:
- enabled: Master switch for recombinant detection
- scoreWeight: Multiplier for the rule's contribution to overall QC score
- weightedThreshold.enabled: Enable this specific strategy
- weightedThreshold.threshold: Weighted count above which recombinant is flagged
- weightedThreshold.weight: Scaling factor for excess calculation
- weightedThreshold.weight*: Per-mutation-type weights

### Advantages

- Simple, interpretable algorithm with clear biological rationale
- Configurable weights allow tuning for different pathogens
- Low computational overhead - O(1) calculation from pre-computed counts
- Backward compatible with existing datasets via legacy fallback
- Reversion weighting captures a key recombination signature

### Limitations

- Does not consider spatial distribution of mutations along the genome
- Cannot distinguish recombination from convergent evolution
- Requires careful threshold tuning per pathogen to avoid false positives
- Single threshold may not suit all lineage combinations
- No direct evidence of breakpoint locations

### Comparison to Other Strategies

Strategy A is the simplest approach, using only mutation counts without
positional information. Other strategies extend detection capabilities:

- B (Spatial Uniformity): Uses coefficient of variation to detect clustered
  mutations - recombinants often show mutations concentrated in segments
  from the divergent parent
- C (Cluster Gaps): Analyzes gaps between SNP clusters to identify potential
  breakpoint regions
- D (Reversion Clustering): Looks for spatially clustered reversions, which
  indicate a segment from a more ancestral lineage
- E (Multi-Ancestor): Compares to multiple reference sequences to find
  regions matching different ancestors
- F (Label Switching): Detects when different genome regions show mutations
  characteristic of different lineages

Strategy A serves as a baseline detector. For comprehensive recombinant
identification, combine with spatial strategies (B, C, D) when available.

### Implementation Summary

Files modified:
- packages/nextclade/src/qc/qc_config.rs - Strategy config structs with
  defaults and backward compatibility for legacy mutationsThreshold
- packages/nextclade/src/qc/qc_rule_recombinants.rs - Main rule implementation
  with weighted count calculation and scoring
- packages/nextclade/src/qc/qc_recomb_utils.rs - Shared utilities for position
  clustering and spatial analysis (foundation for strategies B-D)
- packages/nextclade/src/qc/mod.rs - Module exports
- packages/nextclade/src/qc/qc_run.rs - Integration into QC pipeline
- packages/nextclade-web/src/helpers/formatQCRecombinants.ts - Web UI message
  formatting with weighted count details
- packages/nextclade-web/src/components/Results/* - UI integration
- packages/nextclade-schemas/*.json,yaml - Updated JSON schemas

Test dataset added:
- data/recomb/enpen/enterovirus/ev-d68/ - Enterovirus D68 dataset with
  recombinants rule enabled for testing the weighted threshold strategy

### Future Work

- Add unit tests for edge cases (zero mutations, extreme weights)
- Implement strategies B-F for comprehensive recombination detection
- Add visualization of mutation types in web UI results table
- Consider adaptive thresholds based on genome length and diversity
- Explore machine learning approaches trained on known recombinants

Co-Authored-By: Claude <noreply@anthropic.com>

This comment was marked as resolved.

The mutations_threshold field is still used for backward compatibility
in get_weighted_threshold_config() method. Since there is no stable
public interface being maintained, the deprecation warning is unnecessary
and causes CI failures when warnings are treated as errors.

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

@ivan-aksamentov
Copy link
Member Author

Test with strategy-specific dataset:

Preview with EV-D68 test dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants