Skip to content

glbala87/ldlr-vus-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LDLR VUS Classifier

Machine-learning triage of LDLR variants of uncertain significance (VUS) using predictor concordance and ACMG-aligned evidence mapping.

Overview

This pipeline classifies 1,092 LDLR VUS from ClinVar using a calibrated XGBoost model trained on 2,870 pathogenic/benign variants with 68 engineered features. It integrates multi-source in silico annotations, explicit predictor concordance checking between REVEL and AlphaMissense, and maps outputs to ACMG/AMP PP3/BP4 evidence codes.

Key results

Metric All variants Missense only
ROC-AUC (nested CV) 0.994 0.973
External validation (FH-VCEP) 0.986 0.975
VUS classification n %
Pathogenic-leaning 485 44.4%
Benign-leaning 268 24.5%
Predictor-discordant 170 15.6%
Unresolved 169 15.5%

Pipeline architecture

1. Load ClinVar       2. Annotate (API)      3. Splice annotation
   3,962 variants        REVEL, CADD, AM        VEP SpliceRegion
   LDLR_3962 sheet       PP2, SIFT, MT          636 non-coding SNVs
        │                     │                       │
        ▼                     ▼                       ▼
4. gnomAD expansion    5. Feature engineering  6. Train models
   pseudo-benign          45 raw → 68 encoded     RF + XGBoost + SVM
   AF ≥ 1e-4              9 feature groups        nested 5×3 CV
        │                     │                       │
        ▼                     ▼                       ▼
7. Calibrate           8. Classify VUS         9. Save outputs
   isotonic regression    18-tier framework       vus_predictions.csv
   cv = 5                 concordance + splice    model + metrics

Repository structure

ldlr-vus-classifier/
├── README.md
├── LICENSE
├── requirements.txt
├── data/
│   └── input/
│       └── clinvar_all_classes_genes_cmg.xlsx    # ClinVar LDLR variants
├── src/
│   ├── pipeline/
│   │   ├── ldlr_pipeline.py                      # Core pipeline
│   │   └── annotate.py                           # Batch API annotation
│   └── analysis/
│       ├── run_ablation.py                       # Feature-group ablation
│       ├── run_shap.py                           # SHAP interpretation
│       ├── run_pp3bp4.py                         # ACMG PP3/BP4 mapping
│       ├── run_external_validation.py            # FH-VCEP validation
│       └── run_split_analysis.py                 # Split-half stability
└── output/
    ├── figures/                                  # Publication figures (PNG)
    ├── vus_predictions.csv                       # VUS classifications
    ├── vus_predictions_with_acmg.csv             # + ACMG evidence codes
    ├── model_metrics.csv                         # Nested CV metrics
    ├── model_summary.csv                         # Mean AUC per model
    ├── feature_importance.csv                    # Feature importances
    ├── ablation_study.csv                        # Ablation results
    ├── ablation_summary.csv                      # Ablation summary
    ├── shap_feature_importance.csv               # SHAP values
    ├── acmg_summary.csv                          # PP3/BP4 code counts
    ├── acmg_tier_crosstab.csv                    # Tier x PP3/BP4 crosstab
    ├── external_validation_metrics.csv           # FH-VCEP metrics
    ├── external_validation_predictions.csv       # FH-VCEP per-variant
    ├── external_validation_stratified.csv        # Per-consequence AUC
    ├── ldlr_annotations.csv                      # In silico scores
    └── gnomad_pseudo_benign_missense.csv         # gnomAD expansion set

Quick start

Installation

git clone https://github.com/glbala87/ldlr-vus-classifier.git
cd ldlr-vus-classifier
pip install -r requirements.txt

Run the pipeline

# Step 1: Core pipeline (load, annotate, train, classify)
python3 src/pipeline/ldlr_pipeline.py

# Step 2: Post-hoc analyses (run after pipeline completes)
python3 src/analysis/run_ablation.py
python3 src/analysis/run_shap.py
python3 src/analysis/run_pp3bp4.py
python3 src/analysis/run_external_validation.py
python3 src/analysis/run_split_analysis.py

All API responses are cached in output/annotation_cache.json on first run. Subsequent runs are fully offline and deterministic (random seed = 42).

Methods summary

Data

  • Source: ClinVar LDLR variants (sheet LDLR_3962)
  • Training: 1,880 P/LP + 990 B/LB = 2,870 labelled variants
  • Target: 1,092 VUS (776 missense, 316 non-missense)

Annotation

Two external APIs retrieve in silico predictor scores:

  • myvariant.info: REVEL, CADD, PolyPhen-2, SIFT, AlphaMissense, MutationTaster, GERP++ (via dbNSFP v4)
  • Ensembl VEP: protein position, amino-acid change, canonical consequence, splice-region annotation

Feature engineering

45 base features expand to 68 after one-hot encoding:

Feature group Pre-encode Post-encode
Sequence 6 6
Allele frequency 9 13
Molecular consequence 1 14
ClinVar variant class 1 7
In silico predictors 14 14
Amino-acid biochemistry 7 7
Splice indicator 1 1
Allele origin 1 1
Interaction features 4 4
Total 45 68

Models

Three classifiers trained under nested 5x3 cross-validation:

  • RandomForest (scikit-learn, class_weight="balanced")
  • XGBoost (inverse-frequency sample weights)
  • SVM (RBF/linear kernels, StandardScaler, class_weight="balanced")

Best model (XGBoost) is isotonically calibrated (cv=5) for probability estimation.

Predictor concordance

REVEL and AlphaMissense are checked for agreement before classification:

  • Concordant pathogenic: REVEL >= 0.7 AND AM >= 0.564
  • Concordant benign: REVEL < 0.3 AND AM < 0.34
  • Discordant: flagged, not classified (170 VUS = 15.6%)

18-tier classification

VUS are routed through an 18-tier framework based on calibrated P(pathogenic), concordance label, and splice flag. Probability cutoffs (0.90/0.80/0.65/0.35) approximate the Bayesian odds thresholds of Tavtigian et al.

ACMG PP3/BP4 mapping

Each VUS is mapped to Pejaver-calibrated REVEL thresholds (PP3_Strong >= 0.932, PP3_Moderate >= 0.773, PP3_Supporting >= 0.644, BP4_Supporting <= 0.290, BP4_Moderate <= 0.183, BP4_Strong <= 0.016) and AlphaMissense boundaries (PP3_Supporting >= 0.564, BP4_Supporting < 0.34).

Key output files

File Description
vus_predictions.csv All 1,092 VUS with probabilities, tiers, concordance
vus_predictions_with_acmg.csv Same + PP3/BP4 evidence codes
model_metrics.csv Per-fold nested CV metrics (AUC, F1, Brier)
external_validation_metrics.csv FH-VCEP held-out performance
ablation_summary.csv Feature-group contribution (delta-AUC)
shap_feature_importance.csv SHAP feature importance values

External validation

The model was retrained on 2,651 non-expert-panel ClinVar variants and tested on 219 ClinGen FH-VCEP expert-panel-reviewed variants:

Subset n ROC-AUC F1 Precision Sensitivity
All 219 0.986 0.960 0.994 0.929
Missense 161 0.975 0.970 0.993 0.947

Citation

If you use this pipeline, please cite:

Bala, Faisal, Nader et al. Machine-Learning Triage of LDLR Variants of Uncertain Significance Using Predictor Concordance, ACMG-Aligned Evidence Mapping. (2026). [manuscript in preparation]

License

MIT License. See LICENSE for details.

Acknowledgements

About

Machine-learning triage of LDLR variants of uncertain significance using predictor concordance and ACMG-aligned evidence mapping

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors