Machine-learning triage of LDLR variants of uncertain significance (VUS) using predictor concordance and ACMG-aligned evidence mapping.
This pipeline classifies 1,092 LDLR VUS from ClinVar using a calibrated XGBoost model trained on 2,870 pathogenic/benign variants with 68 engineered features. It integrates multi-source in silico annotations, explicit predictor concordance checking between REVEL and AlphaMissense, and maps outputs to ACMG/AMP PP3/BP4 evidence codes.
| Metric | All variants | Missense only |
|---|---|---|
| ROC-AUC (nested CV) | 0.994 | 0.973 |
| External validation (FH-VCEP) | 0.986 | 0.975 |
| VUS classification | n | % |
|---|---|---|
| Pathogenic-leaning | 485 | 44.4% |
| Benign-leaning | 268 | 24.5% |
| Predictor-discordant | 170 | 15.6% |
| Unresolved | 169 | 15.5% |
1. Load ClinVar 2. Annotate (API) 3. Splice annotation
3,962 variants REVEL, CADD, AM VEP SpliceRegion
LDLR_3962 sheet PP2, SIFT, MT 636 non-coding SNVs
│ │ │
▼ ▼ ▼
4. gnomAD expansion 5. Feature engineering 6. Train models
pseudo-benign 45 raw → 68 encoded RF + XGBoost + SVM
AF ≥ 1e-4 9 feature groups nested 5×3 CV
│ │ │
▼ ▼ ▼
7. Calibrate 8. Classify VUS 9. Save outputs
isotonic regression 18-tier framework vus_predictions.csv
cv = 5 concordance + splice model + metrics
ldlr-vus-classifier/
├── README.md
├── LICENSE
├── requirements.txt
├── data/
│ └── input/
│ └── clinvar_all_classes_genes_cmg.xlsx # ClinVar LDLR variants
├── src/
│ ├── pipeline/
│ │ ├── ldlr_pipeline.py # Core pipeline
│ │ └── annotate.py # Batch API annotation
│ └── analysis/
│ ├── run_ablation.py # Feature-group ablation
│ ├── run_shap.py # SHAP interpretation
│ ├── run_pp3bp4.py # ACMG PP3/BP4 mapping
│ ├── run_external_validation.py # FH-VCEP validation
│ └── run_split_analysis.py # Split-half stability
└── output/
├── figures/ # Publication figures (PNG)
├── vus_predictions.csv # VUS classifications
├── vus_predictions_with_acmg.csv # + ACMG evidence codes
├── model_metrics.csv # Nested CV metrics
├── model_summary.csv # Mean AUC per model
├── feature_importance.csv # Feature importances
├── ablation_study.csv # Ablation results
├── ablation_summary.csv # Ablation summary
├── shap_feature_importance.csv # SHAP values
├── acmg_summary.csv # PP3/BP4 code counts
├── acmg_tier_crosstab.csv # Tier x PP3/BP4 crosstab
├── external_validation_metrics.csv # FH-VCEP metrics
├── external_validation_predictions.csv # FH-VCEP per-variant
├── external_validation_stratified.csv # Per-consequence AUC
├── ldlr_annotations.csv # In silico scores
└── gnomad_pseudo_benign_missense.csv # gnomAD expansion set
git clone https://github.com/glbala87/ldlr-vus-classifier.git
cd ldlr-vus-classifier
pip install -r requirements.txt# Step 1: Core pipeline (load, annotate, train, classify)
python3 src/pipeline/ldlr_pipeline.py
# Step 2: Post-hoc analyses (run after pipeline completes)
python3 src/analysis/run_ablation.py
python3 src/analysis/run_shap.py
python3 src/analysis/run_pp3bp4.py
python3 src/analysis/run_external_validation.py
python3 src/analysis/run_split_analysis.pyAll API responses are cached in output/annotation_cache.json on first run. Subsequent runs are fully offline and deterministic (random seed = 42).
- Source: ClinVar LDLR variants (sheet LDLR_3962)
- Training: 1,880 P/LP + 990 B/LB = 2,870 labelled variants
- Target: 1,092 VUS (776 missense, 316 non-missense)
Two external APIs retrieve in silico predictor scores:
- myvariant.info: REVEL, CADD, PolyPhen-2, SIFT, AlphaMissense, MutationTaster, GERP++ (via dbNSFP v4)
- Ensembl VEP: protein position, amino-acid change, canonical consequence, splice-region annotation
45 base features expand to 68 after one-hot encoding:
| Feature group | Pre-encode | Post-encode |
|---|---|---|
| Sequence | 6 | 6 |
| Allele frequency | 9 | 13 |
| Molecular consequence | 1 | 14 |
| ClinVar variant class | 1 | 7 |
| In silico predictors | 14 | 14 |
| Amino-acid biochemistry | 7 | 7 |
| Splice indicator | 1 | 1 |
| Allele origin | 1 | 1 |
| Interaction features | 4 | 4 |
| Total | 45 | 68 |
Three classifiers trained under nested 5x3 cross-validation:
- RandomForest (scikit-learn, class_weight="balanced")
- XGBoost (inverse-frequency sample weights)
- SVM (RBF/linear kernels, StandardScaler, class_weight="balanced")
Best model (XGBoost) is isotonically calibrated (cv=5) for probability estimation.
REVEL and AlphaMissense are checked for agreement before classification:
- Concordant pathogenic: REVEL >= 0.7 AND AM >= 0.564
- Concordant benign: REVEL < 0.3 AND AM < 0.34
- Discordant: flagged, not classified (170 VUS = 15.6%)
VUS are routed through an 18-tier framework based on calibrated P(pathogenic), concordance label, and splice flag. Probability cutoffs (0.90/0.80/0.65/0.35) approximate the Bayesian odds thresholds of Tavtigian et al.
Each VUS is mapped to Pejaver-calibrated REVEL thresholds (PP3_Strong >= 0.932, PP3_Moderate >= 0.773, PP3_Supporting >= 0.644, BP4_Supporting <= 0.290, BP4_Moderate <= 0.183, BP4_Strong <= 0.016) and AlphaMissense boundaries (PP3_Supporting >= 0.564, BP4_Supporting < 0.34).
| File | Description |
|---|---|
vus_predictions.csv |
All 1,092 VUS with probabilities, tiers, concordance |
vus_predictions_with_acmg.csv |
Same + PP3/BP4 evidence codes |
model_metrics.csv |
Per-fold nested CV metrics (AUC, F1, Brier) |
external_validation_metrics.csv |
FH-VCEP held-out performance |
ablation_summary.csv |
Feature-group contribution (delta-AUC) |
shap_feature_importance.csv |
SHAP feature importance values |
The model was retrained on 2,651 non-expert-panel ClinVar variants and tested on 219 ClinGen FH-VCEP expert-panel-reviewed variants:
| Subset | n | ROC-AUC | F1 | Precision | Sensitivity |
|---|---|---|---|---|---|
| All | 219 | 0.986 | 0.960 | 0.994 | 0.929 |
| Missense | 161 | 0.975 | 0.970 | 0.993 | 0.947 |
If you use this pipeline, please cite:
Bala, Faisal, Nader et al. Machine-Learning Triage of LDLR Variants of Uncertain Significance Using Predictor Concordance, ACMG-Aligned Evidence Mapping. (2026). [manuscript in preparation]
MIT License. See LICENSE for details.
- ClinVar for variant data
- ClinGen FH-VCEP for expert-panel reviewed variants
- myvariant.info and Ensembl VEP for annotation APIs
- gnomAD for population frequency data