Source code for SeizeVar, a two-layer pipeline for mechanism-aware variant interpretation in monogenic epilepsy. Pairs a saturating pathogenicity head (Random Forest + ESM-2 LoRA cross-attention) with a dedicated gain-versus-loss- of-function mechanism classifier and a deterministic sodium-channel prescribing rule, scored under a leave-one-out dynamic-evidence framework that re-uses the live ClinVar reference at scoring time.
Companion paper: From pathogenicity to prescription: mechanism-aware variant interpretation in monogenic epilepsy — Ye and Chen, 2026.
This repository contains only source code and small reference tables. The full training data, trained model weights, and per-variant prediction tables are deposited on Zenodo (DOI to be assigned at acceptance) — too large for version control and partly subject to upstream redistribution constraints (AlphaFold v4, UniProt, ClinVar snapshots).
| Released here | Released on Zenodo |
|---|---|
Pipeline source code (≈ 35 .py + 5 notebooks) |
Raw inputs (data/, ≈ 450 MB) |
| Figure-generation scripts | Trained model weights (models_trained/, ≈ 580 MB) |
Small reference tables (data_release/) |
Per-stage intermediate outputs |
| Build scripts | Full per-variant prediction tables |
seizevar/
├── 01_data/code/ Data extraction, splits, augmentation, leakage audit
├── 02_features/code/ 39-feature computation (gene / residue / substitution / evidence)
├── 03_models/
│ ├── pathogenicity/code/ Random-Forest + ESM-2 LoRA training and inference
│ └── mechanism/code/ Gain-vs-loss-of-function mechanism head training
├── 05_vus_application/code/ Prospective scoring of the 29,293-variant VUS pool
├── 06_competitors/code/ AlphaMissense / REVEL / MetaRNN / LoGoFunc benchmarking
├── data_release/ Curated small-table release (full release on Zenodo)
├── notebooks/ 01_data_pipeline.ipynb, 02_features_pipeline.ipynb
├── infra/ Colab packing + training and scoring notebooks
└── figures/scripts/ Paper figure generation (Fig 1–6 + Fig S1–S7)
The scripts are stage-numbered and intended to be run in order. Each stage
reads from the previous stage's outputs/ and writes the next.
# 1. Extract ClinVar, build splits, audit leakage
python 01_data/code/01_extract_clinvar.py
python 01_data/code/02_build_splits.py
python 01_data/code/03_augment_train.py
python 01_data/code/04_build_extra_valsets.py
python 01_data/code/05_leakage_audit.py
# 2. Compute 39-dimensional feature matrix
python 02_features/code/compute_features.py
python 02_features/code/audit_features.py
# 3. Train models
python 03_models/pathogenicity/code/train_rf_pathogenicity.py
python 03_models/pathogenicity/code/train_esm_lora.py # GPU recommended (Colab)
python 03_models/mechanism/code/train_rf_mechanism.py
# 4. Score the prospective VUS pool
python 05_vus_application/code/predict_vus_rf.py
python 05_vus_application/code/predict_vus_esm.py
python 05_vus_application/code/merge_vus_predictions.py
# 5. Competitor comparison
python 06_competitors/code/fetch_am_for_vus.py
python 06_competitors/code/extract_logofunc.py
python 06_competitors/code/compute_auroc.py
python 06_competitors/code/mechanism_benchmark.pyThe ESM-2 LoRA fine-tune is GPU-bound; infra/seizevar_colab_pro.ipynb
mirrors the local script for free Colab Pro execution. The matched VUS
scoring notebook is infra/seizevar_vus_colab.ipynb.
data_release/build_release.py regenerates the curated release tables.
The scripts read from data/ and 01_data/outputs/, neither of which is in
this repository. Pull both from Zenodo at the DOI listed in the companion
paper, then unpack at the repository root before running.
If you use this code, please cite the companion paper:
Ye S, Chen P. From pathogenicity to prescription: mechanism-aware variant interpretation in monogenic epilepsy. Submitted, 2026.
MIT — see LICENSE.