LitVarAI is an AI-assisted pipeline for automating major components of a biomedical literature review. It discovers relevant articles, retrieves available full text, extracts article text and figure content, identifies reported genomic variants using a large language model, highlights supporting evidence, and compares extracted variants with ClinVar.
Telomere and telomerase-associated genes are used as a biomedical case study:
CTC1, NAF1, NHP2, NOP10, PARN, POT1, RTEL1, TER/TERC, TERT,
TINF2, TPP1/ACD, WRAP53, and ZCCHC8.
| Metric | Result |
|---|---|
| Target genes | 13 |
| ClinVar variant records collected | 7,519 |
| ClinVar-linked article records | 664 |
| Telomerase Database article records | 146 |
| Merged unique article records | 730 |
| Articles with available PDFs | 617 |
| Articles with successfully extracted text | 617 |
| Articles with AI responses | 543 |
| AI responses containing at least one variant | 342 |
| AI-extracted variant rows | 2,181 |
| Unique comparable ClinVar variants | 7,517 |
| Unique comparable AI variants | 1,329 |
| Provisional common variants | 194 |
| Strict protein-concordant common variants | 183 |
| Common variants requiring review | 11 |
| Gene | Merged Articles | PDFs Available | Missing PDFs | Text Extracted | AI Responses | AI Responses with Variants |
|---|---|---|---|---|---|---|
| CTC1 | 60 | 53 | 7 | 53 | 53 | 26 |
| NAF1 | 22 | 16 | 6 | 16 | 16 | 4 |
| NHP2 | 20 | 14 | 6 | 14 | 14 | 8 |
| NOP10 | 13 | 9 | 4 | 9 | 9 | 5 |
| PARN | 55 | 46 | 9 | 46 | 46 | 20 |
| POT1 | 83 | 70 | 13 | 70 | 0 | 0 |
| RTEL1 | 102 | 83 | 19 | 83 | 82 | 53 |
| TER | 1 | 1 | 0 | 1 | 1 | 1 |
| TERT | 246 | 217 | 29 | 217 | 214 | 166 |
| TINF2 | 59 | 51 | 8 | 51 | 51 | 35 |
| TPP1 | 32 | 28 | 4 | 28 | 28 | 12 |
| WRAP53 | 23 | 19 | 4 | 19 | 19 | 9 |
| ZCCHC8 | 14 | 10 | 4 | 10 | 10 | 3 |
| Total | 730 | 617 | 113 | 617 | 543 | 342 |
| Gene | ClinVar Variants | AI Variants | Provisional Common | Strict Common | Review Needed | ClinVar Only | AI Only |
|---|---|---|---|---|---|---|---|
| CTC1 | 732 | 121 | 14 | 14 | 0 | 718 | 107 |
| NAF1 | 191 | 27 | 1 | 1 | 0 | 190 | 26 |
| NHP2 | 98 | 37 | 4 | 4 | 0 | 94 | 33 |
| NOP10 | 32 | 5 | 4 | 4 | 0 | 28 | 1 |
| PARN | 401 | 106 | 11 | 11 | 0 | 390 | 95 |
| POT1 | 1,267 | 0 | 0 | 0 | 0 | 1,267 | 0 |
| RTEL1 | 1,842 | 292 | 26 | 23 | 3 | 1,816 | 266 |
| TER | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| TERT | 1,563 | 594 | 111 | 107 | 4 | 1,452 | 483 |
| TINF2 | 304 | 75 | 19 | 16 | 3 | 285 | 56 |
| TPP1 | 549 | 43 | 1 | 1 | 0 | 548 | 42 |
| WRAP53 | 260 | 25 | 2 | 2 | 0 | 258 | 23 |
| ZCCHC8 | 278 | 4 | 1 | 0 | 1 | 277 | 3 |
| Total | 7,517 | 1,329 | 194 | 183 | 11 | 7,323 | 1,135 |
Review Needed combines common variants with partial protein evidence and
common variants with conflicting protein changes. Detailed and reproducible
results are available in
pipeline_statistics/, particularly
table_09_gene_level_mapping_summary.csv
and
table_01_all_clinvar_ai_common_variants.csv.
LitVarAI was developed to investigate whether an automated workflow can:
- Construct a targeted biomedical literature corpus from multiple databases.
- Convert full-text articles and figures into machine-readable evidence.
- Extract structured variant and clinical information using an AI model.
- Link each extracted variant to supporting evidence from its source article.
- Compare literature-extracted variants with an established reference database.
- Generate transparent, auditable tables for manual review and publication.
The pipeline should be considered AI-assisted, rather than fully autonomous. Extracted variants and evidence should be reviewed by domain experts before clinical or research use.
| Step | Script | Purpose |
|---|---|---|
| 1 | 01_collect_clinvar_variants.py |
Collect ClinVar missense variants and linked citations |
| 2 | 02_collect_telomerase_database.py |
Collect variants and publications from the Telomerase Database |
| 3 | 03_extract_clinvar_pmids.py |
Extract and deduplicate ClinVar-linked PubMed identifiers |
| 4 | 04_merge_article_records.py |
Merge and deduplicate article records from both sources |
| 5 | 05_download_articles.py |
Retrieve available full-text articles |
| 6 | 06_extract_article_text.py |
Extract article text, figures, and figure OCR |
| 7 | 07_extract_variants_with_ai.py |
Extract structured variants and annotate supporting evidence |
| 8 | 08_evaluate_variant_extraction.py |
Compare normalized AI-extracted variants with ClinVar |
| 9 | 09_build_share_package.py |
Build an evidence package for manual review |
| 10 | 10_generate_pipeline_statistics.py |
Generate publication-ready summary and mapping tables |
ClinVar variants and citations ─┐
├─> Merge article records
Telomerase Database records ────┘
|
v
Retrieve available articles
|
v
Extract text, figures, and OCR
|
v
AI-assisted variant extraction
|
v
Evidence annotation and manual review
|
v
ClinVar normalization and mapping
|
v
Publication-ready statistics tables
Variants are normalized before comparison:
- ClinVar cDNA changes are parsed from
Variant_Name. - AI-extracted cDNA changes are read from
cDNA_Change. - Formatting differences such as spaces and parentheses are removed.
- Protein three-letter amino-acid codes are converted to one-letter codes.
- The normalized cDNA change is used as the primary identifier.
- The normalized protein change is used only when cDNA is unavailable.
A provisional common variant has the same normalized identifier in both sources. A strict common variant also has concordant protein-level evidence.
Important limitations:
- ClinVar collection is restricted to missense variants.
- AI extraction includes all reported variant types.
- Transcript-aware HGVS normalization is not yet implemented.
- AI-only variants are not automatically false positives.
- Protein-conflicting overlaps require manual review.
Run:
python 10_generate_pipeline_statistics.pyThe script prints the statistics and creates CSV and JSON outputs under
pipeline_statistics/.
Key GitHub-ready outputs include:
| File | Description |
|---|---|
table_01_all_clinvar_ai_common_variants.csv |
Master union of all ClinVar and AI variants with mapping status |
table_02_common_variants_mapping_audit.csv |
Detailed audit of all provisional common variants |
table_09_gene_level_mapping_summary.csv |
Audited gene-level ClinVar and AI comparison |
table_04_article_processing_by_gene.csv |
Article retrieval, text extraction, and AI processing results |
table_07_ai_requested_fields.csv |
Structured fields requested from the AI model |
python -m venv .venvActivate the environment and install the dependencies needed by the pipeline. Major Python packages include:
pip install biopython requests beautifulsoup4 pymupdf marker-pdf easyocr pytesseract pillow google-generativeaiTesseract OCR must also be installed separately when figure OCR is required.
export GEMINI_API_KEY="your-key"
export ELSEVIER_API_KEY="your-key"
export NCBI_EMAIL="your-email@example.com"python 01_collect_clinvar_variants.py
python 02_collect_telomerase_database.py
python 03_extract_clinvar_pmids.py
python 04_merge_article_records.py
python 05_download_articles.py
python 06_extract_article_text.py
python 07_extract_variants_with_ai.py
python 08_evaluate_variant_extraction.py
python 10_generate_pipeline_statistics.pyEach script can also be run independently when its required input files are available.
- Only redistribute articles when their licenses permit redistribution.
- API keys and personal credentials must never be committed.
- AI-generated extractions require human verification.
- This software is intended for research and methodological evaluation.
- It is not a clinical diagnostic tool.
- Large datasets and copyrighted full-text articles should be deposited or shared separately under appropriate permissions.
If you use LitVarAI, please cite the associated publication when available. Until then, the software can be cited as:
@software{muneeb2026litvarai,
author = {Muneeb, Muhammad and Ascher, David},
title = {LitVarAI: An AI-Assisted Pipeline for Biomedical Literature Review and Genomic Variant Extraction},
year = {2026},
publisher = {GitHub},
url = {https://github.com/MuhammadMuneeb007}
}Muhammad Muneeb
PhD Candidate, The University of Queensland, Australia
BioSig Lab
- Email: m.muneeb@uq.edu.au
- GitHub: MuhammadMuneeb007
- Google Scholar: Muhammad Muneeb
- ResearchGate: Muhammad Muneeb
Supervisor: Prof. David Ascher
Research Group: BioSig Lab
Institution: The University of Queensland
This project is licensed under the MIT License.