Skip to content

MuhammadMuneeb007/LitVarAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LitVarAI

An AI-Assisted Pipeline for Biomedical Literature Review and Genomic Variant Extraction

Python License: MIT Research Group Institution

Overview

LitVarAI is an AI-assisted pipeline for automating major components of a biomedical literature review. It discovers relevant articles, retrieves available full text, extracts article text and figure content, identifies reported genomic variants using a large language model, highlights supporting evidence, and compares extracted variants with ClinVar.

Telomere and telomerase-associated genes are used as a biomedical case study:

CTC1, NAF1, NHP2, NOP10, PARN, POT1, RTEL1, TER/TERC, TERT, TINF2, TPP1/ACD, WRAP53, and ZCCHC8.

Results at a Glance

Overall Pipeline Results

Metric Result
Target genes 13
ClinVar variant records collected 7,519
ClinVar-linked article records 664
Telomerase Database article records 146
Merged unique article records 730
Articles with available PDFs 617
Articles with successfully extracted text 617
Articles with AI responses 543
AI responses containing at least one variant 342
AI-extracted variant rows 2,181
Unique comparable ClinVar variants 7,517
Unique comparable AI variants 1,329
Provisional common variants 194
Strict protein-concordant common variants 183
Common variants requiring review 11

Article Processing by Gene

Gene Merged Articles PDFs Available Missing PDFs Text Extracted AI Responses AI Responses with Variants
CTC1 60 53 7 53 53 26
NAF1 22 16 6 16 16 4
NHP2 20 14 6 14 14 8
NOP10 13 9 4 9 9 5
PARN 55 46 9 46 46 20
POT1 83 70 13 70 0 0
RTEL1 102 83 19 83 82 53
TER 1 1 0 1 1 1
TERT 246 217 29 217 214 166
TINF2 59 51 8 51 51 35
TPP1 32 28 4 28 28 12
WRAP53 23 19 4 19 19 9
ZCCHC8 14 10 4 10 10 3
Total 730 617 113 617 543 342

Audited Variant Mapping by Gene

Gene ClinVar Variants AI Variants Provisional Common Strict Common Review Needed ClinVar Only AI Only
CTC1 732 121 14 14 0 718 107
NAF1 191 27 1 1 0 190 26
NHP2 98 37 4 4 0 94 33
NOP10 32 5 4 4 0 28 1
PARN 401 106 11 11 0 390 95
POT1 1,267 0 0 0 0 1,267 0
RTEL1 1,842 292 26 23 3 1,816 266
TER 0 0 0 0 0 0 0
TERT 1,563 594 111 107 4 1,452 483
TINF2 304 75 19 16 3 285 56
TPP1 549 43 1 1 0 548 42
WRAP53 260 25 2 2 0 258 23
ZCCHC8 278 4 1 0 1 277 3
Total 7,517 1,329 194 183 11 7,323 1,135

Review Needed combines common variants with partial protein evidence and common variants with conflicting protein changes. Detailed and reproducible results are available in pipeline_statistics/, particularly table_09_gene_level_mapping_summary.csv and table_01_all_clinvar_ai_common_variants.csv.

Research Objectives

LitVarAI was developed to investigate whether an automated workflow can:

  1. Construct a targeted biomedical literature corpus from multiple databases.
  2. Convert full-text articles and figures into machine-readable evidence.
  3. Extract structured variant and clinical information using an AI model.
  4. Link each extracted variant to supporting evidence from its source article.
  5. Compare literature-extracted variants with an established reference database.
  6. Generate transparent, auditable tables for manual review and publication.

The pipeline should be considered AI-assisted, rather than fully autonomous. Extracted variants and evidence should be reviewed by domain experts before clinical or research use.

Pipeline

Step Script Purpose
1 01_collect_clinvar_variants.py Collect ClinVar missense variants and linked citations
2 02_collect_telomerase_database.py Collect variants and publications from the Telomerase Database
3 03_extract_clinvar_pmids.py Extract and deduplicate ClinVar-linked PubMed identifiers
4 04_merge_article_records.py Merge and deduplicate article records from both sources
5 05_download_articles.py Retrieve available full-text articles
6 06_extract_article_text.py Extract article text, figures, and figure OCR
7 07_extract_variants_with_ai.py Extract structured variants and annotate supporting evidence
8 08_evaluate_variant_extraction.py Compare normalized AI-extracted variants with ClinVar
9 09_build_share_package.py Build an evidence package for manual review
10 10_generate_pipeline_statistics.py Generate publication-ready summary and mapping tables

Pipeline Architecture

ClinVar variants and citations ─┐
                                ├─> Merge article records
Telomerase Database records ────┘
                                      |
                                      v
                            Retrieve available articles
                                      |
                                      v
                         Extract text, figures, and OCR
                                      |
                                      v
                         AI-assisted variant extraction
                                      |
                                      v
                    Evidence annotation and manual review
                                      |
                                      v
                       ClinVar normalization and mapping
                                      |
                                      v
                     Publication-ready statistics tables

Variant Mapping

Variants are normalized before comparison:

  1. ClinVar cDNA changes are parsed from Variant_Name.
  2. AI-extracted cDNA changes are read from cDNA_Change.
  3. Formatting differences such as spaces and parentheses are removed.
  4. Protein three-letter amino-acid codes are converted to one-letter codes.
  5. The normalized cDNA change is used as the primary identifier.
  6. The normalized protein change is used only when cDNA is unavailable.

A provisional common variant has the same normalized identifier in both sources. A strict common variant also has concordant protein-level evidence.

Important limitations:

  • ClinVar collection is restricted to missense variants.
  • AI extraction includes all reported variant types.
  • Transcript-aware HGVS normalization is not yet implemented.
  • AI-only variants are not automatically false positives.
  • Protein-conflicting overlaps require manual review.

Generated Tables

Run:

python 10_generate_pipeline_statistics.py

The script prints the statistics and creates CSV and JSON outputs under pipeline_statistics/.

Key GitHub-ready outputs include:

File Description
table_01_all_clinvar_ai_common_variants.csv Master union of all ClinVar and AI variants with mapping status
table_02_common_variants_mapping_audit.csv Detailed audit of all provisional common variants
table_09_gene_level_mapping_summary.csv Audited gene-level ClinVar and AI comparison
table_04_article_processing_by_gene.csv Article retrieval, text extraction, and AI processing results
table_07_ai_requested_fields.csv Structured fields requested from the AI model

Quick Start

1. Create an environment

python -m venv .venv

Activate the environment and install the dependencies needed by the pipeline. Major Python packages include:

pip install biopython requests beautifulsoup4 pymupdf marker-pdf easyocr pytesseract pillow google-generativeai

Tesseract OCR must also be installed separately when figure OCR is required.

2. Configure credentials

export GEMINI_API_KEY="your-key"
export ELSEVIER_API_KEY="your-key"
export NCBI_EMAIL="your-email@example.com"

3. Run the pipeline

python 01_collect_clinvar_variants.py
python 02_collect_telomerase_database.py
python 03_extract_clinvar_pmids.py
python 04_merge_article_records.py
python 05_download_articles.py
python 06_extract_article_text.py
python 07_extract_variants_with_ai.py
python 08_evaluate_variant_extraction.py
python 10_generate_pipeline_statistics.py

Each script can also be run independently when its required input files are available.

Data Sources

Responsible Use and Data Availability

  • Only redistribute articles when their licenses permit redistribution.
  • API keys and personal credentials must never be committed.
  • AI-generated extractions require human verification.
  • This software is intended for research and methodological evaluation.
  • It is not a clinical diagnostic tool.
  • Large datasets and copyrighted full-text articles should be deposited or shared separately under appropriate permissions.

Citation

If you use LitVarAI, please cite the associated publication when available. Until then, the software can be cited as:

@software{muneeb2026litvarai,
  author    = {Muneeb, Muhammad and Ascher, David},
  title     = {LitVarAI: An AI-Assisted Pipeline for Biomedical Literature Review and Genomic Variant Extraction},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/MuhammadMuneeb007}
}

Author and Research Group

Muhammad Muneeb
PhD Candidate, The University of Queensland, Australia
BioSig Lab

Supervisor: Prof. David Ascher
Research Group: BioSig Lab
Institution: The University of Queensland

License

This project is licensed under the MIT License.

About

An AI-assisted pipeline for automating biomedical literature discovery, full-text processing, genomic variant extraction, evidence annotation, and ClinVar-based validation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors