LitVarAI

An AI-Assisted Pipeline for Biomedical Literature Review and Genomic Variant Extraction

Overview

LitVarAI is an AI-assisted pipeline for automating major components of a biomedical literature review. It discovers relevant articles, retrieves available full text, extracts article text and figure content, identifies reported genomic variants using a large language model, highlights supporting evidence, and compares extracted variants with ClinVar.

Telomere and telomerase-associated genes are used as a biomedical case study:

CTC1, NAF1, NHP2, NOP10, PARN, POT1, RTEL1, TER/TERC, TERT, TINF2, TPP1/ACD, WRAP53, and ZCCHC8.

Results at a Glance

Overall Pipeline Results

Metric	Result
Target genes	13
ClinVar variant records collected	7,519
ClinVar-linked article records	664
Telomerase Database article records	146
Merged unique article records	730
Articles with available PDFs	617
Articles with successfully extracted text	617
Articles with AI responses	543
AI responses containing at least one variant	342
AI-extracted variant rows	2,181
Unique comparable ClinVar variants	7,517
Unique comparable AI variants	1,329
Provisional common variants	194
Strict protein-concordant common variants	183
Common variants requiring review	11

Article Processing by Gene

Gene	Merged Articles	PDFs Available	Missing PDFs	Text Extracted	AI Responses	AI Responses with Variants
CTC1	60	53	7	53	53	26
NAF1	22	16	6	16	16	4
NHP2	20	14	6	14	14	8
NOP10	13	9	4	9	9	5
PARN	55	46	9	46	46	20
POT1	83	70	13	70	0	0
RTEL1	102	83	19	83	82	53
TER	1	1	0	1	1	1
TERT	246	217	29	217	214	166
TINF2	59	51	8	51	51	35
TPP1	32	28	4	28	28	12
WRAP53	23	19	4	19	19	9
ZCCHC8	14	10	4	10	10	3
Total	730	617	113	617	543	342

Audited Variant Mapping by Gene

Gene	ClinVar Variants	AI Variants	Provisional Common	Strict Common	Review Needed	ClinVar Only	AI Only
CTC1	732	121	14	14	0	718	107
NAF1	191	27	1	1	0	190	26
NHP2	98	37	4	4	0	94	33
NOP10	32	5	4	4	0	28	1
PARN	401	106	11	11	0	390	95
POT1	1,267	0	0	0	0	1,267	0
RTEL1	1,842	292	26	23	3	1,816	266
TER	0	0	0	0	0	0	0
TERT	1,563	594	111	107	4	1,452	483
TINF2	304	75	19	16	3	285	56
TPP1	549	43	1	1	0	548	42
WRAP53	260	25	2	2	0	258	23
ZCCHC8	278	4	1	0	1	277	3
Total	7,517	1,329	194	183	11	7,323	1,135

Review Needed combines common variants with partial protein evidence and common variants with conflicting protein changes. Detailed and reproducible results are available in pipeline_statistics/, particularly table_09_gene_level_mapping_summary.csv and table_01_all_clinvar_ai_common_variants.csv.

Research Objectives

LitVarAI was developed to investigate whether an automated workflow can:

Construct a targeted biomedical literature corpus from multiple databases.
Convert full-text articles and figures into machine-readable evidence.
Extract structured variant and clinical information using an AI model.
Link each extracted variant to supporting evidence from its source article.
Compare literature-extracted variants with an established reference database.
Generate transparent, auditable tables for manual review and publication.

The pipeline should be considered AI-assisted, rather than fully autonomous. Extracted variants and evidence should be reviewed by domain experts before clinical or research use.

Pipeline

Step	Script	Purpose
1	`01_collect_clinvar_variants.py`	Collect ClinVar missense variants and linked citations
2	`02_collect_telomerase_database.py`	Collect variants and publications from the Telomerase Database
3	`03_extract_clinvar_pmids.py`	Extract and deduplicate ClinVar-linked PubMed identifiers
4	`04_merge_article_records.py`	Merge and deduplicate article records from both sources
5	`05_download_articles.py`	Retrieve available full-text articles
6	`06_extract_article_text.py`	Extract article text, figures, and figure OCR
7	`07_extract_variants_with_ai.py`	Extract structured variants and annotate supporting evidence
8	`08_evaluate_variant_extraction.py`	Compare normalized AI-extracted variants with ClinVar
9	`09_build_share_package.py`	Build an evidence package for manual review
10	`10_generate_pipeline_statistics.py`	Generate publication-ready summary and mapping tables

Pipeline Architecture

ClinVar variants and citations ─┐
                                ├─> Merge article records
Telomerase Database records ────┘
                                      |
                                      v
                            Retrieve available articles
                                      |
                                      v
                         Extract text, figures, and OCR
                                      |
                                      v
                         AI-assisted variant extraction
                                      |
                                      v
                    Evidence annotation and manual review
                                      |
                                      v
                       ClinVar normalization and mapping
                                      |
                                      v
                     Publication-ready statistics tables

Variant Mapping

Variants are normalized before comparison:

ClinVar cDNA changes are parsed from Variant_Name.
AI-extracted cDNA changes are read from cDNA_Change.
Formatting differences such as spaces and parentheses are removed.
Protein three-letter amino-acid codes are converted to one-letter codes.
The normalized cDNA change is used as the primary identifier.
The normalized protein change is used only when cDNA is unavailable.

A provisional common variant has the same normalized identifier in both sources. A strict common variant also has concordant protein-level evidence.

Important limitations:

ClinVar collection is restricted to missense variants.
AI extraction includes all reported variant types.
Transcript-aware HGVS normalization is not yet implemented.
AI-only variants are not automatically false positives.
Protein-conflicting overlaps require manual review.

Generated Tables

Run:

python 10_generate_pipeline_statistics.py

The script prints the statistics and creates CSV and JSON outputs under pipeline_statistics/.

Key GitHub-ready outputs include:

File	Description
`table_01_all_clinvar_ai_common_variants.csv`	Master union of all ClinVar and AI variants with mapping status
`table_02_common_variants_mapping_audit.csv`	Detailed audit of all provisional common variants
`table_09_gene_level_mapping_summary.csv`	Audited gene-level ClinVar and AI comparison
`table_04_article_processing_by_gene.csv`	Article retrieval, text extraction, and AI processing results
`table_07_ai_requested_fields.csv`	Structured fields requested from the AI model

Quick Start

1. Create an environment

python -m venv .venv

Activate the environment and install the dependencies needed by the pipeline. Major Python packages include:

pip install biopython requests beautifulsoup4 pymupdf marker-pdf easyocr pytesseract pillow google-generativeai

Tesseract OCR must also be installed separately when figure OCR is required.

2. Configure credentials

export GEMINI_API_KEY="your-key"
export ELSEVIER_API_KEY="your-key"
export NCBI_EMAIL="your-email@example.com"

3. Run the pipeline

python 01_collect_clinvar_variants.py
python 02_collect_telomerase_database.py
python 03_extract_clinvar_pmids.py
python 04_merge_article_records.py
python 05_download_articles.py
python 06_extract_article_text.py
python 07_extract_variants_with_ai.py
python 08_evaluate_variant_extraction.py
python 10_generate_pipeline_statistics.py

Each script can also be run independently when its required input files are available.

Data Sources

Responsible Use and Data Availability

Only redistribute articles when their licenses permit redistribution.
API keys and personal credentials must never be committed.
AI-generated extractions require human verification.
This software is intended for research and methodological evaluation.
It is not a clinical diagnostic tool.
Large datasets and copyrighted full-text articles should be deposited or shared separately under appropriate permissions.

Citation

If you use LitVarAI, please cite the associated publication when available. Until then, the software can be cited as:

@software{muneeb2026litvarai,
  author    = {Muneeb, Muhammad and Ascher, David},
  title     = {LitVarAI: An AI-Assisted Pipeline for Biomedical Literature Review and Genomic Variant Extraction},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/MuhammadMuneeb007}
}

Author and Research Group

Muhammad Muneeb
PhD Candidate, The University of Queensland, Australia
BioSig Lab

Email: m.muneeb@uq.edu.au
GitHub: MuhammadMuneeb007
Google Scholar: Muhammad Muneeb
ResearchGate: Muhammad Muneeb

Supervisor: Prof. David Ascher
Research Group: BioSig Lab
Institution: The University of Queensland

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pipeline_statistics		pipeline_statistics
.gitignore		.gitignore
01_collect_clinvar_variants.py		01_collect_clinvar_variants.py
02_collect_telomerase_database.py		02_collect_telomerase_database.py
03_extract_clinvar_pmids.py		03_extract_clinvar_pmids.py
04_merge_article_records.py		04_merge_article_records.py
05_download_articles.py		05_download_articles.py
06_extract_article_text.py		06_extract_article_text.py
07_extract_variants_with_ai.py		07_extract_variants_with_ai.py
08_evaluate_variant_extraction.py		08_evaluate_variant_extraction.py
09_build_share_package.py		09_build_share_package.py
10_generate_pipeline_statistics.py		10_generate_pipeline_statistics.py
LICENSE		LICENSE
README.md		README.md
Supplementary Material 2.txt		Supplementary Material 2.txt
clinvar_ctc1_missense_all.csv		clinvar_ctc1_missense_all.csv
clinvar_ctc1_pmids.csv		clinvar_ctc1_pmids.csv
clinvar_ctc1_pmids.txt		clinvar_ctc1_pmids.txt
clinvar_naf1_missense_all.csv		clinvar_naf1_missense_all.csv
clinvar_naf1_pmids.csv		clinvar_naf1_pmids.csv
clinvar_naf1_pmids.txt		clinvar_naf1_pmids.txt
clinvar_nhp2_missense_all.csv		clinvar_nhp2_missense_all.csv
clinvar_nhp2_pmids.csv		clinvar_nhp2_pmids.csv
clinvar_nhp2_pmids.txt		clinvar_nhp2_pmids.txt
clinvar_nop10_missense_all.csv		clinvar_nop10_missense_all.csv
clinvar_nop10_pmids.csv		clinvar_nop10_pmids.csv
clinvar_nop10_pmids.txt		clinvar_nop10_pmids.txt
clinvar_parn_missense_all.csv		clinvar_parn_missense_all.csv
clinvar_parn_pmids.csv		clinvar_parn_pmids.csv
clinvar_parn_pmids.txt		clinvar_parn_pmids.txt
clinvar_pot1_missense_all.csv		clinvar_pot1_missense_all.csv
clinvar_pot1_pmids.csv		clinvar_pot1_pmids.csv
clinvar_pot1_pmids.txt		clinvar_pot1_pmids.txt
clinvar_rtel1_missense_all.csv		clinvar_rtel1_missense_all.csv
clinvar_rtel1_pmids.csv		clinvar_rtel1_pmids.csv
clinvar_rtel1_pmids.txt		clinvar_rtel1_pmids.txt
clinvar_ter_missense_all.csv		clinvar_ter_missense_all.csv
clinvar_ter_pmids.csv		clinvar_ter_pmids.csv
clinvar_ter_pmids.txt		clinvar_ter_pmids.txt
clinvar_tert_missense_all.csv		clinvar_tert_missense_all.csv
clinvar_tert_pmids.csv		clinvar_tert_pmids.csv
clinvar_tert_pmids.txt		clinvar_tert_pmids.txt
clinvar_tinf2_missense_all.csv		clinvar_tinf2_missense_all.csv
clinvar_tinf2_pmids.csv		clinvar_tinf2_pmids.csv
clinvar_tinf2_pmids.txt		clinvar_tinf2_pmids.txt
clinvar_tpp1_missense_all.csv		clinvar_tpp1_missense_all.csv
clinvar_tpp1_pmids.csv		clinvar_tpp1_pmids.csv
clinvar_tpp1_pmids.txt		clinvar_tpp1_pmids.txt
clinvar_wrap53_missense_all.csv		clinvar_wrap53_missense_all.csv
clinvar_wrap53_pmids.csv		clinvar_wrap53_pmids.csv
clinvar_wrap53_pmids.txt		clinvar_wrap53_pmids.txt
clinvar_zcchc8_missense_all.csv		clinvar_zcchc8_missense_all.csv
clinvar_zcchc8_pmids.csv		clinvar_zcchc8_pmids.csv
clinvar_zcchc8_pmids.txt		clinvar_zcchc8_pmids.txt
ctc1_all_papers_merged.csv		ctc1_all_papers_merged.csv
human_uniprot_pdb_structures.csv		human_uniprot_pdb_structures.csv
naf1_all_papers_merged.csv		naf1_all_papers_merged.csv
nhp2_all_papers_merged.csv		nhp2_all_papers_merged.csv
nop10_all_papers_merged.csv		nop10_all_papers_merged.csv
parn_all_papers_merged.csv		parn_all_papers_merged.csv
pot1_all_papers_merged.csv		pot1_all_papers_merged.csv
rtel1_all_papers_merged.csv		rtel1_all_papers_merged.csv
telomerase_db_ctc1_papers.csv		telomerase_db_ctc1_papers.csv
telomerase_db_ctc1_variants.csv		telomerase_db_ctc1_variants.csv
telomerase_db_naf1_papers.csv		telomerase_db_naf1_papers.csv
telomerase_db_naf1_variants.csv		telomerase_db_naf1_variants.csv
telomerase_db_nhp2_papers.csv		telomerase_db_nhp2_papers.csv
telomerase_db_nhp2_variants.csv		telomerase_db_nhp2_variants.csv
telomerase_db_nop10_papers.csv		telomerase_db_nop10_papers.csv
telomerase_db_nop10_variants.csv		telomerase_db_nop10_variants.csv
telomerase_db_parn_papers.csv		telomerase_db_parn_papers.csv
telomerase_db_parn_variants.csv		telomerase_db_parn_variants.csv
telomerase_db_pot1_papers.csv		telomerase_db_pot1_papers.csv
telomerase_db_pot1_variants.csv		telomerase_db_pot1_variants.csv
telomerase_db_rtel1_papers.csv		telomerase_db_rtel1_papers.csv
telomerase_db_rtel1_variants.csv		telomerase_db_rtel1_variants.csv
telomerase_db_ter_papers.csv		telomerase_db_ter_papers.csv
telomerase_db_ter_variants.csv		telomerase_db_ter_variants.csv
telomerase_db_tert_papers.csv		telomerase_db_tert_papers.csv
telomerase_db_tert_variants.csv		telomerase_db_tert_variants.csv
telomerase_db_tinf2_papers.csv		telomerase_db_tinf2_papers.csv
telomerase_db_tinf2_variants.csv		telomerase_db_tinf2_variants.csv
telomerase_db_tpp1_papers.csv		telomerase_db_tpp1_papers.csv
telomerase_db_tpp1_variants.csv		telomerase_db_tpp1_variants.csv
telomerase_db_wrap53_papers.csv		telomerase_db_wrap53_papers.csv
telomerase_db_wrap53_variants.csv		telomerase_db_wrap53_variants.csv
telomerase_db_zcchc8_papers.csv		telomerase_db_zcchc8_papers.csv
telomerase_db_zcchc8_variants.csv		telomerase_db_zcchc8_variants.csv
telomerase_diseases.html		telomerase_diseases.html
telomere_structures_complete.csv		telomere_structures_complete.csv
telomere_structures_full.csv		telomere_structures_full.csv
telomere_structures_peptide_only.csv		telomere_structures_peptide_only.csv
ter_all_papers_merged.csv		ter_all_papers_merged.csv
tert_all_papers_merged.csv		tert_all_papers_merged.csv
tinf2_all_papers_merged.csv		tinf2_all_papers_merged.csv
tpp1_all_papers_merged.csv		tpp1_all_papers_merged.csv
wrap53_all_papers_merged.csv		wrap53_all_papers_merged.csv
zcchc8_all_papers_merged.csv		zcchc8_all_papers_merged.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LitVarAI

An AI-Assisted Pipeline for Biomedical Literature Review and Genomic Variant Extraction

Overview

Results at a Glance

Overall Pipeline Results

Article Processing by Gene

Audited Variant Mapping by Gene

Research Objectives

Pipeline

Pipeline Architecture

Variant Mapping

Generated Tables

Quick Start

1. Create an environment

2. Configure credentials

3. Run the pipeline

Data Sources

Responsible Use and Data Availability

Citation

Author and Research Group

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LitVarAI

An AI-Assisted Pipeline for Biomedical Literature Review and Genomic Variant Extraction

Overview

Results at a Glance

Overall Pipeline Results

Article Processing by Gene

Audited Variant Mapping by Gene

Research Objectives

Pipeline

Pipeline Architecture

Variant Mapping

Generated Tables

Quick Start

1. Create an environment

2. Configure credentials

3. Run the pipeline

Data Sources

Responsible Use and Data Availability

Citation

Author and Research Group

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages