A comprehensive biomedical NLP system for extracting drug-gene-mutation-cancer relations from scientific literature, featuring evidence-based relation extraction and multi-task annotation.
This project implements an end-to-end pipeline for cancer pharmacogenomics relation extraction with the following key innovations:
- Evidence-First Architecture: Combines Evidence Sentence Selection (ESS) with Evidence-Conditioned Relation Extraction (EC-RE)
- Multi-Task Annotation: Integrates stance detection, certainty assessment, and study type classification
- Enhanced Entity Recognition: MUTATION-focused NER with improved recall for genetic variants
- Cross-Sentence Relations: Capable of extracting relations spanning multiple sentences
NER (Entity Recognition)
↓
ESS (Evidence Selection)
↓
EC-RE (Relation Classification)
↓
Stance & Certainty Annotation
↓
Study Type Classification
↓
Post-Processing & Validation
- 07_NER_data/ - Named Entity Recognition data preparation and processing
- 08_RE_data/ - Relation Extraction data preparation and augmentation
- 09_NER/ - NER model training with MUTATION-focused optimization
- 10_RE/ - Multi-model training for evidence selection, relation extraction, stance/certainty, and study types
- 11_end2end/ - Complete pipeline integration, evaluation, and deployment
- 11.2_silver_dataset/ - GPT-adjudicated silver standard dataset for quality assessment
- Base model: BioLinkBERT-base
- Entity types: DRUG, GENE, MUTATION, CANCER
- MUTATION-focused training with enhanced recall
- Performance: F1 0.894, MUTATION F1 0.947
- Binary classification for evidence sentence detection
- Cross-sentence capability with distant relation support
- Performance: Overall F1 0.934, Distant F1 0.689
- Multi-task model with balanced class weights
- Relation types: INHIBITS, BINDING, ACTIVATES, UPREGULATES, DOWNREGULATES, CAUSES_SENSITIVITY, CAUSES_RESISTANCE
- Performance: Macro F1 0.925 (balanced version)
- Binary stance: SUPPORTS vs NEUTRAL
- Tri-level certainty: HIGH, MEDIUM, LOW
- Negation and speculation detection
- Performance: Combined F1 0.814
- Multi-label classification for 12 research types
- Categories: Clinical trials, experimental studies, secondary research
- Automated keyword-based labeling with 77.8% coverage
- NER: 75,831 training examples with 33,493 MUTATION annotations
- ESS: 45,242 sentence-level examples with hard negatives
- EC-RE: 5,281 relation examples with evidence packs
- Stance/Certainty: 421 GPT-adjudicated examples
- Study Type: 6,328 abstracts with multi-label annotations
- 661 GPT-adjudicated documents
- 421 high-quality relations with stance and certainty labels
- Used for model validation and certainty training
- Total relations extracted: 1,981 (vs baseline 495)
- Pharmacogenomic relations: 142 (vs baseline 119)
- INHIBITS relations: 1,150 with high precision
- Entity recognition: 95,462 entities from 8,135 abstracts
- Evidence quality score: 0.985 average
- Cross-sentence relations: 7.0% of total
- High confidence relations: 48.5%
- Strong evidence support: 47.6%
- Framework: PyTorch, Transformers (HuggingFace)
- Base Models: michiyasunaga/BioLinkBERT-base
- Training: Focal Loss, class weighting, early stopping
- Evaluation: seqeval (NER), macro/micro F1 (classification)
- ESS model training and evaluation
- EC-RE model training with class balancing
- NER v3 with MUTATION focus
- Pipeline v3 integration and testing
- Stance & Certainty multi-task classifier
- Study Type multi-label classifier
- Complete pipeline integration with all components
Current Status: Script preparation complete for Days 2-5
Key Documents:
- Execution Checklist:
docs/METHODOLOGY_FIX_EXECUTION_CHECKLIST.md - Script Reference:
docs/DAY2_DAY3_SCRIPT_REFERENCE.md - Session Summary:
docs/SESSION_SUMMARY_2026_01_08.md
What's Being Fixed:
- Proper train/dev/test splits (70/15/15)
- Test set integrity (evaluate once only)
- Negation removal from Stance model (insufficient data)
- Complete verification of methodology
This system builds upon:
- Evidence-based relation extraction methodologies
- Multi-task learning for biomedical NLP
- GPT-assisted silver standard creation
- Cross-sentence relation extraction techniques