Skip to content

MRCIEU/assertion_extraction

Repository files navigation

Cancer Pharmacogenomics Relation Extraction with Evidence Selection

A comprehensive biomedical NLP system for extracting drug-gene-mutation-cancer relations from scientific literature, featuring evidence-based relation extraction and multi-task annotation.

Project Overview

This project implements an end-to-end pipeline for cancer pharmacogenomics relation extraction with the following key innovations:

  1. Evidence-First Architecture: Combines Evidence Sentence Selection (ESS) with Evidence-Conditioned Relation Extraction (EC-RE)
  2. Multi-Task Annotation: Integrates stance detection, certainty assessment, and study type classification
  3. Enhanced Entity Recognition: MUTATION-focused NER with improved recall for genetic variants
  4. Cross-Sentence Relations: Capable of extracting relations spanning multiple sentences

System Architecture

NER (Entity Recognition)
  ↓
ESS (Evidence Selection)
  ↓
EC-RE (Relation Classification)
  ↓
Stance & Certainty Annotation
  ↓
Study Type Classification
  ↓
Post-Processing & Validation

Project Structure

Data Processing Modules

  • 07_NER_data/ - Named Entity Recognition data preparation and processing
  • 08_RE_data/ - Relation Extraction data preparation and augmentation

Model Training Modules

  • 09_NER/ - NER model training with MUTATION-focused optimization
  • 10_RE/ - Multi-model training for evidence selection, relation extraction, stance/certainty, and study types

Integration Module

  • 11_end2end/ - Complete pipeline integration, evaluation, and deployment
  • 11.2_silver_dataset/ - GPT-adjudicated silver standard dataset for quality assessment

Key Components

Named Entity Recognition (NER)

  • Base model: BioLinkBERT-base
  • Entity types: DRUG, GENE, MUTATION, CANCER
  • MUTATION-focused training with enhanced recall
  • Performance: F1 0.894, MUTATION F1 0.947

Evidence Sentence Selection (ESS)

  • Binary classification for evidence sentence detection
  • Cross-sentence capability with distant relation support
  • Performance: Overall F1 0.934, Distant F1 0.689

Evidence-Conditioned Relation Extraction (EC-RE)

  • Multi-task model with balanced class weights
  • Relation types: INHIBITS, BINDING, ACTIVATES, UPREGULATES, DOWNREGULATES, CAUSES_SENSITIVITY, CAUSES_RESISTANCE
  • Performance: Macro F1 0.925 (balanced version)

Stance & Certainty Classifier

  • Binary stance: SUPPORTS vs NEUTRAL
  • Tri-level certainty: HIGH, MEDIUM, LOW
  • Negation and speculation detection
  • Performance: Combined F1 0.814

Study Type Classifier

  • Multi-label classification for 12 research types
  • Categories: Clinical trials, experimental studies, secondary research
  • Automated keyword-based labeling with 77.8% coverage

Data

Training Data

  • NER: 75,831 training examples with 33,493 MUTATION annotations
  • ESS: 45,242 sentence-level examples with hard negatives
  • EC-RE: 5,281 relation examples with evidence packs
  • Stance/Certainty: 421 GPT-adjudicated examples
  • Study Type: 6,328 abstracts with multi-label annotations

Silver Standard

  • 661 GPT-adjudicated documents
  • 421 high-quality relations with stance and certainty labels
  • Used for model validation and certainty training

Performance Metrics

End-to-End Pipeline (v3)

  • Total relations extracted: 1,981 (vs baseline 495)
  • Pharmacogenomic relations: 142 (vs baseline 119)
  • INHIBITS relations: 1,150 with high precision
  • Entity recognition: 95,462 entities from 8,135 abstracts

Quality Indicators

  • Evidence quality score: 0.985 average
  • Cross-sentence relations: 7.0% of total
  • High confidence relations: 48.5%
  • Strong evidence support: 47.6%

Technical Stack

  • Framework: PyTorch, Transformers (HuggingFace)
  • Base Models: michiyasunaga/BioLinkBERT-base
  • Training: Focal Loss, class weighting, early stopping
  • Evaluation: seqeval (NER), macro/micro F1 (classification)

Development Status

  • ESS model training and evaluation
  • EC-RE model training with class balancing
  • NER v3 with MUTATION focus
  • Pipeline v3 integration and testing
  • Stance & Certainty multi-task classifier
  • Study Type multi-label classifier
  • Complete pipeline integration with all components

Methodology Fix (January 2026)

Current Status: Script preparation complete for Days 2-5

Key Documents:

  • Execution Checklist: docs/METHODOLOGY_FIX_EXECUTION_CHECKLIST.md
  • Script Reference: docs/DAY2_DAY3_SCRIPT_REFERENCE.md
  • Session Summary: docs/SESSION_SUMMARY_2026_01_08.md

What's Being Fixed:

  • Proper train/dev/test splits (70/15/15)
  • Test set integrity (evaluate once only)
  • Negation removal from Stance model (insufficient data)
  • Complete verification of methodology

Related Work

This system builds upon:

  • Evidence-based relation extraction methodologies
  • Multi-task learning for biomedical NLP
  • GPT-assisted silver standard creation
  • Cross-sentence relation extraction techniques

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published