Skip to content

glanzz/biomedicalimage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Biomedical Image Description Generation Using Vision-Language Models

CSYE7105 - High Performance Parallel Machine Learning & AI, Spring 2026 Team 18: Shwetanshu Subhash Deshmukh and Bhargav Chickmagalur Nanjundappa Instructor: Prof. Handan Liu


πŸ“‹ Project Overview

This project implements a parallel vision-language model for automated biomedical image captioning using the PMC-OA dataset (1.6M image-caption pairs from PubMed Central). We demonstrate parallel computing techniques across multiple paradigms: CPU multi-core processing with Dask, GPU data parallelism with DDP, and GPU model parallelism with FSDP.

Key Achievements

  • βœ… Large-scale dataset: PMC-OA (1.65M biomedical images from PubMed Central)
  • βœ… Parallel preprocessing: Dask-based multi-core image processing with chunked memory management
  • βœ… Multi-GPU training: DDP achieves 2.45Γ— speedup on 3 GPUs (81.7% efficiency)
  • βœ… Memory optimization: FSDP reduces GPU memory by 57.3% vs DDP
  • βœ… Parameter efficiency: LoRA fine-tuning (99.4% fewer trainable parameters)
  • βœ… CPU baseline: XGBoost + joblib achieves 9.4Γ— speedup on 16 cores
  • βœ… Comprehensive benchmarking: Speedup, scaling efficiency, memory usage across 1-8 CPU cores and 1-3 GPUs

🎯 Technical Highlights

Model Architecture

Vision-Language Pipeline:

Input Image (224Γ—224)
    β†’ BiomedCLIP Encoder (86M params, frozen) β†’ 768D features
    β†’ Multi-Layer Projector (6.3M params, trainable) β†’ 4096D features
    β†’ BioMistral-7B + LoRA (35M trainable params) β†’ Caption

Memory Optimization:

  • INT8 quantization for base model (7.24B β†’ ~7GB)
  • LoRA adapters (rank=16) instead of full fine-tuning
  • Model parallelism via device_map="auto" for P100 GPUs (12GB each)

Hardware & Software Stack

Actual Hardware Used:

  • GPUs: NVIDIA Tesla P100 (12GB HBM2, Pascal architecture)
  • CPUs: Multi-core nodes for Dask preprocessing
  • Interconnect: PCIe for multi-GPU communication
  • Cluster: Northeastern University Discovery Cluster

Software:

  • PyTorch: 2.1.0+cu118
  • Transformers: 4.35.2
  • Dask: 2023.11.0 for parallel preprocessing
  • NCCL: 2.15.5 for multi-GPU communication
  • Precision: FP32 (P100 lacks BF16 support) + INT8 quantization

Parallel Computing Results

CPU Preprocessing (Dask):

  • Sequential baseline: ~20 img/s
  • Parallel (16 workers): Target 15Γ— speedup
  • Chunked processing to manage 1.6M image dataset

CPU Baseline (XGBoost + joblib):

  • Sequential: 8 img/s
  • Parallel (16 cores): 75 img/s β†’ 9.4Γ— speedup, 59% efficiency
  • Task: Image modality classification (X-ray, CT, MRI, microscopy)

GPU Training - DDP (Data Parallel):

  • 1 GPU: 137.9 img/s (baseline)
  • 2 GPUs: 268.3 img/s β†’ 1.95Γ— speedup, 97.3% efficiency
  • 3 GPUs: 338.1 img/s β†’ 2.45Γ— speedup, 81.7% efficiency

GPU Training - FSDP (Fully Sharded):

  • Memory reduction: 57.3% vs DDP
  • Trade-off: Lower throughput due to communication overhead
  • Enables training larger models on memory-constrained GPUs

Full Training Run:

  • Dataset: 47.5K samples
  • Hardware: 4Γ—P100 with model parallelism
  • Duration: 511.7 hours (~21 days)
  • Throughput: 0.026 img/s (bottlenecked by sequential layer execution)

πŸ—οΈ Project Structure

parallelml_final/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/              # PMC-OA dataset (JSONL + images)
β”‚   β”œβ”€β”€ processed/        # Preprocessed NPZ files
β”‚   └── metadata/         # EDA statistics
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/             # Preprocessing (sequential + Dask parallel)
β”‚   β”œβ”€β”€ models/           # VLM architecture (encoder, projector, LLM)
β”‚   β”œβ”€β”€ training/         # Trainers (single-GPU, DDP, FSDP)
β”‚   β”œβ”€β”€ evaluation/       # Metrics (BLEU, ROUGE, CIDEr)
β”‚   └── utils/            # Logging, distributed, visualization
β”œβ”€β”€ scripts/              # Execution scripts
β”‚   β”œβ”€β”€ eda.py           # Exploratory data analysis
β”‚   β”œβ”€β”€ benchmark_preprocessing.py
β”‚   β”œβ”€β”€ evaluate_model.py
β”‚   └── analyze_results.py
β”œβ”€β”€ configs/              # YAML configurations
β”œβ”€β”€ slurm_jobs/          # SLURM job scripts for cluster
β”œβ”€β”€ outputs/             # Checkpoints, logs, metrics, plots
└── FINAL_REPORT.md      # Comprehensive technical report

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • CUDA 11.8+ (for GPU training)
  • Multi-GPU system (P100/V100/A100 recommended)
  • Access to compute cluster (optional but recommended)

Installation

# 1. Clone and navigate
cd /path/to/parallelml_final

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate

# 3. Install dependencies
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
pip install git+https://github.com/salaniz/pycocoevalcap.git

# 4. Download NLTK data
python -c "import nltk; nltk.download('punkt')"

# 5. Verify installation
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
python -c "import dask; print(f'Dask: {dask.__version__}')"

Dataset Setup

# Download PMC-OA dataset from HuggingFace
python scripts/download_pmc_oa.py --output_dir data/raw/pmc_oa

# Perform exploratory data analysis
python scripts/eda.py \
    --data_dir data/raw/pmc_oa \
    --max_caption_samples 20000 \
    --max_image_samples 20000

# Create development subsets (10k, 100k)
python scripts/create_subsets.py --full_dataset_path data/raw/pmc_oa

Running Experiments

1. Preprocessing Benchmark:

python scripts/benchmark_preprocessing.py \
    --dataset_path data/raw/pmc_oa_10k \
    --workers 4 8 16

2. CPU Baseline:

python -m src.models.baseline \
    --dataset_path data/raw/pmc_oa_10k \
    --n_jobs 16

3. Single GPU Training:

python -m src.training.single_gpu_trainer \
    --dataset_path data/raw/pmc_oa_100k \
    --output_dir outputs/checkpoints/single_gpu \
    --batch_size 4

4. DDP Multi-GPU Training:

# Local (2 GPUs)
torchrun --nproc_per_node=2 \
    src/training/ddp_trainer.py \
    --dataset_path data/raw/pmc_oa_100k \
    --num_gpus 2

# SLURM cluster
sbatch slurm_jobs/scripts/train_ddp_4gpu.sh

5. FSDP Training:

sbatch slurm_jobs/scripts/train_fsdp_4gpu.sh

6. Model Evaluation:

python scripts/evaluate_model.py \
    --checkpoint_path outputs/checkpoints/ddp_2gpu/checkpoint-10000 \
    --dataset_path data/raw/pmc_oa_10k

πŸ“Š Key Results

Performance Summary

Method Configuration Throughput Speedup Efficiency
Preprocessing Sequential (1 core) ~20 img/s 1.0Γ— 100%
CPU Baseline Parallel (16 cores) 75 img/s 9.4Γ— 59%
DDP Training 1 GPU (P100) 137.9 img/s 1.0Γ— 100%
DDP Training 2 GPUs (P100) 268.3 img/s 1.95Γ— 97.3%
DDP Training 3 GPUs (P100) 338.1 img/s 2.45Γ— 81.7%
FSDP Training 3 GPUs (P100) ~20-25 img/s N/A 57.3% memory saving

Model Architecture Details

Parameter Breakdown:

  • BiomedCLIP Vision Encoder: 86M params (frozen)
  • Multi-Layer Projector: 6.3M params (trainable)
  • BioMistral-7B Base: 7.24B params (INT8 quantized)
  • LoRA Adapters: 35M params (trainable)
  • Total Trainable: 41.3M params (99.4% reduction from full fine-tuning)

Memory Footprint (per GPU):

  • Without optimization: ~115GB (impossible on P100)
  • With INT8 + LoRA + model parallelism: ~5.34GB (fits on 12GB P100)

Challenges & Learnings

Key Challenges:

  1. Hardware constraints: P100 GPUs (12GB) required model parallelism via device_map="auto"
  2. Precision limitations: P100 lacks BF16 support, fell back to FP32 (slower than modern GPUs)
  3. Model quality: 30% mode collapse rate in generated captions
  4. Training time: Full training took 511 hours due to model parallelism overhead

Key Learnings:

  1. Data parallelism (DDP) achieves near-linear scaling (97% efficiency on 2 GPUs)
  2. Model parallelism is significantly slower than data parallelism (sequential layer execution)
  3. Memory optimization via quantization + LoRA enables 7B model training on 12GB GPUs
  4. Communication overhead increases with GPU count (efficiency drops from 97% β†’ 82%)

πŸ“ˆ Exploratory Data Analysis (EDA)

Dataset Statistics

PMC-OA Dataset:

  • Total samples: 1,651,687 image-caption pairs
  • Train: ~1.3M | Validation: ~165K | Test: ~165K

Caption Analysis:

  • Length: 95.3 Β± 45 characters
  • Words: 18.7 Β± 8.5 words per caption
  • Medical vocabulary: CT (43K), microscopy (15K), MRI (8K)

Image Properties:

  • Dimensions: 271Γ—238 pixels (mean)
  • Color modes: RGB (91.2%), Grayscale (7.8%)
  • Modalities: X-ray, CT, MRI, microscopy, histopathology

Medical Terminology Coverage:

  • Imaging: CT, MRI, X-ray, ultrasound, microscopy
  • Anatomy: Cell (15K), tissue, brain, lung, liver
  • Pathology: Lesion (4K), tumor (3.5K), cancer, disease

🎯 Implementation Phases (All Completed βœ…)

  • βœ… Phase 1: Environment setup & infrastructure
  • βœ… Phase 2: Data acquisition & EDA
  • βœ… Phase 3: Parallel preprocessing with Dask
  • βœ… Phase 4: Model architecture development
  • βœ… Phase 5: Single-GPU training
  • βœ… Phase 6: DDP multi-GPU implementation
  • βœ… Phase 7: FSDP implementation & comparison
  • βœ… Phase 8: CPU baseline & evaluation metrics
  • βœ… Phase 9: Documentation & final report

πŸ”§ Configuration Files

vision_encoder:
  model_name: "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"
  freeze: true

language_model:
  model_name: "BioMistral/BioMistral-7B"
  load_in_8bit: true  # INT8 quantization
  use_lora: true
  lora_config:
    r: 16
    lora_alpha: 32
    target_modules: ["q_proj", "v_proj"]

projector:
  hidden_dim: 2048
  num_layers: 2
data:
  batch_size_per_gpu: 4
  max_caption_length: 256

training:
  num_epochs: 3
  learning_rate: 2.0e-4
  gradient_accumulation_steps: 4
  bf16: true  # Falls back to FP32 on P100
  fp16: false

optimizer:
  type: "adamw"
  weight_decay: 0.01

distributed:
  backend: "nccl"

πŸ“š Documentation

Comprehensive Reports:

Key Scripts:


πŸ“Š Visualizations

All performance visualizations are generated in outputs/ directory:

  • Preprocessing speedup curves
  • DDP/FSDP scaling efficiency plots
  • Memory usage comparisons
  • Training loss curves
  • CPU baseline throughput analysis

πŸ–₯️ Cluster Usage (SLURM)

Example SLURM Script

#!/bin/bash
#SBATCH --job-name=ddp-3gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:p100:3
#SBATCH --cpus-per-task=24
#SBATCH --mem=120G
#SBATCH --time=24:00:00

module load cuda/11.8
source ~/venv/bin/activate

torchrun --nproc_per_node=3 \
    src/training/ddp_trainer.py \
    --dataset_path data/raw/pmc_oa_100k \
    --num_gpus 3

Monitor Jobs

# Check queue
squeue -u $USER

# Monitor output
tail -f slurm_jobs/logs/<job_id>.out

# Check GPU usage
ssh <compute_node>
nvidia-smi

πŸŽ“ Academic Contributions

Research Contributions

  1. Parallel preprocessing at scale: Demonstrated Dask-based processing of 1.6M medical images with memory-efficient chunking
  2. Multi-GPU scaling analysis: Comprehensive DDP vs FSDP comparison on P100 GPUs
  3. Memory-constrained training: Successfully trained 7B model on 12GB GPUs using INT8 + LoRA + model parallelism
  4. CPU baseline comparison: Established XGBoost baseline for traditional ML comparison
  5. Reproducible benchmarks: Open-source codebase with detailed documentation

Lessons for Future Work

  1. Use modern GPUs (A100/H100) with BF16 support for 2-3Γ— speedup
  2. Data parallelism (DDP) vastly outperforms model parallelism when memory permits
  3. Communication overhead is the primary bottleneck in multi-GPU scaling
  4. Quality metrics (BLEU/ROUGE) should be monitored throughout training
  5. Gradient accumulation enables large effective batch sizes on memory-constrained GPUs

πŸ“š References

Dataset

  • PMC-OA: Lin et al. (2023). "PMC-OA: A Biomedical Image-Caption Dataset." HuggingFace

Models

Parallel Computing

  • PyTorch DDP: Li et al. (2020). "PyTorch Distributed: Experiences on Accelerating Data Parallel Training."
  • FSDP: Zhao et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel."
  • Dask: Rocklin (2015). "Dask: Parallel Computation with Blocked algorithms and Task Scheduling."

πŸ‘₯ Team Members


πŸ“ License

This project is for educational purposes as part of CSYE7105 coursework at Northeastern University.


πŸ™ Acknowledgments

  • Prof. Handan Liu - CSYE7105 High Performance Parallel ML & AI
  • Northeastern University Discovery Cluster - Computational resources
  • PMC-OA Dataset Creators - Lin et al., 2023
  • HuggingFace Team - Transformers library and model hosting
  • PyTorch Team - DDP and FSDP implementations

πŸ“ž Contact

For questions about this project:

  • Course instructor: Prof. Handan Liu
  • Team members: See above

Project Repository: [GitHub Link if applicable]


Status: βœ… All Phases Complete | Final Report Available in FINAL_REPORT.md

Key Achievement: Successfully demonstrated parallel computing across CPU (Dask, joblib) and GPU (DDP, FSDP) paradigms for training a 7B-parameter biomedical vision-language model on memory-constrained hardware.

About

Biomedical image captioning using parallel ML and distributed computing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors