CSYE7105 - High Performance Parallel Machine Learning & AI, Spring 2026 Team 18: Shwetanshu Subhash Deshmukh and Bhargav Chickmagalur Nanjundappa Instructor: Prof. Handan Liu
This project implements a parallel vision-language model for automated biomedical image captioning using the PMC-OA dataset (1.6M image-caption pairs from PubMed Central). We demonstrate parallel computing techniques across multiple paradigms: CPU multi-core processing with Dask, GPU data parallelism with DDP, and GPU model parallelism with FSDP.
- β Large-scale dataset: PMC-OA (1.65M biomedical images from PubMed Central)
- β Parallel preprocessing: Dask-based multi-core image processing with chunked memory management
- β Multi-GPU training: DDP achieves 2.45Γ speedup on 3 GPUs (81.7% efficiency)
- β Memory optimization: FSDP reduces GPU memory by 57.3% vs DDP
- β Parameter efficiency: LoRA fine-tuning (99.4% fewer trainable parameters)
- β CPU baseline: XGBoost + joblib achieves 9.4Γ speedup on 16 cores
- β Comprehensive benchmarking: Speedup, scaling efficiency, memory usage across 1-8 CPU cores and 1-3 GPUs
Vision-Language Pipeline:
Input Image (224Γ224)
β BiomedCLIP Encoder (86M params, frozen) β 768D features
β Multi-Layer Projector (6.3M params, trainable) β 4096D features
β BioMistral-7B + LoRA (35M trainable params) β Caption
Memory Optimization:
- INT8 quantization for base model (7.24B β ~7GB)
- LoRA adapters (rank=16) instead of full fine-tuning
- Model parallelism via
device_map="auto"for P100 GPUs (12GB each)
Actual Hardware Used:
- GPUs: NVIDIA Tesla P100 (12GB HBM2, Pascal architecture)
- CPUs: Multi-core nodes for Dask preprocessing
- Interconnect: PCIe for multi-GPU communication
- Cluster: Northeastern University Discovery Cluster
Software:
- PyTorch: 2.1.0+cu118
- Transformers: 4.35.2
- Dask: 2023.11.0 for parallel preprocessing
- NCCL: 2.15.5 for multi-GPU communication
- Precision: FP32 (P100 lacks BF16 support) + INT8 quantization
CPU Preprocessing (Dask):
- Sequential baseline: ~20 img/s
- Parallel (16 workers): Target 15Γ speedup
- Chunked processing to manage 1.6M image dataset
CPU Baseline (XGBoost + joblib):
- Sequential: 8 img/s
- Parallel (16 cores): 75 img/s β 9.4Γ speedup, 59% efficiency
- Task: Image modality classification (X-ray, CT, MRI, microscopy)
GPU Training - DDP (Data Parallel):
- 1 GPU: 137.9 img/s (baseline)
- 2 GPUs: 268.3 img/s β 1.95Γ speedup, 97.3% efficiency
- 3 GPUs: 338.1 img/s β 2.45Γ speedup, 81.7% efficiency
GPU Training - FSDP (Fully Sharded):
- Memory reduction: 57.3% vs DDP
- Trade-off: Lower throughput due to communication overhead
- Enables training larger models on memory-constrained GPUs
Full Training Run:
- Dataset: 47.5K samples
- Hardware: 4ΓP100 with model parallelism
- Duration: 511.7 hours (~21 days)
- Throughput: 0.026 img/s (bottlenecked by sequential layer execution)
parallelml_final/
βββ data/
β βββ raw/ # PMC-OA dataset (JSONL + images)
β βββ processed/ # Preprocessed NPZ files
β βββ metadata/ # EDA statistics
βββ src/
β βββ data/ # Preprocessing (sequential + Dask parallel)
β βββ models/ # VLM architecture (encoder, projector, LLM)
β βββ training/ # Trainers (single-GPU, DDP, FSDP)
β βββ evaluation/ # Metrics (BLEU, ROUGE, CIDEr)
β βββ utils/ # Logging, distributed, visualization
βββ scripts/ # Execution scripts
β βββ eda.py # Exploratory data analysis
β βββ benchmark_preprocessing.py
β βββ evaluate_model.py
β βββ analyze_results.py
βββ configs/ # YAML configurations
βββ slurm_jobs/ # SLURM job scripts for cluster
βββ outputs/ # Checkpoints, logs, metrics, plots
βββ FINAL_REPORT.md # Comprehensive technical report
- Python 3.10+
- CUDA 11.8+ (for GPU training)
- Multi-GPU system (P100/V100/A100 recommended)
- Access to compute cluster (optional but recommended)
# 1. Clone and navigate
cd /path/to/parallelml_final
# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate
# 3. Install dependencies
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
pip install git+https://github.com/salaniz/pycocoevalcap.git
# 4. Download NLTK data
python -c "import nltk; nltk.download('punkt')"
# 5. Verify installation
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
python -c "import dask; print(f'Dask: {dask.__version__}')"# Download PMC-OA dataset from HuggingFace
python scripts/download_pmc_oa.py --output_dir data/raw/pmc_oa
# Perform exploratory data analysis
python scripts/eda.py \
--data_dir data/raw/pmc_oa \
--max_caption_samples 20000 \
--max_image_samples 20000
# Create development subsets (10k, 100k)
python scripts/create_subsets.py --full_dataset_path data/raw/pmc_oa1. Preprocessing Benchmark:
python scripts/benchmark_preprocessing.py \
--dataset_path data/raw/pmc_oa_10k \
--workers 4 8 162. CPU Baseline:
python -m src.models.baseline \
--dataset_path data/raw/pmc_oa_10k \
--n_jobs 163. Single GPU Training:
python -m src.training.single_gpu_trainer \
--dataset_path data/raw/pmc_oa_100k \
--output_dir outputs/checkpoints/single_gpu \
--batch_size 44. DDP Multi-GPU Training:
# Local (2 GPUs)
torchrun --nproc_per_node=2 \
src/training/ddp_trainer.py \
--dataset_path data/raw/pmc_oa_100k \
--num_gpus 2
# SLURM cluster
sbatch slurm_jobs/scripts/train_ddp_4gpu.sh5. FSDP Training:
sbatch slurm_jobs/scripts/train_fsdp_4gpu.sh6. Model Evaluation:
python scripts/evaluate_model.py \
--checkpoint_path outputs/checkpoints/ddp_2gpu/checkpoint-10000 \
--dataset_path data/raw/pmc_oa_10k| Method | Configuration | Throughput | Speedup | Efficiency |
|---|---|---|---|---|
| Preprocessing | Sequential (1 core) | ~20 img/s | 1.0Γ | 100% |
| CPU Baseline | Parallel (16 cores) | 75 img/s | 9.4Γ | 59% |
| DDP Training | 1 GPU (P100) | 137.9 img/s | 1.0Γ | 100% |
| DDP Training | 2 GPUs (P100) | 268.3 img/s | 1.95Γ | 97.3% |
| DDP Training | 3 GPUs (P100) | 338.1 img/s | 2.45Γ | 81.7% |
| FSDP Training | 3 GPUs (P100) | ~20-25 img/s | N/A | 57.3% memory saving |
Parameter Breakdown:
- BiomedCLIP Vision Encoder: 86M params (frozen)
- Multi-Layer Projector: 6.3M params (trainable)
- BioMistral-7B Base: 7.24B params (INT8 quantized)
- LoRA Adapters: 35M params (trainable)
- Total Trainable: 41.3M params (99.4% reduction from full fine-tuning)
Memory Footprint (per GPU):
- Without optimization: ~115GB (impossible on P100)
- With INT8 + LoRA + model parallelism: ~5.34GB (fits on 12GB P100)
Key Challenges:
- Hardware constraints: P100 GPUs (12GB) required model parallelism via
device_map="auto" - Precision limitations: P100 lacks BF16 support, fell back to FP32 (slower than modern GPUs)
- Model quality: 30% mode collapse rate in generated captions
- Training time: Full training took 511 hours due to model parallelism overhead
Key Learnings:
- Data parallelism (DDP) achieves near-linear scaling (97% efficiency on 2 GPUs)
- Model parallelism is significantly slower than data parallelism (sequential layer execution)
- Memory optimization via quantization + LoRA enables 7B model training on 12GB GPUs
- Communication overhead increases with GPU count (efficiency drops from 97% β 82%)
PMC-OA Dataset:
- Total samples: 1,651,687 image-caption pairs
- Train: ~1.3M | Validation: ~165K | Test: ~165K
Caption Analysis:
- Length: 95.3 Β± 45 characters
- Words: 18.7 Β± 8.5 words per caption
- Medical vocabulary: CT (43K), microscopy (15K), MRI (8K)
Image Properties:
- Dimensions: 271Γ238 pixels (mean)
- Color modes: RGB (91.2%), Grayscale (7.8%)
- Modalities: X-ray, CT, MRI, microscopy, histopathology
Medical Terminology Coverage:
- Imaging: CT, MRI, X-ray, ultrasound, microscopy
- Anatomy: Cell (15K), tissue, brain, lung, liver
- Pathology: Lesion (4K), tumor (3.5K), cancer, disease
- β Phase 1: Environment setup & infrastructure
- β Phase 2: Data acquisition & EDA
- β Phase 3: Parallel preprocessing with Dask
- β Phase 4: Model architecture development
- β Phase 5: Single-GPU training
- β Phase 6: DDP multi-GPU implementation
- β Phase 7: FSDP implementation & comparison
- β Phase 8: CPU baseline & evaluation metrics
- β Phase 9: Documentation & final report
Model Config (configs/model_config.yaml)
vision_encoder:
model_name: "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"
freeze: true
language_model:
model_name: "BioMistral/BioMistral-7B"
load_in_8bit: true # INT8 quantization
use_lora: true
lora_config:
r: 16
lora_alpha: 32
target_modules: ["q_proj", "v_proj"]
projector:
hidden_dim: 2048
num_layers: 2Training Config (configs/training_config.yaml)
data:
batch_size_per_gpu: 4
max_caption_length: 256
training:
num_epochs: 3
learning_rate: 2.0e-4
gradient_accumulation_steps: 4
bf16: true # Falls back to FP32 on P100
fp16: false
optimizer:
type: "adamw"
weight_decay: 0.01
distributed:
backend: "nccl"Comprehensive Reports:
- FINAL_REPORT.md - Full technical report with methodology, results, and analysis
- PHASE1-9_COMPLETE.md - Detailed phase-by-phase implementation logs
- IMPLEMENTATION_STATUS.md - Project status and deliverables
Key Scripts:
- scripts/eda.py - Exploratory data analysis
- scripts/benchmark_preprocessing.py - Preprocessing benchmarks
- src/training/ddp_trainer.py - DDP implementation
- src/training/fsdp_trainer.py - FSDP implementation
- src/models/baseline.py - CPU XGBoost baseline
All performance visualizations are generated in outputs/ directory:
- Preprocessing speedup curves
- DDP/FSDP scaling efficiency plots
- Memory usage comparisons
- Training loss curves
- CPU baseline throughput analysis
#!/bin/bash
#SBATCH --job-name=ddp-3gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:p100:3
#SBATCH --cpus-per-task=24
#SBATCH --mem=120G
#SBATCH --time=24:00:00
module load cuda/11.8
source ~/venv/bin/activate
torchrun --nproc_per_node=3 \
src/training/ddp_trainer.py \
--dataset_path data/raw/pmc_oa_100k \
--num_gpus 3# Check queue
squeue -u $USER
# Monitor output
tail -f slurm_jobs/logs/<job_id>.out
# Check GPU usage
ssh <compute_node>
nvidia-smi- Parallel preprocessing at scale: Demonstrated Dask-based processing of 1.6M medical images with memory-efficient chunking
- Multi-GPU scaling analysis: Comprehensive DDP vs FSDP comparison on P100 GPUs
- Memory-constrained training: Successfully trained 7B model on 12GB GPUs using INT8 + LoRA + model parallelism
- CPU baseline comparison: Established XGBoost baseline for traditional ML comparison
- Reproducible benchmarks: Open-source codebase with detailed documentation
- Use modern GPUs (A100/H100) with BF16 support for 2-3Γ speedup
- Data parallelism (DDP) vastly outperforms model parallelism when memory permits
- Communication overhead is the primary bottleneck in multi-GPU scaling
- Quality metrics (BLEU/ROUGE) should be monitored throughout training
- Gradient accumulation enables large effective batch sizes on memory-constrained GPUs
- PMC-OA: Lin et al. (2023). "PMC-OA: A Biomedical Image-Caption Dataset." HuggingFace
- BiomedCLIP: Zhang et al. (2023). microsoft/BiomedCLIP
- BioMistral-7B: BioMistral/BioMistral-7B
- PyTorch DDP: Li et al. (2020). "PyTorch Distributed: Experiences on Accelerating Data Parallel Training."
- FSDP: Zhao et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel."
- Dask: Rocklin (2015). "Dask: Parallel Computation with Blocked algorithms and Task Scheduling."
- Shwetanshu Subhash Deshmukh - deshmukh.shw@northeastern.edu
- Bhargav Chickmagalur Nanjundappa - chickmagalur.b@northeastern.edu
This project is for educational purposes as part of CSYE7105 coursework at Northeastern University.
- Prof. Handan Liu - CSYE7105 High Performance Parallel ML & AI
- Northeastern University Discovery Cluster - Computational resources
- PMC-OA Dataset Creators - Lin et al., 2023
- HuggingFace Team - Transformers library and model hosting
- PyTorch Team - DDP and FSDP implementations
For questions about this project:
- Course instructor: Prof. Handan Liu
- Team members: See above
Project Repository: [GitHub Link if applicable]
Status: β All Phases Complete | Final Report Available in FINAL_REPORT.md
Key Achievement: Successfully demonstrated parallel computing across CPU (Dask, joblib) and GPU (DDP, FSDP) paradigms for training a 7B-parameter biomedical vision-language model on memory-constrained hardware.