Skip to content

ubc-tea/MedVLMBench

Repository files navigation

MedVLMBench: A Unified Benchmark for Medical Vision-Language Models

arXiv License Python Stars FairMedFM

MedVLMBench is the first unified benchmark for systematically evaluating generalist and medical-specialist Vision-Language Models (VLMs). It covers 30+ models, 14 datasets, and 3 task types (VQA, diagnosis, captioning) across radiology, pathology, dermatology, and ophthalmology — with support for off-the-shelf inference, linear probing, LoRA fine-tuning, and multi-agent reasoning.


Highlights

  • 30+ models supported — CLIP-based (BioMedCLIP, MedCLIP, PLIP, SigLIP …) and generative (LLaVA, MedGemma, Qwen2-VL, InternVL3, Gemini 2.5 Pro, o3 …)
  • 14 medical datasets — SLAKE, PathVQA, VQA-RAD, MedXpertQA, OmniMedVQA, PneumoniaMNIST, HAM10000, CheXpert, MIMIC-CXR and more
  • 3 evaluation tasks — Visual Question Answering (VQA), Diagnostic Classification, Report Captioning
  • Flexible fine-tuning — off-the-shelf (OTS), linear probing (LP), LoRA, and full fine-tuning
  • Multi-agent reasoning — MDAgent and UCAgent wrappers for chain-of-thought and debate-style inference
  • Reproducible results — full CLI + Jupyter notebook tutorials included

News

  • 2025-06 Paper released on arXiv (2506.17337)
  • 2025-06 Added MedXpertQA and OmniMedVQA benchmark datasets
  • 2025-06 Added MDAgent (multi-specialist reasoning) and UCAgent (hierarchical debate) wrappers
  • 2025-06 Added InternVL3, Gemma3, Qwen2-VL, Qwen2.5-VL, Lingshu, o3, Gemini 2.5 Pro

Companion Benchmark: FairMedFM

Evaluating fairness of medical FMs? See our companion benchmark FairMedFM — the first fairness benchmark covering 20 medical imaging FMs across 17 datasets with bias metrics over sex, race, and age.

MedVLMBench and FairMedFM form a two-part evaluation suite for medical foundation models — capability and fairness, measured on the same models and datasets.

MedVLMBench FairMedFM
Focus Capability: accuracy, AUROC, VQA scores Fairness across sex, race, age
Model paradigm Generative VLMs + discriminative models Discriminative FMs (CLIP, SAM variants)
Tasks VQA, Diagnosis, Captioning Classification, Segmentation
Scale 30+ VLMs · 14 datasets 20 FMs · 17 datasets

Models evaluated in both: BioMedCLIP · MedCLIP · PLIP · SigLIP · MedSigLIP · CLIP · BLIP · BLIP2 · PubMedCLIP

Datasets in both: HAM10000 · CheXpert · MIMIC-CXR · FairVLMed10k · GF3300 · PAPILA


Table of Contents


Getting Started

Prerequisites

Python 3.11, PyTorch 2.1+, CUDA 11.8+.

Installation

Option A — pip

git clone https://github.com/ubc-tea/MedVLMBench.git
cd MedVLMBench
pip install -r requirements.txt

Option B — conda

git clone https://github.com/ubc-tea/MedVLMBench.git
cd MedVLMBench
conda env create -f environment.yml
conda activate medvlmbench

Optional fast-attention deps (require CUDA build tools): uncomment flash-attn and/or xformers in requirements.txt before installing.

Download Datasets and Models

All pretrained models should be stored under MedVLMBench/pretrained_models, and all data under MedVLMBench/data.

mkdir pretrained_models data

Example: LLaVA-1.5

cd pretrained_models
git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
cd ..

Example: MedXpertQA

cd data
git clone https://huggingface.co/datasets/TsinghuaC3I/MedXpertQA

Example: OmniMedVQA

cd data
git clone https://huggingface.co/datasets/foreverbeliever/OmniMedVQA

Available Models and Datasets

Datasets

Dataset Task Modality
SLAKE VQA Radiology
PathVQA VQA Pathology
VQA-RAD VQA Radiology
FairVLMed10k VQA / Diagnosis / Captioning Ophthalmology
MedXpertQA VQA (multi-choice) Multi-modal
OmniMedVQA VQA (multi-choice) Multi-modal
MIMIC-CXR Captioning Radiology
PneumoniaMNIST Diagnosis Radiology
BreastMNIST Diagnosis Radiology
DermaMNIST Diagnosis Dermatology
Camelyon17 Diagnosis Pathology
HAM10000 Diagnosis Dermatology
CheXpert Diagnosis Radiology
ChestXray14 Diagnosis Radiology
GF3300 Diagnosis Ophthalmology
PAPILA Diagnosis Ophthalmology
Drishti Diagnosis Ophthalmology

Models

Generative VLMs (VQA / Captioning)
Model Evaluation Training
o3 (OpenAI) Done NA
Gemini 2.5 Pro Done NA
InternVL3 Done Coming Soon
LLaVA-1.5 Done Done
LLaVA-Med Done Done
Gemma3 Done Coming Soon
MedGemma Done Done
Qwen2-VL Done Coming Soon
Qwen2.5-VL Done Coming Soon
NVILA Done Done
VILA-M3 Done Done
VILA1.5 Done Done
Lingshu Done Done
XrayGPT Done Done
BLIP Done Done
BLIP2-2.7b Done Done
Contrastive / CLIP-based Models (Diagnosis)
Model Evaluation Training
BioMedCLIP Done Done
CLIP Done Done
MedCLIP Done Done
PMCCLIP Done Done
PLIP Done Done
MedSigLIP Done Done
PubMedCLIP Done Done
SigLIP Done Done

Usage

run_eval.py is the main entry point for evaluation. run_train.py is the main entry point for fine-tuning.

Notebook Tutorials

Feature Notebook
Off-the-shelf Diagnosis Open In Colab
Off-the-shelf VQA Open In Colab
LP Diagnosis Open In Colab
LoRA Adaptation VQA Open In Colab

Command-Line Interface

Off-the-shelf Evaluation

Diagnosis (zero-shot CLIP)
python run_eval.py \
  --task diagnosis --usage clip-zs --dataset PAPILA --split test \
  --image_path ./data \
  --exp_path ./log \
  --model CLIP --model_path "original_pretrained" \
  --save_pred \
  --cache_dir ./cache
VQA (generative model)
python run_eval.py \
  --task vqa --dataset SLAKE --split test \
  --image_path ./data/SLAKE/imgs \
  --model LLaVA-1.5 --model_path ./pretrained_models/llava-v1.5-7b \
  --exp_path ./log \
  --cache_dir ./cache \
  --save_pred
MDAgent multi-specialist reasoning

Wrap any supported VLM backbone with multi-agent reasoning by adding --usage mdagent:

python run_eval.py \
  --task vqa --dataset VQA-RAD --split test \
  --image_path ./data \
  --model Qwen2-VL \
  --model_path ./pretrained_models/Qwen2-VL-2B-Instruct \
  --usage mdagent \
  --mdagent_mode adaptive \
  --exp_path ./log \
  --cache_dir ./cache \
  --save_pred

MDAgent modes: basic, intermediate, advanced, adaptive (recommended). When --save_pred is set, the output file includes the full reasoning trace per sample.

UCAgent hierarchical debate reasoning
python run_eval.py \
  --task vqa --dataset MedXpertQA --split test \
  --image_path ./data/MedXpertQA \
  --model MedGemma \
  --model_path ./pretrained_models/medgemma-4b-it \
  --usage ucagent \
  --exp_path ./log \
  --cache_dir ./cache \
  --save_pred

UCAgent runs a 3-level hierarchical diagnosis: two independent expert assessments → senior expert verification → critic-panel debate with leader adjudication.

Fine-tuning

Linear probing (diagnosis)
python run_train.py \
  --task diagnosis --usage lp --dataset HAM10000 --split train \
  --image_path ./data \
  --output_dir ./log \
  --model CLIP --model_path not_given \
  --cache_dir ./cache \
  --num_train_epochs 50 \
  --learning_rate 5e-5

Other fine-tuning modes: img-lora-lp (LP + image encoder LoRA), clip-img-lora (CLIP image encoder LoRA).

LoRA fine-tuning (VQA)
deepspeed run_train.py \
  --peft lora --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
  --deepspeed ./script/zero3.json \
  --task vqa --dataset SLAKE \
  --model LLaVA-1.5 --version v1 \
  --image_path ./data/SLAKE/imgs \
  --model_path ./pretrained_models/llava-v1.5-7b \
  --mm_projector_type mlp2x_gelu \
  --mm_vision_select_layer -2 \
  --mm_use_im_start_end False \
  --mm_use_im_patch_token False \
  --image_aspect_ratio pad \
  --group_by_modality_length True \
  --bf16 True \
  --output_dir ./log \
  --cache_dir ./cache \
  --num_train_epochs 1 \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 2 \
  --learning_rate 2e-4 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --tune_modules L

Abstract

Background: Vision–Language Models (VLMs) have shown promise in automating image diagnosis and interpretation in clinical settings. However, developing medical-specialist VLMs requires substantial computational resources and carefully curated datasets, and it remains unclear under which conditions generalist and medical specialist VLMs each perform best.

Methods: This paper introduces MedVLMBench, the first unified benchmark for systematically evaluating generalist and medical-specialist VLMs. We assessed 18 models spanning contrastive and generative paradigms on 10 publicly available datasets across radiology, pathology, dermatology, and ophthalmology, encompassing 144 diagnostic and 80 VQA settings. MedVLMBench focuses on assessing both in-domain (ID) and out-of-domain (OOD) performance, with off-the-shelf and parameter-efficient fine-tuning (e.g., linear probing, LoRA). Diagnostic classification tasks were evaluated using AUROC, while VQA tasks were assessed with BLEU-1, ROUGE-L, Exact Match, F1 Score, and GPT-based semantic scoring.

Results: Off-the-shelf medical VLMs generally outperformed generalist VLMs on in-domain tasks. However, with lightweight fine-tuning, general-purpose VLMs achieved superior performance in most in-domain evaluations and demonstrated better generalization on out-of-domain tasks. Fine-tuning required only 3% of the parameters associated with full medical pretraining.

Conclusions: Efficiently fine-tuned generalist VLMs can match or surpass medical-specialist VLMs in most tasks, offering a scalable and cost-effective pathway for clinical AI development.


Citation

If you find this repository useful, please consider citing our paper:

@article{zhong2025can,
  title={Can Common VLMs Rival Medical VLMs? Evaluation and Strategic Insights},
  author={Zhong, Yuan and Jin, Ruinan and Li, Xiaoxiao and Dou, Qi},
  journal={arXiv preprint arXiv:2506.17337},
  year={2025}
}

If you also use our companion benchmark FairMedFM for fairness evaluation, please cite:

@article{jin2024fairmedfm,
  title={FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models},
  author={Jin, Ruinan and Xu, Zikang and Zhong, Yuan and Yao, Qiongsong and Dou, Qi and Zhou, S Kevin and Li, Xiaoxiao},
  journal={arXiv preprint arXiv:2407.00983},
  year={2024}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors