Skip to content

ruc-aimc-lab/FunBench

Repository files navigation

FunBench: Benchmarking Fundus Reading Skills of MLLMs

View on Arxiv

News

  • [2025-12-21] Add more results.
  • [2025-11-06] Add preprocess script for RETOUCH.
  • [2025-05-13] FunBench has been early accepted by MICCAI 2025! 🎉🎉🎉
  • [2025-03-28] FunBench is publicly available on Hugging Face

TODO

  • Automated download script for datasets.

Introduction

Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis. However, their capabilities in interpreting fundus images, a critical skill for ophthalmology, remain under-evaluated. Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE). This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs’ fundus reading skills. FunBench features a hierarchical task organization across four levels (modality perception, anatomy perception, lesion analysis, and disease diagnosis). It also offers three targeted evaluation modes: linear-probe based VE evaluation, knowledge-prompted LLM evaluation, and holistic evaluation. Experiments on ten open-source MLLMs plus GPT-4o reveal significant deficiencies in fundus reading skills, particularly in basic tasks such as laterality recognition. The results highlight the limitations of current MLLMs and emphasize the need for domain-specific training and improved LLMs and VEs.

Hierarchical Task Organization

FunBench consists of a total of 10 tasks, divided into 4 levels.

  • Level 1 (L1): Modality perception
  • Level 2 (L2): Anatomy perception
  • Level 3 (L3): Lesion analysis
  • Level 4 (L4): Disease diagnosis

Targeted Evaluation Modes

Three targeted evaluation modes (E-mode) are presented.

  • E-Mode I: Linear-probe based VE Evaluation
  • E-Mode II: Knowledge-prompted LLM evaluation
  • E-Mode III: Holistic Evaluation

Results

Results in TDIUC (general field) vs. results in FunBench.

Preparation

1. Download FunBench

FunBench is available at https://huggingface.co/datasets/AIMClab-RUC/FunBench

2. Download images

We adopt 14 public datasets in FunBench. Please download the images from the provided links and place them in the same directory.

3. Image preprocess

We perform preprocessing preprocess.py on RETOUCH dataset and CFP images.

For RETOUCH dataset, it extracts images and masks from the raw data. For CFP images, it cut out the retina areas and ensure the images are square. Specifically, some images in Retinal-Lesions will be rotated for 180 degrees to ensure consistency between their laterality labels and the image contents.

The preprocessing may take 1-2 hours.

Evaluation

Run predict.py to get results from MLLMs and evaluation.py to calculate metrics. The Predictor Class in predict.py is custom for different MLLMs.

Citation

If you find this our work useful, please consider citing:

@inproceedings{miccai25-funbench,
title = {FunBench: Benchmarking Fundus Reading Skills of MLLMs},
author = {Qijie Wei and Kaiheng Qian and Xirong Li},
booktitle = {MICCAI},
year={2025}
}

Contact

If you encounter any issue, please feel free to reach us either by creating a new issue in the GitHub or by emailing

About

[MICCAI 2025] Benchmarking Fundus Reading Skills of MLLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages