Skip to content

Columbia-ICSL/TDBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images

Oryx Video-ChatGPT

* Equally contributing first authors

Intelligent and Connected Systems Lab (ICSL), Columbia University

paper HuggingFace

Figure: 8 Representative VLMs on 10 dimensions in TDBench

Abstract: The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research.

πŸ“’ Latest Updates

  • Apr-29-25: TDBench has been merged into VLMEvalKit.
  • Apr-10-25: Arxiv Preprint is released arxiv link. πŸ”₯πŸ”₯
  • Apr-01-25: We release the benchmark dataset.

πŸ’‘ Overview

πŸ† Contributions

  • TDBench Benchmark. We introduce TDBench, a benchmark designed specifically for evaluating VLMs on Top-down images originate from real scenarios is aerial operation or drone applications. We carefully curated a dataset manually comprising a total of 2000 questions.
  • Rotational Evaluation. We introduce an evaluation strategy RotationalEval specifically designed for top-down images. Due to the nature of top-down images, rotations do not affect the semantic meaning, whereas this is not true and does not physically make sense naturally for front-view images.
  • Four Case Studies. We performed 4 case studies that frequently occur in the real world.These studies evaluate specific capabilities of VLMs under controlled conditions, providing actionable insights for practical deployment while identifying critical challenges that must be addressed for reliable aerial image understanding.

πŸ“Š Benchmarks Comparison

Dataset Comparison table

Table: Overview performance of 8 open source VLMs and 6 propriety VLMs on 10 dimensions with RotationalEval method.


πŸ—‚οΈ Case Studies

Top-down images are usually captured from a relatively high altitude, which may introduce several challenges such as small object, different perspective. Furthermore, top-down images do not contain depth information in most cases, yet depth is very important for many real-world applications such as building height estimation and autonomous drone navigation and obstacle avoidance. Based on these considerations, we also conduct the following four case studies in paper.

  1. Digital Magnification for Small Object Detection

    • Provide insights on post-processing the images to enable VLMs to see small objects
  2. Altitude Effects on Object Detection

    • Guidelines on drones' hovering height for different object detection tasks
  3. Object Visibility and Partial Occlusion

    • Study when objects are partially hidden or occluded by other objects
  4. Z-Axis Perception and Depth Understanding

    • Assessing the depth reasoning from top-down images

πŸ€– How to run TDBench

TDBench is fully compatible with VLMEvalKit.

Installation

  1. First, install the VLMEvalKit environment by following the instructions in the official repository
  2. Set up your model configuration and APIs according to VLMEvalKit requirements

Datasets (for VLMEvalKit run.py)

  • Standard Evaluation - Tests 9 dimensions with 4 rotation angles

    • tdbench_rot0 (0Β° rotation)
    • tdbench_rot90 (90Β° rotation)
    • tdbench_rot180 (180Β° rotation)
    • tdbench_rot270 (270Β° rotation)
  • Visual Grounding - Tests visual grounding with 4 rotation angles

    • tdbench_grounding_rot0 (0Β° rotation)
    • tdbench_grounding_rot90 (90Β° rotation)
    • tdbench_grounding_rot180 (180Β° rotation)
    • tdbench_grounding_rot270 (270Β° rotation)
  • Case Studies - 4 studies

    • tdbench_cs_zoom
    • tdbench_cs_height
    • tdbench_cs_integrity
    • tdbench_cs_depth

Usage Examples

Standard Evaluation

To only evaluate a single rotation

python run.py --data tdbench_rot0 \
              --model <model_name> \
              --verbose \
              --work-dir <results_directory>

To apply RotationalEval, simply run all rotations

python run.py --data tdbench_rot0 tdbench_rot90 tdbench_rot180 tdbench_rot270 \
              --model <model_name> \
              --verbose \
              --work-dir <results_directory>

Visual Grounding Evaluation

To only evaluate a single rotation

python run.py --data tdbench_grounding_rot0 \
              --model <model_name> \
              --verbose \
              --judge centroid \
              --work-dir <results_directory>

To apply RotationalEval, simply run all rotations

python run.py --data tdbench_grounding_rot0 tdbench_grounding_rot90 tdbench_grounding_rot180 tdbench_grounding_rot270 \
              --model <model_name> \
              --verbose \
              --judge centroid \
              --work-dir <results_directory>

Case Studies

Run all case studies with:

python run.py --data tdbench_cs_zoom tdbench_cs_height tdbench_cs_integrity tdbench_cs_depth \
              --model <model_name> \
              --verbose \
              --work-dir <results_directory>

Output

VLMEvalKit prints and saves each dataset's output in <results_directory>/<model_name>. Check xxx_acc.csv for accuracy score, and xxx_result.xlsx for detailed VLM outputs. RotationalEval is triggered automatically after running all rotations. Results will be printed and saved as xxx_REresult.csv.


πŸ“œ Citation

If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:

@article{hou2025tdbench,
  title={TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images},
  author={Hou, Kaiyuan and Zhao, Minghui and Xu, Lilin and Fan, Yuang and Jiang, Xiaofan},
  journal={arXiv preprint arXiv:2504.03748},
  year={2025}
}

πŸ“¨ Contact

If you have any questions, please create an issue on this repository or contact at kh3119@columbia.edu or mz2866@columbia.edu.


About

Official GitHub repository for TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors