TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images

Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang

* Equally contributing first authors

Intelligent and Connected Systems Lab (ICSL), Columbia University

Figure: 8 Representative VLMs on 10 dimensions in TDBench

Abstract: The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research.

📢 Latest Updates

Apr-29-25: TDBench has been merged into VLMEvalKit.
Apr-10-25: Arxiv Preprint is released arxiv link. 🔥🔥
Apr-01-25: We release the benchmark dataset.

💡 Overview

🏆 Contributions

TDBench Benchmark. We introduce TDBench, a benchmark designed specifically for evaluating VLMs on Top-down images originate from real scenarios is aerial operation or drone applications. We carefully curated a dataset manually comprising a total of 2000 questions.
Rotational Evaluation. We introduce an evaluation strategy RotationalEval specifically designed for top-down images. Due to the nature of top-down images, rotations do not affect the semantic meaning, whereas this is not true and does not physically make sense naturally for front-view images.
Four Case Studies. We performed 4 case studies that frequently occur in the real world.These studies evaluate specific capabilities of VLMs under controlled conditions, providing actionable insights for practical deployment while identifying critical challenges that must be addressed for reliable aerial image understanding.

📊 Benchmarks Comparison

Table: Overview performance of 8 open source VLMs and 6 propriety VLMs on 10 dimensions with RotationalEval method.

🗂️ Case Studies

Top-down images are usually captured from a relatively high altitude, which may introduce several challenges such as small object, different perspective. Furthermore, top-down images do not contain depth information in most cases, yet depth is very important for many real-world applications such as building height estimation and autonomous drone navigation and obstacle avoidance. Based on these considerations, we also conduct the following four case studies in paper.

Digital Magnification for Small Object Detection
- Provide insights on post-processing the images to enable VLMs to see small objects
Altitude Effects on Object Detection
- Guidelines on drones' hovering height for different object detection tasks
Object Visibility and Partial Occlusion
- Study when objects are partially hidden or occluded by other objects
Z-Axis Perception and Depth Understanding
- Assessing the depth reasoning from top-down images

🤖 How to run TDBench

TDBench is fully compatible with VLMEvalKit.

Installation

First, install the VLMEvalKit environment by following the instructions in the official repository
Set up your model configuration and APIs according to VLMEvalKit requirements

Datasets (for VLMEvalKit run.py)

Standard Evaluation - Tests 9 dimensions with 4 rotation angles
- tdbench_rot0 (0° rotation)
- tdbench_rot90 (90° rotation)
- tdbench_rot180 (180° rotation)
- tdbench_rot270 (270° rotation)
Visual Grounding - Tests visual grounding with 4 rotation angles
- tdbench_grounding_rot0 (0° rotation)
- tdbench_grounding_rot90 (90° rotation)
- tdbench_grounding_rot180 (180° rotation)
- tdbench_grounding_rot270 (270° rotation)
Case Studies - 4 studies
- tdbench_cs_zoom
- tdbench_cs_height
- tdbench_cs_integrity
- tdbench_cs_depth

Usage Examples

Standard Evaluation

To only evaluate a single rotation

python run.py --data tdbench_rot0 \
              --model <model_name> \
              --verbose \
              --work-dir <results_directory>

To apply RotationalEval, simply run all rotations

python run.py --data tdbench_rot0 tdbench_rot90 tdbench_rot180 tdbench_rot270 \
              --model <model_name> \
              --verbose \
              --work-dir <results_directory>

Visual Grounding Evaluation

To only evaluate a single rotation

python run.py --data tdbench_grounding_rot0 \
              --model <model_name> \
              --verbose \
              --judge centroid \
              --work-dir <results_directory>

To apply RotationalEval, simply run all rotations

python run.py --data tdbench_grounding_rot0 tdbench_grounding_rot90 tdbench_grounding_rot180 tdbench_grounding_rot270 \
              --model <model_name> \
              --verbose \
              --judge centroid \
              --work-dir <results_directory>

Case Studies

Run all case studies with:

python run.py --data tdbench_cs_zoom tdbench_cs_height tdbench_cs_integrity tdbench_cs_depth \
              --model <model_name> \
              --verbose \
              --work-dir <results_directory>

Output

VLMEvalKit prints and saves each dataset's output in <results_directory>/<model_name>. Check xxx_acc.csv for accuracy score, and xxx_result.xlsx for detailed VLM outputs. RotationalEval is triggered automatically after running all rotations. Results will be printed and saved as xxx_REresult.csv.

📜 Citation

If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:

@article{hou2025tdbench,
  title={TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images},
  author={Hou, Kaiyuan and Zhao, Minghui and Xu, Lilin and Fan, Yuang and Jiang, Xiaofan},
  journal={arXiv preprint arXiv:2504.03748},
  year={2025}
}

📨 Contact

If you have any questions, please create an issue on this repository or contact at kh3119@columbia.edu or mz2866@columbia.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images

Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang

Intelligent and Connected Systems Lab (ICSL), Columbia University

📢 Latest Updates

💡 Overview

🏆 Contributions

📊 Benchmarks Comparison

🗂️ Case Studies

🤖 How to run TDBench

Installation

Datasets (for VLMEvalKit run.py)

Usage Examples

Standard Evaluation

Visual Grounding Evaluation

Case Studies

Output

📜 Citation

📨 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images

Kaiyuan Hou*, Minghui Zhao*, Lilin Xu, Yuang Fan, Xiaofan Jiang

Intelligent and Connected Systems Lab (ICSL), Columbia University

📢 Latest Updates

💡 Overview

🏆 Contributions

📊 Benchmarks Comparison

🗂️ Case Studies

🤖 How to run TDBench

Installation

Datasets (for VLMEvalKit run.py)

Usage Examples

Standard Evaluation

Visual Grounding Evaluation

Case Studies

Output

📜 Citation

📨 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang

Packages