CogRail is the first multimodal benchmark and open-source framework dedicated to cognitive railway intrusion perception, built on real-world surveillance scenes with cognition-driven, multi-dimensional instruction-level annotations (the CogRail dataset). It supports spatio-temporal reasoning, motion prediction, and threat assessment for objects of interest (OOIs) in railway environments. The project integrates visual question-answer annotations with expert-defined threat semantics and leverages instance synthesis to enhance data diversity while maintaining consistent label space across subsets. CogRail systematically evaluates state-of-the-art vision-language models (such as Qwen-VL and LLaVA) in railway scenarios, revealing their strengths and limitations in complex spatio-temporal reasoning. It also introduces the RAILGPT multi-task fine-tuning framework, which combines visual prompts, textual instructions, and specialized agents to optimize cognitive capabilities across position awareness (CogRailPos), motion prediction (CogRailMove), and threat analysis (CogRailThreat) tasks. After joint fine-tuning, RAILGPT achieves an 18.6% F1 improvement on the threat analysis task, demonstrating the effectiveness of structured multi-task learning in safety-critical scenarios and providing a complete benchmark toolchain for both research and engineering applications. You can view our paper at https://arxiv.org/abs/2601.09613
- First CogRail Benchmark: Integrates open-source surveillance scenarios with cognition-driven question-answer annotations, supporting spatio-temporal reasoning and intrusion risk prediction.
- Systematic Evaluation of Representative VLMs: Reveals model strengths and weaknesses in cognitive railway scenarios.
- Multi-task Joint Fine-tuning (RAILGPT): Employs visual prompts + textual prompts + dedicated agents to significantly enhance accuracy and interpretability.
CogRail systematically evaluates vision-language models in railway intrusion perception scenarios. It defines three core tasks and provides unified annotations and synthetic data diversity.
- CogRailPos (Spatial Awareness): Determine OOI location relative to railway infrastructure.
- CogRailMove (Motion Prediction): Predict threat level of movement.
- CogRailThreat (Threat Assessment): Integrate spatial + motion info to assess threat.
Dataset Sources & Labels
- Sources:
- Labels:
- Unified Label Space The CogRail dataset contains two main folders: Cog-MRSI/ and Cog-RailSem19/. Each folder has a training set (train) and a test set (test).
Our projects can be accessed at: https://huggingface.co/datasets/BITZhangqy/Cog-Rail/
Performance Comparison among SOTA VLMs on CogRail averaged on different Prompt types and sub-datasets
Performance(F1) Comparison on Type-I Visual Prompt in Cog-RailSem19 dataset via Individual Fine-tuning
Performance (F1) Comparison on Type-II Visual Prompt in Cog-RailSem19 dataset via Individual Fine-tuning
If you find our work helpful in your research, please consider citing:
@misc{tian2026cograilbenchmarkingvlmscognitive,
title={CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems},
author={Yonglin Tian and Qiyao Zhang and Wei Xu and Yutong Wang and Yihao Wu and Xinyi Li and Xingyuan Dai and Hui Zhang and Zhiyong Cui and Baoqing Guo and Zujun Yu and Yisheng Lv},
year={2026},
eprint={2601.09613},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.09613},
}







