A plug-and-play method that utilizes visual cues for more effective token pruning in visual language models.
[📄 Paper] [🌐 Project Page]
🔥 [2025/06/28] Our Project Page is updated!
🔥 [2025/06/26] Our VisPruner is accepted by ICCV 2025! 🎉
🔥 [2025/06/24] A new version of FasterVLM named VisPruner is released!
In this work, we find that the text-visual attention in the language model is not an ideal indicator for visual token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in visual language models. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image.
- Clone this repository.
git clone https://github.com/anonymous-4869/VisPruner.git
cd VisPruner- Install necessary packages.
conda create -n vispruner python=3.10 -y
conda activate vispruner
pip install -e .- (Optional) Install FlashAttention for further inference acceleration.
pip install flash-attn --no-build-isolationDownload corresponding LLaVA checkpoints from Hugging Face 🤗:
| Version | LLM | Checkpoint |
|---|---|---|
| LLaVA-1.5 | Vicuna-7B | liuhaotian/llava-v1.5-7b |
| LLaVA-1.5 | Vicuna-13B | liuhaotian/llava-v1.5-13b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-7B | liuhaotian/llava-v1.6-vicuna-7b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-13B | liuhaotian/llava-v1.6-vicuna-13b |
Download each dataset according to EVAL.md.
The main implementation of VisPruner is highlighted with [VisPruner] annotations, mainly in llava_llama.py, llava_arch.py and clip_encoder.py.
We provide the evaluation scripts for each benchmark:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/${DATASET}.sh ${VISUAL_TOKEN_NUMBER} ${IMPORTANT_RATIO}You only need to set the remaining visual token number and important ratio as the bash arguments. For example, if you want to evaluate VisPruner with 64 important tokens and 64 diverse tokens retained on the VQAv2 benchmark, you can run the script ./scripts/v1_5/eval/vqav2.sh with argument 128 and 0.5:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh 128 0.5And if you want to evaluate VisPruner with 48 important tokens and 16 diverse tokens retained on the MME benchmark, you can run the following command:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh 64 0.75For evaluation with the 13B LLM, you just need to replace the CKPT argument from llava-v1.5-7b to llava-v1.5-13b in each script. And for evaluation with LLaVA-NeXT, you can use the scripts in ./scripts/v1_6/eval. For example, if you want to evaluate VisPruner with 32 * 5 = 160 visual tokens retained on the TextVQA benchmark, you can run the following command:
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_6/eval/textvqa.sh 32 0.5The detailed guidance for evaluation commands and online submission of each benchmark can be found in EVAL.md.
If you find VisPruner useful for your research and applications, please cite using this BibTeX:
@article{zhang2025vispruner,
title={Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs},
author={Zhang, Qizhe and Cheng, Aosong and Lu, Ming and Zhang, Renrui and Zhuo, Zhiyong and Cao, Jiajun and Guo, Shaobo and She, Qi and Zhang, Shanghang},
journal={arXiv preprint arXiv:2412.01818},
year={2025},
}This project is released under the Apache 2.0 license.

