Open-source implementation of WFS-SB, a training-free frame selection framework for long-video understanding with LVLMs.
Paper · Highlights · Quick Start · Full Pipeline Workflow · Project Structure · Citation
Long videos contain heavy frame redundancy, while Large Vision-Language Models (LVLMs) operate under limited context budgets. Most query-aware frame selection methods focus only on frame relevance, which often yields fragmented visual evidence and ignores the video's narrative structure.
WFS-SB addresses this issue by detecting semantic boundaries in the query-frame similarity signal. It first uses wavelet-based multi-resolution analysis to suppress high-frequency noise, then identifies boundary points that divide a video into coherent clips. Based on these clips, WFS-SB allocates the frame budget adaptively and selects frames with Maximal Marginal Relevance (MMR) to preserve both relevance and diversity.
- 🌟 Training-free pipeline that plugs into long-video LVLM inference without extra model training.
- 🌊 Wavelet-based denoising helps recover robust semantic change signals from noisy query-frame similarities.
- 🧩 Two-stage selection strategy combines clip-level budget allocation with within-clip MMR sampling.
- 📈 Strong reported gains over prior frame selection strategies on three long-video benchmarks.
- [2026-02-21] 🎉 Our paper was accepted to CVPR 2026.
- 🎞️
preprocess/: frame sampling, feature extraction, and query-frame similarity scoring. - 🧠
wfs/: the unified WFS pipeline for VideoMME, LongVideoBench, and MLVU. - 📁
datasets/: annotation files and reproduction keyframe JSONs. - 🩹
lmms-eval-diff/:lmms-evalpatch artifacts and integration notes.
This repository provides the WFS code and the patch artifacts for lmms-eval. If you do not already have a compatible lmms-eval checkout under the repository root, prepare it first and then install the environment.
# Prepare a compatible lmms-eval checkout
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
git checkout bb1ebe76e7a942386c25c4664f902e0e59e8a401
git apply ../lmms-eval-diff/lmms_eval_wfs.patch
cd ..
# Create and activate the environment
conda create -n wfs python=3.10 -y
conda activate wfs
# Install dependencies
pip install -e ./lmms-eval
pip install -r requirements.txtFor FlashAttention 2, install a wheel that matches your local Python, PyTorch, and CUDA versions. For the environment in requirements.txt, choose a wheel built for Python 3.10 and Torch 2.6.
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.0/flash_attn-2.6.0+cu122torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.6.0+cu122torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whlSee the official FlashAttention releases page for a matching build: https://github.com/Dao-AILab/flash-attention/releases
If you prefer to rebuild lmms-eval from upstream and apply the patch manually, see lmms-eval-diff/README.md.
This repository includes annotation files and reproduction keyframe JSONs, but it does not include the raw benchmark videos. Please download the videos from the official dataset sources and place them in the expected directories.
Organize the datasets as follows:
datasets/
├── videomme/
│ ├── data/ # Put VideoMME .mp4 files here
│ ├── videomme_json_file.json
│ └── keyframe_dir/
│ ├── reproduce_videomme_f8.json
│ ├── reproduce_videomme_f16.json
│ └── reproduce_videomme_f32.json
├── longvideobench/
│ ├── videos/ # Put LongVideoBench .mp4 files here
│ ├── lvb_val.json
│ └── keyframe_dir/
│ ├── reproduce_lvb_f8.json
│ ├── reproduce_lvb_f16.json
│ └── reproduce_lvb_f32.json
└── mlvu/
├── video/ # Put MLVU .mp4 files here
├── mlvu_dev.json
└── keyframe_dir/
├── reproduce_mlvu_f8.json
├── reproduce_mlvu_f16.json
└── reproduce_mlvu_f32.json
If your local paths differ, update configs/dataset_paths.example.yaml accordingly.
After the raw videos are in place, you can directly reproduce inference results with the provided keyframe JSON files.
Uniform baseline
export QWEN_CKPT=Qwen/Qwen2.5-VL-7B-Instruct
CUDA_VISIBLE_DEVICES=0 python -m lmms_eval \
--model qwen2_5_vl \
--tasks videomme \
--model_args max_num_frames=16,pretrained=${QWEN_CKPT},max_pixels=12845056,attn_implementation=flash_attention_2,interleave_visuals=False \
--batch_size 1 \
--output_path ./results/videomme/uniWFS reproduction JSONs
export QWEN_CKPT=Qwen/Qwen2.5-VL-7B-Instruct
# VideoMME, K=16
CUDA_VISIBLE_DEVICES=0 python -m lmms_eval \
--model qwen2_5_vl \
--tasks videomme \
--model_args max_num_frames=16,use_keyframe=True,pretrained=${QWEN_CKPT},max_pixels=12845056,attn_implementation=flash_attention_2,interleave_visuals=False \
--batch_size 1 \
--output_path ./results/videomme/ \
--data_files '{"test": "keyframe_dir/reproduce_videomme_f16.json"}'
# LongVideoBench, K=16
CUDA_VISIBLE_DEVICES=0 python -m lmms_eval \
--model qwen2_5_vl \
--tasks longvideobench_val_v \
--model_args max_num_frames=16,use_keyframe=True,pretrained=${QWEN_CKPT},max_pixels=12845056,attn_implementation=flash_attention_2,interleave_visuals=False \
--batch_size 1 \
--output_path ./results/longvideobench_val_v/ \
--data_files '{"validation": "keyframe_dir/reproduce_lvb_f16.json"}'
# MLVU, K=16
CUDA_VISIBLE_DEVICES=0 python -m lmms_eval \
--model qwen2_5_vl \
--tasks mlvu_dev \
--model_args max_num_frames=16,use_keyframe=True,pretrained=${QWEN_CKPT},max_pixels=12845056,attn_implementation=flash_attention_2,interleave_visuals=False \
--batch_size 1 \
--output_path ./results/mlvu_dev/ \
--data_files '{"test": "keyframe_dir/reproduce_mlvu_f16.json"}'For additional model examples, refer to the official lmms-eval examples: https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/examples
The full WFS workflow consists of three steps:
- 🎞️ Extract frame-level features and query-frame similarity scores.
- 🌊 Run WFS to generate keyframe JSON files with
keyframe_indices. - 🤖 Feed the generated JSONs into
lmms-evalfor LVLM inference.
The unified pipeline currently supports:
- Benchmarks:
videomme,lvb,mlvu - Feature models:
blip2,blip1,clip,siglip
Extract frame-level features and similarity scores before running WFS.
Example: VideoMME + BLIP2
python -m preprocess.extract \
--benchmark videomme \
--feature_model blip2 \
--dataset_root datasets/videomme \
--json_file datasets/videomme/videomme_json_file.json \
--output_dir datasets/videomme/blip2_features_and_scores \
--device cuda \
--batch_size 256 \
--sample_fps 1.0Run the WFS pipeline to generate a keyframe JSON file containing keyframe_indices.
python -m wfs.pipeline \
--benchmark videomme \
--feature_model blip2 \
--max_frames 16 \
--dataset_root datasets/videomme \
--questions_file datasets/videomme/videomme_json_file.json \
--features_dir datasets/videomme/blip2_features_and_scores \
--output_path datasets/videomme/keyframe_dir/WFS_videomme_blip2_16f.jsonFor the other benchmarks, replace the dataset-specific paths accordingly:
lvb:datasets/longvideobench/lvb_val.jsonmlvu:datasets/mlvu/mlvu_dev.json
Run lmms-eval with qwen2_5_vl and the generated keyframe JSON.
export QWEN_CKPT=Qwen/Qwen2.5-VL-7B-Instruct
CUDA_VISIBLE_DEVICES=0 python -m lmms_eval \
--model qwen2_5_vl \
--tasks videomme \
--model_args max_num_frames=16,use_keyframe=True,pretrained=${QWEN_CKPT},max_pixels=12845056,attn_implementation=flash_attention_2,interleave_visuals=False \
--batch_size 1 \
--output_path ./results/videomme/ \
--data_files '{"test": "keyframe_dir/WFS_videomme_blip2_16f.json"}'WFS-OpenSource/
├── configs/
│ ├── dataset_paths.example.yaml
│ └── wfs_defaults.yaml
├── datasets/
│ ├── videomme/
│ ├── longvideobench/
│ └── mlvu/
├── lmms-eval-diff/
│ ├── README.md
│ ├── lmms_eval_wfs.patch
│ └── modified_files/
├── preprocess/
│ └── extract.py
├── run_qwen2_5_vl_lmms_eval_reproduce.sh
├── requirements.txt
└── wfs/
├── benchmarks.py
├── core.py
└── pipeline.py
If you find this project useful, please cite our paper:
@article{chen2026wavelet,
title={Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding},
author={Chen, Wang and Zeng, Yuhui and Luo, Yongdong and Xie, Tianyu and Lin, Luojun and Ji, Jiayi and Zhang, Yan and Zheng, Xiawu},
journal={arXiv preprint arXiv:2603.00512},
year={2026}
}