A training-free framework that adapts image-pretrained VLMs to video understanding β achieving SOTA on 7 benchmarks through dynamic compression and question decomposition, with no fine-tuning required.
D-CoDe achieves state-of-the-art performance across 7 video understanding benchmarks β all without any training.
|
Multiple-Choice VideoQA (β higher is better)
|
Open-Ended VideoQA β Accuracy (β higher is better)
|
Highlight: On the challenging long-video benchmark EgoSchema, D-CoDe achieves 58.0% accuracy β a +7.8% improvement over the previous best training-free method (TS-LLaVA 50.2%).
from Dcode import generate_subquestions, supp_frame_selection, token_select_and_merge, load_clip_model
# 1. Question Decomposition (requires OPENAI_API_KEY environment variable)
subquestions = generate_subquestions(
question="What did the person do after picking up the cup?",
prompt_variant="original"
)
# 2. Frame Selection (based on semantic diversity)
clip_processor, clip_model = load_clip_model()
selected_frames, frame_idxs = supp_frame_selection(
video_frames, # List of PIL Images
N=15, # Number of frames to select
uniform_ratio=0.85, # Ratio for uniform sampling
clip_model=clip_model,
clip_processor=clip_processor
)
# 3. Token Selection and Merge
merged_features = token_select_and_merge(
image_features, # Tensor (T, N, D)
top_k=288, # Tokens to keep per frame
merge_strategy="mean", # Options: "mean", "max", "weighted_mean"
similarity_threshold=0.8 # Similarity threshold for merging
)# Multiple-Choice VideoQA
bash scripts/run_eval_egoschema.sh
bash scripts/run_eval_nextqa.sh
bash scripts/run_eval_intentqa.sh
# Open-Ended VideoQA
bash scripts/run_eval_msvd.sh
bash scripts/run_eval_msrvtt.sh
bash scripts/run_eval_tgif.sh
bash scripts/run_eval_activitynet.shconda create -n d_code python=3.10.12
conda activate d_code
bash setup_env.shSet up your OpenAI API key for question decomposition:
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEYDownload pre-trained LLaVA-NeXT weights:
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7bClick to expand full data setup instructions
GT question and answer CSV files are already included in playground/gt_qa_files: MSVD-QA, MSRVTT-QA, TGIF-QA, ActivityNet-QA, NExT-QA, EgoSchema, IntentQA.
-
Open-ended VideoQA
- [Recommended] Follow the instruction in Video-LLaVA to download raw videos.
- Or download directly: MSVD-QA Β· MSRVTT-QA Β· TGIF-QA Β· ActivityNet-QA
-
Multiple-Choice VideoQA
playground/data/
βββ video_qa/
β βββ MSVD_Zero_Shot_QA/videos/
β βββ MSRVTT_Zero_Shot_QA/videos/all/
β βββ TGIF_Zero_Shot_QA/mp4/
β βββ Activitynet_Zero_Shot_QA/all_test/
βββ multiple_choice_qa/
βββ NExTQA/video/
βββ EgoSchema/video/
βββ IntentQA/video/
Module Ablation (EgoSchema)
| Module | Acc. (%) |
|---|---|
| Baseline | 44.8 |
| + Dynamic Spatial Token Compression | 50.6 |
| + Dynamic Temporal Frame Selection | 51.8 |
| + Question Decomposition | 58.0 |
Full Module Ablation (All Benchmarks)
| Module | NExT-QA | IntentQA | MSVD | MSRVTT | TGIF | ANet |
|---|---|---|---|---|---|---|
| Baseline | 65.4 | 61.3 | 77.8/4.0 | 62.8/3.5 | 76.9/4.0 | 54.2/3.3 |
| + Spatial Compression | 66.7 | 62.2 | 79.4/4.0 | 63.6/3.5 | 78.9/4.1 | 55.4/3.3 |
| + Temporal Selection | 67.0 | 62.9 | 80.0/4.1 | 64.2/3.5 | 79.1/4.1 | 56.4/3.4 |
| + Question Decomposition | 68.3 | 64.2 | 72.4/3.8 | 62.2/3.5 | 75.7/4.0 | 53.8/3.3 |
Efficiency Analysis (EgoSchema)
| Module | Acc. (%) | s/sample |
|---|---|---|
| Baseline | 44.8 | 3.927 |
| + Dynamic Compression | 51.8 | 6.115 |
| + Question Decomposition | 58.0 | 37.395 |
The core implementation is in Dcode.py:
| Function | Description | Paper Method |
|---|---|---|
generate_subquestions() |
Decompose questions into sub-questions using GPT-3.5 | Question Decomposition |
supp_frame_selection() |
Select frames based on CLIP semantic similarity | Dynamic Compression (Frame) |
token_select_and_merge() |
Select and merge visual tokens to reduce redundancy | Dynamic Compression (Token) |
We extend our gratitude to the following projects: LLaVA, IG-VLM, Video-LLaVA, SF-LLaVA and TS-LLaVA.
If you find this work useful, please cite our paper:
@inproceedings{huang-etal-2025-code,
title = "{D}-{C}o{D}e: Scaling Image-Pretrained {VLM}s to Video via Dynamic Compression and Question Decomposition",
author = "Huang, Yiyang and
Wang, Yizhou and
Fu, Yun",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
year = "2025",
pages = "11798--11811",
}arXiv version:
@article{huang2025d,
title={D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition},
author={Huang, Yiyang and Wang, Yizhou and Fu, Yun},
journal={arXiv preprint arXiv:2510.08818},
year={2025}
}This project is released under the Apache 2.0 License.
