Skip to content

hukcc/D-CoDe

Repository files navigation

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

A training-free framework that adapts image-pretrained VLMs to video understanding β€” achieving SOTA on 7 benchmarks through dynamic compression and question decomposition, with no fine-tuning required.

EMNLP 2025 arXiv Paper Project Page Code License Python PyTorch

Key Results

D-CoDe achieves state-of-the-art performance across 7 video understanding benchmarks β€” all without any training.

Multiple-Choice VideoQA (↑ higher is better)

Method NExT-QA EgoSchema IntentQA
SF-LLaVA 64.2 47.2 60.1
TS-LLaVA 66.5 50.2 61.7
D-CoDe 68.3 58.0 64.2

Open-Ended VideoQA β€” Accuracy (↑ higher is better)

Method MSVD MSRVTT TGIF ANet
SF-LLaVA 79.1 65.8 78.7 55.5
TS-LLaVA 79.0 65.1 77.7 56.7
D-CoDe 80.0 64.2 79.1 56.4

Highlight: On the challenging long-video benchmark EgoSchema, D-CoDe achieves 58.0% accuracy β€” a +7.8% improvement over the previous best training-free method (TS-LLaVA 50.2%).

Quick Start

from Dcode import generate_subquestions, supp_frame_selection, token_select_and_merge, load_clip_model

# 1. Question Decomposition (requires OPENAI_API_KEY environment variable)
subquestions = generate_subquestions(
    question="What did the person do after picking up the cup?",
    prompt_variant="original"
)

# 2. Frame Selection (based on semantic diversity)
clip_processor, clip_model = load_clip_model()
selected_frames, frame_idxs = supp_frame_selection(
    video_frames,           # List of PIL Images
    N=15,                   # Number of frames to select
    uniform_ratio=0.85,     # Ratio for uniform sampling
    clip_model=clip_model,
    clip_processor=clip_processor
)

# 3. Token Selection and Merge
merged_features = token_select_and_merge(
    image_features,                  # Tensor (T, N, D)
    top_k=288,                       # Tokens to keep per frame
    merge_strategy="mean",           # Options: "mean", "max", "weighted_mean"
    similarity_threshold=0.8         # Similarity threshold for merging
)

Run Full Evaluation

# Multiple-Choice VideoQA
bash scripts/run_eval_egoschema.sh
bash scripts/run_eval_nextqa.sh
bash scripts/run_eval_intentqa.sh

# Open-Ended VideoQA
bash scripts/run_eval_msvd.sh
bash scripts/run_eval_msrvtt.sh
bash scripts/run_eval_tgif.sh
bash scripts/run_eval_activitynet.sh

Installation

conda create -n d_code python=3.10.12
conda activate d_code
bash setup_env.sh

Set up your OpenAI API key for question decomposition:

export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY

Download pre-trained LLaVA-NeXT weights:

git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7b

Data Preparation

Click to expand full data setup instructions

Ground-Truth QA Files

GT question and answer CSV files are already included in playground/gt_qa_files: MSVD-QA, MSRVTT-QA, TGIF-QA, ActivityNet-QA, NExT-QA, EgoSchema, IntentQA.

Download Raw Videos

Expected Directory Structure

playground/data/
β”œβ”€β”€ video_qa/
β”‚   β”œβ”€β”€ MSVD_Zero_Shot_QA/videos/
β”‚   β”œβ”€β”€ MSRVTT_Zero_Shot_QA/videos/all/
β”‚   β”œβ”€β”€ TGIF_Zero_Shot_QA/mp4/
β”‚   └── Activitynet_Zero_Shot_QA/all_test/
└── multiple_choice_qa/
    β”œβ”€β”€ NExTQA/video/
    β”œβ”€β”€ EgoSchema/video/
    └── IntentQA/video/

Detailed Results

Module Ablation (EgoSchema)
Module Acc. (%)
Baseline 44.8
+ Dynamic Spatial Token Compression 50.6
+ Dynamic Temporal Frame Selection 51.8
+ Question Decomposition 58.0
Full Module Ablation (All Benchmarks)
Module NExT-QA IntentQA MSVD MSRVTT TGIF ANet
Baseline 65.4 61.3 77.8/4.0 62.8/3.5 76.9/4.0 54.2/3.3
+ Spatial Compression 66.7 62.2 79.4/4.0 63.6/3.5 78.9/4.1 55.4/3.3
+ Temporal Selection 67.0 62.9 80.0/4.1 64.2/3.5 79.1/4.1 56.4/3.4
+ Question Decomposition 68.3 64.2 72.4/3.8 62.2/3.5 75.7/4.0 53.8/3.3
Efficiency Analysis (EgoSchema)
Module Acc. (%) s/sample
Baseline 44.8 3.927
+ Dynamic Compression 51.8 6.115
+ Question Decomposition 58.0 37.395

Core Components

The core implementation is in Dcode.py:

Function Description Paper Method
generate_subquestions() Decompose questions into sub-questions using GPT-3.5 Question Decomposition
supp_frame_selection() Select frames based on CLIP semantic similarity Dynamic Compression (Frame)
token_select_and_merge() Select and merge visual tokens to reduce redundancy Dynamic Compression (Token)

Acknowledgement

We extend our gratitude to the following projects: LLaVA, IG-VLM, Video-LLaVA, SF-LLaVA and TS-LLaVA.

Citation

If you find this work useful, please cite our paper:

@inproceedings{huang-etal-2025-code,
    title = "{D}-{C}o{D}e: Scaling Image-Pretrained {VLM}s to Video via Dynamic Compression and Question Decomposition",
    author = "Huang, Yiyang  and
      Wang, Yizhou  and
      Fu, Yun",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    pages = "11798--11811",
}

arXiv version:

@article{huang2025d,
    title={D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition},
    author={Huang, Yiyang and Wang, Yizhou and Fu, Yun},
    journal={arXiv preprint arXiv:2510.08818},
    year={2025}
}

License

This project is released under the Apache 2.0 License.

About

[EMNLP 2025πŸ”₯] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors