D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

A training-free framework that adapts image-pretrained VLMs to video understanding — achieving SOTA on 7 benchmarks through dynamic compression and question decomposition, with no fine-tuning required.

Key Results

D-CoDe achieves state-of-the-art performance across 7 video understanding benchmarks — all without any training.

Multiple-Choice VideoQA (↑ higher is better)

Method	NExT-QA	EgoSchema	IntentQA
SF-LLaVA	64.2	47.2	60.1
TS-LLaVA	66.5	50.2	61.7
D-CoDe	68.3	58.0	64.2

Open-Ended VideoQA — Accuracy (↑ higher is better)

Method	MSVD	MSRVTT	TGIF	ANet
SF-LLaVA	79.1	65.8	78.7	55.5
TS-LLaVA	79.0	65.1	77.7	56.7
D-CoDe	80.0	64.2	79.1	56.4

Highlight: On the challenging long-video benchmark EgoSchema, D-CoDe achieves 58.0% accuracy — a +7.8% improvement over the previous best training-free method (TS-LLaVA 50.2%).

Quick Start

from Dcode import generate_subquestions, supp_frame_selection, token_select_and_merge, load_clip_model

# 1. Question Decomposition (requires OPENAI_API_KEY environment variable)
subquestions = generate_subquestions(
    question="What did the person do after picking up the cup?",
    prompt_variant="original"
)

# 2. Frame Selection (based on semantic diversity)
clip_processor, clip_model = load_clip_model()
selected_frames, frame_idxs = supp_frame_selection(
    video_frames,           # List of PIL Images
    N=15,                   # Number of frames to select
    uniform_ratio=0.85,     # Ratio for uniform sampling
    clip_model=clip_model,
    clip_processor=clip_processor
)

# 3. Token Selection and Merge
merged_features = token_select_and_merge(
    image_features,                  # Tensor (T, N, D)
    top_k=288,                       # Tokens to keep per frame
    merge_strategy="mean",           # Options: "mean", "max", "weighted_mean"
    similarity_threshold=0.8         # Similarity threshold for merging
)

Run Full Evaluation

# Multiple-Choice VideoQA
bash scripts/run_eval_egoschema.sh
bash scripts/run_eval_nextqa.sh
bash scripts/run_eval_intentqa.sh

# Open-Ended VideoQA
bash scripts/run_eval_msvd.sh
bash scripts/run_eval_msrvtt.sh
bash scripts/run_eval_tgif.sh
bash scripts/run_eval_activitynet.sh

Installation

conda create -n d_code python=3.10.12
conda activate d_code
bash setup_env.sh

Set up your OpenAI API key for question decomposition:

export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY

Download pre-trained LLaVA-NeXT weights:

git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7b

Data Preparation

Click to expand full data setup instructions

Ground-Truth QA Files

GT question and answer CSV files are already included in playground/gt_qa_files: MSVD-QA, MSRVTT-QA, TGIF-QA, ActivityNet-QA, NExT-QA, EgoSchema, IntentQA.

Download Raw Videos

Open-ended VideoQA
- [Recommended] Follow the instruction in Video-LLaVA to download raw videos.
- Or download directly: MSVD-QA · MSRVTT-QA · TGIF-QA · ActivityNet-QA
Multiple-Choice VideoQA
- NExT-QA · EgoSchema · IntentQA

Expected Directory Structure

playground/data/
├── video_qa/
│   ├── MSVD_Zero_Shot_QA/videos/
│   ├── MSRVTT_Zero_Shot_QA/videos/all/
│   ├── TGIF_Zero_Shot_QA/mp4/
│   └── Activitynet_Zero_Shot_QA/all_test/
└── multiple_choice_qa/
    ├── NExTQA/video/
    ├── EgoSchema/video/
    └── IntentQA/video/

Detailed Results

Module Ablation (EgoSchema)

Module	Acc. (%)
Baseline	44.8
+ Dynamic Spatial Token Compression	50.6
+ Dynamic Temporal Frame Selection	51.8
+ Question Decomposition	58.0

Full Module Ablation (All Benchmarks)

Module	NExT-QA	IntentQA	MSVD	MSRVTT	TGIF	ANet
Baseline	65.4	61.3	77.8/4.0	62.8/3.5	76.9/4.0	54.2/3.3
+ Spatial Compression	66.7	62.2	79.4/4.0	63.6/3.5	78.9/4.1	55.4/3.3
+ Temporal Selection	67.0	62.9	80.0/4.1	64.2/3.5	79.1/4.1	56.4/3.4
+ Question Decomposition	68.3	64.2	72.4/3.8	62.2/3.5	75.7/4.0	53.8/3.3

Efficiency Analysis (EgoSchema)

Module	Acc. (%)	s/sample
Baseline	44.8	3.927
+ Dynamic Compression	51.8	6.115
+ Question Decomposition	58.0	37.395

Core Components

The core implementation is in Dcode.py:

Function	Description	Paper Method
`generate_subquestions()`	Decompose questions into sub-questions using GPT-3.5	Question Decomposition
`supp_frame_selection()`	Select frames based on CLIP semantic similarity	Dynamic Compression (Frame)
`token_select_and_merge()`	Select and merge visual tokens to reduce redundancy	Dynamic Compression (Token)

Acknowledgement

We extend our gratitude to the following projects: LLaVA, IG-VLM, Video-LLaVA, SF-LLaVA and TS-LLaVA.

Citation

If you find this work useful, please cite our paper:

@inproceedings{huang-etal-2025-code,
    title = "{D}-{C}o{D}e: Scaling Image-Pretrained {VLM}s to Video via Dynamic Compression and Question Decomposition",
    author = "Huang, Yiyang  and
      Wang, Yizhou  and
      Fu, Yun",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    pages = "11798--11811",
}

arXiv version:

@article{huang2025d,
    title={D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition},
    author={Huang, Yiyang and Wang, Yizhou and Fu, Yun},
    journal={arXiv preprint arXiv:2510.08818},
    year={2025}
}

License

This project is released under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
eval		eval
outputs-logs/artifacts		outputs-logs/artifacts
playground/gt_qa_files		playground/gt_qa_files
scripts		scripts
slowfast_llava/llava		slowfast_llava/llava
.gitignore		.gitignore
Dcode.py		Dcode.py
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
pipeline.png		pipeline.png
prompt.py		prompt.py
run_inference_multiple_choice_qa.py		run_inference_multiple_choice_qa.py
run_inference_video_qa.py		run_inference_video_qa.py
setup_env.sh		setup_env.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Key Results

Quick Start

Run Full Evaluation

Installation

Data Preparation

Ground-Truth QA Files

Download Raw Videos

Expected Directory Structure

Detailed Results

Core Components

Acknowledgement

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Key Results

Quick Start

Run Full Evaluation

Installation

Data Preparation

Ground-Truth QA Files

Download Raw Videos

Expected Directory Structure

Detailed Results

Core Components

Acknowledgement

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages