GitHub - YXY0807/SLFG: Enhancing Long Video Question Answering with Scene-Localized Frame Grouping

Setup

Follow these steps to set up the project environment.

This project uses LLaVA-Video and LLaVA-OneVision as its base model, which requires the environment setup of LLaVA-NeXT. Please follow these steps before running our pipeline:

0. Install LLaVA-NeXT

# Clone LLaVA-NeXT
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT

# Create environment
conda create -n llava python=3.10 -y
conda activate llava

# Upgrade pip and install training dependencies
pip install --upgrade pip  # Enable PEP 660 support
pip install -e ".[train]"

Once this is complete, you can return to this repository and continue the setup as described below.

1. Clone Repository

git clone https://github.com/YXY0807/SLFG.git
cd SLFG

2. Configure Paths

Modify the configs/paths.json file to match your local directory structure. You need to provide paths to your video files, datasets, model checkpoints, and desired output directory.

{
    "video_data_dir": "/path/to/your/videos",
    "dataset_dir": "/path/to/your/datasets",
    "output_dir": "/path/to/project/outputs",
    "model_checkpoints_dir": "/path/to/llm/and/vision/models"
}

Usage

Run the scripts in the specified order. The output of each script serves as the input for the next.

Step 1: Segment Videos

Divides each video into a fixed number of uniform time segments.

python step_1_segment_videos.py

Step 2: Generate Captions

Generates a detailed text description for each video segment created in Step 1.

python step_2_generate_captions.py --gpu-id 1

Note: Adjust the --gpu-id parameter based on your available devices.

Step 3: Extract Scenes

Breaks down the long, generated captions into smaller, more coherent scene descriptions.

python step_3_extract_scenes.py --gpu-id 0

Step 4: Generate Queries

These two scripts process the input questions to generate queries for retrieval. They can be run in parallel.

4a. Generate Query Sentences:

python step_4a_generate_query_sentences.py --gpu-id 2

4b. Generate Query Items:

python step_4b_generate_query_items.py --gpu-id 2

Step 5: Match Queries to Scenes

Uses a sentence-transformer model to find the most semantically relevant video scenes (from Step 3) for each query (from Step 4).

python step_5_match_queries_to_scenes.py --gpu-id 2

Step 6: Run Final Inference

Takes the top-ranked video segments identified in Step 5 and feeds them, along with the original question, into the final multimodal model to generate an answer.

python step_6_run_final_inference.py --gpu-id 2 --max-frames 32

Note: Adjust --max-frames based on your model's requirements and GPU memory.

Project Structure

/configs: Contains the paths.json configuration file.
1_segment_videos.py: Segments videos into clips.
2_generate_captions.py: Generates captions for clips.
3_extract_scenes.py: Extracts scenes from captions.
4a_generate_query_sentences.py: Extracts sentence queries from questions.
4b_generate_query_items.py: Extracts item queries from questions.
5_match_queries_to_scenes.py: Retrieves relevant scenes based on queries.
6_run_final_inference.py: Generates final answer from retrieved clips.
README.md: This file.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

0. Install LLaVA-NeXT

1. Clone Repository

2. Configure Paths

Usage

Step 1: Segment Videos

Step 2: Generate Captions

Step 3: Extract Scenes

Step 4: Generate Queries

Step 5: Match Queries to Scenes

Step 6: Run Final Inference

Project Structure

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
pipeline		pipeline
LICENSE		LICENSE
README.md		README.md

License

YXY0807/SLFG

Folders and files

Latest commit

History

Repository files navigation

Setup

0. Install LLaVA-NeXT

1. Clone Repository

2. Configure Paths

Usage

Step 1: Segment Videos

Step 2: Generate Captions

Step 3: Extract Scenes

Step 4: Generate Queries

Step 5: Match Queries to Scenes

Step 6: Run Final Inference

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages