Follow these steps to set up the project environment.
This project uses LLaVA-Video and LLaVA-OneVision as its base model, which requires the environment setup of LLaVA-NeXT. Please follow these steps before running our pipeline:
# Clone LLaVA-NeXT
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
# Create environment
conda create -n llava python=3.10 -y
conda activate llava
# Upgrade pip and install training dependencies
pip install --upgrade pip # Enable PEP 660 support
pip install -e ".[train]"Once this is complete, you can return to this repository and continue the setup as described below.
git clone https://github.com/YXY0807/SLFG.git
cd SLFGModify the configs/paths.json file to match your local directory structure. You need to provide paths to your video files, datasets, model checkpoints, and desired output directory.
{
"video_data_dir": "/path/to/your/videos",
"dataset_dir": "/path/to/your/datasets",
"output_dir": "/path/to/project/outputs",
"model_checkpoints_dir": "/path/to/llm/and/vision/models"
}Run the scripts in the specified order. The output of each script serves as the input for the next.
Divides each video into a fixed number of uniform time segments.
python step_1_segment_videos.pyGenerates a detailed text description for each video segment created in Step 1.
python step_2_generate_captions.py --gpu-id 1Note: Adjust the --gpu-id parameter based on your available devices.
Breaks down the long, generated captions into smaller, more coherent scene descriptions.
python step_3_extract_scenes.py --gpu-id 0These two scripts process the input questions to generate queries for retrieval. They can be run in parallel.
4a. Generate Query Sentences:
python step_4a_generate_query_sentences.py --gpu-id 24b. Generate Query Items:
python step_4b_generate_query_items.py --gpu-id 2Uses a sentence-transformer model to find the most semantically relevant video scenes (from Step 3) for each query (from Step 4).
python step_5_match_queries_to_scenes.py --gpu-id 2Takes the top-ranked video segments identified in Step 5 and feeds them, along with the original question, into the final multimodal model to generate an answer.
python step_6_run_final_inference.py --gpu-id 2 --max-frames 32Note: Adjust --max-frames based on your model's requirements and GPU memory.
/configs: Contains thepaths.jsonconfiguration file.1_segment_videos.py: Segments videos into clips.2_generate_captions.py: Generates captions for clips.3_extract_scenes.py: Extracts scenes from captions.4a_generate_query_sentences.py: Extracts sentence queries from questions.4b_generate_query_items.py: Extracts item queries from questions.5_match_queries_to_scenes.py: Retrieves relevant scenes based on queries.6_run_final_inference.py: Generates final answer from retrieved clips.README.md: This file.
This project is licensed under the MIT License. See the LICENSE file for details.