This project implements a video analytics system designed for semantic retrieval and object counting. It utilizes CLIP embeddings for semantic understanding, keyframe selection algorithms for data compression, and lightweight MLPs for count prediction. It includes pipelines for training, indexing (FAISS), and benchmarking against YOLO baselines.
- Python 3.10+
- CUDA-enabled GPU (recommended)
- FFmpeg (for video decoding via
avandopencv)
This project uses Poetry for dependency management.
-
Clone the repository:
git clone <repository_url> cd traffic-video-pipeline
-
Install dependencies:
poetry install
Alternatively, if using
pip:pip install torch torchvision opencv-python transformers ultralytics faiss-cpu av pandas numpy pycocotools scikit-learn tabulate matplotlib
-
Set up the source path: Ensure the
srcdirectory is in your PYTHONPATH.export PYTHONPATH=$PYTHONPATH:$(pwd)
This project uses the VIRAT Video Dataset for benchmarking and fine-tuning.
- Download the VIRAT Video Dataset (Release 2.0) videos. You can find them at the official website or standard dataset repositories.
- Create a directory for source videos:
mkdir -p videos_source
- Place the downloaded
.mp4files intovideos_source/.
If you intend to pre-train the count predictor, download the COCO 2017 dataset. A helper script is provided:
python src/training/download_coco.pyThis will download and extract data to data/coco/.
The pipeline generates artifacts in a structured data/ directory.
traffic-video-pipeline/
├── data/
│ ├── coco/ # COCO dataset (images/annotations)
│ └── VIRAT/ # Processed VIRAT data
│ ├── VIRAT_S_000001/ # Per-video directory
│ │ ├── counts.csv # Ground truth counts (generated by YOLO)
│ │ ├── frames/ # Extracted frames (optional)
│ │ ├── embeddings/ # CLIP embeddings (.npy) and metadata
│ │ └── keyframes/ # Selected keyframe images
│ └── ...
├── models/
│ └── checkpoints/ # Saved model weights (.pth)
├── src/ # Source code
├── videos_source/ # Raw input MP4 files
└── ...
Before training or benchmarking, generate ground truth counts using a high-accuracy object detector (YOLOv8/11).
Modify src/main_train.py to point to your videos_source directory and run run_detection_on_dir:
# Inside src/main_train.py
run_detection_on_dir(
videos_dir="videos_source",
model_name="yolov8l",
annotated=False
)Run the script:
python src/main_train.pyThis creates counts.csv files in data/VIRAT/<video_name>/.
The system uses an MLP to predict object counts from CLIP embeddings.
Pre-training (COCO):
# Inside src/main_train.py
pretrain_on_coco(
coco_dir="data/coco",
target="car",
model_config=LARGE3
)Fine-tuning (VIRAT):
# Inside src/main_train.py
finetune_on_virat(
data_dir="data/VIRAT",
target="car",
pretrained_checkpoint="models/checkpoints/car_coco_pretrained.pth",
model_config=LARGE3
)To evaluate different pipeline configurations (Standard Embedding vs. Keyframe-based) and keyframe selection algorithms (FrameDiff, SSIM, MOG2, Flow):
Run the comprehensive test suite:
python src/comprehensive_test.pyArguments in src/comprehensive_test.py allow you to configure:
keyframe_selectors: List of methods to test.keyframe_params: Parameters for selectors (e.g.,k_mad,min_spacing).test_keyframes: Boolean to enable/disable keyframe logic.
To perform semantic retrieval (e.g., "Find frames with > 2 cars"):
# Inside src/main_train.py
results = evaluate_retrieval(
data_dir="data/VIRAT",
checkpoint_path="models/checkpoints/car_virat_finetuned.pth",
target="car",
count_threshold=2
)src/keyframe/: Contains pre-selection logic.FrameDiffPreselector: Based on pixel difference.SSIMPreselector: Based on Structural Similarity Index.MOG2Preselector: Based on Background Subtraction.FlowPreselector: Based on Optical Flow magnitude.
src/embeddings/: CLIP embedding generation (supports OpenAI CLIP and MobileCLIP).src/indexing/: FAISS index construction and flat-file management.src/models/: MLP architecture (CountPredictor) definition.src/training/: Training loops with loss handling for under-prediction penalties.