This project uses Vision Transformer (ViT) models to compare videos and find similarities between them. It extracts embeddings from video frames and computes cosine similarity to determine how similar two videos are.
- Extract frame embeddings from videos using Vision Transformer models
- Compare videos based on visual similarity
- Process multiple videos in batch
- Generate similarity reports in JSON format
- Python 3.6+
- PyTorch
- OpenCV
- NumPy
- timm (PyTorch Image Models)
- Clone this repository:
git clone https://github.com/yourusername/ViT-model.git
cd ViT-model
- Install the required dependencies:
pip install torch torchvision opencv-python numpy timm
- Place videos to be monitored in the
monitoredfolder - Place videos to be compared against in the
watchedfolder - Run the main script:
python test/main.py
- Results will be saved to
similarity_results.json
- The tool samples frames from each video at a specified rate (default: 1 frame per second)
- Each frame is preprocessed and normalized
- A Vision Transformer model extracts embeddings from the frames
- Frame embeddings are aggregated to create a video-level embedding
- Cosine similarity is computed between video embeddings
- Results are sorted by similarity and saved to a JSON file
test/main.py: Main script for video comparisonmonitored/: Directory for videos to be monitoredwatched/: Directory for videos to be compared againstsimilarity_results.json: Output file with similarity results
This project uses the timm library for Vision Transformer models.