Audio-visual synchronization detection using deep learning with modern Python architecture.
This is a refactored and enhanced version of the original SyncNet implementation by Joon Son Chung, updated for Python 3.9+ with clean architecture, comprehensive error handling, and performance optimizations.
SyncNet Python is a PyTorch implementation of the SyncNet model, which detects audio-visual synchronization in videos. It can identify lip-sync errors by analyzing the correspondence between mouth movements and spoken audio.
- 🎥 Audio-Visual Sync Detection: Accurately detect synchronization between audio and video
- 🔍 Face Detection: Automatic face detection and tracking using S3FD
- 📊 Detailed Analysis: Per-crop offsets, confidence scores, and minimum distances
- 🚀 Batch Processing: Process multiple videos efficiently
- 🐍 Python API: Easy-to-use Python interface with proper error handling
- 🏗️ Clean Architecture: Abstract base classes and factory patterns
- ⚡ Performance Optimized: Parallel processing and memory management
- 🛡️ Robust Error Handling: Comprehensive exception hierarchy
- ⚙️ Configuration Management: YAML/JSON configuration support
- 📝 Advanced Logging: Structured logging with progress tracking
- 🔄 Backward Compatibility: Maintains compatibility with original API
pip install syncnet-python-
FFmpeg: Required for video processing
# Ubuntu/Debian sudo apt-get install ffmpeg # macOS brew install ffmpeg
-
Model Weights: Download pre-trained weights
- Download
sfd_face.pthandsyncnet_v2.model - Place them in a
weights/directory
- Download
from syncnet_python import SyncNetPipeline
# Initialize pipeline
pipeline = SyncNetPipeline(
s3fd_weights="weights/sfd_face.pth",
syncnet_weights="weights/syncnet_v2.model",
device="cuda" # or "cpu"
)
# Process video
results = pipeline.inference(
video_path="video.mp4",
audio_path=None # Extract from video
)
# Extract results (returns tuple)
offset_list, confidence_list, min_dist_list, best_confidence, best_min_dist, detections_json, success = results
# Get best results
offset = offset_list[0] # AV offset in frames
confidence = confidence_list[0] # Confidence score
min_distance = min_dist_list[0] # Minimum distance
print(f"AV Offset: {offset} frames")
print(f"Confidence: {confidence:.3f}")
print(f"Min Distance: {min_distance:.3f}")# For detailed per-crop analysis
for i, (offset, conf, dist) in enumerate(zip(offset_list, confidence_list, min_dist_list)):
print(f"Crop {i+1}: offset={offset}, confidence={conf:.3f}, min_dist={dist:.3f}")
# Parse face detections
import json
detections = json.loads(detections_json)
print(f"Total frames with face detection: {len(detections)}")# Process single video
syncnet-python video.mp4
# Process multiple videos
syncnet-python video1.mp4 video2.mp4 --output results.json
# Use CPU instead of GPU
syncnet-python video.mp4 --device cpuTested with example files:
- Processing Speed: 191.4 fps
- Face Detection: 100% success rate
- Accuracy: Detects 1-frame offsets with high confidence (4.5+)
- Compute Time: ~0.65 seconds for 134 frames
syncnet/core/- Modern refactored implementationbase.py- Abstract base classes and interfacesmodels.py- Enhanced SyncNet model with factory patternaudio.py- MFCC audio processing with streaming supportvideo.py- Parallel video processing with OpenCVsync_analyzer.py- Optimized sync analysis with cachingconfig.py- Configuration management systemexceptions.py- Comprehensive error handlinglogging.py- Advanced logging with progress trackingutils.py- Memory management and utility functions
syncnet_python/- Maintains original API compatibility- Full backward compatibility with existing code
- Python 3.9+ (tested on 3.13)
- PyTorch 2.0+
- CUDA (optional but recommended)
- FFmpeg
- Additional dependencies: OpenCV, SciPy, NumPy, pandas
This package is based on the original SyncNet implementation by Joon Son Chung, enhanced with modern Python architecture and performance optimizations.
If you use this code in your research, please cite the original paper:
@inproceedings{chung2016out,
title={Out of time: automated lip sync in the wild},
author={Chung, Joon Son and Zisserman, Andrew},
booktitle={Asian Conference on Computer Vision},
year={2016}
}MIT License - see LICENSE file for details.