Real-time video style transfer with temporal consistency - Transform videos into artistic masterpieces while maintaining smooth frame-to-frame transitions using neural style transfer and optical flow.
Trained on 118,287 MS-COCO images over 15 epochs (~36 hours) on RTX 4090 Super
- β‘ Real-time Processing: 6.45 FPS on 1080p video (301 frames in 47 seconds)
- π¨ Adaptive Instance Normalization (AdaIN): Fast, flexible style transfer with pre-trained VGG19 encoder
- π Temporal Consistency: RAFT optical flow-based smoothing eliminates flickering between frames
- π High-Resolution Training: 512Γ512 training resolution for professional quality results
- π Production-Scale Training: 118K MS-COCO images, 14 diverse artistic styles, 50x style weight
- ποΈ GPU-Optimized: Mixed-precision (AMP) training, distributed multi-GPU support (DDP)
- π― Convergent Training: Achieved stable loss convergence (final loss: 9.89) over 36 hours
# Clone repository
git clone https://github.com/Romeo-5/temporal-style-net.git
cd temporal-style-net
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Python 3.8+
- PyTorch 2.0+ with CUDA support
- NVIDIA GPU with 8GB+ VRAM (16GB+ recommended for training)
- FFmpeg for video processing
# Process video with trained model
python scripts/inference.py \
--input data/videos/input.mp4 \
--style data/styles/starry_night.jpg \
--output data/outputs/result.mp4 \
--model-path checkpoints/final_model.pth
# With temporal consistency (smoother results)
python scripts/inference.py \
--input data/videos/input.mp4 \
--style data/styles/starry_night.jpg \
--output data/outputs/result_smooth.mp4 \
--model-path checkpoints/final_model.pth \
--temporalpython demo/app.py
# Open http://localhost:7860 in your browserfrom src.inference.video_processor import VideoStyleTransfer
# Initialize processor
processor = VideoStyleTransfer(
method='adain',
device='cuda',
use_temporal_consistency=True
)
# Process video
processor.process_video(
input_path='data/videos/input.mp4',
style_path='data/styles/starry_night.jpg',
output_path='data/outputs/stylized.mp4',
alpha=1.0 # Style strength (0-1)
)| Resolution | GPU | FPS | Processing Time | Total Frames |
|---|---|---|---|---|
| 1080p | RTX 4090 Super | 6.45 | 46.7s (12s video) | 301 frames |
| 1080p | RTX 3080 | ~4.5 | ~67s (12s video) | 301 frames |
| 720p | RTX 4090 Super | ~12 | ~25s (12s video) | 301 frames |
| Configuration | GPU | Time per Epoch | Total Training Time | Final Loss |
|---|---|---|---|---|
| 512px, batch=4 | RTX 4090 Super | ~2.4 hours | 36 hours (15 epochs) | 9.89 |
| 256px, batch=8 | RTX 4090 Super | 25 minutes | 8 hours (20 epochs) | ~15-20 |
| 512px, 4-GPU DDP | 4x RTX 3090 | ~40 minutes | ~10 hours (15 epochs) | ~10-12 |
Training Details:
- Dataset: 118,287 MS-COCO 2017 images
- Style Images: 14 diverse artistic paintings
- Iterations per Epoch: 29,572 (at batch size 4)
- Total Iterations: 443,580 over 36 hours
- Style Weight: 50.0 (strong stylization)
- Optimizer: Adam (lr=1e-4)
- Mixed Precision: Enabled (AMP)
| Metric | Initial (Epoch 1) | Final (Epoch 15) | Improvement |
|---|---|---|---|
| Total Loss | ~2230 | 9.89 | 99.6% β |
| Content Loss | ~26 | 6.84 | 73.7% β |
| Style Loss | ~44 | 0.061 | 99.9% β |
| Weighted Style | ~2204 | 3.06 | 99.9% β |
TemporalStyleNet implements the AdaIN (Adaptive Instance Normalization) style transfer architecture with custom temporal consistency:
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Content ββββββΆβ Encoder ββββββΆβ AdaIN β
β Frame β β (VGG19) β β Transform β
βββββββββββββββ ββββββββββββββββ ββββββββ¬βββββββ
β
βββββββββββββββ ββββββββββββββββ β
β Style ββββββΆβ Encoder βββββββββββββΆβ
β Image β β (VGG19) β β
βββββββββββββββ ββββββββββββββββ β
βΌ
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Previous ββββββΆβ Optical Flow ββββββΆβ Temporal β
β Frame β β (RAFT) β β Warping β
βββββββββββββββ ββββββββββββββββ ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Decoder β
β (Trained) β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Stylized β
β Output β
βββββββββββββββ
1. Style Transfer Network (AdaIN)
- Encoder: Pre-trained VGG19 (frozen) for feature extraction
- Decoder: Custom 4-layer upsampling network (trained from scratch)
- AdaIN Layer: Transfers style statistics (mean/std) from style to content features
2. Temporal Consistency Module
- Optical Flow: RAFT-based motion estimation between consecutive frames
- Feature Warping: Bilinear sampling guided by flow vectors
- Consistency Loss: L2 distance between warped previous features and current features
3. Training Pipeline
- Dataset: MS-COCO 2017 (118,287 images) for content + 14 diverse artistic styles
- Distributed Training: PyTorch DDP with gradient synchronization across GPUs
- Mixed Precision: Automatic Mixed Precision (AMP) for 2x memory efficiency
- Optimization: Adam optimizer with content loss + weighted style loss (50x)
The model is trained to minimize:
L_total = L_content + Ξ»_style * L_style
# Content Loss (MSE in feature space)
L_content = ||Ο(output) - Ο(content)||Β²
# Style Loss (MSE between Gram matrices)
L_style = Ξ£ ||G(Ο_i(output)) - G(Ο_i(style))||Β²Where:
Ο(x)= VGG19 encoder featuresG(x)= Gram matrix (captures style statistics)Ξ»_style = 50.0for strong stylization
Key Implementation Detail: Gram matrix normalization was critical for training success. Initial implementation used gram / (C Γ H Γ W) which over-normalized features by 512x, resulting in zero style loss. Corrected to gram / (H Γ W) achieved proper convergence.
Frame-to-frame coherence achieved through:
# Optical flow estimation
flow = RAFT(frame_t, frame_{t-1})
# Warp previous features
features_warped = warp(features_{t-1}, flow)
# Temporal consistency loss
L_temporal = ||features_t - features_warped||Β²# PyTorch Distributed Data Parallel
model = nn.parallel.DistributedDataParallel(
model,
device_ids=[local_rank],
find_unused_parameters=False
)
# Gradient synchronization
loss.backward()
optimizer.step() # All-reduce happens automaticallytemporal-style-net/
βββ src/
β βββ models/
β β βββ style_transfer.py # AdaIN encoder-decoder
β β βββ temporal.py # Optical flow module
β β βββ losses.py # Perceptual losses
β βββ training/
β β βββ trainer.py # Training loop
β β βββ dataset.py # Data loading pipeline
β βββ inference/
β βββ video_processor.py # Video processing pipeline
βββ scripts/
β βββ train.py # Training entry point
β βββ inference.py # Inference CLI
β βββ evaluate.py # Quality metrics
βββ configs/
β βββ default_config.yaml # Single GPU config
β βββ multi_gpu_config.yaml # Multi-GPU config
β βββ high_res_config.yaml # 512px training
βββ demo/
β βββ app.py # Gradio web interface
βββ data/
β βββ videos/ # Input videos
β βββ styles/ # Style images
β βββ outputs/ # Processed results
β βββ train/ # Training data
β βββ content/ # MS-COCO images
β βββ styles/ # Training style images
βββ checkpoints/ # Saved model weights
βββ docs/ # Documentation and results
βββ requirements.txt
βββ README.md
# MS-COCO 2017 Training Set (~13GB)
cd data/train/content
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip
# Add your style images to data/train/styles/
# (10-20 diverse artistic styles recommended)Edit configs/high_res_config.yaml:
# Model settings
image_size: 512 # High resolution for quality
batch_size: 4 # Adjust for your GPU memory
epochs: 15 # 15-20 recommended for 512px
# Loss weights
content_weight: 1.0
style_weight: 50.0 # Strong stylization
# Training settings
use_amp: true # Mixed precision
num_workers: 4
save_interval: 2 # Save every 2 epochs# Single GPU
python scripts/train.py --config configs/high_res_config.yaml
# Multi-GPU (DDP)
python scripts/train.py --config configs/multi_gpu_config.yaml --gpus 4
# Monitor with TensorBoard
tensorboard --logdir logs/tensorboard# Test after training
python scripts/inference.py \
--input data/videos/test.mp4 \
--style data/styles/starry_night.jpg \
--output data/outputs/result.mp4 \
--model-path checkpoints_512/final_model.pth| Parameter | Description | Recommended | Used in This Project |
|---|---|---|---|
image_size |
Training resolution | 256 (fast), 512 (quality) | 512 |
batch_size |
Images per GPU | 8 (256px), 4 (512px) | 4 |
learning_rate |
Adam LR | 1e-4, 5e-5 (512px) | 1e-4 |
style_weight |
Style loss multiplier | 50.0 (strong), 10.0 (subtle) | 50.0 |
epochs |
Training iterations | 15-20 (512px), 20-30 (256px) | 15 |
save_interval |
Checkpoint frequency | 2-3 | 2 |
| Parameter | Description | Default |
|---|---|---|
--alpha |
Style strength (0-1) | 1.0 |
--temporal |
Enable temporal smoothing | False |
--max-frames |
Limit frames processed | None |
--lightweight |
Use lightweight model | False |
Low FPS / Slow Processing
- Reduce
batch_sizein config - Use
--lightweightflag for faster model - Process at lower resolution
CUDA Out of Memory
- Reduce
batch_sizeorimage_size - Enable gradient checkpointing
- Use mixed precision training (enabled by default)
Style Too Weak
- Increase
style_weightin config (try 50-100) - Train for more epochs (20-30)
- Check Gram matrix normalization (should be
gram / (H * W), notgram / (C * H * W))
Style Loss Zero During Training
- Critical Bug: Over-normalized Gram matrix
- Fix: Change from
gram / (C * H * W)togram / (H * W)inStyleLoss.gram_matrix() - This fix increased style loss from 0.0003 to ~44, enabling proper training
Temporal Flickering
- Enable
--temporalflag during inference - Reduce frame rate of output video
- Use optical flow-based smoothing
This implementation builds upon:
-
Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization
- Huang & Belongie, ICCV 2017
- Paper | Original Code
-
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
-
ReCoNet: Real-time Coherent Video Style Transfer Network
- Chen et al., ACCV 2018
- Paper
- ControlNet Integration: Stable Diffusion-based style transfer with structural control
- 3D Consistency: Depth-aware styling for multi-view consistency
- Real-time Streaming: WebRTC support for live video stylization
- Style Interpolation: Smooth transitions between multiple styles
- Mobile Deployment: ONNX/TensorRT optimization for edge devices
- NeRF Integration: Neural radiance fields for novel view synthesis with style
Contributions welcome! Areas of interest:
- Performance optimizations
- New style transfer architectures
- Quality improvements
- Bug fixes and documentation
Please open an issue first to discuss proposed changes.
This project is licensed under the MIT License - see LICENSE for details.
- MS-COCO Dataset: Lin et al., ECCV 2014 - Training content images
- PyTorch Team: Framework and distributed training utilities
- NVIDIA RAFT: Optical flow implementation
- AdaIN Implementation: Inspired by naoto0804's PyTorch-AdaIN
Romeo Nickel
MS Computer Science (AI) - University of Southern California
Research Assistant - USC ISI Polymorphic Robotics Lab
- LinkedIn: linkedin.com/in/romeo-nickel
- Email: rjnickel@usc.edu
- GitHub: @Romeo-5
Built with PyTorch π₯ | Trained on RTX 4090 Super β‘ | 36 Hours of Training β±οΈ


