Skip to content

Production-scale video style transfer (AdaIN + RAFT Optical Flow) achieving 6.45 FPS and trained via DDP on 118K images.

License

Notifications You must be signed in to change notification settings

Romeo-5/Temporal-Style-Net

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 TemporalStyleNet

Python 3.8+ PyTorch CUDA License: MIT

Real-time video style transfer with temporal consistency - Transform videos into artistic masterpieces while maintaining smooth frame-to-frame transitions using neural style transfer and optical flow.

Style Transfer Demo

Buildings video stylized with Van Gogh's Starry Night in real-time


🎨 Results Gallery

Original Frame Stylized Output
Original Stylized

Trained on 118,287 MS-COCO images over 15 epochs (~36 hours) on RTX 4090 Super


🌟 Key Features

  • ⚑ Real-time Processing: 6.45 FPS on 1080p video (301 frames in 47 seconds)
  • 🎨 Adaptive Instance Normalization (AdaIN): Fast, flexible style transfer with pre-trained VGG19 encoder
  • πŸ”„ Temporal Consistency: RAFT optical flow-based smoothing eliminates flickering between frames
  • πŸš€ High-Resolution Training: 512Γ—512 training resolution for professional quality results
  • πŸ“Š Production-Scale Training: 118K MS-COCO images, 14 diverse artistic styles, 50x style weight
  • πŸ‹οΈ GPU-Optimized: Mixed-precision (AMP) training, distributed multi-GPU support (DDP)
  • 🎯 Convergent Training: Achieved stable loss convergence (final loss: 9.89) over 36 hours

πŸš€ Quick Start

Installation

# Clone repository
git clone https://github.com/Romeo-5/temporal-style-net.git
cd temporal-style-net

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Requirements

  • Python 3.8+
  • PyTorch 2.0+ with CUDA support
  • NVIDIA GPU with 8GB+ VRAM (16GB+ recommended for training)
  • FFmpeg for video processing

Basic Usage

Command Line Inference

# Process video with trained model
python scripts/inference.py \
    --input data/videos/input.mp4 \
    --style data/styles/starry_night.jpg \
    --output data/outputs/result.mp4 \
    --model-path checkpoints/final_model.pth

# With temporal consistency (smoother results)
python scripts/inference.py \
    --input data/videos/input.mp4 \
    --style data/styles/starry_night.jpg \
    --output data/outputs/result_smooth.mp4 \
    --model-path checkpoints/final_model.pth \
    --temporal

Interactive Web Demo

python demo/app.py
# Open http://localhost:7860 in your browser

Python API

from src.inference.video_processor import VideoStyleTransfer

# Initialize processor
processor = VideoStyleTransfer(
    method='adain',
    device='cuda',
    use_temporal_consistency=True
)

# Process video
processor.process_video(
    input_path='data/videos/input.mp4',
    style_path='data/styles/starry_night.jpg',
    output_path='data/outputs/stylized.mp4',
    alpha=1.0  # Style strength (0-1)
)

πŸ“Š Performance

Inference Speed (Measured on Trained Model)

Resolution GPU FPS Processing Time Total Frames
1080p RTX 4090 Super 6.45 46.7s (12s video) 301 frames
1080p RTX 3080 ~4.5 ~67s (12s video) 301 frames
720p RTX 4090 Super ~12 ~25s (12s video) 301 frames

Training Performance (Actual Results)

Configuration GPU Time per Epoch Total Training Time Final Loss
512px, batch=4 RTX 4090 Super ~2.4 hours 36 hours (15 epochs) 9.89
256px, batch=8 RTX 4090 Super 25 minutes 8 hours (20 epochs) ~15-20
512px, 4-GPU DDP 4x RTX 3090 ~40 minutes ~10 hours (15 epochs) ~10-12

Training Details:

  • Dataset: 118,287 MS-COCO 2017 images
  • Style Images: 14 diverse artistic paintings
  • Iterations per Epoch: 29,572 (at batch size 4)
  • Total Iterations: 443,580 over 36 hours
  • Style Weight: 50.0 (strong stylization)
  • Optimizer: Adam (lr=1e-4)
  • Mixed Precision: Enabled (AMP)

Loss Convergence

Metric Initial (Epoch 1) Final (Epoch 15) Improvement
Total Loss ~2230 9.89 99.6% ↓
Content Loss ~26 6.84 73.7% ↓
Style Loss ~44 0.061 99.9% ↓
Weighted Style ~2204 3.06 99.9% ↓

πŸ—οΈ Architecture

Overview

TemporalStyleNet implements the AdaIN (Adaptive Instance Normalization) style transfer architecture with custom temporal consistency:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Content     │────▢│   Encoder    │────▢│   AdaIN     β”‚
β”‚ Frame       β”‚     β”‚  (VGG19)     β”‚     β”‚  Transform  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚ Style       │────▢│   Encoder    │───────────▢│
β”‚ Image       β”‚     β”‚  (VGG19)     β”‚            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
                                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Previous    │────▢│ Optical Flow │────▢│  Temporal   β”‚
β”‚ Frame       β”‚     β”‚    (RAFT)    β”‚     β”‚  Warping    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚
                                                 β–Ό
                                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                          β”‚   Decoder   β”‚
                                          β”‚  (Trained)  β”‚
                                          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚
                                                 β–Ό
                                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                          β”‚ Stylized    β”‚
                                          β”‚ Output      β”‚
                                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

1. Style Transfer Network (AdaIN)

  • Encoder: Pre-trained VGG19 (frozen) for feature extraction
  • Decoder: Custom 4-layer upsampling network (trained from scratch)
  • AdaIN Layer: Transfers style statistics (mean/std) from style to content features

2. Temporal Consistency Module

  • Optical Flow: RAFT-based motion estimation between consecutive frames
  • Feature Warping: Bilinear sampling guided by flow vectors
  • Consistency Loss: L2 distance between warped previous features and current features

3. Training Pipeline

  • Dataset: MS-COCO 2017 (118,287 images) for content + 14 diverse artistic styles
  • Distributed Training: PyTorch DDP with gradient synchronization across GPUs
  • Mixed Precision: Automatic Mixed Precision (AMP) for 2x memory efficiency
  • Optimization: Adam optimizer with content loss + weighted style loss (50x)

πŸ”¬ Technical Details

Style Transfer Loss

The model is trained to minimize:

L_total = L_content + Ξ»_style * L_style

# Content Loss (MSE in feature space)
L_content = ||Ο†(output) - Ο†(content)||Β²

# Style Loss (MSE between Gram matrices)
L_style = Ξ£ ||G(Ο†_i(output)) - G(Ο†_i(style))||Β²

Where:

  • Ο†(x) = VGG19 encoder features
  • G(x) = Gram matrix (captures style statistics)
  • Ξ»_style = 50.0 for strong stylization

Key Implementation Detail: Gram matrix normalization was critical for training success. Initial implementation used gram / (C Γ— H Γ— W) which over-normalized features by 512x, resulting in zero style loss. Corrected to gram / (H Γ— W) achieved proper convergence.

Temporal Consistency

Frame-to-frame coherence achieved through:

# Optical flow estimation
flow = RAFT(frame_t, frame_{t-1})

# Warp previous features
features_warped = warp(features_{t-1}, flow)

# Temporal consistency loss
L_temporal = ||features_t - features_warped||Β²

Multi-GPU Training

# PyTorch Distributed Data Parallel
model = nn.parallel.DistributedDataParallel(
    model,
    device_ids=[local_rank],
    find_unused_parameters=False
)

# Gradient synchronization
loss.backward()
optimizer.step()  # All-reduce happens automatically

πŸ“ Project Structure

temporal-style-net/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ style_transfer.py      # AdaIN encoder-decoder
β”‚   β”‚   β”œβ”€β”€ temporal.py             # Optical flow module
β”‚   β”‚   └── losses.py               # Perceptual losses
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ trainer.py              # Training loop
β”‚   β”‚   └── dataset.py              # Data loading pipeline
β”‚   └── inference/
β”‚       └── video_processor.py      # Video processing pipeline
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train.py                    # Training entry point
β”‚   β”œβ”€β”€ inference.py                # Inference CLI
β”‚   └── evaluate.py                 # Quality metrics
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ default_config.yaml         # Single GPU config
β”‚   β”œβ”€β”€ multi_gpu_config.yaml       # Multi-GPU config
β”‚   └── high_res_config.yaml        # 512px training
β”œβ”€β”€ demo/
β”‚   └── app.py                      # Gradio web interface
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ videos/                     # Input videos
β”‚   β”œβ”€β”€ styles/                     # Style images
β”‚   β”œβ”€β”€ outputs/                    # Processed results
β”‚   └── train/                      # Training data
β”‚       β”œβ”€β”€ content/                # MS-COCO images
β”‚       └── styles/                 # Training style images
β”œβ”€β”€ checkpoints/                    # Saved model weights
β”œβ”€β”€ docs/                           # Documentation and results
β”œβ”€β”€ requirements.txt
└── README.md

πŸŽ“ Training Your Own Model

1. Download Training Data

# MS-COCO 2017 Training Set (~13GB)
cd data/train/content
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip

# Add your style images to data/train/styles/
# (10-20 diverse artistic styles recommended)

2. Configure Training

Edit configs/high_res_config.yaml:

# Model settings
image_size: 512        # High resolution for quality
batch_size: 4          # Adjust for your GPU memory
epochs: 15             # 15-20 recommended for 512px

# Loss weights
content_weight: 1.0
style_weight: 50.0     # Strong stylization

# Training settings
use_amp: true          # Mixed precision
num_workers: 4
save_interval: 2       # Save every 2 epochs

3. Start Training

# Single GPU
python scripts/train.py --config configs/high_res_config.yaml

# Multi-GPU (DDP)
python scripts/train.py --config configs/multi_gpu_config.yaml --gpus 4

# Monitor with TensorBoard
tensorboard --logdir logs/tensorboard

4. Test Checkpoints

# Test after training
python scripts/inference.py \
    --input data/videos/test.mp4 \
    --style data/styles/starry_night.jpg \
    --output data/outputs/result.mp4 \
    --model-path checkpoints_512/final_model.pth

πŸ”§ Configuration

Training Parameters

Parameter Description Recommended Used in This Project
image_size Training resolution 256 (fast), 512 (quality) 512
batch_size Images per GPU 8 (256px), 4 (512px) 4
learning_rate Adam LR 1e-4, 5e-5 (512px) 1e-4
style_weight Style loss multiplier 50.0 (strong), 10.0 (subtle) 50.0
epochs Training iterations 15-20 (512px), 20-30 (256px) 15
save_interval Checkpoint frequency 2-3 2

Inference Parameters

Parameter Description Default
--alpha Style strength (0-1) 1.0
--temporal Enable temporal smoothing False
--max-frames Limit frames processed None
--lightweight Use lightweight model False

πŸ› Troubleshooting

Common Issues

Low FPS / Slow Processing

  • Reduce batch_size in config
  • Use --lightweight flag for faster model
  • Process at lower resolution

CUDA Out of Memory

  • Reduce batch_size or image_size
  • Enable gradient checkpointing
  • Use mixed precision training (enabled by default)

Style Too Weak

  • Increase style_weight in config (try 50-100)
  • Train for more epochs (20-30)
  • Check Gram matrix normalization (should be gram / (H * W), not gram / (C * H * W))

Style Loss Zero During Training

  • Critical Bug: Over-normalized Gram matrix
  • Fix: Change from gram / (C * H * W) to gram / (H * W) in StyleLoss.gram_matrix()
  • This fix increased style loss from 0.0003 to ~44, enabling proper training

Temporal Flickering

  • Enable --temporal flag during inference
  • Reduce frame rate of output video
  • Use optical flow-based smoothing

πŸ”¬ Research References

This implementation builds upon:

  1. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization

  2. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

  3. ReCoNet: Real-time Coherent Video Style Transfer Network

    • Chen et al., ACCV 2018
    • Paper

🚧 Future Enhancements

  • ControlNet Integration: Stable Diffusion-based style transfer with structural control
  • 3D Consistency: Depth-aware styling for multi-view consistency
  • Real-time Streaming: WebRTC support for live video stylization
  • Style Interpolation: Smooth transitions between multiple styles
  • Mobile Deployment: ONNX/TensorRT optimization for edge devices
  • NeRF Integration: Neural radiance fields for novel view synthesis with style

🀝 Contributing

Contributions welcome! Areas of interest:

  • Performance optimizations
  • New style transfer architectures
  • Quality improvements
  • Bug fixes and documentation

Please open an issue first to discuss proposed changes.

πŸ“ License

This project is licensed under the MIT License - see LICENSE for details.

πŸ™ Acknowledgments

  • MS-COCO Dataset: Lin et al., ECCV 2014 - Training content images
  • PyTorch Team: Framework and distributed training utilities
  • NVIDIA RAFT: Optical flow implementation
  • AdaIN Implementation: Inspired by naoto0804's PyTorch-AdaIN

πŸ“§ Contact

Romeo Nickel
MS Computer Science (AI) - University of Southern California
Research Assistant - USC ISI Polymorphic Robotics Lab


⭐ Star this repo if you find it useful! ⭐

Built with PyTorch πŸ”₯ | Trained on RTX 4090 Super ⚑ | 36 Hours of Training ⏱️

About

Production-scale video style transfer (AdaIN + RAFT Optical Flow) achieving 6.45 FPS and trained via DDP on 118K images.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages