A GAN-based Diffusion architecture for enhancing low-bitrate/low-resolution video frames into high-fidelity, temporally stable video sequences.
alen-vfe combines the power of Latent Diffusion Models, Adversarial Training, and Temporal Frame Interpolation to deliver state-of-the-art video enhancement.
- π Fast Inference: diffusion using DDIM sampling
- πΎ Memory Efficient: LoRA fine-tuning
- π¨ High Quality: Combined loss (MSE + LPIPS + Adversarial) ensures sharp, realistic results
- π¬ Temporal Stability: RIFE integration eliminates flickering
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Input: Low-Res Video β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Generator (Stable Diffusion v1.5) β
β + LoRA Fine-tuning β
β (1 step DDIM inference) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Discriminator (PatchGAN) β
β Evaluates realism of enhancements β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Smoothing Layer (RIFE) β
β Temporal Frame Interpolation β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Output: High-Res Video β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
-
Generator: Lightweight Latent Diffusion Model (Stable Diffusion v1.5)
- Fine-tuned with LoRA (Low-Rank Adaptation)
- Optimized inference using DDIM
-
Discriminator: Pre-trained PatchGAN
- Evaluates high-frequency detail realism
- Provides adversarial feedback during training
-
Smoothing Layer: RIFE (Real-Time Intermediate Flow Estimation)
- Optical flow-based frame interpolation
- Ensures temporal consistency
- Eliminates flickering between frames
- Python 3.8+
- CUDA 11.7+ (for NVIDIA GPUs) or Mac M4 with MPS support
- FFmpeg (for video processing)
# Clone the repository
git clone https://github.com/yourusername/alen-vfe.git
cd alen-vfe
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download RIFE pretrained model
python scripts/download_rife.pyfrom inference.enhancer import VideoEnhancer
from omegaconf import OmegaConf
# Load configuration
config = OmegaConf.load("config/inference_config.yaml")
# Initialize enhancer
enhancer = VideoEnhancer(config)
# Enhance video
enhancer.enhance_video(
input_path="input_video.mp4",
output_path="enhanced_video.mp4",
scale_factor=4
)python inference/enhance.py \
--input input_video.mp4 \
--output enhanced_video.mp4 \
--checkpoint checkpoints/best_model.pth \
--scale 4 \
--enable-rifeWe use Vimeo-90K dataset for training:
# Download dataset
python data/download.py --dataset vimeo90k --output ./data
# Prepare training data
python data/prepare_dataset.py \
--dataset vimeo90k \
--downscale-factor 4 \
--output ./data/processed- Upload the project to Kaggle
- Open
notebooks/train_kaggle.ipynb - Ensure GPU accelerator is enabled (T4 recommended)
- Run all cells
python training/train.py \
--config config/training_config.yaml \
--output-dir ./checkpoints- Size: ~7GB (perfect for quick start!)
- Images: 800 training + 100 validation
- Resolution: Up to 2K high-quality images
- Download: Official Link
- Why: Much smaller than Vimeo-90K, faster downloads, great for testing
- Size: ~82GB
- Sequences: 89,800 triplets (3 frames each)
- Resolution: 448Γ256
- Download: Official Link
- Why: Video-specific data, more data for production models
Edit config/training_config.yaml:
model:
generator:
lora_rank: 8 # Higher = more capacity, more VRAM
inference_steps: 4 # 1-4 steps for fast inference
training:
batch_size: 8
num_epochs: 100
learning_rate:
generator: 1.0e-5
discriminator: 4.0e-4
loss:
weights:
mse: 1.0
lpips: 0.5
adversarial: 0.1Edit config/inference_config.yaml:
enhancement:
scale_factor: 4
enable_rife: true
target_fps_multiplier: 2
video:
batch_size: 10
output_codec: "libx264"
output_crf: 18alen-vfe/
βββ config/ # Configuration files
β βββ training_config.yaml
β βββ inference_config.yaml
βββ data/ # Dataset utilities
β βββ __init__.py
β βββ dataset.py
β βββ download.py
β βββ prepare_dataset.py
βββ models/ # Model architectures
β βββ __init__.py
β βββ generator.py # Stable Diffusion + LoRA
β βββ discriminator.py # PatchGAN
β βββ rife.py # RIFE integration
βββ training/ # Training infrastructure
β βββ __init__.py
β βββ losses.py
β βββ trainer.py
β βββ utils.py
βββ inference/ # Inference pipeline
β βββ __init__.py
β βββ enhancer.py
β βββ enhance.py # CLI script
β βββ video_utils.py
βββ notebooks/ # Jupyter notebooks for experiments and training
βββ runs/ # TensorBoard event logs for training visualization
βββ outputs/ # Enhanced video outputs and sample results
βββ checkpoints/ # Model checkpoints (ignored by git)
βββ dataset/ # Training datasets (ignored by git)
βββ requirements.txt
βββ README.md
notebooks/: Contains Jupyter notebooks for exploratory data analysis, experimental training runs, and Kaggle-specific setup.runs/: Stores TensorBoard event files. You can visualize training progress by runningtensorboard --logdir runs/.outputs/: This is where all enhanced videos, preview images, and test results are stored.checkpoints/: Directory for saving model weights during training.dataset/: Local storage for training data like DIV2K or Vimeo-90K.
# Run unit tests
pytest tests/
# Test inference pipeline
python tests/test_pipeline.py --checkpoint checkpoints/best_model.pth
# Benchmark performance
python tests/benchmark.py --device cudaThe model uses a combined loss function:
L_total = Ξ»βΒ·L_MSE + Ξ»βΒ·L_LPIPS + Ξ»βΒ·L_ADV
- L_MSE: Pixel-wise Mean Squared Error (structural accuracy)
- L_LPIPS: Learned Perceptual Image Patch Similarity (perceptual quality)
- L_ADV: Adversarial Loss (realism)
Default weights: Ξ»β=1.0, Ξ»β=0.5, Ξ»β=0.1
Important
This project is currently in an experimental phase.
- Fine-tuning: We are experimenting with using a text-to-image model for video fine-tuning, which is a non-optimal approach and may lead to unexpected results.
- Resources: Due to a lack of significant training resources (GPU time/memory), the current model outputs may not yet reach production-grade quality.
- Outputs: I have shared my latest experimental outputs in the
outputs/folder for review.
Latest enhancement preview. See the
outputs/ folder for full video results.
- Stable Diffusion by Stability AI
- RIFE by Megvii Research
- pix2pix for PatchGAN architecture
- LPIPS for perceptual loss
MIT License - see LICENSE file for details
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
For questions or issues, please open a GitHub issue.