Skip to content

bigai-nlco/IMTalker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

Paper Hugging Face Model Hugging Face Space demo

πŸ“– Overview

IMTalker accepts diverse portrait styles and achieves 40 FPS for video-driven and 42 FPS for audio-driven talking-face generation when tested on an NVIDIA RTX 4090 GPU at 512 Γ— 512 resolution. It also enables diverse controllability by allowing precise head-pose and eye-gaze inputs alongside audio

πŸ“’ News

  • [2025.12.16] πŸš€ The training code are released!
  • [2025.11.27] πŸš€ The inference code and pretrained weights are released!

πŸ› οΈ Installation

1. Environment Setup

conda create -n IMTalker python=3.10
conda activate IMTalker
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
conda install -c conda-forge ffmpeg 

2. Install with pip:

git clone https://github.com/cbsjtu01/IMTalker.git
cd IMTalker
pip install -r requirement.txt

⚑ Quick Start

You can simply run the Gradio demo to get started. The script will automatically download the required pretrained models to the ./checkpoints directory if they are missing.

python app.py

πŸ“¦ Model Zoo

Please download the pretrained models and place them in the ./checkpoints directory.

Component Checkpoint Description Download
Audio Encoder wav2vec2-base-960h Wav2Vec2 Base model πŸ€— Link
Generator generator.ckpt Flow Matching Generator πŸ€— Link
Renderer renderer.ckpt IMT Renderer πŸ€— Link

πŸ“‚ Directory Structure

Ensure your file structure looks like this after downloading:

./checkpoints
β”œβ”€β”€ renderer.ckpt                     # The main renderer
β”œβ”€β”€ generator.ckpt                    # The main generator
└── wav2vec2-base-960h/               # Audio encoder folder
    β”œβ”€β”€ config.json
    β”œβ”€β”€ pytorch_model.bin
    └── ...

πŸš€ Inference

1. Audio-driven Inference

Generate a talking face from a source image and an audio file.

python generator/generate.py \
    --ref_path "./assets/source_image.jpg" \
    --aud_path "./assets/input_audio.wav" \
    --res_dir "./results/" \
    --generator_path "./checkpoints/generator.ckpt" \
    --renderer_path "./checkpoints/renderer.ckpt" \
    --a_cfg_scale 2 \
    --crop

2. Video-driven Inference

Generate a talking face from a source image and an driving video file.

python renderer/inference.py \
    --source_path "./assets/source_image.jpg" \
    --driving_path "./assets/driving_video.mp4" \
    --save_path "./results/" \
    --renderer_path "./checkpoints/renderer.ckpt" \
    --crop

πŸš€ Train

1. Train the renderer

Data Preparation

You can follow the dataset processing pipeline in talkingfaceprocess to crop the raw video data into 512Γ—512 resolution videos where the face occupies the main region, and to extract landmarks for each video. Ensure your dataset directory is organized as follows.

/path/to/renderer_dataset
β”œβ”€β”€ video_frame
    β”œβ”€β”€ video_0001
      β”œβ”€β”€ image_001.jpg
      β”œβ”€β”€ image_002.jpg
      β”œβ”€β”€ ...
    β”œβ”€β”€ video_0002
    β”œβ”€β”€ ...
β”œβ”€β”€ lmd
    β”œβ”€β”€ video_0001.txt
    β”œβ”€β”€ video_0002.txt
    β”œβ”€β”€ ...

Training Command

Then you can execute the following command to train our renderer. In our experiments, we used 4 Γ— A100 (80 GB) GPUs; with a batch size of 4, the GPU memory usage did not exceed 50 GB, and each iteration took approximately 1 second. You can adjust the batch size and learning rate according to your hardware configuration.

python renderer/train.py \
    --dataset_path /path/to/renderer_dataset \
    --exp_name renderer_exp \
    --batch_size 4 \
    --iter 7000000 \
    --lr 1e-4 \

2. Train the generator

Data Preparation

In the second step, you need to train our motion generator to enable speech-driven animation. To accelerate training, we pre-extract and store all required features, including: motion latents obtained by feeding each video frame into the motion encoder in the renderer; final-layer features extracted from audio WAV files using Wav2Vec2; 6D pose parameters for each frame extracted with SMIRK; and gaze directions extracted using L2CS-Net. Ensure your dataset directory is organized as follows.

/path/to/generator_dataset
β”œβ”€β”€ motion
    β”œβ”€β”€ video_0001.pt
    β”œβ”€β”€ video_0002.pt
    β”œβ”€β”€ ...
β”œβ”€β”€ audio
    β”œβ”€β”€ video_0001.npy
    β”œβ”€β”€ video_0002.npy
    β”œβ”€β”€ ...
β”œβ”€β”€ smirk
    β”œβ”€β”€ video_0001.pt
    β”œβ”€β”€ video_0002.pt
    β”œβ”€β”€ ...
β”œβ”€β”€ gaze
    β”œβ”€β”€ video_0001.npy
    β”œβ”€β”€ video_0002.npy
    β”œβ”€β”€ ...

Training Command

Then you can execute the following command to train the generator. In our experiments, we used 4 Γ— A100 (80 GB) GPUs; with a batch size of 16, the GPU memory usage did not exceed 20 GB, achieving approximately 10 iterations per second, and the model converged within a few hours. You can adjust the batch size and learning rate according to your hardware configuration.

python generator/train.py \
    --dataset_pat /path/to/generator_dataset \
    --exp_name generator_exp \
    --batch_size 16 \
    --iter 5000000 \
    --lr 1e-4

πŸ’‘ Best Practices

To obtain the highest quality generation results, we recommend following these guidelines:

  1. Input Image Composition: Please ensure the input image features the person's head as the primary subject. Since our model is explicitly trained on facial data, it does not support full-body video generation.

    • The inference pipeline automatically crops the input image to focus on the face by default.
    • Note on Resolution: The model generates video at a fixed resolution of 512Γ—512. Using extremely high-resolution inputs will result in downscaling, so prioritize facial clarity over raw image dimensions.
  2. Audio Selection: Our model was trained primarily on English datasets. Consequently, we recommend using English audio inputs to achieve the best lip-synchronization performance and naturalness.

  3. Background Quality: We strongly recommend using source images with solid colored or blurred (bokeh) backgrounds. Complex or highly detailed backgrounds may lead to visual artifacts or jitter in the generated video.

πŸ“ To-Do List

  • Release inference code and pretrained models.
  • Launch Hugging Face online demo.
  • Release training code.

πŸ“œ Citation

If you find our work useful for your research, please consider citing:

@article{imtalker2025,
  title={IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer},
  author={Bo, Chen and Tao, Liu and Qi, Chen and  Xie, Chen and  Zilong Zheng}, 
  journal={arXiv preprint arXiv:2511.22167},
  year={2025}
}

πŸ™ Acknowledgement

We express our sincerest gratitude to the excellent previous works that inspired this project:

  • IMF: We adapted the framework and training pipeline from IMF and its reproduction code IMF.
  • FLOAT: We referenced the model architecture and implementation of Float for our generator.
  • Wav2Vec2: We utilized Wav2Vec as our audio encoder.
  • Face-Alignment: We used FaceAlignment for cropping images and videos.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages