IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

📖 Overview

IMTalker accepts diverse portrait styles and achieves 40 FPS for video-driven and 42 FPS for audio-driven talking-face generation when tested on an NVIDIA RTX 4090 GPU at 512 × 512 resolution. It also enables diverse controllability by allowing precise head-pose and eye-gaze inputs alongside audio

📢 News

[2025.12.16] 🚀 The training code are released!
[2025.11.27] 🚀 The inference code and pretrained weights are released!

🛠️ Installation

1. Environment Setup

conda create -n IMTalker python=3.10
conda activate IMTalker
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
conda install -c conda-forge ffmpeg

2. Install with pip:

git clone https://github.com/cbsjtu01/IMTalker.git
cd IMTalker
pip install -r requirement.txt

⚡ Quick Start

You can simply run the Gradio demo to get started. The script will automatically download the required pretrained models to the ./checkpoints directory if they are missing.

python app.py

📦 Model Zoo

Please download the pretrained models and place them in the ./checkpoints directory.

Component	Checkpoint	Description	Download
Audio Encoder	`wav2vec2-base-960h`	Wav2Vec2 Base model	🤗 Link
Generator	`generator.ckpt`	Flow Matching Generator	🤗 Link
Renderer	`renderer.ckpt`	IMT Renderer	🤗 Link

📂 Directory Structure

Ensure your file structure looks like this after downloading:

./checkpoints
├── renderer.ckpt                     # The main renderer
├── generator.ckpt                    # The main generator
└── wav2vec2-base-960h/               # Audio encoder folder
    ├── config.json
    ├── pytorch_model.bin
    └── ...

🚀 Inference

1. Audio-driven Inference

Generate a talking face from a source image and an audio file.

python generator/generate.py \
    --ref_path "./assets/source_image.jpg" \
    --aud_path "./assets/input_audio.wav" \
    --res_dir "./results/" \
    --generator_path "./checkpoints/generator.ckpt" \
    --renderer_path "./checkpoints/renderer.ckpt" \
    --a_cfg_scale 2 \
    --crop

2. Video-driven Inference

Generate a talking face from a source image and an driving video file.

python renderer/inference.py \
    --source_path "./assets/source_image.jpg" \
    --driving_path "./assets/driving_video.mp4" \
    --save_path "./results/" \
    --renderer_path "./checkpoints/renderer.ckpt" \
    --crop

🚀 Train

1. Train the renderer

Data Preparation

You can follow the dataset processing pipeline in talkingfaceprocess to crop the raw video data into 512×512 resolution videos where the face occupies the main region, and to extract landmarks for each video. Ensure your dataset directory is organized as follows.

/path/to/renderer_dataset
├── video_frame
    ├── video_0001
      ├── image_001.jpg
      ├── image_002.jpg
      ├── ...
    ├── video_0002
    ├── ...
├── lmd
    ├── video_0001.txt
    ├── video_0002.txt
    ├── ...

Training Command

Then you can execute the following command to train our renderer. In our experiments, we used 4 × A100 (80 GB) GPUs; with a batch size of 4, the GPU memory usage did not exceed 50 GB, and each iteration took approximately 1 second. You can adjust the batch size and learning rate according to your hardware configuration.

python renderer/train.py \
    --dataset_path /path/to/renderer_dataset \
    --exp_name renderer_exp \
    --batch_size 4 \
    --iter 7000000 \
    --lr 1e-4 \

2. Train the generator

Data Preparation

In the second step, you need to train our motion generator to enable speech-driven animation. To accelerate training, we pre-extract and store all required features, including: motion latents obtained by feeding each video frame into the motion encoder in the renderer; final-layer features extracted from audio WAV files using Wav2Vec2; 6D pose parameters for each frame extracted with SMIRK; and gaze directions extracted using L2CS-Net. Ensure your dataset directory is organized as follows.

/path/to/generator_dataset
├── motion
    ├── video_0001.pt
    ├── video_0002.pt
    ├── ...
├── audio
    ├── video_0001.npy
    ├── video_0002.npy
    ├── ...
├── smirk
    ├── video_0001.pt
    ├── video_0002.pt
    ├── ...
├── gaze
    ├── video_0001.npy
    ├── video_0002.npy
    ├── ...

Training Command

Then you can execute the following command to train the generator. In our experiments, we used 4 × A100 (80 GB) GPUs; with a batch size of 16, the GPU memory usage did not exceed 20 GB, achieving approximately 10 iterations per second, and the model converged within a few hours. You can adjust the batch size and learning rate according to your hardware configuration.

python generator/train.py \
    --dataset_pat /path/to/generator_dataset \
    --exp_name generator_exp \
    --batch_size 16 \
    --iter 5000000 \
    --lr 1e-4

💡 Best Practices

To obtain the highest quality generation results, we recommend following these guidelines:

Input Image Composition: Please ensure the input image features the person's head as the primary subject. Since our model is explicitly trained on facial data, it does not support full-body video generation.
- The inference pipeline automatically crops the input image to focus on the face by default.
- Note on Resolution: The model generates video at a fixed resolution of 512×512. Using extremely high-resolution inputs will result in downscaling, so prioritize facial clarity over raw image dimensions.
Audio Selection: Our model was trained primarily on English datasets. Consequently, we recommend using English audio inputs to achieve the best lip-synchronization performance and naturalness.
Background Quality: We strongly recommend using source images with solid colored or blurred (bokeh) backgrounds. Complex or highly detailed backgrounds may lead to visual artifacts or jitter in the generated video.

📝 To-Do List

Release inference code and pretrained models.
Launch Hugging Face online demo.
Release training code.

📜 Citation

If you find our work useful for your research, please consider citing:

@article{imtalker2025,
  title={IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer},
  author={Bo, Chen and Tao, Liu and Qi, Chen and  Xie, Chen and  Zilong Zheng}, 
  journal={arXiv preprint arXiv:2511.22167},
  year={2025}
}

🙏 Acknowledgement

We express our sincerest gratitude to the excellent previous works that inspired this project:

IMF: We adapted the framework and training pipeline from IMF and its reproduction code IMF.
FLOAT: We referenced the model architecture and implementation of Float for our generator.
Wav2Vec2: We utilized Wav2Vec as our audio encoder.
Face-Alignment: We used FaceAlignment for cropping images and videos.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

📖 Overview

📢 News

🛠️ Installation

1. Environment Setup

⚡ Quick Start

📦 Model Zoo

📂 Directory Structure

🚀 Inference

1. Audio-driven Inference

2. Video-driven Inference

🚀 Train

1. Train the renderer

Data Preparation

Training Command

2. Train the generator

Data Preparation

Training Command

💡 Best Practices

📝 To-Do List

📜 Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
generator		generator
renderer		renderer
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirement.txt		requirement.txt

Folders and files

Latest commit

History

Repository files navigation

IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

📖 Overview

📢 News

🛠️ Installation

1. Environment Setup

⚡ Quick Start

📦 Model Zoo

📂 Directory Structure

🚀 Inference

1. Audio-driven Inference

2. Video-driven Inference

🚀 Train

1. Train the renderer

Data Preparation

Training Command

2. Train the generator

Data Preparation

Training Command

💡 Best Practices

📝 To-Do List

📜 Citation

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages