IMTalker accepts diverse portrait styles and achieves 40 FPS for video-driven and 42 FPS for audio-driven talking-face generation when tested on an NVIDIA RTX 4090 GPU at 512 Γ 512 resolution. It also enables diverse controllability by allowing precise head-pose and eye-gaze inputs alongside audio
- [2025.12.16] π The training code are released!
- [2025.11.27] π The inference code and pretrained weights are released!
conda create -n IMTalker python=3.10
conda activate IMTalker
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
conda install -c conda-forge ffmpeg 2. Install with pip:
git clone https://github.com/cbsjtu01/IMTalker.git
cd IMTalker
pip install -r requirement.txtYou can simply run the Gradio demo to get started. The script will automatically download the required pretrained models to the ./checkpoints directory if they are missing.
python app.pyPlease download the pretrained models and place them in the ./checkpoints directory.
| Component | Checkpoint | Description | Download |
|---|---|---|---|
| Audio Encoder | wav2vec2-base-960h |
Wav2Vec2 Base model | π€ Link |
| Generator | generator.ckpt |
Flow Matching Generator | π€ Link |
| Renderer | renderer.ckpt |
IMT Renderer | π€ Link |
Ensure your file structure looks like this after downloading:
./checkpoints
βββ renderer.ckpt # The main renderer
βββ generator.ckpt # The main generator
βββ wav2vec2-base-960h/ # Audio encoder folder
βββ config.json
βββ pytorch_model.bin
βββ ...
Generate a talking face from a source image and an audio file.
python generator/generate.py \
--ref_path "./assets/source_image.jpg" \
--aud_path "./assets/input_audio.wav" \
--res_dir "./results/" \
--generator_path "./checkpoints/generator.ckpt" \
--renderer_path "./checkpoints/renderer.ckpt" \
--a_cfg_scale 2 \
--cropGenerate a talking face from a source image and an driving video file.
python renderer/inference.py \
--source_path "./assets/source_image.jpg" \
--driving_path "./assets/driving_video.mp4" \
--save_path "./results/" \
--renderer_path "./checkpoints/renderer.ckpt" \
--cropYou can follow the dataset processing pipeline in talkingfaceprocess to crop the raw video data into 512Γ512 resolution videos where the face occupies the main region, and to extract landmarks for each video. Ensure your dataset directory is organized as follows.
/path/to/renderer_dataset
βββ video_frame
βββ video_0001
βββ image_001.jpg
βββ image_002.jpg
βββ ...
βββ video_0002
βββ ...
βββ lmd
βββ video_0001.txt
βββ video_0002.txt
βββ ...
Then you can execute the following command to train our renderer. In our experiments, we used 4 Γ A100 (80 GB) GPUs; with a batch size of 4, the GPU memory usage did not exceed 50 GB, and each iteration took approximately 1 second. You can adjust the batch size and learning rate according to your hardware configuration.
python renderer/train.py \
--dataset_path /path/to/renderer_dataset \
--exp_name renderer_exp \
--batch_size 4 \
--iter 7000000 \
--lr 1e-4 \
In the second step, you need to train our motion generator to enable speech-driven animation. To accelerate training, we pre-extract and store all required features, including: motion latents obtained by feeding each video frame into the motion encoder in the renderer; final-layer features extracted from audio WAV files using Wav2Vec2; 6D pose parameters for each frame extracted with SMIRK; and gaze directions extracted using L2CS-Net. Ensure your dataset directory is organized as follows.
/path/to/generator_dataset
βββ motion
βββ video_0001.pt
βββ video_0002.pt
βββ ...
βββ audio
βββ video_0001.npy
βββ video_0002.npy
βββ ...
βββ smirk
βββ video_0001.pt
βββ video_0002.pt
βββ ...
βββ gaze
βββ video_0001.npy
βββ video_0002.npy
βββ ...
Then you can execute the following command to train the generator. In our experiments, we used 4 Γ A100 (80 GB) GPUs; with a batch size of 16, the GPU memory usage did not exceed 20 GB, achieving approximately 10 iterations per second, and the model converged within a few hours. You can adjust the batch size and learning rate according to your hardware configuration.
python generator/train.py \
--dataset_pat /path/to/generator_dataset \
--exp_name generator_exp \
--batch_size 16 \
--iter 5000000 \
--lr 1e-4
To obtain the highest quality generation results, we recommend following these guidelines:
-
Input Image Composition: Please ensure the input image features the person's head as the primary subject. Since our model is explicitly trained on facial data, it does not support full-body video generation.
- The inference pipeline automatically crops the input image to focus on the face by default.
- Note on Resolution: The model generates video at a fixed resolution of 512Γ512. Using extremely high-resolution inputs will result in downscaling, so prioritize facial clarity over raw image dimensions.
-
Audio Selection: Our model was trained primarily on English datasets. Consequently, we recommend using English audio inputs to achieve the best lip-synchronization performance and naturalness.
-
Background Quality: We strongly recommend using source images with solid colored or blurred (bokeh) backgrounds. Complex or highly detailed backgrounds may lead to visual artifacts or jitter in the generated video.
- Release inference code and pretrained models.
- Launch Hugging Face online demo.
- Release training code.
If you find our work useful for your research, please consider citing:
@article{imtalker2025,
title={IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer},
author={Bo, Chen and Tao, Liu and Qi, Chen and Xie, Chen and Zilong Zheng},
journal={arXiv preprint arXiv:2511.22167},
year={2025}
}We express our sincerest gratitude to the excellent previous works that inspired this project:
- IMF: We adapted the framework and training pipeline from IMF and its reproduction code IMF.
- FLOAT: We referenced the model architecture and implementation of Float for our generator.
- Wav2Vec2: We utilized Wav2Vec as our audio encoder.
- Face-Alignment: We used FaceAlignment for cropping images and videos.
