😭 SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Yu Guo1 Ying Shan 2 Fei Wang 1
CVPR 2023
TL;DR: A realistic and stylized talking head video generation method from a single image and audio.
-
[2023.03.22]: Launch new feature: generating the 3d face animation from a single image. New applications about it will be updated.
-
[2023.03.22]: Launch new feature:
still mode
, where only a small head pose will be produced viapython inference.py --still
. -
[2023.03.18]: Support
expression intensity
, now you can change the intensity of the generated motion:python inference.py --expression_scale 1.3 (some value > 1)
. -
[2023.03.18]: Reconfig the data folders, now you can download the checkpoint automatically using
bash scripts/download_models.sh
. -
[2023.03.18]: We have offically integrate the GFPGAN for face enhancement, using
python inference.py --enhancer gfpgan
for better visualization performance. -
[2023.03.14]: Specify the version of package
joblib
to remove the errors in usinglibrosa
, is online!Previous Changelogs
- 2023.03.06 Solve some bugs in code and errors in installation
- 2023.03.03 Release the test code for audio-driven single image animation!
- 2023.02.28 SadTalker has been accepted by CVPR 2023!
- Generating 2D face from a single Image.
- Generating 3D face from Audio.
- Generating 4D free-view talking examples from audio and a single image.
- Gradio/Colab Demo.
- Full body/image Generation.
- training code of each componments.
- Audio-driven Anime Avatar.
- interpolate ChatGPT for a conversation demo 🤔
- integrade with stable-diffusion-web-ui. (stay tunning!)
sadtalker_demo_short.mp4
CLICK ME
git clone https://github.com/Winfredy/SadTalker.git
cd SadTalker
conda create -n sadtalker python=3.8
source activate sadtalker
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
conda install ffmpeg
pip install dlib-bin # [dlib-bin is much faster than dlib installation] conda install dlib
pip install -r requirements.txt
### install gpfgan for enhancer
pip install git+https://github.com/TencentARC/GFPGAN
CLICK ME
You can run the following script to put all the models in the right place.
bash scripts/download_models.sh
OR download our pre-trained model from google drive or our github release page, and then, put it in ./checkpoints.
Model | Description |
---|---|
checkpoints/auido2exp_00300-model.pth | Pre-trained ExpNet in Sadtalker. |
checkpoints/auido2pose_00140-model.pth | Pre-trained PoseVAE in Sadtalker. |
checkpoints/mapping_00229-model.pth.tar | Pre-trained MappingNet in Sadtalker. |
checkpoints/facevid2vid_00189-model.pth.tar | Pre-trained face-vid2vid model from the reappearance of face-vid2vid. |
checkpoints/epoch_20.pth | Pre-trained 3DMM extractor in Deep3DFaceReconstruction. |
checkpoints/wav2lip.pth | Highly accurate lip-sync model in Wav2lip. |
checkpoints/shape_predictor_68_face_landmarks.dat | Face landmark model used in dilb. |
checkpoints/BFM | 3DMM library file. |
checkpoints/hub | Face detection models used in face alignment. |
python inference.py --driven_audio <audio.wav> \
--source_image <video.mp4 or picture.png> \
--batch_size <default equals 2, a larger run faster> \
--expression_scale <default is 1.0, a larger value will make the motion stronger> \
--result_dir <a file to store results> \
--enhancer <default is None, you can choose gfpgan or RestoreFormer>
basic | w/ still mode | w/ exp_scale 1.3 | w/ gfpgan |
---|---|---|---|
art_0.japanese.mp4 |
art_0.japanese_still.mp4 |
art_0.japanese_scale1.3.mp4 |
art_0.japanese_es1.mp4 |
Kindly ensure to activate the audio as the default audio playing is incompatible with GitHub.
Input | Animated 3d face |
---|---|
3dface.mp4 |
Kindly ensure to activate the audio as the default audio playing is incompatible with GitHub.
More details to generate the 3d face can be founded here
We use camera_yaw
, camera_pitch
, camera_roll
to control camera pose. For example, --camera_yaw -20 30 10
means the camera yaw degree changes from -20 to 30 and then changes from 30 to 10.
python inference.py --driven_audio <audio.wav> \
--source_image <video.mp4 or picture.png> \
--result_dir <a file to store results> \
--camera_yaw -20 30 10
If you find our work useful in your research, please consider citing:
@article{zhang2022sadtalker,
title={SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation},
author={Zhang, Wenxuan and Cun, Xiaodong and Wang, Xuan and Zhang, Yong and Shen, Xi and Guo, Yu and Shan, Ying and Wang, Fei},
journal={arXiv preprint arXiv:2211.12194},
year={2022}
}
Facerender code borrows heavily from zhanglonghao's reproduction of face-vid2vid and PIRender. We thank the authors for sharing their wonderful code. In training process, We also use the model from Deep3DFaceReconstruction and Wav2lip. We thank for their wonderful work.
- StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN (ECCV 2022)
- CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior (CVPR 2023)
- VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild (SIGGRAPH Asia 2022)
- DPE: Disentanglement of Pose and Expression for General Video Portrait Editing (CVPR 2023)
- 3D GAN Inversion with Facial Symmetry Prior (CVPR 2023)
- T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations (CVPR 2023)
This is not an official product of Tencent. This repository can only be used for personal/research/non-commercial purposes.