Unit-based Audiovisual Translation for Korean
Text-free Direct Speech Translation with Synchronized Lip Movement
NetfLipsλ μμ΄ μμμ μ λ ₯λ°μ μμ±κ³Ό μ λͺ¨μμ΄ λκΈ°νλ νκ΅μ΄ λ²μ μμμ μμ±νλ νλ‘μ νΈμ λλ€.
- π― Unit-based Translation: ν μ€νΈ μ€κ° νν μμ΄ μμ±κ³Ό μκ° μ 보λ₯Ό κ³΅ν΅ μ λ(Unit) ννμΌλ‘ μ§μ λͺ¨λΈλ§
- π Speech & Visual Sync: μμ±κ³Ό λΉλμ€λ₯Ό κ³΅ν΅ νΉμ§ 곡κ°μ Unit λ¨μλ‘ μ λ ¬νμ¬ κ°κ±΄ν λ²μ ꡬν
- π°π· Korean Fine-tuning: κΈ°μ‘΄μ μ§μλμ§ μλ νκ΅μ΄ capabilityλ₯Ό μν Fine-tuning
- π¬ Natural Synthesis: μμ°μ€λ¬μ΄ μμ± ν©μ± λ° λ¦½μ±ν¬ μμ±
#Unit-based Audiovisual Translation #Text-free Direct Speech Translation #Lip Sync #Speech Translation
π Demo Link
NetfLipsλ 3λ¨κ³ νμ΄νλΌμΈμΌλ‘ ꡬμ±λ©λλ€:
- FLAC 볡μ (wav)
- νΉμ§ μΆμΆ (Mel Spectrogram)
- K-means λΆλ₯
- μ μ sequenceλ‘ λ³ν
- Base Model: AV2AV (Choi, J., et al., 2024)
- Translation: μμ΄ unit β νκ΅μ΄ unit
- Framework: Fairseq toolkit κΈ°λ° unit sequence νμ΅
- Backbone: λκ·λͺ¨ μ¬μ νμ΅ λͺ¨λΈ mBART νμ©
- Unit β Audio λ³ν
- νκ΅μ΄ unit & νμ μλ² λ© νμ©
- Speech Resynthesis
λ³Έ νλ‘μ νΈλ λ€μ λ°μ΄ν°μ μ νμ©νμ¬ νμ΅λμμ΅λλ€:
| Dataset | Description | Size |
|---|---|---|
| Zeroth Korean ASR | νκ΅μ΄ μμ± μΈμ λ°μ΄ν° | 12,245 λ¬Έμ₯ |
| AIHub Ko-X ν΅λ²μ μμ± | νκ΅μ΄-μμ΄(λ―Έκ΅) λ³λ ¬ μμ± λ°μ΄ν° | 169,488 λ¬Έμ₯ |
# 1. λ ν¬μ§ν 리 ν΄λ‘
git clone https://github.com/Prometheus-AI-3team/NetfLips.git
cd NetfLips
# 2. μλΈλͺ¨λ(fairseq) update
git submodule init
git submodule update
# 2. Conda κΈ°λ³Έ νκ²½ μμ±
conda env create -f environment.yml
conda activate unit2a
# 3. Pip λ€μ΄κ·Έλ μ΄λ (λ©νλ°μ΄ν° μλ¬ λ°©μ§)
pip install "pip<24.1"
# 4. PyTorch μ€μΉ (CUDA 11.7 κΈ°μ€)
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117# 5. λλ¨Έμ§ λΌμ΄λΈλ¬λ¦¬ μ€μΉ
pip install -r requirements.txt
# 6. Fairseq μ€μΉ
cd av2av-main/fairseq
pip install -e .| Model | Name | link |
|---|---|---|
| AV2Unit | mav_hubert_large_noise.py |
download |
| Unit2Unit | utut_sts_ft.pt |
download |
| Unit2AV | unit_av_renderer_withKO.pt |
download |
PYTHONPATH=fairseq python inference.py \
--in-vid-path /path/to/input.mp4 \
--out-vid-path /path/to/output.mp4 \
--src-lang en --tgt-lang ko \
--av2unit-path /path/to/mavhubert_large_noise.pt \
--utut-path /path/to/utut_sts_ft.pt \
--unit2av-path /path/to/unit_av_renderer_withKO.pt \κ° λͺ¨λμ νμ΅ λ° μΆλ‘ (av2unit, unit2unit, unit2av)μ ν΄λΉνλ λͺ¨λμ README.mdλ₯Ό μ°Έκ³ ν΄μ£ΌμΈμ.
NetfLips/
βββ av2unit/ # Audio-Visual to Unit Extraction
β βββ avhubert/ # Feature extraction using AV-HuBERT
β βββ inference.py # Unit extraction inference script
βββ unit2unit/ # Unit to Unit Translation
β βββ utut_pretrain/ # Pre-training modules
β βββ utut_finetune/ # Fine-tuning modules
β βββ inference.py # Translation inference script
βββ unit2av/ # Unit to Audio-Visual Generation
β βββ model.py # Unit2AV model definition
β βββ train_unit2a.py # Training script for Unit2Audio
β βββ inference_unit2av.py # Inference scripts
βββ fairseq/ # Fairseq Toolkit (Submodule)
βββ scripts/ # Utility Scripts for Data Preparation
βββ inference_av2av.py # Main End-to-End Inference Script
βββ environment.yml # Conda Environment Configuration
βββ requirements.txt # Python Dependencies
- FLAC νμΌ λ³΅μ λ° wav λ³ν
- Mel Spectrogram κΈ°λ° νΉμ§ μΆμΆ
- K-means ν΄λ¬μ€ν°λ§μ ν΅ν Unit λΆλ₯
- mBART κΈ°λ° sequence-to-sequence νμ΅
- Fairseq toolkit νμ©
- Unit-to-Unit translation μ΅μ ν
- νκ΅μ΄ unitμμ μμ± μ¬ν©μ±
- νμ μλ² λ©μ νμ©ν μμ°μ€λ¬μ΄ μμ± μμ±
- 립μ±ν¬κ° λκΈ°νλ λΉλμ€ μμ±
- AV2AV: Audio-Visual to Audio-Visual translation model
- Reference: Choi, J., et al., 2024
- νκ΅μ΄ λ―Έμ§μ λ¬Έμ ν΄κ²°μ μν Fine-tuning
- λ³λ ¬ ν-μ μμ± λ°μ΄ν° νμ©
- Unit-level translation νμ΅
| Name | batch |
|---|---|
| μ₯μ§μ | 6th |
| μ μ§ν | 6th |
| μ κ·μ² | 8th |
| μ΄κ°μ° | 8th |
@misc{netflips2024,
title={NetfLips: Unit-based Audiovisual Translation for Korean},
author={μ₯μ§μ, μ μ§ν, μ κ·μ² , μ΄κ°μ°},
year={2024}
}- Choi, J., et al. (2024). AV2AV: Audio-Visual to Audio-Visual Translation
μ΄ νλ‘μ νΈλ MIT λΌμ΄μ μ€ νμ λ°°ν¬λ©λλ€. μμΈν λ΄μ©μ LICENSE νμΌμ μ°Έμ‘°νμΈμ.
This repository is built upon AV2AV and Fairseq. We appreciate the open-source of the projects.