Skip to content

[2025-2] Textless Direct Audio-Visual Speech Translation

License

Notifications You must be signed in to change notification settings

Prometheus-AI-3team/NetfLips

Repository files navigation

🎬 NetfLips

Unit-based Audiovisual Translation for Korean
Text-free Direct Speech Translation with Synchronized Lip Movement

License Python


πŸ“‹ Overview

NetfLipsλŠ” μ˜μ–΄ μ˜μƒμ„ μž…λ ₯λ°›μ•„ μŒμ„±κ³Ό μž… λͺ¨μ–‘이 λ™κΈ°ν™”λœ ν•œκ΅­μ–΄ λ²ˆμ—­ μ˜μƒμ„ μƒμ„±ν•˜λŠ” ν”„λ‘œμ νŠΈμž…λ‹ˆλ‹€.

✨ Key Features

  • 🎯 Unit-based Translation: ν…μŠ€νŠΈ 쀑간 ν‘œν˜„ 없이 μŒμ„±κ³Ό μ‹œκ° 정보λ₯Ό 곡톡 μœ λ‹›(Unit) ν‘œν˜„μœΌλ‘œ 직접 λͺ¨λΈλ§
  • πŸ”Š Speech & Visual Sync: μŒμ„±κ³Ό λΉ„λ””μ˜€λ₯Ό 곡톡 νŠΉμ§• κ³΅κ°„μ˜ Unit λ‹¨μœ„λ‘œ μ •λ ¬ν•˜μ—¬ κ°•κ±΄ν•œ λ²ˆμ—­ κ΅¬ν˜„
  • πŸ‡°πŸ‡· Korean Fine-tuning: 기쑴에 μ§€μ›λ˜μ§€ μ•Šλ˜ ν•œκ΅­μ–΄ capabilityλ₯Ό μœ„ν•œ Fine-tuning
  • πŸ’¬ Natural Synthesis: μžμ—°μŠ€λŸ¬μš΄ μŒμ„± ν•©μ„± 및 립싱크 생성

🎯 Keywords

#Unit-based Audiovisual Translation #Text-free Direct Speech Translation #Lip Sync #Speech Translation


πŸŽ₯ Demo

🌐 Demo Link

πŸ—οΈ Architecture

NetfLipsλŠ” 3단계 νŒŒμ΄ν”„λΌμΈμœΌλ‘œ κ΅¬μ„±λ©λ‹ˆλ‹€:

1️⃣ Unit Extraction

  • FLAC 볡원 (wav)
  • νŠΉμ§• μΆ”μΆœ (Mel Spectrogram)
  • K-means λΆ„λ₯˜
  • μ •μˆ˜ sequence둜 λ³€ν™˜

2️⃣ Unit Translation

  • Base Model: AV2AV (Choi, J., et al., 2024)
  • Translation: μ˜μ–΄ unit β†’ ν•œκ΅­μ–΄ unit
  • Framework: Fairseq toolkit 기반 unit sequence ν•™μŠ΅
  • Backbone: λŒ€κ·œλͺ¨ 사전 ν•™μŠ΅ λͺ¨λΈ mBART ν™œμš©

3️⃣ AV Generation

  • Unit β†’ Audio λ³€ν™˜
  • ν•œκ΅­μ–΄ unit & ν™”μž μž„λ² λ”© ν™œμš©
  • Speech Resynthesis

πŸ“Š Dataset

λ³Έ ν”„λ‘œμ νŠΈλŠ” λ‹€μŒ 데이터셋을 ν™œμš©ν•˜μ—¬ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€:

Dataset Description Size
Zeroth Korean ASR ν•œκ΅­μ–΄ μŒμ„± 인식 데이터 12,245 λ¬Έμž₯
AIHub Ko-X ν†΅λ²ˆμ—­ μŒμ„± ν•œκ΅­μ–΄-μ˜μ–΄(λ―Έκ΅­) 병렬 μŒμ„± 데이터 169,488 λ¬Έμž₯

πŸš€ Getting Started

Prerequisites

# 1. λ ˆν¬μ§€ν† λ¦¬ 클둠
git clone https://github.com/Prometheus-AI-3team/NetfLips.git

cd NetfLips

# 2. μ„œλΈŒλͺ¨λ“ˆ(fairseq) update
git submodule init
git submodule update

# 2. Conda κΈ°λ³Έ ν™˜κ²½ 생성
conda env create -f environment.yml
conda activate unit2a

# 3. Pip λ‹€μš΄κ·Έλ ˆμ΄λ“œ (메타데이터 μ—λŸ¬ λ°©μ§€)
pip install "pip<24.1"

# 4. PyTorch μ„€μΉ˜ (CUDA 11.7 κΈ°μ€€)
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117

Installation

# 5. λ‚˜λ¨Έμ§€ 라이브러리 μ„€μΉ˜
pip install -r requirements.txt

# 6. Fairseq μ„€μΉ˜
cd av2av-main/fairseq
pip install -e .

πŸ’» Usage

Checkpoints

Model Name link
AV2Unit mav_hubert_large_noise.py download
Unit2Unit utut_sts_ft.pt download
Unit2AV unit_av_renderer_withKO.pt download

End-to-End Inference

PYTHONPATH=fairseq python inference.py \
  --in-vid-path /path/to/input.mp4 \
  --out-vid-path /path/to/output.mp4 \
  --src-lang en --tgt-lang ko \
  --av2unit-path /path/to/mavhubert_large_noise.pt \
  --utut-path /path/to/utut_sts_ft.pt \
  --unit2av-path /path/to/unit_av_renderer_withKO.pt \

Training & Inference

각 λͺ¨λ“ˆμ˜ ν•™μŠ΅ 및 μΆ”λ‘ (av2unit, unit2unit, unit2av)은 ν•΄λ‹Ήν•˜λŠ” λͺ¨λ“ˆμ˜ README.mdλ₯Ό μ°Έκ³ ν•΄μ£Όμ„Έμš”.

πŸ“ Project Structure

NetfLips/
β”œβ”€β”€ av2unit/                  # Audio-Visual to Unit Extraction
β”‚   β”œβ”€β”€ avhubert/             # Feature extraction using AV-HuBERT
β”‚   └── inference.py          # Unit extraction inference script
β”œβ”€β”€ unit2unit/                # Unit to Unit Translation
β”‚   β”œβ”€β”€ utut_pretrain/        # Pre-training modules
β”‚   β”œβ”€β”€ utut_finetune/        # Fine-tuning modules
β”‚   └── inference.py          # Translation inference script
β”œβ”€β”€ unit2av/                  # Unit to Audio-Visual Generation
β”‚   β”œβ”€β”€ model.py              # Unit2AV model definition
β”‚   β”œβ”€β”€ train_unit2a.py       # Training script for Unit2Audio
β”‚   └── inference_unit2av.py  # Inference scripts
β”œβ”€β”€ fairseq/                  # Fairseq Toolkit (Submodule)
β”œβ”€β”€ scripts/                  # Utility Scripts for Data Preparation
β”œβ”€β”€ inference_av2av.py        # Main End-to-End Inference Script
β”œβ”€β”€ environment.yml           # Conda Environment Configuration
└── requirements.txt          # Python Dependencies

πŸ”¬ Methodology

Data Preprocessing

  • FLAC 파일 볡원 및 wav λ³€ν™˜
  • Mel Spectrogram 기반 νŠΉμ§• μΆ”μΆœ
  • K-means ν΄λŸ¬μŠ€ν„°λ§μ„ ν†΅ν•œ Unit λΆ„λ₯˜

Model Training

  • mBART 기반 sequence-to-sequence ν•™μŠ΅
  • Fairseq toolkit ν™œμš©
  • Unit-to-Unit translation μ΅œμ ν™”

Audio-Visual Generation

  • ν•œκ΅­μ–΄ unitμ—μ„œ μŒμ„± μž¬ν•©μ„±
  • ν™”μž μž„λ² λ”©μ„ ν™œμš©ν•œ μžμ—°μŠ€λŸ¬μš΄ μŒμ„± 생성
  • 립싱크가 λ™κΈ°ν™”λœ λΉ„λ””μ˜€ 생성

πŸ› οΈ Technical Details

Base Model

  • AV2AV: Audio-Visual to Audio-Visual translation model
  • Reference: Choi, J., et al., 2024

Fine-tuning Strategy

  • ν•œκ΅­μ–΄ 미지원 문제 해결을 μœ„ν•œ Fine-tuning
  • 병렬 ν•œ-영 μŒμ„± 데이터 ν™œμš©
  • Unit-level translation ν•™μŠ΅

πŸ‘₯ Team Members From Prometheus(AI club)

Name batch
μž₯μ§€μˆ˜ 6th
μœ μ§€ν˜œ 6th
μ‹ κ·œμ²  8th
이가연 8th

πŸ“ Citation

@misc{netflips2024,
  title={NetfLips: Unit-based Audiovisual Translation for Korean},
  author={μž₯μ§€μˆ˜, μœ μ§€ν˜œ, μ‹ κ·œμ² , 이가연},
  year={2024}
}

References

  • Choi, J., et al. (2024). AV2AV: Audio-Visual to Audio-Visual Translation

License

이 ν”„λ‘œμ νŠΈλŠ” MIT λΌμ΄μ„ μŠ€ ν•˜μ— λ°°ν¬λ©λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©μ€ LICENSE νŒŒμΌμ„ μ°Έμ‘°ν•˜μ„Έμš”.


Acknowledgments

This repository is built upon AV2AV and Fairseq. We appreciate the open-source of the projects.

About

[2025-2] Textless Direct Audio-Visual Speech Translation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages