Skip to content

feat: add German G2P with text normalization#97

Open
semidark wants to merge 2 commits into
hexgrad:mainfrom
semidark:feat/german-g2p-upstream
Open

feat: add German G2P with text normalization#97
semidark wants to merge 2 commits into
hexgrad:mainfrom
semidark:feat/german-g2p-upstream

Conversation

@semidark
Copy link
Copy Markdown

Summary

Add German G2P module (misaki/de.py) with text normalization and espeak-ng phonemization.

Changes

  • misaki/de.py: DEG2P class + normalize_text_de() function
  • tests/test_de.py: 61 unit tests (CI-safe, no espeak needed) + 4 integration tests
  • pyproject.toml: add de optional dependency

What normalize_text_de() handles

  • Cardinal numbers (42 -> zweiundvierzig)
  • Ordinals (3. -> dritte)
  • Years (1985 -> neunzehnhundertfünfundachtzig)
  • Dates (24.12.2024 -> vierundzwanzigste Dezember zweitausendvierundzwanzig)
  • Times (14:30 -> vierzehn Uhr dreißig)
  • Currency (€29,99 -> neunundzwanzig Euro und neunundneunzig Cent)
  • 30+ German abbreviations (Dr., GmbH, z.B., usw., months, etc.)
  • German-format numbers (1.234,56 with thousand dots + decimal comma)
  • Quote normalization, whitespace cleanup

Architecture

DEG2P follows the same pattern as KOG2P (Korean) and HEG2P (Hebrew):

class DEG2P:
    def __init__(self):
        from .espeak import EspeakG2P
        self.espeak = EspeakG2P(language='de')

    def __call__(self, text) -> Tuple[str, None]:
        text = normalize_text_de(text)
        return self.espeak(text)

The deferred import allows normalize_text_de() to be tested without phonemizer installed.

Usage

from misaki.de import DEG2P

g2p = DEG2P()
phonemes, _ = g2p('Dr. Müller kaufte am 3. Mai um 14:30 Uhr 3 Pakete für €29,99.')

Context

This module was developed as part of kokoro-deutsch, a community project for fine-tuning Kokoro for German. A trained German multi-speaker base model is published at dida-80b/kokoro-deutsch-hui-base.

No new heavy dependencies are introduced. The de extra only adds phonemizer-fork and espeakng-loader, which are already used by misaki[en].

Add misaki/de.py with:
- normalize_text_de(): expands numbers, dates, times, currency,
  abbreviations, ordinals, and years to spelled-out German text
- DEG2P: wraps normalize_text_de() + EspeakG2P(language='de')

Normalizer handles:
- Cardinal numbers (zweiundvierzig), ordinals (dritte)
- Years (neunzehnhundertfünfundachtzig)
- Dates DD.MM.YYYY, times HH:MM
- Currency (Euro, Dollar, Pfund, Yen with cents)
- 30+ German abbreviations (Dr., GmbH, z.B., usw., months)
- German-format numbers (1.234,56 with thousand dots + decimal comma)
- Quote normalization, whitespace cleanup

Add tests/test_de.py with 60 unit tests (CI-safe, no espeak needed)
and 4 integration tests (skipped when phonemizer unavailable).

Add 'de' optional dependency in pyproject.toml.
'14:30 Uhr' was normalized to 'vierzehn Uhr dreißig Uhr'.
Now the regex optionally consumes trailing ' Uhr' after HH:MM.
@apples-kksk
Copy link
Copy Markdown

I found a few small German-normalizer edge cases while testing this PR and opened a follow-up against your branch here: semidark#1

It covers:

  • not consuming Uhr inside longer words like Uhrzeit
  • leaving invalid times such as 25:00 Uhr / 23:99 Uhr unchanged instead of expanding around the colon
  • using Decimal for currency rounding so €9,999 becomes zehn Euro rather than neun Euro und einhundert Cent

Verification:

  • local default env: python -m pytest tests/test_de.py -q -> 65 passed, 4 skipped
  • isolated Python 3.11 with .[de]: full tests/test_de.py -> 69 passed

I also noticed pyproject.toml adds the de extra, but uv.lock has not been regenerated for it yet. I left that untouched because uv lock currently fails for the existing he extra resolution on Python 3.12/3.13 (mishkal-hebrew>=0.3.2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants