feat: add German G2P with text normalization by semidark · Pull Request #97 · hexgrad/misaki

semidark · 2026-04-17T13:55:52Z

Summary

Add German G2P module (misaki/de.py) with text normalization and espeak-ng phonemization.

Changes

misaki/de.py: DEG2P class + normalize_text_de() function
tests/test_de.py: 61 unit tests (CI-safe, no espeak needed) + 4 integration tests
pyproject.toml: add de optional dependency

What `normalize_text_de()` handles

Cardinal numbers (42 -> zweiundvierzig)
Ordinals (3. -> dritte)
Years (1985 -> neunzehnhundertfünfundachtzig)
Dates (24.12.2024 -> vierundzwanzigste Dezember zweitausendvierundzwanzig)
Times (14:30 -> vierzehn Uhr dreißig)
Currency (€29,99 -> neunundzwanzig Euro und neunundneunzig Cent)
30+ German abbreviations (Dr., GmbH, z.B., usw., months, etc.)
German-format numbers (1.234,56 with thousand dots + decimal comma)
Quote normalization, whitespace cleanup

Architecture

DEG2P follows the same pattern as KOG2P (Korean) and HEG2P (Hebrew):

class DEG2P:
    def __init__(self):
        from .espeak import EspeakG2P
        self.espeak = EspeakG2P(language='de')

    def __call__(self, text) -> Tuple[str, None]:
        text = normalize_text_de(text)
        return self.espeak(text)

The deferred import allows normalize_text_de() to be tested without phonemizer installed.

Usage

from misaki.de import DEG2P

g2p = DEG2P()
phonemes, _ = g2p('Dr. Müller kaufte am 3. Mai um 14:30 Uhr 3 Pakete für €29,99.')

Context

This module was developed as part of kokoro-deutsch, a community project for fine-tuning Kokoro for German. A trained German multi-speaker base model is published at dida-80b/kokoro-deutsch-hui-base.

No new heavy dependencies are introduced. The de extra only adds phonemizer-fork and espeakng-loader, which are already used by misaki[en].

Add misaki/de.py with: - normalize_text_de(): expands numbers, dates, times, currency, abbreviations, ordinals, and years to spelled-out German text - DEG2P: wraps normalize_text_de() + EspeakG2P(language='de') Normalizer handles: - Cardinal numbers (zweiundvierzig), ordinals (dritte) - Years (neunzehnhundertfünfundachtzig) - Dates DD.MM.YYYY, times HH:MM - Currency (Euro, Dollar, Pfund, Yen with cents) - 30+ German abbreviations (Dr., GmbH, z.B., usw., months) - German-format numbers (1.234,56 with thousand dots + decimal comma) - Quote normalization, whitespace cleanup Add tests/test_de.py with 60 unit tests (CI-safe, no espeak needed) and 4 integration tests (skipped when phonemizer unavailable). Add 'de' optional dependency in pyproject.toml.

'14:30 Uhr' was normalized to 'vierzehn Uhr dreißig Uhr'. Now the regex optionally consumes trailing ' Uhr' after HH:MM.

apples-kksk · 2026-05-12T11:50:19Z

I found a few small German-normalizer edge cases while testing this PR and opened a follow-up against your branch here: semidark#1

It covers:

not consuming Uhr inside longer words like Uhrzeit
leaving invalid times such as 25:00 Uhr / 23:99 Uhr unchanged instead of expanding around the colon
using Decimal for currency rounding so €9,999 becomes zehn Euro rather than neun Euro und einhundert Cent

Verification:

local default env: python -m pytest tests/test_de.py -q -> 65 passed, 4 skipped
isolated Python 3.11 with .[de]: full tests/test_de.py -> 69 passed

I also noticed pyproject.toml adds the de extra, but uv.lock has not been regenerated for it yet. I left that untouched because uv lock currently fails for the existing he extra resolution on Python 3.12/3.13 (mishkal-hebrew>=0.3.2).

semidark added 2 commits April 17, 2026 15:55

fix: prevent double 'Uhr' when time precedes 'Uhr' in text

3e7b637

'14:30 Uhr' was normalized to 'vierzehn Uhr dreißig Uhr'. Now the regex optionally consumes trailing ' Uhr' after HH:MM.

This was referenced Apr 17, 2026

feat: German text normalization via misaki fork semidark/kikiri-tts#17

Open

feat: add German (de) language support hexgrad/kokoro#317

Open

apples-kksk mentioned this pull request May 12, 2026

fix: tighten German normalization edge cases semidark/misaki#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add German G2P with text normalization#97

feat: add German G2P with text normalization#97
semidark wants to merge 2 commits into
hexgrad:mainfrom
semidark:feat/german-g2p-upstream

semidark commented Apr 17, 2026

Uh oh!

apples-kksk commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

semidark commented Apr 17, 2026

Summary

Changes

What normalize_text_de() handles

Architecture

Usage

Context

Uh oh!

apples-kksk commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

What `normalize_text_de()` handles