feat: Add Turkish Syllabification Module (#43) by ada-cinar · Pull Request #50 · cdliai/durak

ada-cinar · 2026-01-26T20:36:22Z

Summary

Implements research-critical Turkish syllabification functionality following linguistic principles of onset-nucleus-coda structure and Turkish phonotactics.

Motivation

Syllabification (hece ayırma) is fundamental for:

Prosody Research: Stress patterns, rhythm analysis
Poetry Analysis: Rhyme detection, meter identification
TTS/Speech Synthesis: Pronunciation modeling
Morphophonology: Vowel harmony and suffix patterns

Implementation

Core Algorithm

Onset Maximization: Single consonants go to next syllable (ki.tap, not kit.ap)
Turkish Structure: (C)(C)V(C)(C) where V is mandatory
Cluster Handling: Recognizes valid Turkish onset clusters (tr, pr, kr, etc.)

Turkish Syllable Boundary Rules

V.V       → break between vowels    (o-ku, şa-ir)
V.CV      → C starts next syllable  (ki-tap, a-na)
VCC.V     → split between Cs        (mer-ha-ba, an-la-mak)
VCCC.V    → first C as coda         (türk-çe)

API

Basic Usage

from durak import syllabify

syllabify('merhaba')
# ['mer', 'ha', 'ba']

syllabify('kitap', separator='-')
# 'ki-tap'

Advanced Analysis

from durak import Syllabifier

syl = Syllabifier()
info = syl.analyze('anlamak')
# SyllableInfo(
#     word='anlamak',
#     syllables=['an', 'la', 'mak'],
#     count=3,
#     structure=['VC', 'CV', 'CVC']
# )

Features

✅ Onset maximization (linguistically correct)
✅ Turkish consonant cluster recognition
✅ Case preservation (İSTANBUL → İS-TAN-BUL)
✅ Unicode NFC normalization (handles İ/i properly)
✅ Custom separator support
✅ Detailed syllable structure analysis
✅ Integration-ready (works with tokenizer)

Testing

Comprehensive test suite with 70+ test cases:

Basic syllabification patterns
Linguistic accuracy (common words)
Edge cases (vowel-initial, clusters, apostrophes)
Integration with tokenizer
Case preservation
Circumflex vowels (â, î, û)

Examples

# Poetry analysis - filter by syllable count
words = ['ev', 'kitap', 'merhaba', 'bilgisayar']
three_syllable = [w for w in words if Syllabifier().count(w) == 3]
# ['merhaba']

# Integration with tokenizer
from durak import tokenize, syllabify
tokens = tokenize('Kitabı okudum.')
{t: syllabify(t) for t in tokens if t.isalpha()}
# {'Kitabı': ['Ki', 'ta', 'bı'], 'okudum': ['o', 'ku', 'dum']}

Performance

Pure Python implementation for Phase 1. Future optimization:

Phase 2: Rust-accelerated version (10-50x faster)
Batch processing support
Stress detection
Morpheme-aware syllabification

[Feature] Implement Tiered Hybrid Lemmatizer in Rust #6 Lemmatizer (syllables inform morpheme boundaries)
[Feature] Implement Fast TF-IDF and Count Vectorizer in Rust #8 TF-IDF (syllable n-grams as features)
[Documentation] Add PyTorch and HuggingFace Integration Examples #28 PyTorch/HF integration (syllable embeddings)

Closes #43

- Implement rule-based Turkish syllabifier following onset maximization - Support (C)(C)V(C)(C) syllable structure - Handle Turkish-specific phonotactics and consonant clusters - Expose syllabify() convenience function and Syllabifier class - Add SyllableInfo dataclass for detailed analysis - Include comprehensive test suite (70+ test cases) - Support case preservation and custom separators - Add Unicode NFC normalization for Turkish characters Closes #43

- Add fast syllabification algorithm in Rust (src/lib.rs) - Implement Turkish phonotactic rules: * VCV → V.CV (break before single consonant) * VCCV → VC.CV (split between double consonants) * VCC+V → VC.C+V (take first consonant with current syllable) - Add Python API with Syllabifier class and SyllableInfo dataclass - Export syllabify(), syllabify_with_separator(), syllable_count() - Add comprehensive test suite (39 tests, 100% pass) - Support edge cases: vowel sequences, consonant clusters, circumflex - Performance: Rust implementation for fast batch processing Examples: syllabify('merhaba') → ['mer', 'ha', 'ba'] syllabify('anlamak') → ['an', 'la', 'mak'] syllabify('İstanbul') → ['İs', 'tan', 'bul'] Issue #43 - research-critical feature for prosody, poetry analysis, TTS

ada-cinar added 2 commits January 26, 2026 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Turkish Syllabification Module (#43)#50

feat: Add Turkish Syllabification Module (#43)#50
ada-cinar wants to merge 2 commits into
mainfrom
feature/43-turkish-syllabification

ada-cinar commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ada-cinar commented Jan 26, 2026

Summary

Motivation

Implementation

Core Algorithm

Turkish Syllable Boundary Rules

API

Basic Usage

Advanced Analysis

Features

Testing

Examples

Performance

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant