Skip to content

feat: Add Turkish Syllabification Module (#43)#50

Open
ada-cinar wants to merge 2 commits into
mainfrom
feature/43-turkish-syllabification
Open

feat: Add Turkish Syllabification Module (#43)#50
ada-cinar wants to merge 2 commits into
mainfrom
feature/43-turkish-syllabification

Conversation

@ada-cinar

Copy link
Copy Markdown
Member

Summary

Implements research-critical Turkish syllabification functionality following linguistic principles of onset-nucleus-coda structure and Turkish phonotactics.

Motivation

Syllabification (hece ayırma) is fundamental for:

  • Prosody Research: Stress patterns, rhythm analysis
  • Poetry Analysis: Rhyme detection, meter identification
  • TTS/Speech Synthesis: Pronunciation modeling
  • Morphophonology: Vowel harmony and suffix patterns

Implementation

Core Algorithm

  • Onset Maximization: Single consonants go to next syllable (ki.tap, not kit.ap)
  • Turkish Structure: (C)(C)V(C)(C) where V is mandatory
  • Cluster Handling: Recognizes valid Turkish onset clusters (tr, pr, kr, etc.)

Turkish Syllable Boundary Rules

V.V       → break between vowels    (o-ku, şa-ir)
V.CV      → C starts next syllable  (ki-tap, a-na)
VCC.V     → split between Cs        (mer-ha-ba, an-la-mak)
VCCC.V    → first C as coda         (türk-çe)

API

Basic Usage

from durak import syllabify

syllabify('merhaba')
# ['mer', 'ha', 'ba']

syllabify('kitap', separator='-')
# 'ki-tap'

Advanced Analysis

from durak import Syllabifier

syl = Syllabifier()
info = syl.analyze('anlamak')
# SyllableInfo(
#     word='anlamak',
#     syllables=['an', 'la', 'mak'],
#     count=3,
#     structure=['VC', 'CV', 'CVC']
# )

Features

✅ Onset maximization (linguistically correct)
✅ Turkish consonant cluster recognition
✅ Case preservation (İSTANBUL → İS-TAN-BUL)
✅ Unicode NFC normalization (handles İ/i properly)
✅ Custom separator support
✅ Detailed syllable structure analysis
✅ Integration-ready (works with tokenizer)

Testing

Comprehensive test suite with 70+ test cases:

  • Basic syllabification patterns
  • Linguistic accuracy (common words)
  • Edge cases (vowel-initial, clusters, apostrophes)
  • Integration with tokenizer
  • Case preservation
  • Circumflex vowels (â, î, û)

Examples

# Poetry analysis - filter by syllable count
words = ['ev', 'kitap', 'merhaba', 'bilgisayar']
three_syllable = [w for w in words if Syllabifier().count(w) == 3]
# ['merhaba']

# Integration with tokenizer
from durak import tokenize, syllabify
tokens = tokenize('Kitabı okudum.')
{t: syllabify(t) for t in tokens if t.isalpha()}
# {'Kitabı': ['Ki', 'ta', 'bı'], 'okudum': ['o', 'ku', 'dum']}

Performance

Pure Python implementation for Phase 1. Future optimization:

  • Phase 2: Rust-accelerated version (10-50x faster)
  • Batch processing support
  • Stress detection
  • Morpheme-aware syllabification

Related

Enables:

Closes #43

- Implement rule-based Turkish syllabifier following onset maximization
- Support (C)(C)V(C)(C) syllable structure
- Handle Turkish-specific phonotactics and consonant clusters
- Expose syllabify() convenience function and Syllabifier class
- Add SyllableInfo dataclass for detailed analysis
- Include comprehensive test suite (70+ test cases)
- Support case preservation and custom separators
- Add Unicode NFC normalization for Turkish characters

Closes #43
- Add fast syllabification algorithm in Rust (src/lib.rs)
- Implement Turkish phonotactic rules:
  * VCV → V.CV (break before single consonant)
  * VCCV → VC.CV (split between double consonants)
  * VCC+V → VC.C+V (take first consonant with current syllable)
- Add Python API with Syllabifier class and SyllableInfo dataclass
- Export syllabify(), syllabify_with_separator(), syllable_count()
- Add comprehensive test suite (39 tests, 100% pass)
- Support edge cases: vowel sequences, consonant clusters, circumflex
- Performance: Rust implementation for fast batch processing

Examples:
  syllabify('merhaba')  → ['mer', 'ha', 'ba']
  syllabify('anlamak')  → ['an', 'la', 'mak']
  syllabify('İstanbul') → ['İs', 'tan', 'bul']

Issue #43 - research-critical feature for prosody, poetry analysis, TTS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add Turkish Syllabification Module

1 participant