Skip to content

feat: Add vowel harmony validator for suffix stripping (#52)#53

Open
ada-cinar wants to merge 2 commits into
mainfrom
feature/52-vowel-harmony-validator
Open

feat: Add vowel harmony validator for suffix stripping (#52)#53
ada-cinar wants to merge 2 commits into
mainfrom
feature/52-vowel-harmony-validator

Conversation

@ada-cinar

Copy link
Copy Markdown
Member

Summary

Implements vowel harmony validation to prevent false positives in suffix stripping. Turkish suffixes must respect backness harmony (back vowels: a/ı/o/u, front vowels: e/i/ö/ü).

Changes

Core Implementation

  • ✅ Vowel harmony checker (back/front validation)
  • ✅ Updated strip_suffixes() to validate harmony before stripping
  • ✅ Python API: check_vowel_harmony_py(stem, suffix)

Tests

  • ✅ 14 Rust unit tests (100% passing)
  • ✅ 7 Python integration tests

Examples

Before (naive stripping):

strip_suffixes('kitapler')  # → 'kitap' ❌ (a-e mismatch ignored)

After (harmony-aware):

strip_suffixes('kitapler')  # → 'kitapler' ✅ (preserves invalid form)
strip_suffixes('kitaplar')  # → 'kitap' ✅ (valid a-a harmony)

Impact

  • Improves morphological accuracy (fewer false positives)
  • Foundation for advanced morphological analyzer
  • Exposes linguistic primitives to Python for research

Closes #52

## Changes

### Rust Core (src/lib.rs)
- ✅ Add vowel harmony validation functions:
  - is_back_vowel() / is_front_vowel()
  - get_last_vowel() / get_first_vowel()
  - check_vowel_harmony() - validates backness harmony
  - check_vowel_harmony_py() - Python API wrapper

- ✅ Update strip_suffixes() to respect vowel harmony:
  - Prevents false positives (e.g., 'kitapler' with a-e mismatch)
  - Maintains min stem length constraint (>= 2 chars)
  - Recursive stripping validates harmony at each step

### Tests
- ✅ 14 Rust unit tests (all passing):
  - Vowel detection (back/front)
  - Vowel extraction (first/last)
  - Harmony validation (valid/invalid cases)
  - Suffix stripping with harmony constraints
  - Edge cases (no vowels, short stems, recursion)

- ✅ 7 Python integration tests (added):
  - check_vowel_harmony_py() API tests
  - Lemmatizer harmony validation
  - Recursive stripping with harmony

## Linguistic Background

Turkish suffixes follow strict vowel harmony:
- **Backness harmony**: Suffix vowel must match stem's last vowel
  - Back vowels: a, ı, o, u
  - Front vowels: e, i, ö, ü

Examples:
- ✅ kitap + lar → Valid (a-a harmony)
- ✅ ev + ler → Valid (e-e harmony)
- ❌ kitap + ler → Invalid (a-e mismatch)
- ❌ ev + lar → Invalid (e-a mismatch)

## Impact

- Improves morphological analysis accuracy
- Prevents over-stripping in heuristic lemmatizer
- Foundation for future morphological analyzer
- Exposes harmony checker to Python API for research use

Closes #52
@ada-cinar ada-cinar self-assigned this Jan 26, 2026
- Update README.md with vowel harmony feature examples
- Add check_vowel_harmony_py to Python type stubs (_durak_core.pyi)
- Document vowel harmony constraint in strip_suffixes docstring
- Add usage examples for harmony validation

Closes #52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Add Vowel Harmony Validator for Suffix Stripping

1 participant