feat: Add Lemmatization Evaluation Framework (#56) by ada-cinar · Pull Request #60 · cdliai/durak

ada-cinar · 2026-01-27T01:35:54Z

Summary

Closes #56 - Implements comprehensive evaluation framework for lemmatization strategy comparison with automated CI regression testing.

Changes

📊 Gold-Standard Test Set (109 cases)

Expanded from initial 73 to 109 hand-curated test pairs covering:

✅ Nouns: plural, cases, possessives (ev → evler, kitabı, evim)
✅ Verbs: present/past/future + NEW: conditional, imperative, participles
✅ Pronouns: personal with cases
✅ NEW: Proper nouns with apostrophes (Ahmet'in, İstanbul'da)
✅ NEW: Compound words (görebildim, yapabiliyorum)
✅ NEW: Adjective-to-noun derivations (güzellik, zenginlik)
✅ Edge cases: protection rules, unknown words

🔧 Evaluation Script

scripts/evaluate_lemmatizer.py - Full-featured CLI tool
Metrics: accuracy, correct/incorrect counts, detailed error analysis
Single strategy or compare-all mode
Baseline storage + regression detection
Pretty-printed comparison tables

📈 Updated Baseline Results (v0.4.0)

Strategy	Accuracy	Use Case
lookup	68.8%	Formal text (news, official docs)
heuristic	18.3%	OOV-heavy (social media, slang)
hybrid	69.7%	Balanced (default, recommended)

Note: Lower accuracy vs. initial 73-case set reflects more challenging/realistic test coverage (conditional forms, compounds, apostrophes).

🤖 CI Integration (NEW)

✅ Added Evaluate Lemmatizer Quality step to .github/workflows/tests.yml
✅ Runs on Python 3.11 after unit tests
✅ Fails build if accuracy drops >5% from baseline
✅ Baseline stored in benchmarks/lemmatization_baseline.json

📝 Documentation (NEW)

✅ Added "Choosing a Lemmatization Strategy" section to docs/BEST_PRACTICES.md
✅ Includes accuracy benchmarks, usage guidelines, strategy selection criteria
✅ Links to evaluation README for custom test sets

Usage

# Compare all strategies
python scripts/evaluate_lemmatizer.py --all

# Save baseline (already done)
python scripts/evaluate_lemmatizer.py --all --save-baseline

# Check for regressions (CI-ready)
python scripts/evaluate_lemmatizer.py --all --check-regression

# Show detailed errors
python scripts/evaluate_lemmatizer.py --all --show-errors

Testing Results

All strategies evaluated successfully on 109 test cases:

✅ Lookup: 68.8% accuracy (75/109)
✅ Heuristic: 18.3% accuracy (20/109)
✅ Hybrid: 69.7% accuracy (76/109)

Baseline updated in benchmarks/lemmatization_baseline.json for CI regression detection.

Success Criteria (from #56)

✅ At least 100 hand-curated test pairs (109 delivered)
✅ Evaluation script outputs metrics (accuracy + error analysis)
✅ Baseline metrics stored in repo (benchmarks/lemmatization_baseline.json)
✅ CI job for regression detection (fails if >5% drop)
✅ Documentation with strategy comparison (BEST_PRACTICES.md)

All requirements from issue #56 now complete. 🎉

Future Work (from #56)

Add domain-specific test sets (news, social media, literature)
Expand to 200+ test cases for even better coverage
Add morphological feature annotations (POS tags, case markers)
Cross-validate against TRMorph gold standard
Add inter-annotator agreement metrics for manual curation

Related Issues

Complements [Bug] Lemma Dictionary in Rust Core Contains Only Mock Data (3 Entries) #54 (Lemma Dictionary Expansion) - quality validation
Implements testing for [Feature] Implement Tiered Hybrid Lemmatizer in Rust #6 (Tiered Hybrid Lemmatizer)
Related to [Test Coverage] Add Native Rust Unit Tests to src/lib.rs #48 (Rust Unit Tests) - complementary quality infrastructure

Ready for merge! 🚀 This PR delivers complete evaluation infrastructure for quality assurance and informed strategy selection.

- Add gold-standard test set with 73 Turkish word-lemma pairs - Create evaluate_lemmatizer.py script for strategy comparison - Implement baseline storage for regression detection - Achieve 97.3% accuracy with lookup/hybrid strategies - Add comprehensive evaluation documentation Resolves #56

- Expand gold_standard.tsv to 109 test cases (100+ requirement met) - Add conditional tense, imperatives, participles - Add proper nouns with apostrophes - Add compound words and complex suffix chains - Add adjective-to-noun derivations - Update baseline metrics (lookup: 68.8%, hybrid: 69.7%, heuristic: 18.3%) - Lower accuracy reflects more challenging test set - Better represents real-world lemmatization complexity - Add CI regression testing to .github/workflows/tests.yml - Fails build if accuracy drops >5% from baseline - Runs on Python 3.11 after unit tests - Document strategy selection in BEST_PRACTICES.md - Add comparison table with accuracy benchmarks - Provide usage guidelines for each strategy - Include custom dataset evaluation instructions All success criteria from issue #56 now met: ✅ 100+ hand-curated test pairs ✅ Evaluation script with metrics ✅ Baseline metrics stored ✅ CI job for regression detection ✅ Strategy comparison documentation

- Split long lines to comply with 88 char limit - Extract variables to improve readability

ada-cinar added 3 commits January 27, 2026 04:35

fix: resolve ruff E501 linting errors in evaluate_lemmatizer.py

ac9830d

- Split long lines to comply with 88 char limit - Extract variables to improve readability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Lemmatization Evaluation Framework (#56)#60

feat: Add Lemmatization Evaluation Framework (#56)#60
ada-cinar wants to merge 3 commits into
mainfrom
feature/56-lemma-evaluation-framework

ada-cinar commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ada-cinar commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

📊 Gold-Standard Test Set (109 cases)

🔧 Evaluation Script

📈 Updated Baseline Results (v0.4.0)

🤖 CI Integration (NEW)

📝 Documentation (NEW)

Usage

Testing Results

Success Criteria (from #56)

Future Work (from #56)

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ada-cinar commented Jan 27, 2026 •

edited

Loading