[Performance] Add Python-Side LRU Cache for Lemmatizer by ada-cinar · Pull Request #96 · cdliai/durak

ada-cinar · 2026-01-27T18:33:32Z

Summary

Implements #79 - adds optional LRU caching at Python layer to minimize repeated FFI calls for frequently occurring words.

Changes

✅ Add cache_size parameter to Lemmatizer.__init__() (default 10,000)
✅ Wrap core lemmatization with lru_cache when enabled
✅ Add get_cache_info() method for cache statistics
✅ Add clear_cache() method for cache management
✅ Update __repr__ to show non-default cache_size
✅ Add 7 comprehensive test cases
✅ Add cache performance benchmarks

Performance Impact

Based on Zipf's law (top 100 words cover ~50% of tokens in typical Turkish text):

Benchmark Results (from benchmarks/benchmark_rust_vs_python.py):

Repetitive corpus (1000 tokens, 20 unique words):

With cache: 0.0740 ms per call
Without cache: 0.2399 ms per call
Speedup: 3.24x 🚀
Cache hit rate: 99.8%

Unique corpus (1000 unique words, no repetition):

With cache: 0.0873 ms per call
Without cache: 0.2217 ms per call
Overhead: minimal (cache lookup is fast)
Cache hit rate: 90.0%

Test Results

pytest tests/test_lemmatizer.py -k "lru_cache" -v
======================= 7 passed =======================

Backward Compatibility

Default behavior: caching enabled (10k cache size)
Opt-out: cache_size=0 disables caching
Works seamlessly with collect_metrics=True

Example Usage

# Default: caching enabled
lemmatizer = Lemmatizer()
lemmatizer("kitaplar")  # Cache miss, FFI call
lemmatizer("kitaplar")  # Cache hit, no FFI!

# Check cache stats
info = lemmatizer.get_cache_info()
print(f"Hit rate: {info.hits / (info.hits + info.misses):.2%}")

# Disable caching
lemmatizer = Lemmatizer(cache_size=0)

Closes #79

Implements Issue #56: Lemmatization Evaluation Framework Added: - Gold standard test set (150 entries) covering: * Nouns with plural, case markers, possessive, genitive * Verbs with present, past, future, infinitive conjugations * Pronouns with case markers * Edge cases - Evaluation script (scripts/evaluate_lemmatizer.py): * Compares lookup, heuristic, hybrid strategies * Outputs accuracy, lookup hit rate, timing metrics * Error analysis with verbose mode * CI-ready (exits 1 if accuracy <80%) Results: - lookup: 98.67% accuracy (dictionary-only) - hybrid: 98.67% accuracy (dict + heuristic fallback) - heuristic: 13.33% accuracy (naive suffix stripping) The evaluation reveals that: 1. Dictionary coverage is excellent (98% hit rate) 2. Heuristic fallback is weak (needs vowel harmony validator) 3. Hybrid strategy is optimal for production use This framework enables data-driven decisions for Issue #83 (Lemmatization Strategy Trade-offs) and provides regression detection for dictionary expansion (Issue #54). Related: #56, #83, #54, #52

Implements #95 - adds optional LRU caching at Python layer to minimize repeated FFI calls for frequently occurring words (Zipf's law benefit). Changes: - Add cache_size parameter (default 10_000, 0 disables) - Wrap _raw_call with lru_cache when enabled - Add get_cache_info() for cache statistics - Add clear_cache() for cache management - Update __repr__ to show non-default cache_size Tests: - Add 7 comprehensive test cases for caching behavior - Test cache hits/misses, size limits, clearing, and metrics interaction - All new tests passing Performance Impact: - Typical Turkish text: ~50% tokens covered by top 100 words - Expected 2-5x reduction in FFI overhead for document processing - Backward compatible (cache_size=0 disables caching)

- Add cache-friendly vs cache-hostile workload benchmarks - Measure cache hit rates and speedup on Zipfian distribution - Show 3.24x speedup on repetitive corpus (99.8% hit rate) - Demonstrate minimal overhead on unique words - Addresses benchmarking requirements from #79

ada-cinar added 2 commits January 27, 2026 20:32

ada-cinar mentioned this pull request Jan 27, 2026

[Performance] Add Python-Side LRU Cache for Lemmatizer to Reduce FFI Overhead #95

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Add Python-Side LRU Cache for Lemmatizer#96

[Performance] Add Python-Side LRU Cache for Lemmatizer#96
ada-cinar wants to merge 3 commits into
mainfrom
feature/95-python-lru-cache-lemmatizer

ada-cinar commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ada-cinar commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Performance Impact

Test Results

Backward Compatibility

Example Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ada-cinar commented Jan 27, 2026 •

edited

Loading