[Performance] Add Python-Side LRU Cache for Lemmatizer#96
Open
ada-cinar wants to merge 3 commits into
Open
Conversation
Implements Issue #56: Lemmatization Evaluation Framework Added: - Gold standard test set (150 entries) covering: * Nouns with plural, case markers, possessive, genitive * Verbs with present, past, future, infinitive conjugations * Pronouns with case markers * Edge cases - Evaluation script (scripts/evaluate_lemmatizer.py): * Compares lookup, heuristic, hybrid strategies * Outputs accuracy, lookup hit rate, timing metrics * Error analysis with verbose mode * CI-ready (exits 1 if accuracy <80%) Results: - lookup: 98.67% accuracy (dictionary-only) - hybrid: 98.67% accuracy (dict + heuristic fallback) - heuristic: 13.33% accuracy (naive suffix stripping) The evaluation reveals that: 1. Dictionary coverage is excellent (98% hit rate) 2. Heuristic fallback is weak (needs vowel harmony validator) 3. Hybrid strategy is optimal for production use This framework enables data-driven decisions for Issue #83 (Lemmatization Strategy Trade-offs) and provides regression detection for dictionary expansion (Issue #54). Related: #56, #83, #54, #52
Implements #95 - adds optional LRU caching at Python layer to minimize repeated FFI calls for frequently occurring words (Zipf's law benefit). Changes: - Add cache_size parameter (default 10_000, 0 disables) - Wrap _raw_call with lru_cache when enabled - Add get_cache_info() for cache statistics - Add clear_cache() for cache management - Update __repr__ to show non-default cache_size Tests: - Add 7 comprehensive test cases for caching behavior - Test cache hits/misses, size limits, clearing, and metrics interaction - All new tests passing Performance Impact: - Typical Turkish text: ~50% tokens covered by top 100 words - Expected 2-5x reduction in FFI overhead for document processing - Backward compatible (cache_size=0 disables caching)
7 tasks
- Add cache-friendly vs cache-hostile workload benchmarks - Measure cache hit rates and speedup on Zipfian distribution - Show 3.24x speedup on repetitive corpus (99.8% hit rate) - Demonstrate minimal overhead on unique words - Addresses benchmarking requirements from #79
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #79 - adds optional LRU caching at Python layer to minimize repeated FFI calls for frequently occurring words.
Changes
cache_sizeparameter toLemmatizer.__init__()(default 10,000)lru_cachewhen enabledget_cache_info()method for cache statisticsclear_cache()method for cache management__repr__to show non-default cache_sizePerformance Impact
Based on Zipf's law (top 100 words cover ~50% of tokens in typical Turkish text):
Benchmark Results (from
benchmarks/benchmark_rust_vs_python.py):Repetitive corpus (1000 tokens, 20 unique words):
Unique corpus (1000 unique words, no repetition):
Test Results
pytest tests/test_lemmatizer.py -k "lru_cache" -v ======================= 7 passed =======================Backward Compatibility
cache_size=0disables cachingcollect_metrics=TrueExample Usage
Closes #79