Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@v6

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
Expand All @@ -24,6 +24,19 @@ jobs:
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable

# Run Rust checks once (not for every Python version)
- name: Run Rust unit tests
if: matrix.python-version == '3.11'
run: cargo test --all-features

- name: Lint Rust code with Clippy
if: matrix.python-version == '3.11'
run: cargo clippy -- -D warnings

- name: Check Rust formatting
if: matrix.python-version == '3.11'
run: cargo fmt --check

- name: Install dependencies and build package
run: |
python -m pip install --upgrade pip
Expand All @@ -38,6 +51,9 @@ jobs:
- name: Run tests with coverage
run: pytest --cov=durak --cov-report=xml --cov-report=term

- name: Run property-based tests with statistics
run: pytest tests/test_properties.py --hypothesis-show-statistics -v

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
if: matrix.python-version == '3.11'
Expand Down
70 changes: 70 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,10 +113,80 @@ suffixes = _durak_core.get_detached_suffixes()
- **Unicode-aware cleaning**: Turkish-specific normalization (İ/ı, I/i handling)
- **Configurable stopword management**: Keep-lists, custom additions, domain-specific sets
- **Regex-based tokenizer**: Preserves Turkish morphology (clitics, suffixes, apostrophes)
- **Tiered lemmatization**: Dictionary lookup + heuristic fallback with performance metrics
- **Offset tracking**: Character-accurate positions for NER and span tasks
- **Embedded resources**: Zero file I/O, compiled directly into binary
- **Type-safe**: Complete `.pyi` stubs for IDE support and static analysis

## Lemmatization

Durak provides a **tiered lemmatizer** that combines dictionary lookup with heuristic suffix stripping. Three strategies are available:

- **`lookup`**: Fast exact dictionary matches (high precision, lower recall)
- **`heuristic`**: Rule-based suffix stripping (handles OOV words)
- **`hybrid`**: Lookup first, fallback to heuristic (default, best balance)

### Basic Usage

```python
from durak import Lemmatizer

lemmatizer = Lemmatizer(strategy="hybrid")

print(lemmatizer("kitaplar")) # "kitap" (plural → singular)
print(lemmatizer("geliyorum")) # "gel" (conjugated → root)
print(lemmatizer("evleri")) # "ev" (possessive + plural → root)
```

### Performance Metrics

Enable metrics collection to compare strategies and monitor performance:

```python
lemmatizer = Lemmatizer(strategy="hybrid", collect_metrics=True)

# Process your corpus
for word in corpus:
lemma = lemmatizer(word)

# View detailed metrics
print(lemmatizer.get_metrics())
```

**Output:**
```
Lemmatizer Metrics:
Total Calls: 10,000
Lookup Hits: 7,234 (72.3% of all calls)
Lookup Hit Rate: 72.3%
Heuristic Fallbacks: 2,766
Avg Call Time: 0.042ms
Total Time: 0.420s
Lookup Time: 0.274s
Heuristic Time: 0.146s
```

### Strategy Comparison

Compare all three strategies empirically:

```python
corpus = load_your_corpus()
strategies = ["lookup", "heuristic", "hybrid"]

for strategy in strategies:
lemmatizer = Lemmatizer(strategy=strategy, collect_metrics=True)

for word in corpus:
lemmatizer(word)

metrics = lemmatizer.get_metrics()
print(f"\n{strategy.upper()}: {metrics.cache_hit_rate:.1%} hit rate, "
f"{metrics.avg_call_time_ms:.3f}ms avg")
```

See [`examples/lemmatizer_metrics_demo.py`](examples/lemmatizer_metrics_demo.py) for comprehensive usage examples.

## Development Setup

### Building from Source
Expand Down
21 changes: 21 additions & 0 deletions benchmarks/lemmatization_baseline.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"baseline_version": "0.4.0",
"test_set": "gold_standard.tsv",
"strategies": {
"lookup": {
"accuracy": 0.6880733944954128,
"correct": 75,
"total": 109
},
"heuristic": {
"accuracy": 0.1834862385321101,
"correct": 20,
"total": 109
},
"hybrid": {
"accuracy": 0.6972477064220184,
"correct": 76,
"total": 109
}
}
}
55 changes: 55 additions & 0 deletions docs/BEST_PRACTICES.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,61 @@ pipeline = Pipeline([
])
```

### Choosing a Lemmatization Strategy

**Durak supports three lemmatization strategies:**

| Strategy | Accuracy | Best For |
|------------|----------|---------------------------------------------|
| **lookup** | 68.8% | Formal/standard Turkish (news, documents) |
| **heuristic** | 18.3% | OOV-heavy domains (social media, slang) |
| **hybrid** (default) | 69.7% | Balanced precision/recall (most research) |

**When to use `lookup`:**
- Corpus is formal/standard Turkish
- Need high precision for dictionary-covered words
- Fast processing required
- Example: news articles, official documents

**When to use `heuristic`:**
- OOV-heavy domains (social media, misspellings)
- Better recall on unknown words needed
- Can tolerate lower precision
- Example: Twitter data, informal chat

**When to use `hybrid` (recommended):**
- General-purpose NLP tasks
- Balanced precision/recall trade-off
- Most research and production applications
- Falls back to heuristic when lookup fails

**Usage:**

```python
from durak.lemmatizer import Lemmatizer

# Default: hybrid strategy
lemmatizer = Lemmatizer()

# Explicit strategy
lemmatizer = Lemmatizer(strategy="lookup")

# Example usage
lemmas = [lemmatizer(word) for word in ["kitaplar", "geliyorum", "evlerde"]]
```

**Evaluating custom datasets:**

```bash
# Run evaluation on your own test set
python scripts/evaluate_lemmatizer.py --all --test-set my_test.tsv

# Check for regressions after dictionary updates
python scripts/evaluate_lemmatizer.py --all --check-regression
```

See `resources/tr/lemmas/eval/README.md` for details on creating custom test sets and interpreting results.

### Suffix Configuration

**Current state**: Rust suffixes are hard-coded for demo purposes
Expand Down
161 changes: 161 additions & 0 deletions examples/lemmatizer_metrics_demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
#!/usr/bin/env python3
"""
Lemmatizer Performance Metrics Demo

Demonstrates how to use the metrics collection feature to compare
lemmatization strategies and monitor performance.

Issue #63: Add Strategy Performance Metrics to Lemmatizer
"""

from durak.lemmatizer import Lemmatizer


def demo_basic_metrics():
"""Basic metrics collection example"""
print("=" * 60)
print("BASIC METRICS COLLECTION")
print("=" * 60)

lemmatizer = Lemmatizer(strategy="hybrid", collect_metrics=True)

# Process some sample words
test_words = [
"kitaplar", "evler", "geliyorum", "gidiyorum",
"unknownword123", "testleri", "arabalar",
]

results = {}
for word in test_words:
lemma = lemmatizer(word)
results[word] = lemma

# Display results
print("\nLemmatization Results:")
for word, lemma in results.items():
status = "📖" if lemma != word else "🔧"
print(f" {status} {word:<20} → {lemma}")

# Show metrics
print(f"\n{lemmatizer.get_metrics()}")


def demo_strategy_comparison():
"""Compare all three strategies side-by-side"""
print("\n" + "=" * 60)
print("STRATEGY COMPARISON")
print("=" * 60)

# Test corpus
corpus = [
# Words likely in dictionary
"kitaplar", "evler", "geliyorum", "gittim",
# Words likely NOT in dictionary
"unknownword", "testleri", "deneysel",
# Common words
"insanlar", "çocuklar", "yapıyorum",
]

strategies = ["lookup", "heuristic", "hybrid"]

for strategy in strategies:
lemmatizer = Lemmatizer(strategy=strategy, collect_metrics=True)

for word in corpus:
_ = lemmatizer(word)

metrics = lemmatizer.get_metrics()

print(f"\n{'─' * 60}")
print(f"Strategy: {strategy.upper()}")
print(f"{'─' * 60}")
print(f" Total Calls: {metrics.total_calls:,}")
print(f" Lookup Hits: {metrics.lookup_hits:,}")
print(f" Heuristic Calls: {metrics.heuristic_calls:,}")
print(f" Cache Hit Rate: {metrics.cache_hit_rate:.1%}")
print(f" Avg Call Time: {metrics.avg_call_time_ms:.3f}ms")


def demo_large_corpus():
"""Benchmark with larger corpus"""
print("\n" + "=" * 60)
print("LARGE CORPUS BENCHMARK")
print("=" * 60)

# Simulate larger corpus (repeated words)
base_words = [
"kitaplar", "evler", "insanlar", "çocuklar",
"geliyorum", "gidiyorum", "yapıyorum",
"arabalar", "masalar", "testleri",
]

# Repeat to create ~1000 calls
corpus = base_words * 100

lemmatizer = Lemmatizer(strategy="hybrid", collect_metrics=True)

for word in corpus:
_ = lemmatizer(word)

metrics = lemmatizer.get_metrics()

print(f"\nProcessed {metrics.total_calls:,} words")
hit_pct = metrics.cache_hit_rate
print(f"Lookup Hits: {metrics.lookup_hits:,} ({hit_pct:.1%})")
print(f"Heuristic Fallbacks: {metrics.heuristic_calls:,}")
print(f"Total Time: {metrics.total_time:.3f}s")
print(f"Avg Call Time: {metrics.avg_call_time_ms:.4f}ms")
throughput = metrics.total_calls / metrics.total_time
print(f"Throughput: {throughput:,.0f} words/sec")


def demo_incremental_monitoring():
"""Monitor metrics over time with resets"""
print("\n" + "=" * 60)
print("INCREMENTAL MONITORING")
print("=" * 60)

lemmatizer = Lemmatizer(strategy="hybrid", collect_metrics=True)

batches = [
["kitaplar", "evler", "geliyorum"],
["arabalar", "masalar", "testleri"],
["unknownword1", "unknownword2", "unknownword3"],
]

for i, batch in enumerate(batches, 1):
lemmatizer.reset_metrics()

for word in batch:
_ = lemmatizer(word)

metrics = lemmatizer.get_metrics()

print(f"\nBatch {i}:")
print(f" Words: {metrics.total_calls}")
print(f" Lookup Hits: {metrics.lookup_hits} ({metrics.cache_hit_rate:.0%})")
print(f" Heuristic: {metrics.heuristic_calls}")


def main():
"""Run all demos"""
print("\n🔬 Lemmatizer Performance Metrics Demo")
print("Issue #63: Strategy Performance Metrics\n")

try:
demo_basic_metrics()
demo_strategy_comparison()
demo_large_corpus()
demo_incremental_monitoring()

print("\n" + "=" * 60)
print("✅ All demos completed successfully!")
print("=" * 60)

except ImportError as e:
print(f"\n❌ Error: {e}")
print("Make sure durak is installed: pip install -e .")


if __name__ == "__main__":
main()
3 changes: 2 additions & 1 deletion python/durak/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from importlib import metadata

from .cleaning import clean_text, collapse_whitespace, normalize_case, normalize_unicode
from .lemmatizer import Lemmatizer
from .lemmatizer import Lemmatizer, LemmatizerMetrics
from .normalizer import Normalizer
from .pipeline import Pipeline, process_text
from .stopwords import (
Expand Down Expand Up @@ -40,6 +40,7 @@
"DEFAULT_STOPWORD_RESOURCE",
"DEFAULT_DETACHED_SUFFIXES",
"Lemmatizer",
"LemmatizerMetrics",
"Normalizer",
"Pipeline",
"StopwordManager",
Expand Down
Loading