Modularization and Backend Upgrade#154
Conversation
…ickstart, switch to MIT license, refactor CLI tests, add new regression tests, and include CI Rust validation.
…tion, and update the API with a new `Pipeline` class and MIT license.
…st compilation check, and introduce regression tests and an architectural plan.
… tests, improve CLI test harness, and fix Rust compilation issues.
…revent over-stripping, and add `cargo check` to CI.
…n implementations, centralize resource management, and enhance Rust-based text processing capabilities.
…ry for improved modularity and CLI structure.
…toring, centralize resource management, and add architectural planning and rule ownership documentation.
…to a dedicated inventory module.
…x inventory, and comprehensive end-to-end tests.
…e boundaries, unify resources, and implement a hybrid backend strategy.
…ng stages as part of an architectural refactoring plan.
…enization and suffix handling, new core data models, and updated documentation.
…uffix attachment logic, and establish core module structure with new documentation.
… backend control, formalizing pipeline stages and rule ownership.
…and migrate Turkish suffix inventory to a Rust backend.
…ance suffix attachment with a resource provider and Rust inventory.
There was a problem hiding this comment.
Pull request overview
This PR modularizes Durak’s Python/Rust internals, adds an explicit backend control layer, centralizes resource loading, and introduces broader regression/E2E test coverage and fixtures—while aligning project governance/docs (license, contribution guidance, architecture plan) and tightening CI with a Rust compile check.
Changes:
- Added backend orchestration (
DurakController), tokenizer “auto” strategy, and capability matrix + docs. - Centralized Python resource loading and pipeline stage composition; introduced context-aware pipeline execution.
- Added golden + E2E regression fixtures/tests and Rust-side morphology rule modularization (suffix inventory, vowel harmony edge-case handling).
Reviewed changes
Copilot reviewed 41 out of 42 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_tokenizer_parity.py | Adds parity checks between regex and Rust tokenizers. |
| tests/test_suffixes.py | Updates suffix-attachment expectations and adds safety regressions. |
| tests/test_story.py | Updates story test to use Pipeline and aligns license header. |
| tests/test_resources_provider.py | Adds tests for centralized resource provider loaders. |
| tests/test_regression_golden.py | Adds golden regression tests across normalization/tokenization/pipeline/lemmatization. |
| tests/test_pipeline_e2e_canonical_gold.py | Adds strict canonical end-to-end gold-output assertions. |
| tests/test_pipeline_e2e_analysis.py | Adds corpus-scale E2E determinism/quality/parity analysis tests. |
| tests/test_pipeline.py | Adds context-aware pipeline tests and updates imports. |
| tests/test_control.py | Adds tests for backend resolver/controller and capability matrix. |
| tests/test_cli.py | Makes CLI subprocess tests reliable by setting PYTHONPATH in helper. |
| tests/data/golden_regression_cases.json | Adds golden fixtures for regression tests. |
| tests/data/e2e_pipeline_corpus_cases.json | Adds diverse corpus fixture for E2E quality checks. |
| tests/data/e2e_pipeline_canonical_gold.json | Adds canonical gold outputs for strict E2E assertions. |
| src/vowel_harmony.rs | Adds special-casing for progressive “-yor” family harmony checks. |
| src/suffix_inventory.rs | Extracts and centralizes Rust suffix inventories/constants. |
| src/morphotactics.rs | Refactors morphotactic classifier to consume shared suffix inventory. |
| src/lib.rs | Switches lemmatization stripping to shared suffix inventory and adds over-stripping guards. |
| resources/tr/config/lemma_suffixes.txt | Marks file as reference-only; canonical rules moved to Rust inventory. |
| python/durak/tokenizer.py | Defaults tokenization strategy to auto and adds Rust availability flag/registry. |
| python/durak/suffixes.py | Routes suffix/apostrophe loading through resource provider; makes joining conservative by default. |
| python/durak/stopwords.py | Routes metadata path constants through resource provider and adds provider-backed metadata read path. |
| python/durak/stages.py | Introduces dedicated pipeline stage callables and a shared step registry. |
| python/durak/resources_provider.py | Adds centralized resource provider with Rust fast paths and file fallbacks. |
| python/durak/pipeline.py | Uses dedicated stage registry; adds run_with_context and process_text_with_context. |
| python/durak/core/types.py | Adds shared internal data models (Document, TokenSpan). |
| python/durak/core/interfaces.py | Adds internal protocol contracts for module boundaries. |
| python/durak/core/init.py | Exposes core types/interfaces via durak.core. |
| python/durak/control.py | Adds backend control layer (DurakController) + capability matrix and resolver. |
| python/durak/cli.py | Refactors CLI flow into helpers; reuses shared tokenization pipeline steps. |
| python/durak/init.py | Exposes backend control APIs and context-aware processing wrapper. |
| pyproject.toml | Updates SPDX license identifier comment to MIT. |
| docs/RULE_OWNERSHIP.md | Documents canonical sources for rules/resources and maintenance rules. |
| docs/BEST_PRACTICES.md | Formatting/consistency cleanup. |
| docs/BACKENDS.md | Documents backend names, capability matrix, and usage. |
| docs/ARCHITECTURE.md | Formatting tweaks for directory layout depiction. |
| SECURITY.md | Removes standalone security policy file (guidance moved elsewhere). |
| README.md | Aligns licensing text/badge, updates quickstart to Pipeline, and adds CLI quickstart + docs links. |
| PLAN.md | Adds comprehensive architecture/refactoring plan and governance guardrails. |
| CONTRIBUTING.md | Re-formats and expands contribution guidance. |
| CODE_OF_CONDUCT.md | Re-formats pledge text for readability. |
| CHANGELOG.md | Adds entries documenting doc/license alignment and quickstart updates. |
| .github/workflows/tests.yml | Adds cargo check to CI to validate Rust compilation. |
Comments suppressed due to low confidence (1)
CONTRIBUTING.md:69
- The line ending with
*** End Patchlooks like an accidental patch-marker artifact committed into the document. It should be removed so the contribution guide renders cleanly.
- For security-sensitive reports, please email `dev@karagoz.io` instead of opening a public issue.
We appreciate your contributions and look forward to collaborating!*** End Patch
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| from . import _durak_core | ||
| tokenize_with_offsets = _durak_core.tokenize_with_offsets | ||
| tokenize_with_normalized_offsets = _durak_core.tokenize_with_normalized_offsets | ||
| RUST_TOKENIZER_AVAILABLE = True | ||
| except ImportError: | ||
| def tokenize_with_normalized_offsets(text: str) -> list[tuple[str, int, int]]: | ||
| raise RustExtensionError( | ||
| "Rust extension not installed. Run: maturin develop" | ||
| ) | ||
|
|
||
|
|
||
| def rust_tokenize(text: str) -> list[str]: | ||
| """Tokenize text via Rust tokenizer offsets.""" | ||
| return [token for token, _, _ in tokenize_with_offsets(text)] | ||
|
|
There was a problem hiding this comment.
In the except ImportError branch you only define tokenize_with_normalized_offsets, but tokenize_with_offsets is not defined. Because durak.control (and potentially other modules) imports tokenize_with_offsets unconditionally, importing durak on a machine without the Rust extension will raise an ImportError. Define a Python fallback tokenize_with_offsets stub that raises RustExtensionError (similar to tokenize_with_normalized_offsets) so the pure-Python installation path remains importable.
| metrics_obj = lemmatizer_obj.get_metrics() | ||
| if output_format == "json": | ||
| result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}}}' | ||
| result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}' |
There was a problem hiding this comment.
The JSON branch that appends metrics produces invalid JSON: it strips the closing } from json.dumps(...) and appends the metrics payload but never adds the final closing brace back. Make sure the resulting result string remains a valid JSON object when --metrics is used with --format json (e.g., by inserting the metrics field and then re-closing the object).
| result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}' | |
| base_obj = json.loads(result) | |
| base_obj["metrics"] = metrics_obj.to_dict() | |
| result = json.dumps(base_obj, ensure_ascii=False, indent=2) |
| @cache | ||
| def _read_stopword_metadata(resolved_metadata_path: str) -> dict[str, Any]: | ||
| metadata_path = Path(resolved_metadata_path) | ||
| try: | ||
| raw = metadata_path.read_text(encoding="utf-8") | ||
| except FileNotFoundError as exc: | ||
| if metadata_path.resolve() == STOPWORD_METADATA_PATH.resolve(): | ||
| raw = DEFAULT_RESOURCE_PROVIDER.load_stopwords_metadata_text() | ||
| else: | ||
| raw = metadata_path.read_text(encoding="utf-8") | ||
| except (FileNotFoundError, StopwordError) as exc: | ||
| raise StopwordMetadataError( | ||
| f"Stopword metadata file not found at '{metadata_path}'." | ||
| ) from exc |
There was a problem hiding this comment.
DEFAULT_RESOURCE_PROVIDER.load_stopwords_metadata_text() raises ResourceError on missing files, but _read_stopword_metadata only catches (FileNotFoundError, StopwordError). This means missing/failed provider loads will bypass the intended StopwordMetadataError wrapping. Consider catching ResourceError here (or having the provider raise StopwordError) so callers consistently receive StopwordMetadataError for metadata-loading failures.
This pull request introduces significant improvements to documentation, project governance, and developer experience for the Durak Turkish NLP toolkit. The main changes include aligning licensing and usage documentation, adding a comprehensive architecture and refactoring plan, updating contribution and code of conduct guidelines, and enhancing CI reliability with Rust build validation. These updates help clarify project direction, improve onboarding for contributors, and ensure consistency across code, docs, and workflows.
Documentation and Governance Enhancements:
PLAN.mddetailing the architecture, refactoring phases, anti-god-object guardrails, backend strategy, feature roadmap, and governance rules for the project. This document provides a clear long-term direction and operational guidelines.README.mdto reflect MIT licensing, document CLI usage, promote the newPipelineAPI over the deprecatedprocess_text, and add references to backend and rule ownership guides. The changelog also notes these documentation alignments. [1] [2] [3] [4] [5] [6] [7]docs/BACKENDS.mddescribing the backend control layer (DurakController), backend selection, and capability matrix for Rust and Python implementations.SECURITY.mdand consolidated security reporting guidance into contribution docs, streamlining governance files.Developer Experience and CI Reliability:
.github/workflows/tests.ymlto include an explicitcargo checkstep, ensuring the Rust core compiles successfully as part of CI.Contribution and Code of Conduct Updates:
CONTRIBUTING.md, expanding workflow, pull request, and issue reporting instructions for contributors. [1] [2]CODE_OF_CONDUCT.mdfor readability, maintaining the same standards.Best Practices Documentation:
docs/BEST_PRACTICES.mdfor clarity and consistency.