Modularization and Backend Upgrade by fbkaragoz · Pull Request #154 · cdliai/durak

fbkaragoz · 2026-03-03T02:03:35Z

This pull request introduces significant improvements to documentation, project governance, and developer experience for the Durak Turkish NLP toolkit. The main changes include aligning licensing and usage documentation, adding a comprehensive architecture and refactoring plan, updating contribution and code of conduct guidelines, and enhancing CI reliability with Rust build validation. These updates help clarify project direction, improve onboarding for contributors, and ensure consistency across code, docs, and workflows.

Documentation and Governance Enhancements:

Added PLAN.md detailing the architecture, refactoring phases, anti-god-object guardrails, backend strategy, feature roadmap, and governance rules for the project. This document provides a clear long-term direction and operational guidelines.
Updated README.md to reflect MIT licensing, document CLI usage, promote the new Pipeline API over the deprecated process_text, and add references to backend and rule ownership guides. The changelog also notes these documentation alignments. [1] [2] [3] [4] [5] [6] [7]
Added docs/BACKENDS.md describing the backend control layer (DurakController), backend selection, and capability matrix for Rust and Python implementations.
Removed SECURITY.md and consolidated security reporting guidance into contribution docs, streamlining governance files.

Developer Experience and CI Reliability:

Updated .github/workflows/tests.yml to include an explicit cargo check step, ensuring the Rust core compiles successfully as part of CI.

Contribution and Code of Conduct Updates:

Improved formatting and clarity in CONTRIBUTING.md, expanding workflow, pull request, and issue reporting instructions for contributors. [1] [2]
Reformatted the pledge in CODE_OF_CONDUCT.md for readability, maintaining the same standards.

Best Practices Documentation:

Minor formatting update to docs/BEST_PRACTICES.md for clarity and consistency.

…ous files.

…ickstart, switch to MIT license, refactor CLI tests, add new regression tests, and include CI Rust validation.

…tion, and update the API with a new `Pipeline` class and MIT license.

…st compilation check, and introduce regression tests and an architectural plan.

… tests, improve CLI test harness, and fix Rust compilation issues.

…revent over-stripping, and add `cargo check` to CI.

…n implementations, centralize resource management, and enhance Rust-based text processing capabilities.

…ry for improved modularity and CLI structure.

…toring, centralize resource management, and add architectural planning and rule ownership documentation.

…to a dedicated inventory module.

…x inventory, and comprehensive end-to-end tests.

…e boundaries, unify resources, and implement a hybrid backend strategy.

…ng stages as part of an architectural refactoring plan.

…enization and suffix handling, new core data models, and updated documentation.

…uffix attachment logic, and establish core module structure with new documentation.

… backend control, formalizing pipeline stages and rule ownership.

…and migrate Turkish suffix inventory to a Rust backend.

…ance suffix attachment with a resource provider and Rust inventory.

Copilot

Pull request overview

This PR modularizes Durak’s Python/Rust internals, adds an explicit backend control layer, centralizes resource loading, and introduces broader regression/E2E test coverage and fixtures—while aligning project governance/docs (license, contribution guidance, architecture plan) and tightening CI with a Rust compile check.

Changes:

Added backend orchestration (DurakController), tokenizer “auto” strategy, and capability matrix + docs.
Centralized Python resource loading and pipeline stage composition; introduced context-aware pipeline execution.
Added golden + E2E regression fixtures/tests and Rust-side morphology rule modularization (suffix inventory, vowel harmony edge-case handling).

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_tokenizer_parity.py	Adds parity checks between regex and Rust tokenizers.
tests/test_suffixes.py	Updates suffix-attachment expectations and adds safety regressions.
tests/test_story.py	Updates story test to use `Pipeline` and aligns license header.
tests/test_resources_provider.py	Adds tests for centralized resource provider loaders.
tests/test_regression_golden.py	Adds golden regression tests across normalization/tokenization/pipeline/lemmatization.
tests/test_pipeline_e2e_canonical_gold.py	Adds strict canonical end-to-end gold-output assertions.
tests/test_pipeline_e2e_analysis.py	Adds corpus-scale E2E determinism/quality/parity analysis tests.
tests/test_pipeline.py	Adds context-aware pipeline tests and updates imports.
tests/test_control.py	Adds tests for backend resolver/controller and capability matrix.
tests/test_cli.py	Makes CLI subprocess tests reliable by setting `PYTHONPATH` in helper.
tests/data/golden_regression_cases.json	Adds golden fixtures for regression tests.
tests/data/e2e_pipeline_corpus_cases.json	Adds diverse corpus fixture for E2E quality checks.
tests/data/e2e_pipeline_canonical_gold.json	Adds canonical gold outputs for strict E2E assertions.
src/vowel_harmony.rs	Adds special-casing for progressive “-yor” family harmony checks.
src/suffix_inventory.rs	Extracts and centralizes Rust suffix inventories/constants.
src/morphotactics.rs	Refactors morphotactic classifier to consume shared suffix inventory.
src/lib.rs	Switches lemmatization stripping to shared suffix inventory and adds over-stripping guards.
resources/tr/config/lemma_suffixes.txt	Marks file as reference-only; canonical rules moved to Rust inventory.
python/durak/tokenizer.py	Defaults tokenization strategy to `auto` and adds Rust availability flag/registry.
python/durak/suffixes.py	Routes suffix/apostrophe loading through resource provider; makes joining conservative by default.
python/durak/stopwords.py	Routes metadata path constants through resource provider and adds provider-backed metadata read path.
python/durak/stages.py	Introduces dedicated pipeline stage callables and a shared step registry.
python/durak/resources_provider.py	Adds centralized resource provider with Rust fast paths and file fallbacks.
python/durak/pipeline.py	Uses dedicated stage registry; adds `run_with_context` and `process_text_with_context`.
python/durak/core/types.py	Adds shared internal data models (`Document`, `TokenSpan`).
python/durak/core/interfaces.py	Adds internal protocol contracts for module boundaries.
python/durak/core/init.py	Exposes core types/interfaces via `durak.core`.
python/durak/control.py	Adds backend control layer (`DurakController`) + capability matrix and resolver.
python/durak/cli.py	Refactors CLI flow into helpers; reuses shared tokenization pipeline steps.
python/durak/init.py	Exposes backend control APIs and context-aware processing wrapper.
pyproject.toml	Updates SPDX license identifier comment to MIT.
docs/RULE_OWNERSHIP.md	Documents canonical sources for rules/resources and maintenance rules.
docs/BEST_PRACTICES.md	Formatting/consistency cleanup.
docs/BACKENDS.md	Documents backend names, capability matrix, and usage.
docs/ARCHITECTURE.md	Formatting tweaks for directory layout depiction.
SECURITY.md	Removes standalone security policy file (guidance moved elsewhere).
README.md	Aligns licensing text/badge, updates quickstart to `Pipeline`, and adds CLI quickstart + docs links.
PLAN.md	Adds comprehensive architecture/refactoring plan and governance guardrails.
CONTRIBUTING.md	Re-formats and expands contribution guidance.
CODE_OF_CONDUCT.md	Re-formats pledge text for readability.
CHANGELOG.md	Adds entries documenting doc/license alignment and quickstart updates.
.github/workflows/tests.yml	Adds `cargo check` to CI to validate Rust compilation.

Comments suppressed due to low confidence (1)

CONTRIBUTING.md:69

The line ending with *** End Patch looks like an accidental patch-marker artifact committed into the document. It should be removed so the contribution guide renders cleanly.

- For security-sensitive reports, please email `dev@karagoz.io` instead of opening a public issue.

We appreciate your contributions and look forward to collaborating!*** End Patch

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T02:10:30Z

 try:
    from . import _durak_core
    tokenize_with_offsets = _durak_core.tokenize_with_offsets
    tokenize_with_normalized_offsets = _durak_core.tokenize_with_normalized_offsets
+    RUST_TOKENIZER_AVAILABLE = True
 except ImportError:
    def tokenize_with_normalized_offsets(text: str) -> list[tuple[str, int, int]]:
        raise RustExtensionError(
            "Rust extension not installed. Run: maturin develop"
        )


+def rust_tokenize(text: str) -> list[str]:
+    """Tokenize text via Rust tokenizer offsets."""
+    return [token for token, _, _ in tokenize_with_offsets(text)]
+


In the except ImportError branch you only define tokenize_with_normalized_offsets, but tokenize_with_offsets is not defined. Because durak.control (and potentially other modules) imports tokenize_with_offsets unconditionally, importing durak on a machine without the Rust extension will raise an ImportError. Define a Python fallback tokenize_with_offsets stub that raises RustExtensionError (similar to tokenize_with_normalized_offsets) so the pure-Python installation path remains importable.

Copilot · 2026-03-03T02:10:30Z

            metrics_obj = lemmatizer_obj.get_metrics()
            if output_format == "json":
-                result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}}}'
+                result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}'


The JSON branch that appends metrics produces invalid JSON: it strips the closing } from json.dumps(...) and appends the metrics payload but never adds the final closing brace back. Make sure the resulting result string remains a valid JSON object when --metrics is used with --format json (e.g., by inserting the metrics field and then re-closing the object).

Suggested change

result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}'

base_obj = json.loads(result)

base_obj["metrics"] = metrics_obj.to_dict()

result = json.dumps(base_obj, ensure_ascii=False, indent=2)

Copilot · 2026-03-03T02:10:30Z

 @cache
 def _read_stopword_metadata(resolved_metadata_path: str) -> dict[str, Any]:
    metadata_path = Path(resolved_metadata_path)
    try:
-        raw = metadata_path.read_text(encoding="utf-8")
-    except FileNotFoundError as exc:
+        if metadata_path.resolve() == STOPWORD_METADATA_PATH.resolve():
+            raw = DEFAULT_RESOURCE_PROVIDER.load_stopwords_metadata_text()
+        else:
+            raw = metadata_path.read_text(encoding="utf-8")
+    except (FileNotFoundError, StopwordError) as exc:
        raise StopwordMetadataError(
            f"Stopword metadata file not found at '{metadata_path}'."
        ) from exc


DEFAULT_RESOURCE_PROVIDER.load_stopwords_metadata_text() raises ResourceError on missing files, but _read_stopword_metadata only catches (FileNotFoundError, StopwordError). This means missing/failed provider loads will bypass the intended StopwordMetadataError wrapping. Consider catching ResourceError here (or having the provider raise StopwordError) so callers consistently receive StopwordMetadataError for metadata-loading failures.

fbkaragoz added 20 commits March 3, 2026 03:34

pydoc: typehint for StepType usage

4aa5fb7

style: remove emojis from documentation and refine formatting in vari…

3ba2f9e

…ous files.

removed SECURITY.md

eff33ae

docs: Update README and CHANGELOG, promote Pipeline API, add CLI qu…

94eab30

…ickstart, switch to MIT license, refactor CLI tests, add new regression tests, and include CI Rust validation.

feat: Introduce golden file regression tests, refactor CLI test execu…

1ea4ae3

…tion, and update the API with a new `Pipeline` class and MIT license.

refactor: Improve suffix stripping logic with lemma checks, add CI Ru…

839930a

…st compilation check, and introduce regression tests and an architectural plan.

feat: Introduce architectural refactoring plan, add golden regression…

2f5d799

… tests, improve CLI test harness, and fix Rust compilation issues.

feat: Introduce architecture refactoring plan, refine lemmatizer to p…

ca1c037

…revent over-stripping, and add `cargo check` to CI.

feat: Implement a backend control layer to orchestrate Rust and Pytho…

33d5ea3

…n implementations, centralize resource management, and enhance Rust-based text processing capabilities.

refactor: introduce core components, interfaces, and a suffix invento…

39d9df0

…ry for improved modularity and CLI structure.

feat: Introduce ProcessingContext and Document for pipeline refac…

bbcc2a0

…toring, centralize resource management, and add architectural planning and rule ownership documentation.

feat: Introduce a resource provider and refactor suffix management in…

1b004c7

…to a dedicated inventory module.

feat: introduce a modular pipeline with new stages, core types, suffi…

8c25f68

…x inventory, and comprehensive end-to-end tests.

refactor: Initiate architectural refactoring to establish clear modul…

06a4c4a

…e boundaries, unify resources, and implement a hybrid backend strategy.

refactor: Introduce a backend control layer, core types, and processi…

365be7a

…ng stages as part of an architectural refactoring plan.

feat: Introduce a backend control layer with Rust integration for tok…

e75843f

…enization and suffix handling, new core data models, and updated documentation.

feat: Introduce backend control layer for Rust/Python NLP, refactor s…

7e88c8a

…uffix attachment logic, and establish core module structure with new documentation.

feat: Implement Rust-based suffix inventory, core pipeline types, and…

49dc424

… backend control, formalizing pipeline stages and rule ownership.

feat: Introduce a backend control layer, modularize pipeline stages, …

d033213

…and migrate Turkish suffix inventory to a Rust backend.

feat: Establish core data models and modular pipeline stages, and enh…

34e6152

…ance suffix attachment with a resource provider and Rust inventory.

fbkaragoz self-assigned this Mar 3, 2026

Copilot AI review requested due to automatic review settings March 3, 2026 02:03

fbkaragoz added core refactoring labels Mar 3, 2026

Copilot started reviewing on behalf of fbkaragoz March 3, 2026 02:04 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularization and Backend Upgrade#154

Modularization and Backend Upgrade#154
fbkaragoz wants to merge 20 commits into
cdliai:mainfrom
fbkaragoz:main

fbkaragoz commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}'
+                base_obj = json.loads(result)
+                base_obj["metrics"] = metrics_obj.to_dict()
+                result = json.dumps(base_obj, ensure_ascii=False, indent=2)

Conversation

fbkaragoz commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants