Skip to content

Modularization and Backend Upgrade#154

Open
fbkaragoz wants to merge 20 commits into
cdliai:mainfrom
fbkaragoz:main
Open

Modularization and Backend Upgrade#154
fbkaragoz wants to merge 20 commits into
cdliai:mainfrom
fbkaragoz:main

Conversation

@fbkaragoz

Copy link
Copy Markdown
Member

This pull request introduces significant improvements to documentation, project governance, and developer experience for the Durak Turkish NLP toolkit. The main changes include aligning licensing and usage documentation, adding a comprehensive architecture and refactoring plan, updating contribution and code of conduct guidelines, and enhancing CI reliability with Rust build validation. These updates help clarify project direction, improve onboarding for contributors, and ensure consistency across code, docs, and workflows.

Documentation and Governance Enhancements:

  • Added PLAN.md detailing the architecture, refactoring phases, anti-god-object guardrails, backend strategy, feature roadmap, and governance rules for the project. This document provides a clear long-term direction and operational guidelines.
  • Updated README.md to reflect MIT licensing, document CLI usage, promote the new Pipeline API over the deprecated process_text, and add references to backend and rule ownership guides. The changelog also notes these documentation alignments. [1] [2] [3] [4] [5] [6] [7]
  • Added docs/BACKENDS.md describing the backend control layer (DurakController), backend selection, and capability matrix for Rust and Python implementations.
  • Removed SECURITY.md and consolidated security reporting guidance into contribution docs, streamlining governance files.

Developer Experience and CI Reliability:

  • Updated .github/workflows/tests.yml to include an explicit cargo check step, ensuring the Rust core compiles successfully as part of CI.

Contribution and Code of Conduct Updates:

  • Improved formatting and clarity in CONTRIBUTING.md, expanding workflow, pull request, and issue reporting instructions for contributors. [1] [2]
  • Reformatted the pledge in CODE_OF_CONDUCT.md for readability, maintaining the same standards.

Best Practices Documentation:

  • Minor formatting update to docs/BEST_PRACTICES.md for clarity and consistency.

fbkaragoz added 20 commits March 3, 2026 03:34
…ickstart, switch to MIT license, refactor CLI tests, add new regression tests, and include CI Rust validation.
…tion, and update the API with a new `Pipeline` class and MIT license.
…st compilation check, and introduce regression tests and an architectural plan.
… tests, improve CLI test harness, and fix Rust compilation issues.
…revent over-stripping, and add `cargo check` to CI.
…n implementations, centralize resource management, and enhance Rust-based text processing capabilities.
…ry for improved modularity and CLI structure.
…toring, centralize resource management, and add architectural planning and rule ownership documentation.
…x inventory, and comprehensive end-to-end tests.
…e boundaries, unify resources, and implement a hybrid backend strategy.
…ng stages as part of an architectural refactoring plan.
…enization and suffix handling, new core data models, and updated documentation.
…uffix attachment logic, and establish core module structure with new documentation.
… backend control, formalizing pipeline stages and rule ownership.
…and migrate Turkish suffix inventory to a Rust backend.
…ance suffix attachment with a resource provider and Rust inventory.
@fbkaragoz fbkaragoz self-assigned this Mar 3, 2026
Copilot AI review requested due to automatic review settings March 3, 2026 02:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modularizes Durak’s Python/Rust internals, adds an explicit backend control layer, centralizes resource loading, and introduces broader regression/E2E test coverage and fixtures—while aligning project governance/docs (license, contribution guidance, architecture plan) and tightening CI with a Rust compile check.

Changes:

  • Added backend orchestration (DurakController), tokenizer “auto” strategy, and capability matrix + docs.
  • Centralized Python resource loading and pipeline stage composition; introduced context-aware pipeline execution.
  • Added golden + E2E regression fixtures/tests and Rust-side morphology rule modularization (suffix inventory, vowel harmony edge-case handling).

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_tokenizer_parity.py Adds parity checks between regex and Rust tokenizers.
tests/test_suffixes.py Updates suffix-attachment expectations and adds safety regressions.
tests/test_story.py Updates story test to use Pipeline and aligns license header.
tests/test_resources_provider.py Adds tests for centralized resource provider loaders.
tests/test_regression_golden.py Adds golden regression tests across normalization/tokenization/pipeline/lemmatization.
tests/test_pipeline_e2e_canonical_gold.py Adds strict canonical end-to-end gold-output assertions.
tests/test_pipeline_e2e_analysis.py Adds corpus-scale E2E determinism/quality/parity analysis tests.
tests/test_pipeline.py Adds context-aware pipeline tests and updates imports.
tests/test_control.py Adds tests for backend resolver/controller and capability matrix.
tests/test_cli.py Makes CLI subprocess tests reliable by setting PYTHONPATH in helper.
tests/data/golden_regression_cases.json Adds golden fixtures for regression tests.
tests/data/e2e_pipeline_corpus_cases.json Adds diverse corpus fixture for E2E quality checks.
tests/data/e2e_pipeline_canonical_gold.json Adds canonical gold outputs for strict E2E assertions.
src/vowel_harmony.rs Adds special-casing for progressive “-yor” family harmony checks.
src/suffix_inventory.rs Extracts and centralizes Rust suffix inventories/constants.
src/morphotactics.rs Refactors morphotactic classifier to consume shared suffix inventory.
src/lib.rs Switches lemmatization stripping to shared suffix inventory and adds over-stripping guards.
resources/tr/config/lemma_suffixes.txt Marks file as reference-only; canonical rules moved to Rust inventory.
python/durak/tokenizer.py Defaults tokenization strategy to auto and adds Rust availability flag/registry.
python/durak/suffixes.py Routes suffix/apostrophe loading through resource provider; makes joining conservative by default.
python/durak/stopwords.py Routes metadata path constants through resource provider and adds provider-backed metadata read path.
python/durak/stages.py Introduces dedicated pipeline stage callables and a shared step registry.
python/durak/resources_provider.py Adds centralized resource provider with Rust fast paths and file fallbacks.
python/durak/pipeline.py Uses dedicated stage registry; adds run_with_context and process_text_with_context.
python/durak/core/types.py Adds shared internal data models (Document, TokenSpan).
python/durak/core/interfaces.py Adds internal protocol contracts for module boundaries.
python/durak/core/init.py Exposes core types/interfaces via durak.core.
python/durak/control.py Adds backend control layer (DurakController) + capability matrix and resolver.
python/durak/cli.py Refactors CLI flow into helpers; reuses shared tokenization pipeline steps.
python/durak/init.py Exposes backend control APIs and context-aware processing wrapper.
pyproject.toml Updates SPDX license identifier comment to MIT.
docs/RULE_OWNERSHIP.md Documents canonical sources for rules/resources and maintenance rules.
docs/BEST_PRACTICES.md Formatting/consistency cleanup.
docs/BACKENDS.md Documents backend names, capability matrix, and usage.
docs/ARCHITECTURE.md Formatting tweaks for directory layout depiction.
SECURITY.md Removes standalone security policy file (guidance moved elsewhere).
README.md Aligns licensing text/badge, updates quickstart to Pipeline, and adds CLI quickstart + docs links.
PLAN.md Adds comprehensive architecture/refactoring plan and governance guardrails.
CONTRIBUTING.md Re-formats and expands contribution guidance.
CODE_OF_CONDUCT.md Re-formats pledge text for readability.
CHANGELOG.md Adds entries documenting doc/license alignment and quickstart updates.
.github/workflows/tests.yml Adds cargo check to CI to validate Rust compilation.
Comments suppressed due to low confidence (1)

CONTRIBUTING.md:69

  • The line ending with *** End Patch looks like an accidental patch-marker artifact committed into the document. It should be removed so the contribution guide renders cleanly.
- For security-sensitive reports, please email `dev@karagoz.io` instead of opening a public issue.

We appreciate your contributions and look forward to collaborating!*** End Patch


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/durak/tokenizer.py
Comment on lines 139 to +154
try:
from . import _durak_core
tokenize_with_offsets = _durak_core.tokenize_with_offsets
tokenize_with_normalized_offsets = _durak_core.tokenize_with_normalized_offsets
RUST_TOKENIZER_AVAILABLE = True
except ImportError:
def tokenize_with_normalized_offsets(text: str) -> list[tuple[str, int, int]]:
raise RustExtensionError(
"Rust extension not installed. Run: maturin develop"
)


def rust_tokenize(text: str) -> list[str]:
"""Tokenize text via Rust tokenizer offsets."""
return [token for token, _, _ in tokenize_with_offsets(text)]

Copilot AI Mar 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the except ImportError branch you only define tokenize_with_normalized_offsets, but tokenize_with_offsets is not defined. Because durak.control (and potentially other modules) imports tokenize_with_offsets unconditionally, importing durak on a machine without the Rust extension will raise an ImportError. Define a Python fallback tokenize_with_offsets stub that raises RustExtensionError (similar to tokenize_with_normalized_offsets) so the pure-Python installation path remains importable.

Copilot uses AI. Check for mistakes.
Comment thread python/durak/cli.py
metrics_obj = lemmatizer_obj.get_metrics()
if output_format == "json":
result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}}}'
result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}'

Copilot AI Mar 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON branch that appends metrics produces invalid JSON: it strips the closing } from json.dumps(...) and appends the metrics payload but never adds the final closing brace back. Make sure the resulting result string remains a valid JSON object when --metrics is used with --format json (e.g., by inserting the metrics field and then re-closing the object).

Suggested change
result = result.rstrip("}") + f', "metrics": {metrics_obj.to_dict()}'
base_obj = json.loads(result)
base_obj["metrics"] = metrics_obj.to_dict()
result = json.dumps(base_obj, ensure_ascii=False, indent=2)

Copilot uses AI. Check for mistakes.
Comment thread python/durak/stopwords.py
Comment on lines 48 to 59
@cache
def _read_stopword_metadata(resolved_metadata_path: str) -> dict[str, Any]:
metadata_path = Path(resolved_metadata_path)
try:
raw = metadata_path.read_text(encoding="utf-8")
except FileNotFoundError as exc:
if metadata_path.resolve() == STOPWORD_METADATA_PATH.resolve():
raw = DEFAULT_RESOURCE_PROVIDER.load_stopwords_metadata_text()
else:
raw = metadata_path.read_text(encoding="utf-8")
except (FileNotFoundError, StopwordError) as exc:
raise StopwordMetadataError(
f"Stopword metadata file not found at '{metadata_path}'."
) from exc

Copilot AI Mar 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_RESOURCE_PROVIDER.load_stopwords_metadata_text() raises ResourceError on missing files, but _read_stopword_metadata only catches (FileNotFoundError, StopwordError). This means missing/failed provider loads will bypass the intended StopwordMetadataError wrapping. Consider catching ResourceError here (or having the provider raise StopwordError) so callers consistently receive StopwordMetadataError for metadata-loading failures.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants