Skip to content

Fix O(n²) transliteration, broken offset tracking, no-op suffix config, and doc/API inconsistencies#152

Open
Copilot wants to merge 2 commits into
origin/devfrom
copilot/sub-pr-150-again
Open

Fix O(n²) transliteration, broken offset tracking, no-op suffix config, and doc/API inconsistencies#152
Copilot wants to merge 2 commits into
origin/devfrom
copilot/sub-pr-150-again

Conversation

Copilot AI commented Feb 21, 2026

Copy link
Copy Markdown
Contributor

Multiple review findings in the Ottoman module and TextProcessor pipeline: quadratic performance in transliteration, token mappings storing local instead of document-level offsets, custom_suffixes silently doing nothing, and several API/doc mismatches.

Performance

  • arabic_to_latin / scholarly_to_modern: Replaced len("".join(chars)) (O(n) per iteration → O(n²) overall) with a running trans_pos integer counter updated as each character is appended.

Offset tracking

  • Token TransliterationMapping: Now stores document-level offsets (orig_start, orig_end, trans_idx, trans_idx + len(token)) derived from the full mapping when resolvable, falling back to token-local offsets. get_offsets_for_token consequently returns correct absolute spans for OCR/bounding-box alignment.

Unimplemented config option

  • OttomanConfig.custom_suffixes: Was accepted and stored but never applied. Added a suffix-stripping step in OttomanProcessor.process() that removes the longest matching suffix from each processed token.

API correctness

  • ProcessorConfig.lowercase: Was documented but ineffective — clean_text always lowercased via DEFAULT_CLEANING_STEPS. When lowercase=False, the processor now passes a custom steps pipeline that excludes normalize_case.
proc = TextProcessor(ProcessorConfig(lowercase=False))
proc.process("Türkiye NLP").tokens  # ['Türkiye', 'NLP']  (was always lowercased)

Docs / examples

  • frequencies.py docstring: Added missing ProcessorConfig to the example import — snippet was broken if copied verbatim.
  • README.md process_text example: Replaced invalid steps=[...] with correct keyword args (remove_stopwords=True, attach_suffixes=True).
  • docs/USER_GUIDE.md emoji example: Added processor = TextProcessor(config) after setting emoji_mode="extract" — without it result.emojis was always empty.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…, suffix stripping, docs

Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix syntax errors and update documentation for Durak NLP Toolkit Fix O(n²) transliteration, broken offset tracking, no-op suffix config, and doc/API inconsistencies Feb 21, 2026
Copilot AI requested a review from fbkaragoz February 21, 2026 05:09
@fbkaragoz fbkaragoz added bug Something isn't working documentation Improvements or additions to documentation labels Feb 21, 2026
@fbkaragoz fbkaragoz marked this pull request as ready for review February 23, 2026 01:02
Copilot AI review requested due to automatic review settings February 23, 2026 01:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses multiple issues in the Ottoman Turkish module and TextProcessor pipeline, including quadratic performance in transliteration, incorrect offset tracking, a non-functional config option, and several documentation inconsistencies.

Changes:

  • Performance optimization: Replaced O(n²) string position tracking with O(n) running counters in transliteration methods
  • Fixed offset tracking to use document-level offsets instead of token-local offsets for OCR/bounding-box alignment
  • Implemented previously non-functional custom_suffixes configuration option to strip suffixes from processed tokens
  • Fixed lowercase=False configuration which was previously ignored due to hardcoded case normalization
  • Corrected documentation examples that had missing imports or incorrect API usage

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
python/durak/stats/frequencies.py Added missing ProcessorConfig import to docstring example
python/durak/processor.py Implemented lowercase=False by filtering out normalize_case from cleaning steps; added DEFAULT_CLEANING_STEPS import
python/durak/ottoman/transliterator.py Replaced O(n²) len("".join(chars)) with O(n) running counter trans_pos in both arabic_to_latin and scholarly_to_modern methods
python/durak/ottoman/processor.py Fixed token offset mappings to use document-level offsets; implemented suffix stripping for custom_suffixes config
docs/USER_GUIDE.md Added missing processor = TextProcessor(config) instantiation in emoji extraction example
README.md Replaced invalid steps=[...] parameter with correct keyword arguments (remove_stopwords=True, attach_suffixes=True)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +380 to +389
stripped = []
for token in processed_tokens:
# Remove the longest matching suffix (at most one per token)
for suffix in sorted_suffixes:
if token.endswith(suffix) and len(token) > len(suffix):
token = token[: -len(suffix)]
break
stripped.append(token)
processed_tokens = stripped

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suffix stripping creates an inconsistency: result.tokens contains suffix-stripped tokens, but result.offset_mappings[i].transliterated still contains the original token before suffix removal. This breaks the invariant that result.tokens[i] should match result.offset_mappings[i].transliterated. The offset mappings should be updated after suffix stripping to reflect the actual tokens, or the transliterated field should be updated to match the stripped token.

Suggested change
stripped = []
for token in processed_tokens:
# Remove the longest matching suffix (at most one per token)
for suffix in sorted_suffixes:
if token.endswith(suffix) and len(token) > len(suffix):
token = token[: -len(suffix)]
break
stripped.append(token)
processed_tokens = stripped
stripped_tokens: list[str] = []
stripped_mappings: list[TransliterationMapping] = []
for idx, (token, mapping) in enumerate(zip(processed_tokens, offset_mappings)):
original_token = token
removed_len = 0
matched_suffix = ""
# Remove the longest matching suffix (at most one per token)
for suffix in sorted_suffixes:
if token.endswith(suffix) and len(token) > len(suffix):
removed_len = len(suffix)
matched_suffix = suffix
token = token[: -removed_len]
break
stripped_tokens.append(token)
# Keep offset mappings consistent with suffix-stripped tokens.
# If we did not remove any suffix, reuse the existing mapping.
if removed_len == 0:
stripped_mappings.append(mapping)
continue
# Update the transliterated form by stripping the same suffix length.
transliterated = mapping.transliterated
if len(transliterated) >= removed_len:
new_transliterated = transliterated[:-removed_len]
else:
# Fallback: avoid negative slicing if lengths somehow mismatch.
new_transliterated = ""
new_trans_len = len(new_transliterated)
new_char_mappings: list[tuple[int, int, int, int]] = []
for orig_start, orig_end, trans_start, trans_end in mapping.char_mappings:
# Skip mappings that start beyond the new transliterated length.
if trans_start >= new_trans_len:
break
# If the entire mapping lies within the new length, keep as is.
if trans_end <= new_trans_len:
new_char_mappings.append((orig_start, orig_end, trans_start, trans_end))
else:
# Truncate a partially overlapping mapping at the new end.
new_char_mappings.append((orig_start, orig_end, trans_start, new_trans_len))
break
stripped_mappings.append(
TransliterationMapping(
original=mapping.original,
transliterated=new_transliterated,
char_mappings=new_char_mappings,
)
)
processed_tokens = stripped_tokens
offset_mappings = stripped_mappings

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants