Fix O(n²) transliteration, broken offset tracking, no-op suffix config, and doc/API inconsistencies by Copilot · Pull Request #152 · cdliai/durak

Copilot · 2026-02-21T05:02:15Z

Multiple review findings in the Ottoman module and TextProcessor pipeline: quadratic performance in transliteration, token mappings storing local instead of document-level offsets, custom_suffixes silently doing nothing, and several API/doc mismatches.

Performance

arabic_to_latin / scholarly_to_modern: Replaced len("".join(chars)) (O(n) per iteration → O(n²) overall) with a running trans_pos integer counter updated as each character is appended.

Offset tracking

Token TransliterationMapping: Now stores document-level offsets (orig_start, orig_end, trans_idx, trans_idx + len(token)) derived from the full mapping when resolvable, falling back to token-local offsets. get_offsets_for_token consequently returns correct absolute spans for OCR/bounding-box alignment.

Unimplemented config option

OttomanConfig.custom_suffixes: Was accepted and stored but never applied. Added a suffix-stripping step in OttomanProcessor.process() that removes the longest matching suffix from each processed token.

API correctness

ProcessorConfig.lowercase: Was documented but ineffective — clean_text always lowercased via DEFAULT_CLEANING_STEPS. When lowercase=False, the processor now passes a custom steps pipeline that excludes normalize_case.

proc = TextProcessor(ProcessorConfig(lowercase=False))
proc.process("Türkiye NLP").tokens  # ['Türkiye', 'NLP']  (was always lowercased)

Docs / examples

frequencies.py docstring: Added missing ProcessorConfig to the example import — snippet was broken if copied verbatim.
README.md process_text example: Replaced invalid steps=[...] with correct keyword args (remove_stopwords=True, attach_suffixes=True).
docs/USER_GUIDE.md emoji example: Added processor = TextProcessor(config) after setting emoji_mode="extract" — without it result.emojis was always empty.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…, suffix stripping, docs Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>

Copilot

Pull request overview

This PR addresses multiple issues in the Ottoman Turkish module and TextProcessor pipeline, including quadratic performance in transliteration, incorrect offset tracking, a non-functional config option, and several documentation inconsistencies.

Changes:

Performance optimization: Replaced O(n²) string position tracking with O(n) running counters in transliteration methods
Fixed offset tracking to use document-level offsets instead of token-local offsets for OCR/bounding-box alignment
Implemented previously non-functional custom_suffixes configuration option to strip suffixes from processed tokens
Fixed lowercase=False configuration which was previously ignored due to hardcoded case normalization
Corrected documentation examples that had missing imports or incorrect API usage

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
python/durak/stats/frequencies.py	Added missing `ProcessorConfig` import to docstring example
python/durak/processor.py	Implemented `lowercase=False` by filtering out `normalize_case` from cleaning steps; added `DEFAULT_CLEANING_STEPS` import
python/durak/ottoman/transliterator.py	Replaced O(n²) `len("".join(chars))` with O(n) running counter `trans_pos` in both `arabic_to_latin` and `scholarly_to_modern` methods
python/durak/ottoman/processor.py	Fixed token offset mappings to use document-level offsets; implemented suffix stripping for `custom_suffixes` config
docs/USER_GUIDE.md	Added missing `processor = TextProcessor(config)` instantiation in emoji extraction example
README.md	Replaced invalid `steps=[...]` parameter with correct keyword arguments (`remove_stopwords=True, attach_suffixes=True`)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-23T01:08:11Z

+            stripped = []
+            for token in processed_tokens:
+                # Remove the longest matching suffix (at most one per token)
+                for suffix in sorted_suffixes:
+                    if token.endswith(suffix) and len(token) > len(suffix):
+                        token = token[: -len(suffix)]
+                        break
+                stripped.append(token)
+            processed_tokens = stripped
+


Suffix stripping creates an inconsistency: result.tokens contains suffix-stripped tokens, but result.offset_mappings[i].transliterated still contains the original token before suffix removal. This breaks the invariant that result.tokens[i] should match result.offset_mappings[i].transliterated. The offset mappings should be updated after suffix stripping to reflect the actual tokens, or the transliterated field should be updated to match the stripped token.

Suggested change

stripped = []

for token in processed_tokens:

# Remove the longest matching suffix (at most one per token)

for suffix in sorted_suffixes:

if token.endswith(suffix) and len(token) > len(suffix):

token = token[: -len(suffix)]

break

stripped.append(token)

processed_tokens = stripped

stripped_tokens: list[str] = []

stripped_mappings: list[TransliterationMapping] = []

for idx, (token, mapping) in enumerate(zip(processed_tokens, offset_mappings)):

original_token = token

removed_len = 0

matched_suffix = ""

# Remove the longest matching suffix (at most one per token)

for suffix in sorted_suffixes:

if token.endswith(suffix) and len(token) > len(suffix):

removed_len = len(suffix)

matched_suffix = suffix

token = token[: -removed_len]

break

stripped_tokens.append(token)

# Keep offset mappings consistent with suffix-stripped tokens.

# If we did not remove any suffix, reuse the existing mapping.

if removed_len == 0:

stripped_mappings.append(mapping)

continue

# Update the transliterated form by stripping the same suffix length.

transliterated = mapping.transliterated

if len(transliterated) >= removed_len:

new_transliterated = transliterated[:-removed_len]

else:

# Fallback: avoid negative slicing if lengths somehow mismatch.

new_transliterated = ""

new_trans_len = len(new_transliterated)

new_char_mappings: list[tuple[int, int, int, int]] = []

for orig_start, orig_end, trans_start, trans_end in mapping.char_mappings:

# Skip mappings that start beyond the new transliterated length.

if trans_start >= new_trans_len:

break

# If the entire mapping lies within the new length, keep as is.

if trans_end <= new_trans_len:

new_char_mappings.append((orig_start, orig_end, trans_start, trans_end))

else:

# Truncate a partially overlapping mapping at the new end.

new_char_mappings.append((orig_start, orig_end, trans_start, new_trans_len))

break

stripped_mappings.append(

TransliterationMapping(

original=mapping.original,

transliterated=new_transliterated,

char_mappings=new_char_mappings,

)

)

processed_tokens = stripped_tokens

offset_mappings = stripped_mappings

Initial plan

a41f83f

Copilot AI assigned Copilot and fbkaragoz Feb 21, 2026

Copilot AI mentioned this pull request Feb 21, 2026

Origin/dev #150

Open

Copilot started work on behalf of fbkaragoz February 21, 2026 05:02 View session

fix: address all PR review comments - performance, offsets, lowercase…

7f0a425

…, suffix stripping, docs Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix syntax errors and update documentation for Durak NLP Toolkit~~ Fix O(n²) transliteration, broken offset tracking, no-op suffix config, and doc/API inconsistencies Feb 21, 2026

Copilot AI requested a review from fbkaragoz February 21, 2026 05:09

Copilot finished work on behalf of fbkaragoz February 21, 2026 05:09

fbkaragoz added bug Something isn't working documentation Improvements or additions to documentation labels Feb 21, 2026

fbkaragoz marked this pull request as ready for review February 23, 2026 01:02

Copilot AI review requested due to automatic review settings February 23, 2026 01:02

Copilot started reviewing on behalf of fbkaragoz February 23, 2026 01:02 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix O(n²) transliteration, broken offset tracking, no-op suffix config, and doc/API inconsistencies#152

Fix O(n²) transliteration, broken offset tracking, no-op suffix config, and doc/API inconsistencies#152
Copilot wants to merge 2 commits into
origin/devfrom
copilot/sub-pr-150-again

Copilot AI commented Feb 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            stripped = []
-            for token in processed_tokens:
-                # Remove the longest matching suffix (at most one per token)
-                for suffix in sorted_suffixes:
-                    if token.endswith(suffix) and len(token) > len(suffix):
-                        token = token[: -len(suffix)]
-                        break
-                stripped.append(token)
-            processed_tokens = stripped
+            stripped_tokens: list[str] = []
+            stripped_mappings: list[TransliterationMapping] = []
+            for idx, (token, mapping) in enumerate(zip(processed_tokens, offset_mappings)):
+                original_token = token
+                removed_len = 0
+                matched_suffix = ""
+                # Remove the longest matching suffix (at most one per token)
+                for suffix in sorted_suffixes:
+                    if token.endswith(suffix) and len(token) > len(suffix):
+                        removed_len = len(suffix)
+                        matched_suffix = suffix
+                        token = token[: -removed_len]
+                        break
+                stripped_tokens.append(token)
+                # Keep offset mappings consistent with suffix-stripped tokens.
+                # If we did not remove any suffix, reuse the existing mapping.
+                if removed_len == 0:
+                    stripped_mappings.append(mapping)
+                    continue
+                # Update the transliterated form by stripping the same suffix length.
+                transliterated = mapping.transliterated
+                if len(transliterated) >= removed_len:
+                    new_transliterated = transliterated[:-removed_len]
+                else:
+                    # Fallback: avoid negative slicing if lengths somehow mismatch.
+                    new_transliterated = ""
+                new_trans_len = len(new_transliterated)
+                new_char_mappings: list[tuple[int, int, int, int]] = []
+                for orig_start, orig_end, trans_start, trans_end in mapping.char_mappings:
+                    # Skip mappings that start beyond the new transliterated length.
+                    if trans_start >= new_trans_len:
+                        break
+                    # If the entire mapping lies within the new length, keep as is.
+                    if trans_end <= new_trans_len:
+                        new_char_mappings.append((orig_start, orig_end, trans_start, trans_end))
+                    else:
+                        # Truncate a partially overlapping mapping at the new end.
+                        new_char_mappings.append((orig_start, orig_end, trans_start, new_trans_len))
+                        break
+                stripped_mappings.append(
+                    TransliterationMapping(
+                        original=mapping.original,
+                        transliterated=new_transliterated,
+                        char_mappings=new_char_mappings,
+                    )
+                )
+            processed_tokens = stripped_tokens
+            offset_mappings = stripped_mappings

Conversation

Copilot AI commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Offset tracking

Unimplemented config option

API correctness

Docs / examples

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Feb 21, 2026 •

edited

Loading