Fix O(n²) transliteration, broken offset tracking, no-op suffix config, and doc/API inconsistencies#152
Fix O(n²) transliteration, broken offset tracking, no-op suffix config, and doc/API inconsistencies#152Copilot wants to merge 2 commits into
Conversation
…, suffix stripping, docs Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses multiple issues in the Ottoman Turkish module and TextProcessor pipeline, including quadratic performance in transliteration, incorrect offset tracking, a non-functional config option, and several documentation inconsistencies.
Changes:
- Performance optimization: Replaced O(n²) string position tracking with O(n) running counters in transliteration methods
- Fixed offset tracking to use document-level offsets instead of token-local offsets for OCR/bounding-box alignment
- Implemented previously non-functional
custom_suffixesconfiguration option to strip suffixes from processed tokens - Fixed
lowercase=Falseconfiguration which was previously ignored due to hardcoded case normalization - Corrected documentation examples that had missing imports or incorrect API usage
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| python/durak/stats/frequencies.py | Added missing ProcessorConfig import to docstring example |
| python/durak/processor.py | Implemented lowercase=False by filtering out normalize_case from cleaning steps; added DEFAULT_CLEANING_STEPS import |
| python/durak/ottoman/transliterator.py | Replaced O(n²) len("".join(chars)) with O(n) running counter trans_pos in both arabic_to_latin and scholarly_to_modern methods |
| python/durak/ottoman/processor.py | Fixed token offset mappings to use document-level offsets; implemented suffix stripping for custom_suffixes config |
| docs/USER_GUIDE.md | Added missing processor = TextProcessor(config) instantiation in emoji extraction example |
| README.md | Replaced invalid steps=[...] parameter with correct keyword arguments (remove_stopwords=True, attach_suffixes=True) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| stripped = [] | ||
| for token in processed_tokens: | ||
| # Remove the longest matching suffix (at most one per token) | ||
| for suffix in sorted_suffixes: | ||
| if token.endswith(suffix) and len(token) > len(suffix): | ||
| token = token[: -len(suffix)] | ||
| break | ||
| stripped.append(token) | ||
| processed_tokens = stripped | ||
|
|
There was a problem hiding this comment.
Suffix stripping creates an inconsistency: result.tokens contains suffix-stripped tokens, but result.offset_mappings[i].transliterated still contains the original token before suffix removal. This breaks the invariant that result.tokens[i] should match result.offset_mappings[i].transliterated. The offset mappings should be updated after suffix stripping to reflect the actual tokens, or the transliterated field should be updated to match the stripped token.
| stripped = [] | |
| for token in processed_tokens: | |
| # Remove the longest matching suffix (at most one per token) | |
| for suffix in sorted_suffixes: | |
| if token.endswith(suffix) and len(token) > len(suffix): | |
| token = token[: -len(suffix)] | |
| break | |
| stripped.append(token) | |
| processed_tokens = stripped | |
| stripped_tokens: list[str] = [] | |
| stripped_mappings: list[TransliterationMapping] = [] | |
| for idx, (token, mapping) in enumerate(zip(processed_tokens, offset_mappings)): | |
| original_token = token | |
| removed_len = 0 | |
| matched_suffix = "" | |
| # Remove the longest matching suffix (at most one per token) | |
| for suffix in sorted_suffixes: | |
| if token.endswith(suffix) and len(token) > len(suffix): | |
| removed_len = len(suffix) | |
| matched_suffix = suffix | |
| token = token[: -removed_len] | |
| break | |
| stripped_tokens.append(token) | |
| # Keep offset mappings consistent with suffix-stripped tokens. | |
| # If we did not remove any suffix, reuse the existing mapping. | |
| if removed_len == 0: | |
| stripped_mappings.append(mapping) | |
| continue | |
| # Update the transliterated form by stripping the same suffix length. | |
| transliterated = mapping.transliterated | |
| if len(transliterated) >= removed_len: | |
| new_transliterated = transliterated[:-removed_len] | |
| else: | |
| # Fallback: avoid negative slicing if lengths somehow mismatch. | |
| new_transliterated = "" | |
| new_trans_len = len(new_transliterated) | |
| new_char_mappings: list[tuple[int, int, int, int]] = [] | |
| for orig_start, orig_end, trans_start, trans_end in mapping.char_mappings: | |
| # Skip mappings that start beyond the new transliterated length. | |
| if trans_start >= new_trans_len: | |
| break | |
| # If the entire mapping lies within the new length, keep as is. | |
| if trans_end <= new_trans_len: | |
| new_char_mappings.append((orig_start, orig_end, trans_start, trans_end)) | |
| else: | |
| # Truncate a partially overlapping mapping at the new end. | |
| new_char_mappings.append((orig_start, orig_end, trans_start, new_trans_len)) | |
| break | |
| stripped_mappings.append( | |
| TransliterationMapping( | |
| original=mapping.original, | |
| transliterated=new_transliterated, | |
| char_mappings=new_char_mappings, | |
| ) | |
| ) | |
| processed_tokens = stripped_tokens | |
| offset_mappings = stripped_mappings |
Multiple review findings in the Ottoman module and
TextProcessorpipeline: quadratic performance in transliteration, token mappings storing local instead of document-level offsets,custom_suffixessilently doing nothing, and several API/doc mismatches.Performance
arabic_to_latin/scholarly_to_modern: Replacedlen("".join(chars))(O(n) per iteration → O(n²) overall) with a runningtrans_posinteger counter updated as each character is appended.Offset tracking
TransliterationMapping: Now stores document-level offsets(orig_start, orig_end, trans_idx, trans_idx + len(token))derived from the full mapping when resolvable, falling back to token-local offsets.get_offsets_for_tokenconsequently returns correct absolute spans for OCR/bounding-box alignment.Unimplemented config option
OttomanConfig.custom_suffixes: Was accepted and stored but never applied. Added a suffix-stripping step inOttomanProcessor.process()that removes the longest matching suffix from each processed token.API correctness
ProcessorConfig.lowercase: Was documented but ineffective —clean_textalways lowercased viaDEFAULT_CLEANING_STEPS. Whenlowercase=False, the processor now passes a custom steps pipeline that excludesnormalize_case.Docs / examples
frequencies.pydocstring: Added missingProcessorConfigto the example import — snippet was broken if copied verbatim.README.mdprocess_textexample: Replaced invalidsteps=[...]with correct keyword args (remove_stopwords=True, attach_suffixes=True).docs/USER_GUIDE.mdemoji example: Addedprocessor = TextProcessor(config)after settingemoji_mode="extract"— without itresult.emojiswas always empty.✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.