[Feature] Token-to-Original-Text Alignment for NER (#7)#32
Open
ada-cinar wants to merge 8 commits into
Open
Conversation
Implements #7 - Token-to-Original-Text Alignment Key changes: - Added Token struct with text, start, end (byte offsets) - Added tokenize_normalized() function for NER workflows - Preserves original text positions even after normalization - Added comprehensive Rust unit tests for Turkish edge cases - Byte offsets compatible with modern NLP tools (Transformers) This enables users to: 1. Get normalized tokens for processing 2. Map back to exact positions in original raw text 3. Use with labeled training data without breaking labels 4. Integrate with BERT/Transformers tokenizers Tests cover: - Turkish İ/I normalization with offset preservation - Whitespace handling - Real NER use cases with entity extraction
Shows how to use Token struct with offset mapping to extract entities while preserving original text positions.
…ze_normalized - Upgrade pyo3 from 0.23.3 to 0.27.2 (fixes CI Python 3.14 compatibility) - Export tokenize_normalized and Token from __init__.py - All tests pass (44 passed, 9 skipped) - Resolves CI failure in PR #32
- Reorganize imports in test files to match ruff conventions - Fixes CI lint check failures - No functional changes
- Changed Token offsets from byte-based to character-based - Ensures text[start:end] works correctly in Python - Updated all Rust tests to use character offsets - Added comprehensive Python integration tests (9 tests) - Fixed README example with correct offsets - Added NER example demonstrating offset mapping usage Fixes #7 - Token-to-Original-Text Alignment for NER tasks
Resolves #35 - Missing Type Stubs for Token and tokenize_normalized - Add Token class definition to _durak_core.pyi with full docstrings - Add tokenize_normalized function stub with comprehensive examples - Update __all__ exports to include Token and tokenize_normalized - Verify all tests pass (19 passed, 3 skipped) This enables proper type checking and IDE autocomplete for NER workflows using the offset mapping functionality from #7. Type stubs now provide: - Complete Token class interface (text, start, end attributes) - tokenize_normalized return type (list[Token]) - Detailed docstrings with NER use-case examples - Python slice verification examples for offset accuracy
Member
Author
📝 Latest Update: Type Stubs Added (2026-01-26 18:30 TRT)Commit: What's NewAdded complete type stubs to
Impact
Test StatusNext StepsPR is ready for review. All critical functionality implemented:
|
- Fixed import path from 'import _durak_core' to 'from durak import _durak_core' - Tests now properly detect and use the Rust extension - All offset mapping tests passing (3/3) feat(examples): enhance NER offset mapping example - Comprehensive NER workflow demonstration - Multiple Turkish text scenarios (I/İ handling, entity extraction) - Real-world integration guidance with transformers/BERT - Improved documentation and type hints
- Break long lines in ner_offset_mapping.py for readability - Split long comment in _durak_core.pyi docstring - Remove unused Token import - All ruff checks now passing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #7 - Adds Token struct with offset mapping for NER and span-based tasks.
Key Changes
Tokenstruct withtext,start,end(byte offsets)tokenize_normalized()function for NER workflowsWhy This Matters
For research tasks like NER, we need:
Example Usage
Output:
Tests
All Rust unit tests pass:
Next Steps
Closes #7