Skip to content

Fix Id2Token Bugs#1060

Merged
sayanshaw24 merged 2 commits into
mainfrom
sayanshaw/id2token-bugs
May 13, 2026
Merged

Fix Id2Token Bugs#1060
sayanshaw24 merged 2 commits into
mainfrom
sayanshaw/id2token-bugs

Conversation

@sayanshaw24
Copy link
Copy Markdown
Collaborator

@sayanshaw24 sayanshaw24 commented May 11, 2026

Fix three Id2Token bugs in case-encoder decoding

Problem

Three bugs were found in SpmUgmDecoder::Id2Token (operators/tokenizer/ugm_kernels.hpp) when decoding tokens from models that use case-encoding. The previous implementation only recognized case-encoder markers at position 0 of a unigram piece, which broke decoding in three scenarios:

  1. Mode doesn't propagate across pieces — The uppercase (U) mode was not carried across SPM piece boundaries. For example, "Umc"+"p" decoded to "MCp" instead of "MCP".
  2. Markers mid-piece are ignored — When the SPM unigram lattice merged a case marker into the middle of a piece (e.g. "iTphone" where T is a titlecase marker), the marker was emitted literally instead of being applied, producing "iTphone" instead of "iPhone".
  3. Implicit L reset not recovered — When the SPM lattice dropped an explicit L (lowercase) marker at a non-letter codepoint boundary, the decoder failed to reset the mode. For example, "Upp"+"v"+"-"+"mp" decoded to "PPV-MP" instead of "PPV-mp".

Fix

Replaced the position-0-only marker check in Id2Token with a proper per-piece byte-level state machine that:

  • Scans every byte of the piece for marker characters (U, A, T, L, P), not just position 0
  • Propagates the active case mode (signature_) across piece boundaries via the TokenizerDecodingState
  • Handles SPM space markers (▁) inline, resetting U/T modes at word boundaries while allowing A (all-uppercase) to persist across spaces
  • Implicitly resets U/T mode at non-letter codepoints when the encoder's explicit L was dropped by SPM scoring

Changes

  • operators/tokenizer/ugm_kernels.hpp — Rewrote SpmUgmDecoder::Id2Token case-encoding path (-51/+138 lines). Added #include <cstring> for std::memcmp.
  • test/pp_api_test/test_tokenizer_capi.cc — Added 4 regression tests using NMT tokenizer roundtrip (tokenize → detokenize → compare):
    • MarianId2Token_CrossPieceModePropagate — Bug 1
    • MarianId2Token_MidPieceMarker — Bug 2
    • MarianId2Token_ImplicitLReset — Bug 3
    • MarianId2Token_CombinedBugs — All three in one sentence

Validation

  • All existing tests pass. New tests pass with the fix and fail without it.

Copilot AI review requested due to automatic review settings May 11, 2026 21:32
@sayanshaw24 sayanshaw24 requested a review from a team as a code owner May 11, 2026 21:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes Marian case-encoding decoding issues in SpmUgmDecoder::Id2Token by replacing a position-0-only marker check with a byte-level per-piece state machine, and adds regression tests to cover real-world failure cases.

Changes:

  • Reworked Id2Token to scan markers throughout each piece and propagate case mode across piece boundaries.
  • Added inline handling for SPM space markers and implicit mode resets at non-letter boundaries.
  • Added 4 C API roundtrip regression tests covering the three reported bugs and a combined scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
operators/tokenizer/ugm_kernels.hpp Rewrites Marian case-decoding logic in Id2Token using a per-piece UTF-8 aware state machine and persists mode in decoding state.
test/pp_api_test/test_tokenizer_capi.cc Adds regression tests that tokenize→detokenize and assert exact text roundtrip for the three bugs plus a combined case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/pp_api_test/test_tokenizer_capi.cc Outdated
Comment thread operators/tokenizer/ugm_kernels.hpp Outdated
Comment thread operators/tokenizer/ugm_kernels.hpp Outdated
Comment thread operators/tokenizer/ugm_kernels.hpp Outdated
@sayanshaw24 sayanshaw24 merged commit b62dd46 into main May 13, 2026
38 checks passed
@sayanshaw24 sayanshaw24 deleted the sayanshaw/id2token-bugs branch May 13, 2026 00:32
sayanshaw24 added a commit to microsoft/onnxruntime-genai that referenced this pull request May 13, 2026
Includes fixes in
microsoft/onnxruntime-extensions#1060.

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants