Fix Id2Token Bugs by sayanshaw24 · Pull Request #1060 · microsoft/onnxruntime-extensions

sayanshaw24 · 2026-05-11T21:32:05Z

Fix three Id2Token bugs in case-encoder decoding

Problem

Three bugs were found in SpmUgmDecoder::Id2Token (operators/tokenizer/ugm_kernels.hpp) when decoding tokens from models that use case-encoding. The previous implementation only recognized case-encoder markers at position 0 of a unigram piece, which broke decoding in three scenarios:

Mode doesn't propagate across pieces — The uppercase (U) mode was not carried across SPM piece boundaries. For example, "Umc"+"p" decoded to "MCp" instead of "MCP".
Markers mid-piece are ignored — When the SPM unigram lattice merged a case marker into the middle of a piece (e.g. "iTphone" where T is a titlecase marker), the marker was emitted literally instead of being applied, producing "iTphone" instead of "iPhone".
Implicit L reset not recovered — When the SPM lattice dropped an explicit L (lowercase) marker at a non-letter codepoint boundary, the decoder failed to reset the mode. For example, "Upp"+"v"+"-"+"mp" decoded to "PPV-MP" instead of "PPV-mp".

Fix

Replaced the position-0-only marker check in Id2Token with a proper per-piece byte-level state machine that:

Scans every byte of the piece for marker characters (U, A, T, L, P), not just position 0
Propagates the active case mode (signature_) across piece boundaries via the TokenizerDecodingState
Handles SPM space markers (▁) inline, resetting U/T modes at word boundaries while allowing A (all-uppercase) to persist across spaces
Implicitly resets U/T mode at non-letter codepoints when the encoder's explicit L was dropped by SPM scoring

Changes

operators/tokenizer/ugm_kernels.hpp — Rewrote SpmUgmDecoder::Id2Token case-encoding path (-51/+138 lines). Added #include <cstring> for std::memcmp.
test/pp_api_test/test_tokenizer_capi.cc — Added 4 regression tests using NMT tokenizer roundtrip (tokenize → detokenize → compare):
- MarianId2Token_CrossPieceModePropagate — Bug 1
- MarianId2Token_MidPieceMarker — Bug 2
- MarianId2Token_ImplicitLReset — Bug 3
- MarianId2Token_CombinedBugs — All three in one sentence

Validation

All existing tests pass. New tests pass with the fix and fail without it.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes Marian case-encoding decoding issues in SpmUgmDecoder::Id2Token by replacing a position-0-only marker check with a byte-level per-piece state machine, and adds regression tests to cover real-world failure cases.

Changes:

Reworked Id2Token to scan markers throughout each piece and propagate case mode across piece boundaries.
Added inline handling for SPM space markers and implicit mode resets at non-letter boundaries.
Added 4 C API roundtrip regression tests covering the three reported bugs and a combined scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
operators/tokenizer/ugm_kernels.hpp	Rewrites Marian case-decoding logic in `Id2Token` using a per-piece UTF-8 aware state machine and persists mode in decoding state.
test/pp_api_test/test_tokenizer_capi.cc	Adds regression tests that tokenize→detokenize and assert exact text roundtrip for the three bugs plus a combined case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Includes fixes in microsoft/onnxruntime-extensions#1060. Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

fix id2token bugs

ca7d85d

Copilot AI review requested due to automatic review settings May 11, 2026 21:32

sayanshaw24 requested a review from a team as a code owner May 11, 2026 21:32

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread test/pp_api_test/test_tokenizer_capi.cc Outdated

Comment thread operators/tokenizer/ugm_kernels.hpp Outdated

Comment thread operators/tokenizer/ugm_kernels.hpp Outdated

Comment thread operators/tokenizer/ugm_kernels.hpp Outdated

Copilot started reviewing on behalf of sayanshaw24 May 11, 2026 22:03 View session

resolve Copilot comments

0432b2c

apsonawane approved these changes May 13, 2026

View reviewed changes

sayanshaw24 merged commit b62dd46 into main May 13, 2026
38 checks passed

sayanshaw24 deleted the sayanshaw/id2token-bugs branch May 13, 2026 00:32

sayanshaw24 mentioned this pull request May 13, 2026

Update Extensions Commit to Fix Id2Token Bugs microsoft/onnxruntime-genai#2159

Merged

sayanshaw24 added a commit to microsoft/onnxruntime-genai that referenced this pull request May 13, 2026

Update Extensions Commit to Fix Id2Token Bugs (#2159)

5a1bcbb

Includes fixes in microsoft/onnxruntime-extensions#1060. Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Id2Token Bugs#1060

Fix Id2Token Bugs#1060
sayanshaw24 merged 2 commits into
mainfrom
sayanshaw/id2token-bugs

sayanshaw24 commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sayanshaw24 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix three Id2Token bugs in case-encoder decoding

Problem

Fix

Changes

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sayanshaw24 commented May 11, 2026 •

edited

Loading