Support Split pre-tokenizer in BPE Sequence by tanzeel-amd · Pull Request #1059 · microsoft/onnxruntime-extensions

tanzeel-amd · 2026-05-08T11:09:57Z

When tokenizer.json has a Sequence pre-tokenizer containing Split + ByteLevel steps (as used by Hunyuan), all Split regex patterns are now accumulated and fused into the pre-tokenizer regex as higher-priority alternation branches, instead of only keeping the last one.

Key changes:

LoadPreTokenizer accumulates Split regexes in a vector instead of overwriting a single string, and recognizes ByteLevel entries
GetPreTokenizerRegex fuses Split patterns with the base GPT-2/Llama regex so they take priority during matching
JSON-loaded regexes are normalized (CR/LF bytes to escape sequences) to match the Compile pattern table format
New hardcoded matchers for CJK, Unicode punctuation/symbol, and ASCII-punctuation+letter patterns used by Hunyuan's Split entries

Fixes silent tokenization failure where models using Split pre-tokenizers produced garbage token IDs (mostly spaces).

When tokenizer.json has a Sequence pre-tokenizer containing Split + ByteLevel steps (as used by Hunyuan), all Split regex patterns are now accumulated and fused into the pre-tokenizer regex as higher-priority alternation branches, instead of only keeping the last one. Key changes: - LoadPreTokenizer accumulates Split regexes in a vector instead of overwriting a single string, and recognizes ByteLevel entries - GetPreTokenizerRegex fuses Split patterns with the base GPT-2/Llama regex so they take priority during matching - JSON-loaded regexes are normalized (CR/LF bytes to escape sequences) to match the Compile pattern table format - New hardcoded matchers for CJK, Unicode punctuation/symbol, and ASCII-punctuation+letter patterns used by Hunyuan's Split entries Fixes silent tokenization failure where models using Split pre-tokenizers produced garbage token IDs (mostly spaces).

Copilot

Pull request overview

This PR improves BPE tokenizer compatibility with Hugging Face tokenizer.json files that use a Sequence pre-tokenizer containing multiple Split steps (plus ByteLevel), by accumulating all Split regex patterns and fusing them (at higher priority) ahead of the base GPT-2/Llama pre-tokenizer regex to avoid incorrect/degenerate tokenization.

Changes:

Accumulate Split regexes from a Sequence pre-tokenizer (instead of overwriting) and fuse them ahead of the base pre-tokenizer regex.
Normalize CR/LF bytes in JSON-loaded regex strings to \\r/\\n to align with the compile-time pattern table matching logic.
Add hardcoded matchers + compile-time pattern-table entries for Hunyuan-specific Split regex patterns (CJK, punctuation/symbol, and ASCII-punct+letter).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`operators/tokenizer/bpe_utils.hpp`	Adds new hardcoded matchers and registers Hunyuan-specific regex branches in the pattern table used by `PreTokenizerWithRegEx::Compile`.
`operators/tokenizer/bpe_tokenizer_model.hpp`	Updates pre-tokenizer loading to collect/normalize multiple Split regexes and returns a fused regex (Split branches + base regex) from `GetPreTokenizerRegex`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  }

+  static bool IsCJK(char32_t ch) {
+    return (ch >= 0x4E00 && ch <= 0x9FFF) ||   // CJK Unified Ideographs (一-龥 approx)


+      } else if (pre_type == "ByteLevel") {
+        has_byte_level_in_sequence_ = true;


+    std::string fused;
+    for (const auto& sr : split_regexes_) {
+      if (!fused.empty()) fused += "|";
+      fused += sr;
+    }


sayanshaw24 · 2026-05-15T21:09:38Z

This PR's branch predates #1045, which landed on main and added a top-level Split pre-tokenizer handler, no_op_pretokenizer_, and the spm_model parameter to GetPreTokenizerRegex. Please rebase on main — the Sequence loop changes here conflict with that handler, and GetPreTokenizerRegex needs the spm_model parameter restored so SPM models continue to get the Llama regex.

sayanshaw24 · 2026-05-15T21:16:15Z


-    if (model_name == "Llama" || spm_model) {
-      return bpe::PreTokenizerWithRegEx::LLAMA_REGEX_PATTERN;
+    if (split_regexes_.empty()) {


In addition to the Copilot comment here, when split_regexes_ is non-empty, the fused result always appends the base GPT-2 or Llama regex after the split patterns. Is that intentional for Hunyuan? If the split regexes fully define the pre-tokenization, the base regex may cause unexpected matches. Worth a brief comment explaining the rationale for always including the base regex as a fallback.

sayanshaw24 · 2026-05-15T21:16:48Z

-        pre_tokenizer_regex_ = regex_str->get<std::string>();
-        // Validate the regex pattern
+        auto regex = regex_str->get<std::string>();
+        // JSON decodes \r and \n into literal CR/LF bytes, but the Compile()


The CR/LF normalization only handles \r and \n, but JSON also decodes \t into a literal tab (and potentially other escape sequences). If a future model's Split pattern includes \t, the same substring-matching failure would occur. Consider also normalizing \t → \t here, or using a more general approach.

sayanshaw24 · 2026-05-15T21:18:41Z

+    while (i < m_text.size() && IsLM(m_text[i])) {
+      i++;
+    }
+    if (i == 0) return {};


This if (i == 0) return {}; is unreachable. If the optional leading character is consumed, i >= 1 entering the while loop. If it isn't consumed, i == 0 and the guard on line 695 (if (i >= m_text.size() || !IsLM(m_text[i])) return {};) either returns early or ensures the while loop runs at least once (making i >= 1). We can remove this or replace with assert(i > 0).

Ur Rahman and others added 3 commits April 16, 2026 08:18

Add modification notice

a04b2eb

Add modification notice

3c64736

Copilot AI review requested due to automatic review settings May 8, 2026 11:09

tanzeel-amd requested a review from a team as a code owner May 8, 2026 11:09

Copilot started reviewing on behalf of tanzeel-amd May 8, 2026 11:10 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

kunal-vaishnavi requested a review from sayanshaw24 May 15, 2026 17:51

sayanshaw24 reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Split pre-tokenizer in BPE Sequence#1059

Support Split pre-tokenizer in BPE Sequence#1059
tanzeel-amd wants to merge 3 commits into
microsoft:mainfrom
tanzeel-amd:turrahma/split_fix

tanzeel-amd commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

sayanshaw24 commented May 15, 2026

Uh oh!

sayanshaw24 May 15, 2026

Uh oh!

sayanshaw24 May 15, 2026

Uh oh!

sayanshaw24 May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		} else if (pre_type == "ByteLevel") {
		has_byte_level_in_sequence_ = true;

Conversation

tanzeel-amd commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

sayanshaw24 commented May 15, 2026

Uh oh!

sayanshaw24 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

sayanshaw24 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

sayanshaw24 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants