Skip to content

Support Split pre-tokenizer in BPE Sequence#1059

Open
tanzeel-amd wants to merge 3 commits into
microsoft:mainfrom
tanzeel-amd:turrahma/split_fix
Open

Support Split pre-tokenizer in BPE Sequence#1059
tanzeel-amd wants to merge 3 commits into
microsoft:mainfrom
tanzeel-amd:turrahma/split_fix

Conversation

@tanzeel-amd
Copy link
Copy Markdown

When tokenizer.json has a Sequence pre-tokenizer containing Split + ByteLevel steps (as used by Hunyuan), all Split regex patterns are now accumulated and fused into the pre-tokenizer regex as higher-priority alternation branches, instead of only keeping the last one.

Key changes:

  • LoadPreTokenizer accumulates Split regexes in a vector instead of overwriting a single string, and recognizes ByteLevel entries
  • GetPreTokenizerRegex fuses Split patterns with the base GPT-2/Llama regex so they take priority during matching
  • JSON-loaded regexes are normalized (CR/LF bytes to escape sequences) to match the Compile pattern table format
  • New hardcoded matchers for CJK, Unicode punctuation/symbol, and ASCII-punctuation+letter patterns used by Hunyuan's Split entries

Fixes silent tokenization failure where models using Split pre-tokenizers produced garbage token IDs (mostly spaces).

Ur Rahman and others added 3 commits April 16, 2026 08:18
When tokenizer.json has a Sequence pre-tokenizer containing Split +
ByteLevel steps (as used by Hunyuan), all Split regex patterns are now
accumulated and fused into the pre-tokenizer regex as higher-priority
alternation branches, instead of only keeping the last one.

Key changes:
- LoadPreTokenizer accumulates Split regexes in a vector instead of
  overwriting a single string, and recognizes ByteLevel entries
- GetPreTokenizerRegex fuses Split patterns with the base GPT-2/Llama
  regex so they take priority during matching
- JSON-loaded regexes are normalized (CR/LF bytes to escape sequences)
  to match the Compile pattern table format
- New hardcoded matchers for CJK, Unicode punctuation/symbol, and
  ASCII-punctuation+letter patterns used by Hunyuan's Split entries

Fixes silent tokenization failure where models using Split pre-tokenizers
produced garbage token IDs (mostly spaces).
Copilot AI review requested due to automatic review settings May 8, 2026 11:09
@tanzeel-amd tanzeel-amd requested a review from a team as a code owner May 8, 2026 11:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves BPE tokenizer compatibility with Hugging Face tokenizer.json files that use a Sequence pre-tokenizer containing multiple Split steps (plus ByteLevel), by accumulating all Split regex patterns and fusing them (at higher priority) ahead of the base GPT-2/Llama pre-tokenizer regex to avoid incorrect/degenerate tokenization.

Changes:

  • Accumulate Split regexes from a Sequence pre-tokenizer (instead of overwriting) and fuse them ahead of the base pre-tokenizer regex.
  • Normalize CR/LF bytes in JSON-loaded regex strings to \\r/\\n to align with the compile-time pattern table matching logic.
  • Add hardcoded matchers + compile-time pattern-table entries for Hunyuan-specific Split regex patterns (CJK, punctuation/symbol, and ASCII-punct+letter).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
operators/tokenizer/bpe_utils.hpp Adds new hardcoded matchers and registers Hunyuan-specific regex branches in the pattern table used by PreTokenizerWithRegEx::Compile.
operators/tokenizer/bpe_tokenizer_model.hpp Updates pre-tokenizer loading to collect/normalize multiple Split regexes and returns a fused regex (Split branches + base regex) from GetPreTokenizerRegex.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

static bool IsCJK(char32_t ch) {
return (ch >= 0x4E00 && ch <= 0x9FFF) || // CJK Unified Ideographs (一-龥 approx)
Comment on lines +100 to +101
} else if (pre_type == "ByteLevel") {
has_byte_level_in_sequence_ = true;
Comment on lines +483 to +487
std::string fused;
for (const auto& sr : split_regexes_) {
if (!fused.empty()) fused += "|";
fused += sr;
}
@sayanshaw24
Copy link
Copy Markdown
Collaborator

This PR's branch predates #1045, which landed on main and added a top-level Split pre-tokenizer handler, no_op_pretokenizer_, and the spm_model parameter to GetPreTokenizerRegex. Please rebase on main — the Sequence loop changes here conflict with that handler, and GetPreTokenizerRegex needs the spm_model parameter restored so SPM models continue to get the Llama regex.


if (model_name == "Llama" || spm_model) {
return bpe::PreTokenizerWithRegEx::LLAMA_REGEX_PATTERN;
if (split_regexes_.empty()) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the Copilot comment here, when split_regexes_ is non-empty, the fused result always appends the base GPT-2 or Llama regex after the split patterns. Is that intentional for Hunyuan? If the split regexes fully define the pre-tokenization, the base regex may cause unexpected matches. Worth a brief comment explaining the rationale for always including the base regex as a fallback.

pre_tokenizer_regex_ = regex_str->get<std::string>();
// Validate the regex pattern
auto regex = regex_str->get<std::string>();
// JSON decodes \r and \n into literal CR/LF bytes, but the Compile()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CR/LF normalization only handles \r and \n, but JSON also decodes \t into a literal tab (and potentially other escape sequences). If a future model's Split pattern includes \t, the same substring-matching failure would occur. Consider also normalizing \t → \t here, or using a more general approach.

while (i < m_text.size() && IsLM(m_text[i])) {
i++;
}
if (i == 0) return {};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if (i == 0) return {}; is unreachable. If the optional leading character is consumed, i >= 1 entering the while loop. If it isn't consumed, i == 0 and the guard on line 695 (if (i >= m_text.size() || !IsLM(m_text[i])) return {};) either returns early or ensures the while loop runs at least once (making i >= 1). We can remove this or replace with assert(i > 0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants