Support Split pre-tokenizer in BPE Sequence#1059
Conversation
When tokenizer.json has a Sequence pre-tokenizer containing Split + ByteLevel steps (as used by Hunyuan), all Split regex patterns are now accumulated and fused into the pre-tokenizer regex as higher-priority alternation branches, instead of only keeping the last one. Key changes: - LoadPreTokenizer accumulates Split regexes in a vector instead of overwriting a single string, and recognizes ByteLevel entries - GetPreTokenizerRegex fuses Split patterns with the base GPT-2/Llama regex so they take priority during matching - JSON-loaded regexes are normalized (CR/LF bytes to escape sequences) to match the Compile pattern table format - New hardcoded matchers for CJK, Unicode punctuation/symbol, and ASCII-punctuation+letter patterns used by Hunyuan's Split entries Fixes silent tokenization failure where models using Split pre-tokenizers produced garbage token IDs (mostly spaces).
There was a problem hiding this comment.
Pull request overview
This PR improves BPE tokenizer compatibility with Hugging Face tokenizer.json files that use a Sequence pre-tokenizer containing multiple Split steps (plus ByteLevel), by accumulating all Split regex patterns and fusing them (at higher priority) ahead of the base GPT-2/Llama pre-tokenizer regex to avoid incorrect/degenerate tokenization.
Changes:
- Accumulate
Splitregexes from a Sequence pre-tokenizer (instead of overwriting) and fuse them ahead of the base pre-tokenizer regex. - Normalize CR/LF bytes in JSON-loaded regex strings to
\\r/\\nto align with the compile-time pattern table matching logic. - Add hardcoded matchers + compile-time pattern-table entries for Hunyuan-specific Split regex patterns (CJK, punctuation/symbol, and ASCII-punct+letter).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
operators/tokenizer/bpe_utils.hpp |
Adds new hardcoded matchers and registers Hunyuan-specific regex branches in the pattern table used by PreTokenizerWithRegEx::Compile. |
operators/tokenizer/bpe_tokenizer_model.hpp |
Updates pre-tokenizer loading to collect/normalize multiple Split regexes and returns a fused regex (Split branches + base regex) from GetPreTokenizerRegex. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| static bool IsCJK(char32_t ch) { | ||
| return (ch >= 0x4E00 && ch <= 0x9FFF) || // CJK Unified Ideographs (一-龥 approx) |
| } else if (pre_type == "ByteLevel") { | ||
| has_byte_level_in_sequence_ = true; |
| std::string fused; | ||
| for (const auto& sr : split_regexes_) { | ||
| if (!fused.empty()) fused += "|"; | ||
| fused += sr; | ||
| } |
|
This PR's branch predates #1045, which landed on main and added a top-level Split pre-tokenizer handler, |
|
|
||
| if (model_name == "Llama" || spm_model) { | ||
| return bpe::PreTokenizerWithRegEx::LLAMA_REGEX_PATTERN; | ||
| if (split_regexes_.empty()) { |
There was a problem hiding this comment.
In addition to the Copilot comment here, when split_regexes_ is non-empty, the fused result always appends the base GPT-2 or Llama regex after the split patterns. Is that intentional for Hunyuan? If the split regexes fully define the pre-tokenization, the base regex may cause unexpected matches. Worth a brief comment explaining the rationale for always including the base regex as a fallback.
| pre_tokenizer_regex_ = regex_str->get<std::string>(); | ||
| // Validate the regex pattern | ||
| auto regex = regex_str->get<std::string>(); | ||
| // JSON decodes \r and \n into literal CR/LF bytes, but the Compile() |
There was a problem hiding this comment.
The CR/LF normalization only handles \r and \n, but JSON also decodes \t into a literal tab (and potentially other escape sequences). If a future model's Split pattern includes \t, the same substring-matching failure would occur. Consider also normalizing \t → \t here, or using a more general approach.
| while (i < m_text.size() && IsLM(m_text[i])) { | ||
| i++; | ||
| } | ||
| if (i == 0) return {}; |
There was a problem hiding this comment.
This if (i == 0) return {}; is unreachable. If the optional leading character is consumed, i >= 1 entering the while loop. If it isn't consumed, i == 0 and the guard on line 695 (if (i >= m_text.size() || !IsLM(m_text[i])) return {};) either returns early or ensures the while loop runs at least once (making i >= 1). We can remove this or replace with assert(i > 0).
When tokenizer.json has a Sequence pre-tokenizer containing Split + ByteLevel steps (as used by Hunyuan), all Split regex patterns are now accumulated and fused into the pre-tokenizer regex as higher-priority alternation branches, instead of only keeping the last one.
Key changes:
Fixes silent tokenization failure where models using Split pre-tokenizers produced garbage token IDs (mostly spaces).